In the domain of music generation, prevailing methods focus on text-to-music tasks, predominantly relying on diffusion models. However, they fail to achieve good vocal quality in synthetic music compositions.To tackle this critical challenge, we present Ainur, a hierarchical diffusion model that concentrates on the lyrics-to-music generation task. Through its use of multimodal Lyrics-Audio Spectrogram Pre-training (CLASP) embeddings, Ainur distinguishes itself from past approaches by specifically enhancing the vocal quality of synthetically produced music. Notably, Ainur’s training and testing processes are highly efficient, requiring only a single GPU. According to experimental results, Ainur meets or exceeds the quality of other state-of-the-art models like MusicGen, MusicLM, and AudioLDM2 in both objective and subjective evaluations. Additionally, Ainur offers near real-time inference speed, which facilitate its use in practical, real-world applications.

Ainur: Harmonizing Speed and Quality in Deep Music Generation Through Lyrics-Audio Embeddings / Concialdi, Giuseppe; Koudounas, Alkis; Pastor, Eliana; Di Eugenio, Barbara; Baralis, Elena. - (2024), pp. 1146-1150. (Intervento presentato al convegno 2024 IEEE International Conference on Acoustics, Speech and Signal Processing tenutosi a Seoul (KOR) nel 14-19 April, 2024) [10.1109/icassp48485.2024.10448078].

Ainur: Harmonizing Speed and Quality in Deep Music Generation Through Lyrics-Audio Embeddings

Koudounas, Alkis;Pastor, Eliana;Baralis, Elena
2024

Abstract

In the domain of music generation, prevailing methods focus on text-to-music tasks, predominantly relying on diffusion models. However, they fail to achieve good vocal quality in synthetic music compositions.To tackle this critical challenge, we present Ainur, a hierarchical diffusion model that concentrates on the lyrics-to-music generation task. Through its use of multimodal Lyrics-Audio Spectrogram Pre-training (CLASP) embeddings, Ainur distinguishes itself from past approaches by specifically enhancing the vocal quality of synthetically produced music. Notably, Ainur’s training and testing processes are highly efficient, requiring only a single GPU. According to experimental results, Ainur meets or exceeds the quality of other state-of-the-art models like MusicGen, MusicLM, and AudioLDM2 in both objective and subjective evaluations. Additionally, Ainur offers near real-time inference speed, which facilitate its use in practical, real-world applications.
2024
979-8-3503-4485-1
File in questo prodotto:
File Dimensione Formato  
Ainur_Harmonizing_Speed_and_Quality_in_Deep_Music_Generation_Through_Lyrics-Audio_Embeddings.pdf

non disponibili

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 1.41 MB
Formato Adobe PDF
1.41 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2989181