In the domain of music generation, prevailing methods focus on text-to-music tasks, predominantly relying on diffusion models. However, they fail to achieve good vocal quality in synthetic music compositions.To tackle this critical challenge, we present Ainur, a hierarchical diffusion model that concentrates on the lyrics-to-music generation task. Through its use of multimodal Lyrics-Audio Spectrogram Pre-training (CLASP) embeddings, Ainur distinguishes itself from past approaches by specifically enhancing the vocal quality of synthetically produced music. Notably, Ainur’s training and testing processes are highly efficient, requiring only a single GPU. According to experimental results, Ainur meets or exceeds the quality of other state-of-the-art models like MusicGen, MusicLM, and AudioLDM2 in both objective and subjective evaluations. Additionally, Ainur offers near real-time inference speed, which facilitate its use in practical, real-world applications.
Ainur: Harmonizing Speed and Quality in Deep Music Generation Through Lyrics-Audio Embeddings / Concialdi, Giuseppe; Koudounas, Alkis; Pastor, Eliana; Di Eugenio, Barbara; Baralis, Elena. - (2024), pp. 1146-1150. (Intervento presentato al convegno 2024 IEEE International Conference on Acoustics, Speech and Signal Processing tenutosi a Seoul (KOR) nel 14-19 April, 2024) [10.1109/icassp48485.2024.10448078].
Ainur: Harmonizing Speed and Quality in Deep Music Generation Through Lyrics-Audio Embeddings
Koudounas, Alkis;Pastor, Eliana;Baralis, Elena
2024
Abstract
In the domain of music generation, prevailing methods focus on text-to-music tasks, predominantly relying on diffusion models. However, they fail to achieve good vocal quality in synthetic music compositions.To tackle this critical challenge, we present Ainur, a hierarchical diffusion model that concentrates on the lyrics-to-music generation task. Through its use of multimodal Lyrics-Audio Spectrogram Pre-training (CLASP) embeddings, Ainur distinguishes itself from past approaches by specifically enhancing the vocal quality of synthetically produced music. Notably, Ainur’s training and testing processes are highly efficient, requiring only a single GPU. According to experimental results, Ainur meets or exceeds the quality of other state-of-the-art models like MusicGen, MusicLM, and AudioLDM2 in both objective and subjective evaluations. Additionally, Ainur offers near real-time inference speed, which facilitate its use in practical, real-world applications.File | Dimensione | Formato | |
---|---|---|---|
Ainur_Harmonizing_Speed_and_Quality_in_Deep_Music_Generation_Through_Lyrics-Audio_Embeddings.pdf
non disponibili
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
1.41 MB
Formato
Adobe PDF
|
1.41 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2989181