Voice disorders constitute a significant health concern, with an annual prevalence of approximately 7% among the adult population, adversely affecting patients’ quality of life, encompassing both social and occupational functioning. Also, the majority of diagnostic methodologies continue to depend on invasive techniques, whereas non-invasive automated diagnostic approaches have not been extensively investigated yet. This study introduces a transformer-based method for detecting voice disorders aimed at enhancing detection efficacy through a multimodal fusion strategy. Specifically addressing two distinct types of voice recordings – extracted from sentences reading and vowels emissions -— we devised and assessed five multimodal fusion strategies across three stages: early, mid, and late. Our experimental findings indicate that the cross-attention mid-fusion method harnesses the benefits of both data types, and it achieves a detection accuracy of 0.885 and a macro F1 score of 0.843 on an internal dataset. These results represent an improvement of +.03 to +.06 in accuracy and +.02 to +.05 in macro F1 score when compared to unimodal models (trained on sentence or vowel data only). This study represents an advancement for an effective non-invasive detection of voice disorders and provides insights for clinical practice.

Multimodal Fusion Techniques to Enhance Voice Disorder Diagnoses / Liu, Qingqing; Ciravegna, Gabriele; Koudounas, Alkis; Cerquitelli, Tania; Baralis, Elena. - 3946:(2025). (Intervento presentato al convegno Workshops of the EDBT/ICDT 2025 Joint Conference, DARLI-AP Workshop tenutosi a Barcelona (ESP) nel 25-28 March, 2025).

Multimodal Fusion Techniques to Enhance Voice Disorder Diagnoses

Alkis Koudounas;Tania Cerquitelli;Elena Baralis
2025

Abstract

Voice disorders constitute a significant health concern, with an annual prevalence of approximately 7% among the adult population, adversely affecting patients’ quality of life, encompassing both social and occupational functioning. Also, the majority of diagnostic methodologies continue to depend on invasive techniques, whereas non-invasive automated diagnostic approaches have not been extensively investigated yet. This study introduces a transformer-based method for detecting voice disorders aimed at enhancing detection efficacy through a multimodal fusion strategy. Specifically addressing two distinct types of voice recordings – extracted from sentences reading and vowels emissions -— we devised and assessed five multimodal fusion strategies across three stages: early, mid, and late. Our experimental findings indicate that the cross-attention mid-fusion method harnesses the benefits of both data types, and it achieves a detection accuracy of 0.885 and a macro F1 score of 0.843 on an internal dataset. These results represent an improvement of +.03 to +.06 in accuracy and +.02 to +.05 in macro F1 score when compared to unimodal models (trained on sentence or vowel data only). This study represents an advancement for an effective non-invasive detection of voice disorders and provides insights for clinical practice.
2025
File in questo prodotto:
File Dimensione Formato  
DARLI-AP-10.pdf

accesso aperto

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: Creative commons
Dimensione 1.02 MB
Formato Adobe PDF
1.02 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2999238