We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-theart frontends for speaker recognition.
Analysis of ABC Frontend Audio Systems for the NIST-SRE24 / Barahona, S.; Silnova, A.; Mosner, L.; Peng, J.; Plchot, O.; Rohdin, J.; Zhang, L.; Han, J.; Palka, P.; Landini, F.; Burget, L.; Stafylakis, T.; Cumani, S.; Bobos, D.; Hlavacek, M.; Kodovsky, M.; Pavlicek, T.. - (2025), pp. 5763-5767. ( Interspeech 2025 Rotterdam (NL) 17 - 21 August 2025) [10.21437/Interspeech.2025-2737].
Analysis of ABC Frontend Audio Systems for the NIST-SRE24
Cumani S.;
2025
Abstract
We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-theart frontends for speaker recognition.| File | Dimensione | Formato | |
|---|---|---|---|
|
barahona25_interspeech.pdf
accesso aperto
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Pubblico - Tutti i diritti riservati
Dimensione
244.14 kB
Formato
Adobe PDF
|
244.14 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/3007721
