We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-theart frontends for speaker recognition.
Analysis of ABC Frontend Audio Systems for the NIST-SRE24 / Barahona, S., Silnova, A., Mosner, L., Peng, J., Plchot, O., Rohdin, J., Zhang, L., Han, J., Palka, P., Landini, F., Burget, L., Stafylakis, T., Cumani, S., Bobos, D., Hlavacek, M., Kodovsky, M., Pavlicek, T.. - (2025), pp. 5763-5767. (Interspeech 2025 Rotterdam (NL) 17 - 21 August 2025) [10.21437/Interspeech.2025-2737].
Analysis of ABC Frontend Audio Systems for the NIST-SRE24
Cumani S.;
2025
Abstract
We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-theart frontends for speaker recognition.| File | Dimensione | Formato | |
|---|---|---|---|
|
barahona25_interspeech.pdf
accesso aperto
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Pubblico - Tutti i diritti riservati
Dimensione
244.14 kB
Formato
Adobe PDF
|
244.14 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/3007721
