We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-theart frontends for speaker recognition.

Analysis of ABC Frontend Audio Systems for the NIST-SRE24 / Barahona, S., Silnova, A., Mosner, L., Peng, J., Plchot, O., Rohdin, J., Zhang, L., Han, J., Palka, P., Landini, F., Burget, L., Stafylakis, T., Cumani, S., Bobos, D., Hlavacek, M., Kodovsky, M., Pavlicek, T.. - (2025), pp. 5763-5767. (Interspeech 2025 Rotterdam (NL) 17 - 21 August 2025) [10.21437/Interspeech.2025-2737].

Analysis of ABC Frontend Audio Systems for the NIST-SRE24

Cumani S.;
2025

Abstract

We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-theart frontends for speaker recognition.
File in questo prodotto:
File Dimensione Formato  
barahona25_interspeech.pdf

accesso aperto

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Pubblico - Tutti i diritti riservati
Dimensione 244.14 kB
Formato Adobe PDF
244.14 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3007721