In this work we assess the recently proposed hybrid Deep Neural Network/Gaussian Mixture Model (DNN/GMM) approach for speaker recognition considering the effects of the granularity of the phonetic DNN model, and of the precision of the corresponding GMM models, which will be referred to as the phonetic GMMs. The aim of this work is to better understand the contributions of the phonetic information provided by the DNN model with respect to the accuracy of the acous tic GMMs in fitting the distribution of the features associated to a given context-dependent phone state. The testbed for this work was the text-independent speaker recognition task defined by NIST for the 2012 Speaker Recognition Evaluation. Our experiment confirms that the acoustic and the phonetic GMMs are complementary. Thus, their score combination yields very good results if the DNN is trained on data collected in an environment similar to the one that is used for testing. We show, however, that using a single Gaussian per DNN state is not the best choice: the best single system has been obtained balancing the phonetic and acoustic precision of a DNN/GMM system.
Speaker recognition by means of acoustic and phonetically informed GMMs / Cumani, Sandro; Laface, Pietro; Kulsoom, Farzana. - STAMPA. - 1:(2015), pp. 200-204. (Intervento presentato al convegno INTERSPEECH 2015 tenutosi a Dresden (Germany) nel 6-10 September 2015).
Speaker recognition by means of acoustic and phonetically informed GMMs
CUMANI, SANDRO;LAFACE, Pietro;KULSOOM, FARZANA
2015
Abstract
In this work we assess the recently proposed hybrid Deep Neural Network/Gaussian Mixture Model (DNN/GMM) approach for speaker recognition considering the effects of the granularity of the phonetic DNN model, and of the precision of the corresponding GMM models, which will be referred to as the phonetic GMMs. The aim of this work is to better understand the contributions of the phonetic information provided by the DNN model with respect to the accuracy of the acous tic GMMs in fitting the distribution of the features associated to a given context-dependent phone state. The testbed for this work was the text-independent speaker recognition task defined by NIST for the 2012 Speaker Recognition Evaluation. Our experiment confirms that the acoustic and the phonetic GMMs are complementary. Thus, their score combination yields very good results if the DNN is trained on data collected in an environment similar to the one that is used for testing. We show, however, that using a single Gaussian per DNN state is not the best choice: the best single system has been obtained balancing the phonetic and acoustic precision of a DNN/GMM system.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2612954
Attenzione
Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo