In this paper we describe the implementation of a complete ANN training procedure using the block mode back-propagation learning algorithm for sequential patterns – such as the observation feature vectors of a speech recognition system – exploiting the high performance SIMD architecture of GPU using CUDA and its C-like language interface. We also compare the speed-up obtained implementing the training procedure only taking advantage of the multi-thread capabilities of multi-core processors. In our implementation we take into account all the peculiar aspects of training large scale sequential patterns, in particular, the re-segmentation of the training sentences, the block size for the feed-forward and for the back-propagation steps, and the transfer of huge amount of data from host memory to the GPU card. Our approach has been tested by training acoustic models for large vocabulary speech recognition tasks, showing a six times reduction of the time required to train real-world large size networks with respect to an already optimized implementation using the Intel MKL libraries. Thanks to these optimizations and to the support of the GPU, the training time for language having a huge set of training sentences (about one million for Italian) can be reduced from approximately a month to 5 days.

Parallel implementation of Artificial Neural Network training for speech recognition / Scanzio, S; Cumani, Sandro; Gemello, R; Mana, F; Laface, Pietro. - In: PATTERN RECOGNITION LETTERS. - ISSN 0167-8655. - STAMPA. - 31:11(2010), pp. 1302-1309. [10.1016/j.patrec.2010.02.003]

Parallel implementation of Artificial Neural Network training for speech recognition

CUMANI, SANDRO;LAFACE, Pietro
2010

Abstract

In this paper we describe the implementation of a complete ANN training procedure using the block mode back-propagation learning algorithm for sequential patterns – such as the observation feature vectors of a speech recognition system – exploiting the high performance SIMD architecture of GPU using CUDA and its C-like language interface. We also compare the speed-up obtained implementing the training procedure only taking advantage of the multi-thread capabilities of multi-core processors. In our implementation we take into account all the peculiar aspects of training large scale sequential patterns, in particular, the re-segmentation of the training sentences, the block size for the feed-forward and for the back-propagation steps, and the transfer of huge amount of data from host memory to the GPU card. Our approach has been tested by training acoustic models for large vocabulary speech recognition tasks, showing a six times reduction of the time required to train real-world large size networks with respect to an already optimized implementation using the Intel MKL libraries. Thanks to these optimizations and to the support of the GPU, the training time for language having a huge set of training sentences (about one million for Italian) can be reduced from approximately a month to 5 days.
File in questo prodotto:
File Dimensione Formato  
cuda_v10.pdf

accesso aperto

Tipologia: 1. Preprint / submitted version [pre- review]
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 289.23 kB
Formato Adobe PDF
289.23 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2303844
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo