Multiprocessor system-on-chip such as embedded GPUs are becoming very popular in safety-critical applications, such as autonomous and semi-autonomous vehicles. However, these devices can suffer from the effects of soft-errors, such as those produced by radiation effects. These effects are able to generate unpredictable misbehaviors. Fault tolerance oriented to multi-threaded software introduces severe performance degradations due to the redundancy, voting and correction threads operations. In this paper, we propose a new fault injection environment for NVIDIA GPGPU devices and a fault tolerance approach based on error detection and correction threads executed during data transfer operations on embedded GPUs. The fault injection environment is capable of automatically injecting faults into the instructions at SASS level by instrumenting the CUDA binary executable file. The mitigation approach is based on concurrent error detection threads running simultaneously with the memory stream device to host data transfer operations. With several benchmark applications, we evaluate the impact of softerrors classifying Silent Data Corruption, Detection, Unrecoverable Error and Hang. Finally, the proposed mitigation approach has been validated by soft-error fault injection campaigns on an NVIDIA Pascal Architecture GPU controlled by Quad-Core A57 ARM processor (JETSON TX2) demonstrating an advantage of more than 37% with respect to state of the art solution.
Analysis and Mitigation of Soft-Errors on High Performance Embedded GPUs / Sterpone, L.; Azimi, S.; De Sio, C.; Parisi, F.. - ELETTRONICO. - (2022), pp. 91-98. (Intervento presentato al convegno 21st IEEE International Symposium on Parallel and Distributed Computing tenutosi a Basel (Switzerland) nel 11-13 July 2022) [10.1109/ISPDC55340.2022.00022].
Analysis and Mitigation of Soft-Errors on High Performance Embedded GPUs
L. Sterpone;S. Azimi;C. De Sio;
2022
Abstract
Multiprocessor system-on-chip such as embedded GPUs are becoming very popular in safety-critical applications, such as autonomous and semi-autonomous vehicles. However, these devices can suffer from the effects of soft-errors, such as those produced by radiation effects. These effects are able to generate unpredictable misbehaviors. Fault tolerance oriented to multi-threaded software introduces severe performance degradations due to the redundancy, voting and correction threads operations. In this paper, we propose a new fault injection environment for NVIDIA GPGPU devices and a fault tolerance approach based on error detection and correction threads executed during data transfer operations on embedded GPUs. The fault injection environment is capable of automatically injecting faults into the instructions at SASS level by instrumenting the CUDA binary executable file. The mitigation approach is based on concurrent error detection threads running simultaneously with the memory stream device to host data transfer operations. With several benchmark applications, we evaluate the impact of softerrors classifying Silent Data Corruption, Detection, Unrecoverable Error and Hang. Finally, the proposed mitigation approach has been validated by soft-error fault injection campaigns on an NVIDIA Pascal Architecture GPU controlled by Quad-Core A57 ARM processor (JETSON TX2) demonstrating an advantage of more than 37% with respect to state of the art solution.File | Dimensione | Formato | |
---|---|---|---|
ISPDC_2022.pdf
accesso aperto
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
PUBBLICO - Tutti i diritti riservati
Dimensione
416.1 kB
Formato
Adobe PDF
|
416.1 kB | Adobe PDF | Visualizza/Apri |
Analysis_and_Mitigation_of_Soft-Errors_on_High_Performance_Embedded_GPUs.pdf
non disponibili
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
1.11 MB
Formato
Adobe PDF
|
1.11 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2971150