This work has been supported by the National Resilience and Recovery Plan (PNRR) through the National Center for HPC, Big Data and Quantum Computing.Graphics Processing Units (GPUs) are crucial in modern safety-critical systems to implement complex and dense algorithms, so their reliability plays an essential role in several domains (e.g., automotive and autonomous machines). In fact, reliability evaluations in GPUs and their internal units are of special interest by their high parallelism and to identify vulnerable structures. In particular, Special Function Unit (SFU) cores, inside GPUs, are highly used in multimedia, scientific computing, and the training of neural networks. However, reliability evaluations in SFUs have remained highly unexplored. This work evaluates the impact of transient faults in the hardware structures of SFUs for GPUs. We focus on evaluating and analyzing two SFU architectures (‘fused’ and ‘modular’) and their relations to energy, area, and reliability impact on GPU workloads. The evaluation resorts to a fine-grain analysis with experiments using an RTL open-source GPU (FlexGripPlus) instrumented with both SFUs. The experimental results on both SFU architectures indicate that modular SFUs are less vulnerable to transient faults (in up to 47% for the analyzed workloads) and are more power efficient (in up to 36.6%) but require additional cost in terms of area (about 27%) in comparison with a fused SFU architecture (base for commercial devices), which seems more vulnerable to faults, but is area efficient.
Analyzing the Architectural Impact of Transient Fault Effects in SFUs of GPUs / Rodriguez Condia, Josie E.; Guerrero-Balaguera, Juan-David; Patiño Núñez, Edwar J.; Limas, Robert; Reorda, Matteo Sonza. - (2023), pp. 1-6. (Intervento presentato al convegno 2023 IEEE 24th Latin American Test Symposium (LATS) tenutosi a Veracruz (Mexico) nel 21-24 March 2023) [10.1109/LATS58125.2023.10154504].
Analyzing the Architectural Impact of Transient Fault Effects in SFUs of GPUs
Rodriguez Condia, Josie E.;Guerrero-Balaguera, Juan-David;Limas, Robert;Reorda, Matteo Sonza
2023
Abstract
This work has been supported by the National Resilience and Recovery Plan (PNRR) through the National Center for HPC, Big Data and Quantum Computing.Graphics Processing Units (GPUs) are crucial in modern safety-critical systems to implement complex and dense algorithms, so their reliability plays an essential role in several domains (e.g., automotive and autonomous machines). In fact, reliability evaluations in GPUs and their internal units are of special interest by their high parallelism and to identify vulnerable structures. In particular, Special Function Unit (SFU) cores, inside GPUs, are highly used in multimedia, scientific computing, and the training of neural networks. However, reliability evaluations in SFUs have remained highly unexplored. This work evaluates the impact of transient faults in the hardware structures of SFUs for GPUs. We focus on evaluating and analyzing two SFU architectures (‘fused’ and ‘modular’) and their relations to energy, area, and reliability impact on GPU workloads. The evaluation resorts to a fine-grain analysis with experiments using an RTL open-source GPU (FlexGripPlus) instrumented with both SFUs. The experimental results on both SFU architectures indicate that modular SFUs are less vulnerable to transient faults (in up to 47% for the analyzed workloads) and are more power efficient (in up to 36.6%) but require additional cost in terms of area (about 27%) in comparison with a fused SFU architecture (base for commercial devices), which seems more vulnerable to faults, but is area efficient.File | Dimensione | Formato | |
---|---|---|---|
Analyzing_the_Architectural_Impact_of_Transient_Fault_Effects_in_SFUs_of_GPUs.pdf
non disponibili
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
687.76 kB
Formato
Adobe PDF
|
687.76 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2980493