This work has been supported by the National Resilience and Recovery Plan (PNRR) through the National Center for HPC, Big Data and Quantum Computing.Graphics Processing Units (GPUs) are crucial in modern safety-critical systems to implement complex and dense algorithms, so their reliability plays an essential role in several domains (e.g., automotive and autonomous machines). In fact, reliability evaluations in GPUs and their internal units are of special interest by their high parallelism and to identify vulnerable structures. In particular, Special Function Unit (SFU) cores, inside GPUs, are highly used in multimedia, scientific computing, and the training of neural networks. However, reliability evaluations in SFUs have remained highly unexplored. This work evaluates the impact of transient faults in the hardware structures of SFUs for GPUs. We focus on evaluating and analyzing two SFU architectures (‘fused’ and ‘modular’) and their relations to energy, area, and reliability impact on GPU workloads. The evaluation resorts to a fine-grain analysis with experiments using an RTL open-source GPU (FlexGripPlus) instrumented with both SFUs. The experimental results on both SFU architectures indicate that modular SFUs are less vulnerable to transient faults (in up to 47% for the analyzed workloads) and are more power efficient (in up to 36.6%) but require additional cost in terms of area (about 27%) in comparison with a fused SFU architecture (base for commercial devices), which seems more vulnerable to faults, but is area efficient.

Analyzing the Architectural Impact of Transient Fault Effects in SFUs of GPUs / Rodriguez Condia, Josie E.; Guerrero-Balaguera, Juan-David; Patiño Núñez, Edwar J.; Limas, Robert; Reorda, Matteo Sonza. - (2023), pp. 1-6. (Intervento presentato al convegno 2023 IEEE 24th Latin American Test Symposium (LATS) tenutosi a Veracruz (Mexico) nel 21-24 March 2023) [10.1109/LATS58125.2023.10154504].

Analyzing the Architectural Impact of Transient Fault Effects in SFUs of GPUs

Rodriguez Condia, Josie E.;Guerrero-Balaguera, Juan-David;Limas, Robert;Reorda, Matteo Sonza
2023

Abstract

This work has been supported by the National Resilience and Recovery Plan (PNRR) through the National Center for HPC, Big Data and Quantum Computing.Graphics Processing Units (GPUs) are crucial in modern safety-critical systems to implement complex and dense algorithms, so their reliability plays an essential role in several domains (e.g., automotive and autonomous machines). In fact, reliability evaluations in GPUs and their internal units are of special interest by their high parallelism and to identify vulnerable structures. In particular, Special Function Unit (SFU) cores, inside GPUs, are highly used in multimedia, scientific computing, and the training of neural networks. However, reliability evaluations in SFUs have remained highly unexplored. This work evaluates the impact of transient faults in the hardware structures of SFUs for GPUs. We focus on evaluating and analyzing two SFU architectures (‘fused’ and ‘modular’) and their relations to energy, area, and reliability impact on GPU workloads. The evaluation resorts to a fine-grain analysis with experiments using an RTL open-source GPU (FlexGripPlus) instrumented with both SFUs. The experimental results on both SFU architectures indicate that modular SFUs are less vulnerable to transient faults (in up to 47% for the analyzed workloads) and are more power efficient (in up to 36.6%) but require additional cost in terms of area (about 27%) in comparison with a fused SFU architecture (base for commercial devices), which seems more vulnerable to faults, but is area efficient.
2023
979-8-3503-2597-3
File in questo prodotto:
File Dimensione Formato  
Analyzing_the_Architectural_Impact_of_Transient_Fault_Effects_in_SFUs_of_GPUs.pdf

accesso riservato

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 687.76 kB
Formato Adobe PDF
687.76 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2980493