Currently, Graphics Processing Units (GPUs) are extensively used in several safety-critical domains to support the implementation of complex operations where reliability is a major concern. Some internal cores, such as Special Function Units (SFUs), are increasingly adopted, being crucial to achieving the necessary performance in multimedia, scientific computing, and neural network training. Unfortunately, these cores are highly unexplored in terms of their impact on reliability.In this work, we evaluate the incidence of SFUs on the reliability of GPUs when affected by soft errors. First, we analyze the impact of SFU cores on the GPU’s reliability and the running workloads. We resort to applications configured to use or not the SFU cores and evaluate the effect of soft errors by using a software-based fault injection environment (NVBITFI) in an NVIDIA Ampere GPU. Then, we focus on evaluating the impact of soft errors arising in the SFUs. A fine-grain RTL evaluation determines the soft error effects on two SFUs architectures for GPUs (’fused’ and ’modular’). The experiments use an open-source GPU (FlexGripPlus) instrumented with both SFU architectures. The results suggest that workloads using SFUs are more vulnerable to faults (from 1 up to 5 orders of magnitude for the analyzed applications). Moreover, the RTL results show that modular SFUs are less vulnerable to faults (in up to 47% for the analyzed workloads) in comparison with fused SFUs (base of commercial devices), so allowing us to identify the more robust SFU architecture.

Evaluating the Prevalence of SFUs in the Reliability of GPUs / Rodriguez Condia, Josie E.; Guerrero-Balaguera, Juan-David; Patiño Núñez, Edwar J.; Limas, Robert; Reorda, Matteo Sonza. - (2023), pp. 1-6. (Intervento presentato al convegno 2023 IEEE European Test Symposium (ETS) tenutosi a Venice (IT) nel 22-26 May 2023) [10.1109/ETS56758.2023.10174110].

Evaluating the Prevalence of SFUs in the Reliability of GPUs

Rodriguez Condia, Josie E.;Guerrero-Balaguera, Juan-David;Limas, Robert;Reorda, Matteo Sonza
2023

Abstract

Currently, Graphics Processing Units (GPUs) are extensively used in several safety-critical domains to support the implementation of complex operations where reliability is a major concern. Some internal cores, such as Special Function Units (SFUs), are increasingly adopted, being crucial to achieving the necessary performance in multimedia, scientific computing, and neural network training. Unfortunately, these cores are highly unexplored in terms of their impact on reliability.In this work, we evaluate the incidence of SFUs on the reliability of GPUs when affected by soft errors. First, we analyze the impact of SFU cores on the GPU’s reliability and the running workloads. We resort to applications configured to use or not the SFU cores and evaluate the effect of soft errors by using a software-based fault injection environment (NVBITFI) in an NVIDIA Ampere GPU. Then, we focus on evaluating the impact of soft errors arising in the SFUs. A fine-grain RTL evaluation determines the soft error effects on two SFUs architectures for GPUs (’fused’ and ’modular’). The experiments use an open-source GPU (FlexGripPlus) instrumented with both SFU architectures. The results suggest that workloads using SFUs are more vulnerable to faults (from 1 up to 5 orders of magnitude for the analyzed applications). Moreover, the RTL results show that modular SFUs are less vulnerable to faults (in up to 47% for the analyzed workloads) in comparison with fused SFUs (base of commercial devices), so allowing us to identify the more robust SFU architecture.
2023
979-8-3503-3634-4
File in questo prodotto:
File Dimensione Formato  
Evaluating_the_Prevalence_of_SFUs_in_the_Reliability_of_GPUs.pdf

accesso riservato

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 1.43 MB
Formato Adobe PDF
1.43 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2980492