Investigating and Reducing the Architectural Impact of Transient Faults in Special Function Units for GPUs

Rodriguez Condia, Josie E.; Guerrero-Balaguera, Juan-David; Patiño Núñez, Edwar J.; Limas, Robert; Sonza Reorda, Matteo

doi:10.1007/s10836-024-06107-9

Ensuring the reliability of GPUs and their internal components is paramount, especially in safety-critical domains like autonomous machines and self-driving cars. These cutting-edge applications heavily rely on GPUs to implement complex algorithms due to their implicit programming flexibility and parallelism, which is crucial for efficient operation. However, as integration technologies advance, there is a growing concern regarding the potential increase in fault sensitivity of the internal components of current GPU generations. In particular, Special Function Unit (SFU) cores inside GPUs are used in multimedia, High-Performance Computing, and neural network training. Despite their frequent usage and critical role in several domains, reliability evaluations on SFUs and the development of effective mitigation solutions have yet to be studied and remain unexplored. This work evaluates the impact of transient faults in the main hardware structures of SFUs in GPUs. In addition, we analyze the main overhead costs and benefits of developing selective-hardening mechanisms for SFUs. We focus on evaluating and analyzing two SFU architectures for GPUs (’fused’ and ’modular’) and their relations to energy, area, and reliability impact on parallel applications. The experiments resort to fine-grain fault injection campaigns on an RTL GPU model (FlexGripPlus) instrumented with both SFUs. The results on both SFU architectures indicate that fused SFUs (in commercial-grade devices) require lower area overhead (about 27%) for their integration in GPUs but are more vulnerable to transient faults (in up to 47% for the analyzed cases) and less power efficient (in up to 36.6%) than modular SFUs. Moreover, the reliability estimation shows that Modular SFUs are structurally more resilient than Fused ones in up to one order of magnitude. Similarly, selective-hardening mechanism based on Triple-Modular Redundancy (TMR) shows that coarse-grain strategies might increase the reliability of the overall SFUs under feasible overhead costs.

Investigating and Reducing the Architectural Impact of Transient Faults in Special Function Units for GPUs / Rodriguez Condia, Josie E.; Guerrero-Balaguera, Juan-David; Patiño Núñez, Edwar J.; Limas, Robert; Sonza Reorda, Matteo. - In: JOURNAL OF ELECTRONIC TESTING. - ISSN 0923-8174. - ELETTRONICO. - 40:(2024), pp. 215-228. [10.1007/s10836-024-06107-9]

Investigating and Reducing the Architectural Impact of Transient Faults in Special Function Units for GPUs

Rodriguez Condia, Josie E.;Guerrero-Balaguera, Juan-David;Patiño Núñez, Edwar J.;Limas, Robert;Sonza Reorda, Matteo

2024

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
				2024
			
	Codice DOI
	
				https://dx.doi.org/10.1007/s10836-024-06107-9
			
	Titolo della Rivista
	
				JOURNAL OF ELECTRONIC TESTING
			
	Appare nelle tipologie
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
s10836-024-06107-9.pdf accesso aperto Tipologia: 2a Post-print versione editoriale / Version of Record Licenza: Creative commons Dimensione 2.29 MB Formato Adobe PDF Visualizza/Apri	2.29 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2987204

Nome	Dominio	Durata	Descrizione
s_.*	plu.mx	sessione	recupero grafico citazioni sociali da plumx
A_.*	core.ac.uk	7 giorni	recupero pubblicazioni consigliate per il pannello core-recommander
GS_.*	gstatic.com	richiesta http	visualizza grafico citazioni
CC_.*	creativecommons.org	richiesta http	visualizza licenza bitstream

PORTO @ Archivio Istituzionale della Ricerca

Investigating and Reducing the Architectural Impact of Transient Faults in Special Function Units for GPUs

Rodriguez Condia, Josie E.;Guerrero-Balaguera, Juan-David;Patiño Núñez, Edwar J.;Limas, Robert;Sonza Reorda, Matteo

2024

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)