Analyzing the Reliability of TCUs Through Micro-architecture and Structural Evaluations for Two Real Number Formats

Limas Sierra, Robert Alexander; Guerrero-Balaguera, Juan-David; Rodriguez Condia, Josie E.; Sonza Reorda, Matteo

doi:10.1007/978-3-031-70947-0_8

Modern Graphics Processing Units (GPUs) include in-chip hardware accelerators (Tensor Core Units, or TCUs) to increase the performance of machine learning applications. Unfortunately, cutting-edge semiconductor technologies are increasingly prone to suffer from faults and affect devices during their operation. Moreover, the execution of safety-critical and High-Performance Computing (HPC) applications in GPUs strongly stresses crucial resources, such as TCUs, which increases the likelihood of different kinds of failures. Thus, the resilience analysis of GPUs and their critical units (TCUs) are vital in safety-critical domains, e.g., in automotive, space, and autonomous robotics, to develop effective countermeasures or improve designs. Recently, new arithmetic formats have been proposed, particularly suited to neural network processing. However, an effective reliability characterization of TCUs supporting different arithmetic formats was still missed. In this work, we propose a hierarchical multi-level strategy to assess the reliability of permanent faults arising in TCUs inside GPUs when using two number formats, i.e., Floating Point (FP) and Posit. The proposed strategy combines a fine-grain micro-architectural characterization of hardware faults in TCUs with a higher-level structural evaluation to observe the interactions with other GPU structures and the error propagation effects. The micro-architectural characterization resorts to two representative descriptions of the main components in TCUs (Dot-Product Units) for both formats (FP and Posit). Then, the fine-grain findings feed a structural TCU model (PyOpenTCU) to propagate and observe the principal error effects. The experimental results show the advantages in performance and accuracy of using clever methods for the reliability assessment of large hardware accelerators, such as TCUs, and identified a relation between the corrupted spatial areas in the output matrices and the TCU’s scheduling policies. Finally, the results demonstrate that Posit formats are less affected by faults than Floating Point formats by several orders of magnitude.

Analyzing the Reliability of TCUs Through Micro-architecture and Structural Evaluations for Two Real Number Formats / Limas Sierra, Robert Alexander; Guerrero-Balaguera, Juan-David; Rodriguez Condia, Josie E.; Sonza Reorda, Matteo. - 680:(2024), pp. 149-176. ( 31st IFIP WG 10.5/IEEE International Conference on Very Large Scale Integration, VLSI-SoC 2023 Sharjah, United Arab Emirates October 16–18, 2023) [10.1007/978-3-031-70947-0_8].

Analyzing the Reliability of TCUs Through Micro-architecture and Structural Evaluations for Two Real Number Formats

Robert Limas Sierra;Juan-David Guerrero-Balaguera;Josie E. Rodriguez Condia;Matteo Sonza Reorda

2024

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
				2024
			
	Titolo della Serie/Collana
	
				IFIP ADVANCES IN INFORMATION AND COMMUNICATION TECHNOLOGY
			
	Codice ISBN
	
				978-3-031-70946-3
978-3-031-70947-0
			
	Appare nelle tipologie
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
VLSI_Soc2023_Book_Chapter.pdf embargo fino al 29/12/2026 Tipologia: 2. Post-print / Author's Accepted Manuscript Licenza: Pubblico - Tutti i diritti riservati Dimensione 24.88 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	24.88 MB	Adobe PDF	Visualizza/Apri Richiedi una copia
978-3-031-70947-0_8.pdf accesso riservato Tipologia: 2a Post-print versione editoriale / Version of Record Licenza: Non Pubblico - Accesso privato/ristretto Dimensione 1.66 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.66 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2991415

PORTO @ Archivio Istituzionale della Ricerca

Analyzing the Reliability of TCUs Through Micro-architecture and Structural Evaluations for Two Real Number Formats

Robert Limas Sierra;Juan-David Guerrero-Balaguera;Josie E. Rodriguez Condia;Matteo Sonza Reorda

2024

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)