Modern Graphics Processing Units (GPUs) boost the execution of tiled matrix multiplications by extensively using in-chip accelerators (Tensor Core Units or TCUs). Unfortunately, cutting-edge semiconductor technologies are increasingly prone to fault defects. Indeed, faults may affect TCUs when processing massive amounts of data under classical floating-point formats, raising reliability concerns when used in the safety-critical and High-Performance Computing (HPC) domains. In this scenario, the characterization of faulty TCUs supporting different arithmetic formats is still missed. This work for the first time quantitatively evaluates the effects of hardware faults arising in TCU structures when using two different formats for real number representation (i.e., Floating-Point and Posit). For the experimental evaluation, we resort to an architectural description of a TCU core (PyOpenTCU) and perform 60 fault simulation campaigns, injecting 57,344 faults per campaign and requiring around 24 days of computation. The experimental results indicate a relation between the corrupted spatial areas in the output matrices and the TCU’s scheduling policies. Moreover, the numeric analysis shows that hardware faults in TCUs in most cases affect up to 2 bits in the output results for both considered formats. The results also demonstrate that the Posit formats are less affected by faults than Floating-Point formats by up to one order of magnitude.

Analyzing the Impact of Different Real Number Formats on the Structural Reliability of TCUs in GPUs / Sierra, Robert Limas; Guerrero-Balaguera, Juan-David; Rodriguez Condia, Josie E.; Sonza Reorda, Matteo. - (2023), pp. 1-6. (Intervento presentato al convegno IFIP/IEEE Conference on Very Large Scale Integration (VLSI-SoC 2023) tenutosi a Dubai (United Arab Emirates) nel 16-18 October 2023) [10.1109/VLSI-SoC57769.2023.10321881].

Analyzing the Impact of Different Real Number Formats on the Structural Reliability of TCUs in GPUs

Sierra, Robert Limas;Guerrero-Balaguera, Juan-David;Rodriguez Condia, Josie E.;Sonza Reorda, Matteo
2023

Abstract

Modern Graphics Processing Units (GPUs) boost the execution of tiled matrix multiplications by extensively using in-chip accelerators (Tensor Core Units or TCUs). Unfortunately, cutting-edge semiconductor technologies are increasingly prone to fault defects. Indeed, faults may affect TCUs when processing massive amounts of data under classical floating-point formats, raising reliability concerns when used in the safety-critical and High-Performance Computing (HPC) domains. In this scenario, the characterization of faulty TCUs supporting different arithmetic formats is still missed. This work for the first time quantitatively evaluates the effects of hardware faults arising in TCU structures when using two different formats for real number representation (i.e., Floating-Point and Posit). For the experimental evaluation, we resort to an architectural description of a TCU core (PyOpenTCU) and perform 60 fault simulation campaigns, injecting 57,344 faults per campaign and requiring around 24 days of computation. The experimental results indicate a relation between the corrupted spatial areas in the output matrices and the TCU’s scheduling policies. Moreover, the numeric analysis shows that hardware faults in TCUs in most cases affect up to 2 bits in the output results for both considered formats. The results also demonstrate that the Posit formats are less affected by faults than Floating-Point formats by up to one order of magnitude.
2023
979-8-3503-2599-7
File in questo prodotto:
File Dimensione Formato  
Analyzing_the_Impact_of_Different_Real_Number_Formats_on_the_Structural_Reliability_of_TCUs_in_GPUs.pdf

non disponibili

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 1.09 MB
Formato Adobe PDF
1.09 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2985553