Modern Graphics Processing Units (GPUs) include in-chip hardware accelerators (Tensor Core Units, or TCUs) to increase the performance of machine learning applications. Unfortunately, cutting-edge semiconductor technologies are increasingly prone to suffer from faults and affect devices during their operation. Moreover, the execution of safety-critical and High-Performance Computing (HPC) applications in GPUs strongly stresses crucial resources, such as TCUs, which increases the likelihood of different kinds of failures. Thus, the resilience analysis of GPUs and their critical units (TCUs) are vital in safety-critical domains, e.g., in automotive, space, and autonomous robotics, to develop effective countermeasures or improve designs. Recently, new arithmetic formats have been proposed, particularly suited to neural network processing. However, an effective reliability characterization of TCUs supporting different arithmetic formats was still missed. In this work, we propose a hierarchical multi-level strategy to assess the reliability of permanent faults arising in TCUs inside GPUs when using two number formats, i.e., Floating Point (FP) and Posit. The proposed strategy combines a fine-grain micro-architectural characterization of hardware faults in TCUs with a higher-level structural evaluation to observe the interactions with other GPU structures and the error propagation effects. The micro-architectural characterization resorts to two representative descriptions of the main components in TCUs (Dot-Product Units) for both formats (FP and Posit). Then, the fine-grain findings feed a structural TCU model (PyOpenTCU) to propagate and observe the principal error effects. The experimental results show the advantages in performance and accuracy of using clever methods for the reliability assessment of large hardware accelerators, such as TCUs, and identified a relation between the corrupted spatial areas in the output matrices and the TCU’s scheduling policies. Finally, the results demonstrate that Posit formats are less affected by faults than Floating Point formats by several orders of magnitude.
Analyzing the Reliability of TCUs Through Micro-architecture and Structural Evaluations for Two Real Number Formats / LIMAS SIERRA, ROBERT ALEXANDER; Guerrero-Balaguera, Juan-David; Rodriguez Condia, Josie E.; SONZA REORDA, Matteo (IFIP ADVANCES IN INFORMATION AND COMMUNICATION TECHNOLOGY). - In: VLSI-SoC 2023: Silicon Innovations for Trustworthy Artificial Intelligence[s.l] : Springer, 2024. - ISBN 978-3-031-70946-3. - pp. 149-176 [10.1007/978-3-031-70947-0_8]
Analyzing the Reliability of TCUs Through Micro-architecture and Structural Evaluations for Two Real Number Formats
Robert Limas Sierra;Juan-David Guerrero-Balaguera;Josie E. Rodriguez Condia;Matteo Sonza Reorda
2024
Abstract
Modern Graphics Processing Units (GPUs) include in-chip hardware accelerators (Tensor Core Units, or TCUs) to increase the performance of machine learning applications. Unfortunately, cutting-edge semiconductor technologies are increasingly prone to suffer from faults and affect devices during their operation. Moreover, the execution of safety-critical and High-Performance Computing (HPC) applications in GPUs strongly stresses crucial resources, such as TCUs, which increases the likelihood of different kinds of failures. Thus, the resilience analysis of GPUs and their critical units (TCUs) are vital in safety-critical domains, e.g., in automotive, space, and autonomous robotics, to develop effective countermeasures or improve designs. Recently, new arithmetic formats have been proposed, particularly suited to neural network processing. However, an effective reliability characterization of TCUs supporting different arithmetic formats was still missed. In this work, we propose a hierarchical multi-level strategy to assess the reliability of permanent faults arising in TCUs inside GPUs when using two number formats, i.e., Floating Point (FP) and Posit. The proposed strategy combines a fine-grain micro-architectural characterization of hardware faults in TCUs with a higher-level structural evaluation to observe the interactions with other GPU structures and the error propagation effects. The micro-architectural characterization resorts to two representative descriptions of the main components in TCUs (Dot-Product Units) for both formats (FP and Posit). Then, the fine-grain findings feed a structural TCU model (PyOpenTCU) to propagate and observe the principal error effects. The experimental results show the advantages in performance and accuracy of using clever methods for the reliability assessment of large hardware accelerators, such as TCUs, and identified a relation between the corrupted spatial areas in the output matrices and the TCU’s scheduling policies. Finally, the results demonstrate that Posit formats are less affected by faults than Floating Point formats by several orders of magnitude.File | Dimensione | Formato | |
---|---|---|---|
VLSI_Soc2023_Book_Chapter.pdf
embargo fino al 29/12/2026
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
Pubblico - Tutti i diritti riservati
Dimensione
24.88 MB
Formato
Adobe PDF
|
24.88 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
978-3-031-70947-0_8.pdf
accesso riservato
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
1.66 MB
Formato
Adobe PDF
|
1.66 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2991415