The deployment of Machine Learning (ML) applications extensively leverages Matrix Multiplication (MM) operations on modern and advanced accelerators, like Graphic Processing Units (GPUs), which employ Tensor Core Units (TCUs) to optimize MM’s execution efficiently. However, reliability concerns arise in devices with cutting-edge semiconductor technologies (7 nm or less), as faults can compromise some structures (e.g., TCUs) during their operation. In safety-critical applications, this can lead to wrong DNN outcomes and cause unpredictable and unacceptable actions. Thus, the impact evaluation of such faults is crucial to ensure that TCUs and GPUs meet the safety standard requirements (e.g., ISO26262). Currently, the reliability assessment of complex applications concerning hardware faults involves fault injection (FI) campaigns. Unfortunately, low-level FI campaigns might be computationally prohibitive for GPUs when these execute massive applications like DNNs. In this work, we propose an error modeling approach to accurately describe corruptions from permanent faults on TCUs, during the operation of MMs. This approach enables realistic reliability evaluations of computationally expensive MM-based workloads, resulting in a huge acceleration (up to 225X) compared with hardware-level FIs. Our experimental results show a very good accuracy (up to 93% correlation between our error modeling approach and FI campaigns conducted on TCUs).

Effective Application-level Error Modeling of Permanent Faults on AI Accelerators / Pessia, Francesco; Guerrero-Balaguera, Juan-David; Limas Sierra, Robert; Rodriguez Condia, Josie E.; Levorato, Marco; Sonza Reorda, Matteo. - ELETTRONICO. - (2024), pp. 1-7. (Intervento presentato al convegno International Symposium on On-Line Testing and Robust System Design (IOLTS) tenutosi a Rennes (FRA) nel 03-05 July 2024) [10.1109/iolts60994.2024.10616087].

Effective Application-level Error Modeling of Permanent Faults on AI Accelerators

Guerrero-Balaguera, Juan-David;Limas Sierra, Robert;Rodriguez Condia, Josie E.;Levorato, Marco;Sonza Reorda, Matteo
2024

Abstract

The deployment of Machine Learning (ML) applications extensively leverages Matrix Multiplication (MM) operations on modern and advanced accelerators, like Graphic Processing Units (GPUs), which employ Tensor Core Units (TCUs) to optimize MM’s execution efficiently. However, reliability concerns arise in devices with cutting-edge semiconductor technologies (7 nm or less), as faults can compromise some structures (e.g., TCUs) during their operation. In safety-critical applications, this can lead to wrong DNN outcomes and cause unpredictable and unacceptable actions. Thus, the impact evaluation of such faults is crucial to ensure that TCUs and GPUs meet the safety standard requirements (e.g., ISO26262). Currently, the reliability assessment of complex applications concerning hardware faults involves fault injection (FI) campaigns. Unfortunately, low-level FI campaigns might be computationally prohibitive for GPUs when these execute massive applications like DNNs. In this work, we propose an error modeling approach to accurately describe corruptions from permanent faults on TCUs, during the operation of MMs. This approach enables realistic reliability evaluations of computationally expensive MM-based workloads, resulting in a huge acceleration (up to 225X) compared with hardware-level FIs. Our experimental results show a very good accuracy (up to 93% correlation between our error modeling approach and FI campaigns conducted on TCUs).
2024
979-8-3503-7055-3
File in questo prodotto:
File Dimensione Formato  
Effective_Application-level_Error_Modeling_of_Permanent_Faults_on_AI_Accelerators.pdf

non disponibili

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 565.78 kB
Formato Adobe PDF
565.78 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2991567