Computing capability demand has grown massively in recent years. Modern GPU chips are designed to deliver extreme performance for graphics and for data-parallel general purpose computing workloads (GPGPU computing) as well. Many GPGPU applications require high reliability, thus reliability evaluation has become a crucial step during their design. State-of-the-art techniques to assess the reliability of a system are fault injection and ACE analysis. The former can produce accurate results despite eternal time while the latter is very fast but it lacks accuracy of the results. In this paper we introduce a new sampling methodology based on cluster sampling that enables the exploitation of ACE analysis to accelerate the fault injection process. In our experiments we demonstrate that state-of-the-art fault injection techniques, generating random faults according to a uniform distribution, is outperformed by the proposed sampling technique, thus enabling several advantages in terms of accuracy and evaluation time. To quantify the introduced benefits we analyzed the micro-architecture reliability of an AMD Southern Islands GPU in presence of single bit upset affecting the vector register file for 6 benchmarks. One of the most important achievements is that considering all the benchmarks, on average, we are one order of magnitude faster/more accurate than uniform-sampling-based techniques in case of non exhaustive fault injection campaigns, while more than two orders of magnitude in case of exhaustive campaigns.

Combining cluster sampling and ACE analysis to improve fault-injection based reliability evaluation of GPU-based systems / Vallero, A.; Di Carlo, S.. - STAMPA. - (2019), pp. 8138-8143. (Intervento presentato al convegno 32nd IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2019 tenutosi a Noordwijk, Netherlands nel 2-4 Oct. 2019) [10.1109/DFT.2019.8875392].

Combining cluster sampling and ACE analysis to improve fault-injection based reliability evaluation of GPU-based systems

Vallero A.;Di Carlo S.
2019

Abstract

Computing capability demand has grown massively in recent years. Modern GPU chips are designed to deliver extreme performance for graphics and for data-parallel general purpose computing workloads (GPGPU computing) as well. Many GPGPU applications require high reliability, thus reliability evaluation has become a crucial step during their design. State-of-the-art techniques to assess the reliability of a system are fault injection and ACE analysis. The former can produce accurate results despite eternal time while the latter is very fast but it lacks accuracy of the results. In this paper we introduce a new sampling methodology based on cluster sampling that enables the exploitation of ACE analysis to accelerate the fault injection process. In our experiments we demonstrate that state-of-the-art fault injection techniques, generating random faults according to a uniform distribution, is outperformed by the proposed sampling technique, thus enabling several advantages in terms of accuracy and evaluation time. To quantify the introduced benefits we analyzed the micro-architecture reliability of an AMD Southern Islands GPU in presence of single bit upset affecting the vector register file for 6 benchmarks. One of the most important achievements is that considering all the benchmarks, on average, we are one order of magnitude faster/more accurate than uniform-sampling-based techniques in case of non exhaustive fault injection campaigns, while more than two orders of magnitude in case of exhaustive campaigns.
2019
978-1-7281-2260-1
File in questo prodotto:
File Dimensione Formato  
conference_041818.pdf

accesso aperto

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 440.47 kB
Formato Adobe PDF
440.47 kB Adobe PDF Visualizza/Apri
08875392.pdf

non disponibili

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 555.49 kB
Formato Adobe PDF
555.49 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2785914