Improving Deep Neural Network Reliability via Transient-Fault-Aware Design and Training

Fernando Fernandes Dos Santos,; Cavagnero, Niccolo; Ciccone, Marco; Averta, Giuseppe; Kritikakou, Angeliki; Sentieys, Olivier; Rech, Paolo; Tommasi, Tatiana

doi:10.1109/TETC.2024.3520672

Deep Neural Networks (DNNs) have revolutionized several fields, including safety- and mission-critical applications, such as autonomous driving and space exploration. However, recent studies have highlighted that transient hardware faults can corrupt the model's output, leading to high misprediction probabilities. Since traditional reliability strategies, based on modular hardware, software replications, or matrix multiplication checksum impose a high overhead, there is a pressing need for efficient and effective hardening solutions tailored for DNNs. In this paper we present several network design choices and a training procedure that increase the robustness of standard deep models and thoroughly evaluate these strategies with experimental analyses on vision classification tasks. We name DieHardNet the specialized DNN obtained by applying all our hardening techniques that combine knowledge from experimental hardware faults characterization and machine learning studies. We conduct extensive ablation studies to quantify the reliability gain of each hardening component in DieHardNet. We perform over 10,000 instruction-level fault injections to validate our approach and expose DieHardNet executed on GPUs to an accelerated neutron beam equivalent to more than 570,000 years of natural radiation. Our evaluation demonstrates that DieHardNet can reduce the critical error rate (i.e., errors that modify the inference) up to 100 times compared to the unprotected baseline model, without causing any increase in inference time.

Improving Deep Neural Network Reliability via Transient-Fault-Aware Design and Training / Fernandes Dos Santos, Fernando; Cavagnero, Niccolo; Ciccone, Marco; Averta, Giuseppe; Kritikakou, Angeliki; Sentieys, Olivier; Rech, Paolo; Tommasi, Tatiana. - In: IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING. - ISSN 2168-6750. - 13:3(2025), pp. 829-840. [10.1109/TETC.2024.3520672]

Improving Deep Neural Network Reliability via Transient-Fault-Aware Design and Training

Fernando Fernandes dos Santos;Niccolo Cavagnero;Marco Ciccone;Giuseppe Averta;Angeliki Kritikakou;Olivier Sentieys;Paolo Rech;Tatiana Tommasi

2025

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
				2025
			
	Codice DOI
	
				https://dx.doi.org/10.1109/TETC.2024.3520672
			
	Titolo della Rivista
	
				IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING
			
	Appare nelle tipologie
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
tetc_2023_diehardnet.pdf accesso aperto Tipologia: 2. Post-print / Author's Accepted Manuscript Licenza: Pubblico - Tutti i diritti riservati Dimensione 3.45 MB Formato Adobe PDF Visualizza/Apri	3.45 MB	Adobe PDF	Visualizza/Apri
Improving_Deep_Neural_Network_Reliability_via_Transient-Fault-Aware_Design_and_Training.pdf accesso riservato Tipologia: 2a Post-print versione editoriale / Version of Record Licenza: Non Pubblico - Accesso privato/ristretto Dimensione 1.97 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.97 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2996545

PORTO @ Archivio Istituzionale della Ricerca

Improving Deep Neural Network Reliability via Transient-Fault-Aware Design and Training

Fernando Fernandes dos Santos;Niccolo Cavagnero;Marco Ciccone;Giuseppe Averta;Angeliki Kritikakou;Olivier Sentieys;Paolo Rech;Tatiana Tommasi

2025

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)