Graphic Processing Units (GPUs) are currently adopted in several domains with substantial reliability requirements, such as in automotive and robotics. Thus, evaluating the impact of possible faults affecting the internal components of a device is a crucial step towards developing certified products according to industrial standards (i.e., ISO26262). The block scheduling controllers play an important role in resource management and task operation in GPUs. However, understanding the sensitivity to faults of such modules is crucial in the development of mitigation mechanisms and effective countermeasures. This work evaluates the impact of transient faults on the block controller in a GPU. For this purpose, we extended a low-level micro-architecture GPU model (FlexGripPlus) to support the management of the different execution cores (i.e., the Streaming Multiprocessors or SIMD Engines) and allow the analysis of fault effects. A set of typical workloads were employed in the reliability evaluation. The experimental results show that the most critical stages for faults in the scheduler are those arising during the device's configuration and the exchange of tasks from an application. Moreover, when considering faults in the controller, multi-core GPUs appear to be less sensitive to faults than single-core GPUs. Finally, the parallel distribution of tasks (in blocks) also plays a significant role in the vulnerability to faults of the scheduler.
Microarchitectural Reliability Evaluation of a Block Scheduling Controller in GPUs / Rodriguez Condia, Josie E.; Faggiano, R; Reorda, Ms. - (2022), pp. 26-31. (Intervento presentato al convegno IEEE Computer Society Annual Symposium on VLSI (ISVLSI 2022) tenutosi a Nicosia, Cyprus nel 04-06 July 2022) [10.1109/ISVLSI54635.2022.00018].
Microarchitectural Reliability Evaluation of a Block Scheduling Controller in GPUs
Rodriguez Condia, Josie E.;Faggiano, R;Reorda, MS
2022
Abstract
Graphic Processing Units (GPUs) are currently adopted in several domains with substantial reliability requirements, such as in automotive and robotics. Thus, evaluating the impact of possible faults affecting the internal components of a device is a crucial step towards developing certified products according to industrial standards (i.e., ISO26262). The block scheduling controllers play an important role in resource management and task operation in GPUs. However, understanding the sensitivity to faults of such modules is crucial in the development of mitigation mechanisms and effective countermeasures. This work evaluates the impact of transient faults on the block controller in a GPU. For this purpose, we extended a low-level micro-architecture GPU model (FlexGripPlus) to support the management of the different execution cores (i.e., the Streaming Multiprocessors or SIMD Engines) and allow the analysis of fault effects. A set of typical workloads were employed in the reliability evaluation. The experimental results show that the most critical stages for faults in the scheduler are those arising during the device's configuration and the exchange of tasks from an application. Moreover, when considering faults in the controller, multi-core GPUs appear to be less sensitive to faults than single-core GPUs. Finally, the parallel distribution of tasks (in blocks) also plays a significant role in the vulnerability to faults of the scheduler.File | Dimensione | Formato | |
---|---|---|---|
Microarchitectural_Reliability_Evaluation_of_a_Block_Scheduling_Controller_in_GPUs.pdf
non disponibili
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
857.35 kB
Formato
Adobe PDF
|
857.35 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2978951