General-purpose graphics processing units (GPGPUs) are extensively used in high-performance computing. However, it is well known that these devices’ reliability may be limited by the rising of faults at the hardware level. This work introduces a flexible solution to detect and mitigate permanent faults affecting the execution units in these parallel devices. The proposed solution is based on adding some spare modules to perform two in-field operations: detecting and mitigating faults. The solution takes advantage of the regularity of the execution units in the device to avoid significant design changes and reduce the overhead. The proposed solution was evaluated in terms of reliability improvement and area, performance, and power overhead costs. For this purpose, we resorted to a micro-architectural open-source GPGPU model (FlexGripPlus). Experimental results show that the proposed solution can extend the reliability by up to 57%, with overhead costs lower than 2% and 8% in area and power, respectively.
DYRE: a DYnamic REconfigurable solution to increase GPGPU's reliability / Rodriguez Condia, Josie E.; Narducci, Pierpaolo; Sonza Reorda, Matteo; Sterpone, Luca. - In: THE JOURNAL OF SUPERCOMPUTING. - ISSN 0920-8542. - ELETTRONICO. - (2021).
|Titolo:||DYRE: a DYnamic REconfigurable solution to increase GPGPU's reliability|
|Data di pubblicazione:||2021|
|Digital Object Identifier (DOI):||http://dx.doi.org/10.1007/s11227-021-03751-2|
|Appare nelle tipologie:||1.1 Articolo in rivista|
File in questo prodotto:
|Condia2021_Article_DYREADYnamicREconfigurableSolu.pdf||2a Post-print versione editoriale / Version of Record||Visibile a tuttiVisualizza/Apri|