The need to execute Deep Neural Networks (DNNs) at low latency and low power at the edge has spurred the development of new heterogeneous Systems-on-Chips (SoCs) encapsulating a diverse set of hardware accelerators. How to optimally map a DNN onto such multi-accelerator systems is an open problem. We propose ODiMO, a hardware-aware tool that performs a fine-grain mapping across different accelerators on-chip, splitting individual layers and executing them in parallel, to reduce inference energy consumption or latency, while taking into account each accelerator's quantization precision to maintain accuracy. Pareto-optimal networks in the accuracy vs. energy or latency space are pursued for three popular dataset/DNN pairs, and deployed on the DIANA heterogeneous ultra-low power edge AI SoC. We show that ODiMO reduces energy/latency by up to 33%/31% with limited accuracy drop (-0.53%/-0.32%) compared to manual heuristic mappings.
Precision-aware Latency and Energy Balancing on Multi-Accelerator Platforms for DNN Inference / Risso, Matteo; Burrello, Alessio; Maria Sarda, Giuseppe; Benini, Luca; Macii, Enrico; Poncino, Massimo; Verhelst, Marian; JAHIER PAGLIARI, Daniele. - ELETTRONICO. - (2023), pp. 1-6. (Intervento presentato al convegno International Symposium on Low Power Electronics and Design (ISLPED) 2023 tenutosi a Vienna (AUT) nel 07-08 August 2023) [10.1109/ISLPED58423.2023.10244311].
Precision-aware Latency and Energy Balancing on Multi-Accelerator Platforms for DNN Inference
Matteo Risso;Alessio Burrello;Luca Benini;Enrico Macii;Massimo Poncino;Daniele Jahier Pagliari
2023
Abstract
The need to execute Deep Neural Networks (DNNs) at low latency and low power at the edge has spurred the development of new heterogeneous Systems-on-Chips (SoCs) encapsulating a diverse set of hardware accelerators. How to optimally map a DNN onto such multi-accelerator systems is an open problem. We propose ODiMO, a hardware-aware tool that performs a fine-grain mapping across different accelerators on-chip, splitting individual layers and executing them in parallel, to reduce inference energy consumption or latency, while taking into account each accelerator's quantization precision to maintain accuracy. Pareto-optimal networks in the accuracy vs. energy or latency space are pursued for three popular dataset/DNN pairs, and deployed on the DIANA heterogeneous ultra-low power edge AI SoC. We show that ODiMO reduces energy/latency by up to 33%/31% with limited accuracy drop (-0.53%/-0.32%) compared to manual heuristic mappings.File | Dimensione | Formato | |
---|---|---|---|
ISLPED23___ODiMO__arXiv_.pdf
non disponibili
Tipologia:
1. Preprint / submitted version [pre- review]
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
826.1 kB
Formato
Adobe PDF
|
826.1 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
135_Final_Manuscript (1).pdf
accesso aperto
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
PUBBLICO - Tutti i diritti riservati
Dimensione
812.33 kB
Formato
Adobe PDF
|
812.33 kB | Adobe PDF | Visualizza/Apri |
Precision-aware_Latency_and_Energy_Balancing_on_Multi-Accelerator_Platforms_for_DNN_Inference.pdf
non disponibili
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
863.91 kB
Formato
Adobe PDF
|
863.91 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2979263