The need to execute Deep Neural Networks (DNNs) at low latency and low power at the edge has spurred the development of new heterogeneous Systems-on-Chips (SoCs) encapsulating a diverse set of hardware accelerators. How to optimally map a DNN onto such multi-accelerator systems is an open problem. We propose ODiMO, a hardware-aware tool that performs a fine-grain mapping across different accelerators on-chip, splitting individual layers and executing them in parallel, to reduce inference energy consumption or latency, while taking into account each accelerator's quantization precision to maintain accuracy. Pareto-optimal networks in the accuracy vs. energy or latency space are pursued for three popular dataset/DNN pairs, and deployed on the DIANA heterogeneous ultra-low power edge AI SoC. We show that ODiMO reduces energy/latency by up to 33%/31% with limited accuracy drop (-0.53%/-0.32%) compared to manual heuristic mappings.

Precision-aware Latency and Energy Balancing on Multi-Accelerator Platforms for DNN Inference / Risso, Matteo; Burrello, Alessio; Maria Sarda, Giuseppe; Benini, Luca; Macii, Enrico; Poncino, Massimo; Verhelst, Marian; JAHIER PAGLIARI, Daniele. - ELETTRONICO. - (2023), pp. 1-6. (Intervento presentato al convegno International Symposium on Low Power Electronics and Design (ISLPED) 2023 tenutosi a Vienna (AUT) nel 07-08 August 2023) [10.1109/ISLPED58423.2023.10244311].

Precision-aware Latency and Energy Balancing on Multi-Accelerator Platforms for DNN Inference

Matteo Risso;Alessio Burrello;Luca Benini;Enrico Macii;Massimo Poncino;Daniele Jahier Pagliari
2023

Abstract

The need to execute Deep Neural Networks (DNNs) at low latency and low power at the edge has spurred the development of new heterogeneous Systems-on-Chips (SoCs) encapsulating a diverse set of hardware accelerators. How to optimally map a DNN onto such multi-accelerator systems is an open problem. We propose ODiMO, a hardware-aware tool that performs a fine-grain mapping across different accelerators on-chip, splitting individual layers and executing them in parallel, to reduce inference energy consumption or latency, while taking into account each accelerator's quantization precision to maintain accuracy. Pareto-optimal networks in the accuracy vs. energy or latency space are pursued for three popular dataset/DNN pairs, and deployed on the DIANA heterogeneous ultra-low power edge AI SoC. We show that ODiMO reduces energy/latency by up to 33%/31% with limited accuracy drop (-0.53%/-0.32%) compared to manual heuristic mappings.
2023
979-8-3503-1175-4
File in questo prodotto:
File Dimensione Formato  
ISLPED23___ODiMO__arXiv_.pdf

non disponibili

Tipologia: 1. Preprint / submitted version [pre- review]
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 826.1 kB
Formato Adobe PDF
826.1 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
135_Final_Manuscript (1).pdf

accesso aperto

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 812.33 kB
Formato Adobe PDF
812.33 kB Adobe PDF Visualizza/Apri
Precision-aware_Latency_and_Energy_Balancing_on_Multi-Accelerator_Platforms_for_DNN_Inference.pdf

non disponibili

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 863.91 kB
Formato Adobe PDF
863.91 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2979263