Quantization is widely employed in both cloud and edge systems to reduce the memory occupation, latency, and energy consumption of deep neural networks. In particular, mixed-precision quantization, i.e., the use of different bit-widths for different portions of the network, has been shown to provide excellent efficiency gains with limited accuracy drops, especially with optimized bit-width assignments determined by automated Neural Architecture Search (NAS) tools. State-of-The-Art mixed-precision works layer-wise, i.e., it uses different bit-widths for the weights and activations tensors of each network layer. In this work, we widen the search space, proposing a novel NAS that selects the bit-width of each weight tensor channel independently. This gives the tool the additional flexibility of assigning a higher precision only to the weights associated with the most informative features. Testing on the MLPerf Tiny benchmark suite, we obtain a rich collection of Pareto-optimal models in the accuracy vs model size and accuracy vs energy spaces. When deployed on the MPIC RISC-V edge processor, our networks reduce the memory and energy for inference by up to 63% and 27% respectively compared to a layer-wise approach, for the same accuracy.

Channel-wise Mixed-precision Assignment for DNN Inference on Constrained Edge Nodes / Risso, M.; Burrello, A.; Benini, L.; Macii, E.; Poncino, M.; Jahier Pagliari, D.. - ELETTRONICO. - (2022), pp. 1-6. (Intervento presentato al convegno 13th IEEE International Green and Sustainable Computing Conference, IGSC 2022 tenutosi a Pittsburgh, PA, USA nel 2022) [10.1109/IGSC55832.2022.9969373].

Channel-wise Mixed-precision Assignment for DNN Inference on Constrained Edge Nodes

Risso M.;Burrello A.;Macii E.;Poncino M.;Jahier Pagliari D.
2022

Abstract

Quantization is widely employed in both cloud and edge systems to reduce the memory occupation, latency, and energy consumption of deep neural networks. In particular, mixed-precision quantization, i.e., the use of different bit-widths for different portions of the network, has been shown to provide excellent efficiency gains with limited accuracy drops, especially with optimized bit-width assignments determined by automated Neural Architecture Search (NAS) tools. State-of-The-Art mixed-precision works layer-wise, i.e., it uses different bit-widths for the weights and activations tensors of each network layer. In this work, we widen the search space, proposing a novel NAS that selects the bit-width of each weight tensor channel independently. This gives the tool the additional flexibility of assigning a higher precision only to the weights associated with the most informative features. Testing on the MLPerf Tiny benchmark suite, we obtain a rich collection of Pareto-optimal models in the accuracy vs model size and accuracy vs energy spaces. When deployed on the MPIC RISC-V edge processor, our networks reduce the memory and energy for inference by up to 63% and 27% respectively compared to a layer-wise approach, for the same accuracy.
2022
978-1-6654-6550-2
File in questo prodotto:
File Dimensione Formato  
Multi_precision_NAS__Camera_Ready_.pdf

accesso aperto

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 1.07 MB
Formato Adobe PDF
1.07 MB Adobe PDF Visualizza/Apri
Risso_et_al_2022_Channel-wise_Mixed-precision_Assignment_for_DNN_Inference_on_Constrained_Edge.pdf

non disponibili

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 1.13 MB
Formato Adobe PDF
1.13 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2974504