Across different Deep Learning (DL) applications or within the same application but in different phases, bitwidth precision of activations and weights may vary. Moreover, energy and latency of MAC units have to be minimized, especially at the edge. Hence, various precision-scalable MAC units optimized for DL have recently emerged. Our contribution is a new precision-configurable multiplier/dot-product unit based on a modified Radix-4 Booth signed multiplier with Sum-Together (ST) mode. Besides 16-bit full precision multiplications, it can be reconfigured to perform dot products among two 8-bit or four 4-bit sub words of the input operands without requiring an external adder, thus reducing the number of cycles of MAC operations. The results of the synthesis in performance, power and area on a 28-nm technology show that our unit (1) is superior to other state of the art ST multipliers in area (≈35% less) in the clock frequency range between 100 and 1000 MHz and (2) reduces latency up to 4x when used to compute a convolutional layer, at the cost of limited overheads in area (+10%) and power (+13%) compared to a conventional 16-bit Booth multiplier. This unit can play an important role in designing variable-precision MAC units or DL accelerators for edge devices.
A Reconfigurable Multiplier/Dot-Product Unit for Precision-Scalable Deep Learning Applications / Urbinati, Luca; Casu, Mario R.. - ELETTRONICO. - 1005:(2023), pp. 9-14. (Intervento presentato al convegno 53rd Annual Meeting of the Italian Electronics Society tenutosi a Pizzo (VV), Italia nel September 7-9, 2022) [10.1007/978-3-031-26066-7_2].
A Reconfigurable Multiplier/Dot-Product Unit for Precision-Scalable Deep Learning Applications
Luca Urbinati;Mario R. Casu
2023
Abstract
Across different Deep Learning (DL) applications or within the same application but in different phases, bitwidth precision of activations and weights may vary. Moreover, energy and latency of MAC units have to be minimized, especially at the edge. Hence, various precision-scalable MAC units optimized for DL have recently emerged. Our contribution is a new precision-configurable multiplier/dot-product unit based on a modified Radix-4 Booth signed multiplier with Sum-Together (ST) mode. Besides 16-bit full precision multiplications, it can be reconfigured to perform dot products among two 8-bit or four 4-bit sub words of the input operands without requiring an external adder, thus reducing the number of cycles of MAC operations. The results of the synthesis in performance, power and area on a 28-nm technology show that our unit (1) is superior to other state of the art ST multipliers in area (≈35% less) in the clock frequency range between 100 and 1000 MHz and (2) reduces latency up to 4x when used to compute a convolutional layer, at the cost of limited overheads in area (+10%) and power (+13%) compared to a conventional 16-bit Booth multiplier. This unit can play an important role in designing variable-precision MAC units or DL accelerators for edge devices.File | Dimensione | Formato | |
---|---|---|---|
sie22_urbinati_7699_preprint.pdf
Open Access dal 29/02/2024
Descrizione: Pre-print version
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
Pubblico - Tutti i diritti riservati
Dimensione
1.1 MB
Formato
Adobe PDF
|
1.1 MB | Adobe PDF | Visualizza/Apri |
Urbinati_and_Casu_2023_A_Reconfigurable_MultiplierDot-Product_Unit_for_Precision-scalable_Deep_Learning_Applications.pdf
accesso riservato
Descrizione: Post-print version
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
339.29 kB
Formato
Adobe PDF
|
339.29 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2977769