With the rise of embodied foundation models (EFMs), most notably small language models (SLMs), adapting Transformers for the edge applications has become a very active field of research. However, achieving the end-to-end deployment of SLMs on the microcontroller (MCU)-class chips without high-bandwidth off-chip main memory access is still an open challenge. In this article, we demonstrate high efficiency end-to-end SLM deployment on a multicore RISC-V (RV32) MCU augmented with ML instruction extensions and a hardware neural processing unit (NPU). To automate the exploration of the constrained, multidimensional memory versus computation tradeoffs involved in the aggressive SLM deployment on the heterogeneous (multicore+NPU) resources, we introduce Deeploy, a novel deep neural network (DNN) compiler, which generates highly optimized C code requiring minimal runtime support. We demonstrate that Deeploy generates the end-to-end code for executing SLMs, fully exploiting the RV32 cores' instruction extensions and the NPU. We achieve leading-edge energy and throughput of 490 μ J per token, at 340 token per second for an SLM trained on the TinyStories dataset, running for the first time on an MCU-class device without the external memory.
Deeploy: Enabling Energy-Efficient Deployment of Small Language Models on Heterogeneous Microcontrollers / Scherer, Moritz; Macan, Luka; Jung, Victor J. B.; Wiese, Philip; Bompani, Luca; Burrello, Alessio; Conti, Francesco; Benini, Luca. - In: IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS. - ISSN 0278-0070. - 43:11(2024), pp. 4009-4020. [10.1109/tcad.2024.3443718]
Deeploy: Enabling Energy-Efficient Deployment of Small Language Models on Heterogeneous Microcontrollers
Burrello, Alessio;
2024
Abstract
With the rise of embodied foundation models (EFMs), most notably small language models (SLMs), adapting Transformers for the edge applications has become a very active field of research. However, achieving the end-to-end deployment of SLMs on the microcontroller (MCU)-class chips without high-bandwidth off-chip main memory access is still an open challenge. In this article, we demonstrate high efficiency end-to-end SLM deployment on a multicore RISC-V (RV32) MCU augmented with ML instruction extensions and a hardware neural processing unit (NPU). To automate the exploration of the constrained, multidimensional memory versus computation tradeoffs involved in the aggressive SLM deployment on the heterogeneous (multicore+NPU) resources, we introduce Deeploy, a novel deep neural network (DNN) compiler, which generates highly optimized C code requiring minimal runtime support. We demonstrate that Deeploy generates the end-to-end code for executing SLMs, fully exploiting the RV32 cores' instruction extensions and the NPU. We achieve leading-edge energy and throughput of 490 μ J per token, at 340 token per second for an SLM trained on the TinyStories dataset, running for the first time on an MCU-class device without the external memory.File | Dimensione | Formato | |
---|---|---|---|
2408.04413v1.pdf
accesso aperto
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
Pubblico - Tutti i diritti riservati
Dimensione
983.84 kB
Formato
Adobe PDF
|
983.84 kB | Adobe PDF | Visualizza/Apri |
Deeploy_Enabling_Energy-Efficient_Deployment_of_Small_Language_Models_on_Heterogeneous_Microcontrollers.pdf
accesso riservato
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
1.83 MB
Formato
Adobe PDF
|
1.83 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2996574