An open challenge in making Internet-of-Things sensor nodes "smart'' and self-adaptive is to enable on-chip Deep Neural Network (DNN) training on Ultra-Low-Power (ULP) microcontroller units (MCUs). To this aim, we present a framework, based on PULP-TrainLib, to deploy DNN training tasks on RISC-V-based Parallel-ULP (PULP) MCUs. PULP-TrainLib is a library of parallel software DNN primitives enabling the execution of forward and backward steps on PULP MCUs. To optimize PULP-TrainLib's kernels, we propose a strategy to automatically select and configure (autotune) the fastest among a set of tiling options and optimized floating-point matrix multiplication kernels, according to the tensor shapes of every DNN layer. Results on an 8-core RISC-V MCU show that our auto-tuned primitives improve MAC/clk by up to 2.4x compared to "one-size-fits-all'' matrix multiplication, achieving up to 4.39 MAC/clk - 36.6x better than a commercial STM32L4 MCU executing the same DNN layer training workload. Furthermore, our strategy proves to be 30.7x faster than AIfES, a state-of-the-art training library for MCUs, while training a complete TinyML model.

PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-core MCUs Through Performance-Driven Autotuning / Nadalini, Davide; Rusci, Manuele; Tagliavini, Giuseppe; Ravaglia, Leonardo; Benini, Luca; Conti, Francesco. - STAMPA. - (2022), pp. 200-216. (Intervento presentato al convegno 22nd International Conference, SAMOS 2022 tenutosi a Samos, Greece nel July 3–7, 2022) [10.1007/978-3-031-15074-6_13].

PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-core MCUs Through Performance-Driven Autotuning

Nadalini, Davide;Benini, Luca;
2022

Abstract

An open challenge in making Internet-of-Things sensor nodes "smart'' and self-adaptive is to enable on-chip Deep Neural Network (DNN) training on Ultra-Low-Power (ULP) microcontroller units (MCUs). To this aim, we present a framework, based on PULP-TrainLib, to deploy DNN training tasks on RISC-V-based Parallel-ULP (PULP) MCUs. PULP-TrainLib is a library of parallel software DNN primitives enabling the execution of forward and backward steps on PULP MCUs. To optimize PULP-TrainLib's kernels, we propose a strategy to automatically select and configure (autotune) the fastest among a set of tiling options and optimized floating-point matrix multiplication kernels, according to the tensor shapes of every DNN layer. Results on an 8-core RISC-V MCU show that our auto-tuned primitives improve MAC/clk by up to 2.4x compared to "one-size-fits-all'' matrix multiplication, achieving up to 4.39 MAC/clk - 36.6x better than a commercial STM32L4 MCU executing the same DNN layer training workload. Furthermore, our strategy proves to be 30.7x faster than AIfES, a state-of-the-art training library for MCUs, while training a complete TinyML model.
2022
978-3-031-15073-9
File in questo prodotto:
File Dimensione Formato  
PULP-TrainLib - Springer Version.pdf

Open Access dal 15/08/2023

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 3.69 MB
Formato Adobe PDF
3.69 MB Adobe PDF Visualizza/Apri
PULP-TrainLib-Editoriale-Compressed.pdf

non disponibili

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 2.2 MB
Formato Adobe PDF
2.2 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2971649