Pruning deep neural networks (DNN) is a well- known technique that allows for a sensible reduction in inference cost. However, this may severely degrade the accuracy achieved by the model unless the latter is properly fine-tuned, which may, in turn, result in increased computational cost and latency. Thus, upon deploying a DNN in resource-constrained edge environments, it is critical to find the best trade-off between accuracy (hence, model complexity) and latency and energy consumption. In this work, we explore the different options for the deployment of a machine learning pipeline, encompassing pruning, fine-tuning, and inference, across a mobile device requesting inference tasks and an edge server, and considering privacy constraints on the data to be used for fine-tuning. Our experimental analysis provides insights for an efficient allocation of the pipeline tasks across network edge and mobile device in terms of energy and network costs, as the target inference latency and accuracy vary. In particular, our results highlight that the higher the edge server load and the number of inference requests, the more convenient it becomes to deploy the entire pipeline at the mobile device using a pruned model, with a cost reduction of up to a factor two compared to deploying the whole pipeline at the edge.

Compressing and Fine-tuning DNNs for Efficient Inference in Mobile Device-Edge Continuum

Carla Fabiana Chiasserini
2024

Abstract

Pruning deep neural networks (DNN) is a well- known technique that allows for a sensible reduction in inference cost. However, this may severely degrade the accuracy achieved by the model unless the latter is properly fine-tuned, which may, in turn, result in increased computational cost and latency. Thus, upon deploying a DNN in resource-constrained edge environments, it is critical to find the best trade-off between accuracy (hence, model complexity) and latency and energy consumption. In this work, we explore the different options for the deployment of a machine learning pipeline, encompassing pruning, fine-tuning, and inference, across a mobile device requesting inference tasks and an edge server, and considering privacy constraints on the data to be used for fine-tuning. Our experimental analysis provides insights for an efficient allocation of the pipeline tasks across network edge and mobile device in terms of energy and network costs, as the target inference latency and accuracy vary. In particular, our results highlight that the higher the edge server load and the number of inference requests, the more convenient it becomes to deploy the entire pipeline at the mobile device using a pruned model, with a cost reduction of up to a factor two compared to deploying the whole pipeline at the edge.
File in questo prodotto:
File Dimensione Formato  
ADROIT6G-2.pdf

accesso aperto

Tipologia: 1. Preprint / submitted version [pre- review]
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 814.43 kB
Formato Adobe PDF
814.43 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2988279