Voice user interfaces rely on keyword spotting (KWS) to detect wake-word commands, enabling low-power devices to switch from drowsy to active states and initiate more complex tasks. In embedded systems, KWS combines handcrafted acoustic features extraction with lightweight neural network classifiers to achieve accurate detection within strict resource constraints. Adapting KWS to time-varying energy budgets requires optimization strategies that operate at runtime. Most existing approaches adjust the complexity of the neural model but overlook that a substantial amount of latency, and thus energy consumption, is due to feature extraction, which remains unaffected by model scaling. This work introduces Runtime Feature Compression (RFC), a dynamic rescaling strategy that modulates the workload of the entire KWS pipeline. RFC promotes the hop-length parameter of the Short-Time Fourier Transform as a runtime control knob to adjust the number of time frames in speech features, allowing a single model to operate across multiple latency modes. To support this flexibility, we introduce two training-time techniques: HopAugment, a data augmentation scheme that exposes the model to variable hop lengths during training, and Masked Layers, which preserve consistent activation statistics during training and inference under compressed feature settings. Evaluations on four KWS datasets using the TC-ResNet model family show that RFC outperforms model scaling techniques, offering a wider range of latency-accuracy trade-offs. RFC achieves up to 31.8% lower latency without accuracy degradation, or up to 0.30% higher accuracy within equivalent latency bounds. That proves RFC improves adaptability in energy-constrained IoT speech interfaces. A set of ablation studies further demonstrates the robustness of RFC by evaluating the role of its training components, batching strategies, ability to preserve accuracy with a shared weight set, scalability across operating modes, and applicability to different model architectures.
Runtime Feature Compression for Adaptive Keyword Spotting on Embedded Systems / Peluso, Valentino; Calimera, Andrea; Macii, Enrico; Montuschi, Paolo. - In: IEEE INTERNET OF THINGS JOURNAL. - ISSN 2327-4662. - (2026), pp. 1-1. [10.1109/jiot.2026.3676917]
Runtime Feature Compression for Adaptive Keyword Spotting on Embedded Systems
Peluso, Valentino;Calimera, Andrea;Macii, Enrico;Montuschi, Paolo
2026
Abstract
Voice user interfaces rely on keyword spotting (KWS) to detect wake-word commands, enabling low-power devices to switch from drowsy to active states and initiate more complex tasks. In embedded systems, KWS combines handcrafted acoustic features extraction with lightweight neural network classifiers to achieve accurate detection within strict resource constraints. Adapting KWS to time-varying energy budgets requires optimization strategies that operate at runtime. Most existing approaches adjust the complexity of the neural model but overlook that a substantial amount of latency, and thus energy consumption, is due to feature extraction, which remains unaffected by model scaling. This work introduces Runtime Feature Compression (RFC), a dynamic rescaling strategy that modulates the workload of the entire KWS pipeline. RFC promotes the hop-length parameter of the Short-Time Fourier Transform as a runtime control knob to adjust the number of time frames in speech features, allowing a single model to operate across multiple latency modes. To support this flexibility, we introduce two training-time techniques: HopAugment, a data augmentation scheme that exposes the model to variable hop lengths during training, and Masked Layers, which preserve consistent activation statistics during training and inference under compressed feature settings. Evaluations on four KWS datasets using the TC-ResNet model family show that RFC outperforms model scaling techniques, offering a wider range of latency-accuracy trade-offs. RFC achieves up to 31.8% lower latency without accuracy degradation, or up to 0.30% higher accuracy within equivalent latency bounds. That proves RFC improves adaptability in energy-constrained IoT speech interfaces. A set of ablation studies further demonstrates the robustness of RFC by evaluating the role of its training components, batching strategies, ability to preserve accuracy with a shared weight set, scalability across operating modes, and applicability to different model architectures.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/3010606
Attenzione
Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo
