The growing demand for high-performance and energy-efficient processing in edge-oriented Systems-on-Chip is driving the adoption of dedicated integrated circuits that accelerate computationally intensive workloads. To minimize area and performance overhead, low-power, general-purpose CPUs are often tightly coupled with domain-specific coprocessors implementing custom instructions, thereby delivering higher throughput and reduced memory traffic. However, commonly used in-order CPUs are not optimized for instruction-level parallelism, leading to stalls in the instruction stream while waiting for long-latency coprocessor operations, and under-utilizing the coprocessor while executing other instructions. This work investigates the benefits of replacing simple in-order cores with a more complex out-of-order architecture to dynamically schedule instructions for the main core and coprocessor, optimizing resource utilization and reducing execution time. To ensure generality, an in-depth analysis was carried out by offloading instructions to a custom dummy coprocessor capable of emulating iterative and pipelined operations with arbitrary latency. Various workloads simulating real-world applications were executed on two variants of an open-source microcontroller, equipped with a recent out-of-order core and the state-of-the-art CV32E40X in-order core, respectively. Results from Register Transfer Level simulations show that the former configuration executes up to 60% more instructions per cycle, with a modest 12% system area overhead on a 65 nm CMOS technology node.

Ditching the Queue: Optimizing Coprocessor Utilization with Out-of-Order CPUs on compact Systems on Chip / Caon, Michele; Masera, Guido; Martina, Maurizio. - In: ELECTRONICS. - ISSN 2079-9292. - ELETTRONICO. - 13:15(2024). [10.3390/electronics13153018]

Ditching the Queue: Optimizing Coprocessor Utilization with Out-of-Order CPUs on compact Systems on Chip

Caon, Michele;Masera, Guido;Martina, Maurizio
2024

Abstract

The growing demand for high-performance and energy-efficient processing in edge-oriented Systems-on-Chip is driving the adoption of dedicated integrated circuits that accelerate computationally intensive workloads. To minimize area and performance overhead, low-power, general-purpose CPUs are often tightly coupled with domain-specific coprocessors implementing custom instructions, thereby delivering higher throughput and reduced memory traffic. However, commonly used in-order CPUs are not optimized for instruction-level parallelism, leading to stalls in the instruction stream while waiting for long-latency coprocessor operations, and under-utilizing the coprocessor while executing other instructions. This work investigates the benefits of replacing simple in-order cores with a more complex out-of-order architecture to dynamically schedule instructions for the main core and coprocessor, optimizing resource utilization and reducing execution time. To ensure generality, an in-depth analysis was carried out by offloading instructions to a custom dummy coprocessor capable of emulating iterative and pipelined operations with arbitrary latency. Various workloads simulating real-world applications were executed on two variants of an open-source microcontroller, equipped with a recent out-of-order core and the state-of-the-art CV32E40X in-order core, respectively. Results from Register Transfer Level simulations show that the former configuration executes up to 60% more instructions per cycle, with a modest 12% system area overhead on a 65 nm CMOS technology node.
2024
File in questo prodotto:
File Dimensione Formato  
Caon-Ditching.pdf

accesso aperto

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Creative commons
Dimensione 1.26 MB
Formato Adobe PDF
1.26 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2991126