The growing demand for high-performance and energy-efficient processing in edge-oriented Systems-on-Chip is driving the adoption of dedicated integrated circuits that accelerate computationally intensive workloads. To minimize area and performance overhead, low-power, general-purpose CPUs are often tightly coupled with domain-specific coprocessors implementing custom instructions, thereby delivering higher throughput and reduced memory traffic. However, commonly used in-order CPUs are not optimized for instruction-level parallelism, leading to stalls in the instruction stream while waiting for long-latency coprocessor operations, and under-utilizing the coprocessor while executing other instructions. This work investigates the benefits of replacing simple in-order cores with a more complex out-of-order architecture to dynamically schedule instructions for the main core and coprocessor, optimizing resource utilization and reducing execution time. To ensure generality, an in-depth analysis was carried out by offloading instructions to a custom dummy coprocessor capable of emulating iterative and pipelined operations with arbitrary latency. Various workloads simulating real-world applications were executed on two variants of an open-source microcontroller, equipped with a recent out-of-order core and the state-of-the-art CV32E40X in-order core, respectively. Results from Register Transfer Level simulations show that the former configuration executes up to 60% more instructions per cycle, with a modest 12% system area overhead on a 65 nm CMOS technology node.
Ditching the Queue: Optimizing Coprocessor Utilization with Out-of-Order CPUs on compact Systems on Chip / Caon, Michele; Masera, Guido; Martina, Maurizio. - In: ELECTRONICS. - ISSN 2079-9292. - ELETTRONICO. - 13:15(2024). [10.3390/electronics13153018]
Ditching the Queue: Optimizing Coprocessor Utilization with Out-of-Order CPUs on compact Systems on Chip
Caon, Michele;Masera, Guido;Martina, Maurizio
2024
Abstract
The growing demand for high-performance and energy-efficient processing in edge-oriented Systems-on-Chip is driving the adoption of dedicated integrated circuits that accelerate computationally intensive workloads. To minimize area and performance overhead, low-power, general-purpose CPUs are often tightly coupled with domain-specific coprocessors implementing custom instructions, thereby delivering higher throughput and reduced memory traffic. However, commonly used in-order CPUs are not optimized for instruction-level parallelism, leading to stalls in the instruction stream while waiting for long-latency coprocessor operations, and under-utilizing the coprocessor while executing other instructions. This work investigates the benefits of replacing simple in-order cores with a more complex out-of-order architecture to dynamically schedule instructions for the main core and coprocessor, optimizing resource utilization and reducing execution time. To ensure generality, an in-depth analysis was carried out by offloading instructions to a custom dummy coprocessor capable of emulating iterative and pipelined operations with arbitrary latency. Various workloads simulating real-world applications were executed on two variants of an open-source microcontroller, equipped with a recent out-of-order core and the state-of-the-art CV32E40X in-order core, respectively. Results from Register Transfer Level simulations show that the former configuration executes up to 60% more instructions per cycle, with a modest 12% system area overhead on a 65 nm CMOS technology node.File | Dimensione | Formato | |
---|---|---|---|
Caon-Ditching.pdf
accesso aperto
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Creative commons
Dimensione
1.26 MB
Formato
Adobe PDF
|
1.26 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2991126