Hardware blocks are used to execute very different Digital Signal Processing tasks, ranging from simple audio applications to state of the art LTE mobile communication and high definition video processing. Users expect continuous and very rapid innovation, and each new product to be more energy efficient and more powerful in terms of functionality than the previous generation. These user driven market requirements force companies to continuously struggle to design new devices with novel features and improved performance. Also the large number of semiconductor companies (especially fab-less) results in a strong competition, especially in terms of time to market. In summary, two key factors that determine the success of a new product are its differentiating features and its design time. These requirements ultimately translate into an increased pressure on the hardware and software designers to come up with new and innovative designs in short time spans, which can only be achieved with novel design tools and flows. For hardware design the standard RTL-based flow employed in the industry consists of using pre-built and pre-verified hardware IPs plus some customized blocks that are specially designed for new devices. These customized hardware blocks are one of the main ingredients that enable product differentiation in the market. These blocks are normally designed using a procedure that consists of two major designs and modeling steps. The first step models the algorithm in some C-like language that allows algorithmic analysis and verification. The second step manually implements a hardware architecture coded in a hardware description language. The manual translation involved in the second step requires a long time, and in theory it should be iterated for many possible hardware architectures, in order to find the best one. This hardware Design Space Exploration is especially lengthy because verification must be repeated for each new RTL implementation. The best way to cope with these issues is to use hardware synthesis tools that allow modeling in a C-like language at the algorithmic level, and can directly and quickly perform design space exploration with as little manual effort as possible. This observation is the basic motivation for the methods and results described in this thesis. The model-based design paradigm has been studied extensively by the research community to raise the level of abstraction for design, verification and synthesis of hardware and software. The scope of the term is very broad, and it cannot be captured easily in a few lines. However, in the scope of hardware and software design for embedded systems model-based design can be defined as a design paradigm that allows one to rapidly design, verify, validate and implement hardware and software systems starting from pre-verified abstract component models. Model-based design usually starts from abstract graphical drag-and-drop models which are usually modeled in a proprietary way (as in the case of Simulink ) or in a loosely standardized way (as in the case of the UML). The components used for modeling range from simple atomic blocks like arithmetic operations to complex blocks like a Fast Fourier Transform, a Finite Impulse Response filter, a Viterbi decoder, a Discrete Cosine Transform etc. The entire algorithm is modeled in this way, and then automatically synthesized to an implementation-dependent model, in order to enhance re-use under different application scenarios. Model-based design is normally carried out using frameworks and design environments, such as Simulink or Labview, that are equipped with powerful model-to-model translation tools. These tools can be used to implement different low level models for software and hardware simulation and implementation from a single “golden” verified algorithmic model. Most of the model-based design tools that are available in industry and academia today allow automatic software code generation from graphical models. The software code generators can generate optimized code that can come close to hand optimized code in terms of size and performance, because they exploit information about the target processor architecture and the associated memory hierarchy. Model-based design tools also offer the capability to generate hardware from graphical models. However, they either rely on parameterized hand-optimized implementations of the larger macro blocks, or require the designer to use smaller blocks at an abstraction level that is very close to RTL. In other words these tools either rely either on pre-modeled intellectual proprietary blocks (IPs) with fixed hardware architecture or tend to decrease the abstraction level in a proprietary manner. The reliance on a single architectural template can result in a sub-optimal implementation if this architecture is not suitable for particular application, while working at lower abstraction level design space exploration becomes again very expensive, essentially defeating one of the main purposes of model-based design. Simulink (from The Mathworks) is a very powerful modeling, simulation and analysis tool that is used in this thesis as the starting point for model-based design. It has a huge set of libraries and allows modeling systems and algorithms from very different application domains, ranging from embedded control to video and communication systems. It includes many parameterized built in components that can be used for modeling, as well as components that allow easy mathematical and graphical analysis of data in the time and frequency domains, thus making the debugging and verification very easy and intuitive. Simulink also has tools that perform fixed point analysis analysis and ease the transition from floating to fixed point models. Simulink comes bundled with different code-generators that can generate C/C++ code from models for different embedded processors. There are many code generators available with Simulink that can generate C/C++ software code for a variety of processor architectures. These code generators are called Target Language Compilers (TLC) in Simulink terminology. Simulink also supports for hardware generation from graphical models, but as explained above for complex algorithms it relies on parameterizable fixed architectural templates which are not flexible enough to target different application domains with different requirements or it requires modeling at a low abstraction level, resulting in longer design times. The research presented in this thesis uses Simulink as graphical modeling environment front-end for hardware synthesis and hardware/software trade-off analysis. It proposes a novel design flow, which incorporates a more flexible and powerful approach for hardware synthesis and design space exploration than the industrial and research state of the art. Several high level synthesis (HLS) tools are available to synthesize RTL hardware from C/C++/SystemC abstract specifications. Normally these high level synthesis tools also require a set of constraints which drive them to synthesize a specific macro and micro-architecture from the C-like specification. Hence Design Space Exploration can be performed only by changing these constraints, while preserving the same pre-verified C code. Simulink used as a modeling front-end to high level hardware synthesis results in a very powerful flow that allows flexible hardware design and synthesis. Since Simulink supports code generation for software implementation, hardware/software trade-off analysis can also be performed easily starting from Simulink models. During the course of this thesis Simulink was evaluated as graphical modeling front-end for high level hardware synthesis and hardware software trade-off analysis. The tool that was used for high level hardware synthesis is Cadence C-to-Silicon, or CtoS. In the input, C++ constructs are used to specify the algorithm, while SystemC constructs are used for more hardware oriented artifacts, such as bit-true IOs and timing. Hence our goal is to define a methodology to convert the C output from Simulink and efficiently synthesize it with CtoS. In particular, by exploring the variety of TLCs available from Simulink , we found that Embedded Real Time coder (ERT), which was specifically designed to produce readable and optimized code for embedded processors, can generate C code that is well suited for high level hardware synthesis. ERT has many optimization that can be selectively enabled or disabled to generate code that is suitable for a particular application. We first analyzed them to understand which ones can be used to generate code that is best suited for high level hardware synthesis. Also different graphical modeling options were analyzed to enable hierarchical and modular code generation, that is especially useful when model partitioning is required in order to obtain a more concurrent hardware implementation. Simulink allows one to easily control the level of granularity, by grouping sub-designs into aggregate blocks that can become single function calls in the generated C code. In order to illustrate how one can perform hardware/software trade-off analysis, we used a case study from the domain of wireless video surveillance sensor networks, where cameras are triggered to capture video whenever an interesting audio event is detected by an audio detector. The platform that was used as implementation target consisted of ARM Cortex-M3 processor. The audio algorithm was modeled in Simulink and then model was grouped into aggregate Simulink blocks to generate modular code through ERT. This modular code was then used in the form of different C-code partitions to perform extensive hardware/software trade-off analysis, where final implementation decisions were made based on very accurate low level power/area/throughput estimates. An accurate picture of power, area and throughput was easily generated because most of the design space exploration was performed using CtoS with minor or no changes to the C-model. Very simple and abstract architectural constraints to automatically generate the low level synthesis scripts for both CtoS and the downstream tools (e.g. RTL Compiler). Once a C-model is generated from Simulink through ERT it can be synthesized using CtoS by providing architectural constraints. But in order to efficiently perform design space exploration one must gain a high level understanding of the computationally most expensive loops and code segments. This may again become a complex and tedious task for automatically generated code. Hence a very important part of our research was to devise and implement a mechanism that can shield the designer from understanding the automatically generated code during high level synthesis. We implemented an automated TCL script to perform automatic design space exploration. This tool takes as input a set of simple high-level directives that limit the design space to be explored, and then produces many points in the design space that can be used for further low level throughput, power and area analysis. The directives specify, for example, how many resources such as multipliers or adders must be considered, what type of memories must be used, how many operations should be scheduled in a single cycle etc. In the second part of this thesis, we also explored in detail how to implement algorithms, like an FFT, that are used in a multitude of different applications and may require inherently different macro/micro-architectures to satisfy very different performance requirements. The fully sequential C-code that is generated by ERT for such complex blocks limits the design space that can be explored. Especially more concurrent implementations are very difficult to derive from such code because it is inherently sequential and not well suited to achieve the level of concurrency which can be essential for some hardware implementations. . We thus proposed and experimented with a technique that allowed us to represent complex block as proprietary blocks in Simulink . These blocks can be considered as IPs that are defined at a higher abstraction level to enable efficient hardware synthesis and extended hardware design space exploration, still starting from a single C model. The modeling strategy that we used for defining these high level synthesis IPs (HLS-IP) relies on the S-function modeling mechanism. This is a mechanism provided by Simulink to extend its component library. It allows modeling of algorithms in a well-defined manner to enable integration into the simulation environment and C-code generation. We used plain C for modeling S-functions because C-like specifications are also well suited for high level hardware synthesis, which is the target of our proposed HLS-IP scheme. Complex algorithms, like an FFT, have many different signal flow graph representations, where each representation is suited for different purpose. Some representations may be best suited for software implementation and some are more interesting for hardware implementation. The choice of signal flow graph and template architecture in our case was made based on flexibility and opportunity to map to different macro/micro-architectures, to enable more extended hardware design space exploration. This mostly involves a change of the level of concurrency, because state of the art high level synthesis tools are hardly able to increase the level of concurrency starting from sequential C-like specifications, thus essentially limiting a very important dimension of the hardware design space that can explored. We experimented extensively with an FFT test case for designing high level flexible IPs that can be used for modeling, simulation and verification in the Simulink environment and then for efficient hardware synthesis and design space exploration. In the end we were able to formulate and write an FFT IP-generator that generates flexible HLS-IPs following all the guidelines described above, including all the wrappers that are required for integration into Simulink and the high level synthesis scripts. The HLS-IPs are generated in a way that allows separation of pure functionality and behavioral constructs from the SystemC interfaces and threads. The functional part of HLS-IP that is generated remains the same in order to perform verification once, and only the level of parallelism and datapath bit-width is changed to reflect the application requirements.

Model-based High Level Synthesis and Design Space Exploration / Butt, SHAHZAD AHMAD. - STAMPA. - (2013).

Model-based High Level Synthesis and Design Space Exploration

BUTT, SHAHZAD AHMAD
2013

Abstract

Hardware blocks are used to execute very different Digital Signal Processing tasks, ranging from simple audio applications to state of the art LTE mobile communication and high definition video processing. Users expect continuous and very rapid innovation, and each new product to be more energy efficient and more powerful in terms of functionality than the previous generation. These user driven market requirements force companies to continuously struggle to design new devices with novel features and improved performance. Also the large number of semiconductor companies (especially fab-less) results in a strong competition, especially in terms of time to market. In summary, two key factors that determine the success of a new product are its differentiating features and its design time. These requirements ultimately translate into an increased pressure on the hardware and software designers to come up with new and innovative designs in short time spans, which can only be achieved with novel design tools and flows. For hardware design the standard RTL-based flow employed in the industry consists of using pre-built and pre-verified hardware IPs plus some customized blocks that are specially designed for new devices. These customized hardware blocks are one of the main ingredients that enable product differentiation in the market. These blocks are normally designed using a procedure that consists of two major designs and modeling steps. The first step models the algorithm in some C-like language that allows algorithmic analysis and verification. The second step manually implements a hardware architecture coded in a hardware description language. The manual translation involved in the second step requires a long time, and in theory it should be iterated for many possible hardware architectures, in order to find the best one. This hardware Design Space Exploration is especially lengthy because verification must be repeated for each new RTL implementation. The best way to cope with these issues is to use hardware synthesis tools that allow modeling in a C-like language at the algorithmic level, and can directly and quickly perform design space exploration with as little manual effort as possible. This observation is the basic motivation for the methods and results described in this thesis. The model-based design paradigm has been studied extensively by the research community to raise the level of abstraction for design, verification and synthesis of hardware and software. The scope of the term is very broad, and it cannot be captured easily in a few lines. However, in the scope of hardware and software design for embedded systems model-based design can be defined as a design paradigm that allows one to rapidly design, verify, validate and implement hardware and software systems starting from pre-verified abstract component models. Model-based design usually starts from abstract graphical drag-and-drop models which are usually modeled in a proprietary way (as in the case of Simulink ) or in a loosely standardized way (as in the case of the UML). The components used for modeling range from simple atomic blocks like arithmetic operations to complex blocks like a Fast Fourier Transform, a Finite Impulse Response filter, a Viterbi decoder, a Discrete Cosine Transform etc. The entire algorithm is modeled in this way, and then automatically synthesized to an implementation-dependent model, in order to enhance re-use under different application scenarios. Model-based design is normally carried out using frameworks and design environments, such as Simulink or Labview, that are equipped with powerful model-to-model translation tools. These tools can be used to implement different low level models for software and hardware simulation and implementation from a single “golden” verified algorithmic model. Most of the model-based design tools that are available in industry and academia today allow automatic software code generation from graphical models. The software code generators can generate optimized code that can come close to hand optimized code in terms of size and performance, because they exploit information about the target processor architecture and the associated memory hierarchy. Model-based design tools also offer the capability to generate hardware from graphical models. However, they either rely on parameterized hand-optimized implementations of the larger macro blocks, or require the designer to use smaller blocks at an abstraction level that is very close to RTL. In other words these tools either rely either on pre-modeled intellectual proprietary blocks (IPs) with fixed hardware architecture or tend to decrease the abstraction level in a proprietary manner. The reliance on a single architectural template can result in a sub-optimal implementation if this architecture is not suitable for particular application, while working at lower abstraction level design space exploration becomes again very expensive, essentially defeating one of the main purposes of model-based design. Simulink (from The Mathworks) is a very powerful modeling, simulation and analysis tool that is used in this thesis as the starting point for model-based design. It has a huge set of libraries and allows modeling systems and algorithms from very different application domains, ranging from embedded control to video and communication systems. It includes many parameterized built in components that can be used for modeling, as well as components that allow easy mathematical and graphical analysis of data in the time and frequency domains, thus making the debugging and verification very easy and intuitive. Simulink also has tools that perform fixed point analysis analysis and ease the transition from floating to fixed point models. Simulink comes bundled with different code-generators that can generate C/C++ code from models for different embedded processors. There are many code generators available with Simulink that can generate C/C++ software code for a variety of processor architectures. These code generators are called Target Language Compilers (TLC) in Simulink terminology. Simulink also supports for hardware generation from graphical models, but as explained above for complex algorithms it relies on parameterizable fixed architectural templates which are not flexible enough to target different application domains with different requirements or it requires modeling at a low abstraction level, resulting in longer design times. The research presented in this thesis uses Simulink as graphical modeling environment front-end for hardware synthesis and hardware/software trade-off analysis. It proposes a novel design flow, which incorporates a more flexible and powerful approach for hardware synthesis and design space exploration than the industrial and research state of the art. Several high level synthesis (HLS) tools are available to synthesize RTL hardware from C/C++/SystemC abstract specifications. Normally these high level synthesis tools also require a set of constraints which drive them to synthesize a specific macro and micro-architecture from the C-like specification. Hence Design Space Exploration can be performed only by changing these constraints, while preserving the same pre-verified C code. Simulink used as a modeling front-end to high level hardware synthesis results in a very powerful flow that allows flexible hardware design and synthesis. Since Simulink supports code generation for software implementation, hardware/software trade-off analysis can also be performed easily starting from Simulink models. During the course of this thesis Simulink was evaluated as graphical modeling front-end for high level hardware synthesis and hardware software trade-off analysis. The tool that was used for high level hardware synthesis is Cadence C-to-Silicon, or CtoS. In the input, C++ constructs are used to specify the algorithm, while SystemC constructs are used for more hardware oriented artifacts, such as bit-true IOs and timing. Hence our goal is to define a methodology to convert the C output from Simulink and efficiently synthesize it with CtoS. In particular, by exploring the variety of TLCs available from Simulink , we found that Embedded Real Time coder (ERT), which was specifically designed to produce readable and optimized code for embedded processors, can generate C code that is well suited for high level hardware synthesis. ERT has many optimization that can be selectively enabled or disabled to generate code that is suitable for a particular application. We first analyzed them to understand which ones can be used to generate code that is best suited for high level hardware synthesis. Also different graphical modeling options were analyzed to enable hierarchical and modular code generation, that is especially useful when model partitioning is required in order to obtain a more concurrent hardware implementation. Simulink allows one to easily control the level of granularity, by grouping sub-designs into aggregate blocks that can become single function calls in the generated C code. In order to illustrate how one can perform hardware/software trade-off analysis, we used a case study from the domain of wireless video surveillance sensor networks, where cameras are triggered to capture video whenever an interesting audio event is detected by an audio detector. The platform that was used as implementation target consisted of ARM Cortex-M3 processor. The audio algorithm was modeled in Simulink and then model was grouped into aggregate Simulink blocks to generate modular code through ERT. This modular code was then used in the form of different C-code partitions to perform extensive hardware/software trade-off analysis, where final implementation decisions were made based on very accurate low level power/area/throughput estimates. An accurate picture of power, area and throughput was easily generated because most of the design space exploration was performed using CtoS with minor or no changes to the C-model. Very simple and abstract architectural constraints to automatically generate the low level synthesis scripts for both CtoS and the downstream tools (e.g. RTL Compiler). Once a C-model is generated from Simulink through ERT it can be synthesized using CtoS by providing architectural constraints. But in order to efficiently perform design space exploration one must gain a high level understanding of the computationally most expensive loops and code segments. This may again become a complex and tedious task for automatically generated code. Hence a very important part of our research was to devise and implement a mechanism that can shield the designer from understanding the automatically generated code during high level synthesis. We implemented an automated TCL script to perform automatic design space exploration. This tool takes as input a set of simple high-level directives that limit the design space to be explored, and then produces many points in the design space that can be used for further low level throughput, power and area analysis. The directives specify, for example, how many resources such as multipliers or adders must be considered, what type of memories must be used, how many operations should be scheduled in a single cycle etc. In the second part of this thesis, we also explored in detail how to implement algorithms, like an FFT, that are used in a multitude of different applications and may require inherently different macro/micro-architectures to satisfy very different performance requirements. The fully sequential C-code that is generated by ERT for such complex blocks limits the design space that can be explored. Especially more concurrent implementations are very difficult to derive from such code because it is inherently sequential and not well suited to achieve the level of concurrency which can be essential for some hardware implementations. . We thus proposed and experimented with a technique that allowed us to represent complex block as proprietary blocks in Simulink . These blocks can be considered as IPs that are defined at a higher abstraction level to enable efficient hardware synthesis and extended hardware design space exploration, still starting from a single C model. The modeling strategy that we used for defining these high level synthesis IPs (HLS-IP) relies on the S-function modeling mechanism. This is a mechanism provided by Simulink to extend its component library. It allows modeling of algorithms in a well-defined manner to enable integration into the simulation environment and C-code generation. We used plain C for modeling S-functions because C-like specifications are also well suited for high level hardware synthesis, which is the target of our proposed HLS-IP scheme. Complex algorithms, like an FFT, have many different signal flow graph representations, where each representation is suited for different purpose. Some representations may be best suited for software implementation and some are more interesting for hardware implementation. The choice of signal flow graph and template architecture in our case was made based on flexibility and opportunity to map to different macro/micro-architectures, to enable more extended hardware design space exploration. This mostly involves a change of the level of concurrency, because state of the art high level synthesis tools are hardly able to increase the level of concurrency starting from sequential C-like specifications, thus essentially limiting a very important dimension of the hardware design space that can explored. We experimented extensively with an FFT test case for designing high level flexible IPs that can be used for modeling, simulation and verification in the Simulink environment and then for efficient hardware synthesis and design space exploration. In the end we were able to formulate and write an FFT IP-generator that generates flexible HLS-IPs following all the guidelines described above, including all the wrappers that are required for integration into Simulink and the high level synthesis scripts. The HLS-IPs are generated in a way that allows separation of pure functionality and behavioral constructs from the SystemC interfaces and threads. The functional part of HLS-IP that is generated remains the same in order to perform verification once, and only the level of parallelism and datapath bit-width is changed to reflect the application requirements.
2013
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2507516
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo