We now describe the realization of the design flow in terms of the tools, languages, and other components that have been applied in our proposed approach.
3.2.1. Application Development
Applications that process a continuous stream of data and have high performance requirements, such as video coding/decoding, radar signal processing, and wireless communication, can be implemented efficiently using a dataflow programming model [
63]. In this model, applications consist of a number of tasks, which are connected to one another to form the execution flow. They can execute concurrently and express parallelism. We adopted such a dataflow programming model, and developed our applications as streaming applications. An example streaming application can be seen in
Figure 2, in which the raw input data flow through different tasks and finally produce the output data. We chose
CAL actor language [
26] as the application development language because of its concurrent nature, compliance to streaming applications, and ease of use.
CAL [
26] is a high-level dataflow programming language [
64] that has been adopted by the MPEG Reconfigurable Video Coding (RVC) [
65] working group as a part of their standardization efforts. A
CAL program consists of actors, which are stateful operators. Actors consist of code blocks called actions that can transform input data streams into output data streams. The state of the actor usually changes while performing the transformation. When there is more than one actor in a program, these actors get connected with channels/ports to form a network. Each actor consumes the tokens on its input ports and produces new tokens on its output ports.
The execution order of the actions can be defined with an explicit schedule in the form of a state machine together with priority definitions for the actions. Additionally, CAL provides a network language to instantiate actors, define channels, and connect the actors via these channels.
Another advantage of using CAL is the opportunity to integrate application analysis and code generation (both software and hardware) in our Cal2Many framework.
3.2.2. Analysis
The analysis method and the tools employed depend on the programming language used for application development. For instance, for applications written in
C, one can use Valgrind [
67] and gnu profiler (gprof) [
68] to obtain the total number of instruction fetches per function, the number of function calls, the total execution time of the application and the amount of time spent in each function. We used these tools for one of our Autofocus criterion calculation implementations, which is hand-written in
C.
For the applications developed in
CAL, we used TURNUS [
69], which is a framework for profiling dataflow applications running on heterogeneous parallel systems. This framework implements a
CAL dataflow profiler on top of the Open RVC-CAL Compiler (Orcc) [
70]. TURNUS provides high level application and architecture modeling as well as tools for application profiling, system level performance estimation, and optimization.
TURNUS provides both static and dynamic analysis information relevant for both the software and hardware aspects, however our focus is on the software aspects. Some of the key analysis information provided by this tool are the firing rates of the actions, the number of executed operations, the input/output tokens consumed and produced, and communication buffer utilization. To obtain dynamic analysis data such as firing rates of the actions, the application was executed at least once. For each executed action, both the computational load and the data-transfers and storage load were evaluated. The computational load, given as weights in
Figure 3, was measured in terms of executed operators and control statements (i.e., comparison, logical, arithmetic and data movement instructions). This information was used for hot-spot identification, while the communication information was used for configuring the network-on-chip.
This tool can provide performance estimation based on an architecture model that is an undirected graph where each node represents an operator (a processing element such as a CPU or an FPGA in the terms of [
71]) or a communication element (e.g., bus and memories). However, the performance estimation is not critically necessary to identify the hot-spots of the applications. Moreover, we do not want to limit our hot-spot identification to any platform. Therefore, we do not focus on performance estimation.
There are other efforts to profile the applications at different levels such as basic block level in static single assignment (SSA) form to find out the most frequently executed instruction sequences [
72,
73]. However, we maintained the analysis at a higher level of abstraction to keep it simple and architecture agnostic.
CAL actions are meant to be small code blocks and yet they can be significantly compute-intensive by including heavy floating-point operations such as multiplication, division, and taking a square root. Therefore, analyzing the application at the action level is sufficient to identify hot-spots.
The case studies implemented in
CAL were analyzed statically and dynamically via TURNUS. Firing rate of the actions together with the number of executed operations are sufficient to identify the hot-spots. The tool provides weights, which can be interpreted as a compute density, for each action based on their firing rate, number of operations and the number of operands used in the operations. This information can be used directly for hot-spot identification.
Figure 3 shows an overview of the actions in the QRD case study and their weights in an actor. One can see that
calculate_boundary action is the hot-spot of the given actor despite not being the most frequently executed action. This is due to the number of operations in the actions.
Some actors could have dynamic execution paths (different execution order and frequency for the actions) depending on the input tokens or the state of the actor. In such cases, the application can be executed a number of times to obtain an average value for the analysis data.
Gathering the analysis data and identifying the hot-spots are performed automatically by tools. Once the hot-spot actions are identified, they are tagged manually for the code generation.
3.2.3. Code Generation
To automate the code generation, we used our in-house developed Cal2Many framework, which is a source-to-source compiler [
20].
Figure 4 gives an overview of the framework. It takes
CAL application as input and transforms it into two consecutive intermediate representations, namely Actor Machine (AM) [
74] and Action Execution Intermediate Representation (AEIR) [
20]. After the transformations and utilization of necessary libraries, it finally generates native code in various target specific languages.
In CAL, actors execute by firing actions, which satisfy the conditions such as the state of the actor, the availability of input tokens, and their values. While generating the code, the framework has to take ordering of these conditions into account together with the number of the tests performed before firing an action. We adapted a simple actor model called actor machine to schedule the testing of the firing condition of an action. To get closer to the target imperative languages, we used the second intermediate representation (AEIR), which is a data structure that allows generating code in imperative languages such as C, C++ and java. It has constructs for function declarations, expressions, statements, and functions calls. Translation of AM to AEIR consisted of two main passes. The first pass dealt with the translation of CAL constructs to imperative constructs including functions, statements, expressions, actions, and variable declaration. The pass took into account the translation of the AM to a sequential action scheduler.
We previously developed four different back-ends, i.e., sequential
C for uni-processors,
aJava and
aStruct for Ambric architecture, native-
C for Epiphany architecture, and target specific language (Scala subset) for ePuma [
75] and EIT [
76] SIMD architectures, as shown in
Figure 4. The
C and
Chisel hybrid back-end was developed as part of this work.
We used the
C back-end to generate the software code to be executed on the base RISC-V processor core. We modified the parallel
C back-end to generate RISC-V compilable
C code, including embedded custom instructions. However, code generation in our design flow consists of two passes. The first pass generates the architecture-specific software code, which is
C code with embedded custom instructions provided by the RISC-V ISA. The second part is generating the hardware blocks in
Chisel, corresponding to the hot-spots in the
CAL application. To generate the hardware blocks, we developed a new back-end that generates a combination of hardware and software code, using the
Chisel and
C languages, respectively. Those actions, which are tagged as hot-spots during the analysis step, were converted into hardware blocks acting as accelerators in
Chisel, whereas the rest of the application was converted into target-specific software, which at present is in
C. The code generator makes use of a library of hardware blocks consisting of efficient implementations for floating-point division [
77] and square root operations.
The bodies of the tagged actions are replaced with custom instruction calls to trigger the hardware accelerators. The necessary input data are forwarded to the accelerators and the results are read through these instructions. The bit fields of the instruction can be seen in
Figure 5, where
and
indicate the usage of source registers, and
indicates if the accelerator will write a result in the destination register. If the
bit is set, the core will wait until the accelerator finishes the computation. The generated hardware is later integrated to a rocket core through the rocket chip generator.
Cal2Chisel Back-end
The common language used in the RISC-V environment is
Chisel [
21], which is an open-source hardware description language developed at UC Berkeley. It is embedded in the Scala language [
78]. Some features of the
Chisel language are parameterized generators, algebraic construction, abstract data types, interfaces, bulk connections, hierarchical, object oriented and functional constructions and multiple clock domains. The
Chisel compiler, which is provided by UC Berkeley, can generate low level synthesizable
Verilog code.
The new back-end makes use of the intermediate representations (AM and AEIR) provided by the Cal2Many framework to generate a combination of C (software) and Chisel (hardware) code. The C code for the untagged (non-hot-spot) actions is generated in the pass where the AEIR structure is visited once. However, when the tagged (hot-spot) actions are visited in the AEIR structure, the Chisel code and custom instruction calls in C are generated in two phases. In the first phase, the entire action structure is scanned to generate a static single assignment (SSA) form by generating new versions of variables for each assignment and placing phi functions in necessary places. Wires are defined for the first set of variables and the corresponding Chisel code is generated. Then, information about the usage of actor-wide global variables and input and output ports are gathered for identifying the inputs and the outputs of the accelerator. An instance of the Chisel code for inputs and outputs can be seen as follows:
val io = IO(new Bundle{
val r_in = Input(UInt(width = 32.W))
val x_in_in = Input(UInt(width = 32.W))
val r_out = Output(UInt(width = 32.W))
val c_out = Output(UInt(width = 32.W))
val s_out = Output(UInt(width = 32.W))
})
In the second phase, the custom instructions in C/assembly and the rest of the Chisel code is generated. The custom instructions are inlined as assembly code via macros defined in a header file. The inputs and outputs are identified, i.e., the global variables to be read are identified as inputs and the ones to be modified are identified as both inputs and outputs. All variables are defined (except the first set that are already defined in the first phase). While going through the statements, the phi functions are replaced by multiplexers and the common operations such as the floating-point operations and complex number operations are replaced with hardware blocks from a manually-developed library. Finally, the outputs are connected to the final versions of the corresponding variables. In the current implementation of the back-end, loops and arrays are not supported. If only one element of a global array is used in the tagged action, the index will be used out of the hardware block to access the array and the value will be an input to the hardware.
We have experienced that the critical path of the generated hardware in many cases becomes quite long and causes the maximum clock frequency to be significantly lowered. To alleviate this, it is necessary to introduce pipelining in the accelerators. Therefore, a pipelining feature was added to the back-end by using pipelined hardware blocks for arithmetic operations. Delay registers were inserted into the data paths where they need to be synchronized.
Figure 6 shows the generated
C and
Chisel code (without delay registers) for a tagged
CAL action. In the
CAL code, indices are used for accessing the array elements. However, to avoid transfer of the whole array to the accelerator, these indices are moved to the
C code and only single array elements are transfered. The custom instruction is called twice with different
funct fields (refer to
Figure 5) to provide all the inputs to the accelerator. The
funct field is used to let the interface know when all the inputs are provided so that they can be fed to the accelerator. The floating-point operations are performed by instantiating the pre-defined hardware blocks such as
FPAdd,
FPMult and
fpSqrt, as seen in the generated
Chisel code (
Figure 6). The code also shows the connection between the modules and different variables.
3.2.4. Accelerator Integration
We have used the RISC-V environment for our integration step. RISC-V is an open instruction set architecture based on reduced instruction set computing (RISC) principles. It originated at University of California (UC), Berkeley. Several processing cores have been developed that implement this instruction set, such as the Berkeley Out-of-Order Machine (BOOM) [
79], rocket core [
27], Sodor CPU [
80], picoRV32 [
81] and scr1 [
82]. We chose the rocket core [
27] for our case studies as it has an interface to connect custom hardware blocks that enables us to create cores with application-specific augmentations. Additionally, the rocket chip generator produces emulator and synthesizable
Verilog code for this core and its surrounding components.
Rocket core [
27] is an in-order scalar processor based on the RISC-V ISA. It features a five-stage pipeline and has an integer ALU, an optional FPU and L1 data and instruction caches. This core supports up to four accelerators via an interface called rocket custom co-processor (RoCC) [
83], as shown in
Figure 7. We can see a tile consisting of the rocket core with L1 cache and an accelerator connected to the core and the cache via the RoCC interface. Custom instructions of the RISC-V ISA can be forwarded to the accelerator through this interface. Depending on a bit field in the custom instruction, the core might halt until it receives a response from the accelerator.
A rocket core can be generated via the rocket chip generator using a Scala program that invokes the
Chisel compiler to produce RTL describing a complete system-on-chip (SoC). This generator allows the developers to configure the generated core by changing parameters such as cache sizes, FPU usage, number of cores, and accelerator usage. The Rocket chip environment also includes cycle-accurate Verilator [
84] for simulations and a cycle-accurate C++ emulator.
The accelerator was integrated into the rocket core using the RoCC interface in several steps. First, the Chisel code for the accelerator was copied into the Chisel source folder of the rocket chip generator. Then, a Chisel class (which becomes a module in Verilog) was added to the rocket core source. This extends the RoCC interface, instantiates the accelerator, and connects the accelerator I/O wires to the RoCC interface I/O wires. This class needs to be instantiated in a configuration class within the configuration file of the rocket core, where the core components can be added, removed or modified. Custom instructions are bound to the accelerators in this configuration class. Each custom instruction can be bound to a different accelerator. Because four custom instructions are supported by RISC-V ISA, the number of accelerators that can be connected to the same core is limited to four. However, this can be increased using the funct field of the custom instruction as an identifier to determine the accelerator to use, while connecting the RoCC interface I/O wires to the accelerator I/O wires. The RISC-V ISA supports two 64-bit source registers in the custom instructions. If the accelerator requires more input data at once, the extended RoCC interface is used for storing the input data until the last inputs arrive and feed all the data to the accelerator at once. The accelerator returns the result through the same interface.
The rocket chip configuration file includes many different configurations to determine the components and their settings to be used in the generated architectures. A new configuration class is needed to instantiate the new rocket core integrated with the accelerator. Once the new configurations are added, the generator can produce a cycle-accurate emulator or synthesizable Verilog code for the new tile consisting of the base core, memory, and the accelerator.
This step of the design flow produces tiles containing different components including processor, accelerator, memory and the connections between these components. Theoretically, the types of the tiles can be:
Processor Tile, consisting of a processing core, local memory and optionally an accelerator
Memory Tile, consisting of only memory
Accelerator Tile, consisting of only an accelerator
The tile types are illustrated in
Figure 8. Thus far, the rocket chip generator allows us to generate processor tiles with and without the accelerators. However, we plan to extend the tile generation to cover all types of tiles.