Designing Domain-Specific Heterogeneous Architectures from Dataflow Programs

The last ten years have seen performance and power requirements pushing computer architectures using only a single core towards so-called manycore systems with hundreds of cores on a single chip. To further increase performance and energy efficiency, we are now seeing the development of heterogeneous architectures with specialized and accelerated cores. However, designing these heterogeneous systems is a challenging task due to their inherent complexity. We proposed an approach for designing domain-specific heterogeneous architectures based on instruction augmentation through the integration of hardware accelerators into simple cores. These hardware accelerators were determined based on their common use among applications within a certain domain.The objective was to generate heterogeneous architectures by integrating many of these accelerated cores and connecting them with a network-on-chip. The proposed approach aimed to ease the design of heterogeneous manycore architectures—and, consequently, exploration of the design space—by automating the design steps. To evaluate our approach, we enhanced our software tool chain with a tool that can generate accelerated cores from dataflow programs. This new tool chain was evaluated with the aid of two use cases: radar signal processing and mobile baseband processing. We could achieve an approximately 4× improvement in performance, while executing complete applications on the augmented cores with a small impact (2.5–13%) on area usage. The generated accelerators are competitive, achieving more than 90% of the performance of hand-written implementations.


Introduction
Several highly demanding applications are currently in the process of being introduced, such as autonomous vehicles, 5G communication, and video surveillance and analytics.Many of these also include artificial intelligence and machine learning algorithms, which adds significantly to their computational demands.To support these applications on vehicles and other mobile devices, there is a need for embedded high-performance computing architectures to perform streaming computations in real time.
The present generation of multi/manycore architectures, along with general purpose graphics processing units (GPGPU), aim to address these computational demands by duplicating identical processing units or cores.Some companies are already pushing the number of cores on a chip as high as a thousand [1].The first multi/manycores were produced by duplicating the processing cores, resulting in homogeneous architectures having several identical cores on the same die [2,3].However, the streaming applications mentioned above are comprised of a large variety of tasks which are not necessarily identical.For example, a typical massive MIMO [4,5] application, which is the core of 5G communication technology, consists of a chain of tasks, including channel encoding/decoding, precoding, OFDM modulation, channel estimation, and MIMO detection, each of which performs different computations and requires different hardware resources for efficient performance and power consumption.Some tasks do not even perform any computation but consist of only memory operations such as shuffling or transposing a matrix.Techniques such as virtualization [6] and containerization [7] aim to execute heterogeneous tasks efficiently on processing systems by encapsulating them and minimizing runtime requirements.Additionally, technologies such as hyper-threading [8] provide simultaneous execution of threads on the same processor.However, the efficiency that can be achieved is still limited by the efficiency of the underlying hardware.To achieve the highest efficiency while executing the tasks on a manycore, it is necessary to optimize individual cores to the task at hand, thus introducing heterogeneity [9][10][11][12].One core may be optimized for efficient fast Fourier transforms (FFT) used in OFDM modulation, whereas another core can be optimized for efficient QR decomposition for matrix inversion during precoding.However, designing heterogeneous architectures is a challenging task due to the complexity of these architectures.There can be many forms of heterogeneity based on the components and how they are inter-connected and used [12].Therefore, finding the most suitable architecture for a target application requires design space exploration.This represents a further challenge, because of the diversity in the manycore design.In performing design space exploration, manycores are often simulated partially, or as a full system.Simulations allow hiding unnecessary architectural details when a certain component is tested and require no hardware development, which reduces costs.However, simulators are usually quite slow [13].
There are software tools to simulate manycore architectures, such as Gem5 [14], ZSim [15], Graphite [16], Sniper [17], and PriME [18], however, most of them do not support the kind of complete configurable system that allows configuration of parameters such as the types and numbers of processing units, memory units, custom hardware blocks, or the network-on-chip structure.There are full system simulators such as Gem5 and SimFlex [19], which provide detailed data on different components.However, the simulation time increases beyond a feasible limit as the system grows.Additionally, the simulators do not provide timing and area information, which can be obtained by taking a design to RTL or FPGA implementation.As a result, exploring the design space of heterogeneous manycore architectures through simulations becomes increasingly challenging.Further discussion on manycore simulation can be found in [13].
In this paper, we propose a design approach for high-performance, domain-specific heterogeneous architectures based on simple cores with application-specific extensions in the form of custom hardware blocks.With this approach, instead of finding an efficient architecture to run a certain application, we aim to build the architecture automatically with a tool-chain starting from dataflow programs.The design approach can be summarized as identifying the extensions within an application domain, integrating these extensions to simple cores and, (as a future goal) generating heterogeneous manycore architectures by connecting these cores with a network-on-chip (NoC).The paper covers identification and integration of the extensions with case studies and provide the insight of connecting the extended cores with a NoC.However, it does not cover the steps needed to generate a manycore architecture.When all the design steps are automated, our approach can be used for exploring the design space of heterogeneous manycore architectures.We define the extensions (custom hardware blocks) as the compute-intensive parts (hot-spots) of the applications within a domain such as radar, baseband, or audio/video signal processing.These blocks are integrated to simple cores by extending the instruction set.The cores are tasked to execute the control flow of the application and delegate the compute-intensive parts to these blocks, using them as accelerators.The cores with accelerator extensions will be referred to as "tiles" in the rest of the paper.
As the first step towards automating the design approach, we developed a software tool to automatically generate custom hardware blocks from dataflow applications.We extended our code generation framework, Cal2Many [20], which takes dataflow applications and generates target specific code for different architectures, by adding a back-end to generate a combination of C and Chisel [21] code.We considered two case studies to evaluate the performance and area usage of the generated code.The first case study was QR decomposition (QRD), implemented using Givens Rotations method [22].The QRD operation is used in numerous signal processing applications including massive MIMO [23].The second case study was an autofocus criterion calculation application, which is a key component of synthetic aperture radar systems [24].The chosen method performs cubic interpolation based on Neville's algorithm [25] to calculating the autofocus criterion.The case studies are implemented in CAL dataflow programming language [26], which is a concurrent language with support for expressing parallelism.The compute-intensive parts of the case studies were identified through profiling and automatically generated as custom hardware (accelerators).The accelerators are integrated to a rocket core [27] that is based on RISC-V open-source instruction set architecture (ISA) [28].Synthesizable designs for the integration of the rocket cores and the accelerators are generated via rocket chip system on chip generator [29].The generated implementations were evaluated in terms of performance and area usage against hand-optimized implementations.The contributions made by this study can be summarized as:

•
A generic method to design domain-specific heterogeneous manycore architectures with an emphasis on custom-designed tiles was proposed.

•
An approach to design augmented cores (tiles) via instruction extension and hardware acceleration was realized, including development of a code generation tool to automate hardware accelerator generation directly from a dataflow application.This tool performs hardware/software codesign and generates C and Chisel code from CAL applications.

•
The design method was evaluated using two case studies from baseband processing and radar signal processing.For these case studies, hand-written and automatically generated accelerators are used.The accelerators are integrated into a RISC-V core.
The remainder of the paper is structured as follows: Section 2 surveys the literature on related work.Section 3 describes the generic and realized versions of the proposed design approach.Section 4 introduces the case studies and provides details of how they are implemented.Section 5 presents the results of the case studies and a discussion of their results.Section 6 contains concluding remarks and some discussion of possible future works.

Related Works
In accordance with the focus of our work on generating heterogeneous manycore architectures, we provide here a review of related work on manycore design and custom hardware generation from high-level languages.
The first set of related works are on manycore generation.There are FPGA based manycore architectures developed by Sano et al. [30], Tanabe et al. [31] and Schurz et al. [32].However, these studies develop a single architecture and do not propose any generic method for doing so.However, celerity [33], an accelerator-centric system-on-chip (SoC) design based on a combination of three tiers, does closely resemble our work.The first tier consists of five rocket cores capable of running Linux, whereas 506 smaller RISC-V cores reside in tier 2, with an accelerator in tier 3. The tiers are connected to each other with parallel links.The accelerator is generated using SystemC and high-level synthesis tools.In their design, the accelerator and the cores are placed in different tiers and all cores share the accelerator tier.In contrast, in our design, each core can have its own tightly-coupled accelerator, making the accelerator an instruction extension to the simple core.Additionally, our design starts from application development and uses application requirements to configure the architecture in terms of the number of cores, memory size, accelerator types, etc.
There are several tools available that generate hardware description from high-level languages.These tools support a variety of languages [34] including Lava [35] and Clash [36] which are functional languages similar to Chisel [21].Clash and Lava are based on Haskell and they have compilers and support simulations.Additionally, Clash supports generating Verilog, VHDL and SystemVerilog code [37].Chisel is based on Scala and has a compiler for simulation and generating Verilog code.However, the RISC-V tools that are used in this study require Chisel code for integrating accelerators into the rocket cores.Therefore, we have chosen Chisel as our high-level hardware description language.There are more common languages, which are C and C-like languages, that are used by the tools to generate a hardware description.Some of these tools are Xilinx Vivado [38], Calypto's Catapult C [39], CoDeveloper from Impulse [40], eXCite from Y Explorations [41], Stratus from Cadence [42], and Symphony C from Synopsis [43].In these tools, the developer is usually required to change their C-like code by adding pragmas and rewriting some code snippets to change the structure of the generated RTL design.Both Catapult [44] and CyberWorkBench [45] deal with generating controller and data paths.They require modifications to the C code and the data paths depend on the controller generation.In our design method, we do not generate any controller.The accelerators we generate consist of only data paths and control is implemented on the processing core.This difference can also be seen when comparing our work with that of Trajkovic et al. [46].Trajkovic et al. [46] automatically generate processor cores from C code, including separate generation of the data path and the controller.
The Tensilica XPRES (Xtensa PRocessor Extension Synthesis) Compiler [47] automatically generates extensions, which are combinations of existing instructions.It uses a C/C++ compiler to analyze the application and find the candidate instructions.The instructions can be in the form of VLIW, vector, fused operations, or combinations of these.Therefore, the generated extensions are within the bounds of combinations of the existing instructions.Instruction extensions require modifications to the compiler and the decoder.The compiler is generated automatically.However, having a combination of instructions, including VLIW-style instructions, requires multiple parallel instruction decoders [47].This increases the hardware cost, which may affect the clock frequency, and also limits the number of instructions that may potentially be executed in parallel [46].
Clark et al. [48] automated custom instruction generation using dataflow graphs.They discovered custom instruction candidates by exploring the dataflow graphs.They used a hardware library to estimate timing and area for the the combined primitive operations (subgraphs).Combination of a few subgraphs, to be executed on the same hardware, is performed to form a set of custom function units (CFUs).Performance and area are estimated for the set members and fed to a selection mechanism to choose the custom instruction.They also generalized the custom instruction to be used by a set or all applications within the same domain.The compiler needs to create a dataflow graph of the application, perform pattern matching to find all the subgraphs matching the CFUs, and replace all the CFUs with the custom instructions.
Koeplinger et al. [49] automatically generated accelerators for FPGAs from applications described in a high-level language using patterns, such as map, reduce, filter, and groupBy.Their framework performs high-level optimizations and describes hardware using their own intermediate representation which is based on a set of architectural templates.Their method requires the parallel patterns and pragmas to be present in the input code, which in turn requires changes to the compiler/simulator.However, with our method, the only change in the code is applied to the name of the action to be accelerated.
The SDSoC environment from Xilinx [50] only provides automatic generation of accelerators running on their Znyq and Ultrascale+ platforms, which significantly limits portability across different target platforms.
The CAL2HDL tool, developed by Janneck et al. [51], was the first attempt to generate hardware descriptions from CAL actor language.This tool transforms CAL application into an XML language-independent model before the OpenForge synthesizer is used to translate the application to a hardware description in Verilog.CAL2HDL supports a subset of CAL and the time for Verilog design generation increases dramatically with the design complexity, which limits its wider applicability.It generates a platform-dependent design that can be used only on Xilinx FPGAs.Another hardware description generator from CAL actor language is developed by Siret et al. [52].Their tool generates VHDL, however, it lacks loop support.Bezati et al. [53] developed Xronos, aiming to support the ISO subset of CAL actor language to generate RTL code.Xronos uses OpenForge and generates Verilog, similar to CAL2HDL, however the authors claimed that Xronos operates faster and the generated hardware uses fewer resources because of changes in the transformations applied to the IR that is used as the input to the OpenForge.In our work, we did not generate hardware to cover the whole application but only the tagged (hot-spot) actions.Therefore, our back-end does not perform all of the transformations that are covered by the other CAL tools.Additionally, we used different IRs, and generated hybrid code (C+Chisel) that embeds custom instructions (in assembly) into the C code to communicate with the generated hardware blocks.
The generated heterogeneous architecture does not require an explicit mapping/resourcediscovery process to map the application onto corresponding components (core or accelerator) as our code generation framework already takes care of this by generating the code to fire the accelerator wherever necessary.However, if many of these extended cores (core + accelerator) are to be used in the same architecture, there is a need for a mapping approach such as HARD (Hybrid adaptive resource discovery) [54].This approach might be somewhat too advanced due to its functionality support.The only significant information for a mapping approach would be the capability of the accelerator and the hot-spots of the applications.Hence, a simpler approach can be applied to map the hot-spots to the accelerators.
To summarize, we present a generic methodology to design domain-specific heterogeneous manycore architectures based on accelerators integrated into simple cores.We reveal the steps undertaken to integrate the accelerators into an open source core and use them via custom instructions.We automate custom hardware generation to facilitate design space exploration of heterogeneous architectures.

Design Approach
Our approach for designing domain-specific heterogeneous architectures was based on instruction augmentation through the integration of hardware accelerators added into simple cores.We based these accelerators on an analysis of the application where hot-spots suitable for hardware acceleration are identified.The overall design flows consisted of the following steps: • Application development • Analysis and code generation • Accelerator integration • System integration In the following section, we describe a generic design flow, and then give a description of its realization, starting from CAL dataflow applications in Section 3.2.The generic design flow and its specific realization are illustrated in Figure 1.On the left hand side of the figure, where the generic flow is shown, one can see that the application is fed to the code generation and analysis tools.The analysis tool provides hot-spot information to the code generation tool, which sends feedback data to the developer.The generated hardware code is passed down to an accelerator integration tool to be integrated to a core and form a tile.These tiles are fed to a system integration tool to be connected together with a NoC to form a manycore.The generated software code is passed down to a native compiler.Together with a mapping tool, the compiler maps executables onto the generated architecture.The configuration parameters are used in different tools to determine the features of the hardware components.The right hand side of the figure shows the tools, languages and hardware components used in the realization of the design flow.Note that the system integration step is grayed out and the tools are not yet specified, because this step of the design flow is an ongoing work and is not covered in this study.

Generic Design Flow
The generic design flow consists of the following steps: (1) application development in a suitable programming language; (2) analysis and code generation to generate hardware and low-level software for the intended architecture; (3) accelerator integration where the hardware accelerator is integrated with the basic core; and (4) system integration where all the accelerated cores are integrated and connected through a NoC.This generic description is independent of the programming language, programming model, tool and hardware.

Application Development
The design method aims to build an efficient architecture for a specific domain of applications by integrating task-specific custom hardware into simple cores.The architecture configurations are based on the requirements of the applications.Therefore, the design method starts from the application description.
From an application programming perspective, programming languages in which concurrency is inherent in the language are attracting increased attention in mainstream parallel computing compared to sequential languages.These concurrent languages make explicit the inherent parallelism of the applications.In the proposed method, to target a manycore architecture, the application development should be performed in a programming language that can express different levels of parallelism, such as instruction-level, task-level, and data-level parallelism.This facilitates generation of the task-specific cores, which are the fundamental components of the target architectures.
A developer can use the feedback from the analysis and code generation tools to improve the application.Additionally, the feedback can be used to adjust the application to make it more suitable for use in architecture development, especially in cases where acceleration of the application is desired, but the implementation lacks a distinguishable hot-spot.

Analysis
The most compute-intensive parts of the application (hot-spots) were implemented as custom hardware to be integrated into the base core.To identify the compute-intensive parts of the application, analysis of the application is required (often referred as profiling).Analysis methods are usually divided into static and dynamic analysis methods [55][56][57].Static analysis can be used to highlight possible coding errors, and mathematically prove properties of a program via formal methods, or to count the number of operations and operands.It can also be used to estimate execution times for applications which have constant or static behavior.However, dynamic analysis is required to obtain execution measures for applications with dynamic behavior, and to define the most frequently executed, computationally-intensive code blocks.Analysis information useful for hot-spot identification includes the execution rate, the number of operations and operands, the complexity of the operations, and the execution time.
During the analysis, architectural features can also be generated, including memory requirements, communication characteristics, and the required number of cores.These features can be adjusted during code generation for optimization purposes.

Code Generation
Once identified, hot-spots are candidates for implementation as accelerators.These were implemented as custom hardware, which can be performed using a high-level language such as C before using HLS tools to generate the RTL design, or directly, using a hardware description language.The code generation step consisted of both software and hardware generation.Software code generation resulted in native code with embedded instructions for the accelerator.This native code can then be translated to the target machine code using proprietary tools.Hardware generation involved generating hardware descriptions for the custom hardware blocks.

Accelerator Integration
The custom hardware block, which was generated in the previous step, needs to be integrated to the base core so that the hot-spot can be delegated to the accelerator while the rest of the application is executed on the core.The accelerator can be connected to the data bus and be memory mapped, it can be connected through custom interfaces, or even act as an instruction extension.However, instruction extensions need changes in the compiler unless the instructions are already in the instruction set and supported by the compiler.During integration, the custom hardware can be interfaced to memory to provide direct memory access.Moreover, the custom hardware can be connected to a network-on-chip through a network interface, enabling the other cores on the network to make use of the custom hardware.In-short, the integration method can vary based on different aspects, such as the features of the base core, architectural requirements, and application requirements.

System Integration
The prior steps of the design flow produce tiles for individual tasks in the applications.To execute the entire application, or a set of applications, the tiles need to be connected to each other to form a manycore architecture.In particular, for dataflow applications that have significant amount of core-to-core (or tile-to-tile) communication, the tiles must be connected with a proper infrastructure that supports tile-to-tile communication.The architectural information gathered during the analysis and code generation stages was used in this step while configuring the network-on-chip (NoC), and when determining whether to add tiles containing only memory to provide more on-chip memory.
In this step, an important decision is the choice of NoC topology.The efficiency of the topology might change based on the domain of the applications However, for dataflow applications, we suggest a 2D mesh structure based on our experience during previous work [22,58] with the Epiphany architecture [59].This structure provides efficient core-to-core communication for dataflow applications in terms of bandwidth and latency.With a proper routing algorithm, such as X-Y routing, it becomes deadlock free.In addition, it is scalable, which enables different sized manycore architectures to be built.However, there are several other NoC topologies, such as torus, ring, star, tree, crossbar, and hypercube [60][61][62].A torus has wrap-around connections between the opposite edges of the mesh.However, it would be smart to save the edge connections for extensions to the NoC and external connections.The crossbar topology establishes all-to-all connections, however, this becomes quite expensive in terms of area usage when the size of the network increases.In a star topology, all messages pass through a central router.This router is a potential congestion point, especially for applications with intensive communication needs.In tree topologies, the root node and the nodes close to it become a bottleneck.In the hypercube topology, the number of neighbours of each node is equal to the degree of the topology.To increase the number of nodes in such topologies, one needs to increase the degree of the topology, and hence increase the number of connections to each node.From a scalability point of view, this is an obstacle, as the structure of the NoC needs to be changed with each dimension increase.

Realization of the Design Flow
We now describe the realization of the design flow in terms of the tools, languages, and other components that have been applied in our proposed approach.

Application Development
Applications that process a continuous stream of data and have high performance requirements, such as video coding/decoding, radar signal processing, and wireless communication, can be implemented efficiently using a dataflow programming model [63].In this model, applications consist of a number of tasks, which are connected to one another to form the execution flow.They can execute concurrently and express parallelism.We adopted such a dataflow programming model, and developed our applications as streaming applications.An example streaming application can be seen in Figure 2, in which the raw input data flow through different tasks and finally produce the output data.We chose CAL actor language [26] as the application development language because of its concurrent nature, compliance to streaming applications, and ease of use.
CAL [26] is a high-level dataflow programming language [64] that has been adopted by the MPEG Reconfigurable Video Coding (RVC) [65] working group as a part of their standardization efforts.A CAL program consists of actors, which are stateful operators.Actors consist of code blocks called actions that can transform input data streams into output data streams.The state of the actor usually changes while performing the transformation.When there is more than one actor in a program, these actors get connected with channels/ports to form a network.Each actor consumes the tokens on its input ports and produces new tokens on its output ports.
The execution order of the actions can be defined with an explicit schedule in the form of a state machine together with priority definitions for the actions.Additionally, CAL provides a network language to instantiate actors, define channels, and connect the actors via these channels.
Another advantage of using CAL is the opportunity to integrate application analysis and code generation (both software and hardware) in our Cal2Many framework.

Analysis
The analysis method and the tools employed depend on the programming language used for application development.For instance, for applications written in C, one can use Valgrind [67] and gnu profiler (gprof) [68] to obtain the total number of instruction fetches per function, the number of function calls, the total execution time of the application and the amount of time spent in each function.We used these tools for one of our Autofocus criterion calculation implementations, which is hand-written in C.
For the applications developed in CAL, we used TURNUS [69], which is a framework for profiling dataflow applications running on heterogeneous parallel systems.This framework implements a CAL dataflow profiler on top of the Open RVC-CAL Compiler (Orcc) [70].TURNUS provides high level application and architecture modeling as well as tools for application profiling, system level performance estimation, and optimization.
TURNUS provides both static and dynamic analysis information relevant for both the software and hardware aspects, however our focus is on the software aspects.Some of the key analysis information provided by this tool are the firing rates of the actions, the number of executed operations, the input/output tokens consumed and produced, and communication buffer utilization.To obtain dynamic analysis data such as firing rates of the actions, the application was executed at least once.For each executed action, both the computational load and the data-transfers and storage load were evaluated.The computational load, given as weights in Figure 3, was measured in terms of executed operators and control statements (i.e., comparison, logical, arithmetic and data movement instructions).This information was used for hot-spot identification, while the communication information was used for configuring the network-on-chip.
This tool can provide performance estimation based on an architecture model that is an undirected graph where each node represents an operator (a processing element such as a CPU or an FPGA in the terms of [71]) or a communication element (e.g., bus and memories).However, the performance estimation is not critically necessary to identify the hot-spots of the applications.Moreover, we do not want to limit our hot-spot identification to any platform.Therefore, we do not focus on performance estimation.
There are other efforts to profile the applications at different levels such as basic block level in static single assignment (SSA) form to find out the most frequently executed instruction sequences [72,73].However, we maintained the analysis at a higher level of abstraction to keep it simple and architecture agnostic.CAL actions are meant to be small code blocks and yet they can be significantly compute-intensive by including heavy floating-point operations such as multiplication, division, and taking a square root.Therefore, analyzing the application at the action level is sufficient to identify hot-spots.
The case studies implemented in CAL were analyzed statically and dynamically via TURNUS.Firing rate of the actions together with the number of executed operations are sufficient to identify the hot-spots.The tool provides weights, which can be interpreted as a compute density, for each action based on their firing rate, number of operations and the number of operands used in the operations.This information can be used directly for hot-spot identification.Figure 3 shows an overview of the actions in the QRD case study and their weights in an actor.One can see that calculate_boundary action is the hot-spot of the given actor despite not being the most frequently executed action.This is due to the number of operations in the actions.Some actors could have dynamic execution paths (different execution order and frequency for the actions) depending on the input tokens or the state of the actor.In such cases, the application can be executed a number of times to obtain an average value for the analysis data.
Gathering the analysis data and identifying the hot-spots are performed automatically by tools.Once the hot-spot actions are identified, they are tagged manually for the code generation.

Code Generation
To automate the code generation, we used our in-house developed Cal2Many framework, which is a source-to-source compiler [20].Figure 4 gives an overview of the framework.It takes CAL application as input and transforms it into two consecutive intermediate representations, namely Actor Machine (AM) [74] and Action Execution Intermediate Representation (AEIR) [20].After the transformations and utilization of necessary libraries, it finally generates native code in various target specific languages.
In CAL, actors execute by firing actions, which satisfy the conditions such as the state of the actor, the availability of input tokens, and their values.While generating the code, the framework has to take ordering of these conditions into account together with the number of the tests performed before firing an action.We adapted a simple actor model called actor machine to schedule the testing of the firing condition of an action.To get closer to the target imperative languages, we used the second intermediate representation (AEIR), which is a data structure that allows generating code in imperative languages such as C, C++ and java.It has constructs for function declarations, expressions, statements, and functions calls.Translation of AM to AEIR consisted of two main passes.The first pass dealt with the translation of CAL constructs to imperative constructs including functions, statements, expressions, actions, and variable declaration.The pass took into account the translation of the AM to a sequential action scheduler.
We previously developed four different back-ends, i.e., sequential C for uni-processors, aJava and aStruct for Ambric architecture, native-C for Epiphany architecture, and target specific language (Scala subset) for ePuma [75] and EIT [76] SIMD architectures, as shown in Figure 4.The C and Chisel hybrid back-end was developed as part of this work.We used the C back-end to generate the software code to be executed on the base RISC-V processor core.We modified the parallel C back-end to generate RISC-V compilable C code, including embedded custom instructions.However, code generation in our design flow consists of two passes.The first pass generates the architecture-specific software code, which is C code with embedded custom instructions provided by the RISC-V ISA.The second part is generating the hardware blocks in Chisel, corresponding to the hot-spots in the CAL application.To generate the hardware blocks, we developed a new back-end that generates a combination of hardware and software code, using the Chisel and C languages, respectively.Those actions, which are tagged as hot-spots during the analysis step, were converted into hardware blocks acting as accelerators in Chisel, whereas the rest of the application was converted into target-specific software, which at present is in C. The code generator makes use of a library of hardware blocks consisting of efficient implementations for floating-point division [77] and square root operations.
The bodies of the tagged actions are replaced with custom instruction calls to trigger the hardware accelerators.The necessary input data are forwarded to the accelerators and the results are read through these instructions.The bit fields of the instruction can be seen in Figure 5, where xs1 and xs2 indicate the usage of source registers, and xd indicates if the accelerator will write a result in the destination register.If the xd bit is set, the core will wait until the accelerator finishes the computation.The generated hardware is later integrated to a rocket core through the rocket chip generator.

Cal2Chisel Back-end
The common language used in the RISC-V environment is Chisel [21], which is an open-source hardware description language developed at UC Berkeley.It is embedded in the Scala language [78].Some features of the Chisel language are parameterized generators, algebraic construction, abstract data types, interfaces, bulk connections, hierarchical, object oriented and functional constructions and multiple clock domains.The Chisel compiler, which is provided by UC Berkeley, can generate low level synthesizable Verilog code.
The new back-end makes use of the intermediate representations (AM and AEIR) provided by the Cal2Many framework to generate a combination of C (software) and Chisel (hardware) code.The C code for the untagged (non-hot-spot) actions is generated in the pass where the AEIR structure is visited once.However, when the tagged (hot-spot) actions are visited in the AEIR structure, the Chisel code and custom instruction calls in C are generated in two phases.In the first phase, the entire action structure is scanned to generate a static single assignment (SSA) form by generating new versions of variables for each assignment and placing phi functions in necessary places.Wires are defined for the first set of variables and the corresponding Chisel code is generated.Then, information about the usage of actor-wide global variables and input and output ports are gathered for identifying the inputs and the outputs of the accelerator.An instance of the Chisel code for inputs and outputs can be seen as follows: In the second phase, the custom instructions in C/assembly and the rest of the Chisel code is generated.The custom instructions are inlined as assembly code via macros defined in a header file.The inputs and outputs are identified, i.e., the global variables to be read are identified as inputs and the ones to be modified are identified as both inputs and outputs.All variables are defined (except the first set that are already defined in the first phase).While going through the statements, the phi functions are replaced by multiplexers and the common operations such as the floating-point operations and complex number operations are replaced with hardware blocks from a manually-developed library.Finally, the outputs are connected to the final versions of the corresponding variables.In the current implementation of the back-end, loops and arrays are not supported.If only one element of a global array is used in the tagged action, the index will be used out of the hardware block to access the array and the value will be an input to the hardware.
We have experienced that the critical path of the generated hardware in many cases becomes quite long and causes the maximum clock frequency to be significantly lowered.To alleviate this, it is necessary to introduce pipelining in the accelerators.Therefore, a pipelining feature was added to the back-end by using pipelined hardware blocks for arithmetic operations.Delay registers were inserted into the data paths where they need to be synchronized.
Figure 6 shows the generated C and Chisel code (without delay registers) for a tagged CAL action.In the CAL code, indices are used for accessing the array elements.However, to avoid transfer of the whole array to the accelerator, these indices are moved to the C code and only single array elements are transfered.The custom instruction is called twice with different funct fields (refer to Figure 5) to provide all the inputs to the accelerator.The funct field is used to let the interface know when all the inputs are provided so that they can be fed to the accelerator.The floating-point operations are performed by instantiating the pre-defined hardware blocks such as FPAdd, FPMult and fpSqrt, as seen in the generated Chisel code (Figure 6).The code also shows the connection between the modules and different variables.

Accelerator Integration
We have used the RISC-V environment for our integration step.RISC-V is an open instruction set architecture based on reduced instruction set computing (RISC) principles.It originated at University of California (UC), Berkeley.Several processing cores have been developed that implement this instruction set, such as the Berkeley Out-of-Order Machine (BOOM) [79], rocket core [27], Sodor CPU [80], picoRV32 [81] and scr1 [82].We chose the rocket core [27] for our case studies as it has an interface to connect custom hardware blocks that enables us to create cores with application-specific augmentations.Additionally, the rocket chip generator produces emulator and synthesizable Verilog code for this core and its surrounding components.
Rocket core [27] is an in-order scalar processor based on the RISC-V ISA.It features a five-stage pipeline and has an integer ALU, an optional FPU and L1 data and instruction caches.This core supports up to four accelerators via an interface called rocket custom co-processor (RoCC) [83], as shown in Figure 7.We can see a tile consisting of the rocket core with L1 cache and an accelerator connected to the core and the cache via the RoCC interface.Custom instructions of the RISC-V ISA can be forwarded to the accelerator through this interface.Depending on a bit field in the custom instruction, the core might halt until it receives a response from the accelerator.
A rocket core can be generated via the rocket chip generator using a Scala program that invokes the Chisel compiler to produce RTL describing a complete system-on-chip (SoC).This generator allows the developers to configure the generated core by changing parameters such as cache sizes, FPU usage, number of cores, and accelerator usage.The Rocket chip environment also includes cycle-accurate Verilator [84] for simulations and a cycle-accurate C++ emulator.The accelerator was integrated into the rocket core using the RoCC interface in several steps.First, the Chisel code for the accelerator was copied into the Chisel source folder of the rocket chip generator.Then, a Chisel class (which becomes a module in Verilog) was added to the rocket core source.This extends the RoCC interface, instantiates the accelerator, and connects the accelerator I/O wires to the RoCC interface I/O wires.This class needs to be instantiated in a configuration class within the configuration file of the rocket core, where the core components can be added, removed or modified.Custom instructions are bound to the accelerators in this configuration class.Each custom instruction can be bound to a different accelerator.Because four custom instructions are supported by RISC-V ISA, the number of accelerators that can be connected to the same core is limited to four.However, this can be increased using the funct field of the custom instruction as an identifier to determine the accelerator to use, while connecting the RoCC interface I/O wires to the accelerator I/O wires.The RISC-V ISA supports two 64-bit source registers in the custom instructions.If the accelerator requires more input data at once, the extended RoCC interface is used for storing the input data until the last inputs arrive and feed all the data to the accelerator at once.The accelerator returns the result through the same interface.
The rocket chip configuration file includes many different configurations to determine the components and their settings to be used in the generated architectures.A new configuration class is needed to instantiate the new rocket core integrated with the accelerator.Once the new configurations are added, the generator can produce a cycle-accurate emulator or synthesizable Verilog code for the new tile consisting of the base core, memory, and the accelerator.
This step of the design flow produces tiles containing different components including processor, accelerator, memory and the connections between these components.Theoretically, the types of the tiles can be: • Processor Tile, consisting of a processing core, local memory and optionally an accelerator • Memory Tile, consisting of only memory • Accelerator Tile, consisting of only an accelerator The tile types are illustrated in Figure 8. Thus far, the rocket chip generator allows us to generate processor tiles with and without the accelerators.However, we plan to extend the tile generation to cover all types of tiles.

Core Memory NoC Interface Accelerator
Processor Tile Memory Tile Accelerator Tile

System Integration
The last step of the design flow is system-level integration of the tiles.This step was not realized in this study, hence we do not have a generated manycore architecture.The rocket chip generator has configurations to generate architectures with more than one core.However, it uses cache to connect the cores together.Because this is not as efficient as having a network-on-chip (to handle cases with intense core-to-core communication), we do not generate manycore architectures via the rocket chip generator.This step of the design method is being pursued in our ongoing work with the aim to add support for a network-on-chip to the rocket chip generator.

Case Studies and Implementations
This section provides the details of the implementation of the case studies Autofocus criterion calculation and QR decomposition.
The case studies are considered as a proof-of-concept for the design method, and to evaluate the hardware back-end of the Cal2Many framework.The first case study is QR decomposition, which is implemented in CAL actor language.The second case study is an Autofocus criterion calculation using cubic interpolation.Different versions of this application are implemented in sequential and dataflow fashions, in the C and CAL languages, respectively.The C version is implemented as a proof-of-concept, whereas the CAL version is implemented to evaluate the automatic code (C and Chisel) generation with a more complicated accelerator.The Cal2Many framework is used for automatically generating C and Chisel code for the CAL implementations of the case studies.Single tiles are produced at the end of each case study.
In the case studies, where the accelerators are automatically generated, hand-written accelerators are also developed, and used as reference implementations for evaluation.The accelerators are integrated into a rocket core through the RoCC interface.Verilog code for the accelerators is generated and synthesized using Xilinx tools.Additionally, rocket cores with and without the accelerators are synthesized individually.The accelerators are integrated to the rocket core in three steps: • Connect the RoCC interface to the I/O of the accelerator.
• Adjust the core configuration and binding the accelerator to a custom instruction.
• Adjust the platform configuration to use the new core configuration which includes the accelerator.
The following sections provide details of the case studies and their corresponding implementations.

QR Decomposition
QR decomposition (QRD) or QR factorization is decomposition of a matrix into an orthogonal matrix Q and an upper triangular matrix R [85].The decomposition equation for a square matrix A is simply A = QR.The matrix A does not necessarily need to be square.The equation for an m × n matrix, where m ≥ n, is as follows: QRD is used in numerous applications for replacing matrix inversions to avoid precision loss and reduce the number of operations.It is also a part of the solution to the linear least squares problem and the basis of an eigenvalue algorithm (the QR algorithm).There are several different methods to perform QRD, such as the Givens Rotations, Householder and Gram-Schmidt methods [22].
The method employed in this work for decomposition is Givens Rotations with a systolic array implementation [86].The structure of the parallel implementation is given in Figure 9.There are 10 actors, consisting of two different namely boundary and inner actors.These actors can be mapped onto separate cores.However, in this case study, they are manually converted into actions to be combined in a single actor and executed on a single core.When these actors are combined, a few further control actions are required to help with communication and scheduling.All of the actions can be seen in Figure 3.
The following computations are performed in calculate_boundary and calculate_inner actions.According to the analysis tool, the calculate_boundary action is the hot-spot of the application, as shown in Figure 3. Therefore, Chisel code is generated for this action while C code is generated for the remainder of the application.The hardware for the action is also implemented manually as a baseline for comparison.
RISC-V instructions support one destination register.However, the QRD accelerator returns at least three floating-point numbers, thus requiring at least two 64-bit registers.Therefore, the custom instruction is called twice for firing the hand-written accelerator: once sending the input data to the accelerator and reading the first two results and once for reading the last result.These operations are distinguished via the funct bits of the custom instruction within the extended RoCC interface that stores the last result until the second instruction call.The generated C + Chisel accelerator requires three custom instruction calls because it has five floating-point inputs and three outputs.The custom instruction is called twice send the inputs, and a further time to read the last output.

Autofocus Criterion Calculation in Synthetic Aperture Radar Systems
Synthetic-Aperture Radar (SAR) systems produce high resolution images of the ground.The back-projection integration technique has been applied in SAR systems, enabling processing of the image in the time domain, which makes it possible to compensate for non-linear flight tracks.However, the cost is typically a high computational burden.In reality, the flight path is not perfectly linear.This can, however, be compensated for in the processing.The compensations are typically based on positioning information from GPS.If this information is insufficient, or even missing, autofocus can be used.The autofocus calculations use the image data itself, and are performed before each subaperture merge.One autofocus method, which assumes a merge base of two, relies on finding the flight path compensation that results in the best possible match between the images of the contributing subapertures in a merge.Several flight path compensations are thus tested before a merge.The image match is checked according to a selected focus criterion.The criterion assumed in this study is maximization of correlation of image data.[24].As the criterion calculations are carried out many times for each merge, it is important that these are done efficiently.The calculations include interpolations and correlation calculations.
The autofocus implementation takes 6 × 6 blocks of image pixels as input and applies complex cubic interpolations to these blocks.The hot-spot of the application is identified as the computation of the cubic interpolation.This computation is a combination of complex multiplication, complex subtraction and complex division.Each of these operations is implemented as a building block.Figure 10 presents the structure of the complex division block, consisting of floating-point operations.This block is used as one of the building blocks while developing the accelerator, and it is represented as CDiv in Figure 11 that shows the fully flattened version of the accelerator.The mathematical expressions of the operations performed during the cubic interpolation are as follows: where x int is a constant, x and y are the positions and values of the input pixels, and p 03 is the computed result.The Red highlighted box in Figure 11 performs the operations to calculate p01 and the small red boxes represent delay registers.We implemented the autofocus criterion calculation in CAL actor language and generated the software and hardware through our tool-chain.Furthermore, we manually implemented the software and the hardware of the application in native C and Chisel languages, respectively.These implementations will be referred to as Generated and Hand-written in the rest of the paper.
The Generated and Hand-written implementations have different designs.Figure 12 presents the dataflow diagram of the CAL (Generated) implementation.One can see that the main function (cubic) is implemented as an actor and instantiated several times to exploit data-level and task-level parallelism.However, since our tool-chain supports composition of actors [87] and we are intending to generate code for single-core execution, therefore, all the actors of the autofocus criterion calculation (seen in Figure 12) are composed into a single actor to be mapped on a single core.This operation results in a more complex scheduler in the composition, and requires handling actor-to-actor communication.To analyze the CAL actor, the TURNUS framework is used, and cubic interpolation is identified as the hot-spot.The cubic interpolation of autofocus criterion calculation is implemented in CAL as a action that performs only a single step of the interpolation (calculation of a single p value).This helps to keep the action simple and decreases the number of global variables that provide the input values to the interpolation kernel.The action is executed six times in a row to complete the interpolation of 4 pixels and produce a single pixel result.
The Generated accelerator makes use of complex number arithmetic blocks implemented manually with pipelines.The number of used complex and floating-point blocks of the accelerators are given in Tables 1 and 2. The generated accelerator results are equal to the results given for the Folded Accelerator in the tables.The generated accelerator requires 4 complex numbers (each of 64-bits) as inputs to be fired.Each complex number has two floating-point values representing the real part and the imaginary part.The total number of inputs to the accelerator is 8 floating-point numbers, represented in IEEE-754 single precision (binary32) format [88].The RoCC interface supports transfer of two 64-bit values from core to the accelerator with each custom instruction.Therefore, we combine real and imaginary values of each complex number in a 64-bit register and send 4 floating-point numbers to the accelerator with each instruction.Hence, two custom instruction calls are required to transfer four complex numbers to the accelerator.The accelerator is pipelined and has 20 stages.We have extended the RoCC interface to handle synchronization of the input data between the instruction calls for firing.The extended interface stores the input data until it receives the instruction with the last data required.Then, all of the input data are fed to the accelerator together with an enable signal that fires the accelerator.
Two different versions of the Hand-written accelerator were developed.The first version of the cubic interpolation accelerator, which is flat and presented in Figure 11, consists of six identical structures-one of them is highlighted in Figure 11.This hardware accelerator is identical to the cubic interpolation function of the hand-written C implementation, and requires 4 pixels as input, with each pixel consisting of two complex numbers (value and position).These inputs are named as x and y in Figure 11.It takes four instructions to send all the input data.
The second version of the hand-written accelerator is optimized (folded) for resource usage.It uses only one of these six identical structures in a software controlled loop and the four individual CSub blocks to perform the cubic interpolation, as shown on the right hand side of the Figure 11.
Similar to the generated accelerator, this accelerator is fired six times by the software to calculate one result.To be fired, this accelerator requires four complex numbers, which takes two instructions.
The flat accelerator is designed with a fully pipelined structure and can produce one result (a complex number) per cycle.However, due to the limitations in the data transfer, it can be fired at every fourth cycle (ignoring the cache misses).The folded accelerator also has a pipelined structure and can be fired at every cycle.However, input limitations mean that it can only be fired every other cycle.Moreover, with the folded accelerator, the first three iterations of the loop for calculating one final result can be executed two cycles apart (without waiting for the result of previous iterations).However, synchronization problems may occur with the results returned from the accelerator if the core does not halt until the arrival of the result.
Table 1 presents the number of (blocks) complex operations executed by the accelerators, whereas Table 2 shows the number of floating-point operations performed by each block and the accelerators.The generated accelerator and the folded hand-written accelerator have the same results and are presented in Folded Accelerator column.In total, 140 floating-point operations are performed for each set of inputs (4 pixels).Floating-point multiplication and division operations employ integer multiplications.To perform these integer multiplications on-chip, DSP blocks are utilized.

Results
This section provides the performance, resource utilization and timing results for the case studies considered in this work.The performance results are obtained through executing the case studies on the cycle-accurate emulators generated with the rocket chip generator.Hardware counters are used to obtain the cycle counts.Resource usage results are provided by Xilinx synthesis tools.The target platform is the Xilinx VCU108 evaluation kit, including the Virtex UltraScale XCVU095 FPGA.

QR Decomposition
The QRD case study takes a matrix of 16 × 16 elements as input data.Table 3 present the performance results of different implementations of the QRD in terms of achieved clock frequency and cycle count.The first implementation results are for software generated by the Cal2Many framework and executed on the rocket core without using any custom hardware.The second row of results correspond to the rocket core and accelerator where both the software for the core and hardware for the accelerator are generated by the Cal2Many tool-chain.Finally, the last set of results are for the generated software code executing on rocket core along with a hand-written custom hardware implementation.It is apparent from the results that could achieve similar clock frequencies for all three cases, primarily because of the pipelining incorporated in the Cal2Chisel back-end that results in generating pipelined accelerators.The combination of accelerator and the rocket core outperforms the single-core execution by a factor of 4.However, there is a small drop of 4% in the performance of the generated accelerator with respect to the hand-written accelerator.The main reason for this difference the different the number of inputs.During code generation, use of the global array indices is moved out of the accelerator.However, they are still passed to the accelerator as inputs.The hand-written accelerator takes in two floating-point inputs requiring a single custom instruction call, whereas the generated accelerator takes in five floating-point inputs, requiring at least two custom instruction calls.Copying the input variables to the source registers adds extra cost.This difference can be reduced further by incorporating optimizations in the hardware code generation.
The resource usage and clock frequency results for the rocket core and the accelerators can be seen in Table 4.The main reason for the difference in look-up tables and flip-flops in the two accelerator implementations is the lack of support for a mixture of integer/floating-point operations.The code generation tool treats integer operations similarly to floating-point operations, and utilizes floating point operation blocks.Specifically, there is an integer addition operation that is converted into a floating-point operation in the generated accelerator, which causes an increase in resource usage.Thus, there is potential for further improvement in the resource usage of the generated hardware.The maximum clock frequency of the generated accelerator is 20 MHz without the pipelining feature.However, by inserting the pipeline stages, the maximum clock frequency increases to 104 MHz, equivalent to the clock frequency of the hand-written accelerator.The accelerators use the same hardware library for the arithmetic operations, which plays a significant role in determining the clock frequency.In addition, the accelerator is fully pipelined and parallelized, meaning it can produce one set of results per cycle.However, due to limitations in the data movement between the core and the accelerator, it is not possible to make use of this feature.The standalone latency of the accelerator is 14 cycles and the total latency (including data movement) is approximately 27 cycles.
To summarize, we are able to achieve an improvement of 4× in performance with respect to rocket core software execution, and the overhead of generated accelerator vs hand-written accelerator is minimal.

Autofocus Criterion Calculation
The the autofocus criterion calculation application, consisting of integrated cubic interpolation accelerators, is tested using 12 kernels, each consisting of 6 × 6 image pixels.Cubic interpolation is applied 324 times within the autofocus criterion calculation application.
Table 5 shows the performance results for the autofocus criterion calculation in terms of cycle count and achieved clock frequencies.All of these implementations use hand-optimized software.There are two implementations of the hand-written accelerators, namely: flat and folded.The flat accelerator is fully parallelized and has the best performance in terms of cycle count.On the other hand, the folded hand-written implementation, which applies the same cubic implementation block in a loop, has similar a performance to the generated accelerator.Clearly, all of the implementations with integrated accelerators outperform the software execution on rocket core by a factor of 2.8-4.8.The hand-written flat accelerator implementing cubic interpolation produces a final result in 52 cycles.Due to the pipelined structure, the accelerator can produce one result every clock cycle, provided all the required inputs are loaded.Since the RoCC interface can provide at most four floating-point inputs per-cycle, the accelerator can produce four result per four cyclez.Furthermore, the rocket core waits until the accelerator produces its result.Hence, the accelerator can compute one result in 52 cycles.When the accelerator is integrated into the core, it takes eight cycles to copy the data into the source registers and fire the accelerator.Additionally, it takes two cycles to read the result.Therefore, in total, it takes 62 cycles to compute a result.While measuring these results, the caches are pre-filled, hence there are no cache misses.
The hand-written folded accelerator and the generated accelerator, both of which perform cubic interpolation, are similar in their design.They both compute one of the six identical functions shown in Figure 11 and produce an intermediate result in 20 cycles.If the data movement overhead is ignored, this accelerator can calculate the final result in six iterations, taking 120 cycles.However, for each firing of the accelerator, data movement costs six additional cycles (four for filling the source register and two for reading the destination register).Thus, each iteration takes 26 cycles, meaning that the whole autofocus criterion computation takes 156 cycles (ignoring any cache misses).
Table 6 provides synthesis results for individual rocket core, individual accelerator implementations and integrated (core + accelerator) designs.The results for the integrated designs include the resource usage of the extended RoCC interfaces.The hand-written folded and the generated accelerators use only one of the six main blocks, highlighted with dashed lines in Figure 11.Therefore, one would expect a 6× reduction in resource usage compared to the hand-written flat implementation.However, the CSub components, which produce the d values, are used in the folded version as well.Hence, LUT and FF usage do not decrease exactly by six times.However, one can see that the DSP and BRAM usages actually drop to 1/6, as they are only used in the main blocks.The increase in the size of the integrated design causes the critical path of the processor to increase insignificantly.When integrated, the flat accelerator causes the max clock frequency of the processor to decrease from 58 MHz to 56 MHz, whereas the folded and generated accelerators do not result in any reduction of the clock frequency.When synthesized individually, hand-written flat, hand-written folded, and generated accelerators achieve 126 MHz, 131 MHz and 117 MHz clock frequencies, respectively.
We have also automatically generated software code from the CAL program and executed it separately on the rocket core and on the integrated design (rocket core + accelerator).The software execution on the rocket core, take 607 k cycles.The use of the generated accelerator reduces the execution time to 196 k cycles.The main reason is that the hand-written (C) implementation fires the accelerator six times in a row (in the same function).However, the generated (CAL) implementation fires the accelerator once with each function call and returns to the scheduler, which is a state machine implemented as if-else statements.The variables copied into the source registers are global variables and it takes four instructions to move them to the source registers in the generated implementation, whereas in the hand-written implementation the variables copied into the source registers are sent as arguments/parameters to the function and it takes two instructions to move them to the source registers.The CAL implementation consists of many actors and would benefit from being mapped onto different cores.The automated composition of the actors cause a significant performance reduction, which will be our next goal for improving software generation.
To summarize, the hand-written flat accelerator increases the performance of the rocket core by a factor of 4.8 at the cost of a significantly larger resource usage.However, the hand-written folded and the generated accelerators increase the performance by 2.8× with a very modest increase in the resource usage.

Conclusions
Manycore architectures are emerging to meet the performance demands of today's complex applications.These architectures usually consist of identical or generic cores.However, applications usually consist of a variety of tasks.To achieve the highest efficiency, there is a need to specialize the cores for individual tasks and introduce heterogeneity.However, designing such an architecture is a challenging task.
In this paper, we propose an approach for designing domain-specific heterogeneous tiles based on instruction augmentation through integrating custom hardware to simple cores.The main purpose of generating these tiles is to use them to build future manycore architectures.The design approach aims to build the architecture based on the requirements of the applications within a domain.We developed a tool to automatically generate the custom hardware blocks from CAL applications.We generated custom hardware that is then integrated into a core that executes the RISC-V instruction set.We evaluated our approach by implementing QR decomposition and an autofocus criterion calculation case studies in CAL actor language.The case studies revealed that RISC-V custom instructions can be used for integrating specialized custom hardware to boost performance.The processing cores with custom hardware outperform the cores without custom hardware by a factor of 2.8-4.8.Additionally, it is shown that automated hardware generation increases the area of the hardware by approximately 0-12% and the performance drop (in terms of cycle count) is 0-9%, which can be decreased even further by optimizing the code generation tool.The pipelining stages implemented during hardware generation play a significant role in improving the maximum clock frequency for the accelerators.However, when integrated with the rocket core, the frequency is saturated to the maximum clock frequency of the rocket core i.e., 56 or 58 MHz.
The significant advantage of our design approach is in terms of productivity, as the resulting accelerator can be produced from the same dataflow program without committing any extra effort into developing the hardware.Manycore architectures are slowly evolving in the direction of specialized cores, and we believe that the proposed approach will facilitate the design of new architectures based on specialized cores.
In the future, we plan to generate a manycore architecture by integrating many tiles with a network-on-chip.While doing this, we will use several case studies to generate the tiles to support a set of applications.Later, we aim to extend the hardware block library and fully automate every design step, including system integration.

Figure 1 .
Figure 1.Illustration of the design flow and its realization.The grayed out components are not addressed by this study.

Figure 3 .
Figure 3.An overview of firing rates and weights of actions in an actor performing QR decomposition.

Figure 4 .
Figure 4.An overview of Cal2Many framework including the back-end for generating custom hardware blocks.

Figure 5 .
Figure 5. Custom instruction provided by the RISC-V ISA to call the custom hardware blocks.

Figure 6 .
Figure 6.A tagged CAL action and the corresponding generated C and Chisel code.

Figure 7 .
Figure 7.A simplified view of the RoCC interface.

Figure 9 .
Figure 9. Systolic Array implementation of Givens Rotations (QRD).The arrows show the direction of the data movement.

Figure 10 .
Figure10.Building block that performs division on complex numbers.

Figure 11 .
Figure 11.Structure of the flat cubic interpolation accelerator.

Table 1 .
Number of complex operations in the accelerators.

Table 2 .
Number of floating-point operations in blocks and the accelerators.

Table 3 .
Performance results for QRD in terms of cycle count and clock frequency.

Table 4 .
Resource usage and timing results of the rocket core and QRD accelerators

Table 5 .
Performance results for the autofocus criterion calculation in terms of cycle counts and achieved clock frequencies.

Table 6 .
Resource usage and timing results for the rocket core and the integrated designs including hand-written and generated accelerators.