A Reconﬁgurable Convolutional Neural Network-Accelerated Coprocessor Based on RISC-V Instruction Set

: As a typical artiﬁcial intelligence algorithm, the convolutional neural network (CNN) is widely used in the Internet of Things (IoT) system. In order to improve the computing ability of an IoT CPU, this paper designs a reconﬁgurable CNN-accelerated coprocessor based on the RISC-V instruction set. The interconnection structure of the acceleration chain designed by the predecessors is optimized, and the accelerator is connected to the RISC-V CPU core in the form of a coprocessor. The corresponding instruction of the coprocessor is designed and the instruction compiling environment is established. Through the inline assembly in the C language, the coprocessor instructions are called, coprocessor acceleration library functions are established, and common algorithms in the IoT system are implemented on the coprocessor. Finally, resource consumption evaluation and performance analysis of the coprocessor are completed on a Xilinx FPGA. The evaluation results show that the reconﬁgurable CNN-accelerated coprocessor only consumes 8534 LUTS, accounting for 47.6% of the total SoC system. The number of instruction cycles required to implement functions such as convolution and pooling based on the designed coprocessor instructions is better than using the standard instruction set, and the acceleration ratio of convolution is 6.27 times that of the standard instruction set. FPGA. The evaluation results show that the coprocessor shown in the design consumes only 8534 LUTs, accounting for only 47.6% of the entire SoC system. By comparing the RISC-V standard instruction set and the custom coprocessor instruction set to achieve the four basic algorithms of convolution, pooling, ReLU, and matrix addition, the number of running cycles is used to evaluate the acceleration performance of the coprocessor. The results show that the implementation of the coprocessor instruction set has a signiﬁcant acceleration e ﬀ ect on these four algorithms, and the acceleration of the convolution has reached 6.27 times that achieved by the standard instruction set.


Introduction
With the rapid development of artificial intelligence (AI) technology, more and more AI applications are beginning to be developed and deployed on internet of things (IoT) node devices. The intelligent internet of things (AI + IoT, AIoT), which integrates the advantages of AI and IoT technology, has gradually become a research hotspot in IoT-related fields [1]. Traditional cloud computing is not suitable for AIoT, because of its high delay and poor mobility [2]. For this reason, a new computing paradigm, edge computing, is proposed. In order to reduce the computing delay and network congestion, the computing is migrated from the cloud server to the device [2]. Edge computing brings innovation to the IoT system, but also challenges the AI computing performance of the IoT node processor. It is necessary to improve its AI computing performance under the condition of meeting the power consumption and area limitation of IoT node devices [3]. In order to improve the AI computing power of IoT node processors, some IoT chip manufacturers provide some artificial intelligence acceleration libraries for their IoT node processors, but only from the software level optimization and tailoring algorithm, just a stopgap. It is necessary to design the AI algorithm calculator suitable for the IoT node processor from the hardware level.

RISC-V Instruction Set
The RISC-V instruction set is a new type of instruction set that was born only in 2010. It summarizes the advantages and disadvantages of the traditional instruction set, avoids the unreasonable design of the traditional instruction set, and its open-source features ensure that it can adjust the existing design at any time. This guarantees the simplicity and efficiency of the instruction [17]. RISC-V is a kind of modular instruction set, which contains three basic instruction sets and six extended instruction sets. This modular feature enables the designer to select one basic instruction set and several extended instruction sets for combination according to the actual design requirements so that flexible processor functions can be realized. More information on RISC-V can be found in [17].
Instruction extensibility is a significant advantage of the RISC-V architecture. It has reserved instruction coding space for the special domain architecture in the processor design, and users can easily expand their instruction subsets. The RISC-V architecture reserves 4 groups of custom instruction types in 32-bit instructions, as shown in Figure 1.

RISC-V Instruction Set
The RISC-V instruction set is a new type of instruction set that was born only in 2010. It summarizes the advantages and disadvantages of the traditional instruction set, avoids the unreasonable design of the traditional instruction set, and its open-source features ensure that it can adjust the existing design at any time. This guarantees the simplicity and efficiency of the instruction [17]. RISC-V is a kind of modular instruction set, which contains three basic instruction sets and six extended instruction sets. This modular feature enables the designer to select one basic instruction set and several extended instruction sets for combination according to the actual design requirements so that flexible processor functions can be realized. More information on RISC-V can be found in [17].
Instruction extensibility is a significant advantage of the RISC-V architecture. It has reserved instruction coding space for the special domain architecture in the processor design, and users can easily expand their instruction subsets. The RISC-V architecture reserves 4 groups of custom instruction types in 32-bit instructions, as shown in Figure 1. According to RISC-V architecture [17,18], custom-0 and custom-1 instruction spaces are reserved for user-defined extension instructions and will not be used as standard extension instructions in the future. The instruction space marked as custom-2/rv128 and custom-3/rv128 is reserved for the future rv128, which will also avoid being used by standard extensions, so it can also be used for custom extension instructions.

E203 CPU
In order to realize a reconfigurable CNN-accelerated coprocessor based on the RISC-V instruction set, we need to choose a suitable RISC-V processor as the research object. The E203 core is a 32-bit RISC-V processor core with a 2-level pipeline [19]. It extends integer multiplication and integer division instructions, atomic operation instructions, and 16-bit compression instructions on the basis of the rv32i instruction set. The E203 kernel is the closest to the Arm Cortex-M0 + (Arm, Cambridge, UK) in the design goal. Its structure is shown in Figure 2.
The E203 kernel adopts a single transmit sequential execution architecture, and the first stage of the pipeline is the fetch instruction. It performs simple decoding of instructions and branch prediction of branch instructions. The second stage of the pipeline completes decoding, execution, memory access, and write back.
The first stage of the pipeline mainly includes a simple decoding unit, a branch prediction unit, and a PC generation unit. The simple decoding unit (as shown by label 1 in Figure 2) performs simple decoding on the retrieved instruction to obtain the type of instruction and whether it belongs to a branch and jump instruction. For branch and jump instructions, a branch prediction unit (as shown by label 2 in Figure 2) is required to perform branch prediction to obtain the predicted jump address of the instruction. The PC generation unit (as shown by label 3 in Figure 2) generates the next PC value to be fetched, and accesses instruction tightly-coupled memory (ITCM) or bus interface unit (BIU) to fetch instructions. The PC value and instruction value are placed in the PC According to RISC-V architecture [17,18], custom-0 and custom-1 instruction spaces are reserved for user-defined extension instructions and will not be used as standard extension instructions in the future. The instruction space marked as custom-2/rv128 and custom-3/rv128 is reserved for the future rv128, which will also avoid being used by standard extensions, so it can also be used for custom extension instructions.

E203 CPU
In order to realize a reconfigurable CNN-accelerated coprocessor based on the RISC-V instruction set, we need to choose a suitable RISC-V processor as the research object. The E203 core is a 32-bit RISC-V processor core with a 2-level pipeline [19]. It extends integer multiplication and integer division instructions, atomic operation instructions, and 16-bit compression instructions on the basis of the rv32i instruction set. The E203 kernel is the closest to the Arm Cortex-M0 + (Arm, Cambridge, UK) in the design goal. Its structure is shown in Figure 2.
The E203 kernel adopts a single transmit sequential execution architecture, and the first stage of the pipeline is the fetch instruction. It performs simple decoding of instructions and branch prediction of branch instructions. The second stage of the pipeline completes decoding, execution, memory access, and write back.
The first stage of the pipeline mainly includes a simple decoding unit, a branch prediction unit, and a PC generation unit. The simple decoding unit (as shown by label 1 in Figure 2) performs simple decoding on the retrieved instruction to obtain the type of instruction and whether it belongs to a branch and jump instruction. For branch and jump instructions, a branch prediction unit (as shown by label 2 in Figure 2) is required to perform branch prediction to obtain the predicted jump address of the instruction. The PC generation unit (as shown by label 3 in Figure 2) generates the next PC value to be fetched, and accesses instruction tightly-coupled memory (ITCM) or bus interface unit (BIU) to fetch instructions. The PC value and instruction value are placed in the PC register and IR register and Electronics 2020, 9, 1005 4 of 19 passed to the next stage of the pipeline. The second level of the pipeline mainly includes a decoding dispatch unit (as shown by label 4 in Figure 2), a branch prediction analysis unit (as shown by label 5 in Figure 2), an arithmetic logic operation unit (as shown by label 6 in Figure 2), a multi-cycle multiplier and divider (as shown by label 7 in Figure 2), an access memory unit (as shown by label 8 in Figure 2), and an extension accelerator interface (EAI) (as shown by label 9 in Figure 2). The decoding and dispatching unit implements the decoding and dispatching of instructions. It dispatches instructions to different execution units for execution according to the specific types of instructions. Ordinary arithmetic logic operation instructions are dispatched to the arithmetic logic.
Electronics 2020, 9, x FOR PEER REVIEW 4 of 21 register and IR register and passed to the next stage of the pipeline. The second level of the pipeline mainly includes a decoding dispatch unit (as shown by label 4 in Figure 2), a branch prediction analysis unit (as shown by label 5 in Figure 2), an arithmetic logic operation unit (as shown by label 6 in Figure 2), a multi-cycle multiplier and divider (as shown by label 7 in Figure 2), an access memory unit (as shown by label 8 in Figure 2), and an extension accelerator interface (EAI) (as shown by label 9 in Figure 2). The decoding and dispatching unit implements the decoding and dispatching of instructions. It dispatches instructions to different execution units for execution according to the specific types of instructions. Ordinary arithmetic logic operation instructions are dispatched to the arithmetic logic. The operation unit for execution includes instructions such as logic operations, addition, subtraction, and shift. The branch and jump instructions are dispatched to the branch prediction analysis unit to perform branch prediction analysis. For instructions that fail to predict, pipeline flushing and instruction fetching are required. The multiplication and division instructions are dispatched to a multi-cycle multiplier and divider to execute the multiplication or division operation through multiple cycles. Load and store instructions read and write to the memory through the fetch unit. The coprocessor instructions are dispatched to the EAI interface and executed by the coprocessor. The reconfigurable CNN acceleration coprocessor designed in this paper is connected to the E203 processor through the EAI interface. The EAI interface has four channels: Request channel, feedback channel, memory request channel, and memory feedback channel. The request channel is used by the main processor to send custom extended instructions to the coprocessor, including the source operand, dispatch label, and other information. The feedback channel is used for the coprocessor to feedback the execution of custom extended instructions. The feedback information includes the calculation results of the instructions, the dispatch label, etc. The memory request channel is used for memory access by the coprocessor, which reads and writes memory through the main processor. The memory feedback channel is used for the main processor to return the read and write results of the memory to the coprocessor, including the data read from the memory. Please refer to [19] for a detailed introduction of the EAI interface.
The coprocessor extended instructions of the E203 processor follow the instruction expansion rules of RISC-V, and its coprocessor extended instruction encoding format is shown in Figure 3. The operation unit for execution includes instructions such as logic operations, addition, subtraction, and shift. The branch and jump instructions are dispatched to the branch prediction analysis unit to perform branch prediction analysis. For instructions that fail to predict, pipeline flushing and instruction fetching are required. The multiplication and division instructions are dispatched to a multi-cycle multiplier and divider to execute the multiplication or division operation through multiple cycles. Load and store instructions read and write to the memory through the fetch unit. The coprocessor instructions are dispatched to the EAI interface and executed by the coprocessor. The reconfigurable CNN acceleration coprocessor designed in this paper is connected to the E203 processor through the EAI interface. The EAI interface has four channels: Request channel, feedback channel, memory request channel, and memory feedback channel. The request channel is used by the main processor to send custom extended instructions to the coprocessor, including the source operand, dispatch label, and other information. The feedback channel is used for the coprocessor to feedback the execution of custom extended instructions. The feedback information includes the calculation results of the instructions, the dispatch label, etc. The memory request channel is used for memory access by the coprocessor, which reads and writes memory through the main processor. The memory feedback channel is used for the main processor to return the read and write results of the memory to the coprocessor, including the data read from the memory. Please refer to [19] for a detailed introduction of the EAI interface.
The coprocessor extended instructions of the E203 processor follow the instruction expansion rules of RISC-V, and its coprocessor extended instruction encoding format is shown in Figure 3. The 25th to 31st bits of the instruction are funct7 intervals, which are used as the sub coding space and for the parsing of instructions in the decoding stage. Funct7 is 7 bits and can encode 128 instructions. rs1, rs2, and rd are source operand 1, source operand 2, and destination operand register indexes, respectively. xs1, xs2, and xd are used to indicate whether read source and write destination registers are required. The 0th to 6th bits of the instruction are opcode encoding segments, which can be opcode encoding values of custom-0, custom-1, custom-2, or custom-3. According to the bit width of funct7, each group of custom instructions can encode 128 instructions, and four groups of custom instructions can encode 512 coprocessor instructions.

Hardware Design of CNN-Accelerated Coprocessor
The reconfigurable CNN acceleration coprocessor designed in this paper is a further extension of the compact CNN accelerator designed in Reference [16]. It mainly optimizes the acceleration chain structure and adds the coprocessor-related modules. Other basic components are consistent with those in Reference [16].

Accelerator Structure Optimization
In the compact CNN accelerator designed in Reference [16], each operation unit in the acceleration chain is connected by a serial structure, and its data flow direction is single. When some algorithms are implemented, the data need to be moved in the memory and accelerator many times, which will reduce the calculation efficiency and increase the power consumption of the processor. In this paper, four basic operation modules, convolution, pooling, ReLU, and matrix plus, are interconnected by a crossbar, and a reconfigurable CNN accelerator is designed. The accelerator mainly comprises four reconfigurable computing acceleration processing elements (PEs). By configuring PE units with different parameters, the hardware acceleration of various algorithms can be realized. The structure is shown in Figure 4.   The accelerator includes a source convolution kernel cache module (COE RAM), a reconfigurable circuit controller (Reconfigure Controller), two ping-pong buffer blocks (BUF RAM BANK), and four PE units. Each PE unit contains four basic computing components and a configurable crossbar (Crossbar). The configuration of the crossbar allows data to flow to different The 25th to 31st bits of the instruction are funct7 intervals, which are used as the sub coding space and for the parsing of instructions in the decoding stage. Funct7 is 7 bits and can encode 128 instructions. rs1, rs2, and rd are source operand 1, source operand 2, and destination operand register indexes, respectively. xs1, xs2, and xd are used to indicate whether read source and write destination registers are required. The 0th to 6th bits of the instruction are opcode encoding segments, which can be opcode encoding values of custom-0, custom-1, custom-2, or custom-3. According to the bit width of funct7, each group of custom instructions can encode 128 instructions, and four groups of custom instructions can encode 512 coprocessor instructions.

Hardware Design of CNN-Accelerated Coprocessor
The reconfigurable CNN acceleration coprocessor designed in this paper is a further extension of the compact CNN accelerator designed in Reference [16]. It mainly optimizes the acceleration chain structure and adds the coprocessor-related modules. Other basic components are consistent with those in Reference [16].

Accelerator Structure Optimization
In the compact CNN accelerator designed in Reference [16], each operation unit in the acceleration chain is connected by a serial structure, and its data flow direction is single. When some algorithms are implemented, the data need to be moved in the memory and accelerator many times, which will reduce the calculation efficiency and increase the power consumption of the processor. In this paper, four basic operation modules, convolution, pooling, ReLU, and matrix plus, are interconnected by a crossbar, and a reconfigurable CNN accelerator is designed. The accelerator mainly comprises four reconfigurable computing acceleration processing elements (PEs). By configuring PE units with different parameters, the hardware acceleration of various algorithms can be realized. The structure is shown in Figure 4. The 25th to 31st bits of the instruction are funct7 intervals, which are used as the sub coding space and for the parsing of instructions in the decoding stage. Funct7 is 7 bits and can encode 128 instructions. rs1, rs2, and rd are source operand 1, source operand 2, and destination operand register indexes, respectively. xs1, xs2, and xd are used to indicate whether read source and write destination registers are required. The 0th to 6th bits of the instruction are opcode encoding segments, which can be opcode encoding values of custom-0, custom-1, custom-2, or custom-3. According to the bit width of funct7, each group of custom instructions can encode 128 instructions, and four groups of custom instructions can encode 512 coprocessor instructions.

Hardware Design of CNN-Accelerated Coprocessor
The reconfigurable CNN acceleration coprocessor designed in this paper is a further extension of the compact CNN accelerator designed in Reference [16]. It mainly optimizes the acceleration chain structure and adds the coprocessor-related modules. Other basic components are consistent with those in Reference [16].

Accelerator Structure Optimization
In the compact CNN accelerator designed in Reference [16], each operation unit in the acceleration chain is connected by a serial structure, and its data flow direction is single. When some algorithms are implemented, the data need to be moved in the memory and accelerator many times, which will reduce the calculation efficiency and increase the power consumption of the processor. In this paper, four basic operation modules, convolution, pooling, ReLU, and matrix plus, are interconnected by a crossbar, and a reconfigurable CNN accelerator is designed. The accelerator mainly comprises four reconfigurable computing acceleration processing elements (PEs). By configuring PE units with different parameters, the hardware acceleration of various algorithms can be realized. The structure is shown in Figure 4.    The configuration of the crossbar allows data to flow to different computing components, which can speed up different algorithms. The entire accelerator can be parameterized and data accessed through an external bus.

Accelerator
The key to the acceleration of various algorithms by the four PE units is the design of the crossbar circuit. The configuration of the crossbar can make the input data flow pass through any one or more calculation modules. Its structure is shown in Figure 5. crossbar circuit. The configuration of the crossbar can make the input data flow pass through any one or more calculation modules. Its structure is shown in Figure 5.
The crossbar mainly consists of an input buffer (FIFO), a configuration register group (Cfg Regs), and five multiplexers (MUX). The five MUX select the data path to be opened according to the configuration information of the Cfg Regs. Thus, the data stream passes through different calculation modules in different orders. For example, the edge detection operation in image processing usually extracts the input image and then convolves it with the edge detection operator. It is necessary to configure the MUX as the path in the red part of Figure 5. In the convolutional neural network algorithm, it is usually necessary to perform convolution, ReLU, and pooling operations on the source matrix. You need to configure the MUX as the path in the blue part of Figure 5. The crossbar circuit and each calculation module use the ICB bus for data transmission. The ICB bus is a bus protocol customized by the open source CPU E203. It combines the advantages of the AXI bus and the AHB bus. For more information about the ICB bus, please refer to [19].  Figure 5. The structure of Crossbar.

Coprocessor Design
In addition to optimizing the accelerator chain module designed in Reference [16], this paper also adds the EAI controller, decoder, and data fetcher, and completes the hardware design of the reconfigurable CNN-accelerated coprocessor, whose structure is shown in Figure 6. The crossbar mainly consists of an input buffer (FIFO), a configuration register group (Cfg Regs), and five multiplexers (MUX). The five MUX select the data path to be opened according to the configuration information of the Cfg Regs. Thus, the data stream passes through different calculation modules in different orders. For example, the edge detection operation in image processing usually extracts the input image and then convolves it with the edge detection operator. It is necessary to configure the MUX as the path in the red part of Figure 5. In the convolutional neural network algorithm, it is usually necessary to perform convolution, ReLU, and pooling operations on the source matrix. You need to configure the MUX as the path in the blue part of Figure 5. The crossbar circuit and each calculation module use the ICB bus for data transmission. The ICB bus is a bus protocol customized by the open source CPU E203. It combines the advantages of the AXI bus and the AHB bus. For more information about the ICB bus, please refer to [19].

Coprocessor Design
In addition to optimizing the accelerator chain module designed in Reference [16], this paper also adds the EAI controller, decoder, and data fetcher, and completes the hardware design of the reconfigurable CNN-accelerated coprocessor, whose structure is shown in Figure 6.
The EAI controller is used to process the time sequence related to the EAI interface, and hand over the instruction information and source operands obtained from the EAI request channel to the decoder for decoding. The memory data obtained from the EAI memory response channel are handed over to the data fetcher for allocation to the corresponding cache. The decoder is used to decode custom extended instructions. For configuration instructions, the instructions are handed over to the reconfiguration controller to realize the configuration of each functional module. For the memory access instruction, the instruction is handed over to the data fetcher for processing, and the memory access is realized. The data fetcher implements the processing of the memory access instruction. It reads data from external memory to the corresponding cache through the memory response channel of the EAI interface, the convolution kernel coefficient is loaded to the COE CACHE unit, and the source matrix is loaded to CACHE BANK1 or CACHE BANK2. The calculation results are written back to the external memory through the memory request channel of EAI interface.  The EAI controller is used to process the time sequence related to the EAI interface, and hand over the instruction information and source operands obtained from the EAI request channel to the decoder for decoding. The memory data obtained from the EAI memory response channel are handed over to the data fetcher for allocation to the corresponding cache. The decoder is used to decode custom extended instructions. For configuration instructions, the instructions are handed over to the reconfiguration controller to realize the configuration of each functional module. For the memory access instruction, the instruction is handed over to the data fetcher for processing, and the memory access is realized. The data fetcher implements the processing of the memory access instruction. It reads data from external memory to the corresponding cache through the memory response channel of the EAI interface, the convolution kernel coefficient is loaded to the COE CACHE unit, and the source matrix is loaded to CACHE BANK1 or CACHE BANK2. The calculation results are written back to the external memory through the memory request channel of EAI interface.
When the main processor executes an instruction, the decoding unit of the main processor first determines whether the instruction belongs to the custom instruction group according to the opcode of the instruction. For the instructions belonging to the custom instruction group, determine whether to read the source operand according to the xs1 and xs2 bits in the instruction. If to read, read the source operand from the register group according to rs1 and rs2. The main processor also needs to maintain the data correlation of instructions. If the instructions have data correlation, the processor will pause the pipeline until the correlation is removed. At the same time, if the instruction needs to be written back to the target register, the target register rd will also be used as one of the bases for the subsequent instruction data correlation judgment. Then, the instructions are sent to the coprocessor through the EAI request channel for processing. After the coprocessor receives the instructions, the instructions are further decoded and allocated to different units for execution according to the types of instructions. The coprocessor will execute instructions in the form of blocking. Only when one instruction is executed can the main processor dispatch the next instruction. Finally, after the instruction is executed, the result of instruction execution is returned to When the main processor executes an instruction, the decoding unit of the main processor first determines whether the instruction belongs to the custom instruction group according to the opcode of the instruction. For the instructions belonging to the custom instruction group, determine whether to read the source operand according to the xs1 and xs2 bits in the instruction. If to read, read the source operand from the register group according to rs1 and rs2. The main processor also needs to maintain the data correlation of instructions. If the instructions have data correlation, the processor will pause the pipeline until the correlation is removed. At the same time, if the instruction needs to be written back to the target register, the target register rd will also be used as one of the bases for the subsequent instruction data correlation judgment. Then, the instructions are sent to the coprocessor through the EAI request channel for processing. After the coprocessor receives the instructions, the instructions are further decoded and allocated to different units for execution according to the types of instructions. The coprocessor will execute instructions in the form of blocking. Only when one instruction is executed can the main processor dispatch the next instruction. Finally, after the instruction is executed, the result of instruction execution is returned to the main processor through the response channel. For the instruction to be written back, the result of instruction execution needs to be written back to the target register.
The designed reconfigurable CNN acceleration coprocessor and the compact CNN accelerator designed by Reference [16] are used to process multimedia data such as voice and image. The data flow is shown in Figure 7. Figure 7a shows the data flow direction of the compact CNN accelerator designed in Reference [16]. It is an external device mounted on the SoC bus, which requires the CPU core to handle the data transfer. First, it needs to move the multimedia data from the external interface to the data memory through the CPU core, and then to the accelerator for processing. After the processing, it needs the CPU core to move the calculation results from the accelerator to the data memory and then to the external interface to send them to the network interface. It needs to move data between the accelerator and the data memory twice.
Electronics 2020, 9, x FOR PEER REVIEW 8 of 21 the main processor through the response channel. For the instruction to be written back, the result of instruction execution needs to be written back to the target register. The designed reconfigurable CNN acceleration coprocessor and the compact CNN accelerator designed by Reference [16] are used to process multimedia data such as voice and image. The data flow is shown in Figure 7.
IoT node equipment   Figure 7a shows the data flow direction of the compact CNN accelerator designed in Reference [16]. It is an external device mounted on the SoC bus, which requires the CPU core to handle the data transfer. First, it needs to move the multimedia data from the external interface to the data memory through the CPU core, and then to the accelerator for processing. After the processing, it needs the CPU core to move the calculation results from the accelerator to the data memory and then to the external interface to send them to the network interface. It needs to move data between the accelerator and the data memory twice. Figure 7b shows the data flow direction of the reconfigurable CNN-accelerated coprocessor designed in this paper. Compared to the accelerator designed in Reference [16], the coprocessor can read multimedia data directly from the external interface, and after processing, it can be directly sent to the network interface by the external interface, which does not need the data moving between the coprocessor and the data memory.
The coprocessor method reduces the data moving, further speeds up the algorithm processing speed, and saves power consumption. In addition, the coprocessor can provide coprocessing instruction support, with higher code density and simple programming.

Instruction Design of Coprocessor
The coprocessor instructions designed for the reconfigurable CNN-accelerated coprocessor are shown in Table 1.  Figure 7b shows the data flow direction of the reconfigurable CNN-accelerated coprocessor designed in this paper. Compared to the accelerator designed in Reference [16], the coprocessor can read multimedia data directly from the external interface, and after processing, it can be directly sent to the network interface by the external interface, which does not need the data moving between the coprocessor and the data memory.
The coprocessor method reduces the data moving, further speeds up the algorithm processing speed, and saves power consumption. In addition, the coprocessor can provide coprocessing instruction support, with higher code density and simple programming.

Instruction Design of Coprocessor
The coprocessor instructions designed for the reconfigurable CNN-accelerated coprocessor are shown in Table 1. There are 15 coprocessor instructions in total, which are divided into two categories: coprocessing configuration instruction and coprocessing data loading instruction. Configuration instructions are used to configure each function module of the coprocessor, such as PE working mode, cross-switch data flow direction, etc. The data loading instruction loads the convolution coefficient and the source matrix and loads them into the respective coprocessor cache. The assembly format of coprocessor instructions is shown in Figure 8. There are 15 coprocessor instructions in total, which are divided into two categories: coprocessing configuration instruction and coprocessing data loading instruction. Configuration instructions are used to configure each function module of the coprocessor, such as PE working mode, cross-switch data flow direction, etc. The data loading instruction loads the convolution coefficient and the source matrix and loads them into the respective coprocessor cache. The assembly format of coprocessor instructions is shown in Figure 8.  The use of the designed coprocessor instructions is generally divided into 5 steps: 1. Reset the coprocessor Before using the coprocessor, you need to use the acc.rest instruction to reset the coprocessor.

Load the convolution coefficient
If you need to use the convolution module in the coprocessor, you need to use the acc.load.cd instruction to load the convolution coefficient into the COE cache. 3. Load the source matrix The use of the designed coprocessor instructions is generally divided into 5 steps:

1.
Reset the coprocessor Before using the coprocessor, you need to use the acc.rest instruction to reset the coprocessor.

2.
Load the convolution coefficient If you need to use the convolution module in the coprocessor, you need to use the acc.load.cd instruction to load the convolution coefficient into the COE cache.

3.
Load the source matrix The coprocessor supports reading the source matrix directly from memory for calculation, and also supports moving the source matrix to CACHE BANK and reading the data from CACHE BANK for calculation. The instruction to achieve this function is acc.load.bd.

4.
Configure coprocessor parameters (1) Working mode Configure the working mode of each PE of the coprocessor, that is, configure the crossbar in each PE, select the acceleration unit that participates in the calculation, and configure the direction of the data flow according to the algorithm. The instruction to achieve this function is acc.cfg.wm. (2) Calculate data source location Configure the coprocessor calculation data source. The coprocessor can load data directly from memory or load data from CACHE BANK. Before starting the coprocessor, you need to configure the coprocessor calculation data source according to the storage location of the source matrix. The instruction to achieve this is acc.cfg.ldl. Configure the offset from the base address of the calculation data required by each PE in the data source. The calculation data required by the four PEs is continuously stored in the data source, so the relative offset of the calculation data required by each PE needs to be determined. The instruction to achieve this is acc.cfg.ldo.
Storage location of calculation results The configuration coprocessor calculation result storage location is the same as the configuration calculation data source. It supports saving calculation results to memory or CACHE BANK, and the configuration instruction is acc.cfg.sdl. Configure the relative offset of each PE calculation result in the storage location. The configuration instruction is acc.cfg.sdo.

(4)
The width and height of source matrix Configure the width and height of the source matrix in each PE, and the instruction to achieve this function is acc.cfg.sms. (5) Convolution kernel coefficient parameter Configure the offset address of the convolution kernel in the COE CACHE and the size of the convolution kernel. The configuration instruction is acc.cfg.cp. Configure each convolution kernel bias. The configuration instruction is acc.cfg.cb. (6) Pooling parameters Configure the height and width of the maximum pooling method. The configuration instruction is acc.cfg.pw.
Matrix plus parameters Configure the loading position of another input matrix participating in the matrix addition calculation, support the input matrix to be loaded from memory or CACHE BANK, and the configuration instruction is acc.cfg.adl. Configure the relative offset of another input matrix in each PE. The configuration instruction is acc.cfg.ado.

5.
Start calculation Using the acc.sc instruction calculation, it enables the coprocessor to start the calculation according to the configuration parameters.

Establishment of Instruction Compiling Environment for Coprocessor
The RISC-V foundation open-source compilation tool chain includes GCC and LLVM. The most commonly used compilation tool chain is GCC. By modifying the Binutils toolset in the GCC compilation tool chain, a compilation environment for coprocessor instructions is established.
The RISC-V foundation open-source GCC compilation tool chain mainly includes riscv32 and riscv64 versions. The E203 kernel only supports the rv32IMAC instruction set in RISC-V, so the riscv32 version of the GCC compilation tool chain is used. The GCC compilation tool chain mainly includes the GCC compiler, Binutils binary toolset, GDB debugging tools, and C runtime libraries. The Binutils binary toolset is a set of tools for processing binary programs. Some commonly used tools and functions in the 32-bit Binutils toolset are shown in Table 2. Table 2. Binutils toolset list.

Tool Name Function
riscv32-unkown-elf-as Assembler, convert assembly code to executable ELF file riscv32-unkown-elf-ld Linker, linking multiple target and library files as executables riscv32-unknown-elf-objcopy Converting ELF format files to bin format files riscv32-unknown-elf-objdump Disassembler, which converts binary files into assembly code By calling each Binutils tool in the table, the assembly code can be assembled and linked, and an object file can be generated. In order to enable the Binutils toolset to support the custom coprocessor instructions in Section 4.1, the source code of Binutils needs to be modified. After adding all custom coprocessor instructions to the Binutils source code, compile the Binutils source code to generate the toolset in Table 2. To test the functionality of the toolset, write the following test assembler test.s as shown in Figure 9.
Use riscv32-unkown-elf-as assembler to assemble the program to generate the test.out binary object file, and use riscv32-unknown-elf-objdump to disassemble test.out to obtain the disassembly information shown in Figure 10.
The first column is the binary information of the instruction, and the second column is the assembly code obtained by disassembling the instruction. Analyze the binary information of the instruction, which is consistent with the design of Section 4.1, and the assembly code obtained by disassembly is also consistent with the test assembly code (×5, ×1, ×2, t0, ra, and sp represent the same register). This test indicates that the designed custom coprocessor instructions have been successfully added to the Binutils toolset.
By calling each Binutils tool in the table, the assembly code can be assembled and linked, and an object file can be generated. In order to enable the Binutils toolset to support the custom coprocessor instructions in Section 4.1, the source code of Binutils needs to be modified. After adding all custom coprocessor instructions to the Binutils source code, compile the Binutils source code to generate the toolset in Table 2. To test the functionality of the toolset, write the following test assembler test.s as shown in Figure 9. Use riscv32-unkown-elf-as assembler to assemble the program to generate the test.out binary object file, and use riscv32-unknown-elf-objdump to disassemble test.out to obtain the disassembly information shown in Figure 10.

Instruction binary
Instruction assembly code Figure 9. Test assembler. toolset in Table 2. To test the functionality of the toolset, write the following test assembler test.s as shown in Figure 9. Use riscv32-unkown-elf-as assembler to assemble the program to generate the test.out binary object file, and use riscv32-unknown-elf-objdump to disassemble test.out to obtain the disassembly information shown in Figure 10.

Coprocessor-Accelerated Library Function Design
After completing the instruction design of the custom coprocessor and the establishment of the instruction compilation environment, the use of the coprocessor can be completed by writing assembly code. However, the method of writing assembly code has a low development efficiency. In order to facilitate the use of the coprocessor, the common functions of the coprocessor are packaged into the form of a C language function interface, and the coprocessor acceleration library functions are designed.
The designed library function interface is shown in Table 3. The library functions use the C language inline assembly to implement the call to the coprocessor instructions. Taking the CfgPoolWidth function interface as an example, the specific implementation is shown in Figure 11.
The CfgPoolWidth function calls the acc.cfg.pw instruction in the form of inline assembly and passes in the values of the width and height variables to implement the height and width configuration of the pooling operation.  The CfgPoolWidth function calls the acc.cfg.pw instruction in the form of inline assembly and passes in the values of the width and height variables to implement the height and width configuration of the pooling operation.

Implementation of Common Algorithms on Coprocessor
The reconfigurable CNN-accelerated coprocessor designed in this paper can accelerate not only the CNN algorithm but also some other commonly used algorithms in the IoT system. This article describes the implementation of the LeNet-5 network, Sobel edge detection, and FIR filtering algorithms on the coprocessor.

LeNet-5 Network Implementation
In order to verify the acceleration of the coprocessor to the CNN algorithm, the classical

Implementation of Common Algorithms on Coprocessor
The reconfigurable CNN-accelerated coprocessor designed in this paper can accelerate not only the CNN algorithm but also some other commonly used algorithms in the IoT system. This article describes the implementation of the LeNet-5 network, Sobel edge detection, and FIR filtering algorithms on the coprocessor.

LeNet-5 Network Implementation
In order to verify the acceleration of the coprocessor to the CNN algorithm, the classical LeNet-5 network is used in this paper. The structure of the LeNet-5 network is shown in Figure 12.

Implementation of Common Algorithms on Coprocessor
The reconfigurable CNN-accelerated coprocessor designed in this paper can accelerate not only the CNN algorithm but also some other commonly used algorithms in the IoT system. This article describes the implementation of the LeNet-5 network, Sobel edge detection, and FIR filtering algorithms on the coprocessor.

LeNet-5 Network Implementation
In order to verify the acceleration of the coprocessor to the CNN algorithm, the classical LeNet-5 network is used in this paper. The structure of the LeNet-5 network is shown in Figure 12. The LeNet-5 network mainly comprises six hidden layers [20]: (1) Convolution layer C1. Convolutions are performed using six 5 × 5 size convolution kernels and 32 × 32 original images to generate six 28 × 28 feature maps, and the feature map is activated using the ReLU function. (2) Pooling layer S2. The S2 layer uses a 2 × 2 pooling filter to perform maximum pooling on the output of C1 to obtain six 14 × 14 feature maps. (3) Partially connected layer C3. The C3 layer uses 16 5 × 5 convolution kernels to partially connect with the six feature maps output by S2 and calculates 16 10 × 10 feature maps. The partial connection relationship and calculation process are shown in Figure 13. Take the calculation process of the output 0th feature map as an example: First, use the 0th convolution kernel to The LeNet-5 network mainly comprises six hidden layers [20]: (1) Convolution layer C 1 . Convolutions are performed using six 5 × 5 size convolution kernels and 32 × 32 original images to generate six 28 × 28 feature maps, and the feature map is activated using the ReLU function. (2) Pooling layer S 2 . The S 2 layer uses a 2 × 2 pooling filter to perform maximum pooling on the output of C 1 to obtain six 14 × 14 feature maps. (3) Partially connected layer C 3 . The C 3 layer uses 16 5 × 5 convolution kernels to partially connect with the six feature maps output by S 2 and calculates 16 10 × 10 feature maps. The partial connection relationship and calculation process are shown in Figure 13. Take the calculation process of the output 0th feature map as an example: First, use the 0th convolution kernel to convolve with the 0, 1, 2 feature maps output by the S 2 layer, and then add the results of the three convolutions, Plus a bias, and finally activate it to obtain the 0th feature map of the C 3 layer. (4) Pooling layer S 4 . This layer uses a 2 × 2 pooling filter to pool the output of C 3 into 16 5 × 5 feature maps. (5) Expand layer S 5 . This layer combines the 16 5 × 5 feature maps output by S 4 into a one-dimensional matrix of size 400. (6) Fully connected layer S 6 . The S 6 layer fully connects the one-dimensional matrix output from the S 5 layer with 10 convolution operators, and obtains 10 classification results as the recognition results of the input image.
From comprehensive analysis of the operating characteristics of each layer of the network, the calculation of each layer of LeNet-5 is summarized in Table 4. Table 4. LeNet-5 calculation steps of each layer of LeNet-5.

Layer
Calculation Formula Explanation Convolution, matrix addition, and ReLU S 4 [S 0 4 -S 15 4 15 4 ] Expand S 6 [S 0 6 -S 9 6 ] = S 5 × [K 0 6 -K 9 6 ] Fully connected maps. (5) Expand layer S5. This layer combines the 16 5 × 5 feature maps output by S4 into a one-dimensional matrix of size 400. (6) Fully connected layer S6. The S6 layer fully connects the one-dimensional matrix output from the S5 layer with 10 convolution operators, and obtains 10 classification results as the recognition results of the input image. From comprehensive analysis of the operating characteristics of each layer of the network, the calculation of each layer of LeNet-5 is summarized in Table 4. Table 4. LeNet-5 calculation steps of each layer of LeNet-5.

Layer
Calculation Formula Explanation C1 Convolution and ReLU S2

Fully connected
As can be seen from the above table, the implementation of the LeNet-5 network on the coprocessor can be performed in four steps: (1) First, map the C1 layer and the S2 layer. Configure the coprocessor to use the convolution, ReLU, and pooling modules of the PE unit, and configure the parameters of these three modules.
Configure the crossbar to enable the data flow in the sequence shown in Figure 14a, start the coprocessor, and calculate the C1 and S2 layers. (2) Map C3 layer and S4 layer. Configure the coprocessor to use the convolution, matrix addition, ReLU, and pooling modules of the PE unit, and configure the parameters of these four modules.
Configure the crossbar to enable the data flow in the order shown in Figure 14b, start the As can be seen from the above table, the implementation of the LeNet-5 network on the coprocessor can be performed in four steps: (1) First, map the C 1 layer and the S 2 layer. Configure the coprocessor to use the convolution, ReLU, and pooling modules of the PE unit, and configure the parameters of these three modules.
Configure the crossbar to enable the data flow in the sequence shown in Figure 14a, start the coprocessor, and calculate the C 1 and S 2 layers. (2) Map C 3 layer and S 4 layer. Configure the coprocessor to use the convolution, matrix addition, ReLU, and pooling modules of the PE unit, and configure the parameters of these four modules.
Configure the crossbar to enable the data flow in the order shown in Figure 14b, start the coprocessor, calculate the C 3 layer and S 4 layer, and cache the calculation result in BUF RAM BANK1. (3) The S 5 layer uses CPU calculations. Use a software program to expand the calculation result in (2) into a 1 × 400 one-dimensional matrix, which is buffered in BUF RAM BANK2. (4) Map the S 6 layer. Configure the coprocessor to use the convolution module of the PE unit so that the convolution module supports one-dimensional convolution. Configure the crossbar so that the data flow follows the sequence in Figure 14c and uses only the convolution module. Configure the data source as BUF RAM BANK2, start the coprocessor, and calculate the S 6 layer. (2) into a 1 × 400 one-dimensional matrix, which is buffered in BUF RAM BANK2. (4) Map the S6 layer. Configure the coprocessor to use the convolution module of the PE unit so that the convolution module supports one-dimensional convolution. Configure the crossbar so that the data flow follows the sequence in Figure 14c and uses only the convolution module. Configure the data source as BUF RAM BANK2, start the coprocessor, and calculate the S6 layer.

Sobel Edge Detection and FIR Algorithm Implementation
In image processing algorithms, edge detection of images is often required. Edge detection based on Sobel operators is a commonly used method. Sobel operators are first-order gradient algorithms that can effectively filter out noise interference and extract accurate edge information [21]. Sobel edge detection uses two 3 × 3 matrix operators and the input image to convolve to obtain the gray values of

Sobel Edge Detection and FIR Algorithm Implementation
In image processing algorithms, edge detection of images is often required. Edge detection based on Sobel operators is a commonly used method. Sobel operators are first-order gradient algorithms that can effectively filter out noise interference and extract accurate edge information [21]. Sobel edge detection uses two 3 × 3 matrix operators and the input image to convolve to obtain the gray values of the horizontal and vertical edges, respectively. If A is the original image and G x and G y are the horizontal and vertical edge images, respectively, the calculation formulas of G x and G y are shown in Figure 15.

Sobel Edge Detection and FIR Algorithm Implementation
In image processing algorithms, edge detection of images is often required. Edge detection based on Sobel operators is a commonly used method. Sobel operators are first-order gradient algorithms that can effectively filter out noise interference and extract accurate edge information [21]. Sobel edge detection uses two 3 × 3 matrix operators and the input image to convolve to obtain the gray values of the horizontal and vertical edges, respectively. If A is the original image and Gx and Gy are the horizontal and vertical edge images, respectively, the calculation formulas of Gx and Gy are shown in Figure 15. The complete edge gray value of the image can be approximated by Equation (1).
x y Before image edge detection, downsampling is usually performed to reduce the image size and the amount of data for operations.
In summary, the implementation of Sobel edge detection on the accelerator can be performed in three steps: (1) Calculate Gx and Gy. Configure the coprocessor to use two PE units. Each PE unit uses pooling and convolution modules, and the convolution kernels used by the convolution modules in the two PEs are the two convolution kernels of the Sobel operator. Configure the crossbar so that the data flow follows the sequence in Figure 16a. Start the coprocessor, calculate Gx and Gy, and cache the calculation result in BUF RAM BANK1. Figure 15. G x , G y calculation formula.
The complete edge gray value of the image can be approximated by Equation (1).
Before image edge detection, downsampling is usually performed to reduce the image size and the amount of data for operations.
In summary, the implementation of Sobel edge detection on the accelerator can be performed in three steps: (1) Calculate G x and G y . Configure the coprocessor to use two PE units. Each PE unit uses pooling and convolution modules, and the convolution kernels used by the convolution modules in the two PEs are the two convolution kernels of the Sobel operator. Configure the crossbar so that the data flow follows the sequence in Figure 16a. Start the coprocessor, calculate G x and G y , and cache the calculation result in BUF RAM BANK1. (2) Use the CPU to calculate |G x | and |G y |. Use a software program to take the absolute value of each element in the cached result of (1) to obtain |G x | and |G y |, and cache the calculation result in BUF RAM BANK2. (3) Calculate |G|. Configure the coprocessor to use the matrix addition module of the PE unit and configure the crossbar so that the data flow follows the order in Figure 16b and only the matrix addition module is used. Configure the data source as BUF RAM BANK2, start the coprocessor, and calculate |G|.
In speech signal processing, FIR filters are often used for denoising. FIR filtering is a one-dimensional convolution operation. You only need to configure the PE to use a convolution module. Configure the crossbar so that the data flow follows the sequence shown in Figure 17.
The convolution module is configured as a one-dimensional convolution, loads the convolution kernel into COE RAM, starts the coprocessor, and calculates the FIR filtering result. each element in the cached result of (1) to obtain |Gx| and |Gy|, and cache the calculation result in BUF RAM BANK2.
(3) Calculate |G|. Configure the coprocessor to use the matrix addition module of the PE unit and configure the crossbar so that the data flow follows the order in Figure 16b and only the matrix addition module is used. Configure the data source as BUF RAM BANK2, start the coprocessor, and calculate |G|.  In speech signal processing, FIR filters are often used for denoising. FIR filtering is a one-dimensional convolution operation. You only need to configure the PE to use a convolution module. Configure the crossbar so that the data flow follows the sequence shown in Figure 17.  The convolution module is configured as a one-dimensional convolution, loads the convolution kernel into COE RAM, starts the coprocessor, and calculates the FIR filtering result. addition module is used. Configure the data source as BUF RAM BANK2, start the coprocessor, and calculate |G|.  In speech signal processing, FIR filters are often used for denoising. FIR filtering is a one-dimensional convolution operation. You only need to configure the PE to use a convolution module. Configure the crossbar so that the data flow follows the sequence shown in Figure 17.  The convolution module is configured as a one-dimensional convolution, loads the convolution kernel into COE RAM, starts the coprocessor, and calculates the FIR filtering result.

Experiment and Resource Analysis
This section completes resource analysis and performance analysis of the designed reconfigurable CNN acceleration coprocessor based on Xilinx FPGA. The FPGA model is Xilinx xc7a100tftg256-1 (Xilinx, San Jose, CA, USA), and the synthesis tool is Vivado 18.1 (Xilinx, San Jose, CA, USA). FPGA circuit synthesis is performed on the E203 SoC connected to the coprocessor. Table 5 shows the resource consumption of each main functional unit. As shown in Table 5, the E203 kernel and coprocessor account for most of the resource consumption in the E203 SoC. The E203 core accounts for 24.2% of the total LUT consumption, while the designed coprocessor accounts for 47.6% of the LUT resource consumption in the SoC.
To evaluate the performance of the coprocessor, the four basic arithmetic algorithms of convolution, pooling, ReLU, and matrix addition are implemented in two ways. One is implemented by the I and M instruction sets of the standard instruction set of RISC-V. The other is implemented using the coprocessor instructions designed in this paper, comparing the number of cycles of algorithm execution in these two implementations. By writing a testbench file to load the compiled binary file as an input stimulus for the entire E203 SoC, it is simulated by modelsim simulation software to count the execution cycles of each algorithm in two implementations. The experimental results are shown in Table 6. As can be seen from Table 6, using the coprocessor can accelerate all four algorithms, and the acceleration ratio of each algorithm is shown in Figure 18. It can be seen from Figure 18 that the coprocessor has the most obvious acceleration effect on the convolution algorithm, which is 6.27 times faster than the standard instruction set. This is because, on the one hand, the coprocessor realizes convolution calculation by the special hardware unit, while the RISC-V-based main processor realizes convolution calculation by software; on the other hand, the coprocessor architecture reduces the data moving, further speeding up the algorithm processing speed.

Conclusion
In this paper, we optimize the compact CNN accelerator designed by Reference [16], and interconnect the operation units in the acceleration chain with crossbar, to realize the reconfigurable design of a CNN accelerator. Furthermore, the accelerator is connected to the E203 core in the form of a coprocessor, and a reconfigurable CNN acceleration coprocessor is designed. Based on the EAI interface, the hardware design of the reconfigurable CNN-accelerated coprocessor is completed. The designed custom coprocessor instructions are introduced, and the open-source compilation tool chain GCC is modified to complete the compilation environment. Coprocessor instructions are packaged into a function interface in the C language inline assembly mode, and coprocessor acceleration library functions are established. The implementation process of common algorithms on the coprocessor is described. Finally, the resource evaluation of the coprocessor was completed based on the Xilinx FPGA. The evaluation results show that the coprocessor shown in the design consumes only 8534 LUTs, accounting for only 47.6% of the entire SoC system. By comparing the RISC-V standard instruction set and the custom coprocessor instruction set to achieve the four basic algorithms of convolution, pooling, ReLU, and matrix addition, the number of running cycles is It can be seen from Figure 18 that the coprocessor has the most obvious acceleration effect on the convolution algorithm, which is 6.27 times faster than the standard instruction set. This is because, on the one hand, the coprocessor realizes convolution calculation by the special hardware unit, while the RISC-V-based main processor realizes convolution calculation by software; on the other hand, the coprocessor architecture reduces the data moving, further speeding up the algorithm processing speed.

Conclusions
In this paper, we optimize the compact CNN accelerator designed by Reference [16], and interconnect the operation units in the acceleration chain with crossbar, to realize the reconfigurable design of a CNN accelerator. Furthermore, the accelerator is connected to the E203 core in the form of a coprocessor, and a reconfigurable CNN acceleration coprocessor is designed. Based on the EAI interface, the hardware design of the reconfigurable CNN-accelerated coprocessor is completed. The designed custom coprocessor instructions are introduced, and the open-source compilation tool chain GCC is modified to complete the compilation environment. Coprocessor instructions are packaged into a function interface in the C language inline assembly mode, and coprocessor acceleration library functions are established. The implementation process of common algorithms on the coprocessor is described. Finally, the resource evaluation of the coprocessor was completed based on the Xilinx FPGA. The evaluation results show that the coprocessor shown in the design consumes only 8534 LUTs, accounting for only 47.6% of the entire SoC system. By comparing the RISC-V standard instruction set and the custom coprocessor instruction set to achieve the four basic algorithms of convolution, pooling, ReLU, and matrix addition, the number of running cycles is used to evaluate the acceleration performance of the coprocessor. The results show that the implementation of the coprocessor instruction set has a significant acceleration effect on these four algorithms, and the acceleration of the convolution has reached 6.27 times that achieved by the standard instruction set.