Low-Power RTL Code Generation for Advanced CNN Algorithms toward Object Detection in Autonomous Vehicles

: In the implementation process of a convolution neural network (CNN)-based object detection system, the primary issues are power dissipation and limited throughput. Even though we utilize ultra-low power dissipation devices, the dynamic power dissipation issue will be difﬁcult to resolve. During the operation of the CNN algorithm, there are several factors such as the heating problem generated from the massive computational complexity, the bottleneck generated in data transformation and by the limited bandwidth, and the power dissipation generated from redundant data access. This article proposes the low-power techniques, applies them to the CNN accelerator on the FPGA and ASIC design ﬂow, and evaluates them on the Xilinx ZCU-102 FPGA SoC hardware platform and 45 nm technology for ASIC, respectively. Our proposed low-power techniques are applied at the register-transfer-level (RT-level), targeting FPGA and ASIC. In this article, we achieve up to a 53.21% power reduction in the ASIC implementation and saved 32.72% of the dynamic power dissipation in the FPGA implementation. This shows that our RTL low-power schemes have a powerful possibility of dynamic power reduction when applied to the FPGA design ﬂow and ASIC design ﬂow for the implementation of the CNN-based object detection system.


Introduction
Among the machine learning algorithms, the convolutional neural network (CNN) model is one of the popular architectures and keywords at present. Technology trends indicate that deep learning algorithms are becoming the essential feature of Internet of Things (IoT) devices including autonomous vehicles. Some of these devices have limited power resources, especially the electric vehicles; thus, an energy-aware system is indispensable, due to the limited battery capacity. In particular, CNN includes other neural network models such as recurrent neural networks (RNN) and deep neural networks (DNN). These two have been used for the recognition systems in the computer vision, video processing, and image processing research areas [1][2][3][4][5][6]. Being similar to the human brain's operation in perception and recognition, the neural network algorithms are able to process given visual information to recognize the object that we target or to predict the next movement of the target object [7][8][9][10]. The common feature of these neural network models is that they require abundant computational resources. In order to increase the computational capability, we need to extend several hardware resources including memory capacity, hard drive size, and microcontroller capability, which can process the larger amount of data simultaneously or in parallel. This increased hardware capability causes a large amount of power consumption inevitably. Consequently, if we reduce the power consumption of the system, naturally, we can reach the conclusion of the degradation of the computational ability of the entire system [11]. Since deep learning is widely applied, the field-programmable gate array (FPGA) has become the solution for power-efficient hardware platforms. FPGA has been getting attention as an extensible computational device. It has also become known as a device that consumes less power than the graphics processing unit (GPU) [7][8][9][10]12].
With the growing usage of CNN-based Internet of Thing (IoT) products, including autonomous vehicles, companies are developing and releasing various sizes of customized chips to support the massive amount of CNN computational processes, such as the Tensor Processing Unit (TPU), deep learning processing unit (DPU), holographic processing unit (HPU), image processing unit (IPU), neural network processing unit (NPU), and vision processing unit (VPU) [13]. In the ASIC design flow, there are opportunities to minimize and optimize those chips for high-performance and redundant power reduction. The compact size, high-speed processing, energy-efficient design, and powerful parallel computing are the primary features of the artificial intelligence (AI) accelerator chip. However, in the case of the compact size chip design, excessive power dissipation leads to increased board temperature and cooling cost, as well as potential reliability problems [14].
The implementation process is composed of the CNN architecture, the CNN accelerator, an efficient coding technique, and the hardware material to implement the power-efficient deep learning algorithm. In terms of the power dissipation for the hardware itself, we can approach primarily two aspects, which are the hardware material aspect and software architecture/algorithm model aspect. For instance, for the algorithm model aspect, to achieve high-performance and high-throughput results, most researchers and developers have suggested novel CNN architectures and efficient memory structures to decrease the processing time and improve the parallel computing performance so that they can increase the power efficiency [15][16][17][18].
In terms of hardware, the major elements that can cause a large power consumption are memory size, memory processing architecture, and the capability of parallel processing. For example, if the memory size increases, the circuit size arranged for the memory has to be increased inevitably. Consequently, longer processing time, high performance, and high integration density can cause more heat and lead to consuming more power. Due to the limitations of technology evolution in terms of the semi-conductor size, production cost, market needs, and issues in the developing process, we focus on the other elements of the hardware design flow. Traditionally, as one of the hardware level designs, register-transfer-level (RT-level) optimization is an effective methodology to reduce the dynamic power consumption of the system. Therefore, in this article, we describe the RT-level low-power techniques we demonstrated in our previous paper [19][20][21] and how to apply them to the real industry model used for real-time object detection. We evaluate the power consumption through a reliable experimental environment including an FPGA platform and ASIC design flow.

Background
For edge computing, to overcome the lack of computational resources, researchers have analyzed the CNN computation process and developed various methodologies. The following various optimizing and accelerating methodologies focus on the performance, such as high accuracy, reduced processing time, maximized throughput, and relatively increased power efficiency. This background section shows that our experiment is an untried methodology using industrial CNN codes. Our target CNN architecture is the basic architecture, which consists of multiple interconnected convolution layers that are organized as shown in Figure 1.
Between computational convolution layers, the real computing process is affected by the loading and storing process with the internal/external data storage unit. The work in [22] proposed an energy-aware accelerator design to achieve high performance without degradation. According to this article, the crucial power issues were caused by the data transformation process. Especially, much computing resource was used for the convolution operation.  Figure 2 shows the basic data processing flow of the convolutional neural network. Typically, it is composed of hierarchically interconnected multiplication units and adder units. From the software developers' and algorithm researchers' perspective, the multiplication units and adder units degrade the processing speed even though they make a drastic cut of the convolution size and a number of layers before reducing or optimizing the data transformation process. Thus, they started studying the methodologies for the optimization of the data access process and extending the buffer size, developing an efficient floating point. Y.Ma et al. proposed minimized data flow to enhance the throughput [23]. D. Nguyen et al. proposed a fully pipelined convolution layer to achieve the enhanced hardware utilization [24]. This methodology was used to minimize the external data access with DRAM so that the DRAM power dissipation would decrease. In addition, there are various methodologies to improve the throughput, processing time, and power efficiency [7][8][9][10]25]. These articles introduced the methods in terms of the optimized kernel size, fixed-point arithmetic, vectorized convolutional operation, batch normalization, data scaling, etc. [16,26,27]. Recently, using an FPGA SoC board, the implementation of the CNN-based object detection system was the most fascinating method in the industrial and academic areas due to the flexibility of the implementation and evaluation [25]. Most of the FPGA products provide a high-level synthesis (HLS) design tool, which can help to create the RT-level from a behavioral description of the hardware by using well-known programming languages such as the C programming language. HLS requires the high-level and functional description of a design so that the RTL implementation can be released and automatically compiled [7][8][9][10]. FPGA products provide design tools: Xilinx provides the Vivado HLS tool; Intel provides the OpenCL Board Support Package [28,29]. At the RT-level, in addition, the developer is able to rearrange the processing schedule, reconfigure the circuit integration, and reconstruct the circuit logic elements for power reduction. A representative low-power technique is the clock gating technique [20]. Y. Chen et al. discovered the power breakdown factors of the chip running layer in each convolution layer. They said that the clock network took more than 32.9% of the entire power consumption amount. As a previous work, this article showed that the conventional clock gating technique was capable of applying the different hardware platforms, the convolution computation, and the virtual memory reconfiguration [30]. In Sections 2.1-2.6, the detailed techniques will be introduced. This article will show the implementation of each technique on the FPGA hardware platform designed for the CNN-based accelerator.

Clock Gating
Clock gating is a technique to enhance performance and efficiency by eliminating unnecessary clock cycles. For example, clock gating can be achieved by selectively stopping the clock in cases when the calculation to be performed in the next clock cycle is duplicated. That is, the clock signal is either enabled or disabled depending on the idle condition of the logical network. Many CNN circuits introduced in recent years have a standby state due to the CNN characteristic and a large number of unnecessary clock cycle occurrences. Therefore, unnecessary clock cycles can be avoided with this technique, and this can prevent significant power consumption. Figure 3 shows two types of registers with and without an enable signal. Furthermore, Figure 3b represents a local explicit clock enable (LECE) [19][20][21]. The output Q is updated on the rising edge of the clock only when the ENABLE signal is high. Typically, logic synthesis will implement the circuit shown in Figure 3b with a 2:1 multiplexer or a multiplexed D flip-flop if there is one available in the target technology. If the enable signal is low for a significant amount of the circuit operation and if input D and output Q are multi-bit buses, then a substantial amount of power dissipated by the clock driver is wasted. That is, driving the clock to this register when the ENABLE signal is low does not change the circuit behavior. An alternative method to implement this circuit is applying the gate on the clock (CLK). Replacing the clock input to the flip-flop with an AND gate where the inputs are the CLK and the ENABLE is not a recommended method to reduce power consumption. Considering the situation when the CLK is high and transitions occur on the ENABLEs, edges will appear on the register clock input, causing incorrect circuit behaviors. To avoid these edges, a latch must also be inserted so that when the CLK is high, no activity on the ENABLE will be transferred to the clock input. Figure 4 illustrates the additionally required circuitry.

Local Explicit Clock Gating
The schematic shown in Figure 4 describes the local explicit clock gating (LECG) technique [19][20][21]. The output Q is updated on the rising edge of the CLK, but only when the ENABLE is high. Depending on the clock insertion technique, there are two implementation methods for the LECG technique. In most cases, the physical clock insertion software will limit the choice of gates that can be used for the logic implementation. For example, a special clock gate may be needed or the tool may require insertion of a gate with the correct drive strength. In this case, the RTL design must be modified to instantiate the required gates. In some cases, it may be possible to simply write the required functionality in RTL and allow the synthesizer to select the needed gates.

Bus-Specific Clock Gating
Based on the comparison of I/O signals, the bus-specific clock gating (BSCG) technique [19][20][21] utilizes the clock gating technique and adjusts the EN signal as shown in Figure 5a. The previously introduced LECE technique can deactivate blocks that are unnecessarily active, thus allowing each block to be controlled in a low-power scenario. It also filters floating signals to provide a stable signal for switching of each block. When processing a large number of bits at once, all bit lines on the data bus change state and consume redundant power. Therefore, the technique can set the active bit and reduce power to the data bus using the active bus.

Enhanced Clock Gating
The clock power of sequential logic circuits accounts for a significant portion of the total power consumption. In addition, XOR gates, compared to AND/OR gates, are significantly lower power consuming logic gates for the gate-level power analysis [19,21]. Therefore, it is effective to utilize the XOR gate in clock gating techniques, considering multi-bit I/O data, as shown in Figure 5b. In this technique, the larger pipeline level or the larger the I/O bit size, the greater the power reduction effect.

Memory Split
This technique reduces power by dividing an entire memory into several smaller memories using an output multiplexer [21]. As an example, the following Figure 6's circuit diagram shows how power reduction can be achieved by splitting the existing 512 bytes of memory into four 128 bytes and controlling each 128 bytes' memory block through the decoder and multiplexer. In most of the memory block structures, half or small memory block sizes use less power than full size memory block for reading and writing operations. However, to achieve this small size of memory block, various preliminary investigations are required to determine the optimal number of memory block configurations and the memory block size. In our proposed memory structure, even if there is an identical number of reads and writes, the operation is performed only in one of the small memory blocks, instead of the full size memory, which reduces the amount of power.

Proposed CNN Accelerator
Our target was to achieve maximum power reduction by applying our proposed low-power techniques without significantly changing the conventional layer of the CNN accelerator structure. The CNN accelerator was the most frequently used part in the system for the object detection, thus consuming the most power during the detection process. In this CNN accelerator, as shown in Figure 7, we reduced power consumption by reconstructing the convolution layer part, which occupied the most part of the CNN accelerator. In Figure 7, we applied the LECG technique described in the previous section to control the multiplier and adder blocks needed for convolution, as well as to maximize the power savings by controlling the data and weight inputs through the memory split technique.

Practical Application of the Industrial CNN Accelerator
Adder and multiplier blocks were the crucial components in the CNN accelerator including the core processing element (PE) blocks. Furthermore, they were the core blocks in terms of power consumption. The convolution was performed using multiplication and additions.
This CNN accelerator that we used as our original model utilized 32 bit adders and 36 bit adders. This design used 16 multipliers, and each had a 16 bit input and a 32 bit output. In this way, the multiplier used the 32 ×16 bit D flip flop to store the output results. We employed the clock gating to reduce the total dynamic power from these registers, as shown in Figure 8. The basic adder block also used many D flip flops, and we reduced the dynamic power by clock gating as shown in Figure 9. Figures 9b and 10 illustrate the top adder module consisting of four adder blocks, and each adder block operated in the pipeline; hence, it could achieve the power reduction. The last adder block in the pipeline structure should be activated after the fourth clock cycle; thus, we could use the clock gating technique, and it was enabled by the previous adder block. We also used the same techniques for the other adders in the pipeline structure.  In the top module, all multipliers were enabled by the eninput and all executed in parallel; thus, these were gated by the single clock gating technique. Figure 11 shows the overall configuration of the CNN model that we applied as low-power techniques. Eighteen adder blocks operated in the pipeline with the clock gating technique applied to all blocks.

Experiment Results
In this article, we presented a method to achieve ultra-low power techniques at the RT-level, rather than a complex hardware design change for low power. Specifically, we focused on reducing power consumption through clock gating optimizations. Furthermore, we demonstrated our proposed technology through the CNN accelerator, which consumed the most power in the CNN architecture.

Testing Environment
In order to verify our design of the CNN accelerator for the FPGA, we used the Xilinx Vivado for HLS design and the simulation in Figure 12.
According to all the parameters of the targeted board specification, the program provided an objective simulation result in different power dissipation types. Our targeted board was the Xilinx ZYNQ Ultrascale+ ZCU-102 FPGA SoC board with a quad-core ARM Cortex A-53, a dual-core Cortex processor R-5F, and an additional GPU based on Xilinx's 16 nm FinFET+ programmable fabric. In addition, it contained 600 system logic cells, 32.1 Mb memory, and 2520 DSP slices. Additionally, for the ASIC verification, we used the FreePDK45 [31]: 45 nm Technology library and proceeded to logic synthesis using Synopsys Design Compiler. As a result, power analysis was performed by using the power results from synthesis, and the physical design was not performed for the power measurement.

FPGA Implementation Result
As we explored in the previous section, we applied three techniques, LECG, memory splitting, and multi-voltage scaling on the Xilinx Ultrascale+ ZCU102, FPGA SoC board. In Table 1 and Figure 13, using static power, the power savings amount was reduced by 4.22% compared to the original Verilog RTL code. In the dynamic power dissipation result, we were able to achieve 32.72% power reduction. The total power consumption was decreased by 28.01%.  As shown in Table 1, even though the power consumption seemed to be reduced by 1 watt, the more than 30% power reduction result showed that our suggested methods could be directly effective for the FPGA implementation of the CNN accelerator. Table 2 shows the comparison result of the power consumption with state-of-the-art works. Since, we controlled unnecessary operation through the clock gating in dynamic operation mainly, our techniques appeared more effective in terms of dynamic power.

ASIC Implementation Result
Along with the FPGA verification, we applied LECG and memory splitting techniques. In order to measure the power consumption in the ASIC design, we were not able to proceed with the physical design at this time due to the academic research lab limitations. Therefore, the multi-voltage scaling technique applied to FPGA was not applied. However, since we are currently working with the industry, we plan to proceed with the physical design with the industry in the near future. We used a 45 nm FreePDK library for the ASIC implementation and proceeded to logic synthesis through the Synopsys Design Compiler. We analyzed the power results verified through the logic synthesis, and the results are shown in Table 3. As described in Table 3 and Figure 14, we could reduce 34.9% of the power consumption in static power, and above all, we could reduce the power consumption in dynamic power by 53.21%. Overall, the total power consumption showed that our proposed CNN accelerator consumed 52.68% less power than previous approaches.

Conclusions
This article presented that we achieved low-power dissipation through the CNN-based accelerator implementation on the FPGA SoC board by using the representative effective techniques at the RT-level. Based on the result, we ascertained that applying the clock gating and the memory splitting techniques at the RT-level could reduce the power dissipation, not affecting the major functionality of the computational processing. In the FPGA hardware implementation, applying clock gating and memory splitting methods to the synthesized Verilog code, we were able to achieve up to a 32.72% power reduction of the dynamic power consumption on the convolution operation on the Xilinx ZYNQ Ultrascale+ ZCU-102. This entailed that the low-power techniques at the different levels of the design process enabled additional power dissipation reduction. Furthermore, in the ASIC implementation, the power dissipation was reduced by 52.68% using the same techniques applied to the FPGA implementation. This result represented that the low-power techniques on the different target device could achieve additional power dissipation reduction. Therefore, our experiment results facilitated greatly the optimization of the HLS design process when using a small size FPGA platform as the components of autonomous and electric vehicles. As for the future work, we will graft those techniques into the CNN accelerator design for ASIC fabrication so that we can achieve the low-power design solution for improving the power efficiency of autonomous vehicles.