A-DSCNN: Depthwise Separable Convolutional Neural Network Inference Chip Design Using an Approximate Multiplier

: For Convolutional Neural Networks (CNNs), Depthwise Separable CNN (DSCNN) is the preferred architecture for Application Speciﬁc Integrated Circuit (ASIC) implementation on edge devices. It beneﬁts from a multi-mode approximate multiplier proposed in this work. The proposed approximate multiplier uses two 4-bit multiplication operations to implement a 12-bit multiplication operation by reusing the same multiplier array. With this approximate multiplier, sequential multiplication operations are pipelined in a modiﬁed DSCNN to fully utilize the Processing Element (PE) array in the convolutional layer. Two versions of Approximate-DSCNN (A-DSCNN) accelerators were implemented on TSMC 40 nm CMOS process with a supply voltage of 0.9 V. At a clock frequency of 200 MHz, the designs achieve 4.78 GOPs/mW and 4.89 GOP/mW power efﬁciency while occupying 1.16 mm 2 and 0.398 mm 2 area, respectively.


Introduction
In today's age of technological development, Artificial Intelligence (AI) has become seamlessly integrated into society.In the past, AI computing relied on high-performance cloud computing, in which data is transmitted to a high-performance server for remote processing before being returned to their origin.Multiple penalties are incurred with this process; latency is a major concern and affects the ability of the local device to make timely decisions.
Therefore, the current development of the IC industry in AI is gradually moving toward the direction in which edge devices can operate independently.The design development of such devices is moving towards lower power consumption, higher throughput, and a smaller area.Amongst various reasons for this development are portable devices; higher power dissipation impacts the longevity of battery-operated devices, further limiting practical applications.This is why low-power chips are significant.
To address these issues, the improvement of the design of the PE as one of the basic elements in a CNN accelerator needs to be considered.This centers around the utilization of a configurable systolic array [1].Although dataflow efficiency is enhanced through this approach, the fundamental computations within the PE remain unchanged.The dominant aspect in the PE's basic computations is the multiplication operation, affecting area, latency, and power.This is where the incorporation of approximate computation methods using approximate multipliers becomes crucial for enhancing the PE design.However, the utilization of approximate multipliers may introduce other drawbacks to the overall network, including accuracy degradation and increased delay.This is where the contributions of this work come into play.A novel A-DSCNN architecture is proposed, integrating a new approximate multiplier, which seamlessly integrates into a modified DSCNN.In the implemented A-DSCNN, accuracy is maintained while efficiency is improved.This paper is mainly motivated by the above and implements a new hardware architecture dedicated to the CNN model.The method of further optimizing the circuit in this paper is to reduce the circuit's power consumption while limiting precision loss by simplifying and improving the architecture using approximate calculation methods.
Section 2 summarizes the background of CNN and DSCNN.Section 3 discusses the proposed DSCNN.The main contribution of this work is to propose a DSCNN employing a multi-mode approximate multiplier to reduce the number of computations and parameters.Then, through data scheduling and optimizing computations, the timing of the proposed A-DSCNN is further improved.Section 4 introduces this paper's experimental method and performance results.Upon completion of the hardware implementation, it is benchmarked against other CNN hardware designs.Analysis of the results is detailed towards the end of this Section.Finally, Section 5 reviews the architecture and ideas proposed in this paper.

Convolutional Neural Network (CNN)
A CNN consists of convolution layers, pooling layers, fully connected layers, and flattened layers.Compared with multi-layers neural networks, also known as multilayer perceptrons, CNN has fewer parameters.Another prominent feature of a CNN is that it can retain the location information of the image.In general, an image will have a certain degree of correlation between adjacent pixels.

Approximate Neural Networks
In this section, the focus is on introducing approximate arithmetic circuits employed in neural networks.The main objective is to explore the application of various approximate circuits or algorithms within neural network models.By utilizing these approximate circuits, it is anticipated that overall performance can be optimized while minimizing hardware requirements.This research direction holds significant growth potential, as it aims to reduce hardware complexity and associated costs while ensuring the integrity of the Neural Network remains intact.
Karnaugh map simplification is utilized by [2][3][4] for the implementation of digital circuitry that performs approximate arithmetic within a tolerable range.Using newer logic units or adders with improved architectures is proposed by [5,6].The former proposes a Lower-Part-OR Adder architecture, while the latter reorganizes different adders through re-grouping modules coupled with design changes to optimize dataflow.Approximate arithmetic operators for Multiply-Accumulate (MAC) operators, which encompass the critical path of CNN models, are introduced by [7].The high speed of N-way paths is leveraged by the proposed accelerator architecture.An approximation is implemented through hardware architecture improvements by [8,9].
The data flow of CNN processing and the energy efficiency achieved using a 2D systolic array architecture were discussed in [10].An accelerator named Eyeriss for CNN was proposed in this paper, which optimized the data flow between each PE.However, the data flow of Eyeriss restricted the PE of diagonal input data feature sharing and the lateral way of weight kernel sharing.
Additionally, the Run Length Compression proposed in [10] also resulted in additional power consumption.Next, a priority-driven CNN accelerator optimized for a 3 × 3 weight kernel was proposed in [11], where the modules were designed to work in parallel, fully utilizing the kernel and input image data.In this work, there was no need to design additional specific circuits, since weight sharing was unnecessary.As a result, power consumption and the chip area were reduced.
In [12], a unique approximation operation method was proposed.The approximate operation was initially performed through sampling to determine the location of the value that the pooling layer may retain.Selective convolution was performed by merging the pooling and convolutional layers, reducing redundant convolution operations.However, since this method required pre-processing the data, it was necessary to consider the penalties incurred from the additional encoder.

Depthwise Separable Convolution (DSC)
Another type of architecture that improves upon a CNN is Depthwise Separable Convolution (DSC).To reduce the high computational complexity required by CNN, Google utilizes a novel computational structure in MobileNet, aiming to decrease the computational cost of a CNN model [13].
MobileNet is a faster CNN architecture and a smaller model incorporating a new convolutional layer known as DSC.Due to their compact size, these models are particularly suitable for implementation in mobile and embedded devices.This approach primarily divides the original convolution into two separate parts: Depthwise Convolution (DWC) and Pointwise Convolution (PWC), enabling a reduction in computational requirements without compromising the underlying structure.
The difference between DSCNN and conventional CNN lies in the approach of the DWC.In DWC, the convolution kernel is split into a single-channel form, as illustrated in Figure 1.For each channel of the input data, a filter of size k is established, and separate convolution operations are performed on each channel without altering the depth of the input feature image.However, this process imposes a limitation on the expansion of the feature map's size, thereby restricting the dimensionality of the feature map.Consequently, using PWC becomes necessary to combine the outputs of the DWC and generate a new feature map.PWC essentially represents a 1 × 1 convolution, as illustrated in Figure 2, and its operational method is similar to traditional convolution.The kernel size in PWC is 1 × 1 × M, where the parameter M corresponds to the number of channels in the previous layer of the DWC.
Subsequently, PWC combines the feature maps from the previous step along the depth dimension, generating new feature maps based on the number of kernels.The output of both operations is equivalent to that of a conventional convolutional layer.
PWC serves two primary purposes: firstly, it enables DSCNN to adjust the number of output channels, and secondly, it consolidates the output feature maps from DWC.The preference for DSCNN over conventional CNN arises from its significant reduction in computational requirements and parameters, resulting in improved computational efficiency.While a slight decrease in accuracy can be expected with DSCNN, the deviation remains within an acceptable margin.As shown in Figure 3, the number of parameters and calculations of conventional CNN is Dk × Dk × M × N and Dk × Dk × M × N × DF × DF, respectively.Meanwhile, the number of computations and parameters in a DSCNN, which comprises both depthwise convolution and pointwise convolution, is Dk × Dk × M + M × N and Dk × Dk × M × DF× DF + M × N × DF × DF.By simplifying the aforementioned equations, the ratio of param- eters and computations in a DSCNN compared to a conventional CNN can be estimated in Equation (1).
The main objective of this work is to have conventional CNN replaced by DSCNN as the primary architecture to reduce the number of computations and parameters, thereby significantly decreasing power consumption.
Subsequently, introductions to some related papers are provided.In [14], an accelerator design for reconfigurable DSCNN is proposed.However, it is plagued by low hardware utilization rates and timing delays encountered during each feature map reconstruction.
In [15], an adaptive row-based dataflow of a unified reconfigurable engine for DSC is proposed.However, due to its reconfigurable nature, it suffers from low hardware utilization and energy efficiency.
An architecture for energy-efficient convolutional units for DSC is presented in [16].This architecture implements array-based PE to optimize the workloads of PWC and DWC.The paper also introduces a reconfigurable multiplier array with a 2 n coefficient to optimize data reads.Nevertheless, challenges are encountered regarding low hardware utilization and energy efficiency due to the proposed configurable multiplier array.Furthermore, the paper primarily focuses on optimizing PWC, resulting in a lower utilization rate of the PE during DWC and leading to timing and hardware reuse issues.
The framework known as DSCNN is utilized in most of the aforementioned reviewed papers, aiming to enhance overall performance compared to conventional CNN.However, in some papers, the original characteristics of DSCNN are disregarded due to the complex design of layer optimization in neural networks.As a result, a new DSCNN architecture is proposed in this work for neural network image recognition.It involves the combination of DSCNN with the approximate operation method and implementing a complete hardware architecture on the ASIC.

Proposed Method
The proposed method, A-DSCNN, will be discussed in this section.

Multi-Mode Approximate Multiplier
The multi-mode approximate multiplier proposed in this paper is depicted in Figure 4.The original multiplier circuit is divided into two blocks using a control signal alternating between two operation modes.This paper's image and weight input data are partitioned into two parts: the MSB part and the LSB part.Mode-0 is employed to compute the LSB, while Mode-1 is utilized for the MSB.
The MSB portion takes precedence in this work, as it plays a more critical role in the computation.Conventional multiplication is employed for this part, ensuring that no subsequent errors are introduced.On the other hand, the LSB portion utilizes an encoder to calculate the LSBs and divides the value into two-digit groups.Considering only the larger even-numbered bits as the new value reduces the computational complexity from 8-bit to 4-bit.Since the proposed approximate multiplier matches this 4-bit computation, the existing hardware can be repurposed by adding control signals.This results in a reduction in chip area and power dissipation, thereby achieving lower power consumption.The operation of the approximate multiplier involves dividing it into the MSB and LSB parts using a control signal.For the LSB part, the control signal is set to 0, activating Mode-0 of the multiplier for calculations.Conversely, for the MSB part, the control signal is set to 1, initiating Mode-1 for calculations.Since the same internal circuitry is utilized, the same multiplier hardware array can be reused.A shift register is employed to output the bits correctly to perform two consecutive 4-bit × 4-bit multiplications.Consequently, compared to a conventional multiplier, the proposed multiplier offers reduced power consumption and occupies less area.
To ensure the functionality and identify potential errors in the multipliers, the designs of both the approximate and standard multipliers undergo synthesis and mapping to the specific process technology, namely TSMC CMOS.This allows for gate-level simulations to be conducted.
To verify the performance of the multipliers, random numbers are generated and used to test both the approximate and standard multipliers.The results of these multiplications are compared to the ideal multiplication results, as expressed in Equation ( 2) [17].This comparison helps evaluate the accuracy and reliability of the multipliers under consideration.
where k random numbers are used, with m multiplicands and n multipliers generating P(actual) products through the approximate or standard multiplier circuits, compared to the ideal product P(ideal).The multiplication operations are carried out using signed integers 12 bits to match the truncation performed by the approximate multiplier.For this specific evaluation, k is set to 10,000, and it has been observed that the errors do not significantly increase for higher values of k, but a large computation resource is required for the circuit simulation.Table 1 summarizes the comparison between the approximate multiplier and the standard multiplier.Under the random number test, the multiplication error for the approximate multiplier is found to be 1.2% higher than that of the standard multiplier.These results confirm that the approximate multiplier can be successfully integrated into the CNN architecture.

DSCNN with Multi-Mode Approximate Multiplier
The proposed approach involves scheduling Mode-1 and Mode-0 to operate sequentially, utilizing the same multiplier, thereby creating a multi-mode approximate multiplier.This multiplier is then integrated into a DSCNN, leading to novel hardware architecture known as A-DSCNN.The implemented hardware incorporates the approximate computations mentioned earlier and employs a pipelined scheduling strategy, which is illustrated in Figure 5 for a typical convolutional layer implemented in A-DSCNN.The operation of the convolution core is given as follows: 1.
Initially, the image input and weight inputs are loaded from an off-chip memory by the control unit and stored in the input buffer and weight buffer, respectively.2.
The encoder control determines whether the input buffer and weight buffer data should undergo encoding, and accordingly, the reformatted data are obtained.

3.
The reformatted data are then supplied to the convolution core in the A-DSCNN PE array for computation.

4.
The convolution control unit within the A-DSCNN PE array decides whether to perform depthwise convolution and pointwise convolution, generating a new job.

5.
The newly created job comprises a set of instructions pipelined for processing.6.
After scheduling, individual instructions are sent to the multi-mode approximate multiplier for computation.Control signals determine if the computed data need to be shifted.7.
Once the computations are complete, the computed results are accumulated and sent back to the output buffer to finalize the convolution operation.
Timing is often not considered in conventional DSCNN, as observed in [16].As a result, the operation process follows the same approach as conventional CNN, where the system waits for the input to be read before proceeding to the next convolution step.To address this timing issue, the computational steps are rescheduled to align with the new pipelined strategy, in addition to splitting the original convolution operation into the DWC and the PWC.
The timing sequence of the convolution operation is demonstrated with an example in Figure 6, considering one convolution operation.In this example, since the kernel size is 3 × 3, it is necessary to read in nine values for each convolution operation.In reference to [16], the MAC operation is not performed until the input is loaded onto the buffer.This approach leads to a sequential execution, where each step waits for the completion of the previous step before proceeding to the subsequent MAC step.Consequently, the overall runtime performance is adversely affected.To overcome this limitation, jobs are scheduled to pipeline instructions in the proposed A-DSCNN approach, with the same MAC operation.With this method, computations can commence as soon as the first input is read, reducing the number of clock cycles required to complete the operation.As depicted in Figure 6, the number of cycles is reduced from 18 to 11 in comparison to conventional DSCNN [16].Additionally, reusing hardware logic units further reduces the area required for implementation.
Before discussing the performance in the next section (Section 4), it is important to mention the design trade-off associated with the proposed A-DSCNN.Implementing the multi-mode approximate multiplier introduces overheads in the A-DSCNN, such as including control blocks ('Control' in Figure 4) and dedicated scheduling.Therefore, it is crucial to consider the timing of the A-DSCNN, as illustrated in Figure 6, when designing the scheduling scheme.This is followed by determining the operation of control blocks.The two-mode approximate multiplier offers a good compromise between accuracy, area and timing, although it is also possible to have a three-mode approximate multiplier.

Performance Results
Considering the relatively large size of the VGG16 model [18], a modified version of VGG (referred to as "VGGnet") is employed as the initial hardware implementation for performance analysis.The image classification task will be performed using the CIFAR-10 dataset [19].This modification is required to relax the requirement to perform the simulations, especially the circuit-level simulations.
The hardware architecture is designed using structured Verilog HDL.The operating frequency is set to 200 MHz during timing simulation, and the design is implemented during the TSMC 40-nm CMOS process.The operating frequency and technology are chosen to perform a compatible comparison with the existing work for benchmarking, Table 2, although it can be implemented at a lower technology node and higher operating frequency if desired.The table presents data on power consumption, area, and energy efficiency from various reference papers.According to the summarized information in Table 2, the proposed Approximate-DSCNN accelerators achieve approximately 20% higher power efficiency compared to the works recently reported in papers [20,21].Additionally, the proposed accelerators occupy only 13% of the area of the design presented in [20].It is worth mentioning that the design described in [21] also utilizes the VGG16 model.The entire hardware architecture is illustrated in the architecture diagram shown in Figure 7. Table 3 provides additional details about the modified VGGnet model for the hardware implementation.With 7997 parameters, it is a smaller network with a slight accuracy penalty that is used primarily for verification.With its parameters quantized to 12-Bit, the total memory footprint of the network is 11.7 kB.Additionally, the VGG16 model presented here is the same model found in [18] with the conventional convolutional layers replaced with depthwise separable convolutional layers.This reduces the number of parameters to 20,595,585.The total memory footprint of the network is 29.46 MBs.
In CNN architectures, the primary components of the PE are the multipliers and adders.This work focuses on key design considerations, including the proposed multiplier, latency, hardware reuse, and reduction of redundant computations.To facilitate low latency dataflow, both internal and external buffers must be employed.Figure 7 highlights the necessary on/off chip buffers included in the design.When taking into consideration the size of the input/output feature maps, the primary constraint on the off-chip memory are the network parameters, as shown in Table 4.And as such, the size of the network employed would be the key consideration.The different on-chip buffers serve to reduce latency amongst the various cores.Table 4 contains a breakdown of the buffers used by the different cores, which is dictated by the largest layers within the network that employ the specific cores.
By addressing these design considerations, the proposed A-DSCNN (VGGnet) achieves significant savings in hardware resources, with a 53% reduction compared to conventional DSCNN [16].The area reported in [16] is only for its convolutional layer, Table 2. Furthermore, the energy efficiency is improved by 1.56 times, resulting in an accelerator design with a smaller area and higher performance.
For the physical implementation of the chip, the Innovus software from Cadence [22] is utilized to perform the Place and Route (PnR) process, generating the layout file of the circuit.Other Electronic Design Automation (EDA) tools, such as VCS from Synopsys [23], are employed for chip simulation and functional verification.
Figure 8 showcases the completed A-DSCNN (VGG16) accelerator layout, while Table 5 outlines the specifications.With an operating frequency of 200 MHz, the accelerator core area is 1.16 mm 2 .It operates at a 0.9 V supply voltage, resulting in power consumption of 486.81 mW.The main components (convolutional, pooling, and dense layers) account for approximately 78% of the total area, while other components (registers and control units) make up the remaining 22%.The main components contribute to around 90% of the total power consumption, whereas other components (registers and control units) are responsible for the remaining 10%.The breakdowns are also summarized in Table 5.

Figure 3 .
Figure 3.The parameter and computation in (a) Conventional Convolutional Neural Network, and (b) Depthwise Separable CNN.

Figure 5 .
Figure 5. Hardware structure of a convolutional layer in A-DSCNN (PE: Processing Element).

Figure 7 .
Figure 7.The overall hardware architecture of A-DSCNN.

Table 1 .
Comparison of Approximate Multiplier and Standard Multiplier.
RMSE: Root Mean Square Error.

Table 2 .
Comparison of A-DSCNN with Recent Accelerators.

Table 3 .
Model Architecture of VGG16 and Modified VGGnet.are presented in the following format; Kernel, Filters, Activation Function multiplied by the number of convolutional layers.

Table 4 .
Size of Registers required for Internal Buffers.