ECHO: Energy-Efficient Computation Harnessing Online Arithmetic—An MSDF-Based Accelerator for DNN Inference

Ibrahim, Muhammad Sohail; Usman, Muhammad; Lee, Jeong-A

doi:10.3390/electronics13101893

Open AccessArticle

ECHO: Energy-Efficient Computation Harnessing Online Arithmetic—An MSDF-Based Accelerator for DNN Inference

by

Muhammad Sohail Ibrahim

,

Muhammad Usman

^*

and

Jeong-A Lee

^*

Department of Computer Engineering, Chosun University, Gwangju 61452, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(10), 1893; https://doi.org/10.3390/electronics13101893

Submission received: 8 April 2024 / Revised: 1 May 2024 / Accepted: 10 May 2024 / Published: 11 May 2024

(This article belongs to the Section Circuit and Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

Deep neural network (DNN) inference demands substantial computing power, resulting in significant energy consumption. A large number of negative output activations in convolution layers are rendered zero due to the invocation of the ReLU activation function. This results in a substantial number of unnecessary computations that consume significant amounts of energy. This paper presents ECHO, an accelerator for DNN inference designed for computation pruning, utilizing an unconventional arithmetic paradigm known as online/most significant digit first (MSDF) arithmetic, which performs computations in a digit-serial manner. The MSDF digit-serial computation of online arithmetic enables overlapped computation of successive operations, leading to substantial performance improvements. The online arithmetic, coupled with a negative output detection scheme, facilitates early and precise recognition of negative outputs. This, in turn, allows for the timely termination of unnecessary computations, resulting in a reduction in energy consumption. The implemented design has been realized on the Xilinx Virtex-7 VU3P FPGA and subjected to a comprehensive evaluation through a rigorous comparative analysis involving widely used performance metrics. The experimental results demonstrate promising power and performance improvements compared to contemporary methods. In particular, the proposed design achieved average improvements in power consumption of up to

81 %

,

82.9 %

, and

40.6 %

for VGG-16, ResNet-18, and ResNet-50 workloads compared to the conventional bit-serial design, respectively. Furthermore, significant average speedups of

2.39 \times

,

2.6 \times

, and

2.42 \times

were observed when comparing the proposed design to conventional bit-serial designs for the VGG-16, ResNet-18, and ResNet-50 models, respectively.

Keywords:

computation pruning; early negative detection; CNN acceleration; convolution neural network; most significant digit first; online arithmetic

1. Introduction

Since their inception, deep neural networks (DNNs) have demonstrated remarkable performance and have been recognized as state-of-the-art methods, achieving near-human performance in various applications, e.g., image processing [1], bioinformatics [2], natural language processing [3], etc. The efficacy of DNNs becomes particularly evident in scenarios involving vast amounts of data with complex features that may not be easily discernible by humans. This positions DNNs as invaluable tools to address evolving data processing needs. It is widely observed that the number and organization of layers significantly influences the performance of the network [4]. DNNs consisting of more layers often lead to enhanced feature extraction capabilities. However, it is essential to acknowledge that deeper networks typically demand a larger number of parameters, consequently requiring more extensive computational resources and memory capacity for effective training and inference.

DNN training and inference demands significant computing power, leading to the consumption of considerable energy resources. In a study presented in [5], it was estimated that dynamic power consumption training a 176 billion parameter language model produces

24.7

metric tonnes of CO₂. This environmental impact underscores the urgency of optimizing DNN implementations, a concern that has gained widespread attention from the research community [6,7]. Addressing these challenges is necessary to maintain a better trade-off between the power and environmental costs associated with deploying DNNs for various applications.

In their fundamental structure, DNNs rely on basic mathematical operations like multiplication and addition, culminating in the multiply–accumulate (MAC) operation. These MAC operations can constitute approximately

99 %

of all the processing in convolutional neural networks (CNNs) [8]. The configuration and arrangement of MAC blocks depend on the depth and structure of the DNN. For instance, the pioneering DNN in the ImageNet challenge, which surpassed human-level accuracy and is known as the ResNet model, with 152 layers, necessitates

11.3

GMAC operations and 60 million weights [9,10]. Typically, processing a single input sample in a DNN demands approximately one billion MAC operations [11]. This highlights the potential for substantial reductions in computational demands by enhancing the efficiency of MAC operations. In this context, researchers have suggested pruning the computations by replacing floating point numbers with fixed-point, with minimal or no loss in accuracy [12,13]. Another plausible approach involves variations in the bit precision used during MAC operations [14], an avenue extensively explored within the area of approximate computing [15]. Research indicates that implementing approximate computing techniques in DNNs can result in power savings for training and inference of up to

88 %

[16]. However, many of these approximate computing methods often compromise accuracy, which may not be suitable for some critical applications. Consequently, devising methodologies that enhance DNN computation efficiency without compromising output accuracy becomes crucial.

A typical CNN comprises several convolution and fully connected layers, such as VGG-16, shown in Figure 1. In such CNN models, the convolution layers typically serve the purpose of complex feature extraction and the fully connected layers perform the classification task using the complex features extracted from the convolution layers. During inference, these layers execute MAC operations on the input, utilizing trained weights to produce a distinct feature representation as the output [3]. The layered architecture, configured in a specific arrangement, can effectively approximate a desired function. Although convolutional and fully connected layers excel in representing linear functions, they are inadequate for addressing applications necessitating nonlinear representations. To introduce nonlinearity in the DNN, the outputs of these layers are subjected to processing through a nonlinear operator, referred to as an activation function [17]. As each output value must traverse an activation function, selecting the appropriate one has a significant impact on the performance of DNNs [18].

Rectified linear unit (ReLU) is among the most employed activation functions in modern DNNs [19]. The simplicity and piece-wise linearity of ReLU contribute to faster learning capability and stability in values when utilizing gradient descent methods [17]. The ReLU function, as shown in Figure 2, acts as a filter that outputs the same input for positive input values and outputs zero for negative inputs. This implies that precision in output is crucial only when the input is a positive value. The input to a ReLU function typically originates from the output of a fully connected or convolution layer in the DNN, involving a substantial number of MAC operations [11]. It is indicated by various researchers that a significant proportion, ranging from

50 %

to

95 %

, of ReLU inputs in DNNs are negative [20,21,22,23]. Consequently, a considerable amount of high-precision computation in DNNs is discarded, as output elements are reduced to zero after the ReLU function. The early detection of these negative values has the potential to curtail the energy expended on high-precision MAC operations, ultimately leading to a more efficient DNN implementation.

To this end, we propose ECHO: Energy-efficient Computing Harnessing Online Arithmetic—an MSDF-based accelerator for DNN inference, aimed at computation pruning at the arithmetic level. ECHO leverages an unconventional arithmetic paradigm, i.e., online, or MSDF arithmetic, which works in a digit-serial manner, taking inputs and producing output from the most significant to the least significant side. The MSDF nature of online-arithmetic-based computation units combined with a negative output detection scheme can aid in the accurate detection of negative outputs at an early stage, which results in promising computation and energy savings. The experimental results demonstrate that ECHO showcases promising power and performance improvements compared to contemporary methods.

The rest of the paper is organized as follows. Section 2 presents a comprehensive review of the relevant literature. Section 3 presents an overview of online-arithmetic-based computation units and the details of the proposed design. The evaluation and results of the proposed methodology are presented in Section 4, followed by the conclusions in Section 5.

2. Related Works

Over the past decade, researchers have extensively addressed challenges in DNN acceleration and proposed solutions such as DaDianNao [24], Stripes [25], Bit-Pragmatic [26], Cnvlutin [27], Tetris [28], etc. Bit-Pragmatic [26] is designed to leverage the bit information content of activation values. The architecture employs bit-serial computation of activation values, using sparse encoding to skip computations for zero bits. By avoiding unnecessary computations for zero bits in activation values, Bit-Pragmatic achieves performance and energy improvements. On the other hand, Cnvlutin [27] aims to eliminate unnecessary multiplications when the input is zero, leading to improved performance and energy efficiency. The architecture of Cnvlutin incorporates hierarchical data-parallel compute units and dispatchers that dynamically organize input neurons, skipping these unnecessary multiplications, to ensure the efficient utilization of computing units, thereby keeping them busy to achieve superior resource utilization.

Problems such as unnecessary computations and the varying precision requirements across different layers of CNNs have been thoroughly discussed in the literature [29,30]. These computations can contribute to increased energy consumption and resource demands in accelerator designs. To tackle these challenges, researchers have investigated designing application-specific architectures tailored to accelerate the computation of the matrix–matrix multiplication in convolution layers of DNN [31,32]. To reduce the number of MAC operations in DNNs, the work in [33] noted that neighboring elements within the output feature map often display similarity. By leveraging this insight, they achieved a

30.4 %

reduction in MAC operations with a

1.7 %

loss in accuracy. In reference to [34], the introduction of processing units equipped with an approximate processing mode resulted in substantial improvements in energy efficiency, ranging from

34.11 %

to

51.47 %

. However, these gains in energy efficiency were achieved at a cost of a

5 %

drop in accuracy. This trade-off between energy efficiency and accuracy highlights the potential benefits of approximate processing modes in achieving energy savings but underscores the need to carefully balance these gains with the desired level of accuracy in DNN computations.

In recent years, there has been a noticeable trend towards the adoption of bit-serial arithmetic circuits for accelerating DNNs on hardware platforms [25,35,36,37,38]. This transition is motivated by several factors: (1) the aim to reduce computational complexity and required communication bandwidth; (2) the capability to accommodate varying data precision requirements across different deep learning networks and within network layers; (3) the flexibility to adjust compute precision by manipulating the number of compute cycles in DNN model evaluations using bit-serial designs; and (4) the improvement in energy and resource utilization through the early detection of negative results, leading to the termination of ineffective computations yielding negative results.

A significant contribution in this context is Stripes [25], recognized for pioneering the use of bit-serial multipliers instead of conventional parallel multipliers in its accelerator architecture to address challenges related to power consumption and throughput. In the similar direction, UNPU [35] extends the Stripes architecture by incorporating look-up tables (LUTs) to store inputs for reuse multiple times during the computation of an input feature map. These advancements mark significant progress in the research towards more efficient and effective CNN accelerator designs.

These accelerators are designed to enhance the performance of DNN computations through dedicated hardware implementations. However, it is important to note that none of these hardware accelerators have explicitly focused on the potential for computation pruning through early detection of negative outputs for ReLU activation functions. The aspect of efficiently handling and optimizing ReLU computations, especially in-terms of early detection and pruning of negative values, remains an area where further exploration and development could potentially lead to improvements in efficiency and resource utilization.

Most modern CNNs commonly employ ReLU as an activation function, which filters out negative results from convolutions and outputs zeros instead of the negative values. Traditional CNN acceleration designs typically perform the ReLU activation separately after completing the convolution operations. Existing solutions involve special digit encoding schemes, like those in [22,23], or complex circuit designs [21,39] to determine whether the result is negative. SnaPEA [21] aims to reduce the computation of a convolutional layer followed by a ReLU activation layer by identifying negative outputs early in the process. However, it is important to note that SnaPEA introduces some complexities. It requires reordering parameters and maintaining indices to match input activations with the reordered parameters correctly. Additionally, in predictive mode SnaPEA necessitates several profiling and optimization passes to determine speculative parameters, adding an extra layer of complexity to the implementation. MSDF arithmetic operations have also emerged as a valuable approach for early detection of negative activations [39,40]. Shuvo et al. [39] introduced a novel circuit implementation for convolution where negative results can be detected early, allowing subsequent termination of corresponding operations. Similarly, USPE [41] and PredictiveNet [42] propose splitting values statistically into most significant bits and least significant bits for early negative detection. Other works in this avenue include [43,44,45]. However, existing methods for early detection of negative activations often rely on digit encoding schemes, prediction using a threshold, or additional complex circuitry, which may introduce errors or increase overhead.

3. Materials and Methods

A convolution layer in a neural network processes an input image by applying a set of M 3D kernels in a sliding window fashion. A standard CNN-based classifier model is generally composed of two main modules: the feature extraction module and the classification module. The feature extraction module typically comprises stacks of convolution, activation, and pooling layers. On the other hand, the classification module contains stacks of fully connected layers. Activation functions, which introduce nonlinearity, are strategically placed either after the convolution layer or the pooling layer to enhance the representation power of the model and enable it to capture complex patterns and features in the data. In the conventional convolution layers of a CNN, the computation comprises a sequence of MAC operations to derive the output feature maps. Each MAC operation involves the multiplication of corresponding elements from the kernel and input feature maps, followed by the summation of the results. The convolution operation performed within a CNN layer can be expressed by a basic weighted sum or sum of products (SOPs) equation, depicted as follows:

y_{i j} = \sum_{a = 0}^{K - 1} \sum_{b = 0}^{K - 1} w_{a b} x_{(i + a) (j + b)}

(1)

In the given equation, one output pixel at position

(i, j)

is represented by

y_{i j}

in layer l. x is one of the inputs from the input feature map that is convolved with the weight kernel w. As can be observed from the equation, in order to produce the output at position

(i, j)

, the value of the kernel w remains same. In fact, the kernel of

k \times k

dimension, slides over the input feature map, suggesting the change in input x only. To this end, weight-stationary dataflow is often adapted to perform the convolution operation.

3.1. Online Arithmetic

Online arithmetic is a digit-serial arithmetic in which the input is fed in and output is generated digit-by-digit from left to right (LR), following the most-significant-digit-first mode of computation [46]. The algorithms are designed such that to produce the jth digit of the result,

j + δ

digits of the corresponding input operands are required, as illustrated in Figure 3. Here,

δ

is referred to as the online delay, which varies for different arithmetic operations and is typically a small integer ranging from 1 to 4. To generate the most significant output first, online arithmetic leverages a redundant number system, making the cycle time independent of working precision. One notable advantage of online arithmetic is the ability to pipeline the execution of dependent operations. As soon as the most significant digit (MSD) of the predecessor unit is generated, the successive unit can initiate computation [47]. This stands in contrast to conventional digit-serial arithmetic, where all digits must be known to start computation. While it is possible to have overlapped/pipelined computation with conventional digit-serial arithmetic in either MSDF or least significant digit first (LSDF) manner, it becomes challenging when arithmetic operations, such as division (MSDF), are followed by multiplication (LSDF). Since online arithmetic always computes in the MSDF manner, overlapped computation of dependent operations is consistently feasible. This property enhances the efficiency and parallelization capabilities of online arithmetic.

The property of generating outputs based on partial input information in online computation negates the need to store intermediate results. Instead, these results can be immediately consumed in subsequent computations. Consequently, this leads to a reduction in the number of read/write operations from/to memory, resulting in lower memory traffic and subsequent energy savings [48].

To facilitate generating outputs based on partial input information, online computation relies on flexibility in computing digits. This is achieved by employing a redundant digit number system. In this study, a signed digit (SD) redundant number system is utilized, where numbers are represented in a radix-(r) format that offers more than r values for representing a given value. Specifically, a symmetric radix-2 redundant digit set of

- 1, 0, 1

is used. To ensure compatibility, the online modules work with fractional numbers, simplifying operand alignment. In this system, the first digit of the operand carries a weight of

r^{- 1}

, and at a given iteration j, the digit

x_{j}

is represented by two bits, namely,

x^{+}

and

x^{-}

. The numerical value of the digit

x_{j}

is computed by subtracting (

S U B

) the

x^{+}

and

x^{-}

bits, as shown in relation (2).

x_{j} = S U B (x^{+}, x^{-})

(2)

The input and outputs are given as (3) and (4), respectively.

x [j] = \sum_{i = 1}^{j + δ} x_{i} r^{- i}

(3)

z [j] = \sum_{i = 1}^{j} z_{i} r^{- i},

(4)

where j represents the iteration index and the subscript i denotes the digit index. In a given online algorithm, the execution spans

q + δ

cycles. The algorithm processes a single digit input for q iterations. After

δ

cycles of providing the first input digit, it generates a single output digit in each subsequent iteration. This approach emphasizes digit-level processing and output generation, contributing to the overall efficiency of online computation.

3.1.1. Online Multiplier (OLM)

In most CNN architectures, the convolution operation during inference entails the multiplication of a fixed-weight kernel with the input image in a sliding window manner. This characteristic necessitates the repeated use of the kernel matrix for the convolution operation. In such scenarios, an online multiplier with one operand processed in parallel and the other in a serial manner can offer advantages, enabling parallel utilization of the weight kernel while the input is sequentially fed. In this study, a non-pipelined serial–parallel multiplier, as described in [49] and illustrated in Figure 4a, is employed. The multiplier produces its output in MSDF manner after an online delay of

δ = 2

cycles. The serial input and output in each cycle are denoted by (3) and (4), respectively, while the constant weight is represented as:

Y [j] = Y = - y_{0} \cdot r^{0} + \sum_{i = 1}^{n} y_{i} r^{- i}

(5)

The pseudocode of the non-pipelined radix-2 serial–parallel online multiplier is shown in Algorithm 1. Additional details related to the recurrence stage and selection function of the serial–parallel online multiplier are presented in [49].

Algorithm 1 Serial–parallel online multiplication

1:: Initialize:
$x [- 2] = w [- 2] = 0$
2:: for j = $- 2, - 1$ do
3:: $v [j] = 2 w [j] + (x_{j + 2} \cdot Y]) 2^{- 2}$
4:: $w [j + 1] \leftarrow v [j]$
5:: end for
6:: Recurrence:
7:: for $j = 0 \dots n + δ$ do
8:: $v [j] = 2 w [j] + (x_{j + 2} \cdot Y]) 2^{- 2}$
9:: $z_{j + 1} = S E L M (\hat{v [j]})$
10:: $w [j + 1] \leftarrow v [j] - z_{j + 1}$
11:: $Z_{out} \leftarrow z_{j + 1}$
12:: end for

3.1.2. Online Adder (OLA)

For the summation of the multiplication and to obtain the SOP of the results generated by the online multiplier (OLM), an online adder (OLA) is employed. The online adder follows a similar mode of operation, i.e., MSDF; therefore, as soon as the MSD from the multiplier is produced, it is fed to the OLA. This property of utilizing online arithmetic allows us to pipeline the subsequent operations at digit level, which further contributes to early detection and subsequent termination of negative activations. To develop the online adder, the redundant adders such as signed digit adders are serialized. As observed in Figure 4b, the online adder follows two-step addition. To avoid the carry propagation, relevant logic and registers are inserted to generate appropriate intermediate results. The online delay of OLA is

δ = 2

. Further details and relevant derivations can be found in [50].

We present an example of the bit-level computation of the online multiplier and online adder, as shown in Figure 5. We consider a simple SOP example represented as

A \times B + C \times D

, where

A, B, C,

and D represent n-bit vectors.

O L M 1

and

O L M 2

represent the online multipliers and the online adder is denoted as

O L A

, while X, Y, and Z denote the outputs of

O L M 1

,

O L M 2

, and the final SOP result, respectively.

δ_{\times}

and

δ_{+}

denote the online delays of the multipliers and adder, respectively. Consider the values of the operands as

A = 44

,

B = 35

,

C = - 62

, and

D = 87

; therefore, the SOP result should be

Z = - 3854

. The input operands are first normalized by dividing each operand by

2^{n}

, i.e., shifting the operands n positions towards the right, where

n = 8

. The normalized values of the operands are

A = 0.171875

,

B = 0.13671875

,

C = - 0.2421875

, and

D = 0.33984375

. These numbers are then converted to their redundant representations resulting in the vectors

A = [0.010 \bar{1} 0 \bar{1} 00]

,

B = [0.0010010 \bar{1}]

,

C = [0.0 \bar{1} 000010]

, and

D = [0.10 \bar{1} 0 \bar{1} 00 \bar{1}]

. Since the operands in their redundant digit representation are fractional numbers, the weight of the MSD position is

2^{- 1}

and that of the LSD position is

2^{- 8}

. The MSD and LSD of the operands are presented as

A [7], B [7], \dots, D [7]

and A[0], B[0], ⋯, D[0], respectively.

The timing diagram of the bit-level computation of the SOP presented in the example is depicted in Figure 6. The MSDs of the outputs X and Y are generated after

(δ_{\times} + c o m p u t e)

cycles with a weight of

2^{- 1}

and are denoted as

X [15]

and

Y [15]

, respectively. The output Z of the online adder has a bit width of

2 n + 1

and the weight of its MSD is

2^{0}

, labeled as

Z [16]

in the figure. The timing diagram presents the results of the first 15 cycles. However, the

2 n

-bit results of the OLM1 and OLM2 are obtained as

X = [0.000010 \bar{1} 000000100]

and

Y = [0.000 \bar{1} 0 \bar{1} 0 \bar{1} 000 \bar{1} 00 \bar{1} 0]

, which are fractional numbers. After shifting left

2 n

positions and conversion, we obtained the results of the multipliers as

X = A \times B = 1540

and

Y = C \times D = - 5394

, which matches the expected results of the multiplication. After the computation of the SOP until the least significant digit, the result is obtained as

Z = [0.000 \bar{1} 0001000 \bar{1} 0010]

, which after a

2 n

-bit left-shift results in

Z = - 3854

, which is the expected SOP output.

3.2. Proposed Design

This section provides a detailed overview of the architecture of ECHO, which is built upon online computation units and decision units featuring early detection and termination capabilities for negative outputs. The arrangement of computation units within the processing engine (PE) of the proposed design, along with the techniques for identifying and terminating ineffective convolutions (resulting in negative outputs), is thoroughly discussed.

3.2.1. Processing Engine Architecture

Each processing engine (PE), as illustrated in Figure 7, is equipped with

k \times k

online serial–parallel multipliers, succeeded by a reduction tree comprising online adders. This configuration is designed to compute a

k \times k

convolution, as outlined in Equation (1), on an input channel or feature map. The input pixel is serially fed in a digit-by-digit manner, while the kernel pixel is concurrently fed in parallel, as indicated by the different arrows in Figure 7.

Each multiplier within the PE is tasked with multiplying one pixel in the convolution window with the corresponding pixel in the same feature map of the convolution kernel. Consequently, all

(k \times k)

pixels are processed concurrently. The computation of the number of cycles required for a PE to generate its output can be performed as follows:

P E_{C y c l e s} = δ_{\times} + δ_{+} \times ⌈ l o g_{2} (k \times k) ⌉ + p_{o u t}^{P E}

(6)

where

δ_{\times}

and

δ_{+}

are the online delays of the online multiplier and adder, respectively,

⌈ l o g_{2} (k \times k) ⌉

is the number of reduction tree stages required to generate the SOP of the

k \times k

multipliers, and

p_{o u t}^{P E}

is the precision of the SOP result generated by a PE.

p_{o u t}^{P E}

is calculated as follows:

p_{o u t}^{P E} = p_{o u t}^{M u l t} + ⌈ l o g_{2} (k \times k) ⌉

(7)

3.2.2. Early Detection and Termination of Negative Computations

The convolution operation in CNN models can be referred as multiple-channel multiple-kernel (MCMK) convolution, where the term channel corresponds to the input feature maps and kernel corresponds to the convolution kernel. However, a multiple-channel single-kernel (MCSK) convolution is carried out repeatedly using several kernels to generate the output of a convolution layer. A basic MCSK convolution in a CNN can be described by the following equation.

\begin{matrix} Y (i, j) = \sum_{c = 1}^{N} \sum_{q = 1}^{k} \sum_{r = 1}^{k} X (q + i - 1, r + j - 1, c) \\ \cdot W (q, r, c) + b \end{matrix}

(8)

where

Y (i, j)

represents the output at the

(i, j)

position in the output feature map,

X (\cdot)

denotes the input, and

W (\cdot)

denotes the weight. The innermost double summation term performs the

k \times k

convolution for the

c th

input feature map, N denotes the number of input feature maps, and b denotes the bias.

In most designs of CNN accelerators, the focus is often on achieving faster and more efficient generation of the SOP. However, only a limited number of works explore the potential for early assessment of negative values in the activation layer, particularly for ReLU. Early detection, and the subsequent termination of computation, of negative activation poses a significant challenge in accelerators based on conventional arithmetic. For instance, in conventional bit-serial multipliers, the multiplicand undergoes parallel processing, whereas the multiplier is processed serially. During each iteration, a partial product is produced and stored in a register. Subsequently, this partial product is shifted into appropriate positions before being added to other partial products to yield the final product. Typically, either an accumulator or a sequence of adders, such as carry-save adders or ripple-carry adders, are utilized for this reduction process. In the context of convolution, an additional level of reduction is necessary to compute the sum of

k \times k

multiplications (observed in the innermost double summation in (8)) in order to derive the output pixel. Furthermore, if there are multiple input feature maps, an extra level of reduction is required to add the

k \times k

SOPs across N input feature maps, as depicted by the outermost summation in (8). In conventional bit-serial multipliers, determining the most significant bit and discerning the polarity of the result entails waiting until all partial products have been generated and added to the previous partial sums. Among the limited studies addressing the early detection of negative activations, some utilize either a digit encoding scheme or an estimation technique for early negative detection [23], while others concentrate on complex circuit designs such as the carry propagation shift register (CPSR) as demonstrated in [39].

Addressing the challenge of early detection and termination of negative activations can be effectively accomplished by utilizing the intrinsic capability of online arithmetic to produce output digits in an MSDF manner. ECHO enables the detection and subsequent halting of negative activation computation within p cycles, where

p < N

, with

N

denoting the number of cycles required for complete result computation. This is realized through the monitoring and comparison of output digits. The process of identifying negative activations and subsequently stopping the relevant computation is outlined in Algorithm 2.

Algorithm 2 Early detection and termination of negative activations in ECHO

1:: $z_{j}^{+}, z_{j}^{-}$ bits
2:: for j : 1 to $N u m_{C y c l e s}$ do
3:: $z^{+} [j] \leftarrow C o n c a t (z^{+} [j], z_{j}^{+})$
4:: $z^{-} [j] \leftarrow C o n c a t (z^{-} [j], z_{j}^{-})$
5:: if $z^{+} [j] < z^{-} [j]$ then
6:: Terminate
7:: else
8:: Continue
9:: end if
10:: end for

The decision unit incorporates registers to retain the output

z^{+} [j]

and

z^{-} [j]

bits, signifying the positive and negative output digits of the convolution in redundant number representation. In each iteration, the newly computed digits are appended to their corresponding previous digits, as outlined in Algorithm 2. When the value of

z^{+} [j]

drops below

z^{-} [j]

, indicating a negative output, the control unit generates a termination signal, halting the computation of the respective SOP. This straightforward process, leveraging the inherent MSDF nature of online arithmetic, enables early detection and termination of negative outputs, thereby conserving a significant number of computation cycles required for a convolution operation resulting in a negative number and leading to substantial energy savings.

3.2.3. Accelerator Design

The overall architecture of ECHO is depicted in Figure 8. It consists of multiple PEs arranged in a 2D array, where each column in the array contains N PEs. The convolution kernels and input activations stored in the DRAM are pre-loaded into the weight buffers (WBs) and activation buffers (ABs) using the activation and weight interconnect. The activations and weights are then forwarded via the same interconnect to the PE array for computation. The results of the computations are written back to the DRAM using the memory manager in the central controller. The output of each column is fed serially to a corresponding decision unit in the central controller, where the polarity of the output activation is determined dynamically. If the output digit is determined to be negative, a termination signal is generated which is propagated to the corresponding column, resulting in the subsequent termination of the computation in that column. The various signals and data lines are color-coded in Figure 8, and the description of the color coding can be found in Figure 9. By design, the proposed accelerator architecture supports input tiling by the invocation of N PEs in each column. However, the number of columns in the accelerator array can be determined on the basis of the architecture of the CNN and the target hardware. The number of cycles required to generate the result of a convolution can be calculated as

\begin{matrix} S O P_{C y c l e s} = δ_{\times} + δ_{+} \times ⌈ l o g_{2} (k \times k) ⌉ + \\ δ_{A d d} \times ⌈ l o g_{2} (N) ⌉ + p_{o u t} \end{matrix}

(9)

where

⌈ l o g_{2} (N) ⌉

is the number of reduction tree stages required to add the SOP results of N input feature maps to generate the final output of the convolution.

p_{o u t} = p_{o u t}^{P E} + ⌈ l o g_{2} (N) ⌉

is the precision of the final SOP result of the convolution.

The central controller, as shown in Figure 9, consists of the memory manager block and several decision units dedicated for each column in the accelerator array. At the end of each convolution SOP computation, the central controller generates signals (see DRAM read signal in Figure 9) for new input activations and/or weights to be loaded into the activation and weight buffers, respectively. The decision unit blocks in the central controller receive the respective column output digits serially (indicated by black arrows in Figure 9) and generate a column termination signal (indicated by red arrow in Figure 9) when the column output is detected to be negative.

The design of the decision unit is presented in Figure 9. It consists of a comparator unit, where a comparison between the

z^{+}

and

z^{-}

redundant digits is carried out. Moreover, it also contains registers to store the redundant output digits of the respective column. In case of a convolution resulting in a positive output, these registers are populated for

S O P_{C y c l e s}

, and then, forwarded to the memory manager block to be written back to the DRAM. The counter block keeps the count of the

S O P_{C y c l e s}

so that the final positive result is forwarded to the DRAM. This is achieved by the invocation of the multiplexers for each register, i.e., when the

S O P_{C y c l e s}

are elapsed, the counter triggers the selection signal of the multiplexer to route the value stored in the registers to the memory manager. However, in the event of a convolution resulting in a negative value, the comparator generates a column termination signal, as indicated by the red arrow in Figure 9, which subsequently terminates the computation of the respective column. Moreover, the column termination signal is also used as the reset signal for the registers and counter in the decision unit, thereby selecting the multiplexer input connected to 0, subsequently forwarding a 0 value to the memory manager.

The following section (Section 4) contains details on the experiments conducted to ascertain the performance of the proposed design on modern CNNs and also performs a comparison of the proposed design with contemporary designs.

4. Experimental Results

To evaluate and compare the performance of the proposed design, we conduct experimental evaluation using two baseline designs: (1) Baseline-1: conventional bit-serial accelerator design based on UNPU accelerator [35]; (2) Baseline-2: online-arithmetic-based design without the capability for early detection and termination of the ineffective negative computations. The peak throughput of the accelerator, independent of the workload, can be calculated as

2 \times F r e q u e n c y \times N_{M u l t i p l i e r s}

[51,52]. Here, frequency (f = 100 MHz = 0.1 GHz) is represented in GHz and the number of multipliers used in the proposed design is

N_{M u l t i p l i e r s} = 9216

; therefore, the peak throughput of the ECHO accelerator is calculated to be

1843.2

GOPS.

For a fair comparison, both the baseline designs use the same accelerator array layout as ECHO. We evaluate ECHO on VGG-16 [53] and ResNet [10] workloads. The layer-wise architecture of the CNN models is presented in Table 1. As mentioned in the table, each of the CNNs contain several convolution layer blocks, where the convolution layers within the blocks have the same number of convolution kernels (M) and output feature map dimensions (

R \times C

). It is also worth noting that while VGG-16 contains a maxpooling layer after every convolution block, the ResNet-18 CNN performs down-sampling by a strided (

S = 2

) convolution in the first convolution layer in each block of layers except for the maxpooling layer after C1.

We use pre-trained VGG-16, ResNet-18, and ResNet-50 models obtained from torchvision [54] for our experiments. For the evaluation, we use 1000 images from the validation set of ImageNet database [9]. The RTL of the proposed and baseline accelerators are designed, functionally verified, and simulated using Xilinx Vivado 2023.2. The implementation is carried out on Xilinix Vertix-7 VU3P FPGA and the power is estimated using the implemented design in Vivado 2023.2. The FPGA device used in the experiments contains 862 K logic cells, 394 K CLB LUTs, 25.3 Mb of block RAM, and 2280 DSP slices.

Table 2 and Table 3 present the layer-wise comparative results for inference time, power consumption, and speedup for the VGG-16 and ResNet-18 networks, respectively. It is worth noting that for the VGG-16 workload, the online-arithmetic-based design without the early negative detection capability (Baseline-2), outperforms the conventional bit-serial design (Baseline-1). The Baseline-2 design, on average, achieves an

18.66 %

improvement in inference time compared to the Baseline-1 design. It also consumes

2.7 \times

less power and has a

1.23 \times

improved speedup in computation compared to the Baseline-1 design, which underscores the superior capability of the online-arithmetic-based DNN accelerator designs. Similarly, ECHO achieves

58.16 %

and

48.55 %

improvements in inference time, consumes

5.3 \times

and

1.9 \times

less power, and also achieves

2.39 \times

and

1.94 \times

improved speedup compared to the Baseline-1 and Baseline-2 designs, respectively, for the VGG-16 network.

Similarly, for the ResNet-18 model, as indicated in Table 3, ECHO achieves

61.5 %

and

50 %

improvements in inference time, consumes

6.2 \times

and

1.94 \times

less power, achieves

2.52 \times

and

2.0 \times

higher performance, and also achieves

2.62 \times

and

2.0 \times

speedup compared to the Baseline-1 and Baseline-2 designs, respectively.

Figure 10 presents the runtime of ECHO compared to the baseline methods for VGG-16 and ResNet-18 workloads. From the figures, it can be noted that the accelerator design based on online arithmetic (Baseline-2) outperforms the conventional bit-serial design (Baseline-1), even without the capability of early detection of the negative outputs, which emphasizes the superiority of the online-arithmetic-based designs over the conventional bit-serial designs. As shown in Figure 10a, ECHO shows mean improvements of

58.16 %

and

48.55 %

in runtime compared to the Baseline-1 and Baseline-2 designs, respectively, for the VGG-16 workloads. Similarly, as shown in Figure 10b, ECHO achieves

61.8 %

and

50.0 %

improvements in runtime on ResNet-18 workloads compared to the Baseline-1 and Baseline-2 designs, respectively. These average runtime improvements lead to average speedups of

2.39 \times

and

2.6 \times

for the VGG-16 and ResNet-18 CNN models, respectively, compared to conventional bit-serial design. Similarly, compared to the online-arithmetic-based design without the capability of early detection of negative outputs (Baseline-2), ECHO achieves an average speedup of

1.9 \times

and

2.0 \times

for VGG-16 and ResNet-18 workloads, respectively. The faster runtimes of ECHO are not only due to the efficient design of the online-arithmetic-based processing elements, but also due to the large number of negative output activations, which in turn helps in terminating the ineffective computations, resulting in substantial power and energy savings.

The power consumption is related to the execution time as well as the utilization of resources in the accelerator. The proposed early detection and termination of negative output activations can result in substantial improvements in power consumption. Figure 11 shows the layer-wise power consumption of ECHO compared to the baseline designs. For the VGG-16 model, as shown in Figure 11a, ECHO achieves

81 %

and

48.72 %

improvements in power consumption compared to the Baseline-1 and Baseline-2 designs, respectively. Similarly, as depicted in Figure 11b, ECHO shows significant improvements of

82.9 %

and

50.22 %

in power consumption for ResNet-18 workloads compared to the Baseline-1 and Baseline-2 designs, respectively.

A comparison of the FPGA implementation of ECHO with the conventional bit-serial design (Baseline-1) and online arithmetic design without the early negative detection capability (Baseline-2) is presented in Table 4. All the designs in these experiments were evaluated on a 100 MHz frequency. It can be observed from the comparative results that the logic resource and BRAM utilization for ECHO are marginally higher than those for the Baseline-1 design. The accelerator performance is calculated using the relation

P e r f o r m a n c e = \frac{O p s}{t}

, where

O p s

is the total number of MAC operations performed and t denotes the duration it takes to perform the said number of operations for a particular workload [55]. ECHO achieves

13.8 \times

,

2.6 \times

, and

3.3 \times

higher performance in-terms of

G O P S

compared to the Baseline-1 design for VGG-16, ResNet-18, and ResNet-50 CNN workloads, respectively. Similarly, ECHO achieves

2.39 \times

,

2.6 \times

, and

1.59 \times

improvement in latency/image for VGG-16, ResNet-18, and ResNet-50 workloads, respectively, compared to the Baseline-1 design. We also implemented ECHO without the early detection and termination capability (Baseline-2) on FPGA and achieved performance improvements of

10.67 \times

,

2.0 \times

, and

2.1 \times

for VGG-16, ResNet-18, and ResNet-50 workloads, respectively. Moreover, ECHO also achieved

1.94 \times

,

2.01 \times

, and

2.01 \times

improvements in latency for VGG-16, ResNet-18, and ResNet-50 models compared to the Baseline-2 design. ECHO exhibits improved power consumption due to the use of the proposed method of early detection of the negative results. In particular, for the VGG-16 model, ECHO achieved improvements of

81.25 %

and

48.72 %

compared to the Baseline-1 and Baseline-2 designs, respectively. The proposed design shows similar improvements of

82.9 %

and

50.22 %

, compared to the Baseline-1 and Baseline-2 designs, respectively, for the ResNet-18 workload. ECHO achieved

40.6 %

and

50.3 %

less power consumption for the ResNet-50 model compared to the Baseline-1 and Baseline-2 designs, respectively.

Similar to the comparison with the baseline designs, a comparison of the proposed design with contemporary methods is also conducted for VGG-16, ResNet-18, and ResNet-50 workloads. The comparative results are presented in Table 5. For the VGG-16 workload, the proposed design achieves

3.88 \times

,

1.75 \times

,

1.65 \times

, and

1.03 \times

superior results in terms of performance compared to NEURAghe [56,57,58], and Caffeine [59], respectively. In terms of energy efficiency, ECHO achieved

2.86 \times

and

2.01 \times

improvements in energy efficiency compared to NEURAghe [56] and OPU [58], respectively. Similarly, the proposed design outperformed NEURAghe [51,56] by

2.12 \times

and

1.38 \times

in terms of performance in GOPS for the ResNet-18 workload. ECHO also achieved superior energy efficiency of

21.58 GOPS / W

compared to

5.8 GOPS / W

for [56] and

19.41 GOPS / W

for [51]. For the ResNet-50 network, the proposed design achieved

1.43 \times

and

14.8 %

superior results in terms of performance and energy efficiency, respectively. However, besides the promising performance in the context of the comparative results stated earlier, the proposed design utilizes a slightly larger number of logic resources compared to its contemporary counterparts. Another area for future research and exploration in this avenue lies in the lightweight design of the online-arithmetic-based compute units, where a compact design may help in reducing the area/logic resource overhead experienced in the present design.

5. Conclusions

This research introduces ECHO—a DNN hardware accelerator focused on computation pruning through the use of online arithmetic. ECHO effectively addresses the challenge of early detection of negative activations in ReLU-based DNNs, showcasing substantial improvements in power efficiency and performance for VGG-16, ResNet-18, ResNet-50 workloads. Compared to existing methods, the proposed design demonstrates superior performance, with mean performance improvements of

39.81 %

,

42.97 %

, and

29.92 %

for VGG-16, ResNet-18, and ResNet-50, respectively. With regards to energy efficiency, ECHO achieved superior energy efficiencies (GOPS/W) of

48.41

,

21.58

, and

22.51

for VGG-16, ResNet-18, and ResNet-50 workloads, respectively. Moreover, ECHO achieved average improvements in power consumption of up to

81 %

,

82.90 %

, and

40.60 %

for VGG-16, ResNet-18, and ResNet-50 workloads compared to the conventional bit-serial design, respectively. Furthermore, significant average speedups of

2.39 \times

,

2.60 \times

, and

2.42 \times

were observed when comparing the proposed design to conventional bit-serial designs for the VGG-16, ResNet-18, and ResNet-50 models, respectively. Additionally, ECHO outperforms online-arithmetic-based designs without early detection, achieving average speedups of

1.9 \times

,

2.0 \times

, and

2.0 \times

for VGG-16, ResNet-18, and ResNet-50 workloads, respectively. These findings underscore the potential of the proposed hardware accelerator in enhancing the efficiency of computing convolution in DNN during inference.

Author Contributions

Conceptualization, M.S.I. and M.U.; methodology, M.S.I.; software, M.S.I. and M.U.; validation, J.-A.L. and M.U.; formal analysis, M.S.I.; investigation, M.S.I. and M.U.; writing—original draft preparation, M.S.I.; writing—review and editing, M.S.I., M.U. and J.-A.L.; visualization, M.S.I.; supervision, J.-A.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Basic Science Research Program funded by the Ministry of Education through the National Research Foundation of Korea (NRF-2020R1I1A3063857).

Data Availability Statement

Data are contained within the article.

Acknowledgments

The EDA tool was supported by the IC Design Education Center (IDEC), Korea.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gao, G.; Xu, Z.; Li, J.; Yang, J.; Zeng, T.; Qi, G.J. CTCNet: A CNN-transformer cooperation network for face image super-resolution. IEEE Trans. Image Process. 2023, 32, 1978–1991. [Google Scholar] [CrossRef] [PubMed]
Usman, M.; Khan, S.; Park, S.; Lee, J.A. AoP-LSE: Antioxidant Proteins Classification Using Deep Latent Space Encoding of Sequence Features. Curr. Issues Mol. Biol. 2021, 43, 1489–1501. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F.E. A survey of deep neural network architectures and their applications. Neurocomputing 2017, 234, 11–26. [Google Scholar] [CrossRef]
Kwon, H.; Chatarasi, P.; Pellauer, M.; Parashar, A.; Sarkar, V.; Krishna, T. Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Columbus, OH, USA, 12–16 October 2019; pp. 754–768. [Google Scholar]
Gupta, U.; Jiang, D.; Balandat, M.; Wu, C.J. Towards green, accurate, and efficient ai models through multi-objective optimization. In Workshop Paper at Tackling Climate Change with Machine Learning; ICLR: Vienna, Austria, 2023. [Google Scholar]
Narayanan, D.; Harlap, A.; Phanishayee, A.; Seshadri, V.; Devanur, N.R.; Ganger, G.R.; Gibbons, P.B.; Zaharia, M. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, Huntsville, ON, Canada, 27–30 October 2019; pp. 1–15. [Google Scholar]
Deng, C.; Liao, S.; Xie, Y.; Parhi, K.K.; Qian, X.; Yuan, B. PermDNN: Efficient compressed DNN architecture with permuted diagonal matrices. In Proceedings of the 2018 51st Annual IEEE/ACM international symposium on microarchitecture (MICRO), Fukuoka, Japan, 20–24 October 2018; pp. 189–202. [Google Scholar]
Jain, S.; Venkataramani, S.; Srinivasan, V.; Choi, J.; Chuang, P.; Chang, L. Compensated-DNN: Energy efficient low-precision deep neural networks by compensating quantization errors. In Proceedings of the 55th Annual Design Automation Conference, San Francisco, CA, USA, 24–29 June 2018; pp. 1–6. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, US, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Hanif, M.A.; Javed, M.U.; Hafiz, R.; Rehman, S.; Shafique, M. Hardware–Software Approximations for Deep Neural Networks. Approx. Circuits Methodol. CAD 2019, 269–288. [Google Scholar]
Zhang, S.; Mao, W.; Wang, Z. An efficient accelerator based on lightweight deformable 3D-CNN for video super-resolution. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 70, 2384–2397. [Google Scholar] [CrossRef]
Lo, C.Y.; Sham, C.W. Energy efficient fixed-point inference system of convolutional neural network. In Proceedings of the 2020 IEEE 63rd International Midwest Symposium on Circuits and Systems (MWSCAS), Springfield, MA, USA, 9–12 August 2020; pp. 403–406. [Google Scholar]
Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 525–542. [Google Scholar]
Agrawal, A.; Choi, J.; Gopalakrishnan, K.; Gupta, S.; Nair, R.; Oh, J.; Prener, D.A.; Shukla, S.; Srinivasan, V.; Sura, Z. Approximate computing: Challenges and opportunities. In Proceedings of the 2016 IEEE International Conference on Rebooting Computing (ICRC), San Diego, CA, USA, 17–19 October 2016; pp. 1–8. [Google Scholar]
Liu, B.; Wang, Z.; Guo, S.; Yu, H.; Gong, Y.; Yang, J.; Shi, L. An energy-efficient voice activity detector using deep neural networks and approximate computing. Microelectron. J. 2019, 87, 12–21. [Google Scholar] [CrossRef]
Szandała, T. Review and comparison of commonly used activation functions for deep neural networks. Bio-Inspired Neurocomput. 2021, 203–224. [Google Scholar]
Dubey, S.R.; Singh, S.K.; Chaudhuri, B.B. Activation functions in deep learning: A comprehensive survey and benchmark. Neurocomputing 2022, 503, 92–108. [Google Scholar] [CrossRef]
Cao, J.; Pang, Y.; Li, X.; Liang, J. Randomly translational activation inspired by the input distributions of ReLU. Neurocomputing 2018, 275, 859–868. [Google Scholar] [CrossRef]
Shi, S.; Chu, X. Speeding up convolutional neural networks by exploiting the sparsity of rectifier units. arXiv 2017, arXiv:1704.07724. [Google Scholar]
Akhlaghi, V.; Yazdanbakhsh, A.; Samadi, K.; Gupta, R.K.; Esmaeilzadeh, H. Snapea: Predictive early activation for reducing computation in deep convolutional neural networks. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Los Angeles, CA, USA, 1–6 June 2018; pp. 662–673. [Google Scholar]
Lee, D.; Kang, S.; Choi, K. ComPEND: Computation Pruning through Early Negative Detection for ReLU in a deep neural network accelerator. In Proceedings of the 2018 International Conference on Supercomputing, Beijing, China, 12–15 June 2018; pp. 139–148. [Google Scholar]
Kim, N.; Park, H.; Lee, D.; Kang, S.; Lee, J.; Choi, K. ComPreEND: Computation Pruning through Predictive Early Negative Detection for ReLU in a Deep Neural Network Accelerator. IEEE Trans. Comput. 2021, 71, 1537–1550. [Google Scholar] [CrossRef]
Luo, T.; Liu, S.; Li, L.; Wang, Y.; Zhang, S.; Chen, T.; Xu, Z.; Temam, O.; Chen, Y. DaDianNao: A neural network supercomputer. IEEE Trans. Comput. 2016, 66, 73–88. [Google Scholar] [CrossRef]
Judd, P.; Albericio, J.; Hetherington, T.; Aamodt, T.M.; Moshovos, A. Stripes: Bit-serial deep neural network computing. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Taipei, Taiwan, 15–19 October 2016; pp. 1–12. [Google Scholar]
Albericio, J.; Delmás, A.; Judd, P.; Sharify, S.; O’Leary, G.; Genov, R.; Moshovos, A. Bit-Pragmatic Deep Neural Network Computing. In Proceedings of the 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Boston, MA, USA, 14–17 October 2017; pp. 382–394. [Google Scholar]
Albericio, J.; Judd, P.; Hetherington, T.; Aamodt, T.; Jerger, N.E.; Moshovos, A. Cnvlutin: Ineffectual-neuron-free deep neural network computing. ACM SIGARCH Comput. Archit. News 2016, 44, 1–13. [Google Scholar] [CrossRef]
Gao, M.; Pu, J.; Yang, X.; Horowitz, M.; Kozyrakis, C. Tetris: Scalable and efficient neural network acceleration with 3d memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, Xi’an, China, 8–12 April 2017; pp. 751–764. [Google Scholar]
Judd, P.; Albericio, J.; Hetherington, T.; Aamodt, T.; Jerger, N.E.; Urtasun, R.; Moshovos, A. Proteus: Exploiting precision variability in deep neural networks. Parallel Comput. 2018, 73, 40–51. [Google Scholar] [CrossRef]
Shin, S.; Boo, Y.; Sung, W. Fixed-point optimization of deep neural networks with adaptive step size retraining. In Proceedings of the 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 1203–1207. [Google Scholar]
Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D. A domain-specific architecture for deep neural networks. Commun. ACM 2018, 61, 50–59. [Google Scholar] [CrossRef]
Juracy, L.R.; Garibotti, R.; Moraes, F.G. From CNN to DNN Hardware Accelerators: A Survey on Design, Exploration, Simulation, and Frameworks. Found. Trends® Electron. Des. Autom. 2023, 13, 270–344. [Google Scholar] [CrossRef]
Shomron, G.; Weiser, U. Spatial correlation and value prediction in convolutional neural networks. IEEE Comput. Archit. Lett. 2018, 18, 10–13. [Google Scholar] [CrossRef]
Zhang, Q.; Wang, T.; Tian, Y.; Yuan, F.; Xu, Q. ApproxANN: An approximate computing framework for artificial neural network. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 9–13 March 2015; pp. 701–706. [Google Scholar]
Lee, J.; Kim, C.; Kang, S.; Shin, D.; Kim, S.; Yoo, H.J. UNPU: An energy-efficient deep neural network accelerator with fully variable weight bit precision. IEEE J. Solid-State Circuits 2018, 54, 173–185. [Google Scholar] [CrossRef]
Hsu, L.C.; Chiu, C.T.; Lin, K.T.; Chou, H.H.; Pu, Y.Y. ESSA: An energy-Aware bit-Serial streaming deep convolutional neural network accelerator. J. Syst. Archit. 2020, 111, 101831. [Google Scholar] [CrossRef]
Isobe, S.; Tomioka, Y. Low-bit Quantized CNN Acceleration based on Bit-serial Dot Product Unit with Zero-bit Skip. In Proceedings of the 2020 Eighth International Symposium on Computing and Networking (CANDAR), Naha, Japan, 24–27 November 2020; pp. 141–145. [Google Scholar]
Li, A.; Mo, H.; Zhu, W.; Li, Q.; Yin, S.; Wei, S.; Liu, L. BitCluster: Fine-Grained Weight Quantization for Load-Balanced Bit-Serial Neural Network Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 4747–4757. [Google Scholar] [CrossRef]
Shuvo, M.K.; Thompson, D.E.; Wang, H. MSB-First Distributed Arithmetic Circuit for Convolution Neural Network Computation. In Proceedings of the 2020 IEEE 63rd International Midwest Symposium on Circuits and Systems (MWSCAS), Springfield, MA, USA, 9–12 August 2020; pp. 399–402. [Google Scholar]
Karadeniz, M.B.; Altun, M. TALIPOT: Energy-Efficient DNN Booster Employing Hybrid Bit Parallel-Serial Processing in MSB-First Fashion. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2021, 41, 2714–2727. [Google Scholar] [CrossRef]
Song, M.; Zhao, J.; Hu, Y.; Zhang, J.; Li, T. Prediction based execution on deep neural networks. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Los Angeles, CA, USA, 1–6 June 2018; pp. 752–763. [Google Scholar]
Lin, Y.; Sakr, C.; Kim, Y.; Shanbhag, N. PredictiveNet: An energy-efficient convolutional neural network via zero prediction. In Proceedings of the 2017 IEEE international symposium on circuits and systems (ISCAS), Baltimore, MD, USA, 28–31 May 2017; pp. 1–4. [Google Scholar]
Asadikouhanjani, M.; Ko, S.B. A novel architecture for early detection of negative output features in deep neural network accelerators. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 3332–3336. [Google Scholar] [CrossRef]
Suresh, B.; Pillai, K.; Kalsi, G.S.; Abuhatzera, A.; Subramoney, S. Early Prediction of DNN Activation Using Hierarchical Computations. Mathematics 2021, 9, 3130. [Google Scholar] [CrossRef]
Pan, Y.; Yu, J.; Lukefahr, A.; Das, R.; Mahlke, S. BitSET: Bit-Serial Early Termination for Computation Reduction in Convolutional Neural Networks. ACM Trans. Embed. Comput. Syst. 2023, 22, 1–24. [Google Scholar] [CrossRef]
Ercegovac, M.D. On-Line Arithmetic: An Overview. In Proceedings of the Real-Time Signal Processing VII; Bromley, K., Ed.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 1984; Volume 0495, pp. 86–93. [Google Scholar]
Usman, M.; Lee, J.A.; Ercegovac, M.D. Multiplier with reduced activities and minimized interconnect for inner product arrays. In Proceedings of the 2021 55th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 31 October–3 November 2021; pp. 1–5. [Google Scholar]
Ibrahim, M.S.; Usman, M.; Nisar, M.Z.; Lee, J.A. DSLOT-NN: Digit-Serial Left-to-Right Neural Network Accelerator. In Proceedings of the 2023 26th Euromicro Conference on Digital System Design (DSD), Durres, Albania, 6–8 September 2023; pp. 686–692. [Google Scholar]
Usman, M.; D. Ercegovac, M.; Lee, J.A. Low-Latency Online Multiplier with Reduced Activities and Minimized Interconnect for Inner Product Arrays. J. Signal Process. Syst. 2023, 95, 777–796. [Google Scholar] [CrossRef]
Ercegovac, M.D.; Lang, T. Digital Arithmetic; Elsevier: Amsterdam, The Netherlands, 2004. [Google Scholar]
Xie, X.; Lin, J.; Wang, Z.; Wei, J. An efficient and flexible accelerator design for sparse convolutional neural networks. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 68, 2936–2949. [Google Scholar] [CrossRef]
Wei, X.; Liang, Y.; Li, X.; Yu, C.H.; Zhang, P.; Cong, J. TGPA: Tile-grained pipeline architecture for low latency CNN inference. In Proceedings of the 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Diego, CA, USA, 5–8 November 2018; pp. 1–8. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Marcel, S.; Rodriguez, Y. Torchvision the machine-vision package of torch. In Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Italy, 29 October 2010; pp. 1485–1488. [Google Scholar]
Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 161–170. [Google Scholar]
Meloni, P.; Capotondi, A.; Deriu, G.; Brian, M.; Conti, F.; Rossi, D.; Raffo, L.; Benini, L. NEURAghe: Exploiting CPU-FPGA synergies for efficient and flexible CNN inference acceleration on Zynq SoCs. ACM Trans. Reconfigurable Technol. Syst. (TRETS) 2018, 11, 1–24. [Google Scholar] [CrossRef]
Li, G.; Liu, Z.; Li, F.; Cheng, J. Block convolution: Toward memory-efficient inference of large-scale CNNs on FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2021, 41, 1436–1447. [Google Scholar] [CrossRef]
Yu, Y.; Wu, C.; Zhao, T.; Wang, K.; He, L. OPU: An FPGA-based overlay processor for convolutional neural networks. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2019, 28, 35–47. [Google Scholar] [CrossRef]
Zhang, C.; Sun, G.; Fang, Z.; Zhou, P.; Pan, P.; Cong, J. Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2019, 38, 2072–2085. [Google Scholar] [CrossRef]

Figure 1. A typical convolutional neural network: VGG-16. The convolution layers with a

3 \times 3

kernel are shown in yellow, the maxpooling layer is represented in orange, and the fully connected layers are presented in purple.

Figure 1. A typical convolutional neural network: VGG-16. The convolution layers with a

3 \times 3

kernel are shown in yellow, the maxpooling layer is represented in orange, and the fully connected layers are presented in purple.

Figure 2. The rectified linear unit (ReLU) activation function.

Figure 3. Timing characteristics of online operation with

δ = 3

.

Figure 3. Timing characteristics of online operation with

δ = 3

.

Figure 4. Basic components: (a) online serial–parallel multiplier [49], where x is the serial input and Y is the parallel output; (b) online adder [50].

Figure 5. Circuit block diagram for an example SOP computation.

Figure 6. Bit-level computation pattern of the SOP in the example

Z = (A \times B + C \times D)

. Here, output of the

O L M 1

and

O L M 2

are

X = A \times B

and

Y = C \times D

respectively.

Figure 6. Bit-level computation pattern of the SOP in the example

Z = (A \times B + C \times D)

. Here, output of the

O L M 1

and

O L M 2

are

X = A \times B

and

Y = C \times D

respectively.

Figure 7. Processing engine architecture of ECHO. Each PE contains

k \times k

multipliers, where each multiplier accepts a bit-serial (input feature) and a parallel (kernel pixel) input.

Figure 7. Processing engine architecture of ECHO. Each PE contains

k \times k

multipliers, where each multiplier accepts a bit-serial (input feature) and a parallel (kernel pixel) input.

Figure 8. Architecture of ECHO for a convolution layer. Each column is equipped with N PEs to facilitate the input channels, while each column of PEs is followed by an online-arithmetic-based reduction tree for the generation of the final SOP. The central controller block generates the termination signals and also controls the dataflow to and from the weight buffers (WBs) and activation buffers (ABs).

Figure 9. Central controller and decision unit in ECHO.

Figure 10. Runtimes of the convolution layers for the proposed method with the baseline designs. The proposed design achieved mean runtime improvements of 58.16% and 61.6% compared to conventional bit-serial design (Baseline-1) for VGG-16 and ResNet-18 workloads, respectively.

Figure 11. Power consumption of the proposed design with the baseline designs. The proposed design achieved

81 %

and

82.9 %

mean reduction in power consumption compared to conventional bit-serial design (Baseline-1) for VGG-16 and ResNet-18 workloads, respectively.

Figure 11. Power consumption of the proposed design with the baseline designs. The proposed design achieved

81 %

and

82.9 %

mean reduction in power consumption compared to conventional bit-serial design (Baseline-1) for VGG-16 and ResNet-18 workloads, respectively.

Table 1. Convolution layer architecture of VGG-16 and ResNet networks. Where M denotes the number of kernels (number of output feature maps) and

R \times C

denotes the dimensions of the output feature maps.

Table 1. Convolution layer architecture of VGG-16 and ResNet networks. Where M denotes the number of kernels (number of output feature maps) and

R \times C

denotes the dimensions of the output feature maps.

Network	Layer	Kernel Size	M	$R \times C$
VGG-16	C1–C2	$3 \times 3$	64	$224 \times 224$
	C3–C4		128	$112 \times 112$
	C5–C7		256	$56 \times 56$
	C8–C10		512	$28 \times 28$
	C11–C13		512	$14 \times 14$
ResNet-18	C1	$7 \times 7$	64	$112 \times 112$
	C2–C5	$3 \times 3$	64	$56 \times 56$
	C6–C9		128	$28 \times 28$
	C10–C13		256	$14 \times 14$
	C14–C17		512	$7 \times 7$
ResNet-50	C1	$7 \times 7$	64	$112 \times 112$
	C2-x	$1 \times 1$ $3 \times 3$ $1 \times 1$	64,256	$56 \times 56$
	C3-x		128,512	$28 \times 28$
	C4-x		2,561,024	$14 \times 14$
	C5-x		5,122,048	$7 \times 7$

Table 2. Comparison of per layer inference time, power consumption, and speedup for the convolution layers of VGG-16 network.

Layer	Inference Time (ms)			Power (W)			Speedup
Layer	Baseline-1	Baseline-2	ECHO	Baseline-1	Baseline-2	ECHO	Baseline-1	Baseline-2	ECHO
C1	35.12	22.8	12.29	4.55	1.44	0.81	1	1.54	2.86
C2	78.27	60.21	33.95	108.2	36.99	20.86	1	1.3	2.31
C3	39.13	30.10	16.99	54.10	18.5	10.44	1	1.3	2.3
C4	80.28	64.22	29.14	110.98	39.46	17.90	1	1.25	2.75
C5	40.14	32.11	19.98	55.49	19.73	12.27	1	1.25	2.01
C6	82.28	68.24	34.42	113.75	41.92	21.14	1	1.21	2.39
C7	82.28	68.24	38.51	113.75	41.92	23.66	1	1.21	2.14
C8	41.14	34.12	15.49	56.87	20.96	9.51	1	1.21	2.67
C9	84.29	72.25	46.86	116.53	44.39	28.79	1	1.16	1.79
C10	84.29	72.25	15.30	116.53	44.39	9.40	1	1.16	5.51
C11	21.07	18.06	15.86	29.13	11.09	9.74	1	1.16	1.33
C12	21.07	18.06	1.41	29.13	11.09	0.86	1	1.16	14.92
C13	21.07	18.06	17.04	29.13	11.09	10.47	1	1.16	1.24
Mean	54.65	44.46	22.87	72.16	26.38	13.53	1	1.23	2.39

Table 3. Comparison of per layer inference time, power consumption, and speedup for the convolution layers of ResNet-18 model.

Layer	Inference Time (ms)			Power (W)			Speedup (×)
Layer	Baseline-1	Baseline-2	ECHO	Baseline-1	Baseline-2	ECHO	Baseline-1	Baseline-2	ECHO
C1	12.79	6.52	3.25	9.02	1.16	0.58	1	1.96	3.93
C2	4.89	3.76	1.9	6.76	2.31	1.16	1	1.3	2.57
C3	4.89	3.76	1.88	6.76	2.31	1.15	1	1.3	2.6
C4	4.89	3.76	1.91	6.76	2.31	1.17	1	1.3	2.56
C5	4.89	3.76	1.87	6.76	2.31	1.15	1	1.3	2.61
C6	2.44	1.88	0.96	3.38	1.15	0.59	1	1.29	2.54
C7	5.01	4.01	1.98	6.93	2.46	1.21	1	1.24	2.53
C8	5.01	4.01	2.05	6.93	2.46	1.26	1	1.24	2.44
C9	5.01	4.01	2.04	6.93	2.46	1.25	1	1.24	2.45
C10	2.5	2.04	1.06	3.46	1.25	0.65	1	1.22	2.35
C11	5.14	4.26	2.1	7.1	2.62	1.29	1	1.2	2.44
C12	5.14	4.26	2.14	7.1	2.62	1.31	1	1.2	2.4
C13	5.14	4.26	2.07	7.1	2.62	1.27	1	1.2	2.48
C14	2.57	2.26	1.08	3.55	1.39	0.66	1	1.13	2.37
C15	5.26	4.6	2.47	7.28	2.83	1.52	1	1.14	2.12
C16	5.26	4.6	1.92	7.28	2.83	1.18	1	1.14	2.73
C17	5.26	4.6	2.35	7.28	2.83	1.44	1	1.14	2.23
Mean	5.06	3.9	1.94	6.49	2.23	1.11	1	1.29	2.6

Table 4. Comparison of FPGA implementation of the proposed design with the conventional bit-serial design (Baseline-1) and online arithmetic design without the early negative detection capability (Baseline-2). The FPGA device used for this experiment is Xilinx Ultrascale+ Vertix-7 VU3P. The performance and latency per image have been mentioned in GOPS and ms, respectively.

Model	VGG-16			ResNet-18			ResNet-50
Design	Baseline-1	Baseline-2	ECHO	Baseline-1	Baseline-2	ECHO	Baseline-1	Baseline-2	ECHO
Frequency (MHz)	100			100			100
Logic Utilization	238 K (27.6%)	315 K (36.54%)		238 K (27.6%)	315 K (36.54%)		238 K (27.6%)	315 K (36.54%)
BRAM Utilization	83 (11.4%)	84 (11.54%)		83 (11.4%)	84 (11.54%)		83 (11.4%)	84 (11.54%)
Mean Power (W)	72.16	26.38	13.53	6.49	2.23	1.11	9.63	11.5	5.72
Performance (GOPS)	47.3	61.4	655	47.3	61.4	123	39.01	61.4	128.7
Latency per Image (ms)	710.5	578.03	297.3	86.2	66.4	33.1	366.7	464.01	231.03
Average Speedup (×)	1	1.23	2.39	1	1.29	2.6	1	1.21	2.42

Table 5. Comparisonwith previous works.

Models	VGG-16					ResNet-18			ResNet-50
Designs	NEURAghe [56]	[57]	OPU [58]	Caffeine [59]	ECHO	NEURAghe [56]	[51]	ECHO	[51]	ECHO
Device	Zynq Z7045	Zynq ZC706	Zynq XC7Z100	VX690t	VU3P	Zynq Z7045	Arria10 SX660	VU3P	Arria10 SX660	VU3P
Frequency (MHz)	140	150	200	150	100	140	170	100	170	100
Logic Utilization	100 K	-	-	-	315 K	100 K	102.6 K	315 K	102.6 K	315 K
BRAM Utilization	320	1090	1510	2940	84	320	465	84	465	84
Performance (GOPS)	169	374.98	397	636	655	58	89.29	123	90.19	128.7
Energy Efficiency (GOPS/W)	16.9	-	24.06	-	48.41	5.8	19.41	21.58	19.61	22.51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ibrahim, M.S.; Usman, M.; Lee, J.-A. ECHO: Energy-Efficient Computation Harnessing Online Arithmetic—An MSDF-Based Accelerator for DNN Inference. Electronics 2024, 13, 1893. https://doi.org/10.3390/electronics13101893

AMA Style

Ibrahim MS, Usman M, Lee J-A. ECHO: Energy-Efficient Computation Harnessing Online Arithmetic—An MSDF-Based Accelerator for DNN Inference. Electronics. 2024; 13(10):1893. https://doi.org/10.3390/electronics13101893

Chicago/Turabian Style

Ibrahim, Muhammad Sohail, Muhammad Usman, and Jeong-A Lee. 2024. "ECHO: Energy-Efficient Computation Harnessing Online Arithmetic—An MSDF-Based Accelerator for DNN Inference" Electronics 13, no. 10: 1893. https://doi.org/10.3390/electronics13101893

APA Style

Ibrahim, M. S., Usman, M., & Lee, J.-A. (2024). ECHO: Energy-Efficient Computation Harnessing Online Arithmetic—An MSDF-Based Accelerator for DNN Inference. Electronics, 13(10), 1893. https://doi.org/10.3390/electronics13101893

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ECHO: Energy-Efficient Computation Harnessing Online Arithmetic—An MSDF-Based Accelerator for DNN Inference

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Online Arithmetic

3.1.1. Online Multiplier (OLM)

3.1.2. Online Adder (OLA)

3.2. Proposed Design

3.2.1. Processing Engine Architecture

3.2.2. Early Detection and Termination of Negative Computations

3.2.3. Accelerator Design

4. Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI