Carry-Propagation-Adder-Factored Gemmini Systolic Array for Machine Learning Acceleration

: Systolic arrays are the primary part of modern deep learning accelerators and are being used widely in real-life applications such as self-driving cars. This paper presents a novel factored systolic array, where the carry propagation adder for accumulation and the rounding logic are extracted out from each processing element, which reduces the area, power and delay of the processing elements substantially. The factoring is performed in the column-wise manner and the cost of the factored logic, placed at each column output, is amortized by the processing elements in a column. We demonstrate the proposed factoring in an open source systolic array, Gemmini. The factoring technique does not change the functionality of the base design and is transparent to applications. We show that the proposed technique leads to substantial reduction in area and delay up to 45.3% and 23.7%, respectively, compared to the Gemmini baseline.


Introduction
Recently, machine learning (ML) algorithms have acquired considerable attention after deep learning (DL) demonstrated breakthroughs in various complex tasks such as the ImageNet challenge. The vigorous ability of DL to solve complex tasks is not limited to image recognition but also applicable in object detection, speech recognition, natural language processing, etc. [1][2][3]. However, deep learning models require massive amounts of computation and large memory footprints, and recent research have focused on DL accelerators [4]. The matrix multiplication is the key primitive in computation of ML models, and systolic arrays (SAs) for the matrix multiplication have been adopted widely [5,6]. Systolic arrays, proposed in 1979, are two dimensional mesh that consist of processing elements (PEs) organized in the form of a grid [7,8]. Due to data reusability, concurrency and simple architectural characteristics, many industry giants such as Google [9], Nvidia [10], Intel [11] and Samsung [12] utilized systolic array for general matrix multiplication (GEMM). With the increasing interest in accelerators, many studies have been proposed using systolic arrays [9][10][11][12][13][14][15][16], but to the best of our knowledge, all of them focus on dataflows to increase memory bandwidth efficiency and maximum data reuse, etc.; none of them deal with the logic level design of the systolic arrays. In this paper, we present a novel factored systolic array and demonstrate it using an open-source (https://github.com/ucb-bar/gemmini) systolic array, the Gemmini (Gemmini system on chip (SoC) RTL can be generated by following this lab, EE-290-2, Hardware for Machine Learning, Lab-2) [17]. The main contributions of this paper are outlined below: • We present a novel factored systolic array, referred to as the carry-propagate-adder (CPA)-factored systolic array. • Using the practical systolic array baseline, we demonstrate that significant improvements in key design metrics are possible without modifying the functionality of the systolic array.
The rest of this paper is organized as follows. Related work is given in Section 2. We present the proposed design and the baseline in Section 3. In Section 4, we detail the evaluation analysis. Section 5, gives the discussion and Section 6 concludes the paper.

Fixed ML Accelerator Designs
Due to incredible amount of interest in machine learning accelerators, the architecture community has focused on designing efficient dataflows to maximize the operand reuse and unnecessary data transfer in [22,23] for Convolutional Neural Networks (CNN). In [22], the authors implemented indexed based Sparse CNN (SCNN) accelerator architecture to improve the energy efficiency. However, indexed based approaches have significant overhead costs for storing and computing on the indexes. In [23] Liu et al. introduced density bound block (DBB) to make the bound on the number of non-zero elements in each block to deal with the sparse data and sparsity is fixed at the design time in their scheme. Unfortunately, with fixed sparsity any models that do not achieve or exceed this threshold must fall back to dense operation with no benefit.

Flexible ML Accelerator Designs
To support a variety of workloads, flexible mapping by supporting multiple dataflows has been proposed in ML accelerators [24][25][26]. In these studies, refs. [24,25] flexible accelerators are natively designed for convolution to support data reusability. However, ref. [26] introduces FlexSA, a flexible systolic array architecture for GEMMs operation, which dynamically re-configures the systolic structure. Indeed, flexibility is good for pruned or sparse CNN accelerators but this flexibility increases implementation cost due to increment in data traffic in accelerators and extra control logic. This is acceptable for small convolution/matrix computation but it severely increases the cost for large GEMMs.

Logic Level ML Accelerator Designs
Since the processing element and systolic array are the main components of ML accelerators, some recent works proposed the re-architecting of these components at the logic level [27,28], respectively. In [27], the Tetris accelerator was proposed that not only deal with sparsity but also with zero bits in non-zero values through split-and-accumulate (SAC) unit in PEs to increase the efficiency in accelerator. Tetris is good for small matrix tiles but does not have enough computation power to work on larger networks without multiple costly passes due to increment in complex control logic. In [28], Ullah et al. proposed a factored radix-8 systolic array, in which differently sized SAs have been implemented and suggested to perform extraction of radix-8 multiplier booth encoding and hard multiple (3Y) computation of multiplicand Y as a pre-processing at the input of systolic array. It also demonstrated the substantial improvements in 16 bit or higher systolic arrays, but showed less improvements in 8 bit or lower SAs, which are typically used for inference acceleration in edge devices.
In the above discussed previous work, much of the focus on ML accelerator design has been on optimizing core dataflows to improve local reuse of data and reduce expensive data movement between the processing elements, increase memory bandwidth efficiency, etc. Thus, critically, the logic-level design of datapath components needs more attention.
We target the systolic array accelerator at logic level without adding control logic complexities and proposed novel CPA-Factored Gemmini Systolic Array which provides same functionality as the conventional systolic array and achieves significant improvements in the area and delay.

CPA-Factored Systolic Array
We consider a 2-dimensional (2-D) systolic array where PEs are organized in the form of a mesh grid. Each PE receives two inputs A and B and utilize multiplication and accumulation (MAC) to perform the multiplication and accumulation on every clock cycle. Let X = A × B. Also, let C FB denote the partial sum stored in the accumulator register. The multiplier in a PE is usually considered as a black-box primitive.
However, we consider the logic-level design of the multiplier here. Initially, the multiplier performs partial product (PP) generation and partial product reduction in the reduction tree. Then, it performs final addition using a carry propagation adder (CPA), and the multiplier output X is added to C FB using another CPA. To avoid accuracy degradation, systolic arrays for machine learning usually deploys a CPA with a high bit-width (e.g., 32 bit instead of 8 bit or 16 bit), which causes significant delay and area overhead in MAC computation. In systolic arrays, we can replace them with two Carry Save Adders (CSAs) for the accumulation and place a CPA in each column output of the array for the final addition.
A CSA is a 3:2 compressor, while reducing three inputs to two, does not propagate carry; it rather keeps the carry (shift carry) in the next significant bit position with partial sum (also known as pseudo sum), these two values are known as a redundant binary representation [29,30]. For the delay and area perspective, the delay of CSA is a constant with respect to the word-length and the area of CSA is linear. Thus, the PE delay, which is often the critical path delay of the whole SA, and the PE area can be reduced significantly. In addition, the area cost of the factored CPA can be amortized by PEs in a column and becomes marginal as the size of the array increases. We refer to this structure as the CPAfactored systolic array. However, this factoring will cause a double sequential area in each PE because, in this case, we need to store two values (the sum and carry vectors) in two accumulator registers (Acc s , Acc c ) instead of one, as shown in Figure 1b. Moreover, every PE propagates these two values downwards. Thus, the pipeline register cost for output migration in the systolic array is also doubled. However, this increment in sequential cost can be compensated by simplifying the logic in PEs.

Gemmini Systolic Array Architecture
Gemmini [17] is an open source generator of systolic array accelerators that supports multiple data flows for application-specific integrated circuits (ASICs) and field programmable gate arrays (FPGAs) implementation. We considered a Gemmini systolic array with output stationary data flow as shown in Figure 2 that performs all the computa-tion on 8 bit signed inputs. This Gemmini SA consists of a set of PEs interconnected as a 2-D array. Pipeline registers are placed in the input and output of PEs in such a way that all PEs communicate only adjacent with nominal data migration and high computational parallelism in a wave-front flow. Input matrices A, B and D are provided at the left edge and the top edge of the systolic array to perform the GEMM through PEs as represented by the equation: where A and B are the multiplied matrices, C is the result and D is accumulator preload (bias matrix) in output stationary dataflow. This architecture is a practical version of previously discussed proposed systolic array shown in Figure 1a. The Gemmini systolic array MAC contains an 8 bit signed multiplier for multiplication and 32 bit adder (CPA) for accumulation. Accumulation is performed with 32 bits to avoid accuracy degradation in machine learning. Some components (double buffer and peripheral logic (PL)) are different than Figure 1a Figure 2. Gemmini systolic array architecture with output stationary dataflow.
To reduce the critical path delay, it has two accumulation registers and two dedicated datapaths, one for each register. Thus, each register has a separate accumulator (CPA) and a peripheral logic for accumulation, as shown in Figure 2. To select the input and outputs for computation and propagation Gemmini systolic array double buffer has two multiplexers (2:1 MUX) at input, one big multiplexer (MUX) (2:1 MUX) at output; these multiplexers are driven by one bit propagation (PROP) select line.

CPA-Factored Gemmini Systolic Array Architecture
The architecture of the proposed CPA-factored Gemmini systolic array with output stationary dataflow is shown in Figure 3 that also perform all computation on 8 bit signed inputs. In this architecture, for MAC, we utilize the radix-4 signed multiplier (we adopted the multiplier design from [31]) and we represent the partial sum in a PE in the carry save representation and use two carry save adders for accumulation. Thus, we can remove all the CPAs from the PE in Gemmini. As mentioned before, the CSAs provide binary redundant representation that contain two values (partial sum and shift carry); therefore, in the proposed Gemmini systolic array architecture, the double buffer has four registers to accommodate four values (two values for computation and two values for propagation). To get the final value, we place a CPA column-wise at the output of the array to add the partial sum and shift carry values. The PL circuitry rounds the final dot-product from a high bit-width down to a lower bit-width.
However, because we factor out the final addition outside PEs column-wise, it is not reasonable to deploy this PL circuitry inside PE. Thus, we factor out the PL circuitry from all PEs and place it next to the CPA in each column with the small cost of an additional 7 bit register for the control line CTL (five SHIFT bit, one ENABLE and one PROP) in PE and 7 bits increment in pipeline registers of systolic array. However, the overall incremented sequential cost is further offset by removing redundant hardware, which is explained in the coming section.

Double Buffer Complexities in Systolic Arrays
As mentioned earlier, the Gemmini SA double buffer has two datapaths, that is basically to remove the multiplexer delay from the critical path of the accumulator, but it also creates a redundancy in the hardware and this redundant hardware is replicated across all PEs in the systolic array, increasing the overall area of the systolic array significantly. However, the proposed CPA-factored Gemmini systolic array exploit the double buffer, which has two dedicated outputs (one for accumulator computation and one for output propagation) and a common accumulator (CSAs) inside PE. This helps to remove the redundant adder and factor out a common PL circuitry. In this way, the CPA-factored Gemmini systolic array offsets the aforementioned effect of sequential area growth in the total area. The proposed SA double buffer has two multiplexers at input and two multiplexers at output, adding a multiplexer delay in the MAC critical path but our MAC design can accommodate this as we already eliminated the CPA delay in the MAC critical path.

Evaluation Setup and Baseline
In this work, our baseline is Gemmini generated 8 × 8 output stationary SA architecture with 32 bit accumulator word-length (acc. WL). Therefore, we compared the the proposed PE and SA designs with the Gemmini designs. We also analyzed the proposed PE and SA designs in comparison to the Gemmini PE and SA designs at different accumulator word-lengths (16, 32 and 64 acc. WL). All the designs were implemented in Verilog and verified using Synopsys VCS. For verification, we built the test binaries using bare-metal software (Bare-metal software given in the Gemmini open source repository) test and checked the correctness of both designs in bare-metal environment. Moreover, an industrial 32 nm standard cell library was used to map the designs and was synthesized by the Synopsys Design Compiler. For the power measurements, PrimePower was used. The first switching activity interchange format (SAIF) file was generated by post processing the gate-level simulation using random input vectors in VCS; then, power dissipation was acquired by annotating the SAIF file to the netlist. All experiments were performed on a Linux machine. The comparison of PE area, delay, power, power delay product (PDP) and area delay product (ADP) is shown in Table 1. Even though the binary redundant representation and the forwarded control signals added the register cost in the proposed PE as compared to the baseline, the baseline Gemmini systolic array's PE total area is still higher compared to the proposed SA's PE. Similarly, as in the proposed CPA-factored Gemmini SA, we used two CSAs instead two CPAs and removed the CPA completely from the PE, on average the total delay is also improved. However, since the sequential area is more than double in the proposed design due to redundant representation and the control signal, the power gets degraded slightly as compared to Gemmini designs. Overall the proposed PE shows improvement in all the key metrics over the baseline except power and have been normalized with respect to the baseline as shown in the Figure 4 (left). It can be seen that the area and delay of the proposed PE are 44.1%, 16.9% less than those of the Gemmini systolic array PE, respectively. The PDP and ADP are also improved by 6.1% (improvement in PDP is not significant due to the power degradation) and 53.5% as compared to Gemmini PE, respectively.

Systolic Arrays
The performance comparison of CPA-factored Gemmini SA area, delay, power, PDP and ADP is shown in Table 2 with baseline Gemmini SA. The additional bits we stored in PE cost in the systolic array as well, because we have pipeline registers at input and output of each PE to keep data migration in wave-front flow. However, at the same time, the factorization of CPAs, peripheral circuitry and modified double buffer not only balance out the aforementioned sequential area cost by removing the combinational area but also reduce the total area significantly in the proposed systolic array as compared to the baseline Gemmini systolic array. The area and delay of CPA-Factored Gemmini SA are 33.3% and 14.9% less than those of Gemmini SA, respectively (the proposed SAs evaluation normalized metrics are shown in Figure 4 (right) to those of corresponding baseline for the comparison). Moreover, the PDP and ADP performance of proposed SA is 4.7% and 43.1% better than those of the baseline, respectively.

Comparison of Different Acc. WL Systolic Arrays
We also compared the proposed design of PEs and SAs with different acc. WLs according to all key metrics (area, delay, etc) with Gemmini designs in Table 3. In the PEs, due to the proposed CPA factoring, the combinational area improvement increases as the acc. WL increases, but the delay improvement remains stable (as now there is no carry propagation in the PE design: thus, this improvement is stable with the accumulator length.) because of CSAs. For SAs, Figure 5 breaks the combinational and sequntial area down for insight comparison with the delay. It can be seen that the reduction in the SA combinational area and delay (26.9-68%, 14.7-23.7%, respectively) significantly increases as the acc. WL increases and correspondingly, the degradation in sequential area is not significant. Thus, it reduces the total area (7.1-45.3%) in the proposed SAs too as the acc. WL increases. Moreover, the PDP and ADP performance of the proposed SAs is up to 31.9% and 58.2% better than those of baseline, respectively.

Discussion
CPA-factored Gemmini SA focuses its whole systolic array design at the logic level on improving the overall area and the delay for machine learning accelerators, which is very different from the direction taken in many of the previous state-of-the-art works.
Broadly speaking, the architecture community has focused on exploring of possible dataflows by exploiting model sparsity and model pruning, etc., and proposed fixed or flexible machine learning accelerators. All of these works are highly relevant to this field but do have some limitations: First, most of these ML accelerators use convolutional neural networks (CNNs) and few mentioned other neural networks (NNs). Unfortunately, according to Google, CNN utilization in data center NNs is barely 5% and mostly utilized in edge devices [9]. Second, fixed or flexible ML accelerators both work well with small scale convolution/matrix computation but for large GEMMs computation, design complexities increase, such as threshold match, huge data communication and additional control logic.
For large GEMMs, Google introduced the first tensor processing unit (TPUv1) in [9], which used an 8-bit integer systolic array to accelerate the inference and to replace general purpose computing units such as GPUs/CPUs in data centers. In TPUv1, the main architecture feature was the systolic array, which reduces the area and power of the large matrix multiplication unit. However, the MAC of TPUv1 (and many other ML accelerators including Gemmini) utilize conventional components (e.g., multipliers and adders). Recently, ref. [28] proposed the factored systolic array (FSA) using the radix-8 multiplier, in which the authors adopted non-conventional approach by considering systolic array and multiplier together and factored out the booth encoding and hard multiple (3Y) as a pre-processing unit. However, it is worth noting that the systolic array (either radix-4 or radix-8 multiplier) delay and area complexity mainly lies on the MAC accumulation feedback. The carry propagate adder is the main bottleneck in this feedback as it adds a huge delay due to the carry propagation. Thus, this delay increases as the accumulator word-length increases.
Therefore, in this paper we suggested considering systolic array and computing components multiplier and adders all together (unlike [28], which considered only SA and multiplier). We demonstrated the idea of performing accumulation using CSAs inside PE in binary redundant form and factoring out the CPA from PE and placed column-wise in SA for the final addition to get the final output. We have seen that this simple proposed technique leads to a substantial reduction in the area and delay.
Since the proposed design gained significant improvement in the area and delay, it can be useful in data centers or for the cloud: whether in training or inferences as in both cases area and delay constraints applies to accelerators.

Conclusions
This work presented a novel systolic array based on CPA and rounding logic factoring, and demonstrated that the proposed CPA-Factored SA can substantially ameliorate the area and delay. This factoring is done at the cost of an increased number of registers in the processing elements and systolic arrays, which cause power degradation. However, the growth in sequential area is compensated for by reducing the double buffer complexities and removing the redundant hardware. Compared with the baseline Gemmini SA design equipped, the CPA-Factored Gemmini SA achieved significant improvements in area and delay. Moreover, we have also shown that, for high precision cases when acc. WL increases, the reduction in area and delay also increases substantially when compared to the baseline.
Consequently, this paper provides substantial evidence of the critical importance of the reconsideration in the design path of arithmetic components for machine learning accelerators. For future research, we believe more exploration is required in that research path to enable these designs to work on low-powered edge devices, which may be our future work.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: