Hardware Implementation of a Softmax-Like Function for Deep Learning †

: In this paper a simpliﬁed hardware implementation of a CNN softmax-like layer is proposed. Initially the softmax activation function is analyzed in terms of required numerical accuracy and certain optimizations are proposed. A proposed adaptable hardware architecture is evaluated in terms of the introduced error due to the proposed softmax-like function. The proposed architecture can be adopted to the accuracy required by the application by retaining or eliminating certain terms of the approximation thus allowing to explore accuracy for complexity trade-offs. Furthermore, the proposed circuits are synthesized in a 90 nm 1.0 V CMOS standard-cell library using Synopsys Design Compiler. Comparisons reveal that signiﬁcant reduction is achieved in area × delay and power × delay products for certain cases, respectively, over prior art. Area and power savings are achieved with respect to performance and accuracy.


Introduction
Deep neural networks (DNN) have emerged as a means to tackle complex problems such as image classification and speech recognition. The success of DNNs is attributed to the big data availability, the easy access to enormous computational power and the introduction of novel algorithms that have substantially improved the effectiveness of the training and inference [1]. A DNN is defined as a neural network (NN) which contains more than one hidden layer. In the literature, a graph is used to represent a DNN, with a set of nodes in each layer, as shown in Figure 1. The nodes at each layer are connected to the nodes of the subsequent layer. Each node performs processing including the computation of an activation function [2]. The extremely large number of nodes at each layer impels the training procedure to require extensive computational resources.
A class of DNNs are the convolutional neural networks (CNNs) [2]. CNNs offer high accuracy in computer-vision problems such as face recognition and video processing [3] and have been adopted in many modern applications. A typical CNN consists of several layers, each one of which can be convolutional, pooling, or normalization with the last one to be a non-linear activation function. A common choice for normalization layers is usually the softmax function as shown in Figure 1. To cope with increased computational load, several FPGA accelerators have been proposed and have demonstrated that convolutional layers exhibit the largest hardware complexity in a CNN [4][5][6][7][8][9][10][11][12][13][14][15]. In addition to CNNs, hardware accelerators for RNNs and LSTMs have also been investigated [16][17][18]. In order to implement a CNN in hardware, the softmax layer should also be implemented with low complexity. Furthermore, the hidden layers of a DNN can use the softmax function when the model is designed to choose one among several different options for some internal variable [2]. In particular, neural turing machines (NTM) [19] and differentiable neural computer (DNC) [20] use softmax layers within the neural network. Moreover, softmax is incorporated in attention mechanisms, an application of which is machine translation [21]. Furthermore, both hardware [22][23][24][25][26][27][28] and memory-optimized software [29,30] implementations of the softmax function have been proposed. This paper, extending previous work published in MOCAST2019 [31], proposes a simplified architecture for a softmax-like function, the hardware implementation of which is based on a proposed approximation which exploits the statistical structure of the vectors processed by the softmax layers in various CNNs. Compared to the previous work [31], this paper uses a large set of known CNNs, and performs extensive and fair experiments to study the impact of the applied optimizations in terms of the achieved accuracy. Moreover the architecture in [31] is further elaborated and generalized by taking into account the requirements of the targeted application. Finally, the proposed architecture is compared with various softmax hardware implementations. In order for the softmax-like function to be implemented efficiently in hardware, the approximation requirements are relaxed.
The remainder of the paper is organized as follows. Section 2 revisits the softmax activation function. Section 3 describes the proposed algorithm and Section 4 offers a quantitative analysis of the proposed architecture. Section 5 discusses the hardware complexity of the proposed scheme based on synthesis results. Finally, conclusions are summarized in Section 6. Version August 12, 2020 submitted to Technologies 2 of 24 x 1 x 2 x 3 . Furthermore, both hardware [22][23][24][25][26][27][28] and memory-optimized software [29,30] 5 implementations of the softmax function have been proposed. This paper, extending previous work 6 published in MOCAST2019 [31], proposes a simplified architecture for a softmax-like function, the 7 hardware implementation of which is based on a proposed approximation which exploits the statistical 8 structure of the vectors processed by the softmax layers in various CNNs. Compared to the previous 9 work [31], this paper uses a large set of known CNNs, and performs extensive and fair experiments 0 to study the impact of the applied optimizations in terms of the achieved accuracy. Moreover the 1 architecture in [31] is further elaborated and generalized by taking into account the requirements of the 2 targeted application. Finally, the proposed architecture is compared with various softmax hardware 3 implementations. In order for the softmax-like function to be implemented efficiently in hardware, the 4 approximation requirements are relaxed. 5 The remainder of the paper is organized as follows. Section 2 revisits the softmax activation

Softmax Layer Review
CNNs consist of a number of stages each of which contains several layers. The final layer is usually fully-connected using ReLU as an activation function and drives a softmax layer before the final output of the CNN. The classification performed by the CNNs is accomplished at the final layer of the network. In particular, for a CNN which consists of i + 1 layers, the softmax function is used to transform the real values generated by the ith CNN layer to possibilities, according to where z is an arbitrary vector with real values z j , j = 1, . . . , n, generated at the ith layer of the CNN and n is the size of the vector. The (i + 1)st layer is called the softmax layer. By applying the logarithm function to both sides of (1), it follows that In (4), the term log n ∑ k=1 e z k is computed as e m e m e z k (5) = log e m n ∑ k=1 1 e m e z k (6) = log e m + log where m = max k (z k ). From (4) and (8) it follows that where Due to the definition of m, it holds that where Q = Q n − 1 . Expressing Q in terms of Q , (11) becomes The next section presents the proposed simplifications for (16) and the derivative architecture for the softmax-like hardware implementation.

Proposed Softmax Architecture
Equation (15) involves the distance of the maximum component from the remainder of the components of a vector. As Q ≈ 0, the differences in z i s increase and z j m. On the contrary, as Q → 1 the differences in z i s are eliminated. Based on this observation, a simplifying approximation can be obtained, as follows. The third term in the right hand side of (16), log (n − 1)Q + 1 , can be roughly approximated by 0. Hence, (16) is approximated by Furthermore, this simplification substantially reduces hardware complexity as described below.
Proof. Due to the softmax definition, it holds that For the case of the proposed function (18), it holds that From (20) and (22), it is derived that z q = z r . Hence, argmax Proof. Proof is trivial and is omitted.
Theorem 1 states that the proposed softmax-like function and the actual softmax function always derive the same decisions. The proposed softmax-like approximation is based on the idea that the softmax function is used during training to target an output y by using maximum-log likelihood [2]. Hence, if the correct answer has already the maximum input value to the softmax function then log (n − 1)Q + 1 0 will roughly alter the output decision due to the exponent function used in term Q . In general, ∑ j f j (z) > 1, since the sequence f j (z) cannot be denoted as a probability density function. For models where the normalization function is required to be a pdf, a modified approach can be followed, as detailed below. According to the second approach, from (11) and (12) it follows: A second approximation is performed as From (26)- (28), it derives that Equation (30) uses parameter p which defines the number of additional terms used. By properly selecting p, then it holds that ∑ j f j (z, p) 1 and (30) approximates pdf better than (18). This is derived from the fact that in a real life CNN, the p maximum values are those that contribute to the computation of the softmax since all the remainder values are close to zero.
Proof. By definition, it holds that when p = 1 then m 1 = m, since the p = 1 maximum value m 1 is identified as the maximum m. Hence, by substituting p = 1 in (30), it derives that From a hardware perspective, (18) and (30) can be performed by the same circuit which implements the exponential function. The contributions of the paper are as follows. Firstly, the quantity log (n − 1)Q + 1 is eliminated from (16), implying that the target application requires decision making. Secondly, further mathematical manipulations are proposed to be applied to (30), in order to approximate the outputs as pdf i.e., probabilities that sum to one. Thirdly, the circuit for the evaluation of e x is simplified, since and Figure 2 depicts the various building blocks of the proposed architecture. More specifically, the proposed architecture is comprised of the block which computes the maximum m, i.e., m = max(z k ) k . The particular computation is performed by a tree which generates the maximum by comparing the elements by two, as shown in Figure 3. The depicted tree structure generates m = max(z k ) k , k = 0, 1, . . . , 7. Notation z ij denotes the maximum of z i and z j , while z ijkl denotes the maximum of z ij and z kl . The same architecture is used to compute the top p maximum values of z i s. For example, Z 01 , Z 21 , Z 45 and Z 67 are the top four maximum values and m = max(z k ) k is the maximum.
Subsequently, m is subtracted from all the component values z k as dictated by (17). The subtraction is performed through adders, denoted as in Figure 2, using two's complement representation for the input negative values −m. The obtained differences, also represented in two's complement, are used as inputs to a LUT, which performs the proposed simplified e x operation of (18), to compute the final vector f j (z) as shown in Figure 2a. Additional p terms are added and subsequently each output f j (z) through (30) generates the final value for the softmax-like layer output as shown in Figure 2b. For the hardware implementation of the e x function, an LUT is adopted the input of which is x = z j − m. The LUT size increases on the larger range of e x . Our proposed hardware implementation is simpler than other exponential implementations which propose CORDIC transformations [32], use floating-point representation [33], or LUTs [34]. In (33), the e x values are restricted to the range (0, 1] and the derived LUT size significantly diminishes and leads to simplified hardware implementation. Furthermore, no conversion from the logarithmic to the linear domain is required, since f j (z) represents the final classification layer.
The next section quantitatively investigates the validity and usefulness of employing f j (z), in terms of the approximation error.
Version August 12, 2020 submitted to Technologies 7 of 24 (18), generates the final value for the softmax-like layer output.
Proposed softmax-like architecture. The notation • denotes negation. Additional p terms are added and subsequently each output f j (z) through (30), generates the final value for the softmax-like layer output.

Figure 2.
Proposed softmax-like layer architecture. The circuit max(z k ) k , k = 0, 1, . . . , n computes the maximum value m of the input vector z = [z 1 · · · z n ] T . Next m is subtracted by each z k , as described in (17).   (18), generates the final value for the softmax-like layer output.
Proposed softmax-like architecture. The notation • denotes negation. Additional p terms are added and subsequently each output f j (z) through (30), generates the final value for the softmax-like layer output.

Figure 2.
Proposed softmax-like layer architecture. The circuit max(z k ) k , k = 0, 1, . . . , n computes the maximum value m of the input vector z = [z 1 · · · z n ] T . Next m is subtracted by each z k , as described in (17). , k = 0, 1, . . . , n computes the maximum value m of the input vector z = [z 1 × z n ] T . Next m is subtracted by each z k , as described in (17). (a) Proposed softmax-like architecture with p = 1. Each output f j (z) through (18) generates the final value for the softmax-like layer output. (b) Proposed softmax-like architecture. The notation • denotes negation. Additional p terms are added and subsequently each output f j (z) through (30) generates the final value for the softmax-like layer output.
Version August 12, 2020 submitted to Technologies 7 of 24 (18), generates the final value for the softmax-like layer output.
Proposed softmax-like architecture. The notation • denotes negation. Additional p terms are added and subsequently each output f j (z) through (30), generates the final value for the softmax-like layer output. , k = 0, 1, . . . , n computes the maximum value m of the input vector z = [z 1 · · · z n ] T . Next m is subtracted by each z k , as described in (17).

Quantitative Analysis of Introduced Error
This section quantitatively verifies the applicability of the approximation introduced in Section 2, for certain applications, by means of a series of Examples.
In order to quantify the error introduced by the proposed architecture, the mean square error (MSE) is evaluated as where f j (z) and f j (z) are the expected and the actually evaluated softmax output, respectively. As an illustrative example, denoted as Example 1, Figure 4 depicts the histograms of component values in test vectors used as inputs to the proposed architecture, selected to have specific properties detailed below. The corresponding parameters are evaluated by using the proposed architecture for the case of a 10-bit fixed-point representation, where 5 bits are used for the integer and 5 bits are allocated to the fractional part. More specifically, the vector in Figure 4a The statistical structure of the vector is characterized by the quantity Q = 0.9946 of (15). The estimated MSE = 0.8502 dictates that the particular vector is not suitable as an alternative to softmax input in terms of CNN performance, i.e., the obtained classification is performed with low confidence. Hence although the proposed approximation in (18) the histogram of which is shown in Figure 4b. In this case, the statistical structure of the vector demonstrates Q = 0.0449 and MSE = 0.0018. The feature of vector z in Example 2 is that it contains three large different component values close to each other, namely z 1 = 3, z 2 = 6, z 3 = 4, z 4 = 2 and all other components are smaller than 1. The softmax output in (1) for the values in the particular z are By using the proposed approximation (18), the obtained modified softmax values are Equations (41) and (42) show that the proposed architecture chooses component z 2 with value 1 while the actual probability is 0.7675. This means that the introduced error of MSE = 0.0018 can be negligible depending on the application, dictated by Q = 0.0449 1. In the following, tests using vectors obtained from real CNN applications are considered. More specifically, in an example shown in Figure 5, the vectors are obtained from image and digit classification applications. In particular, Figure 5a,b depict the values used as input to the final softmax layer, generated during a single inference for a VGG-16 imagenet image classification network for 1000 classes and a custom net for MNIST digit classification for 10 classes. Quantity Q can be used to determine whether the proposed architecture is appropriate for application on vector z before evaluating the MSE. It is noted that MSE for the example of Figure 5a and MSE for the example of          Subsequently, the proposed method is applied on the ResNet-50 [35], VGG-16, VGG-19 [36], InceptionV3 [37] and MobileNetV2 [38] CNNs, for 1000 classes with 10000 inferences of a custom image data set. In particular, for the case of ResNet-50, Figure 6a                 In general, in all cases identical output decisions are obtained for the actual softmax and the softmax-like output layer for each one of the CNNs.    . Considering the impact of the data wordlegth representation, let (l, k) denote the fixed-point representation of a number with l integral and k fractional bits. Figure 12a,b depict histograms for the MSE values obtained for the case of 1000 inferences by the VGG-16 CNN. It is shown that the case w = (6, 2) demonstrates the smaller MSE values. The reason for this is that the maximum value of the inputs in the softmax layer is 56 for all the 1000 inferences, and hence the value of 6 for the integral part is sufficient.   Summarizing, it is shown that the proposed architecture suits well for the final stage of a CNN network as an alternative to implementing the softmax layer stage, since the MSE is negligible. Next, the proposed architecture is implemented in hardware and compared with published counterparts.

Hardware Implementation Results
This section describes implementation results obtained by synthesizing the proposed architecture outlined in Figure 2. Among several authors reporting results on CNN accelerators, [22][23][24] have recently published works focusing on hardware implementation of the softmax function. In particular, in [23], a study based on stochastic computation is presented. Geng et al. provide a framework for the design and optimization of softmax implementation in hardware [26]. They also discuss operand bit-width minimization, taking into account application accuracy constraints. Du et al. propose a hardware architecture that derives the softmax function without a divider [25]. The appproach relies on an equivalent softmax expression which requires natural logarithms and exponentials. They provide detailed evaluation of the impact of the particular implementation on several benchmarks. Li et al. describe a 16-bit fixed-point hardware implementation of the softmax function [27]. They use a combination of look-up tables and multi-segment linear approximations for the approximation of exponentials and a radix-4 Booth-Wallace-based 6-stage pipeline multiplier and modified shift-compare divider.
In [24], the architecture demonstrates LUT-based computations that add complexity and exhibits 444,858 µm 2 area complexity by using 65-nm standard-cell library. For the same library the architecture in [25] reports 640,000 µm 2 area complexity with 0.8 µw power consumption at a 500 MHz clock frequency. The architecture in [28] reports 104,526 µm 2 area complexity with 4.24 µw power consumption at a 1 GHz clock frequency. The proposed architecture in [26] demonstrates power consumption and area complexity of 1.8 µw and 3000 µm 2 , respectively at a 500 MHz clock frequency with UMC 65 nm standard cell library. In [27], it is reported 3.3 GHz and 34,348 µm 2 frequency and area complexity at 45 nm technology node. Yuan [22] presented an architecture for implementing the softmax layer. Nevertheless there is no discussion of the implementation of the LUTs and there are no synthesis results. Our proposed softmax-like function differs from the actual softmax function due to the approximation of the quantity log (n − 1)Q + 1 , as discussed in Section 3. In particular, (18) approximates the softmax output as a decision making application and not as a pdf function. The proposed softmax-like function in (30) approximates outputs as pdf function, depending on the number of p terms used. As p → n, (30)→ actual softmax function. The hardware complexity reduction derives from the fact that a limited number, p, of z i s contribute to the computation of the softmax function. Summarizing, we compare both architectures depicted in Figure 2a,b with [22] to quantify the impact of p on the hardware complexity. Section 4 shows that the softmax-like function suits well in a CNN. For a fair comparison we have implemented and synthesized both architectures, our proposed and [22], by using a 90 nm 1.0 V CMOS standard-cell library with Synopsys Design Compiler [39]. Figure 13 depicts the architecture obtained from synthesis where the various building blocks, namely maximum evaluation, subtractor and the simplified exponential LUTs that perform in parallel, are shown. Furthermore registers have been added at the circuits inputs and outputs for applying the delay constraints. Detailed results are depicted in Table 2a-c for the proposed softmax-like of Figure 2a, the [22] and the proposed softmax-like of Figure 2b layer with size 10, respectively. Furthermore, results are plotted graphically in Figure 14a,b where area vs. delay and power vs. delay are depicted, respectively. Results demonstrate that substantially area savings are achieved with no delay penalty. More specifically, for a 4 ns delay constraint the area complexity is 25,597 µm 2 and 43,576 µm 2 in case of architectures in Figure 2b and [22], respectively. For the case where the pdf output is not significant, the area complexity reduction can be 17,293 µm 2 for the architecture on Figure 2a. Summarizing, depending on the application and the design constraints there is a trade-off between the additional p terms used for the evaluation of the softmax output. As we increase the value of the parameter p, then the actual softmax value is better approximated while hardware complexity increases. When p = 1, then the hardware complexity is minimized while softmax output approximation diverges.

239
This paper proposes hardware architectures for implementing the softmax layer in a CNN with 240 substantially reduced reduction in area × delay and power × delay product, respectively, for certain 241 cases. A family of architectures that can approximate the softmax function have been introduced 242 and evaluated, each member of which is obtained through a design parameter p, which controls the 243 number of terms employed for the approximation. It is found that a very simple approximation using 244 p = 1, suffices to deliver accurate results in certain cases, even though the derived approximation is not    .

Conclusions
This paper proposes hardware architectures for implementing the softmax layer in a CNN with substantially reduced reduction in area × delay and power × delay product, respectively, for certain cases. A family of architectures that can approximate the softmax function have been introduced and evaluated, each member of which is obtained through a design parameter p, which controls the number of terms employed for the approximation. It is found that a very simple approximation using p = 1, suffices to deliver accurate results in certain cases, even though the derived approximation is not a pdf. Furthermore, it has been demonstrated that for image and digit classification applications, the proposed architecture suits ideally as it achieves MSEs of the order of 10 −13 and 10 −5 , respectively, which are considered low.