An Accuracy-Improved Fixed-Width Booth Multiplier Enabling Bit-Width Adaptive Truncation Error Compensation

: Fixed-width Booth multipliers (FWBMs) generate a product with the same bit width as the operand and have been extensively employed in many digital systems. Various truncation error compensation (TEC) schemes have been presented for FWBM designs, aiming to reduce hardware costs while preserving operation accuracy. In general, the existing TEC methods function adequately for an exact bit width of the operand but fail to consider the TEC effect for FWBM inputs with various bit-width levels. To address this issue, we propose a bit-width adaptive TEC (BWATEC) scheme for providing high-accuracy TEC functions that are adaptive to the multiple L (cid:48) -bit numerical ranges of input data for an L -bit FWBM ( L (cid:48) ≤ L ). We also present adjustable architecture for a 16-bit FWBM to enable the proposed BWATEC scheme and evaluate the hardware performance, using the TSMC 40 nm standard cell library. Relative to the contrast 16-bit FWBM approaches that use state-of-the-art TEC methods, the proposed BWATEC-enabled FWBM design can achieve reductions in the area-delay-error product of 7.9–50.9%, 17.1–69.5%, 29.9–82.2%, and 100% for the 14-bit, 12-bit, 10-bit, and 8-bit inputs, respectively. Moreover, the resultant 16-bit FWBM with BWATEC was veriﬁed by using the ﬁeld-programmable gate array for convolutional neural network acceleration.


Introduction
Multipliers are widely used in many digital operation systems. To limit bit-width increases in data paths, fixed-width multipliers are accordingly employed as arithmetic modules for digital signal processing, communication baseband operations, and neural network acceleration [1][2][3][4]. L-bit fixed-width multipliers generate the same L-bit output width as the L-bit operand, of which the Baugh-Wooley (array) multiplier and Booth multiplier are two of the most popular types. Two convenient approaches to fixed-width Baugh-Wooley or Booth multipliers are post-truncation (PT) and direct-truncation (DT). The PT method calculates all partial products and rounds the 2L-bit full-width product to the L most significant bits (MSBs) to achieve high accuracy, but the hardware costs are high. The DT method truncates the partial products related to the least significant bits (LSBs) of the 2L-bit full-width product to reduce the hardware costs, but the accuracy is very low.
In conventional TEC methods for FWBMs, TEC functions are generally operated based on a certain particular bit width of the FWBM operand. However, in practical applications, an L-bit FWBM might need to process input patterns with various L -bit widths (L ≤ L; L and L are generally even). For example, a 16-bit FWBM might be employed to operate with 16-bit, 14-bit, 12-bit, 10-bit, or 8-bit numerical input patterns, as specified by different situations. Such an operation can be practically performed in several applications. Taking the convolutional neural network (CNN) as an example, an accelerator may employ an FWBM to process input data that have different settable bit widths from different CNN models or layers. Moreover, FWBMs used in a shared digital filter might operate with input data whose levels are various for multiple analog modules. To the best of our knowledge, no previous study has developed a TEC scheme for such an FWBM design to offer adaptive TEC biasing for various bit widths of input data. In this study, we propose a bit-width adaptive TEC (BWATEC) scheme for providing an adjustable TEC bias for the diverse bit widths of input patterns. For an L-bit FWBM, the proposed BWATEC method can enable a tailored and high-accuracy TEC function for each case of the L -bit input pattern (where L ≤ L). In addition, an FWBM design for enabling the BWATEC is proposed based on a reconfigurable bias circuit with high hardware efficiency.
The remainder of this paper is organized as follows. Section 2 briefly introduces the background of FWBMs and the conventional probability-based TEC schemes for FWBMs. Section 3 outlines the proposed BWATEC scheme and its operations. In Section 4, the architecture of a 16-bit FWBM enabling the proposed BWATEC scheme is described. Section 5 evaluates the accuracy and hardware performances of our design and reports the experiment results, using a system-on-chip (SoC) field-programmable gate array (FPGA) platform. Finally, the conclusions are highlighted in Section 6.

Preliminaries and Design Issues
Some abbreviations and acronym words frequently used in this study are tabulated in Table 1 for convenient reference. Let A and B be two L-bit 2 s complement operands, represented by "a L−1 , a L , . . . , a 1 , a 0 " and "b L−1 , b L , . . . , b 1 , b 0 " with the values shown below, respectively.
The Booth encoding maps three consecutive terms, b 2j+1 , b 2j , and b 2j−1 into d j , as tabulated in Table 2. The d j value can be associated with (b 2j+1 , b 2j , b 2j−1 ) terms as expressed in Equation (2), where Q = (1/2) × L. As a result, a 2L-bit full-width product (FP) for A × B can be obtained as shown in Equation (3). (b 2j+1 b 2j b 2j−1 ) d j p L,j p L−1,j p L−2,j · · · p 2,j p 1,j p 0,j n j (0 0 0)/(1 1 1) 0 0 0 0 · · · 0 0 0 0 (0 0 1)/(0 1 0) 1 a L−1 a L−1 a L−2 · · · a 2 a 1 a 0 0 (1 0 1)/(1 1 0) −1 a L−1 a L−1 a L−2 · · · a 2 a 1 a 0 1 (0 1 1) 2 a L−1 a L−2 a L−3 · · · a 1 a 0 0 0 Using binary arithmetic for A × B, the partial products (P.P.) for each d j can be derived in terms of a i (i is from 0 to L − 1), 0, or 1, as shown for the values of p i,j , and n j in Table 2. Based on the P.P. terms in Table 2, Figure 1 depicts the structure of the P.P. array for an example of a 16-bit (L = 16) A × B full-width Booth multiplier. As shown in Figure 1, all P.P. terms can be divided into two groups: the main part (MP) and truncation part (TP). The P.P. in the MP are calculated to generate the product, whereas the TP includes the P.P. for computing the rounded L LSBs of the full-width product. The TP can be further divided into the TP major and TP minor subgroups. As indicated in Figure 1, TP major contains the P.P. in the most significant column (MSC) of the TP, which dominates the accuracy of the carry from the TP toward the MP. In general, the accuracy can be improved by increasing the column range for TP major [17][18][19][20]. However, a TEC function based on the MSC TP major with one MSC usually offers adequate accuracy in many applications and the use of one-MSC TP major sufficiently serves as a baseline to evaluate the performances for different TEC schemes [12,14,16,22]. Thus, this study adopted the one-MSC TP major to develop and evaluate our BWATEC scheme and FWBM design. In an FWBM design with TEC, TP major is reserved for calculation, whereas TP minor is truncated, and an estimated bias is adopted to compensate for the truncation error [9][10][11][12][13][14][15][16][17][18][19][20][21][22]. Therefore, an L-bit FWBM with TEC produces an L-bit quantized FP q result, as expressed in Equation (4), where B TEC indicates the estimated bias value for TEC, TP major is mapped to the 2 −1 digit and R{.} is the rounding operation.
With regard to an L-bit FWBM whose operands can be assigned to the input data of multiple prespecified L -bit width (L ≤ L), the L -bit input patterns are necessarily left-shifted by (L−L ) bits and are padded with zeros (i.e., Zero-Padding bits) to form the L-bit operand. The aforementioned processing for L = 16 is also described in Figure 1 for input patterns with multiple 14-bit, 12-bit, 10-bit, or 8-bit (i.e., L -bit) widths. schemes [12,14,16,22]. Thus, this study adopted the one-MSC TPmajor to develop and evaluate our BWATEC scheme and FWBM design. In an FWBM design with TEC, TPmajor is reserved for calculation, whereas TPminor is truncated, and an estimated bias is adopted to compensate for the truncation error [9][10][11][12][13][14][15][16][17][18][19][20][21][22]. Therefore, an L-bit FWBM with TEC produces an L-bit quantized FPq result, as expressed in Equation (4), where BTEC indicates the estimated bias value for TEC, TPmajor is mapped to the 2 −1 digit and R{.} is the rounding operation.
With regard to an L-bit FWBM whose operands can be assigned to the input data of multiple prespecified L′-bit width (L′ ≤ L), the L′-bit input patterns are necessarily leftshifted by (L−L′) bits and are padded with zeros (i.e., Zero-Padding bits) to form the L-bit operand. The aforementioned processing for L = 16 is also described in Figure 1 for input patterns with multiple 14-bit, 12-bit, 10-bit, or 8-bit (i.e., L′-bit) widths.

Probability-Based TEC Schemes for FWBMs
Several FWBM designs with probability-based TEC have been presented [14][15][16][17][18][19][20][21][22]. The authors of Reference [14] presented the probability-based scheme, together with their simulation-based works. Similarly, the work in Reference [15] used the expected value for P.P. to derive bias values. Furthermore, the probabilistic analysis methods [16][17][18][19][20][21][22] derived closed formulas of the TEC function based on the expected value or the conditional probability for P.P. terms. In Reference [16], the expected values for two groups of TPminor (i.e., the nj terms in Table 2 equals to 0 or 1) were respectively derived to obtain the probabilistic estimation bias when one-MSC TPmajor is specified. In addition, a generalized probabilistic estimation bias (GPEB) method [17] further enhanced the work in Reference [16] for the cases of TPmajor containing more P.P. columns. Using the GPEB methods [16,17], a simple TEC function of a 1-bit or 2-bit constant value was derived. The work in Reference [18] presented a TEC scheme based on the conditional probability depending on non-zero Booth encoder outputs (i.e., dj! = 0 in Table 2) for each row of TPminor. In Reference [19], a more complex method based on [18] was presented by using a conditional probability model for multiple TPminor rows. Such a design [19] slightly improved accuracy but increased hardware overheads. The authors of Reference [20] considered both expected values and conditional probability to progress a bias function improving accuracy and area based on the probability and computer simulation (PACS). In Reference [21], the concept of data scaling was presented and applied to conventional TEC-adapted FWBM designs for improving accuracy. A Booth-encoded sign-digit-based conditional probability (BSCP) Partial product (P.P.) array structure for a 16-bit full-width Booth multiplier with multiple 16-to-8-bit numerical ranges of input patterns.

Probability-Based TEC Schemes for FWBMs
Several FWBM designs with probability-based TEC have been presented [14][15][16][17][18][19][20][21][22]. The authors of Reference [14] presented the probability-based scheme, together with their simulation-based works. Similarly, the work in Reference [15] used the expected value for P.P. to derive bias values. Furthermore, the probabilistic analysis methods [16][17][18][19][20][21][22] derived closed formulas of the TEC function based on the expected value or the conditional probability for P.P. terms. In Reference [16], the expected values for two groups of TP minor (i.e., the n j terms in Table 2 equals to 0 or 1) were respectively derived to obtain the probabilistic estimation bias when one-MSC TP major is specified. In addition, a generalized probabilistic estimation bias (GPEB) method [17] further enhanced the work in Reference [16] for the cases of TP major containing more P.P. columns. Using the GPEB methods [16,17], a simple TEC function of a 1-bit or 2-bit constant value was derived. The work in Reference [18] presented a TEC scheme based on the conditional probability depending on non-zero Booth encoder outputs (i.e., d j ! = 0 in Table 2) for each row of TP minor . In Reference [19], a more complex method based on [18] was presented by using a conditional probability model for multiple TP minor rows. Such a design [19] slightly improved accuracy but increased hardware overheads. The authors of Reference [20] considered both expected values and conditional probability to progress a bias function improving accuracy and area based on the probability and computer simulation (PACS). In Reference [21], the concept of data scaling was presented and applied to conventional TEC-adapted FWBM designs for improving accuracy. A Booth-encoded sign-digit-based conditional probability (BSCP) method was presented in Reference [22] for the case of one-MSC TP major . The work in Reference [22] further took advantage of the sign of non-zero Booth encoder results to generate a TEC function achieving relatively high accuracy. Considering a 16-bit FWBM design with TEC, the aforementioned conventional TEC schemes can be directly applied to the design example, as shown in Figure 1. However, such approaches cannot achieve optimized accuracy for input patterns with 14/12/10/8-bit widths, as the applied TEC functions are for 16-bit operands; thus, imprecise biasing might be introduced to 14-bit to 8-bit FWBM operations. Accordingly, the development of an enhanced and tailored TEC scheme (e.g., the proposed BWATEC method) that is adaptive to input patterns with values in multiple bit-width levels is considered to be useful and practical for the TEC-enabled FWBM design.

Proposed Bit-Width Adaptive TEC (BWATEC) Scheme
In Sections 3.1 and 3.2, we use the 16-bit FWBM as an example for explaining the probability-based bias estimation and TEC operations for the proposed BWATEC scheme.

Derivation of Probabilistic Estimation for BWATEC
Referring to Figure 1, there are eight rows of TP minor (incl. n j ) in the P.P. array for the 16-bit A × B Booth multiplication. We can represent a row index of j from 0 to 7 (the top row is the 0th row). The contents of the TP vary with the number of Zero-Padding (ZP) bits for different L -bit input patterns of the operands A and B. Based on the mapping results from Table 2, Figure 2a-c illustrates the contents of TP minor for L = 14, 12, and 10, respectively. As described in Figure 2a-c, TP minor is classified into three regions: the zero region (R Z ), hybrid region (R H ), and deterministic-only region (R D ). The R Z region only has zero-valued P.P. related to the Booth-encoded result of the ZP bits for the B operand, and thus can be trivially truncated. In Figure 2, the R H region includes P.P. with hybrid deterministic and probabilistic values. For the jth row of TP minor in R H , the s j terms are the P.P. associated with the ZP bits of the operand (A). Both n j and s j can be exactly determined to be "0" or "1" (i.e., deterministic values) depending on d j , based on the contents in Table 2. The e j in R H (Figure 2) is the P.P. value of the p r,j (wherein r = L−L ). From Table 2, it can be observed that the e j value can be equal to "0" or "1" (d j = ±2) or can be identified by the LSB of the original L -bit input data (d j = ±1). In the R H region, the e j terms in the case of (d j = ± 1) and all other P.P. terms, excluding n j , s j , and e j , can be estimated by using an expected value of 1/2 (i.e., probabilistic values) [16][17][18][19][20]. Relative to the R H , all P.P. in the R D region are only s j and n j terms, which are deterministic values. In addition to the cases of L = 14, 12, and 10 (shown in Figure 2a-c), the TP minor contents for two contrast cases of L = 16 and L = 8 are also illustrated in Figure 3a,b. For L = 16, TP minor only has the R H region, while when L = 8, only the R Z and R D regions are included.
Electronics 2021, 10, x FOR PEER REVIEW 5 of 19 method was presented in Reference [22] for the case of one-MSC TPmajor. The work in Reference [22] further took advantage of the sign of non-zero Booth encoder results to generate a TEC function achieving relatively high accuracy. Considering a 16-bit FWBM design with TEC, the aforementioned conventional TEC schemes can be directly applied to the design example, as shown in Figure 1. However, such approaches cannot achieve optimized accuracy for input patterns with 14/12/10/8-bit widths, as the applied TEC functions are for 16-bit operands; thus, imprecise biasing might be introduced to 14-bit to 8-bit FWBM operations. Accordingly, the development of an enhanced and tailored TEC scheme (e.g., the proposed BWATEC method) that is adaptive to input patterns with values in multiple bit-width levels is considered to be useful and practical for the TEC-enabled FWBM design.

Proposed Bit-Width Adaptive TEC (BWATEC) Scheme
In Sections 3.1 and 3.2, we use the 16-bit FWBM as an example for explaining the probability-based bias estimation and TEC operations for the proposed BWATEC scheme.

Derivation of Probabilistic Estimation for BWATEC
Referring to Figure 1, there are eight rows of TPminor (incl. nj) in the P.P. array for the 16-bit A × B Booth multiplication. We can represent a row index of j from 0 to 7 (the top row is the 0th row). The contents of the TP vary with the number of Zero-Padding (ZP) bits for different L′-bit input patterns of the operands A and B. Based on the mapping results from Table 2, Figure 2a-c illustrates the contents of TPminor for L′ = 14, 12, and 10, respectively. As described in Figure 2a-c, TPminor is classified into three regions: the zero region (RZ), hybrid region (RH), and deterministic-only region (RD). The RZ region only has zero-valued P.P. related to the Booth-encoded result of the ZP bits for the B operand, and thus can be trivially truncated. In Figure 2, the RH region includes P.P. with hybrid deterministic and probabilistic values. For the jth row of TPminor in RH, the sj terms are the P.P. associated with the ZP bits of the operand (A). Both nj and sj can be exactly determined to be "0" or "1" (i.e., deterministic values) depending on dj, based on the contents in Table 2. The ej in RH ( Figure 2) is the P.P. value of the pr,j (wherein r = L−L′). From Table 2, it can be observed that the ej value can be equal to "0" or "1" (dj = ±2) or can be identified by the LSB of the original L′-bit input data (dj = ±1). In the RH region, the ej terms in the case of (dj = ± 1) and all other P.P. terms, excluding nj, sj, and ej, can be estimated by using an expected value of 1/2 (i.e., probabilistic values) [16][17][18][19][20]. Relative to the RH, all P.P. in the RD region are only sj and nj terms, which are deterministic values. In addition to the cases of L′ = 14, 12, and 10 (shown in Figure 2a  By mapping TPmajor to the 2 −1 digit (i.e., the MSB of TPminor is 2 −2 ), the expected value of all P.P. for the jth-row TPminor in the RH, , can be calculated as Equation (5), where ns is the number of sj (refer to Figure 2). Based on Equation (5) and the mapping contents in Table 2, the values of and (nj, sj, ej) according to dj are listed in Table 3.
(5) Table 3. Values of (nj, sj, ej) and according to dj. The values in Table 3 can be summarized by using the following expression, where a variable δj is defined by δj = 1 for dj ! = 0; otherwise, δj = 0.

P.P. Values Values
From observing the RH region for L′ = 14, 12, and 10 in Figure 2, it can be found that the RH includes the mth to the kth row of TPminor, in which m = ns/2 and k = 7 − (ns/2). By summing for all rows in the RH, an overall can be obtained as follows: For an FWBM with TEC, the result of Equation (7) can be viewed as an ideal bias for the truncated TPminor in the RH; however, the calculation of (a) in Equation (7) is complex. The bottom row in the RH (i.e., the kth row; j = k) dominates the final calculation result. Moreover, the result of (a) in Equation (7) can be rounded to the 2 −2 digit to be arithmetically added to δj. Therefore, Equation (7) can be approximated by Equation (8) by simplifying the (a) part to a σ·2 −2 term, where R−2{.} represents rounding a value to the 2 −2 digit. By mapping TP major to the 2 −1 digit (i.e., the MSB of TP minor is 2 −2 ), the expected value of all P.P. for the jth-row TP minor in the R H , E[TP (H) minor,j ], can be calculated as Equation (5), where ns is the number of s j (refer to Figure 2). Based on Equation (5) and the mapping contents in Table 2, the values of E[TP (H) minor,j ] and (n j , s j , e j ) according to d j are listed in Table 3. Table 3. Values of (n j , s j , e j ) and according to d j . Table 3 can be summarized by using the following expression, where a variable δ j is defined by δ j = 1 for d j ! = 0; otherwise, δ j = 0.

P.P. Values E[TP
From observing the R H region for L = 14, 12, and 10 in Figure 2, it can be found that the R H includes the mth to the kth row of TP minor , in which m = ns/2 and k = 7 − (ns/2 For an FWBM with TEC, the result of Equation (7) can be viewed as an ideal bias for the truncated TP minor in the R H ; however, the calculation of (a) in Equation (7) is complex. The bottom row in the R H (i.e., the kth row; j = k) dominates the final calculation result. Moreover, the result of (a) in Equation (7) can be rounded to the 2 −2 digit to be arithmetically added to δ j . Therefore, Equation (7) can be approximated by Equation (8) by simplifying the (a) part to a σ·2 −2 term, where R −2 {.} represents rounding a value to the 2 −2 digit.
However, the subtraction arithmetic for the 2 −2 (i.e., σ = −1 in Equation (8)) is also an issue in a P.P. array. This issue can be resolved by taking advantage of the following operational features. When d k is negative, both δ k and σ are equal to 1; thus, a carry of "1" can be added to the 2 −1 digit. If d k is positive, δ k is 1, whereas σ is −1. Thus, δ k can be eliminated at the 2 −2 digit, owing to the offset by σ. As a result, Equation (8) can be further calculated by using Equation (9), where a variable, γ, is operated at the 2 −1 digit only with an addition.
As shown in Figure 2a-c, the R D region only comprises P.P. in terms of s j and n j , as the number of P.P. within a row in the R D is less than the number of ZP bits ( Figure 1). Similar to s j in the R H , the s j terms in the R D are also P.P. obtained from the ZP bits of the A operand and are equal to the d j -dependent deterministic "1" or "0" values. Setting s j and n j to "1" (for d j < 0) or "0" (for d j ≥ 0), the actual value of all P.P. for the jth row of TP minor minor,j ], can be obtained by the following derivation. An accumulated result from the jth row in the R D is introduced to the 2 −1 digit for negative d j values.

E TP
For the design examples illustrated in Figure 2a-c, the R D region includes rows with indexes from k + 1 to Q − 1, where Q equals 8 for the case of L = 16. The variable Q is defined as Q = L/2, which is the number of rows in a P.P. array. Thus, a global E[TP (D) minor ] can be derived, as shown in Equation (11), in which the variable λ j is defined by λ j = 1 for d j < 0 and λ j = 0 for d j ≥ 0, corresponding to the execution results of Equation (10) for each row.

BWATEC Synthesis and Operations
For FWBMs with TEC, E[TP minor ] values obtained by using Equations (9) and (11) can be employed as the TEC bias to compensate for the truncated P.P. of TP minor in the R H and R D , respectively. Moreover, the operations of Equations (9) and (11) are different from the input bit width (L ), as well as the contents and range of the R H and R D regions ( Figure 2). In addition to the cases of L = 14, 12, and 10 as shown in Figure 2a-c, the proposed schemes based on Equations (9) and (11) can also be applied to the conditions of L = 16 and 8 as shown in Figure 3. For L = 16, TP minor only has the R H region, and we can use Equation (9) to generate the TEC bias. Alternatively, when L = 8, Equation (11) is used, as only the R D region is calculated. In practice, the TEC function for L = 16 can be further improved. From Figure 1, we can use deterministic p 0,7 and n 7 to operate with δ j at the 2 −2 digit; thus, a more precise carry can be added at the 2 −1 digit, instead of adding γ in Equation (9). The efficient use of Equations (9) and (11), as associated with multiple combinations of deterministic and probabilistic data, achieves the aims of the proposed BWATEC scheme. Considering a 16-bit FWBM, Figure 4 illustrates the TEC operations by using the proposed BWATEC scheme for various L -bit inputs.

Design Scalability
Taking the 16-bit FWBM example as a base, the deduced processing can also be applied to general L-bit FWBM designs (i.e., L is a scalable number other than 16). In general, an L-bit Booth multiplier is operated based on L of an even number. Considering the scalability of the proposed design, the aimed L-bit FWBMs can be categorized into two kinds of specifications. One is L = 2n, and n is an even integer; thus, the number of P.P. rows (i.e., the Q value) is even based on Q = L/2. The other is L = 2n, and n is an odd integer; thus, the number of P.P. rows is odd. Referring to the contents associated with Figures 2 and 3, the TPminor P.P. corresponding to the RZ/RH/RD regions are illustrated for the design case of an L-bit Booth multiplier (L = 16) with various L′-bit inputs (L′ = 16, 14, 12, 10, and 8), which has even (i.e., 8) P.P. rows. Moreover, the contents related to Figure 4 illustrate the proposed BWATEC operation for a 16-bit FWBM. As extension based on the illustration for the case of L = 16, Figure 5 depicts the RZ/RH/RD distribution of TPminor rows of a general Lbit Booth multiplier (i.e., L is scalable) for different L′-bit inputs, and Figure 5a,b illustrates the specification of "L = 2n (n and Q are even; even rows)" and "L = 2n (n and Q are odd; odd rows)", respectively. In Figure 5a

Design Scalability
Taking the 16-bit FWBM example as a base, the deduced processing can also be applied to general L-bit FWBM designs (i.e., L is a scalable number other than 16). In general, an Lbit Booth multiplier is operated based on L of an even number. Considering the scalability of the proposed design, the aimed L-bit FWBMs can be categorized into two kinds of specifications. One is L = 2n, and n is an even integer; thus, the number of P.P. rows (i.e., the Q value) is even based on Q = L/2. The other is L = 2n, and n is an odd integer; thus, the number of P.P. rows is odd. Referring to the contents associated with Figures 2 and 3, the TP minor P.P. corresponding to the R Z /R H /R D regions are illustrated for the design case of an L-bit Booth multiplier (L = 16) with various L -bit inputs (L = 16, 14, 12, 10, and 8), which has even (i.e., 8) P.P. rows. Moreover, the contents related to Figure 4 illustrate the proposed BWATEC operation for a 16-bit FWBM. As extension based on the illustration for the case of L = 16, Figure 5 depicts the R Z /R H /R D distribution of TP minor rows of a general L-bit Booth multiplier (i.e., L is scalable) for different L -bit inputs, and Figure 5a,b illustrates the specification of "L = 2n (n and Q are even; even rows)" and "L = 2n (n and Q are odd; odd rows)", respectively. In Figure 5a,b, the value shown inside each R Z /R H /R D block represents the number of rows in that region and refers to the ceiling operator.

Design Scalability
Taking the 16-bit FWBM example as a base, the deduced processing can also be applied to general L-bit FWBM designs (i.e., L is a scalable number other than 16). In general, an L-bit Booth multiplier is operated based on L of an even number. Considering the scalability of the proposed design, the aimed L-bit FWBMs can be categorized into two kinds of specifications. One is L = 2n, and n is an even integer; thus, the number of P.P. rows (i.e., the Q value) is even based on Q = L/2. The other is L = 2n, and n is an odd integer; thus, the number of P.P. rows is odd. Referring to the contents associated with Figures 2 and 3, the TPminor P.P. corresponding to the RZ/RH/RD regions are illustrated for the design case of an L-bit Booth multiplier (L = 16) with various L′-bit inputs (L′ = 16, 14, 12, 10, and 8), which has even (i.e., 8) P.P. rows. Moreover, the contents related to Figure 4 illustrate the proposed BWATEC operation for a 16-bit FWBM. As extension based on the illustration for the case of L = 16, Figure 5 depicts the RZ/RH/RD distribution of TPminor rows of a general Lbit Booth multiplier (i.e., L is scalable) for different L′-bit inputs, and Figure 5a,b illustrates the specification of "L = 2n (n and Q are even; even rows)" and "L = 2n (n and Q are odd; odd rows)", respectively. In Figure 5a  As indicated in the previous section, the proposed BWATEC operations for the case of a 16-bit FWBM (i.e., Figure 4) can be synthesized based on the contents in Figures 2 and 3 in common with Equations (9) and (11). By analogy with the derivation for the contents in Figure 4, the proposed BWATEC operations for various L -bit input patterns of a general L-bit FWBM can be similarly synthesized based on Figure 5, Equations (9) and (11), as described in Figure 6, where the two specifications of L = 2n (n is even or odd) are also respectively illustrated. The contents in Figures 5 and 6 further address the R Z /R H /R D distribution and BWATEC operations of a general L-bit FWBM operated with L -bit inputs in small L values. For the cases of "L ≤ L/2-2 (L = 2n; n is even)" or "L ≤ L/2-1 (L = 2n; n is odd)", only the R Z and R D regions are included and the range of R D is reduced with smaller L values. Such conditions allow only the R D P.P. to be calculated to obtain the TEC bias, as expressed in Equation (12), which is an extended form based on Equation (11).
Electronics 2021, 10, x FOR PEER REVIEW 9 of 19 Figure 5. Schematic for the RZ/RH/RD distribution of TPminor rows of a general L-bit Booth multiplier for various L′-bit inputs: (a) L = 2n, and n is even; (b) L = 2n, and n is odd.
As indicated in the previous section, the proposed BWATEC operations for the case of a 16-bit FWBM (i.e., Figure 4) can be synthesized based on the contents in Figures 2 and 3 in common with Equations (9) and (11). By analogy with the derivation for the contents in Figure 4, the proposed BWATEC operations for various L′-bit input patterns of a general L-bit FWBM can be similarly synthesized based on Figure 5, Equations (9) and (11), as described in Figure 6, where the two specifications of L = 2n (n is even or odd) are also respectively illustrated. The contents in Figures 5 and 6 further address the RZ/RH/RD distribution and BWATEC operations of a general L-bit FWBM operated with L′-bit inputs in small L′ values. For the cases of "L′ ≤ L/2-2 (L = 2n; n is even)" or "L′ ≤ L/2-1 (L = 2n; n is odd)", only the RZ and RD regions are included and the range of RD is reduced with smaller L′ values. Such conditions allow only the RD P.P. to be calculated to obtain the TEC bias, as expressed in Equation (12), which is an extended form based on Equation (11).

Proposed BWATEC-Enabled FWBM Architecture
There is also a need for an FWBM design with an efficient architecture for enabling the proposed BWATEC scheme. Figure 7 describes the hardware architecture of a 16-bit FWBM example enabling the BWATEC functions. As shown in Figure 7, the P.P. values are first produced through the Booth Encoder, and the P.P. Generator operates on two operands of A and B, which are already padded with ZP bits according to the prespecified bit width, L′ of input patterns (Figure 1). Depending on L′, the BWATEC-associated δj, γ, and λj terms are also set and sent to the P.P. array, along with the P.P. terms. The carrysave adder/carry-propagation adder (CSA/CPA) unit performs the array operations for the MP, TPmajor, and BWATEC biasing. Right shifting of bits can be optionally executed at the CPA output depending on the practical system design. As detailed in Figure 7, we used four groups (i.e., M1, M2, M3, and M4) of multiplexers controlled by the setting of L′ to enable data selection of the carry of δj accumulation, γ, λj, and "0" for the BWATEC operations described in Figure 4. A switch is also employed for selecting γ or an extra carry contributed by the addition of p0,7 and n7 for L′ = 16. Based on the configuration ( ) Figure 6. Schematic of BWATEC operations for various L -bit input patterns of a general L-bit FWBM: (a) L = 2n, and n is even; (b) L = 2n, and n is odd.

Proposed BWATEC-Enabled FWBM Architecture
There is also a need for an FWBM design with an efficient architecture for enabling the proposed BWATEC scheme. Figure 7 describes the hardware architecture of a 16-bit FWBM example enabling the BWATEC functions. As shown in Figure 7, the P.P. values are first produced through the Booth Encoder, and the P.P. Generator operates on two operands of A and B, which are already padded with ZP bits according to the prespecified bit width, L of input patterns (Figure 1). Depending on L , the BWATEC-associated δ j , γ, and λ j terms are also set and sent to the P.P. array, along with the P.P. terms. The carry-save adder/carry-propagation adder (CSA/CPA) unit performs the array operations for the MP, TP major , and BWATEC biasing. Right shifting of bits can be optionally executed at the CPA output depending on the practical system design. As detailed in Figure 7, we used four groups (i.e., M 1 , M 2 , M 3 , and M 4 ) of multiplexers controlled by the setting of L to enable data selection of the carry of δ j accumulation, γ, λ j , and "0" for the BWATEC operations described in Figure 4. A switch is also employed for selecting γ or an extra carry contributed by the addition of p 0,7 and n 7 for L = 16. Based on the configuration shown in Figure 7, similar approaches can be used to deduce the FWBM design for other bit widths of the operand. As a result, the BWATEC-enabled FWBM can be realized by using the originally required P.P. elements with additional multiplexers (incl. a switch) and control logics for adjustable TEC operations. shown in Figure 7, similar approaches can be used to deduce the FWBM design for other bit widths of the operand. As a result, the BWATEC-enabled FWBM can be realized by using the originally required P.P. elements with additional multiplexers (incl. a switch) and control logics for adjustable TEC operations. Considering a general L-bit FWBM (L is scalable) enabling the proposed BWATEC scheme, we see that its hardware configuration for TEC biasing can also be developed based on the BWATEC operation shown in Figure 6, as the approach for the 16-bit FWBM example (refer to Figures 4 and 7). The hardware structure for BWATEC biasing of a general L-bit FWBM is described in Figure 8a,b for the two specifications of "L = 2n (n is even)" and "L = 2n (n is odd)", respectively. As indicated in Figure 8, the addition of biasing element is performed by using full adders (FAs) or half adders (HAs). The mandatory multiplexers (i.e., MUX1 in Figure 8) are employed to select the carry of δ accumulation or the γ and λ terms based on the BWATEC operations in Figure 6, and the optional multiplexers (i.e., MUX2) can allow unadded δ terms to be "0" for energy efficiency. If the devised FWBM is specified to process L′-bit inputs with small L′ values, corresponding levels of multiplexers (i.e., MUX3) might be employed to mask the uncalculated γ or λ terms (refer to Figure 6) as shown in Figure 8. Based on the configuration of a P.P. array and the BWATEC biasing (refer to Figure 8), the hardware (HW) resource usage in the number of FAs, HAs, and multipliers (i.e., MUX1, 2, and 3) of a general L-bit FWBM using the proposed BWATEC scheme for various L′-bit inputs are listed in Table 4. Considering a general L-bit FWBM (L is scalable) enabling the proposed BWATEC scheme, we see that its hardware configuration for TEC biasing can also be developed based on the BWATEC operation shown in Figure 6, as the approach for the 16-bit FWBM example (refer to Figures 4 and 7). The hardware structure for BWATEC biasing of a general L-bit FWBM is described in Figure 8a,b for the two specifications of "L = 2n (n is even)" and "L = 2n (n is odd)", respectively. As indicated in Figure 8, the addition of biasing element is performed by using full adders (FAs) or half adders (HAs). The mandatory multiplexers (i.e., MUX1 in Figure 8) are employed to select the carry of δ accumulation or the γ and λ terms based on the BWATEC operations in Figure 6, and the optional multiplexers (i.e., MUX2) can allow unadded δ terms to be "0" for energy efficiency. If the devised FWBM is specified to process L -bit inputs with small L values, corresponding levels of multiplexers (i.e., MUX3) might be employed to mask the uncalculated γ or λ terms (refer to Figure 6) as shown in Figure 8. Based on the configuration of a P.P. array and the BWATEC biasing (refer to Figure 8), the hardware (HW) resource usage in the number of FAs, HAs, and multipliers (i.e., MUX1, 2, and 3) of a general L-bit FWBM using the proposed BWATEC scheme for various L -bit inputs are listed in Table 4. Table 4. HW resources usage of a general L-bit FWBM for various L -bit inputs using the BWATEC scheme (Q = L/2).

Evaluations and Experiments
Considering 16-bit FWBM designs with TEC based on one-MSC TPmajor, this section evaluates the accuracy and hardware performances for the proposed design and several representative works in previous studies. Moreover, the 16-bit FWBM with BWATEC was verified through the SoC-FPGA implementation for CNN inference operations.

Evaluations of Accuracy and Hardware Performances
For the accuracy performance, the signal-to-noise ratio (SNR) is the most important parameter and is defined as in Equation (13), where FP (refer to Equation (3)) is the product of the full-width Booth multiplier, and FPq (refer to Equation (4)) is the product of the FWBM with TEC, DT, or PT. In Equation (12), the mean square error (MSE) is also defined.

Evaluations and Experiments
Considering 16-bit FWBM designs with TEC based on one-MSC TP major , this section evaluates the accuracy and hardware performances for the proposed design and several representative works in previous studies. Moreover, the 16-bit FWBM with BWATEC was verified through the SoC-FPGA implementation for CNN inference operations.

Evaluations of Accuracy and Hardware Performances
For the accuracy performance, the signal-to-noise ratio (SNR) is the most important parameter and is defined as in Equation (13), where FP (refer to Equation (3)) is the product of the full-width Booth multiplier, and FP q (refer to Equation (4)) is the product of the FWBM with TEC, DT, or PT. In Equation (12), the mean square error (MSE) is also defined.
For comparison, we select state-of-the-art TEC schemes whose functions have a closed form, i.e., the generalized probabilistic estimation bias (GPEB) [16,17], probability estimation and computer simulation (PACS) [20], Booth-encoded sign-digit-based conditional probability (BSCP) [22], and SC-generator-based (SCG) [12] methods, as well as the DT and PT approaches. Table 5 presents the accuracy (i.e., the SNR) and hardware performances (area, critical-path delay, and power consumption) for a 16-bit FWBM using the aforementioned TEC and proposed BWATEC schemes, respectively. In Table 5, the SNR results were obtained for operations of 16-bit data (i.e., L = 16), based on the calculation of 30 K sets of 16-bit A × B Booth multiplication. Both the A and B operands were uncorrelated random 16-bit numbers with uniform distribution in statistics. The hardware parameters were provided by using the Synopsys Design Compiler, through logic synthesis with the TSMC 40 nm typical standard cell library for FWBM designs with no optional bit-shift processing at the output. According to References [12,22], we used a general sorting circuit based on Reference [12] for the BSCP and SCG designs in order to avoid the addition of negative digit values. In Table 5, the BSCP method achieves a better SNR than all other TEC-enabled designs. However, this result of the BSCP approach was obtained by using a complex TEC formula (i.e., Equation (19) in Reference [22]), and this function is difficult to be directly applied to a practical biasing circuit. The GPEB scheme outperforms other TEC-based works due to its use of a simple 1-bit or 2-bit constant TEC bias; however, the GPEB accuracy result is comparatively more reduced. Referring to the hardware parameters listed in Table 5, the results from the "area" and "power" items basically exhibit the same trend. To benchmark both the area efficiency and the accuracy, a design metric of area-delay-error product (ADEP), defined as "ADEP = Area × Delay × MSE", can be adopted to evaluate the overall design efficiency. As there is no TEC function involved in either DT or PT and the MSE magnitudes obtained by DT and PT are too extreme for the ADEP evaluation, these two schemes are excluded from the ADEP evaluation [22]. Table 6 lists the ADEP results in percentage values (normalized to that of the GPEB case) for TEC-enabled designs. As indicated in Table 6, the proposed design outperforms all listed schemes (i.e., a relatively small ADEP value) except the PACS method. This is because additional dataselection multiplexers/switch and control logics are required in our design to enable the adjustment of the TEC function for multiple L levels (refer to Figure 7). Such processing increases hardware costs and especially increases critical-path delay in our FWBM relative to other TEC designs, as shown in Table 5; however, the accuracy for L = 16 in our case is comparatively improved by using deterministic p 0,7 and n 7 (refer to Section 3.2, Figure 4).
Nevertheless, the actual accuracy performance and hardware efficiency of the proposed design is manifested in the accuracy improvement for operations on L -bit input patterns, giving L < 16. Table 7 reports the SNR results for the TEC-enabled 16-bit FWBMs (i.e., works in Table 6) for operations of L = 14, 12, 10, and 8. In Table 7, the SNR values were obtained based on the 16-bit product of FWBMs relative to the PT outcomes. As indicated in Table 7, our design achieves the highest SNR performances compared to other TEC-based designs for all listed L cases because the proposed BWATEC scheme provides more precise TEC biasing for various L -bit inputs. In addition, higher SNR results can be achieved with smaller L values by using the proposed design, due to more counts of deterministic R Z /R D elements. In practical designs, a slight improvement in the SNR results possibly results in an efficient enhancement in the system operation accuracy [21]. Considering the overall design efficiency, Figure 9 illustrates the ADEP results from the TEC-based 16-bit FWBMs based on the MSE value relative to the PT products for operations of L = 14, 12, 10, and 8, with ZP bits added to the input operand. In Figure 9, the ADEP values are normalized to the GPEB results, and the annotated percentage values represent the reduction of the ADEP achieved by the proposed design relative to all other listed methods. Figure 9 demonstrates that our scheme outperforms its contenders in terms of the ADEP values, achieving reductions of 7.9-50.9%, 17.1-69.5%, 29.9-82.2%, and 100% for the operations of input patterns with 14-bit, 12-bit, 10-bit, and 8-bit widths, respectively. Figure 9 shows that our design can achieve a more significant TEC effect with smaller specified L values, as more ratios of deterministic values are used and associated with the R H and R D when using the proposed BWATEC scheme. For the case of L = 8, our approach equivalently counts all P.P. terms to obtain a full-width result that is the same as a PT outcome, and thus a 100% ADEP reduction can be achieved.
Electronics 2021, 10, x FOR PEER REVIEW 13 of 19 indicated in Table 7, our design achieves the highest SNR performances compared to other TEC-based designs for all listed L′ cases because the proposed BWATEC scheme provides more precise TEC biasing for various L′-bit inputs. In addition, higher SNR results can be achieved with smaller L′ values by using the proposed design, due to more counts of deterministic RZ/RD elements. In practical designs, a slight improvement in the SNR results possibly results in an efficient enhancement in the system operation accuracy [21]. Considering the overall design efficiency, Figure 9 illustrates the ADEP results from the TEC-based 16-bit FWBMs based on the MSE value relative to the PT products for operations of L′ = 14, 12, 10, and 8, with ZP bits added to the input operand. In Figure 9, the ADEP values are normalized to the GPEB results, and the annotated percentage values represent the reduction of the ADEP achieved by the proposed design relative to all other listed methods. Figure 9 demonstrates that our scheme outperforms its contenders in terms of the ADEP values, achieving reductions of 7.9-50.9%, 17.1-69.5%, 29.9-82.2%, and 100% for the operations of input patterns with 14-bit, 12-bit, 10-bit, and 8-bit widths, respectively. Figure 9 shows that our design can achieve a more significant TEC effect with smaller specified L′ values, as more ratios of deterministic values are used and associated with the RH and RD when using the proposed BWATEC scheme. For the case of L′ = 8, our approach equivalently counts all P.P. terms to obtain a full-width result that is the same as a PT outcome, and thus a 100% ADEP reduction can be achieved. As discussed in Sections 3.3 and 4, two specifications of "L = 2n (n is even)" and "L = 2n (n is odd)" are considered for the design scalability of an L-bit FWBM, using the proposed BWATEC scheme. Therefore, in addition to the case of L =16 (for even n), another case of L = 14 (for odd n) was also evaluated for the ADEP performances in this section. Figure 10 illustrates the ADEP results from the TEC-enabled 14-bit FWBMs for operations As discussed in Sections 3.3 and 4, two specifications of "L = 2n (n is even)" and "L = 2n (n is odd)" are considered for the design scalability of an L-bit FWBM, using the proposed BWATEC scheme. Therefore, in addition to the case of L = 16 (for even n), another case of L = 14 (for odd n) was also evaluated for the ADEP performances in this section. Figure 10 illustrates the ADEP results from the TEC-enabled 14-bit FWBMs for operations of L = 12, 10, 8, and 6, based on the same processing with that for the 16-bit FWBM evaluation. Figure 10 indicates that our design outperforms all other listed methods, achieving the significant ADEP reductions for the operations of inputs with 12-bit, 10-bit, 8-bit, and 6-bit widths, respectively. Compared to the ADEP results for 16-bit FWBMs (Figure 9), the ADEP drops for all GPEB-excluded designs in relation to the GPEB base is reduced in Figure 10; however, the same trend of the ADEP reductions based on the relative value of L and L is exhibited for our design associated with other TEC-based works.
of L′ = 12, 10, 8, and 6, based on the same processing with that for the 16-bit FWBM evaluation. Figure 10 indicates that our design outperforms all other listed methods, achieving the significant ADEP reductions for the operations of inputs with 12-bit, 10-bit, 8-bit, and 6-bit widths, respectively. Compared to the ADEP results for 16-bit FWBMs (Figure 9), the ADEP drops for all GPEB-excluded designs in relation to the GPEB base is reduced in Figure 10; however, the same trend of the ADEP reductions based on the relative value of L and L′ is exhibited for our design associated with other TEC-based works.

CNN Acceleration Application
To verify an FWBM enabling the proposed BWATEC scheme, we implemented our design by using a SoC-FPGA platform and demonstrated the hardware acceleration for CNN inference operations. In a typical CNN accelerator, fixed-point operations are usually considered and a suitable bit width can be determined based on the CNN inference accuracy requirement [25,26]. Several studies have shown that the small bit width (e.g., 8bit width or fewer) is sufficient for the model coefficients and operation precision, while preserve the inference accuracy [27][28][29]. However, a sufficiently high bit width (e.g., common 16 bits) is considered in several CNN accelerator approaches to ensure the precision required by various applications [30][31][32]. Moreover, different bit-widths cab be specified for different CNN layers (e.g., the intermediate layers) to adjust the CNN performance [33,34]. Accordingly, several works have proposed CNN processing units that support operations with variable bit widths (e.g., 4/8/16-bit or 1-bit to 16-bit) [28,[35][36][37]. In this study, an L-bit FWBM (e.g., our design example of a 16-bit FWBM) capable of processing input patterns with multiple L′-bit (L′ ≤ L) widths lends support to the aforementioned practical approaches.

SoC-FPGA Implementation
The employed SoC-FPGA-based platform uses a Xilinx Zynq-7000 SoC-FPGA device which integrates an ARM central processing unit (CPU) with the user-developed hardware side. Such a SoC-FPGA approach lends support to the CNN inference operations by using a software (SW)-hardware (HW) co-design scheme [38][39][40] by appropriately evaluating the SW-HW work division. For example, computation-expensive two-dimensional

CNN Acceleration Application
To verify an FWBM enabling the proposed BWATEC scheme, we implemented our design by using a SoC-FPGA platform and demonstrated the hardware acceleration for CNN inference operations. In a typical CNN accelerator, fixed-point operations are usually considered and a suitable bit width can be determined based on the CNN inference accuracy requirement [25,26]. Several studies have shown that the small bit width (e.g., 8-bit width or fewer) is sufficient for the model coefficients and operation precision, while preserve the inference accuracy [27][28][29]. However, a sufficiently high bit width (e.g., common 16 bits) is considered in several CNN accelerator approaches to ensure the precision required by various applications [30][31][32]. Moreover, different bit-widths cab be specified for different CNN layers (e.g., the intermediate layers) to adjust the CNN performance [33,34]. Accordingly, several works have proposed CNN processing units that support operations with variable bit widths (e.g., 4/8/16-bit or 1-bit to 16-bit) [28,[35][36][37]. In this study, an L-bit FWBM (e.g., our design example of a 16-bit FWBM) capable of processing input patterns with multiple L -bit (L ≤ L) widths lends support to the aforementioned practical approaches.

SoC-FPGA Implementation
The employed SoC-FPGA-based platform uses a Xilinx Zynq-7000 SoC-FPGA device which integrates an ARM central processing unit (CPU) with the user-developed hardware side. Such a SoC-FPGA approach lends support to the CNN inference operations by using a software (SW)-hardware (HW) co-design scheme [38][39][40] by appropriately evaluating the SW-HW work division. For example, computation-expensive two-dimensional (2D) convolution is often accelerated at the HW side, while other low-effort CNN operations, such as maximum pooling, fully-connected (FC) layer execution and system controls are processed at the SW end [39,40]. Figure 11 shows the setup of our implementation that uses a SoC-FPGA approach based on the SW-HW co-design for CNN acceleration.
(2D) convolution is often accelerated at the HW side, while other low-effort CNN operations, such as maximum pooling, fully-connected (FC) layer execution and system controls are processed at the SW end [39,40]. Figure 11 shows the setup of our implementation that uses a SoC-FPGA approach based on the SW-HW co-design for CNN acceleration. Figure 11. Schematic of the setup of our design implementation that uses a SoC-FPGA approach.
Referring to Figure 11, the division of HW and SW responsibilities in our CNN setup was as follows. The HW side was responsible for 2D (5 × 5) convolution, addition of an offset, activation function (i.e., rectified linear unit; ReLU), and maximum pooling, while the SW side performed residual low-complex operations (e.g., FC execution), HW operation mode setting, and system control. As depicted in Figure 11, the ARM CPU executes SW commands and communicates with the HW side through the AXI bus, and the data transferring between the external memory and the FPGA HW-side memory is executed through direct memory access (DMA). When the HW acceleration of each CNN layer was actuated, the feature map data, kernel weights, and control parameters were fetched from the external memory (e.g., DRAM) to the HW side via DMA transmission and stored in the block RAMs, data registers, and control registers, respectively. The 2D convolution accelerator then accessed those stored values for convolution operations and then sent the calculated result to the next module for the offset-addition, ReLU, and pooling operations. The final produced data of HW acceleration for each CNN layer were stored in the block RAMs and sent to the external memory through DMA for the follow-up SW processing.
In our design, the 2D (5 × 5) convolution accelerator employs 25 16-bit FWBMs with BWATEC, which can operate with multiple 16-bit, 14-bit, 12-bit, 10-bit, or 8-bit numerical input data. Depending on the tilling for each CNN layer, the block RAMs can be configured to store the data of input and output feature maps (images) with sizes from 32 × 32 to 128 × 128. Table 8 lists the main HW resource usage on a Xilinx/Zynq-7000 SoC-FPGA device for our FPGA design, and the items include the lookup-table (LUT), flip-flop (FF), LUTRAM, and block RAM (BRAM)utilization. Table 8 also lists the HW performance of giga operations per second (GOPs), which is obtained by using a 50 MHz clock rate with values converted from the giga multiplication and addition operations [40].

Electrocardiogram Classification Experiment
Based on our SoC-FPGA implementation and SW-HW co-design setup, an experiment was performed to demonstrate the electrocardiogram (ECG) classification. In this Figure 11. Schematic of the setup of our design implementation that uses a SoC-FPGA approach.
Referring to Figure 11, the division of HW and SW responsibilities in our CNN setup was as follows. The HW side was responsible for 2D (5 × 5) convolution, addition of an offset, activation function (i.e., rectified linear unit; ReLU), and maximum pooling, while the SW side performed residual low-complex operations (e.g., FC execution), HW operation mode setting, and system control. As depicted in Figure 11, the ARM CPU executes SW commands and communicates with the HW side through the AXI bus, and the data transferring between the external memory and the FPGA HW-side memory is executed through direct memory access (DMA). When the HW acceleration of each CNN layer was actuated, the feature map data, kernel weights, and control parameters were fetched from the external memory (e.g., DRAM) to the HW side via DMA transmission and stored in the block RAMs, data registers, and control registers, respectively. The 2D convolution accelerator then accessed those stored values for convolution operations and then sent the calculated result to the next module for the offset-addition, ReLU, and pooling operations. The final produced data of HW acceleration for each CNN layer were stored in the block RAMs and sent to the external memory through DMA for the follow-up SW processing.
In our design, the 2D (5 × 5) convolution accelerator employs 25 16-bit FWBMs with BWATEC, which can operate with multiple 16-bit, 14-bit, 12-bit, 10-bit, or 8-bit numerical input data. Depending on the tilling for each CNN layer, the block RAMs can be configured to store the data of input and output feature maps (images) with sizes from 32 × 32 to 128 × 128. Table 8 lists the main HW resource usage on a Xilinx/Zynq-7000 SoC-FPGA device for our FPGA design, and the items include the lookup-table (LUT), flip-flop (FF), LUTRAM, and block RAM (BRAM)utilization. Table 8 also lists the HW performance of giga operations per second (GOPs), which is obtained by using a 50 MHz clock rate with values converted from the giga multiplication and addition operations [40].

Electrocardiogram Classification Experiment
Based on our SoC-FPGA implementation and SW-HW co-design setup, an experiment was performed to demonstrate the electrocardiogram (ECG) classification. In this work, we used the standard MIT-BIH arrhythmia dataset [41] for the CNN model training and inference. To operate with ECG data by using a 2D CNN model [42,43], we transformed the one-dimensional MIT-BIH ECG signals into the (128 × 128) 2D ECG image by using the signal preprocessing technique presented in Reference [42]. Rather than the clinical ECG classification for seven or five arrhythmia classes [42,44], our experiment merely classified ECG images into "normal" and "abnormal" heartbeats for wearable ECG monitor applications.
The experimental network for the aimed ECG classification was built up by using a simplified LeNet-5 CNN model [45]. As the contents summarized in Table 9, the built-up CNN model includes two convolution and maximum pooling layers, followed by the FC layers. Our CNN model was first determined by training process performed on a high-end computer in floating-point operations. For the CNN inference using an SW-HW co-design approach, the executions accelerated at the HW side were performed in fixedpoint operations to achieve available overall accuracy (Table 9). To verify the proposed design, we implemented a 2D convolution unit consisting of 25 pcs 16-bit FWBMs with the proposed BWATEC function on the SoC-FPGA device. In our experiment, the same 2D convolution unit with one set of 16-bit FWBMs (25 pcs) was operated to perform the computation of two CNN layers. To demonstrate the BWATEC operations of our design, two phases of L setting for the same set of 25 pcs 16-bit FWBMs (i.e., L = 16; L -bit inputs) were adopted to execute two layers of CNN convolution execution. In phase 1, all 16-bit FWBMs in the 2D convolution unit were set to operate with 12-bit input data (i.e., L = 12) for CNN layer-1 operations to consist with the numerical level of inputs. In phase 2, the same 16-bit FWBMs were set to process 16-bit input data (i.e., L = 16) for CNN layer-2 operations to preserve the computation precision. For evaluation, we also implemented contrast 2D convolution units composed of 16bit FWBMs, using the BSCP, PACS, GPEB, and SCG TEC schemes on the same SoC-FPGA device. Table 10 lists the FPGA LUT resource utilization for a 2D convolution unit with various TEC-based FWBM designs, using the aforementioned methods and our scheme. As shown in Table 10, our design achieves the medium level of area efficiency on FPGA, which is basically consist with the trend of area parameters listed in Table 7. However, the feature of our design for multiple setting of L -bit operations lends support to the system development requiring flexible word lengths or improved accuracy. For a case study in addition to CNN acceleration, the devised 2D convolution unit can be restructured to realize a 25-tap finite impulse response (FIR) filter by inserting several multiplexers in the data paths of 25 pcs FWBMs. We also developed such an FIR with a slight HW overhead via FPGA implementation to use our design for digital signal processing applications. The ECG classification was checked by using a modified CNN inference model with the 2D convolution performed in our two-phase fixed-point operations. Moreover, the experimental CNN operations (Table 9) were performed by using the SoC-FPGA based on our SW-HW co-design approach to obtain the inference results. For the bit-width setting (i.e., L ) of two phases, the SW side would prepare the FWBM operands appropriately padded with ZP bits and set the BWATEC control for each round (i.e., layer 1 or 2) of CNN HW acceleration. The inference outcomes generated via the SoC-FPGA were further compared with the results generated by using the aforementioned fixed-point-operated CNN inference model for verification. After inference checking, the confusion matrix and performances of our ECG classification experiment are listed in Table 11. The performance results reported in Table 11 were obtained based on the accuracy (Acc.), sensitivity (Sen.), and specificity (Spc.) statistical metrics extracted from the confusion matrix [42,44]. The terms of TP, TN, FP, and FN denote true positive as "abnormal" (arrhythmia), true negative as "normal", false positive as "abnormal", and false negative as "normal" in the binary classification, respectively. The associated formulas are defined as follows: Acc. = TP + TN TP + TN + FP + FN Sen. = TP TP + FN Spc. = TN TN + FP (14)

Conclusions
In this paper, we presented a BWATEC scheme capable of providing an adjusted TEC function adaptive to various L -bit input patterns of an L-bit FWBM, in which L ≤ L. Using different combinations of hybrid deterministic/probabilistic values associated with the R H and R D regions, the proposed BWATEC scheme can generate a tailored high-accuracy TEC bias for an L-bit FWBM, depending on the setting of L (L and L are scalable). An FWBM enabling the proposed BWATEC scheme can be realized by using a reconfigurable bias circuit in a P.P. array with design scalability.
Taking a 16-bit FWBM as an example, we found that the approach using our BWATEC scheme exhibited design efficiency and different degrees of ADEP reduction for operations with 14-bit to 8-bit inputs, as compared to FWBM designs that used state-of-the-art TEC methods.
Moreover, the resultant 16-bit FWBM with BWATEC were verified by using the Xilinx Zynq-7000 SoC-FPGA based on the SW-HW co-design approach. The SoC-FPGA-based verification demonstrated the experimental CNN model for ECG classification.  Informed Consent Statement: C-language simulation and Verilog modeling supports. We also thank Hong-Yu Ke and Jia-Nan Zhong for their works in the CNN model development for ECG classification.