Approximate Floating-Point Multiplier based on Static Segmentation

: In this paper a novel low-power approximate ﬂoating-point multiplier is presented. Since the mantissa computation is responsible for the largest part of the power consumption, we apply a novel approximation technique to mantissa multiplication, based on static segmentation. In our approach, the inputs of the mantissa multiplier are properly segmented so that a small inner multiplier can be used to calculate the output, with beneﬁcial impact on power and area. To further improve performance, we introduce a novel segmentation-and-truncation approach which allows us to eliminate the shifter normally present at the output of the segmented multiplier. In addition, a simple compensation term for reducing approximation error is employed. The accuracy of the circuit can be tailored at the design time, by acting on a single parameter. The proposed approximate ﬂoating-point multiplier is compared with the state-of-the-art, showing good performance in terms of both precision and hardware saving. For single-precision ﬂoating-point format, the obtained NMED is in the range 10 − 5 –7 × 10 − 7 , while MRED is in the range 3 × 10 − 3 –1.7 × 10 − 4 . Synthesis results in 28 nm CMOS show area and power saving of up to 82% and 85%, respectively, compared to the exact ﬂoating-point multiplier. Image processing applications conﬁrm the expectations, with results very close to the exact case.


Introduction
Multipliers are the most-used arithmetic blocks in many digital signal processing applications, being the basic elements for operations as filtering, correlation, de-noising, and domain transformation. Thanks to the favorable hardware features, fixed-point implementation is extensively exploited in a wide range of electronic systems including transceivers [1,2], FPGA accelerators [3,4], digital phase locked loops and spread spectrum clock generators [5][6][7]. The fixed-point arithmetic employs a fixed number of bits for representing the integer and the fractional parts of the signals. The bit-length of the integer part is related to the range of representable numbers, while the bit-length of the fractional part affects the accuracy of the operations. Therefore, the designer must properly choose the signal bit-widths and resolutions to manage numerical range and precision.
Floating-point (FP) arithmetic, while complex from the point of view of hardware implementation, offers a flexible way of performing numerical computations, providing at the same time a large range of representable values and high precision. It is therefore routinely used in applications such as scientific computing, digital signal processing, and computer graphics. The standardized IEEE 754 format [8] is ubiquitous in most computing platforms, from CPUs to GPUs and microcontrollers. According to this standard, a FP number consists of sign S, exponent E and mantissa M. The value encoded in the FP format is given by: A = (−1) S ·(1 + M)·2 E-bias , with the mantissa M in the range [0, 1).
A FP multiplication involves a fixed-point adder to sum up the exponents, a fixedpoint multiplier for the mantissa processing, and a normalization logic for the result. It   As detailed above, the input a is expressed on 32 bits, with ne = 8 bits dedicated to the exponent and nm = 23 bits dedicated to the mantissa. More precisely, the most significant bit (MSB) of a, a [31], is the sign Sa, while the portions Ea = a [30:23] and Ma = a [22:0] are the exponent and the mantissa, respectively. The same scheme applies also to the input b, with sign Sb = b [31], exponent Eb = b [30:23], and mantissa Mb = b [22:0]. Therefore, the signals a and b are represented as follows: Similarly, the product c = a·b is expressed in the form:

Single Precision Floating-Point Multiplier
In the following we consider the IEEE 754 single precision standard. According to this standard, a FP number consists of sign S, exponent E and mantissa M.
The value encoded in the FP format, in the case of normalized representations, is given by: X = (−1) S ·( 1 + M)·2 E-bias (for the sake of simplicity, we do not consider de-normalized representations).
The mantissa M is in the range [0, 1). The '1' bit added to M is the so-called implicit bit. The mantissa M is represented with 23 bits (with MSB and LSB of weights 2 −1 and 2 −23 , respectively). The exponent E is an 8-bit integer, while the exponent bias is 127. Thus, the exponent value E-bias is in the range [-127, 128].
The Figure 1 shows an example, where the decimal number −13.140625 is represented according to the IEEE 754 single precision standard. This number can be written as: −2 3 × (1 + 1/2 + 1/8 + 1/64 + 1/512). Thus: the sign bit is S = 1; the exponent value is E = 3 + bias = 130, corresponding to binary 10000010; the mantissa is M = 1/2 + 1/8 + 1/64 + 1/512, corresponding to binary 10100100100000000000000.   As detailed above, the input a is expressed on 32 bits, with ne = 8 bits dedicated to the exponent and nm = 23 bits dedicated to the mantissa. More precisely, the most significant bit (MSB) of a, a [31], is the sign Sa, while the portions Ea = a [30:23] and Ma = a [22:0] are the exponent and the mantissa, respectively. The same scheme applies also to the input b, with sign Sb = b [31], exponent Eb = b [30:23], and mantissa Mb = b [22:0]. Therefore, the signals a and b are represented as follows: Similarly, the product c = a·b is expressed in the form: As detailed above, the input a is expressed on 32 bits, with n e = 8 bits dedicated to the exponent and n m = 23 bits dedicated to the mantissa. More precisely, the most significant bit (MSB) of a, a [31], is the sign Sa, while the portions Ea = a [30:23] and Ma = a [22:0] are the exponent and the mantissa, respectively. The same scheme applies also to the input b, with sign Sb = b [31], exponent Eb = b [30:23], and mantissa Mb = b [22:0]. Therefore, the signals a and b are represented as follows: Similarly, the product c = a·b is expressed in the form: where Sc, Ec, and Mc are the sign, exponent and mantissa of c, respectively. The mantissas Ma, Mb, and Mc are in the range [0, 1) with MSB and LSB of weights 2 −1 and 2 −23 , respectively. Therefore, they constitute the fractional part of the quanti- The arithmetic stage of Figure 2 computes Sc, Ec, and the mantissa product P. As shown, the XOR between Sa and Sb allows us to obtain the sign Sc. The sum between Ea and Eb computes the exponent Ec, while the subtraction by 127 considers the exponent bias. In the mantissa multiplier, a bit '1' is explicitly concatenated to Ma and Mb at the most significant position (see (1.Ma) = (1 + Ma) and (1.Mb) = (1 + Mb) in the figure)), to compute: It is worth noting that (1 + Ma) lays in the range [1, 2). As consequence, P is in the range [1, 4), that means that 2 MSBs are involved for representing its integer part (namely p[47] and p [46]). We also underline that P is expressed on 48 bits.
The normalization logic extracts the mantissa Mc from P and consequently adjusts the exponent Ec. If P < 2 (i.e., if p[47] is low), the product P is in the form (1.Mc). Therefore, the extraction of Mc simply requires to select the fractional part of P, that is p [45:0]. Actually, only 24 bits of P (that is p [45:22]) are sent to the next rounding stage.
In the case P ≥ 2 (i.e., if p[47] is high), we need to right-shift P of one position in order to express the product in the form (1.Mc) before to apply the rounding. Therefore, the segment p [46:23] is sent to the rounding stage. In this case, moreover, the exponent is incremented by one to compensate for the shift on P, as shown in exponent update block in Figure 2.
The mantissa rounding block, finally, rounds the mantissa to 23 bits, as required by the standard, with the help of a 24-bit adder.

Proposed Optimization for the Mantissa Computation
In the proposed approach, instead of computing P as in (3), we compute: as follows: Please note that (3) requires a 24 bit × 24 bit multiplier, due to the bit '1' explicitly concatenated to Ma and Mb. On the other hand, the calculation of P in (5) requires a smaller 23 bit × 23 bit multiplier. Equation (5) opens the way to a static segmentation of the multiplication: in fact in the multiplication operation described by (3) the bit '1' does not allow to segment the multiplier inputs, while, on the other hand, the multiplicands contained in (5) does not include any stuck at '1' bit. Figure 3a shows the structure of the proposed floating-point multiplier and highlights in the dashed red rectangle the multiply-and-add unit (MAA) that implements Equation (5).
As shown in the figure, since the LSBs of the mantissas and of the product Ma·Mb have weights 2 −23 and 2 −46 , respectively, both Ma and Mb are added with a hard-wired left-shift of 23 positions, to properly align their LSBs to Ma·Mb.
In this implementation, the two MSBs of P' are exploited to manage the normalization process. Before to proceed with the discussion, let us observe that the maximum value of P', named P' max , can easily be calculated since the maximum value of Ma and Mb is equal to (1-2 −23 ). We have: P' max = (1-2 −23 ) 2 + 2·(1-2 −23 ), therefore P' max is slightly less than 3. Thus, 0 ≤ P' < 3 and, in binary representation, the two MSBs of P', which constitute its integer part, can vary between '00', '01', and '10'. This observation helps the calculation of P = P' + 1. In fact, the fractional bits of P coincide with the fractional bits of P' and only the two MSBs of P (the integer part) should be computed from the two MSBs of P'. To this purpose, let us consider the various cases reported in the truth  [46], and normalize the mantissa when sel = 1 (see also Figure 3a). The truth table also shows that p [46] is inverted with respect to p' [46]. Therefore, the output of the mantissa multiplier is given by {~p' [46], p[45:0]}, where '~' is used to represent the complement operation.  In this implementation, the two MSBs of P' are exploited to manage the normalization process. Before to proceed with the discussion, let us observe that the maximum value of P', named P'max, can easily be calculated since the maximum value of Ma and Mb is equal to (1-2 −23 ). We have: P'max = (1-2 −23 ) 2 + 2·(1-2 −23 ), therefore P'max is slightly less than 3. Thus, 0 ≤ P' < 3 and, in binary representation, the two MSBs of P', which constitute its integer part, can vary between '00', '01', and '10'. This observation helps the calculation of P = P' + 1. In fact, the fractional bits of P coincide with the fractional bits of P' and only the two MSBs of P (the integer part) should be computed from the two MSBs of P'. To this purpose, let us consider the various cases reported in the truth  [46], and normalize the mantissa when sel = 1 (see also Figure 3a). The truth table also shows that p [46] is inverted with respect to p' [46]. Therefore, the output of the mantissa multiplier is given by {~p' [46], p[45:0]}, where '~' is used to represent the complement operation.
After mantissa multiplier, mantissa normalization and rounding follow, as in Figure  2. The main difference is the use of signal sel (instead of p[47]) to perform normalization. Note also that exponent update is implemented without multiplexer, by using sel to increment the exponent.

Static Segmentation Method
The static segmentation [40,41] reduces the size of the multiplier by segmenting the multiplicands before the product. Each segment comprises m bits, with nm/2 < m < nm. Therefore, an m × m multiplier is employed for computing the result instead of a nm × nm multiplier. Synthesizable HDL descriptions of fixed-point static-segmented multipliers are available in [45]. Figure 4 shows the segmentation for the mantissa Ma, where, for the sake of simplic- After mantissa multiplier, mantissa normalization and rounding follow, as in Figure 2. The main difference is the use of signal sel (instead of p[47]) to perform normalization. Note also that exponent update is implemented without multiplexer, by using sel to increment the exponent.

Static Segmentation Method
The static segmentation [40,41] reduces the size of the multiplier by segmenting the multiplicands before the product. Each segment comprises m bits, with n m /2 < m < n m . Therefore, an m × m multiplier is employed for computing the result instead of a n m × n m multiplier. Synthesizable HDL descriptions of fixed-point static-segmented multipliers are available in [45]. Figure 4 shows the segmentation for the mantissa Ma, where, for the sake of simplicity, we assume n m = 8 bits with m = 5. The mantissa Ma is divided in a lower portion (LPa), given by its m LSBs, and in an upper portion (UPa), given by its m MSBs. The (n m -m) MSBs constitutes the control segment CSa, used to decide between LPa and UPa. When the bits of CSa are low, the segment LPa is selected for the multiplication. Conversely, if at least one of the bits of CS is high, the segment UPa is chosen. A similar mechanism is applied also to input Mb. Therefore, defining the selection flags α Ma and α Mb as the OR of bits of the control segments CSa and CSb, respectively, and naming Ma ssm , Mb ssm the segmented signals, the following relations hold:  Please note that an approximation error is introduced when the upper portion of the mantissa is multiplied (i.e., when the selection flag is high) since the less significant part is discarded (namely εMa in the figure). On the contrary, no approximation is introduced if the lower portion is selected.
After the multiplication, a left-shift is required to extend the result from 2·m bits to 2·nm bits. Then, the approximate product Kssm is computed as follows: with LSHa and LSHb defined as The term LSH in (8) is given by: LSH = LSHa + LSHb and is the number of positions for the overall left-shift:

Static Segmentation Applied to the Mantissa Product
In this paragraph, we apply the segmentation to the inputs of the MAA unit to employ an m × m multiplier and an m-bits adder for the mantissa computation. By assuming that Massm and Mbssm have LSB of weight 2 0 , we write the approximate product P'apprx as follows: Figure 5 depicts the circuit that implements Equation (11). Please note that an approximation error is introduced when the upper portion of the mantissa is multiplied (i.e., when the selection flag is high) since the less significant part is discarded (namely ε Ma in the figure). On the contrary, no approximation is introduced if the lower portion is selected.
After the multiplication, a left-shift is required to extend the result from 2·m bits to 2·n m bits. Then, the approximate product K ssm is computed as follows: with LSHa and LSHb defined as The term LSH in (8) is given by: LSH = LSHa + LSHb and is the number of positions for the overall left-shift:

Static Segmentation Applied to the Mantissa Product
In this paragraph, we apply the segmentation to the inputs of the MAA unit to employ an m × m multiplier and an m-bits adder for the mantissa computation. By assuming that Ma ssm and Mb ssm have LSB of weight 2 0 , we write the approximate product P' apprx as follows: Figure 5 depicts the circuit that implements Equation (11). The multiplexer mux LSH, used for the left-shift of the multiplier output, does no allow to merge the multiplier and the adder in a fused PPM, and leads to the usage of two cascaded CPAs, one for computing the product Massm·Mbssm 2 −2nm+LSH and one for compu ting P'apprx. Furthermore, Massm· 2 −nm+LSHa and Mbssm· 2 −nm+LSHb involve up to 2·nm bits when αMa = 1 and αMb = 1, thus degrading the performances of the adder. The multiplexer mux LSH, used for the left-shift of the multiplier output, does not allow to merge the multiplier and the adder in a fused PPM, and leads to the usage of two cascaded CPAs, one for computing the product Ma ssm ·Mb ssm 2 −2nm+LSH and one for computing P' apprx . Furthermore, Ma ssm · 2 −nm+LSHa and Mb ssm · 2 −nm+LSHb involve up to 2·n m bits when α Ma = 1 and α Mb = 1, thus degrading the performances of the adder.
In order to optimize the MAA unit, we analyze Equation (11) considering all the possible combinations for α Ma and α Mb . Figure 6 shows the alignments of the signals reporting the exact MAA for reference (Figure 6a  The multiplexer mux LSH, used for the left-shift of the multiplier output, does not allow to merge the multiplier and the adder in a fused PPM, and leads to the usage of two cascaded CPAs, one for computing the product Massm·Mbssm 2 −2nm+LSH and one for computing P'apprx. Furthermore, Massm· 2 −nm+LSHa and Mbssm· 2 −nm+LSHb involve up to 2·nm bits when αMa = 1 and αMb = 1, thus degrading the performances of the adder. In order to optimize the MAA unit, we analyze Equation (11) considering all the possible combinations for αMa and αMb. Figure 6 shows the alignments of the signals reporting the exact MAA for reference (Figure 6a), and the alignments with segmentation in the case nm = 8 and m = 6 (Figure 6b-d).
(a) If αMa = 0 and αMb = 0, the shifts LSH, LSHa and LSHb are zero, and Equation (11) becomes: If α Ma = 0 and α Mb = 0, the shifts LSH, LSHa and LSHb are zero, and Equation (11) becomes: As shown in Figure 6b, the product is on 2·m bits, whereas the shifted mantissas imply the usage of an adder on (n m + m) bits. To employ an m-bit adder, we truncate the product Ma ssm ·Mb ssm ·2 −nm discarding the gray LSBs in the figure. Therefore, we approximate Equation (12) as follows: where · is used to represent the floor operator.
In order to implement the calculations in (13) with a fused PPM and an unique CPA, we rearrange (13) as follows:  (15) and · is the ceiling operator.
For the sake of simplicity, let us consider the case in which n m is even. The final floor operation on Ma ssm,mult and Mb ssm,mult of (15) results in truncating n m /2 bits from m bit operands and, therefore, the computation of (14) requires a (m − n m /2) × (m − n m /2) multiplier.
As observable from the formula, we can first compute Ma ssm,mult , Mb ssm,mult , that are truncated versions of Ma ssm , Mb ssm , and then execute the multiply-and-add operation. In this way, we remove the shift between the multiplier and the adder and design the MAA unit with a fused PPM and a unique CPA.
When α Ma = 1 and α Mb = 0, we have LSH = LSHa = n m −m (refer also to (9), (10)). Therefore, the (11) becomes Applying the same reasoning, we need to discard the gray LSBs of Ma ssm ·Mb ssm ·2 −nm and of Mb ssm ·2 −LSH (see Figure 6c) in order to involve again an m-bit addition. Furthermore, we need also to remove the shift at the output of the multiplier.
The above considerations lead to the following approximation for (16): with Ma ssm,mult , Mb ssm,mult , and Ma ssm,add defined as in (15), and Mb ssm,add that is Here, the addend Mb ssm is also truncated along with the multiplier inputs. Since the expression of Ma ssm,mult and Mb ssm,mult remains the same, also in this case the computation of (17) requires a (m − n m /2) × (m − n m /2) multiplier.
A similar reasoning applies also for the case α Ma = 1 and α Mb = 0, with Ma ssm,mult , Mb ssm,mult, and Mb ssm,add defined as in (15) and Ma ssm,add = Ma ssm ·2 −LSH .
Finally, when α Ma = 1 and α Mb = 1, we have LSH = 2·(n m −m) and LSHa = LSHb = n m −m. Therefore, the (11) becomes As shown in Figure 6d, we need to truncate (Ma ssm ·Mb ssm )·2 −nm+LSHa for employing an m-bit adder, and to shift the inputs of the multiplier for employing a unique CPA. Therefore, the (19) is approximated as follows With and Ma ssm,add , Mb ssm,add defined as in (15). If for the sake of simplicity we consider the case in which n m and m are even, the final floor operation on Ma ssm,mult and Mb ssm,mult of (21) results in truncating (n m − LSHa)/2 = m/2 bits from m bit operands, therefore the computation of (20) requires a (m/2) × (m/2) multiplier. Since m − n m /2 < m/2, overall, the computation of P' apprx requires a (m/2) × (m/2) multiplier. Figure 7a shows the circuit that implements the static segmented multiply-and-add unit (SSMAA), whereas Table 1 collects the segments of Ma ssm , Mb ssm obtained with the segment-and-truncate approach. The multiplexers on Ma and Mb perform the segmentation of Table 1 at the input of both the multiplier and the adder. Then, a further multiplexer applies the final shift on P' ssm in order to express the result P' apprx on (2·n m + 2) = 48 bits.
and Massm,add, Mbssm,add defined as in (15). If for the sake of simplicity we consider the case in which nm and m are even, the final floor operation on Massm,mult and Mbssm,mult of (21) results in truncating (nm − LSHa)/2 = m/2 bits from m bit operands, therefore the computation of (20) requires a (m/2) × (m/2) multiplier. Since m − nm/2 < m/2, overall, the computation of P'apprx requires a (m/2) × (m/2) multiplier. Figure 7a shows the circuit that implements the static segmented multiply-and-add unit (SSMAA), whereas Table 1 collects the segments of Massm, Mbssm obtained with the segment-and-truncate approach. The multiplexers on Ma and Mb perform the segmentation of Table 1 at the input of both the multiplier and the adder. Then, a further multiplexer applies the final shift on P'ssm in order to express the result P'apprx on (2·nm + 2) = 48 bits.  2nm-m-nq It is worth mentioning that the P'apprx is subsequently quantized in the normalization process. It follows that the rounding allows us to reduce the size of the final multiplexer, since the result can be expressed on (2·nm + 2 − nq) bits, nq being the number of discarded LSBs (nq = 22 for the single precision FPM).   It is worth mentioning that the P' apprx is subsequently quantized in the normalization process. It follows that the rounding allows us to reduce the size of the final multiplexer, since the result can be expressed on (2·n m + 2 − n q ) bits, n q being the number of discarded LSBs (n q = 22 for the single precision FPM).

Error Analysis and Correction
The approximation errors that affect the proposed SSMAA are due to the discarding of the least significant parts of Ma and Mb (when the segmentation selects the upper segments), and due to the truncations used to employ the m-bit adder and a unique CPA. It follows that the largest error arises when α Ma = 1, α Mb = 1.
Estimating the error E = P'− P' apprx can help in improving the accuracy of the SSMAA unit. The idea is to compute the SSMAA result as P' apprx,c = P' apprx + E*, where E* is a term which approximates E.
To study the error, by considering where UPa mult , UPb mult , UPa add , and UPb add are the quantities selected for the SSMAA, and m' = m − m/2 = m/2. In this discussion we consider an even value of m for the sake of simplicity in order to have the same truncation for both Ma ssm,mult and Mb ssm,mult .
Additionally, the quantities pruned for the segmentation are: By writing Ma and Mb as follows for the product: and as follows for the addition: we obtain the exact result of MAA: Therefore, being P' apprx given by: the error E is: In order to simplify the discussion, let us focus on the most significant term E*: Following the approach of [41], we approximate ε Ma,mult with: and write UPa mult as: A similar expression holds for ε Mb,mult and for UPb mult . Then, substituting (30) and (31) in (29), the error becomes: with: e k+nm−m = 2·(ma k+nm−m mb nm−m −1 + mb k+nm−m ma nm−m −1 )+ +ma k+nm−m + mb k+nm−m Approximating e k+nm-m' with: e * k+nm−m = 4·(ma k+nm−m mb nm−m −1 OR mb k+nm−m ma nm−m −1 ) for further simplification, we obtain: In the case of m odd, we consider the following approximate expression for e* k+nm-m' : Approximating the summation in (35) with two or three terms allows us to sufficiently reduce the approximation error in the SSMAA. Figure 7b depicts the scheme of the proposed corrected SSMAA (cSSMAA in the following). The correction term E*, highlighted in red, can be directly fused in the PPM, thus implying a minimum impact on the hardware performances.

Error Metrics Analysis
We exploit some of the common error metrics to verify the performances of the proposed SSFPM. Let us define the exact and the approximate result of the floating-point multiplier as C and C apprx , respectively. The approximation error is given by: E FMP = C − C' apprx while the error distance is ED = |E FMP |. The normalized mean error and the normalized mean error distance are NM = mean(E FMP )/C max and NMED = mean(ED)/C max , respectively, where mean(·) is the average operator and C max = 2 128 is the maximum value of C. The mean relative error distance is MRED = mean(ED/C), whereas the normalized maximum error distance is defined as NmaxED= ED max /C max , ED max being the maximum value of ED.
We compute the error metrics by simulating the SSFPM with 10 7 couples of random inputs laying in the whole range of representation (that is about [−2 128 , 2 128 ]). Figure 8 represents the behavior of NMED and MRED as function of the parameter m, used to define the accuracy of the segmentation. In the corrected version of the SSFPM, two terms are used to approximate the (35). As expected, increasing m allows us to improve both NMED and MRED, which lowers from 1.32 × 10 −5 to 3.16 × 10 −7 and from 3.41 × 10 −3 to 7.96 × 10 −5 , respectively. The figure also highlights the beneficial effects of the correction technique, since NMED and MRED are about halved in the corrected case. Table 2 collects the error metrics of the proposed SSFPM and cSSFPM in the cases m = 12, 14, 16, and 18. For the sake of comparison, we show also the accuracy of the approximate FPMs obtained by exploiting the techniques of [22,42,43] for the computation of the product Ma·Mb in (5), and by implementing the proposal of [17].  As expected, increasing m allows us to improve both NMED and MRED, which lowers from 1.32 × 10 −5 to 3.16 × 10 −7 and from 3.41 × 10 −3 to 7.96 × 10 −5 , respectively. The figure also highlights the beneficial effects of the correction technique, since NMED and MRED are about halved in the corrected case. Table 2 collects the error metrics of the proposed SSFPM and cSSFPM in the cases m = 12, 14, 16, and 18. For the sake of comparison, we show also the accuracy of the approximate FPMs obtained by exploiting the techniques of [22,42,43] for the computation of the product Ma·Mb in (5), and by implementing the proposal of [17].  TOSAM in the table), the authors devise the usage of a dynamic segmentation to perform multiplication with good precision and optimized power and area. Here, the multiplication is revisited as multiply-and-add operation, with the multiplicands truncated on h bits after the leading one. The addends are also truncated since h + 4 LSBs are discarded. The work [43] (referred as DRUM in the table) explores the dynamic segmentation selecting a segment of k bits (comprising the leading one bit and a correction bit at the least significant position) for the multiplication. Then, a barrel shifter allows us to extend the result on the desired number of bits. The HDL description of DRUM multiplier [43] is available in [46]. The technique of [22] (referred as DATE17 in the table) organizes the PPM in groups of L rows. Then, the rows of each group are compressed by means of L-input OR gates. In [17] (referred as AFMB in the table), a modified version of the Mitchell algorithm is employed to compute the product of (3), with the input signals that are truncated on t bits. Here, since the leading one bit always corresponds to the implicit bit, no leading one detectors and barrel shifters are used, with beneficial improvements on power and area.
It is also worth noting that the correction technique allows us to reduce the NmaxED since it lowers the maximum error of the segmented multiplier.

Electrical Performances
We synthesize the proposed segmented floating-point multipliers and the state-of-theart with a physical flow in TSMC 28 nm CMOS technology using Cadence Genus, with clock period T clk of 500 ps and using standard threshold voltage cell library. Furthermore, the exact FPM described in Section 2.2. is implemented for reference. In the implementation of FPMs, pipeline levels are often inserted, to shorten the critical. In our case, we employ a single pipeline level between the arithmetic stage and the normalization logic in all investigated FPMs.
The power and area are obtained from post-synthesis analyses, with power consumption computed by simulating the synthesized netlist at 1 GHz. To this aim, SDF and TCF format files are employed to annotate the path delays and the toggle activity of the signals. In addition, we also study the minimum clock period employable for each FPMs, corresponding to the minimum clock period that ensures positive slack.
As shown in Table 3, the proposed SSFPM and cSSFPM are also competitive from a hardware point of view, exhibiting remarkable results with area and power reductions up to −82.3% and −85.5% with SSFPM and m = 12. The corrected circuits exhibit only a slight worsening of the performances with respect to the uncorrected ones, with degradations of area and power in the range of 1-3%. The minimum T clk improves with respect to the exact implementation, with the multiplier m = 12 that exhibits the best speed.
The FPMs with [22,42,43] show poorer performances, with area improvement limited to 47.4%, 50.2%, and 60%, respectively, and power saving up to 55.1%, 50.9% and 65%. In particular [22], L = 2, which offers best accuracy, exhibits limited improvements, with area and power reductions of 35.2% and 47.5%. The minimum T clk worsens with [42,43], increasing up to 25% and 35%, respectively. On the contrary, the minimum T clk improves up to 30% in the case of [22]. The work [17] shows best hardware performances with area reduction around −94% and power saving up to −95.7%. Moreover, the minimum T clk exhibits best improvement (up to 70% with t = 18). These performances are due to the realization of the multiplication by means of a truncated adder in the logarithmic number system. However, by looking to the data of Table 2, we can conclude that these electrical features are achieved at the price of an accuracy loss.

Image Filtering
We verify the validity of our proposal exploring the performances of the segmented FPMs in an image filtering application. The image filtering performs the following operation on the input image I where h is the kernel matrix of the filter, with size (2d + 1) × (2d + 1).
In this example, we consider a smoothing application with gaussian kernel of size 5 × 5 and standard deviation 2, whose coefficients are floating-point numbers with values normalized to 1. The Matlab command fspecial('gaussian', 5, 2) allows us to obtain the filter mask. Then, the products in (37) are realized by means of our approximate FPMs and the state-of-the-art.
As a second example, we analyze the performances in edge detection. In this case, the gradient G of the original image is computed to highlight its edges. To this aim, the Sobel kernel h Sobel and its transpose h T Sobel are used to compute the x and the y component of G (named G x and G y respectively). Then, the gradient G is computed as follows: In our trial, we use the following Sobel kernel: and used investigated FPMs to implement the multiplications in (38), (39). Table 4 collects the results in terms of structural similarity index measure (SSIM) and peak signal-to-noiseratio (PSNR, in dB), obtained by filtering three images (Lena, Cameraman, and Lady). For each example, we average the SSIM and the PSNR obtained by processing the test images, as well as we indicate the overall average SSIM and PSNR as synthetic parameter in the last column of the table.
As shown in Table 4, the segmented FPMs achieve SSIM very close to 1 and PSNR up to 70 dB with gaussian filtering and produce the exact result with m > 12 in the edge detection. The correction technique allows us to improve the PSNR, with a maximum increment of 5.4 dB in the case m = 14, smoothing application.
Moreover, the implementations with [22,42,43] L = 4, 6 achieve good results, with SSIM very close to 1 and a PSNR values up to 63.1 dB in the gaussian case. Similarly, the PSNR overcomes 60 dB with [42,43] Figure 9 shows the results of the edge detection using our SSFPMs with m = 12, 18 and the implementations from [17]. We report the negative of G for better highlighting the computed edges. The images obtained with the proposed multipliers are practically unchanged with respect to the exact one (as also expected from the high values of SSIM and PSNR). The results from [17] depend on the truncation t, with a sensible degradation of the detection in the background with t = 18 (again confirmed by the lower values of SSIM and PSNR).

JPEG Compression
The JPEG compression leverages the limit of human senses to reduce the bit-volume of images. The compression algorithm exploits the discrete cosine transformation (DCT), applied to disjoined blocks of size 8 × 8 pixels, and performs a quantization to the transformed image with a variable resolution. The lower frequency components, more visible to human eyes, are approximated with a finer quantization step, whereas the high frequency component, less appreciable to human eyes, are approximated with a rougher Figure 9. Results of the edge detection for the Cameraman image.

JPEG Compression
The JPEG compression leverages the limit of human senses to reduce the bit-volume of images. The compression algorithm exploits the discrete cosine transformation (DCT), applied to disjoined blocks of size 8 × 8 pixels, and performs a quantization to the transformed image with a variable resolution. The lower frequency components, more visible to human eyes, are approximated with a finer quantization step, whereas the high frequency component, less appreciable to human eyes, are approximated with a rougher quantization step.
In addition, a quality factor Q, defined in the range [0, 100], allows us to further modify the quantization accuracy and, as consequence, the compression, with Q = 0 that implies hardest compression and Q = 100 that implies lightest compression. Then, the quantized transformed image is reported in the original domain by means of the inverse discrete cosine transformation (iDCT). In the algorithm, the DCT and the iDCT require the multiplication between real numbers and are suitable to verify the validity of our proposal in a concrete scenario.
For the performances assessment, we approximate the DCT and the iDCT by using the proposed segmented FPMs and the state-of-the-art, considering the cases Q = 40, Q = 70, and Q = 100. Table 5 collects the results, again expressed in terms of SSIM and PSNR, obtained by compressing three grayscale images: Lena, Cameraman, and Peppers (SSIM and PSNR are computed relative to image compression performed with exact multiplier). In this case, we also report the mean SSIM and the mean PSNR for each Q, obtained by processing the three images, and indicate the overall average SSIM and PSNR in the last column of the table. Figure 10 reports Peppers compressed images in case of segmented multipliers, Q = 40.   As observable, our segmented multipliers ensure again a SSIM very close to 1 in all the cases, and a PSNR that ranges between 47 dB and 63 dB on average. Increasing m allows us to improve the quality of results, while the correction technique leads to a remarkable PSNR increment especially for small values of m (up to +8 dB in the case m = 12, Q = 100). In addition, performances are almost constant with respect to the quality factor Q. The designs with [22,42,43] L = 4, 6 exhibit lower accuracy, with the average PSNR limited to 53 dB, whereas [22] L = 2 allows best compression with PSNR of 71.4 dB on average.

Tone Mapping of HDR Images
As a last example, we employ the investigated multipliers for a tone mapping application on HDR images. An HDR image exploits floating-point pixels to represent a high dynamic range of luminance. A mapping operation is needed to properly adapt the high dynamic range of luminance to a lower range of values, whenever requested by the application. The algorithm devised in [44] exploits the following formula to perform tone mapping: The L(i,j) is a pixel of the luminance matrix of the image, obtained by executing the following steps: where R, G, B are the three channels of the input HDR image, N is the number of pixels in a channel, L m is the geometric mean of L tmp , and β is a value in the range [0, 1]. Applying (40) allows us to properly scale the luminance since large values of L(i,j) are normalized to 1, whereas small values of L(i,j) are practically unmodified. Then, the channels of the original image are weighted for L(i,j) as follows: and quantized to integer values in the range [0, 255].
As in the previous demonstrations, the approximate tone mapping is obtained using the approximate FPMs in the multiplications in (41) and (42). In our trials, we pose β = 0.5. Table 6 collects the results of the comparison between the exact the approximate algorithm, again expressed in terms of SSIM and PSNR, with the processed HDR images that are Oxford Church, Office and Bottles_small. Figure 11 depicts the result for the Bottles_small image.  Figure 11. Example of tone mapping with the proposed segmented multipliers and the state-of-theart.  In this case, the SSIM achieved with the segmented multipliers is also very close to 1, and the PSNR ranges between 46 dB and 64 dB on average. The results are comparable with [42,43], whereas the implementation [17,22] L = 6 exhibits lower performances (with PSNR up to 32.6 dB on average).

Discussion
The static segmentation applied to the MAA operation of (5) allows us to reduce power, delay, and area of the FPM while preserving remarkable accuracy performances. This is mainly due to the reduction in the input bit-width in the MAA unit. In addition, the proposed shift-and-truncate technique allows us to realize a fused PPM for the MAA unit with a unique CPA, with beneficial effects on the hardware performances.
At the same time, the SSFPM exhibits very good accuracy since (i) the approximation is applied only to the mantissa computation and (ii) the employed approximation, based on static segmentation of the fixed-point multiplier needed in mantissa computation, provides a small relative error. Indeed, the SSM approach introduces an error only when large values are represented, whereas small values are not approximated. This leads to good error performances that are suitable for the implementation of a floating-point multiplier.
The multipliers that exploit [42,43] benefit from the dynamic segmentation to approximate the product Ma·Mb in (5). These solutions achieve satisfactory error performances, as also demonstrated in the image processing applications. The error metrics are comparable with the SSFPM, as well as the SSIM and PSNR values demonstrate the high capabilities of these solutions. On the other hand, shifters are required between the multiplier and the adder of the MAA, thus implying the usage of two CPAs. This leads to a worsening of the hardware performances with respect to the SSFPM, as demonstrated by the lower power and area savings, limited to 55% and 50%, respectively. Furthermore, the minimum T clk is larger due to the shifters placed between the multiplier and the adder of the MAA unit.
The FPMs with the approximation of [22] exhibit performances that strongly depend on L, with the accuracy that worsens as L increases. Power and area reductions are limited in the case L = 2 (35.2% and 47.5%, respectively), whereas an improvement is registered with L = 4, 6 (up to 59.7% and 65.0%, respectively) at the cost of precision.
The proposal of [17] exhibits best hardware performances, with area and power saving that overcome 90% in both cases. These improvements are due to the usage of an adder in place of a multiplier for the realization of the mantissa product. In addition, leading one detectors and barrel shifters are not used in this case since the position of the implicit bit is always known. In any way, approximating the product with a logarithmic sum leads to a larger error, due to logarithm approximation. In addition, the adder is also truncated for further optimization, with a consequent accuracy loss.
As part of a joint assessment between hardware performances and accuracy, we show the power and the area saving versus the NMED and MRED of each FPM in Figure 12. The black dotted line indicates the Pareto Front. For NMED < 10 −5 the proposed SSFPMs overcomes [22,42,43] with L = 4, 6, exhibiting a power saving greater than 70% and an area improvement larger than 60%. The cSSFPMs also define the Pareto front in that region of the graph. Similar observations also apply to the case MRED < 10 −2 . The figures show again that [17] performs better from a hardware point of view, but at the cost of a loss of accuracy (NMED > 8 × 10 −5 and MRED > 2 × 10 −2 ). Similarly, [22] L = 2 has the best accuracy, but at the cost of a degradation of power and area.
Therefore, we can conclude that our proposal offers the best trade-off between hardware improvement and accuracy of results. Therefore, we can conclude that our proposal offers the best trade-off between hardware improvement and accuracy of results.

Conclusions
In this paper we propose a novel low-power approximate floating-point multiplier, based on static segmentation. To optimize hardware performances, the mantissa product is first revisited as a multiply-and-add operation. In this way the implicit bit is excluded from the computation to reduce the complexity of the multiplier, and additive logic is introduced to recover the exact result.
Then, a segmentation scheme is applied to the mantissas, to reduce the size of the multiplier. The proposed technique leverages a segment-and-truncate approach to eliminate the left-shift operation at the output of the multiplier. In this way, we can realize the mantissa multiplier by means of a fused partial product matrix and a unique carry-propagate adder, with beneficial effects on the hardware performances. In addition, a correction term is introduced, to reduce the approximation error due to the segmentation. The accuracy of the circuit can be accurately tailored at the design time, by acting on a single parameter, m.

Conclusions
In this paper we propose a novel low-power approximate floating-point multiplier, based on static segmentation. To optimize hardware performances, the mantissa product is first revisited as a multiply-and-add operation. In this way the implicit bit is excluded from the computation to reduce the complexity of the multiplier, and additive logic is introduced to recover the exact result.
Then, a segmentation scheme is applied to the mantissas, to reduce the size of the multiplier. The proposed technique leverages a segment-and-truncate approach to eliminate the left-shift operation at the output of the multiplier. In this way, we can realize the mantissa multiplier by means of a fused partial product matrix and a unique carry-propagate adder, with beneficial effects on the hardware performances. In addition, a correction term is introduced, to reduce the approximation error due to the segmentation. The accuracy of the circuit can be accurately tailored at the design time, by acting on a single parameter, m.
Analysis of error metrics show that proposed floating-point multiplier is competitive with the state-of-the-art for values of m in the range 12-18 (in the considered case of singleprecision floating-point format). Syntheses in 28 nm CMOS reveal a remarkable reduction in the power consumption and area, with best results achieved with m = 12 (up to 85% of power saving and up to 82% of area reduction compared to exact floating-point multiplier).
Implementations of several image processing algorithms (JPEG compression, image filtering, tone mapping of HDR images) show the effectiveness of the proposed architecture in real applications.
By a joint analysis of electrical performances and error metrics, the proposed approximate floating-point multiplier overcomes the state-of-the-art, exhibiting the best trade-off between hardware improvements and quality of results.