Next Article in Journal
SoC Estimation for Lithium-ion Batteries: Review and Future Challenges
Previous Article in Journal
Performance Evaluation of Downlink Multi-Beam Massive MIMO with Simple Transmission Scheme at Both Base and Terminal Stations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

by
E. George Walters III
Department of Electrical and Computer Engineering, Penn State Erie, The Behrend College, 5101 Jordan Road, Erie, PA 16563, USA
Electronics 2017, 6(4), 101; https://doi.org/10.3390/electronics6040101
Submission received: 31 August 2017 / Revised: 30 October 2017 / Accepted: 17 November 2017 / Published: 22 November 2017

Abstract

:
Multiplication by a constant is a common operation for many signal, image, and video processing applications that are implemented in field-programmable gate arrays (FPGAs). Constant-coefficient multipliers (KCMs) are often implemented in the logic fabric using lookup tables (LUTs), reserving embedded hard multipliers for general-purpose multiplication. This paper describes a two-operand addition circuit from previous work and shows how it can be used to generate and add pre-computed partial products to implement KCMs. A novel method for pre-computing partial products for KCMs with a negative constant is also presented. These KCMs are then extended to have two to eight coefficients that may be selected by a control signal at runtime to implement time-multiplexed multiple-constant multiplication. Synthesis results show that proposed pipelined KCMs use 27.4% fewer LUTs on average and have a median LUT-delay product that is 12% lower than comparable LogiCORE IP KCMs. Proposed pipelined KCMs with two to eight selectable coefficients use 46% to 70% fewer LUTs than the best LogiCORE IP based alternative and most are faster than using a LogiCORE IP multiplier with a coefficient lookup function. They also outperform the state-of-the-art in the literature, using 22% to 57% fewer slices than the smallest pipelined adder graph (PAG) fusion designs and operate 7% to 30% faster than the fastest PAG fusion designs for the same operand size and number of selectable coefficients. For KCMs and KCMs with selectable coefficients of a given operand size, the placement and routing of LUTs remains the same for all positive and negative constant values, which is advantageous for runtime partial reconfiguration.

1. Introduction

Field-programmable gate arrays (FPGAs) are often used for computationally intensive applications such as digital-signal processing (DSP), video and image processing, and artificial neural network (ANN) based applications such as machine learning and artificial intelligence. For these applications and others, multiplication is the dominant operation in terms of required resources, delay and power consumption. In many cases, one of the operands is a constant and the multiplier is called a constant-coefficient multiplier (KCM). Most contemporary FPGAs have embedded hard multipliers distributed throughout the fabric due to the importance of multiplication. Even so, soft KCMs based on lookup tables (LUTs) in the configurable logic fabric are often used for high-performance designs for several reasons:
  • Embedded multiplier operands are fixed in size and type, such as 25 × 18 two’s complement, while LUT-based KCM operands can be any size or type;
  • The number and location of embedded multipliers are fixed, while LUT-based KCMs can be placed anywhere and the number is limited only by the size of the reconfigurable fabric;
  • Embedded multipliers cannot be modified, while LUT-based KCMs can use techniques such as merged arithmetic [1] and approximate arithmetic [2] to optimize the overall system.
One approach to designing a KCM is to build lookup tables of pre-computed partial products, indexed by one or more bits of the variable operand, and sum them to produce the product. Chapman’s KCM algorithm uses LUT-based lookup tables to generate radix-16 partial products, specifically targeting Xilinx FPGAs with 4-input LUTs [3,4]. Wirthlin generalizes this approach and presents a method to merge the lookup with addition logic that is also specific to Xilinx FPGAs with 4-input LUTs [5]. Hormigo et al. extend Wirthlin’s work to include runtime self-reconfiguration [6]. These approaches target FPGA implementations.
Another approach to designing a KCM is to sum shifted copies of the variable operand that correspond to non-zero digits of the constant. Canonical signed digit (CSD) recoding gives a structure that requires at most m / 2 and on average m / 3 add/subtract operations, where m is the number of bits in the constant [7]. Sub-expressions can be shared to further reduce the number of add/subtract operations [8,9]. Turner and Woods present a technique to design reduced coefficient multipliers (RCMs) that operate on a limited set of coefficients [10], exploiting the observation that LUTs used to implement add/subtract operations have unused inputs. This is also known as time-multiplexed multiple-constant multiplication, where a variable input is multiplied by one of several constants selected by a control input to produce a single output. Tummeltshammer et al. present an algorithm for time-multiplexed multiple-constant multiplication, which is useful for finite-impulse response (FIR) filters and other sum-of-product computations, which fuse directed acyclic graph (DAG) solutions for multiplication by each constant into a time-multiplexed DAG [11]. Their work is optimized for application-specific integrated circuit (ASIC) implementations. Kumm et al. present a heuristic they call reduced pipelined adder graph (RPAG) that includes provisions for pipelining, which is especially important for FPGA implementations [12]. Möller et al. extend the RPAG heuristic by applying the fusion concept of Tummeltshammer et al. which they call pipelined adder graph (PAG) fusion [13]. PAG fusion is a heuristic that specifically targets FPGAs and is able to search for opportunities to use three-input (ternary) adders, which are available on recent Xilinx and Altera FPGAs and use roughly the same resources as two-input adders. The work of Möller et al. also incorporates low-level optimizations using primitives for Xilinx FPGAs that use fewer resources than allowing the tools to interpret hardware description language (HDL) models that do not specify primitives.
This paper describes an approach that uses a novel two-operand addition circuit [14,15,16] that combines generation of a pre-computed partial product with addition of another value, similar to Wirthlin’s work but optimized for Xilinx FPGAs with 6-input LUTs. A novel approach is used for the case where the constant is negative. A design variation for KCMs with two, four or eight selectable coefficients is also presented. The discussion and results focus on the Xilinx 7 Series FPGAs, but the technique is applicable to the Spartan-6, Virtex-5, Virtex-6, UltraScale and newer Xilinx FPGAs that use 6-input LUTs.
The paper is organized as follows. Section 2 discusses relevant FPGA architecture and the two-operand adder used to make the proposed KCMs. Section 3 describes the proposed LUT-based constant-coefficient multipliers. Section 4 extends proposed designs to handle two, four or eight selectable coefficients. Synthesis results are discussed in Section 5 and conclusions are given in Section 6.

2. Background

This section describes details of the Xilinx logic fabric and the proposed two-operand adder.

2.1. FPGA Logic Fabric

The main logic resource for implementing combinational and sequential circuits in a Xilinx FPGA is the configurable logic block (CLB). Each CLB has two slices. Figure 1 is a partial diagram of a 7 Series FPGA slice. Each slice has four 6-input lookup tables (LUT6s) designated A, B, C, and D. Each LUT6 is composed of two 5-input lookup tables (LUT5s) and a 2-to-1 multiplexer. The two LUT5s are 32 × 1 memories that share five inputs designated I5:I1. The memory values are designated M[63:32] in one LUT5 and M[31:0] in the other LUT5. The output of the M[31:0] LUT5 is designated O5. The sixth input, I6, is input to a multiplexer that selects one of the LUT5 outputs. The selected output is designated O6. The LUT6 is normally configured as either two LUT5s with five shared inputs and two outputs by connecting I6 to logic ‘1’, or as one LUT6 with six inputs and one output by connecting the sixth input to I6 [17,18].
A multiplexer (MUXCY) and an XOR gate (XORCY) are associated with each LUT6. Inputs to the MUXCY associated with the A LUT6 are a select signal, p r o p i , a first data input, g e n i , and a second data input, c i . The output of the MUXCY, c i + 1 , is connected to the MUXCY associated with the B LUT6. These connections continue through the C and D LUT6s to form a fast carry chain within the slice. The c i + 4 output of the slice, COUT, can be connected to the c i input of the next slice, CIN, to form longer carry chains. The p r o p signal is driven by the O6 output of the corresponding LUT6. The g e n signal is selected by a configuration multiplexer and is either the O5 output of the corresponding LUT6 or the bypass input, which is designated AX, BX, CX, or DX.
Two flip-flops are associated with each LUT6. One flip-flop can be used to register O5 or the bypass input. The other flip-flop can be used to register O5, O6, the bypass input, the MUXCY output, or the XORCY output.

2.2. Proposed Two-Operand Adder

Suppose X and Y are to be added using the Xilinx fast carry logic. For the i th column of the adder, x i and y i are the bits of X and Y, respectively, c i is the carry-in bit, c i + 1 is the carry-out bit and s i is the sum bit. The p r o p i signal must be set to x i y 1 and the g e n i signal can be set to either x i or y i to add x i and y i [14,16]. If x i and y i together are a function of five or fewer inputs, then the LUT6 can be configured as two LUT5s, generating either x i or y i at O5 and routing it to g e n i , and generating x i y i at O6 to drive p r o p i . If x i and y i together are a function of six inputs, then the LUT6 can be configured to generate x i y i at O6 to drive p r o p i and x i or y i can be applied to the bypass input and configured to drive the g e n i input. A disadvantage to this configuration is that the bypass flip-flop cannot be used.
Normally, a LUT6 can be used to either generate a function of six inputs at O6 or to generate two functions of five inputs at O5 and O6 [17,18]. However, in some cases, one function of six variables can be output at O6 and a separate function of five shared variables can be output at O5. Suppose x i is a function of one variable connected to I6 and y i is a function of five variables connected to I5:I1. The function y i is stored in M[31:0], so y i is output at O5. If x i is ‘0’, y i is also output at O6. If x i is ‘1’, the function stored in M[63:32] is output at O6. If y i ¯ is stored in M[63:32] then x i y i is generated at O6 and y i is generated at O5. This can be used to add x i and y i without using the bypass input when x i is a function of one variable and y i is a function of up to five variables. Figure 2 shows the connections for this configuration. This frees the bypass input to be connected to the bypass flip-flop to implement additional registers. Input I6 has the shortest delay path and I1 has the longest [17], so this method also allows faster inputs to be used. The carry into the proposed adder, c 0 , can be used to implement subtraction or to add an extra bit to the least significant column.

3. Proposed Constant-Coefficient Multipliers

This section describes how the proposed constant-coefficient multipliers (KCMs) are implemented and pipelined.

3.1. Radix-2 Multiplication by a Constant

Suppose A is an m-bit constant, B is an n-bit variable and P = A · B is to be computed. If A and B are unsigned integers, then
A = i = 0 m 1 a i · 2 i ,
B = j = 0 n 1 b j · 2 j ,
and the product is
P = i = 0 m 1 j = 0 n 1 a i b j · 2 i + j .
If A is positive and B is signed, then
A = i = 0 m 1 a i · 2 i ,
B = b n 1 · 2 n 1 + j = 0 n 2 b j · 2 j ,
and the product can be computed using Baugh and Wooley’s approach [19] as
P = i = 0 m 1 j = 0 n 2 a i b j · 2 i + j + i = 0 m 1 a i b n 1 ¯ · 2 i + n 1 + 2 m + n 1 + 2 n 1 .
Figure 3 shows a ( 6 × 6 )-bit KCM, where A is a positive constant and B is a two’s-complement variable as described by Equation (6). The least-significant column has a weight of 2 0 to simplify equations and column references, but the results in this work are applicable to fixed-point multipliers by applying appropriate shifts and placement of the binary point.
If A is negative, it could be coded in two’s complement form and Baugh and Wooley’s approach could be used to develop an equation for the product. A would have m 1 bits of useful precision instead of m bits because the most-significant bit (MSB) would always be ‘1’. In the proposed designs, the magnitude of A is used with an implicit negative sign bit and Equation (3) is used if B is unsigned or Equation (6) is used if B is signed. The product is then negated by negating each row of partial products. Each bit, including implicit leading ‘0’s, is complemented and ‘1’ is added to the least-significant bit (LSB) in each row. The constants are then pre-added to simplify the matrix. If A is negative and B is unsigned, then
P = i = 0 m 1 j = 0 n 1 a i b j ¯ · 2 i + j + 2 m + 2 n 1 .
The product is m + n + 1 bits to accommodate the sign bit. The product is always negative so the MSB is always ‘1’ and does not require any logic. If A is negative and B is signed, then
P = i = 0 m 1 j = 0 n 2 a i b j ¯ · 2 i + j + i = 0 m 1 a i b n 1 · 2 i + n 1 + 2 m + n 1 + 2 m + 2 n 1 1 .
The product is m + n bits assuming | A | 2 m 1 . If | A | = 2 m , a hard-wired shift and negation of the product would be used instead of a KCM.
Figure 4 shows a ( 6 × 6 )-bit KCM, where A is the magnitude of a negative constant and B is a two’s-complement variable as described by Equation (8).

3.2. Design of Proposed Constant-Coefficient Multiplier

Figure 5 shows a dot diagram of a proposed ( 12 × 12 )-bit KCM, where A is a negative constant and B is a two’s complement variable. Each dot is a partial-product bit that corresponds to a bit in Equation (8). Each row j of partial-product bits is a function of only one variable bit, b j . The rows of partial-product bits are divided into groups, each of which are summed to produce a partial product, P ρ . Each partial product P ρ is the sum of j ρ rows of partial-product bits.
In the example of Figure 5, the first five rows of partial-product bits are grouped and their sum is P 0 . P 0 is a function of the constant A, the constant 2 12 + 2 5 2 0 and a 5-bit sub-vector of the variable B, B[4:0]. The 2 5 possible values of P 0 are pre-computed and generated using LUT6s. Each LUT6 generates two bits of P 0 , p 0 , i + 1 and p 0 , i . The next five rows of partial-product bits are grouped and their sum is P 1 , which is a function of A, 2 10 2 5 and B[9:5]. The 2 5 possible values of P 1 are pre-computed and generated by a proposed two-operand adder, which adds the generated value to P 0 and produces an accumulated sum, X 1 . The final two rows of partial-product bits are grouped and their sum is P 2 , which is a function of A, 2 23 + 2 10 and B[11:10]. The 2 2 possible values of P 2 are pre-computed, generated by another proposed two-operand adder, and added to X 1 to produce an accumulated sum X 2 . The five least-significant bits of the final product, P[4:0], are the five LSBs of P 0 . The next five LSBs of the product, P[9:5], are the five LSBs of the accumulated sum X 1 . The remaining bits of the product, P[23:10], are the accumulated sum, X 2 .
In a proposed ( m × n )-bit KCM, all of the partial-product bits are grouped into ( n 1 ) / 5 partial products. Each partial product, P ρ , is the sum of j ρ rows of partial-product bits. When n 1 is an exact multiple of five, such as when n = 16 , P 0 is the sum of six rows and each of the other partial products are the sum of five rows. When n 1 is not an exact multiple of five, each partial product is the sum of five rows except possibly the last, which is the sum of the remaining rows.
P 0 is the sum of the first j 0 rows of partial-product bits and is generated using LUT6s. When P 0 is the sum of six rows, each bit p 0 , i is a function of six variables, B[5:0], so each LUT6 generates one output bit. When P 0 is the sum of five rows, each pair of bits p 0 , i + 1 and p 0 , i are functions of the same five variables, B[4:0], so each LUT6 generates two output bits. P 0 is m + j 0 bits long, so m + j 0 LUT6s are required if j 0 = 6 and ( m + j 0 ) / 2 LUT6s are required if j 0 5 .
The remaining partial products, P ρ where ρ 1 , are each generated using a proposed two-operand adder. The proposed two-operand adder generates a function of up to five variables, so it is most efficient when P ρ is the sum of five rows of partial-product bits. P ρ is m + j ρ bits long, so m + j ρ LUT6s are required for each two-operand adder.
Constant ‘1’s can be grouped with any partial product and are simply included in each pre-computed value. In practice, groups are selected so that constant ‘1’s do not increase the length of the partial product.
When n 1 is an exact multiple of five, each partial product requires m + j ρ LUT6s. There are ( n 1 ) / 5 partial products, and ρ = 0 ( n 1 ) / 5 1 j ρ = n , so the maximum number of required LUT6s is
# LUT 6 s m ( n 1 ) / 5 + n .
When n 1 is not an exact multiple of five, there are ( m + j 0 ) / 2 LUT6s instead of m + j 0 LUT6s in the first row, so the maximum number of required LUT6s is
# LUT 6 s m ( n 1 ) / 5 + n ( m + 5 ) / 2 .
Some LUTs may be optimized away during synthesis, so these equations give the maximum number of required LUT6s.

3.3. Array Structure and Pipelining

Figure 6 shows the structure of the proposed ( 12 × 12 )-bit KCM from the example of Figure 5. The top row of LUT6s generates the first five rows of partial-product bits and outputs the sum, P 0 . The second row of LUT6s implements a proposed two-operand adder that generates the sum of the next five rows of partial-product bits, P 1 , and adds it to P 0 to produce an accumulated sum, X 1 . The third row of LUT6s implements another two-operand adder that generates the sum of the last two rows of partial-product bits, P 2 , and adds it to X 1 to produce an accumulated sum, X 2 . The KCM output, P, is composed of the five LSBs of P 0 , the five LSBs of X 1 and X 2 .
The proposed KCM can be pipelined by placing registers after each row of LUT6s. The first stage registers m + j 0 bits of the final product P and n j 0 bits of B, which requires m + n flip-flops. Subsequent stages register m + j ρ + 1 bits of X ρ , j ρ additional bits of P and j ρ fewer bits of B, which requires m + n + 1 flip-flops. The last stage registers the output P, which requires m + n flip-flops. There are ( n 1 ) / 5 stages, and each stage registers m + n + 1 bits except the first and last stages, which register m + n bits each, so the maximum number of flip-flops required is
# FFs ( n 1 ) / 5 ( m + n + 1 ) 2 .
Each LUT6 used in the KCM has two available flip-flops so there are more than enough flip-flops available within the footprint of the multiplier to implement pipeline registers. The structure is very regular and easy to place in the logic fabric so that routing paths are short and fast.

3.4. Discussion

When n = 10 , the first row of the KCM computes the sum of five partial products using LUT6s. Each LUT6 computes two bits of the sum, except for one LUT6 that computes only one bit if m is even. The second row of the KCM also computes the sum of five partial products and adds them to the sum from the first row. This is very efficient because both rows are computing the maximum number of partial-product bits per LUT6. When n is increased to n = 11 , the second row still computes the sum of five partial products, but the first row now computes the sum of six partial products, so each LUT6 only computes one bit of the sum. This causes a jump in the number of LUTs required to implement the KCM. When n is increased to n = 12 , the first row computes the sum of five partial products, which reduces the number of LUT6s in that row compared to n = 11 . The second row still computes the sum of five partial products. However, a third row is now required, which causes another jump in the number of LUTs required to implement the KCM. When n is increased to n = 13 , the first and second rows still compute the sum of five partial products each. The third row computes the sum of three partial products, compared to two for n = 12 , which only requires one additional LUT6 plus an additional LUT6 per bit that m increases, so the increase in the number of LUTs required to implement the KCM is not as large as the increase from n = 10 to n = 11 or from n = 11 to n = 12 . The situation is similar when n is increased to n = 14 and again when n is increased to n = 15 . When n is increased to n = 16 , the first row computes the sum of six partial products, which causes a jump in the number of required LUTs as it does when n increases from n = 10 to n = 11 . This cycle repeats itself as n is increased. The significance of this is that for a given value of m, KCMs with n { 10 , 15 , 20 , 25 , } are generally the most efficient in terms of required LUTs, while KCMs with n { 12 , 17 , 22 , 27 , } are generally the least efficient.
The value of m does not affect the number of rows in the KCM, so there are no jumps in the required number of LUTs as m is increased. If n 1 is an exact multiple of five, there are ( n 1 ) / 5 rows in the KCM and the first row requires one LUT6 per bit of the sum. As m is increased, each row of the KCM requires one additional LUT6 per bit that m increases, so a total of m ( ( n 1 ) / 5 ) additional LUT6s are required. If n 1 is not an exact multiple of five, there are ( n 1 ) / 5 rows in the KCM and the first row requires approximately one half of an LUT6 per bit of the sum. As m is increased, the KCM requires approximately one half of an additional LUT6 for the first row and one additional LUT6 for each of the other rows per bit that m increases, so a total of ( n 1 ) / 5 1 2 additional LUT6s are required per bit that m increases. The significance of this is that, for a given value of n, the increase in the number of LUTs required to implement the KCM as m increases is approximately linear, and the value of m has a much lower impact than n on the efficiency of the implementation in terms of required LUTs.
Figure 7 shows the number of LUT6s required for KCMs as m and n are varied, based on Equations (9) and (10). These functions are discrete and the points are connected by lines for readability only, not to imply continuity. The middle set of points is the case where m = n . The total number of LUTs required for the KCM increases as m = n increases, with jumps from n = 10 to n = 11 , from n = 11 to n = 12 , etc., due to n increasing as discussed earlier. The other sets of points are cases where m { 1.5 n , 1.25 n , 0.75 n , 0.5 n } . This results in m having a fractional value for many points, which is not possible. However, those fractional values are used to compute the points because the intent of the graph is to show how the number of LUTs scales with m, not to show an exact number of LUTs. The graph shows that for a given value of n, the change in the number or LUTs required is roughly proportianal to m .
Figure 8 shows the number of partial product bits that are computed and summed, m · n , per LUT6 required for implementation as m and n are varied. This provides a measure of efficiency of the implementation in terms of LUTs required. The middle set of points is the case where m = n . The graph shows that KCMs with n { 10 , 15 , 20 , 25 , } generally have a local maximum value and are the most efficient, while KCMs with n { 12 , 17 , 22 , 27 , } generally have a local minimum value and are the least efficient. For a given value of n, efficiency increases somewhat as m increases and decreases as m is decreased.

4. Proposed KCMs with Selectable Coefficients

Turner and Woods present a reduced-coefficient multiplier (RCM) that can operate on a limited set of coefficients, selectable at run-time [10]. Their multipliers use canonical signed digit (CSD) recoding and sub-expression elimination to reduce the number of add/subtract operations. This section discusses how the proposed KCMs can be modified to incorporate the idea to operate on a set of two, four or eight coefficients, selectable at run-time.
In the following sections, the selectable coefficients are designated A [ k ] and the resulting products are designated P [ k ] . The coefficients for a KCM with selectable coefficients do not need to have the same sign. The variable is designated B [ k ] because it can be treated as signed for one coefficient and unsigned for another. For example, a KCM with two selectable coefficients could treat A [ 0 ] as negative and B [ 0 ] as unsigned, and treat A [ 1 ] as positive and B [ 1 ] as signed without any special considerations.

4.1. Proposed KCMs with Two Selectable Coefficients

A KCM with two selectable coefficients requires one input to select the coefficient. Partial products for both coefficients are pre-computed and generated using LUT6s for each P [ k ] 0 , and generated using proposed two-operand adders for the rest of the partial products, P [ k ] i .
One input to each LUT6 used to generate P [ k ] 0 is needed to select the coefficient, so only five inputs are left to select the pre-computed value of P [ k ] 0 if each LUT6 generates one bit, p [ k ] 0 , i , and only four inputs are left if each LUT6 generates two bits, p [ k ] 0 , i + 1 and p [ k ] 0 , i . One of the y i inputs to each LUT6 in each of the adders are needed to select the coefficient, so only four inputs are left to select the pre-computed value of P [ k ] i . Therefore, all of the partial-product bits in a KCM with two selectable coefficients are grouped into ( n 1 ) / 4 partial products.
When n 1 is an exact multiple of four, each partial product requires m + j [ k ] ρ LUT6s. There are ( n 1 ) / 4 partial products, and ρ = 0 ( n 1 ) / 4 1 j [ k ] ρ = n , so the maximum number of required LUT6s is
# LUT 6 s m ( n 1 ) / 4 + n .
When n 1 is not an exact multiple of four, there are ( m + j [ k ] 0 ) / 2 LUT6s instead of m + j [ k ] 0 LUT6s in the first row, so the maximum number of required LUT6s is
# LUT 6 s m ( n 1 ) / 4 + n ( m + 4 ) / 2 .
Some LUTs may be optimized away during synthesis, so these equations give the maximum number of required LUT6s.
Figure 9 shows a dot diagram of a proposed ( 12 × 12 )-bit KCM with two selectable coefficients, where A [ k ] is a negative constant and B [ k ] is a two’s complement variable (cf. Figure 5). In this example, no additional adders are needed and the unit has a very similar footprint to the single-coefficient KCM. Other size operands usually require one or more additional adders.

4.2. Proposed KCMs with Four Selectable Coefficients

A KCM with four selectable coefficients requires two inputs to select the coefficient. Partial products for each coefficient are pre-computed and generated using LUT6s for each P [ k ] 0 and the proposed two-operand adders generate and add the rest of the partial products.
Two inputs to each LUT6 used to generate P [ k ] 0 are needed to select the coefficient, so only four inputs are left to select the pre-computed value of P [ k ] 0 if each LUT6 generates one bit and only three inputs are left if each LUT6 generates two bits. Two of the y i inputs to each LUT6 in each of the adders are needed to select the coefficient, so only three inputs are left to select the pre-computed value of P [ k ] i . Therefore, all of the partial-product bits in a KCM with four selectable coefficients are grouped into ( n 1 ) / 3 partial products.
When n 1 is an exact multiple of three, the maximum number of required LUT6s is
# LUT 6 s m ( n 1 ) / 3 + n .
When n 1 is not an exact multiple of three, the maximum number of required LUT6s is
# LUT 6 s m ( n 1 ) / 3 + n ( m + 3 ) / 2 .
Figure 10 shows a dot diagram of a proposed ( 12 × 12 )-bit KCM with four selectable coefficients, where A [ k ] is a negative constant and B [ k ] is a two’s complement variable (cf. Figure 5). In this example, one additional adder is needed compared to the single-coefficient KCM.

4.3. Proposed KCMs with Eight Selectable Coefficients

A KCM with eight selectable coefficients requires three inputs to select the coefficient. Partial products for each coefficient are pre-computed and generated using LUT6s for each P [ k ] 0 and the proposed two-operand adders generate and add the rest of the partial products.
Three inputs to each LUT6 used to generate P [ k ] 0 are needed to select the coefficient, so only three inputs are left to select the pre-computed value of P [ k ] 0 if each LUT6 generates one bit and only two inputs are left if each LUT6 generates two bits. Three of the y i inputs to each LUT6 in each of the adders are needed to select the coefficient, so only two inputs are left to select the pre-computed value of P [ k ] i . Therefore, all of the partial-product bits in a KCM with eight selectable coefficients are grouped into ( n 1 ) / 2 partial products.
When n 1 is an exact multiple of two, the maximum number of required LUT6s is
# LUT 6 s m ( n 1 ) / 2 + n .
When n 1 is not an exact multiple of two, the maximum number of required LUT6s is
# LUT 6 s m ( n 1 ) / 2 + n ( m + 2 ) / 2 .
Figure 11 shows a dot diagram of a proposed ( 12 × 12 )-bit KCM with eight selectable coefficients, where A [ k ] is a negative constant and B [ k ] is a two’s complement variable (cf. Figure 5). In this example, three additional adders are needed compared to the single-coefficient KCM.

4.4. Discussion

Table 1 compares proposed KCMs that have a single coefficient to the proposed KCMs with two, four and eight selectable coefficients. The number of partial products and the number of LUTs used by each version are given, based on Equations (12) through (17). The percentage increase in the number of LUTs for two, four and eight-coefficient versions versus single-coefficient versions is also given. One or more of the LUTs used to generate the least-significant bits in the first row can often be optimized away so the number of LUTs in an actual implementation may be a little lower. For the operand sizes in the table, KCMs with two selectable coefficients use an average of 19% more LUTs, KCMs with four selectable coefficients use an average of 55% more LUTs and KCMs with eight selectable coefficients use an average of 117% more LUTs than single-coefficient KCMs. In designs where a KCM with selectable coefficients can replace two or more single-coefficient KCMs, the increase is more than offset by the reduced number of KCMs required.
KCMs with selectable coefficients usually have more partial products than single-coefficient KCMs. This means more adder stages are required, which translates into additional delay in single-cycle units. In pipelined versions, this results in longer latencies. However, cycle times are comparable because the adders are the same width or a little shorter.
Figure 12 shows the number of LUT6s required for KCMs with one, two, four and eight selectable coefficients. These functions are discrete and the points are connected by lines for readability only, not to imply continuity. The lower set of points is for single-coefficient KCMs and is the same as the middle set of points in Figure 7. As discussed in Section 3.4, there are jumps at every fifth value of n, starting with n = 11 , because the first row requires twice as many LUT6s every fifth value of n starting at n = 11 and the number of rows increases every fifth value of n starting at n = 12 . KCMs with two selectable coefficients have jumps for the same reasons, except they occur every fourth value of n, KCMs with four selectable coefficients have jumps every third value of n and KCMs with eight selectable coefficients have jumps every second value of n.
Figure 13 shows the number of partial product bits that are computed and summed for a single output per LUT6 for KCMs with one, two, four and eight selectable coefficients. The upper set of points is for single-coefficient KCMs and is the same as the middle set of points in Figure 8. As discussed in Section 3.4, there are local maximums every fifth value of n starting at n = 10 and local minimums every fifth value of n starting at n = 12 , indicating most efficient and least efficient units, respectively. KCMs with two selectable coefficients have a similar cycle every fourth value of n. They can be implemented using the same number of LUTs as single-coefficient KCMs for n = 8 and n = 12 because of the different period of each cycle. The cycle for KCMs with four selectable coefficients is every third value of n and the cycle for KCMs with eight selectable coefficients is every second value of n. KCMs with selectable coefficients are less efficient than single-coefficient KCMs by this measure because they require more LUTs to produce a single product in a clock cycle. However, they are more efficient in a design that performs time-multiplexed multiplication because additional single-coefficient KCMs or a general-purpose multiplier would be required to provide the same functionality.

5. Results

The proposed KCMs are compared to Xilinx LogiCORE IP v12.0 (rev. 12) (Xilinx Inc., San Jose, CA, USA) constant-coefficient multipliers [20] for ( n × n ) -bit units. Proposed KCMs with two, four and eight selectable coefficients are compared to units composed of a LogiCORE IP general-purpose multiplier and a lookup function to select the coefficient. Proposed KCMs with two and four selectable coefficients are also compared to units composed of two or four LogiCORE IP KCMs and a multiplexer to select the output. Results for 8, 10, 12, 14, 16, 20 and 24-bit operands are given for single-cycle and pipelined units. KCMs are synthesized with a positive constant and again with a negative constant. KCMs with selectable coefficients are synthesized with half of the constants being positive and the other half negative.
Arbitrary values ± π / 4 , ± 3 π / 4 , ± 5 π / 4 and ± 7 π / 4 are used for constants. This paper presents operands as integers, so π / 4 is multiplied by 2 n , 3 π / 4 and 5 π / 4 are multiplied by 2 n 2 , and 7 π / 4 is multiplied by 2 n 3 . They are rounded to the nearest odd to ensure that the least-significant bit (LSB) is ‘1’ to avoid obvious optimizations. Table 2 gives the magnitudes of the constants used in synthesized units. Examination of the bit patterns show that they are typical of many constants, with some runs of ‘1’s and ‘0’s and some isolated ‘1’s and ‘0’s.

5.1. Methodology

Version 2016.3 of the Xilinx Vivado Design Suite (Vivado) was used. Designs were synthesized with the strategy set to ‘Flow_PerfOptimized_high’ and implemented with the strategy set to ‘Performance_Retiming’. Designs were synthesized for the Xilinx Virtex-7 XC7VX330T-FFG1157 (-3 speed grade) device with a timing constraint of 1 ns on the inner clock. All results are post place-and-route.
LogiCORE IP constant-coefficient multipliers and general-purpose multipliers were created using the IP Catalog in Vivado. Structural models of the proposed multipliers were implemented in Verilog-2001 (IEEE Standard 1364-2001, IEEE, Piscataway, NJ, USA). Pipelined versions were created for LogiCORE IP multipliers using the optimal number of stages specified in the IP customization dialog. Input and output (I/O) ports were double registered to reduce dependence on I/O placement [21]. A separate clock on the inner level was used to measure the delay through each multiplier.

5.2. SynthesisResults

Synthesis results for proposed KCMs are given Section 5.2.1. Synthesis results for proposed KCMs with two, four and eight selectable coefficients are given in Section 5.2.2, Section 5.2.3 and Section 5.2.4, respectively.

5.2.1. Proposed Constant-Coefficient Multipliers

Synthesis results for single-cycle constant-coefficient multipliers are given in Table 3 and Table 4. The total number of LUTs used and the delay are given. The LUT-delay product (LDP), computed by multiplying the number of LUTs by the delay, is also given. LDP is analogous to the area-delay product of a very-large-scale integration (VLSI) design. The reciprocal of LDP gives a metric to compare maximum throughput. The total number of LUTs, delay and LDP are normalized to LogiCORE IP KCMs.
Table 3 gives results for single-cycle KCMs, where the constant A is positive and the variable B is signed. For these units, proposed designs are 10% to 31% smaller than comparable LogiCORE IP KCMs, except for 12-bit units which are 14% larger. This anomaly occurs because proposed KCMs are less efficient for n = 12 as discussed in Section 3.4 and LogiCORE IP KCMs with positive coefficients are more efficient for n = 12 as shown in Figure 15. Proposed designs have a 23% to 108% increase in delay, so there is a trade-off of fewer LUTs for increased cycle time. Table 4 gives results for single-cycle KCMs where the constant A is negative and the variable B is signed. For these KCMs, LogiCORE IP units increase in size while proposed units remain roughly the same. This reduces the relative size, so proposed designs with a negative constant are 17% to 35% smaller than LogiCORE IP units. Normalized delay is similar to proposed KCMs with a positive constant.
For most proposed single-cycle units, the normalized LDP is greater than 1.0. This suggests that single-cycle LogiCORE IP units usually offer higher throughput in designs where the KCMs are on the critical path and determine the clock period. However, when the KCMs are not on the critical path and proposed designs meet timing requirements, proposed designs for most operand sizes will improve the system by reducing the number of LUTs required.
Synthesis results for pipelined constant-coefficient multipliers are given in Table 5 and Table 6. The number of pipeline stages are reported, as well as the total number of LUTs used, the delay and the LUT-delay product. The number of pipeline stages determines the latency in clock cycles. The reported delay is for one clock cycle. The total number of LUTs, delay and LDP are normalized to LogiCORE IP units.
Table 5 gives results for pipelined KCMs, where the constant A is positive and the variable B is signed. For these units, proposed designs are 15% to 36% smaller than comparable LogiCORE IP units, except for 12-bit units which are 5% larger. Proposed designs have a 23% to 38% increase in delay so there is still a trade-off of LUTs for cycle time. However, the extreme cases are significantly reduced and normalized delay is fairly constant as operand size is scaled.
Table 6 gives results for pipelined KCMs, where the constant A is negative and the variable B is signed. As with single-cycle KCMs, negative constant LogiCORE IP KCMs increase in size, while proposed units remain roughly the same. This again reduces the relative size, so proposed designs with a negative constant are 20% to 42% smaller than LogiCORE IP units. Even 12-bit units are 20% smaller. Normalized delay is similar to proposed KCMs with a positive constant as it is with single-cycle KCMs.
The average normalized LDP for proposed pipelined KCMs is 1.025 for units with a positive constant and 0.855 for units with a negative constant. The overall average LDP is 0.940 and the overall median LDP is 0.881. This suggests that, for many operand sizes, proposed pipelined KCMs offer higher throughput in designs where they are on the critical path and determine the clock period. When they are not on the critical path and meet timing requirements, the throughput advantage of proposed units increases because they use 27% fewer LUTs on average than comparable LogiCORE IP units. Proposed KCMs have more pipeline stages than some LogiCORE IP KCMs, especially as n gets larger, because the proposed method uses an array structure to add partial products while LogiCORE IP units appear to use a tree structure. This may be a problem for systems where latency requirements are difficult to meet. However, for systems that can tolerate the increased latency this is less of an issue.
Figure 14 combines the graph of Figure 12 with actual values for LogiCORE IP KCMs obtained by synthesis. The graph shows that, for many operand sizes, the proposed KCMs with two selectable coefficients use fewer LUTs than LogiCORE IP KCMs that only handle a single-coefficient.
Figure 15 combines the graph of Figure 13 with actual values for LogiCORE IP KCMs obtained by synthesis. The graph shows that proposed KCMs are more efficient in terms of required LUTs than LogiCORE IP KCMs except for 12-bit units with a positive constant.
Figure 15. Number of partial product bits computed and summed per LUT6 for KCMs. Values for proposed KCMs are computed maximums, values for LogiCORE are from synthesized results.
Figure 15. Number of partial product bits computed and summed per LUT6 for KCMs. Values for proposed KCMs are computed maximums, values for LogiCORE are from synthesized results.
Electronics 06 00101 g015

5.2.2. Proposed KCMs with Two Selectable Coefficients

Synthesis results for single-cycle KCMs with two selectable coefficients are given in Table 7. Results for units composed of a LogiCORE IP general-purpose multiplier and a lookup function to select the coefficient are given. Results for units composed of two LogiCORE IP KCMs with a multiplexer to select the output are also given. Results are normalized to LogiCORE IP KCM units because they use 27% fewer LUTs and are 2.08 times faster on average than LogiCORE IP multiplier units.
Proposed KCMs with two selectable coefficients use only 20% more LUTs on average than proposed KCMs with a single coefficient, while units based on LogiCORE IP KCMs use more than twice as many LUTs because they cannot be combined and require a multiplexer to select the product. Delay for proposed 8- and 12-bit units is about the same but increases for other sizes because an additional row is required to compute the product. Delay for all LogiCORE IP KCM-based units increases due to the multiplexer and because the variable operand must be routed to two KCMs, which doubles the fanout. Proposed units use 57% to 70% fewer LUTs compared to LogiCORE IP KCM-based units at the expense of a 13% to 74% increase in delay. The LDP for proposed units is 28% to 67% lower, indicating that significantly higher throughput can be achieved compared to LogiCORE IP KCM-based units. LogiCORE IP multiplier-based units are not competitive for two selectable coefficients.
Table 8 gives synthesis results for pipelined KCMs with two selectable coefficients. Proposed units use the same number of LUTs as single-cycle versions, except for 20 and 24-bit units, which use some additional LUTs as shift registers (SRLs) to replace flip-flops. This optimization can be avoided if desired using the -shreg_min_size setting in synthesis options. Similar to single-cycle units, proposed designs use 60% to 70% fewer LUTs than LogiCORE IP KCM-based units. However, proposed units benefit relatively more from pipelining than LogiCORE IP and are only 3% to 37% slower, and the relative delay tends to improve as n increases. Proposed units have 45% to 63% lower LDP, which is consistently lower for all operand sizes. The LDP suggests that proposed units offer more than double the throughput versus LogiCORE IP KCM-based units for most operand sizes. LogiCORE IP multiplier-based units are still not competitive.

5.2.3. Proposed KCMs with Four Selectable Coefficients

Table 9 gives synthesis results for single-cycle KCMs with four selectable coefficients. Proposed KCMs with four selectable coefficients use 31% more LUTs on average than proposed KCMs with two selectable coefficients. LogiCORE IP KCM-based units with four coefficients use 82% more LUTs while LogiCORE IP multiplier-based units use only 1% more LUTs on average than their two coefficient versions. Delay increases for proposed units and LogiCORE IP KCM-based units and remains about the same for LogiCORE IP multiplier-based units. Results are normalized to LogiCORE IP multiplier-based units.
Proposed single-cycle units use 61% to 67% fewer LUTs than LogiCORE IP multiplier-based units and 69% to 74% fewer LUTs than LogiCORE IP KCM-based units. Proposed units are faster than some LogiCORE IP multiplier-based units and slower than some, but are slower than LogiCORE IP KCM-based units for all sizes. Proposed units have a 58% to 81% lower LDP than LogiCORE IP multiplier-based units and a 44% to 67% lower LDP than LogiCORE IP KCM-based units.
Table 10 gives synthesis results for pipelined KCMs with four selectable coefficients. Proposed units benefit relatively more than LogiCORE IP KCM-based and multiplier-based units in regards to delay. They are faster than most LogiCORE IP multiplier-based units and slower than LogiCORE IP KCM-based units but more comparable than they were for single-cycle units. Proposed pipelined units use 61% to 66% fewer LUTs than LogiCORE IP multiplier-based units and 72% to 76% fewer LUTs than LogiCORE IP KCM-based units. They have a 63% to 72% lower LDP than LogiCORE IP multiplier-based units and a 63% to 76% lower LDP than LogiCORE IP KCM-based units.

5.2.4. Proposed KCMs with Eight Selectable Coefficients

Table 11 gives synthesis results for single-cycle KCMs with eight selectable coefficients and Table 12 gives synthesis results for pipelined KCMs with eight selectable coefficients. Results for LogiCORE IP KCM-based units are not given because they would require eight KCMs and do not scale well as the number of coefficients increase. LogiCORE IP multiplier-based units only require a small amount of additional logic for the lookup function so they scale very well.
Proposed units use 51% to 52% fewer LUTs than LogiCORE IP for single-cycle units. They are slower than LogiCORE IP for most units, and the relative delay generally increases as n increases. The LDP for proposed single-cycle units is 13% to 53% lower than LogiCORE IP, with better results for smaller operand sizes.
Proposed pipelined units use 46% to 52% fewer LUTs and are faster, having 10% to 16% lower delay than LogiCORE IP. The LDP for proposed units is 51% to 59% lower than LogiCORE IP and performance is consistently better for all operand sizes.

5.3. Comparison to Möller Et Al.

Möller et al. [13] present synthesis results for ( 16 × 16 )-bit constant coefficient multipliers with two to fourteen selectable coefficients. They compare units generated using their proposed PAG fusion heuristic to units based on DAG fusion [11], using a Xilinx CoreGen multiplier with a distributed RAM to store coefficients as a baseline for comparison. Results for pipelined PAG fusion with ternary adders and PAG fusion with only two-operand adders are given. Results for single-cycle DAG fusion are given, as well as pipelined DAG fusion with resigters after each adder, subtractor, adder/subtractor and multiplexer, plus additional registers as needed for pipeline balancing. The Xilinx CoreGen multiplier-based unit is pipelined to the same depth as pipelined PAG fusion units. The number of slices required for implementation are shown on one graph and the maximum clock frequency for each method is shown on another graph in their paper. Numerical values are estimated from these graphs and tabulated in Table 13 and Table 14 for units with two to eight selectable coefficients.
Results presented by Möller et al. were obtained using Xilinx ISE v13.4, targeting a Virtex 6 FPGA (xc6vlx75t-2ff484-2) [13]. Slices are used as the metric for resource usage, and a Xilinx CoreGen based unit is used as a baseline for comparison. This paper presents results obtained using Xilinx Vivado 2016.3, targeting a Virtex 7 FPGA and uses LUTs as the metric for resource usage. In order to compare results, a Xilinx LogiCORE IP multiplier with a function to lookup coefficients is used in this work as a baseline. The LogiCORE IP multiplier is pipelined to the optimal depth as given in the IP customization dialog and a pipeline register is inserted between the coefficient lookup and the multiplier. Results given by Möller et al. are normalized to their CoreGen based unit and results in this work are normalized to the LogiCORE IP based units to account for differences in the synthesis tools, target device and IP implementation. Table 13 and Table 14 summarize these results.
Figure 16 and Figure 17 compare results from Möller et al. to this work by plotting normalized values. With the proposed approach, a KCM with three selectable coefficients would have the same structure as the KCM with four selectable coefficients described in Section 4.2, except the table of precomputed partial products would use zeros or don’t cares for the unused coefficient. This may allow some LUTs in the first row to be optimized away, but the LUTs in the other rows would still be required so the resources consumed by the unit would be identical or slightly less than a KCM with four selectable coefficients. For this reason, proposed KCMs with three selectable coefficients are graphed using the same values as KCMs with four selectable coefficients. Likewise, proposed KCMs with five, six or seven selectable coefficients are graphed using the same values as KCMs with eight selectable coefficients.
PAG fusion units with two-operand adders use fewer slices than CoreGen for two to four selectable coefficients and PAG fusion units with ternary adders use fewer slices than CoreGen for two to six selectable coefficients. All PAG fusion units use fewer slices than pipelined DAG fusion units. All PAG fusion units with two-operand adders have a maximum frequency comparable to CoreGen units, ranging from 6% slower to 4% faster. PAG fusion units with ternary adders are 22% to 31% slower than CoreGen, mainly because ternary adders are slower than two-operand adders. However, for many applications, they would not be on the critical path and would be better than CoreGen because they use 6% to 60% fewer slices for units with two to six selectable coefficients.
Proposed KCMs with selectable coefficients use significantly fewer slices than LogiCORE IP and pipelined versions are faster than LogiCORE IP for units with two to eight selectable coefficients. PAG fusion units outperform CoreGen units in most cases, so it is important to compare proposed units to PAG fusion units. Table 15 compares required slices and Table 16 compares maximum operating frequency for PAG fusion units, normalized to CoreGen, with proposed units, normalized to LogiCORE IP. These values are then normalized to PAG fusion with two-operand adders and PAG fusion with ternary adders. Comparing normalized values, proposed KCMs with selectable coefficients use 47% to 65% fewer slices than PAG fusion with two-operand adders and 22% to 57% fewer slices than PAG fusion with ternary adders. Proposed KCMs with selectable coefficients can operate 7% to 30% faster than PAG fusion with two-operand adders and 28% to 52% faster than PAG fusion with ternary adders.

6. Conclusions

This paper presents constant-coefficient multipliers (KCMs) for Xilinx FPGAs with 6-input LUTs, and extends them to have two to eight coefficients that may be selected by a control signal at runtime to implement time-multiplexed multiple-constant multiplication. Synthesis results show that proposed KCMs use 20% fewer LUTs for single-cycle designs and 27% fewer LUTs for pipelined designs on average compared to LogiCORE IP KCMs at the expense of increased delay. Proposed KCMs with two to four selectable coefficients use 63% fewer LUTs on average and proposed KCMs with eight selectable coefficients use 49% fewer LUTs on average compared to the smallest LogiCORE IP based alternative. Proposed KCMs with selectable coefficients also outperform state-of-the-art reconfigurable multipliers that are based on shift-and add methods, using 22% to 57% fewer slices than the smallest designs and operate 7% to 30% faster than the fastest designs. For a given operand size and number of constants, proposed designs have the same placement and routing of LUTs, regardless of the sign or magnitude of the constant. The only thing that changes is the content of the LUT RAMs, which makes proposed KCMs an attractive candidate for runtime partial reconfiguration. LogiCORE IP KCMs are larger for negative constants, and the size of KCMs based on shift and add methods vary with the constant, making runtime partial reconfiguration more difficult.

Acknowledgments

The Xilinx Vivado Design Suite was obtained through the Xilinx University Program (XUP). The author wishes to thank the reviewers for their thorough reviews and comments which helped to greatly improve the contributions of this paper.

Author Contributions

EGW designed the proposed units; EGW conceived, designed and performed the experiments; EGW analyzed the data and wrote the paper.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
BRAMBlock RAM (Random-Access Memory)
CLBConfigurable Logic Block
CSDCanonical Signed Digit
DAGDirected Acyclic Graph
DSPDigital-Signal Processing
FPGAField-Programmable Gate Array
KCMConstant-Coefficient Multiplier
LDPLUT-Delay Product
LSBLeast-Significant Bit
LUTLookup Table
LUT55-Input Lookup Table
LUT66-Input Lookup Table
MSBMost-Significant Bit
PAGPipelined Adder Graph
RCMReduced Coefficient Multiplier
PAGReduced Pipelined Adder Graph

References

  1. Swartzlander, E.E., Jr. Merged Arithmetic. IEEE Trans. Comput. 1980, C-29, 946–950. [Google Scholar] [CrossRef]
  2. Ercegovac, M. On Approximate Arithmetic. In Proceedings of the 47th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 3–6 November 2013; pp. 126–130. [Google Scholar]
  3. Chapman, K.D. Fast Integer Multipliers Fit in FPGAs; EDN Magazine: AspenCore Media, San Francisco, CA, USA, 1994. [Google Scholar]
  4. Chapman, K. Constant Coefficient Multipliers for the XC4000E; Version 1.1; Application Note XAPP 054; Xilinx: San Jose, CA, USA, 1996. [Google Scholar]
  5. Wirthlin, M.J. Constant Coefficient Multiplication Using Look-Up Tables. J. VLSI Signal Process. Syst. Signal Image Video Technol. 2004, 36, 7–15. [Google Scholar] [CrossRef]
  6. Hormigo, J.; Caffarena, G.; Oliver, J.P.; Boemo, E. Self-Reconfigurable Constant Multiplier for FPGA. ACM Trans. Reconfig. Technol. Syst. 2013, 6, 14:1–14:17. [Google Scholar] [CrossRef]
  7. Ercegovac, M.; Lang, T. Digital Arithmetic; Morgan Kaufmann: San Francisco, CA, USA, 2004. [Google Scholar]
  8. Brisebarre, N.; de Dinechin, F.; Muller, J.M. Integer and Floating-Point Constant Multipliers for FPGAs. In Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors (ASAP), Leuven, Belgium, 2–4 July 2008; pp. 239–244. [Google Scholar]
  9. Gustafsson, O.; Dempster, A.G.; Johansson, K.; Macleod, M.D.; Wanhammar, L. Simplified Design of Constant Coefficient Multipliers. Circuits Syst. Signal Process. 2006, 25, 225–251. [Google Scholar] [CrossRef]
  10. Turner, R.H.; Woods, R.F. Highly Efficient, Limited Range Multipliers for LUT-Based FPGA Architectures. IEEE Trans. VLSI Syst. 2004, 12, 1113–1117. [Google Scholar] [CrossRef]
  11. Tummeltshammer, P.; Hoe, J.C.; Püschel, M. Time-Multiplexed Multiple-Constant Multiplication. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2007, 26, 1551–1563. [Google Scholar] [CrossRef]
  12. Kumm, M.; Zipf, P.; Faust, M.; Chang, C.H. Pipelined Adder Graph Optimization for high speed multiple constant multiplication. In Proceedings of the 2012 IEEE International Symposium on Circuits and Systems (ISCAS), Seoul, Korea, 20–23 May 2012; pp. 49–52. [Google Scholar]
  13. Möller, K.; Kumm, M.; Kleinlein, M.; Zipf, P. Reconfigurable Constant Multiplication for FPGAs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2017, 36, 927–937. [Google Scholar] [CrossRef]
  14. Walters, E.G., III. Partial-Product Generation and Addition for Multiplication in FPGAs With 6-Input LUTs. In Proceedings of the 48th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 2–5 November 2014; pp. 1247–1251. [Google Scholar]
  15. Walters, E.G., III. Techniques and Devices for Performing Arithmetic. US Patent Application 15/025,770, 29 March 2016. [Google Scholar]
  16. Walters, E.G., III. Array Multipliers for FPGAs With 6-Input LUTs. Computers 2016, 5, 20:1–20:25. [Google Scholar] [CrossRef]
  17. Young, S.P.; Bauer, T.J. Programmable Integrated Circuit Providing Efficient Implementations of Arithmetic Functions. US Patent 7 218 139, 15 May 2007. [Google Scholar]
  18. Xilinx. 7 Series FPGAs Configurable Logic Block User Guide; UG474 (v1.4); Xilinx: San Jose, CA, USA, 2012. [Google Scholar]
  19. Baugh, C.R.; Wooley, B.A. A Two’s Complement Parallel Array Multiplication Algorithm. IEEE Trans. Comput. 1973, C-22, 1045–1047. [Google Scholar] [CrossRef]
  20. Xilinx. Multiplier v12.0 LogiCORE IP Product Guide; PG108; Xilinx: San Jose, CA, USA, 2015. [Google Scholar]
  21. Xilinx. LogiCORE IP Multiplier v11.2 Product Specification; DS255; Xilinx: San Jose, CA, USA, 2011. [Google Scholar]
Figure 1. Partial diagram of a Xilinx 7 Series configurable logic block (CLB) slice.
Figure 1. Partial diagram of a Xilinx 7 Series configurable logic block (CLB) slice.
Electronics 06 00101 g001
Figure 2. Proposed two-operand adder, computes S U M = X + Y .
Figure 2. Proposed two-operand adder, computes S U M = X + Y .
Electronics 06 00101 g002
Figure 3. Proposed ( m × n )-bit constant-coefficient multiplier (KCM), where m = n = 6 , A is a positive constant and B is a signed variable.
Figure 3. Proposed ( m × n )-bit constant-coefficient multiplier (KCM), where m = n = 6 , A is a positive constant and B is a signed variable.
Electronics 06 00101 g003
Figure 4. Proposed ( m × n )-bit KCM, where m = n = 6 , A is a negative constant and B is a signed variable. This method allows a KCM with a negative constant to be implemented using the same resources as a KCM with a positive constant at the same precision.
Figure 4. Proposed ( m × n )-bit KCM, where m = n = 6 , A is a negative constant and B is a signed variable. This method allows a KCM with a negative constant to be implemented using the same resources as a KCM with a positive constant at the same precision.
Electronics 06 00101 g004
Figure 5. Proposed ( m × n )-bit KCM, where m = n = 12 , A is a negative constant and B is a signed variable. When A is positive, the pre-computed values are different but the grouping of partial products and required resources are the same.
Figure 5. Proposed ( m × n )-bit KCM, where m = n = 12 , A is a negative constant and B is a signed variable. When A is positive, the pre-computed values are different but the grouping of partial products and required resources are the same.
Electronics 06 00101 g005
Figure 6. Structure of proposed ( m × n )-bit KCM, where m = n = 12 , A is a negative constant and B is a signed variable. The structure is easy to place in the logic fabric and facilitates short routing connections.
Figure 6. Structure of proposed ( m × n )-bit KCM, where m = n = 12 , A is a negative constant and B is a signed variable. The structure is easy to place in the logic fabric and facilitates short routing connections.
Electronics 06 00101 g006
Figure 7. Number of 6-input lookup tables (LUT6s) required for KCMs as m and n are varied. For a given value of n, the change in the number or LUTs is roughly proportianal to m .
Figure 7. Number of 6-input lookup tables (LUT6s) required for KCMs as m and n are varied. For a given value of n, the change in the number or LUTs is roughly proportianal to m .
Electronics 06 00101 g007
Figure 8. Number of partial product bits computed and summed per LUT6 as m and n are varied. KCMs with n { 10 , 15 , 20 , 25 , } are generally the most efficient, KCMs with n { 12 , 17 , 22 , 27 , } generally are the least efficient. For a given value of n, efficiency increases as m increases and decreases as m decreases.
Figure 8. Number of partial product bits computed and summed per LUT6 as m and n are varied. KCMs with n { 10 , 15 , 20 , 25 , } are generally the most efficient, KCMs with n { 12 , 17 , 22 , 27 , } generally are the least efficient. For a given value of n, efficiency increases as m increases and decreases as m decreases.
Electronics 06 00101 g008
Figure 9. Proposed ( m × n )-bit KCM with two selectable coefficients, where m = n = 12 , A [ k ] is a negative constant and B [ k ] is a signed variable. This example does not require any additional resources compared to a ( 12 × 12 )-bit KCM with a single coefficient.
Figure 9. Proposed ( m × n )-bit KCM with two selectable coefficients, where m = n = 12 , A [ k ] is a negative constant and B [ k ] is a signed variable. This example does not require any additional resources compared to a ( 12 × 12 )-bit KCM with a single coefficient.
Electronics 06 00101 g009
Figure 10. Proposed ( m × n )-bit KCM with four selectable coefficients, where m = n = 12 , A [ k ] is a negative constant and B [ k ] is a signed variable). This example requires approximately 33% more LUTs and one extra stage if pipelined compared to a ( 12 × 12 )-bit KCM with a single coefficient.
Figure 10. Proposed ( m × n )-bit KCM with four selectable coefficients, where m = n = 12 , A [ k ] is a negative constant and B [ k ] is a signed variable). This example requires approximately 33% more LUTs and one extra stage if pipelined compared to a ( 12 × 12 )-bit KCM with a single coefficient.
Electronics 06 00101 g010
Figure 11. Proposed ( m × n )-bit KCM with eight selectable coefficients, where m = n = 12 , A [ k ] is a negative constant and B [ k ] is a signed variable). This example requires approximately 93% more LUTs and three extra stages if pipelined compared to a ( 12 × 12 )-bit KCM with a single coefficient.
Figure 11. Proposed ( m × n )-bit KCM with eight selectable coefficients, where m = n = 12 , A [ k ] is a negative constant and B [ k ] is a signed variable). This example requires approximately 93% more LUTs and three extra stages if pipelined compared to a ( 12 × 12 )-bit KCM with a single coefficient.
Electronics 06 00101 g011
Figure 12. LUT6s required for KCMs with one, two, four and eight coefficients, m = n .
Figure 12. LUT6s required for KCMs with one, two, four and eight coefficients, m = n .
Electronics 06 00101 g012
Figure 13. Number of partial product bits computed and summed per LUT6 for KCMs with one, two, four and eight coefficients, m = n .
Figure 13. Number of partial product bits computed and summed per LUT6 for KCMs with one, two, four and eight coefficients, m = n .
Electronics 06 00101 g013
Figure 14. LUT6s required for KCMs, m = n . Values for proposed KCMs are computed maximums, and values for LogiCORE are from synthesized results.
Figure 14. LUT6s required for KCMs, m = n . Values for proposed KCMs are computed maximums, and values for LogiCORE are from synthesized results.
Electronics 06 00101 g014
Figure 16. Slice utilization for directed acyclic graph (DAG) fusion [11,13], pipelined adder graph (PAG) fusion [13] and proposed ( 16 × 16 )-bit KCMs with two to eight selectable coefficients. One slice contains four LUT6s and eight flip-flops (see Figure 1). DAG fusion and PAG fusion units are normalized to CoreGen as presented in Möller et al. [13], proposed units are normalized to a LogiCORE IP multiplier-based unit.
Figure 16. Slice utilization for directed acyclic graph (DAG) fusion [11,13], pipelined adder graph (PAG) fusion [13] and proposed ( 16 × 16 )-bit KCMs with two to eight selectable coefficients. One slice contains four LUT6s and eight flip-flops (see Figure 1). DAG fusion and PAG fusion units are normalized to CoreGen as presented in Möller et al. [13], proposed units are normalized to a LogiCORE IP multiplier-based unit.
Electronics 06 00101 g016
Figure 17. Maximum operating frequency for DAG fusion [11,13], PAG fusion [13] and proposed ( 16 × 16 )-bit KCMs with two to eight selectable coefficients. DAG fusion and PAG fusion units are normalized to CoreGen as presented in Möller et al. [13], proposed units are normalized to a LogiCORE IP multiplier-based unit.
Figure 17. Maximum operating frequency for DAG fusion [11,13], PAG fusion [13] and proposed ( 16 × 16 )-bit KCMs with two to eight selectable coefficients. DAG fusion and PAG fusion units are normalized to CoreGen as presented in Möller et al. [13], proposed units are normalized to a LogiCORE IP multiplier-based unit.
Electronics 06 00101 g017
Table 1. Comparison of one, two, four and eight-coefficient ( m × n )-bit constant-coefficient multipliers (KCMs), where m = n . The increase in LUTs for multiple-coefficient KCMs is less than using separate KCMs with a multiplexer or a general-purpose multiplier with multiplexed constants.
Table 1. Comparison of one, two, four and eight-coefficient ( m × n )-bit constant-coefficient multipliers (KCMs), where m = n . The increase in LUTs for multiple-coefficient KCMs is less than using separate KCMs with a multiplexer or a general-purpose multiplier with multiplexed constants.
One CoefficientTwo CoefficientFour CoefficientEight Coefficient
nPPsLUTsPPsLUTsIncreasePPsLUTsIncreasePPsLUTsIncrease
82182180%32750%43594%
1022333343%34074%554135%
123403400%45333%67793%
1434746130%57662%7104121%
163644709%59650%8135111%
1847959723%611647%9170115%
20488510823%714969%10209138%
225119614118%717648%11252112%
245130615418%820356%12299130%
Acronyms: partial products (PPs), lookup tables (LUTs).
Table 2. Values of | A | used for synthesis, π / 4 · 2 n , 3 π / 4 · 2 n 2 , 5 π / 4 · 2 n 2 and 7 π / 4 · 2 n 3 rounded to nearest odd.
Table 2. Values of | A | used for synthesis, π / 4 · 2 n , 3 π / 4 · 2 n 2 , 5 π / 4 · 2 n 2 and 7 π / 4 · 2 n 3 rounded to nearest odd.
Magnitude of A, π / 4 · 2 n Magnitude of A, 3 π / 4 · 2 n 2
nIntegerBinaryIntegerBinary
82011100100115110010111
1080511001001016031001011011
123,2171100100100012,413100101101101
1412,867110010010000119,65110010110110011
1651,471110010010000111138,6031001011011001011
20823,54911001001000011111101617,66310010110110010111111
2413,176,7951100100100001111110110119,882,595100101101100101111100011
Magnitude of A, 5 π / 4 · 2 n 2 Magnitude of A, 7 π / 4 · 2 n 3
nIntegerBinaryIntegerBinary
82511111101117510101111
101,00511111011017031010111111
124,0211111101101012,815101011111111
1416,0851111101101010111,25910101111111011
1664,339111111110111101145,0371010111111101101
201,029,43711111011010100111101720,60510101111111011011101
2416,470,99311111011010100111101000111,529,695101011111110110111011111
Table 3. Synthesis results for single-cycle ( m × n )-bit KCMs, where m = n , A = π / 4 · 2 n and B is a signed variable. Proposed KCMs use 15% fewer LUTs on average compared to LogiCORE IP KCMs at the expense of increased delay.
Table 3. Synthesis results for single-cycle ( m × n )-bit KCMs, where m = n , A = π / 4 · 2 n and B is a signed variable. Proposed KCMs use 15% fewer LUTs on average compared to LogiCORE IP KCMs at the expense of increased delay.
Type TotalDelay Normalized
nLUTs(ns)LDPLUTsDelayLDP
LogiCORE IP
A = π / 4 · 2 n
B is signed
8190.97218.51.0001.0001.000
10261.08328.21.0001.0001.000
12351.11238.91.0001.0001.000
14611.786108.91.0001.0001.000
16711.869132.71.0001.0001.000
201272.033258.21.0001.0001.000
241712.044349.51.0001.0001.000
Proposed KCM
A = π / 4 · 2 n
B is signed
8171.46324.90.8951.5051.347
10221.47532.50.8461.3621.152
12402.30892.31.1432.0762.372
14472.454115.30.7701.3741.059
16632.306145.30.8871.2341.095
20873.149274.00.6851.5491.061
241294.025519.20.7541.9691.486
Acronyms: lookup tables (LUTs), LUT-delay product (LDP).
Table 4. Synthesis results for single-cycle ( m × n )-bit KCMs, where m = n , A = π / 4 · 2 n and B is a signed variable. Proposed KCMs use 26% fewer LUTs on average compared to LogiCORE IP KCMs, a significant improvement compared to KCMs with a positive constant.
Table 4. Synthesis results for single-cycle ( m × n )-bit KCMs, where m = n , A = π / 4 · 2 n and B is a signed variable. Proposed KCMs use 26% fewer LUTs on average compared to LogiCORE IP KCMs, a significant improvement compared to KCMs with a positive constant.
Type TotalDelay Normalized
nLUTs(ns)LDPLUTsDelayLDP
LogiCORE IP
A = π / 4 · 2 n
B is signed
8220.95821.11.0001.0001.000
10281.11731.31.0001.0001.000
12471.11552.41.0001.0001.000
14711.821129.31.0001.0001.000
16771.860143.21.0001.0001.000
201341.980265.31.0001.0001.000
241832.092382.81.0001.0001.000
Proposed KCM
A = π / 4 · 2 n
B is signed
8171.50025.50.7731.5661.210
10221.59535.10.7861.4281.122
12392.36692.30.8302.1221.761
14462.405110.60.6481.3210.856
16602.272136.30.7791.2220.952
20873.190277.50.6491.6111.046
241293.969512.00.7051.8971.337
Acronyms: lookup tables (LUTs), LUT-delay product (LDP).
Table 5. Synthesis results for pipelined ( m × n )-bit KCMs, where m = n , A = π / 4 · 2 n and B is a signed variable. Proposed KCMs use 22% fewer LUTs on average compared to LogiCORE IP KCMs and the increase in delay is less significant than it is for single-cycle units.
Table 5. Synthesis results for pipelined ( m × n )-bit KCMs, where m = n , A = π / 4 · 2 n and B is a signed variable. Proposed KCMs use 22% fewer LUTs on average compared to LogiCORE IP KCMs and the increase in delay is less significant than it is for single-cycle units.
Type PipelineTotalDelay Normalized
nStagesLUTs(ns)LDPLUTsDelayLDP
LogiCORE IP
A = π / 4 · 2 n
B is signed
82200.91418.31.0001.0001.000
102281.00028.01.0001.0001.000
122371.06639.41.0001.0001.000
143711.14381.21.0001.0001.000
163841.18699.61.0001.0001.000
2031351.270171.51.0001.0001.000
2431841.334245.51.0001.0001.000
Proposed KCM
A = π / 4 · 2 n
B is signed
82171.12219.10.8501.2281.043
102221.28428.20.7861.2841.009
123391.47857.61.0541.3861.461
143461.50069.00.6481.3120.850
163631.53896.90.7501.2970.973
204871.628141.60.6441.2820.826
2451381.802248.70.7501.3511.013
Acronyms: lookup tables (LUTs), LUT-delay product (LDP).
Table 6. Synthesis results for pipelined ( m × n )-bit KCMs, where m = n , A = π / 4 · 2 n and B is a signed variable. Proposed KCMs use 33% fewer LUTs and have a 14% lower LDP on average compared to LogiCORE IP KCMs.
Table 6. Synthesis results for pipelined ( m × n )-bit KCMs, where m = n , A = π / 4 · 2 n and B is a signed variable. Proposed KCMs use 33% fewer LUTs and have a 14% lower LDP on average compared to LogiCORE IP KCMs.
Type PipelineTotalDelay Normalized
nStagesLUTs(ns)LDPLUTsDelayLDP
LogiCORE IP
A = π / 4 · 2 n
B is signed
82240.96223.11.0001.0001.000
102340.96332.71.0001.0001.000
122491.07152.51.0001.0001.000
143801.17493.91.0001.0001.000
163901.188106.91.0001.0001.000
2031461.279186.71.0001.0001.000
2431981.360269.31.0001.0001.000
Proposed KCM
A = π / 4 · 2 n
B is signed
82171.18420.10.7081.2310.872
102221.22727.00.6471.2740.824
123391.45056.60.7961.3541.078
143461.48068.10.5751.2610.725
163601.50590.30.6671.2670.845
204871.615140.50.5961.2630.752
2451381.739240.00.6971.2790.891
Acronyms: lookup tables (LUTs), LUT-delay product (LDP).
Table 7. Synthesis results for single-cycle ( m × n )-bit KCMs with two selectable coefficients, where m = n , A [ k ] = ± π / 4 · 2 n and both B [ k ] are signed variables. Proposed units use 62% fewer LUTs on average compared to LogiCORE IP KCM-based units.
Table 7. Synthesis results for single-cycle ( m × n )-bit KCMs with two selectable coefficients, where m = n , A [ k ] = ± π / 4 · 2 n and both B [ k ] are signed variables. Proposed units use 62% fewer LUTs on average compared to LogiCORE IP KCM-based units.
Type TotalDelay Normalized
nLUTs(ns)LDPLUTsDelayLDP
LogiCORE IP
Lookup + Multiplier
A [ 0 ] = π / 4 · 2 n
A [ 1 ] = π / 4 · 2 n
B[k] is signed
8672.861191.71.1362.1992.497
101043.676382.31.3682.7493.762
121513.935594.21.4382.7523.957
142054.088838.01.2891.8292.358
162703.9801074.61.5081.4462.181
204205.0132105.51.4051.8432.589
246034.9823004.11.5081.7272.603
LogiCORE IP
KCMs + MUX
A [ 0 ] = π / 4 · 2 n
A [ 1 ] = π / 4 · 2 n
B[k] is signed
8591.30176.81.0001.0001.000
10761.337101.61.0001.0001.000
121051.430150.21.0001.0001.000
141592.235355.41.0001.0001.000
161792.753492.81.0001.0001.000
202992.720813.31.0001.0001.000
244002.8851154.01.0001.0001.000
Proposed KCM
A [ 0 ] = π / 4 · 2 n
A [ 1 ] = π / 4 · 2 n
B[k] is signed
8181.46826.40.3051.1280.344
10332.21173.00.4341.6540.718
12402.30592.20.3811.6120.614
14613.146191.90.3841.4080.540
16703.210224.70.3911.1660.456
201084.012433.30.3611.4750.533
241545.009771.40.3851.7360.668
Acronyms: lookup tables (LUTs), LUT-delay product (LDP).
Table 8. Synthesis results for pipelined ( m × n )-bit KCMs with two selectable coefficients, where m = n , A [ k ] = ± π / 4 · 2 n and both B [ k ] are signed variables. Proposed units use 64% fewer LUTs and have a 58% lower LDP on average compared to LogiCORE IP KCM-based units.
Table 8. Synthesis results for pipelined ( m × n )-bit KCMs with two selectable coefficients, where m = n , A [ k ] = ± π / 4 · 2 n and both B [ k ] are signed variables. Proposed units use 64% fewer LUTs and have a 58% lower LDP on average compared to LogiCORE IP KCM-based units.
Type PipelineTotalDelay Normalized
nStagesLUTs(ns)LDPLUTsDelayLDP
LogiCORE IP
Lookup + Multiplier
A [ 0 ] = π / 4 · 2 n
A [ 1 ] = π / 4 · 2 n
B [ k ] is signed
84671.510101.21.1171.5931.779
1051061.476156.51.2681.4401.861
1251531.604245.41.3851.4892.091
1452071.590329.11.1521.3551.576
1652721.825496.41.3641.4511.916
2064261.917816.61.3501.4251.903
2466091.9321176.61.4361.4422.052
LogiCORE IP
KCMs + MUX
A [ 0 ] = π / 4 · 2 n
A [ 1 ] = π / 4 · 2 n
B [ k ] is signed
83600.94856.91.0001.0001.000
103821.02584.11.0001.0001.000
1231091.077117.41.0001.0001.000
1441781.173208.81.0001.0001.000
1642061.258259.11.0001.0001.000
2043191.345429.11.0001.0001.000
2444281.340573.51.0001.0001.000
Proposed KCM
A [ 0 ] = π / 4 · 2 n
A [ 1 ] = π / 4 · 2 n
B [ k ] is signed
82181.21121.80.3001.2770.383
103331.40446.30.4021.3700.551
123401.25650.20.3671.1660.428
144611.42987.20.3431.2180.417
164701.40798.50.3401.1180.380
2051161.379160.00.3641.0250.373
2461701.451246.70.3971.0830.430
Acronyms: lookup tables (LUTs), LUT-delay product (LDP).
Table 9. Synthesis results for single-cycle ( m × n )-bit KCMs with four selectable coefficients where m = n , A [ k ] = ± π / 4 · 2 n or ± 3 π / 4 · 2 n 2 and all B [ k ] are signed variables. Proposed units use 64% fewer LUTs on average compared to LogiCORE IP multiplier-based units and 73% fewer LUTs on average compared to LogiCORE IP KCM-based units.
Table 9. Synthesis results for single-cycle ( m × n )-bit KCMs with four selectable coefficients where m = n , A [ k ] = ± π / 4 · 2 n or ± 3 π / 4 · 2 n 2 and all B [ k ] are signed variables. Proposed units use 64% fewer LUTs on average compared to LogiCORE IP multiplier-based units and 73% fewer LUTs on average compared to LogiCORE IP KCM-based units.
Type TotalDelay Normalized
nLUTs(ns)LDPLUTsDelayLDP
LogiCORE IP8672.903194.51.0001.0001.000
Lookup + Multiplier101063.740396.41.0001.0001.000
A [ 0 ] = π / 4 · 2 n 121533.922600.11.0001.0001.000
A [ 1 ] = π / 4 · 2 n 142073.811788.91.0001.0001.000
A [ 2 ] = 3 π / 4 · 2 n 2 162724.0681106.51.0001.0001.000
A [ 3 ] = 3 π / 4 · 2 n 2 204224.8852061.51.0001.0001.000
B [ k ] is signed246055.1283102.41.0001.0001.000
LogiCORE IP8971.500145.51.4480.5170.748
KCMs + MUX101271.528194.11.1980.4090.489
A [ 0 ] = π / 4 · 2 n 121911.591303.91.2480.4060.506
A [ 1 ] = π / 4 · 2 n 142912.633766.21.4060.6910.971
A [ 2 ] = 3 π / 4 · 2 n 2 163313.1021026.81.2170.7630.928
A [ 3 ] = 3 π / 4 · 2 n 2 205542.8721591.11.3130.5880.772
B [ k ] is signed247743.0192336.71.2790.5890.753
Proposed KCM8262.19056.90.3880.7540.293
A [ 0 ] = π / 4 · 2 n 10391.89473.90.3680.5060.186
A [ 1 ] = π / 4 · 2 n 12523.178165.30.3400.8100.275
A [ 2 ] = 3 π / 4 · 2 n 2 14753.855289.10.3621.0120.367
A [ 3 ] = 3 π / 4 · 2 n 2 16953.604342.40.3490.8860.309
B [ k ] is signed201485.464808.70.3511.1190.392
242026.4531303.50.3341.2580.420
Acronyms: lookup tables (LUTs), LUT-delay product (LDP).
Table 10. Synthesis results for pipelined ( m × n )-bit KCMs with four selectable coefficients where m = n , A [ k ] = ± π / 4 · 2 n or ± 3 π / 4 · 2 n 2 and all B [ k ] are signed variables. Proposed units use 63% fewer LUTs on average compared to LogiCORE IP multiplier-based units and 74% fewer LUTs on average compared to LogiCORE IP KCM-based units.
Table 10. Synthesis results for pipelined ( m × n )-bit KCMs with four selectable coefficients where m = n , A [ k ] = ± π / 4 · 2 n or ± 3 π / 4 · 2 n 2 and all B [ k ] are signed variables. Proposed units use 63% fewer LUTs on average compared to LogiCORE IP multiplier-based units and 74% fewer LUTs on average compared to LogiCORE IP KCM-based units.
Type PipelineTotalDelay Normalized
nStagesLUTs(ns)LDPLUTsDelayLDP
LogiCORE IP84691.635112.81.0001.0001.000
Lookup + Multiplier1051081.466158.31.0001.0001.000
A [ 0 ] = π / 4 · 2 n 1251551.671259.01.0001.0001.000
A [ 1 ] = π / 4 · 2 n 1452091.640342.81.0001.0001.000
A [ 2 ] = 3 π / 4 · 2 n 2 1652741.731474.31.0001.0001.000
A [ 3 ] = 3 π / 4 · 2 n 2 2064281.817777.71.0001.0001.000
B [ k ] is signed2466111.8201112.01.0001.0001.000
LogiCORE IP831031.030106.11.4930.6300.940
KCMs + MUX1031451.066154.61.3430.7270.976
A [ 0 ] = π / 4 · 2 n 1232001.137227.41.2900.6800.878
A [ 1 ] = π / 4 · 2 n 1443271.624531.01.5650.9901.549
A [ 2 ] = 3 π / 4 · 2 n 2 1643851.403540.21.4050.8111.139
A [ 3 ] = 3 π / 4 · 2 n 2 2045931.515898.41.3860.8341.155
B [ k ] is signed2448231.5321260.81.3470.8421.134
Proposed KCM83261.36535.50.3770.8350.315
A [ 0 ] = π / 4 · 2 n 103391.48257.80.3611.0110.365
A [ 1 ] = π / 4 · 2 n 124521.40473.00.3350.8400.282
A [ 2 ] = 3 π / 4 · 2 n 2 145801.561124.90.3830.9520.364
A [ 3 ] = 3 π / 4 · 2 n 2 1651021.520155.00.3720.8780.327
B [ k ] is signed2071651.628268.60.3860.8960.345
2482261.712386.90.3700.9410.348
Acronyms: lookup tables (LUTs), LUT-delay product (LDP).
Table 11. Synthesis results for single-cycle ( m × n )-bit KCMs with eight selectable coefficients where m = n , A [ k ] = ± π / 4 · 2 n , ± 3 π / 4 · 2 n 2 , ± 5 π / 4 · 2 n 2 or ± 7 π / 4 · 2 n 3 and all B [ k ] are signed variables. Proposed units use 51% fewer LUTs on average compared to LogiCORE IP multiplier-based units.
Table 11. Synthesis results for single-cycle ( m × n )-bit KCMs with eight selectable coefficients where m = n , A [ k ] = ± π / 4 · 2 n , ± 3 π / 4 · 2 n 2 , ± 5 π / 4 · 2 n 2 or ± 7 π / 4 · 2 n 3 and all B [ k ] are signed variables. Proposed units use 51% fewer LUTs on average compared to LogiCORE IP multiplier-based units.
Type TotalDelay Normalized
nLUTs(ns)LDPLUTsDelayLDP
LogiCORE IP8732.830206.61.0001.0001.000
Lookup + Multiplier101103.838422.21.0001.0001.000
A [ 0 ] , A [ 1 ] = ± π / 4 · 2 n 121573.892611.01.0001.0001.000
A [ 2 ] , A [ 3 ] = ± 3 π / 4 · 2 n 2 142123.892825.11.0001.0001.000
A [ 4 ] , A [ 5 ] = ± 5 π / 4 · 2 n 2 162784.1881164.31.0001.0001.000
A [ 6 ] , A [ 7 ] = ± 7 π / 4 · 2 n 3 204284.9682126.31.0001.0001.000
B [ k ] is signed246115.1133124.01.0001.0001.000
Proposed KCM8353.018105.60.4791.0660.511
A [ 0 ] , A [ 1 ] = ± π / 4 · 2 n 10543.642196.70.4910.9490.466
A [ 2 ] , A [ 3 ] = ± 3 π / 4 · 2 n 2 12774.428341.00.4901.1380.558
A [ 4 ] , A [ 5 ] = ± 5 π / 4 · 2 n 2 141045.209541.70.4911.3380.657
A [ 6 ] , A [ 7 ] = ± 7 π / 4 · 2 n 3 161356.014811.90.4861.4360.697
B [ k ] is signed202097.7491619.50.4881.5600.762
242999.0672711.00.4891.7730.868
Acronyms: lookup tables (LUTs), LUT-delay product (LDP).
Table 12. Synthesis results for pipelined ( m × n )-bit KCMs with eight selectable coefficients where m = n , A [ k ] = ± π / 4 · 2 n , ± 3 π / 4 · 2 n 2 , ± 5 π / 4 · 2 n 2 or ± 7 π / 4 · 2 n 3 and all B [ k ] are signed variables. Proposed units use 47% fewer LUTs and have 15% less delay on average compared to LogiCORE IP multiplier-based units.
Table 12. Synthesis results for pipelined ( m × n )-bit KCMs with eight selectable coefficients where m = n , A [ k ] = ± π / 4 · 2 n , ± 3 π / 4 · 2 n 2 , ± 5 π / 4 · 2 n 2 or ± 7 π / 4 · 2 n 3 and all B [ k ] are signed variables. Proposed units use 47% fewer LUTs and have 15% less delay on average compared to LogiCORE IP multiplier-based units.
Type PipelineTotalDelay Normalized
nStagesLUTs(ns)LDPLUTsDelayLDP
LogiCORE IP84731.626118.71.0001.0001.000
Lookup + Multiplier1051121.460163.51.0001.0001.000
A [ 0 ] , A [ 1 ] = ± π / 4 · 2 n 1251591.690268.71.0001.0001.000
A [ 2 ] , A [ 3 ] = ± 3 π / 4 · 2 n 2 1452141.648352.71.0001.0001.000
A [ 4 ] , A [ 5 ] = ± 5 π / 4 · 2 n 2 1652801.721481.91.0001.0001.000
A [ 6 ] , A [ 7 ] = ± 7 π / 4 · 2 n 3 2064341.894822.01.0001.0001.000
B [ k ] is signed2466171.9881226.61.0001.0001.000
Proposed KCM84351.40449.10.4790.8630.414
A [ 0 ] , A [ 1 ] = ± π / 4 · 2 n 105581.23471.60.5180.8450.438
A [ 2 ] , A [ 3 ] = ± 3 π / 4 · 2 n 2 126851.433121.80.5350.8480.453
A [ 4 ] , A [ 5 ] = ± 5 π / 4 · 2 n 2 1471161.364158.20.5420.8280.449
A [ 6 ] , A [ 7 ] = ± 7 π / 4 · 2 n 3 1681511.549233.90.5390.9000.485
B [ k ] is signed20102331.583368.80.5370.8360.449
24123311.669552.40.5360.8400.450
Acronyms: lookup tables (LUTs), LUT-delay product (LDP).
Table 13. Slice utilization for directed acyclic graph (DAG) fusion [11,13], pipelined adder graph (PAG) fusion [13] and proposed ( 16 × 16 )-bit KCMs with two to eight selectable coefficients. DAG fusion and PAG fusion units are normalized to CoreGen as presented in Möller et al. [13], proposed units are normalized to a LogiCORE IP multiplier-based unit.
Table 13. Slice utilization for directed acyclic graph (DAG) fusion [11,13], pipelined adder graph (PAG) fusion [13] and proposed ( 16 × 16 )-bit KCMs with two to eight selectable coefficients. DAG fusion and PAG fusion units are normalized to CoreGen as presented in Möller et al. [13], proposed units are normalized to a LogiCORE IP multiplier-based unit.
Type Number of Selectable Coefficients
2345678
CoreGen,Slices107107107107107107107
pipelined [13]Normalized1.0001.0001.0001.0001.0001.0001.000
DAG Fusion,Slices356476100112127138
single-cycle [11,13]Normalized0.3270.5980.7100.9351.0471.1871.290
DAG Fusion,Slices6793105131146163175
pipelined [11,13]Normalized0.6260.8690.9811.2241.3641.5231.636
PAG Fusion,Slices638296113122140157
pipelined [13]Normalized0.5890.7660.8971.0561.1401.3081.467
PAG Fusion Ternary,Slices43587184100120130
pipelined [13]Normalized0.4020.5420.6640.7850.9351.1211.215
LogiCORE IP,Slices96 101 95
pipelinedNormalized1.000 1.000 1.000
Proposed,Slices21 28 40
single-cycleNormalized0.219 0.277 0.421
Proposed,Slices30 32 49
pipelinedNormalized0.313 0.317 0.516
One slice contains four LUT6s and eight flip-flops (see Figure 1).
Table 14. Maximum operating frequency for DAG fusion [11,13], PAG fusion [13] and proposed ( 16 × 16 )-bit KCMs with two to eight selectable coefficients. DAG fusion and PAG fusion units are normalized to CoreGen as presented in Möller et al. [13], proposed units are normalized to a LogiCORE IP multiplier-based unit.
Table 14. Maximum operating frequency for DAG fusion [11,13], PAG fusion [13] and proposed ( 16 × 16 )-bit KCMs with two to eight selectable coefficients. DAG fusion and PAG fusion units are normalized to CoreGen as presented in Möller et al. [13], proposed units are normalized to a LogiCORE IP multiplier-based unit.
Type Number of Selectable Coefficients
2345678
CoreGen,Freq (MHz)443443443443443443443
pipelined [13]Normalized1.0001.0001.0001.0001.0001.0001.000
DAG Fusion,Freq (MHz)206177160135126116109
single-cycle [11,13]Normalized0.4650.3990.3620.3040.2840.2610.247
DAG Fusion,Freq (MHz)474478475462450450454
pipelined [11,13]Normalized1.0721.0801.0741.0431.0181.0181.026
PAG Fusion,Freq (MHz)442417437451460453460
pipelined [13]Normalized0.9980.9420.9871.0191.0381.0241.040
PAG Fusion Ternary,Freq (MHz)346324322315312306304
pipelined [13]Normalized0.7820.7320.7280.7120.7050.6920.686
LogiCORE IP,Freq (MHz)548 577 581
pipelinedNormalized1.000 1.000 1.000
Proposed,Freq (MHz)312 277 166
single-cycleNormalized0.569 0.480 0.280
Proposed,Freq (MHz)711 658 646
pipelinedNormalized1.297 1.138 1.111
Table 15. Comparison of normalized slice utilization for PAG fusion [13] and proposed ( 16 × 16 )-bit KCMs with two to eight selectable coefficients. Proposed units use 57% fewer slices on average compared to PAG fusion units with two-operand adders and 44% fewer slices on average compared to PAG fusion units with ternary adders.
Table 15. Comparison of normalized slice utilization for PAG fusion [13] and proposed ( 16 × 16 )-bit KCMs with two to eight selectable coefficients. Proposed units use 57% fewer slices on average compared to PAG fusion units with two-operand adders and 44% fewer slices on average compared to PAG fusion units with ternary adders.
Type Number of Selectable Coefficients
2345678
(a) PAG Fusion,Normalized Slices0.5890.7660.8971.0561.1401.3081.467
pipelined [13]Normalized to (a)1.0001.0001.0001.0001.0001.0001.000
Normalized to (b)1.4651.4141.3521.3451.2201.1671.208
(b) PAG Fusion Ternary,Normalized Slices0.4020.5420.6640.7850.9351.1211.215
pipelined [13]Normalized to (a)0.6830.7070.7400.7430.8200.8570.828
Normalized to (b)1.0001.0001.0001.0001.0001.0001.000
Proposed,Normalized Slices0.3130.3170.3170.5160.5160.5160.516
pipelinedNormalized to (a)0.5310.4130.3530.4880.4520.3940.352
Normalized to (b)0.7780.5840.4770.6570.5520.4600.425
Table 16. Comparison of normalized maximum operating frequency for PAG fusion [13] and proposed ( 16 × 16 )-bit KCMs with two to eight selectable coefficients. Proposed units can operate 14% faster on average compared to PAG fusion units with two-operand adders and 40% faster on average compared to PAG fusion units with ternary adders.
Table 16. Comparison of normalized maximum operating frequency for PAG fusion [13] and proposed ( 16 × 16 )-bit KCMs with two to eight selectable coefficients. Proposed units can operate 14% faster on average compared to PAG fusion units with two-operand adders and 40% faster on average compared to PAG fusion units with ternary adders.
Type Number of Selectable Coefficients
2345678
(a) PAG Fusion,Normalized Freq (MHz)0.9980.9420.9871.0191.0381.0241.040
pipelined [13]Normalized to (a)1.0001.0001.0001.0001.0001.0001.000
Normalized to (b)1.2771.2871.3571.4321.4731.4791.516
(b) PAG Fusion Ternary,Normalized Freq (MHz)0.7820.7320.7280.7120.7050.6920.686
pipelined [13]Normalized to (a)0.7830.7770.7370.6980.6790.6760.659
Normalized to (b)1.0001.0001.0001.0001.0001.0001.000
Proposed,Normalized Freq (MHz)1.2971.1381.1381.1111.1111.1111.111
pipelinedNormalized to (a)1.2991.2091.1541.0901.0701.0851.068
Normalized to (b)1.2771.2871.3571.4321.4731.4791.516

Share and Cite

MDPI and ACS Style

Walters, E.G., III. Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs. Electronics 2017, 6, 101. https://doi.org/10.3390/electronics6040101

AMA Style

Walters EG III. Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs. Electronics. 2017; 6(4):101. https://doi.org/10.3390/electronics6040101

Chicago/Turabian Style

Walters, E. George, III. 2017. "Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs" Electronics 6, no. 4: 101. https://doi.org/10.3390/electronics6040101

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop