Array Multipliers for High Throughput in Xilinx FPGAs with 6-Input LUTs †

† This paper is an extended version of our paper published in Walters III, E.G. Partial-Product Generation and Addition for Multiplication in FPGAs With 6-Input LUTs. In Proceedings of the 48th Asilomar Conference on Signals, Systems, and Computers, Paciﬁc Grove, CA, USA, 2–5 November 2014; pp. 1247–1251. Academic


Introduction
Field-programmable gate arrays (FPGAs) are often used in signal processing systems for many applications, such as digital-signal processing (DSP), video processing and image processing. For these applications and others, computation of a sum-of-products is very common. As a result, multiplication is often the focus of efforts to reduce required resources, delay and power. For this reason, most contemporary FPGAs have embedded hard multipliers distributed throughout the fabric. Even so, soft multipliers using lookup tables (LUTs) in the configurable logic fabric remain important for high-performance designs for several reasons: • Flexible size and type: Embedded multiplier operands are fixed in size and type, e.g., 25 × 18 two's complement, while LUT-based multiplier operands can be any size or type. • Flexible placement: The number and location of embedded multipliers are fixed, while LUT-based multipliers can be placed anywhere, and the number is limited only by the size of the reconfigurable fabric. • Configurable: Embedded multipliers cannot be modified, while LUT-based multipliers can use techniques, such as merged arithmetic [1] and truncated-matrix arithmetic [2][3][4][5][6], to optimize the overall system. • Hybrids: LUT-based multipliers are often combined with embedded multipliers to make larger multipliers.
The new optimizations further reduce the number of required LUTs by approximately 10% and provide approximately a 1.5-times speedup compared to [22]. The proposed multipliers are believed to be the only designs to date that produce better results than LogiCORE IP LUT-based multipliers.
The paper is organized as follows. Section 2 gives background information. Section 3 discusses related work using GPCs. Section 4 describes the proposed two-operand adder, and Section 5 describes the proposed LUT-based array multipliers. Synthesis results are discussed in Section 6, and conclusions are given in Section 7.

Background
This section describes the details of the Xilinx logic fabric, two-operand addition in the Xilinx logic fabric, Altera logic fabric and radix-4-modified Booth multiplication.

Xilinx Logic Fabric
The main logic resource for implementing combinational and sequential circuits in a Xilinx FPGA is the configurable logic block (CLB). Each CLB has two slices. Figure 1 is a partial diagram of a 7-Series FPGA slice. Each slice has four 6-input lookup tables (LUT6s) designated A, B, C and D. Each LUT6 is composed of two 5-input lookup tables (LUT5s) and a two-to-one multiplexer. The two LUT5s are 32 × 1 memories that share five inputs designated I5:I1. The memory values are designated M[63 :32] in one LUT5 and M [31:0] in the other LUT5. The output of the M[31:0] LUT5 is designated O5. The sixth input, I6, is input to a multiplexer that selects one of the LUT5 outputs. The selected output is designated O6. The LUT6 is normally configured as either two LUT5s with five shared inputs and two outputs by connecting I6 to logic "1", or as one LUT6 with six inputs and one output by connecting I6 to the sixth input [25,26]. A multiplexer and an XOR gate, indicated in Figure 1 as MUXCY and XORCY respectively, are associated with each LUT6. Inputs to the MUXCY associated with the A LUT6 are a select signal, prop i , a first data input, gen i , and a second data input, c i . The output of the MUXCY, c i+1 , is connected to the MUXCY associated with the B LUT6. These connections continue through the C and D LUT6s to form a fast carry chain within the slice. The c i+4 output of the slice, COUT, can be connected to the c i input of the next slice, CIN, to form longer carry chains. The prop signal is driven by the O6 output of the corresponding LUT6. The gen signal is selected by a configuration multiplexer and is either the O5 output of the corresponding LUT6 or the bypass input, which is designated AX, BX, CX or DX. The fast carry logic in a slice, which includes four MUXCYs, four XORCYs and the fast carry chain, is called a CARRY4 [26].
Two flip-flops are associated with each LUT6. One flip-flop can be used to register O5 or the bypass input. The other flip-flop can be used to register O5, O6, the bypass input, the MUXCY output or the XORCY output.
The Spartan-6, Virtex-5, Virtex-6 and UltraScale families are similar to the 7-Series. One notable difference is that the Spartan-6 family does not have fast carry chains in every column of slices.

Two-Operand Addition
Suppose X and Y are to be added using the Xilinx fast carry logic. For the i-th column of the adder, x i and y i are the bits of X and Y, respectively; c i is the carry-in bit; c i+1 is the carry-out bit; and s i is the sum bit. A truth table can be made for the adder; then, required values for prop i and gen i can be derived from the table. Figure 1 shows that s i = prop i ⊕ c i , so prop i must have the same value as s i ⊕ c i to produce the correct value for the sum bit. When prop i = 0, the generate signal becomes the carry out, so gen i must have the same value as the expected value of c i+1 . When prop i = 1, the generate signal is not used, so it is a don't-care. These values are given in Table 1. Next, prop i and gen i are expressed as functions of x i and y i . Inspection of the truth table shows that prop i = x i ⊕ y i and that the generate signal can be either gen i = x i or gen i = y i .

Altera Logic Fabric
The multipliers proposed in this paper are specific to the Xilinx LUT6 architecture and are not applicable to Altera FPGAs. The Altera logic fabric is briefly described here to give context to related work on generalized parallel counters (GPCs).
The main logic resource for implementing combinational and sequential circuits in an Altera Stratix V FPGA is the logic array block (LAB) [27]. Each LAB in the Stratix V has ten adaptive logic modules (ALMs). The ALM has evolved, but the general functionality described in this section applies to the older Stratix II family [28] through the latest family, the Stratix 10 [29].
The capabilities of an ALM can be compared to a Xilinx LUT6 and its associated MUXCY, XORCY and flip-flops. An ALM can be configured to implement two functions of six inputs, provided that four of the inputs are common, as shown in Figure 2. By comparison, a Xilinx LUT6 can implement one function of six inputs or two functions of five shared inputs. Each ALM includes two full adders and dedicated carry connections to implement fast addition. Figure 3 shows an Altera ALM in arithmetic mode.

Radix-4-Modified Booth Multipliers
Suppose A and B are to be multiplied. If the multiplicand, A, is an m-bit two's-complement integer and the multiplier, B, is an n-bit two's-complement integer, then: MacSorley's modified Booth recoding algorithm works for both unsigned and two's-complement multipliers [30]. First, b −1 is concatenated to the right of B and set to "0". For two's-complement multipliers, n must be even. If it is not, B is sign extended by one bit to make n even. For unsigned multipliers with odd values of n, B is zero-extended with one "0" to make n even. If n is already even, B is zero-extended with two "0"s.
If a partial product is +A, then the multiplicand, A, is selected. If a partial product is +2A, then the multiplicand is shifted left one bit before selection. If a partial product is −A or −2A, then A or 2A is subtracted by complementing each bit and adding "1" to the least significant bit (LSB). Table 3 summarizes partial-product generation for each selection. There are m + 1 bits in the partial product to provide for a left shift of A, with sign extension if A is not shifted. The operation bit, op ρ , is set to "0" for addition or "1" for subtraction and is added to the LSB column of the partial product. Table 3. Radix-4-modified Booth partial-product generation.
Each partial product is sign extended to the width of the multiplier in order to provide for correct addition and subtraction. Sign extension can be accomplished by complementing the sign bit, adding a "1" in the same column, and extending with constant "1"s. The constants are pre-added to reduce the number of "1"s in the matrix. Figure 4 shows the simplified partial-product matrix for a 6 × 6 multiplier [31][32][33].

Related Work: Generalized Parallel Counters
The well-known Wallace tree [34] and Dadda [35] multipliers use full adders and half adders to reduce the partial-product matrix to two rows, which are then added using a final CPA. A full adder is sometimes called a (3;2) counter, because it adds three bits in the same column and outputs a two-bit result equal to the sum of the three bits. A GPC adds bits in one or more columns and produces an n-bit result equal to the sum of the bits, taking into account the weight of the columns [36]. For example, a (5,5;4) counter adds five bits in the 2 i+1 column and five bits in the 2 i column and outputs a four-bit result equal to the weighted sum of the ten input bits. Figure 5 shows how several (5,5;4) counters could be used to reduce five rows of bits to two rows. Parandeh-Afshar et al. are believed to be the first to look at using GPCs implemented using LUTs to build compressor trees for multi-operand addition in FPGAs [8][9][10]. They note that modern FPGAs, such as Altera Stratix II and newer and Xilinx Virtex-5 and newer, have 6-input LUTs. Therefore, they focus on GPCs that have up to six total inputs for efficient usage of the LUTs and show that (6;3), (1,5;3), (2,3;3) and (3,3;4) counters each map to two ALMs in modern Altera FPGAs. They use a heuristic to implement multi-operand adder compressor trees with GPCs in [8], use integer linear programming (ILP) to improve the results in [9] and improve the GPCs themselves by using the ALM fast addition resources in [10]. They note that both Altera and Xilinx have efficient ternary adders, so they use GPCs to reduce the matrix to three rows. Other work on GPCs that is based on work by Parandeh-Afshar et al. presents incremental improvements or additional applications for GPCs [13][14][15]17,20]. Kumm and Zipf present two novel GPCs, (6,0,6;5) and (1,3,2,5;5), that are specific to and optimized for Xilinx FPGAs [19].

Proposed Two-Operand Adder
Suppose X and Y are to be added using the Xilinx fast carry logic. For the i-th column of the adder, x i and y i are the bits of X and Y, respectively; c i is the carry-in bit; c i+1 is the carry-out ; and s i is the sum bit. The prop i signal must be set to x i ⊕ y 1 , and the gen i signal can be set to either x i or y i to add x i and y i [22]. If x i and y i together are a function of five or fewer inputs, then the LUT6 can be configured as two LUT5s, generating either x i or y i at O5, routing it to gen i and generating x i ⊕ y i at O6 to drive prop i . If x i and y i together are a function of six inputs, then the LUT6 can be configured to generate x i ⊕ y i at O6 to drive prop i and x i or y i can be applied to the bypass input and configured to drive the gen i input. A disadvantage of this configuration is that the bypass flip-flop cannot be used.
Normally, a LUT6 can be used to either generate a function of six inputs at O6 or to generate two functions of five inputs at O5 and O6 [25,26]. However, there are several useful cases where one function of six variables can be output at O6 and a separate function of five shared variables can be output at O5. Suppose x i is a function of one variable connected to I6 and y i is a function of five variables connected to I5:I1. The function y i is stored in M[31:0], so y i is output at O5. If x i is "0", y i is also output at O6. If x i is "1", the function stored in M[63:32] is output at O6. If y i is stored in M[63 :32], then x i ⊕ y i is generated at O6 and y i is generated at O5. This can be used to add x i and y i without using the bypass input when x i is a function of one variable and y i is a function of up to five variables. Figure 6 shows the connections for this configuration. This frees the bypass input to be connected to the bypass flip-flop to implement additional registers. Input I6 has the shortest delay path, and I1 has the longest [25], so this method also allows faster inputs to be used if y i is a function of less than five variables. The carry into the proposed adder, c 0 , can be used to implement subtraction or to add an extra bit to the least significant column.

Proposed Multipliers
This section describes how the proposed array multipliers are implemented and pipelined.

Combined Partial-Product Generation and Addition
Partial-product generation and addition of a second value are combined into a generate-add unit, which is the main building block of the proposed array multipliers. The arithmetic operation is shown in Figure 7. Each unit generates one radix-4 partial product, P ρ , with a leading "1" and the most-significant bit (MSB) complemented to implement sign extension. The operation bit, op ρ , and the (m + 1) MSBs of the output from the previous generate-add unit, X ρ−1 , are added to produce an accumulated sum, X ρ . The two LSBs of X ρ are bits p 2ρ+1 and p 2ρ of the final product, so they are not added in the next unit. The generate-add unit is shown in Figure 8. It is implemented using an (m + 2)-bit proposed two-operand adder as described in Section 4, with X ρ−1 and P ρ as the X and Y addends, respectively.  Bit i of partial product P ρ , p ρ,i , is a function of five inputs: The inputs for each bit, p ρ,i , are connected to the I5:I1 inputs of a LUT6. x ρ−1,i+2 is connected to I6 of the same LUT6. The M[31:0] LUT5 is configured to generate p ρ,i , and the M[63:32] LUT5 is configured to generate p ρ,i . O6 then generates x ρ−1,i+2 ⊕ p ρ,i and drives prop i . O5 generates p ρ,i and is selected to drive gen i . This is done for all of the partial-product bits except the MSB, p ρ,m . The MSB is complemented for sign extension by generating p ρ,m in the M[31:0] LUT5 and p ρ,m in the the M[63:32] LUT5. O6 then generates x ρ−1,m+2 ⊕ p ρ,m and drives prop m . O5 generates p ρ,m and is selected to drive gen m . The leading "1", 2 2ρ+m+1 , is added by configuring the M[31:0] LUT5 to generate "1", configuring the M[63:32] LUT5 to generate "0" and wiring "0" to I6, so that gen m+1 = 1 and prop m+1 = 0 ⊕ 1 = 1. To summarize, the M[31:0] LUT5s generate the bits of P ρ ; the M[63:32] LUT5s generate the complement of those bits; and the bits of X ρ−1 to be added are wired to the I6 inputs. The operation bit, op ρ , is added by wiring b 2ρ+1 to c 0 . The sum produced at the XORCY output is x ρ,i , which is added to p ρ+1,i−2 in the next generate-add unit.

Optimizations for the Generate-Add Unit
The most-significant LUT, shown in Figure 8, can be simplified and eliminated. Inspection of the circuit shows that the prop m+1 input to the MUXCY is always "1". This means that the gen m+1 input to the MUXCY is never used, so it is a don't-care. This could be implemented by storing all "1"s in the M[63:32] LUT5 and wiring "1" to the I6 input, which frees the M[31:0] LUT5 to be used for another purpose. When this is done, the Xilinx tools optimize the entire LUT6 away. The Verilog models used in this work simply assign "1" to the prop m+1 input of the CARRY4 primitive.
Pipelined array multipliers reported in previous work [22] had an interesting result for delay. 10 × 10 multipliers were slower than 12 × 12 multipliers (2.402 ns vs. 2.144 ns), and 14 × 14 multipliers were slower than 16 × 16 multipliers (2.471 ns vs. 2.160 ns). These multipliers were implemented using the generate-add structure shown in Figure 8, which requires m + 2 LUT6s. When m + 2 is a multiple of four, (m + 2)/4 slices are fully utilized. Inspection of Figure 1 shows that the XORCY output and the MUXCY output are registered using the same flip-flop. When m + 2 is a multiple of four, such as for 10 × 10 and 14 × 14 multipliers, the x ρ,m+2 output from the MUXCY cannot be registered within the same slice because the x ρ,m+1 output and the other XORCY outputs use all of the available flip-flops. This forces x ρ,m+2 to be routed outside of the slice to an available flip-flop, causing the additional delay due to longer and slower interconnect.
This problem is avoided by noting that x ρ,m+2 = x ρ,m+1 . The x ρ,m+1 output is used in the next row instead of x ρ,m+2 so that the MUXCY output does not need to be registered. Figure 9 shows the arithmetic that is performed (cf. Figure 7). The optimized generate-add unit generates P ρ with a leading "1" and the MSB complemented to implement sign extension as in the original generate-add unit. The operation bit, op ρ , and the (m + 1) MSBs of X ρ−1 , using x ρ,m+1 instead of x ρ,m+2 , are added to produce an accumulated sum, X ρ . The MSB of the output, x ρ,m+2 , is not needed in the next row, so it is not produced. The most-significant LUT6 of the optimized generate-add unit is configured differently than the other LUT6s. The MSB from the previous unit, x ρ−1,m+1 , is connected to one of the shared I5:I1 inputs, and "1" is input to I6. The M[31:0] LUT5 is configured to produce p ρ,m at O5 to drive the gen m signal. The M[63:32] LUT5 is configured to produce the function f = x ρ−1,m+2 ⊕ p ρ,m at O6 to drive the prop m signal. Since   Figure 10 shows the optimized generate-add unit. The optimized generate-add unit uses only m + 1 LUT6s and avoids the delay of routing a MUXCY output out of a slice to be registered.

Array Structure and Pipelining
An array of n/2 optimized generate-add units is used to implement an m × n multiplier. Optimized generate-add units are connected in an array structure as shown in Figure 11. Each generate-add unit requires m + 1 LUT6s, so the number of LUT6s required to implement an m × n array multiplier is: #LUT6s = n/2 (m + 1). (6) Figure 11. Array structure of proposed multiplier, m = n = 6.
The multiplier can be pipelined to reduce cycle time and increase throughput for applications that can tolerate increased latency. Figure 12 shows the connections for optimized generate-add units in a pipelined m × n array multiplier with n/4 stages. The multiplier can be pipelined by placing a register after every two generate-add units to increase the maximum clock frequency with a modest increase in latency. All m bits of operand A and m + 2 bits output from the second generate-add unit are registered at the end of the first stage. The three LSBs of operand B are not needed after the first stage, so only n − 3 bits are registered. The two LSBs from the output of the first generate-add unit are also registered for a total of 2m + n + 1 bits registered at the end of the first stage. In each subsequent stage, four fewer bits of B are registered while four additional LSBs from generate-add units are registered, so 2m + n + 1 flip-flops are used to implement pipeline registers in each stage. There are n/4 − 1 pipeline registers, and m + n flip-flops are needed to register the output, so the number of flip-flops required for an n/4 -stage pipelined array multiplier is: Each of the LUT6s used to implement the array multiplier has two flip-flops, so there are n/2 (2m + 2) flip-flops available within the footprint of the multiplier. If m ≥ n, there are enough flip-flops to implement an n/4 -stage pipeline with a significant number left over for other uses. This does not imply that all flip-flops used to implement the pipeline must be placed within the footprint of the multiplier. It does imply that a large number of multipliers can be densely placed on the FPGA fabric, and there will be enough flip-flops available within the logic of the multipliers for pipelining. Other designs that use the bypass input only have one flip-flop available per LUT6 and would not have enough flip-flops available for deep pipelining. If the product is truncated or rounded, the LSBs of the generate-add units do not need to be registered, and additional flip-flops are available for other uses. The proposed array multipliers can also be pipelined with n/2 stages to further increase the maximum clock frequency. This is accomplished by placing pipeline registers after every generate-add unit. As with the n/4 -stage pipeline, this requires 2m + n + 1 bits to be registered in each stage plus m + n bits for the output register, so the number of flip-flops required for an n/2 -stage pipelined array multiplier is: #FFs n/2 = n/2 (2m + n + 1) − m − 1.
There are not enough flip-flops available within the footprint of the multiplier to implement an n/2 -stage pipeline. Unused flip-flops in nearby logic can be used to make up the difference if available. The number of required flip-flops can be reduced by using shift-register LUTs (SRLs). A single SRL can be used to replace a number of flip-flops connected as a shift register, such as the least-significant bits of the product that are shifted through each stage. The two flip-flops associated with the SRL are available for use, so using SRLs increases the number of flip-flops available in the multiplier footprint while reducing the number that is required. When SRLs are used to replace chains of three or more flip-flops, the Vivado synthesis default, there are more than enough flip-flops within the multiplier footprint to implement the n/2 -stage pipeline. As noted earlier, this does not imply that pipeline flip-flops must be placed with the footprint. Routing into or out of an SRL may be longer than the longest route between two flip-flops in a chain that it replaces, so it may be on the critical path and increase the delay of the multiplier.
The proposed array structure is easy to layout. LUT6s are placed in the fabric much like a mirror image of how they are shown in the schematic of Figure 11, which simplifies routing, as well. Deeper pipelining, i.e., using n/2 instead of n/4 stages, reduces delay significantly.

Row 0 Generate-Add Estimation Unit
The generate-add unit in the first row, ρ = 0, does not have an input value X −1 to add. The unit only needs to generate P 0 and add op 0 and 2 m to produce X 0 , the input to the next generate-add unit. Figure 13 shows the arithmetic for the Row 0 generate-add unit. If a maximum absolute error of one unit in the last place (ulp) can be tolerated, the generate-add unit in the first row can be replaced with an estimation unit that uses only (m + 1)/2 LUT6s instead of m + 1. Figure 14 shows the Row 0 generate-add estimation unit, which produces an estimate, X 0 , instead of X 0 .  For any adjacent pair of bits in P 0 , each bit is a function of four variables: Together, p 0,i+1 and p 0,i are a function of five variables, The two bits can be computed using two LUT5s in the same LUT6, generating p 0,i+1 at O6 and p 0,i at O5. This allows P 0 to be generated using only (m + 1)/2 LUT6s instead of the m + 1 LUT6s required for a generate-add unit, but does not allow the fast carry chain to be used. Table 8 gives the truth table for a LUT6 that generates adjacent partial products p 0,i+1 and p 0,i in the top row, Row 0.
The least-significant LUT6 can generate p 0,1 and p 0,0 , but cannot properly add op 0 because there cannot be a carry-out to the next LUT6. One option is to discard op 0 and simply output x 0,1 = p 0,1 and x 0,0 = p 0,0 . Another option is to generate p 0,1 and p 0,0 , add op 0 and output x 0,1 = x 0,1 and x 0,0 = x 0,0 if there is no carry out or x 0,1 = 1 and x 0,0 = 1 if there is a carry out. Another option is to output a function of p 0,1 , p 0,0 and op 0 that has a desired statistical result, such as an average error of zero. Table 8. Truth table to generate p 0,i+1 and p 0,i in Row 0.
The LUT5s that output x 0,i for m − 1 ≥ i ≥ 2 generate x 0,i = p 0,i . The sum of p 0,m and the two constant "1"s is p 0,m , p 0,m , p 0,m . The LUT5s that output x 0,m+1 and x 0,m generate x 0,m+1 = p 0,m and x 0,m = p 0,m . As described in Section 5.3, the generate-add unit in the second row uses x 0,m+1 for x 0,m+2 and complements it internally, so x 0,m+2 does not need to be generated. The only error introduced into X 0 is the error from the least-significant LUT6, so the maximum absolute error is easily constrained to 1 ulp. Although not shown in Figure 14, p 0,m could be generated using a single LUT5 and used for x 0,m+2 , x 0,m+1 and x 0,m .

Methodology
Version 2014.4 of the Xilinx Vivado Design Suite was used. Designs were synthesized with the strategy set to "Vivado Synthesis Defaults" and implemented with the strategy set to "Performance_Retiming". The -shreg_min_size parameter was set to the default value of three to synthesize pipelined versions of the proposed multipliers using SRLs and set to 99 to synthesize versions using flip-flops only. Designs were synthesized for the Virtex-7 XC7VX330T-FFG1157 (-3 speed grade) device with a timing constraint of 1 ns on the inner clock. All results are post place-and-route.
LogiCORE IP multipliers were created using the IP Catalog in Vivado. Area-optimized and delay-optimized units were synthesized for each operand size. Structural models of the proposed multipliers were implemented in Verilog. Single-cycle versions for each multiplier were created. Pipelined versions were created for LogiCORE multipliers using the optimal number of stages specified in the IP customization dialog. Pipelined versions of the proposed designs were created using n/4 and n/2 stages. n/4 -stage versions were synthesized using flip-flops only (no SRLs). Flip-flop-only designs and designs using SRLs were synthesized for n/2 -stage versions. Input and output ports were double registered to reduce dependence on I/O placement [38]. CARRY4 primitives were placed manually using the RLOC constraint, which specifies the relative location of primitives in the FPGA fabric. Placement of LUTs was done by the tools with no constraints. Placement of flip-flops was also done by the tools, and they were not constrained to the footprint of the multiplier. A separate clock on the inner level was used to measure the delay through each multiplier.

Single-Cycle Multipliers
Tables 9 and 10 show synthesis results for single-cycle multipliers. The total number of LUTs used and the delay in nanoseconds of each multiplier are reported. The LUT-delay product (LDP) is computed as the total number of LUTs multiplied by the delay. This is analogous to the area-delay product of a VLSI design and gives a metric for comparing overall design efficiency, with lower values indicating higher efficiency. The reciprocal of LDP gives a metric for comparing throughput. The area optimization for LogiCORE IP multipliers is most effective when both operands are unsigned [38]. Signed area-optimized LogiCORE multipliers were found to use more LUTs and to have a higher LUT-delay product than delay-optimized units for each of the operand sizes tested, so delay-optimized multipliers are used as the baseline for comparison. The total number of LUTs, maximum delay and LUT-delay product for each design are normalized to the delay-optimized LogiCORE multiplier of the same size.
The proposed single-cycle designs use 47%-51% fewer LUTs than the baseline LogiCORE multipliers, which allows approximately twice as many to be implemented in the same logic fabric. They are slower than baseline multipliers, and the normalized delay generally increases as n increases. For n ≤ 20, the decrease in LUTs is more significant than the increase in delay, so those units have a 12%-46% lower LUT-delay product than baseline multipliers.  Tables 11-15 show synthesis results for pipelined multipliers. The number of pipeline stages and the number of flip-flops (FFs) are reported. The number of flip-flops includes pipeline registers and one output register, but does not include the input registers or the second set of registers used to reduce dependence on I/O placement. Values are normalized to Xilinx LogiCORE IP multipliers reported in Table 11.  Table 12 shows proposed multipliers using an n/4 -stage pipeline and no SRLs. These versions use 47%-52% fewer LUTs than the baseline LogiCORE multipliers, which allows 1.90-2.10-times as many to be implemented in the same logic fabric. These versions use fewer flip-flops than LogiCORE multipliers. The LUTs used to implement each proposed multiplier have more associated flip-flops available for use than are used in the design because the bypass inputs are not used. These versions are generally slower than LogiCORE multipliers. Table 13 shows proposed multipliers using an n/2 -stage pipeline and no SRLs. These versions use 47%-52% fewer LUTs and are 0%-23% faster than the baseline LogiCORE multipliers. These versions use more flip-flops than LogiCORE multipliers, and more flip-flops than are available from the associated LUTs used in the designs. If extra flip-flops are available from nearby logic, these versions offer LUT-delay products that are 52%-61% lower than baseline LogiCORE multipliers.  Table 14 shows proposed multipliers using an n/2 -stage pipeline and SRLs to save flip-flops. These versions use 42%-49% fewer LUTs and are 1%-22% faster than the baseline LogiCORE multipliers. These versions use fewer flip-flops than LogiCORE multipliers, and enough flip-flops are available from the associated LUTs. They have a 46%-55% lower LUT-delay product than baseline multipliers, indicating a potential 1.86-2.21-times increase in throughput for a fixed number of LUTs.  Table 15 shows proposed multipliers using an n/2 -stage pipeline, SRLs and a Row 0 estimation unit instead of a generate-add unit. These versions use 45%-49% fewer LUTs and are 4%-19% faster than the baseline LogiCORE multipliers. These versions use fewer LUTs and flip-flops than versions that use a generate-add unit, but may have slightly longer delay. They have a 49%-57% lower LUT-delay product than baseline multipliers, indicating a potential 1.97-2.33-times increase in throughput for a given number of LUTs.  Figure 15 shows a screen capture of the implementation of a proposed 6 × 6 single-cycle array multiplier (cf. the mirror image of Figure 11). Nine slices are shown in the screen capture, and primitives in the lower-right slice are annotated (cf. Figure 1). The four MUXCYs, four XORCYs and the fast carry chain for the slice are instantiated as a single Xilinx primitive called a CARRY4. Primitives that are used are indicated by a cyan background color. Note that in Figure 11, carries propagate from the right side of the figure to the left side, so that the most-significant bit of the product is on the left side and the least-significant bit is on the right side. In the screen capture, carries propagate from the bottom of the image to the top. The two slices in the left column of slices correspond to the generate-add unit that generates P 0 and outputs X 0 . The two slices in the middle column of slices correspond to the generate-add unit that generates P 1 and adds it to X 0 to output X 1 . The two slices in the right column of slices correspond to the generate-add unit that generates P 2 and adds it to X 1 to output X 2 . The flip-flops that are indicated as used are part of the registers used for the input and output ports to reduce dependence on I/O placement as noted in Section 6.1. Figure 16 shows a screen capture of the implementation of a proposed 16 × 16 pipelined multiplier, with an eight-stage pipeline using SRLs. The image on the left shows wiring from the I/O pads used for the bits of operand A to the first register for operand A. The flip-flops for the first register are generally located near the corresponding I/O pad. The wiring from the first register for operand A to the second register for operand A is not shown. Most of the flip-flops used for the second register for operand A are located near the multiplier logic at the top of the image. The image on the right shows the wiring from the second register for the output P to the I/O pads used for the bits of P. The flip-flops for the first output register for P are generally located close to the logic for the multiplier at the top of the image. This figure shows the importance of double-registering the input and output ports. If they were not double-registered, delays from long routing lines from I/O pads to the multiplier would give misleading results for the speed of a multiplier when used as part of a larger unit, such as a finite impulse response (FIR) filter that is not connected directly to I/O pads.    Figure 16. This image shows a close-up view the multiplier logic shown at the top of the images in Figure 16. The eight generate-add units of the multiplier each occupy five slices in a column. The SRLs used in the multiplier are implemented using LUT6s in nearby slices. Most of the flip-flops in the slices are used for generate-add units, showing that the bypass inputs are indeed available and the bypass flip-flops can be used. The implementation was not constrained to use only those flip-flops. The tools implemented most of the pipeline registers using them, but left some of them unused and available while using some flip-flops in nearby slices.  Figure 18 shows a screen capture of the same multiplier shown in Figure 17, plus the wiring from the pipeline register between the third and fourth pipeline stage to the inputs of the generate-add unit in the fourth stage. It can be seen that many of the flip-flops in the pipeline register are bypass flip-flops. However, some are not, and some are not in slices occupied by generate-add units. If many of the proposed n/2 -stage multipliers using SRLs were located next to each other, there would be enough flip-flops associated with the generate-add units and SRLs to implement all of the pipeline registers and an output register for each multiplier. However, without constraining the placement of those flip-flops, the tools would likely place some of the flip-flops for one multiplier in slices occupied by another multiplier. Further research is needed to determine if constraining flip-flops for a multiplier to the logic used to implement the same multiplier would yield any improvements in delay. Figure 18. Implementation of the proposed 16 × 16 pipelined multiplier (with an eight-stage pipeline using SRLs) showing wiring from the third pipeline register to the fourth-stage generate-add unit.

GPC-Based Tree Multipliers
Brunie et al. [18] present a data structure called a bit heap, which is similar to a BitMatrix object [4,39,40]. Bit heaps and BitMatrix objects treat a set of operands to be summed as a collection of individual weighted bits instead of a collection of operand vectors. The FloPoCo [41] arithmetic generator operates on bit heaps, applying embedded multipliers, GPCs and 3 × 3 multipliers [42] to compute the sum. FloPoCo targets Altera and Xilinx FPGAs. Kumm and Zipf present two novel GPCs specific to Xilinx FPGAs that exploit the slice structure and are more efficient than previous work in terms of the ratio of the number of bits removed from the bit heap to the number of required LUTs. They then use ILP to select GPCs to reduce a bit heap to two rows and report improvements over the previous FloPoCo heuristic [19]. Mhaidat and Hamzah [20] present results for a Xilinx Spartan-6 FPGA, which uses a 6-input LUT architecture. They report that their 32 × 32 multiplier uses 1133 LUTs, which is 2.15-times the number used by the proposed multipliers that do not use SRLs and 1.95-times the number used by the proposed n/2 -stage pipelined multipliers that use SRLs. They do not compare their results to LogiCORE IP, so normalized results for LUTs or delay are not available for comparison to proposed multipliers.
Two Altera ALMs can be used to implement a (6;3), a (1,5;3), a (2,3;3) or a (3,3;4) counter. (6;3) and (1,5;3) counters are favored because they eliminate three partial-product bits per counter, compared to (2,3;3) and (3,3;4) counters, which only eliminate two bits per counter. In Xilinx, three LUT6s would be required to implement a (6;3) or a (1,5;3) counter, because only five inputs can be shared between the LUT5s. A (2,3;3) counter could be implemented using two LUT6s, because there are only five inputs, so LUT5s can be used. A (3,3;4) counter would require four LUT6s. (6;3) and (1,5;3) counters can be used in Altera to eliminate 1.5 bits per ALM, but they would only eliminate one bit per LUT6 in Xilinx. The differences between Xilinx and Altera are too great to assume that results for GPC-based multipliers on Xilinx FPGAs would be comparable to results for Altera FPGAs presented in other work.
Parandeh-Afshar et al. compare LUT-based multipliers using GPCs to MegaWizard multipliers in Altera FPGAs in [7] and give a graph of the results. Numerical results are estimated from their graphs and tabulated in Table 16. Their radix-4 Booth multipliers have the best overall results. They are faster than MegaWizard multipliers at the expense of additional LUTs for most operand sizes. The normalized LUT-delay product ranges from 0.67 to 1.08. By contrast, the proposed multipliers are significantly smaller and have a much lower LUT-delay product than Xilinx LogiCORE IP multipliers when pipelined with n/2 -stages. This indicates that the proposed method has a larger improvement on Xilinx than [7] has on Altera.

Conclusions
This paper presents a novel two-operand adder that combines radix-4 partial-product generation and addition and shows how it can be used in FPGAs based on 6-input LUTs to implement two's-complement array multipliers. Synthesis results are compared to Xilinx LogiCORE IP multipliers. The proposed array multipliers use approximately one-half of the LUTs needed by comparable LogiCORE IP multipliers, which allows approximately twice as many to be implemented in the same logic fabric. When deeply pipelined, the proposed multipliers are also faster than LogiCORE IP multipliers in most cases. SRLs can be used so that there are more flip-flops associated with the logic of the multiplier than required for pipelining, which allows a large number of deeply pipelined multipliers to be densely placed in the FPGA fabric. If a maximum absolute error of 1 ulp is tolerable, the number of required LUTs can be reduced further. The proposed multipliers are well suited for multiply-intensive applications, such as digital-signal processing, image processing and video processing, where they can be modified further using techniques, such as merged arithmetic and truncated-matrix arithmetic, to optimize the overall system.