Asynchronous Floating-Point Adders and Communication Protocols: A Survey

: Addition is the key operation in digital systems, and ﬂoating-point adder (FPA) is frequently used for real number addition because ﬂoating-point representation provides a large dynamic range. Most of the existing FPA designs are synchronous and their activities are coordinated by clock signal(s). However, technology scaling has imposed several challenges like clock skew, clock distribution, etc., on synchronous design due to presence of clock signal(s). Asynchronous design is an alternate approach to eliminate these challenges imposed by the clock, as it replaces the global clock with handshaking signals and utilizes a communication protocol to indicate the completion of activities. Bundled data and dual-rail coding are the most common communication protocols used in asynchronous design. All existing asynchronous ﬂoating-point adder (AFPA) designs utilize dual-rail coding for completion detection, as it allows the circuit to acknowledge as soon as the computation is done; while bundled data and synchronous designs utilizing single-rail encoding will have to wait for the worst-case delay irrespective of the actual completion time. This paper reviews all the existing AFPA designs and examines the effects of the selected communication protocol on its performance. It also discusses the probable outcome of AFPA designed using protocols other than dual-rail coding.


Introduction
The computational complexity of scientific and engineering applications has increased in recent years, and it is difficult to envision a modern scientific infrastructure without numerical computing. Several scientific and engineering applications utilize floating-point representation for the computation of real numbers, as it provides a large dynamic range. The use of floating-point in computing goes all the way back to the world's first operating computing machine, the Z3, designed by Konrad Zuse, where it includes a binary floating-point number for computation [1].
Goldberg [2] demonstrates that negligence in floating-point designs can result in erroneous outcome, and hence, thorough research is required for floating-point numbers. One such example of design failure due to inaccurate calculations for floating-point representation is the Intel Pentium Processor failure [3]. Professor Thomas R. Nicely was working on the sum of the reciprocal of twin prime numbers. Being a mathematician, he developed several algorithms and evaluated those algorithms on different types of processors. In 1994, when he included a machine based on the Intel Pentium Processor, he noticed that for division algorithm, the processor might produce an incorrect floating-point result. At first, Intel denied such kind of possibility, but several other researchers reported similar challenges for different applications. Later, it was proved by engineer Tim Coe that for a few cases, floating-point computations of double-precision numbers produced an error larger than that for single-precision floating-point. The worst-case error was produced when double-precision floating point representation is used to calculate the ratio of two numbers 4,195,835 and 3,145,727. The correct result was 1.33382044 . . . and the result computed on Pentium was 1.33373906, which was accurate only up to the 14th bits, and the error produced in this case was larger than its single-precision equivalent.
In February 1991 [3], the inaccurate representation of floating-point numbers might have led to an accident that killed 28 soldiers. An American Patriot Missile battery miscalculated and failed to predict the range of an Iraqi Scud Missile. The time was expressed as an integer number, but it was measured in 1/10 by the internal clock of the system, which is a non-terminating floating-point number in binary number representation. All the computations were done in a 24-bit fixed-point register; therefore, the numbers after 24 bits were truncated, resulting in a negligible truncation error. This small truncation error computed in the Patriot battery became a significant one as the battery was utilized continuously for more than 100 hours. As a result, the accumulation of multiple small numbers resulted in a larger error.
Another similar incident happened in June 1996 [3], when a crewless Ariane rocket exploded 30 seconds after its lift-off due to the inaccurate conversion of floating-point numbers. The project including the rocket and its cargo cost approximately $500 million, but fortunately this time no lives were lost. The computation system was unable to find the accurate conversion of the rocket velocity represented by a 64-bit floating-point number into a 16-bit signed integer, as the resulting velocity (in 16-bit signed integer representation) was greater than 32,767, the largest number represented by a 16-bit signed integer.
The process completion time between two synchronous logic blocks is evaluated based on their critical paths, and the clock period for the design must be larger than the worst of these critical paths, hence limiting the scope of speed improvement. Clock skew and hence clock delay balancing are difficult to manage due to technology scaling, as the clock signal needs to arrive at the same time at all storage elements [46][47][48][49]. Moreover, synchronous circuits invest 40% and more of its power in clock distribution [50,51], and as the design grows in complexity, additional delay units are required to tune the delay from the clock source to the flip-flops/latches to overcome clock skew [52][53][54]. This implies that the presence of a global clock signal leads to more latency and power consumption. Asynchronous circuits provide an alternate solution for these challenges that arise due to the presence of a global clock [49,55], as the clock signal is replaced by the handshaking (request REQ and acknowledge ACK) signals in such circuits. The datapath becomes active upon the reception of a request signal REQ, and it goes back to the inactive state after it has completed its operation and issued an acknowledge signal ACK.
The asynchronous design approach is not new; in fact, circuits built right after the invention of the transistor were asynchronous. The ORDVAC at the University of Illinois and the IAS machine at Princeton [56] were examples of these earlier designs. Later, it was realized that the presence of a global clock would help to build smaller and faster circuits, and today it is the preferred design method supported by a vast Electronic Design Automation (EDA) industry. The involvement of technology in our day to day life, however, creates the need for even faster and more compact electronic devices with lower power consumption. The technology scaling imposes various limitations on synchronous circuits [49,[57][58][59], and the asynchronous design approach is being reconsidered by various researchers to overcome the limitations due to the presence of the clock signal. The demand for portable electronic devices with minimal power consumption [60] without compromising the processing speed and silicon area is another major concern [61,62]. Various asynchronous digital circuits are designed and commercialized by the leading companies such as IBM, Intel, Philips Semiconductors, Sun Microsystems (now Oracle), etc., [63][64][65][66][67] over the last two decades with considerable cost benefits. Several successful industrial experiments have also been performed to support the asynchronous circuit design such as Intel RAPPID [68], IBM FIR filter [69,70], optimizing continuous-time digital signal processors [71,72], developing ultra-low-energy devices [73][74][75][76], system design to handle extreme temperature [77] and finally, developing alternative computing paradigms [78][79][80]; however, these experiments were not commercialized. One of the primary reasons for the absence of commercial asynchronous circuits is the absence of sufficiently mature asynchronous EDA tools [51,81]. Fortunately, several languages and design tools are being developed for asynchronous approach such as UNCLE (Unified NULL Convention Logic Environment) [82], Tangram [65,[83][84][85][86][87][88], CHP (communicating hardware processes) [89][90][91][92][93][94][95], BALSA [96], asynchronous circuit compiler [97][98][99][100], Petrify [101][102][103] and various other tools [104][105][106][107][108][109][110][111][112][113], as the asynchronous circuit design promises to overcome the limitations posed by the synchronous logic due to technology scaling [114].
An FPA requires various operations of variable latencies, and the asynchronous design approach can utilize this feature of FPA to optimize its speed and energy consumption. The asynchronous approach is capable of indicating the process completion as soon as the computation is done, therefore, the AFPA does not have to wait for the worst-case delay if the current computation has finished earlier than that; and the power rail would go to ideal state if there is no ongoing activity. However, limited research work is available on implementing an FPA using asynchronous approach, and all of them have utilised dual-rail coding for completion detection. This paper aims to review the existing designs of asynchronous floating-point adder (AFPA). The organization of the rest of the manuscript is as follows: Section 2 discusses the basic algorithm of floating-point adders. Various communication protocols used to control the datapath of asynchronous circuits are discussed in Section 3. Section 4 reviews the existing asynchronous floating-point adder designs. A summarised comparison of various AFPA design is provided in Section 5 along with a discussion of the scope of implementing an AFPA using bundled data protocol, followed by the conclusion in Section 6.

Basic Operation of Floating-Point Adder
The implementation of a floating-point adder is more complicated than an integer adder, as floating-point numbers cannot perform addition/subtraction without a few essential preliminary steps, which consists of at least six operations with different computation time [115]. Floating-point numbers are represented using the IEEE754 format developed by the Technical Committees of the IEEE Societies [116,117], and it is widely accepted by academics and industries.
A standard floating-point representation is shown in Figure 2 with sign bit S, characteristics E, and significand. The mantissa M is represented as 'h.significand'; 'h' is the hidden bit that is not stored in the memory, but it is used for computation, and the value of the hidden bit is always 1. There are two primary floating-point representations defined in the IEEE754 format: Single-Precision and Double Precision, as shown in Figure 3. A single-precision floating-point number is 32-bit long: the MSB is the sign bit, the next eight bits the exponent, and the final 23-bits the significand. A double-precision floating-point number is 64-bits. The MSB represents the sign bit, followed by an 11-bit exponent and a 52-bit significand. The hidden bit '1' is virtually present in both representations with the significand to provide the mantissa.  Step 1: Calculate the exponent difference: Step 2: Alignment: Monitor the carry C out to identify the smaller operand. The smaller operand needs to be shifted by amount d before performing addition and the larger operand is directly fed to the adder.
Step 3: The shifter output is fed to the XOR gates with a Control signal such that: If Control = 0, effective operation is addition. The shifted mantissa from Step 2 would be transferred as it is to the adder with C in = 0, as performing XOR operation with zero bit would provide the same variable, i.e., x ⊕ 0 = x If Control = 1, effective operation is subtraction. 2's complement of the shifted mantissa from Step 2 would be transferred to the adder, as performing XOR operation with one bit would provide the complement of the variable, i.e., x ⊕ 1 = x and C in = 1 would provide the additional 1 for 2's complement. The sign of the result would be calculated by the sign computation block.
Step 4: Add the shifted mantissa with the mantissa of the other number. The summation output is sent to the normalization unit to provide the final outcome in the IEEE format. The normalization unit performs the following operations:  The basic algorithm is modified by several researchers to improve the performance of the FPA. For example, introducing the FAR/CLOSE algorithm to utilize two paths for two different cases of subtraction [34,39,42], or by replacing LOD with LZA [34][35][36][37][38] to mention a few modifications. It is evident from the basic steps of floating-point addition algorithm that FPA requires various operations with variable computational time in order to provide the final output. The synchronous FPA, however, cannot take the advantage of variable computational time as the time-period of the clock signal is fixed, which is calculated according to the critical path delay or the worst-case delay. It cannot start the next task unless the next clock pulse is available, even if the ongoing computation has finished earlier than the critical path delay. On the other hand, some asynchronous implementations of FPAs can indicate process completion as soon as the computation is done, and therefore do not have to wait for the worst-case delay if the current computation has finished earlier than that. However, asynchronous design approach requires communication protocol to indicate that the process is done before sending the acknowledge signal, as there is no global clock to synchronize the timings. The next section discusses the communication protocols used for asynchronous circuits and compares the effect of different communication protocols on asynchronous circuit design.

Communication Protocols for Asynchronous Designs
The absence of a global clock in asynchronous circuits has the potential to overcome the clock distribution and clock skew related challenges, but it creates the need for a communication protocol to detect the process completion and data validity [48,118,119]. As asynchronous circuits are event-driven, a circuit becomes active only after receiving a request signal and goes back to an inactive state once the process is complete and sends the acknowledge signal. This process is controlled by the communication protocol in asynchronous circuits with the help of data encoding, and the most widely accepted encoding protocols are 1. Bundled Data Scheme: In the bundled data scheme, a single bit is encoded with one wire [81] similar to the synchronous circuit [120], hence, it is also known as Single-Rail Protocol.
The circuit starts its process after receiving a request signal REQ and sends acknowledgment ACK after the worst-case delay required by the critical path of the circuit to ensure the process completion and data validity, as shown in Figure 5. 2. Dual-Rail Protocol: In the dual-rail protocol, a single bit is encoded with two wires. To encode n-bit data 'd', 2n wires are required (d.t and d.f for the true and false value of data, respectively), and the request signal is encoded with the data [121], as shown in Figure 6.
Once the data is available at the receiver, a completion detection circuit is used to determine the value of data 'd' by observing the d.t and d.f signals. This implies that the dual-rail coding requires more implementation area compared to bundled data or synchronous design, but it can indicate the data validity as soon as the computation is done. However, detection of valid data from d.t and d.f signals requires a completion detection technique, which needs an additional process delay, and it might not deliver the anticipated outcome. The other protocols used to encode data are special cases of M-of-N encoding, in which log 2 N bits can be represented using N wires, and one extra wire is used to send the acknowledgment [122]. Circuits with such encoding are known as quasi-delay insensitive (QDI) circuits as it assumes a finite gate and wire delays [123][124][125]. Dual-rail coding is also a special case of M-of-N encoding with M = 1 and N = 2 [126].
Most of the asynchronous circuit designs utilize either bundled data or dual-rail coding to indicate the process completion. However, all of the existing designs of AFPA utilize dual-rail coding, as dual-rail design encodes the information of data validity with the data and allows the circuit to acknowledge as soon as the computation is done. The asynchronous design of AFPA is rarely discussed, and next section discusses all the existing designs of AFPA in detail.

Asynchronous Floating-Point Adders
The asynchronous implementation of floating-point adders has not been explored much, and most researchers have focused mainly on implementing asynchronous floating-point multiplication and division operations [127][128][129][130][131]. This could perhaps be due to the complexity involved in FPA implementation. The floating-point addition operation consists of various operations of variable latencies: mantissa shifting for exponent matching, addition of aligned mantissa, rounding, and normalization of the computed output. The synchronous FPA utilizes the worst-case delay to determine the clock frequency, hence, there is no need to worry about the process completion, as every process can complete within or before the worst-case delay. However, 60-90% of the addition process can take the benefit of early completion [132]. Implementing an asynchronous floating-point adder by replacing the clock pulse with REQ and ACK signals can take advantage of early completion detection and reduce the processing time from worst-case delay to average-case delay. The next section discusses the existing AFPA architecture proposed by Noche and Jose [133], Sheikh and Manohar [134][135][136], and Jun and Wang [115], and these are the only designs available in literature that discusses the implementation of AFPA architecture. The MTNCL approach discussed in [137] does not provide any details on implementing the AFPA; however, this paper is included for the review since it provides the performance comparison of floating-point addition/subtraction operations for both synchronous and asynchronous floating-point co-processor.

Single-Precision AFPA
This section discusses a single-precision asynchronous floating-point unit (AFPU) implemented using a variable latency algorithm proposed by Noche and Jose [133]. This design introduces the first asynchronous implementation of a single-precision floating-point adder [136] along with other arithmetic operations, while all the previous implementations of floating-point units have focused on multiplication or division operations. The design has used dual-rail differential cascode voltage switch (DCVS) logic for datapath and complementary metal-oxide semiconductor (CMOS) logic for the control path. The AFPU is designed at transistor level using 3.3 V supply voltage and 0.35 µm process, and Cadence software is used to design and test the arithmetic unit at the transistor level. This paper discusses the performance of AFPU for addition operation only.
Registers and adders are the two key components of the datapath for addition operation. Bidirectional shift registers are used to implement the shifter circuit required for exponent matching and normalization with a provision of rounding bit [138]. The two adders required to find the exponent difference and mantissa addition are designed by using 9-bit and 25-bit Carry Lookahead Adders (CLA), respectively [139]. DCVS multiplexers are used to select inputs for registers and adders, and it would be replaced by OR gates to optimize the design if the dual-rail inputs will never be active at the same time. The control circuitry of the AFPU includes logic gates, SR latches, and C-elements [140]. The asynchronous floating-point addition operation is event-driven, and it has utilized the dual-rail protocol with four-phase signaling to detect the process completion.
The process completion time reported by Noche and Jose for single-precision AFPA (SPAFPA) includes Time required to provide the final outcome t al , it will be small if the result is zero, and large for a negative result. However, value of t al is less compared to the shift operation, therefore, the average value of t al is considered during computation.
The completion time t can be calculated as For a single-precision operand, the value of d ranges from 0 to 254, providing the addition completion time ranges from 59 ns to 5850.2 ns, as reported by Noche and Jose. Simulation is performed using 248 test vectors at 25 ºC with 5 fF (femtofarad) capacitance connected at the output. The average addition completion time reported is 127.4 ns for ten applications from the SPECfp92 benchmark suite. A graph has been plotted by Noche & Jose between the addition completion time and exponent difference |E A − E B | for ordinary cases (no exceptions), as shown in Figure 7. The computation time for the critical path of SPAFPA might be greater than the worst-case delay of its synchronous equivalent, but the computation time is much shorter when the exponent difference 'd' is small. Cases with a large shift amount are not common, and 45% of cases have the shift amount either 0 or 1 [129,134] and as such the speed of the AFPA is much improved over its synchronous counterpart. A single test case to add two random numbers 4,195,835 and 3,145,727 has been also considered by Noche & Jose to evaluate the SPAFPA performance. The computation is done in 79.0 ns with 0.32 nJ energy and 4.08 mW power consumption.
The SPAFPA design focuses to reduce the processing time from the worst-case delay to average-case delay. Serial architecture is used to implement an area-efficient design, and the AFPU is designed by using 17,085 transistors only. However, it uses shift registers to shift the data for exponent matching, in which the processing time t s is directly proportional to the exponent difference 'd'. This would reduce the speed of the SPAFPA if the value of 'd' is large. A logarithmic/barrel shifter would be a better choice for shifter as an N-bit barrel shifter requires only log 2 N stages to implement, therefore reducing the processing time of shifting operation [141][142][143]. Furthermore, CLA can be replaced by a faster average case adder such as a parallel prefix adder, carry select adder, ripple carry adder [144], or any advanced adder design, as discussed in the next section. Shifter and adder are two basic modules of AFPA, and improving the processing time of these modules would optimize the processing time of AFPA.

Operand-Optimized Double-Precision AFPA
Noche and Joes claim to have reduced the process completion time from worst-case to average-case for a single-precision AFPA [133], but the processing time does not include the time to compute the rounding logic. Moreover, the design is completely non-pipelined, and it does not use any other energy optimization technique. Pipelining is a technique where multiple tasks are executed in parallel for different data values and consequently optimize the output, and several asynchronous pipelining techniques are used to optimize the throughput [105,[145][146][147][148][149][150][151][152][153]. Sheikh & Manohar [136] have designed an operand-optimized double-precision AFPA (DPAFPA) along with all four rounding logic, as discussed in this section. The performance of the DPAFPA was compared with a baseline high performance AFPA, and the operating conditions were as follows: Temperature is 25 ºC with 1 V supply voltage, in a 65 nm bulk CMOS process at typical-typical (TT) corner. The baseline AFPA consists of a 56-bit Hybrid Kogge Stone Carry Select Adder (HKSCSA) to add mantissa. The adder provides two speculative sum output for two different values of carry-in, and the final output will be selected at the final stage according to the actual value of carry-in. Dual-rail protocol is used with 1-of-4 encoding and radix-4 arithmetic to optimize the energy and speed requirements.
Normalization is done in parallel with the addition operation by using the Leading One Predictor (LOP) technique. The shift amount is speculated by the LOP, and the final outcome has to be shifted by one bit if the estimated shift amount is wrong [43]. The datapath is divided into two separate pipelines (Left and Right) to normalize the summation output. The left pipeline is used for a massive left shift required due to subtraction operation, and all other cases are managed by the right pipeline. 30 pipeline stages are used in datapaths with minimal increase in latency. It has utilized the pre-charge enable half-buffer (PCEHB) pipeline for all data computation [154], which is faster and more energy-efficient compared to the original pre-charge half-buffer (PCHB) pipeline [155]. Moreover, it uses weak condition half-buffer (WCHB) in spite of PCEHB for simple buffers and tokens, as it is more energy efficient. A detailed power breakdown of the FPA datapath is shown in Figure 8, which indicates that addition is the highest power-consuming operation, followed by the right shift operation. The DPAFPA design proposed by Sheikh improves the power saving compared to the baseline AFPA by making the following changes: • The HKSCSA is replaced by an interleaved asynchronous adder, which utilizes two ripple-carry adders of radix-4. Both of the ripple-carry adders can perform parallelly for different input operands, one adder is used to add even operand pairs, and the other one adds odd operand pairs. The length of maximum carry-chain is seven for radix-4 arithmetic for approximately 90% cases, and the requirement of energy/operation by an interleaved adder is 2.9 pJ/op for the carry length less than 15 with a throughput of 2.2 GHz. On the contrary, the 56-bit adder (HKSCSA) used by the baseline FPA needs 13.6 pJ/op with a throughput of 2.17 GHz. Therefore, the reduction in power consumption by an interleaved adder is more than four times compared to HKSCSA. The number of transistors required by a 56-bit adder is also reduced by 35% for interleaved adders. • The right shifter is designed by using three pipeline stages: Stage 1 shifts the mantissa between 0 to 3 bits, Stage 2 shifts the mantissa by 0, 4, 8, or 12 bits, and Stage 3 shifts the mantissa by 0, 16, 32 or 48 bits. The computation time of shifter to shift the mantissa between 0 to 55 bits is fixed in baseline AFPA. Sheikh's AFPA design has split the shifter into two paths: long path and short path, which allows the shifter to select a path according to the shift amount and bypass the other path. The shifter design is data-driven, and it can optimize power consumption. • The LOP scheme is modified in the design, and only one pipeline (either left or right) is used for normalization. The selection of the left or right pipeline is made before activating the LOP stage. The left and right pipelines provide up to 13% and 18% of power-saving, respectively, compared to baseline AFPA. • The post-add right pipeline manages the left/right 1-bit shifter, 53-bit mantissa incrementor, rounding operation, and calculation of the final value of the exponent. The DPAFPA design uses the interleaved incrementor, similar to the interleaved adder, compared to the carry-select incrementor used by the baseline AFPA. It will make DPAFPA more energy efficient. • The design can detect the zero input operands. If one or both operands are zero, the final outcome can be given without using the power-consuming blocks of AFPA. The DPAFPA design requires 30.2 pJ/op, compared to the baseline AFPA, which consumes 69.3 pJ/op, resulting in 56.7% of reduction in energy consumption [136]. The performance of DPFPA is also compared with a synchronous FPA proposed by Quinnell [156], as it is one of the rare designs of fully implemented FPA and it provided a good baseline for analyzing the performance of DPFPA. The synchronous is FPA designed using a standard-cell library with 65 nm SOI (Silicon-On-Insulator) process. Performance comparison of DPAFPA, baseline FPA and a synchronous design of FPA considered by Sheikh & Manohar is given in Table 1 [136]. GFLOPS (gigaflops) is used to measure the performance of a floating-point unit (FLOPSfloating-point operations per second). A high GFLOPS/W for the proposed DPFPA [136] makes the asynchronous design scheme suitable for optimizing the circuit performance. The input set for both baseline AFPA and DPFPA is taken for the right shift amount ranging from 0 to 3, and it considers only non-zero operands.
The use of interleaved adders and shifters helped to reduce the power consumption of the circuit. The shifter architecture is also split and implemented with pipelines, therefore reducing the processing time and power consumption of the circuit. A gate-level simulation tool PRISM is used to design and test AFPA, by using ten billion random input operands, and one billion stored inputs from actual application benchmark. The design is also tested for exceptions (NaN, Zero, Infinity, Denormal numbers). This design implementation of DPAFPA focuses mainly on reducing energy/operation and power consumption using the pipelining technique, with minimal increment in the processing time.

Double-Precision AFPA with Operand-Dependent Delay Elements
The desynchronization technique provides better performance compared to the synchronous design [140,157], and it can also be used to implement the AFPA. However, it cannot take advantage of the event-driven nature of asynchronous circuits, as the clock signal is replaced by worst-case delay models during desynchronization. Xu and Wang [115] have examined the speed of various sub-operation required to perform floating-point addition and proposed an AFPA design with operand-dependent delay elements, as discussed in this section. A synchronous FPA [158] is used as the baseline (by Xu and Wang), which has the FAR/CLOSE path architecture, a balanced 56-bit shifter with LOP, and rounding by injection technique. This synchronous FPA is redesigned by Xu and Wang by using asynchronous logic with variable-length delay elements to utilize its event-driven property. Various sub-operations of AFPA with different computation time should be identified in order to select the delay models. At least six operations processing at different speeds have been identified [115] as follows:  It is evident from the above discussion that at least six delay elements with variable latencies are required. However, Xu and Wang have used only three variable-length delay elements with six multiplexers to design AFPA to reduce the area overhead. A two-phase MOUSETRAP pipelining [159] is used instead of master-slave latches, and it utilized dual-rail protocol for completion detection to generate control information. A total of 10,000 random inputs are taken for simulation from six benchmarks. The design claims to improve the speed of proposed AFPA by 33%, reduce the energy consumption 12% but increase the area by 5% compared to its synchronous counterpart.

Multi-Threshold NULL Convention Logic (MTNCL)
Liang et al. [137] have proposed a Multi-Threshold NULL Convention Logic (MTNCL) or Sleep Convention Logic (SCL), which is a combination of Multi-Threshold CMOS (MTCMOS) with NULL Convention Logic (NCL). MTCMOS is designed using transistors with different threshold voltages (V t ) viz. Low V t (high leakage current, fast speed) and High V t (lesser leakage current, slower speed). Low V t and high V t are combined to design MTCMOS to preserve the performance with less leakage. MTCMOS has a sleep mode which maintains minimum power dissipation when the circuit is not active. However, maintaining the sleep signal requires complex logic since it is critical to the timing constraints, and transistor sizing and logic block partitioning is difficult for synchronous circuits. On the other hand, NCL utilizes asynchronous dual-rail design, which requires two wires to implement a single bit, along with a spacer or NULL signal as shown in Figure 6. Combining MTCMOS with NCL in MTNCL allows the circuit to utilize the sleep mode during NULL logic without dealing with clock-related challenges. A modification in the MTNCL architecture is done by placing the power gating high V t transistor to the pull-down network. This design is known as the Static MTNCL threshold gate structure (SMTNCL), and it eliminates two bypass transistors and removes the output wake-up glitch.
Liang et al. have provided a comparison between a synchronous MTCMOS design and several variations of NCL designs for single-precision floating-point co-processors. The performance of co-processors is provided for addition/subtraction and multiplication operations, however, this paper discusses the performance for addition/subtraction operations only. An average time T DD is considered for MTNCL circuits to process both data and NULL, which is comparable to the synchronous clock period. The multi-threshold designs do not provide any specific architecture for AFPA, but these designs are considered in this survey paper due to less available literature on AFPA.
The comparison is given only for basic NCL designs (Low and High V t ), the best MTNCL design, and the synchronous MTNCL design, as provided in Table 2 [137]. It would require some basic understanding of available SMTNCL designs to understand their reportedly best MTNCL (SMTNCL with SECRII w/o nsleep) architecture [137]. The basic design of MTNCL uses the Early Completion Input-Incomplete (ECII) feature, which puts a stage to sleep only when all the inputs are NULL. A variation of the design known as SECII puts the combinational logic of NCL circuit to sleep during the NULL cycle to reduce power dissipation. Another variation in the design known as SECRII makes the completion and registration logic to sleep along with the combinational logic when the circuit is not active. Signals sleep and nsleep (sleep) are used to put the circuit in sleep mode. However, when the SMTNCL circuit combines with bitwise MTNCL, it would remove the need for nsleep signal, and provides the SMTNCL with SECRII w/o nsleep architecture. This design is reported as the best design by Liang et. al. [137] when simulated for 25 sets of randomly selected floating-point numbers as it requires 86% less energy, three orders of magnitude less idle power and 14% less area; however, the speed is slower (not less than 2×) compared to the synchronous MTCMOS design.

Discussion
The asynchronous circuit design has the potential to improve the performance of digital circuits, especially in terms of speed and power consumption. It can also eliminate the limitations of synchronous circuits imposed by clock signals due to technology scaling. The performance of existing AFPA designs is analyzed in Section 4, and their comparison is given in Table 3. Table 3 indicates that most of the asynchronous implementation of floating-point adders provides better performance compared to their synchronous counterparts. However, all the existing AFPA designs have been implemented using the dual-rail protocol, which requires two wires to implement a single bit data. Moreover, the process of determining data validity in dual-rail circuits requires a large number of gates, as each bit in the datapath needs to be examined in the process. The logic to determine data validity might take a considerable amount of time and higher power consumption for some applications, therefore, it may not deliver the anticipated outcome [160].
A different approach has been used to implement a few asynchronous circuits using the bundled data scheme with completion detection techniques [112,[161][162][163][164][165], which can indicate the data validity as soon as the process is complete. A speculative completion detection scheme is designed for asynchronous fixed-point adders [161,162], and for barrel shifters [163], where the datapath channel is implemented with multiple delay models, including the worst-case delay. The bundled data adder and shifter implementation using a speculative completion detection technique provided a better performance compared to synchronous design without increasing the silicon area significantly like dual-rail circuits. However, there is no design available for AFPA using bundled data protocol. Shift and add are two fundamental operations of AFPA, and since a bundled data implementation for asynchronous shifter and fixed-point adder are already available with a speculative completion detection scheme, there exists a possibility of implementing a bundled data AFPA with speculative completion detection technique. Moreover, the speculative design of a bundled data fixed-point adder can be replaced by a deterministic completion detection technique adder proposed by Lai [165,166] to further improve the AFPA performance.

Conclusions
Floating-point addition and subtraction are the most frequent arithmetic operation in the typical scientific applications, yet very few research articles are available for the design of asynchronous floating-point adder. The existing designs of AFPA are designed using dual-rail coding which requires a large implementation area, but their speed and power consumption has been improved compared to their baseline FPA. This survey paper discussed all the four existing designs of AFPA, comparing different performance features with their respective baseline FPA. An absolute comparison of all the designs is not possible as all the existing designs have different performance features; however, these are the only implementation of AFPA available in the literature and it shows that the asynchronous design technique has the potential to improve the performance of AFPA. It also discusses the probable outcome of AFPA designed using bundled data protocol using some completion detection technique.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: