Directed Acyclic Graph-Based Datapath Synthesis Using Graph Isomorphism and Gate Reconfiguration

: Datapath synthesis is a crucial step in synthesis flow and aims at globally minimizing an area by identifying shareable logic structures. This paper introduces a novel Directed Acyclic Graph (DAG)-based datapath synthesis method based on graph isomorphism and gate reconfiguration. Unlike algorithms that identify common specification logic, our approach simplifies the problem by focusing on searching for common topology. Leveraging the concept of gate reconfiguration, our algorithm extends the applicability of DAG-based datapath synthesis by transforming a topology-equivalent network into a specification-equivalent network. Experimental results demonstrate up to 23.6% improvement when optimizing the adder–subtractor circuit, a scenario not addressed by existing DAG-based datapath synthesis algorithms.


Introduction
Due to a surge in computing-intensive applications, computational hardware faces significant challenges arising from the increased complexity of hardware systems in design, verification, and synthesis [1,2].As the performance gains from the technology scaling become more challenging, electronic design automation (EDA) tools play an increasingly vital role in enhancing hardware performance in terms of delay, energy, and area efficiency, as well as reducing labor costs [1,3].
A traditional design flow typically includes several synthesis steps.Beginning with a behavioral definition in a high-level programming language, a fully optimized netlist is generated through high-level synthesis (HLS), logic synthesis, and technology mapping.HLS is primarily responsible for extracting arithmetic operations such as addition, multiplication, shifting, and comparison from the hardware description.The synthesizer models these operations as block modules [4,5] and processes them through scheduling, allocation, and binding phases, ultimately producing a netlist using standard libraries for subsequent logic synthesis and technology mapping [6][7][8].
Datapath synthesis is fundamentally considered a subset of HLS as it is intricately linked with HLS techniques such as behavioral transformations [6] and resource binding [9].Unlike subsequent circuit optimization stages in logic synthesis, datapath synthesis operates at a higher level of abstraction, allowing for global optimization before delving into low-level details.However, datapath synthesis performs at lower abstraction levels, such as the logic circuit level, which can also uncover beneficial optimization opportunities.
For instance, bit-level optimization, a form of datapath synthesis, can be applied to HLS-generated logic circuits to identify resource-sharing opportunities [10].As depicted in Figure 1, this optimization step occurs post-HLS and pre-logic synthesis to identify common structures.This method is based on logic circuits, where the structure or topology can be represented as a Directed Acyclic Graph (DAG).Techniques such as [10,11] and are implemented on logic circuit networks represented as And-Inverter Graphs (AIGs) [12] and networks mapped onto standard libraries.
Chips 2024, 3, FOR PEER REVIEW 2 common structures.This method is based on logic circuits, where the structure or topology can be represented as a Directed Acyclic Graph (DAG).Techniques such as [11] and [10] are implemented on logic circuit networks represented as And-Inverter Graphs (AIGs) [12] and networks mapped onto standard libraries.DAG-based methods aim to discover graph isomorphism, i.e., equivalent topology, to identify common specification modules and optimize datapaths [10,11].Despite the successful reduction in the circuit area achieved by DAG-based methods, they are limited in their applicability.The effectiveness of the datapath rearrangement is primarily observed in specification-equivalent modules.In essence, searching for equivalent graphs (structures) is equivalent to searching for equivalent specifications.Unfortunately, modern HLS tools offer robust support for optimizing datapaths for fully equivalent modules, diminishing the practicality of DAG-based datapath synthesis methods.
Reconfigurable logic gates represent a class of emerging devices capable of implementing multiple logic operators within a single cell [13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28].Drawing inspiration from various physical mechanisms, they have the potential to enhance the area efficiency of logic circuits by enabling a greater number of functions to be performed using a fixed number of cells.In addition to simply leveraging reconfigurable logic gates as compact logic models within libraries, several methodologies have been proposed to uncover circuit-level resource sharing through gate-level reconfiguration.For instance, [29] introduces a satisfiable SAT-based exact synthesizer that utilizes reconfigurable logic gates to realize two arbitrary functions within a single network.While this approach is area-optimal, it is primarily applicable to small-scale circuits.By combining SAT-based synthesis methods with heuristic logic synthesis algorithms, [30] demonstrates the creation of reconfigurable circuits that partially share logic resources.This logic synthesis technique aims to optimize a given circuit without actively identifying large-scale sharable resources, such as arithmetic operations in modules.This distinction arises from the fundamentally different objectives of datapath synthesis and logic synthesis.However, gatelevel reconfigurability still presents an opportunity to enhance the tolerance of DAGbased datapath synthesis to specification inequivalence.
In this work, we develop new algorithms to overcome the limitation of DAG-based datapath synthesis by using reconfigurable gates.The major contributions of this work are highlighted as follows: • Instead of conducting DAG-based datapath synthesis on a standard netlist, we introduce the concept of emerging reconfigurable logic gates to loosen the constraint of specification equivalence.This novel approach enables automatic transformation from common topology into common specification.DAG-based methods aim to discover graph isomorphism, i.e., equivalent topology, to identify common specification modules and optimize datapaths [10,11].Despite the successful reduction in the circuit area achieved by DAG-based methods, they are limited in their applicability.The effectiveness of the datapath rearrangement is primarily observed in specification-equivalent modules.In essence, searching for equivalent graphs (structures) is equivalent to searching for equivalent specifications.Unfortunately, modern HLS tools offer robust support for optimizing datapaths for fully equivalent modules, diminishing the practicality of DAG-based datapath synthesis methods.
Reconfigurable logic gates represent a class of emerging devices capable of implementing multiple logic operators within a single cell [13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28].Drawing inspiration from various physical mechanisms, they have the potential to enhance the area efficiency of logic circuits by enabling a greater number of functions to be performed using a fixed number of cells.In addition to simply leveraging reconfigurable logic gates as compact logic models within libraries, several methodologies have been proposed to uncover circuit-level resource sharing through gate-level reconfiguration.For instance, ref. [29] introduces a satisfiable SAT-based exact synthesizer that utilizes reconfigurable logic gates to realize two arbitrary functions within a single network.While this approach is area-optimal, it is primarily applicable to small-scale circuits.By combining SAT-based synthesis methods with heuristic logic synthesis algorithms, ref. [30] demonstrates the creation of reconfigurable circuits that partially share logic resources.This logic synthesis technique aims to optimize a given circuit without actively identifying large-scale sharable resources, such as arithmetic operations in modules.This distinction arises from the fundamentally different objectives of datapath synthesis and logic synthesis.However, gate-level reconfigurability still presents an opportunity to enhance the tolerance of DAG-based datapath synthesis to specification inequivalence.
In this work, we develop new algorithms to overcome the limitation of DAG-based datapath synthesis by using reconfigurable gates.The major contributions of this work are highlighted as follows:

•
Instead of conducting DAG-based datapath synthesis on a standard netlist, we introduce the concept of emerging reconfigurable logic gates to loosen the constraint of specification equivalence.This novel approach enables automatic transformation from common topology into common specification.
• We propose a novel algorithm for identifying common topology while considering the utilization of reconfigurable logic gates.The problems of logic loop and area-efficient inverter/wire removal are investigated.• A synthesis flow is constructed encompassing high-level synthesis, logic synthesis, and technology mapping.An experimental analysis demonstrates a significant improvement of 23.6% and 26.7% in areas when optimizing the adder-subtractor circuit and parity-or circuit, respectively.
Section 2 introduces the background of conventional DAG-based datapath synthesis algorithms and necessary knowledge.Section 3 explains the proposed synthesis scheme utilizing gate reconfiguration.Section 4 provides the implementation of the proposed datapath synthesis method.Section 5 describes the experiment setup and provides the results demonstrating the achieved area improvement compared to a conventional scheme along with the synthesis flow setup.Section 6 provides a summary of the achieved results.

Background 2.1. Network Representation
The representation of a logic network is termed a Boolean network, characterized by its topology expressed as a DAG with nodes symbolizing logic gates and directed edges representing wires connecting the gates [31].However, most real-world circuits with memory components are sequential and do not adhere to the acyclic nature of a DAG.Logic synthesis facilitates optimization for these circuits by employing techniques such as retiming and extracting combinational logic partitions.This is achieved by considering the memory components as primary inputs (PI) and primary outputs (PO) for the circuit partitioning process.
The representation format of a Boolean network significantly influences logic optimization strategies and outcomes.Different formats, such as DAGs, lead to varied optimization approaches.Among DAGs, the most prominent is AIG.In an AIG (illustrated in Figure 1a), all nodes represent AND gates, excluding PI and PO.Additionally, edges in an AIG are weighted, allowing for optional inversions (depicted by dotted lines), thereby representing either wires or inverters.AIGs are widely favored for logic optimization due to their excellent compatibility with logic gate technologies and fine-grained logical expression.The earliest DAG-based datapath synthesis methods were based on AIG networks.To enhance runtime efficiency in identifying common specification logic, ref. [10] employs a mapped netlist as the DAG for datapath synthesis.This approach no longer requires identical edges as a criterion for determining whether a node is common in two datapaths.It is believed that using a mapped netlist enhances the likelihood of successfully identifying common specifications, as transformations from a mapped network to AIGs may alter the original structure.In this context, we adopt a netlist mapped with a library containing arbitrary two-input gates as the circuit representation.This choice offers runtime advantages, as illustrated in Figure 1b, where the inclusion of an XOR gate reduces the number of gates.

Specification Equivalence
Specification equivalence refers to two combinational circuits with exactly the same function.The sharable circuit partition is known as common specification logic [32].In a previous study, the datapath aimed to identify common specification logic within the context of the following task: given the output boundaries of two logic cones, determine the input boundaries that maximize the alignment of PI signals.The number of aligned/merged PIs between the original datapaths indicates the final area improvement after the logic synthesis and technology mapping followed by the datapath synthesis.In the practice of datapath synthesis, this purpose is represented as finding as many common logic nodes as possible.Typical techniques for checking if two designs conform to common types of specification logic are based on combinational equivalence checking (CEC), such as existing methods, BDDs [33], SAT [34,35], and AIG [36].However, the CEC methods are not universal for this task because of the unknown input boundaries of the designs, the unknown relationships (Boolean matching) of those inputs, and/or an explosive runtime in large-scale arithmetic networks.

Graph Isomophism
In the graph context, an isomorphism between graphs G and H denotes a bijection between the vertex sets V(G) and V(F), f:V(G)→V(F), such that any two vertices u and v of G are adjacent in G if and only f(u) and f(v) are adjacent in H [37].Although graph isomorphism is a well-known NP problem of computational complexity, it can be solved in linear time complexity when the graph is a planar Boolean network, as an acyclic planar graph, which can find isomorphism efficiently via heuristic algorithms [10].This process can be further accelerated by employing fan in-fan out information to filter out partial potential searching space [10].

Reconfigurable Gate
The reconfigurable gate refers to a type of emerging device technology that can be switched between multiple operators.Thanks to field advances in polymorphic electronics, especially the arrival of unconventional, post-silicon, or beyond-CMOS materials, many of them cost less with area, delay, and power overhead in realizing multiple Boolean operators in one cell compared to conventional CMOS schemes [13][14][15][16][17][18][19][20][21].Particularly, some works have achieved a complete reconfigurable gate set-they provide an efficient implementation of any pair of two-input Boolean functions.For example, ref. [38] proposes eight sets of efficient bi-functional two-input reconfigurable gates based on emerging double-gate ambipolar transistors, which switch N and P polarities according to a control signal.With a slightly increased overhead, refs.[29,39] provides logic gate modules that reconfigure between arbitrary two-input Boolean functions by manipulating the Valley Pseudospin degree of freedom.Figure 1c demonstrates a brief example of how reconfigurable gates can simplify circuits.The reconfigurable logic gates not only stimulate research in reconfigurable circuits but the concept can also be leveraged to increase the tolerance of specification differences in datapath synthesis.

Proposed Methodology
The overall methodology of our scheme consists of two steps.First, the synthesizer generates and locates all the 2-to-l multiplexers (MUXs), identifying the optimizable datapath by locating those root multiplexers and treating them as the starting points of two datapaths.In this paper, the synthesizer takes a Boolean network as the input, which is technology-mapped with a library including 2-to-l multiplexers, arbitrary two-input logic gates, and inverters.The multiplexers in the network are collected, excluding those with any of their data inputs having a fanout.Each multiplexer has two data inputs, indicating the starting points of two datapaths, which are also represented as two combinational logic cones.
In the second step, a graph isomorphism searching algorithm is applied to each root multiplexer to identify the common topology.Once the boundary of the common topology is determined, a plan for multiplexer relocation is generated, moving the multiplexers from the root of the logic cone to the input of the common topology.An evaluation script is then executed to determine if the plan results in an area improvement.If the plan yields a positive gain, the multiplexer relocation is deployed to realize the datapath adjustment and re-configurable logic gates are assigned to the required positions to avoid changing the network's functionality.After traversing all the multiplexers, the modified netlist undergoes the remaining logic synthesis steps to further enhance the area of improvement for datapath optimization.Detailed algorithms and examples are described below.

Basic Searching Method for Common Specification
One of the fundamental functions of the datapath is to explore common specifications among datapaths delineated by multiplexers.As illustrated in Figure 2, the multiplexer routes of two data inputs each correspond to a logic cone with unknown boundaries and Boolean matching relationships.Despite being termed 'searching graph isomorphism' in existing DAG-based datapath synthesis algorithms, they effectively assign this operation the task of searching for specification equivalence.The search process entails comparing nodes layer by layer to determine if they realize exactly identical functions.It terminates upon reaching a layer devoid of pairable nodes (e.g., all nodes in the layer are primary inputs) or encountering any nodes that are not pairable.As a result, modules rotated during the previous synthesis can be identified and captured as shareable logical resources.In Figure 2b, the realization of resource sharing is depicted by relocating the multiplexer backward to the boundary of the circuit partition with common specifications.Nodes identified as having the same specification are termed 'paired' nodes.After the following resource sharing and multiplexer relocation, only one of the paired nodes remains.For instance, nodes n2 in cone 1 and n3 in cone 2 are paired during the search process, with only n2 persisting in Figure 2b.

Basic Searching Method for Common Specification
One of the fundamental functions of the datapath is to explore common specifications among datapaths delineated by multiplexers.As illustrated in Figure 2, the multiplexer routes of two data inputs each correspond to a logic cone with unknown boundaries and Boolean matching relationships.Despite being termed 'searching graph isomorphism' in existing DAG-based datapath synthesis algorithms, they effectively assign this operation the task of searching for specification equivalence.The search process entails comparing nodes layer by layer to determine if they realize exactly identical functions.It terminates upon reaching a layer devoid of pairable nodes (e.g., all nodes in the layer are primary inputs) or encountering any nodes that are not pairable.As a result, modules rotated during the previous synthesis can be identified and captured as shareable logical resources.In Figure 2b, the realization of resource sharing is depicted by relocating the multiplexer backward to the boundary of the circuit partition with common specifications.Nodes identified as having the same specification are termed 'paired' nodes.After the following resource sharing and multiplexer relocation, only one of the paired nodes remains.For instance, nodes n2 in cone 1 and n3 in cone 2 are paired during the search process, with only n2 persisting in Figure 2b.An important challenge in maximizing common logic is determining the optimal node pairings when a layer offers multiple choices.For instance, in Figure 2a, layer 2 presents the following two viable options: {n1-n3, n2-n4} and {n1-n4, n2-n3}.It is evident that {n1-n3, n2-n4} offers a superior opportunity for pairing nodes in layer 3. To tackle this challenge, the algorithm employs a look-ahead heuristic approach.It explores three levels deeper into the searching space, selecting pairings that maximize the shared logic gates.

Approximate Pairing Using Gate Reconfiguration
In Section 3.1, it can be found that the basic method is searching for a common specification instead of a common topology, and the common specification requires an exact equivalent function.This raises a demand for adding tolerance to specification differences when finding sharable resources between logic cones.By reducing the requirement of exact equivalent specifications, common resources can be discovered between cones realizing different functions.In existing studies, an approach of approximate pairing has been An important challenge in maximizing common logic is determining the optimal node pairings when a layer offers multiple choices.For instance, in Figure 2a, layer 2 presents the following two viable options: {n1-n3, n2-n4} and {n1-n4, n2-n3}.It is evident that {n1-n3, n2-n4} offers a superior opportunity for pairing nodes in layer 3. To tackle this challenge, the algorithm employs a look-ahead heuristic approach.It explores three levels deeper into the searching space, selecting pairings that maximize the shared logic gates.

Approximate Pairing Using Gate Reconfiguration
In Section 3.1, it can be found that the basic method is searching for a common specification instead of a common topology, and the common specification requires an exact equivalent function.This raises a demand for adding tolerance to specification differences when finding sharable resources between logic cones.By reducing the requirement of exact equivalent specifications, common resources can be discovered between cones realizing different functions.In existing studies, an approach of approximate pairing has been proposed to try to handle this problem [10].Specifically, this approach tolerates the different usage of inverters between logic cones.
As shown in Figure 3a, the circuits in the two logic cones are highly similar except for an extra inverter in the left cone.The approach enables the pairing to take place by replacing the inverter/wire with an XOR gate with one of its inputs from the control signal of the multiplexer in Figure 3b.However, this approximate pairing only slightly improves the tolerance of specification difference.One observation is that it excels in reusing precisely equivalent functions but exhibits reduced efficiency when dealing with functions that exhibit variation, even if they are very similar.For example, a 38% improvement was achieved in optimizing 64-bit A+B:A+C while the improvement reduced to 0.5% when the algorithm performed on 64-bit A+B:A−C.This is because of several limitations to the current scheme: 1.
This scheme does not apply to the specification difference in two-input nodes, which is its major disadvantage.For a netlist mapped with a standard library, the difference probably reflects the two-input logic gates instead of the inverters.For example, the two cones in Figure 4a implement abc and a ′ b ′ + c ′ in the same DAG.The current scheme judges if the pairing fails at the first layer because the gates AND and OR have different operators.Applying the De Morgan law, the implementation can be easily transformed into a ′ b ′ c ′ which can be recognized as the common specification in the current scheme.However, such transformation is not provided in DAG-based synthesis, and the case in Figure 4 widely exists in high-level synthesis-generated networks that have different descriptions.

2.
The tolerance of the specification difference is realized in a significant overhead.In the example in Figure 4b, the resource sharing is achieved at the cost of inserting three XOR gates, let alone those which are usually much larger than AND or OR gates in the context of conventional CMOS technology.Although some of those XOR gates can be optimized in subsequent logic synthesis, the overhead is not negligible.

3.
Once the approximate method is applied to the network, i.e., XOR gates are inserted into a logic cone, and the topology of this cone (module) is permanently changed.Since the datapath synthesis is performed multiplexer by multiplexer in order, common logic can only be shared once before its topology changes.For example, when optimizing A+B:A−C:A+D, the topology is changed because of the insertion of XOR gates during the generation of A±(B:C).Then, the network of A+(B:C) has a different topology from A+D and cannot pair with it.
proposed to try to handle this problem [10].Specifically, this approach tolerates the different usage of inverters between logic cones.As shown in Figure 3a, the circuits in the two logic cones are highly similar except for an extra inverter in the left cone.The approach enables the pairing to take place by replacing the inverter/wire with an XOR gate with one of its inputs from the control signal of the multiplexer in Figure 3b.However, this approximate pairing only slightly improves the tolerance of specification difference.One observation is that it excels in reusing precisely equivalent functions but exhibits reduced efficiency when dealing with functions that exhibit variation, even if they are very similar.For example, a 38% improvement was achieved in optimizing 64-bit A+B:A+C while the improvement reduced to 0.5% when the algorithm performed on 64-bit A+B:A−C.This is because of several limitations to the current scheme: 1.This scheme does not apply to the specification difference in two-input nodes, which is its major disadvantage.For a netlist mapped with a standard library, the difference probably reflects the two-input logic gates instead of the inverters.For example, the two cones in Figure 4a implement    and   +  in the same DAG.The current scheme judges if the pairing fails at the first layer because the gates AND and OR have different operators.Applying the De Morgan law, the implementation can be easily transformed into    which can be recognized as the common specification in the current scheme.However, such transformation is not provided in DAG-based synthesis, and the case in Figure 4 widely exists in high-level synthesis-generated networks that have different descriptions.2. The tolerance of the specification difference is realized in a significant overhead.In the example in Figure 4b, the resource sharing is achieved at the cost of inserting three XOR gates, let alone those which are usually much larger than AND or OR gates in the context of conventional CMOS technology.Although some of those XOR gates can be optimized in subsequent logic synthesis, the overhead is not negligible.3. Once the approximate method is applied to the network, i.e., XOR gates are inserted into a logic cone, and the topology of this cone (module) is permanently changed.Since the datapath synthesis is performed multiplexer by multiplexer in order, common logic can only be shared once before its topology changes.For example, when optimizing A+B:A−C:A+D, the topology is changed because of the insertion of XOR gates during the generation of A±(B:C).Then, the network of A+(B:C) has a different topology from A+D and cannot pair with it.In this work, we propose a novel approximate pairing scheme based on gate-level reconfiguration to overcome the aforementioned issues that restrict searching for sharable resources.As shown in Figure 5, the new scheme aims for generic graph isomorphic searching, and the paired DAG nodes with different logic operators are transformed into sharable logic gates by assigning reconfigurable logic gates.This approach prevents the pairing of early ceased layer-wise nodes in common specification searching.In the previous scheme, one different logic gate can stop the current attempt at datapath optimization even though massive approximate logic is buried in deeper layers.In the example in Figure 5, the search for equivalent specifications stops at the first layer due to the logic difference between XOR and AND.However, the two logic cones are fully graph-isomorphic and can be leveraged by searching for equivalent topology.Nevertheless, we keep the approach of replacing the inverter/wire because of its efficiency in some scenarios.This issue is further discussed in Section 4.4.
Chips 2024, 3, FOR PEER REVIEW 7 In this work, we propose a novel approximate pairing scheme based on gate-level reconfiguration to overcome the aforementioned issues that restrict searching for sharable resources.As shown in Figure 5, the new scheme aims for generic graph isomorphic searching, and the paired DAG nodes with different logic operators are transformed into sharable logic gates by assigning reconfigurable logic gates.This approach prevents the pairing of early ceased layer-wise nodes in common specification searching.In the previous scheme, one different logic gate can stop the current attempt at datapath optimization even though massive approximate logic is buried in deeper layers.In the example in Figure 5, the search for equivalent specifications stops at the first layer due to the logic difference between XOR and AND.However, the two logic cones are fully graph-isomorphic and can be leveraged by searching for equivalent topology.Nevertheless, we keep the approach of replacing the inverter/wire because of its efficiency in some scenarios.This issue is further discussed in Section 4.4.In this work, we propose a novel approximate pairing scheme based on gate-level reconfiguration to overcome the aforementioned issues that restrict searching for sharable resources.As shown in Figure 5, the new scheme aims for generic graph isomorphic searching, and the paired DAG nodes with different logic operators are transformed into sharable logic gates by assigning reconfigurable logic gates.This approach prevents the pairing of early ceased layer-wise nodes in common specification searching.In the previous scheme, one different logic gate can stop the current attempt at datapath optimization even though massive approximate logic is buried in deeper layers.In the example in Figure 5, the search for equivalent specifications stops at the first layer due to the logic difference between XOR and AND.However, the two logic cones are fully graph-isomorphic and can be leveraged by searching for equivalent topology.Nevertheless, we keep the approach of replacing the inverter/wire because of its efficiency in some scenarios.This issue is further discussed in Section 4.4.

Implementation
There are several challenges to be solved in implementing reconfigurable-based approximate isomorphic searching.In this section, the examples, detailed solutions, and algorithms are described.The pseudocode in Algorithm 1 outlines the proposed method.

Priority of Pairing
The critical function of this method is to identify the maximum common specification logic between the neighboring datapath.This function is described in the function Com-monLogicBoundary in Algorithm 1.This function returns the pairings of the boundary signals B that maintain the isomorphism class, which is used for relocating the root multiplexers to the boundary of the common logic.It also returns all the positions of conditional pairs (R0, R1) that need to be assigned a reconfigurable gate.The topology-based common specification determination suffers from the explosion of searching space.Since the pairing is in the context of topology, any nodes that share paired parent nodes have the same fanin number, are in the same node type (e.g., logic node or PI), and can form a pair and operate in a common node.In order to reduce runtime, the layer-wise node pairing is performed in the following order of priority: • All nodes to be paired have paired parent nodes to ensure functional continuity between layers.

•
Nodes that are PIs are paired in advance.Note that the identical PIs are put into a pair with priority since they do not need a multiplexer for datapath adjustment.For example, the pairs {F0-F0} in Figure 5. • The nodes that have the identical operator.
• The nodes that have identical fanins.
• The nodes that have the same number of fanouts [10].

•
If several candidates for pairing have equal priority, they are evaluated and sorted according to their potential to maximize the logic sharing.This potential evaluation is a three-layer look-ahead heuristic algorithm adopted from [10].
Note that the inverter is not considered as a gate or a node in the DAG.Additionally, only the nodes that have an identical number of fanins can be paired.

Elimination of Logic Loop
The utilization of datapath synthesis introduces the potential risk of generating a logic loop within the network.An illustrative example of the creation of a logic loop is demonstrated in Figure 6, where node 2 concurrently resides in the logic cones of data input 0 and data input 1.In such a scenario, if achieving common logic between data 0 and data 1 necessitates pairing node 7 with node 2 and assigning them as a reconfigurable node, a loop forms, as indicated by the red circle in the figure.However, due to the absence of memory elements, loops are prohibited in combinational logic networks.This challenge can be addressed by decoupling the two cones undergoing datapath optimization.Specifically, we temporarily duplicate the overlapping nodes in the two logic cones and assign independent indices to the duplicated nodes.While this temporary measure may momentarily increase the total area, it can be reversed in subcircuits and not updated in the network if the datapath adjustment is evaluated as unworthy.

•
Nodes that are PIs are paired in advance.Note that the identical PIs are put into a pair with priority since they do not need a multiplexer for datapath adjustment.For example, the pairs {F0-F0} in Figure 5.

•
The nodes that have the identical operator.

•
The nodes that have identical fanins.

•
The nodes that have the same number of fanouts [10].

•
If several candidates for pairing have equal priority, they are evaluated and sorted according to their potential to maximize the logic sharing.This potential evaluation is a three-layer look-ahead heuristic algorithm adopted from [10].Note that the inverter is not considered as a gate or a node in the DAG.Additionally, only the nodes that have an identical number of fanins can be paired.

Elimination of Logic Loop
The utilization of datapath synthesis introduces the potential risk of generating a logic loop within the network.An illustrative example of the creation of a logic loop is demonstrated in Figure 6, where node 2 concurrently resides in the logic cones of data input 0 and data input 1.In such a scenario, if achieving common logic between data 0 and data 1 necessitates pairing node 7 with node 2 and assigning them as a reconfigurable node, a loop forms, as indicated by the red circle in the figure.However, due to the absence of memory elements, loops are prohibited in combinational logic networks.This challenge can be addressed by decoupling the two cones undergoing datapath optimization.Specifically, we temporarily duplicate the overlapping nodes in the two logic cones and assign independent indices to the duplicated nodes.While this temporary measure may momentarily increase the total area, it can be reversed in subcircuits and not updated in the network if the datapath adjustment is evaluated as unworthy.

Evaluation Function
After the boundary of common specification logic has been found, the algorithm needs to evaluate if such resource sharing is worthy in terms of area.The evaluation function first calculates and records the circuit area within the datapaths in optimization.Then, it generates the post-adjustment area by taking the additional area as follows: where A Removed is the saved area by removing the network on one of the logic cones, and the function of the removed cone is undertaken by the other logic cone; A Rec is the area overhead for the assigned reconfigurable gate, which comes from the slightly larger size of the reconfigurable logic gate compared to the two-input logic gate; A MuxRelocate is the cost of relocating the root 2-to-1 multiplexer to the boundaries, and this term usually dominates the overhead; and A Xor is the area of consumption of adding an XOR gate to replace the inverter/wire.

Remove Inverter/Wire
During the search for common logic, inverters are disregarded, yet some of them introduce specification differences in the form of inverter/wire combinations.The algorithm records the positions of inverters/wires and handles them in two ways.First, it attempts to combine the inverter/wire with the nearest reconfigurable gate.As depicted in Figure 7, an inverter solely resides in the left cone, and the driven gate AND is converted into a reconfigurable gate following datapath adjustments.Consequently, the inverter/wire can be merged with the AND/OR gate by inversing one of the inputs of the AND operator.
If the inverter is not connected to a reconfigurable gate, we transform the conventional gate driven by the inverter into a reconfigurable gate that dynamically switches the input complementation status based on the control signal.This approach offers area advantages compared to inserting an XOR gate, especially considering that input/output negation is at a lower area cost in certain technologies, such as Valley Spin devices [29].However, if the inverter drives multiple gates, merging the inverter/wire with the driven gates may not be area-efficient.Moreover, if the gate g driving the inverter also controls gate x outside of the logic cone, this method becomes infeasible because gate x could require a statically inverted signal g, disregarding the control signal selecting this logic cone.Therefore, without the loss of the main advantage of leveraging gate-level reconfiguration, in Algorithm 1 handleInvDiff, we still handle the inverter/wire by inserting XOR when the inverter drives or is driven by multiple gates.

Remove Inverter/Wire
During the search for common logic, inverters are disregarded, yet some of them introduce specification differences in the form of inverter/wire combinations.The algorithm records the positions of inverters/wires and handles them in two ways.First, it attempts to combine the inverter/wire with the nearest reconfigurable gate.As depicted in Figure 7, an inverter solely resides in the left cone, and the driven gate AND is converted into a reconfigurable gate following datapath adjustments.Consequently, the inverter/wire can be merged with the AND/OR gate by inversing one of the inputs of the AND operator.If the inverter is not connected to a reconfigurable gate, we transform the conventional gate driven by the inverter into a reconfigurable gate that dynamically switches the input complementation status based on the control signal.This approach offers area advantages compared to inserting an XOR gate, especially considering that input/output negation is at a lower area cost in certain technologies, such as Valley Spin devices [29].However, if the inverter drives multiple gates, merging the inverter/wire with the driven gates may not be area-efficient.Moreover, if the gate g driving the inverter also controls gate x outside of the logic cone, this method becomes infeasible because gate x could require a statically inverted signal g, disregarding the control signal selecting this logic cone.Therefore, without the loss of the main advantage of leveraging gate-level reconfiguration, in Algorithm 1 handleInvDiff, we still handle the inverter/wire by inserting XOR when the inverter drives or is driven by multiple gates.

Experiment Setup and Results
We implement the proposed algorithms outlined in Section 4 in C and integrate them with the logic synthesis tool ABC [36].The benchmark circuit chosen is A+B:A−C, which previously exhibited almost no area optimization using the original datapath synthesis scheme [10].We use open-source synthesis tools, such as Yosys [40], to perform the highlevel synthesis.
First, the A+B:A−C netlist is implemented using the Verilog code 'if O=A+B; else O=A−C;' with an assigned operant bit-width.Subsequently, the RTL description is translated into a gate-level combinational netlist using the command 'read; hierarchy -top; proc; techmap;' in the synthesis tool Yosys [40].This netlist is then mapped into a standard library and serves as the initial network for subsequent processes, which is also set as the baseline scheme in this work.
Our algorithm, logic synthesis, and technology mapping are then applied to the netlist.In addition to the baseline, four other schemes are implemented, and the results are shown in Table 1.First of all, a library is built with standard gates as well as a set of reconfigurable gates capable of switching between arbitrary two-input operators.We set the delay and area of standard two-input gates to be 1 and assumed the reconfigurable gates to have the same delay and 1.5× relative area of standard two-input gates.By mapping the baseline network with the library with reconfigurable gates, the scheme 'Rec-library' was deployed to assess the impact of using added reconfigurable gates.In the scheme 'ABC flow', the baseline network experiences the traditional logic synthesis and technology mapping in ABC by executing the commands 'strash; dch -v; map -v' with the new library.This scheme aims to investigate the performance of the synthesis that does not leverage datapath synthesis.The 'original datapath' scheme first processes the initial network with the existing datapath synthesis algorithm, then uses 'strash; dch -v; map -v' to fully explore and leverage the benefits of deploying structural optimization.Based on the same process, the proposed scheme replaces the original datapath synthesis with the proposed datapath that can tolerate functional differences.The experiments are performed on bit-widths ranging from 8 to 64 to investigate the impact of the scale.From the experimental results summarized in Table 1 'Rec-Library', which maps a network with advanced reconfigurable devices, causes limited improvement in the area.Applying the ABC optimization flow only improves the area to a very small extent.The original datapath scheme can change the network and occasionally improve the area compared to previous schemes.This is because several topology-identical nodes can still be found in this case.As the partial structure of the benchmark circuit illustrated in Figure 8 shows, the small logic cones of O0 and O1 are optimizable because they have identical topology and gate operators except for the inverter/wire that can be tolerated.However, the red components represent specification differences, and almost none of them can be tolerated by existing algorithms.In contrast, the proposed scheme leverages the common topology by transforming them into common specifications through the replacement of the red components with reconfigurable gates.The results demonstrate the effectiveness of the proposed algorithm by an area improvement of up to 22.7% compared to the original datapath scheme.For the aspect of delay, as shown in Table 2, the two datapath synthesis schemes have the same performance, and both of them increase the delay by one compared to none datapath synthesis schemes.Moreover, since the proposed algorithm is developed in work [10] and retains the main framework, their runtimes are comparable.
Chips 2024, 3, FOR PEER REVIEW 12 compared to none datapath synthesis schemes.Moreover, since the proposed algorithm is developed in work [10] and retains the main framework, their runtimes are comparable.In the experiments above, we assume the area overhead of gate-level reconfiguration to be 1.5×.However, the cost of realizing reconfigurable gates significantly varies depending on the device technology.Here, a simulation is performed on the 64-bit case to reveal how the area overhead of reconfigurable gates affects the advantage of the proposed scheme.As shown in Figure 9, the proposed scheme becomes more advantageous with lower reconfiguration costs.The highest area of improvement at 23.6% is achieved when no overhead is applied.Meanwhile, it can be observed that the other four schemes are not sensitive to the area of reconfigurable gates.This observation provides further evidence that these existing approaches cannot effectively leverage the reconfigurable gates that In the experiments above, we assume the area overhead of gate-level reconfiguration to be 1.5×.However, the cost of realizing reconfigurable gates significantly varies depending on the device technology.Here, a simulation is performed on the 64-bit case to reveal how the area overhead of reconfigurable gates affects the advantage of the proposed scheme.As shown in Figure 9, the proposed scheme becomes more advantageous with lower reconfiguration costs.The highest area of improvement at 23.6% is achieved when no overhead is applied.Meanwhile, it can be observed that the other four schemes are not sensitive to the area of reconfigurable gates.This observation provides further evidence that these existing approaches cannot effectively leverage the reconfigurable gates that have a stronger representative ability even though the area overhead is removed.

Conclusions
This paper introduces an advanced DAG-based datapath synthesis method to utilize the unique reconfigurable feature of emerging devices and minimize the overall circuitlevel area.It achieves logic-level resource sharing with greatly enhanced tolerance for specification differences, thereby enabling the effective utilization of identical topology.A range of solutions are proposed to address issues, such as cone pairing priority, loop prevention, and inverter/wire removal.The proposed approach demonstrates its ability to

Conclusions
This paper introduces an advanced DAG-based datapath synthesis method to utilize the unique reconfigurable feature of emerging devices and minimize the overall circuitlevel area.It achieves logic-level resource sharing with greatly enhanced tolerance for specification differences, thereby enabling the effective utilization of identical topology.A range of solutions are proposed to address issues, such as cone pairing priority, loop prevention, and inverter/wire removal.The proposed approach demonstrates its ability to

Conclusions
This paper introduces an advanced DAG-based datapath synthesis method to utilize the unique reconfigurable feature of emerging devices and minimize the overall circuit-Chips 2024, 3 194 level area.It achieves logic-level resource sharing with greatly enhanced tolerance for specification differences, thereby enabling the effective utilization of identical topology.A range of solutions are proposed to address issues, such as cone pairing priority, loop prevention, and inverter/wire removal.The proposed approach demonstrates its ability to identify more opportunities in datapath optimization and effectively tackle circuits that were previously challenging to address.This leads to a significant reduction in areas within a design flow.

Figure 1 .
Figure 1.Three Boolean network representations for  = ( + ) ×  + ( ⊕ ) × ̅ : (a) an AIG network, in which every node is the AND gate and the dotted line represents an inverter, (b) a network mapped with standard two-input logic gates, and (c) a network mapped with reconfigurable gates.

Figure 1 .
Figure 1.Three Boolean network representations for f = (a + b) × c + (a ⊕ b) × c: (a) an AIG network, in which every node is the AND gate and the dotted line represents an inverter, (b) a network mapped with standard two-input logic gates, and (c) a network mapped with reconfigurable gates.

Figure 2 .
Figure 2. (a)The original circuit with two logic cones, i.e., datapaths that are connected to the same MUX, and (b) the circuit that has been processed by the original datapath synthesis.This is the basic method to use graph isomorphism and it requires specification equivalence between logic cone 1 and logic cone 2.

Figure 2 .
Figure 2. (a)The original circuit with two logic cones, i.e., datapaths that are connected to the same MUX, and (b) the circuit that has been processed by the original datapath synthesis.This is the basic method to use graph isomorphism and it requires specification equivalence between logic cone 1 and logic cone 2.

Figure 3 .Figure 3 .
Figure 3.The existing approach of approximate pairing.The XOR gate in (b) is added to compensate for the inverter that only exists in the left cone in (a). a b c a' b' c'

Figure 4 .
Figure 4.An example showing the limits of the existing approximate pairing method.(a) A case that the specification difference reflects in the two-input logic gate, (b) a transformed implementation from (a), and (c) the implementation generated by datapath synthesis using existing approximate pairing.

Figure 5 .Figure 4 .
Figure5.The novel approximate pairing scheme for leveraging gate-level reconfiguration.The colored gates mark the gates that are in the same topological position but with different operators.

Figure 4 .
Figure 4.An example showing the limits of the existing approximate pairing method.(a) A case that the specification difference reflects in the two-input logic gate, (b) a transformed implementation from (a), and (c) the implementation generated by datapath synthesis using existing approximate pairing.

Figure 5 .Figure 5 .
Figure 5.The novel approximate pairing scheme for leveraging gate-level reconfiguration.The colored gates mark the gates that are in the same topological position but with different operators.

Figure 6 .
Figure 6.The example of creating a loop using reconfiguration-based approximate pairing.Assume all nodes in (a,b) are the AND gate except node 7, which is the OR gate; then, the common specification is generated as (c).The red circle marks the created loop.

Figure 6 .
Figure 6.The example of creating a loop using reconfiguration-based approximate pairing.Assume all nodes in (a,b) are the AND gate except node 7, which is the OR gate; then, the common specification is generated as (c).The red circle marks the created loop.

Figure 7 .Figure 7 .
Figure 7.The example of removing inverter/wire combination using reconfiguration-based approximate pairing.(a) A circuit with an inverter difference between two datapaths, and (b) the solution that addresses the inverter difference by employing a reconfigurable gate.a b c a' b' c'

Figure 8 .
Figure 8.The partial structure of the benchmark circuit.Red components indicate the specification difference.

Figure 8 .
Figure 8.The partial structure of the benchmark circuit.Red components indicate the specification difference.

Figure 9 .
Figure 9. Area of 64-bit 'A+B:A-C' generated by the five schemes with different areas overhead of gate-level reconfiguration.In addition, another example, 'parity([A [0:x/2-1] B[0:x/2-1]]):A[0:x/2-1]|C[0:x/2-1]' with the bit-width  = 2 is provided to demonstrate the effectiveness of the proposed scheme.The other experiment setup is the same as the previous one.The delay results of the proposed scheme and other schemes are n and n + 1, respectively.The delay of the proposed scheme and the original datapath synthesis scheme are no longer equal since the latter cannot perform any optimization on the circuit.The results for this area are shown in Figure 10.It can be observed that up to 26.7% of area improvements are achieved with negligible delay overhead.

Figure 10 .
Figure 10.Area of 'parity([A B]):A|C' generated by the five schemes.

Figure 9 . 13 Figure 9 .
Figure 9. Area of 64-bit 'A+B:A−C' generated by the five schemes with different areas overhead of gate-level reconfiguration.In addition, another example, 'parity([A [0:x/2-1] B[0:x/2-1]]):A[0:x/2-1]|C[0:x/2-1]' with the bit-width x = 2 n is provided to demonstrate the effectiveness of the proposed scheme.The other experiment setup is the same as the previous one.The delay results of the proposed scheme and other schemes are n and n + 1, respectively.The delay of the proposed scheme and the original datapath synthesis scheme are no longer equal since the latter cannot perform any optimization on the circuit.The results for this area are shown in Figure 10.It can be observed that up to 26.7% of area improvements are achieved with negligible delay overhead.

Figure 10 .
Figure 10.Area of 'parity([A B]):A|C' generated by the five schemes.

Figure 10 .
Figure 10.Area of 'parity([A B]):A|C' generated by the five schemes.

Table 1 .
Experimental area results for five schemes of implementing A+B:A−C with different bit-widths.

Table 2 .
Experimental delay results for five schemes of implementing A+B:A-C with different bitwidths.

Table 2 .
Experimental delay results for five schemes of implementing A+B:A−C with different bit-widths.