Structural Decomposition in FSM Design: Roots, Evolution, Current State—A Review

: The review is devoted to methods of structural decomposition that are used for optimizing characteristics of circuits of ﬁnite state machines (FSMs). These methods are connected with the increasing the number of logic levels in resulting FSM circuits. They can be viewed as an alternative to methods of functional decompositions. The roots of these methods are analysed. It is shown that the ﬁrst methods of structural decomposition have appeared in 1950s together with microprogram control units. The basic methods of structural decomposition are analysed. They are such methods as the replacement of FSM inputs, encoding collections of FSM outputs, and encoding of terms. It is shown that these methods can be used for any element basis. Additionally, the joint application of different methods is shown. The analysis of change in these methods related to the evolution of the logic elements is performed. The application of these methods for optimizing FPGA- based FSMs is shown. Such new methods as twofold state assignment and mixed encoding of outputs are analysed. Some methods are illustrated with examples of FSM synthesis. Additionally, some experimental results are represented. These results prove that the methods of structural decomposition really improve the characteristics of FSM circuits.


Introduction
The development of information technologies has led to the widespread use of various digital systems in different areas of mankind's activity [1][2][3][4][5][6][7][8][9]. It is known that digital systems consist of various combinational and sequential blocks [10,11]. As a rule, the circuits of combinational blocks are regular [12]. A designer can use standard library elements of computer-aided design (CAD) systems to implement such circuits [11]. For example, a multi-bit adder can be represented as a composition of standard single-bit adders. The sequential blocks could be very complex (for example, control units of computers) or rather simple (such as binary counters). It is known that the circuits of complex sequential blocks are irregular [10,12]. As a rule, there are no standard library solutions for such blocks. It means that each sequential block is synthesised anew. To synthesise the logic circuit of a sequential block, some tools are used to present the law of its behaviour.
Very often, the behaviour of sequential blocks is represented using the model of a finite state machine (FSM) [10,13,14]. Three characteristics of an FSM circuit significantly influence the characteristics of a digital system. These characteristics are the hardware amount, the operating frequency (the performance), and the power consumption. Because To design an FSM circuit, an STG should be transformed into the corresponding STT. An STT includes the following columns [10,13]: s C is a current state; s T is a state of transition; I h an input signal determining a transition from s C to s T ; and, O h is a subset of the set of outputs generated during the transition from the current state s C to the state of transition s T . We name this subset a collection of outputs. The numbers of the transitions (h ∈ {1, . . . , H}) are shown in the last column of the STT.
In this article, we mostly use STTs for initial representation of FSMs. For example, the FSM A 1 is represented by the STT (Table 1). We hope that there is the transparent connection between the STG ( Figure 1) and STT (Table 1). There are two main types of FSM, namely, Mealy [31] and Moore [32] FSMs. The first of them was proposed in 1955 by G. Mealy; the second was proposed in 1956 by E. Moore. In both cases, the function δ determines the states of transition as functions depending on the current states and inputs. So, it is the following function: For Mealy FSMs, the function λ determines the outputs as functions depending on the current states and inputs. It gives the following function: For Moore FSMs, the function λ determines the outputs as functions depending only on the current states. So, it is the following function: The difference among (2) and (3) leads to a difference in the synthesis methods of Mealy and Moore FSMs. We now explain the stages of Mealy FSM's synthesis starting from Table 1.
In 1965, Viktor Glushkov proved a theorem of the structural completeness [33]. According to this theorem, an FSM circuit is represented as a composition of the combinational part and the memory. The memory is necessary for keeping the history of the FSM's operation. The history is represented by FSM internal states. This fundamental approach is still widely used for the synthesis of FSM circuits [34][35][36][37][38].
An FSM logic circuit is represented by some systems of Boolean functions (SBFs) [10,13]. To find these SBFs for Mealy FSMs, it is necessary to [13]: (1) encode states s m ∈ S by binary codes K(s m ); (2) construct sets of state variables T = {T 1 , . . . , T R } and input memory functions (IMFs) D = {D 1 , . . . , D R }; and, (3) transform an initial STT into a direct structure table (DST). The states s m ∈ S are encoded during the step of state assignment [10].
The minimum possible number of state variables R S is determined as The approach based on (4) defines so-called maximum binary codes [10]. This method is used, for example, in the well-known academic system SIS [39]. However, the number of state variables can be different from (4). For example, the one-hot state codes with R = M are used in the academic system ABC [30,40] of Berkeley. The maximum binary codes and one-hot codes define the extreme points of the encoding space. There are other approaches for state assignment where the following relation holds: log 2 M ≤ R S ≤ M.
A state register (RG) keeps the state codes. The register includes R memory elements (flip-flops) having shared inputs of synchronization (Clock) and reset (Start). Very often, master-slave D flip-flops are used to organize state registers [41,42]. The pulse Clock allows the functions D r ∈ D to change the RG content.
After the execution of the state assignment, we should create a direct structure table. A DST includes all of the columns of an STT and three additional columns. These columns include the current state codes K(s C ) and the codes K(s T ) of the states of transitions. Finally, a column Φ h includes the symbols D r ∈ D corresponding to 1's in the code K(s T ) from the row h of a DST (h ∈ {1, . . . , H}). A DST is a base to construct the following SBFs: D = D(T, I); (5) Y = Y(T, I).
The systems (5)-(6) determine a structural diagram of P Mealy FSM ( Figure 2) [42]. The block of input memory functions generates the functions (5). The block of outputs generates the system (6). The pulse Start loads the code of the initial state to RG. The pulse of synchronization Clock allows information to be written to the register.
A DST of Moore FSM is a base for deriving the systems (5) and A P Moore FSM is represented by a structural diagram that is similar to the one shown in Figure 2. However, as follows from SBF (7), there is no connection between the inputs i l ∈ I and block of outputs.
We now discuss how to obtain systems (5)-(6) for P Mealy FSM A 1 . There is M = 4. Using (4) gives the value of R S = 2. This determines the sets T = {T 1 , T 2 } and D = {D 1 , D 2 }. Let us encode states in the trivial way: K(s 1 ) = 00, . . . , K(s 4 ) = 11. Having state codes allows transforming Table 1 (the initial STT) to Table 2. Table 2 is the DST of P FSM A 1 .
To fill the column D h , we should take into account that the value of D r ∈ D is equal to the value of the r-th bit of code K(s T ) [13]. Systems (5)-(6) are represented as a sum-ofproducts (SOPs) [10,43]. These SOPs include product terms F h ∈ F corresponding to rows of a DST. The elements of the set of terms F are determined as In (8), the first member S C is a conjunction of state variables corresponding to a code of the current state K(s C ) from the h-th row of DST. There are the following conjunctions S C in the discussed case: S 1 = T 1 T 2 , . . . , S 4 = T 1 T 2 .
Using Table 2, we can obtain the following SBFs: The SBF (9) determines the circuit of block of outputs and the SBF (10) determines the circuit of block of input memory functions.
The hardware amount in an FSM circuit depends on the combination of SBF characteristics (the numbers of literals, functions, and product terms of SOPs) and specifics of the used logic elements (the number of inputs, outputs and product terms). Denote, by N A( f i , F h ), the number of literals in a term F h of the SOP of a function f i , and, by NT( f i ), the number of terms in a SOP of this function. Obviously, the following conditions are true for a SOP of any function f i ∈ D ∪ O: Consider the SOP of function D 1 from SBF (10). Each term of this SOP includes N A(D 1 , F h ) = 3 literals. There are NT(D 1 ) = 3 terms in this SOP. If NAND gates having N I N AND = 3 inputs are used for implementing a logic circuit corresponding to D 1 , then there are four gates and two levels of gates in the circuit. This is the best solution, because the circuit includes the minimum possible number of gates (the minimum hardware amount), their levels (the maximum operating frequency), and interconnections.
However, if there is N I N AND = 2, then the SOP should be transformed. After the transformation, the SOP is represented by the following formula: Twelve gates are necessary for implementing the function (13). The resulting circuit has six levels of gates. Thus, an imbalance between the characteristics of the function and logic elements leads to an increase in the number of gates and levels of logic in the resulting logic circuit.
This situation can occur for any logical elements (logic gates, ROMs, PROMs, PLAs, PALs, CPLDs, FPGAs, and so on). In this case, it is necessary to optimize the characteristics of a resulting logic circuit. The structural decomposition is one of the ways for such an optimization [21].

Roots of Structural Decomposition
The control units' circuits of the first computers were characterized by an irregular structure [44][45][46][47][48] with all the ensuing consequences. In 1951, Professor of Cambridge M. Wilkes proposed a principle of microprogram control [25,26]. According to this principle, each computer instruction is represented as a microprogram kept into a special control memory (CM). A microprogram consists of microinstructions. Each microinstruction has an operational part with control outputs (microoperations) and an address part having data used for generating an address of transition (the address of the next microinstruction to be executed). A special register is used to keep the microinstruction address. This approach allows for obtaining a microprogram control unit (MCU) with a regular circuit, which is quite simple to implement and test. A trivial structural diagram of the MCU is shown in Figure 3.  The MCU (Figure 3) uses the microinstruction address from the register and logical conditions (inputs) i l ∈ I to generate outputs o n ∈ O and the next address represented by variables T r ∈ T. A comparison of Figures 2 and 3 shows that the MCU is a finite state machine in which blocks of input memory functions and outputs are replaced by the control memory. At the same time, microinstructions correspond to FSM states; microinstruction addresses correspond to state codes. This connection between FSMs and MCUs was first noted in [49].
The circuit of control memory was implemented using ROM [50][51][52]. For MCU (Figure 3), the required volume of such a ROM, V ROM , is determined as For average FSMs [13], there is N = 50, R S = 8, L = 30. If such a control unit is implemented as MCU (Figure 3), then it is necessary for 10 13 bits of control memory. In the 1950s, the use of such a big control memory would lead to a significant increase in the cost of a computer. Because of it, Figure 3 rather shows an idea of MCU, not the practical way of its implementation.
In order to diminish the required value of V ROM , two approaches have been proposed by M. Wilkes. The first of them is the selection of an input that should be used for generation of the transition address. As a rule, only a single logic condition is selected in each cycle of MCU operation. This allows for reducing the length of the address part of microinstruction. This approach leads to a two-level MCU shown in The second approach is an encoding of collections of microoperations Y q ⊆ O by maximum binary codes K(Y q ) having R Q bits. In practical cases [53], there is R Q ≤ 8. This allows reducing the length of the operational part of microinstruction up to In (15), we use Q to denote the number of different COs for a particular STT. If an MCU is implemented starting from STT (Table 1), then the following collections of outputs (COs) can be found: Accordingly, there is Q = 4. Using (15) gives R Q = 2. Let us use elements of the set Z = {z 1 , . . . , z RQ } for encoding of the COs. It gives the set Z = {z 1 , z 2 }. Figure 5 shows one of the possible outcomes of encoding. As follows from Figure 5, there is K(Y 1 ) = 00, ..., K(Y 4 ) = 11. The system of outputs is represented by the following SOP: To implement the system (16), we should include in the MCU a block of outputs. This block consists of a decoder (DC) and a coder. Hence, this block has two levels of logic. The decoder transforms codes of COs into one-hot codes corresponding to COs. The coder transforms these one-hot codes into outputs. In the general case, the outputs are represented by the system If both of the approaches are used simultaneously, then there are three levels of logic blocks in the MCU. Figure 6 shows the structural diagram of three-level MCU with the compulsory addressing of microinstructions [50,54]. In the case of the compulsory addressing of microinstructions, the microinstruction format includes the operational part having R Q bits and the address part having R L = log 2 L bits with a code of logical condition to be checked and two address fields. The first address field includes an address of transition, if a logical condition to be checked is equal to 0 (or an address of unconditional transition). The second address field includes an address of transition if a logical condition to be checked is equal to 1. If a microprogram includes M microinstructions, then the number of address bits is determined by (4). Accordingy, each microinstruction has R Q + R L + 2 × R S bits. If R Q = R S = 8 and R L = 6, then the value of V ROM is equal to 30 × 256 = 7680 bits.
The block of addressing generates address variables D r ∈ D. These variables depend on the inputs and microinstruction address part. This block is implemented on multiplexers (MXs) [52]. The block of outputs generates outputs as functions of the MCU operational part. Hence, an MCU is a Moore FSM.
The MCU with the block of addressing became the prototype of the FSMs with replacement of inputs. In literature [41] such FSMs are called MP FSMs, where "M" means "multiplexer". The MCU with the block of outputs became the prototype of PY FSMs with encoding of collections of outputs. The three-level MCU ( Figure 6) corresponds to MPY FSM. This means that various methods of structural decomposition can be used together.
The method of encoding of fields of compatible outputs (FCOs) was proposed to eliminate the coder from the block of outputs [55]. The outputs are compatible if they are not written in the same rows of STT. The set O is divided by I classes of compatible outputs: Outputs o n ∈ O i are encoded by maximum binary codes K i (o n ). There are R i bits in the code K i (o n ): In (19), we use the symbol N i to denote the number of outputs in the class O i . The one is added to N i to take the relation o n / ∈ O i into account. The outputs o n ∈ O i are encoded using variables z r ∈ Z i . The total number of operational part bits, R FCO , is determined by summation of the values of (19). The structural diagram of the MCU based on this principle is the same as the one shown in Figure 6. However, the block of outputs consists of I decoders DC i . A decoder DC i generates outputs from the field FCO i .
This approach was used in optimizing control units of IBM/360 [56]. Additionally, they became the prototypes of PD FSMs [41].
There are three possible organizations of the block of outputs that are shown in Figure  7. As follows from Figure 7, the one-hot organization ( Figure 7a) leads to the fastest MCUs having the longest operational part. The block of outputs is absent. The maximum encoding of collections of outputs ( Figure 7b) results in the two-level block of outputs. This is the slowest solution, but it provides the shortest operational part. As follows from Figure 7c, the encoding of FCOs results in a single-level block of outputs. This approach provides a compromise solution with the average delay and hardware amount.
The value of V ROM can be reduced due to using the nanomemory [44,54]. We now explain the idea of this approach (  Figure 8). It means that that approach allows reducing the volume of control memory by 2.46 time when compared to the MCU ( Figure 6). This approach is a prototype of PH FSMs with encoding of product terms [41].
One fundamental law follows from the analysis of different methods of minimizing the value of V ROM . This is the following: the reducing hardware amount leads to an increase in the delay time of the resulting circuit (due to an increase in the number of logic levels). This law holds for all methods of structural decomposition.

Structural Decomposition in Matrix-Based FSMs
If an FSM is a part of an application-specific integrated circuit (ASIC) [57], then its circuit can be implemented using custom matrices [13,58]. These matrices are used as either AND-planes or OR-planes [59]. Each plane is a system of wires connected by CMOS transistors. Two wires (direct and compliment values of corresponding arguments) represent each literal of a SOP. Each term of a SOP corresponds to a wire.
To implement a matrix circuit of Mealy FSM, it is enough to use a single AND-matrix M 1 and a single OR-matrix M 2 . This is a trivial matrix implementation of P Mealy FSM ( Figure 9).
The trivial matrix circuit ( Figure 9) represents a P Mealy FSM [41]. This is the fastest matrix solution. However, such a solution is very redundant.
The hardware amount of matrix circuits is defined in conventional units of area (CUA) of matrices [13]. These areas are determined as the following: In (20) Two methods of structural decomposition were used to reduce the chip area that is occupied by an FSM circuit, namely [58]: 1.
The replacement of inputs (MP FSM).

2.
The encoding of collections of outputs (PY FSM).
To design an MP FSM, it is necessary to replace the set I by some set P = {p 1 , . . . , p G }. This makes sense if G L.
The value of G is determined by the maximum number of inputs causing transitions from states s m ∈ S [58]. Consider the DST of Mealy FSM A 2 (Table 3).
In the case of A 2 , we have G = 2. Accordingly, there is a set P = {p 1 , p 2 }.
To replace inputs, it is necessary to create the following SBF: This SBF is constructed using a table of replacement. In the discussed case, Table 4 presents the table of replacement.
Using Table 4 gives the following SBF: The SOP for p 1 includes terms v 1 -v 4 ; the SOP for p 2 includes terms v 5 -v 8 . Hence, there is NT(P) = 8.
We should construct a table of MP FSM to design the circuit of MP FSM. It can be done by a transformation of the DST of P FSM. The transformation is reduced to the replacement of the column I h by the column P h [58]. In the discussed case, this leads to Table 5.
From Table 5, we can find that the IMFs and outputs of MP FSM are represented by the following SBFs: Systems (23)-(25) determine a matrix circuit of MP FSM that is shown in Figure 10. In the MP Mealy FSM (Figure 10), the matrix M 3 implements terms of SBF (23). The matrix M 4 transforms terms v r ∈ V into functions (23). The matrix M 1 implements terms F h ∈ F. These terms correspond to the rows of DST. The matrix M 2 generates functions (25)- (26). These matrices have the following areas: To optimize the matrix M 2 , the method of encoding of COs can be used [58]. As it is for MCU, Q COs are encoded by binary codes K(Y q ). These codes have R Q bits, where the expression (15) determines the value of R Q .
For FSM A 2 , the following COs can be found: To minimize the number of literals in (17), it is necessary to encode COs Y q ⊆ O using the approach [60]. In the discussed case, Figure 11 shows the outcome of encoding. Figure 11. Optimal codes of collections of outputs.
Using codes (Figure 11), we can get the following SBF: There are N = 5 terms in (28). In the general case, there are NT(O) terms in (17). They form a set W.
To implement a PY FSM circuit, it is necessary to create a DST of PY FSM. For the FSM A 2 , it is Table 6.
The DST is a base for deriving SBFs (5) and The SBFs (5), (17), and (29) determine a PY Mealy FSM whose structural diagram is shown in Figure 12.
In PY FSM, the matrix M 2 implements functions D r ∈ D and variables z r ∈ Z. The matrix M 5 transforms z r ∈ Z into terms of SBF (17). The matrix M 6 generates outputs o n ∈ D. These matrices have the following areas: These approaches can be used simultaneously [58]. This leads to MPY Mealy FSM ( Figure 13).
There are two levels of logic in the matrix circuit of P Mealy FSM (Figure 9). This circuit has six levels of logic. Obviously, the P FSM is three times faster than an equivalent MPY FSM ( Figure 13). Let us compare areas of equivalent FSMs.
As shown in [58], the average FSMs have the following characteristics: L = 30, N = 50, R S = 8, H = 2000, G = 4, NT(P) = 50, R Q = 6, NT(O) = 80. This gives the following: . Now, we have the following total area of MPY FSM circuit: 83, 460 CUA. There are 268, 000 CUA of the area of P Mealy FSM (Figure 9). This gives around 69% of economy. Accoridngly, an increase in the number of levels of a matrix circuit leads to an average reduction in area by 3.23 times. Of course, the FSM performance practically decreases to the same extent.
Accordingly, the methods of structural decomposition can be used for optimizing matrix circuits of Mealy FSMs. The same is true for Moore FSMs [61]. For further reducing the area, it is necessary to apply various methods of joint minimization of SBFs [43].
There is one common feature of SPLDs. Namely, they can be viewed as a composition of AND and OR arrays [62][63][64][65]70]. A typical SPLD structure is exactly the same as the one shown in Figure 9. Accordingly, SPLDs can implement SOPs representing the systems of Boolean functions.
In the case of PROM, the AND-array is fixed. It creates an address decoder. The ORarray is programmable. A PROM is the best tool for implementing SBFs that are represented by truth tables [10]. The number of address inputs of a PROM was rather small. Acccordingly, PROMs were used for implementing only parts of FSM circuits [71].
The joint using PROMs and multiplexers (MXs) leads to MP FSMs. The MXs implement the replacement of inputs that are represented by (23). The PROMs implement systems (25)- (26). To keep state codes, the register RG is used (Figure 14a). The joint using PROMs, decoders (DCs), and MXs leads to MPD FSMs (Figure 14b). To implement MPY FSMs, it is necessary to use MXs and PROMs (Figure 14c). As follows from Figure 14, different logic elements implement different parts of FSM circuits. This approach is a heterogeneous implementation of FSM circuit [71]. Of course, it is enough to use only memory blocks for implementing an FSM circuit [72].
As a rule, FSM circuits were represented by networks of PLAs [13,75]. To optimize the number of chips in a circuit, the methods of structural decomposition were used. Additionally, the principle of heterogeneous implementation was used. For example, MPY FSMs could be implemented using MXs, PLAs, and PROMs ( Figure 15) . Different approaches were used for optimizing characteristics of PLA-based FSMs [76][77][78][79][80][81][82]. One of the new approaches was an encoding of FSM terms [78], leading to PH FSMs.  In this case, terms F h ∈ F, corresponding to rows of STT, were encoded by binary codes K(F h ) having R H bits:

MXs
To encode terms, variables z r ∈ Z were used, where |Z| = R H . The following SBFs represent PH FSMs: These SBFs were implemented using PLAs (for Z) and PROMs (for D, O). Such a composition of PLAs and PROMs leads to PH FSM ( Figure 16).
Obviously, PH FSM ( Figure 16) can be transformed into MPH, MPHY, MPHD, PHY, and PHD FSMs. To optimize circuits with decoders, the method [50] can be used.
To optimize hardware of PLA-based FSMs, it is possible to use the methods that are based on transformation of objects [27,83,84]. The following objects are characteristic for the Mealy FSMs [71]: states, outputs, and collections of outputs. The main idea of this approach is a representation of some objects as functions of other objects and additional variables.   The transformation of states into outputs leads to P S FSMs (Figure 17a). The transformation of states into COs leads to P S Y FSMs (Figure 17b). The transformation of COs into states leads to P O Y FSMs (Figure 17c). As follows from Figure 17a, additional variables v r ∈ V replace inputs i e ∈ I in the SBF of outputs: If |V| L, then the SOPs of (34) are much simpler than SOPs of (6). In P S Y FSMs (Figure 17b), the following SBFs are generated: In the case of P O Y FSMs (Figure 17c), the following new SBF is implemented: As follows from [83], the transformation of objects improves performance as compared with MPY FSMs. Because of it, they are used in FPGA-based design [21].
The PAL chips have the following specific [64,85]: the AND array is programmable and OR-array is fixed. the terms of PAL are assigned to macrocells [23,74]. The evolution of this conception led to complex programmable logic devices (CPLDs) [15,69,86]. There are a huge number of publications related to PAL-and CPLD-based synthesis [64,73,85,[87][88][89][90][91]. We do not discuss these methods in this survey. However, we note that the structural decomposition is used in CPLD-based FSMs [23].

Basic Methods of Structural Decomposition in Design with LUTs and EMBs
Field-programmable gate arrays are widely used for implementing circuits of various digital systems [12,15,69,92]. To implement an FSM circuit, the following internal resources of FPGA chip can be used: look-up table (LUT) elements, embedded memory blocks (EMBs), programmable flip-flops, programmable interconnections, input-output blocks, and block of synchronization. LUTs and flip-flops form configurable logic blocks (CLBs). The "island-style" architecture is used in the majority of FPGAs [17,93,94].
A LUT is a block having S L inputs and a single output [95][96][97][98]. If a Boolean function depends on up to S L arguments [67], then the corresponding circuit only includes a single LUT. However, the number of LUT inputs is very limited [95][96][97]. Due to it, the methods of functional decomposition are used to implement the FPGA-based FSM circuits [99][100][101][102][103]. As a result, the FSM circuits have a lot of logic levels and a complex systems of interconnections [29]. Such circuits resemble programs that are based on intensive use of "go-to" operators [104]. Using terminology from programming, we can say that the functional decomposition produces the "spaghetti-type" LUT-based FSM circuits.
Modern FPGAs include a lot of configurable embedded memory blocks [95,96]. These CLBs allow for implementing systems of regular functions [28]. If at least a part of the FSM circuit is implemented using EMBs, then the characteristics of this circuit can be significantly improved [16]. Because of it, there are a lot of design methods targeting EMBbased FSMs [16,[105][106][107][108][109][110][111][112][113][114][115]. In [28], there is the survey of various methods of EMB-based FSM design. However, very often, practically all available EMBs are used for implementing the operational blocks of digital systems. Accordingly, the EMB-based FSM design methods can only be applied if a designer has some "free" EMBs.
An EMB can be characterized by a pair S A , t F , where S A is a number of address inputs and t F is a number of memory cell outputs. A single EMB can keep a truth table of an SBF including up to t F Boolean functions depended on up to S A arguments [116]. A pair S A , t F defines a configuration of an EMB with the constant total number of bits (size of EMB): The parameters S A and t F could be defined by a designer [66]. It means that EMBs are configurable memory blocks [67]. The following configurations exist for modern EMBs [95,96]: 15, 1 , 14, 2 , . . . , 9, 64 . Accordingly, modern EMBs are very flexible and can be tuned to meet characteristics of a particular FSM. This explains the existence of a wide spectrum of EMB-based design methods [16,[105][106][107][108][109][110][111][112][113][114][115].
If the condition holds, then a single EMB implements an FSM circuit [28]. If (39) is violated, then an FSM circuit could be implemented as: (1) a homogenous network of EMBs or (2) a heterogeneous network where LUTs and EMBs are used together [16,114]. There are three approaches for implementing combinational parts of CLB-based FSMs. They are the following: (1) using only LUTs; (2) using only EMBs; and, (3) using the heterogeneous approach, when both LUTs and EMBs are applied [28].
One of the most crucial steps in the CLB-based design flow is the technology mapping [29,117,118]. The outcome of the technology mapping is a network of interconnected CLBs representing an FSM circuit. This step largely determines the resulting characteristics of an FSM circuit. These characteristics are strongly interrelated.
A chip area occupied by a CLB-based FSM circuit is mostly determined by the number of CLBs and the system of their interconnections. Obviously, to reduce the area, it is necessary to reduce the CLB count in an FSM circuit. As follows from [119], the more LUTs are included into an FSM circuit, the more power it consumes. Now, "process technology has scaled considerably . . . with current design activity at 14 and 7 nm. Due to it, interconnection delay now dominates logic delay" [18]. As noted in [120], the interconnections are responsible for the consume up to 70% of power. Accordingly, it is very important to reduce the amount of interconnections to improve the characteristics of FSM circuits. All of this can be done using methods of structural decomposition.
As follows from (39), an FSM circuit can be implemented by a single EMB if the following conditions hold for a configuration S A , t F : As a rule, the modern EMBs are synchronous blocks. Hence, there is no need in an additional register to keep FSM state codes [28]. Figure 18 shows a trivial EMB-based circuit of Mealy FSM. To design such a circuit, it is necessary to [28]: (1) execute the state assignment; (2) construct a DST on the base of an STT; and, (3) create the truth table corresponding to the DST. This truth table has L + R S columns containing an address of a particular cell. Each cell has R S + N bits. Transitions from any state s m ∈ S are represented by H(s m ) rows of the truth table [28]: The following parameters can be found for A 1 ( Table 2): the number of inputs L = 3, and the number of state variables R S = 2. Accordingly, using (42) gives H(s m ) = 8. If an input i e ∈ I is insignificant for transitions from a state s m ∈ S, then there are the same values of IMFs and outputs for cells with addresses having either i e = 0 or i e = 1. This rule is illustrated by Table 9 with the transitions from state s 2 from Table 2.  In Table 9, the number of a cell is shown in the column q. The column h is added to compare Tables 2 and 9. The even rows of Table 9 correspond to i 3 = 1, and the odd rows correspond to i 3 = 0.

Contents of Cells
The transition from LUTs to EMBs is similar to the transition from gates to large scale integration circuits. This transition improves all the characteristics of an FSM circuit, namely, the chip area that is occupied by FSM circuit, the FSM performance and power consumption. If conditions (40)-(41) are violated, then methods of structural decomposition can be used [21]. In this case, an FSM circuit is represented as a network of EMBs and LUTs.
The analysis of numerous literature has shown that the following methods of structural decomposition are used in EMB-based FSM design: 1.

2.
The maximum encoding of collections of outputs leading to PY FSMs [28].

4.
The encoding of product terms leading to PH FSMs [122].
Following the notation of [21], we denote, as LUTer, a block consisting of LUTs and. as EMBer, a block consisting of EMBs. The structural diagram of MP Mealy FSM is shown in Figure 19. In MP FSM, the LUTerP implements SBF (23), the EMBer contains a truth table of SBFs (25)- (26). As follows from Figures 18 and 19, the outputs o n ∈ O are synchronized. This is necessary to stabilize FSM outputs [42]. The MP Mealy FSM can be used if the following condition holds: Clearly, the MP FSM ( Figure 19) uses an idea of the two-level MCU (Figure 4) in an FPGA environment. The state variables create the address part of microinstructions. The number of EMBs in EMBer is determined as To diminish the value of n EMB , the maximum encoding of COs Y q ⊆ O can be used [21]. The replacement of inputs can be used together with this approach. This results in the MPY Mealy FSM (Figure 20). In MPY FSM, the EMBer implements SBFs (23) and (31). The LUTerO transforms codes K(O q ) into outputs o n ∈ O. To do it, SBF (17) is implemented by LUTerO. Now, the number of EMBs in EMBer is determined as

EMBer
The value of R Q is determined by (15). If the condition holds, then a single-level circuit of LUTerO includes up to N LUTs. If (46) is violated, then a mixed encoding of outputs [121] can be used. The idea of this approach is the following. Let it be Q = 17, R Q = 5, and S L = 4. The analysis of these values shows that the condition (46) is violated. Let the set of COs include COs In [121], there is proposed a method allowing to create such a partition of the set O. It allows for eliminating the minimum possible number elements of O to create the set O E .
This approach can be used to diminish the number of CLBs in the circuit of LUTerO. For example, there is S L = 6 for LUTs of Virtex 7 [96]. If R Q = 6, then the number of LUTs in the circuit of LUTerO is equal to N. However, the CLB can be organized as two LUTs having five shared inputs. If the mixed encoding of outputs gives the set O L with R Q = 5, then the number of LUTs in LUTerO is determined as |O L |/2 . The closer the values of N and |O L | are, the greater the saving in the number of CLBs.
Two approaches are possible for implementing EMB-based Mealy FSMs [122]. In both cases, the binary codes K(F h ) encode the terms F h ∈ F. These codes have R H bits. The variables z r ∈ Z are used for encoding of terms, where |Z| = R H . The value of R H is determined by (32). The system Z = Z(T, I) represents the block of terms [122]. This system can be implemented as either the network of LUTs (Figure 22a) or the network of EMBs (Figure 22b).
Both methods should be used. Finally, the method leading to the minimum hardware should be selected [122].

Structural Decomposition in LUT-Based Design
As mentioned in [12], EMBs are widely used for implementing various blocks of digital systems. Accordingly, it is quite possible that only LUTs can be used for implementing FSM circuits. The methods of structural decomposition may be used in LUT-based FSMs [21]. They are used to improve LUT counts (and other characteristics) of LUT-based P Mealy FSMs (Figure 23). In P FSMs, the LUTerD implements SBF (5) and the LUTerO implements SBF (6). Each function f i ∈ D ∪ O is represented by a SOP having N A( f i ) literals. In the best case, there are R S LUTs in the circuit of LUTerD and N LUTs in the circuit of LUTerO. The following relation determines this case: If (47) is violated, then a P FSM is represented by a multi-level circuit. To improve LUT count of such circuits, the model of MPY FSM can be used.
This approach is proposed in [123]. It leads to a three-level circuit that is shown in Figure 24. In MPY FSM, the LUTerP implements system (23). It generates additional variables p g ∈ P replacing inputs i e ∈ I. The LUTerD generates input memory functions that are represented by (25). The LUTerZ generates variables z r ∈ Z used for encoding of collections of outputs. This block implements SBF (31). The LUTerO implements outputs o n ∈ O that are represented by SBF (17).
The method of synthesis of LUT-based MPY FSM includes the following steps [123]: 1.
Executing the replacement of inputs.

3.
Deriving collections of outputs from the STT.

4.
Executing the encoding of COs.
Implementing FSM circuit using particular LUTs.
In [123], the results of experiments conducted to compare the characteristics of various models of LUT-based FSMs are shown. The standard benchmarks [124] were used for investigation. These benchmarks are Mealy FSMs; they are represented in KISS2 format. Table 10 contains the characteristics of these benchmark FSMs.
Four other methods were compared with MPY FSMs. They were Auto of Vivado, onehot of Vivado, JEDI [39,127], and DEMAIN [128]. The benchmarks were divided by five categories. To do it, the values of R S + L and S L = 6 were used. If R S + L ≤ 6, then benchmarks belong to category 0; if 6 < R S + L ≤ 12, it is the category 1; if 12 < R S + L ≤ 18, then it defines the category 2; if 18 < R S + L ≤ 24, then benchmarks belong to category 3; finally, the relation R S + L > 24, determines category 4. Table 11 (the LUT counts) and Table 12 (the maximum operating frequency) represent the results of investigations [123]. As follows from Table 11, MPY-based FSMs have minimum number of LUTs. As follows from Table 12, MPY-based FSMs are the slowest. However, this disadvantage is reduced with the increase in the number of category.

New Methods of Structural Decomposition
In all thw discussed methods, only maximum state codes are used when the value of R S is determined by (4). In [129][130][131], there is a method of twofold state assignment proposed. In this case, any state s m ∈ S has two codes. The code K(s m ) determines the state as an element of the set S. The code C(s m ) defines the state as an element of some partition class.
To use the method [129,130], it is necessary to construct a partition Π S = {S 1 , . . . , S K } of the set of states S. For each class S k ∈ Π S , the following condition holds: In (48), the symbol R k denotes the length (the number of bits) of a code C(s m ) for states s m ∈ S k ; the symbol L k defines the number of inputs i e ∈ I determining the transitions from states s m ∈ S k .
Each class S k ∈ Π S determines a DST k with transitions from states s m ∈ S k . This table includes inputs from the set I k ⊆ I, outputs from the set O k ⊆ O, and IMFs that are equal to 1 for transitions from states s m ∈ S k . These IMFs form a set D k ⊆ D. A DST k determines the SBFs D k = D k (τ k , I k ); The variables τ r ∈ τ k encode states as elements of the set S k ⊆ S. This approach determines P T Mealy FSMs. The logic circuits of P T FSMs include three levels of logic blocks. Figure 25 showsn the structural diagram of P T FSM. In P T Mealy FSM, the LUTerk (k ∈ {1, . . . , K}) implements SBF (49)- (50). The LUTerTO implements the following SBFs: The LUTerτ transform state codes K(s m ) into state codes C(s m ). To do it, the following SBF is implemented: The structural diagram ( Figure 25) determines a case of the one-hot encoding of outputs [130]. In [129], there was a method proposed combining the twofold state assignment with the maximum encoding of COs. This leads to P T Y Mealy FSM, as shown in Figure 26.
In P T Y FSM, the SBFs (50) and (52) are replaced by SBFs: Because of (48), each function (49), (50), and (54) are implemented as a single-level circuit; moreover, each function is implemented by a circuit having exactly one LUT. If there is then it is enough a single LUT to implement a circuit for each determined by (52) and (54).
then the circuit of the LUTerτ is a single-level one. If the condition (46) holds, then there are up to N LUTs in the circuit of LUTerO.
In the best case, the conditions (46), (48), (56), and (57) are true. This best case determines the three-level LUT-based circuits of both P T and P T Y Mealy FSMs. Logic circuits of P T Y FSMs consume fewer LUTs than equivalent PY FSMs, as shown in [129]. The experimental results [130] show that the logic circuits of P T FSMs consume fewer LUTs than this is for the equivalent P Mealy FSMs.
Using the twofold state assignment improves the characteristics of EMB-based FSMs, as shown in [122]. In [122], this method is used to improve LUT count in PH Mealy FSMs (Figure 22b). The method is based on finding a partition Π F = {F 1 , . . . , F k } of the set of terms F. For each class of this partition, the following condition holds: The value of R k can be found as log 2 H k , where H k is a number of elements in the set F k .
The binary codes K(F h ) encode the classes F k ∈ Π F . These codes have R c bits, where The code of a term F h ∈ F is represented as In (60), C(F h ) is a code of a term as an element of the set F k ⊆ F, * is a sign of concatenation. To encode terms, the variables z r ∈ Z are used. To use free outputs of EMB, the set D is represented as D E ∪ D L and the set O is represented as O E ∪ O L . The classes of Π F are encoded using variables v r ∈ V. Now, the PH FSM is represented, as shown in Figure 27.
In [122], the results of experiments are shown. The following models were compared: P FSMs (Figure 23), MP FSMs (Figure 19), PH FSMs (Figure 22b), and the proposed approach ( Figure 27). Table 13 (LUT counts), Table 14 (the maximum operating frequency), and Table 15 (the consumed power) show the results of experiments for some benchmarks [124].
The experiments have been conducted for the benchmarks [124], the evolution board with chip XC7VX690TFFG1761-2 [126] and CAD tool Vivado [125]. It is enough a single EMB of Virtex 7 to implement the logic circuits for any from 33 benchmarks [124], as shown in [122]. A network of LUTs and EMBs is used to implement circuits for other benchmarks.
It is possible to improve the characteristics of LUT-based FSM circuits using the transformation of objects [21]. For example, there is a structural diagram of P o Y Mealy FSM shown in Figure 28 [132].   In P o Y FSM, the LUTerZV implements SBFs (35) and The LUTerT generates the functions from the SBF (37) and the LUTerO implements SBF (17). This approach is used to: (1) improve the operating frequency of multi-level MPY FSMs and (2) reduce the LUT count as compared with P FSMs if the condition (47) is violated.
If condition (47) is violated for functions f i ∈ V ∪ Z, then the LUTerZV is represented by a multi-level circuit. To improve the characteristics of P o Y FSMs, the following approach is proposed in [132].
The set S is divided by classes S k ∈ Π S , such that the condition (48) holds for each class of Π S . Next, states s m ∈ S k are encoded by codes C(s m ) having the minimum possible number of bits. The following SBFs should be implemented [132]: (54), (55), (17), (37), and This approach leads to P oT Y FSMs. The circuit of P oT Y FSM includes three levels of LUTs ( Figure 29).

LUTerZV
There are experimental results in [132] that are obtained using the CAD tool Vivado [125] and the evolution board with Virtex 7 FPGA chip [126]. The following characteristics have been compared: the LUT counts (Table 16), maximum operating frequency (Table 17), and area-time products (Table 18).
As follows from Table 16, the P o Y FSMs require fewer LUTs than other investigated methods. The P oT Y FSMs consume more LUTs (8.84%) when compared to P o Y FSMs. However, other FSMs are based on functional decomposition. Their circuits require more LUTs than for P oT Y FSMs. The gain increases along with the growth of the category number.
As follows from Table 17, the P oT Y-based FSMs have the highest operating frequency as compared to other investigated methods. The following can be found from Table  18: the P oT Y-based FSMs produce circuits with better area-time products then it is for other investigated methods. Starting from average FSMs, P oT Y-based circuits have better area-time products. Hence, using the methods of structural decomposition allows for improving characteristics of FPGA-based FSMs. Three-level circuits improve the LUT count and two-level circuits improve the performance. These methods can be applied together with other optimization methods used in FSM design [21].

Conclusions
Since the 1950s, digital systems have increasingly influenced different areas of our lives. The control units and other sequential blocks are very important parts of digital systems. Very often, the behaviour of sequential blocks is represented using a model of finite state machine. During these 70 years, several generations of logic elements that are used to implement FSM circuits have changed. However, one thing remained unchanged: regardless of the generation of logic elements, there is always the problem of reducing their number in the FSM circuit. This problem arises if a single-level FSM circuit with minimum possible amount of elements cannot be implemented. One of the ways for reducing the required hardware is the applying various methods of structural decomposition.
These approaches have roots in various methods that are used for optimizing the size of the control memory of microprogram control units. The following basic methods of structural decomposition are known: the replacement of FSM inputs, encoding of the collections of outputs, encoding of product terms corresponding to interstate transitions, and transformation of objects. Using these methods requires taking the peculiarities of logic elements into account. Recently, two new methods of structural decomposition have appeared. These new methods are: (1) the twofold state assignment and (2) the mixed encoding of FSM outputs. These methods are focused on FPGA-based FSMs.
This orientation is related to the fact that FPGA devices are very often used for implementing digital systems. These chips include a lot of LUT elements and embedded memory blocks. It allows implementing very complex digital systems. Embedded memory blocks are effective tools for implementing FSM circuits. However, it is quite possible that all available EMBs are used for implementing various blocks of a digital system. In this case, an FSM circuit is implemented as a network of LUTs. The main specific of LUTs is a very small number of inputs (for the vast majority of FPGAs the value of S L is less than 7). This feature makes it necessary to use the methods of functional decomposition in the FPGA-based design. As a rule, this leads to multi-level FSM circuits that are characterized by the very complex systems of "spaghetti-type" interconnections.
The optimization of the chip area that is occupied by a LUT-based FSM circuit can be achieved due to applying various methods of structural decomposition. Numerous studies show that the structural decomposition produces the FSM circuits having better characteristics than their counterparts based on the functional decomposition. The FSM circuits that are based on the structural decomposition are characterized by the regular system of interconnections and predicted number of logic levels. The same is true for the heterogeneous implementation of FSM circuits when LUTs and EMBs are used simultaneously.
In this review, we have shown the roots of structural decomposition methods and their development starting from the 1950s. Our research shows that these methods can be used for optimizing FSM circuits that were implemented with any logic elements (PROMs, PLAs, PALs, CPLDs, FPGAs, and custom matrices of ASIC). Now, the majority of digital systems are implemented using FPGAs and ASICs. It is difficult to imagine what elements will replace them in the future. However, one thing remains clear: these elements will also have limits on the number of inputs, outputs, and terms. The results of the research presented in this article allow us to conclude that the methods of structural decomposition will be used in the future generations of the logic elements implementing FSM circuits.
Funding: This research received no external funding.

Data Availability Statement:
The data presented in this study are available in the article.

Conflicts of Interest:
The authors declare no conflict of interest.