Improving Hardware in LUT-Based Mealy FSMs

: The main contribution of this paper is a novel design method reducing the number of look-up table (LUT) elements in the circuits of three-block Mealy ﬁnite-state machines (FSMs). The proposed method is based on using codes of collections of outputs (COs) for representing both FSM state variables and outputs. The interstate transitions are represented by output collections generated during two adjacent cycles of FSM operation. To avoid doubling the number of variables encoding of COs, two registers are used. The ﬁrst register keeps a code of CO produced in the current cycle of operation; the code of a CO produced in the previous cycle is kept in the second register. There is given a synthesis example with applying the proposed method. The results of the research are shown. The research is conducted using the CAD tool Vivado by Xilinx. The experiments prove that the proposed approach allows reducing the hardware compared with such known methods as auto and one-hot of Vivado, and JEDI. Additionally, the proposed approach gives better results than a method based on the simultaneous replacement of inputs and encoding of COs. Compared to circuits of the three-block FSMs, the LUT counts are reduced by an average of 7.21% without signiﬁcant reduction in the performance. Our approach loses in terms of power consumption (on average 9.62%) and power–time products (on average 10.44%). The gain in LUT counts and area–time products increases with the increase in the numbers of FSM states and inputs.


Introduction
Nowadays, it is characteristic the fact that numerous digital systems are widely used in the daily life of human society [1,2]. Among other digital equipment, contemporary systems include a lot of various sequential devices [3]. The law of operation of a sequential device can be described by the model of the Mealy finite state machine (FSM) [4]. This model is used, for example, to set the behavior of (1) control devices [5,6]; (2) serial communication and display protocols [7]; (3) various software tools of embedded systems [8]; (4) controldominated systems [9]; (5) different systems in robotics [10] and so on. This analysis led to the choice of the Mealy FSM model in our recent research.
The process of FSM-based design is connected with raising some optimization problems [5,7]. As a rule, the following characteristics of FSM circuits should be improved: the occupied chip area, the time of cycle (the maximum operating frequency) and the consumed power. The approaches used for reducing these values depend strongly on the peculiarities of logic elements used for implementing the FSM circuits. Changing the type of logic elements leads to the necessity for changing the optimization approaches. This is the reason for the continuous interest in developing new design methods aiming at the optimization of FSM circuits. These characteristics are interrelated. For example, the area reduction leads to reducing the power consumption [11,12]. Due to the great importance of the area reduction, we devote our article to this problem.
The area reduction of LUT-based FSM circuits may be achieved using the methods of structural decomposition (SD) [13]. In this case, an FSM circuit is represented as a composition of two to four large logical blocks. These blocks have unique systems of input variables and output functions distinguishing them from other circuit blocks [14]. In this article, we propose an alternative to the method discussed in [15]. The original method [15] is reduced to joint applying the replacement of inputs and encoding of collections of outputs (COs) [5]. Applying these methods is connected with generating two additional systems of functions. Implementing circuits for these additional systems requires using some chip resources. In this article, we propose to use the same variables both for encoding the COs and for the replacement of the FSM inputs. This leads to the elimination of a block generating the additional variables replacing the FSM inputs. As a result, this reduces the number of LUTs compared to this number for the equivalent circuit based on the approach of [15].
The main contribution of this paper is a novel design method which allows diminishing the LUT count (the number of LUTs) in the circuits of three-level Mealy FSMs with the joint use of two methods of structural decomposition. The proposed method is based on using the same additional variables as inputs of logic blocks generating both input memory functions (IMFs) and FSM outputs. Due to this, there is eliminated a block replacing FSM inputs inherent in the method [15]. The main purpose of the proposed method is to reduce the LUT count in the FSM circuit without significantly impairing the FSM performance.
The further text of the paper is organized in the following order. The second section contains basic information related to LUT-based Mealy FSMs. The third section discusses the necessary elements of the state of the art. In this section, we provide a critical analysis of existing synthesis methods and show the need for their improvement. The fourth section highlights the main idea of the proposed method. An example of synthesis is presented and analyzed in the fifth section. The sixth section includes the results of experiments and their analysis. A brief conclusion ends the paper.

Basics of LUT-Based Mealy FSM Design
In this section, we show the basics of designing FSMs circuits using internal resources of FPGAs. Here, we introduce the main notation used in the rest of the text and show the features of the logic elements used. At last, we introduce the simplest structural diagram of a Mealy FSM circuit implemented with LUTs and programmable flip-flops.
For a better understanding of the material of the article by readers, we introduce Table 1. This table shows the main sets of variables and the notation adopted in our article.
To start the designing process, it is necessary to set the law of the FSM behavior. For this, various mathematical apparatuses can be used [1]. Two methods are most commonly used for this purpose: (1) a state transition graph (STG) and (2) a state transition table (STT) [16]. We use both forms in this article. These forms are used to derive systems of Boolean functions (SBFs) defining dependencies between FSM outputs and input memory functions on the one hand, and FSM inputs and state variables on the other hand. These SBFs are used to design FSM logic circuits [16].
The FSM inputs create a set X = {x 1 , . . . , x L }, the FSM outputs form a set Y = {y 1 , . . . , y N }. The inputs determine transitions between FSM states combined into a set A = {a 1 , . . . , a M }. To synthesize an FSM circuit, the states a m ∈ A are represented by binary codes K(a m ) having R bits. The states are encoded using state variables from the set T = {T 1 , . . . , T R }. The r-th bit of a state code is represented by an internal state variable T r ∈ T. The minimum value of R is calculated using the following formula: The formula (1) determines so-called maximum state codes [17]. A special register RG is entered into the FSM circuit as a memory of the state codes [5]. In the case of FPGA-based FSMs, the RG is implemented using D flip-flops [17,18]. The content of RG is determined by the input memory functions combined into a set Φ = {D 1 , . . . , D R }. The IMFs are inputs of RG showing a direction of a particular interstate transition.

K(a m )
The binary code of state a m ∈ A.
K(Y q ) The binary code of collection of outputs Y q ⊆ Y with R q = log 2 Q bits.
The number of literals in a function φ k ∈ Φ ∪ Y. The SBFs that make it possible to synthesize an FSM circuit can be formed using a direct structure table (DST) [5]. A DST is constructed on the base of either the initial STT or STG. An STT includes the following columns [16]: a current state a m (a state for the current instant); a state of transition a T (a state for the next instant); an input signal X h determining the transition from a m into a T (it is a certain conjunction of inputs); a collection of outputs (CO) Y h formed during the transition < a m , a T >; and the numbers of transitions are shown in the column h. There are H lines in an STT. A DST includes all these columns and three additional columns. These additional columns are [5] the current state code K(a m ), the next state code K(a T ), and IMFs Φ h ⊆ Φ, which allows loading the next state code into RG.
Using a DST, the following SBFs are constructed: The SBF (2) corresponds to the FSM transition function [5] that specifies the dependence of the states of transition on the current states and input variables. The SBF (2) represents rules of generating IMFs necessary to load a next state code into the RG. The SBF (3) corresponds to the FSM output function [5] that specifies the dependence of the FSM outputs on the current states and input variables. The SBF (3) represents rules of generating FSM outputs during each interstate transition. The SBFs (2) and (3) are the basis for the synthesis of FSM U 1 , whose structural diagram is shown in Figure 1. In Figure 1, the BlockYΦ implements the SBFs (2) and (3). The RG includes R flip-flops. The pulse Res loads the code of the initial state a 1 ∈ A into RG. Very often, there are only zeros in the code K(a 1 ) [18]. The synchronization pulse Clk allows loading state codes into RG.
Consider a transition between the states a 3 and a 5 of some Mealy FSM. Let it be the transition with h = 6. The transition is represented using fragments of three equivalent forms: an STG, an STT and a DST ( Figure 2).
a 5 y y As follows from Figure 2a, the transition < a 3 , a 5 > is caused by the input signal X 6 = x 1 x 2 . The transition is accompanied by the producing of a CO Y 6 = {y 2 , y 4 }. Row 6 of the STT (Figure 2b) is a sequence of characters corresponding to the fragment of STG ( Figure 2a). If, for example, there is M = 7, then using (1) gives R = 3 and two sets: For a trivial state assignment [5], there are the codes K(a 3 ) = 010 and K(a 5 ) = 100. These codes and IMF D 1 are written in the sixth line of DST (Figure 2c). This line determines a product term F 6 = T 1 T 2 T 3 x 1 x 2 . This term is a part of the sum of products (SOPs) of Boolean functions D 1 ∈ Φ and y 2 , y 4 ∈ Y. All other terms of SOPs for (2) and (3) are obtained in the same way [5].
In this paper, we discuss a case of implementing SBFs (2) and (3) using configurable logic blocks (CLBs) and other internal resources of FPGA chips [19]. To form an FSM circuit, the CLBs are connected using a programmable routing matrix [17,20]. In this paper, we consider CLBs, including LUTs, multiplexers and programmable flip-flops. Similar to the notation used in the paper [21], we use a symbol I-LUT to denote a single-output LUT having I inputs. Such a LUT can implement an arbitrary Boolean function having up to I arguments. The analysis of the FPGA market shows that AMD Xilinx dominates this market [19]. Due to it, we focus our current research on the solutions of Xilinx. These solutions are very popular at present for the implementation of various projects. This fact is confirmed by the analysis of the literature [22][23][24][25][26][27][28].
If the number of arguments of a Boolean function is greater than I, then the corresponding circuit can be implemented with the help of the functional decomposition (FD) [29][30][31][32]. In this case, the resulting circuits are, as a rule, multi-level. Additionally, they are characterized by very complex systems of "spaghetti-type" interconnections [13].
In LUT-based FSMs, the RG is hidden and distributed among CLBs generating IMFs. Due to it, the logic circuit of LUT-based FSM U 1 consists of two logic blocks ( Figure 3). In Mealy FSM U 1 , the block LSV consists of CLBs generating SBF (2). The state code is kept in the hidden register RG. Due to it, the pulses Res and Clk enter the block LSV. The outputs y n ∈ Y are generated by the block LY. This block does not include flip-flops; it implements SBF (3).

Related Work
This section provides a brief analysis of basic methods used for reducing the number of LUTs in FSM circuits. We show that this problem can be solved using either a certain state assignment or various methods of functional and structural decomposition. We show the disadvantages inherent in the methods from these three groups. The method proposed in this paper belongs to the group of structural decomposition methods.
Under certain conditions, there is only one level of LUTs in the circuit of U 1 . To implement a single-level circuit, each function φ k ∈ Φ ∪ Y should depend on no more than I arguments. However, there are up to six address inputs in the present-day LUTs [19,33,34]. To balance the area-spatial-power characteristics of a LUT, it is necessary that the number of inputs does not exceed six [35]. Nevertheless, the total number of inputs and state variables of an FSM can significantly exceed the value of I. This leads to an imbalance between a very large number of FSM inputs, outputs and states, on the one hand, and a very small number of LUT inputs, on the other hand. To reduce the negative impact of this imbalance, it is necessary to improve the design methods of FPGA-based FSMs.
The required chip area can be reduced due to the optimizing of the system of interconnections for a particular circuit. Improving interconnections can reduce the power consumption because more than 70% of the power consumption is due to the interconnections [36]. Additionally, the interconnections are responsible for the value of maximum operating frequency of a resulting FSM circuit. As it is shown in [36], the complexity of the interconnection system is beginning to have an increasing negative impact on the propagation time of signals in the FSM circuits. As follows from [15], the regularization of interconnections results in reducing both the time and power consumption. To regularize the interconnection system, it is necessary to use the structural decomposition methods [13,37].
If the condition holds for each function φ k ∈ Φ ∪ Y, then there are L + R I-LUTs in a single-level circuit of the corresponding FSM U 1 . However, if the condition (4) does not hold for some functions φ k ∈ Φ ∪ Y, then it becomes impossible to represent such an FSM with a single level of LUTs. To improve the characteristics of multi-level circuits, various methods can be applied. A significant number of optimization methods aimed at FPGA-based FSMs can be found in the literature [13,17,18,21,32,[38][39][40][41]. As a rule, these methods can improve the value of one of the characteristics of the FSM circuit [39,40]. Additionally, there are methods which simultaneously reduce the values of two characteristics (area and power consumption, or area and performance). In our current paper, a method is proposed which aims at reducing the LUT count of three-block circuits of Mealy FSMs [15].
The values of N A(φ k ) can be reduced with the help of a proper state assignment [41][42][43]. The number of FSM state memory elements is in the range from R = log 2 M to R = M. The upper limit of this amount (R = M) corresponds to a one-hot state assignment. Both of these extreme approaches can be found in many CAD tools, such as SIS [44], ABC [32,45] or Sinthagate [46]. The manufacturers of FPGA chips also have their tools for implementing the technology mapping of LUT-based circuits. Examples of such systems are Vivado [47], Vitis [48], and Quartus [49]. The first two CAD systems were developed by AMD Xilinx, and the third one is a product of Intel (Altera).
It is impossible to specify the approach that is optimal for any FSM. For example, in [50], there is given the comparison of the synthesis results for FSM circuits based on state codes with R = log 2 M and one-hot state codes. Note that both of these approaches are widely used in most modern CAD tools. As follows from the comparison, the one-hot codes are the best choice for FSMs with more than 16 states. However, in addition to the value of R, the number of input variables also has a very strong influence on the characteristics of LUT-based FSM circuits. For example, the experiments [51] definitely show the following: if the number of FSM inputs exceeds 10, then it is better to use the codes with a minimum number of bits.
As follows from this analysis, it is necessary to check which method leads to the best results for a specific combination of characteristics of a particular FSM. In this paper, we compared the results produced by our new approach with the characteristics of FSM circuits produced using the algorithm JEDI [44], and the methods auto (R = log 2 M ) and one-hot (R = M) of Vivado [47] by Xilinx [19]. Our choice of JEDI is due to the fact that it is considered one of the best deterministic methods of the state encoding [44].
If condition (4) is violated, then various methods of functional decomposition should be applied to implement an FSM circuit [29,39]. All these methods are based on splitting the original SOP into sub-SOPs for which the number of arguments does not exceed the number of LUT inputs. Each sub-SOP corresponds to a partial function which differs from the initial function φ k ∈ Φ ∪ Y [39]. This splitting should be executed in a way that increases the number of logic levels of the final FSM circuit as little as possible [29]. Practically, the methods of FD are included in each academic and industrial CAD tool dealing with the LUT-based design. The main disadvantage of FD-based methods: they produce the FSM circuits with spaghetti-type interconnections [13]. It is known that such circuits lose in all three main characteristics to their counterparts with a regular interconnection system [52].
The methods of structural decomposition [13] are an alternative to the methods of FD. The main idea of these methods is the elimination of the direct connection between FSM inputs and state variables, on the one hand, and FSM outputs and IMFs , on the other hand. In the case of SD, an FSM circuit is represented as a composition of unique logic blocks. This leads to an increase in the number of implemented functions, but these partial functions are much simpler than functions (2) and (3). The analysis of these methods can be found, for example, in [13].
The first known methods of SD are the replacement of inputs (RI) and the encoding of the collections of outputs (ECO). They were proposed in the mid-twentieth century by M. Wilkes for the optimization of microprogram control units [53]. In [15], we proposed the joint use of these methods for the optimization of LUT-based Mealy FSMs' circuits. Let us briefly describe these two methods.
In the case of RI, the set X = {x 1 , . . . , x L } is replaced by a set of additional variables The replaced inputs are represented by an SBF Each function of (5) represents a multiplexor. Its control inputs are connected with the state variables, and the data inputs are connected with the replaced inputs. In the case of CLB-based solutions, these multiplexors are implemented using LUTs and dedicated multiplexors [54].
There are Q different COs. Each collection Y q ⊆ Y includes FSM outputs generated during a particular interstate transition. As a rule, the condition Q < H holds, where H is a number of interstate transitions. The COs are encoded by binary codes K(Y q ). The bits of K(Y q ) are represented by elements of an additional set Z = {z 1 , . . . , z RQ }. The cardinality number of the set Z is determined as To encode COs, two additional SBFs should be constructed: The SBFs (7) and (8) are implemented using LUTs. Obviously, the system (8) is represented by R Q decoders.
Combining the methods of RI and ECO leads to the replacement of both SBFs (2) and (7). Now, the following SBFs should be constructed: The SBFs (5), (8)-(10) determine a structural diagram of FSM U 2 ( Figure 4). In FSM U 2 , a BlockB implements SBF (5). The variables b j ∈ B enter a BlockZΦ implementing SBFs (9) and (10). The IMFs D r ∈ Φ enter the state code register RG. The variables z r ∈ Z are transformed into the FSM outputs y n ∈ Y by a BlockY.
In LUT-based FSMs, these blocks are implemented using the internal resources of CLBs, inter-slice interconnections, programmable input-outputs and synchronization tree buffers [54]. In [15], we compared the characteristics of U 1 -and U 2 -based FSMs. The research results obtained in [15] show that the joint use of RI and ECO allows to significantly reduce the LUT counts in FSM circuits.
To optimize an FSM circuit, we propose using the variables z r ∈ Z for generating both FSM outputs and IMFs. To make it possible, we propose to use codes of COs generated in two neighboring instances of the FSM discrete time.

Main Idea of the Proposed Method
The analysis of FSM U 2 ( Figure 4) allows finding its shortcomings. The main drawback of U 2 is the need to form two systems of additional variables. One of them serves to replace the inputs x l ∈ X, and the second system is used to encode the collections of outputs. These systems are represented by SBFs (5) and (10), respectively. To implement these systems, it is necessary to use some internal resources of FPGA chip. The amount of resources used can be reduced by using the same additional variables to implement both input memory functions and FSM outputs. In our article, there is proposed such an approach. Our analysis of the extensive literature shows that so far, there has been no such a method. Due to it, the proposed method has an undeniable scientific novelty.
Our method is based on using the codes of collections of FSM outputs for generating IMFs D r ∈ Φ. Consider Figure 5 where this idea is illustrated.
Clk t -1 t t +1 Figure 5. Illustration of the main idea of proposed method.
A subgraph of some STG is shown in Figure 5. The generator of pulses Clk sets the course of discrete time t(t = 0, 1, 2, . . .). Three instances of time are shown in Figure 5. In the instant of time t, the FSM is in the state a(t) = a 4 . The transition from a 3 into a 4 is accompanied by producing a CO Y 5 . So, the following relation takes place: Y q (t) = Y 5 . From STG ( Figure 5), we can find that a(t + 1) = a 5 and Y q (t + 1) = Y 3 . So, the transition < a 4 , a 5 > corresponds to a pair of COs < Y 5 , Y 3 >. This transition is caused by an input x 2 ∈ X. So, the pair < a 4 , x 2 > also corresponds to a pair of COs < Y 5 , Y 3 >. This means that IMFs can be represented using only codes of COs.
In FSM U 2 , the SOPs of functions D r ∈ Φ include product terms F h determined as In (11), the symbol A m stands for a conjunction of the state variables corresponding to the code of a current state a m written in the h-th row of DST; the symbol B h stands for a conjunction of additional variables replacing the input signal X h written in the h-th row of DST (h ∈ {1, . . . , H}). If a pair < a m , X h > determines the h-th transition of an FSM, then we propose to replace it by a pair of COs (as it follows from Figure 5). So, we propose to construct the SOPs of functions D r ∈ Φ using product terms formed by conjunctions corresponding to codes of COs replacing a pair < a m , X h >.
To do it, we should use different sets of variables to encode COs Y(t) and Y(t + 1). For example, we use the elements of the set Z = {z 1 , . . . , z RQ } to encode a CO Y(t + 1) and the elements of a set V = {v 1 , . . . , v RQ } to encode a CO Y(t). Obviously, this actually doubles the number of variables encoding the collections of outputs compared to (6). To avoid doubling the resources used for the encoding, we propose using two interconnected registers for storing the codes of COs. This approach results in FSM U 3 ( Figure 6). In FSM U 3 , a block LZ implements SBF (7). There is a distributed register RZ inside of the block LZ. The register keeps the codes of COs Y(t + 1). This explains the presence of pulses Clk and Res entering LZ. The variables z r ∈ Z are inputs of both a block LY and a register RV. The block LY implements SBF (8). The register RV de facto transforms the variables z r ∈ Z into the variables v r ∈ V representing the codes of COs Y(t). As follows from Figure 6, the same pulses Clk and Res are used by both registers. A block LT generates the state variables T r ∈ T represented by an SBF There are the following product terms in SOPs of the SBF (12): In (13), the symbols Z h and V h stand for conjunctions of the variables z r ∈ Z and v r ∈ V, respectively. As we show a bit later, the following condition can take place: In this paper, we propose a synthesis method for U 3 -based Mealy FSMs. We assume that the FSM to be synthesized is represented by its STG. The proposed method includes the following steps:

1.
Constructing the STT corresponding to an initial STG.

2.
Executing the state assignment using maximum binary codes K(a m ).

3.
Encoding of collections of outputs Y q ⊆ Y by binary codes K(Y q ).

5.
Creating the modified DST of FSM U 1 .

Creating a table of pairs
Creating a table representing the block LZ and SBF Z = Z(T, X).

8.
Creating a table representing the block LT and SBF T = T(Z, V).

9.
Implementing the LUT-based circuit of Mealy FSM U 3 using internal resources of a particular FPGA chip.
Let us analyze the complexity of the proposed method. Because each FSM transition should be transformed into a pair of COs, the time of synthesis depends on the number of FSM transitions. The synthesis algorithm does not include iterations. The pairs of COs are formed strictly sequentially: at each moment of time, the next in line transition is transformed into a pair of COs. In this regard, the algorithm has a linear character.

Example of Synthesis
We use the symbol U i (S a ) to show that the model U i (i ∈ {1, 2, 3}) of Mealy FSM is used to implement the circuit of an FSM S a . Let us consider an example of the synthesis of Mealy FSM U 3 (S 1 ) shown in Figure 7. We use 4-LUTs to implement the circuit.
x 1 x 2 /y 3 a 5 x 1 x 2 /y 3 y 5 1/y 1 y 3 y 6 x 3 /- Using an STG, we can find the sets of states, inputs and outputs, as well as the number of interstate transitions. Using Figure 7, we can find the sets A = {a 1 , . . . , a 5 }, . . , y 6 }. This gives the following values: M = 3, L = 3, and N = 6. The analysis of Figure 7 shows that there are H = 9 transitions between the states of FSM S 1 . Naturally, the state a 1 ∈ A is the initial state.
Step 1. The transformation of an STG into an equivalent STT is executed in the trivial way [16]. As follows from Figure 2, each arc of the STG is transformed in a row of the corresponding STT. In our case, Table 2 is an STT of Mealy FSM S 1 corresponding to the STG shown in Figure 7.
In the column Y h of Table 2, we show the collections of outputs Y q ⊆ Y. As a rule, such information is not given in the classical STT [5].
Step 2. For FSM S 1 , there is M = 5. Using (1) gives R = 3. This determines the set of state variables T = {T 1 , T 2 , T 3 }. It is possible to encode the states in a way optimizing the system (7). For example, this can be done using the algorithm JEDI [44]. In our simple example, we use the trivial way of state assignment [5] with the following state codes: K(a 1 ) = 000, K(a 2 ) = 001,. . . , K(a 5 ) = 100.
Step 3. Using Table 2, we can find the following collections of outputs: As shown in [13], it is necessary to encode the collections in a way that minimizes the number of literals in functions from (8). If the condition holds, then such an approach could minimize the LUT count for the block LY [13,15,37].
To encode the COs, we use the approach proposed in [55]. The outcome of encoding is shown in Figure 8. Step 4. Using the distribution of FSM outputs by COs and codes (Figure 8), we obtain the following SBF: The analysis of (15) shows that there are 11 literals in this system. So, there are 11 interconnections between the blocks LZ and LY. As shown in [13], in the common case, there are NR Q = 21 interconnections between these blocks. Therefore, using the approach [55] allows reducing the number of interconnections by 1.91 times.
Step 5. The columns of a DST are shown in Figure 2c. We have modified the traditional DST. The column Y h is replaced by a column Z h (Table 3).
Step 6. The first five steps of this example are performed using known techniques [13,15]. Starting from the sixth step, the features of our method appear. Since we propose to represent the terms of IMFs in the form of conjunctions corresponding to the codes of COs at adjacent operation cycles, it is necessary to find these pairs of COs. For these purposes, a table of pairs should be built.
A table of pairs P g =< Y i , Y j > shows a correspondence between these pairs and the pairs < a m , X h >. There are six columns in this table: a m (a current state); a T (a transition state); Y m (a CO produced during the transition into the state a m ); Y T (a CO produced during the interstate transition < a m , a T >); P g (a pair < Y m , Y T >); and g (the number of a table row, g ∈ {1, . . . , G}). The following condition holds: For example, in the discussed case, there is G = 12 (Table 4). Let us explain why the relation (16) takes place. For example, there is a single transition < a 3 , a 4 > in Table 2. This transition is accompanied by CO Y 5 . At the same time, there are two rows in Table 4 representing this transition. This is explained by the fact that two different COs are produced during the transitions into a 3 ∈ A. As follows from either the STG (Figure 7) or the STT (Table 2), the transition < a 1 , a 3 > is accompanied by the generating CO Y 2 (the row 2 of Table 2), and the transition < a 2 , a 3 > is accompanied by the generating CO Y 3 (the row 3 of Table 2). Due to it, the transition < a 3 , a 4 > is represented by the pairs The similar analysis allows filling all rows of Table 4. Each transition from states a 1 , a 2 , a 5 ∈ A is represented by a single pair. However, two transitions from a 4 ∈ A are represented by four pairs P 8 − P 11 (Table 4).
Step 7. The table of block LZ (Table 5) is created using the modified DST (Table 3). This table includes only a part of the DST columns: a m , K(a m ), X h ; Z h and h.
Obviously, SOPs of SBF (7) include the product terms F h = A m X h (h ∈ {1, . . . , H}). Using Table 5 gives the following minimized SOPs: Step 8. This step is presented only in our proposed method. As follows from (12), the state variables T r ∈ T depend on variables encoding COs. So, it is necessary to construct a table reflecting this dependence. To do it, each transition from the initial state transition table (Table 2) must be represented as a transition between the COs from the adjacent cycles of operation times. This dependence is shown in the table of LT.
The table of LT includes seven columns. They are the following: Y m ,K(Y m ), Y T , K(Y T ), a T , T g , and g. This table is constructed using the columns Y m , Y T , a T of the table of pairs (Table 4), the codes of COs ( Figure 8) and state codes K(a T ). In the discussed case, there are G = 12 rows in this table (Table 6).
a 5 100 1 -9 For example, the following relations take places for the first row of Table 4: Y m = Y 0 , Y T = Y 1 and a T = a 2 . As follows from Figure 8, there are the codes K(Y 0 ) = 000 and K(Y 1 ) = 100. As follows, for example, from Table 5, there is the state code K(a 2 ) = 001. So, the column T g of Table 6 contains the symbol T 3 for the row g = 1. All other rows are filled in the same manner.
Using the table of LT, the SBF (12) is derived. The SOPs of corresponding functions include the terms (13). In the discussed case, this is the following SBF: Step 9. To obtain the LUT-based circuit of Mealy FSM U 3 (S 1 ), the step of technology mapping should be executed [31]. This can be done only with the help of some industrial CAD tools. In the case of Virtex-7-based circuits, the industrial package Vivado [47] should be used. This CAD tool executes the process of technology mapping. As a result, we can extract the real characteristics of an FSM circuit (such as the LUT count, number of slices, number of flip-flops, maximum operating frequency, and power consumption) from the Vivado reports.
This CAD tool can be used starting from the FPGAs of the Virtex-7 family. So, it is impossible to use Vivado for implementing the circuit of FSM U 3 (S 1 ) using LUTs with four inputs. In the next section, there are shown the results of experiments conducted with the help of the industrial CAD package Vivado and the library of standard benchmark FSMs [56].

Experimental Results
In this section, there are shown the results of experiments which were conducted to compare the characteristics of U 3 -based Mealy FSMs with the characteristics of FSM circuits based on some other models. The benchmark FSMs from the library [56] are used for these experiments. This library includes 48 benchmarks represented in the format KISS2 taken from the practice of logic design. Although the library dates back to the 1990s of the twentieth century, it has been used by various authors for 30 years to compare the new and existing methods of implementing FSM circuits. Let us indicate only some examples of articles and monographs, where the library [56] is used in experimental research. Such works include, for example, articles [31,35,52,[57][58][59] and monographs [6,39]. The basic characteristics of benchmarks are shown in Table 7. To conduct the research, we use a personal computer with the following characteristics: CPU, Intel Core i7 6700 K 4.2@4.4 GHz and memory, 16 GB RAM 2400MHz CL15. As a platform for FSM circuits implementation, the Virtex-7 VC709 Evaluation Platform (xc7vx690tffg1761-2) [60] is used. As a CAD tool, we use the package Vivado v2019.1 (64-bit) of Xilinx [47]. The circuits are implemented using CLBs from the slices SLICEL. They include LUTs having six inputs. To create the tables with research results, the reports of Vivado are used. To link the initial KISS2-based files with Vivado, we create VHDL-based descriptions of these models. To do it, the CAD tool K2F [40] is used.
Using the Vivado reports, we compare some parameters of produced FSM circuits. These parameters are (1)  As it is in our previous research [15], the benchmark FSMs are divided by five groups. The groups are determined by the value of a parameter D(R, L, I). This parameter is calculated as D(R, L, I) = L + R − I.
Using (19), we create the following groups. The relation D(R, L, I) ≤ 0 determines the group of trivial FSMs (the group G0). The relation 0 < D(R, L, I) ≤ 6 determines the group of simple FSMs (G1). The relation 6 < D(R, L, I) ≤ 12 determines the group of average FSMs (G2). The relation 12 < D(R, L, I) ≤ 18 determines the group of big FSMs (G3). The relation D(R, L, I) > 18 determines the group of very big FSMs (G4). As research [15] shows, the larger the group number, the greater the gain from the use of methods of structural decomposition.
The results of the experiments are shown in Tables 8-20. We have organized these tables in the following way. In the table columns, we show the names of the methods used. The table rows are marked by the names of benchmarks. At the intersection of a column with a method and a row with a benchmark, we show the result of a specific experiment obtained from the Vivado report. Inside each table, the benchmarks are listed in alphabetical order and sorted by ascending group number. The rows "Total" contain the results of summation of values for each column. The row "Percentage" includes the percentage of summarized characteristics of FSM circuits produced by other methods, respectively, to U 3 -based FSMs. We use the model of Mealy FSM U 1 for methods auto, one-hot, and JEDI.
These tables include the following information: (1) the LUT counts for all benchmarks (Table 8); (2) the LUT counts for benchmarks of the group G0 (Table 9); (3) the LUT counts for benchmarks of the group G1 (Table 10); (4) the LUT counts for benchmarks of groups G2, G3 and G4 (Table 11); (5) the maximum operating frequency for all benchmarks (Table 12); (6) the maximum operating frequency for benchmarks of the group G0 (Table 13); (7) the maximum operating frequency for benchmarks of the group G1 (Table 14); and (8) the maximum operating frequency for benchmarks of the groups G2-G4 (Table 15).
To fill in the tables with the research results, we use data from our previous articles. Basically, all numbers are taken from papers [13,61]. However, information about the number of flip-flops is mentioned only in paper [14]. Therefore, we used Ref. [14] to fill in Table 16. The necessary information regarding the proposed method is taken from the Vivado reports.
The following conclusions can be made from the analysis of Tables 8-15.      As follows from Table 8, the U 3 -based FSMs require fewer LUTs than do the other investigated methods. Our approach produces circuits with 46.52% less 6-LUTs than for equivalent auto-based FSMs; 70.50% less 6-LUTs than for equivalent one-hot-based FSMs; and 20.66% less 6-LUTs than for equivalent JEDI-based FSMs. Additionally, our approach provides the gain (7.21%) respectively to equivalent U 2 -based FSMs. However, the amount of gain (or loss) depends on each group that a particular benchmark belongs to.
As follows from Table 9, our approach loses compared to all other investigated methods. There are the following losses: 30.11% relative to auto-based FSMs; 4.44% relative to one-hot-based FSMs; 32.22% relative to JEDI-based FSMs (7.58% loss); and 1.11% relative to U 2 -based FSMs. So, it does not make sense to use the U 3 -based FSMs to implement the circuits for FSMs of the group G0.
Let us explain the reasons for these losses. Comparing the results for group G0 shows that both multilevel approaches (U 2 and U 3 ) lose out to the other methods. For FSM U 2 , the loss is 30% compared to auto-based FSMs, 3.43% compared to one-hot-based FSMs, and 31.11% compared to JEDI-based FSMs. We explain this by the fact that condition (4) holds for benchmarks of G0. In this case, only a single LUT is needed to implement any function from SBFs (2) and (3). So, there is no need in the encoding of COs. However, as follows from Figures 4 and 6, this method is always used in both multi-level FSMs U 2 and U 3 . Due to it, for the group G0, the multilevel FSMs have higher LUT counts than for the other investigated design methods.      However, our approach gives a win starting from group G1. As follows from Tables 10 and 11, using the model U 3 gives a win for groups G1-G4. Compared with auto-based FSMs, there is either a 29.3% win rate (G1) or 61.45% of gain in LUT counts (groups G2-G4). Compared with one-hot-based FSMs, there is either a 67.2% win rate (G1) or 79.88% of gain in LUT counts (groups G2-G4). Compared with JEDI-based FSMs, there is either 7.32% of gain (G1) or a 31.45% win rate (G2-G4). Compared with U 2 -based FSMs, there is either 2.55% of gain (G1) or a 9.88% win rate (G2-G4). So, the gain from applying the proposed approach increases with the growth of the number of FSM inputs and state variables.
Let us explain the nature of this situation. Starting from G1, the condition (4) is violated. This means that the methods of functional decomposition should be applied for FSMs based on auto, one-hot and JEDI. However, both FSMs U 2 and U 3 are based on the methods of structural decomposition. As follows from [13], using the SD-based methods allows improving LUT counts compared with the FD-based methods. A similar phenomenon also occurs in our case. There is only one set of additional variables in FSMs U 3 . However, FSMs U 2 have two such sets. As follows from the research results, the implementation of systems (5) and (10) requires more internal resources than the implementation of the system (7). This advantage of FSMs U 3 in relation to FSMs U 2 explains the gain in LUTs that the method proposed in this article gives.
From the analysis of Table 10, it follows that for group G1, the following phenomenon takes place. In some cases, the circuits of FSMs U 2 require fewer LUTs than it is for equivalent FSMs U 3 . This situation takes place for benchmarks: bbara, dk15, ex2, ex7, keyb, s27, and s8. However, for other benchmarks of G1, the circuits of FSMs U 3 have better LUT counts than for equivalent FSMs U 2 . Let us explain this phenomenon.
In LUT-based FSMs, the LUT counts depend on the relation among N A(φ k ) and I. Both FSMs U 2 and U 3 include logic blocks generating outputs y n ∈ Y. Obviously, these blocks consume the same amount of LUTs. So, the difference in LUTs depends on LUT counts for other blocks of these FSMs. For FSMs U 2 , the number of LUTs depends on the distribution of FSM inputs among the functions belonging to SBFs (5), (9) and (10). For FSMs U 3 , the LUT count depends on relation among the value of 2R Q and the number of LUT inputs, I. If the condition (4) holds but the condition 2R Q ≤ I is violated, then there are fewer LUTs in the circuits of FSMs U 2 compared to the circuits of equivalent FSMs U 3 . We think that such situation takes place for the benchmarks bbara, dk15, ex2, ex7, keyb, s27, and s8. For other benchmarks of G1, the following situation takes place: the condition (4) is violated but the condition 2R Q ≤ I holds. As a result, for these benchmarks, there are fewer LUTs in the circuits of FSMs U 3 compared to the circuits of equivalent FSMs U 2 . It seems that this situation takes place for all benchmarks from the groups G2-G4. As a result, our approach allows obtaining better LUT counts for all benchmarks from these groups.
As follows from Table 12, our approach produces slightly faster LUT-based FSM circuits compared to the three other investigated methods. The average win is equal to (1) 6.14% (compared with auto-based FSMs); (2) 6.9% (relative to one-hot-based FSMs); (3)1.42% (compared with U 2 -based FSMs). The winning relative to U 2 -based FSMs is especially important. It shows that our method not only improves the LUT counts, but also does not degrade the performance compared to three-block FSMs U 2 . Note that our approach loses in the performance of the obtained FSM circuits relative to JEDI-based FSMs (only 0.5%).
For the group G0 (Table 13), our approach provides a gain relative to U 2 -based FSMs (7.19%). However, other investigated methods win in the values of maximum operating frequency. The auto-based state encoding provides to 2.96% of gain. The JEDI-based state encoding provides 3.89% of gain. It means that our approach should not be applied if the number of LUT inputs is not less than the total number of FSM inputs (L) and state variables (R).
So, for the group G0, there is the performance loss of SD-based FSMs in comparison with FD-based FSMs. This loss can be explained in the following way. Because the condition (4) holds, there is only a single logic level in the circuits of FD-based FSMs (auto, one-hot, JEDI). However, as follows from Figures 4 and 6, there are three logic levels in the circuits of U 2 -based FSMs and two logic levels in the circuits of U 3 -based FSMs. Therefore, the SD-based FSMs produce slower circuits compared to their FD-based counterparts.
As follows from Table 14, for the group G1, our approach produces faster circuits than both auto-and one-hot-based FSMs. Our gain is equal to 3.92% and 4.04%, respectively. However, the FSM circuits produced by two other methods are slightly faster than U 3 -based circuits. The JEDI-based FSMs win 2.9%. The U 2 -based FSMs win 0.93%. Thus, the number of logic levels in the FD-based FSMs has increased, but still remains less than this number in the equivalent SD-based FSMs. The analysis of Table 15 shows that only U 2 -based FSM circuits are a bit faster than the equivalent circuits based on our approach. This win is equal to 0.09%. However, our approach allows producing the faster circuits as compared with auto (15.38%), one-hot (15.6%) and JEDI (4.78%).
Note that to compare different FPGA-based circuits of equivalent devices, such estimates as the number of flip-flops in the circuit, its power consumption, the product of the number of LUTs and the cycle time (the area-time characteristic), the product of the power consumption and the cycle time (the power-time characteristic) can be used. We also compared these characteristics of FSM circuits for the models used in the research. The numbers of flip-flops used in FSM circuits are shown in Tables 16 and 17. Table 18 contains information about the power consumption. The area-time characteristics are shown in Table 19. The power-time characteristics are shown in Table 20.
As follows from Table 16, our method significantly loses in the number of flip-flops to all other methods (except for the one-hot approach). This is determined by the fact that the number of flip-flops is the same as the number of bits in the state codes K(a m ). For the proposed FSM U 3 , the number of flip-flops is equal to twice the number of bits in the codes of COs. Due to it, our method loses an average of 39.37% to FSMs based on methods auto, JEDI and U 2 .
However, this is not entirely true if we consider an FSM as a block of some digital system. It is known that the outputs of the Mealy FSM are not stable. They can change when the input signals change. The FSM inputs are the outputs of the remaining system blocks. This phenomenon can lead to malfunctions in the functioning of the digital system. To eliminate possible failures, an intermediate register is introduced into the system. The FSM outputs are recorded in this register after the end of transient processes in the remaining blocks of the system. So, to find the required number of flip-flops, it is necessary to add a value of N (the number of FSM outputs) to the value obtained from the Vivado reports. For example, there are 7 flip-flops in FSM s1494 for the model U 2 and 16 flip-flops for the model U 3 (Table 16). As follows from Table 7, there is N = 19 for FSM s1494. So, as a block interacting with other blocks of a digital system, this U 2 -based FSM s1494 requires 26 flip-flops. Using the same approach, we can create Table 17.
The proposed method does not require such an additional output register. This is due to the fact that the codes of COs are written to the registers. Therefore, for the model U 3 , the FSM outputs are stable after being written to the registers. So, when choosing an FSM model, the designer must add the number of outputs to all numbers from the Table 16 except for the numbers obtained for U 3 -based FSMs. This fact explains the coincidence of information in columns "Our approach" of Tables 16 and 17. As follows from Table 17, our method allows the use of fewer flip-flops compared to other methods studied. The gain is 41.06% compared to methods auto, JEDI and U 2 and 166.18% compared to the FSMs based on the one-hot approach.
To estimate the power consumption, we also used Vivado. Vivado uses the value of maximum operating frequency achieved for each benchmark and calculates the value of power consumption basing on this frequency. To conduct the research, the core voltage (VCCINT) was set to 1.0V. The data in the Table 18 are taken from the Vivado Power Reports.
As follows from Table 18, the U 3 -based FSMs consume more power than equivalent U 2 -based FSMs (the loss is on average 9.62%). We think this is because (1) U 3 -based FSMs have more flip-flops compared to the equivalent U 2 -based FSMs and (2) the switching activity of flip-flops from U 3 is significantly higher than it is for equivalent U 2 -based FSMs. However, the application of our method allows reducing the power consumption compared to the FSM circuits based on auto (26.03%), one-hot (36.45%) and JEDI (6.26%).
Let us point out that Table 18 shows the power consumption characteristics for FSMs as stand-alone units. If we consider an FSM as some part of a digital system, then the situation can change significantly in favor of our method. This conclusion can be made from the analysis of Table 17.
So far, we have only discussed estimates for one of the FSM circuit characteristics. However, the quality of FSM circuits is often evaluated by integral estimates. One such assessment is that which shows how much chip area is used to achieve a certain cycle time.
In the case of LUT-based FSM circuits, the required FPGA chip area is usually estimated by the number of LUTs used [12]. This approach is adopted in our article, and the results are shown in Table 19.
As follows from Table 19, our approach provides an average gain of 7.17% compared to the equivalent U 2 -based FSMs. The gain compared to other methods is even more significant: (1) 73.82% compared to auto; (2) 101.39% compared to one-hot and (3) 25.94% compared to JEDI. We do not provide here tables for each of the FSM groups. However, we conducted such a study, and its results showed the following. Our approach is an outsider for the group G0, where we lose (1) 32.45% compared to auto; (2) 3.79% compared to one-hot; (3) 33.96% compared to JEDI and (4) 2.04% compared to U 2 -based FSMs. Winning starts with group G1. In this group, our method wins (1) 36.84% with respect to auto; (1) 75.98% with respect to one-hot; (1) 4.08% with respect to JEDI; and (4) 1.57% compared to the equivalent U 2 -based FSMs. The greatest gain is observed for the most complex FSMs belonging to the groups G2-G4. For these groups, our method wins (1) 97.68% with respect to auto; (1) 120.88% with respect to one-hot; (1) 39.76% with respect to JEDI; and (4) 10.13% compared to the equivalent U 2 -based FSMs. So, the gain from the application of our method increases as the FSM complexity increases.
The power-time (power-delay) product shows how much energy is spent on the execution of one cycle of operation [62]. In case of discussed benchmarks, the cycle time is measured in nanoseconds. Since the power is measured in Watts, the resulting power-time products are presented in nanojoules (nJ). These results are shown in Table 20.
As follows from Table 20, the U 3 -based FSMs have higher energy values than the equivalent U 2 -based FSMs (the loss is on average 10.44%). We think this is because (1) U 3based FSMs have more flip-flops compared to the equivalent U 2 -based FSMs, and (2) the switching activity of flip-flops from U 3 is significantly higher than it is for the equivalent U 2 -based FSMs. However, U 3 -based FSMs require less energy compared to FSM circuits based on auto (43.64%), one-hot (56.70%) and JEDI (8.29%).
For a better understanding of the experimental results, we created Table 21. The first column of this table contains the total values for each of the studied characteristics. The remaining columns contain the values of these characteristics for each of the studied methods. The best values for each of the characteristics are shown in bold. The goal of our method is to reduce the number of LUTs without a significant decrease in frequency in relation to three-level U 2 -based FSMs. Due to it, in the "Gain" column, we show the gain or loss (negative gain) of our method with respect to U 2 -based FSMs.
As follows from Table 21, our method allows reducing the LUT counts (the chip area occupied by FSM circuit) compared to equivalent U 2 -based FSM having three logic blocks. The results of experiments show that there is no degradation in FSM performance. On the contrary, there is a slight gain in this characteristic (1.42%). So, the results of our experiments show that the proposed approach can be used instead of other models starting from the simple FSMs (the group G1). However, the proposed method cannot be used if the dominant factor determining the FSM circuit optimality is its power consumption. We think that the proposed model can be used in CAD systems targeting LUT-based Mealy FSMs if the dominant factor determining the FSM circuit optimality is either the number of LUTs or area-time products.

Conclusions
Nowadays, the majority of digital systems are implemented using FPGAs. So, FPGAs are used for implementing circuits of FSMs representing various sequential blocks. As the complexity of the FSMs (the numbers of inputs, outputs and states) increases, the contradiction between this significant complexity and a very small number of LUT inputs increases, too. Modern LUTs have around six inputs. This value is still rather small compared with numbers of literals in SBFs representing FSM circuits. This leads to using various methods of functional decomposition in the LUT-based FSM design. It is known [39] that the functional decomposition leads to multi-level LUT-based FSM circuits having spaghetti-type interconnections.
In many cases, the characteristics of FPGA-based FSM circuits can be improved due to applying the methods of structural decomposition instead of using the methods of functional decomposition [13]. Our research [15] shows that three-block circuits of LUTbased Mealy FSM circuits require fewer LUTs than some of their counterparts. But this gain is connected with the introduction of some additional functions. This requires using additional chip internal resources to generate these functions. This is the main disadvantage of the three-block FSM circuits.
In this article, we propose to use the codes of collections of outputs to represent both the outputs and state variables of Mealy FSMs. This is connected with using two registers keeping codes of COs. Using this approach, it is possible to generate in parallel FSM outputs and codes of the transition states. This leads to Mealy FSM circuits having two levels of LUTs. These circuits require fewer LUTs than it is in the equivalent threeblock FSM circuits. The experiments prove that the proposed approach allows reducing hardware compared with such known methods as auto and one-hot of Vivado, and JEDI. Additionally, the proposed approach gives better results than a method based on the simultaneous replacement of inputs and encoding of COs.
Compared to circuits of the three-block FSMs, the LUT counts are reduced by an average of 7.21% without a significant reduction in the performance. The gain in LUT counts and area-time products increases with the increase in the numbers of FSM states and inputs. Our approach loses in terms of power consumption (on average 9.62%) and power-time products (on average 10.44%). As the experiments show, the proposed twoblock FSMs have practically the same cycle times (maximum operating frequencies) as their three-block counterparts. This analysis allows us to conclude that the proposed method can be used for improving the LUT counts of various FPGA-based sequential devices.

Data Availability Statement:
The data presented in this study are available in the article.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: