Improving the Characteristics of Multi-Level LUT-Based Mealy FSMs

: Contemporary digital systems include many varying sequential blocks. In the article, we discuss a case when Mealy ﬁnite state machines (FSMs) describe the behavior of sequential blocks. In many cases, the performance is the most important characteristic of an FSM circuit. In the article, we propose a method which allows increasing the operating frequency of multi-level look-up table (LUT)-based Mealy FSMs. The main idea of the proposed approach is to use together two methods of structural decomposition. They are: (1) the known method of transformation of codes of collections of outputs into FSM state codes and (2) a new method of extension of state codes. The proposed approach allows producing FPGA-based FSMs having three levels of logic combined through the system of regular interconnections. Each function for every level of logic was implemented using a single LUT. An example of the synthesis of Mealy FSM with the proposed architecture is shown. The effectiveness of the proposed method was conﬁrmed by the results of experimental studies based on standard benchmark FSMs. The research results show that FSM circuits based on the proposed approach have a higher operating frequency than can be obtained using other investigated methods. The maximum operating frequency is improved by an average of 3.18 to 12.57 percent. These improvements are accompanied by a small growth of LUT count.


Introduction
Digital systems are widely used in our daily life [1]. They can be viewed as combinations of various sequential and combinational blocks [2,3]. To implement the circuit of a sequential block, it is necessary to formally describe its behavior. Very often, models of finite state machines (FSMs) [4,5] are used for this purpose. The quality of an FSM circuit is determined by a combination of such characteristics as: a chip area occupied by the circuit, maximum operating frequency and consumption of power. As follows from [6], there is a direct relationship between these circuit characteristics. To reduce the occupied chip area, various methods of structural decomposition can be applied [7]. These methods produce circuits with multiple levels of logic, which are significantly slower than their single-level counterparts.
The methods of structural decomposition [7] are designed to reduce the numbers of LUTs in FSM circuits. As a rule, FSM circuits with three levels of logic blocks require the smallest numbers of LUTs. However, three-level FSMs have a much lower operating frequency compared to their single-level counterparts. FSM circuits with two levels of logic blocks represent a compromise on the number of LUTs and operating frequency. The main contribution of this paper is a novel design method aimed at increasing the operating frequency of two-level LUT-based Mealy FSMs. The main idea of the proposed approach is to use together two methods of structural decomposition. They are: (1) the known method of transformation of codes of collections of outputs into FSM state codes and (2) a new method of extension of state codes. Due to it, there are exactly three levels of LUTs in the part of FSM circuit implementing the system of outputs. Additionally, it produces FSM circuits having regular system of interconnections, where each level of logic has its unique systems of inputs and outputs. The proposed method allows obtaining FSM circuits that have slightly more LUTs and a higher operating frequency than their three-level counterparts [30]. The experimental results presented in the article show that the advantage of the proposed approach increases as the number of FSM inputs increases.
The further text of the article includes five sections. Section 2 presents the background of single-level LUT-based Mealy FSMs. Section 3 discusses the methods currently used in design of FPGA-based FSMs. The main idea of our method is considered in Section 4. In Section 5, we discuss an example synthesis, and the main ways for improving the characteristics of the resulting FSM circuit. In Section 6, we present the results of research on the effectiveness of the proposed method for benchmarks FSMs from [31]. The article ends with a brief summary.

Single-Level LUT-Based Mealy FSMs
As follows from [13], FPGAs manufactured by Xilinx are based on "island-style" architecture [19,20]. The configurable logic blocks (CLBs) are "islands" surrounded by a "sea" of programmable interconnections that form a general routing matrix [13]. In this paper, we discuss a case of CLBs including LUTs and programmable flip-flops. The flip-flops are used to organize hidden distributed registers keeping FSM state codes [2]. A LUT-based CLB includes a LUT, a flip-flop and a multiplexer (Figure 1). An FSM circuit is represented by some SBF. For practical digital systems, an SBF can include around 50-70 literals [3,4]. However, a LUT has not more than six inputs. This limitation makes it necessary to transform SBFs representing FSM circuits. The transformation is executed using different methods of functional decomposition (FD) [32]. The FD-based transformation leads to FSM circuits with many levels of LUT-based CLBs and systems of unordered (irregular) interconnections. The functional decomposition leads to CLB-based circuits having "spaghetti-type" interconnections [33].
A Mealy FSM is represented as a six-component vector S =< X, Y, A, δ, λ, a 1 > [34]. The vector S includes a set of inputs X = {x 1 , . . . , x L }, a set of outputs Y = {y 1 , . . . , y N }, a set of internal states A = {a 1 , . . . , a M }, a function of transitions δ, a function of output λ and an initial state a 1 ∈ A. Various tools can be applied to represent the vector S. The most commonly used tools are: graph-schemes of algorithms [3,34], binary decision diagrams [35,36], state transition graphs [4] and inverter graphs [37]. In this article, we use state transition tables (STTs) to represent Mealy FSMs.
An STT includes the following columns [4]: a current state a m ; a state of transition (a next state) a s ; an input signal X h (it determines a transition from a m to a s ); a collection of outputs Y h (it is generated during the transition from the current state into the next state). The column h includes the numbers of transitions (h ∈ {1, . . . , H}). For example, a Mealy FSM S 0 is represented by the STT (Table 1).
a 3 a 1 1 -5 As follows from Table 1, the FSM S 0 has two inputs, four outputs, three states and five transitions. From Table 1 we can find, for example, that δ(a 1 , x 1 ) = a 2 and λ(a 1 , x 1 ) = y 1 (these formulae follow from the first row of Table 1). The following steps should be executed to construct SBFs describing logic circuits of FSMs [3,34]: (1) the encoding of FSM states a m ∈ A by binary codes K(a m ); (2) the constructing sets of state variables T = {T 1 , . . . , T R } and input memory functions (IMFs) Φ = {D 1 , . . . , D R }; and (3) constructing a direct structure table (DST). To encode the states a m ∈ A, the step of state assignment should be executed [2].
In this paper, we use the style of binary state assignment where the number state variables (R) is determined as The binary state assignment is used, for example, in the system SIS [38]. The number of bits of the state code can vary from the minimum value determined by (1) to the number of states, M. If R = M, then the corresponding state codes are one-hot codes. This style is used, for example, by the academic system ABC [37] of Berkeley.
A special state register (RG) keeps FSM state codes. It is controlled by two internal pulses. The pulse start causes the loading of the initial state code into the RG. The pulse clock sets the time when the RG can be changed. For CLB-based FSMs, state registers are constructed on the basis of D flip-flops [2]. In this article, we also use state registers based on D flip-flops. The pulse clock allows the functions D r ∈ Φ to change the RG content.
After the state assignment, each state a m ∈ A is represented by its code K(a m ). The Boolean systems representing an FSM circuit can be derived from a DST. Compared to the initial STT, a DST includes three additional columns: K(a m ), K(a s ) and Φ h . The column Φ h includes the symbols D r ∈ Φ corresponding to 1s in the code of the state a s from the row h of a DST. A DST is a base for finding the following SBFs: The architecture of a Mealy FSM U 1 is defined by these systems of Boolean functions (SBFs). It is shown in Figure 2.
Let us analyze this architecture. The SBF (2) is implemented by Blockδ. This block includes the distributed register. The RG is controlled by IMFs (2) and mutual pulses of synchronization and reset. The SBF (3) is implemented using Blockλ. Both blocks are implemented with CLBs ( Figure 1). Analysis of systems Φ and Y shows that they depend on the same variables. It is the main peculiarity of Mealy FSMs. Many design methods [7,39] use this specific to reduce the numbers of LUTs in circuits represented by SBFs (2) and (3).

State-Of-The-Art
As a rule, the process of designing digital systems involves solving some optimization problems [2,4]. In the case of FPGA-based sequential blocks, these problems are the following [2,24]: (1) the reduction of chip resources required to implement a LUT-based circuit; (2) the decreasing the propagation time (the increasing the maximum operating frequency); and (3) the reducing power consumption. Our current article is devoted to improving the maximum operating frequency of LUT-based Mealy FSMs.
We call optimal state codes such codes that allow reducing the numbers of arguments in SBFs (2) and (3). For example, the numbers of arguments is significantly reduced by the algorithm JEDI [38]. It is one of the best state assignments algorithms [2]. Due to it, we chose JEDI-based FSMs to compare with FSMs based on our proposed approach.
Modern industrial CAD tools include various state assignment strategies. For example, the following state assignment methods are used in the Xilinx design tool Vivado [40]: automatic state assignment (auto); sequential encoding; the one-hot; Gray encoding and Johnson codes. The same methods can be found in the package XST by Xilinx [51].
The one-hot state assignment is very popular in LUT-based design [41], because FPGAs include many programmable flip-flops. The one-hot state assignment leads to increasing the number of input memory functions compared with (1). However, these IMFs are much simpler than in the case of binary state assignment [2]. As follows from [41], it is better to use the one-hot codes if an FSM has more than 16 states. However, the characteristics of LUT-based FSM circuits significantly depend on the number of inputs [2]. As follows from [42], the binary state encoding allows producing better FSM circuits if L ≥ 10. Since each approach is good under certain conditions, we compare both of these encoding styles with our proposed method. The method of binary state assignment auto of Vivado is used as a baseline for comparison with the proposed method.
To reduce the power consumption, it is very important to diminish the number of interconnections inside an FSM circuit. Therefore, to diminish the number of interconnections, it is necessary to minimize the numbers of arguments in SBFs (2) and (3) [2]. Thus, it is always useful to apply the optimal state assignment to improve the characteristics of FSM circuits.
The second approach to optimizing CLB-based FSMs is related to using EMBs instead of LUTs [47]. There are many design methods targeting EMB-based FSMs [47][48][49][52][53][54][55][56][57].The survey of different methods of EMB-based design can be found in [47]. In the best case, only a single EMB is necessary to implement an FSM circuit [49]. However, if the number of arguments in systems (2) and (3) exceeds the maximum possible number of EMB address inputs, then an FSM is represented by a network of EMBs.
To diminish the number of EMBs in such a network, it is necessary to implement some functions using LUTs [2,49].
Thus, an FSM circuit can be implemented as either a network of EMBs, or a network of LUTs, or a joint network of LUTs and EMBs. In this article, we discuss the second case, when FSM circuits are implemented using LUT-based CLBs. This approach makes sense if: (1) all EMBs are used to implement other parts of a digital system or (2) the number of arguments in SBFs (2) and (3) exceeds 15 (this is a maximum possible number of modern EMBs [11][12][13]).
takes place, then a logic circuit for any function f i ∈ Φ ∪ Y is represented by exactly one LUT.
If NL( f i ) > S L , then the corresponding logic circuit can be obtained using various methods of FD [21,23,27,35,36,48,58,59]. The FD can be viewed as a process during which decomposed functions are broken down into smaller and smaller components. If any component depends on no more than S L arguments, the process of FD for a given function is completed. Of course, this results in multi-level LUT-based circuits. For these circuits, it is typical that the same inputs x l ∈ X or state variables appear on several logic levels. It significantly complicates the system of interconnection between LUTs of FD-based FSM circuits (with all the ensuing consequences).
In the best case, the LUT count of an FSM circuit is equal to the total number of inputs and state variables. However, if the condition (4) is violated, the LUT count increases by the value of |Ψ|, where Ψ is a set of additional functions different from (2) and (3). These additional functions are components of functions (2) and (3) produced during the process of FD. We do not discuss these methods in our article.
The reducing LUT counts in circuits of Mealy FSMs can be achieved using the various methods of structural decomposition [7,39]. These methods eliminate a direct dependence of functions y n ∈ Y and D r ∈ Φ on inputs x l ∈ X. The methods of structural decomposition are also connected with introducing new functions f i ∈ Ψ. Functions f i ∈ Ψ depend on variables x l ∈ X and T r ∈ T. The structural decomposition allows reducing LUT counts if there is These new functions are divided into subsystems having unique input and output variables. Each subsystem determines a separate LUT-based block of logic. When the condition (5) takes place, the total LUT count for a decomposed FSM is significantly less than it is for equivalent FSM U 1 . The new functions are arguments of functions (2) and (3). If the condition takes place, then the total LUT count of a decomposed FSM circuit is significantly less than it is for an equivalent multi-level circuit . A survey of different methods of structural decomposition is represented in [7]. In this article, we discuss three known methods of structural decomposition [7,34]: replacement of inputs, encoding of outputs and transformation of codes of collections of outputs into state codes. Consider these approaches.
To reduce the LUT count, the inputs x l ∈ X could be replaced by additional variables p g ∈ P = {p 1 , . . . , p G }, where G L [34]. As a rule, the value of G is determined as [34]: The system of additional variables p g ∈ P is represented by the SBF P = P(T, X).
The functions f i ∈ Φ ∪ Y are represented by the following SBFs: Collections of outputs (COs) Y q ⊆ Y(q ∈ {1, . . . , Q}) include functions y n ∈ Y generated simultaneously. To synthesize an FSM circuit, it is necessary to represent each CO Y q ⊆ Y by a binary code K(Y q ). As a rule, the number of bits in these codes is determined as To create codes K(Y q ), it is necessary to use additional variables z r ∈ Z = {z 1 , . . . , z R Q }. This allows representing outputs of FSM as the following: The additional variables z r ∈ Z are represented by the following system: To generate functions (13), an additional block of logic should be used. In the work [30], two known methods of structural decomposition are used for reducing LUT count for FPGA-based Mealy FSMs. It results in Mealy FSM U 2 shown in Figure 3. The logic circuit of Mealy FSM U 2 has three logic levels. The BlockP executes the replacement of inputs x l ∈ X by additional variables p g ∈ P = {p 1 , . . . , p G } and implements the SBF (8). The Blockδ generates input memory functions (9) and additional variables z r ∈ Z used for encoding of collections of outputs Y q ⊆ Y(q ∈ {1, . . . , Q}). This block includes a distributed register keeping state codes. To generate variables z r ∈ Z, it is necessary to implement the system Blockλ implements the system (12) dependent on additional variables z r ∈ Z. As our investigations [30] show, this approach allows significantly reducing the LUT count as compared to equivalent FSM U 1 . However, this solution has a serious drawback: the performance of FSM U 2 is always less than it is for an equivalent Mealy FSM U 1 .
In [36], different models of Mealy FSMs based on transformation of object codes are discussed. One of the typical methods from this group is a transformation of codes K(Y q ) into state codes K(a m ).
The main idea of this approach is the following. For example, some CO Y 3 is generated during transitions into states a 4 and a 6 . Using CO Y 3 , it is possible to determine these states. To do it, it is necessary to use identifiers I 1 and I 2 . Using two pairs < collection o f outputs, identi f ier > allows the following representation of these states of transition: a 4 →< Y 3 , I 1 > and a 6 →< Y 3 , I 2 >. Thus, each state a m ∈ A can be represented by one or more pairs < Y q , I np >. To create the set of identifiers SI = {I 1 , . . . , I NP }, it is necessary to find the maximum amount of pairs (NP) including the same CO Y q ⊆ Y.
Each identifier I np ∈ I is represented by a binary code K(I np ) having R I bits, where To encode identifiers, the elements of the set V = {v 1 , . . . , v RI } are used. It allows representing the IMFs by the following system: The variables v r ∈ V are represented by the following system: Thus, an FSM based on this principle implements systems (12), (13), (16) and (17). It is an FSM U 3 shown in Figure 4. In FSM U 3 , the BlockZV implements systems (13) and (17); the Blockδ implements input memory functions represented as (16); the Blockλ implements the system (12). Thus, there are only two levels of logic between inputs and outputs in the case of FSM U 3 . As follows from Figure 3, there are three levels of logic between inputs and outputs in the case of FSM U 2 .
This property of FSM U 3 can be used for acceleration of a digital system. As is known [2], outputs (3) of Mealy FSM are not stable. If inputs are changing during a clock cycle, the outputs (3) may also change. This may cause the digital system as a whole to crash. To prevent failures, it is necessary to prohibit the access of incorrect outputs (3) to a digital system. To do it, a special register SRG is introduced ( Figure 5). If all transients in the FSM circuit are completed and the values of outputs are stable, then a pulse of synchronization C1 is generated. It allows loading outputs y n ∈ Y into SRG. Next, the registered outputs y n ∈ Y R enter the digital system. The system executes the corresponding operations and generates the values of inputs x l ∈ X. Such an interaction should be organized for any model of Mealy FSM.
Thus, in the case of FSM U 3 , the pulse C1 may be generated when the correct values are set for the outputs of two blocks (BlockZV and Blockλ). In the case of FSM U 2 , the correct outputs are set after all three blocks are triggered sequentially. Thus, the model U 3 can provide better performance than the model U 2 .
There is one very serious disadvantage of FSM U 3 compared to equivalent FSM U 2 . If the relation is true, then the number of LUTs (and maybe their levels) in BlockZV is significantly more than in BlockP of equivalent FSM U 2 . In this article, we propose a method which allows reducing the number of LUTs in FSM U 3 .

Main Idea of the Proposed Method
In this article, we discuss a case when the condition (4) is violated for some functions f i ∈ Z ∪ V. It leads to a multi-level circuit of BlockZV with an irregular system of interconnections. Obviously, it degenerates the performance of FSM U 3 . To diminish the number of levels of LUTs in the circuit of BlockZV, we propose the following approach.
As it is in the case of two-fold state assignment [7,60], we propose to construct a partition Π = {A 1 , ..., A J } of the set A such that the following condition takes place: Using methods [7,60] allows creating the required partition Π A having the minimum possible number of classes, J.
If a class A j ∈ Π A includes M j states a m ∈ A, then there are enough state variables to encode the states a m ∈ A j . To do it, the state variables T r ∈ T j ⊆ T are used. There are R o elements in the sets T and Φ: If a m / ∈ A j , then T r = 0 for T r ∈ T j . It explains the presence of 1 in (20). Now, we can encode each state a m ∈ A j by a code C(a m ) having R o bits. In this code, R o − R j variables are equal to zero. Only variables T r ∈ T j identify a state a m ∈ A as an element of A j ∈ Π A .
As R o > R, the codes C(a m ) are extended state codes [7]. However, only R j < R state variables are used to represent functions dependent on states a m ∈ A j .
To find SBFs (13) and (17), it is necessary to construct a table of BlockZV (TZV). It includes the columns a m , C(a m ), a s , Y q , I np , These variables are written in the columns X h , Z h and V h of TZV j , respectively. Additionally, a table TZV j determines SBFs Using this preliminary information, we propose an architecture of Mealy FSM U 4 ( Figure 6). In FSM U 4 , the Blockj implements functions (22) and (23). Due to (19), each Blockj has only a single level of LUTs.
BlockOR implements functions z r ∈ Z and v r ∈ V as disjunctions: In (24) and (25), the superscript j means that the corresponding function is generated by the Blockj.
If J ≤ S L , then there is only a single level of LUTs in the circuit of BlockOR. Otherwise, it is a multi-level block.
Blockλ and Blockδ execute the same functions as these blocks in FSM U 3 . The Blockλ generates functions (12), the Blockδ the functions (16). If R Q ≤ S L , then Blockλ includes only a single level of LUTs.
Thus, in the best case, there are three levels of LUTs between inputs x l ∈ X and outputs y n ∈ Y. If the condition (4) is violated for equivalent FSM U 3 , then the FSM U 4 provides higher operating frequency.
Comparison of Figure 4 and Figure 6 shows that: (1) BlockZV of U 3 is replaced by Block1, ..., BlockJ, BlockOR and (2) Blockδ of U 4 has R o > R outputs. These two issues are the main specifics of FSM U 4 .
In this paper, we propose a method of synthesis of finite state machine U 4 . If an FSM is represented by an STT, then the method includes the following steps: 1. Representing states a m ∈ A by pairs P(m, q). 2. Encoding of collections of outputs and identifiers. Constructing SBF (12) (24) and (25) representing BlockOR. 7. Constructing SBF (16) representing Blockδ. 8. Implementing the logic circuit of FSM U 4 .
The first step is executed using an initial STT. If CO Y q ⊆ Y is generated during transitions into m q different states a s ∈ A, then there are m q identifiers. Each identifier determines an unique state represented by Y q ⊆ Y. The cardinality of the set SI is determined as Step 2 is executed on the basis of STT. The COs should be encoded in a way optimizing the number of literals in SBF (12). Identifiers can be encoded in the trivial way.
The partition Π A is constructed using methods from [7,43]. After finding classes A j ∈ Π A , we can encode the states a m ∈ A j . It gives sets T j ⊆ T = {T 1 , ..., T R 0 } and Φ = {D 1 , ..., D R 0 }.
A table of Blockj has the following columns: The states a m ∈ A j are written in the column a m . As T r = 0 if T r ∈ T j , we can write only parts of C(a m ) created from state variables A table TZV j is a base to derive the SBFs (24) and (25). The terms of corresponding SOPs are conjunctions A m · X h , where A m is a conjunction of variables T r ∈ T j . All other state variables are treated as insignificant. The SBF (24) and (25) are used to implement circuits of Block1-BlockJ.
The step 6 is executed in the trivial way. If J ≤ S L , then there is a single level of LUTs in BlockOR. In this case, its circuit includes exactly R Q + R I LUTs.
To find the SBF (16), it is necessary to construct a table of Blockδ. This table includes the following columns: Y q , K(Y q ), I n p, K(I np ), a s , C(a s ), Φ h , h. Each row of this table corresponds to a pair < Y q , I np > determining the state a s ∈ A. The terms of SOPs (16) are conjunctions of variables z r ∈ Z and v r ∈ V. The corresponding literals are determined by codes K(Y q ) and K(I np ).
The last step is executed using standard CAD tools. It is based on program tools translating initial STT into required SBFs. These SBFs are used into VHDL models of FSMs. Now, we would like to show the difference between the two-fold state assignment [60] and the proposed method. In the first case, there are two sets of state variables. The set T = {T 1 , ...T R } is used to encode states a m ∈ A as elements of set A. The set τ = {τ 1 , ..., τ R 0 } is used to encode states a m ∈ A j as elements of sets A j (j = 1, J). Due to it, there are two levels of logic creating inputs of the Block1-BlockJ. In the proposed approach, the inputs of these block are generated by Blockδ. Thus, the proposed approach leads to faster FSMs than for the two-fold state assignment.

Example of Synthesis
In this article, we use a symbol U i (S j ) to show that an FSM model U i is used to synthesize an FSM S j . An example of synthesis of Mealy FSM U 4 (S 1 ) is shown in this section. A Mealy FSM S 1 is represented by Table 2.
The following characteristics of S 1 follow from Table 2: the number of states M = 6, the number of transitions H = 15, the number of inputs L = 6 and the number of outputs N = 8. Additionally, the following collections of outputs can be found from Table 2: x 1 x 2 y 3 2 a 4 x 1 x 2 y 2 y 4 3 x 3 x 4 y 5 y 6 5 a 5 x 3 y 5 y 7 6 a 3 a 4 x 5 y 5 y 6 7 a 5 x 5 x 1 y 3 y 8 8 a 6 x 5 x 1 y 1 y 2 9 a 4 a 3 x 6 x 3 y 3 10 a 5 x 6 x 3 y 2 y 4 11 a 2 x 6 y 1 y 2 12 a 5 a 6 1 y 1 y 2 13 x 5 y 3 y 8 15 1. Representing states by pairs P(m, q).
Using STT (Table 2), it is possible to find pairs < Y q , I np > representing the states a m ∈ A. For example, the CO Y 2 is written in the rows 1, 9, 12 and 13. Additionally, these rows include the states of transitions a 2 (rows 1 and 12) and a 6 (rows 9 and 13). Thus, it is necessary two identifiers (I 1 , I 2 ) to distinguish these states: a 2 →< Y 2 , I 1 >, a 6 →< Y 2 , I 2 >.
Using the same approach, we can find all pairs < Y q , I np > for the given example. The process is shown in Figure 7. Using (26) gives NP = 2 and I = {I 1 , I 2 }. In the discussed case, there is H P = 12, where H P is a number of pairs P(m, q). Thus, the Blockδ will be represented by the table having 12 rows.
There is R Q + R I = 4 < S L . Therefore, each equation from SBF (16) is implemented using only a single look-up table. Thus, there is no need in encoding of COs in a way optimizing (16). Let us encode COs Y q ⊆ Y in a way optimizing the SBF (12).
Using contents of COs, the following SBF can be obtained: To diminish the number of interconnections between BlockOR and Blockδ, it is necessary to reduce the number of literals in functions (12). It can be done using approach [61]. One of the possible solutions is shown in Figure 8. Using codes from Figure 8 and rules of minimization [4], we can transform the SBF (27) into the following system: y 1 = z 1 z 2 z 3 ; y 2 = z 2 z 3 ; y 3 = z 1 z 3 ; y 4 = z 1 z 2 ; y 5 = z 1 z 2 ; y 6 = z 1 z 2 z 3 ; y 7 = z 1 z 3 ; y 8 = z 2 z 3 .
The system (28) represents Blockλ of U 4 (S 1 ). This block has 18 interconnections with BlockOR. In the common case, there are N · R Q = 8 × 3 = 24 literals (and 24 interconnections). Thus, the number of interconnections is reduced by 1.33 times thanks to encoding of COs shown in Figure 8.
The identifiers can be encoded in a trivial way: K(I 1 ) = 0 and K(I 2 ) = 1. Now, the identifier I 1 is determined by v 1 , and I 2 by v 1 .
3. Constructing the partition of the set A. There is S L = 5 in the discussed example. It means that each block A j ∈ Π A should satisfy the condition L j + R j ≤ 5.
This step is very important because it determines significantly the characteristics of FSM U 4 [60]. We do not discuss this step in detail. Instead, we use the approach [60] to create the partition Π A = {A 1 , A 2 } with classes A 1 = {a 1 , a 3 , a 6 } and A 2 = {a 2 , a 4 , a 5 }. Using Table 2 gives the sets Using (20) gives Thus, the found partition satisfies the condition (19).
Due to it, state codes C(a m ) do not affect the number of look-up tables in circuits of Block1 and Block2. We can encode them in the following way: C(a 1 ) = 0100, C(a 2 ) = 0001, C(a 3 ) = 1000, C(a 4 ) = 0010, C(a 5 ) = 0011 and C(a 6 ) = 1100.
4. Creating tables of Block1 and Block2. To do it, we should construct a table of BlockZV of equivalent FSM U 3 (S 1 ). Next, this table is divided by two tables using classes A j ∈ Π A and codes C(a m ).
Table of BlockZV is constructed using an initial STT. To do it, the states of transitions are replaced by corresponding pairs P(m, q). Additionally, the codes K(Y q ), K(I p ) and columns Z h , V h are introduced instead of the column Y h of STT. In the discussed example, the BlockZV is represented by Table 3.
In Table 3, we used codes K(Y q ) from Figure 8. The pairs <Y q , I np > were taken from Figure 7. To design circuits of Block1-BlockJ, Table 3 should be transformed into a set of tables representing blocks of the first level of logic.
Consider the row h = 1 of Table 3. It corresponds the pair P(2, 2). Thus, the column Y q includes Y 2 and the column I np includes I 1 . The column K(Y q ) includes K(Y 2 ) = 010, the column K(I np ) the code K(I 1 ) = 0. It explains the contents of columns Z h and V h of the row 1. The column X h is the same as for initial STT ( Table 2). All other rows are filled in the same way.
To create tables of a Blockj, we should: (1) choose state a m ∈ A j and (2) take rows of table of BlockZV for these states. In this case, the Block1 is represented by Table 4 and the Block2 by Table 5. In Tables 4 and 5 the superscripts 1 and 2 mean that corresponding functions are implemented by Block1 or Block2, respectively.
Constructing systems representing blocks of the first level. These systems are constructed using Tables 4 and 5. Each system includes R Q + R I = 4 equations.
The Block1 is represented by the following SBF: The Block2 is represented by the following SBF: 6. Constructing the system for BlockOR. This system is constructed in a trivial way. Each function f i ∈ Z ∪ V is represented by a disjunction of functions of the same name with different upper indexes. It is the following SBF in the discussed case: 7. Constructing the system for Blockδ. To find the system (16), it is necessary to create a table of Blockδ. It is constructed using pairs P(m, q) and codes K(Y q ), K(I np ) and C(a s ). In the discussed case, this is Table 6. The table uses data from Figures 7 and 8. The following SBF is derived from Table 6: Now, we have systems for each block of FSM U 4 (S 1 ). Next step is the implementation of the logic circuit.
8. Implementing the logic circuit of FSM U 4 (S 1 ). This step is executed using special synthesis tools, e.g., Quartus Prime [50] or Vivado by Xilinx [40]. During this step, each LUT is represented by its truth table. Such complicated tasks are executed as mapping, placement and routing [6]. We just focus on finding the number of LUTs in the circuit and do not discuss this step for our example.
The Block1 is represented by the SBF (29). The corresponding circuit includes four LUTs. The Block2 is represented by the SBF (30). Its circuit also includes four LUTs. Thus, the first level of logic includes eight LUTs having S L = 5.
The BlockOR is represented by the SBF (31). To implement its circuit, it is enough to have four LUTs. Blockλ is represented by the SBF (28). Its circuit consist of 8 LUTs. At last, the system (32) represents Blockδ. Its circuit has four LUTs.
Thus, the circuit of FSM U 4 (S 1 ) includes 24 LUTs. There are three levels of LUTs between inputs x l ∈ X and outputs y n ∈ Y. The same is true for inputs and input memory functions D r ∈ Φ.
This example is very simple. We show it to explain all steps of the proposed method. The next Section shows results of experiments with more complex FSMs.

Experimental Results
In this section we show the results of experiments based on benchmark FSMs from the library [31]. There are 48 benchmarks in the library. They are very often used to compare outcomes of different design methods. The benchmark Mealy FSMs are represented in the format KISS2. We do not show the characteristics of these benchmarks in this article. They can be found, for example, in [30].
To implement FPGA-based FSM, we used VHDL-based FSM models. Our CAD tool K2F [2] translated the benchmarks into VHDL-based FSM models. The synthesis and simulation of FSMs were executed by the Active-HDL environment. As a target platform, we used Xilinx VC709 Evaluation Board (Virtex 7, XC7VX690T-2FFG1761C) [62]. This chip includes LUTs having S L = 6. To execute the technology mapping and produce reports with characteristics of resulting FSM circuits, we used Xilinx CAD tool Vivado-version 2019.1 [40].
When we investigated FSM U 2 [30], we found that this model allows producing circuits with less area and power consumption if R + L > S L . In [30], we divided the benchmarks into five groups using the values of L + R and S L . If L + R ≤ 6, then benchmarks belong to group 0 (trivial FSMs); if L + R ≤ 12, then to group 1 (simple FSMs); if L + R ≤ 18, then to group 2 (average FSMs); if L + R ≤ 24, then to group 3 (big FSMs); otherwise, they belong to group 4 (very big FSMs). As our research [30] shows, the larger the group number, the bigger the gain from using our method. We use the same division of benchmarks in this article too.
In the section State-of-the-art, we have justified the choice of three methods for comparison with our approach. We chose the method auto of Vivado as a method based on binary state codes.
Additionally, we used the method one-hot of Vivado. Due to its high reputation, we chose JEDI-based FSMs as a basis for comparison too. Our approach is a competitor to the method from work [30]. Thus, we chose U 2 -based FSMs with three levels of logic blocks as the fourth method used in experiments. The results of experiments are shown in Table 7 (the number of LUTs) and Table 8 (the maximum operating frequency). These results were taken from reports generated by Vivado.  We use the same organization of Tables 7 and 8. Their rows are marked by the names of benchmarks, the columns by investigated design methods. The row "Total" includes results of summation for corresponding values. The summarized characteristics of our approach (U 4 -based FSMs) were taken as 100%. The row "Percentage" shows the percentages of summarized characteristics of FSM circuits implemented by other methods, respectively, compared to benchmarks based on our approach. Let us point out that the model U 1 was used for designs with auto, one-hot, and JEDI.
As follows from Table 7, the U 2 -based FSMs require fewer LUTs than other investigated methods. Our approach produces circuits having 8.84% more LUTs than equivalent U 2 -based FSMs. However, our approach requires fewer LUTs than auto (24.86% of gain), one-hot (45.3% of gain) and JEDI-based FSMs (2.83% of gain). The higher is the group, the greater is the gain in LUTs respectively auto, one-hot and JEDI-based FSMs. We show these results in Figure 9. Table 8 shows that the U 4 -based FSMs have the highest operating frequency of the investigated methods. Our method gives us a 9.85% advantage over the auto. The one-hot of Vivado loses 10.48% to our approach. The U 4 -based FSMs provide a 3.18% gain compared to JEDI-based FSMs. At last, the U 2 -based FSMs have an average frequency of 12.57% less than it is for FSM based on our approach. These results are shown in Figure 10.  To clarify how the gain in LUTs depends on the FSM group, we have created Table 9 (gain in LUTs for group 0), Table 10 (gain in LUTs for group 1) and Table 11 (gain in LUTs for groups 2-4). Additionally, we present these results by graphs on Figures 11-13, respectively. To clarify how the gain in frequency depends on the FSM group, we have created Table 12 (gain in frequency for group 0), Table 13 (gain in frequency for group 1) and Table 14 (gain in frequency for groups 2-4). Additionally, we present these results by graphs on Figures 14-16, respectively. Table 9. Gain in LUTs for group 0 (LUT count).

Benchmark
Auto One-Hot JEDI U 2 Our Approach Group     Analysis of Table 9 and Figure 10 shows that the U 4 -based FSMs have more used LUTs than other investigated methods. Our method has the following loss: 44.14% compared to auto, 22.52% compared to one-hot, 45.05% compared to JEDI-based FSMs and 19.82% compared to U 2 -based FSMs. Thus, this method is not suitable for small FSMs.
As follows from Table 10 and Figure 12, the U 4 -based FSMs of group 1 required fewer LUTs than FSMs based on auto (11.54% of gain) and one-hot (44.23% of gain). However, we still lose to the JEDI-based FSMs (7.42% of loss) and U 2 -based FSMs (12.36% of loss). Note that the loss decreased in comparison with the group 0.
As follows from Table 11 and Figure 10, the U 4 -based FSMs of groups 2-4 required fewer LUTs than FSMs based on auto (37.72% of gain), one-hot (53.44% of gain) and JEDI-based FSMs (12.13% of gain). Only U 2 -based FSMs have better results and our approach has 6.27% of loss. Note that the loss decreased in comparison with the group 1. Thus, starting from average FSMs, our approach loses only to the U 2 -based FSMs.      As follows from Table 12 and Figure 14, the U 4 -based FSMs of group 0 are faster than U 2 -based FSMs (5.38% of gain). In this group, the best results belong to JEDI-based FSMs. They have the following gains: (1) 0.9% regarding auto; (2) 3.57% regarding one-hot; (3) 12.73% regarding U 2 -based FSMs; (4) 7.35% regarding our approach. Thus, for the group 0, there is no sense in applying our approach. However, starting from the group 1, our method allows producing faster circuits than the other investigated methods.
As can be seen from Table 8, the U 2 -based FSMs require fewer LUTs compared to other methods. Analysis of Table 9 shows that U 4 -based FSMs are the ones with the highest maximum operating frequency compared to other methods. The overall design quality can be estimated by the product of used resources [63] (for example, chip area occupied by a circuit) and the latency time. As it is in [63], we use the number of LUTs to compare areas required for FSM circuits based on different models (auto, one-hot, JEDI, U 2 and U 4 ). As a rule, an FSM is only a part of a digital system. We do not know how many cycles a system needs to perform a required task. Thus, we cannot find absolute values of latency times. However, for a relative evaluation of different models, it is sufficient to know only the time of cycle.
In this article, we have performed a generalized comparison of the models used in experiments. As a generalized assessment, we used the result of multiplying the number of LUTs in an FSM circuit by the cycle time. The numbers of LUTs are taken from Table 7. To calculate the cycle times in nanoseconds, we used the operating frequencies from Table 8. The area-time products measured in LUTs × ns are shown in Table 16.
To better evaluate the chip resources used by FSM circuits, we have created Table 15. It contains the numbers of flip-flops required for implementing the state registers. As follows from Table 15, there are the same number of flip-flops in registers of FSMs obtained using methods auto, JEDI and U 2 -based FSMs. For these FSMs the number of memory elements is the same. They use the least number of flip-flops determined as R = log 2 M . The largest number of flip-flops is consumed by FSMs based on the one-hot state assignment (eight times more than, for example, U 2 -based FSMs and 4.97 times more than U 4 -based FSMs). Our approach gives a gain of 397% compared to one-hot-based FSMs, but loses 37% to other investigated methods. If we find the difference between, for example, the number of flip-flops in registers of U 2 -and U 4 -based FSMs, we can see that the difference decreases as the group number decreases.  The results of our experiments show that the proposed approach can be used instead of other models starting from simple FSMs. The U 2 -based FSMs have fewer LUTs than other models. However, starting from average FSMs, our approach allows producing circuits having slightly larger numbers of LUTs with significantly higher maximum operating frequencies. Additionally, our approach provides better area-time products starting from average FSMs. It has rather good potential and can be used in targeting FPGA-based Mealy FSMs.

Conclusions
Modern FPGA chips have reached such a level that quite complex systems can be implemented using only a single chip. At the same time, significant parts of the digital systems are implemented using LUTs having rather small numbers of inputs. The value S L = 6 is considered as optimal [19,20], but it is too small compared to the number of inputs and outputs of FSMs from modern digital systems. To design these complex FSMs with the use of such simple elements, it is necessary to apply the methods of functional decomposition. As a rule, the functional decomposition results in LUT-based FSM circuits having many logic levels and very complicated systems of interconnections.
Different methods of structural decomposition can be used to optimize the characteristics of FPGA-based FSM circuits. Our research [30,60] shows that the FSM circuits based on structural decomposition possess significantly better characteristics (fewer LUTs, higher maximum operating frequency, lower power consumption) than their counterparts based on functional decomposition. It is very important that the FSM circuits based on structural decomposition have regular systems of interconnections and predicted numbers of levels of logic. In the best case, each logic block of an FSM circuit has only a single level of LUTs.
In this paper, we propose a novel approach aimed at optimization of LUT-based Mealy FSMs. The proposed method leads to Mealy FSM U 4 . Two methods of structural decomposition are the cornerstones of our approach. They are: (1) the transformation of codes of collections of outputs into state codes and (2) the extension of state codes. The second method is a new one and it is proposed in this paper. To increase the maximum operating frequency, we encode the FSM states using more than the minimum number of state variables determined by (1). Our approach leads to Mealy FSM circuits with three levels of LUTs and regular systems of interconnections. As it is in a single-level FSMs U 1 , FSM outputs are generated simultaneously with input memory functions. As a result, our approach provides an increase in maximum operating frequency, accompanied by a small increase in the number of LUTs compared to equivalent three-level FSMs.
The results of our experiments clearly show that the proposed approach can be used instead of other models starting from simple FSMs. The U 2 -based FSMs have fewer LUTs than other models. However, starting from average FSMs, our approach allows producing circuits having slightly larger numbers of LUTs with significantly higher maximum operating frequency. Additionally, our approach provides better area-time products starting from average FSMs. Thus, our approach can be used if either the performance or the area-time product is the dominant characteristic of a digital system.
We are currently considering several areas of research. We intend to explore the possibility of applying the proposed approach to FPGA chips of Intel (Altera). We will also try to adapt this approach for optimizing characteristics of Moore finite state machines.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: