Improving Characteristics of LUT-Based Mealy FSMs with Twofold State Assignment

: Practically, any digital system includes sequential blocks. This article is devoted to a case when sequential blocks are represented by models of Mealy ﬁnite state machines (FSMs). The performance (maximum operating frequency) is one of the most important characteristics of an FSM circuit. In this article, a method is proposed which aims at increasing the operating frequency of LUT-based Mealy FSMs with twofold state assignment. This is done using only extended state codes. Such an approach allows excluding a block of transformation of binary state codes into extended state codes. The proposed approach leads to LUT-based Mealy FSM circuits having two levels of logic blocks. Each function for any logic level is represented by a circuit including a single LUT. The proposed method is illustrated by an example of synthesis. The results of experiments conducted with standard benchmarks show that the proposed approach produces LUT-based circuits with signiﬁcantly higher operating frequency than it is for circuits produced by other investigated methods (Auto and One-hot of Vivado, JEDI, twofold state assignment). The performance is increased by an average of 15.9 to 25.49 percent. These improvements are accompanied by a small growth of the numbers of LUTs compared with circuits based on twofold state assignment. Our approach provides the best area-time products compared with other investigated methods. The advantages of the proposed approach increase as the number of FSM inputs and states increases. of experiments show that the performance gain increases as the complexity of an FSM (the number of FSM and state increases. At the same time, the increase in the FSM complexity leads to a decrease in the loss in the number of LUTs. Furthermore, our approach provides better area-time products starting from FSMs for which the total number of inputs and state variables exceeds twice the


Introduction
Practically, any digital system includes various sequential blocks [1][2][3]. To specify the behaviour of a sequential block, it is necessary to use some formal model. In many cases, the behaviour of sequential blocks is specified by Mealy finite state machines (FSMs) [4][5][6][7]. Three main characteristics determine the quality of an FSM circuit: the chip area occupied by the circuit, performance (either minimum propagation time or maximum operating frequency), and the consumption of power. These characteristics are strongly interconnected [8]. As a rule, the occupied chip area significantly affects other characteristics of an FSM circuit [9,10]. Various methods of structural decomposition can be used for optimizing the size of the occupied chip area [11]. These methods have one serious drawback: they lead to multi-level FSM circuits with a significant decrease in performance compared with equivalent single-level circuits.
If a multi-level FSM circuit does not provide the required operating frequency, then it is necessary to reduce the number of levels. This reducing must be performed in a way that increases as little as possible the occupied chip area. One of these methods is proposed in [12]. This is a method of twofold state encoding. The method is aimed at Mealy FSMs implemented with field programmable gate arrays (FPGAs) [13][14][15]. Now, FPGAs are widely used for implementing various digital systems [16][17][18]. Due to this, we chose the FPGA-based Mealy FSMs as a research object. Our current article considers Mealy FSM circuits implemented using three main components of FPGA chips. These components are look-up table (LUT) elements, programmable flip-flops and interconnections of FPGAs. We focus our current research on solutions of Xilinx which is the largest manufacturer of FPGA chips [19,20]. So, we consider ways of increasing the performance of LUT-based Mealy FSMs.
A LUT is a logic block having N I LUT inputs and a single output [14,21]. A single LUT may implement an arbitrary Boolean function having up to N I LUT arguments. Unfortunately, the value of N I LUT is very small [20]. If a Boolean function depends on more than N I LUT variables, then it is necessary to apply methods of functional decomposition of this function [19,22]. It is known, the applying functional decomposition leads to multi-level FSM circuits with "spaghetti-type" interconnections [23,24]. FSM circuits are represented by systems of Boolean functions (SBFs). To implement an LUT-based FSM circuit, it is necessary to transform an initial SBF into a network of LUTs of a particular FPGA chip. This is a step of technology mapping [25,26]. The outcome of technology mapping tremendously affects all main characteristics of FSM circuits [20,27,28]. To implement an LUT-based FSM circuit, it is necessary to use LUTs, programmable flipflops, programmable interconnections, circuits of synchronization, and input-output blocks.
The article [26] notes that time delays of the interconnection system are starting to play a major role in comparison with LUT delays. Furthermore, more than 70% of the power dissipation is due to the interconnections [29]. So, the optimization of interconnections leads to increasing the performance and reducing the power consumption of LUT-based FSM circuits. This can be done, for example, using the twofold state assignment [12].
Obviously, increasing the number of LUT inputs leads to a decrease in both the number of LUTs and their levels in FSM circuits. However, as shown in [30,31], for the foreseeable future, it is very difficult to expect an increase in the number of LUT inputs. Basically, modern LUTs have no more than 6 inputs [14,21]. An increase in the number of inputs leads to an imbalance for the main characteristics of an LUT circuit. However, this state of affairs leads to an imbalance between an increasing number of arguments in SBFs representing FSM circuits and a fairly small number of LUT inputs. However, the increasing complexity of modern digital systems is accompanied by an increase in the number of arguments in functions representing FSM circuits. This imbalance is the source of the need to improve synthesis methods of LUT-based FSMs.
Our current article is devoted to synthesis of multi-level LUT-based circuits of Mealy FSMs obtained using the method of twofold state assignment [12]. The twofold state assignment allows improving characteristics of LUT-based Mealy FSMs compared with their counterparts based on functional decomposition [32]. However, applying twofold state assignment leads to FSM circuits with three levels of LUTs. These circuits are significantly slower than their counterparts with fewer logic levels. If performance is the dominant quality factor, then the number of levels in FSM circuit should be reduced. It is extremely important that the increase in operating frequency is accompanied by as small an increase in the number of LUTs as possible (compared with the original three-level FSM circuit).
Recently, we have published works [11,12,32] where a method of twofold state assignment have been proposed. In these works, each FSM state has two codes. The first of them represents a state as an element of the set of FSM states. The second code is an extended state code (ESC). The ESC represents a state as an element of some class of partition of the set of states. This approach requires using an additional block generating ESCs. This block introduces an additional level of logic to the FSM circuit. In our current article, we propose a method which allows eliminating this block. The elimination of this additional block is the main feature that distinguishes our current work from works [11,12,32]. This approach leads to an increase in the value of FSM operating frequency compared with FSMs based on methods [11,12,32].
The main contribution of this paper is a novel design method aimed at increasing the operating frequency of LUT-based Mealy FSMs with twofold state assignment. The main idea of the proposed approach is to use only extended state codes. This approach leads to a completely new structural diagram of the LUT-based Mealy FSM. This diagram does not include a block transforming binary state codes into extended state codes. Our current research shows that this approach leads to FSM circuits having practically the same amount of LUTs as FSM circuits based on the twofold state assignment. The experimental results show that this method allows increasing the maximum operating frequency of LUT-based FSMs compared with equivalent FSMs obtained using some known methods of FSM design.
The further text of the article includes six sections. The second section shows the background of single-level LUT-based Mealy FSMs. The third section discusses the state-ofthe-art in the synthesis of LUT-based FSMs. The fourth section represents the main idea of the proposed method. The fifth section includes an example of FSM synthesis using the extended state codes. The results of experiments are shown and discussed in the sixth section of the article. A short conclusion ends the article.

Single-Level LUT-Based Mealy FSMs
FPGAs manufactured by Xilinx include a lot of configurable logic blocks (CLB) [1,15], such as embedded memory blocks, digital signal processors, even microprocessors. The CLBs are connected using a programmable routing matrix [33]. In this paper, we consider CLBs including LUTs, multiplexers and programmable flip-flops. A LUT has N I LUT inputs and a single output. Networks of LUTs implement systems of Boolean functions representing an FSM circuit.
A LUT can implement an arbitrary Boolean function which depends on up to N I LUT arguments. A combinational output of LUT may be connected with a flip-flop. Usually, D flip-flops are used to organize an FSM state register (SRG) [34,35]. The value of a function may be loaded into a flip-flop using the pulse of synchronization Clk. It is possible to make output of flip-flop equal to zero using the pulse Reset. To select the type of a CLB output (combinational or registered), a programmable multiplexer is used.
In practical digital systems, an SBF representing an FSM circuit may depend on up to 50-70 literals [4,33]. At the same time, modern LUTs have no more than 6 inputs. So, there is a contradiction between a large number of arguments and a very small number of LUT inputs. This leads to the need for functional decomposition (FD) of SBFs representing an FSM circuit [36,37]. However, FD-based circuits have a lot of logic levels and complex systems of "spaghetti-type" interconnections [38,39].
Very often, STGs represent FSMs. An STG is a directed graph which vertices correspond to FSM states and edges correspond to interstate transitions [4]. Each edge is directed from the current state s m ∈ S to the state of transition s T ∈ S . There are H edges in an STG. The h-th edge is marked by the pair < I h , O h >, where I h is a conjunction of inputs causing a transition < s m , s T >, O h is a collection of outputs generated during the transition < s m , s T >. Consider a Mealy FSM A 0 represented by the STG shown in Figure 1. An STG can be transformed into an STT. Each edge of an STG corresponds to a single row of an STT. For example, FSM A 0 can be represented by its STT (Table 1) obtained from the STG (Figure 1).
There are five columns in an STT [4]. The first four of them are the following: a current state s m ; a state of transition s T ; a conjunction of inputs I h ; a collection of outputs O h . The column h includes the numbers of interstate transitions (h ∈ {1, . . . , H}). For example, the following functions are determined by the fourth row of Table 1: δ(s 2 , i 4 ) = s 3 and λ(s 2 , i 4 ) = o 3 . For a row with the unconditional transition, there is I h = 1. In Table 1, this is the row 12.
Let us encode FSM states s m ∈ S by binary codes K(s m ) having R s bits where To encode states, we use state variables from the set T = {T 1 , . . . , T R s }. These variables are kept into the SRG. To change the contents of SRG, the special input memory functions (IMFs) are used. They form a set D = {D 1 , . . . , D R s }. Now, we can transform an STT into a direct structure table (DST). A DST is an extension of the STT with the three following columns: K(s m ), K(s T ), and D h [1]. The first of them contains a code of a current state, the second includes a code of a state of transition, and there is a collection of IMFs in the third of them. This collection includes IMFs equal to 1 to load the code K(s T ) into SRG.
Using a DST, we can derive two following SBFs representing an FSM circuit: The SBF (2) represents the function of transitions, the SBF (3) represents the function of outputs. Using the terminology from [42], we can state that SBFs (2)  In P Mealy FSMs, the block of transition logic is determined by SBF (2), the block of output logic is specified by SBF (3). The inputs of register SRG are connected with outputs of the block of transition logic. In each cycle of FSM operation, the SRG contains a current state code. The pulse of synchronization Clk allows changing the contents of SRG. To load the zero code of the initial state s 1 ∈ S into SRG, it is necessary to generate the pulse Reset. We discuss a case when both blocks are implemented using LUT-based CLBs. In this case, the flip-flops of SRG are distributed among LUTs of the block of transition logic.
The analysis of systems (2) and (3) shows that both input memory functions and FSM outputs depend on state variables and FSM inputs. This peculiarity of Mealy FSMs is used for optimizing LUT-based FSM circuits [11].

Optimizing Circuits of FPGA-Based Mealy FSMs
There are a significant number of methods for synthesis of LUT-based FSMs [11,12,23,24,26,32,[36][37][38]41,43,44]. It is quite possible that the quality of FSM circuits obtained by different synthesis methods will differ significantly. As a rule, the quality of an FSM circuit is determined by a combination of three main characteristics. These characteristics are the following: (1) chip resources used for implementing the circuit; (2) the maximum performance and (3) the power consumption [45,46]. In the case of LUT-based FSMs, the following chip resources are necessary: (1) LUTs; (2) programmable flip-flops; (3) programmable interconnections; (4) the circuit of synchronization and (5) the programmable input-outputs of a chip. Obviously, the best FSM circuit consumes the minimum amount of chip resources, has the maximum possible operating frequency, and consumes the minimum power.
To improve the quality of an FSM circuit, it is necessary to solve some optimization problems [47,48]. In this article, we propose a method for improving the performance (the maximum operating frequency) of LUT-based Mealy FSMs.
The functions (2) and (3) are represented as sum-of-products (SOPs) [4]. These SOPs include product terms F h corresponding to rows of a DST. A term F h is determined as the following conjunction: In (4), the symbol S m stands for a conjunction of state variables corresponding to the code of a current state s m ∈ S from the h-th row of the DST.
Each function f j ∈ D ∪ O depends on NL( f j ) literals, where a literal is either direct or compliment value of a Boolean variable [4]. Consider the following condition If condition (5) takes place, then there is exactly a single LUT in the circuit implementing the function f j ∈ D ∪ O. If condition (5) is violated, then the corresponding LUT-based circuit is a multi-level one.
To design multi-level LUT-based FSM circuits, various methods of functional decomposition (FD) may be used [22]. During the process of FD, an initial SOP is broken down by partial SOPs corresponding to some additional functions. This process is terminated when each partial SOP includes no more than N I LUT literals. Different partial SOPs of a function f j ∈ D ∪ O may include the same inputs i l ∈ I or/and the same state variables T r ∈ T. Thus, there is a duplication of literals in different partial SOPs of the original SOP. This phenomenon leads to a significant complication of the interconnection system. In turn, this not only complicates the placement and tracing process, but also reduces performance and increases power consumption compared with an equivalent single-level circuit [11,29].
If condition (5) is true for all functions representing an FSM circuit, then the number of LUTs in the circuit is equal to N + R S . This is the best possible LUT count. If condition (5) is violated, then the LUT count is equal to R S + N(F), where N(F) is the number of additional functions different from f j ∈ D ∪ O.
To optimize FSM circuits, it is necessary to reduce the value of N(F). The importance of this problem has led to the development of a significant number of methods of FD [22]. Various algorithms of FD are included into CAD tools aimed in implementation of FPGAbased digital systems.
The values of NL( f j ) could be reduced due to proper state assignment [4]. For example, it is possible to represent a state using only a single state variable. This is achieved by the one-hot state assignment, when R S = M [49]. The one-hot state assignment requires using the SRG with M bits. However, this is not a problem because modern FPGAs include a lot of programmable flip-flops. Due to this, this approach is very popular in FPGA-based design. For example, this method is used in the academic CAD system ABC by Berkeley [50,51]. It is also used in industrial CAD packages such as, for example, Vivado of Xilinx [52] and Quartus of Intel (Altera) [53].
The Equation (1) determines so called maximum binary state codes [4]. The negative effect of one-hot state assignment is an increase in the number of IMFs compared with their minimum possible number (1). However, these IMFs are much simpler than in the case of maximum binary state assignment [1]. These approaches have been compared, for example, in [54]. The research [54] shows that using one-hot codes improves FSM characteristics if there is M > 16. However, the number of state variables is not the only factor influencing the circuit characteristics. The limited number of LUT inputs increases the effect of the number of FSM inputs on the characteristics of LUT-based FSM circuits [1]. It is shown in [35] that the one-hot state assignment produces worse FSM circuits if the number of FSM inputs exceeds 10.
So, in one case the best results are produced by the method of maximum binary state assignment, and in the other case it is better to use the one-hot state codes. Thus, it is necessary to check which method will give the best results for a particular FSM. Due to this, we have compared the FSM circuits produced by our proposed approach with FSM circuits produced by three other methods of state assignment. As a base for comparison, we chose the algorithm JEDI [9,55], the methods of binary state assignment Auto and the One-hot state assignment of Vivado [52] by Xilinx [15]. We chose Vivado because it aims in Xilinx FPGA chips. We chose JEDI because it is one of the best maximum binary state assignment methods [5].
It is possible to encode states in a way reducing the power consumption [56]. The majority of such methods are based on reducing the switch activity of an FSM circuit [48,57]. To do it, it is necessary to minimize the Hamming distance for codes of states with the maximum amount of transitions [57]. However, our research [12,38,39] shows that the power consumption can be reduced due to the reducing the number of interconnections inside an FSM circuit. To reduce the number of interconnections, it is necessary to minimize the numbers of arguments in SBFs (2) and (3) [4]. This can be done using various methods of state assignment.
The structural decomposition is an efficient way of reducing LUT counts in Mealy FSMs logic circuits [11]. The main idea of these methods is the elimination of direct connection between FSM inputs i l ∈ I and state variables T r ∈ T , on the one hand, and outputs o n ∈ O and IMFs D r ∈ D , on the other hand. This is connected with introducing additional functions forming a set F having N(F) elements. The functions f j ∈ F depend on FSM inputs and state variables. In turn, FSM outputs and IMFs use these additional functions as arguments creating literals of corresponding SOPs. To optimize LUT count due to applying the methods of structural decomposition (SD), the following conditions should take places [11]: All known methods of SD are based on conditions (6) and (7). These methods are analysed, for example, in [11]. If condition (5) is violated for some functions f j ∈ F, then joint application of FD-and SD-based decomposition methods is necessary [12].
One of the SD-based methods is a method of twofold state assignment [12,32]. Let us analyse this method. The method is based on construction a partition π S = {S 1 , . . . , S K } of the set of Mealy FSM states. Each class S k ∈ π S includes M k states. The maximum binary codes C(s m ) are used for encoding states as elements of some class S k ∈ π S . There are R(S k ) bits in codes C(s m ) of states s m ∈ S k , where To encode states s m ∈ S k , the variables τ r ∈ τ k are used. The sets τ 1 , . . . , τ K form a set τ having R 0 elements: Each class S k ∈ π S determines three sets. The set I k ⊆ I includes inputs causing transitions from states s m ∈ S k . There are L k elements in the set I k ⊆ I . The set O k ⊆ O includes outputs generating during the transitions from states s m ∈ S k . The set D k ⊆ D includes IMFs equal to 1 during the transitions from states s m ∈ S k .
This method can be used if the following condition takes place: In this case, it is possible to use the model of P T Mealy FSM ( Figure 3).
A class S k ∈ τ S corresponds to a BlockSk implementing SBFs The circuit of each BlockSk is implemented with LUTs having N I LUT inputs. The functions (11) and (12) represent the partial SOPs of FSM input memory functions and outputs. The BlockTO creates final values of functions from the set D ∪ O. To do it, each LUT of this block implements disjunctions having up to K inputs: The outputs of LUTs producing functions (13) are connected with inputs of D flip-flops. This explains the existence of pulses Reset and Clk as inputs of BlockTO. So, this block produces outputs o n ∈ O and state variables T r ∈ T.
To create SBFs (11) and (12), it is necessary to have state variables τ r ∈ τ. These variables are generated by the Blockτ. This block transforms state variables T r ∈ T and generates the following SBF: As follows from [12], the circuits of P T FSMs require fewer LUTs than the circuits of equivalent P Mealy FSMs. If condition (10) takes place, circuits of P T FSMs also have fewer levels of logic. Due to this, they are faster than the circuits of equivalent P Mealy FSMs.
As follows from Figure 3, both inputs i l ∈ I and state variables τ r ∈ τ enter only LUTs of the first level of FSM circuit. The partial functions (11) and (12) enter only LUTs of the second level of FSM circuit. At last, the state variables T r ∈ T enter only LUTs of the Blockτ creating the third level of logic. Due to this, the circuits of P T FSMs have regular systems of interconnections. This distinguishes them from the circuits of FD-based P FSMs having complex systems of "spaghetti-type" interconnections. Our research [11,32] shows that the circuits of P T FSMs consume less power than the circuits of equivalent P Mealy FSMs.
The analysis of Figure 3 shows that P T FSMs have a drawback. Namely, they include Blockτ in the path leading from inputs i l ∈ I to state variables τ r ∈ τ. The conversion (15) takes some time, which is added to the FSM cycle time. In this article, we propose a way of eliminating the Blockτ from FSM circuit. It allows reducing the cycle time and, therefore, increasing the value of maximum operating frequency of a resulting FSM circuit.

Main Idea of the Proposed Method
Assume that we have constructed a partition π S for some Mealy FSM. To eliminate the converter of state codes K(s m ) into state codes C(s m ), we propose to use extended state codes EC(s m ). For a state s m (m ∈ {1, . . . , M}), the extended state code determines the state as an element of both sets S and S k ∈ π S . The number of bits in EC(s m ) is equal to R 0 determined by (9). To encode states by ESCs, we use state variables τ r ∈ τ , where |τ| = R 0 . If s m ∈ S k , then only variables τ r ∈ τ k can differ from zero.
Using only ESCs leads to P E Mealy FSMs. There is a structural diagram of P E Mealy FSM shown in Figure 4.
In P E Mealy FSMs, the blocks of the first level implement partial functions (11) and (12). These functions enter inputs of LUTs from BlockτO. As in the case of P T FSMs, the functions (12) are transformed into FSM outputs. However, the functions (11) are transformed directly into state variables τ r ∈ τ . Due to this, there is no need in the Blockτ used in the case of P T Mealy FSM ( Figure 3).
By removing the Blockτ, the three-level FSM circuit turns into a two-level one. So, we can expect that P E Mealy FSMs have better performance than equivalent P T Mealy FSMs. Results of our experimental studies, shown in Section 6, have confirmed this assumption.
In this article, we propose a synthesis method for P E Mealy FSMs. The method aims in LUT-based FSMs where an LUT has N I LUT inputs. We assume that an FSM is represented by its state transition table. Furthermore, we assume that there exists a partition π S satisfying the condition (10). There are the following steps in the proposed synthesis method:

1.
Constructing the partition π S with the minimum cardinality number K.

2.
Encoding of FSM states by extended state codes EC(s m ).

3.
Creating DST of P E Mealy FSM.
Implementing P E Mealy FSM circuit with particular LUTs and other resources of FPGA chip.
To create the partition π S , we can use the methods from [11,32]. These methods try to minimize the number of LUTs in the resulting FSM circuit. Firstly, the number of shared outputs in the sets S k ∈ π S should be minimum one. This reduces the number of LUTs implementing partial outputs in the circuit of BlockSk. Furthermore, this allows to reduce the number of interconnections among BlockSk and LUTs of BlockτO. The circuit of block BlockτO is guaranteed to have a single level of logic if the following condition is true: If condition (16) is violated, then the circuit of BlockτO is still a single-level one, if each partial output is generated by no more than N I LUT blocks of the first level of logic. Due to this, it makes sense to minimize the appearance of shared FSM outputs in different sets S k ∈ π S . Each class S k ∈ π S is characterised by a set δ(S k ) including states of transitions from the states s m ∈ S k . The methods [11,32] minimize the appearance of shared states of transition in different classes of the partition π S . This allows minimizing the number of partial input memory functions generated by a particular block of logic. In turn, this minimizes the numbers of LUTs in circuits of all blocks of P T FSM circuit. In the case of P E FSM, we still use this method.
The difference in the organization of P T and P E FSMs leads to a change in the method of state assignment. In [32], state are encoded in a way minimizing the number of LUTs in the circuit of LUTerτ. There is no LUTerτ in P E FSMs. We propose to encode states in a way minimizing the number of LUTs generating functions (11). To do it, we encode the states s m ∈ δ(S k ) in a way maximizing the number of zeros in the same bits of codes for different states of transition within each set δ(S k ).
Consider the following example. There is a set δ(S 1 ) = {s 3 , s 5 }. If these states have the codes EC(s 3 ) = 01101, EC(s 5 ) = 10011, then all five partial input memory functions are generated by BlockS1. To generate them, it is necessary to use five LUTs. If there are codes EC(s 3 ) = 00001 and EC(s 5 ) = 00011, then only two partial input memory functions should be generated. To do it, only 2 LUTs are necessary. We use this approach to encode states of P E Mealy FSMs.

Example of Synthesis
If a model of P E FSM is used to implement the circuit of some FSM A b , then we denote this as P E (A b ). In this section, we discuss an example of synthesis of P E Mealy FSM starting from STT (Table 1). To implement the circuit of Mealy FSM P E (A 0 ), we use LUTs having N I LUT = 5.
Step 1. Using the approach in [32] gives the partition π S = {S 1 , S 2 } of the set of states. It includes the classes S 1 = {s 1 , s 3 , s 4 } and S 2 = {s 2 , s 5 , s 6 }. There is M 1 = M 2 = 3 in the discussed case. Using (8) gives R(S 1 ) = R(S 2 ) = 2. As follows from the analysis of Table 1, the following sets exist for each class S k ∈ π S : So, the following relation takes place: L 1 = L 2 = 2. Because of N I LUT = 5, the condition (10) is true for each class of the partition π S = {S 1 , S 2 }.
As we can see, there are no identical elements in the sets I 1 and I 2 (I 1 ∩ I 2 = ∅). The fewer identical elements in different sets I k ⊆ I, the fewer connections between the sources of FSM inputs and LUTs of the first level of logic. In our particular case, the absolute minimum of zero is reached.
There is only a single common output o 1 in sets O 1 and O 2 . It means that only a single LUT of BlockτO is necessary to generate FSM outputs. All other outputs are generated by LUTs from corresponding blocks of the first level of logic circuit. In general, the fewer identical elements in different sets O k ⊆ O, the smaller the number of LUTs in the second level of logic.
Step 2. There is R(S 1 ) = R(S 2 ) = 2. From (9), we can get R 0 = 4. This gives the set τ = {τ 1 , . . . , τ 4 }. We can create the following sets for the state encoding: τ 1 = {τ 1 , τ 2 } and τ 2 = {τ 3 , τ 4 }. If s m ∈ S 1 , then τ 3 = τ 4 = 0 in the code EC(s m ). There is τ 1 = τ 2 = 0 in the code EC(s m ) for the states s m ∈ S 2 . Furthermore, we should maximize the number of zeros in the same bits of codes for different states of transition within each set δ(S k ). In our example, it is possible to encode only states s m ∈ δ(S 2 ) in such a way. The outcome of state assignment is shown in Figure 5. As follows from the Karnaugh map ( Figure 5), there is τ 3 = 0 in extended codes of states s m ∈ δ(S 2 ). It means that only three LUTs are used for generating input memory functions in BlockS2. Furthermore, we can see that variables τ 3 and τ 4 are insignificant for conjunctions S m for the codes of states s m ∈ S 1 . The same is true for the variables τ 1 and τ 2 and the codes of states s m ∈ S 2 . This allows diminishing the number of literals in terms (4) up to R(S k ). For example, there is S 3 = τ 1 ∧ τ 2 . These minimized conjunctions are used in SOPs of functions (11) and (12).
Step 3. The transition from an STT to a DST is executed in the trivial way. The columns s m , s T , I h , O h , h contain the same symbols in both tables. The state codes are taken from a corresponding Karnaugh map. In the discussed example, the codes are taken from Figure 5. There is a symbol D r in the row h of the column D h , if there is 1 in the code of state of transition s T ∈ S in this row. In the discussed case, the DST of FSM P E (A 0 ) is represented by Table 2. This table includes h = 15 rows.
Step 4. To construct a table of BlockSk, it is necessary to select the rows of a DST with transitions from the states s m ∈ S k . Obviously, this is executed in a trivial way. Applying this approach to Table 2 gives two tables. Table 3 represents the BlockS1; Table 4 represents the BlockS2. There are 9 rows in Table 3 and 6 rows in Table 4. So, together, these tables include all 15 rows of the original DST (Table 2).
s 4 1100 Step 5. The table of BlockτO includes two columns. The column Function includes all functions from sets of input memory functions and FSM outputs. The column Block is divided by K sub-columns (S1, . . . , SK). The sub-column Sk corresponds to BlockSk. If a function f i ∈ D ∪ O is generated by BlockSk, then there is 1 at the intersection of the row f i and the column Sk.
In the discussed case, there are R 0 + N = 11 rows in the table of BlockτO. This block is represented by Table 5.

Function
Block Function Block S1 S2 S1 S2 There is a transparent connection between Tables 3 and 4, on the one hand, and Table 5, on the other hand. For example, there is D 3 = 0 in Table 4. So, there is 0 at the intersection of the row D 3 and the column S2 of Table 5. Next, there is o 4 = 1 in Table 3. So, there is 1 at the intersection of the row o 4 and the column S1 of Table 5. All other rows of Table 5 are filled on the base of a similar analysis.
The following systems are derived from Table 4: The following SBF is derived from Table 5: Systems (17)-(21) represent logic circuit of FSM P E (A 0 ). Let us analyse the LUT counts for each level of logic. To do it, we should analyse systems (17)- (21).
As follows from (17) and (18), there are 9 LUTs in the circuit of BlockS1. The LUT count is determined by the number of equations in (17) and (18). From SBFs (19) and (20) follows that there are 6 LUTs in the circuit of BlockS2.
If some equation of SBF (21) has more than a single product term, then the corresponding LUT is included into BlockτO. Otherwise, a corresponding function is generated by some of block of the first level of logic. So, there are 4 LUTs in the circuit of BlockτO.
So, there are 9 + 6 + 4 = 19 LUTs in the circuit of FSM P E (A 0 ). The pulses Clk and Reset enter CLBs generating functions D r ∈ D. The circuit is shown in Figure 6.  Step 7. To implement the circuit of FSM P E (A 0 ), it is necessary to use very complicated methods of technology mapping [26]. This can be done using, for example, the CAD tool Vivado by Xilinx [52]. This package solves the problems of mapping, placement, routing, testing, finding characteristics of an FSM circuit (such as the LUT count, number of CLBs, number of flip-flops, maximum operating frequency, power consumption). We do not show the results of implementation for this particular example. In the next section, we use Vivado to investigate the efficiency of our method compared with some other design methods.

Experimental Results
The results of experiments are shown in this section. To conduct experiments, we use benchmark FSMs from the library [58]. The library includes 48 benchmarks represented in the format KISS2. These benchmarks have a wide range of basic characteristics (numbers of states, inputs, and outputs). The characteristics of these benchmarks can be found in many articles and books, for example, in [11,27,37]. These benchmarks are used very often by different researchers to compare area and time characteristics of FSMs obtained using different design methods. The characteristics of the benchmarks are shown in Table 6.  To conduct the experiments, we used a personal computer with the following characteristics: CPU: Intel Core i7 6700 K 4.2@4.4 GHz, Memory: 16 GB RAM 2400 MHz CL15. As a platform for implementing FSM circuits we used the Virtex-7 VC709 Evaluation Platform (xc7vx690tffg1761-2) [59]. The FPGA chip of this platform includes LUTs with 6 inputs. We used CAD tool Vivado v2019.1 (64-bit) [52] to execute the technology mapping. The results of experiments are taken from reports produced by Vivado. As the source information for the CAD tool, we used VHDL-based FSM models obtained by the transformation of files in KISS2 format into VHDL codes. The transformation is executed by the CAD tool K2F [11].
We compared area (the LUT count) and time (the maximum operating frequency) characteristics of FSMs based on five different approaches. Three of them are P Mealy FSMs based on: (1) Auto of Vivado (it uses binary state codes); (2) One-hot of Vivado; (3) JEDI. The fourth objects for comparison are P T -based FSMs [12,32]. We compared these four FSMs with our approach.
It is known [11] that area and time characteristics of LUT-based FSM circuits depend strongly on the relation between numbers of inputs (L) and state variables (R S ), on the one hand, and the number of LUT inputs N I LUT , on the other hand. Due to this, we have divided the benchmarks into five following classes. The benchmarks belong to class of trivial FSMs (class 0) if R S + L ≤ 6. The benchmarks belong to class of simple FSMs (class 1) if R S + L ≤ 12. The benchmarks belong to class of average FSMs (class 2) if R S + L ≤ 18. The benchmarks belong to class of big FSMs (class 3) if R S + L ≤ 24. The benchmarks belong to class of very big FSMs (class 4) if R S + L > 24. As research [39] shows, the larger the class number, the bigger the gain from using methods of structural decomposition.
The results of experiments are shown in Tables 7-15. There is the same organization of Tables 7,9,[11][12][13][14][15]. We marked the table columns by the names of investigated methods. The table rows contain the names of benchmarks. Within each table, the results for the same class are shown in adjacent rows. There are results of summation of values from columns in the row "Total". The row "Percentage" includes the percentage of summarized characteristics of FSM circuits produced by other methods, respectively, to P E -based FSMs. To point that the model of P FSM is used for methods Auto, One-hot, and JEDI, we name these methods as P-Auto, P-One-hot, and P-JEDI. Tables 8, 10 and 12 show summarized experimental results for different classes of benchmark FSMs. Let us analyse the experimental results taken from reports produced by Vivado.  As follows from Table 7, the P T -based FSMs consume the minimum amount of LUTs compared with other investigated approaches. The P E -based FSMs require 7.7% more LUTs than equivalent P T -based FSMs. However, all other FSMs require more LUTs than it is for our approach. Our approach consumes fewer LUTs than it is for P-Auto (25.47% of gain), P-One-hot (46.1% of gain) and P-JEDI-based FSMs (6.67% of gain).
To show a dependence of the gain in LUTs on the class of benchmarks, we have created Table 8. It shows the gain for classes 0, 1 and 2-4.   As follows from Table 8, the P E -based FSMs have worse results than equivalent P-Auto FSMs (12.68% of loss on the number of LUTs), P-JEDI FSMs (14.08% of loss on the number of LUTs) and P T -based FSMs (12.68% of loss on the number of LUTs). So, for the class 0, the P-JEDI based FSMs have the minimum amount of LUTs. For the class 1, our approach loses out relative to the other two approaches (11.01% to P E -based FSMs and 1.46% to P-JEDI-based FSMs). However, our approach produces more economical circuits than it is for P-Auto (18.71% of gain) and P-One-hot (53.51% of gain). The P E -based FSMs only lose to the P T -based FSMs (6.23%). So, for the classes 2-4, our approach defeats P-Auto FSMs (30.35% of gain), P-One-hot FSMs (45.23% of gain), and P-JEDI FSMs (6.13% of gain). It means that our approach allows obtaining FSM circuits with a number of LUTs comparable to this number for equivalent P-JEDI-and P T -based FSMs. At the same time, the loss of our approach decreases as the complexity of FSMs increases: the larger the class number, the smaller our loss relative to the best solutions (P-JEDI-and P T -based FSMs).
We thought that our approach would allow us to obtain FSM circuits with higher performance than it is for circuits based on models of either P or P T FSMs. To test this assumption, we have created Table 9. It includes values of maximum operating frequency measured in megahertz.
As follows from Table 9, the P E -based FSMs have the higher values of maximum operating frequency compared with other investigated FSMs. Our approach provides the following gain: (1) 25.49% compared with P-Auto-based FSMs; (2) 26.09% compared with P-One-hot-based FSMs; (3) 10.06% compared with P-JEDI-based FSMs; (4) 15.9% compared with P T -based FSMs. Our research has shown that the frequency gain depends on the class to which an FSM belongs. This conclusion is supported by data from Table 10. As follows from Table 10, P E -based FSMs have the same operating frequency as equivalent P T FSMs from the class 0. We can explain this phenomenon by the fact that there is no code transformer al P T FSMs. Furthermore, P-JEDI-based FSMs have slightly higher frequency (0.84%). However, our approach gives a slight advantage over P-Auto-based FSMs (0.07%) and P-One-hot-based FSMs (2.51%). It follows from Table 10 that for trivial automata, the method of organizing the FSM circuit is practically irrelevant. The difference in frequency depends mainly on the state encoding method.
Starting from simple FSMs (the class 1), our approach allows producing the fastest circuits. There is the following gain in maximum operating frequency: (1) 24.68% compared with P-Auto-based FSMs; (2) 24.77% compared with P-One-hot-based FSMs; (3) 19.33% compared with P-JEDI-based FSMs; (4) 15.68% compared with P T -based FSMs. The gain is even greater for FSMs of classes 2-4. This is the following: (1) 39.94% compared with P-Auto-based FSMs; (2) 40.1% compared with P-One-hot-based FSMs; (3) 32.02% compared with P-JEDI-based FSMs; (4) 24.64% compared with P T -based FSMs. As we can see, starting from simple FSMs, the difference in frequency depends mainly on the architecture of FSM.
When comparing different variants of the FSM circuit implementation, an integral estimate is often used, which is equal to the product of the chip area occupied by a circuit by the performance [45]. In the case of LUT-based FSMs, the circuit quality is estimated by the product of the LUT count by the minimum cycle time [45]. Obviously, the time of cycle is inversely to the operating frequency. The lower the value of this product, the better is the quality of the corresponding FSM circuit.  The results of the comparison relative to this estimate are shown in Table 11. The time of cycle is represented in nanoseconds. So, the data inside Table 11 are represented as "number of LUTs × nsecs".
As follows from Table 11, our approach leads to FSM circuits with better integral estimates than it is for other investigated methods. There is the following gain in the integral estimates: (1) 101.63% compared with P-Auto-based FSMs; (2) 133.62% compared with P-One-hot-based FSMs; (3) 45.54% compared with P-JEDI-based FSMs; (4) 16.89% compared with P T -based FSMs. This is connected with the fact that our approach produces two-level FSM circuits. At the same time, a circuit for each function for any level of logic requires only a single LUT.
To check the dependence of the integral estimate on the value of R S + L, we split Table 11 and built Table 12, which shows integral estimates for FSMs from classes 0, 1 and 2-4. In the case of class 0 (Table 12), the circuits of P E -based FSMs have the worst values of integral estimates compared with the circuits produced by P-Auto, P-JEDI, and P T FSMs. There is the following loss: (1) 10.78% compared with P-Auto; (2) 12.78% compared with P-JEDI and (3) 12.41% compared with P T FSMs. So, the loss is approximately the same relative to these three models. However, P E -based FSMs have a gain of 27.07% compared with LUT-based P-One-hot FSMs.  Starting from simple FSMs (the class 1), our approach allows producing the circuits with significantly better integral estimates as their counterparts. There is the following gain in values of area-time products: (1) 63.38% compared with P-Auto-based FSMs; (2) 110.11% compared with P-One-hot-based FSMs; (3) 24.27% compared with P-JEDIbased FSMs; (4) 5.01% compared with P T -based FSMs. The gain is even greater for FSMs of classes 2-4. This is the following: (1) 124.33% compared with P-Auto-based FSMs; (2) 150.66% compared with P-One-hot-based FSMs; (3) 58.24% compared with P-JEDIbased FSMs; (4) 23.48% compared with P T -based FSMs. As we can see, starting from simple FSMs, the difference in the values of area-time products depends mainly on the architecture of FSM. Due to this, the circuits based on P T FSMs also benefit in comparison to P-Auto, P-One-hot, and P-JEDI FSMs.
As follows from results of experiments, using the model of P E FSMs allows obtaining FSM circuits with higher operating frequency and less values of integral estimates than they are for other investigated models. Winning starts already from simple FSMs for whom the following relation takes place: (L + R S ) − N I LUT > 0. The gain from using our method increases as the difference between the number of FSM inputs and state variables, on one side, and the number of inputs of LUT, on the other side, increases.
In our previous papers [12,32,38,39], we have proposed various methods for improving characteristics of LUT-based Mealy FSMs. All these methods lead to three-level FSM circuits. In [12], there is proposed a synthesis method based on the twofold state assignment and one-hot encoding of outputs. In [32], there is proposed a synthesis method based on the twofold state assignment and encoding of collections of outputs. These methods are also shown in our book [11]. In [38], there is proposed a synthesis method based on the replacement of FSM inputs and encoding of collections of outputs. In [39], there is proposed a synthesis method based on the transformation of codes of collections of outputs into state codes.
To compare our new results with results [12,32,38,39], we have created three additional tables. Table 13 includes the experimental results for the number of LUTs. Table 14 shows the results for the maximum operating frequency. Table 15 contains the products of LUT counts by propagation times.    We have proposed the method based on extended state codes to improve the time characteristics of P T FSMs. Note that the gain in frequency is accompanied by a slight increase in the number of LUTs compared with equivalent P T FSMs. We think that our approach can be used instead of P T FSMs if the performance is the main criterion for the optimality of LUT-based FSM circuits.

Conclusions
Modern FPGAs have up to 7 billion transistors [13]. It means that a very complex digital system may be implemented using a single FPGA chip. The complexity of the implemented systems is constantly increasing, but the number of LUT inputs remains very small. As research [30,31] states, there is no sense in having LUTs with more than 6 inputs. If an FSM circuit is represented by functions for which the condition (5) is violated, then the technology mapping is based on applying various methods of functional decomposition. In turn, this leads to multi-level LUT-based FSM circuits having complicated systems of interconnections.
The characteristics of LUT-based FSM circuits may be improved using various methods of structural decomposition [11]. Very often, FSM circuits based on the structural decomposition have much better characteristics compared with their counterparts based on the functional decomposition [11,12,38,39]. Our research [12] shows that LUT-based Mealy FSM circuits with the twofold state assignment have better characteristics (fewer LUTs and lower power consumption) than their counterparts based on functional decomposition. However, to apply this approach, it is necessary to create the extended state codes. It leads to using a block of state code transformer adding some delay in the cycle time.
In our current article, we propose to use only extended state codes for the state assignment. As a result, we propose a structural diagram and the design method of P E Mealy FSMs. The elimination of code transformer allows increasing the maximum operating frequency in comparison with P T -based FSMs. In P E Mealy FSMs, outputs o n ∈ O are produced simultaneously with functions D r ∈ D. As a result, we achieved an increase in operating frequency (up to 23.48%) accompanied by a small increase (up to 12.68%) in the FPGA resources used.
The results of experiments show that the performance gain increases as the complexity of an FSM (the number of FSM inputs and state variables) increases. At the same time, the increase in the FSM complexity leads to a decrease in the loss in the number of LUTs. Furthermore, our approach provides better area-time products starting from FSMs for which the total number of inputs and state variables exceeds twice the number of inputs N I LUT .
Funding: This research received no external funding.

Data Availability Statement:
The data presented in this study are available in the article.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: class number k of partition π S = {S 1 , ..., S K } C(s m ) code of state as an element of a class S k R(S k ) number of bits necessary to encode states from a class S k L k number of inputs determining transitions from states of class S k τ r ∈ τ k state variables encoding states s m ∈ S k R 0 total number of state variables encoding states as elem. of partition π S = {S 1 , ..., S K } I k ⊆ I set of inputs causing transitions from states s m ∈ S k O k ⊆ O set of outputs generating during transitions from states s m ∈ S k D k ⊆ D set of input memory functions generating during transitions from states s m ∈ S k EC(s m ) extended state code of state s m K number of classes of partition π S = {S 1 , ..., S K } δ(S k ) set of states of transitions from the states s m ∈ S k