Using a Double-Core Structure to Reduce the LUT Count in FPGA-Based Mealy FSMs

: A method is proposed which aims at reducing the numbers of look-up table (LUT) elements in logic circuits of Mealy ﬁnite state machines (FSMs). The FSMs with twofold state assignment are discussed. The reduction is achieved due to using two cores of LUTs for generating partial Boolean functions. One core is based on maximum binary state codes. The second core uses extended state codes. Such an approach allows reducing the number of LUTs in the block of state codes’ transformation. The proposed approach leads to LUT-based Mealy FSM circuits having three levels of logic blocks. Each partial function for any core is represented by a single-LUT circuit. A formal method is proposed for redistribution of states between these cores. An example of synthesis is shown to explain peculiarities of the proposed method. An example of state redistribution is given. The results of experiments conducted with standard benchmarks show that the double-core approach produces LUT-based FSM circuits with better area-temporal characteristics than they are for circuits produced by other investigated methods (Auto and One-hot of Vivado, JEDI, and twofold state assignment). Both the LUT counts and maximum operating frequencies are improved. The gain in LUT counts varies from 5.74% to 36.92%, and the gain in frequency varies from 5.42% to 12.4%. These improvements are connected with a very small growth of the power consumption (less than 1%). The advantages of the proposed approach increase as the number of FSM inputs and states increases.


Introduction
Our time is characterized by the widespread penetration of various embedded systems into all spheres of human activity [1][2][3].Various sequential devices are an integral part of almost every embedded system [4,5].Very often, the behaviour of a sequential device is represented using the model of Mealy finite state machine (FSM) [6,7].Often in the FSM design process, designers strive to balance the values of the three main characteristics of a resulting circuit [8,9].These characteristics are the occupied chip area, maximum operating frequency, and power consumption.The values of these characteristics are closely related [10].As a rule, the occupied chip area has the greatest influence on the values of other characteristics [11].The occupied chip area can be reduced using methods of structural decomposition [11].One of these methods is a method of twofold state assignment (TSA) leading to three-level FSM circuits [12].The TSA is aimed at Mealy FSMs implemented with field-programmable gate arrays (FPGAs) [13][14][15][16][17].
We chose FPGAs as the basis for the implementation of FSM circuits, since they are widely used for designing various digital systems [18].We discuss FSM circuits based on configurable logic blocks (CLBs) consisting of look-up table (LUT) elements and programmable flip-flops.Now, the largest manufacturer of FPGA chips is AMD Xilinx [19].Due to it, we focus this paper on FPGAs of AMD Xilinx.We propose a method of reducing the numbers of LUTs (LUT counts) in the FPGA-based circuits of Mealy FSMs.
The main disadvantage of twofold FSMs is the need to convert all maximum binary state codes (MBCs) into so-called extended state codes (ESCs) [12].For this purpose, an additional block is used to transform the maximum binary state codes into the extended state codes.This block consumes some of the FPGA chip's internal resources (LUTs and programmable interconnections).In this paper, we propose a method which allows reducing the overhead connected with the transformation of state codes.
The main contribution of this paper is a novel design method aimed at reducing the LUT counts in the circuits of FPGA-based Mealy FSMs with twofold state assignment.We propose to represent an FSM circuit as a double-core structure.The first core uses maximum binary state codes for generating partial Boolean functions (PBFs).The PBFs of the second core are based on the extended state codes.The proposed approach leads to a LUT-based Mealy FSM where only a part of maximum binary state codes is transformed into extended state codes.Our current research shows that this approach leads to FSM circuits having fewer LUTs compared to FSM circuits based on the twofold state assignment.The experimental results show that FSMs based on our method have practically the same values of the maximum operating frequencies as they are for equivalent FSMs with TSA.
The further text of the article is organized in the following order.The second section shows the background LUT-based Mealy FSM design.The third section discusses the relative works.The main idea of the proposed method is shown in the fourth section.The fifth section includes an example of FSM synthesis using our approach.An algorithm of state redistribution is shown in the sixth section.The seventh section is devoted to results of experiments.The article also includes a short conclusion.

Background of LUT-Based Mealy FSMs
A Mealy FSM is characterized by sets of states A, inputs X, outputs Y, state variables T, and input memory functions (IMFs) D [6].These sets are the following: A = {a 1 , . . ., a M }, X = {x 1 , . . ., x L }, Y = {y 1 , . . ., y N }, T = {T 1 , . . ., T R }, and D = {D 1 , . . ., D R }.So, a Mealy FSM has M states, L inputs, N outputs, R state variables and R input memory functions.The values of the first three parameters are independent of the FSM circuit designer.The value of R can be chosen by a designer.The minimum value of R is determined as The Formula (1) determines so-called maximum binary state assignment.The maximum value of R corresponds to so-called one-hot state assignment: R OH = M [20].
The state variables T r ∈ T are used for creating state codes K(a m ).An input memory function D r ∈ D can set up the binary value of the r-th bit of the code K(a m ).To keep state codes, a special register RG is used.The RG consists on R flip-flops controlled by two pulses, Start and Clock [21].The pulse Start loads the code K(a 1 ) of the initial state a 1 ∈ A into RG.The synchronization pulse Clock allows loading a state code into RG.This code is determined by the values of IMFs.We discuss a case when the RG consists of flip-flops with informational inputs of D type.This is the most popular type of flip-flops using in the FPGA-based design [18].
In this article, we discuss a case when the internal resources of an FPGA chip are used for implementing FSM circuits.These resources include LUTs, flip-flops, programmable interconnections, synchronization tree, programmable input-outputs [22,23].The LUTs and flip-flops are combined into CLBs.
A LUT is a block having S L inputs and a single output [20,24].A LUT may implement an arbitrary Boolean function including no more than S L arguments.The value of S L is rather small [22].If the number of arguments of a Boolean function exceeds S L , then it is necessary to combine together some LUTs.It is quite possible that a function is represented by a multi-CLB circuit.In this case, it is necessary to diminish the number of LUTs and their levels in the corresponding circuit [25,26].In this article we use the symbol LUTer to show that a corresponding logic blocks includes LUTs, flip-flops and interconnections.
An FSM logic circuit is represented by the following systems of Boolean functions (SBFs) [9]: The SBF (2) represents the function of transitions, the SBF (3) represents the function of outputs [6].The SBFs (2) and (3) represent a structural diagram of P Mealy FSM (Figure 1) [6].Obviously, the Functions (2) and (3) depend on state variables T r ∈ T and FSM inputs

LUTerT
holds, then a corresponding logic circuit consists of a single LUT.If the condition (4) holds for each function f j (j ∈ {1, . . ., R + N}, then the FSM circuit includes exactly R + N LUTs.Such a circuit is single-level.This is the best possible solution providing minimum values of the required chip area, power consumption and cycle time (in other words, the maximum value of operating frequency).However, FSMs can have up to 10 state variables and 30 inputs [6].At the same time, the modern LUTs have S L = 6 inputs.So, it is quite possible that condition (4) will be violated for at least a single function f j ∈ D ∪ Y.In this case, it is necessary to use various optimization strategies to optimize the characteristics of an FSM circuit.Our current paper deals with the area reducing problem.Let us analyze some approaches used to solve this problem.
In the case of decomposition, Functions (2) and ( 3) are represented by systems of partial functions [29,35].Each partial Boolean function has no more than S L arguments.Due to it, each PBF is represented by a single-LUT circuit.Both FD and SD lead to multilevel FSM circuits.However, these circuits differ in the nature of interconnections [11].In the case of FD, the resulting circuit has an irregular interconnect structure in which the same variables x l ∈ X and T r ∈ T appear at different logical levels of the circuit.In the case of SD, an FSM circuit includes from two to four large logic blocks [30].These blocks have unique systems of inputs and outputs.Due to it, the SD-based FSM circuits have regular systems of interconnections.As shown in the article [11], SD-based circuits have better characteristics compared to equivalent FD-based circuits.In this article, we discuss a way for improvement some SD-based method.
In the case of LUT-based FSMs, a state assignment is optimal if it allows excluding the maximum possible number of literals from the sum-of-products of Functions ( 2) and (3) [36].For the possibility of a single-level implementation of an FSM circuit, it is necessary to exclude such amount of literals that condition (4) is satisfied for each function f j ∈ D ∪ Y.However, this result is possible only for sufficiently simple FSMs [34].Therefore, in most cases, state encoding methods have an auxiliary nature.If condition (4) is not satisfied after the state assignment, then it is necessary to use other optimization methods.
Very often, the methods of SD are based on finding a partition of the state set A by classes of compatible states.One of such methods is a method of twofold state assignment (TSA) [12,37].The method is based on construction a partition π A = {A 1 , . . ., A I } of the set A. Each class A i ∈ π A determines sets X i , Y i , D i .The set X i ⊆ X includes L i FSM inputs causing transitions from states a m ∈ A i .The set Y i ⊆ Y consists of FSM outputs produced during the transitions from states a m ∈ A i .The set D i ⊆ D includes input memory functions determining MBCs of transition states.
There are M i states in each class A i ∈ π A .Inside each class, these states are encoded by partial maximum binary codes C(a m ) having R i bits: To encode states a m ∈ A i , the variables v r ∈ V i are used.The sets V 1 , . . ., V I form a set V having R A elements: A state a m ∈ A is compatible with states a s ∈ A i , if the including this state into A i does not violate the following condition: To optimize the FSM logic circuit, it is necessary to minimize the value of I.This approach leads to the so-called P T Mealy FSM (Figure 2).

LUTer1
In P T Mealy FSMs, each state a m ∈ A has two codes.These codes are: (1) the maximum binary state code K(a m ) and (2) the partial state code C(a m ) determining a particular state as an element of a particular class.A block LUTeri corresponds to the class A i ∈ π A .This block generates the following systems of PBFs: The LUTerTY creates resulting values of functions f j ∈ D ∪ Y.Each element of LUTerTY implements the following SBFs: The block LUTerTY contains the flip-flops of RG.The pulses Start and Clock enter this block to control the operation of RG.
As follows from ( 8) and ( 9), the partial functions depend on state variables v r ∈ V i .These state variables are produced by the transformation of the state variables T r ∈ T. To transform the codes K(a m ), the block LUTerV generates the following SBF: As follows from [37], the circuits of P T FSMs require fewer LUTs than the circuits of equivalent P Mealy FSMs.If the condition holds, then the circuits of P T FSMs have exactly three levels of LUTs.As a rule [37], they have higher values of maximum operating frequencies than they are for circuits of equivalent P Mealy FSMs.We will call the FSM core a block generating partial functions depending on state variables.In P T FSMs, there is the CoreV consisting of blocks LUTer1-LUTerI.All other functions are generated by a function assembly block (FAB).In P T FSMs, the FAB consists of blocks LUTerTY and LUTerV.Using this terminology, we can represent the structural diagram of P T FSM in its generalized form (Figure 3).As follows from Figure 3, all PBFs depend on both inputs x l ∈ X and state variables v r ∈ V. So, the transformation K(a m ) into C(a m ) is executed for all states a m ∈ A. However, if condition (4) is satisfied for some state a m ∈ A, then there is no need for the code transformation noted above.If we reduce the number of states whose codes are transformed, then it is possible to reduce both the number of classes (I) and the value of the parameter R A .This is an approach proposed in our current paper.

Main Idea of the Proposed Method
The transitions from a state a m ∈ A depend on FSM inputs from a set X(a m ) ⊆ X.This set includes L(a m ) ≤ L elements.Let the following condition hold: If the condition ( 14) takes place, then each PBF generated during the transitions from a m ∈ A is represented by a single-LUT circuit.So, there is no need in the partial codes for such states a m ∈ A. So, the partial codes C(a m ) should be generated only for states for which the condition ( 14) is violated.This conclusion is the basis for a method proposed in this article.
We propose to divide the set A by sets A MB and A PC .If the condition ( 14) holds for a state a m ∈ A, then this state is included into the set A MB .Otherwise, this state is included into the set A PC .The states a m ∈ A MB form a core denoted as a CoreT, whereas the states a m ∈ A PC form a core denoted as a CoreV.The transformation of state codes is executed only for the states a m ∈ A PC .
The CoreT determines the sets X T ⊆ X, Y T ∪ Y 0 ⊆ Y, and D 0 ⊆ D. The first set includes FSM inputs determining the transitions from the states a m ∈ A MB .The second set consists of FSM outputs produced during the transitions from these states.The outputs from the set Y T are produced only during transitions from the states of the CoreT.The outputs from the set Y 0 are shared between both cores.The third set includes IMFs generated during the transitions from the states a m ∈ A MB .The following SBFs determine the CoreT: The CoreV determines the sets X V ⊆ X and Y V ⊆ Y.The first set includes FSM inputs determining the transitions from the states a m ∈ A PC .The second set consists of FSM outputs produced during the transitions from these states.The following SBFs determine the CoreT: The CoreV is based on the partition π V = {A 1 , . . ., A K } of the set A PC .This partition is constructed in the same way as the partition π A .Each class of the partition π V determines the sets X k V , Y k V , V k and D k V .These sets are similar to the corresponding sets of partial functions considered for the partition π A .The circuit of CoreV is determined by SBFs similar to SBFs (8) and (9).These SBFs are the following: To generate the outputs y n ∈ Y V and state variables, it is necessary to use FAB.We propose to combine together the blocks FAB, CoreV, and CoreT.The proposed connection of blocks leads to a double-core FSM P 2C .Its generalized structural diagram is shown in Figure 4.
There are K classes in the partition π V .The following condition holds: Then, replacing the subscript i by subscript k turns the Formula (5) into a formula determining the number of state variables in the codes C(a m ) for states a m ∈ A k .Having these values allows obtaining the total number of variables v r ∈ V: Obviously, the following condition takes place: Due to the validity of condition ( 22), the following is true: (1) the circuit of CoreV for FSM P 2C must include fewer LUTs than this circuit for the equivalent FSM P T and (2) the circuit of FSM P 2C must include no more levels of logic than it is for the circuit for the equivalent FSM P T .Both P T and P 2C FSMs incorporate the block LUTerV executing the transformation of state codes.Obviously, the fewer LUTs has included in the circuit of this block, the less power it consumes.As follows from the validity of condition (24), the circuit of LUTerV for FSM P 2C must include fewer LUTs than this circuit for the equivalent FSM P T .Therefore, the block LUTerV of P 2C FSM has less static power consumption than this block of equivalent FSM P T .Since some PBFs are generated by the block CoreT, then in some cycles of FSM operation the elements LUTs of the block LUTerV do not change their states.So, in these cycles, the block LUTerV has the dynamic power consumption close to zero.This analysis suggests that the block LUTerV of P 2C FSM has less power consumption than that block of an equivalent FSM P T .
So, we assume that the circuits of Mealy FSMs P 2C will have fewer LUTs and almost the same or even faster performance compared to circuits of equivalent FSMs P T .We can also argue that P 2C FSMs require less energy for the code transformation than equivalent FSMs P T .However, only the experimental studies can show the real energy budgets of equivalent P T and P 2C FSMs.
Using the above information, we propose a method for synthesis of LUT-based P 2C Mealy FSMs.As the initial form of FSM representation we use state transition graphs (STGs) [9].Next, we transform this STG in an equivalent state transition table (STT) [9].To implement an FSM circuit, we use LUTs having S L inputs.The proposed method includes the following steps: 1.
Transforming the initial STG into STT of P Mealy FSM.

2.
Preliminary constructing sets A MB and A PC .

3.
Preliminary constructing the partition π V of the set A PC .4.
Redistribution of states between sets A MB , A PC and π V .

5.
Encoding of FSM states by maximum binary codes K(a m ).
Encoding states a m ∈ A k by partial state codes C(a m ). 8.
We use a symbol P 2C (S) to show that the model of P 2C FSM is used to implement the logic circuit of some FSM S. In the next section, we discuss an example of synthesis of P 2C Mealy FSM, where we explain how each step is executed.

Example of Synthesis
We discuss a case of P 2C (S 1 ) FSM synthesis using LUTs with S L = 5.The FSM S 1 is represented by an STG shown in Figure 5. -/y 1 y 4 x 5 x 6 /y 1 y 6 x 5 /y 2 a 7 x 5 x 6 /y 8 x 3 /y 1 y 3 x 3 /y 7 x 2 /y 2 y 6 a 8 x 2 /y 5 x 3 x 7 /y 1 y 7 x 3 /y 3 a 9 x 3 x 7 /y 5 y 8 x 5 x 7 /y 1 x 5 /y 3 y 7 x 5 x 7 /y 7 y 8 Each node of an STG corresponds to the FSM state.Each arc of an STG corresponds to an interstate transition [9].There are H arcs in an STG.The h-th arc is marked by a pair <input signal X h , collection of outputs Y h >.An input signal X h is a conjunction of FSM inputs x l ∈ X determining the h-th interstate transition.A collection of outputs Y h ⊆ Y includes FSM outputs y n ∈ Y generating during the h-th interstate transition.
Step 1.This step is executed in the trivial way [6].Each arc of the STG corresponds to a single line of a corresponding STT.So, this table has the columns a m , a s , X h , Y h , h.The state a m corresponds to a vertex from which the h-th arc comes out (this is a current state); the state a s corresponds to a vertex into which this arc enters (this is a state of transition).The column X h includes the input signal written above the h-th arc.The column Y h includes the collection of outputs written above the h-th arc.Using this approach transforms the STG (Figure 5) into the equivalent STT (Table 1).Step 2. To divide the set A by sets A MB and A PC , it is necessary to find values of L(a m ) for states a m ∈ A. The following values can be found from Table 1: L(a 4 ) = 0; L(a m ) = 1 for states a 1 , a 3 , a 6 , a 7 ; L(a m ) = 2 for states a 2 , a 5 , a 8 , a 9 .There is S L = 5.As follows from (14), there are the sets A MB = {a 1 , a 3 , a 4 , a 6 , a 7 } and A PC = {a 2 , a 5 , a 8 , a 9 }.As we show in the next section, some elements of the set A MB can be transferred to the set A PC .Thus, these sets do not yet have a final form.Now, we can find sets X T and X V .The set X T includes inputs determining transitions from states a m ∈ A MB , the set X V includes inputs determining transitions from states a m ∈ A PC .In the discussed case, there are the following sets: Step 3. Using approach [12] gives the partition π V = {A 1 , A 2 } of the set A PC .The classes of this partition are the following: A 1 = {a 2 , a 5 } and A 2 = {a 8 , a 9 }.This gives the following values of M k : Since the set A PC can be changed, the partition π V is also preliminary.
Step 4. We discuss this step in Section 6.Now, we only show the outcome of this step.It is the following: A MB = {a 1 , a 3 , a 4 } and A PC = {a 2 , a 5 , a 6 , a 7 , a 8 , a 9 }.Now, the classes of π V = {A 1 , A 2 } are the following: A 1 = {a 2 , a 5 , a 7 } and A 2 = {a 6 , a 8 , a 9 }.This gives the following values of M k : M 1 = M 2 = 3.Using (5) gives R 1 = R 2 = 2 and R V = 4. So, there is no change in the total number of state variables v r ∈ V before and after refining the sets A MB and A PC .So, there is the set V = {v 1 , . . ., v 4 }.However, now there are fewer states in the set A MB .This means that the number of LUTs in the circuit of CoreT should be reduced compared to this number corresponding to the set A MB obtained during the Step 2.
Step 5.There is M = 9.Using (1) gives R MB = 4. So, there are the following sets: T = {T 1 , . . ., T 4 } and D = {D 1 , . . ., D 4 }.To minimize the sum-of-products (SOPs) of functions (12), it is necessary to place the states from the same class into minimum possible amount of generalized cubes of R MB -dimensional Boolean space [9].Let us encode the states in a way shown in Figure 6.As follows from Figure 6, the states a m ∈ A MB are placed into the cube 00xx.This allows optimizing SOPs of functions ( 15)- (17).The states a m ∈ A 1 are placed in the cube x100, the states a m ∈ A 2 are placed in the cube 1x00.This gives the opportunity to optimize SOPs of functions (12).
Step 6.The table of CoreT is constructed using the lines 1-2 and 6-8 of Table 1.Three more columns are added in this table: K(a m ), K(a s ) and D 0 h .The first and second additional columns include the codes of current and next states, respectively.The column D 0 h includes IMFs equal to 1 to load the code K(a s ) into the RG.We changed the names for columns X h and Y h compared to Table 1.Now we use the notation X 0 h and Y 0 h .The CoreT is represented by Table 2. Using Table 2 gives the following SBFs: This system is used to create the circuit of CoreT.Let us point out that the function y 4 is generated only by some LUT of CoreT.This gives Y T = {y 4 }.Furthermore, the following sets can be derived from Table 2: Step 7. To encode the states a m ∈ A 1 , the variables v 1 , v 2 ∈ V are used.To encode the states a m ∈ A 2 , the variables v 3 , v 4 ∈ V are used.We use the code 00xx to show that a particular state does not belong to the class A 1 .The code xx00 shows that a particular state does not belong to the class A 2 .The outcome of state assignment is shown in Figure 7.The following partial codes can be found from the Karnaugh map (Figure 7): C(a 2 ) = C(a 6 ) = 01, C(a 5 ) = C(a 8 ) = 10, and C(a 7 ) = C(a 9 ) = 11.These codes are used in LUTs of CoreV.
Step 8.There are two blocks of LUTs in the CoreV.The block LUTer1 implements SBFs for the class A 1 ; the block LUTer2 implements SBFs for the class A 2 .The table of LUTer1 is constructed using the lines 3-5, 9-11 and 14-15 of Table 1.This is Table 3.The table of LUTer2 is constructed using the lines 12-13 and 16-21 of Table 1.This is Table 4.Both tables use partial state codes C(a m ) for current states and the MBCs K(a s ) for states of transition.The following sets can be found from Table 3: X 1 = {x 2 , x 5 , x 6 }, Y 1 = {y 1 , y 2 , y 5 , y 6 , y 8 } and D 1 = D.The following sets can be found from Table 4: The SBFs (18) and (19) are constructed in the same way as this is for SBFs ( 15)- (17).For example, the following SOPs can be obtained for functions D 1 1 (Table 3) and D 2 1 (Table 4): Step 9.There are the following columns in table of LUTerTY: f j (a function generated by LUTerTY), CoreT, CoreV.If a function f j ∈ D ∈ Y is generated by a LUT of CoreT, then there is 1 in the intersection of the line with this function and the column of the corresponding core.Otherwise, this intersection is marked by 0. There are K sub-columns in the column CoreV.If a function f j ∈ D ∪ Y is generated by LUTerk of CoreV, then there is 1 in the intersection of the line with this function and the sub-column k.In the discussed case, the block LUTerTY is represented by Table 5.
To fill the column CoreT, the data from Table 2 are used.To fill the sub-column 1, we use Table 3. Table 4 is a base for filling the sub-column 2. We hope there is a transparent connection between Tables 2-5.
Using Table 5, we can construct the following SBFs: y 2 = y 1 2 ; y 3 = y 2 3 ; y 4 = y 0 4 ; y 5 = y 0 5 ∨ y 1 5 ∨ y 2 5 ; y 6 = y 1 6 ; Each function f j ∈ D ∪ Y is represented by a disjunction of its partial components.The principle of constructing each function of ( 27) is clear from the comparison of these functions with contents of Table 5.
Step 10.To create the table of LUTerV, we should use the full codes K(a m ) and partial state codes C(a m ).So, there are the following columns in this table: a m , K(a m ), C(a m ), V m .Inside this table, we use only states a m ∈ A PC .In the discussed case, there are six lines in the table of LUTerV (Table 6).To fill the column K(a m ), we use the state codes from Figure 6.The column C(a m ) is filled using the partial state codes from Figure 7.
To optimize the SBF (12), we represent its functions by the Karnaugh map (Figure 8).In this map, we treat the codes of states a m ∈ A MB as the "don't care" input assignment.
Using the Karnaugh map (Figure 8) gives the following SBF: In the worst case, each function v r ∈ V is represented by a SOP having R MB literals.So, the maximum number of literals is calculated as the product of R V by R MB .In the discussed case, this number is equal to 16.If we analyze the SBF (28), we find that it includes 10 literals.So, using our approach allows reducing the number of literals by a factor of 1.6.Each literal corresponds to an interconnection between outputs of RG and inputs of LUTs creating the circuit of LUTerV.It is known that minimizing the number of interconnections allows reducing the value of power consumption [26,38].
Step 11.To implement the circuit of P 2C Mealy FSM, it is necessary to use, for example, the CAD tool Vivado by Xilinx [39].This package solves all problems connected with the step of technology mapping [40,41].In Section 7, we use Vivado to compare the proposed method with some known FSM design methods.

Algorithm of State Redistribution
If a class A k ∈ π V includes M k states, then it is necessary R k state variables to encode the states a m ∈ A k by the partial state codes C(a m ).The value of R k is determined by (5).We denote as MNP k the maximum possible number of states in a class A k ∈ π V .This value is determined as Our research shows that it is quite possible that some class A k ∈ π V includes fewer states compared to the value of MNP k .For example, we have the following classes for FSM S 1 : A 1 = {a 2 , a 5 } and A 2 = {a 8 , a 9 }.Using (5) gives R 1 = R 2 = 2. Using (29) gives MNP 1 = MNP 2 = 3.So, both classes might be supplemented by states from the set A MB = {a 1 , a 3 , a 4 , a 6 , a 7 }.One state can be added to each of the classes So, it is quite possible that we need to redistribute states between sets A MB and A PC .Obviously, these new elements of A PC should be added into some classes A k ∈ π V .It is obvious that it is expedient to transfer states in such a way as to reduce the number of states in the set A MB as much as possible.
We propose to use an estimate I(a m ), which we called the influence of the state a m ∈ A MB on the sets X T and X V .In the discussed case, these sets are the following: The best candidate for transfer to the set A k ∈ π V is the state a m ∈ A MB that minimizes the number of inputs in the set X T and minimally increases this number in the set X k .The influence of a state a m ∈ A MB on the set X T is determined as The influence of a state a m ∈ A MB on the set X k is determined as So, the overall influence of the state a m ∈ A MB is defined as Obviously, it is necessary to transfer the states with the greatest influence.This is the basis of our proposed redistribution algorithm (Figure 9).During the redistribution, a queue γ k is formed from the states a m ∈ A MB .This queue is based on the following rule: the states are placed as the value of I(a m ) decreases.If the influence is the same for states a m , a s ∈ A MB (I(a m ) = I(a s )), then, in the queue, the state with lower subscript precedes a state with higher subscript.A state can be included into a class A k ∈ π V , if its including does not violate the condition (4).In our algorithm, we use the abbreviation CBI (can be included).For each class A k ∈ π V , the queue γ k includes J k elements.This preliminary information is quite enough to proceed to the description of the proposed algorithm.
We start the redistribution from the testing the set A MB (Block 1).If this set is empty (output 1), then the redistribution cannot be executed.If there are some states in the set A MB (output 0), then the redistribution process begins.The analysis starts with class A 1 ∈ π V (Block 2).If the analyzed class includes the maximum number of states (output 1 from Block 3), then it is necessary to proceed to the analysis of the next class (go to Block 15).The algorithm is terminated when all classes are analyzed (output 1 of Block 16).Otherwise, the next class is analyzed (go to from Block 16 to Block 3).
If an additional state can be included in the class A k ∈ π V (output 0 from Block 3), then there is created a queue γ k having J k elements (Block 4).Next, the sequential analysis of the states from the queue γ k is performed.The analysis starts from the first element of the queue (Block 5).
The j-th element is taken from the queue (Block 6).If it cannot be included into the class A k ∈ π V (output 0 from Block 7), then the next element of the queue should be analyzed (go to Block 13).If all elements are analyzed (output 1 of Block 14), then it is necessary to analyze the class A k+1 ∈ π V (go to Block 15).Otherwise (output 0 of Block 14), the next element of the queue is analyzed (go to Block 6).
If the j-th element can be included into the class A k ∈ π V (output 1 from Block 7), then the following actions are executed (Block 8): (1) the state a j ∈ A MB is included into the set A k ∈ π V ; (2) the state a j ∈ A MB is excluded from the set A MB .If now (after excluding state a j ∈ A MB ) the set A MB becomes empty (output 1 of Block 9), the redistribution process is terminated (go to End).Otherwise (output 0 of Block 9), the next element of queue should be analyzed (go to Block 10).If all elements are already analyzed (output 1 of Block 11), then it is necessary to analyze the class A k+1 ∈ π V (go to Block 15).Otherwise (output 0 of Block 11), the next element of queue should be analyzed.This can be done if the class A k ∈ π V does not contain the maximum possible number of elements.This is checked in the Block 12.If the class is full (output 1 of Block 12), then it is necessary to analyze the class A k+1 ∈ π V (go to Block 15).Otherwise (output 0 of Block 12), the next element of the queue is analyzed (go to Block 6).
There are two conditions to terminate this redistribution process.First, if there are no elements in the set A MB (outputs 1 from Blocks 1 and 9).Second, all classes A k ∈ π V have been tested and, if it was possible, supplemented by states a m ∈ A MB (output 1 from Block 16).
So, the k-th step of the redistribution process starts from creating current sets A MB and X 0 .Next, it is necessary to find the values of I(a m ) for states a m ∈ A MB and create the current queue γ k .So, there are K columns corresponding to classes A k ∈ π V in the table of redistribution.Each column is divided by the following sub-columns: A MB , I(a m ), γ k , j = 1, j = 2, . . ., j = J k .In this table, the line a m includes states a m ∈ A MB transferred in the particular class A k ∈ π V .The lines for these states are marked by ⊕.If a state cannot be included into the class A k ∈ π V , the corresponding line includes the sign "−".The last line of the table contains the classes A k ∈ π V .Table 7 shows the redistribution process for FSM S 1 .
So, for k = 1, the column A MB contains the states a 1 , a 3 , a 4 , a 6 , a 7 .For the state a 1 ∈ A MB , we can find the set X(a 1 ) = {x 1 }.Let us find the value of I(a 1 ).Using (30) gives the following: This value is written in the intersection of the line a 1 and sub-column I(a m ) for k = 1.In the same way, the values of I(a m ) for all other states a m ∈ A MB are calculated.
Using the values of I(a m ), we can get the queue γ 1 =< a 7 , a 4 , a 6 , a 1 , a 3 >.In the intersection of the line a m and the sub-column γ 1 , there is written the place of this state in this queue.So, we should check the possibility of redistribution starting from the state a 7 .If we place the state a 7 into the class A 1 , then there is no change for values of L 1 and R 1 .So, the state is included into A 1 and excluded from A MB .Now, there is M 1 = MPN 1 = 3.So, during the step j = 2 no state can be added into the class A 1 .Now, there are the following modified sets: A 1 = {a 2 , a 5 , a 7 }, A MB = {a 1 , a 3 , a 4 , a 6 } and X T = {x 1 , x 3 }.Using the modified sets A MB and X T , we can start the next step of redistribution (k = 2).
The values of I(a m ) are shown in the corresponding sub-column of the column k = 2. Using them gives the queue γ 2 =< a 6 , a 4 , a 1 , a 3 >.If we place the state a 6 into the class A 2 , then there is no change for values of L 2 and R 2 .So, the state a 6 is included into A 2 and excluded from A MB .Now, there is M 2 = MPN 2 = 3.So, during the step j = 2 no state can be added into the class A 2 .So, the class A 2 is ready.Now, there are the following modified sets: A 1 = {a 2 , a 5 , a 7 }, A 2 = {a 6 , a 8 , a 9 }, A MB = {a 1 , a 3 , a 4 } and X T = {x 1 }.Obviously, these sets are the same as we use as the outcome of Step 4 in our example.

Experimental Results
In this section, the results of experiments conducted with the benchmarks [42] are shown.The library [42] consists of 48 benchmarks.The benchmark FSMs are represented by their STTs.To represent the STTs, the format KISS2 is used.These benchmarks have a wide range of basic characteristics (numbers of states, inputs, and outputs).Different researchers use these benchmarks to compare various characteristics of FSM circuits [28,29,32].The characteristics of benchmarks are shown in Table 8.
Our current research is connected with Mealy FSMs which are the parts of digital systems.It is known that Mealy FSMs are not stable [6],: fluctuations at the inputs lead to fluctuations at the outputs.This can lead to errors in the operation of the digital system as a whole.To avoid these errors, the FSM inputs should be stabilized.The stabilization presumes using an additional input register (AIR) [30].When input values stabilize, they are loaded into the AIR.Now, fluctuations at the inputs (which are the outputs of some system's blocks) do not lead to fluctuations at the FSM outputs.However, the AIR consumes some resources of a chip: (1) it requires L additional LUTs and flip-flops and (2) it is synchronized (due to it, AIR uses some resources of the synchronization tree).So, this register consumes additional LUTs, flip-flops, power and time (it adds some delay to the whole synchronization cycle time).Such an approach allows taking into account this overhead connected with the stabilization of FSM operation.The experiments are conducted using a personal computer with the following characteristics: CPU: Intel Core i5-11300H, Memory: 16GB RAM LPDDR4X.To get the FSM circuits, we use the Virtex-7 VC709 Evaluation Platform (xc7vx690tffg1761-2) [43] by AMD Xilinx.There is S L = 6 for LUTs used in this platform includes.The CAD tool Vivado v2019.1 (64-bit) [39] executes the technology mapping.The results of experiments are taken from reports produced by Vivado.To connect the library with Vivado, we use VHDL-based FSM models.These models are obtained by a transformation of the files in KISS2 format into VHDL codes.The transformation is executed by the CAD tool K2F [30].
We have found three main characteristics of P 2C Mealy FSMs.They are: the occupied chip area (the LUT count), performance (both the values of cycle time and maximum operating frequency), and power consumption.We compared the obtained values with the corresponding values for four different FSMs.Three of them are P Mealy FSMs based on: (1) Auto of Vivado (it uses MBCs); (2) One-hot of Vivado; (3) JEDI (it uses MBCs, too).Moreover, for the comparison, we use P T -based FSMs [12] whose circuits we try to improve.
As shown in [30], all main characteristics of LUT-based FSM circuits depend on the relation between the values of L + R MB , on the one hand, and the value of S L , on the other hand: Analysis of Table 8 allows dividing the benchmarks into five sets.The benchmarks belong to class of trivial FSMs (set 0), if n = 0 (it gives R MB + L ≤ 6).I The benchmarks belong to set of simple FSMs (set 1), if n = 1 (it gives R MB + L ≤ 12).The benchmarks belong to set of average FSMs (set 2), if n = 2 (it gives R MB + L ≤ 18).The benchmarks belong to set of big FSMs (set 3), if n = 3 (it gives R MB + L ≤ 24).The benchmarks belong to set of very big FSMs (set 4), if n = 4 (it gives the relation R MB + L > 24).As research [37] shows, the larger the set number, the bigger the gain from using methods of twofold state assignment.
The results of experiments are shown in Tables 9-11.These tables are organized in the same manner.The table columns are marked by the names of investigated methods.The last column includes the number of the benchmark set to whom the particular benchmark belongs.The table rows are marked the names of benchmarks.There are results of summation of values from columns in the row "Total".The row "Percentage" includes the percentage of summarized characteristics of FSM circuits produced by other methods respectively to P 2C -based FSMs.We start the analysis of experiments from Table 9.This table contains the values of LUT counts for each benchmark used in the experiments.
As follows from Table 9, the circuits of P 2C -based FSMs use a minimum number of LUTs compared to other investigated methods.There is the following gain: (1) 36.92% compared to Auto-based FSMs; (2) 56.23% compared to One-hot-based FSMs; (3) 16.11% compared to JEDI-based FSMs; (4) 5.74% compared to P T -based FSMs.In our opinion, this gain is associated with a decrease in the number of variables used in partial state codes (compared to equivalent P T -based FSMs).The second source of a decrease in the LUT counts can be a decrease in the number of partition classes.If the relation (K + 1) < I takes place, then there is a decrease in the required number of LUT inputs for elements of LUTerTY.If the condition ( 13) is violated but the condition (K + 1) < S L holds, then the circuit of LUTerTY is multi-level for a P T -based FSM as opposed to the single-level block circuit of an equivalent P 2C -based FSM.
Careful analysis of the table reveals the following feature of the proposed method: there are the same values of LUT counts for equivalent P T -and P 2C -based FSMs for the Set 0. This can be explained as follows.For this set, the condition (14) holds.This means that each function f j ∈ D ∪ Y does not require being decomposed.Only a single LUT is enough to implement a logic circuit for any function f j ∈ D ∪ Y.In this case, there is the same single class into both partitions, π A and π V .Due to it, the block FAB is absent.This means that both P T and P 2C FSMs turn into P FSMs.So, there are the same circuits for P T and P 2C FSMs.Obviously, these circuits have the same values of LUT counts.The same should take place also for other characteristics of these two models.Furthermore, from Table 9 we see that the values of LUT counts are the same for some equivalent P T and P 2C FSMs that do not belong to the set 0. This phenomenon occurs for the following benchmarks: dk16, ex1, planet, planet1, s1488, s1494, s1a, s420, s510, s810, s832, sand and styr.Analysis of Table 8 reveals the nature of this phenomenon: there are more than S L = 6 bits in state codes for these FSMs.This means that the following condition holds: In this case, the condition ( 14) is violated.This leads to the empty set A MB .In turn, this makes correct the following relations: A PC = A and π A = π V .So, if the condition (34) holds, then P 2C FSMs turn into P T FSMs.Obviously, there are the same LUT counts for such equivalent P 2C and P T FSMs.As follows from Table 10, the circuits of P 2C -based FSMs are the fastest compared to the circuits produced by other investigated methods.There is the following gain: (1) 14.60% compared to Auto-based FSMs; (2) 14.89% compared to One-hot-based FSMs; (3) 8.88% compared to JEDI-based FSMs; (4) 5.46% compared to P T -based FSMs.We think that this gain is due to the fact that in some cases the circuits of P 2C -based FSMs have fewer levels of LUTs than the circuits of P T -based FSMs.We discussed the reasons for this phenomenon in the analysis of Table 9.It is interesting to note that the average gain in the cycle time almost coincides with the average gain in the LUT counts (for P T -and P 2C -based FSMs).As follows from Table 10, for the Set 0, there are the same values of cycle times for equivalent benchmarks using models of single-core and dual-core FSMs.The explanation is the same as it is for the equality of LUT counts.Moreover, from Table 10 we can find out that the temporal characteristics are the same for the following benchmarks: dk16, ex1, planet, planet1, s1488, s1494, s1a, s420, s510, s810, s832, sand and styr.The reasons for this phenomenon have also been analyzed in the previous paragraphs.
Using values of cycle times, we can trivially compute the values of maximum operating frequencies.These values are shown in Table 11.
As follows from Table 11, the circuits of P 2C -based FSMs have the highest values of maximum operating frequencies compared to the circuits based on other investigated methods.There is the following gain: (1) 12.26% compared to Auto-based FSMs; (2) 12.40% compared to One-hot-based FSMs; (3) 7.09% compared to JEDI-based FSMs; (4) 5.42% compared to equivalent P T -based FSMs.Obviously, the gain in frequency is related to the gain in cycle time.We discussed all the reasons for this phenomenon above.
The value of power consumption is one of the most important characteristics of FSM circuits [44].Very often, the gain in area-temporal characteristics is accompanied with an increase in the power consumption [27].Using Vivado reports allows constructing Table 12 with values of consumed power.
The main goal of the proposed method is to obtain FSM circuits with fewer LUTs than it is in circuits of equivalent P T -based FSMs.Of course, this improvement can lead to an increase in power consumption.As follows from Table 12, this increase is extremely small.Compared to P T -based FSMs, the circuits of equivalent P 2C -based FSMs consume less than one percent more power (0.76%).If compare P 2C -based FSMs with other investigated methods, then there is the following gain: (1) 16.38% compared to Auto-based FSMs; (2) 24.02% compared to One-hot-based FSMs; (3) 1.90% compared to JEDI-based FSMs.
We associate this loss with the following.In P T -based FSMs, the state variables T r ∈ T are connected only with the block LUTerV.However, in P 2C -based FSMs, these variables are connected with LUTs of both LUTerV and CoreT.This increase in the number of connections leads to an increase in the value of parasitic capacitance in an FSM circuit [26].Due to it, P 2C -based FSMs consume more power than equivalent P T -based FSMs.Obviously, this phenomenon does not occur for FSMs from the Set 0.Moreover, for the benchmarks dk16, ex1, planet, planet1, s1488, s1494, s1a, s420, s510, s810, s832, sand and styr both P T -and P 2C -based FSMs consume equal values of power.
So, the proposed method allows obtaining circuits with either better or the same values of area-temporal characteristics than they are for equivalent P T -based FSMs.Our main purpose is to get the FSM circuits with fewer LUTs than it is for equivalent P T -based FSMs.As follows from the conducted experiments, this goal has been achieved.Furthermore, the proposed method has an additional positive effect: it allows getting faster FSM circuits than the circuits of equivalent P T -based FSMs.Our method loses slightly in terms of the amount of power consumed.However, this loss does not exceed 1% on average.We think that our approach can be used instead of P T FSMs if area-temporal characteristics determine the optimality of the resulting FSM circuits.

Conclusions
Modern FPGAs are very powerful design tools [45].Nowadays, a single FPGA chip may implement a very complicated digital system.The main drawback of FPGAs is a very small number of LUT inputs [19,46].This complicates the problem of optimizing the FSM circuits representing sequential blocks of digital systems.Very often, the process of technology mapping for such FSMs is connected with applying various functional decomposition methods.In this case, the resulting LUT-based FSM circuits are multi-level.
The technology mapping can be based on applying various methods of structural decomposition [30].The research results shown in [11] prove that, very often, the SD leads to FSM circuits with significantly better characteristics compared to their counterparts based on the FD.Our research [12] shows that single-core circuits with the twofold state assignment have better characteristics compared to their FD-based counterparts.However, this approach is connected with using a special transformer creating the extended state codes.This transformer consumes some resources of FPGA chip used.
In our current article, we propose to use two cores generating systems of partial Boolean functions.This leads to P 2C Mealy FSMs where different systems of state variables are used in different cores.Our approach allows reducing LUT counts and improving temporal characteristics in comparison with PT-based FSMs.Note that this gain is associated with a very slight increase in the power consumption (up to 1% on average).
In our future research, we will try to use this approach to optimize Mealy FSM circuits based on various structural decomposition methods.We will also check the possibility of using the double-core approach for optimizing the circuits of LUT-based Moore FSMs.We hope these methods can be used for implementing sequential devices of modern embedded systems.

Figure 1 .
Figure 1.Structural diagram of P Mealy FSM.In P FSMs, the block LUTerT is a block of IMFs.This block implements the SBF (2) and loads the next state code into RG.The register RG is distributed among the LUTs included into CLBs of LUTerT.The flip-flops of RG are controlled by pulses Start and Clock.The block LUTerY is a block of output logic implementing the SBF (3).Obviously, the Functions (2) and (3) depend on state variables T r ∈ T and FSM inputs x l ∈ X.Let a function f j ∈ D ∪ Y depend on R j ≤ R state variables and L j ≤ L inputs.If the condition R j + L j ≤ S L (4)

Figure 3 .
Figure 3. Generalized diagram of P T Mealy FSM.

T 1 T 2 T 3 Figure 6 .
Figure 6.Outcome of state assignment for Mealy FSM S 1 .

Table 1 .
State transition table of Mealy FSM S 1 .

Table 2 .
Table of CoreT for Mealy FSM S 1 .

Table 7 .
Redistribution process for FSM S 1 .

Table 11 .
Experimental results (the maximum operating frequency, MHz).