Adaptive Sparse Representation of Continuous Input for Tsetlin Machines Based on Stochastic Searching on the Line

: This paper introduces a novel approach to representing continuous inputs in Tsetlin Machines (TMs). Instead of using one Tsetlin Automaton (TA) for every unique threshold found when Booleanizing continuous input, we employ two Stochastic Searching on the Line (SSL) automata to learn discriminative lower and upper bounds. The two resulting Boolean features are adapted to the rest of the clause by equipping each clause with its own team of SSLs, which update the bounds during the learning process. Two standard TAs ﬁnally decide whether to include the resulting features as part of the clause. In this way, only four automata altogether represent one continuous feature (instead of potentially hundreds of them). We evaluate the performance of the new scheme empirically using ﬁve datasets, along with a study of interpretability. On average, TMs with SSL feature representation use 4.3 times fewer literals than the TM with static threshold-based features. Furthermore, in terms of average memory usage and F1-Score, our approach outperforms simple Multi-Layered Artiﬁcial Neural Networks, Decision Trees, Support Vector Machines, K-Nearest Neighbor, Random Forest, Gradient Boosted Trees (XGBoost), and Explainable Boosting Machines (EBMs), as well as the standard and real-value weighted TMs. Our approach further outperforms Neural Additive Models on Fraud Detection and StructureBoost on CA-58 in terms of the Area Under Curve while performing competitively on COMPAS. Author Contributions: Conceptualization, K.D.A. and O.-C.G.; software, K.D.A. O.-C.G.; vali-dation, K.D.A. and O.-C.G.; analysis, K.D.A. and O.-C.G.; investigation, K.D.A., O.-C.G. and M.G.; writing—original preparation, K.D.A.; writing—review and editing, K.D.A., O.-C.G. and M.G.; supervision, O.-C.G. All of the


Introduction
Deep learning (DL) has significantly advanced state-of-the-art models in machine learning (ML) over the last decade, attaining remarkable accuracy in many ML application domains. One of the issues with DL, however, is that DL inference cannot easily be interpreted [1]. This limits the applicability of DL in high-stakes domains such as medicine [2,3], credit-scoring [4,5], churn prediction [6,7], bioinformatics [8,9], crisis analysis [10], and criminal justice [11]. In this regard, the simpler and more interpretable ML algorithms, such as Decision Trees, Logistic Regression, Linear Regression, and Decision Rules, can be particularly suitable. However, they are all hampered by low accuracy when facing complex problems [12]. This limitation has urged researchers to develop machine learning algorithms that are capable of achieving a better trade-off between interpretability and accuracy.
While some researchers focus on developing entirely new machine learning algorithms, as discussed above, other researchers try to render DL interpretable. A recent attempt to make DL interpretable is the work of Agarwal et al. [11]. They introduce Neural Additive Models (NAMs), which treat each feature independently. The assumption of independence makes NAMs interpretable but impedes accuracy compared with regular DL [11]. Another approach is to try to explain DL inference with surrogate models. Here, one strives to attain local interpretability; i.e., explaining individual predictions [13]. Nev-Recent theoretical work proves the convergence to the correct operator for "identity" and "not". It is further shown that arbitrarily rare patterns can be recognized using a quasi-stationary Markov chain-based analysis. The work finally proves that when two patterns are incompatible, the most accurate pattern is selected [43]. Convergence for the "XOR" operator has also recently been proven by Jiao et al. [44].
The approach proposed in [37] is currently the most effective way of representing continuous features through Booleanization. However, the approach requires a large number of TAs to represent the Booleanized continuous features. Indeed, one TA is needed per unique continuous value. Consequently, this increases the training time of the TM as it needs to update all the TAs in all of clauses for each training iteration. Further, this adds more post-processing work for generating interpretable rules out of TM outputs. To overcome this challenge in TMs, we propose a novel approach to represent continuous features in the TM, encompassing the following contributions: • Instead of representing each unique threshold found in the Booleanization process by a TA, we use a Stochastic Searching on the Line (SSL) automaton [45] to learn the lower and upper limits of the continuous feature values. These limits decide the Boolean representation of the continuous value inside the clause. Only two TAs then decide whether to include these bounds in the clause or to exclude them from the clause. In this way, one continuous feature can be represented by only four automata, instead of representing it by hundreds of TAs (decided by the number of unique feature values within the feature). • A new approach to calculating the clause output is introduced to match with the above Booleanization scheme. • We update the learning procedure of the TM accordingly, building upon Type I and Type II feedback to learn the lower and upper bounds of the continuous input. • Empirically, we evaluate our new scheme using eight data sets: Bankruptcy, Balance Scale, Breast Cancer, Liver Disorders, Heart Disease, Fraud Detection, COMPAS, and CA-58. With the first five datasets, we show how our novel approach affects memory consumption, training time, and the number of literals included in clauses, in comparison with the threshold-based scheme [46]. Furthermore, performances on all these datasets are compared against recent state-of-the-art machine learning models.
This paper is organized as follows. In Section 2, we present the learning automata foundation we build upon and discuss the SSL automaton in more detail. Then, in Section 3, we introduce the TM and how it traditionally has dealt with continuous features. We then propose our new SSL-based scheme. We evaluate the performance of our new scheme empirically using five datasets in Section 5. In this section, we use the Bankruptcy dataset to demonstrate how rules are extracted from TM clause outputs. The prediction accuracy of the TM SSL-based continuous feature representation is then compared against several competing techniques, including ANNs, SVMs, DTs, RF, KNN, EBMs (the current state-ofthe-art of Generalized Additive Models (GAMs) [22,23]), Gradient Boosted Trees (XGBoost), and the TM with regular continuous feature representation. Further, we contrast the performance of the TM against reported results on recent state-of-the-art machine learning models, namely NAMs [11] and StructureBoost [47]. Finally, we conclude our paper in Section 6.

Learning Automata and the Stochastic Searching on the Line Automaton
The origins of Learning Automata (LA) [48] can be traced back to the work of M. L. Tsetlin in the early 1960s [31]. In a stochastic environment, an automaton is capable of learning the optimum action that has the lowest penalty probability through trial and error. There are different types of automata; the choice of a specific type is decided by the nature of the application [49].
Initially, the LA randomly perform an action from an available set of actions. This action is then evaluated by its attached environment. The environment randomly produces feedback; i.e., a reward or a penalty as a response to the action selected by the LA. De-pending on the feedback, the state of the LA is adjusted. If the feedback is a reward, the state changes towards the end state of the selected action, reinforcing the action. When the feedback is a penalty, the state changes towards the center state of the selected action, weakening the action and eventually switching the action. The next action of the automaton is then decided by the new state. In this manner, an LA interacts with its environment iteratively. With a sufficiently large number of states and a reasonably large number of interactions with the environment, an LA learns to choose the optimum action with a probability arbitrarily close to 1.0 [48].
During LA learning, the automaton can make deterministic or stochastic jumps as a response to the environment feedback. LA make stochastic jumps by randomly changing states according to a given probability. If this probability is 1.0, the state jumps are deterministic. Automata of this kind are called deterministic automata. If the transition graph of the automaton is kept static, we refer to it as a fixed-structure automaton. The TM employs TAs to decide which literals to include in the clauses. A TA is deterministic and has a fixed structure, formulated as a finite-state automaton [31]. A TA with 2N states is depicted in Figure 1. States 1 to N map to Action 1 and states N + 1 to 2N map to Action 2. The Stochastic Searching on the Line (SSL) automaton pioneered by Oommen [45] is somewhat different from regular automata. The SSL automaton is an optimization scheme designed to find an unknown optimum location λ * on a line, seeking a value between 0 to 1, [0, 1].
In SSL learning, λ * can be one of the N points. In other words, the search space is divided into N points, {0, 1/N, 2/N, . . . , (N − 1)/N, 1}, with N being the discretization resolution. Depending on the possibly faulty feedback from the attached environment (E), λ moves towards the left or right from its current state on the created discrete search space. We consider the environment feedback 1, E = 1, as an indication to move towards right (or to increase the value of λ) by one step. The environment feedback 0, E = 0, on the other hand, is considered as an indication to move towards the left (or to decrease the value of λ) by one step. The next location of λ, λ(n + 1) can thus be expressed as follows: Asymptotically, the learning mechanism is able to find a value arbitrarily close to λ * when N → ∞ and n → ∞.

Tsetlin Machine (TM) for Continuous Features
As seen in Figure 2, conceptually, the TM decomposes into five layers to recognize sub-patterns in the data and categorize them into classes. In this section, we explain the job of each of these layers in the pattern recognition and learning phases of the TM. The parameters and symbols used in this section are explained and summarized in Table 1.  Layer 1: the input. In the input layer, the TM receives a vector of o propositional variables: X, x k ∈ {0, 1} o . The objective here of the TM is to classify this feature vector into one of the two classes, y ∈ {0, 1}. However, as shown in Figure 2, the input layer also includes negations of the original features, ¬x k , in the feature set to capture more sophisticated patterns. Collectively, the elements in the augmented feature set are called literals: L = [x 1 , x 2 , . . . , x o , ¬x 1 , ¬x 2 , . . . , ¬x o ] = [l 1 , l 2 , . . . , l 2o ].
Layer 2: clause construction. The sub-patterns associated with class 1 and class 0 are captured by m conjunctive clauses. The value m is set by the user where more complex problems might demand a large m. All clauses receive the same augmented feature set formulated at the input layer, L. However, to perform the conjunction, only a fraction of the literals is utilized. The TM employs two-action TAs in Figure 1 to decide which literals are included in which clauses. Since we found a number 2 × o of literals in L, the same number of TAs-one per literal k-is needed by a clause to decide the included literals in the clause. When the index set of the included literals in clause j is given in I j , the conjunction of the clause can be performed as follows: Notice how the composition of a clause varies from another clause depending on the indexes of the included literals in the set I j ⊆ {1, . . . , 2o}. For the special case of I j = ∅-i.e., an empty clause-we have That is, during learning, empty clauses output 1, and during classification, they output 0. Layer 3: storing states of TAs of clauses in the memory. The TA states on the lefthand side of the automaton (states from 1 to N) ask to exclude the corresponding literal from the clause while the states on the right-hand side of the automaton (states from N + 1 to 2N) ask to include the literal in the clause. The systematic storage of states of TAs in the matrix, A: A = (a j,k ) ∈ {1, . . . , 2N} m×2o , with j referring to the clause and k to the literal, allows us to find the index set of the included literals in clause j, I j as I j = {k|a j,k > N, Layer 4: clause output. Once the TA decisions are available, the clause output can be easily computed. Since the clauses are conjunctive, a single literal of value 0 is enough to turn the clause output to 0 if its corresponding TA has decided to include it in the clause. To make the understanding easier, we introduce set I 1 X , which contains the indexes of the literals of value 1. Then, the output of clause j can be expressed as The clause outputs, computed as above, are now stored in vector C, i.e., C = (c j ) ∈ {0, 1} m .
Layer 5: classification: The TM structure given in Figure 2 is used to classify data into two classes. Hence, sub-patterns associated with each class have to be separately learned. For this purpose, the clauses are divided into two groups, where one group learns the sub-pattern of class 1 while the other learns the sub-patterns of class 0. For simplicity, clauses with an odd index are assigned a positive polarity (c + j ), and they are used to capture sub-patterns of output y = 1. Clauses with an even index, on the other hand, are assigned negative polarity (c − j ) and they seek the sub-patterns of output y = 0. The clauses that recognize sub-patterns output 1. This makes the classification process easier as we simply need to sum the clause outputs of each class and assign the sample into the class which has the highest sum. A higher sum means that more sub-patterns are identified from the designated class and that there is a higher chance of the sample being in that class. Hence, with v being the difference in clause output, v = ∑ j c + j − ∑ j c − j , the output of the TM is decided as follows: A TM learns online, updating its internal parameters according to one training sample (X, y) at a time. As we discussed, a TA team decides the clause output, and collectively, the output of all the clauses decides the TM's output. Hence, to maximize the accuracy of the TM's output, it is important to sensibly guide individual TAs in clauses. We achieve this with two kinds of reinforcement: Type I and Type II feedback. Type I and Type II feedback decide if the TAs in clauses receive a reward, a penalty, or inaction feedback, depending on the context of their actions. How the type of feedback is decided and how the TAs are updated according to the selected feedback type is discussed below in more details.
Type I feedback: Type I feedback has been designed to reinforce the true positive outputs of the clauses and to combat against the false negative outputs of the clauses. To reinforce the true positive output of a clause (the clause output is 1 when it has to be 1), include actions of TAs whose corresponding literal value is 1 are strengthened. However, more fine-tuned patterns can be identified by strengthening the exclude actions of TAs in the same clause whose corresponding literal value is 0. To combat the false negative outputs of the clauses (the clause output is 0 when it has to be 1), we erase the identified pattern by the clause and make it available for a new pattern. To do so, the exclude actions of TAs, regardless of their corresponding literal values, are strengthened. We now sub-divide the Type I feedback into Type Ia and Ib, where Type Ia handles the reinforcing of exclude actions while Type Ib works on reinforcing exclude actions of TAs. Together, Type Ia and Type Ib feedback force clauses to output 1. Hence, clauses with positive polarity need Type I feedback when y = 1 and clauses with negative polarity need Type I feedback when y = 0. To diversify the clauses, they are targeted for Type I feedback stochastically as follows: All clauses in each class should not learn the same sub-pattern, nor only a few subpatterns. Hence, clauses should be smartly allocated among the sub-patterns. The user set target T in (7) does this while deciding the probability of receiving Type I feedback; i.e., T number of clauses are available to learn each sub-pattern in each class. A higher T increases the robustness of learning by allocating more clauses to learn each sub-pattern. Now, T together with v decides the probability of clause j receiving Type I feedback, and accordingly, the decision p j is made. The decisions for the complete set of clauses to receive Type I feedback are organized in the vector P = (p j ) ∈ {0, 1} m . Once the clauses to receive Type I feedback are singled out as per (7), the probability of updating individual TAs in selected clauses is calculated using the user-set parameter s (s ≥ 1), separately for Type Ia and Type Ib. According to the above probabilities, the decision whether the k th TA of the j th clause is to receive Type Ia feedback, r j,k or Type Ib feedback, q j,k is made stochastically as follows: q j,k = 1 with probability 1 s , 0 otherwise.
The above decisions are stored in the two matrices R and Q, respectively; i.e., R = (r j,k ) ∈ {0, 1} m×2o and Q = (q j,k ) ∈ {0, 1} m×2o . Using the complete set of conditions, TA indexes selected for Type Ia are The states of the identified TAs are now ready to be updated. Since Type Ia strengthens the include action of TAs, the current state should move more towards the include action direction. We denote this as ⊕, and here ⊕ adds 1 to the current state value of the TA. The Type Ib feedback, on the other hand, moves the state of the selected TA towards the exclude action direction to strengthen the exclude action of TAs. We denote this by , and here, subtracts 1 from the current state value of the TA. Accordingly, the states of TAs in A are updated as A ← A ⊕ I Ia I Ib . Type II feedback: Type II feedback has been designed to combat the false positive output of clauses (the clause output is 1 when it has to be 0). To turn this clause output from 1 to 0, a literal value of 0 can simply be included in the clause. Clauses with a positive polarity need Type II feedback when y = 0 and clauses with negative polarity need this when y = 1 as they do not want to vote for the opposite class. Again, using the user-set target T, the decision for the j th clause is made as follows: The states of the TAs whose corresponding literal of value 0 in selected clauses according to (10) are now moved towards the include action direction with a probability of 1. Hence, the index set of this kind can be identified as When training has been completed, the final decisions of the TAs are recorded, and the resulting clauses can be deployed for operation.

Booleanization of Continuous Features:
In the TM discussed so far, the input layer accepted only Boolean features; i.e., X = [x 1 , x 2 , x 3 , . . . , x o ] with x k , k = 1, 2, ..., o, being 0 or 1. These features and their negations were directly fed into the clauses without any further modifications. However, continuous features in machine learning applications are more common than values of simply 1 or 0. In one of our previous papers [37], we presented a systematic procedure of transforming continuous features into Boolean features while maintaining ranking relationships among the continuous feature values.
We here summarize the previous Booleanization scheme using the example presented in [27]. As seen in Table 2, we Booleanize the two continuous features listed in table column 1 and column 2.

1.
First, for each feature, the unique values are identified; 2.
The unique values are then sorted from smallest to largest; 3.
The sorted unique values are considered as thresholds. In the table, these values can be seen in the "Thresholds" row; 4.
The original feature values are then compared with identified thresholds, only from their own feature value set. If the feature value is greater than the threshold, set the corresponding Boolean variable to 0; otherwise, set it to 1; 5.
The above steps are repeated until all the features are converted into Boolean form.
The first feature in the first column of the table contains three unique values: 5.779, 10.008, and 3.834 (step (i)). Once they are sorted as required in step ii, we obtain 3.834, 5.779, and 10.008. Now, we consider them as thresholds: ≤3.834, ≤5.779, and ≤10.008 (step (iii)). Here, we find that each original feature in column 1 is going to represent 3 bit values. According to step iv, we now compare the original values in the first feature against its thresholds. The first feature value of 5.779 is greater than 3.834 (resulting in 0), equal to 5.779 (resulting in 1), and less than 10.008 (resulting in 1). Hence, we replace 5.779 with 011. Similarly, 10.008 and 3.834 can be replaced with 001 and 111, respectively.
The conversion of the feature values for the second feature starts once all the feature values in the first feature are completed. This procedure is iterated until all the continuous values of all the continuous features have been converted to Boolean form (step (v)). This Boolean representation of continuous features is particularly powerful as it allows the TM to reason about the ordering of the values, forming conjunctive clauses that specify rules based on thresholds, and with negated features, also rules based on intervals. This can be explained again with the following example.
The threshold ≤3.834 in the "Threshold" row stands for the continuous value 3.834 of the first feature. Similarly, threshold ≤5.779 and ≤10.008 represent the continuous values 5.779 and 10.008, respectively. Consider a clause with threshold ≤5.779 included in the clause, which is the only threshold included in the clause. Then, for any input value less than or equal to ≤5.779 from that feature, the clause outputs 1. Now, consider the case of having two thresholds, ≤5.779 and ≤10.008, included in the clause. The threshold ≤5.779 still decides the clause output due to the fact that the AND of ≤5.779 and ≤10.008 threshold columns in Table 2 yields the threshold column ≤5.779. When a clause includes a negated threshold it reverses the original threshold. Consider a clause that only includeds the negation of the threshold ≤3.834. Now, the clause outputs 1 for all the values greater than ≤3.834 from that feature, as the NOT of ≤3.834 is equivalent to 3.834<.
The above explanation of the threshold selection reveals that the lowest original threshold included in the clause and the highest negated threshold included in the clause decide the upper and lower boundary of the feature values, and these thresholds are the only important thresholds for calculating the clause output. Hence, this motivates us to represent the continuous features in clauses in a new way and train the clauses accordingly as follows.

Sparse Representation of Continuous Features
However, the above representation of continuous values in clauses is costly as it needs two times the total number of unique values of TAs per clause. This is more severe when the dataset is large and when there is a large number of input features to be considered. Hence, we introduce SSL automata to represent the upper and lower limits of the continuous features. With the new representation, a continuous feature can be then represented by only two automata instead of having two times the number of unique values in the considered continuous feature. The new parameters and symbols used in this section are explained and summarized in Table 3. The decision whether the TA of the lower limit of k th feature in the j th clause is to receive Type Ia feedback The decision whether the TA of the upper limit of k th feature in the j th clause is to receive Type Ia feedback The decision whether the TA of the lower limit of k th feature in the j th clause is to receive Type Ib feedback The decision whether the TA of the upper limit of k th feature in the j th clause is to receive Type Ib feedback l j,k Computed literal value for the k th feature in clause j Input Features. As discussed earlier, the TM takes o propositional variables as input, Here, SSL l k and SSL u k are lower and upper limit values of the k th continuous feature, respectively. However, step size N within an SSL in this case is not constant. When E in (1) and (2) is 1, the SSL state moves to the higher neighboring unique value of the attached continuous feature of the SSL. Similarly, when E is 0, SSL state moves to the lower neighboring unique value of the considered continuous feature.
Clauses. Each conjunctive clause in the TM receives X c as an input. The inclusion and exclusion decisions of the corresponding upper and lower bounds of x c k in the clause are made by TAs. Hence, each clause now needs 2o TAs, where half of them make the decision related to the lower bound of the continuous features while the other half make the decision related to the upper bound of the continuous features. The matrix A therefore still contains m × 2o elements: A = (a j,k ) ∈ {1, . . . , 2N} m×2o .
In the phase of calculating clause outputs, both limit values given by SSLs and the decisions of TAs on their corresponding SSLs are considered. The value of the k th literal, l j,k , which represents the k th continuous feature inside the clause, j, to perform the conjunction is evaluated as follows: • Condition 1: Both TA l j,k and TA u j,k which respectively make the decision on SSL l j,k and SSL u j,k decide to include them in the clause. Then, • Condition 2: The TA l j,k decides to include SSL l j,k in the clause and TA u j,k decides to exclude SSL u j,k from the clause. Then, • Condition 3: The TA u j,k decides to include SSL u j,k in the clause and TA l j,k decides to exclude SSL l j,k from the clause. Then, • Condition 4: Both TA l j,k and TA u j,k decide to exclude their corresponding SSLs from the clause, which consequently takes the lower limit to the lowest possible and the upper limit to the highest possible values. Hence, l j,k always becomes 1 or can be excluded when conjunction is performed.
Hence, when at least one of the TAs that represent the lower and upper limits decides to include its corresponding limit in the j th clause, the index of the feature is included in I j , I j ⊆ {1, . . . , o}. Then, depending on the literal value according to the above conditions, the clause output is computed, c j = k∈I j l k , j = 1, . . . , m.
Classification. Similar to the standard TM, the vote difference v is computed as . Once the vote difference is known, the output class is decided using (6).
Learning. In this new setup, the clauses still receive Type I and Type II feedback. However, both TAs and SSLs have to be updated as feedback is received. In other words, Type I and Type II feedback should be able to guide SSLs to learn the optimal lower and upper limits of the continuous features in each clause and lead TAs to correctly decide which limits should be included or excluded in individual clauses.
As discussed earlier, Type Ia feedback reinforces the true positive outputs of clauses by rewarding the include action of TAs when the literal value is 1. In the new setting, Type Ia feedback updates both SSLs and TAs when x c k is within the lower and upper boundaries, SSL l j,k < x c k ≤ SSL u j,k and when the clause output is 1 when it has to be 1 (positive clauses when y = 1 and negative clauses when y = 0). Under these conditions, the decision regarding whether both the TAs of upper and lower bounds of k th feature in the j th clause are to receive Type Ia feedback, r l j,k and r u j,k , is stochastically made as follows: r u j,k = 1 with probability s−1 s , 0 otherwise.
The environment feedbacks E l j,k to update SSL l j,k and E u j,k to update SSL u j,k are 0 and 1, respectively. By doing so, we force SSLs to tighten up the boundary of the continuous feature k and include them in the clause j by reinforcing the include action of TAs. Notice that the above updates are made only if the condition in (7) is satisfied.
Type Ib feedback activates if x c k is outside any of the upper or lower boundaries, x c k ≤ SSL l j,k ∨ SSL u j,k < x c k or if the clause output is 0. For the case where x c k ≤ SSL l j,k or the clause output is 0 when it has to be 1, the decision on whether the TA of the lower bound of the k th feature in the j th clause should receive Type Ib feedback, q l j,k , is stochastically made as follows: Similarly, when it violates the upper bound requirement-i.e., SSL u j,k < x c k or if the clause output is 0 when it has to be 1-the decision to receive Type Ib feedback on the TA which represents upper bound is made as follows: The environment feedbacks for the SSLs when the Type Ib feedback is applied are 0 and 1 for E l j,k and E u j,k , respectively. In this way, SSLs are forced to expand the boundary and TAs are discouraged from the inclusion of their respective SSLs in the clause.
Once the eligible clauses (positive clauses when y = 0 and negative clauses when y = 1) to receive Type II feedback are stochastically selected using (10), the states of the individual SSLs and TAs in them are updated. The original idea of Type II feedback is to combat the false positive output of the clause. In the new updating scheme, this is achieved by expanding the boundaries of the k th feature if x c k is outside of the boundary and including them in the clause, which then turns the clause output to 0 eventually. Hence, if x c k ≤ SSL l j,k , the environment feedback on SSL l j,k , E l j,k becomes 0 and the state of the TA that appears for SSL l j,k increases by one step with probability 1. Likewise, if SSL u j,k < x c k , the environment feedback on SSL u j,k , E u j,k becomes 1 and the state of the TA that appears for SSL u j,k increases by one step with probability 1. The above decisions on receiving Type Ia, Type Ib, and Type II are stored in I Ia , I Ib , and I II , respectively. The processing of the training example in the new scheme ends with the state matrix A of TAs being updated as A ← A ⊕ I Ia I Ib ⊕ I II , and the states of SSLs are updated according to (1) and (2) with the identified environment feedback of individual SSLs, E.

Empirical Evaluation
In this section, the impact of the new continuous input feature representation to the TM is empirically evaluated using five real-world datasets (the implementation of Tsetlin Machine with versions of relevant software libraries and frameworks can be found at: https://github.com/cair/TsetlinMachine, accessed on 24 August 2021). The datasets Liver Disorder dataset, Breast Cancer dataset, and Heart Disease dataset are from the health sector. The Balance Scale and Corporate Bankruptcy datasets are the two other remaining datasets. The Liver Disorder dataset, Breast Cancer dataset, Heart Disease dataset, and Corporate Bankruptcy datasets were selected as these applications in the health and finance sectors demand both interpretability and accuracy in predictions. The Balance Scale dataset was added to diversify the selected applications. As an example, we use the Corporate Bankruptcy dataset to examine the interpretability of the TM using both the previous continuous feature representation and the proposed representation. A summary of these datasets is presented in Table 4. The performance of the TM is also contrasted against several other standard machine learning algorithms, namely Artificial Neural Networks (ANNs), Decision Trees (DTs), Support Vector Machines (SVMs), Random Forest (RF), K-Nearest Neighbor (KNN), Explainable Boosting Machines (EBMs), [22], and Gradient Boosted Trees (XGBoost) [50] along with two recent state-of-the-art machine learning approaches: StructureBoost [47] and Neural Additive Models [11]. For comprehensiveness, three ANN architectures are used: ANN-1-with one hidden layer of 5 neurons, ANN-2-with two hidden layers of 20 and 50 neurons each, and ANN-3-with three hidden layers and 20, 150, and 100 neurons. The other hyperparameters of each of these models are decided using trial and error. The reported results in this sections are the average measure over 50 independent experiment trials. The data are randomly divided into training (80%) and testing (20%) sets for each experiment.

Bankruptcy
In finance, interpretable machine learning algorithms are preferred over black-box methods to predict bankruptcy as bankruptcy-related decisions are sensitive. However, at the same time, the accuracy of the predictions is also important to mitigate financial losses [51].
The output "bankruptcy" is considered as class 0 while "non-bankruptcy" is class 1. The features are, however, ternary. Thus, the TM has to be used with the proposed SSL scheme to represent categorical features directly in clauses or features should be Booleanized using the Booleanization scheme before feeding them into the TM. If the features are Booleanized beforehand, each feature value can be represented in three Boolean features as shown in Table 5. Thus, the complete Booleanized dataset contains 18 Boolean features. First, the behavior of the TM with 10 clauses is studied. The included literals in all these 10 clauses at the end of training are summarized in Table 6. In the TM with Booleanized features, the TAs in clause 1 decide to include only the negation of feature 11, ¬x 11 . Feature 11 is the negative credibility, which we can find after binarizing all features. The TAs in clauses 2, 4, 6, 8, and 10 decide to include the negation of average competitiveness and negative competitiveness, which are non-negated in clauses. The TAs in clauses 3, 5, and 9, on the other hand, decide to include negated negative competitiveness. Clause 7 is "empty" as TAs in this clause decide not to include any literal in the clause.   Table 6 also contains the clauses learnt by the TM when the SSL continuous feature approach is used. The clauses 2, 4, 6, 8, and 10 which vote for bankruptcy activate negative competitiveness. On the other hand, clause 3 recognizes the sub-patterns of class 1 outputs 1 for positive competitiveness. There are four free votes for class 1 from the "empty" clauses 1, 5, 7, and 9, which are again ignored during classification. Note also that, without loss of accuracy, the TM with the SSL approach simplifies the set of rules by not including negative credibility in any of the clauses. With the identified thresholds for the continuous values (categorical in this application), the TM with the SSL approach ends up with the simple classification rule: Non-bankruptcy otherwise. (18) By asking the TMs to utilize only two clauses, we can obtain the above rule more directly, as shown in Table 7. As seen, again, the TM with both feature representations achieves similar accuracy. The previous accuracy results represent the majority of experiment trials. Some experiments fail to obtain this optimum accuracy. Instead of conducting the experiments multiple times to find the optimum clause configuration in the TM, the number of clauses can be increased to find more robust configurations of clauses. Even though this provides stable higher accuracy for almost all the trials, a large number of clauses affects the interpretability. This is where we have to consider achieving a balance between accuracy and interpretability. For the Bankruptcy dataset, how robustness increases with clauses can be seen in Tables 8 and 9. The average performance (precision, recall, F1-Score, accuracy, specificity) is summarized in the tables for the TM with both feature arrangements, respectively.  Table 8 reports the results of the TM with a regular Booleanization scheme. The goal here is to maximize the F1-Score, since accuracy can be misleading for imbalanced datasets. As can be seen, the F1-Score increases with clauses and peaks at m = 2000. To obtain this performance with the Booleanized features, the TM classifier uses 3622 (rounded to nearest integer) literals (include actions).
Despite the slight reduction in the F1-Score, the TM with the proposed continuous feature representation reaches its best F1-Score with only 398 literals in the TM classifier. This represents a greater than nine-fold reduction of literals, which is more significant compared to the reduction of the F1-Score (0.001). At this point, the number of clauses, m, equals 500.   The performance of the TM with both continuous feature arrangements is compared against multiple standard machine learning models: ANN, KNN, XGBoost, DT, RF, SVM, and EBM. The performances of these techniques along with the best performance of the TM setups are summarized in Table 10. The best F1-Score is obtained by the TM with regular Booleanized features. The second best F1-Score belongs to ANN-3 and the TM with the SSL scheme. Memory wise, the TM with both input feature representations together with DT needs close to zero memory at both training and testing, while ANN-3 requires a training memory of 28,862.65 KB and a testing memory of 1297.12 KB. More importantly, the training time per epoch and the number of literals in clauses are reduced with the SSL scheme for the TM compared to the Booleanization approach.

Balance Scale
We then move to the Balance Scale dataset (available from http://archive.ics.uci.edu/ ml/datasets/balance+scale, accessed on 24 August 2021). The Balance Scale dataset consists of three classes: a balance scale that tips to the left, tips to the right, or that is in balance. The above class is decided collectively by four features: (1) the size of the weight on the left-hand side, (2) distance from the center to the weight on the left, (3) size of the weight on the right-hand side, and (4) distance from the center to the weight on the right. However, we remove the third class-i.e., "balanced"-and contract the output to a Boolean form. The resulting dataset ends up with 576 data samples. Tables 11 and 12 contain the results of the TM with two continuous feature representations, with varying m. The F1-Score reaches its maximum of 0.945 at m = 100 for the TM with the Boolean feature arrangement. The average number of literals used to achieve the above performance is 790. The TM with the SSL scheme reaches its maximum performance when m = 500. The number of literals used in the classifier to achieve this performance is 668. The variation of the number of literals over different numbers of clauses in the TM with these two continuous feature arrangements is graphed in Figure 4. The TM with the SSL scheme uses a smaller number of literals for all the considered number of clauses, with the difference increasing with number of clauses.  For the Balance Scale dataset, the performances of the other machine learning algorithms are also obtained. Along with the TM performance, the prediction accuracies of other models are presented in Table 13. The highest F1-Score from all the considered models is procured by EBM. Out of the two TM approaches, the TM with the SSL scheme shows the best performance in terms of the F1-Score, while using less training time and training memory.

Breast Cancer
The nine features in the Breast Cancer dataset (available from: https://archive.ics.uci. edu/ml/datasets/Breast+Cancer, accessed on 24 August 2021) predict the recurrence of breast cancer. The nine features in the dataset are age, menopause, tumor size, inverse nodes, node caps, degree of malignancy, side (left or right), the position of the breast, and irradiation status. The numbers of samples in the "non-recurrence" and "recurrence" classes are 201 and 85, respectively. However, some of these samples are removed as they contain missing values in their features.
The performances of the TMs with two feature arrangements and the number of literals they included in their clauses to achieve this performance are summarized in Tables 14 and 15, respectively, for the Booleanized scheme and the SSL scheme. In contrast to the previous two datasets, the F1-Score for the TMs with both feature arrangements peaks at m = 2. The performance then decreases with the increase of m. The numbers of literals at this phase in TMs with Booleanized and SSL feature arrangements are 21 and 4, respectively. Overall, the TM with the SSL scheme requires the smallest amount of literals, as can be seen in Figure 5.  All the other algorithms obtain an F1-Score of less than 0.5. The performances of DT, RF, SVM, XGBoost, and EBM can be identified as the worst of all models as summarized in Table 16. The best F1-Score is obtained by the TM with the SSL feature representation procedure, while the TM with Booleanized features obtain the second best F1-Score. The increase of the F1-Score from 0.531 to 0.555 comes also with the advantage of having 19 less literals in clauses for the SSL approach. Both the training and testing memory usage of the TM with these two feature arrangements is negligible. The TM also has the lowest training time of all algorithms, amounting to 0.001 s.

Liver Disorders
The Liver Disorders dataset (available from: https://archive.ics.uci.edu/ml/datasets/ Liver+Disorders, accessed on 24 August 2021) was created in the 1980s by BUPA Medical Research and Development Ltd. (hereafter "BMRDL") as a part of a larger health-screening database. The dataset contains data in seven columns: mean corpuscular volume, alkaline phosphotase, alamine aminotransferase, aspartate aminotransferase, gamma-glutamyl transpeptidase, number of half-pint equivalents of alcoholic beverages (drunk per day), and selector. By taking the selector attribute as a class label, some researchers have used this dataset incorrectly [52]. However, in our experiments, the "number of half-pint equivalents of alcoholic beverages" is used as the dependent variable, Booleanized using the threshold ≥3. Further, only results of various blood tests are used as feature attributes; i.e., the selector attribute is discarded.
Tables 17 and 18 summarize the performance of the TM with two feature arrangements. As can be seen, the F1-Scores of the TM with Booleanized continuous features peak at m = 2, while this value for the TM with the SSL scheme is m = 10. With 10 clauses, the method of representing continuous features for the TM with theSSL scheme considers merely 9 literals in clauses to acquire a better F1-Score. The increase of the number of literals included in TM clauses with the increase of the number of clauses can be seen in Figure 6. Again, this confirms that the TM with the SSL scheme uses a considerably smaller number of literals overall.  From the performance of the other machine learning models, summarized in Table 19, we can observe that the highest F1-Score (0.729) is produced by RF. The performance of DT in terms of the F1-Score is comparable to the performance of RF. However, DT requires a training memory of 49.15 KB, while RF uses a negligibly small memory both for training and testing to work with the Liver Disorders dataset. The TM, on the other hand, performs better with SSL continuous features representation than with the Booleanized continuous features. This performance is the fourth best among all the other models. For training and testing, the TMs with both feature representation approaches require an insignificantly small amount of memory. However, the TM with SSL feature representation requires less time for training.

Heart Disease
The last dataset we use is the Heart Disease dataset (Available from: https://archive.ics. uci.edu/ml/datasets/Statlog+%28Heart%29, accessed on 24 August 2021). The goal of this dataset is to predict future heart disease risk based on historical data. The complete dataset consists of 75 features. However, in this experiment, the updated version of the dataset, containing 13 features, is used: one ordered, six real-valued, three nominal, and three Boolean features.
Tables 20 and 21 summarize the performance of the TM with two feature arrangement schemes. For the TM with Boolean features, the best F1-Score occurs with m = 10, achieved by using 346 literals on average. The F1-Score of the TM with SSL continuous features peaks again at m = 10 with only 42 literals. Even though the TM with Boolean features performs better in terms of accuracy, the TM with SSL feature representation outperforms the Boolean representation of continuous features by obtaining a higher F1-Score.  Considering the number of literals used with an increasing number of clauses (Figure 7), both approaches behave almost similarly until m = 500, and then the TM with Booleanized features includes more literals in clauses than the proposed approach.  Out of the considered machine learning models, as summarized in Table 22, EBM obtains the best F1-Score. However, EBM needs the highest training time and uses the second largest amount of training memory, while both TMs use negligible memory during both training and testing and consume much less training time than EBM.

Summary of Empirical Evaluation
To compare the overall performance of the various techniques, we calculate the average F1-Scores across the datasets. Furthermore, to evaluate the overall interpretability of TMs, we also report the average number of literals used overall.
Indeed, the average numbers of literals employed are 961 for the TM with Booleanized continuous features and 224 for the TM with the SSL feature scheme. That is, the TM with SSL feature representation uses 4.3 times fewer literals than the TM with Booleanized continuous features.
It should also be noted that increasing the number of clauses stabilizes the precision, recall, F1-score, accuracy, and specificity measures, rendering variance insignificant. That is, variance becomes negligible for all the datasets and feature representations.

Comparison against Recent State-of-the-Art Machine Learning Models
In this section, we compare the accuracy of the TM with reported results on recent stateof-the-art machine learning models. First, we perform experiments on Fraud Detection and COMPAS: Risk Prediction in Criminal Justice datasets to study the performance of the TM in comparison with Neural Additive Models [11]. A Neural Additive Model is a novel member of the so-called General Adaptive Models. In Neural Additive Models, the significance of each input feature towards the output is learned by a dedicated neural network. During the training phase, the complete set of neural networks is jointly trained to learn complex interactions between inputs and outputs.
To compare the performance against StructureBoost [47], we use the CA weather dataset [53]. For simplicity, we use only the CA-58 subset of the dataset in this study. StructureBoost is based on gradient boosting and is capable of exploiting the structure of categorical variables. StructureBoost outperforms established models such as CatBoost and LightBoost on multiple classification tasks [47].
Since the performance of both of the above techniques has been measured in terms of the Area Under the ROC Curve (AUC), here, we use a soft TM output layer [54] to calculate the AUC. The performance characteristics are summarized in Table 23. Table 23 shows that for the Fraud Detection dataset, the TM with the SSL continuous feature representation approach performs on par with XGBoost and outperforms NAMs and all the other techniques mentioned in [11]. For the COMPAS dataset, the TM with the SSL feature arrangement exhibits competitive performance compared to NAMs, EBM, XGBoost, and DNNs. The TM with SSL feature representation shows, however, superior performance compared to Logistic Regression and DT on COMPAS. The performance of the TM on CA-20 is better in comparison to StructureBoost, LightBoost, and CatBoost models, as reported in [47].

Conclusions
In this paper, we proposed a novel continuous feature representation approach for Tsetlin Machines (TMs) using Stochastic Searching on the Line (SSL) automata. The SSLs learn the lower and upper limits of the continuous feature values inside clauses. These limits decide the Boolean representation of the continuous value inside the clauses. We have provided empirical evidence to show that the novel way of representing continuous features in the TMs can reduce the number of literals included in the learned TM clauses by 4.3 times compared to the Booleanization scheme without loss of performance. Further, the new continuous feature representation is able to decrease the total training time from 0.177 s to 0.126 s per epoch, and the combined total memory usage is reduced from 16.35 KB to 9.45 KB while exhibiting on par or better performance. In terms of the average F1-Score, the TM with the proposed feature representation also outperforms several state-of-the-art machine learning algorithms.
In our future work, we intend to investigate the possibility of applying a similar feature representation for multi-class and regression versions of the TM.