Explainable AI using expressive Boolean formulas

,


I. INTRODUCTION
Most of today's machine learning (ML) models are complex, and their inner workings (sometimes with billions of parameters) are difficult to understand and interpret.Yet, in many applications, explainability is desirable or even mandatory due to industry regulations, especially in high-stakes situations (for example, in finance or health care).In situations like these, explainable models can help through increased transparency, with the following additional benefits: 1) explainable models may expose biases, important in the context of ethical and responsible usage of ML, and 2) may be easier to maintain and improve.
While there exist techniques that attempt to explain black-box ML model decisions [1], they can be problematic due to their ambiguity, imperfect fidelity, lack of robustness to adversarial attacks, and likely drift from the ground truth [2,3].Other approaches to explainability focus on constructing interpretable ML models, at times at the price of lower performance [4][5][6][7][8].
In this work we propose an interpretable ML classification model based on expressive Boolean formulas.The Boolean formula defines a rule (with tunable complexity and interpretability) according to Fidelity Public Information which input data are classified.Such a formula can include any operator that can be applied to one or more Boolean variables, such as And and AtLeast.This flexibility provides higher expressivity compared to rigid rule-based approaches, potentially resulting in improved performance and interpretability.
Quantum computers might offer speedups for solving hard optimization problems in the fullness of time [9,10].In the near term, specialized, classical hardware has been developed for speeding up ML [11,12] and optimization workloads [13,14].Such devices, classical or quantum, tend to have a limited scope of applicability (i.e., areas of potential advantage).Moreover, native optimization-i.e., the solving of optimization problems in their natural representation, promises to be more efficient [15], but often requires a custom optimizer that cannot easily take advantage of specialized hardware.For the explainability problem, we develop a native optimization algorithm that utilizes specialized hardware to efficiently solve subproblems, thereby combining the advantages of both techniques-specialized hardware and native optimization.
The main contributions of this paper are: • Improved and expanded Integer Linear Programming (ILP) formulations and respective Quadratic Unconstrained Binary Optimization (QUBO) formulations for finding depth-one rules.
• A native local solver for determining expressive Boolean formulas.
The main findings are: • Expressive Boolean formulas provide a more compact representation than decision trees and conjunctive normal form (CNF) rules for various examples.
• Parameterized operators such as AtLeast are more expressive than non-parametrized operators such as Or.
• The native local rule classifier is competitive with well-known alternatives considered in this work.
• The addition of non-local moves achieves similar results with fewer iterations, and therefore using specialized or quantum hardware could lead to a speedup by fast proposal of non-local moves.
This paper is structured as follows.In Section II, we review related work, and in Section III we provide the problem definition, introduce expressive Boolean formulas, and motivate their usage for explainable artificial intelligence (XAI).In Section IV, we lay out the training of the classifier using a native local optimizer, including the idea and implementation of non-local moves.In Section V we formulate the problem of finding optimal depth-one rules as ILP and QUBO problems.In Section VI, we present and discuss our results, and in Section VII we present our conclusions and discuss future directions.

II. RELATED WORK
Broadly speaking, there are two prevalent approaches to XAI which we briefly (briefly reviewed below): Post-hoc explanation of black-box models (Explainable ML) -Many state of the art ML models, particularly in Deep Learning (DL), are hugeconsisting of a large number of weights and biases, recently surpassing a trillion parameters [16].These DL models are, by nature, difficult to decipher.The most common XAI approaches for these models provide posthoc explanation of black-box model decisions.These approaches are typically model agnostic and can be applied to arbitrarily complex models, such as the ones commonly used in DL.
Local explanation methods apply local approximations to groups of instances, such as LIME [17] and SHAP [18], or to different feature sets, such as MUSE [19].Such methods often suffer from a lack of robustness and can typically be fooled easily by adversarial attacks [2,3].
Global explanation methods aim to mimic black-box models using interpretable models [20,21].These methods benefit from the easy availability of additional data, by querying the black-box model.However, there is an issue of ambiguity-different interpretable models might yield the same high fidelity to the black-box model.Furthermore, since it is generally impossible to achieve perfect fidelity, the resulting model can drift even farther from the ground truth than the original black-box model.
Training interpretable models (Interpretable ML) -The definition of what constitutes an interpretable model is domain-specific and potentially user specific.Nevertheless, many interpretable models have been studied.As a few selected examples, see rule lists [4], falling-rule lists [5], decision sets [6], scoring systems [7], and the more well-known decision trees and linear regression.There is an existing body of work on using specific Boolean formulas as ML models.For example, learning conjunctive/disjunctive normal form (CNF/DNF) rules using a MaxSAT (Maximum Satisfiability) solver [22,23], an ILP solver [24][25][26], or via LP-relaxation [27].
There is a common belief that interpretable models yield less accurate results than more complex blackbox models.However, in numerous cases, interpretable models have been shown to yield comparable results to black-box models [28], which is often the case for structured data with naturally meaningful features.Nonetheless, there may be cases in which interpretable models in fact do yield worse results.In those cases, a loss of accuracy might be a worthwhile concession in exchange for the additional trust that comes with an interpretable model.
Interpretable classifiers, including the classifiers described in this work can be used as stand-alone interpretable models, or they can be used to explain black-box models.The main difference is in the origin of the labels-in the former they would be the ground truth, whereas in the latter they would be the output of the black-box model.

III. PRELIMINARIES
In this section we introduce our problem definition, the objective functions, expressive Boolean formulas and motivate their usage for XAI.

A. Problem definition
We start by stating our problem definition in words:

Definition 1 (Rule Optimization Problem (ROP))
Given a binary feature matrix X, and a binary label vector y, the goal of the Rule Optimization Problem (ROP) is to find the optimum rule R * , that balances the score S of the rule R on classifying the data, and the complexity of the rule C, which is given by the total number of features and operators in R. The complexity might in addition be bounded by a parameter C .
Mathematically, our optimization problem can be stated at a high-level as: where R is any valid rule, R * is the optimized rule, S is the score, C is the complexity, X is a binary matrix containing the input data, y is a binary vector containing the corresponding labels, and λ is a real number that controls the tradeoff between the score and the complexity.The solution of this problem is a single rule.Note that our problem definition is flexible to accommodate different design decisions which are described in more detail below.

B. Objectives
In this problem we have two competing objectives: maximizing the performance, and minimizing the complexity.The performance is measured by a given metric that yields a score S.
This problem could be solved by a multi-objective optimization solver, but we leave that for future work.Instead, we adopt two common practices in optimization: • Combining multiple objectives into oneintroducing a new parameter λ ≥ 0 that controls the relative importance of the complexity (and, therefore, interpretability).The parameter λ quantifies the drop in the score we are willing to accept, in order to decrease the complexity by one.Higher values of λ generally result in less complex models.We then solve a single-objective optimization problem with the objective function S − λC, which combines both objectives into one hybrid objective controlled by the (use-specific) parameter λ.
• Constraining one of the objectives -introducing the maximum allowed complexity, C (also referred to as max_complexity), and then varying C to achieve the desired result.
Normally, formulations include only one of these methods.However, we choose to include both in our formulation, for the following reasoning.The former method does not provide guidance on how to set λ, while the latter method provides tight control over the complexity of the rule, at the price of including an additional constraint.In principle, we prefer the tight control provided by setting C , since adding just one constraint is not a prohibitive price to pay.However, note that if λ = 0, solutions that have an equal score but different complexity have the same objective function value.In reality, in most use cases we expect that the lower complexity solution would be preferred.In order to indicate this preference to the solver, we recommend setting λ to a small, nonzero value.Regardless, in our implementation users can set C , λ, both, or neither of them.
Without loss of generality, in this work we mainly use balanced accuracy as the performance metric.Here, S is equal to the mean of the accuracy of predicting each of the two classes: where T P is the number of true positives, F N is the number of false negatives, and similarly for F P and T N .Generalizations to alternative metrics are straightforward.For balanced datasets, balanced accuracy reduces to regular accuracy.The motivation for using this metric is that many datasets of interest are not well balanced.
A common use case is to explore the score vs. complexity tradeoff by varying C or λ over a range of values, producing a series of rules in the score-complexity space, as close as possible to the Pareto frontier.

C. Rules as expressive Boolean formulas
The ROP (see Definition 1) can be solved over different rule definitions.In this work, we define the the rules R to be Boolean formulas, and in particular expressive Boolean formulas.An expressive Boolean formula (henceforth, a "formula"), as we define it here, consists of literals and operators.Literals are variables f i or negated variables ∼f i .Operators are operations that are performed on two or more literals, such as And(f 0 , f 1 , ∼f 2 ) and Or(∼f 0 , f 1 ).Some operators are parameterized, for example AtLeast2(f 0 , f 1 , f 2 ), which would return true only if at least two of the literals are true.
Operators can optionally be negated as well.For simplicity, we consider negation to be a property of the respective literal or operator rather than being represented by a Not operator.See Fig. 2 for the operators we have included in this work.The definitions and implementation we have used are flexible and modular-additional operators could be added (such as AllEqual or Xor), or some could be removed.
The inclusion of operators like AtLeast is motivated by the idea of (highly interpretable) checklists, such as a list of medical symptoms that signify a particular condition.It is conceivable that a decision would be made using a checklist of symptoms, of which a minimum number would have to be present for a positive diagnosis.Another example is a bank trying to decide whether or not to provide credit to a customer.
It is convenient to visualize formulas as directed graphs (see Fig. 1).The leaves in the graph are the literals which are connected with directed edges to the operator operating on them.To improve readability, we avoid crossovers by including a separate node for each literal, even if that literal appears in multiple places in the formula.Formally, this graph is a directed rooted tree.Evaluating a formula on given values of the variables can be done by starting at the leaves, substituting the variable values, and then applying the operators until one reaches the top of the tree, referred to as the root.
We define the depth of a formula as the longest path from the root to any leaf (literal).For example, the formula in Fig. 1 has a depth of two.We define the complexity as the total number of nodes in the tree, i.e., the total number of literals and operators in the formula.That same formula has a complexity of eight.This definition is motivated by the intuitive idea that adding literals or operators generally makes a given formula less interpretable.In this study, we are concerned with maximizing interpretability, which we shall do by minimizing complexity.

D. Motivation
In this section we motivate our work by illustrating with a few simple examples how rigid-rule based classifiers and decision trees can require unreasonably complex models for simple rules.
Shallow decision trees are generally considered highly interpretable and can be trained fairly efficiently.However, it is easy to construct simple datasets that require very deep decision trees to achieve high accuracy.For example, consider a dataset with five binary features in which data rows are labelled as true only if at least three of the features are true.This is a simple rule that can be stated as AtLeast3(f 0 , . . ., f 4 ).However, training a decision tree on this dataset results in a large tree with 19 split nodes (see Fig. 4).Despite encoding a simple rule, this decision tree is deep and difficult to interpret.
The prevalence of methods of finding optimal CNF (or equivalently, DNF) rules using MaxSAT solvers [22,23] or ILP solvers [24][25][26] suggests that one might use such a formula as the rule for the classifier.However, in this case as well, it is easy to construct simple datasets that require complicated rules.Consider the example abovethe rule AtLeast3(f 0 , . . ., f 4 ) requires a CNF rule with 11 literals, 13 clauses, and a rule length of 29 (number of literals in all clauses).Fig. 3 shows several examples in which decision trees and CNF rules require a complicated representation for simple rules.The complexity C of a decision tree is defined here as the number of decision nodes.The complexity of a CNF formula is defined (conservatively) as the total number of literals that appear in the CNF formula (including repetitions).For CNF rules, we also tried other encodings besides the one indicated in the table (sorting networks [32], cardinality networks [33], totalizer [34], modulo totalizer [35], and modulo totalizer for k-cardinality [36]), all of which produced more complex formulas for this data.One can see that the decision tree encodings ("DT") are the least efficient, followed by the CNF encodings ("CNF"), and finally expressive Boolean formulas ("Rule") are by far the most efficient at encoding these rules.

IV. THE COMPONENTS OF THE SOLVER
In this work we propose to determine optimized expressive Boolean formulas via native local and nonlocal optimization.Here, "native" refers to optimization in the natural search space for the problem.That is in contrast to reformulating the problem in a fixed-format such as MaxSAT, ILP, or QUBO, which would be difficult (if not impossible), and often requires searching a much larger space.A natural search space for this problem is the space of valid expressive Boolean formulas.Next, "local" refers to the exploration of the search space via stochastic local search, i.e., by performing a series of moves that make relatively small changes to the current configuration [37]."Non-local" optimization refers to exploration of the search space via moves that change a larger part of the solution, a form of large neighborhood search [38].Below we describe the native local optimizer in detail, and then we explain the idea of non-local moves and our specific implementation thereof.
In order to define a native local solver, we must define several components: the constraints and search space, how to generate initial rules to start the optimization, the allowed local moves, how to evaluate proposed rules, and the way the solver works.Below we provide information on our choice for each of these components.However, other choices are possible for each of the components and might provide better results.[29], as implemented in pysat [30]."DT" is a decision tree, as implemented in scikit-learn [31]."Rule" is an expressive Boolean formula as defined in this paper.The complexity of a decision tree is defined as the number of decision nodes.The complexity of a CNF formula is defined as the total number of literals that appear in the CNF formula (including repetitions).The complexity of expressive Boolean formulas is defined as the total number of operators and literals (including repetitions), so in this case it is equal to the number of literals plus one.

A. Constraints and search space
This problem can be posed as an unconstrained optimization problem.In order to do so, we must define the search space such that all candidate solutions are feasible.This search space is infinite if C is not set.However, in this case the solver focuses only on a small fraction of the search space due to the regularizing effect of λ.To see this, note that 0 ≤ S ≤ 1, and C ≥ 1.Thus, for a given value of λ, rules with C > 1/λ yield a negative objective function value, which is surely exceeded by at least one rule.Therefore, these rules would be avoided by the solver, or at least de-emphasized.
In principle, rules could involve no literals ("trivial rules"), a single literal, an operator that operates on two or more literals, and any nested combination of operators and literals.In practice, trivial rules and single-literal rules can be quickly checked exhaustively, and for any reasonably complicated dataset would not provide high enough accuracy to be of interest.Therefore, we simplify our solver by excluding such rules.In order to check our assumptions (and as a useful baseline), our experiments include the results provided by the optimal single literal (feature) rule, as well as the optimal trivial rule (always one or always zero).
For parametrized operators, we constrain the search space such that it only includes sensible choices of the parameters.Namely, for AtMost, AtLeast, and Choose, we require that k is non-negative, and is no larger than the number of literals under the operator.These constraints are fulfilled by construction, by picking initial solutions and proposing moves that take them into account.FIG.4: Simple example rule that yields a complex decision tree.The optimal rule for this dataset can be stated as AtLeast3(f 0 , . . ., f 4 ), yet the decision tree trained on this dataset has 19 split nodes and is not easily interpretable.

B. Generating initial rules
Initial rules are constructed by choosing between two and C − 1 literals randomly (without replacement).Literals are chosen for negation via a coin flip.Once the literals have been chosen, an operator and valid parameter (if the operator chosen is parametrized) are selected randomly.These generated rules are of depthone-additional depth is explored by subsequent moves.

C. Generating local moves
Feasible local moves are found by rejection sampling as follows.A node (literal or operator) in the current rule is chosen at random, and a move type is chosen by cycling over the respective move types for the chosen literal/operator.Next, a random move of the chosen type is drawn.If the random move is invalid, the process is restarted, until a valid move is found (see Algorithm 1).Typically only a few iterations are needed, at most.
The literal move types we have implemented are [see Fig. 5(a)-(c)]: • Remove literal -removes the chosen literal but only if the parent operator would not end up with fewer than two subrules.If the parent is a parametrized operator, adjusts the parameter down (if needed) such that it remains valid after the removal of the chosen literal.
• Expand literal to operator -expands a chosen literal to an operator, moving a randomly chosen sibling literal to that new operator.Proceeds only if the parent operator includes at least one more literal.If the parent is a parametrized operator, adjusts the parameter down (if needed) such that it remains valid after the removal of the chosen literal and the sibling literal.
• Swap literal -replaces the chosen literal with a random literal that is either the negation of the current literal, or is a (possibly negated) literal that is not already included under the parent operator.
The operator move types we have implemented are [see Fig. 5

(d)-(f)]:
• Remove operator -removes an operator and any operators and literals under it.Only proceeds if the operator has a parent (i.e., it is not the root), and if the parent operator has at least three subrules, such that the rule is still valid after the move has been applied.
• Add literal to operator -adds a random literal (possibly negated) to a given operator, but only if that variable is not already included in the parent operator.
• Swap operator -swaps an operator for a randomly-selected operator and a randomlyselected parameter (if the new operator is parametrized).Proceeds only if the new operator type is different, or if the parameter is different.

D. Evaluation of rules
The "goodness" of a rule is defined by the objective function in Eq. (1), i.e., as S − λC.In order to calculate the objective function, we must evaluate the rule with respect to the given data rows.The evaluation yields a Boolean prediction for each data row.Evaluating the metric on the predictions and respective labels results in a score S. The complexity of a rule C is a function return proposed_move Algorithm 1: The function that proposes local moves-propose_local_move().This function selects a node (an operator or a literal) randomly, while cycling through the move types, and then attempts to find a valid local move for that node and move type.The process is re-started if needed until a valid local move is found, a process which is referred to as "rejection sampling."only of the rule's structure, and does not depend on the inputs.Therefore, the computational complexity of calculating the objective function is dominated by the rule evaluation, which is linear in both the number of samples and the rule complexity.
The evaluation of literals is immediate, because the result is simply equal to the respective feature (data column), or the complement of that feature.The evaluation of operators is done by evaluating all the subrules, and then applying the operator to the result.Evaluation is usually done on multiple rows at once.For this reason, it is far more efficient to implement evaluation using vectorization, as we have done.In fact, for large datasets, it might be beneficial to parallelize this calculation, because it is trivially parallelizable (over the data rows).In addition, evaluation in the context of a local solver could likely be made far more efficient by memoization-i.e., storing the already evaluated subrules so that they can be looked-up rather than re-evaluated.

E. The native local solver
The solver starts from a new random rule num_starts times, which diversifies the search and is embarrassingly parallelizable.
Each start begins by generating an initial random rule, and continues with the proposal of num_iterations local moves.Moves are accepted based on a Metropolis criterion, such that the acceptance probability of a proposed move depends only on the objective function change and temperature.Initially, the temperature is high, leading to most moves being accepted (exploration).The temperature is decreased on a geometrical schedule such that, in the latter stages of each start, the solver accepts only descending or "sideways" moves-moves that do not change the objective function (exploitation).See Fig. 6 for an example run.

F. Non-local moves
The local native optimizer described above searches the formula space by making small local moves.It is also possible to perform larger, non-local moves.Such moves are more expensive computationally, but they have the potential to improve the objective function more drastically by extending the range of the otherwise local search, as well as allowing the solver to escape local minima.In this section, we describe ways of proposing good non-local moves that are compatible with, and inspired by, quantum algorithms for optimization that might show a speedup over classical algorithms, namely algorithms for solving QUBO and ILP problems [41][42][43][44].
The basic idea behind the proposed non-local moves is   FIG.6: An example native local classifier run on the Breast Cancer dataset [40].The settings for the solver are num_starts = 20, num_iterations = 2000, and the temperatures follow a geometric schedule from 0.2 to 10 −6 .The acceptance probability, which is the probability of accepting a proposed move, is averaged across the starts and a window of length 20.
to make a move that optimizes an entire subtree of the current formula.Given a randomly-selected target node (an operator or literal) and a new operator, we assign the new operator to the target node and optimize the subtree beneath it.
The optimized subtree could be of depth-one (i.e., an operator with literals under it), or of depth-two (i.e., an operator with other operators and literals under it, with literals under those secondary operators).The optimization of the subtree could be limited to a particular subset of the possible subtrees, which can occur when using fixed format optimizers.An example of a depth-two tree of a fixed format is a DNF rule, i.e., an Or of Ands.We refer to the subtree optimization move as swap node with subtree of depth d [see Fig. 7 In order to perform the optimization over the subtree T , we need to determine the effective input data and labels for the subtree optimization problem, X and y , respectively.We now use the notation R(T (x i ), x i ) in order to highlight that the evaluation of the rule can be thought of as being the result of evaluating the subtree, and then evaluating the rest of the rule given the result of the subtree evaluation.For each data row (sample) x i we evaluate the rule R(T (x i ), x i ) twice, by first substituting T (x i ) = 0, and then T (x i ) = 1.If the result is the same in both cases, then it is predetermined, and hence the sample x i does not contribute to the score of any proposed subtree T .If the result is different, then we set y i to the value that causes the classification to be correct, i.e., such that R(x i ) = y i , and add the corresponding sample to X .Note that it is guaranteed in this case that one of these options classifies the sample correctly, because in this case we have two different outputs, and those cover all of the possible outputs (for binary classification).The subtree score is then calculated over only the labels that are not predetermined, i.e., the effective labels.The complexity of the subtree C(T ) is calculated in the same way as it is calculated for the complete rule.
To determine the non-local move, we solve an optimization problem that is determined by fixing R/T 0 (i.e., everything that is not the target node or a descendant of it) in Eq. ( 1) and discarding constants, where T 0 is the target subtree (prior to optimization).In particular, we have where X and y are the input data and effective subtree labels for the non-predetermined data rows, respectively, T is any valid subtree, and T * is the optimized subtree (i.e., the proposed non-local move).
In practice, we must also constrain the complexity of the subtree from below, because otherwise the optimized subtree could cause the rule to be invalid.For this reason, if the target is the root, we enforce min_num_literals = 2, because the root operator must have two literals or more.If the target is not the root, we enforce min_num_literals = 1, to allow the replacement of the subtree with a single literal (if beneficial).The ILP and QUBO formulations we present in Section V do not include this lower bound on the number of literals for simplicity, but its addition is trivial.
Yet another practical consideration is that we we want to propose non-local moves fast, to keep the total solving time within the allotted limit.One way of trying to achieve this is to set a short timeout.However, the time required to construct and solve the problems is dependent on the number of non-determined samples.If the number of samples is large and the timeout is short, the best solution found might be of poor quality.With this in mind, we define a parameter named max_samples which controls the maximum number of samples that can be included in this optimization problem.This parameter controls the tradeoff between the effort required to find a good non-local move to propose, and the speed with which such a move can be constructed.Even with this parameter in place, the time required to find the optimal solution can be significant (for example minutes or more), so we utilize a short timeout.

G. Putting it all together
Now that we have described the idea of the nonlocal moves at a high-level, we describe the way in which we have incorporated them into the native local solver in more detail.Algorithm 2 presents pseudocode for the simulated annealing native local solver with non-local moves.Non-local moves are relatively expensive to calculate, and thus should be used sparingly.Moreover, non-local moves are fairly exploitative (as opposed to exploratory), so in the context of simulated annealing, proposing such moves at the beginning of the optimization would not help convergence, and therefore would be a waste of computational resources.With these points in mind, we define a burn-in period num_iterations_burn_in for each start, in which no non-local moves are proposed.After the burn-in period, non-local moves are proposed only if a certain number of iterations (referred to as the patience) has gone by with no improvements due to proposing local moves.A short example run is presented in Fig. 8, showing the evolution of the objective function over multiple starts, the way in which the non-local moves kick in after an initial burnin period, and that they are indeed able to improve the objective function measurably (but also sometimes fail to do so).

V. DEPTH-ONE ILP AND QUBO FORMULATIONS
In this section we formulate the problem of finding optimal depth-one rules as an ILP or QUBO problem, with various operators at the root.We use these formulations in two ways: 1) to find optimized depthone rules that form the basis of standalone classifiers, and 2) to find good non-local moves in the context of a native local solver, which periodically proposes non-local moves.
We start by following Ref.[27], which explains how to formulate the search for optimal Or, And, and AtLeast rules as ILP problems.Our contributions here are: 1.The addition of negated features.We then explain how to translate these formulations into QUBO formulations.Finally, we describe how the constraints can be softened, thereby reducing the search space significantly.

A. Formulating the Or rule as an ILP
We start by observing that the rule y = Or(f 0 , f 1 ) can be expressed equivalently as: We can then define an optimization problem to find the smallest subset of features f i to include in the Or rule, in order to achieve perfect accuracy: where b is a vector of indicator variables, indicating for each feature whether it should be included in the rule (i.e., b i = 1 if feature f i is included and b i = 0 otherwise), X P is a matrix containing only the rows labelled as "positive" (y = 1), X N is a matrix containing only the rows labelled as "negative" (y = 0), and 0 and 1 are vectors containing only zeros and ones, respectively.We extend this formulation by adding the possibility to include negated features in the rule.While this could be accomplished simply by adding the negated features to the input data X, thereby doubling the size of this matrix, it may be more efficient to include the negation in the formulation.To accomplish this, we add an additional vector of indicator variables b, indicating for each negated feature whether it should be included in the rule.We then replace Xb → Xb + Xb , noting that X = 1 − X (where 1 is now a matrix of ones) since a binary variable v ∈ {0, 1} is negated by 1 − v. Therefore, we have: where || b|| 0 is the sum over the entries of b.By substituting the above into Eq.( 5) and adding a corresponding term for b to the objective function, we find: In practice, we typically do not expect to be able to achieve perfect accuracy.With this in mind, we introduce a vector of "error" indicator variables η, indicating for each data row whether it is misclassified.When the error variable corresponding to a particular sample is 1, the corresponding constraint is always true by construction, effectively deactivating that constraint.Accordingly, we change our objective function such that it minimizes the number of errors.To control the complexity of the rule, we add a regularization term, as well as an explicit constraint on the number of literals.Finally, in order to deal with unbalanced datasets, we allow the positive and negative error terms to be weighted differently: where m (also referred to as max_num_literals) is the maximum number of literals allowed, m ≤ m, where m is the number of features, and w P and w N are the positive and negative class weights, respectively.The The solver executes num_starts starts, each with num_iterations iterations.In each start, a random initial rule is constructed, and then a series of local (see Algorithm 1) and non-local moves (see Section IV F) are proposed and accepted based on the Metropolis criterion.Non-local moves are only proposed after an initial num_iterations_burn_in iterations and if there has been no improvement over patience iterations.The initial rule and the proposed moves are constructed such that the current rule is always feasible, and in particular has a complexity no higher than max_complexity.Non-local moves replace an existing literal or operator with a subtree, optimized over a randomly selected subset of the data of size max_samples.The solver returns the best rule found.Some details omitted due to lack of space.default choice for the class weights is to make them inversely proportional to the fraction of samples of each respective class.In this work, we set w P = n/(2n P ) and similarly w N = n/(2n N ), where n P is the number of positive samples, n N is the number of negative samples, and n is the total number of samples (n = n P + n N ).
Inspecting Eq. ( 7), one can see that when a positive data row is misclassified (η = 1), then the first constraint is always fulfilled.This is because the left hand side of that constraint is then the sum of one plus a nonnegative number, which is always larger than or equal to one.Similarly, when a negative data row is misclassified, then the second constraint is always satisfied.This is because the left hand side of that constraint consists of a number that is bound from above by m from which m is subtracted, which is surely negative or zero.Notice that this second constraint, for the negative samples, was previously an equality constraint [in Eq. ( 6)].It was changed to an inequality constraint in order to make the "deactivation" (for η = 1) work.

B. Formulating the And rule as an ILP
In the previous section we described the ILP formulation for the Or operator.In this section we show how this formulation can be extended to the And operator.It is instructive to first consider why it is useful to go beyond the Or rules at all, and what is the exact nature of the relationship between Or and And rules in this context.
We note that De Morgan's laws (which we shall use below) guarantee that every Or rule has an equivalent ∼And rule (and similarly for And and ∼Or).The relation between the distribution of the different rules is apparent from Fig. 9 where we plot the score landscape for both single feature and double feature rules.In particular, the distribution of Or clearly matches the distribution of ∼And (and similarly for And and ∼Or), as expected.More importantly for us, the distribution of Or is clearly very different from the distribution of And, motivating the introduction of the latter.Similarly, the distribution of non-negated single feature rules is clearly very different from the distribution of negated single feature rules, again motivating the inclusion of negated features.As an aside, note the reflection symmetry in both plots, which is due to the identity S(R) = 1 − S(∼R).
The best (optimal) rules for each distribution are given in Table I.The best And rule is significantly better than the best Or rule in this case, clearly motivating its inclusion.Furthermore, the reason for including negated features is clear from comparing the best single feature rule scores.
As a final motivational observation, we note that as max_num_literals is increased, Or rules generally tend to produce false positives (since additional features tend to push more outputs to one), whereas And rules generally tend to produce false negatives (since additional features tend to push more outputs to zero).Different use cases might lend themselves to either of these operators.However, both of these operators are both fairly rigid, a point which is mitigated by the introduction of parametrized operators (see Section V C), and the introduction of higher depth rules, as found by the native local solver (see Section IV).
To formulate the And operator, we use De Morgan's laws.Namely, starting from Eq. ( 7) we swap b ↔ b, η N ↔ η P , and X N ↔ X P , to find: min (w

C. Extending the ILP formulation to parametrized operators
So far we have shown how the problem of finding an optimal depth-one rule with Or and And operators in the root can be formulated as an ILP.Here we extend these formulations to the parameterized operators AtLeast, AtMost, and Choose.This is motivated by the hypothesis that the additional parameter in these operators could provide additional expressivity, which could translate into better results for particular datasets, at least for some values of max_num_literals.
We probe this idea in an experiment, in which we vary max_num_literals over the full possible range for the Breast Cancer dataset in Fig. 10.In this experiment, we fix max_num_literals=min_num_literals to expose the true objective value over the full range.One can clearly see that the additional expressivity of the parameterized operators allows them to maintain a higher accuracy over most of the possible range, for both the train and test samples.For the unparameterized operators, we observe significantly reduced performance at high max_num_literals, explained by the fact that eventually the solver runs out of features that can be added and still improve the score.For the parameterized operators, this limitation is mitigated by adjusting the corresponding parameter.
To formulate the AtLeast operator, we modify Eq. ( 4) such that the rule y = AtLeastk(f 0 , f 1 ) can be expressed equivalently as Accordingly, we modify Eq. ( 7) to obtain the ILP formulation for AtLeast: noting that k is a decision variable that is optimized over by the solver, rather than being chosen in advance.
Similarly, we can formulate AtMost as: min (w Note that any AtLeastk rule has an equal-complexity equivalent AtMost rule that can be obtained by taking k → F − k and f i → ∼f i for each feature f i in the original rule, where F is the number of features in the original rule.For example, AtLeastk(f 0 , f 1 ) is equivalent to AtMost[2-k](∼f 0 , ∼f 1 ).For this reason, we expect similar numerical results for AtLeast and AtMost.However, the actual rules differ, and a user might prefer one over the other, for example due to a difference in effective interpretability.TABLE I: The best (optimal) rules for single and double feature rules for the full Breast Cancer dataset."Type" is the type of rule, "Rules" is the number of rules of that rule type, "Accuracy" is regular accuracy (the metric used for this table), and "Rule" is one of the optimal rules for that respective rule type.Negations of features are post-processed out for readability by reversing the relationship in the respective feature name (e.g., ∼a > 3 → a ≤ 3).Finally, formulating Choose is more complicated, because we need to formulate a not-equal constraint.Let us first write the equivalent of the perfect Or classifier of Eq. ( 6) (not in ILP form yet): min (w For the first constraint, we note that any equality constraint a = b can be split equivalently into two inequalities, a ≤ b and a ≥ b.Then, we already know how to add error variables to inequalities, as we did for the other formulations.Similarly, we can split the not-equal constraint into two inequalities, because any not-equal constraint a = b over integers can be split into two disjunctive inequalities, a ≥ b + 1 and a ≤ b − 1.Once again, we already know how to add error variables to inequality constraints.However, in this case the constraints are mutually exclusive, and adding both of them causes our model to be infeasible.There is a well-known trick for modelling either-or constraints that can be applied here [45].We add a vector of indicator variables q that chooses which constraint of the two to apply, for each negative sample.We end up with: min (w One can think of Choose as being equivalent to a combination of AtLeast and AtMost.Accordingly, one can readily see that Eq. ( 13) is equivalent to a combination of Eq. ( 10) (if q = 0) and Eq. ( 11) (if q = 1).The points correspond to the mean of the balanced accuracy and complexity over those splits.The error bars are given by the standard deviation over the balanced accuracy and complexity on the respective 32 splits for each point.

D. Converting the ILPs to QUBO problems
We seek to convert the above ILPs to corresponding QUBO problems, motivated by the fact that many quantum algorithms and devices are tailored towards QUBO problems [42,[46][47][48][49].We start by describing the standard method [50], argue for a subtle change, and then lay out clearly a general recipe for converting an ILP to a QUBO problem.
A QUBO problem is commonly defined as where x is a vector of binary variables, and Q is a real matrix [50].The standard method of including equality constraints such as a T x = b in a QUBO is to add a squared penalty term P (a T x − b) 2 , where a is a real vector, b is a real number, and P is a positive penalty coefficient.
We now describe how to include inequality constraints such as a T x ≤ b (without loss of generality) in a QUBO.We first note that a T x is bound from above by the sum of positive entries in a (denoted by a + ) and from below by the sum of negative entries in a (denote by a − .)For the constraint a T x ≤ b, the assumption is that b < a + , i.e., that b provides a tighter bound (or else this constraint is superfluous).Therefore, we can write any inequality constraint in the form l ≤ a T x ≤ u, where l and u are the lower and upper bounds respectively, which is the form we use for the rest of this section.
We can convert this inequality constraint into an equality constraint by introducing an integer slack variable 0 ≤ s ≤ u − l, which can be represented via a binary encoding.We can then use the above squaring trick to find the corresponding penalty term.It is important to note that there are two ways to do so: a T x = u − s (from above).
These equality constraints could then be included by adding the respective penalty term P (a T x − l − s) 2 (from below) ( 16) This brings us to a subtle point that is not commonly discussed.For an exact solver, it should not matter, in principle, which of the two forms of the penalty terms we choose to add.However, when using many heuristic solvers, it is desirable to reduce the magnitude of the coefficients in the problem.In the case of quantum annealers, the available coefficient range is limited, and larger coefficients require a larger scaling factor to reduce the coefficients down to a fixed range [51].
The larger the scaling factor, the more likely it is that some scaled coefficients will be within the noise threshold [52].In addition, for many Monte Carlo algorithms, such as simulated annealing, larger coefficients require higher temperatures to overcome, which could lead to inefficiencies.
In order to reduce the size of coefficients, we note that the above formulations are almost the same; they differ only in the sign in front of the slack variable s, which does not matter for this discussion, and the inclusion of either l or u in the equation.This motivates us to recommend choosing the penalty term that contains the bound (l or u) that has a smaller absolute magnitude, because this yields smaller coefficients when the square is expanded.
Based on the above, we now provide a compact recipe for converting ILP problems to QUBO problems: 1. Assume we have a problem in canonical ILP form: 2. Convert the inequality constraints to equivalent equality constraints by the addition of slack variables: where we have adopted, without loss of generality, the "from above" formulation.
3. Convert to a QUBO: where we have dropped a constant, and diag(c) is a square matrix with c on the diagonal.
Using this recipe, it is possible to translate each of the five ILP formulations to corresponding QUBO formulations.As a representative example, we show how to do so for the Or formulation.We start from Eq. ( 7), which is reproduced here for easier reference: min (w Upon applying the conversion recipe, we find the following QUBO: min (w where the curly brackets should be interpreted as a sum over the rows of the vector expression within the brackets, s and r are vectors of slack variables, t is a slack variable, and L 1 and L 2 are positive penalty coefficients.The strength of the maximum number of literals constraint should be much larger than the strength of the soft constraints, to ensure it is enforced, i.e., L 2 L 1 .We do not write out the matrices in the QUBO formulation Q cost and Q penalty explicitly here, but they can readily be identified from Eq. ( 21).

E. Reducing the number of variables
In this section we describe a way of reducing the number of variables in the QUBO formulation by eliminating the error variables.
Recall that we introduced the misclassification indicator variables η in order to soften the constraints.This inclusion made sense for the ILP formulation in which the constraints would otherwise be applied as hard constraints.
In QUBO formulations, the difference between soft and hard constraints is just in the magnitude of the penalty coefficients.Therefore, we can, in principle, eliminate the error variables from the "imperfect" formulations and then add the constraints as soft constraints by construction, setting the penalty coefficients for those constraints to be relatively small.
For example, for the Or operator, we take Eq. ( 20), delete the error variables from the objective function and constraints, and label those constraints as "soft": We then apply the QUBO to ILP recipe described above to each constraint, and add the class weights to find: where we use the same curly bracket notation used in Eq. ( 21), s is a vector of slack variables, t is a slack variable, and L 1 and L 2 are positive penalty coefficients.As before, the strength of the maximum number of literals constraint should be much larger than the strength of the soft constraints, to ensure it is enforced, i.e., L 2 L 1 .The reduction in problem size due to softening the constraints is generally significant (see Table II).For TABLE II: Number of variables required for various QUBO and ILP formulations."Operator" is the operator at the root of the depth-one rule, "with η" refers to the QUBO formulation in which misclassifications are allowed via additional error variables, "without η" refers to the QUBO formulation in which misclassifications are allowed by soft constraints.The numbers quoted are for the complete Breast Cancer dataset which has 63% negative labels, with max_num_literals = 4. example, for the Or formulation, we save the addition of the η variables, one per sample.However, the dominant savings are in the slack variables for the negative data rows, because those constraints become equality constraints, thus avoiding the need for slack variables.In addition, there is a reduction in the range of the slack variables for the positive data rows, which can result in an additional reduction.The number of variables for each formulation is given by Or with η: (
We hypothesize that the reduced problem size likely leads to a reduced time to solution (TTS).The TTS is commonly calculated as the product of the average time taken for a single start τ , and the number of starts required to find an optimum with a certain confidence level, usually 99%, referred to as R 99 , i.e., TTS = τ R 99 [13].The reduction in problem size should yield a shorter time per iteration, resulting in a smaller τ .In addition, one might expect the problem difficulty to be reduced, resulting in a smaller R 99 , but this is not guaranteed, as smaller problems are generally easier, but not always.See Fig. 11 for the results of an experiment comparing the two formulations (with/without η) for the Or and And operators.We observe, anecdotally, that the "without η" formulation leads to a similar or better accuracycomplexity curve, and even when both formulations have similar performance, the runtime for the formulation without η is much shorter (potentially by orders of magnitude).This is in line with the above theoretical arguments, however we leave a detailed comparison of the TTS for the two formulations for future work.
Finally, it is worth noting that the elimination of error variables causes the solution space to be biased.The objective function in this case is a sum of squares of violations (residuals) of the sample constraints.Therefore, given two solutions that have the same score S (i.e., degenerate solutions), the solution with the lower sum of squares of violations is preferred.In some sense, one might argue that this bias is reasonable, because the solution that is preferred is "wrong by less" than the solution with the larger sum of squares of violations.

F. Number of variables and the nature of the search space
When calculating the Pareto frontier, we plan to solve a series of optimization problems starting from a small complexity bound m (i.e., max_num_literals) and gradually increasing it.For this reason, it is interesting to consider the dependence of the number of variables and the search space size on m .For simplicity, we limit this discussion to non-parametrized operators.
From Eq. ( 24), we can see that for a fixed dataset of dimensions n × m, the number of variables in the QUBO formulations increases in a step-wise fashion as m increases (see Fig. 12a).The search space size for each of the QUBO formulations is given by 2 num_vars where num_vars is the number of variables in the respective formulation.The size of the search space for the ILP formulation, which is exactly equal to the feasible space, is much smaller and is given by For a visual comparison of the sizes of the infeasible space (QUBO) and the feasible space (ILP) see Fig. 12b.It is clear that the space searched by the QUBO formulations is far larger than the feasible space, handicapping the QUBO solver.For example, for max_num_literals ≈ 10 we find that the QUBO search space surpasses the ILP search space by more than 400 orders of magnitude.This motivates, in part, the usage of the QUBO solver as a subproblem solver, as described later in this paper, instead of for solving the whole problem.When solving smaller subproblems, the size gap between the feasible space and the infeasible space is relatively smaller and might be surmountable by a fast QUBO solver.Furthermore, by inspecting the constraints, we can see that moving from a feasible solution to another feasible solution requires flipping more than a single bit.This means that the search space is composed of single feasible solutions, each surrounded by many infeasible solutions with higher objective function values, like islands in an ocean.This situation is typical for constrained QUBO problems.Under these conditions, we expect that a single bit-flip optimizer, such as vanilla simulated annealing, would be at a distinct disadvantage.In contrast, this situation should, in principle, be good for quantum optimization algorithms (including quantum annealing), because these barriers between the feasible

Fit time [sec]
And (with η) And (without η) Or (with η) Or (without η) FIG. 11: Test and timing results the QUBO depth-one classifier with Or and And, for the two formulations (with/without η) for the Direct Marketing dataset [53].Each classifier is trained and tested on 32 stratified shuffled splits with a 70/30 split (for cross-validation) over the in-sample data (80/20 stratified split).The points correspond to the mean of the balanced accuracy and complexity over those splits.The error bars are given by the standard deviation over the balanced accuracy and complexity on the respective 32 splits for each point.
solutions are narrow and, therefore, should be relatively easy to tunnel through [54].

VI. BENCHMARKING METHODOLOGY AND RESULTS
In this section we state our research questions, the setup for our numerical experiments, and present and discuss our results.

A. Research questions
We consider the following research questions (RQ) to demonstrate the effectiveness of our approach: RQ1.What is the performance of each solution approach with respect to the Pareto frontier, i.e., score vs. complexity?
RQ2.Does sampling help to scale to large datasets?
RQ3. Are non-local moves advantageous vs. using just local moves, and under what conditions?

B. Benchmarking methodology
Datasets -See Table III for the binary classification datasets included in our experiments.We have selected datasets with varied characteristics.In particular, the number of samples ranges from 195 (very small) to ∼1.3 • 10 5 (very large), and the number of binarized features ranges from 67 to 300.
Several datasets included in our study contain samples with missing data.Those samples were removed prior to using the data.We removed 393 samples from the Airline Customer Satisfaction dataset, 11 samples from the Customer Churn dataset, and 2596 samples from the Home Equity Default dataset.The first two are negligible, but the latter comprise of 44% of the data.
Binarization -In order to form Boolean rules, all the input features must be binary.For this reason, we binarize any features that are not already binary.Features that are already binary do not require special treatment.Features that are categorical, or numerical with few unique values, are binarized using a one-hot encoding.For features that are numerical with many unique values, many binarization methods are available, including splitting them into equal-count bins, into equalwidth bins, and into bins that maximize the information gain [62].We hypothesize that the choice of method of binarization can have a strong effect on downstream ML models, though we leave this for future work.
In this paper, the binarization is carried out by binning the features followed by encoding the features using a one-hot encoding.The binning is carried out by calculating num_bins quantiles for each feature.For each quantile, a single "up" bin is defined, extending from that quantile value to infinity.Because our classifiers all include the negated features, the corresponding "down" bins are already included by way of those negated features.In all experiments we set num_bins = 10.
Classifiers -We include several classifiers in our experiments.First, the baseline classifiers: • Most frequent -A naive classifier that always outputs the label that is most frequent in the training data.This classifier always gives exactly  24).The step-wise form is a result of the binary encoding of the slack in the inequality constraints.The size of the feasible space, which is equal to the size of the search space for the ILP solver, as well as the size of the much larger (feasible and infeasible) space searched by the QUBO formulation (without η) is plotted in (b).The inset shows a zoomed in version of the former.TABLE III: Datasets included in our experiments."Rows" is the number of data samples with no missing values, "Features" is the number of features provided, "Binarized" is the number of features after binarization, "Majority" is the fraction of data rows belonging to the larger of the two classes, and "Ref." is a reference for each dataset.0.5 for balanced accuracy.We exclude this classifier from the figures to reduce clutter and because we consider it as a lower baseline.This classifier is easily outperformed by all the other classifiers.
• Single feature -A classifier that consists of a simple rule, containing just a single feature.The rule is determined in training by exhaustively checking all possible rules consisting of a single feature or a single negated feature.
• Decision tree -A decision tree classifier, as implemented in scikit-learn [31] with class_weight = "balanced".Note that decision trees are able to take non-binary inputs, so we also include results obtained by training a decision tree on the raw data, with no binarization.In order to control the complexity of the decision tree, we vary max_depth which sets the maximum depth of the trained decision tree.The complexity is given by the number of split nodes (as described in Section III D).
Then, the depth-one QUBO and ILP classifiers: • ILP rule -A classifier that solves the ILP formulations described in Sections V A to V C for a depth-one rule with a given operator at the root, utilizing FICO Xpress (version 9.0.1) with a timeout of one hour.In order to limit the size of the problems, a maximum of 3,000 samples is used for each cross-validation split.
• QUBO rule -A classifier that solves the QUBO formulations described in Section V D for a depthone rule with a given operator at the root, utilizing simulated annealing as implemented in dwaveneal with num_reads = 100, and num_sweeps = 2000.In order to limit the size of the problems, a maximum of 3,000 samples is used for each crossvalidation split.The results were generally worse than the ILP classifier results (and are guaranteed not to be better in score), so to reduce clutter we do not include them below (but see some QUBO results in Fig. 11).A likely explanation for the underwhelming results is explained in Section V F -this is a single bit-flip optimizer, but going from feasible solution to feasible solution in our QUBO formulations requires flipping more than one variable at a time.
Finally, the native local solvers: • SA native local rule -The simulated annealing native local rule classifier described in Section IV E, with num_starts = 20, and num_iterations = 2000.The temperatures follow a geometric schedule from 0.2 to 10 −6 .
• SA native non-local rule -The simulated annealing native local rule classifier, with additional ILP-powered non-local moves as described in Section IV F. Uses the same parameter values as the native local rule classifier, and in addition the burn-in period consists of the first third of the steps, patience = 10, max_samples = 100, and the timeout for the ILP solver was set to one second.
Cross-validation -Each dataset is split into insample data and out-of-sample data using an 80/20 stratified split.The in-sample data are then shuffled and split 32 times (unless indicated otherwise in each experiment) into train/test data with a 70/30 stratified split.All benchmarking runs are done on Amazon EC2 utilizing c5a.16xlarge instances, which have 64 vCPUs and 128 GiB of RAM.Cross-validation for each classifier and dataset is generally performed on 32 splits in parallel in separate processes, on the same instance.
Hyperparameter optimization -Very minimal hyperparameter optimization is performed-it is assumed that the results could be improved by parameter tuning, which is not the focus of this work.In addition, it is possible that using advanced techniques for solving ILPs, such as column generation, would improve the results measurably.Note that λ = 0 is used in all experiments to simplify the analysis.

C. Results and Discussion
Pareto frontier (RQ1) -On each dataset, for each classifier, we report the mean and standard deviation of the balanced accuracy and of the complexity over those splits, for the train/test portions of the data.Some classifiers are unparameterized, so we report a single point for each dataset (single feature classifier and most frequent classifier).The decision tree classifier results are obtained by changing the max_depth parameter over a range of values.For the depth-one QUBO and ILP classifiers, the results are obtained by changing the max_num_literals parameter over a range of values.Finally, for the native local optimizers the results are obtained by changing the max_complexity parameter over a range of values.The decision tree is also run on the raw data, without a binarizer, which is indicated by the suffix "(NB)".
For all datasets except the three smallest ones (up to 1,000 samples), the train and test results are virtually indistinguishable.For this reason, we present train and test results for the smaller datasets, but only test results for the larger datasets.The near identity of the train and test results is likely explained by the fact that our classifiers are (by design) highly regularized by virtue of the binarization and their low complexity.Therefore, they do not seem to be able to overfit the data, as long as the data size is not very small.See Fig. 13 for train and test results for the three smallest datasets, and Fig. 14 for test results for the six larger datasets.Finally, example rules found by the native local optimizer for each of the datasets are presented in Table IV.
The native local rule classifier is found to be generally competitive with the other classifiers; it achieves similar scores to the decision tree, despite the decision tree involving a more flexible optimization procedure.The ILP classifiers are shallower, but still achieve similar scores to the native local solver for most datasets.However, the ILP classifiers require significantly longer run times.This suggests that the additional depth does not always provide an advantage.However, for some of the datasets, such as Direct Marketing and Customer Churn, there is a slight advantage to the native local solver, and hence an apparent advantage to deeper rules.However, note that deeper rules could be considered less interpretable, and so less desirable for the XAI use case.
Of note, the single feature rule classifier baseline beats or matches all other classifiers on the Credit Card Default, Credit Risk, and Online Shoppers Intention datasets, suggesting that for those datasets a more complex model than those included in this work is required in order to achieve higher scores.The best score achieved varies significantly across the datasets studied, suggesting that they feature a range of hardnesses.
For all datasets except Parkinsons, the great majority of the decision tree classifier's results have such a high complexity that they are outside of the plot.This is another example of the issue pointed out in Section III D-decision trees tend to yield high complexity trees/rules even for single-digit values of max_depth.
Comparing the different ILP operators, the results suggest that the parametrized operators AtMost and AtLeast offer an advantage over the unparameterized operators, which could be explained by their additional expressivity.We note that performance of AtMost and AtLeast is largely identical, in line with the similarity of these operators.The results for the third parametrized operator, Choose, are poor for some of the datasets, TABLE IV: Example rules obtained by the native local solver for each dataset."Dataset" is the name of the dataset, and "Rule" is the best rule found by the first of the cross-validation splits, for the case max_complexity = 4.The variable names in the rules are as obtained from the original datasets.Negations of features are post-processed out by reversing the relationship in the respective feature name (e.g., ∼a = 3 → a = 3).which is likely due to the larger problem size in this formulation, thereby resulting in the ILP solver timing out before finding a good solution.In fact, many of the other ILP runs timed out despite sampling down the larger datasets to 3,000 samples (for training), which suggests that trying to prove optimality for large datasets might not be scalable.Furthermore, as anticipated in Section V B, for some of the datasets the results for the And operator are better than the results for the Or operator (for example, for the Breast Cancer dataset).Sampling (RQ2) -The main objective of this experiment is to assess whether large datasets can be tackled via sampling, i.e., by selecting a subsample of the data to be used for training the classifier (see Fig. 15).This experiment is run only on the three largest datasets to allow for enough room to change the sample size.
We observe that the training score goes down with train-sample size until saturation, presumably since it is harder to fit a larger dataset.At the same time, the test score goes up with train-sample size, presumably since the larger sample provides better representation and, therefore, the ability to generalize.The score distribution on the train and test sets generally narrows down with increasing train-sample size, presumably converging on the population's score.We observe that the train and test scores are initially very different for small trainsample sizes but converge to similar values for large trainsample sizes.These results suggest that several thousand samples might be enough in order to reach a stable score for the ILP and native local solvers, in comparison to Deep Learning techniques that often require far more data (for example [63]).We comment also that the sampling procedure is attractive due to its simplicity and due to it being classifier agnostic.
Native non-local solver (RQ3) -The main objective of this experiment is to assess whether the non-local moves provide an advantage and under which conditions.With this in mind we vary both max_complexity and num_iterations in Fig. 16.
In the train results, it is clear that the non-local moves provide a meaningful improvement (several percent in Balanced Accuracy) for two out of the three datasets (and a marginal improvement for the third), but only at the higher complexity, suggesting that one can use the non-local moves to fit the data with a smaller number of iterations.If a solver is available that can solve the optimization problem to find non-local moves fast, then using such a solver might lead to faster training.At lower complexity, there is not enough "room" for the non-local moves to operate, so they do not provide an advantage.
In the test results, the error bars are overlapping, so it is not possible to make a strong statement.However, we note that the point estimates for the mean are slightly improved over the whole range of complexities for all three datasets.
Note that the very short timeout for the ILP solver to find each non-local move likely curtailed the solver's ability to find good moves.In addition, ILP solvers typically have many parameters that control their operation, which were not adjusted in this case.It is likely that asking the solver to focus on quickly finding good solutions would improve the results.We leave this direction for future research.), the single feature classifier ("Single"), the decision tree ("DT"), and the decision tree with no binarizer ("DT (NB)"), for the three smallest datasets.Each classifier is trained and tested on 32 stratified shuffled splits with a 70/30 split (for cross-validation) over the in-sample data (80/20 stratified split).The points correspond to the mean of the balanced accuracy and complexity over those splits.The error bars are given by the standard deviation over the balanced accuracy and complexity on the respective 32 splits for each point.Continued in Fig. 14.

VII. CONCLUSIONS AND OUTLOOK
We have proposed a class of interpretable ML classification models based on expressive Boolean formulas, and discussed in detail the underlying training procedure based on native local optimization techniques, with the optional addition of non-local moves.These non-local moves are proposed by solving an optimization problem that can be formulated as an ILP or a QUBO problem.
As such, we foresee an opportunity to gain potential speedups by using hardware accelerators, including future quantum computers.
Solving the XAI problem to optimality is not strictly required in most practical scenarios, thus potentially lowering the requirements on noise for quantum devices.The formulations we have introduced are usable also as standalone classifiers in which the Boolean rule forming the basis for the classifier is very shallow (depth-one).Future work may center on applying these classifiers to other datasets, introducing new operators, or applying these concepts to other uses cases.

FIG. 1 :
FIG.1:A simple expressive Boolean formula.This formula contains six literals and two operators, has a depth of two, and a complexity of eight.It can be stated also as And(Choose2(a, b, c, d), ∼e, f ).
l i t e r a l _ m o v e _ t y p e s = cycle ({ " rem ove_liter al " , " e x p a n d _ l i t e r a l _ t o _ o p e r a t o r " , " swap_literal " }) o p e r a t o r _ m o v e _ t y p e s = cycle ({ " r e mo ve _o p er at or " , " add_literal " , " swap_operator " }) def p r o p o s e _ l o c a l _ m o v e ( current_rule ) : a l l _ o p e r a t o r s _ a n d _ l i t e r a l s = current_rule .flatten () proposed_move = None while proposed_move is None : target = random .choice ( a l l _ o p e r a t o r s _ a n d _ l i t e r a l s ) if i s i n s t a n c e ( target , Literal ) : move_type = next ( l i t e r a l _ m o v e _ t y p e s ) else : move_type = next ( o p e r a t o r _ m o v e _ t y p e s ) proposed_move = ge t _r an do m _m ov e ( move_type , target ) FIG. 5: Local move types.Moves on literals are shown in (a)-(c) and moves on operators in (d)-(f).All moves are relative to the initial rule shown in Fig. 1.
Evolution of the objective function.
Evolution of the acceptance probability.
(a)-(b)], and in this work we focus on the d = 1 case.
FIG.7: Non-local moves-swap node with subtree of depth d.These panels show examples of moves that replace an operator (or a literal, not shown) with a chosen operator (in this case Or) and an optimized depth-one (a) and depth-two (b) subtree.The latter shows an example disjunctive normal form move (Or of Ands), but other structures are possible.Both moves are relative to the initial rule shown in Fig.1.The dashed rectangle shows the subtree to be optimized.New literals are represented schematically, using dots; their actual number could vary.

2 .
Design of even-handed formulations (unbiased towards positive/negative samples), and the addition of class weights.3.Correction of the original formulation for AtLeast,and generalizations to the AtMost and Choose operators.

FIG. 8 : 4 .
FIG.8: Evolution of the objective function (in this case, the accuracy) in a short example run of the native local solver with non-local moves on the Breast Cancer dataset with max_complexity = 10.The settings for the solver are num_starts = 3, num_iterations = 100, and the temperatures follow a geometric schedule from 0.2 to 10 −4 .The first iteration of each start is indicated by a vertical solid red line.The first num_iterations_burn_in = 50 iterations of each start are defined as the burn-in period (shaded in blue), in which no non-local moves are proposed.After that, non-local moves are proposed when there is no improvement to the accuracy in patience = 10 iterations.The proposals of non-local moves use a subset of the samples of size max_samples = 100, and are indicated by vertical dashed black lines.
def solve (X , y , max_complexity , num_starts , num_iterations , num_iterations_burn_in , patience ) : best_score = -inf best_rule = None for start in range ( num_starts ) : i s _ p a t i e n c e _ e x c e e d e d = False current_rule = g e n e r a t e _ i n i t i a l _ r u l e (X , max _complex ity ) current_score = score ( current_rule , X , y ) for iteration in range ( nu m_iterat ions ) : T = u p d a t e _ t e m p e r a t u r e () if iteration > n u m _ i t e r a t i o n s _ b u r n _ i n and i s _ p a t i e n c e _ e x c e e d e d : proposed_move = p r o p o s e _ n o n _ l o c a l _ m o v e ( current_rule , max_samples ) else : proposed_move = p r o p o s e _ l o c a l _ m o v e ( current_rule ) p r o p o s e d _ m o v e _ s c o r e = score ( proposed_move , X , y ) dE = p r o p o s e d _ m o v e _ s c o r e -current_score accept = dE >= 0 or random .random () < exp ( dE / T ) if accept : current_score = p r o p o s e d _ m o v e _ s c o r e current_rule = proposed_move if current_score > best_score : best_score = current_score best_rule = deepcopy ( current_rule ) i s _ p a t i e n c e _ e x c e e d e d = u p d a t e _ p a t i e n c e _ e x c e e d e d ( patience ) return best_rule Algorithm 2: The pseudo-code for our native local solver with non-local moves.

FIG. 9 :
FIG.9: Accuracy score landscape for single and double feature rules for the full Breast Cancer dataset.Rules and respective scores are obtained by enumerating the full search space.

FIG. 10 :
FIG.10:A comparison of the train (left) and test (right) expressivity of different operators in depth-one rules, solved via the ILP formulations.Each classifier is trained and tested on 32 stratified shuffled splits with a 70/30 split (for cross-validation) over the in-sample data (80/20 stratified split).The points correspond to the mean of the balanced accuracy and complexity over those splits.The error bars are given by the standard deviation over the balanced accuracy and complexity on the respective 32 splits for each point.

FIG. 12 :
FIG.12: Number of variables and size of search space as a function of the maximum number of literals for the Breast Cancer dataset.The number of variables for an Or rule in the two QUBO formulations, as a function of m (max_num_literals) is plotted in (a)-see Eq. (24).The step-wise form is a result of the binary encoding of the slack in the inequality constraints.The size of the feasible space, which is equal to the size of the search space for the ILP solver, as well as the size of the much larger (feasible and infeasible) space searched by the QUBO formulation (without η) is plotted in (b).The inset shows a zoomed in version of the former.

FIG. 13 :
FIG.13: Train (left) and test (right) results for the native local solver ("Local") vs. the depth-one ILP classifiers (indicated by the name of the respective operator, e.g., Or), the single feature classifier ("Single"), the decision tree ("DT"), and the decision tree with no binarizer ("DT (NB)"), for the three smallest datasets.Each classifier is trained and tested on 32 stratified shuffled splits with a 70/30 split (for cross-validation) over the in-sample data (80/20 stratified split).The points correspond to the mean of the balanced accuracy and complexity over those splits.The error bars are given by the standard deviation over the balanced accuracy and complexity on the respective 32 splits for each point.Continued in Fig.14.

FIG. 14 :
FIG.14: Continued from Fig.13.Test results for the six largest datasets.For these larger datasets the train and test results are virtually indistinguishable, likely due to the highly regularized models used.For this reason only the test results are presented.