A Meta Algorithm for Interpretable Ensemble Learning: The League of Experts

: Background. The importance of explainable artificial intelligence and machine learning ( XAI / XML ) is increasingly being recognized, aiming to understand how information contributes to decisions, the method’s bias, or sensitivity to data pathologies. Efforts are often directed to post hoc explanations of black box models. These approaches add additional sources for errors without resolving their shortcomings. Less effort is directed into the design of intrinsically interpretable approaches. Methods. We introduce an intrinsically interpretable methodology motivated by ensemble learning: the League of Experts (LoE) model. We establish the theoretical framework first and then deduce a modular meta algorithm. In our description, we focus primarily on classification problems. However, LoE applies equally to regression problems. Specific to classification problems, we employ classical decision trees as classifier ensembles as a particular instance. This choice facilitates the derivation of human-understandable decision rules for the underlying classification problem, which results in a derived rule learning system denoted as RuleLoE . Results. In addition to 12 KEEL classification datasets, we employ two standard datasets from particularly relevant domains—medicine and finance—to illustrate the LoE algorithm. The performance of LoE with respect to its accuracy and rule coverage is comparable to common state-of-the-art classification methods. Moreover, LoE delivers a clearly understandable set of decision rules with adjustable complexity, describing the classification problem. Conclusions. LoE is a reliable method for classification and regression problems with an accuracy that seems to be appropriate for situations in which underlying causalities are in the center of interest rather than just accurate predictions or classifications.


Introduction and Motivation
Machine learning (ML) is an integral part of many products and activities in our everyday life, including developments in autonomous driving [1], health care [2], and law enforcement [3].Regulations regarding automated algorithmic decision making-"the right for explanation[s]" [4]-and other judicial reasons [5] emphasize the need for general accountability [6] of ML-based systems, particularly of black box models.Black box models are optimized for performance by adding complexity to an extent that renders them infeasible to interpret.Current attempts try to overcome the drawbacks of lacking explainability by adding external components distilling human-understandable information within additional layers of complexity (post hoc methods).These additional layers, however, can only be an approximation of the actual decision-making process at the risk of not being faithful [7].Explainable artificial intelligence as well as machine learning (XAI/XML) is increasingly being recognized [8][9][10][11].While different application areas highlight the emerging trends in XAI and XML [12][13][14], performance explainability trade-offs are also within the focus of current research [15,16].

Contribution of This Work
In this contribution, the League of Experts (LoE), a novel and transparent machine learning model and meta algorithm based on the idea of ensemble learning [17] (p.605), is introduced.LoE is a variant of dynamic classifier selection (DCS) methods [18] (see Section 2.7).However, in comparison, LoE trains its ensemble as part of the involved selection and assignment method, unlike other DCS methods that often overproduce a rather large set of classifiers using static methods such as bagging or boosting [19].So far, this is, to our knowledge, unexplored in the context of DCS [18].LoE is a construction framework, allowing for modifications with the goal of leveraging explainability.It enables the design of user interfaces to directly interfere with the model and its components.However, this work will not dive into concrete user interactions, but will instead mention possible interactions where appropriate, for which LoE allows users to manually adjust the selection process and the ensemble members.LoE allows users to interactively explore trade-offs between model performance and complexity.In this contribution, we exemplify and evaluate LoE using two datasets.We particularly focus on controlling the amount of complexity involved in explaining the learned model instances (experiments in Section 4.1).Furthermore, we also introduce a specific solution to instantiate LoE, which enables the transformation of LoE into an almost equivalent rule set learner (section 3.5).In order to reduce the complexity while jointly improving the approximation to LoE, we further explore ways to adapt the training procedure.

Section Overview
The following sections introduce our methodology in Section 2 and LoE's implementation in Section 3. The main idea of LoE is motivated in Section 3, followed by a concise definition and description of its basic training and inference procedures.Section 3.2 shows how to explain a trained LoE model using decision rules.The proposed methods also allow the reduction in the length of these decision rules, which enables the reduction in their complexity, which in turn facilitates interpretations made by human users (Section 3.3.1).Subsequently, we present a method to extract an analogous rule learner from the derived explanations in Section 3.5, called RuleLoE.Finally, Section 4 provides an evaluation regarding feature space reduction and rule set generation given the example of two prominent datasets from medicine and finance.

Materials and Methods
In the following, we introduce the League of Experts (LoE) algorithm with its motivation in terms of explainability and accuracy.For this purpose, we provide some background information and taxonomy of currently and previously developed and investigated approaches within XAI and XML.For this purpose, Sections 2.1-2.7 provide insights into the use of black and glass box models (Sections 2.1 and 2.2), the principles on rule learning and explanations (Sections 2.3 and 2.4), decision trees and their use with surrogate models (Sections 2.5 and 2.6), and the fundamentals of ensemble learning and dynamic selection (Section 2.7).

Black Box Models
Computational advances facilitate more powerful but also more complex models such as black box models [20][21][22][23].Feed forward artificial neural networks (ANNs) [24], e.g., for image classification, possess about 5 to 155 × 10 6 trainable parameters while performing up to 8 × 10 10 computational operations for a single prediction [25].Since an ANN consists of a graph of operations, it is in theory possible to follow along the performed calculations in a stepwise fashion.However, in practice, such enormous complexity is not cognitively comprehensible by a human.Even other popular methods involving much fewer parameters, e.g., support vector machines (SVMs) [26] or ensemble methods such as boosting (i.e., AdaBoost [27]) and random forests [28], cannot be easily interpreted.These approaches are inappropriate for applications requiring a good understanding of the decision-making process.For ensemble learning, this holds true even if the ensemble's members themselves are easy to interpret, e.g., with decision trees [29].However, the decision boundary of the whole ensemble remains highly complex as the final prediction consists of a fusion of all members.

Glass Box Models
In contrast, glass box models [22,30,31], also known as white box models, are characterized by transparent states, allowing for deriving explanations regarding their decision boundaries.Such models include decision trees [17] (p.305), (logistic) regression [17] (p.119), and neighborhood models describing decisions by the similarity of related data points, e.g., k-nearest neighbors (k-NN) [32].However, adding too much complexity jeopardizes practical interpretability, e.g., too many levels in decision trees or too many dependencies in Bayesian networks [33].Importantly, this is due to the cognitive limits of the users and not the decision-making process itself.

Rule Learning
A particular example of glass box models is decision rules that conveniently present knowledge to users as logical patterns.They are realized either unordered (i.e., independent) as rule sets or ordered as rule lists.The latter are more complicated to interpret [34] as each rule's applicability depends on the negations of its predecessors (i.e., the previous rules have to not apply).Decision trees have representations as rule sets, where each path in the tree becomes a single rule.For example, the rules shown in Figure 1 are as follows: if (sex is female ∧ age > 40 ∧ weight > 80) ∨ (sex is female ∧ age ≤ 40), predict low-risk group; otherwise, predict high-risk group.Unlike decision trees, most rule learners do not produce rule sets with full coverage and no sample space overlap-several or no rules are applicable to data points (or queries).The former is resolved by determining a resolution order, the latter by adding an additional fallback rule.Both cases must be addressed when transforming LoE into a rule learner (see Section 3.5).Rule learning systems, including CN2 [35], RIPPER [36], Bayesian rule sets [37], Boolean decision rules via column generation (BRCG) [38], and interpretable decision sets [34], derive rules from directly optimizing a target function, e.g., by using integer programming to optimize specific criteria such as classification performance, coverage, or simplicity.The optimal solution is often intractable due to the exponential amount of possible clauses, resulting in the NP-complete set cover problem.

Explanations
In the context of XAI, "explanation" refers to human-understandable information in a local or global sense, justifying (i) the process generating a query-dependent output of the method (e.g., the classification result) or (ii) the state of the model as a whole.Local explanations can be misleading about a model's global state, whereas query-dependent properties might not apply in the global context, i.e., for arbitrary queries.
Post hoc explainers are popular methods.They add methods and visual components aiming to explain the decision process of an already-trained model.Post hoc methods can be (i) model agnostic (not tied to a specific model class), e.g., LIME [39], Anchors [40], Shapley values [41], or visualizations such as partial dependence plots (PDPs) or individual conditional expectation plots (ICE) [42] (Chapters 5.1 and 5.2), or (ii) model specific, e.g., decision-tree-based SHAP variants (TreeSHAP) [43] or explainers specific to neural networks [44].
However, the drawbacks of post hoc methods are that (i) external components add complexity to ML pipelines, and (ii) users must have confidence in the underlying ML model and the explanation method.Both can be wrong, inaccurate, or only applicable in certain unobserved circumstances.Importantly, users cannot notice the latter.Glass box models, which are algorithmically transparent, do not require post hoc approaches, as do more generally global simulatable models that are characterized as being simple enough to be comprehensible in reasonable time [45].
Simulatable models include small decision trees, small rule sets [46], (generalized) linear models with a small amount of model parameters (after applying shrinkage methods, e.g., Lasso), or instance-based learners with explainable features (e.g., k-NN [32]).Additionally, the features describing entities, such as data points, must be reasonably comprehensible.Models satisfying those requirements are strongly limited, typically weak in performance, and limited in generalization properties [47].Notably, the definition of simulatable is somewhat fuzzy as it depends on individual users' abilities.

Decision Trees
Decision trees [48] directly display their reasoning processes due to their graphical structure (Figure 1).They typically focus only on a subset of relevant attributes or features [49], rendering them cognitively less demanding.The graphical structure of a decision tree allows the user to investigate details of a given reasoning process, which also enables the investigation of alternative paths, i.e., by assuming that an attribute has a different value to observe how this would influence the decision-making process.That allows for judging for fairness and relevance of the decisions made.

Surrogate Models
Surrogate (or proxy) models [50] attempt to explain the decisions-making process of a complex black box model f c (•) by imitating the black box model's output profile on a dataset L by utilizing a simpler glass box model f g (•) [42] (Chapter 5.6) (e.g., by using a decision tree to approximate an ANN).The approximation f c ∼ f g can hold globally (global surrogate), i.e., on L, or locally (local surrogate), i.e., in a subset of L. For instance, the LIME method [40] is a local surrogate.
Generally, it is unclear whether the surrogate f g captures the same causalities as the explained model of f c .Several alternative surrogates might perform equally well, which can be explained by the Rashomon effect [51], describing circumstances under which multiple alternative explanations yield the same result.The latter is less problematic when replacing the complex model by a sufficiently accurate global surrogate.However, for local surrogates, the problems amplify, particularly if users are inclined to induce global decision making from local explanations.Although the mentioned problems do not vanish, they can be controlled by directing the training procedure (see Section 3.3).

Ensemble Learning and Dynamic Selection
Ensemble learning methods [52,53] train multiple models as an ensemble for classification or regression tasks used in specific combinations, also called aggregation or fusion.With our approach, LoE also follows this idiom.More specifically, it is a dynamic selection (DS) method [18], which, in contrast with static methods (e.g., AdaBoost [27] and random forests [28]), does not use the whole ensemble to form a decision.It uses a separate selection method to select ensemble subsets for individual queries.The rationale of DS is to dynamically find a subset of the models that are likely to be knowledgeable about a given query presented to the system, for which a filter method that decides which members of the ensemble may be knowledgeable in predicting a query is employed.This filter is often denoted as a selection or assignment function.DS methods fall into the class of either dynamic classifier selection (DCS) or dynamic ensemble selection (DES) methods.While DCS methods will choose exactly one model from the pool when presented a query, DES methods may choose multiple models.LoE is categorized as a DCS method, as it selects exactly one model given a query to retain explainability.

The League of Experts Classifier
For the realization of the League of Experts as a classifier, we adapt a particular case of ensemble learning, in which each ensemble member is associated with a specific part of the feature space.More specifically, we introduce an algorithm that adaptively learns a partition of the feature space in order to assign one specialized ensemble member-the expert-to each subset of the partition.Since the partition divides the feature space into disjoint sets, each expert is trained on a different subset of the training data.This diversifies the learning process, which implies that each expert performs particularly well only on its assigned subset but not necessarily on the whole feature space.Since each expert is not required to model the whole feature space, the complexity of experts can be limited without sacrificing the overall performance.
Motivated by the way in which inferences are made, the resulting ensemble of experts is named accordingly.Once an LoE classifier is trained, inference follows a two-step approach: (i) a query x q is assigned to the correct expert g, and (ii) the query output g(x q ) is processed by the assigned expert (see also Figure 2).This process mimics human teamwork, having a number of experts that are specialized in specific domains, in which a coordinator assigns tasks to the most appropriate expert.Allegorically, imagine a query being a patient visiting a primary physician (coordinator) that performs an initial examination.In this case, the coordinator is referring the patient to the specialist.
The following sections introduce our definitions and model formulations (Section 3.1), our approach to deriving explanations (Section 3.2), the reduction in the complexity of explanations (Section 3.3), the analysis of LoE's runtime complexity (Section 3.4), and LoE's rule set learner, also called RuleLoE (Section 3.5).query x q assign x q to Step 1: Record-to-Model Assignment A(x q ) selected model g 2 Here, experts are illustrated as gear wheels.

Basic Definitions and Model Formulation
Consider a typical supervised learning problem based on a labeled dataset L ⊆ X × C. Here, each labeled data point (x, y) ∈ L is generated by a probability distribution while consisting of m real-valued inputs of a domain X ⊆ R m and a categorical (or, in the case of regression problems, metrical) label within a set C (for metric outputs C ⊆ R l ).Notably, categorical inputs are also permitted, for which they are assumed to be one-hot encoded.For a point (x, y) ∈ L, we seek to maximize the probability of the output Y conditioned on the input X based on the model P(Y = y|X = x) = g s (x, y), where g s : X × C → [0, 1] is a (soft) classifier, belonging to a family G s .
The corresponding derived hard classifier is defined as follows: where g : X → C. The resulting family of hard classifiers derived from G s is denoted by G.
For regression rather than classification problems, the conditional expectation of the output Y given the input X is modeled rather than the conditional probability.More precisely, the underlying regression model is The following function: the feature space X of rank n by the following: is called the territory of the expert g k .An LoE defines a single new classifier by the map: which generally is not an element of G.The set of all admissible assignment functions A (or equivalently the set of all corresponding partitions A part ) and classifiers G defines the set of all possible LoE classifiers, denoted by G.At this point, it is also noted that an LoE is conceptually similar to DCS algorithms (Section 2.7) with the assignment taking the role of the selection function.In the following, because an assignment function is equivalent to the partition it defines, we identify an assignment function by its equivalent partition.
The quality of an LoE classifier is measured by a performance function p : C × C → R. A typical choice is the Kronecker delta: For a metrical output (regression task), the performance function typically follows from a distance measure between ŷ and y, e.g., the negative mean-squared distance.
An optimal LoE classifier maximizes the average performance over the whole input and output domain X × C; i.e., it is defined by the following: where F(x, y) is the joint probability distribution of inputs X and outputs Y.
The maximization in (2) is performed in two steps: (i) for each assignment function A ∈ A (or its equivalent partition), an optimal ensemble of experts is found, and (ii) these optimal LoEs (G * A , A) are maximized over all assignment functions A ∈ A (or equivalently all partitions).This yields an optimal LoE classifier (G * A ′ , A ′ ), i.e., where and Notably, an optimal LoE classifier is not necessarily unique.Furthermore, it is infeasible to find one, as the distribution F(x, y) is unknown and the number of possible partitions is exhaustive.In practice, a pseudo-optimal LoE is empirically determined by replacing F(x, y) by the empirical distribution function F(x, y) obtained from the labeled dataset L. This yields the following: where and for each A ∈ A However, this optimization is also infeasible.In general, an uncountable amount of tuples G * ∈ G |A| (6c) and assignment functions A ∈ A (6b) exist.However, this can be heuristically approximated for a specific class of assignment functions.We provide an algorithm yielding an LoE ( Ĝ(A * ) , A * ), performing almost as good as an optimal LoE.

Class of Assignment Functions
The set of assignment functions A of degree n is restricted to the class of functions defined by n pairwise different anchor points θ 1 , . . ., θ n ∈ R m by the following: where d(•, •) is a given metric on X .Without considering the minimum, A would not be well defined in the cases d(x, θ k ) = d(x, θ l ) for k ̸ = l.Therefore, we assume the following: Here, A n depends on the choice of the metric d.It is advisable to use a metric that accounts for the variation of the input features.We employ the weighted L 1 metric d(x, y) = ∑ m i=1 |x i − y i |/σ i , with σ i being the empirical standard deviation of the i-th input.In practice, ς i is replaced by its estimate from the labeled dataset L.An alternative would be to employ a Mahalanobis distance.
As there is a one-to-one correspondence between the assignment function A and an n-tuple of anchor points (θ 1 , . . ., θ n ) for each A ∈ A n , they are in turn identified.For an LoE classifier, anchor points correspond to the center points of the experts' territories, for which they are referred to as such.

Algorithm Training
An assignment function A partitions not only the feature space X but also the labeled dataset L. Namely, We approximate the optimization (6) for the class of assignment functions A n with the following heuristic algorithm illustrated in Figure 3.The algorithm starts with n initial anchor points θ 1 , . . ., θ n , and a corresponding assignment function A. Next, (6b) is approximated for the assignment function A by training one classifier g k ∈ G for part L (A) k of the partition of the labeled dataset L. This yields an ensemble Ĝ * A and, hence, an LoE ( Ĝ * A , A) being "quasi-optimal for A".Next, the anchor points θ k are updated.We shift the center points of the experts' territories, whereby θ k is replaced by the following: where η is the update's step size.Here, each expert's performance is evaluated over the whole training data L. The centers θ k are shifted into the direction of points, for which the expert g k performs well as measured by the performance function p(g k (•), •).The new center points define an assignment function A ′ and a new partition of the feature space, for which a new LoE is trained.This step is repeated with potentially decreasing step sizes.
Finally, the LoE classifier with the overall best performance is chosen, which is referred to as a "quasi-optimal LoE".This procedure is described as a pseudocode in Algorithm 1.
The updating step depends on the choice of the performance function.For simplicity, the Kronecker delta ( 1) is used for the examples presented here.

Good prediction
Bad prediction while t < T do # train experts on their territories: # inference using the current experts and positions Let ( Ĝ(A * ) , A * ) be a quasi-optimal LoE with A * being characterized by the anchor points θ * 1 , . . ., θ * n .The tuple ( Ĝ(A * ) , A * ) is a substitute for f LoE in (3).For a query x q , the distances d(x q , θ * k ) are calculated for k = 1, . . ., n.Then, x q is assigned to the associated closest expert, say g * i in (7), to predict the outcome ŷ = g * i (x q ) (Figure 2).

Deriving Explanations
In the previous sections, we described a functional and modular machine learning algorithm.Instantiated using glass box models as experts, the model itself can already be employed to train a set of diverse (i.e., due to being trained on disjoint data) and interplaying classification problems (i.e., solving a common task).While the resulting experts can unveil information about the learning problem on their own by directly representing a part of the underlying decision process, the assignment function is still not yet directly visible.The following sections will therefore extend on this missing part by further utilizing the properties of the model in order to derive explanations.
In our context, explainability involves two aspects: (i) understanding the assignment process, i.e., why a query is assigned to a specific expert, and (ii) understanding the experts themselves, i.e., why an expert predicts a specific outcome inside its territory.The latter implies either that the experts themselves are glass box models-as assumed here-or that post hoc methods are applied otherwise (see also Section 2.1).To explain the assignment process, the decision boundaries of the assignment function A must be understood, which might be nontrivial.Their explanation should be accurate, comprehensible, and comprehensive, where requirements are naturally leading to a trade-off as "One Explanation Does Not Fit All [Alike]" [54].While there is a multitude of options to explain an assignment function A (or the implied partitions), we will focus on revealing a functional relationship and their interactions between relevant attributes.To achieve this, we transform the assignment process into an equivalent classification problem that is solved by employing a glass box model that acts as a surrogate for A. Although employing surrogates to model a process imposes risks, as mentioned in Section 2.6, LoE exhibits properties that allow the mitigation of the mentioned risks.Particularly, a part of the model's decision process remains intact as the experts themselves will not be approximated by the surrogate.By steering the concrete complexity of the surrogate models and by employing different modeling strategies (Section 3.5.1)and feature space reduction techniques induced by a slightly modified training routine (Section 3.3.1),the faithfulness of the surrogate can be further optimized.Although any model can be employed for this purpose, this work will focus on using decision trees as they allow for a direct transformation of the learned LoE model into the rule set learner, called RuleLoE.

Making the Assignment Function Explainable
To make the assignment function explainable, a glass box model is trained as a surrogate model for the assignment function A. Let S denote the family of such surrogates, which require a training dataset.The inputs of the assignment function are elements of the feature space X , while the labels are elements of the set {1, . . ., |A(X )|}.For a subset X s ⊆ X of the feature space, the graph of the assignment function A restricted to X s is as follows: Surrogates are assignment functions, but in general, their corresponding partitions are not members of A. Such a set LA,X s can be used as labeled data to train a surrogate h ∈ S. The set LA,X s is called assignment dataset.In the following, A will be the assignment function of a given LoE (G * , A).We model A as a decision tree trained on LA,X s (i.e., A is learning how to map a data point to a specific expert).The whole decision process for classifying a data point x q is expressed as a concatenation of the decision paths of the two models A and g k with k = A(x q ), whereas the final prediction is ŷ = g A(x q ) (x q ).This process is illustrated in Figure 4.  g 2 ) with A being replaced by a single decision tree (surrogate).The orange line denotes a decision path of a query x q (where x q = (a 1 , a 2 , a 3 ) satisfies a 1 = v 2 ∧ a 3 < 5 ∧ a 2 < 3) through both the surrogate and one expert.At this point, it is also noted that only the expert g 1 is relevant for the query x q .

The Surrogate's Faithfulness
The surrogate's performance in mimicking the original assignment function A (i.e., the assignment accuracy) must be evaluated over a sample X s ⊆ X .The level of agreement between the surrogate h and A is measured by p ass (h, A; X s) (p ass : S × A × (P(X ) \ ∅) → R ≥0 , where P denotes the power set), with high values indicating good agreement.If sur- rogates have unsatisfactory agreement, the class S might not capture the same mechanisms as the elements in A, or it does cover the correct mechanisms but is too constrained.In these cases, the class S can be extended (i.e., relaxing regularizations, e.g., allowing larger decision trees).Alternatively, the amount of input features considered by the surrogate can be reduced, e.g., by reducing their feature space.Appropriate projections can be identified during LoE's training procedure (Section 3.3.1).
As the ultimate goal is to obtain comprehensible explanations of the whole decision process, the surrogate's comprehensibility has to be ensured.Hence, surrogates must be understandable.As the employed decision trees can be of arbitrary complexity, we have to introduce a measure thereof.To do so, we first introduce a basic notation.

Decision-Tree-Related Notation
For a decision tree t, its root node, the set of leaves, and the set of ancestors from node v to the root are denoted by head(t), leafs(t), and trace(t, v), respectively.The decision tree assigns each data point a leaf.The path of this leaf is the nodes from the root to its starting point.For a subset X s ⊆ X of the data, evaluated over the tree t, let |v| (X s ,t) be the number of data points in X s that follow paths passing through node v.As the root is contained in every path, and exactly one path is leading to each leaf, |head(t)| (X s ,t) = |X s | (i.e., all data points start at the root node).

Complexity of a Decision Tree
Although decision trees can always be visualized as a graphical representation or linearized as rules, this can be impractical due to their size, i.e., depth, width, or structure, which are measuring the tree's complexity.This argument ignores the distribution of queries.Namely, if the majority of queries utilize only a simple subtree, a complex tree remains practical for most instances.The distribution of queries is taken into account when measuring the decision tree's complexity by the average trace length to the leafs of a representative sample X 1 ⊆ X , i.e., If the choice of X 1 does not follow naturally, it can be constructed for a uniform sample distribution over the leaves, i.e., E trace(t, .)= 1

Reducing the Complexity of Explanations
Comprehensiveness depends on the cognitive load confronting users.We introduce a concept to reduce cognitive load by effectively restricting the LoE model to a lowerdimensional feature space.The process impacts the training process of LoE while preserving its predictive power.This also affects the assignment process and inherits to surrogates.Although surrogates approximate complex nonlinear decision boundaries, most practically relevant datasets are aligned along a lower-dimensional manifold [55] being well approximated by projections.A high-dimensional feature space, i.e., a large number of attributes, typically compromises the generalization properties of ML models due to overfitting.For LoE's training procedure, the number of classifiers |S| increases exponentially such that several surrogates not faithfully explaining A will still have good performance, which justify the assignment but do not reflect true causality (see Section 2.6).This is known as Rashomon set [54].

Feature Space Reduction
We mitigate high-dimensional drawbacks by projecting the feature space onto a lowerdimensional subspace.Let π : R m → R r , defined by π(x) = (x i 1 , . . ., x i r ), be the projection of x = (x 1 , . . ., x m ) onto the r components i 1 , . . ., i r (1 ≤ i 1 < i 2 . . .< i r ≤ m).The metric d is replaced by the (pseudo) metric d π (x, y) := d(π(x), π(y)), which is the restriction of d to the projection space defined by π.Thus, ( 7) is replaced by the following: The updating step of the center points of the experts' territories is restricted to the lower-dimensional projection space, while leaving the remaining components unchanged; i.e., (9) becomes the following: where the performance is still evaluated over the original unprojected sample L. However, surrogates are trained on the projected feature space.This reduces complexity and the Rashomon effect (Section 2.6), thereby increasing the faithfulness of the surrogate's explanations.Moreover, users' understandability could be increased as they are confronted with interpreting fewer attributes.
If decision trees are built with fewer attributes, they are more likely to appear multiple times along a decision path.They can be merged into a single rule, facilitating explainability.For instance, if age < 40 and age < 30 occur along a path, the two rules can be reduced to (age < 30).Similarly, (age < 40 and age > 30) reduce to (age in range 30 to 40).

Projections Obtained from Leveraging LoE's Properties
A projection π is based on a selection of attributes.Ideally, "important" attributes are retained.For instance, attributes that seem to be strongly correlated, have low predictive power, or seem noisy can be removed.Advanced approaches make use of permutation feature importances [42] (Chapter 5.5), SHAP feature importances [42] (Chapter 5.10) based on Shapley values [41], or individual experts' assessments of the importances of the attributes.
Concerning LoE (and other DCS methods), attributes contributing most to the assignment process also determine their importance.Each expert in LoE is able to individually calculate the importances of the attributes, which can be combined into an importance score.Approximately, the importance scores can be determined based on a surrogate h rather than on the original assignment function A (see also Section 3.2.1).A requirement is that the choice of the surrogate is capable of reporting feature importance.One possibility is to choose a second surrogate family S for this purpose, which is not necessarily explainable but able to report feature importances.A possible choice is extra trees (extremely randomized trees) [56], which are ensembles of decision trees using a randomized collection of attributes and cut points.They do not optimize a split criterion and, hence, can be generated fast while yielding good approximations of the feature importance given a sufficiently large ensemble.Moreover, they are not biased by an optimization procedure (e.g., by optimizing for the Gini index) during the construction step.
The feature importance within LoE is partly determined during its iterative procedure.The importance scores are obtained in every step.These can be combined to rolling means or other time-dependent quantities indicative of attributes' importances.

Concrete Calculation of Feature Importances
We present a concrete implementation for calculating feature importances by utilizing decision trees as a primary example.Although decision trees are a natural choice for this purpose, it is worth noting that other models capable of reporting feature importances can also be used as alternatives.For the construction of decision trees, the importance of attributes is naturally incorporated.An attribute's importance is identified with its performance as readily calculated by the relative change of a measure such as the Gini index during the decision tree's construction [57].In the described example of extra trees, the measure is not already used when generating the ensemble.
For an LoE (G * , A) with expert ensemble {g 1 , . . ., g n } (here, decision trees) and data L, let L k be the subsample assigned to expert g k (corresponding to X k ).An importance function ϕ assigns every expert the relative importance of each attribute; i.e., it is a map from the LoE ensemble to the m − 1-dimensional simplex (ϕ : G * → S m−1 ) defined by ϕ : g k → (ϕ 1 (g k ), . . ., ϕ m (g k )).The ensemble's feature importance of the attributes is the weighted relative importances across all experts, i.e., For the assignment process, the importance of the features are calculated separately.From the second surrogate family S, here, an extra tree ensemble of 1000 decision trees, a surrogate is trained to approximate the assignment function A. An importance function φ analogous to ϕ is evaluated at the trained surrogate h to yield each attribute's importance.The total importance averages the above quantities with weight α, i.e., The total feature importance is updated in every step of the LoE algorithm.Rather than using these importances directly, a time average is calculated.The time-averaged total feature importance in step t with geometric decay β is derived as follows: The projection π to reduce the dimension of the feature space is constructed by projecting on the r features with the highest importance.

Runtime Analysis
To determine the runtime complexity of LoE in terms of its training process, the following considerations are taken into account.First, LoE's runtime complexity's determinative factors are reiterated.Second, LoE's training complexity is derived.
= training complexity for model g for data (x, y) • E(g, (x, y) ⊆ L) = evaluation complexity for model g for data (x, y)

One-versus-Rest Surrogate
The derived rules from a decision tree as surrogate have no rule overlap; i.e., no pair of rules is true at the same time.While this is a desired property due to having no ambiguities, it also has potential drawbacks: No matter what query is presented to the process, there will always be some partial overlap as every query starts at the head of the tree, hence sharing at least one common test.This reduces the flexibility and versatility of the resulting set of rules.However, rule overlaps and partial rule overlaps are not desirable as emerging rules will become too similar.In practice, it is worth sacrificing the first property to obtain more diverse rules by minimizing partial overlap.This is achieved by training one surrogate for each expert as follows.The k-th surrogate will learn to model the assignment process relative to the k-th expert in a binary fashion (one-versus-rest strategy, OvR).Effectively, each of the k surrogates is presented with a binary-labeled dataset where the value 1 represents A(x) = k (and 0 otherwise).It learns to decide if the k-th model should be used in predicting a given query.If appropriate, the k-th expert is assigned (i.e., its decision path is followed).Otherwise, no assignment is made (i.e., the expert is not taken into account).A visual example of a scenario using two surrogates is shown in Figure 5.Note that using OvR in a scenario with two experts is equivalent to using exactly one surrogate as the labels that are presented to both surrogates are the inverse of each other.Since the labels are different for each expert when using more than two experts, each surrogate is trained on a different dataset.Hence, each of them learns different structures, in particular because decision trees are sensitive to the input, ultimately resulting in more diverse and potentially simpler rules.When using surrogates instead of the original assignment function, the process may not always be captured perfectly since the assignment function itself may be arbitrarily complex, whereas the surrogate's complexity may be limited or not be able to fully model the original function.Therefore, it can happen that none or several of the n surrogates are assigned an expert to the query.

Query Strategy
The problem of overlapping rules and uncovered queries potentially occurring in decision sets created by OvR is still to be resolved.This is done by mimicking human thinking.For competing rules, the rule giving the best overall result is chosen.If no rule applies, the best-fitting one is chosen.Formally, let L train be the training dataset and r i a decision rule (e.g., r i ← age > 20 ∧ size < 180cm ⇒ positive class ).If N i is the number of data points to which r i is applicable, the rule's coverage is defined as c i = N i/|L|.The number of covered points correctly classified by r i is N + i , so that the rule's precision is Rules are assigned to a query x q as follows: 1.
Rank rules by descending precision (and coverage on tie).

2.
Assign x q to the covering rule with the lowest rank.

3.
If no rule covers x q , return the rule with the highest fraction of clauses being evaluated as "true" (partial rule coverage) with preference given to the rule with higher precision on ties.
This approach, particularly step 3, is possible only because RuleLoE is a multiclass rule learner.It is not applicable to approaches learning only one positive class, including algorithms such as RIPPER, CN2, or BRCG.Consequently, the latter might produce fewer rules but less data insights.
In conclusion, the proposed League of Experts algorithm and its framework reaches out to an audience seeking to apply, develop, and evaluate explainable artificial intelligence by providing the necessary tools.These are available at the project page of LoE and its repository.The project page of the League of Experts (LoE) framework can be found via its repositories under https://github.com/Mereep/loe,https://github.com/Mereep/rule_loe, and https://github.com/Mereep/HDTree,accessed on 2 April 2024.

Test Results, Evaluation, and Discussion
In the following sections, we exemplify the utilization of the League of Experts algorithm and its methodology given two different datasets from medicine and finance, the UCI breast cancer dataset and the HELOC dataset.Medicine and finance are two critical domains for the application of explainable models as the decision process may have strong implications.For example, in medicine, a patient's treatment can be evaluated, adjusted, or justified.In the domain of finance, aspects of accountability and discrimination are to be considered when, for example, credits are to be approved.Therefore, two comprehensive experiments are conducted and analyzed as follows: First, LoE is applied and the effect of feature space reduction is studied (Section 3.3.1).Second, LoE is transformed into RuleLoE, whose explainability and performance are compared with alternative rule learners and black box models (Section 4.2) and commonly known ensemble learners.Subsequently, an in-depth analysis of decision rules is performed on the HELOC dataset.Additionally, we conducted a series of experiments on a collection of datasets from the KEEL dataset repository using a static model setting to allow for a performance-level overview.Following, we provide a dataset overview.

Example dataset 1: UCI breast cancer
The UCI breast cancer dataset (UCI breast cancer dataset, https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic), accessed on 2 April 2024) [58] contains 569 data points with 32 real-valued attributes, describing the characteristics of cell nuclei found in digitized images of a fine-needle aspirate of breast mass, including information on, e.g., mean radius, worst perimeter, or smoothness of the contour.Data points are classified as malignant or benign.

Example dataset 2: HELOC
The HELOC dataset, used in the FICO Explainable Machine Learning Challenge (HELOC dataset, https://community.fico.com/community/xml,accessed on 2 April 2024) and different applications of explainable AI [38,59,60], describes the credit history of 10, 549 individuals (9872 after removing identical entries), which are characterized by 23 metric features.The risk of default is categorized as good or bad if the individual had overdue payments for at least 90 days during a 24-month period.

Example dataset 3: KEEL
The KEEL dataset repository (Knowledge Extraction based on Evolutionary Learning) is a dataset that "aims at providing to the machine learning researchers a set of benchmarks to analyze the behavior of (...) learning methods" [61].KEEL contains various datasets of different properties.For this work, we chose a subset of the repository using medium-sized datasets with a size between 5000 and 10, 000 samples (Table 1).

Feature Space Reduction
The amount of features affects the assignment of the surrogate model h and the updates of the expert territories' center points.In this case, the surrogate model is a decision tree.Intuitively, the surrogate's complexity reduces with fewer features while becoming more accurate in the sense of approximating LoE's assignment function A.
For our experiments, the datasets are split into 2 /3 training data and 1 /3 test data.The family of surrogates consists of decision trees of maximum depth 6. LoE's structure is fixed to an ensemble of n = 3 experts from the class G of decision trees with maximum depth 3.However, the number of features retained by the projections π varies between 10 and 100 % in 5 % increments (r = n × p with p = 0.1, 0.15, . . ., 1).For each r, LoE is trained and evaluated over 20 runs.For these runs, different training and test data partitions are employed, which are repeatedly used for every value of r.The test and training accuracies, the average length of the surrogates' decision trees h, and the assignment performance θ ass (i.e., the level of the surrogate's agreement with LoE's assignment function) are obtained.

Test Results
For the UCI breast cancer dataset, LoE with three experts slightly overfits the training data as seen from comparing the training with the test accuracy (Figure 6, comparison of C and D).The former results in almost 100 % accuracy, while the latter is close to 94 % accuracy.We surmise that it would be advisable to remove one of the experts or to constrain them.Feature reduction strongly influences the complexity and quality of the surrogate model.As shown in Figure 6B,A, the average rule length (average weighted path length) has a significant positive correlation with the amount of features (Pearson's χ 2 test: χ 2 = 0.48, p < 0.01), while showing a significant negative correlation with the assignment accuracy (surrogate's level of agreement with A; χ 2 = −0.62,p < 0.01).
Similar results are obtained for the FICO data (Figure 7).The accuracy is lower with both the training and the test dataset.However, the effect of overfitting is similar to the UCI dataset (Figure 7C,D).However, feature reduction has an even stronger effect, whereas the rule length has a stronger positive correlation with the amount of features (χ 2 = 0.57, p < 0.01).A stronger negative correlation with the assignment accuracy is observable (χ 2 = −0.94,p < 0.01).The strong correlations underline the importance of exploring the optimal dimensionality for feature space reduction.When retaining more than 35% of the attributes, the average rule length quickly approaches the maximum admissible length of 6 imposed by the surrogate family.This trend suggests that the derived decision rules would become more complex if they were allowed to; i.e., the surrogate would tend to overfit A.

Rule Set Generation with RuleLoE
The properties of RuleLoE (Section 3.5) concerning (i) explainability (average rule length, ∅ rules, and # rules), (ii) faithfulness (rule coverage), and (iii) accuracy are compared with alternative rule set learners, in particular with (1) RIPPER [36] (despite its age still state of the art; see [46] (p.52) and [38]), (2) Boolean decision rules via column generation (BRCG) [38], and (3) classical CART decision trees.In terms of accuracy, RuleLoE is compared with two black box models, namely, SVC and MLP, using, if not noted otherwise, standard configurations.
As LoE is a form of ensemble learning, we furthermore compared LoE with a set of different well-known ensemble learners, namely, random forest, gradient boosting, extra trees, and AdaBoost, to allow for a more comparative analysis.As tree-based ensemble methods possess a notion of ensemble size and and maximum tree depth analogous to LoE, we trained two versions each: default and adjusted.Default refers to the algorithm having default hyperparameter settings.The adjusted versions diverge from the default by setting the ensemble size and the maximum tree depth to match LoE's parameterization.Our default ensemble configurations as pairs of ensemble size and max tree depth are as follows: At this point, it is also noted that, technically, default AdaBoost only learns decision stumps, i.e., trees with a maximum depth of one.Although the employed ensemble learners are using decision trees, the ensemble methods cannot easily be used for explanations as their results must be blended in the final stage.Especially using default hyperparameters, the models produce very big ensembles, which essentially renders them to black box models.

UCI Breast Cancer Dataset
A LoE ensemble of n = 9 decision trees with a maximum depth of 1 level and surrogate trees of a maximum depth of 2 with a feature reduction to r = 11 attributes was initially used.After training, one expert having a small territory was manually removed.
The complexity of rules differs substantially between the algorithms with no clear pattern emerging.RIPPER and BRCG focus on a single class to determine rules (positive or negative), whereas LoE and CART incorporate both classes, influencing complexity and performance.Consequently, coverage is lower for RIPPER and BRCG as they assign fewer data points to a rule.Coverage of decision trees (CART) shows 100 % accuracy as every query has a valid rule down to the leafs.Here, RuleLoE's performance is competitive with the other models-including the black box models.Glass box models were furthermore outperformed in terms of test accuracies (Table 2).None of the evaluated algorithms outperform in each other in all categories.In this case, there is no advantage to prefer black over glass box models.

FICO Dataset
A LoE ensemble with n = 4 experts with a maximum depth of 1 and surrogate decision trees with a maximum depth of 2 retaining r = 8 features was initially used.One expert was manually augmented to two levels as one of its leaves was assigned too many data points with very low purity (i.e., skewed class distribution).
RIPPER and CART learn a large amount of rules without achieving a higher accuracy (Table 3), whereas RIPPER shows slightly worse results.CART also learned complex rules.Interestingly, BRCG (positive class bad) learns exactly one short rule (Predict y=bad if: ExternalRiskEstimate < 73), which performs well (Table 3).Naturally, ExternalRisk-Estimate is a relevant predictor for the credit rating as it is a composition of risk markers scoring the risk by an (undisclosed) functional relationship.Hence, BRCG's simple for rule determining the credit score is based on a hard threshold that is intransparent to outsiders.This is different for RuleLoE, leading to the following rules (Figure 8).

Predict y=good if one of the following applies:
(   Some of these rules could be further merged, e.g., rules (3) and ( 5).However, the decisions naturally also involve the same nontransparent attribute ExternalRiskEstimate. Unlike BRCG, however, we gather some further information.If the attribute ExternalRisk-Estimate has no clear indication of a good or a bad credit score (≥76 or <70, respectively), the attribute is disputable and the decision is based on two transparent attributes.Individuals are considered to have a good credit score if they appear on average more than 54 months in files and their last credit inquiry dates back by more than 1 month.Concretely, a credit score is good if AverageMInFile ≥ 54 ∧ MSinceMostRecentInqexcl7days < 1.Otherwise, it is considered bad.
These more insightful rules of RuleLoE are less complex than the alternatives with the single exception of BRCG with the positive class bad, which can lead to a slightly improved performance (Table 2).Consequently, black box models are, in general, not superior to glass box models in terms of their accuracy given our evaluated test cases, whereas the SVC shows only a slightly higher accuracy than the other models.Considering all aspects, RuleLoE outperformed the other methods as it contributes to knowledge discovery by providing simple and interpretable decision rules, thereby leveraging explainability.

The KEEL Dataset
For all KEEL-based experiments, we used a LoE ensemble of three experts with a maximum depth of 2. This restricted configuration was selected to prioritize the generation of succinct and interpretable rule systems, aligning with the research's emphasis on explanation-focused outcomes.The data were divided using a 2 /3 split for training and testing purposes.For a comparative analysis, a random forest classifier was again evaluated, adhering to identical ensemble size and tree depth constraints (see Table 4).Despite no dataset-specific fine-tuning being conduced, LoE demonstrates a reasonably good performance.Some of our experiments show a low assignment accuracy (e.g., for the data subsets marketing and texture) and, consequently, yield a RuleLoE with significantly less performance.This effect might be attributed to an insufficient depth of the related assignment trees or an insufficient feature reduction.The data subsets marketing, optdigits, and texture, which show an overall suboptimal performance, feature many classes (Table 1).Those cannot be fully expressed by trees having two levels of depth, naturally leading to misclassifications.On average, LoE and RuleLoE surpass the random forest classifier in accuracy, underscoring the possible benefits of LoE in producing interpretable and efficient rule-based systems within the constraints of the experimental setup.Therefore, future investigations should also address these observed distinctions in more detail, for which further models should be investigated.

Conclusions and Outlook
In this contribution, we introduced the League of Experts (LoE) framework within the context of explainable artificial intelligence and machine learning (XAI/XML).LoE is a particular instance of an ensemble learner that combines surrogate models in order to leverage explainability and possibilities for human interaction.Moreover, LoE is accompanied by RuleLoE, a derived rule set learner.By choosing the ensemble members-the experts-from a class of glass box models, LoE itself becomes a glass box model, which, as demonstrated, is competitive in performance to existing glass and black box models.Importantly, feature space reductions can be incorporated into the training process of LoE, reducing the complexity of derived explanations.This is contrary to existing methods that navigate through a high-dimensional feature space.Additionally, we would like to address hyperparameter optimization, in terms of both sound default values and the automatic tuning of ensemble members based on the provided dataset and training performance.
Consequently, many possibilities to improve LoE and RuleLoE have yet to be further explored in regard to usability, performance, and presentation.Within our experiments, LoE was based on decision trees.However, LoE is not limited to them.Different models may be able to capture different attribute distributions, whereas a combination of different models might capture more complex distributions while furthermore retaining explainability.To guarantee the applicability of these methods, possibilities for user interaction (e.g., cutting nodes from decision trees, white-or blacklisting specific attributes to specific experts, or removing individual rules from RuleLoE) have to be integrated into accessible user interfaces, for which user studies will have to be conducted.Such an interface ideally facilitates the integration of intuitive human understanding of a concrete application into the corresponding machine learning pipeline.Proper user interactivity potentially reduces hidden biases and improves causal decision mining.This introduction into the LoE framework is hence a starting point for future developments.

Figure 1 .
Figure 1.(Partial) decision tree example with (A-C) being the root, inner nodes, and leaves, respectively.An example query with corresponding trace is highlighted in orange.

1 2 3 Figure 3 .
Figure 3. Training process of LoE with two experts separated into three steps: (1) data-to-model assignment (A), (2) evaluation phase, and (3) optimization phase (movement).Experts are illustrated as gear wheels.The training data are shown using purple and yellow circles, whereas each color depicts a distinct class of the learning problem.In (2), a model's expected prediction performance on different partitions of the feature space are colored in green and red, relating to good and a bad performance, respectively.

Figure 4 .
Figure 4. Exemplified LoE instance containing two experts (g 1 and g 2 ) with A being replaced by a single decision tree (surrogate).The orange line denotes a decision path of a query x q (where x q = (a 1 , a 2 , a 3 ) satisfies a 1 = v 2 ∧ a 3 < 5 ∧ a 2 < 3) through both the surrogate and one expert.At this point, it is also noted that only the expert g 1 is relevant for the query x q .

Figure 6 .
Figure 6.UCI breast cancer dataset results.Shown are paired box plots with their proportions of retained features over 20 runs for (A) assignment accuracy (p ass ), measured on test data, (B) average rule/path length, (C) accuracy for the test data, and (D) accuracy for the train data.

Figure 7 .
Figure 7. FICO dataset results.Shown are paired box plots with their proportions of retained features over 20 runs for (A) assignment accuracy (p ass ), measured on test data, (B) average rule/path length, (C) accuracy for the test data, and (D) accuracy for the train data.

Figure 8 .
Figure 8. FICO dataset results.Shown are the obtained rules via RuleLoE.
Exemplified LoE instance containing two experts (g 1 and g 2 ) with A being replaced by two decision trees (surrogates).Note that this example is only illustrative as the surrogates would be inverse instances to each other in a 2-expert scenario.

Table 1 .
Overview of our selected subset of the KEEL dataset with sample sizes between 5000 and 10, 000 samples.

Table 4 .
Test results for the KEEL dataset, comparing the accuracy of LoE, a derived RuleLoE, and a random forest classifier as pairs of train/test accuracies.Additionally, the number of rules and their average length is reported.Best average results are highlighted in bold.