A Model-Agnostic Algorithm for Bayes Error Determination in Binary Classification

This paper presents the intrinsic limit determination algorithm (ILD Algorithm), a novel technique to determine the best possible performance, measured in terms of the AUC (area under the ROC curve) and accuracy, that can be obtained from a specific dataset in a binary classification problem with categorical features {\sl regardless} of the model used. This limit, namely the Bayes error, is completely independent of any model used and describes an intrinsic property of the dataset. The ILD algorithm thus provides important information regarding the prediction limits of any binary classification algorithm when applied to the considered dataset. In this paper the algorithm is described in detail, its entire mathematical framework is presented and the pseudocode is given to facilitate its implementation. Finally, an example with a real dataset is given.


Introduction
The majority of machine learning projects tend to follow the same pattern. Namely, many different machine learning model types (as decision trees, logistic regression, random forest, neural network, etc.) are first trained from data to predict specific outcomes, and then tested and compared to find the one that gives the best prediction performance on validation data. Many techniques to compare models have been developed and are commonly used in several settings [1,2]. For some specific model types, as neural networks, it is difficult to know when to stop the search [3]. There is always the hope that a different set of hyperparameters, as the number of layers, or a better optimizer will give a better performance [3,4]. This makes the model comparison laborious and time-consuming.
Many reasons may lead to a bad accuracy: overlapping class densities [5,6], noise affecting the data [7,8,9], deficiencies in the classifier or limitations in the training data being the most important [10]. Classifier deficiencies could be addressed by building better models, of course, but other types of errors linked with the data (for example mislabeled patterns or missing relevant features) will lead to an error that cannot be reduced by any optimisation in the model training, regardless of the effort invested. This error is also known in the literature as Bayes error (BE) [3,11]. The BE can, in theory, be obtained from the Bayes theorem if one would know all density probabilities exactly. However, this is impossible in all real-life scenarios, and thus the BE cannot be computed directly from the data in all nontrivial cases. The Naïve Bayes classifier [11] tries to approximately address this problem, but it is based on the assumption of the conditional independence of the features, rarely satisfied in practice [12,10,11]. The methods to estimate the BE developed in the past decade tend to follow the same strategy: reduce the error linked to the classifier as much as possible, thus being left with only the BE. Ensemble strategies and meta learners [12,10,13] have been widely used to address this problem. The idea is to exploit the multiple predictions to provide an indication of the limits to the performance for a given dataset [10]. This approach has been widely used with neural networks, given their universal approximator nature [14,15].
In any supervised learning task, knowing the BE linked to a given dataset would be of extreme importance. Such a value would help practitioners decide whether or not it is worthwhile to spend time and computing resources in improving the developed classifiers or acquiring additional training data. Even more importantly, knowing the BE would let practitioners assess if the available features in a dataset are useful for a specific goal. Suppose for example that a set of clinical exams are available for a large number of patients. If such a feature set gives a BE of 30% (so an accuracy of 70%) in predicting an outcome and a BE smaller than 20% is the desired target, it is useless to spend time in developing models. So time would be better spent in acquiring additional features. The problem of determining the BE intrinsic of a given dataset is addressed and solved in this work from a theoretical point of view.
The contribution of this paper is twofold. Firstly, a new algorithm, called Intrinsic Limit Determination algorithm (ILD algorithm) is presented. The ILD algorithm allows computing the maximum performance in a binary classification problem, expressed both as the largest area under the ROC curve (AUC) and as the accuracy that can be achieved with any given dataset with categorical features. This is by far the largest contribution of this paper, also with respect to previous methods, since the ILD algorithm for the first time allows evaluating the BE for a given dataset exactly. This paper demonstrates how the BE is a limit not dependent on any chosen model but is an inherent property of the dataset itself. Thus, the ILD algorithm gives the upper limit of the prediction possibilities of any possible model when applied to a given dataset, with the only restrictions that the features must be categorical and that the target variable must be binary. Secondly, the mathematical framework on which the ILD algorithm is based is discussed and a mathematical proof of the algorithm validity is given. The algorithm's computational complexity is also discussed.
The paper is organized as follows. The necessary notation and dataset restructuring for the ILD algorithm is discussed in Section 2. In Section 3 the complete mathematical formalism necessary for the ILD algorithm is discussed in detail and the fundamental ILD Theorem is given and proof is discussed. Application of the ILD algorithm to a real dataset is provided in Section 5. Finally, in Section 6 conclusions and promising research developments are discussed.

Mathematical Notation and Dataset Aggregation
Let us consider a dataset with N categorical features, i.e., each feature can only assume a finite set of values. Let us suppose that the i th feature, denoted as F i , takes n i possible values. For notational simplicity, it is assumed that the categorical feature is encoded in such a way that its possible values are integers from 1 to n i , with i = 1, ..., N (note that each n i can assume different integer values). Each possible combination of the features is called here a bucket. The idea is that the observations will be aggregated in buckets depending on their features. The number of observations present in the dataset are indicated with M. All the observations with the same set of features are said to be in the same bucket.
The problem is a binary classification one, aiming at predicting an event that can have only two possible outcomes, indicated here with class 0 and class 1. In general, in the j th bucket, there will be a certain number of observations (that we will indicate with m 1 ) with a class of 1. The feature vector for each observation, denoted as x k (with k = 1, ..., M), is thus defined by an N-dimensional vector (F 1,k , F 2,k , ..., F N,k ) ∈ N N , where F i,k denotes the value of the i-th feature of the k-th sample.
The following useful definitions are now introduced.
is a possible combination of the values of the N features, i.e., In the rest of the paper, the feature bucket is indicated as to explicitly mention the feature values characterizing the bucket B [j] . The total number N B of feature buckets is thus equal to: N B = N i=1 n i . As an example, in the case of two binary features F 1 and F 2 , four possible feature buckets can be constructed, namely: B [1] = (0, 0), B [2] = (0, 1), B [3] = (1, 0) and B [4] = (1, 1).
The cardinality of the set M [j] will be denoted as |M [j] |. In a binary classification problem, the observations belonging to the feature bucket B [j] , denoted as will contain m 1 ∈ N 0 observations with a target variable equal to 1. Note that by definition  and Based on the definitions above, the original dataset can be equivalently represented by a new dataset of buckets, an aggregated dataset B, each of which contains a certain number of samples |M [j] | = m 1 ). In the previous example of a dataset with only two binary features, B would look like the one in Table 1. In this easy example a dataset with any number of observations M would be reduced to one with only 4 records, i.e., the number of possible buckets.
With this new representation of the dataset, generated by aggregating all observations sharing the same feature values in buckets, the proposed ILD algorithm allows computing the best possible ROC curve considering all possible predictions.

ILD Algorithm Mathematical Framework
Since the output of any machine learning model is a function of the feature values, and since a bucket is a collection of observations all with the same feature values, any possible deterministic model will associate to the j th bucket of features only one possible class prediction p [j] that can be either 0 or 1. More in general, to each model can be associated a prediction vector p = (p [1] , p [2] , ..., p [N B ] ). In the next sections important quantities (as TPR and FPR) evaluated for the aggregated dataset B as functions of m In fact, if 1 . Considering the entire dataset, the true positive, true negative (T N), false positive (F P ), false negative (F N) are given by: where the sums are performed over all the N B buckets.

Accuracy
In a binary classification problem, the accuracy is defined as Using equations (7) and (8) the accuracy can be rewritten as The maximum value of the accuracy is obtained if the model predicts 0 . This can be stated as Theorem 3.1. The accuracy for an aggregated categorical dataset B, expressed as Equation (12), is maximised by choosing Proof. The proof is given by considering each bucket separately. Let's consider a bucket i that has m 0 . In this case, there are two possibilities: Therefore, the i th contribution to the accuracy in Equation (12)

Sensitivity and specificity
The sensitivity or true positive rate (T P R) is the ability to correctly predict the positive cases. Considering the entire dataset, the T P R can be expressed using Equations (7) and (10) as Analogously, the specificity or true negative rate (T NR), which is the ability to correctly reject the negative cases, can be written using Equations (8) and (9) as

ROC Curve
The receiver operating characteristic (ROC) curve is built by plotting the true positive rate T P R on the y-axis, and the false positive rate (F P R) on the x-axis. For completeness, the F P R = 1 − T NR is

Perfect Bucket and Perfect Dataset
Sometimes a bucket may contain only observations that are all in class 0 or 1. Such a bucket is called in this paper perfect bucket and is defined as follows.
It is also useful to define the set of all perfect buckets P .
Definition 3.2. The set of all perfect buckets, indicated with P is defined by 0 > 0 and m Note that by definition, the set B/P contains only imperfect buckets, namely buckets where m 1 > 0. An important special case is that of a perfect dataset, one in which all buckets are perfect. Indicating with B the set containing all buckets, we have B = P . It is easy to see that we can create a prediction vector that will predict all cases perfectly, by simply choosing, for feature bucket j Remember that all feature buckets are perfect, and in a perfect feature bucket m

The Intrinsic Limit Determination Algorithm
Let us introduce the predictor vector p S ≡ (1, 1, . . . , 1), for which it clearly holds T P R| p S = 1 F P R| p S = 1 T P R| p S and F P R| p S indicate T P R and F P R evaluated for p S , respectively.
Let us indicate as a flip the change of a component of a prediction vector from the value of 1 to the value of 0. Any possible prediction vector can thus be obtained by a finite series of flips starting from p S , where a flip is done only on components with a value equal to 1. Let us denote with p 1 the prediction vector obtained after the first flip, p 2 after the second, and so on. After N B flips, the prediction vector will be p N B ≡ (0, 0, . . . , 0). The T P R and F P R evaluated for a prediction vector p i (with i = 1, . . . , N B ) are indicated as T P R| p i and F P R| p i . The set of tuples of points' coordinates is indicated with P: A curve can be constructed by joining the points contained in P in ordered segments, where ordered segments means that the point (F P R| p S , T P R| p S ) will be connected to (F P R| p 1 , T P R| p 1 ); (F P R| p 1 , T P R| p 1 ) with (F P R| p 2 , T P R| p 2 ); and so on. The segment that joins the point (F P R| p S , T P R| p S ) with (F P R| p 1 , T P R| p 1 ) is denoted as s [0] ; the one that joins the points (F P R| p 1 , T P R| p 1 ) and (F P R| p 2 , T P R| p 2 ) is denoted as s [1] , and so on. In Figure 2 a curve obtained by joining the tuples given by the respective prediction vectors obtained with three flips is visualised to give an intuitive understanding of the process.
The ILD algorithm provides a constructive process to select the order of the components to be flipped to obtain the curveC characterized by the theoretical maximum AUC that can be obtained from the considered dataset, regardless of the predictive model used.
To be able to describe the ILD algorithm effectively and prove its validity, some additional concepts are needed and described in the following paragraphs.

Effect of one single flip
Let us consider what happens if one single component, say the j th component, of p S is changed from 1 to 0. The T P R and F P R values clearly change.
(1, 1) s [1] s [2]  By denoting with p 1 the prediction vector in which the j th component was changed, the following equations hold: Therefore, T P R and F P R will be reduced by an amount equal to the ratio m 0 , respectively. As an example, the effect of multiple single flips on T P R and F P R is illustrated in Figure 3. Here are shown the ROC curves resulting from a random flipping starting from p S for a real-life dataset, namely, the Framingham dataset [16] (See Section 5). As expected, flipping components randomly results in a curve that lies close to the diagonal. Since the diagonal corresponds to randomly assigning classes to the observations, randomly flipping does not bring to the best prediction possible with the given dataset.
By ordering the points in P in ascending order based on the ratio T P R| p j /  F P R| p j , a new set of pointsP is constructed. It can happen that in a given dataset, multiple points have F P R| p j = 0. In this case, this ratio can not be calculated. If this happens all the points with F P R| p j = 0 can be placed at the end of the list of points. The order between those points is irrelevant. It is interesting to note that a flip for perfect buckets for which m 1 = 0 will have F P R| p j = 0.
With the set of ordered pointsP, a curveC can be constructed by joining the points inP as described in the previous paragraph. Note that the relative order of all points with T P R| p j = 0 is also irrelevant, in the sense that this order does not affect the AUC ofC.

ILD Theorem
The ILD theorem can now be formulated.
Theorem 4.1 (ILD Theorem). Among all possible curves that can be constructed by generating a set of points P by flipping all components of p S in any order one at a time, the curveC has the maximum AUC.
Proof. The Theorem is proven by giving a construction algorithm. It starts with one set of points P 0 generated by flipping components of p S in a random order. Let us consider two adjacent segments generated with P 0 : s [j] and s [j+1] . In Figure 4 panel (A) the two segments are plotted in the case where the angle between them β [j,j+1] < π. The angles α [j] and α [j+1] indicate the angles of the segments with the horizontal direction and β [j,j+1] the angle between the two segments j and j + 1. The area under the two segments and any horizontal line that lies below the segments can be increased by simply switching the order of the two flips, as it is depicted in Figure 4 panel (B). Switching the order simply means flipping first the j + 1 component and then the j component.
(C) It is important to note that in Figure 4 panel (A) β [j,j+1] < π while in panel (B) β [j+1,j] > π. It is evident that the area under the two segments in panel (B) is greater than the one in panel (A). The parallelogram in Figure  4 panel (C) depicts the area difference.
The proof is based on repeating the previous step until all angles β [j,j+1] > π. This is described in pseudo-code in Algorithm 1. The area under the curve obtained with Algorithm 1 cannot be made larger with any further switch of points in P.
Note that Algorithm 1 will end after a finite number of steps. This can be shown by noting that the described algorithm is nothing else than the bubble sort algorithm [17] applied to the set of angles α [1] , α [2] , ...

Handling missing values
Missing values can be handled by imputing them with a value that does not appear in any feature. All observations that have missing values in a specific feature will be assigned to the same bucket and considered similar, since we have no way of knowing better.

Application of the ILD algorithm to the Framingham Heart Study Dataset
The power of the ILD algorithm is best demonstrated by applying it to a real dataset, here the medical dataset named Framingham [16], which is publicly available on the Kaggle website [18]. This dataset comes from an ongoing cardiovascular risk study made on residents of the town of Framingham (Massachusetts, US). Different cardiovascular risk score versions were developed during the years [19], the most current of whom is the 2008 study by D'agostino et al. [20], to which the ILD algorithm results are also referred to for comparison of performances.
The classification goal is to predict, given a series of risk factors, the 10-years risk of a patient of future coronary heart disease. This is a high impact task, since 17.9 million deaths occur worldwide every year due to heart diseases [21] and their early prognosis may be of crucial importance for a correct and successful treatment. The dataset used in our study contains 4238 patients and 7 features: gender (0: female, 1: male); smoker (0: no, 1: yes); diabetes (0: no, 1: yes); hypertension treatment (0: no, 1: yes); age; total cholesterol; and systolic blood pressure (SBP). The last three features are continuous variables and are discretized as followed: • age: 0 if age < 40 years, 1 if age ≥ 40 years and age < 60 years, 2 if age ≥ 60 years; • total cholesterol: 0 if total cholesterol < 200 mg/dL, 1 if total cholesterol ≥ 200 mg/dL and total cholesterol < 240 mg/dL, 2 if total cholesterol ≥ 240 mg/dL; • SBP: 0 if SBP < 120 mmHg, 1 if SBP ≥ 120 mmHg. The outcome variable is binary (0: no coronary disease, 1: coronary disease).
To correctly interpret the comparison with existing results, it is important to remark that in the dataset used in our study, the high-density lipoprotein (HDL) cholesterol variable is missing with respect to the original Framingham dataset employed in [20]. Finally, to create the buckets all missing values are substituted by a feature value of -1.
The application of the algorithm starts with the population of the buckets as described in the previous sections. A total number of 177 buckets are generated. The dataset is split into two parts: a training set S T (80% of the data) and a validation one S V (20% of the data). For comparison, the Naïve Bayes classifier is also trained and validated on S T and S V . The performance of the ILD algorithm and the Naïve Bayes classifier are shown through the ROC curve in Figure 5. The AUC for the ILD algorithm is 0.78, clearly higher than that for the Naïve Bayes classifier, namely, 0.68.
To further test the performance, the split and training is repeated for 100 different dataset splits. Each time both the ILD algorithm and the Naïve Bayes classifier are applied to the validation set S V and the resulting AUCs are plotted in Figure 6 (top panel). For clarity, the difference between the AUC provided by the two algorithms is shown in Figure 6 (bottom panel).
This example shows that the application of the ILD algorithm allows the comparison of the prediction performance of a model, here the Naïve Bayes  classifier, with the maximum obtainable for a given dataset. The maximum accuracy over the validation set S V is 85% for the Naïve Bayes classifier (calculated for a specific threshold, i.e., 61%, which optimizes the accuracy over the training set S T ) and 86% for the ILD algorithm (calculated applying Theorem 3.1). The reported accuracies are the ones that refer to the ROC curves shown in Figure 5. The two values are similar and have been reported for completeness, even if this result may by misleading. The reason lies in the strong dataset unbalance, since only 15% of the patients experienced a cardiovascular event. In particular, both the Naïve Bayes classifier and ILD algorithm obtains a true positive rate and a false positive rate near zero in the point of the ROC curve which maximises the accuracy over the validation set, with a high miss-classification in positive patients, which however cannot be noticed from the accuracy result. As it is known, the accuracy is not a good metric for unbalanced datasets, and the AUC is a much widely used metric that does not suffer from the problem described above.

Conclusions
The work presents a new algorithm, the ILD algorithm, which determines, the best possible ROC curve that can be obtained from a dataset with categorical features and binary outcome, regardless of the predictive model.
The ILD algorithm is of fundamental importance to practitioners because it allows: • to determine the prediction power (namely, the BE) of a specific set of categorical features; • to decide when to stop searching for better models; • to decide if it is necessary to enrich the dataset.
The ILD algorithm has thus the potential to revolutionize how binary prediction problems will be solved in the future, allowing practitioners to save an enormous amount of efforts, time, and money (considering that, for example, computing time is expensive especially in cloud environments). The major limitations of the ILD algorithm are firstly the requirement for the features to be categorical. The generalization of this approach to continuous features is the natural next step and will open new ways of understanding datasets with continuous features. Secondly, the ILD algorithm works well when the different buckets are populated with enough observations. The ILD algorithm would not give any useful information on a dataset with just one observation in each bucket (since it would be a perfect dataset). Consider the example of gray levels images. Even if pixel values could be considered categorical (the gray level of a pixel is an integer that can assume values from 0 to 255), two major problems would arise if the ILD algorithm would be applied to such a case: the number of buckets would be extremely large and each bucket would contain only one image therefore making the ILD algorithm completely useless, as only perfect buckets will be constructed.
An important further research direction is the expansion of the ILD algorithm to detect the best performing models that do not overfit the data. In the example of images, it is clear that being a perfect dataset one could theoretically construct a perfect predictor, therefore giving a maximum accuracy of 1. The interesting question is how to determine the maximum accuracy or the best AUC only in cases in which no overfitting is occurring. This is a nontrivial problem that is currently under investigation by the authors. To address, at least partially, this problem, the authors have defined a perfection index (IP) that can help in this regard. IP is discussed in Appendix Appendix A.
To conclude, although more research is needed to generalize the ILD algorithm, but it is, to the best knowledge of the authors, the first algorithm that is able to determine the exact BE from a generic dataset with categorical features, regardless of the predictive models.
Appendix A. Perfection Index I P It is very useful to give a measure of how perfect a dataset is with a single number. To achieve this, a perfection index (PI) I P can be defined.
Definition Appendix A.1. The Perfection Index I P is defined as: Note that if a bucket i is perfect then either m and For an imperfect dataset, it is to see that I P will be less than 1. In fact 1 > 0 and m that cannot be one as long as B 01 is not empty. There is a special interesting case when the dataset B is completely imperfect, meaning B 0 = B 1 = ∅, and for every bucket i with i = 1, ..., N B is true that m 0 . In this case, regardless of the predictions a model may make, the accuracy will always be 0.5. In this case, I P = 0. In facts, one can see that in this case This index is particularly useful since the following theorem can be proved.
Theorem Appendix A.1. The perfection index satisfy the relationship where S P is the set of all possible prediction vectors for the bucket set B. The perfection index measures that the ranges of possible values that the accuracy (a) can have.
Proof. To start the proof the formula for the maximum (max S P a) and minimum (min S P a) possible accuracy must be derived. Starting from Equation (12) and choosing (to get the maximum accuracy) p This concludes the proof.
Appendix A.1. Interpretation of the perfection index I P In the two extreme cases, if I P is small then the ILD algorithm will give very useful information. The dataset is "imperfect" enough that an analysis as described is very useful. The more the value of I P gets close to one, the more the analysis, as it is formulated here, is less helpful. Since in the case of a perfect dataset a perfect prediction model can be easily built and therefore the question of what is the best possible model loses significance. Note that there is a big assumption made, namely, that a feature bucket with a few observations contains the same information as one with one thousand observations in it. In real life, one feature bucket with just one (or few) observation will probably be due to the lack of observations collected with the given set of features and therefore should be doubted in its importance.