Redundancy Is Not Necessarily Detrimental in Classification Problems

In feature selection, redundancy is one of the major concerns since the removal of redundancy in data is connected with dimensionality reduction. Despite the evidence of such a connection, few works present theoretical studies regarding redundancy. In this work, we analyze the effect of redundant features on the performance of classification models. We can summarize the contribution of this work as follows: (i) develop a theoretical framework to analyze feature construction and selection, (ii) show that certain properly defined features are redundant but make the data linearly separable, and (iii) propose a formal criterion to validate feature construction methods. The results of experiments suggest that a large number of redundant features can reduce the classification error. The results imply that it is not enough to analyze features solely using criteria that measure the amount of information provided by such features.


Introduction
In the classification, the quality of information in the features is essential to building a high-quality predictive model. Furthermore, the rapid advances in data acquisition and storage technologies have created high-dimensional data. However, noise, non-informative features, and redundancy, among other issues, make the classification task challenging [1]. Therefore, selecting suitable features is an important task, as a preliminary step, for building highly predictive classifiers [2].
To reduce dimensionality, there are two main approaches-feature selection and feature construction. Feature selection selects a subset of features from the input to reduce the effects of noise or irrelevant features, while still providing good prediction results [2]. In contrast, feature construction refers to the task of transforming a given set of input features to generate a new set of more predictive features [3].
According to [2], feature selection can be divided into three major categories depending on the evaluation criteria-filter, wrapper, and embedded. Filter methods use intrinsic properties of the data to select a subset of features and are applied as a preprocessing task [4]. Wrappers, in contrast, use learning to guide the search. The learning bias is included in the search and, therefore, they achieve better results. However, they are com-shows the experimental results, and finally, Section 6 presents a discussion of all results obtained.

A Mathematical Model for Feature Selection and Construction
In this section, we present a formal framework for the mathematical analysis of feature selection and construction. Let {A i } be a finite sequence of finite sets in R and another finite set C, where each A i is denoted as feature i and C is the set of possible classes. Taking A = A 1 × A 2 × ... × A n , we consider a probability distribution P over A × C, we denote P[.] and P[.|.] as the probability and conditional probability determined by P, respectively. Notice that we may generate a dataset using distribution P, where each record is an element from A × C and we denote P as a dataset distribution. Denote the sequence Â i , such thatÂ i = A i for i ≤ n andÂ n+1 = C. Let {S i } be a subsequence of Â i , we denote (i) S = S 1 × S 2 × ... × S m , (ii) if s ∈ S then s is denoted as a pattern of S and (iii) E P S (x) is denoted as the event where we sample an instance such that s = x for a pattern s of S according to distribution P. We say that s is a not-null pattern of S if P E P S (s) > 0. Notice that our definition of the dataset distribution is general enough for a dataset or its real distribution. For example, given the dataset distribution P in Table 1, we can take A 1 = {1, 2, 3}, A 2 = {1, 2}, A 3 = {0, 1, 2, 3}, and C = {0, 1}. As S = S 1 × S 3 represents all possible values taken by the first and third feature, if s = (1, 1) is a pattern of S, then E P S (x) = {(1, 1, 1, 0), (1, 2, 1, 0), (1, 1, 1, 1), (1, 2, 1, 1)} is the event where the first and third features have value one. Notice that s is a not-null pattern because P E P S (s) = 2/5. The following definition formalizes the notion of patterns that do not contradict each other. For example, take the dataset distribution P of Table 1 We have that b = (1, 2) ∈ B and d = (2, 1) ∈ D are congruent patterns, because they have the same value in their single shared feature. However, ifd = (1, 2) ∈ D, then b andd are not congruent patterns, because both have different values given the second feature of the dataset.
As a dataset distribution P may not be consistent (inconsistent), we define a function for all not-null patterns a ∈ A. Notice that an inconsistent dataset distribution always has classification error because a classifier does not have enough features, then f P gives the category that minimize error for any configuration of features. If we consider the dataset distribution of Table 1, we must define a f P , such that f P (1, 1, 0) = 0, f P (2, 1, 2) = 1 and f P (3, 1, 3) = 0; however for any other pattern a ∈ A we can take 0 or 1 for f P .
The subsequence B of features is complete for P if satisfies that for all class c and all congruent not-null patterns a, b of A, B, respectively, we have: Definition 2 formalizes the notion of a subset of features with the same amount of information as all features as a whole. This notion of information considers that the subset of features is sufficient to estimate the class with the same probability as the original set of features.

Definition 3.
Maintaining the same terms of Definition 2. LetB k = B i be a sub-sequence of sequence B without the term A k andB k =B 1 ×B 2 × ... ×B q . The subsequence B of features is non-redundant for P, if it satisfies that for all k there is some class c, and some not-null congruent patterns b,b of B,B k , respectively, such that: Definition 3 formalizes the notion of a subset of features where each feature provides information that does not exist in other features of the subset. This notion of information considers that if we eliminate a feature from the subset, we will not obtain the same probability of obtaining a class. Under this definition of a non-redundant subset of features, we can say that the other features of the dataset are redundant because they can be eliminated without losing information in the dataset. We formulate Definition 4 for redundant features.

Definition 4.
Maintaining the same terms of Definition 2. LetÂ = Â i be a subsequence of A, obtained by eliminating the features of a subsequence B from A. The subsequence B of features is redundant for P if it satisfies that for all class c and not-null congruent patterns a,â of A,Â respectively, we have: Taking the dataset distribution P of Table 1 Definition 5 formalizes the notion of feature construction. It consists of a new dataset distribution whose set of features contains the set of features of the original dataset with the same distribution according to the first property. However, according to the second property, the new distribution also contains new features whose values are entirely determined by the shared features.
Following the dataset distribution P of Table 1, we denote a dataset distributionP from Table 1 where we eliminate the feature A 3 . Notice that P is an extension ofP, because (i) asP is P without a feature, then they have the same probabilities for the common features and (ii) if we know the values of features 1 and 2, then we know the value of feature 3 with probability 1 for any not-null pattern.

Features: Selection vs. Construction
In this section, we use the mathematical notions defined above to compare selection with feature construction. In this sense, feature selection is denoted as an elimination of features, while feature construction is denoted as incorporating new features.
Feature selection methods that do not involve the classifier in the selection are called filter methods. These methods are based on applying some measure that seeks to obtain a subset of features, which contains the same amount of information as the original set but without any redundancy. The literature reports several of these methods; however, they believe that a non-redundant set of features should be as small as possible. This condition can be mathematically described as obtaining a complete and non-redundant sub-sequence of features B for P, where we minimize |B|.
Mathematically we can define the construction of features from a dataset distribution P as any extension of P. Feature construction consists of computing new features from the original features. If the result ends up with more features than the original, we come across a method contrary to the minimization criterion of the feature selection by filtering methods.
One of the principles of feature selection by the filtering methods is that redundancy in features is detrimental. We refer to a redundant feature in the sense that all the information existing in the feature can be obtained from a subset of features that does not contain the feature itself. In that sense, the construction of features without a subsequent selection of features only produces redundant features. Formally, we are saying that if Q is an extension of P as constructed in Definition 5,then for all c ∈ C and all not-null congruent patterns a,â of A,Â, respectively. In other words, although pattern a has extra features toâ, that does not modify the probabilities of obtaining any class c; therefore, the extra features do not provide information.
The notion of feature construction introduced by Definition 5 does not add more information because the original features define the new features entirely. Therefore, we are interested in knowing what else can be provided by new features in case these features do not have more information than what already exists.
We analyze a simple example of a classification algorithm interacting with a constructed feature before presenting theorems with more general results. First, we consider the distribution of Table 2, where we assume that the original features are 1 and 2. For each pattern a ∈ A, feature 3 is defined as a 3 = (a 1 ) 2 . Second, we consider a classifier based on the logistic model. If we denote L(x) = 1 1+e −x and the internal parameters or weights v 0 , v 1 , v 2 ∈ R, the logistic model applied to the original features of pattern a ∈ A outputs 1 if L(v 0 + a 1 v 1 + a 2 v 2 ) > 1 2 and 0 otherwise. Denoting another parameter v 3 ∈ R, the logistic model applied to all features of pattern and 0 otherwise. Notice that in Figure 1, if we apply the logistic model in the original features, we obtain a linear classifier on the plane for features 1-2 that cannot give the correct class to all instances. Therefore this first model has under-fitting problems. However, if we take the second logistic model with the we obtain a non-linear model over the plane for features 1-2 with the region between Att. 1 = 1.5 and Att. 2 = 2.5 for class 0 and the rest of the plane for class 1. This second logistic model is equivalent to a third logistic model applied to the original features of pattern 2 and 0 otherwise. We say that both logistic models are equivalent because they partition the plane of features 1-2 exactly as Figure 1 shows. In both the second and third models there is an extra parameter v 3 that modulates the non-linearity in the plane of features 1-2. The second model is a linear model over the space produced by the features 1-3 and behaves non-linearly in the plane of features 1-2 due to feature 3. Instead, the third model is an inherently nonlinear model for features 1-2 for v 3 distinct to 0. Therefore, the construction of features can increase the representation capacity of the model and solve under-fitting problems like the one we observed with the first model.

A Theoretical Analysis of Feature Construction
In this section, we present results that generalize what was stated in Section 3. The following theorem refutes the idea that the fewer features we use without losing information, the better for the classification problem. Theorem 1. Let P be a non-constant dataset distribution over A × {0, 1}, whose set of features {A i } is non-redundant in P. Take p as the total number of non-null patterns in P whose value by f P is the minority class between zero and one. For all integer m in [p, n + 1] there is a set of m features {B i } and an extension Q of P, such that (i) Q is a distribution overÂ × {0, 1} wherê A=A × B, (ii) {B i } is a non-redundant set of features in Q, (iii) there is a linear classifier that computes f P (a) from b, ifb = (a, b) is a not-null pattern ofÂ.
Proof. Let N be the set of not-null patterns of A according to P. We denote a partition {N i } of N of size m, where each N i contains a pattern with value one and a pattern with value zero according f P . We also take B i = {0, 1} for all i. Then we construct Q : for each a ∈ N we have (a, b) ∈Â, such that: if a ∈ N k then b k = f P (a) and b i = 0 for all i = k. As f P (a) is fully determined by a and b is fully determined by f P (a), then b is fully determined by a and Q is an extension of P.
For the second property, we with all terms zero andb with all terms zero except b i = 1. Notice thatb is congruent to the other patterns and all are non-null patterns, this b are events with non-zero probability. Then we have: and: As x y < z w implies that x y < x+z y+w for real positive numbers x, y, z, w and: Thus, we have: For the last property, we need to construct a logistic model that outputs one if L(∑ i b i ) > 1 2 and zero otherwise. Notice that this linear classifier computes f P (a) for all not-null patternb = (a, b) from Q.
Notice that {B i } can be much bigger than {A i }. However, inferring the category labels from {A i } can be as complex as we want, at the same time that selecting the bigger set {B i } instead we will have a problem that is solved by a linear classifier. Therefore, a feature selection method would choose {A i } over {B i } under the criterion of minimizing the number of features.
Although we refer to complexity, there is no single measure of complexity for classification problems [40]. However, it is observed that classifiers that use a single variable, artificial neural networks of a single neuron, and the simplest SVM models are linear classifiers. Additionally, linear classifiers have a VC dimension of the value of only two [41]. Therefore, for our purpose, we consider linearly separable sets as those with less complexity.
Theorem 1 shows an extreme case where feature construction breaks a standard criterion for feature selection methods. However, theorem-proof does not present a practical method for feature construction because we can only build the features in the training set. We note that to construct features {B i }, we must know in advance the most probable class for each pattern in A. That is to say, first solve the classification problem only with the features of {A i }, which does not make sense. Therefore, we will now study a standard method for constructing features.
The following definition generalizes the construction of features using monomials, which was used as an example in Section 3. The idea is that there is a feature equivalent to each monomial of degree less than or equal to k from the original features. Definition 6. Taking same terms from Definition. We denote P k as a k-monomial extension of P and A(k) as the product of features of P k , if (i) for each i, there is a monomial function f : A(k) → R of grade equal or less than k, such thatâ i = f (â 1 ,â 2 , ...,â n ) for each not-null patternâ ∈ A(k) and (ii) for each monomial function f : A → R of grade equal or less than k there is some i, such thatâ i = f (â 1 ,â 2 , ...,â n ) for each not-null patternâ ∈ A(k).
For example, suppose that the dataset distribution P has three features and denote (a 1 , a 2 , a 3 ) as a pattern for those features. Then, a pattern from P 2 could of be the form a 1 , a 2 , a 3 , a 2 1 , a 2 2 , a 2 3 , a 1 a 2 , a 1 a 3 , a 2 a 3 and a pattern from P 3 could be of the form a 1 , a 2 , a 3 , a 2 1 , a 2 2 , a 2 3 , a 1 a 2 , a 1 a 3 , a 2 a 3 , a 3 1 , a 3 2 , a 3 3 , a 2 1 a 2 , a 2 1 a 3 , a 2 2 a 3 , a 1 a 2 2 , a 1 a 2 3 , a 2 a 2 3 . Notice that Definition 6 does not give an explicit order for the new features, however Definition 5 just guarantees that the first n features of P k are the original features of P. Then the features i in P k for i > n are in function of the first n features in P k (that also are the features of P).
The following theorem describes how the feature construction method described in Definition 6 can reduce the complexity of the classification problem.

Theorem 2.
For all dataset distribution P over A × {0, 1}, there is some k such that some linear classifier computes f P k from the not-null patterns of P k .
Proof. Let P be a dataset distribution over A × C whose features A i have more than one possible value, without loss of generalization. We denote (i) the minimum absolute difference between values in the feature A i as β i , (ii) the difference between the maximum and minimum values in the feature A i as δ i and (iii) the maximum δ i /β i as D. Then, from a ∈ A we define the function g(a) = ∑ i a i (3D) i−1 , which is a polynomial of grade 1 on the terms of a. Notice that g is an injective function if we take A as the domain. We denote P as the Lagrange polynomial, such that P(g(a)) = f P (a). Let k be the maximum grade in a monomial from P(g(a)) where we take the variables {a i }. Then P(g(a)) =P(â) for some polynomialP of grade 1 andâ ∈ A(k). AlthoughP is a regression model, it takes only zero or one values in the patternsâ ∈ A(k) and therefore can be taken as a linear classification model.
We present an example with Table 3, the first two columns with the class corresponding to a dataset distribution P for an XOR function, which is not linearly separable. However, the 2−monomial expansion P 2 is a linearly separable dataset distribution.  This definition seeks to formalize the notion of a feature construction method that is applied iteratively, producing an unbounded quantity of new features. For example, if we construct each k-monomial extension of P, such that the features of P k have the same indices in P k+1 , then P i is a progressive sequence of dataset distributions. Definition 8. We say that a feature construction method is linearly asymptotic if from all dataset distribution P over A × {0, 1}, feature construction methods produce a progressive sequence of dataset distributions {P i }, such that there is some k and a linear classifier that can compute f P k from P k .
Finally, we present a desirable property for any feature construction method. This property is equivalent to a feature construction method never getting stuck in patterns that are not linearly separable. Proving that a feature construction method is linearly asymptotic represents a formal validation of the method. For example, by Theorem 2, we conclude that the k−monomial construction method is linearly asymptotic.
Note that this desired property is similar to the kernel trick exploited by SVM models, where the data are mapped to a larger-dimensional space, such that a low-capacity classifier can separate the classes [42].

Experimental Results
In this section, we present the experimental results. We analyze the accuracy under the application of classification algorithms on pre-processed real and artificial datasets with their k-monomial extensions. The classification algorithms used are Naive Bayes, logistic regression, KNN, PART, JRIP, J48, and random forest. The classifiers mentioned were executed using the Waikato Environment for Knowledge Analysis (Weka) software [43].

Datasets from Real Classification Problems
The real data correspond to the Speaker Accent Recognition dataset [44], Algerian Forest Fires dataset [45], Banknote Authentication dataset [46], User Knowledge Modeling dataset [47], Glass Identification dataset [48], Wine Quality dataset [49], Somerville Happiness Survey dataset [50], Melanoma dataset, and Pima Indians Diabetes dataset [51]. As the experimental analysis is limited to binary classification problems, we took only the instances that belong to one of the two majority classes in the case of the Speaker Accent Recognition dataset, User Knowledge Modeling dataset, Glass Identification dataset, and Wine Quality dataset.
Before the analysis, we applied the k-monomial extension for k = 2 and 3 in the datasets obtaining two new datasets per original dataset. Finally, we applied a normalization on all datasets and features A i , where a i ∈ A i . Table A1 shows more details about the datasets and their k-monomial extensions.

Datasets from Artificial Classification Problems
The synthetic datasets are generated according to five rules that organize the datasets into five corresponding families. We first generate n features with r possible values for each dataset. The value of each feature given in an instance is generated from the ceiling function applied on a value x with uniform distribution in the interval [0, r]. For each rule, four datasets are generated with the following characteristics: is greater than zero and otherwise assigns the category FALSE. The function Υ r n is defined as: • The second rule assigns the category TRUE if the function, Φ r n : {1, 2, ..., r} n → {TRUE, FALSE}, is greater than zero, and otherwise assigns the category FALSE. The function Φ r n is defined as: • The third rule assigns the category TRUE if the function, Ψ r n : {1, 2, ..., r} n → {TRUE, FALSE}, is greater than zero, and otherwise it assigns the category FALSE. The function Ψ r n is defined as: • The fourth rule assigns the category TRUE if the function, is greater than zero, and otherwise assigns the category FALSE. The function Ω r n is defined as: • The fifth rule assigns the category TRUE if the function, Γ r n : {1, 2, ..., r} n → {TRUE, FALSE}, is greater than zero, and otherwise assigns the category FALSE. The function Γ r n is defined as: Before the analysis, we applied the k-monomial extension for k = 2, 3, 4, and 5 in the datasets obtaining four new datasets per original dataset. Finally, we applied the normalization on all datasets and features A i , where a i ∈ A i . Table A6 shows more details about the datasets and their k-monomial extensions.

Analysis from the Real Datasets
In this subsection, we present the results corresponding to the real datasets. For the real datasets we have graphics like Figure 2 for the Speaker Accent Recognition dataset, that show the true positive, true negative, false positive, and false negative of the classification algorithms on each dataset, and their k-monomial extensions ( Figures A1-A8, corresponding to the rest of the datasets are in the appendix). The values are calculated using 10-fold cross validation. For each algorithm, three joined bars are presented, showing the configuration of the confusion matrix. From left to right, the first bar corresponds to the original dataset, the second corresponds to the 2-monomial extension, and the last one corresponds to the 3-monomial extension. We represent the confusion matrix to show that the criteria for evaluating improvements in classification are adequate for these examples. We can see that there is little difference between the values of the original dataset and the k-monomial extensions most of the time. However, there are a few cases where the original dataset presents a significantly better accuracy, such as the naive Bayes classifier in Figure A1 and the J48 classifier in Figure A2. However, there are some cases where some k-monomial extension presents some accuracy slightly higher than the original dataset.

Analysis from the Artificial Datasets
In this subsection, we present the results corresponding to the artificial datasets. For the synthetic datasets we present results like Table 4 (for the first family of datasets) that shows the accuracy of the classification algorithms on each dataset and their k-monomial extensions (Tables A2-A5, corresponding to the rest of the families of datasets are in the Appendix A). The values are calculated using 10-fold cross-validation. Each dataset has a column indexed by "n-r", where n is the number of features, and r is the cardinality of the features. For each dataset and algorithm, the original accuracy corresponds to the original dataset accuracy. Best accuracy corresponds to the highest precision between the kmonomial extensions, and grade corresponds to the k for which the k-monomial extensions reach the highest precision. In all families of datasets, we can see that the k-monomial extensions tend to have better accuracy than the original datasets. However, there are cases where the original dataset has more accuracy, but without exceeding 5%. We can also observe that the 5-monomial extension is common, as the case with greater accuracy. Notice that the 5-monomial extension is the dataset with a larger subset of redundant features.

Discussion
This is not the first work that relates features to data-complexity. The quotient between the number of instances and the number of features (known as the T2 measure) has been studied as a measure of data complexity [40]. However, T2 is independent of the notion of complexity in this work, since we can define linearly separable datasets in all ranges of T2. There are also applications of complexity measures for the feature selection problem, but applying a mainly experimental analysis [52][53][54][55].
The concept of a redundant set of features is based on the relevant feature definition of John et al. [56]. There are several other definitions for redundancy or redundant features. However, these definitions are more oriented to applications than a theoretical analysis of redundancy and its effects [57][58][59][60][61][62][63][64].
Our theoretical results show that many redundant features can reduce the complexity of the data. This result is interpreted in that a feature can provide representativeness without providing extra information, as seen in the example in Section 3. It can also be interpreted that redundant features are capable of increasing the capacity of the model.
Our experimental results reinforce the evidence that redundancy itself is not necessarily detrimental. The real and synthetic datasets showed that extended datasets with many redundant features constructed as monomials could achieve higher accuracy. However, higher accuracy was more pronounced in synthetic datasets. The synthetic datasets applied did not have noise and had few dimensions, which are the main differences to the real datasets studied.
Usually, redundant features before preprocessing entail a greater complexity of the algorithm than the classifier induces. The reason is that the classifier cannot find the optimal (global) rule, because the search space increases exponentially. Therefore, it returns a local optimum. Due to this increased search space, as we increase the features, the problem increases the difficulty and, tends to be classifiers with poorer performance. However, this fact occurs because those initial features do not add enough expressiveness. Therefore, features obtained from suitable construction methods cannot be equally treated in the same way as an initial feature.
Finally, the increase or decrease of features implies an increase or decrease of parameters in the model, respectively. Therefore, the choice of features can induce overfitting or underfitting. However, these learning problems are not commonly studied in the development of feature selection methods. Therefore, the criteria for selecting features should consider the information provided by each feature and the representativeness provided by the features. Furthermore, in the same way that there are regularization methods to avoid overfitting by the internal parameters of the model, regularization methods could be developed against the excess of features.

Conclusions
The main finding of this work is that attributes that are redundant from an information viewpoint indeed reduce under-fitting. Theoretical and experimental evidence is provided for this finding. However, these results are limited to binary classification problems with numerical attributes. Therefore, continuations of this work can be extended on the following points:    Table A1. Basic information about the real datasets. The column "Instances" denotes the number of entries in the dataset. The column "Original" denotes the number of features in the original dataset. The columns "2-Mon. Ext." and "3-Mon. Ext." denote the number of features in the 2-monomial extension and 3-monomial extension, respectively.  Table A3.    Table A6. Basic information about the artificial datasets. The column "Family" denotes the corresponding family function. The column "Indices" denotes the number of features and their cardinality.