Outlier Detection Based Feature Selection Exploiting Bio-Inspired Optimization Algorithms

: The curse of dimensionality problem occurs when the data are high-dimensional. It af-fects the learning process and reduces the accuracy. Feature selection is one of the dimensionality reduction approaches that mainly contribute to solving the curse of the dimensionality problem by selecting the relevant features. Irrelevant features are the dependent and redundant features that cause noise in the data and then reduce its quality. The main well-known feature-selection methods are wrapper and ﬁlter techniques. However, wrapper feature selection techniques are computationally expensive, whereas ﬁlter feature selection methods suffer from multicollinearity. In this research study, four new feature selection methods based on outlier detection using the Projection Pursuit method are proposed. Outlier detection involves identifying abnormal data (irrelevant features of the transpose matrix obtained from the original dataset matrix). The concept of outlier detection using projection pursuit has proved its efﬁciency in many applications but has not yet been used as a feature selection approach. To the author’s knowledge, this study is the ﬁrst of its kind. Experimental results on nineteen real datasets using three classiﬁers (k-NN, SVM, and Random Forest) indicated that the suggested methods enhanced the classiﬁcation accuracy rate by an average of 6.64% when compared to the classiﬁcation accuracy without applying feature selection. It also outperformed the state-of-the-art methods on most of the used datasets with an improvement rate ranging between 0.76% and 30.64%. Statistical analysis showed that the results of the proposed methods are statistically signiﬁcant.


Introduction
The data is continuously evolving, leading to a big challenge in several domains. It is generally high-dimensional, which causes the curse of the dimensionality problem. In this study, we investigated this problem with a high number of attributes (features). Reducing the space of features is performed through feature extraction or feature selection (FS). Feature extraction transforms the initial feature space into a novel low-dimensional feature space, for example, Kernel Dimensionality Reduction [1], Principal Component Analysis (PCA) [2], and Kernel PCA [3]. While FS does not produce new features but employs the original features to select the best ones, it consists of choosing a subgroup of significant features that mainly contribute to enhancing the output. FS is an essential preprocessing step in clustering and classification techniques to improve data quality. The effectiveness of FS method is one of the crucial aspects that affect the accuracy outcomes. The goal of the algorithm is to minimize the cardinality of the features set. Finding the most relevant features is not an easy task. For a dataset with a dimension d, there are 2 d − 1 potential subsets of features (excluding the empty subset) [4]. For even a reasonable value of d, it is of interest to choose a less computationally expensive method. FS is required in numerous domains and applications, for example, in bio-informatics [5,6], where the data can hold a large number of variables, most of wihich might be extremely well correlated with other variables. These variables are called dependent variables and do not deliver further information in the classification process, but they are considered as a noise for the classifier, which requires removing them to enhance the data quality. Another example in the biomedical domain wherein protein interaction discovery is vitally needed [7]. In biometric identification, FS is requested in facial recognition [8] to diminish the data dimensionality and then increase the classification accuracy. In addition, text mining suffers from the curse of dimensionality, and some FS methods were addressed in text categorization [9] and text clustering [10].
FS methods are classified into four categories, as displayed in Figure 1, called wrapper, filter, embedded, and outlier detection methods. Wrapper methods consist of evaluating features based on their predictive power using a classifier and usually a cross-validation technique. Several wrapper techniques have been proposed to solve the curse of dimensionality problem. They are based on a specific search method [11], such as the sequential selection [12], meta-heuristic (for example [13,14]), and local search [15]. The filter method is based on evaluating features using a criterion function that is relatively independent of classification such as distance [16], information [17], and dependency [18]. It could be based on information gain, Fisher score, variance threshold, chi-square test, correlation coefficient, and others [11]. Most (if not all) of the filter FS methods can be found in several statistical and ML software such that Weka, R, and SPSS. The embedded methods consist of applying FS in the learning process. The data are not divided into training and testing sets because they do not optimize an objective function or performance of a machine learning algorithm or a model like wrapper methods do. However, they focus on the use of a basic model constructing metrics during the learning. These methods are mainly based on the Least Absolute Shrinkage and Selection Operator (Lasso) technique combined with the Linear Regression method (for example, [19,20]). The other approaches can comprise outlier detection, which consist of detecting and removing abnormal features in order to reduce the dimensionality and enhance the quality.
The aforementioned FS methods have some limitations. The wrapper FS are efficient but time-consuming [21,22]. The filter FS techniques are efficient and not computationally expensive. However, they are affected by the multicollinearity [23,24]. The embedded FS methods are not well investigated compared to wrapper and filter FS methods due to their complex structure. The limitations of FS methods are deeply illustrated in Section 3.
To overcome these limitations, this article proposes novel FS techniques stimulated by outlier detection where the outliers cover a small part of the dataset.
Outlier detection involves searching for data objects that are different from standard data [25]. It is mainly required in numerous applications, including fraud detection, image processing, intrusion detection, public safety and security, and industry damage detection.
In this article, the irrelevant features represent the outliers, and the dataset is represented as a matrix. The matrix is then transposed so that the rows represent the features (the attributes) and the columns stand for the samples. Thus, the outliers will be found in the samples of the transposed dataset, which are the features set of the original dataset.
Outlier detection is performed using the Projection Pursuit (PP). The PP approach comprises looking for low dimensional linear projections that reveal potential outlying observations. Finding interesting projections is performed through the optimization of a specific function called a projection index (or PP index). The concept of this proposal is defined in Section 4. PP has successfully been applied to the search for outliers [26,27], which is performed by assuming data normality and that the data not following the model are outliers.
The present study focuses on developing FS based on outlier detection using PP called Projection Pursuit based Feature Selection (PPFS). To the author's knowledge, only one study used outlier detection for FS [28] but without PP. It involved feature ranking and isolation forest, which is different from this proposal. Therefore, this proposal is a new of its kind. The contribution of PPFS is described in the following section along with the research objectives.
To validate this model, nineteen real-word datasets obtained from repositories are used along with three classifiers, including the k-Nearest Neighbors, Support Vector Machine, and Random Forest. Four classification measures are utilized, namely the accuracy, precision, F1 measure, and recall to confirm the results and compare them with those obtained with the state-of-the-art techniques.
The article is structured as follows. Section 2 illustrates the contributions and the research objectives of this study. Section 3 surveys the existing works related to the four FS categories and addresses their limitations. Section 4 describes PP and the PP indices used in this study. Section 5 discusses the optimization methods applied to the PP indices. Section 6 presents the methodology followed in this study. Section 7 investigates the experimentation results, including the comparison study along with a deep discussion illustrating the results, the advantages and the limitations of this study, in addition to the research impacts. Section 8 concludes this work and presents some perspectives and future work.

Contributions and Research Objectives
The contribution is fourfold: Suggesting a new FS method based on PP, a well-known outlier detection technique.

2.
Providing a one-stage approach based on detecting and removing outliers that correspond to the irrelevant features.

3.
Discovering relevant features using optimization methods based on common bioinspired algorithms, algorithms (GA) and a variant of PSO (Tribes).

4.
Providing a software-based outlier-detection method to determine and remove the non-relevant features.
This study aims at achieving the following objectives. Suggest the use of EPP-Lab to select the non-relevant features.

5.
Investigate a statistical analysis to show that the four PPFS methods are equivalent and their results are statistically significant compared to the state-of-the-art methods. 6.
Perform a comparison study with the state-of-the-art methods to confirm the efficiency of the proposed model.

Review of Related Literature
This section describes the recent related work regarding the four FS categories. Four subsections were presented, Wrapper FS methods, Filter FS, Embedded FS, and outlier FS methods.

Wrapper Feature Selection Methods
In the following, the most recent wrapper FS are discussed. For the wrapper FS based meta-heuristics, the authors in [13] developed a new FS based on a Quantum Whale Optimization Algorithm. Four classifiers were used: Decision Tree (C4.5), Support Vector Machine with a linear kernel, k-Nearest Neighbors (k-NN), and Linear discriminant classifier with 13 datasets. The results revealed that the proposed method performed better than the state-of-the art methods, with an average accuracy rate achieving 88.57%. While in [14], the authors proposed a new FS using Whale optimization algorithm, the k-NN classifier was utilized to assess the effectiveness of the suggested method using different UCI datasets. A comparison study was addressed using GA, PSO, and Ant Lion optimizer wrapper FS methods along with five standard filter FS techniques. The results indicated that the mutation enhanced the results for most datasets.
In sequential search, the authors of [12] proposed sequential forward search (SRS) and sequential backward search (SBS) approaches combined with two optimization methods called AdaBoost and Bagging algorithms. The Decision Tree and the Naïve Bayes classifiers along with thirteen UCI datasets were used to assess the model using ten-fold cross validation. The results showed that the SFS combined with the Bagging technique achieved a higher accuracy of 89.60%.
For the Wrapper FS based local search, the authors of [15] proposed an FS method based on Greedy Search (IG) for Sentiment analysis. The Multi-nomial Naïve Bayes classifier was applied to four Amazon reviews and nine public datasets. The classification accuracy rate reached an average of 90.74% and 96.45% for each type of datasets, respectively.

Filter Feature Selection Methods
In the following, the most recent filter FS studies from the three categories (distance, information, and dependency) are presented.
The authors of [16] proposed a new filter FS technique based on the dependency concept using Discriminative Correlation Filter (DCF) for visual object tracking. Many benchmark datasets were used to measure the effectiveness of the proposed method, along with the Deep Neural Network. The comparison results confirmed the dominance of the proposed FS over the existing techniques. The study [18] used the concept of distance. The authors suggested a new filter FS using an artificial immune system. They minimized the intra-class distance and maximized the inter-class distance in the defined regions. Then, a comparison study was successfully performed with a global and local FS algorithms using UCI and synthetic datasets. The obtained results were encouraging. Another study mainly based on the information concept [17]. It consists of analyzing several filter FS methods. Twenty-two filter FS were presented such as oneR, Variance, permutation, Info gain, Gain ratio, and others. The authors compared their performance in terms of predictive accuracy and run time. They applied the k-NN, the Logistic Ridge Regression, and SVM classifiers to determine the accuracy of each filter FS method. They concluded that there is no best filter FS method for all the datasets.

Embedded Feature Selection Methods
In the following, the most recent embedded FS studies are presented. The authors in [19] suggested a new embedded FS method related to within-class and the least-squares loss function. This study aimed at recommending a treatment to avoid or decelerate the progress of dementia disease. An SVM classifier based on the one-versus-one class was used. In addition, ten-fold cross-validation was reiterated for each classification. The Medical Reasoning Image (MRI) dataset was utilized. The proposed embedded method reduced the number of features to 24 over 90 reaching a classification accuracy of 81.9%. In another study [20], the authors developed a new embedded FS method to transform the multi-label FS problem into several single-label ones considering the correlation among labels as an evaluation measure. The authors used five bioinformatics datasets. Three classifiers based on multi-label were utilized, including Multi-Label k-NN, Binary Relevance, and Classifier Chain. The proposed methods were compared to X2based maximal, Multi-label embedded FS, and filter multi-label FS. The proposed method surpassed the compared methods.

Outlier Feature Selection Methods
To the author's knowledge, only one research study has been found that conducted on reducing the dimensionality by eliminating the outliers. In other words, they performed the outliers' detection and elimination in the feature set [28]. The concept was based on feature ranking and isolation forest. The one-class SVM, the local outlier factor and IFOR were employed. In addition, several synthetic and real-world datasets were utilized to validate the efficiency of the proposed FS. A comparison study was introduced using several FS and outlier detection methods such as Laplacian, variance, and kurtosis. The results showed that the proposed FS was similar to kurtosis but better than Laplacian and the variance. Many other studies combined FS to outlier detection to strengthen the FS approach. For example, the authors in [29] suggested a novel FS method combined with an outlier detection technique and classification to identify the attacks. Other studies used FS methods to detect outliers. For example, the authors of [30] aimed to develop an outlier detection approach based on FS and association rules. In [31], the authors proposed a new FS method for unsupervised outlier detection. The authors used the inter and intra distance along with recursive backward elimination to extract the outliers.

Discussion
Many wrapper, filter, and embedded FS were proposed. However, several studies addressed their limitations: • The wrapper FS methods are time consuming [21,22] because the learning steps and the cross-validation are repeated many times. • The filter FS methods suffer from the multicollinearity; only one feature is randomly selected from a set of highly correlated features. This selection might negatively affect the result, for example, in gene expression [23,24]. • The embedded FS methods are not well investigated due to their complex structure. • The hybrid wrapper-filter FS techniques [12,32] are two-stage approaches and thus require more knowledge in addition to their time-consuming nature. • The only work presenting the outlier-based FS method [28] used feature ranking and isolation forest. It is a random-based approach that makes the attributes selection and the location of the best split for building the isolation forest unclear [33]. • The combination of wrapper FS with outlier detection [21,29] is a two-stage approach that works by reducing the features and then removing the outliers. This approach is time consuming due to the limitation of wrapper FS. In addition, the knowledge of wrapper FS and outlier detection are both required. • Some studies [30,31,34] performed the outlier detection using FS methods, which is the opposite of this proposed study.
To show the significance of this study, the difference between the PPFS methods and the related works is described in Table 1.

Projection Pursuit
PP belongs to the same family of the Principal Component Analysis (PCA) method. The central target of PP is to look for linear projections (combinations of variables) that reveal interesting structures in multidimensional and large data [35,36]. These projections are in one, two, or three dimensions. The interesting projections are found through the optimization of the projection index using a robust optimization algorithm capable of determining the optimum as well as the local optima that can also reveal hidden interesting projections [27]. The PP indices are categorized according to the search purpose, for example, detecting outliers or finding clusters. In this research study, the interesting projections are those that detect outliers. Huber [35] and Jones and Sibson [36] showed that the Gaussian distribution is the least interesting one, while the most interesting structures are those that bear a minimum resemblance to the Gaussian distribution. Sometimes the outliers can form a small cluster, PP can also identify unbalanced clusters, and it considers the smallest one as outliers.
Before going through the details of PP, it is necessary to introduce the principal idea of this study. The concept of this proposal is to detect the outliers in the feature set instead of the data objects. To do this, the dataset matrix must be transposed. Given the mathematical definition, let us denote by X an (P × N) dataset matrix, and X an (N × P) transposed dataset matrix, where the rows represent the features (N features) and the columns stand for the observations or the samples (P observations). It is worth knowing that the transposed matrix is used in PP (instead of the original dataset matrix) to find out the optimal projection that reveal the outliers (i.e., the irrelevant features). This research study considers a one-dimensional projection that can be defined as P-dimensional vector called a. The coordinates of the projected data are presented as an N-dimensional vector w such that where X is the transposed dataset matrix and a is the projection vector. Therefore, finding the projection vector a is just optimizing the PP index I(a). For more details about PP, one can refer to [27,37].
In this research work, two well-known PP indices dedicated to the search of outliers [27,38] are proposed, including the kurtosis index [39] and Friedman index [40]. The detection of the outliers is univariate.

Friedman Index
This index pertains to the polynomial based family indices. It measures the difference between the Gaussian distribution and the distribution of the projected data using expansions based on orthogonal polynomials introduced by Friedman [40]. The maximization of this index leads to finding the optimal projection. It is expressed as follows: where the recursive definition of Legendre polynomials L j is: . . The number of terms m is set to 3 according to the recommendations in [40,41]. φ be the cumulative distribution function of an N(0, 1) random variable.
X is the transposed dataset matrix and N is the number of samples. X i is a Pdimensional vector representing the sample i of the transposed dataset matrix X , i = 1, . . ., N and P is the number of features.

Kurtosis Index
This index is the fourth moment of the projected data proposed by Peña and Prieto [39]. The authors stressed that maximizing the kurtosis index of the projected data leads to identifying outliers. This index is based on the bimodality that split the data into two clusters. Maximizing this index means getting unbalanced clusters, the smallest one is considered as outliers. It is defined as: where X i is a P-dimensional vector representing the sample i of the transposed dataset matrix X , i = 1, . . ., N, N is the number of samples and P is the number of features. a is the projection vector.

Optimization Methods
To optimize the PP indices defined above, bio-inspired algorithms are used because of their success in finding different optimal values of the indices. GA and Tribes are investigated. These techniques are very suitable in the resolution of PP [27], and they are implemented in EPP-Lab [38]. The investigated methods are described below. For more details, one can refer to [38].

Genetic Algorithms
Genetic Algorithm (GA) is an evolutionary algorithm inspired by the natural evolution developed by Holland [42]. GA relies on three bio-inspired operators, including selection, crossover, and mutation. It involves a population of candidate solutions called individuals. An individual (or a solution) is a set of genes. In this study, it is characterized by an array of real numbers. More precisely, it represents the projection vector. A population is a set of individuals. The individuals are randomly initialized, searching for better solutions (close to the optimum) in the search space. At each iteration, every individual is assessed based on an objective function (PP index in this study). Then, the most suitable individuals are designated from the existing population using the tournament selection with three individuals. After that, the individuals are modified by applying the recombination and mutation operators to produce a novel population. The recombination consists of merging the information of two existing individuals (parents) to produce new individuals (offspring). In this study, two-point crossover is operated in the whole population with a probability equal to 0.65. It involves randomly selecting one part (of equal-size) from two parents to generate two offsprings. The two offsprings will have the same genes (real numbers) as the two parents, except for those of the selected part. Later, all the individuals are mutated with a probability equal to 0.05 by randomly selecting one gene and substituting it with a random real number. This population is involved in the following iteration of the algorithm. In this study, the algorithm terminates after reaching a maximum number of iterations. The algorithm that summarizes the GA mentioned above can be found in [27,38].

Particle Swarm Optimization
PSO is a population-based stochastic optimization technique, developed by Kennedy and Eberhart [43]. It is inspired by the behavior of a bird. In PSO, the potential solutions are called particles and form the swarm. The initial particles are randomly generated. They are assessed using the fitness function (or the objective function). The swarm direction is dominated by the movement of the best particle, i.e., the particle with the best performance. The position of the particle is calculated according to its best position achieved so far (called pbest) and the velocity. This calculation also includes the best position of the best particle of the swarm (called global best or gbest) or the best position of the neighbor particle (called the local best or lbest) if the neighborhood notion is applied. The notion of the neighborhood is related to the search purpose. Moreover, the movement of the particles depends on some parameters such as the maximum velocity (Vmax) or the inertia parameter. Both parameters are set to prevent a quick displacement of the particles and maintain the exploration of the search space. More details about PSO can be found in [38,44].

Tribes
Tribes is a PSO version suggested by Clerc [45]. It is a free parameter method and differs from the original PSO. A particle (or a solution) consists of an array of real numbers which represents the projection vector in this study. The swarm is a set of particles. Unlike PSO, the swarm is divided into subsets of particles called Tribes. The particles and Tribes are linked to ensure the recognition of both the best particle in each tribe and the best particle of the swarm. Contrary to PSO, this method does not require the computation of the velocity: the displacement of the particles in the problem domain uses specific movement strategies related to independent Gaussian or hyperspherical probability distributions with or without noise. It can been noted that this method is highly recommended for the search of local optima [45]. This is due to its structure, allowing simultaneous exploration of numerous potential regions around local optima before meeting the global optimum. The Tribes method is described in [44,45]. This method does not involve parameters. The fitness function is the PP index to be optimized. The algorithm of this technique can be found in [38,44].

Methodology
The present study is based on four main stages described as follows.

Outlier Detection
The detection of the outliers is univariate. First, let us represent the dataset as a matrix. The dataset matrix is transposed so that the rows represent the features (the attributes) and the columns are the samples. This is because the outliers can be found in the feature set of the original dataset (samples of the transposed dataset). Figure 2 shows an example of PPFS processing. After transposing the dataset matrix, the PP index is chosen (either Friedman or Kurtosis) as well as the optimization method (either GA or Tribes). Then, the number of iterations and the number of runs are set (defined in the next section). Later, the optimization is applied to find the best projection a (see Section 4) associated with the best value of the PP index I(a). The next step is to determine the outliers. For that, purpose the coordinate of the projected data are calculated according to Equation (1). In order to point out an observation (in the transposed dataset matrix) as an outlier, the rule based on k-sigma principle was applied. In fact, an observation is considered as an outlier when its distance to the mean is larger than 3 times the standard deviation of the projected data. Thus, the parameter k is set to 3 based on the recommendation of [37,38].
The optimization methods are performed several times (number of runs) to ensure that the obtained result is the best and cannot be changed. When selecting the best projection and determining the outliers, a threshold is set to confirm the outlying status of the recognized outliers. This threshold is set to half the number of runs. In fact, the proposed framework checks how many times these observations were detected as outliers throughout the runs. Thus, each outlier is associated with the frequency number showing how many times it was detected as outlier. In case of the previous example displayed in Figure 2, the last observation of the transposed matrix can be considered as an outlier because it is far from data points as displayed in Figure 3. By applying PPFS, suppose that the number of runs is set to 100, and the two samples number 8 and 9, from the previous transposed dataset matrix example (see Figure 2) are detected as outliers. Therefore, the proposed technique will detect these two outliers with the number of times they were recognized as outliers after performing 100 runs. The sample number 8 will not be detected as an outlier, or it will be detected but with a frequency number less than the threshold (=50), while sample 9 will be detected with a frequency greater than 50 as it is considered as outlier. This frequency number ensures the status (outlier or not) of each observation. In this study, an outlier is confirmed if it is detected more than the fixed threshold (half the number of runs). The list of confirmed outliers corresponds to the non-relevant features in the original dataset.
After the transposed dataset matrix is uploaded in EPP-Lab, the PP index is chosen (either Friedman or Kurtosis) as well as the optimization method (either GA or Tribes) from the EPP-Lab software. Then, the number of iterations and the number of runs are set (defined in the next section). Figure 4 shows the EPP-Lab interface used for this purpose.
After running, the best projection, the best values of the PP index obtained throughout the runs, and the projected data are provided. EPP-Lab allows the user to set the k parameter of the sigma rule. The software provides a list of detected outliers and the number of times (the frequency) they are detected as outliers. The functioning of EPP-Lab is described in [38].

Feature Selection
This stage involves removing the outliers obtained from the original dataset and keeping only the relevant features. Considering the previous example, feature 6 will be removed from the feature set of the original dataset, as shown in Figure 2. Therefore, the new dataset (after removing the irrelevant features) will be used in the classification phase.

Classification
This stage involves performing the classification of the dataset before and after removing the outliers (reducing the data dimentionality) using three classifiers, k-Nearest Neighbors (k-NN), Support Vector Machine (SVM), and Random Forest (RF) with ten-fold cross-validation. The reason behind the choice of these classifiers is that k-NN is the simplest classifier that is recommended for small datasets [46], while SVM and RN are well known for their efficiency in tackling large datasets [46,47].

Validation
The last step consists of validating the classification model using four classification measures including the accuracy, precision, recall, and F1 measure. Figure 5 summarizes the four steps that define the proposed approach including transposing the dataset, detecting the outliers using EPP-Lab, FS, classification and evaluation.

Experimentation Results
The proposed outlier detection-based FS techniques (PPFS) were tested using a Dell XPS 9343 with an Intel core TM i7-5500U CPU 2.40 GHz and 8 GB RAM. In the following experimentation, 19 datasets (described in Table 2) were used to validate the effectiveness of this study. The first nine datasets are from the UCI repository (https://archive.ics. uci.edu/ml/datasets.php (accessed on 14 January 2021)). The next eight datasets are from (https://www.openml.org/search?q=australian&type=data (accessed on 14 January 2021)). Finally, August_senate_polls (August S.P) dataset is taken from https://data.world/ fivethirtyeight/august-senate-polls (accessed on 14 January 2021), and Fifa2019 is from https://www.kaggle.com/karangadiya/fifa19 (accessed on 14 January 2021). August S.P contains polls from 1990 to 2016, the results of the election related to each poll, and the average error of each poll. This dataset is composed of three classes. The Fifa2019 dataset is composed of more than 13,000 instances and 80 attributes. An extensive preprocessing operation was applied to this dataset. Then, two classes were created in the body_type attribute, namely Lean and Normal, using R programming language.
As EPP-Lab software is used for outlier detection, only the number of iterations and the number of runs are set. All the parameters related to GA (set in EPP-Lab [38]) were not changed. This is why no validation phase for setting the parameters of GA is needed. In fact, for GA, the population is randomly generated with 50 individuals. The tournament selection was used with three individuals. The crossover was employed with two points, and the probability equals 0.65. The mutation is applied to all individuals with a probability 0.05. As stated before, the Tribes method does not require parameters. For the initialization phase, the EPP Lab randomly generates an initial solution (population) for each method. The numbers of iterations and runs are set to 100 according to the recommendations of [38].
The number of runs guarantees that the obtained results (the removed outliers) cannot be improved. It also allows determining the number of times a feature is detected as an outlier. The threshold that confirms the outlying status is then set to 50 (see Section 6.1). For more details, one can refer to [38,44].  Table 3 shows the parameters' values. In the classification stage, three classifiers are applied: Support Vector Machine (SVM) with a Linear kernel, the k Nearest Neighbors (k-NN), and the Random Forest (RF). R (version 3.5.1) is used for the classification step. The datasets are split into training (70%) and testing (30%). The split was done randomly by R using the "createDataPartition" function from the caret library. To validate the parameter setting of the classifiers, ten-fold Cross-Validation, repeated ten times, is applied to the training set. The accuracy with the largest value is used to select the optimal training model. The optimal model with its best parameters' values is then used in the testing stage. The training and validation accuracy values are omitted to save space. The training set is centered and scaled.
For the k-NN classifier, several values of k were tested (between 5 and 200) for each dataset. For the RF classifier, the number of attributes used for splitting at each node in the tree was studied. The optimal training model was selected using the values (0.1, 0.8, 2, 6, 10). For the SVM classifier, the cost parameter was investigated, which is considered as the essential parameter for building the best useful model. After several experiments, the values (0.5, 1, 1.25, and 2) were used to select the optimal training model. Table 4 shows the obtained accuracy values for SVM using the four aforementioned values of the cost for the Breast Cancer dataset after applying PPFS-GAF. The final value used for the model was cost = 0.5. Note that this parameter setting approach was applied to the three classifiers before and after applying the four PPFS methods for all the datasets. Therefore, to save space, only Table 4 is shown. The four PPFS methods include PP based on Friedman index and the GA optimization method (called GAF), PP based on Friedman index and the Tribes optimization method (called TF), PP based on Kurtosis index and the GA optimization method (GAK), and PP based on Kurtosis index and the Tribes optimization method (called TK). After detecting the outliers using each PPFS method, they were removed, and then the three classifiers were applied to determine the accuracy, precision, recall, and F1 measures for all the datasets, as explained in the Methodology section. Prior to starting the experiments, the features were numbered, starting from 1, according to their appearance in the dataset. For example, if a dataset had four features, then the features were numbered from 1 to 4 starting with the first feature in the dataset until the last one. Table 5 displays the number of outliers detected by each approach (column 4) for each dataset. In addition, the features detected as outliers and the number of times (over 100 runs) they were recognized as such (frequency) are shown in column 3. The last column yields the running time of PPFS to find out the outliers.
Most of the outliers are commonly revealed by the four approaches as shown in Column 3. Moreover, their frequency values are greater than the threshold (half of the number of runs = 50). For example, for the Diabetes dataset, the features 1, 2, 5, and 7 were detected as outliers by both PPFS-GAF and PPFS-TF. However, the features 2 and 7 were detected by the four PPFS methods. In addition, these features were recognized as outliers more than 68 times over 100 runs. Hence, the four features can be considered non-relevant and were removed from the dataset to perform the classification. The positive point in this result is that the four PPFS methods provide almost similar outputs. This fact confirms the consistency of the proposed study. The small difference can be explained by the behavior of the projection indices as well as the effect of the optimization methods. The Kurtosis index split the data into two unbalanced clusters, and the smallest one is considered to comprise outliers. The Friedman index detects the outliers by weighting distances in the center of the Normal distribution rather than in the tails [40]. For the optimization methods, GA is well-known in finding the best solution (the optimum or close to the optimum), whereas the Tribes is dedicated to finding local optima as well as the optimum. Therefore, combining these two optimization methods along with both PP indices does not yield exactly the same result; however, the accuracy and the different measures (precision, recall, and F1) provided by each classifier for the four approaches are slightly different for the used datasets. For the experiment time, EPP-Lab is fast when combining Tribes and the Kurtosis index [39]. The shortest average time noticed with the smallest dataset (Wine, 178 samples) was 79 ms using TK and 162 ms using TF. With the largest dataset, Fifa2019, there were 13,000 samples and 80 features, while Colon had 62 samples and 2000 features. TK for Fifa2019 spent an average of 12 s 580 ms, whereas TF took an average of 44 s 143 ms. TK for Colon was 1 s 654 ms, and TF was 6 s 443 ms. Again, the running time does not show a major difference (see Table 5).      Table 6 presents the accuracy, precision, recall, and the F1 values of each classifier before and after applying PPFS methods (columns 3-6 for k-NN, columns 7-10 for SVM, and columns 11-14 for RF). The PPFS methods have globally improved the classification accuracy compared to the accuracy obtained without applying FS. The accuracy obtained using k-NN (resp. SVM and RF) indicates an improvement rate ranging between 0.08 and 6.79 for 9 datasets, while the accuracy for SVM and RF were 0.36-12.9 and 0.01-12.7, respectively, for 16 datasets each. This result is explained by the fact that k-NN is not always efficient with large and noisy datasets. The classification accuracy of the datasets that is not enhanced with k-NN achieved a significant improvement rate with the other classifiers. This is the case for Wine, Spam, August S.P, Diabetes, Australian, Glass, and Ionosphere datasets.  The PPFS methods also managed well for the Fifa2019 dataset even though the accuracy remains below 70%. This dataset is large and required massive preprocessing efforts. However, PPFS methods did not enhance the accuracy of the Waveform and Segment datasets using the three classifiers because Waveform contains 40 features, all of which include noise, especially the last 19, which are all noise, and Segment contains many instances of noise. PPFS did not remove all these noise features.
Moreover, considering the results of the three classifiers, the classification accuracy of some datasets was enhanced, but they remained below or around 70%. This is the case of Liver D., Vehicle (enhanced only once reaching 83.33%), Diabetes (enhanced only once reaching 81.30%), Glass, and Fifa2019. To improve the results, it recommended to preprocess the datasets and/or apply centering and sphering procedure [37] before peforming PPFS.
The four proposed PPFS methods provided approaching accuracy, precision, recall, and F1 values for each dataset. This confirms the consistency of PPFS again. Figure 6 shows the number of times each PPFS method enhances the classification accuracy (compared to that obtained without applying any FS method) using the three classifiers for all datasets. As displayed in this figure, GAF and GAK performed well at 36 and 37 times, respectively, using the three classifiers. In fact, GAF enhanced the accuracy 10 times with k-NN, 12 times with SVM, and 14 times with RF. TF and TK yielded similar results (34 times). On the other hand, the SVM classifier performed well with all the PPFS methods for all the datasets. It enhanced the results 58 times. The RF is a comparative classifier as it enhanced the accuracy results 48 times using the PPFS for all the datasets. The k-NN does not perform well with large datasets. It improved the accuracy 35 times using all the PPFS methods for all the datasets. Consequently, SVM can be considered as the classifier providing the best results.

What Is the Best PPFS Technique?
It is necessary to show that the results of the PPFS approaches are equivalent. To do this, the non-parametric statistical test "Friedman" was employed with a significance level of 0.01. This test is conducted with the best classification accuracy obtained by SVM for all the datasets after performing PPFS. The test does not adopt a normal distribution assumption [48].
It is worth knowing that the Friedman test [48] is different from the Friedman index used in PP [40]. The Friedman index is defined in Section 4, and Friedman test shows the existence of a significant statistical difference between the methods.
The Friedman test provides two main results, the p-value and the effect size. If the obtained p-value is less than the significance level (0.01), then the statistical hypothesis (that the PPFS methods are equivalent) is rejected. The effect size has a value ranging between 0 (corresponding to no relationship) and 1 (stating an ideal relationship). The Friedman test resulted in χ 2 (2) = 6.68 and p_value = 0.0204 > 0.01, with an effect size W = 0.172. According to the p_value, the hypothesis is not rejected. Moreover, the effect size is small. Consequently, there is no statistical difference between the PPFS methods. The obtained PPFS results are equivalent. Figure 7 displays the boxplot of the accuracy obtained by SVM after using the PPFS methods for all the datasets. The y-axis represents the minimum, first quartile, median, third quartile, and maximum value of the accuracy values. As indicated, the four PPFS have a similar median (the line inside the box) equal to 83.3 for GAF, TF, and TK, 83.5 for GAK. As the p_value = 0.0204 > 0.01, the dissimilarity between the median values are not statistically significant. This confirms that the PPFS techniques are equivalent and that their results are statistically significant. Furthermore, the mean values of the four PPFS do not show a significant difference. The lowest value was provided by GAF (82.0), whereas the highest mean was obtained by GAK (84.2). TF yielded 82.6 and TK 83.3.

Comparison Study
The comparison study was performed based on three wrapper and five filter-based FS methods. Unfortunately, the existing embedded FS methods did not use the available datasets or apply clustering techniques (instead of classification techniques). Furthermore, the related work applying outlier detection to select the relevant features did not use the available data, and their evaluation measures are different from those used in this article.
The first comparison study consisted of applying the Correlation (CFS), Principal Component Analysis (PCA), Gain ratio, the Mutual Information, and Gini index FS techniques. The reason behind this choice is that PCA belongs to the same family of the PP and the other FS are well-known filter FS methods. The implementation of the filter FS techniques was performed using both FSelector and FSinR R packages [11]. The Gain ratio, Mutual Information, and Gini index were applied using the Sequential Forward Selection (FSF) search method as it provided better results than Sequential Backward Selection (SBS) and Breadth First Search (BFS). In addition, the FSF search method was the fastest. Table 7 displays the classification accuracy obtained by SVM (as it is the best classifier in this study) after applying the four PPFS methods along with the five filter FS for all the datasets. As shown in this table, the four PPFS outperformed all the Filter FS for 15 datasets (over 19 datasets). The accuracy improvement rate ranges between 0.76% and 30.64%. However, PCA managed well in eliminating the noisy features of Waveform dataset. In addition, it enhanced the accuracy of Sonar dataset despite its correlated features. This is due to its process in projecting the feature space into a new low-dimensional feature space without being affected by the noise or high correlation. Furthermore, CFS also improved the accuracy of Sonar dataset. This is done thanks to the concept of removing the correlated features. Besides, Mutual Information and Gini Index outperformed the PPFS methods when tackling the Spam dataset. Moreover, CFS, Gain ratio, Mutual Information, and Gini index (resp. CFS) yielded similar results to the four PPFS methods when handling the Climate (resp. Heart) dataset. For the Colon dataset, the last three FS methods did not manage to find the relevant features. For the running time, the five filter FS are similar to PPFS in terms of speed, except when handling Spam, Colon, and Fifa2019. For these datasets, the filter FS took an average of 12 min and 10 s compared to PPFS, which took an average of 7 s.   To show the statistical significance of the obtained results, the same experiment discussed in Section 7.2 was performed. It aimed to show the existence of a significant statistical difference between the methods. As the PPFS methods are equivalent, only GAK was used. Table 8 displays the mean and median of the accuracy values obtained by SVM classifier after applying GAK and the five filter FS methods. These results are visually shown in Figure 8. As indicated, the median and the mean of GAK are greater than those of the four filter methods, but slightly different from those of PCA. Moreover, the Friedman test was applied with a confidence level of 0.01. The p_value achieved 0.00994 < 0.01, χ 2 (2) = 15.1, with an effect size W = 0.168 (small effect), which means that there is no relationship. Thus, the statistical hypothesis (the methods are not statistically different) is rejected since the p_value is less than the significant level and the effect size is small. This result indicates that the methods are statistically different. Now, to demonstrate the differences between each pair of methods, the Friedman test was followed by the pairwise Wilcoxon signed-rank test. Table 8 (see the first line) pointed out the p_value calculated between GAK and the filter methods. As displayed, GAK, CFS, and PCA are competitive methods. This is because the hypothesis is not rejected (GAK is statistically different from CFS and PCA) due to the p_value not being less than the significance level 0.01. However, GAK is better than Gain Ratio, Mutual Information, and Gini index FS methods, since their p-value is less than 0.01. Consequently, GAK outperforms Gain Ratio, Mutual Information, and Gini index, but it is competitive with PCA and CFS. This result is again confirmed with the medians and the means computed in Table 8.  The second comparison study involved comparing PPFS approaches with three wrapper-based FS methods published in [14], named Enhanced Whale Optimization Algorithm (WOA-CM), GA, and PSO using k-NN. Note that GA is also employed in GAF and GAK, and Tribes is one variant of PSO used in TF and TK. Table 9 and Figure 9 show the classification accuracy obtained by k-NN after applying PPFS techniques along with the three wrapper methods using eight UCI datasets. The PPFS approaches outperformed the wrapper FS methods on five datasets. However, only WOA outperformed the four PPFS methods with an improvement of 5.06% for Ionosphere, 13.12% for Colon, and 16.09% for the Sonar datasets. In addition, the three wrapper FS outperformed the four PPFS methods when handling the Colon dataset. The reason behind this success is that WOA is mainly based on the exploration and exploitation concepts leading to the global search. This aspect allowed WOA to find out the optimal solutions for the Waveform, Colon, and Sonar datasets in revealing the important features despite the noise and high correlation containing in these datasets.
Since the number of datasets used in this comparison study is small (eight datasets), the result of the statistical tests (Friedman or Wilcoxon) might be inaccurate [48]. Therefore, their results were omitted. However, the median and the mean of the accuracy values obtained by k-NN after applying GAF, WOA, PSO, and GA were calculated and displayed in Table 10. In addition, the Boxplot of these four methods was presented in Figure 10. As displayed, the mean and the median of the accuracy values obtained by GAF are higher than those obtained by PSO and GA, but somewhat similar to WOA. Therefore, it is worth mentioning that GAF yielded results better than or comparable to the aforementioned FS. Therefore, PPFS methods are competitive methods. Table 9. The classification accuracy provided by the k-NN classifier after applying the four PPFS, WOA, PSO, and GA feature selection.

Discussion
Many experiments were performed in this paper to fulfill the four reseach objectives. Firstly, the related works were deeply investigated to bring out their limitations. Consequently, the first objective was achieved.
To prove the efficacy of these four PPFS, nineteen datasets and three classifiers were utilized. The first experiment consists of showing the detected outliers for the nineteen datasets using the four FS methods. It was shown that the PPFS methods provide equivalent results in terms of the detecting outliers. In addition, the detection frequency of most of the outliers reached 80% using the four PPFS methods. This confirms that they are outliers. To show that the detected outliers correspond to non-relevant features, classification was performed on the datasets before and after applying the four PPFS methods. The results showed that the classification performance along with the precision, the recall, and the F1 measure were superior when applying PPFS. The k-NN achieved an improvement, reaching 6.79 for 11 datasets. The SVM and RN enhanced the results of 16 datasets, with an improvement reaching 12.91 and 12.4, respectively. Hence, SVM was selected as the best classifier for this study. Although the proposed PPFS did not improve the accuracy of the Waveform and Segment datasets, they enhanced the classification accuracy for most of the datasets. This allowed the second and third objectives to be realized.
On the other hand, the success of this study could not be realized without the utilization of EPP-Lab software. Therefore, EPP-Lab is efficient in detecting the non-important features. This software is open-source and can be used as a desktop application in any environment. This is the fourth objective of the present study.
One of the main experiments performed in this article was to show that the four presented PPFS methods provide equivalent results. This was carried out through the Friedman test. The test's result indicated that the p-value is greater than the significace level and that the effect size is small. Consequently, the hypothesis is not rejected, and hence the difference between the PPFS results is not statistically significant. This experiment realized the fifth objective.
To demonstrate that the four PPFS are competitive FS methods, two comparison studies were performed. The first comparison included applying five filter FS methods (CFS, PCA, Gain Ratio, Mutual Information, Gini Index) using all the datasets. The results showed that CFS was superior to the four PPFS methods when testing only the Sonar dataset. PCA was also superior to the four PPFS methods when testing both the Waveform and Sonar datasets. Again, Mutual Information and Gini Index outperformed the PPFS for the Spam dataset. As stated previously, these three datasets have noise and high correlation, which negatively affect PPFS in enhancing the classification accuracy. These limitations deserve further investigation. The classification accuracy results of PPFS-GAK and the five Filter FS methods were used to compute the Friedman and pairwise Wilcoxon tests. The statistical results show that GAK is superior to Mutual Information, Gain Ratio, and Gini Index but equivalent to PCA and CFS.
The second comparison involved comparing PPFS with three wrapper FS methods, called WOA, PSO, and GA, using eight datasets. The four PPFS methods performed better than the three wrapper FS using five datasets. However, WOA surpassed all the FS methods using Sonar, Waveform, and Colon datasets. Again, the PPFS methods failed in enhancing the accuracy of these three datasets. Like the first comparison study, the statistical analysis validated the significant results of the PPFS methods. Finally, the last objective was achieved by demonstrating the efficiency of the PPFS methods. To sum up, the aforementioned experiments successfully carried out the research objectives of this study.
In the following, the research impacts, the advantages, and the limitations of PPFS methods are discussed.

Research Impacts
The research impacts of PPFS methods are important and deserve to be addressed: In spite of the previous advantages, PPFS have two limitations.
• PPFS failed to find the non-significant features when most of the features are highly correlated or include noise. • PPFS methods do not work well when the dataset is binary (the values of features' vectors are either 0 or 1)

Conclusions
The present study proposed four FS methods based on outliers detection using the Project Pursuit. Two PP indices along with two AI optimization methods were used to find the optimal projections that reveal the outliers. The proposed PPFS methods were validated using three classifiers and 19 datasets. They yielded promising results.Moreover, these methods were successfully compared with three Wrapper FS methods on eight datasets. They also yielded the best or equivalent accuracy against five Filter FS techniques on 15 datasets (among 19 datasets). Statistical analysis was performed and showed that PPFS are equivalent, and their results are statistically significant. It is also demonstrated that PPFS are comparative with PCA, CFS, and WOA. Therefore, our assumption to consider the outlier detection methods for revealing the irrelevant features from a dataset to increase the classification accuracy was validated. For the computation time, PPFS was faster than the five Filter FS methods when handling large datasets (Colon, Spam, and Fifa2019).
However, no computation time was reported for the three wrapper FS methods. Furthermore, PPFS yielded promising results in handling a large dataset reaching 13,000 instances (Fifa2019 dataset) and 2000 features (Colon dataset). The only limitation of this study is not getting the best result when dealing with the Waveform and Segment dataset. Hence, some future research directions are suggested: • Make the dataset spherical before seeking the outliers [27]

Data Availability Statement:
The links to datasets used in this study are provided in Section 7.

Acknowledgments:
The author would like to acknowledge the support of Prince Sultan University for paying the Article Processing Charges (APC) of this publication. This work was also supported by Artificial Intelligence and Data Analytics Lab (AIDA), Prince Sultan University, Riyadh, Saudi Arabia.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
All the abbreviations used in this paper are defined below.