The primary goal of data analysis in biomarker discovery studies is to identify features that are able to correctly classify the samples in two or more groups, e.g., healthy vs diseased or different disease states. Feature selection methods are not only applied to retrieve biologically meaningful biomarkers but are also used to reduce the number of features required to discriminate between sample groups [

45]. This dimensionality reduction is an important step in the data analysis process due to the fact that proteomics datasets typically suffer from the small-n-large-p problem; the number of features is far greater than the number of samples. Reducing the number of features avoids the risk of overfitting, thereby improving classification accuracy, lowering the computational costs and maximizing the chance of subsequent biomarker validation.

Feature reduction is typically performed in the data analysis step, but dimensionality reduction can already be accomplished during sample preparation/data acquisition or in the data pre-processing procedure. Alternatively, prior knowledge or pilot experiments can be used to define a list of putative biomarker candidates that can be studied in a targeted fashion. This removes the need to perform a holistic study that would only increase the number of non-relevant features and thereby increase dimensionality, which influences the prediction accuracy. As discussed in the pre-processing section, removal of features with missing values or applying a group count missing value approach during data pre-processing already lowers the number of features and reduces dimensionality prior to data analysis.

Feature selection methods reduce the number of features by eliminating features that present redundant information or selecting relevant features. The feature reduction methods can be divided by how they are coupled to the classification or learning algorithms, depicted in

Figure 3 [

46]. A filter method reduces the number of features independently of the classification model. Wrapper methods wrap the feature selection around the classification model and use the prediction accuracy of the model to iteratively select or eliminate a set of features. In embedded methods the feature selection process is an integral part of the classification model. Before detailed discussion of the different feature selection methods, a selection of the most common classification and learning algorithms will be reviewed.

#### 4.3. Parameter Selection

Both the classifier and feature selection methods require parameter values to be selected which have a significant impact on the final outcome of the analysis. These include the number of principal components, latent variables, the kernel method, and the number of trees for PCA, PLS-DA, SVM, and RF respectively. For every classification model this parameter can be optimized using a double cross-validation (2CV) procedure depicted in

Figure 4. As described for wrapper methods, the samples are first split in a training set and a test set to construct and evaluated the model based on prediction accuracy. This cross-validation is called the outer loop. In the double cross-validation scheme the samples in the training set are again split into a training and validation set to select the optimal value of the parameter, which is called the inner loop. This double cross-validation ensures that there is no dependency between the samples used for parameter optimization and prediction error calculation. Westerhuis et al. [

43] provides detailed information and an example using PLS-DA on how to construct a double cross-validation procedure.

Additional to parameter optimization for the classification algorithm, the wrapper methods require the selection of the number of features for the feature subsets. For the wrapper methods there is no rule of thumb for the selection of the number of features but the computational resources are the biggest determining factor for this. If the number of features in a subset is small more combinations of feature subsets are possible which increases the number of classification models that need to be build. Every classification model that is built requires computation time, so the smaller the feature subset the larger the computation cost.

The filter and embedded methods require a parameter which specifies the cut-off value of the scores calculated for the features. The selection of the cut-off value depends on the algorithm for which the scores are calculated. There are methods that have a common cut-off value, such as the 0.05 cut-off point for the univariate t-statistic and ANOVA methods. For some techniques the cut-off value is dependent on the constructed classification model and ranges around a preferred cut-off value, the value of 1 for the PLS-DA VIP score for example [

62]. Not only the cut-off value but also the number and type of features that are selected are important and depend on the type of research and result that is required. When the final set of selected features is to be validated by high-throughput follow-up experiments there is no need to be conservative. In such cases, it might be more important to avoid false negatives rather than false positives. On the other hand, when only a limited number biomarker candidates can be validated in follow-up experiments, it is important to avoid false positives at the expense of false negative results.

#### 4.4. Evaluation and Validation

The performance of a feature selection method is evaluated and validated based on the prediction accuracy of the classifier, and the statistical significance and stability of the selected features. Because univariate methods are not based on classification algorithms the performance is determined differently from multivariate methods.

A univariate test is deemed significant if the calculated

p-value is lower than the α-level, the significance level which is often set to 0.05. However, using univariate methods for feature selection in proteomics data inherently leads to the so called multiple testing problem [

63]. For feature selection numerous univariate tests are performed in a single experiment which increases the chance of finding false positives. The solution to the multiple testing problem is to adjust the α-level to maintain an acceptable false-discovery rate (FDR); the probability that a test produces a false positive result. Two common methods for controlling the number of false positives when performing multiple tests are the Bonferroni correction and the Benjamini-Hochberg correction [

64]. The Bonferroni correction [

65] changes the α-level at which a test, and therefore features are declared significant. If

m tests are performed the level at which a test/feature is presumed to be significant becomes α = 0.05/

m. This correction however is known to be conservative, especially in proteomics studies where the number of features are high. The α-level becomes so small that only a handful of features are deemed significant and the number of false negatives increases. A less conservative method is the Benjamini-Hochberg correction [

66]. The

p-values are ranked from low to high and are recalculated using α * (

i/

m), with

i representing the rank position. The tests/features with a recalculated value lower than the α-level are declared significant. The choice of the preferred method depends on the FDR that is accepted in the biomarker validation process, as discussed in

Section 4.3. The statistical significance of a

p-value can additionally be determined by resampling techniques which is discussed at the end of this section.

Multivariate classification methods are evaluated with a performance measure and a corresponding significance value that are determined independently. The performance measures are based on how well the classification model is able to correctly classify a sample from the test set to its respective class. The sample can then be categorized as a true positive, true negative, false positive or false negative and stored in a confusion matrix. For binary classification the confusion matrix is illustrated in

Table 1, for multiclass classification a confusion matrix is derived for every combination of classes.

With the confusion matrix the most common performance measures can be derived that are listed in

Table 2. For multiclass cases the performance measures can be macro-averaged where the overall performance measure is the average of the performance measure for every class combination or micro-averaged where the overall performance measure is calculated by an overall confusion matrix which is the sum of all confusion matrices for every class combination [

67].

Every performance measure has a different focus: NMC focuses on misclassifications, accuracy on the overall effectiveness of the classifier, sensitivity and specificity on correctly classifying positives and negatives respectively, and the AUC on the ability to avoid false classification. These differences make it difficult to compare performance measures between different classification methods as different performance measures could advocate distinct methods. It is, therefore, advised to not only report the final performance measure but also document the confusion matrices to improve transparency of results.

To determine significance of the performance measures and therefore stability of the classification model resampling techniques can be used. Common resampling techniques are bootstrapping, jackknifing, or permutation tests, of which the latter is typically used. Permutation tests evaluate if the performance measure is significantly better compared to any other random classification [

68]. First the class/group labels are randomly permuted over the samples. The feature selection model that has been performed on the original data is performed again on the permuted data with random class/group labels. This procedure is repeated multiple times forming a distribution for performance measures of the random data which is not expected to be significant, a H0 distribution. The performance measure is said to be significant if the original (not permuted) data is outside the 95% or 99% confidence intervals of the H0 distribution.

#### 4.5. Which Method to Choose?

Although most commonly used, the feature selection and classifier methods mentioned in the previous sections are only few of the many algorithms available. Even though multiple studies evaluated these feature selection methods there is not one method that outperforms all other methods in these studies [

44,

69,

70]. The selection of the most suitable method is determined by the properties of the dataset, computational resources, the type of biomarker that is searched for and the validation process available after feature selection.

The number of sample groups in the dataset already gives a preference to certain classifiers. When one disease group is compared to healthy controls the classification problem is called binary for which all univariate and multivariate methods can be used. The number of applicable classifier algorithms however decreases when three or more groups are compared, typically referred to as multiclass classification. The basic PLS-DA and SVM algorithms do not support multiclass classification. Extensions have been proposed for these methods but require additional parameters that increase model complexity [

49,

71]. RF and ANN on the other hand are intrinsically capable of classifying both binary and multiclass problems.

The computation time needed to perform a feature selection procedure is an important decision factor that depends on the method of choice. Filter methods are fast and scalable, whereas wrapper methods have high computational costs. Additionally, the type of classifier and how the classifier is used has an influence on the computation time. RF is a fast algorithm when applied exclusively for classification purposes but demands high computation power if used for feature importance calculations. In addition, the number of parameters that need to be optimized significantly increases computation time. This means that an increasing number of pilot calculations on subsets of the data need to be performed to determine optimal parameters settings.

The choice of univariate or multivariate methods depends on the type of biomarker that is searched for. If the biomarkers of interest are single markers that by themselves can be used to classify samples from each group, univariate methods are the method of choice. Multivariate methods are preferred if the sample classification is expected to be defined by a set of biomarkers that are interrelated. If this is not known a priori it is advised to apply both univariate and multivariate methods as they are able to extract complementary information [

72].

The experimental validation stage for biomarker candidates that will be performed after feature selection needs to be taken into account on how to execute the preferred methods with respect to false positive and negative rates that can be tolerated as discussed throughout this review.