The Role of Mutual Information Estimator Choice in Feature Selection: An Empirical Study on mRMR

Papaioannou, Nikolaos; Myllis, Georgios; Tsimpiris, Alkiviadis; Vrana, Vasiliki

doi:10.3390/info16090724

Open AccessArticle

The Role of Mutual Information Estimator Choice in Feature Selection: An Empirical Study on mRMR

¹

Department of Computer Informatics and Telecommunications Engineering, International Hellenic University, 624 24 Serres, Greece

²

Department of Business Administration, International Hellenic University, 624 24 Serres, Greece

^*

Authors to whom correspondence should be addressed.

Information 2025, 16(9), 724; https://doi.org/10.3390/info16090724

Submission received: 10 July 2025 / Revised: 11 August 2025 / Accepted: 22 August 2025 / Published: 25 August 2025

(This article belongs to the Section Information and Communications Technology)

Download

Browse Figures

Versions Notes

Abstract

Maximum Relevance Minimum Redundancy (mRMR) is a widely used feature selection method that is applied in a wide range of applications in various fields. mRMR adds to the optimal subset the features that have high relevance to the target variable while having minimum redundancy with each other. Mutual information is a key component of mRMR as it measures the degree of dependence between two variables. However, the real value of mutual information is not known and needs to be estimated. The aim of this study is to examine whether the choice of mutual information estimator affects the performance of mRMR. To this end, three variations of mRMR are compared. The first one uses Parzen window estimation to assess mutual information between continuous variables. The second is based on equidistant partitioning using the cells method, while the third incorporates a bias-corrected version of the same estimator. All methods are tested with and without a regularization term in the mRMR denominator, introduced to improve numerical stability. The evaluation is conducted on synthetic datasets where the target variable is defined as a combination of continuous features, simulating both linear and nonlinear dependencies. To demonstrate the applicability of the proposed methods, we also include a case study in real-world classification tasks. The study carried out showed that the choice of mutual information estimator can affect the performance of mRMR and it must be carefully selected depending on the dataset and the parameters of the examined problem. The application of the corrected mutual information estimator improves the performance of mRMR in the examined setup.

Keywords:

feature selection; mutual information; mRMR; bias correction; discretization; machine learning; classification; relevance; redundancy

1. Introduction

Nowadays, there has been an explosive increase in the amount of data. The widespread use of digital technologies and the internet, the growth of social media, mobile devices, and the Internet of Things (IoT) have all contributed to the explosion of the available data [1]. The availability of these data has made possible the development and the improvement of machine learning algorithms [2]. Large datasets are necessary for machine learning algorithms, which can learn from the available data and consequently train effectively [3]. In addition, with big data, deep learning models, which are a subset of machine learning models, are becoming capable of recognizing and learning patterns in images, speech, text and other types of unstructured data [4,5,6,7]. Therefore, it becomes easier to create and develop reliable algorithms for handling diverse and complex problems and datasets [8,9,10].

Nevertheless, on several occasions, having an excessive amount of data might be detrimental and it could reduce the efficiency and therefore the reliability of the algorithms [11]. Specifically, this may occur when the dataset contains noise or irrelevant and redundant features. This can result in overfitting and decrease the model’s performance in terms of accuracy and generalization [12]. Additionally, it could also lead to computational challenges and scalability issues, especially when the analysis concerns deep learning algorithms, which demand significant computational resources to execute [13]. In order to avoid such issues, various techniques have been developed and used. One such technique is feature selection [14].

Feature selection plays an especially critical role in domains such as natural language processing (NLP), where datasets often involve high-dimensional, sparse representations like term frequency-inverse document frequency (TF-IDF), n-grams or word embeddings [15]. In such settings, a large proportion of the features may be irrelevant or redundant, contributing little to the prediction task and increasing the risk of overfitting [16]. Filter-based feature selection methods, such as mRMR [17], are well suited for these contexts because they evaluate features based on intrinsic statistical properties rather than relying on specific classifiers [18]. This allows for dimensionality reduction before model training, improving computational efficiency and enabling better generalization, especially when only a limited number of labeled instances are available [19,20].

Compared to wrapper and embedded methods, which can be computationally expensive or tightly coupled with specific models, mRMR offers a lightweight and model-agnostic solution [21]. While modern deep learning models often include built-in mechanisms for feature weighting or selection, such approaches require large amounts of data and resources and may still suffer from interpretability challenges [22]. Furthermore, post hoc explainability techniques such as SHAP or LRP are useful for model interpretation, but they cannot replace the role of robust feature filtering before training [23]. The mRMR method remains valuable in this preprocessing stage [24], particularly when enhanced with effective estimators of mutual information that better capture complex, potentially nonlinear dependencies between variables [25,26].

The process of feature selection is frequently employed as a preparatory stage or in conjunction with machine learning [27]. The aim of feature selection algorithms is to construct a subset of features from the initial one, by selecting the most relevant features to the class variable and by removing irrelevant and redundant features. This procedure may produce a substantially smaller dataset that only includes the most relevant features [28]. Feature selection methods are classified into three main categories. Various methods have been proposed and used in several research for both wrapper methods [29,30,31,32], filter methods [33,34], and embedded methods [35,36,37]. In this paper, a filter method for feature selection known as mRMR will be studied [17].

The minimum Redundancy Maximum Relevance (mRMR) feature selection method aims to create a subset that contains features that have maximum relevance to the target variable, while having minimum redundancy with each other [17]. Until a predefined number of features is reached, mRMR iteratively adds features to the optimal subset of features, selecting the features with the highest dependency with the class variable and the lowest dependency between the feature and the previously selected features. In order to quantify the dependency between the variables and consequently relevance and redundancy, various statistical measures can be used. A widespread statistical measure that is also an important component of mRMR is mutual information. Mutual information measures the amount of information that one feature contains about another specific feature and it can be applied to quantify relevance and redundancy [38]. This measure has the advantage of being able to detect both linear and nonlinear dependencies, contrary to other statistical measures such as Pearson’s correlation.

Nevertheless, despite the advantages of mutual information, its reliable estimation is considered challenging for real-world datasets, as most of them involve continuous variables. In such cases, the joint probability distribution is not directly known and must be approximated from finite samples, which often introduces bias or instability in the estimates. To confront with this problem and to achieve a reliable assessment of mutual information, in many research various methods have been proposed, some non-parametric ones for density estimation such as k-nearest neighbors [39] and B-splines [40], but also discretization techniques such as equidistant binning, equiprobable binning and adaptive partitioning [41], that generally give worse assessment of mutual information compared to the non-parametric ones [42,43].

Generally, numerous recent studies have focused on improving feature selection techniques, addressing different aspects of the process. One line of work has proposed approaches that enhance mutual information–based feature selection by incorporating unique relevance criteria [44]. Other contributions have introduced conditional mutual information strategies for dynamic feature selection [45], as well as ensemble evaluation functions that combine linear measures, such as the Pearson correlation coefficient, with nonlinear measures like normalized mutual information, in order to capture complex feature–class relationships while maintaining low redundancy [46]. Further research has emphasized the benchmarking and evaluation of mutual information estimators in high-dimensional and complex settings [47]. Additionally, methods have been proposed to address the numerical instability of k-nearest-neighbor-based normalized mutual information estimators in high-dimensional spaces [48]. Feature reduction has also been investigated in the context of IoT intrusion detection, where efficient selection methods are essential for real-time attack classification [49].

Although these approaches highlight significant advances in the broader field of feature selection, it remains the case that the choice of mutual information estimator, despite being a central component of many methods, often receives comparatively less attention. As a result, distinguishing the impact of the estimator from that of the selection strategy itself continues to be a challenge in the literature.

In fact, numerous well-known and widely applied mutual information estimators exist, yet in many studies the emphasis is placed primarily on the choice of the feature selection method itself, while the choice of the mutual information estimator, despite being universally acknowledged as a critical component, often receives comparatively less attention [50,51,52,53,54]. It is not uncommon to find comparisons between feature selection methods that rely on different mutual information estimators, which makes it difficult to disentangle the effect of the selection strategy from that of the estimator [33].

The goal of the present work is not to propose the most advanced or bias-free estimator, since more sophisticated and reliable methods already exist, but rather to demonstrate how even a very small correction to a widely used discretization-based estimator, under otherwise identical conditions, can lead to noticeably different outcomes. This correction is derived directly from the same estimator itself, by subtracting the average mutual information obtained from surrogate samples from the original estimate, which ensures that the estimator is not fundamentally altered but is adjusted in a simple and controlled way. By doing so, we aim to draw further attention within the research community to the importance of carefully considering the choice of mutual information estimator, in addition to the choice of the feature selection method.

To accomplish this, we evaluate the impact of different mutual information estimators on the performance of the mRMR feature selection method. To this end, three variations of mRMR are compared. The first uses the Parzen window estimation of mutual information, the second is based on equidistant binning using the cells method, and the third incorporates a bias-corrected version of the same discretization-based estimator. A regularization term is optionally added to the mRMR denominator to enhance numerical stability. The comparison is performed through an extensive simulation study involving synthetic datasets with both linear and nonlinear dependencies. In addition, a case study using real-world datasets is included to assess the applicability of the methods in practice. The rest of this paper is organized as follows. In Section 2, the methodology and the materials used are presented. In Section 3, a simulation study is carried out in different systems and the results are presented. Additionally, in Section 4, a case study using real-world datasets is included to evaluate the practical applicability of the proposed methods. In Section 5 the results are discussed, and in Section 6 the conclusions are drawn.

2. Materials and Methods

In this section, the tools and methods that were used are described. Initially, some theoretical background on mutual information and surrogates is described. Next, the mutual information estimators and the mRMR feature selection method are presented.

2.1. Mutual Information

Mutual information is an entropy-based measure that detects linear and nonlinear dependencies. It measures the amount of information that two random variables share and, specifically, how much uncertainty about a variable is reduced when the other variable is known. In terms of Shannon entropies, it is calculated as follows:

I (X; Y) = H (X) + H (Y) - H (X, Y)

where

H (X)

and

H (Y)

are the marginal entropies of the random variables X and Y, respectively, and

H (X, Y)

is their joint one.

Mutual information is a measure of the interdependence between two random variables X and Y. It is a bidirectional relationship, meaning that it measures how much information X provides about Y, and vice versa. For discrete variables it can be measured as follows:

I (X; Y) = \sum_{y \in Y} \sum_{x \in X} p_{X, Y} (x, y) l o g (\frac{p_{X, Y} (x, y)}{p_{X} (x) p_{Y} (y)})

where

p_{X, Y} (x, y)

is the joint probability mass function of the variables X, Y and

p_{X} (x)

,

p_{Y} (y)

are the probability mass functions of the variables X, Y, respectively. The domain of mutual information is defined as

0 ⩽ I (X; Y) ⩽ m i n [H (X), H (Y)]

2.2. Surrogates

Shifted Surrogates

The time-shifted technique for creating surrogate samples, is a method for eradicating the relationship between two time series while preserving their original properties. Let us consider two random variables X, Y and their corresponding time series

{x_{1}, \dots, x_{n}}

and

{y_{1}, \dots, y_{n}}

. In this method a random integer d is chosen and then the values of one of the time series, e.g.,

{x_{1}, \dots, x_{n}}

are shifted by d time steps. In particular, the original time series is replaced with the time series

x_{t} = {x_{d + 1}, \dots, x_{n}, x_{1}, \dots, x_{d}}

, which is created by shifting the first d values to the end. Although the time-shifted surrogate method is typically used in time series analysis to destroy temporal correlations, in this study, we draw inspiration from this approach and apply a similar idea to multivariate data, aiming to disrupt dependencies between feature variables. Specifically, we perform a shift on individual feature vectors in order to disrupt their dependency with other variables, while preserving their marginal distributions. This approach is used to generate surrogate datasets that allow for the estimation of the bias introduced by the mutual information estimator. By comparing the mutual information computed on the original data to that obtained from the surrogates, we obtain a reference value that reflects the mutual information expected when no statistical dependency is present.

2.3. Mutual Information Estimators

In this subsection, the three mutual information estimators that were employed in this study are presented. Each estimator follows a different strategy for estimating mutual information between variables. Specifically, the estimators include a binning-based approach, a bias-corrected version of it and a kernel-based estimator that utilizes Parzen windows. Their structure and computational process are described in the following.

2.3.1. Original Estimator of Mutual Information

As aforementioned, the mutual information estimator that has been chosen in this study (from now on original estimator), uses an equidistant binning technique for discretization and specifically the cells method. This estimator is applied to continuous variables, which are discretized beforehand in order to allow the computation of mutual information using frequency-based probability estimates. The process for estimating mutual information using the original estimator is presented below in algorithmic form.

Consider two continuous variables of sample size n, denoted as $x_{1} = {x_{1, i}}_{i = 1}^{n}$ and $x_{2} = {x_{2, i}}_{i = 1}^{n}$ .
Set the number of bins as $m = round (\sqrt{n / 5})$ , which defines the resolution of the discretization and the size of the cell grids $C_{1, m}$ and $C_{2, m}$ .
Normalize each variable to the interval $[0, 1]$ using min–max scaling:

$x_{k, i}^{norm} = \frac{x_{k, i} - min (x_{k})}{max (x_{k}) - min (x_{k})} for k = 1, 2$
For each $i = 1, \dots, n$ , compute:

$c_{k, i} = x_{k, i}^{norm} \cdot m and update the corresponding bin : C_{k, ⌊ c_{k, i} ⌋} = C_{k, ⌊ c_{k, i} ⌋} + 1$

where $⌊ c_{k, i} ⌋$ is the integer part of $c_{k, i}$ .
Based on the bin counts $C_{1, m}$ and $C_{2, m}$ , estimate the joint and marginal probability distributions for $x_{1}$ and $x_{2}$ .
Finally, compute mutual information using the standard definition:

$I (x_{1}; x_{2}) = \sum_{j} \sum_{i} p_{x_{1}, x_{2}} (i, j) log (\frac{p_{x_{1}, x_{2}} (i, j)}{p_{x_{1}} (i) \cdot p_{x_{2}} (j)})$

2.3.2. The Proposed Correction

The second mutual information estimator employed in this study is a correction of the original one, based on the methodology proposed by [55]. This correction addresses the bias that is often introduced in mutual information estimation when discretization is applied to continuous variables.

Generally, the corrected mutual information estimator does not differ significantly to the original one. The idea is to divide the data into progressively smaller segments, compute mutual information for each and use linear regression to extrapolate the mutual information corresponding to a larger target sample size. Additionally, surrogate data are used to remove spurious effects. Specifically, for each input variable, mutual information is computed for different subdivisions of the data and then linearly regressed against the logarithm of the corresponding sample sizes. This process is applied both to the original data and to randomized surrogates.

The corrected mutual information is obtained as the difference between the extrapolated mutual information of the real data and the average extrapolated mutual information of the surrogates. The algorithmic procedure is summarized as follows:

Let $x = {x_{i}}_{i = 1}^{n}$ and $y = {y_{i}}_{i = 1}^{n}$ be two variables with n samples.
Choose a number of subdivisions d. For each $r = 1, \dots, d$ , set $p = 2^{r - 1}$ .
For each p, divide both x and y into p equal-length parts. Estimate the mutual information in each part and compute the average:

$I_{\frac{n}{p}} (x; y) = \frac{1}{p} \sum_{j = 1}^{p} I (x^{[j]}; y^{[j]})$

where $x^{[j]}$ and $y^{[j]}$ denote the j-th segment.
Fit a linear model $I = α ln (n) + β$ based on the values $I_{\frac{n}{p}}$ and $ln (n / p)$ to estimate the mutual information for a larger target sample size $n_{target}$ . In this study, the original sample size was $n = 1024$ and the target size was set to $n_{target} = 2048$ .
Repeat the above steps using M surrogate versions of x (via circular time shifting) and compute the average surrogate mutual information ${\hat{I}}_{surrogate}$ .
The final corrected mutual information is:

$I_{corrected} = {\hat{I}}_{original} - {\hat{I}}_{surrogate}$

For clarity, the bias correction procedure is also illustrated in a flowchart in Figure 1.

The initial idea for the proposed correction arose from the observation that the estimated mutual information values obtained with the original discretization-based estimator tend to vary with the length of the dataset [55]. In particular, for sufficiently large sample sizes the estimates appear to converge to a stable value, yet still show a systematic bias when compared with the theoretical expectation. It was further noted that subtracting the mutual information computed on surrogate datasets from the original estimates consistently reduced this bias, yielding values closer to the theoretical reference. This motivated the design of the correction procedure, which combines extrapolation based on subdivisions with surrogate adjustment, while keeping the underlying estimator essentially unchanged.

Following the rationale proposed in [55], the number of subdivisions d was chosen according to the size of the examined dataset. Specifically, when the sample size was

n = 1024

, we set

d = 5

, which yielded partitions of 64, 128, 256, 512 and 1024 samples, with the regression extrapolated to 2048. For

n = 512

, we used

d = 4

, resulting in partitions of 64, 128, 256 and 512, again with extrapolation to 2048. Similarly, for

n = 256

,

d = 4

was selected, giving partitions of 32, 64, 128 and 256, with regression to 1024. For larger sample sizes such as those equal to or greater than 2048, we observed that convergence of the estimates had already occurred, so extrapolation was no longer necessary and only the surrogate correction (subtraction step) was applied.

2.3.3. Parzen-Window-Based Estimation for Continuous Variables

This estimator corresponds to the mutual information computation method used for continuous variables in the original mRMR feature selection framework, as introduced by [17]. It is based on the Parzen window (kernel density estimation, KDE) technique and is particularly suitable when at least one of the variables involved is continuous. In the original work, this estimator was adopted as a practical alternative to discretization, especially in cases involving continuous or mixed-type variables. The method avoids discretization by directly estimating the probability density functions using Gaussian kernels centered at the observed data points.

Given two real-valued random variables X and Y with n observations

{(x_{1}, y_{1}), \dots, (x_{n}, y_{n})}

, the procedure is as follows:

Estimate the marginal densities $\hat{p} (x_{i})$ and $\hat{p} (y_{i})$ using Parzen window estimation:

$\hat{p} (x_{i}) = \frac{1}{n} \sum_{j = 1}^{n} φ (x_{i} - x_{j}; h), \hat{p} (y_{i}) = \frac{1}{n} \sum_{j = 1}^{n} φ (y_{i} - y_{j}; h)$

where $φ (\cdot; h)$ denotes the Gaussian kernel with bandwidth h.
Estimate the joint density $\hat{p} (x_{i}, y_{i})$ using bivariate kernel density estimation with full covariance matrix $Ω$ :

$\hat{p} (x_{i}, y_{i}) = \frac{1}{n} \sum_{j = 1}^{n} φ ([\begin{matrix} x_{i} \\ y_{i} \end{matrix}] - [\begin{matrix} x_{j} \\ y_{j} \end{matrix}]; h, Ω)$
Approximate the mutual information as:

$I (X; Y) \approx \frac{1}{n} \sum_{i = 1}^{n} log (\frac{\hat{p} (x_{i}, y_{i})}{\hat{p} (x_{i}) \hat{p} (y_{i})})$
For discrete target variables, the conditional entropy $H (X ∣ Y)$ is estimated separately for each class. The mutual information is then computed using the following identity:

$I (X; Y) = H (X) - H (X ∣ Y)$

2.4. mRMR

mRMR is a filter method for feature selection that uses mutual information to measure the dependencies between the features and the class variable, but also between the features themselves in order to find the features that are most relevant to the class variable and less redundant with each other, ultimately to create an optimal subset of features, which will eventually include as much information as possible about the initial dataset by including only a few features of it. The optimal feature subset S is filled in progressively, starting with just one feature that has the maximum mutual information with the class variable C

{max}_{f_{i} \in F} I (f_{i}, C)

. Until a predefined number of features is reached, mRMR iteratively adds features to the optimal subset. Particularly, in each search cycle, mRMR adds to the optimal subset the feature that satisfies the following equation:

max_{f_{i} \in F - S} [\frac{I (f_{i}; C)}{\frac{1}{| S |} \sum_{f_{j} \in S} I (f_{i}; f_{j})}]

(1)

It has to be noted that in this study, instead of using the classical mRMR as seen in Equation (1), we use a variant of it, as presented in Equation (2). The only difference between the two, is that we added one unit to the denominator, because during the study, we observed that for features that had minimal correlation or theoretically zero correlation, the estimation of mutual information through the proposed correction gave values slightly smaller than zero. This is partly logical as the estimation of mutual information through the proposed correction involves subtraction, as we previously saw. In addition to this, other variations were examined, such as setting the negative values to zero, but this often led the fraction to tend to infinity. Another idea was to convert the denominator to an absolute intersection, but then the property of redundancy was completely lost, as we could equate a pair of features that had correlation with one that had none. However, because it gave a negative value, its absolute value made it appear as if there was a dependency. We believe that the chosen method affects the whole process of mRMR less, because although it makes mRMR more sensitive to relevance, the redundancy still contributes to the final outcome, albeit to a lesser extent. We decided to add 1 to the denominator instead of, e.g., 0,

0.05

,

0.1

,

0.2

,

0.3

,

0.5

,

0.7

and

0.8

, since after various attempts in different systems, it appeared to perform better compared to the other values tested (see Figure 2 and Figure 3 for details). For the original estimator, we also used Equation (2) although it was not necessary, so that the comparison could be made on equal terms, as the primary goal was to evaluate how the choice of mutual information estimator affects the reliability and efficiency of mRMR and in order to examine this, all the other parameters have to be the same.

max_{f_{i} \in F - S} [\frac{I (f_{i}; C)}{1 + \frac{1}{| S |} \sum_{f_{j} \in S} I (f_{i}; f_{j})}]

(2)

3. Simulation Study

In this section, a simulation study is conducted to evaluate the performance of the mRMR feature selection algorithm. Simulated datasets are constructed such that the features relevant to the class variable are known in advance. This allows for a direct assessment of the algorithm’s ability to identify informative features. The systems have been selected progressively from very simple ones to more complex, in order to examine the performance of mRMR in different scenarios. In the following examples, the variable y gives the classes after discretization and the features

f_{i}

are random normal variables, some of which are predictors of the class variable y, while some of them are functionally related to the predictors. Beyond the features that are related to the class variable either directly or indirectly, we also added random variables, until each dataset consists of a total of thirty features.

For discretization, the number of classes k is chosen to satisfy the relation

k = round (\sqrt{n / 5})

, where n is the number of samples. This setting follows the same rationale as in the study where the corrected mutual information estimator was originally introduced. Each system was simulated by generating 20 datasets of 1024 samples. However, for the corrected mutual information estimator, which requires extrapolation to a larger sample size, the estimation was performed by projecting to a length of 2048, following the procedure described in Section 2.3.2. The results reported in Section 3.2 correspond to the average performance across these 20 datasets.

To assess the effect of the mutual information estimator and the mRMR configuration, five variations of the mRMR algorithm were compared:

Original-mRMR (s = 0): Uses the original mutual information estimator and applies Equation (1), without modification to the denominator.
Original-mRMR (s = 1): Same estimator, but with one unit added to the denominator (Equation (2)).
Corr-mRMR (s = 1): Uses the proposed correction of mutual information with one unit added to the denominator.
Parzen-mRMR (s = 0): Uses the Parzen (KDE-based) estimator without adjustment in the denominator.
Parzen-mRMR (s = 1): Uses the Parzen estimator with one unit added to the denominator.

In addition to the original and corrected mutual information estimators, we also included the original Parzen-based estimator, as proposed by [17], in order to gain a more complete picture of the mRMR framework’s performance under different estimation strategies. This setup allows us to systematically compare how the choice of mutual information estimator, as well as the presence or absence of a stabilizing constant in the denominator, affects the reliability and effectiveness of mRMR across diverse systems.

3.1. Systems

$y = 0.5 f_{1} + 0.5 f_{2} + e$ ,
$y = 0.5 f_{1} + 0.5 f_{2} + e$ , with additional features correlated to $f_{1}$ and $f_{2}$ ,
$y = 0.5 f_{1} + 0.5 f_{2} + 0.5 f_{3} + e$ , where $f_{3} = f_{1} \cdot f_{2}$ ,
$y = 0.5 f_{1} + 0.5 f_{2} + 0.5 f_{3} + e$ , where $f_{3} = f_{1} \cdot f_{2} \cdot z_{3}$ ,
$y = f_{1} + f_{2} + 0.2 f_{3} + 0.3 f_{4} + e$ , where $f_{3} = 0.2 f_{1} + 0.3 f_{2} + 2 z_{3}$ and $f_{4} = 0.1 f_{1}^{2} + 0.1 f_{2}^{2}$ ,
$y = \sum_{i = 1}^{6} β_{i} f_{i} + e$ , where $f_{1} = z_{1}$ , $f_{2} = z_{2}$ , $f_{3} = z_{1} \cdot z_{2}$ , $f_{4} = z_{3}$ , $f_{5} = z_{4}^{2}$ , and $f_{6} = z_{1} \cdot z_{5} \cdot z_{6}$ .

System 1 is the simplest of all, as there are only two functionally related features to the class variable y, the features

f_{1}

and

f_{2}

, which are moderately correlated with the class variable (both coefficients = 0.5). e is a standard normal random variable. As aforementioned, for all cases the dataset consists of a total of 30 features, of which the remaining 28 are irrelevant features generated from a standard normal distribution.

System 2 is identical to System 1 in terms of the functionally related features to the class variable. In addition to these two features, there are two groups of two features each. The features in the first group are strongly correlated with

f_{1}

(

c o e f f i c i e n t = 0.8

), while the features in the second group are weakly correlated with

f_{2}

(

c o e f f i c i e n t = 0.4

). Similar to System 1, the remaining 24 features are irrelevant features generated from a standard normal distribution.

In System 3 there are three functionally related features to the class variable y, particularly the features

f_{1}

,

f_{2}

and

f_{3}

. In this case, feature 3 is correlated with the first two selected features. The remaining 27 features are irrelevant. Here, the optimal subset should contain any two of the three functionally related features. In mRMR, the number of features that the optimal subset should contain before the algorithm stops is arbitrarily defined. In this case, we set the algorithm to stop when the optimal subset contains 3 features, in order to examine whether the feature selection algorithm selects the features,

f_{1}

,

f_{2}

and

f_{3}

, or prefers to add in an irrelevant feature to the optimal subset, when it has already selected two of the functionally related features.

System 4 also has three functionally related features to the class variable, specifically the features,

f_{1}

,

f_{2}

and

f_{3}

. However, unlike in System 3, feature

f_{3}

in System 4 contains information about another random variable,

z_{3}

, in addition to being correlated with

f_{1}

and

f_{2}

. The remaining 27 features are irrelevant random variables. In this case, the optimal subset should contain all three functionally related features and the objective is to investigate whether the feature selection algorithm chooses them or another irrelevant feature, given that

f_{3}

contains information about the predictors of y, namely

f_{1}

and

f_{2}

.

System 5 has four functionally related features to the class variable, namely

f_{1}

,

f_{2}

,

f_{3}

and

f_{4}

. In this case the functionally related features do not equally affect the class variable, since their coefficients differ. Specifically,

f_{1}

and

f_{2}

are strongly correlated with the class variable y (coefficients = 1), while

f_{3}

and

f_{4}

are weakly correlated, since their coefficients are equal to 0.2 and 0.3, respectively. In this case, the number of features contained in the optimal subset is set as four, because we consider it more appropriate for the feature selection algorithm to select a feature that contains the same information as other selected predictors e.g.,

f_{4}

, rather than select an irrelevant feature (standard random variable).

In System 6, there is a group of six predictors of the class variable y, namely

f_{1}

,…,

f_{6}

, that are functionally related to the random independent standard variables

z_{i}

,

i = 1, \dots, 6

. The

β_{i}

are set as

\frac{1}{σ_{i}}

, i = 1,…,6, where

σ_{i}

is the standard deviation of

f_{i}

, so that all features contribute the same to y. The number of the features that contained in the optimal subset was set as six, for the same reason aforementioned in the description of System 5. In addition to these six features, there are also two groups of six features each. The features of one of these groups are strongly correlated to the corresponding features of the first group (

c o e f f i c i e n t = 0.8

), while the six features of the other group are weakly correlated to the corresponding features of the first group (

c o e f f i c i e n t = 0.4

). The remaining 12 features are irrelevant.

To provide a concise overview of the experimental setup, we summarize the main characteristics of all six systems in Table 1. This table highlights the functional relationships, the relevant and redundant features, as well as the number of irrelevant features in each case, allowing the reader to more easily interpret and compare the different scenarios.

3.2. Results

In this section, the results are presented. We set the feature selection algorithm to stop when the optimal subset contains the same number of features with the functionally related features, as described in Section 3.1. If the optimal subset contains all of the features that are functionally related to the class variable, then it is considered as correct (from now on success rate). The results refer to the average accuracy obtained by the 20 generated datasets of each system.

First, the Figure 2 and Figure 3 for Systems 5 and 6, respectively, are presented. These figures depict the success rate of mRMR using two different mutual information estimators across various values of s, where s is the number added to the denominator of mRMR to prevent negative values. The algorithm that uses the original estimator is referred to as mRMR, while the one that uses the proposed mutual information correction is referred to as Corr-mRMR.

Figure 2 shows that both of the examined feature selection methods achieve the highest success rate when the number 1 is added to the denominator of mRMR. Corr-mRMR achieves a success rate of above 80% when s is equal to 0.3 or higher, while simple mRMR only does so when s is equal to 1. Additionally, it is observed that for this system, the success rate of the Corr-mRMR method is always equal to or better than that of simple mRMR, regardless of the choice of s.

Figure 3 differs slightly from Figure 2. When the values of s are equal to 0, 0.05 and 0.1, the Original-mRMR method appears to be more efficient than the corrected one. However, when the value of s is equal to 0.2, the success rate of Corr-mRMR increases rapidly to 65%, while the corresponding success rate of the Original-mRMR is only 40%. The same trend seems to apply for values of s greater than 0.2, where Corr-mRMR performs better. For the selected value of s, which in this study is

s = 1

, both of the examined methods appear to have a success rate smaller than the maximum achievable. Nevertheless, the differences between these success rates and the optimal ones are small compared to other systems, where choosing

s = 1

was found to be the optimal choice.

The results presented in Table 2 indicate that in general, mRMR performs better when the corrected mutual information estimator is used.

In System 1, both of the predictors are moderately correlated to the class variable y (coefficients = 0.5). For this system, all the variations of mRMR performed equally well without making a mistake, since all included in the optimal subset the features

f_{1}

and

f_{2}

, which are the only features that are functionally related to the class variable, while the remaining 28features are irrelevant.

In System 2, while both Parzen variations and the corrected mRMR method achieve 100% success, the original estimator gives only 80% for

s = 0

and 90% for

s = 1

. On the other hand, the corrected and KDE-based estimators demonstrate better resilience to redundancy. The performance of Corr-mRMR was slightly better, since its success rate is 100%, while the Original-mRMR (

s = 1

), using the original estimator of mutual information, achieved a success rate of 90%. Systems 1 and 2 are similar, except that beyond the two functionally related features to the class variable, System 2 has four additional features that are correlated with the predictors of y, while in System 1, the remaining 28 features are irrelevant. It is observed that Corr-mRMR does not confront difficulties in identifying the actual relevant features, while Original-mRMR (

s = 0

) and Original-mRMR (

s = 1

) make a few mistakes.

System 3 introduces nonlinearity through the feature

f_{3} = f_{1} \cdot f_{2}

, creating a higher-order dependency. In this case, the Parzen-based methods again achieve perfect performance (100%), while both the Original-mRMR (

s = 0

) and Corr-mRMR (

s = 1

) fail to fully recover the true subset. On the contrary, the Original-mRMR (

s = 0

) method without correction completely fails (0%). It is worth noting, however, that the 0% success rate of the Original-mRMR with

s = 0

reflects only the strict definition of success—namely, recovering all three relevant features. If we were to restrict the optimal subset to only the two strongest predictors (

f_{1}

and

f_{2}

), its performance would be nearly equivalent to that of the

s = 1

variant. The failure occurs in the final selection step: instead of selecting the third relevant (though redundant) feature

f_{3}

, Original-mRMR (

s = 0

) tends to include a completely unrelated feature, ignoring the partial redundancy advantage of

f_{3}

over irrelevant variables. The corrected mRMR (Corr-mRMR), which uses the adjusted mutual information estimator and

s = 1

, performs similarly to the Original-mRMR (

s = 1

) in this example, also achieving 95%.

System 4 is similar to System 3, except that in System 3, the three features were functions of only two independent predictors, contrary to the System 4, where

f_{1}

,

f_{2}

and

f_{3}

are functions of three independent predictors, since

f_{3} = f_{1} * f_{2} * z_{3}

. In this system the optimal subset should contain all three features (

f_{1}

,

f_{2}

and

f_{3}

). The increased complexity affects the classical estimators more severely. The Original-mRMR with

s = 0

fails completely (0%) as it strongly penalizes redundancy and fails to recognize the added informational contribution of

f_{3}

. Increasing the denominator to

s = 1

improves the performance to 80%, indicating a more balanced treatment between relevance and redundancy. However, some misranking still occurs. The corrected version (Corr-mRMR), which incorporates the proposed mutual information adjustment and also uses

s = 1

, achieves a success rate of 95%. This performance comes very close to the Parzen-based methods (both at 100%), highlighting that the correction significantly improves reliability in more complex scenarios where redundancy and relevance coexist.

In System 5 there are four functionally related features to the class variable but only three independent predictors. The Parzen-mRMR methods still succeed in identifying all relevant features. In contrast, the Original-mRMR fails completely for

s = 0

(0%) and performs better with

s = 1

(85%), while the corrected estimator improves further to 95%. The difference lies in the selection of

f_{4}

, since for all the generated datasets, both of them selected correctly the features

f_{1}

,

f_{2}

and

f_{3}

, but the simple mRMR in a couple of times added an irrelevant feature to the optimal subset, instead of adding

f_{4}

, which, however, is a function of the features

f_{1}

and

f_{2}

. Contrary to System 4, for this system the complete failure of mRMR for

s = 0

stems from the addition of a completely irrelevant feature instead of the weakly informative

f_{4}

, which is a nonlinear function of

f_{1}

and

f_{2}

. If the selected subset had been limited to three features—namely

f_{1}

,

f_{2}

and

f_{3}

—the performance of Original-mRMR with

s = 0

would have been comparable to that with

s = 1

. Both

s = 1

and Corr-mRMR handle the balance between relevance and redundancy more effectively, with the latter approaching the Parzen benchmark.

System 6 is considered the most complex of all the examined systems. In this system there are six functionally related features to the class variable, five of which are independent predictors, while the sixth (

f_{3}

) is the product of

f_{1}

and

f_{2}

. Once again, the Original-mRMR fails when

s = 0

(0%) and performs modestly when

s = 1

(55%). The Corr-mRMR performs better (70%), but only the Parzen-based methods were able to identify all the relevant features (100%). Notably, even when the stopping criterion is set to five features instead of six, Corr-mRMR fails to recover the complete set of relevant features in only 5% of the cases, whereas the Original-mRMR does so in 15%.

4. Case Study

To assess the practical applicability of the examined mRMR variations, we conducted a case study in real-world classification tasks. Specifically, we employed a diverse set of benchmark datasets widely used in feature selection and classification research. The collection includes financial datasets (bankrupt, credit), bioinformatics datasets (qsar), software defect prediction datasets (kc1, pc1) and text or signal classification datasets (spambase, magic).

To ensure consistency with the experimental design used in the simulation study, we randomly selected a subset of 1140 instances from the full datasets. This choice allows training to be conducted on 1026 observations, aligning with the sample size used for the corrected mutual information estimation in previous sections.

Each method was configured to select the top 5 features. Using these subsets, we then trained and evaluated four standard classifiers: Logistic Regression, Decision Tree, Random Forest and K-Nearest Neighbors (KNN). Classification performance was assessed using 10-fold cross-validation and we report the mean accuracy and standard deviation across folds.

The following tables (Table 3, Table 4, Table 5 and Table 6) summarize the classification accuracies for each classifier across all datasets. For each dataset, the best-performing method is highlighted in bold to emphasize its relative performance.

The results presented in Table 3 demonstrate how different mutual information estimators and regularization settings influence the performance of the mRMR feature selection algorithm when combined with Logistic Regression. Across the seven benchmark datasets, Corr-mRMR achieved the highest accuracy in four cases, while the Original-mRMR (s = 1), the Original-mRMR (s = 0) and Parzen-mRMR (s = 0) methods performed best in one dataset each. These results indicate that Corr-mRMR provided a measurable advantage over both the uncorrected mRMR variants and the Parzen-based estimators in the majority of cases.

In several datasets, such as bankrupt, magic, kc1, pc1 and credit, the differences between Corr-mRMR and Original-mRMR (s = 1) were minimal, with accuracies differing by less than 0.01. For instance, in bankrupt, Corr-mRMR achieved 0.9649, compared with 0.9588 obtained by Original-mRMR (s = 1). In kc1, the corresponding values were 0.8518 and 0.8500, while in pc1, all three methods, including the Parzen variants, produced nearly identical results in the range 0.8851–0.8860. Similarly, in credit, the performance of the examined methods was nearly identical, with only marginal differences between them.

By contrast, notable performance differences were observed in the qsar and spambase datasets. In qsar, Corr-mRMR achieved an accuracy of 0.8010, clearly higher than Original-mRMR (s = 1) (0.7469) and the best Parzen-based result (0.7631). An even larger improvement was observed in spambase, where Corr-mRMR reached 0.8482, substantially outperforming Original-mRMR (s = 1) (0.6833) as well as the Parzen methods, the best of which achieved 0.7588. It is worth noting that in these two datasets, the direct comparison between Corr-mRMR and Original-mRMR (s = 1) (which differ only in the use of the bias-corrected mutual information estimator while all other parameters remain identical) shows that even a seemingly minor adjustment, which leads to the corrected estimator of mutual information, can result in a substantial performance gain.

Overall, the choice of mutual information estimator appears to have a substantial impact on the performance of feature selection methods in certain cases. Among the examined approaches, the mRMR variant that employs the corrected mutual information estimator (Corr-mRMR) exhibited the most stable performance across all datasets, achieving results that were either nearly identical to or better than those of the alternative methods.

The results presented in Table 4 show the performance of the examined mRMR variations combined with Decision Tree classifiers across seven benchmark datasets. In this setting, Corr-mRMR achieved the highest accuracy in four datasets (qsar, spambase, kc1 and pc1), while Original-mRMR (s = 0) performed best in bankrupt and magic and Parzen-mRMR (s = 0) slightly outperformed the others in credit.

In most datasets, the performance differences between methods were modest, typically within a range of 0.01–0.02. For example, in pc1, Corr-mRMR achieved 0.8518, closely followed by Original-mRMR (s = 0) with 0.8456 and Parzen-mRMR (s = 0) with 0.8465. Similarly, in credit, Parzen (s = 0) achieved the highest accuracy (0.7702), differing by less than 0.01 from Corr-mRMR (0.7684) and Original-mRMR (s = 1) (0.7658).

However, clearer advantages for Corr-mRMR were observed in the qsar and spambase datasets, where it outperformed the other examined methods by approximately 0.025 and 0.021, respectively.

Notably, when compared directly with the mRMR variant employing the Parzen estimator (Parzen-mRMR (s = 0)), which is the configuration used in the study where mRMR was originally introduced, the Corr-mRMR showed clear advantages in most of the datasets. In qsar, Corr-mRMR achieved 0.7514, compared with 0.7120 of Parzen-mRMR (s = 0), while in spambase the respective values were 0.7671 and 0.7167. Similarly, in magic, the corrected variant reached 0.7904, outperforming Parzen-mRMR (s = 0) at 0.7719 and in kc1 it obtained 0.8228 compared with 0.7991.

The results in Table 5 summarize the performance of the examined mRMR variants with Random Forest classifiers. Overall, the differences between methods are relatively modest, with accuracies generally within a 0.01–0.03 range. Corr-mRMR achieved the best performance in qsar, spambase and pc1, reaching 0.8351, 0.8395 and 0.8904, respectively, each representing an improvement of about 0.02 compared with the best performing of the other examined methods. In contrast, Original-mRMR (s = 0) was slightly superior in bankrupt (0.9596) and magic (0.8447), while Parzen-mRMR (s = 0) was the most accurate in kc1 (0.8561) and credit (0.8088) datasets.

These findings suggest that no single method dominates across all datasets, though Corr-mRMR provided some measurable gains, particularly in qsar and spambase.

It is worth noting that in this setting, all the examined methods performed particularly well, as nearly all achieved accuracies equal to or above 0.80 across the benchmark datasets, with only minimal deviations observed for Original-mRMR (s = 0).

The results presented in Table 6 summarize the performance of the examined mRMR variations combined with K-Nearest Neighbors classifiers. Corr-mRMR attained the highest accuracy in four datasets (qsar, spambase, kc1 and pc1), while Original-mRMR (s = 0) performed best in bankrupt and magic, and Parzen-mRMR (s = 0) achieved the highest accuracy in credit.

Across most datasets, the observed differences were modest, typically below 0.01–0.015 in absolute accuracy. For example, in pc1, Corr-mRMR reached 0.8728, only slightly higher than Original-mRMR (s = 0) (0.8675), Parzen (s = 1) (0.8684) and Original-mRMR (s = 1), which with 0.8649 yielded the lowest accuracy among the examined methods. Similarly, in kc1, Corr-mRMR obtained 0.8421 compared with 0.8342 for Parzen-mRMR (s = 1).

More pronounced differences appeared in qsar and spambase. Specifically, Corr-mRMR achieved 0.7816 in qsar, compared with 0.7544 for Parzen-mRMR (s = 0) and 0.7930 in spambase, compared with 0.7667. These results highlight that adopting the corrected mutual information estimator can yield measurable gains even when the improvements appear numerically moderate.

Overall, these findings confirm that incorporating the corrected mutual information estimator into the mRMR framework is effective not only in controlled synthetic settings, but also across a broad range of real-world applications. The results demonstrate that the observed benefits are generalizable and not restricted to a specific domain or dataset.

4.1. Comparison with State-of-the-Art Feature Selection Methods

To further evaluate the Corr-MRMR method, we compared it against two well-established mutual information-based feature selection algorithms, CONMI_FS [46], and MIFS-ND [51].

All three methods were applied to the same benchmark datasets described in Section 4. In this case, the number of selected features was not fixed a priori. Instead, we adopted the intrinsic stopping criterion of CONMI_FS, as defined in its original description. Because CONMI_FS is the only method among the three that specifies a stopping criterion, we used its selected cardinality

k^{★}

on each dataset as the target subset size. MIFS-ND and Corr-mRMR were then constrained to return exactly

k^{★}

features. This approach ensured that the comparison was performed using feature subsets of equal cardinality across all methods.

For CONMI_FS and MIFS-ND, algorithm parameters were set according to the values reported in their respective original studies, corresponding to configurations that achieved the best empirical performance. In particular, for CONMI_FS, the hyperparameter

λ

was fixed at

0.9

, as this value yielded the best classification results in preliminary testing.

Classification performance was assessed using four classifiers: Logistic Regression (LR), Decision Tree (DT), Random Forest (RF) and K-Nearest Neighbors (KNN), each evaluated with 10-fold cross-validation.

The classification results obtained for all datasets and classifiers are summarized in Table 7, Table 8, Table 9 and Table 10. For each dataset, the mean accuracy and standard deviation are reported, with the highest value in each row highlighted in bold.

Only four out of the seven datasets examined in the case study are reported here. In the remaining three datasets (kc1, pc1 and credit), the stopping criterion of CONMI_FS halted the feature selection process when the subset contained a single feature. In these cases, all three methods selected the same feature, resulting in identical classification performance. Therefore, the corresponding results are omitted for brevity.

The results in Table 7 present the classification performance of CONMI_FS, MIFS-ND, and Corr-mRMR when combined with Logistic Regression. Among the four datasets considered, Corr-mRMR achieved the highest accuracy in two cases (bankrupt and magic), CONMI_FS in two cases (qsar and spambase), while MIFS-ND did not achieve the highest accuracy in any dataset.

In bankrupt, Corr-mRMR obtained an accuracy of 0.9640, slightly higher than MIFS-ND (0.9588) and CONMI_FS (0.9509). In magic, Corr-mRMR again outperformed the other two methods, achieving 0.7991 compared with 0.7658 for both CONMI_FS and MIFS-ND. In qsar, CONMI_FS led with 0.8113, marginally surpassing MIFS-ND (0.8076) and showing a clearer advantage over Corr-mRMR (0.7754). For spambase, all three methods achieved relatively high performance, with CONMI_FS obtaining the highest accuracy (0.9123), exceeding MIFS-ND (0.9088) by 0.0035 and Corr-mRMR (0.8807) by 0.0316.

The results in Table 8 show the classification performance of CONMI_FS, MIFS-ND and Corr-mRMR when combined with a Decision Tree classifier. Compared with the Logistic Regression results, the differences between methods are generally smaller, with relatively close performance across all datasets.

CONMI_FS achieved the highest accuracy in three out of the four datasets (bankrupt, qsar and spambase), while Corr-mRMR led in one case (magic). In bankrupt, CONMI_FS obtained 0.9456, exceeding MIFS-ND (0.9421) by 0.0035 and Corr-mRMR (0.9316) by 0.0140. In qsar, CONMI_FS reached 0.7735, which was 0.0161 higher than MIFS-ND and 0.0427 higher than Corr-mRMR. In spambase, CONMI_FS scored 0.8904, ahead of MIFS-ND (0.8798) and Corr-mRMR (0.8561). The magic dataset was the only case where Corr-mRMR achieved the highest performance (0.7553), outperforming CONMI_FS (0.7246) by 0.0307 and MIFS-ND (0.7298) by 0.0255.

The results in Table 9 show the classification performance of CONMI_FS, MIFS-ND and Corr-mRMR when combined with a Random Forest classifier. In bankrupt, all three methods achieved very similar performance, with MIFS-ND obtaining the highest accuracy (0.9570), marginally surpassing CONMI_FS (0.9561) by 0.0009 and Corr-mRMR (0.9491) by 0.0079.A similar pattern was observed in spambase, where Corr-mRMR reached 0.9053, compared with 0.9193 for MIFS-ND and 0.9246 for CONMI_FS. The maximum difference between methods in this dataset was 0.0193.

In contrast, more pronounced differences were observed in qsar and magic. In qsar, MIFS-ND achieved 0.8179, outperforming CONMI_FS (0.7838) by 0.0341 and Corr-mRMR (0.7536) by 0.0643. In magic, Corr-mRMR obtained the highest accuracy (0.8070), exceeding CONMI_FS (0.7570) by 0.0500 and MIFS-ND (0.7509) by 0.0561.

The results in Table 10 indicate that, for the K-nearest neighbors classifier, Corr-mRMR achieved the highest accuracy in two out of the four examined datasets (bankrupt and magic), while CONMI_FS led in the remaining two (qsar and spambase). In bankrupt, the differences among the three methods were minimal, with all accuracies within 0.005 of each other. Similarly, in spambase, CONMI_FS outperformed the other two methods by small margins, with deviations of less than 0.021. More pronounced differences occurred in qsar and magic. In qsar, CONMI_FS reached 0.8218, outperforming MIFS-ND and Corr-mRMR by 0.0105 and 0.0606, respectively. The magic dataset presented the most substantial difference observed across all experiments in this study, with Corr-mRMR achieving 0.8500 and surpassing both CONMI_FS and MIFS-ND (0.7667) by 0.0833.

Overall, no single feature selection method managed to achieve the highest accuracy across all classifiers and datasets. In some cases, the differences were minimal, such as in bankrupt, where depending on the classifier, each of the examined methods achieved the best performance. A similar pattern was observed in spambase, where CONMI_FS consistently achieved the highest accuracy, but with only small margins over the other methods. For qsar, the best-performing method varied between MIFS-ND and CONMI_FS depending on the classifier, while in magic, Corr-mRMR consistently produced the highest accuracy, with the largest observed margins in the study. These findings indicate that the proposed correction to the mutual information estimator enables Corr-MRMR to match or exceed the performance of other strong, well-established feature selection approaches.

4.2. Impact of the Corrected Mutual Information Estimator on CONMI_FS

In order to further evaluate the impact of the mutual information estimator on feature selection performance, we extended our analysis to the CONMI_FS algorithm. Given that CONMI_FS employs the same mutual information estimator as the one originally used in mRMR, this provided an opportunity to examine whether substituting it with the bias-corrected version would yield any notable changes in results. In CONMI_FS, the data discretization process is performed prior to feature selection because the normalized mutual information in the integration requires a discrete data format. Specifically, the Equal-Width discretization method is used to divide each continuous feature into equal intervals, as described in the original study.

Although CONMI_FS integrates Pearson’s correlation coefficient with mutual information in its criterion, which may limit the influence of the mutual information estimator, we considered it relevant to assess whether the corrected estimator could still have a measurable effect. For this comparison, we evaluated the original CONMI_FS alongside a variant (CONMI_Cormut) in which the corrected estimator of mutual information was applied, using the same datasets as in the previous experiments. In this case, however, both methods were allowed to operate with their intrinsic stopping criteria, determining the number of selected features independently. Since the CONMI_FS scoring formula subtracts redundancy from relevance and relies on normalized rather than unnormalized mutual information, the expected impact of estimation bias is somewhat smaller than in the mRMR setting.

Table 11 summarises, for each dataset, the number of features selected by CONMI_FS and CONMI_Cormut, together with the corresponding classification accuracy obtained using the Random Forest classifier.

The results in Table 11 indicate that the bias-corrected variant of CONMI_FS achieved higher accuracy in bankrupt, magic and kc1, with the largest improvement observed for kc1 despite both methods selecting a single feature. In contrast, the original method performed better in qsar, spambase, pc1 and credit, with the largest difference occurring in credit, where CONMI_Cormut selected six features compared to a single feature for CONMI_FS.

Differences in the number of selected features varied considerably across datasets. For example, in bankrupt and spambase the CONMI_Cormut selected more than twice as many features as the CONMI_FS, without a proportional accuracy gain, whereas in magic and kc1 the subset size was identical between methods, suggesting that the observed accuracy differences in these cases were most likely attributable to the estimator change.

As noted earlier, the smaller influence of mutual information in the CONMI_FS formulation resulting from the use of normalized mutual information, its combination with Pearson correlation and the subtraction of redundancy from relevance, reduces the expected effect of replacing the estimator. The present results are consistent with this expectation, while also illustrating that such modifications can still yield measurable changes in certain scenarios.

Since CONMI_FS also relies primarily on the same mutual information estimator used in our mRMR variants, we found it worthwhile to investigate whether the bias-corrected version could lead to different outcomes in this context. This additional analysis was intended not only to assess its effect here, but also to encourage further examination of such adjustments in other feature selection methods, where the impact might be more pronounced.

5. Discussion

The primary objective of this study was not to identify the most accurate version of the mRMR algorithm, but rather to investigate how the choice of mutual information estimator affects the performance of this widely used feature selection method. To this end, we evaluated five mRMR variations, based on two different estimators (a discretization-based estimator and a Parzen KDE-based estimator), with or without a denominator correction (

s = 0

or

s = 1

), as well as a corrected mutual information version (Corr-mRMR), across both synthetic and real data.

Although the original mutual information estimator used in this study is not considered optimal compared to certain non-parametric alternatives such as k-nearest neighbors or B-spline-based approaches (see Introduction, Section 1), it was chosen due to observed biases in prior work and the existence of a proposed correction. The corrected estimator was integrated into the mRMR framework to evaluate whether mitigating this bias improves selection outcomes.

Overall, the results from the simulation study indicate that the correction has a clear positive effect, especially in more complex systems involving nonlinear dependencies or redundant features. In simpler systems, where the functionally related features were strongly or moderately correlated with the class variable, all methods achieved near-perfect performance. These results were also supported by the use of a slightly modified mRMR formulation (Equation (2)) that places more emphasis on relevance than redundancy (see Section 2.4).

For systems such as System 3, we deliberately set the size of the optimal feature subset to include all functionally related features—despite the fact that only two were sufficient to capture the underlying information—so as to test whether the methods would prioritize redundant-but-relevant features over completely irrelevant ones. This decision aligns with practical use cases, where the user predefines the number of features, and thus selection behavior, in such important contexts.

An interesting observation is that the Parzen-based variants performed consistently well across all systems, even in the presence of redundancy and nonlinearity. This suggests that density-based mutual information estimation may offer improved robustness over discretization-based techniques, particularly in more realistic data scenarios. Corr-mRMR, which employs the corrected estimator and

s = 1

in the denominator, approached the performance of the Parzen-based methods in most cases—often surpassing both the uncorrected versions (

s = 0

and

s = 1

) and offering a strong compromise between computational simplicity and selection accuracy.

It also should be noted that during simulations with the aforementioned setup, for systems where some of the predictors were weakly correlated to the class variable e.g.,

y = 0.8 f_{1} + 0.3 f_{2}

, but also when the coefficients of the functionally related features did not all have the same signs and their sum tended to zero, the simple mRMR appeared to be more accurate compared to the corrected one.

In addition to these observations, it should also be noted that the corrected estimator is substantially slower than the original discretization-based version, since it requires computing the mutual information once on the original data and repeatedly (in our case 100 times) on surrogate datasets. This leads to a significantly higher runtime compared to the original method. However, the main aim of this study was not to propose the fastest estimator, but rather to highlight how even a small correction to the mutual information estimation process can lead to markedly different outcomes in mRMR. Therefore, the computational overhead of the corrected estimator should be viewed as secondary to the conceptual point being made.

In addition to the effect of sample size, which was discussed in Section 2.3.2, dimensionality is another important factor that may influence the performance of mutual information estimators. In very high-dimensional feature spaces, sparsity can reduce the reliability of probability estimates, potentially affecting the stability of mRMR outcomes. A more systematic investigation of this aspect could be pursued in future work.

Finally, the results from the case study further support the simulation findings, showing that the mRMR variant employing the corrected mutual information estimator consistently provided competitive and often superior classification performance across diverse real-world datasets. While Parzen-based and Original-mRMR methods also performed strongly, the corrected-estimator variant (Corr-mRMR) frequently achieved the highest accuracies, particularly in datasets such as qsar and spambase, where the gains were substantial. These results reinforce the conclusion that the choice of mutual information estimator, and even relatively small adjustments to its formulation, can have a measurable impact on feature selection outcomes, especially in scenarios involving complex feature dependencies.

From a practical perspective, the findings of this study highlight that the choice of mutual information estimator can substantially influence the outcome of mRMR-based feature selection, even when all other parameters are kept constant. It is therefore important to consider not only the feature selection framework itself but also the underlying estimator used to quantify dependencies.

Overall, the results suggest that aligning the choice of estimator with the characteristics of the dataset and the resource constraints of the task can lead to more effective and reliable feature selection in practice.

6. Conclusions

This study investigated the impact of a mutual information estimator choice on the performance of the mRMR feature selection algorithm. Specifically, we evaluated five variants: two using a discretization-based mutual information estimator (with and without a regularization constant in the denominator), two based on a Parzen window approach and one employing a bias-corrected discretization-based mutual information estimator.

The evaluation included both synthetic systems, ranging from simple linear structures to more complex nonlinear and redundant systems and real-world classification tasks. The results revealed that the choice of mutual information estimator can substantially affect feature selection outcomes, particularly in settings with nonlinear dependencies or feature redundancy. In such cases, the examined Parzen-based and bias-corrected estimators demonstrated clear advantages over the original discretization-based version.

The results also highlight the importance of carefully tuning the redundancy term in mRMR’s denominator. Although switching from

s = 0

to

s = 1

often led to more stable outcomes, the effect was not uniform across all scenarios.

It is also emphasized how crucial it is to select the appropriate settings based on the requirements of each problem, because different setups yield dramatically different results.

Further research would involve a comparison between state-of-the-art mutual information estimators in several systems and setups in order to clarify which one is considered as the most appropriate in different problems.

Author Contributions

Conceptualization, N.P. and A.T.; methodology, N.P., G.M. and A.T.; software, N.P.; validation, N.P., G.M. and A.T.; formal analysis, N.P.; investigation, N.P.; resources, N.P. and G.M.; data curation, N.P.; writing—original draft preparation, N.P.; writing—review and editing, N.P., G.M., A.T. and V.V.; visualization, N.P., A.T. and V.V.; supervision, A.T. and V.V.; project administration, A.T. and V.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The benchmark datasets used in this study are publicly available via the OpenML platform (https://www.openml.org, accessed on 27 June 2025). These include the *spambase*, *qsar*, *magic*, *bankrupt*, *credit*, *kc1* and *pc1* datasets; dataset identifiers and metadata are available through OpenML.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, M.; Mao, S.; Liu, Y. Big Data: A Survey. Mob. Netw. Appl. 2014, 19, 171–209. [Google Scholar] [CrossRef]
Manley, K.; Nyelele, C.; Egoh, B.N. A review of machine learning and big data applications in addressing ecosystem service research gaps. Ecosyst. Serv. 2022, 57, 101478. [Google Scholar] [CrossRef]
Morgenstern, J.; Rosella, L.; Costa, A.; de Souza, R.; Anderson, L. Perspective: Big Data and Machine Learning Could Help Advance Nutritional Epidemiology. Adv. Nutr. 2021, 12, 621–631. [Google Scholar] [CrossRef] [PubMed]
Strikas, K.; Papaioannou, N.; Stamatopoulos, I.; Angeioplastis, A.; Tsimpiris, A.; Varsamis, D.; Giagazoglou, P. State-of-the-art CNN Architectures for Assessing Fine Motor Skills: A Comparative Study. Wseas Trans. Adv. Eng. Educ. 2023, 20, 44–51. [Google Scholar] [CrossRef]
Praseetha, V.M.; Vadivel, S. Deep Learning Models for Speech Emotion Recognition. J. Comput. Sci. 2018, 14, 1577–1587. [Google Scholar] [CrossRef]
Goossens, A.; De Smedt, J.; Vanthienen, J. Extracting Decision Model and Notation models from text using deep learning techniques. Expert Syst. Appl. 2023, 211, 118667. [Google Scholar] [CrossRef]
Myllis, G.; Tsimpiris, A.; Aggelopoulos, S.; Vrana, V. High-Performance Computing and Parallel Algorithms for Urban Water Demand Forecasting. Algorithms 2025, 18, 182. [Google Scholar] [CrossRef]
Li, Y.; Nie, J.; Chao, X. Do we really need deep CNN for plant diseases identification? Comput. Electron. Agric. 2020, 178, 105803. [Google Scholar] [CrossRef]
Yang, J.; Zhang, L.; Tang, X.; Han, M. CodnNet: A lightweight CNN architecture for detection of COVID-19 infection. Appl. Soft Comput. 2022, 130, 109656. [Google Scholar] [CrossRef]
Gupta, A.; Anpalagan, A.; Guan, L.; Khwaja, A.S. Deep learning for object detection and scene perception in self-driving cars: Survey, challenges, and open issues. Array 2021, 10, 100057. [Google Scholar] [CrossRef]
Dinsdale, N.K.; Bluemke, E.; Sundaresan, V.; Jenkinson, M.; Smith, S.M.; Namburete, A.I. Challenges for machine learning in clinical translation of big data imaging studies. Neuron 2022, 110, 3866–3881. [Google Scholar] [CrossRef] [PubMed]
Süpürtülü, M.; Hatipoğlu, A.; Yılmaz, E. An Analytical Benchmark of Feature Selection Techniques for Industrial Fault Classification Leveraging Time-Domain Features. Appl. Sci. 2025, 15, 1457. [Google Scholar] [CrossRef]
Guo, J.; Yu, H.; Xing, S.; Huan, T. Addressing Big Data Challenges in Mass Spectrometry-Based Metabolomics. Chem. Commun. 2022, 58, 9979–9990. [Google Scholar] [CrossRef] [PubMed]
Kamalov, F.; Sulieman, H.; Alzaatreh, A.; Emarly, M.; Chamlal, H.; Safaraliev, M. Mathematical Methods in Feature Selection: A Review. Mathematics 2025, 13, 996. [Google Scholar] [CrossRef]
El-Hajj, W.; Hajj, H. An optimal approach for text feature selection. Comput. Speech Lang. 2022, 74, 101364. [Google Scholar] [CrossRef]
Montalvo-Lezama, R.; Montalvo-Lezama, B.; Fuentes-Pineda, G. Improving Transfer Learning for Movie Trailer Genre Classification using a Dual Image and Video Transformer. Inf. Process. Manag. 2023, 60, 103343. [Google Scholar] [CrossRef]
Ding, C.; Peng, H. Minimum Redundancy Feature Selection from Microarray Gene Expression Data. J. Bioinform. Comput. Biol. 2005, 3, 185–205. [Google Scholar] [CrossRef]
Lyu, Y.; Feng, Y.; Sakurai, K. A Survey on Feature Selection Techniques Based on Filtering Methods for Cyber Attack Detection. Information 2023, 14, 191. [Google Scholar] [CrossRef]
Wang, W.; Lu, L.; Wei, W. A Novel Supervised Filter Feature Selection Method Based on Gaussian Probability Density for Fault Diagnosis of Permanent Magnet DC Motors. Sensors 2022, 22, 7121. [Google Scholar] [CrossRef]
Papaioannou, N.; Myllis, G.; Tsimpiris, A.; Aggelopoulos, S.; Vrana, V. PCMINN: A GPU-Accelerated Conditional Mutual Information-Based Feature Selection Method. Information 2025, 16, 445. [Google Scholar] [CrossRef]
Cherrington, M.; Thabtah, F.; Lu, J.; Xu, Q. Feature Selection: Filter Methods Performance Challenges. In Proceedings of the 2019 International Conference on Computer and Information Sciences (ICCIS), Sakaka, Saudi Arabia, 3–4 April 2019; pp. 1–4. [Google Scholar] [CrossRef]
Taye, M.M. Understanding of Machine Learning with Deep Learning: Architectures, Workflow, Applications and Future Directions. Computers 2023, 12, 91. [Google Scholar] [CrossRef]
Cesarini, M.; Malandri, L.; Pallucchini, F.; Seveso, A.; Xing, F. Explainable AI for Text Classification: Lessons from a Comprehensive Evaluation of Post Hoc Methods. Cogn. Comput. 2024, 16, 3077–3095. [Google Scholar] [CrossRef]
Zhang, J.; Tan, J.; Song, Q.; Du, T.; Hauch, J.; Brabec, C. Feature Selection for Machine Learning-Driven Accelerated Discovery and Optimization in Emerging Photovoltaics: A Review. Adv. Intell. Discov. 2025, 202500022. [Google Scholar] [CrossRef]
Passemiers, A.; Folco, P.; Raimondi, D.; Birolo, G.; Moreau, Y.; Fariselli, P. A quantitative benchmark of neural network feature selection methods for detecting nonlinear signals. Sci. Rep. 2024, 14, 31180. [Google Scholar] [CrossRef] [PubMed]
Peng, H.; Long, F.; Ding, C. Feature Selection Based On Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
Guyon, I.; Elisseeff, A. An Introduction to Feature Extraction; Springer: Berlin/Heidelberg, Germany, 2008; Volume 207, pp. 1–25. [Google Scholar] [CrossRef]
Sánchez-Maroño, N.; Alonso-Betanzos, A.; Tombilla-Sanromán, M. Filter Methods for Feature Selection—A Comparative Study. In Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning, Birmingham, UK, 16–19 December 2007; pp. 178–187. [Google Scholar] [CrossRef]
Tsimpiris, A.; Kugiumtzis, D. Feature Selection for Classification of Oscillating Time Series. Expert Syst. 2012, 29, 456–477. [Google Scholar] [CrossRef]
Papaioannou, N.; Tsimpiris, A.; Talagozis, C.; Fragidis, L.; Angeioplastis, A.; Tsakiridis, S.; Varsamis, D. Parallel Feature Subset Selection Wrappers Using k-means Classifier. Wseas Trans. Inf. Sci. Appl. 2023, 20, 76–86. [Google Scholar] [CrossRef]
Maldonado, J.; Riff, M.; Neveu, B. A review of recent approaches on wrapper feature selection for intrusion detection. Expert Syst. Appl. 2022, 198, 116822. [Google Scholar] [CrossRef]
Bouzoubaa, K.; Taher, Y.; Nsiri, B. Predicting DOS-DDOS Attacks: Review and Evaluation Study of Feature Selection Methods based on Wrapper Process. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 132–145. [Google Scholar] [CrossRef]
Tsimpiris, A.; Vlachos, I.; Kugiumtzis, D. Nearest neighbor estimate of conditional mutual information in feature selection. Expert Syst. Appl. 2012, 39, 12697–12708. [Google Scholar] [CrossRef]
Gamage, H.N.; Chetty, M.; Shatte, A.; Hallinan, J. Filter feature selection based Boolean Modelling for Genetic Network Inference. Biosystems 2022, 221, 104757. [Google Scholar] [CrossRef]
Zhang, X.; Wu, G.; Dong, Z.; Crawford, C. Embedded feature-selection support vector machine for driving pattern recognition. J. Frankl. Inst. 2015, 352, 669–685. [Google Scholar] [CrossRef]
Zhu, M.; Song, J. An Embedded Backward Feature Selection Method for MCLP Classification Algorithm. Procedia Comput. Sci. 2013, 17, 1047–1054. [Google Scholar] [CrossRef]
Mahendran, N.; Vincent, P. A deep learning framework with an embedded-based feature selection approach for the early detection of the Alzheimer’s disease. Comput. Biol. Med. 2022, 141, 105056. [Google Scholar] [CrossRef] [PubMed]
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Shachaf, L.; Roberts, E.; Cahan, P.; Xiao, J. Gene regulation network inference using k-nearest neighbor-based mutual information estimation: Revisiting an old DREAM. BMC Bioinform. 2023, 24, 84. [Google Scholar] [CrossRef] [PubMed]
Qian, X. Topology optimization in B-spline space. Comput. Methods Appl. Mech. Eng. 2013, 265, 15–35. [Google Scholar] [CrossRef]
Darbellay, G.; Vajda, I.; Member, S. Estimation of the Information by an Adaptive Partitioning of the Observation Space. IEEE Trans. Inf. Theory 2000, 45, 1315–1321. [Google Scholar] [CrossRef]
Hu, Q.; Zhang, L.; Wei, P.; An, S.; Pedrycz, W. Measuring relevance between discrete and continuous features based on neighborhood mutual information. Expert Syst. Appl. 2011, 38, 10737–10750. [Google Scholar] [CrossRef]
Papana, A.; Kugiumtzis, D. Evaluation of Mutual Information Estimators for Time Series. Int. J. Bifurc. Chaos 2009, 19, 4197–4215. [Google Scholar] [CrossRef]
Liu, S.; Motani, M. Improving Mutual Information Based Feature Selection by Boosting Unique Relevance. J. Artif. Int. Res. 2025, 82, 1267–1292. [Google Scholar] [CrossRef]
Gadgil, S.; Covert, I.C.; Lee, S.I. Estimating Conditional Mutual Information for Dynamic Feature Selection. arXiv 2023, arXiv:2306.03301. [Google Scholar] [CrossRef]
Gong, H.; Li, Y.; Zhang, J.; Zhang, B.; Wang, X. A new filter feature selection algorithm for classification task by ensembling pearson correlation coefficient and mutual information. Eng. Appl. Artif. Intell. 2024, 131, 107865. [Google Scholar] [CrossRef]
Czyż, P.; Grabowski, F.; Vogt, J.E.; Beerenwinkel, N.; Marx, A. Beyond Normal: On the Evaluation of Mutual Information Estimators. In Proceedings of the Proceedings of the 37th International Conference on Neural Information Processing System, New Orleans, LA, USA, 10–16 December 2023; Volume 36, pp. 16957–16990.
Tuononen, M.; Hautamäki, V. Improving Numerical Stability of Normalized Mutual Information Estimator on High Dimensions. IEEE Signal Process. Lett. 2025, 32, 2783–2787. [Google Scholar] [CrossRef]
Li, J.; Othman, M.; Chen, H.; Mi Yusuf, L. Optimizing IoT intrusion detection system: Feature selection versus feature extraction in machine learning. J. Big Data 2024, 11, 36. [Google Scholar] [CrossRef]
Ramírez-Gallego, S.; Lastra, I.; Martinez, D.; Bolón-Canedo, V.; Benítez, J.; Herrera, F.; Alonso-Betanzos, A. Fast-mRMR: Fast Minimum Redundancy Maximum Relevance Algorithm for High-Dimensional Big Data. Int. J. Intell. Syst. 2016, 32, 134–152. [Google Scholar] [CrossRef]
Hoque, N.; Bhattacharyya, D.; Kalita, J. MIFS-ND: A mutual information-based feature selection method. Expert Syst. Appl. 2014, 41, 6371–6385. [Google Scholar] [CrossRef]
Abdo, A.; Mostafa, R.; Abdel-Hamid, L. An Optimized Hybrid Approach for Feature Selection Based on Chi-Square and Particle Swarm Optimization Algorithms. Data 2024, 9, 20. [Google Scholar] [CrossRef]
Morán-Fernández, L.; Blanco-Mallo, E.; Sechidis, K.; Bolón-Canedo, V. Breaking boundaries: Low-precision conditional mutual information for efficient feature selection. Pattern Recognit. 2025, 162, 111375. [Google Scholar] [CrossRef]
Sun, L.; Xu, F.; Ding, W.; Xu, J. AFIFC: Adaptive fuzzy neighborhood mutual information-based feature selection via label correlation. Pattern Recognit. 2025, 164, 111577. [Google Scholar] [CrossRef]
Papaioannou, N.; Kugiumtzis, D. Correction of bias in the assessment of mutual information and corresponding correlation networks from multivariate time series. Hell. Acad. Libr. Link AUTH 2021. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the bias correction procedure. The steps illustrate how the mutual information is estimated at different subdivisions, extrapolated and corrected using surrogate data.

Figure 2. The success rates of mRMR for the two variations being examined depending on the value of s (x-axis: s, regularization constant; y-axis: success rate (%)). Optimal performance was observed at

s = 1

. This analysis is for System 5.

Figure 2. The success rates of mRMR for the two variations being examined depending on the value of s (x-axis: s, regularization constant; y-axis: success rate (%)). Optimal performance was observed at

s = 1

. This analysis is for System 5.

Figure 3. The success rates of mRMR for the two variations being examined depending on the value of s (x-axis: s, regularization constant; y-axis: success rate (%)). It is observed that the corrected mRMR consistently outperforms the original at higher values of s. This analysis is for System 6.

Table 1. Summary of the systems used in the simulation study.

System	Functional Relationship (y)	Relevant/Redundant Features	Irrelevant Features
$S_{1}$	$0.5 f_{1} + 0.5 f_{2} + e$	$f_{1}, f_{2}$	28
$S_{2}$	$0.5 f_{1} + 0.5 f_{2} + e$	$f_{1}, f_{2}$ + 4 correlated features	24
$S_{3}$	$0.5 f_{1} + 0.5 f_{2} + 0.5 f_{3} + e$ ; $f_{3} = f_{1} f_{2}$	$f_{1}, f_{2}, f_{3}$ (redundant)	27
$S_{4}$	$0.5 f_{1} + 0.5 f_{2} + 0.5 f_{3} + e$ ; $f_{3} = f_{1} f_{2} z_{3}$	$f_{1}, f_{2}, f_{3}$	27
$S_{5}$	$f_{1} + f_{2} + 0.2 f_{3} + 0.3 f_{4} + e$ ; $f_{3}, f_{4}$ functions of $f_{1}, f_{2}$	$f_{1}, f_{2}, f_{3}, f_{4}$	26
$S_{6}$	$\sum_{i = 1}^{6} β_{i} f_{i} + e$ ; $f_{1}, \dots, f_{6}$ depend on $z_{1}, \dots, z_{6}$	6 relevant ( $f_{1}, \dots, f_{6}$ ), + 12 correlated (0.8/0.4)	12

Table 2. Success rate (%) of the examined mRMR variations across different systems. The success rate refers to the correct identification of all functionally related features. Reported values represent averages over 20 generated datasets.

System	Original (s = 0)	Original (s = 1)	Corr (s = 1)	Parzen (s = 0)	Parzen (s = 1)
$S_{1}$	100	100	100	100	100
$S_{2}$	80	90	100	100	100
$S_{3}$	0	95	95	100	100
$S_{4}$	0	80	95	100	100
$S_{5}$	0	85	95	100	100
$S_{6}$	0	55	70	100	100

Table 3. Classification accuracies (mean ± std) using logistic regression across seven benchmark datasets. Columns correspond to different mRMR variations: Original ( $s = 0$ ) uses the original mutual information estimator without modification to the denominator; Original ( $s = 1$ ) adds a regularization term of one unit to the denominator; Corr ( $s = 1$ ) employs the bias-corrected mutual information estimator with the same regularization; Parzen ( $s = 0$ ) and Parzen ( $s = 1$ ) rely on Parzen window (KDE-based) estimation without and with the

s = 1

regularization, respectively. The parameter s denotes the regularization constant added to the denominator of the mRMR criterion. The best-performing method for each dataset is highlighted in bold.

Table 3. Classification accuracies (mean ± std) using logistic regression across seven benchmark datasets. Columns correspond to different mRMR variations: Original ( $s = 0$ ) uses the original mutual information estimator without modification to the denominator; Original ( $s = 1$ ) adds a regularization term of one unit to the denominator; Corr ( $s = 1$ ) employs the bias-corrected mutual information estimator with the same regularization; Parzen ( $s = 0$ ) and Parzen ( $s = 1$ ) rely on Parzen window (KDE-based) estimation without and with the

s = 1

regularization, respectively. The parameter s denotes the regularization constant added to the denominator of the mRMR criterion. The best-performing method for each dataset is highlighted in bold.

Dataset	Original (s = 1)	Original (s = 0)	Corr (s = 1)	Parzen (s = 1)	Parzen (s = 0)
bankrupt	0.9588 ± 0.0124	0.9570 ± 0.0083	0.9649 ± 0.0147	0.9544 ± 0.0053	0.9491 ± 0.0066
qsar	0.7469 ± 0.0332	0.7726 ± 0.0329	0.8010 ± 0.0200	0.7631 ± 0.0199	0.7270 ± 0.0223
spambase	0.6833 ± 0.0138	0.8000 ± 0.0265	0.8482 ± 0.0398	0.7588 ± 0.0375	0.6667 ± 0.0245
magic	0.8044 ± 0.0340	0.8044 ± 0.0331	0.8009 ± 0.0344	0.7702 ± 0.0355	0.7702 ± 0.0355
kc1	0.8500 ± 0.0249	0.8482 ± 0.0245	0.8518 ± 0.0262	0.8509 ± 0.0294	0.8456 ± 0.0212
pc1	0.8851 ± 0.0073	0.8860 ± 0.0118	0.8851 ± 0.0114	0.8842 ± 0.0110	0.8860 ± 0.0124
credit	0.8000 ± 0.0225	0.7991 ± 0.0194	0.8018 ± 0.0229	0.8000 ± 0.0225	0.8026 ± 0.0177

Table 4. Classification accuracies (mean ± std) using Decision Tree across seven benchmark datasets. Column descriptions as in Table 3. The best-performing method for each dataset is highlighted in bold.

Dataset	Original (s = 1)	Original (s = 0)	Corr (s = 1)	Parzen (s = 1)	Parzen (s = 0)
bankrupt	0.9307 ± 0.0159	0.9412 ± 0.0104	0.9325 ± 0.0239	0.9061 ± 0.0222	0.9386 ± 0.0277
qsar	0.7242 ± 0.0342	0.7262 ± 0.0319	0.7514 ± 0.0242	0.7210 ± 0.0218	0.7120 ± 0.0256
spambase	0.7211 ± 0.0188	0.7460 ± 0.0256	0.7671 ± 0.0321	0.7254 ± 0.0216	0.7167 ± 0.0278
magic	0.7825 ± 0.0310	0.7939 ± 0.0328	0.7904 ± 0.0297	0.7693 ± 0.0285	0.7719 ± 0.0301
kc1	0.8105 ± 0.0275	0.8123 ± 0.0260	0.8228 ± 0.0302	0.8053 ± 0.0291	0.7991 ± 0.0272
pc1	0.8421 ± 0.0178	0.8456 ± 0.0210	0.8518 ± 0.0204	0.8412 ± 0.0197	0.8465 ± 0.0213
credit	0.7658 ± 0.0202	0.7623 ± 0.0199	0.7684 ± 0.0221	0.7614 ± 0.0187	0.7702 ± 0.0194

Table 5. Classification accuracies (mean ± std) using Random Forest across seven benchmark datasets. Column descriptions as in Table 3. The best-performing method for each dataset is highlighted in bold.

Dataset	Original (s = 1)	Original (s = 0)	Corr (s = 1)	Parzen (s = 1)	Parzen (s = 0)
bankrupt	0.9544 ± 0.0116	0.9596 ± 0.0105	0.9526 ± 0.0153	0.9070 ± 0.0208	0.9588 ± 0.0152
qsar	0.8158 ± 0.0289	0.8104 ± 0.0314	0.8351 ± 0.0261	0.8184 ± 0.0256	0.8096 ± 0.0298
spambase	0.8254 ± 0.0302	0.7991 ± 0.0198	0.8395 ± 0.0487	0.8237 ± 0.0279	0.8158 ± 0.0332
magic	0.8302 ± 0.0341	0.8447 ± 0.0347	0.8298 ± 0.0389	0.8193 ± 0.0328	0.8272 ± 0.0356
kc1	0.8465 ± 0.0299	0.8412 ± 0.0310	0.8333 ± 0.0301	0.8526 ± 0.0280	0.8561 ± 0.0277
pc1	0.8719 ± 0.0192	0.8746 ± 0.0196	0.8904 ± 0.0212	0.8789 ± 0.0189	0.8728 ± 0.0203
credit	0.8026 ± 0.0214	0.7974 ± 0.0198	0.8018 ± 0.0209	0.8061 ± 0.0287	0.8088 ± 0.0220

Table 6. Classification accuracies (mean ± std) using K-Nearest Neighbors across seven benchmark datasets. Column descriptions as in Table 3. The best-performing method for each dataset is highlighted in bold.

Dataset	Original (s = 1)	Original (s = 0)	Corr (s = 1)	Parzen (s = 1)	Parzen (s = 0)
bankrupt	0.9500 ± 0.0196	0.9570 ± 0.0149	0.9500 ± 0.0196	0.9482 ± 0.0073	0.9544 ± 0.0140
qsar	0.7588 ± 0.0291	0.7658 ± 0.0288	0.7816 ± 0.0267	0.7623 ± 0.0275	0.7544 ± 0.0299
spambase	0.7702 ± 0.0310	0.7825 ± 0.0294	0.7930 ± 0.0345	0.7719 ± 0.0302	0.7667 ± 0.0328
magic	0.8026 ± 0.0335	0.8167 ± 0.0327	0.8096 ± 0.0352	0.8018 ± 0.0340	0.8070 ± 0.0331
kc1	0.8351 ± 0.0286	0.8386 ± 0.0271	0.8421 ± 0.0294	0.8342 ± 0.0279	0.8368 ± 0.0263
pc1	0.8649 ± 0.0208	0.8675 ± 0.0211	0.8728 ± 0.0224	0.8684 ± 0.0207	0.8667 ± 0.0218
credit	0.7912 ± 0.0197	0.7939 ± 0.0189	0.7956 ± 0.0202	0.7921 ± 0.0190	0.7982 ± 0.0187

Table 7. Mean classification accuracy (± standard deviation) with Logistic Regression after feature selection using CONMI_FS, MIFS-ND and Corr-mRMR. The “# Features” column indicates the subset size

k^{★}