Article Filter-Type Variable Selection Based on Information Measures for Regression Tasks

This paper presents a supervised variable selection method applied to regression problems. This method selects the variables applying a hierarchical clustering strategy based on information measures. The proposed technique can be applied to single-output regression datasets, and it is extendable to multi-output datasets. For single-output datasets, the method is compared against three other variable selection methods for regression on four datasets. In the multi-output case, it is compared against other state-of-the-art method and tested using two regression datasets. Two different figures of merit are used (for the single and multi-output cases) in order to analyze and compare the performance of the proposed method.


Introduction
Variable selection aims at reducing the dimensionality of data.It consists of selecting the most relevant variables (attributes) among the set of original ones [1].This step is crucial for the design of regression and classification systems.In this framework, the term relevant is related to the impact of the variables on the prediction error of the variable to be regressed (target variable).
The relevant criterion can be based on the performance of a specific predictor (wrapper method), or on some general relevance measure of the variables for the prediction (filter method).Wrapper methods may have two drawbacks [2]: (a) they can be computationally very intensive; (b) their results may vary according to initial conditions or other chosen parameters.In the case of variable selection for regression, several studies have applied different regression algorithms attempting to minimize the cost of the search in the variable space [3,4].Others studies have assessed a noise variance estimator known as the Delta test that considers the differences in the outputs of the relevant variable associated with neighboring points [5].This estimation has been applied to obtain the relevance of input variables.
Filter methods allow sorting variables independently of the regressor [6,7].Eventually, embedded methods try to include the variable selection as a part of the training process.Such strategies have been used in classification problems.In order to tackle the combinatorial search problem to find an optimal subset of variables, the most popular variable selection methods intend to avoid having to perform an exhaustive search by applying forward, backward or floating sequential schemes [8,9].
Research work has mainly focused on single-output (SO) regression datasets.However, multi-output (MO) regression is becoming more and more important in areas such as biomedical data analysis, where experiments are conducted on several individuals belonging to a specific population.In this case, the individual responses share some common variables whereby data from a subject may help assess the responses associated to other patients.Multi-output (MO) regression is based on the assumption that several tasks (outputs) share certain structures, and therefore tasks can mutually benefit form these shared structures.In fact, some works [10] suggest that most single classification and regression real-world problems should be reasonably treated as multi-output by nature, and the assessment would improve due to the improvement in the generalization performance of the learning strategy.
This paper focuses on filter strategies by proposing a measure based on information theory as a criterion to determine the relevance between variables.A variable clustering-based method aimed at finding a subset of variables that minimizes the regression error is proposed.The conditional mutual information will be estimated to define a criterion of distance between variables.This distance has already been used in [11] for variable selection in classification tasks.The contributions of this paper are two-fold: (a) to establish a methodology to properly solve the estimation of this distance for regression problems where the relevant variable is continuous, through the assessment of the conditional mutual information between input and output variables; and (b) to show the extension of this methodology to multi-output regression datasets.Some preliminary results were presented in [12], where the method was applied only to single-output regression datasets.In addition, the work presented here introduces an information theoretic framework for the distance used in the clustering-based feature selection process in regression tasks, when using continuous variables.This methodology is also extended here to multi-output regression problems and an extensive experimentation is also included to validate the proposed approach.
The organization of the rest of this paper is as follows: Section 2 describes the theoretical foundations rooted in information theory for variable selection when using continuous relevant variables, and proposes a methodology to estimate the conditional mutual information.Section 3 describes the experiments carried out, including the datasets used and the variable selection methods in regression used in the comparison.Section 4 presents and discusses the regression results obtained.Finally, some concluding remarks are drawn in Section 5.

Variable Selection for Single and Multi-Output Continuous Variables
The approach presented here is based on a previous work in which information theory was used to propose a filter variable selection method for classification tasks [11].In this section, this approach is adapted and extended for single and multi-output regression tasks.In order to achieve this, two main issues must be solved: to assess the possibility of applying the same information theoretic criteria when using continuous relevant variables, and to establish a way to estimate the conditional mutual information for continuous variables.
In Section 2.1 it will be analyzed if it is possible to justify under certain conditions an upper boundary of the regression error through an information theory expression.This is necessary if a variable selection algorithm for regression is going to be applied for the case of continuous relevant variables.As will be seen in Section 2.1, the conclusions drawn will be valid for the case when the relevant output variable is countably infinite.Besides, in Section 2.1 a set of concepts such as entropy, conditional entropy and mutual information are used.These concepts are defined, when using a training set in the learning process, at the beginning of Section 2.2.
In Section 2.2, a method to estimate the probability density function for continuous relevant variables is introduced, for the single-output and multi-output cases.In addition, the optimization strategy used to obtain the method parameters is also explained.

Variable Selection Criterion for Regression
Let a dataset be represented in a variable space denoted, in principle, by a random variable, usually multivariate X = (X 1 , . . ., X d ), where d is the dimension of this variable space, and where Y is a continuous variable that we want to predict.In terms of information theory, let us suppose that Y represents the random variable of messages sent through a noisy communication channel denoted by (Y, p(x|y), X), where X is the random variable representing the values at the receiver.We denote p(x|y) as the conditional probability of observing the output x ∈ X when sending y ∈ Y .In this framework, the goal is to decode the received value X, and recover the correct Y .That is, we will perform a decoding operation, Y = f (X) considering it as an estimation problem using a regressor function f ().Therefore, in regression tasks, Y is the original (unknown) relevant variable and f is the predictor function that estimates the different values of the relevant variable.For regression problems the variable Y is usually a real continuous variable and therefore should be characterized as countless and infinite.To approximate this variable through a probability distribution, and to apply some of the concepts of information theory, let us consider Y as a countably infinite variable.
Given a variable Y , taking values on a possible countably infinite alphabet Y, given a random variable X, and given a function f () to predict the values Y = f (X), an upper bound of the error can be obtained in terms of the conditional entropy H(Y |X) [13]: where = min f :X→Y P [Y = f (X)] is the minimum error probability when estimating Y given X.In the variable selection context, we define I(X; Y ) as the mutual information between X and Y , i.e., a quantity that measures the knowledge that two random variables share [14].If we have a subset of variables X ∈ X, where X = ( X 1 , . . ., X m ) represents a subset of variables from the original representation with size m < d, the following inequality holds Using the above relationship and Equation (1), and taking into account that mutual information between two variables is always non negative, the following upper bound for is obtained: where H(Y ) is the entropy of Y .Note that the higher the value of I( X; Y ), the more the error decreases, which also leads to the subset of selected variables that better represents the original set with respect to the target variable Y .This is the underlying principle in criterion Equation (2) that has motivated different approaches in supervised variable selection for classification problems under the so-called Max-Dependency Criterion [15,16].
Since the relationship Equation (2) still holds for countable infinite target variables we may approximate it for continuous variables and use it in regression tasks, applying the same metric as in [11] and using a clustering-based algorithm based on a Ward's linkage method [17], extrapolating this distance for continuous target variables in regression problems.Ward's linkage method has the property to generate minimum variance partitions between variables.Thus, the algorithm begins with n initial clusters and, at each step, it merges the two most similar groups to make a new cluster.The number of clusters decreases at each iteration until the number of m clusters is reached.
For practical purposes, it can be shown that the expectation of the Hamming distortion measure is equal to the generic probability of error [18].Therefore, during the experimental results, the error will be approximated using the Root Mean Squared Error (RMSE) to validate the different subsets of variables selected for the regressor, since RMSE is equivalent to the expectation of the square-error distortion measure, Ed(Y, Y ).
The square-error distortion measure can be considered as an approximation of the Hamming distortion measure, when estimating the true value of the variable Y by the value Y using a square error distance d(y, y) = ( y − y) 2 .This type of distortion is an example of a normal distortion measure and allows for Y to be reproduced with zero distortion, that is, the probability of error is zero when the true Y and the estimated Y values are the same.

Estimation of the Conditional Mutual Information for Continuous Regression Variables
Given a set of N samples of a dataset in a d−dimensional variable space (x k , y k ), k = 1, . . ., N defined by a multivariate random variable X = (X 1 , . . ., X d ) where a specific regressor y k = f (x k ) can be applied, the conditional differential entropy H(Y |X) can be written as [14]: Analogously, the entropy H(Y ) and the mutual information I(X; Y ) are defined as: and Let us consider that the joint probability distribution, p(x, y) can be approximated by the empirical distribution as [19] p where δ(x − x k , y − y k ) is the Dirac delta function.Considering the following property of the Dirac delta function: valid for any continuous compactly supported f function and substituting p(x, y) into Equation (3), we obtain: From the previous Equation ( 7), we can estimate the conditional entropies for one and for all pairs of two variables X i , X j .According to [11], given two variables X i and X j , the following metric distance can be defined: The conditional mutual information terms I(X i ; Y |X j ) and I(X j ; Y |X i ) represent how much information variable X i can predict about the regression variable Y that variable X j cannot and vice versa, respectively.Substituting Equation (7) into Equation ( 8), a dissimilarity matrix of distances D CMI (X i , X j ) can be built.

Single-Output Regression
The assessment of p(y|x) in Equation ( 7) is usually called Kernel Conditional Density Estimation (KCDE).This is a relatively recent active area of research that basically started with the works by Fan et al. [20] and Hyndman et al. [21].One way to obtain p(y|x) is to use a (training) dataset (x k , y k ) and a Nadaraya-Watson type kernel function estimator, as in [22], considering only the y k training values that are paired with values x k : where K h is a compact symmetric probability distribution function, for instance, a gaussian kernel.Note that there are two bandwidths h 1 for the K h 1 kernel and h 2 for the K h 2 kernel.The Nadaraya-Watson estimator is consistent provided h 1 → 0, h 2 → 0, and Nh 1 h 2 → ∞, as N → ∞ [21].
In this work, the Parzen window function is used, where h is the window width and Σ is a covariance matrix of a d-dimensional vector x: The performance in the estimation of the conditional density functions is dependent on a suitable choice of the bandwidths (h 1 and h 2 ).A data-driven bandwidth score previously used in the KCDE literature is the Mean Integrated Square Error (MISE) in the following form [23]: However, the cross-validated log-likelihood defined in [22] will be used here because of its lower computational requirements: where p (−k) means p evaluated with (x k , y k ) left out.p(x) is the standard kernel density estimate over x using the bandwidth h 2 in Equation ( 9).Maximizing the KCDE likelihood is equivalent to minimizing the MISE criterion.When substituting the Nadaraya-Watson type kernels into L(h 1 , h 2 ), the following result follows [22]:

Multi-Output Regression
This strategy can be applied to regression datasets with more than one output.We can consider the relevant variable Y = (Y 1 , . . ., Y l ) as a multivariate variable where each instance x k of the training set has l outputs y k = (y 1 , . . ., y l ).In this way, we can calculate the conditional entropies for a single variable X i and for all pairs of two variables X i , X j and for the multivariate output variable Y.The conditional probability, the conditional entropy and the L(h 1 , h 2 ) function would be given considering the following formulae, respectively: and

Optimization Strategy
In order to obtain the maximum of L(h 1 , h 2 ), a method to perform optimization with constraints is applied.This method starts with an approximation of the Hessian of the Lagrangian function of the minimization method using a quasi-Newton updating method.The Lagrangian equation is translated into a Karush-Kuhn-Tucker (KKT) formulation.Constrained quasi-Newton methods guarantee their convergence by accumulating second-order information regarding the KKT equations using a quasi-Newton updating procedure.These methods are commonly referred to as Sequential Quadratic Programming (SQP) methods, since a QP subproblem is solved at each major iteration.The (h 1 , h 2 ) pair could be obtained taking into account that: Therefore, assume the general minimization problem min h 1 ,h 2 L(h 1 , h 2 ), subject to: where m e is the number of equality constraints, and m is the total number of equality and inequality constraints.Using the following auxiliary Lagrangian function: the KKT conditions can be written as: A Quadratic Programming (QP) iterative subproblem can be defined as: subject to: where d would be the new direction to be accumulated to the solution, and H the Hessian of C(h 1 , h 2 , λ).
In order to solve this Quadratic Programming (QP) problem, a projection method [24] is adopted.The iterative rule would be expressed as: where α k is a constant.

Summary of the Methodology and Algorithmic Structure
The method proposed in this paper is based on the application of a hierarchical clustering strategy based on Ward's linkage method [17] to find clusters of variables using the metric distance between pairs of variables: D CMI (X i , X j ) [Equation (8)].In order to use this distance, the conditional entropies H(Y|X i ) and H(Y|X i , X j ) have to be assessed.The conditional entropies are estimated using Equation (15), where p(y k |x k ) is obtained using a Nadaraya-Watson type kernel function estimator, as can be seen in Equation (14).Each one of these kernels is defined by a bandwidth, h 1 for the multi-output continuous relevant variable Y, and h 2 for the multivariate random variable X.The best (h 1 , h 2 ) pairs are obtained maximizing Equation ( 16).An algorithmic structure of the methodology presented in this paper for the multi-output case follows (this structure is identical for the single-output case): (1) Kernel width estimation.Obtain, for each (Y; X i ) and (Y; X i , X j ) tuples, the pair of parameters (h 1 , h 2 ) that maximize L(h 1 , h 2 ) [Equation ( 16)].(2) Kernel density estimation.Obtain the Nadaraya-Watson type Kernel Density estimators K h 1 ( y − y k ) and K h 2 ( x − x k ) applying Equation (10) (3) Assessment of the posterior probabilities.Estimate p(y|x) using Equation ( 14) (4) Estimation of the conditional entropies.Obtain, for each variable X i and every possible combination (X i , X j ) the conditional entropies using Equation ( 15). ( 5) Dissimilarity matrix construction.The distance D CMI (X i , X j ) for the multi-output relevant variable Y is assessed.( 6) Clustering.Apply a hierarchical clustering strategy based on Ward's linkage method to find clusters using D CMI (X i , X j ).The number of clusters is determined by the number of variables to be selected.(7) Representative selection.For each cluster C i , select the variable X i ∈ C i so that: that is, the variable with the highest mutual information with respect to Y.

Experimental Validation
The proposed method, hereafter called CMI Dist , has been compared against other state-of-the-art single-output and multi-output methods, as described below.

Methods for Single-Output Datasets
Three methods were considered in this case.Among the many single-output variable selection methods for regression available in the literature, FSR and EN are the most commonly used and are a obliged reference in the field.However, both methods FSR and EN assume a linear regression model, which may be a disadvantage in a general scenario.The third method (PS-FS) is particularly useful when the dimensionality of the input space is high, and it does not assume any particular regression model for the selection process, as in the case of the method proposed herein.
• The Monteiro et al. method [25] based on a Particle Swarm Optimization (PSO) strategy [26] (Particle-Swarms Variable Selection, PS-FS).It is a wrapper-type method to perform variable selection using an adaptation of an evolutionary computation technique developed by Kennedy and Eberhart [26].For further details, see [25].

• Forward Stepwise Regression (FSR). Consider a linear regression model. The significance of each
variable is determined from its t-statistics with the null hypothesis that the correlation between Y and X i is 0. The significance of factors is ranked using the p-values (of the t-statistics) and with this order a series of reduced linear models is built.
• Elastic Net (EN).It is a sparsity-based regularization scheme that simultaneously does regression and variable selection.It proposes the use of a penalty which is a weighted sum of the l 1 −norm and the square of the l 2 −norm of the coefficient vector formed by the weights of each variable.For further details, see [27].

Methods for Multi-Output Datasets
The method proposed by Mladen Kolar and Eric P. Xing [28] was used for comparison in the case of multi-output regression datasets.This method is based on the Simultaneous Orthogonal Matching Pursuit (S-OMP) procedure for sparsistant variable selection in ultra-high dimensional multi-task regression problems.We will call this method MO-FSR hereafter.Although there are some multi-output selection methods for regression, most of them provide a given number of variables selected with a limited possibility of extracting a ranking or variable subsets of different sizes [29].The method proposed in this paper allows this possibility.This property can be useful when obtaining (or analyzing the effect of) a particular degree of dimensionality reduction.The method proposed in [28] also allows a predefined number of variables to be selected.

Dataset Description
Six datasets were used to test the variable selection methods, four of them with only one output and two of them of a multi-output nature.Two of the single-output datasets are of hyperspectral nature corresponding to a remote sensing campaign (SEN2FLEX, [30]).

Single-Output Datasets
• CASI-THERM.It consists of the reflectance values of image pixels that were taken by the Compact Airborne Spectrographic Imager (CASI) sensor [30].Corresponding thermal measurements for these pixels were also made.The training set is formed by 402 data points.The testing set is formed by 390 data points.The CASI sensor reflectance curves are formed by 144 bands between 370 and 1049 nm.• CASI-AHS-CHLOR.It consists of the reflectance values of image pixels that were taken by the CASI and the Airborne Hyper-spectral Scanner (AHS) [30] sensors.Corresponding chlorophyll measurements for these pixels were also performed.The training set is formed by 2205 data points.The testing set is formed by 2139 data points.AHS images consist of 63 bands between 455 and 2492 nm.Therefore, the input dimensionality of this set is 207 (the sum of the bands corresponding to the CASI and AHS sensors).• Bank32NH.It consists of 8192 cases, 4500 for training and 3692 for testing, with 32 continuous variables, corresponding to a simulation of how bank customers choose their banks.It can be found in the DELVE Data Repository [31].• Boston Housing.Dataset created by D. Harrison et al. [32].It is related to the task of predicting housing values in different areas of Boston.The whole dataset consists of 506 cases and 13 continuous variables.It can be found in the UCI Machine Learning Repository [33].
The total number of training and testing samples as well as the input number of variables are given in Table 1.

Results and Discussion
In order to validate the subsets of variables selected by the different considered methods, the ε−Support Vector Regression (ε−SVR) regressor was used, with a radial basis function, because it has already been developed for single output [34] as well as for multi-output [35] datasets.In order to estimate the best values for the parameters of the regressor, an exhaustive grid search using equally spaced steps in the logarithmic space of the tuning parameters was performed.

Single-Output Regression Datasets
In order to assess the kernel bandwidths (h 1 , h 2 ) for the estimation of the conditional probabilities, the optimization strategy methodology explained in Section 2.2.3 was applied.The starting values were fixed at: h 1,0 = h 2,0 = 1 2 log(N ) , as in [36], and the lower and upper bounds at For the assessment of p(y/x), the covariance matrix considered was diagonal: Σ = diag(σ 2 i , σ 2 j ), where σ 2 i and σ 2 j are the variance value of variables i and j, respectively, for the training set.For the CASI-THERM, CASI-AHS-CHLOR and Bank32NH datasets, there was no further partition because the training and testing sets were already given.For the Boston Housing dataset, a 10-fold cross-validation strategy was used to obtain the Root Mean Squared Error (RMSE).
Figure 1 represents the dissimilarity matrix D CMI as a gray level image from the CASI-AHS-CHLOR hyperspectral dataset, with 207 bands.Figure 1 shows the existence of intervals with similar values of the dissimilarity measure that determine families of variables in different regions of the electromagnetic spectrum.Figures 2 and 3 show the RMSE error given by the ε−SVR method for the four single-output datasets and the first 20 variables selected (13 variables for the case of Boston Housing) by each of the variable selection methods tested.The decision to select a maximum of K = 20 variables for the plots is determined based on the fact that from this number the error decreases very slowly.The x-axis represents the subset of variables selected, whereas the y-axis plots the RMSE error in each selection method.In order to analyze the statistical significance of the results from all the methods used in the comparison, Friedman and Quade Tests [37] were applied on the results with a confidence level of p = 0.005.These kinds of techniques measure the significance of the statistical difference of several algorithms that provide results on the same problem, using rankings of results obtained by the algorithms to be compared.For each subset of variables, the different errors are ranked from one to the number of methods.In this case the comparison is made over four methods.The approach with lower error will have rank 1, while the worst approach will have rank 4. In the case where two or more methods have the same value, an average of the ranks is assigned to them.
The Quade test conducts a weighted ranking analysis of the results [37].Both statistical methods use the Fisher distribution to discern the statistical significance of results.The Fisher distribution critical value was estimated for the four methods and over the first K = 5, K = 10, K = 15 and K = 20 variables.The Fisher distribution follows (N M − 1) and (N M − 1) • (N B − 1) degrees of freedom, where N M is the number of methods, and N B the number of variable subsets on which the ranking is applied.Therefore, for different rows in the table (K = 5, K = 10, K = 15 and K = 20), we obtain the values F (3, 12) = 7.20, F (3, 27) = 5.36 , F (4, 42) = 4.92, and F (3, 57) = 5.06.The table shows the statistical significance being positive (+) when the value of the test is greater than the Fisher distribution, and negative (−) otherwise.
From the rest of the results in the experiments, other interesting points deserve our attention: • In Table 3 we see that the proposed method CMI Dist obtains better performance with respect to the rest of methods for all the cases (5, 10, 15 and 20 variables) for the CASI-AHS-CHLOR and CASI-THERM datasets and for two out of the three (10 and 13 variables) for the Boston Housing dataset.• In CMI Dist the clustering process plays an important role, which can be interpreted as a global strategy to obtain subsets of variables with high relevance in the estimation of the relevant variable Y obtained by the ε−SVR algorithm.The dissimilarity space built from the conditional mutual information distances allows to find relationships between variables.• The PS-FS method is the second best one in most cases followed by the EN method.PS-FS is a wrapper-type method based on a Neural Network regressor to make an optimal search where the error of the regressor acts as the search criteria.• F SR is the worst method in all the cases, with the exception of K = 10 and K = 13 for Boston Housing.For the Bank32NH dataset, all methods provide similar results.
• In two out of the four single-output regression datasets that appear in Table 3, when the dimension of the input variable space increases, the performance of the regression methods nevertheless decreases.One reason could be given by the Hughes phenomenon (the curse of dimensionality).
Noise and variables considered as noisy may also degrade the quality of the regression.• Table 4 shows the average RMSE over the four datasets for different sizes of the variable subsets selected.CMI Dist provides the best results for all cases, while the PS-FS method is the second best method.The differences in RMSE ranked for the four methods are not significant for the first 5 variables, but they are significant when selecting 10 to 15 variables.Thus, the difference between the methods increases with the number of selected variables, although in the case of the CASI-THERM dataset the statistical tests suggests that there are not significant differences among the methods.
The selection of the first variable and the first two variables is better when using CMI Dist as compared to the rest of the methods.In this case, the clustering strategy plays an important role in the formation of different groups of variables, obtaining better results than a greedy selection algorithm as is the case of the FSR.In the case of the PS-FS and EN methods, the advantage of the CMI Dist method consists of a proper adjustment of the parameters from the Nadaraya-Watson function estimator and its use through a distance metric in the variable space that takes into account the internal relationships between variables.

Multi-Output Regression Datasets
Sánchez et al. showed in [35] that SVR can be generalized to solve the problem of regression estimation for multiple (output) variables (hereafter, called MO-SVR).In fact, the use of a multidimensional regression tool helps in exploiting the dependencies between variables and makes the retrieval of each output variable less vulnerable to noise and measurement errors.Treating all the variables together may allow estimating each of them accurately if scarce data is available.The minimization of Equation (25) (Root-Mean Sum of Squares of the Diagonal, RMSSD) was used as a criterion to select the parameters of the MO-SVR regressor [38]: where Y p is the output predicted matrix in the multi-output case, Y the corresponding original output matrix, trace is the trace of the matrix and N is the number of data points.
Figure 4 shows the RMSSD Error of the proposed method against the method by Kolar and Xing in [28].Comparison results for different subsets of variables can be seen in Table 5.In this case, the Fisher distribution takes the values F (1, 4) = 31.32,F (1, 9) = 13.61,F (1, 14) = 11.06, and F (1, 19) = 10.07.The table shows the statistical significance as positive (+) when the value of the test is greater than the Fisher distribution and negative (−) otherwise.From Table 5, we can also see that: • The CMI Dist method outperforms MO-FSR for the Parkinson dataset, while MO-FSR outperforms CMI Dist in the Tecator dataset.These experiments show that our method is comparable to MO-FSR.• The CMI Dist method does not assume that the input and output data are linearly related, whereas MO-FSR does.Therefore, the performance of the selector may depend on the relationship between the input and output values for the datasets.
• The CMI Dist method outperforms MO-FSR for the first three selected variables in the Parkinson dataset, whereas there is no difference for the rest of the selected variables, as can be seen in the Friedman and Quade tests in Table 5.However, in the case of the Tecator dataset (see Figure 4b), the MO-FSR outperforms the CMI Dist up to variable 10, and tends to become equal afterwards.

Conclusions
This paper presents a filter-type variable selection technique for single and multi-output regression datasets, using a distance measure based on information theory.The main contributions of the paper are: (a) the variable selection method proposed in [11] for classification has been extended to single-output and multi-output regression problems involving selection of variables; (b) information theoretic criteria have been applied to extend the variable selection methodology to continuous variables; (c) a method to estimate the conditional entropy for single and multi-output continuous variables has been defined.
The proposed method outperforms the other methods used in the comparison in the case of single-output regression datasets, and it is also competitive in the case of the multi-output datasets considered.Therefore, the method proposed in this paper has a high generalization capability to apply the strategy to more than one output variable, because the conditional entropy can be directly extended for multivariate output datasets.
Variable selection in multi-output regression is a novel area of research that requires a deeper understanding in terms of the development of new selection techniques as well as in terms of the analysis of the inner structure of the datasets involved.

Figure 1 .
Figure 1.Dissimilarity matrix of a hyperspectral dataset with 207 input bands.

Table 1 .
Number of training and testing samples, and number of input variables for the single-output regression datasets The objective is to predict two Parkinson disease symptom scores (motor UPDRS and total UPDRS) for patients, based on 19 bio-medical variables, one of them being the label associated to the patient number.•Tecator.The data consists of 215 near-infrared absorbance spectra of meat samples, recorded on a Tecator Infratec Food Analyzer.Each observation consists of a 100-channel absorbance spectrum in the wavelength range [850,1050] nm, and the content of water, fat and protein.The absorbance is equal to the − log 10 of the transmittance measured by the spectrometer.The three (output) contents, measured in percentage, are determined by analytic chemistry.The total number of training and testing samples as well as the input and output number of variables are provided in Table2.

Table 2 .
Number of training and testing samples, and number of input and output variables for the multi-output regression datasets.

Table 4 .
Average RMSE error over the four datasets and the first 5, 10, 15 and 20 variables.Boston Housing has 13 variables and it is not considered for the case of K = 20.