1. Introduction
The problem of feature selection is equivalent to the problem of nonparametric estimation of a discrete joint distribution $\mathbb{P}(\mathit{X},Y)$ from a sample of n pairs $\{({\mathit{x}}_{\mathbf{1}},{\mathit{y}}_{\mathbf{1}}),\cdots ,({\mathit{x}}_{\mathit{n}},{\mathit{y}}_{\mathit{n}})\}$, in which $\mathit{X}$ is an m-dimensional real or integer vector of features, i.e., measures of phenomenon characteristics, and Y is a natural number, which represents the class of $\mathit{X}$. The problem is to find a subspace of characteristics $\mathit{\chi}$ with dimension $k<m$ that permits to properly estimate $\mathbb{P}(\mathit{\chi},Y)$, emphasizing one of its important characteristics: for example, that $\mathit{\chi}$ is a good predictor of Y, i.e., for all $\mathit{x}$, the conditional distribution $\mathbb{P}(Y\mid \mathit{\chi}=\mathit{x})$ has its mass concentrated around some value $Y={y}_{\mathit{x}}$.
Formally, let $\mathit{X}$ be an m-dimensional feature vector and Y a single variable. Let $\mathit{\chi}$ be a feature vector, whose features are also in $\mathit{X}$, and denote $\mathcal{P}(\mathit{X})$ as the set of all feature vectors whose features are also in $\mathit{X}$. In this scenario, we define the classical approach to feature selection, in which the search space is the Boolean lattice of features sets (BLFS), as follows.
Definition 1. Given a variable Y, a feature vector $\mathit{X}$ and a cost function ${C}_{Y}:\mathcal{P}(\mathit{X})\to {\mathbb{R}}^{+}$ calculated from the estimated joint distribution of $\mathit{\chi}\in \mathcal{P}(\mathit{X})$ and Y, the classical approach to feature selection consists in finding a subset $\mathit{\chi}\in \mathcal{P}(\mathit{X})$ of features such that ${C}_{Y}(\mathit{\chi})$ is minimum.
In light of Definition 1, we note that some families of feature selection algorithms may be considered as classical approaches. In fact, according to the taxonomy of feature selection, as presented in [
1] for example, feature selection algorithms may be divided into three families,
filters,
wrappers and
embedded methods, being the last two classical approaches to feature selection. Indeed, in the
wrappers methods, the feature selection algorithm exists as a wrapper around a learning machine (or induction algorithm), so that a subset of features is chosen by evaluating its performance on the machine [
2]. Furthermore, in the
embedded methods, a subset of features is also chosen based on its performance on a learning machine, although the feature selection and the learning machine cannot be separated [
3]. Therefore, both
wrappers and
embedded methods satisfy Definition 1, as the performance on the learning machine may be established by a cost function, so that these methods are special cases of the classical approach to feature selection. For more details about these methods see [
1,
2,
3,
4,
5,
6,
7].
Under this approach for feature selection, a classical choice for the cost function is the estimated mean conditional entropy [
8] that measures the mean mass concentration of the conditional distribution. Great mass concentration indicates that the chosen features define equivalence classes with almost homogeneous classifications, hence it is a good choice to represent the complete set of features. For a joint distribution estimated by a sample of size
n, the curve formed by this cost function applied in a chain of the BLFS has an U shape and is called U-curve. For a small set of features, in the left side of the U-curve, the cost is high for the small amount of features generates large equivalence classes that mix labels, which leads to high entropy. For a large set of features, in the right side of the U-curve, the cost is also high, for severe conditional distribution estimation error leads to high entropy. Therefore, the ideal set of features in this chain is in the U-curve minimum, that is achieved by a set that contains the maximum number of features, whose corresponding distribution estimation is not seriously affected by estimation error. Choosing the best set of features consists of comparing the minimum of all lattice chains. There are some NP-hard algorithms that find the absolute minimum [
9,
10]. There are also some heuristics that give approximate solutions such as Sequential Forward Selection (SFS) [
11], which adds features progressively until it finds a local minimum, and Sequential Forward Floating Selection (SFFS) [
11], which, at first, adds features, but after takes some of them out and adds others, trying to improve the first local minimum found.
The main goal of the classical approach is to select the features that are most related to
Y according to a metric defined by a cost function. Although useful in many scenarios, this approach may not be suitable in some applications in which it is of interest to select not only the features that are most related to
Y, but also the features’ values that most influence
Y, or that are most
prone to have a specific value
y of
Y. Therefore, it would be relevant to extend the search space of the classical approach to an extended space that also contemplates the range of the features, so that we may select features and subsets of their range. This extended space is a collection of Boolean lattices of ordered pairs (features, associated values) (CBLOP) indexed by the elements of the BLFS. In other words, for each
$\mathit{\chi}\in \mathcal{P}(\mathit{X})$ we have the Boolean lattice that represents the powerset of its range
${R}_{\mathit{\chi}}$, that is denoted by
$\mathcal{P}({R}_{\mathit{\chi}})$, and the CBLOP is the collection of these Boolean lattices, i.e.,
$\{\mathcal{P}({R}_{\mathit{\chi}}):\mathit{\chi}\in \mathcal{P}(\mathit{X})\}$. If
$\mathit{X}=({X}_{1},{X}_{2})$ are Boolean features, then its CBLOP is as the one in
Figure 1. Note that the circle nodes and solid lines form a BLFS, that around each circle node there is an associated Boolean lattice that represents the powerset of
${R}_{\mathit{\chi}}$, for a
$\mathit{\chi}\in \mathcal{P}(\mathit{X})$, and that the whole tree is a CBLOP (Note that if the features
$\mathit{X}$ are not Boolean, we also have that
$\mathcal{P}(\mathit{\chi})$ is a Boolean lattice for all
$\mathit{\chi}\in \mathcal{P}(\mathit{X})$, so that the search space of the algorithm is always a CBLOP in this framework, regardless of the features’ range).
A downside of this extension is that the sample size needed to perform feature selection at the extended space is greater than the one needed at the associated BLFS, what demands more refined optimal and sub-optimal algorithms in order to select features and subsets of their range. Nevertheless, the extended space brings advances to the state-of-art in feature selection, as it expands the method to a new variety of applications. As an example of such applications, we may cite market segmentation. Suppose it is of interest to segment a market according to the products that each market share is most prone to buy. Denote Y as a discrete variable that represents the products sold by a company, i.e., $\mathbb{P}(Y=y)$ is the probability of an individual of the market buying the product $y\in \{1,\cdots ,p\}$ sold by the company, and $\mathit{X}$ as the socio-economic and demographic characteristics of the people that compose the market. In this framework, it is not enough to select the characteristics (features) that are most related to Y: we need to select, for each product (value of Y), the characteristics $\mathit{\chi}\in \mathcal{P}(\mathit{X})$ and their values $W\in \mathcal{P}({R}_{\mathit{\chi}})$ (the profile of the people) that are prone to buy a given product, so that feature selection must be performed on a CBLOP instead of a BLFS.
We call the approach to feature selection in which the search space is a CBLOP multi-resolution, for we may choose the features based on a global cost function calculated for each $\mathit{\chi}\in \mathcal{P}(\mathit{X})$ (low resolution); or choose the features and a subset of their range based on a local cost function calculated for each $\mathit{\chi}\in \mathcal{P}(\mathit{X})$ and $W\in \mathcal{P}({R}_{\mathit{\chi}})$ (medium resolution); or choose the features and a point of their range based on a local cost function calculated for each $\mathit{\chi}\in \mathcal{P}(\mathit{X})$ and $\mathit{x}\in {R}_{\mathit{\chi}}$ (high resolution). Formally, the multi-resolution approach to feature selection may be defined as follows.
Definition 2. Given a variable Y, a feature vector $\mathit{X}$ and cost functions ${C}_{Y}^{k}:\mathcal{P}(\mathit{X})\times {\mathbb{R}}^{k}\to {\mathbb{R}}^{+},k\in \{1,\cdots ,m\}$, calculated from the estimated joint distribution of $\mathit{\chi}\in \mathcal{P}(\mathit{X})$ and Y, the multi-resolution approach to feature selection consists in finding a subset $\mathit{\chi}\in \mathcal{P}(\mathit{X})$ of $k\in \{1,\cdots ,m\}$ features and a $W\in \mathcal{P}({R}_{\mathit{\chi}})$ such that ${C}_{Y}^{k}(\mathit{\chi},W)$ is minimum.
The cost functions ${C}_{Y}^{k}(\mathit{\chi},W)$ considered in this paper measure the local dependence between $\mathit{\chi}\in \mathcal{P}(\mathit{X})$ of length $k\in \{1,\cdots ,m\}$ and Y restricted to the subset $W\in \mathcal{P}({R}_{\mathit{\chi}})$, i.e., for $\mathit{\chi}\in W$. More specifically, our cost functions are based on the Local Lift Dependence Scale, a scale for measuring variable dependence in multiple resolutions. In this scale we may measure variable dependence globally and locally. On the one hand, global dependence is measured by a coefficient, that summarizes it. On the other hand, local dependence is measured for each subset $W\in \mathcal{P}({\mathcal{R}}_{\mathit{\chi}})$, again by a coefficient. Therefore, if the cardinality of ${\mathcal{R}}_{\mathit{\chi}}$ is N, we have ${2}^{N}-1$ dependence coefficients: one global and ${2}^{N}-2$ local, each one measuring the influence of $\mathit{\chi}$ in Y restricted to a subset of ${\mathcal{R}}_{\mathit{\chi}}$. Furthermore, the Local Lift Dependence Scale also provides a propensity measure for each point of the joint range of $\mathit{\chi}$ and Y. Note that the dependence is indeed measured in multiple resolutions: globally, for each subset of ${\mathcal{R}}_{\mathit{\chi}}$ and pointwise.
Thus, in this paper, we extend the classical approach to feature selection in order to select not only the features, but also their values that are most related to
Y in some sense. In order to do so, we extend the search space of the feature selection algorithm from the BLFS to the CBLOP and use cost functions based on the
Local Lift Dependence Scale, which is an extension of Shannon’s mutual information. The feature selection algorithms proposed in this paper are applied to a dataset consisting of student performances on a university’s entrance exam and on undergraduate courses in order to select exam’s subjects, and the performances on them, that are most related to undergraduate courses, considering student performance on both. The method is also applied to two datasets publicly available at the UCI Machine Learning Repository [
12], namely, the Congressional Voting Records and Covertype datasets. We first present the main concepts related to the
Local Lift Dependence Scale. Then, we propose feature selection algorithms based on the
Local Lift Dependence Scale and apply them to solve real problems.
2. Local Lift Dependence Scale
The
Local Lift Dependence Scale (LLDS) is a scale for measuring the dependence between a random variable
Y and a random vector
$\mathit{X}$ (also called feature vector) in multiple resolutions. Although consisting of well known mathematical objects, there does not seem to exist any literature that thoroughly defines and study the properties of the LLDS, even though it is highly used in marketing [
13] and data mining [
14] (Chapter 10), for example. Therefore, we present an unprecedented characterization of the LLDS, despite the fact that much of it is known in the theory.
The LLDS analyses the raw dependence between the variables, as it does not make any assumption about its kind, nor restrict itself to the study of a specific kind of dependence, e.g., linear dependence. Among LLDS dependence coefficients, there are three measures of dependence, one global and two local, but with different resolutions, that assess variable dependence on multiple levels. The global measure and one of the local are based on well known dependence measures, namely, the Mutual Information and the Kullback-Leibler Divergence. In the following paragraphs we present the main concepts of the LLDS and discuss how they can be applied to the classical and multi-resolution approaches to feature selection. The main concepts are presented for discrete random variables $\mathit{X}=\{{X}_{1},\cdots ,{X}_{m}\}$ and Y defined on $(\mathsf{\Omega},\mathbb{F},\mathbb{P})$, with range ${R}_{\mathit{X},Y}={R}_{\mathit{X}}\times {R}_{Y}$, although, with simple adaptations, i.e., by interchanging probability functions with probability density functions, the continuous case follows from it.
The Mutual Information (MI), proposed by [
15], is a classical dependence quantifier that measures the mass concentration of a joint probability function. As more concentrated the joint probability probability function is, the more dependent the random variables are and greater is their MI. In fact, the MI is a numerical index defined as
in which
$f(\mathit{x},y):=\mathbb{P}(\mathit{X}=\mathit{x},Y=y)$,
$g(\mathit{x}):=\mathbb{P}(\mathit{X}=\mathit{x})$ and
$h(y):=\mathbb{P}(Y=y)$ for all
$(\mathit{x},y)\in {R}_{\mathit{X},Y}$. An useful property of the MI is that it may be expressed as
in which
$H(Y)$ is the Entropy of
Y and
$H(Y|\mathit{X})$ is the Conditional Entropy (CE) of
Y given
$\mathit{X}$. The form of the MI in (
1) is useful because, if we fix
Y, and consider features
$({X}_{1},\cdots ,{X}_{m})$, we may determine which one of them is the most dependent with
Y by observing only the CE of
Y given each one, as the feature that maximizes the MI is the one that minimizes the CE. In this paper, we consider the normalized MI that is given by
when
$H(Y)\ne 0$. We have that
$0\le {\eta}_{\mathit{X}}(Y|{R}_{\mathit{X}})\le \frac{min\{H(Y),H(X)\}}{H(Y)}\le 1$, that
${\eta}_{\mathit{X}}(Y|{R}_{\mathit{X}})=0$ if, and only if,
$\mathit{X}$ and
Y are independent and that
${\eta}_{\mathit{X}}(Y|{R}_{\mathit{X}})=1$ if, and only if, there exists a function
$Q:{R}_{\mathit{X}}\to {R}_{Y}$ such that
$\mathbb{P}(Y=Q(\mathit{X}))=1$, i.e.,
Y is a function of
$\mathit{X}$. The
$\eta $, MI and CE are equivalent global and general measures of dependence, that summarize to an index a variety of dependence kinds that are expressed by mass concentration.
On the other hand, we may define a LLDS local and general measure of dependence that expands the global dependence measured by the MI into local indexes, enabling a local interpretation of variable dependence. As the MI is an index that measures variable dependence by measuring the mass concentration incurred in one variable by the observation of another, it may only give evidences about the existence of a dependence, but cannot assert what kind of dependence is being observed. Therefore, it is relevant to break down the MI by region, so that it can be interpreted in an useful manner and the kind of dependence outlined by it may be identified. The
Lift Function (LF) is responsible for this break down, as it may be expressed as
in which
$f(y|\mathit{x}):=\mathbb{P}(Y=y|\mathit{X}=\mathit{x})$. When there is no doubt about which variables the LF refers to, it is denoted simply by
$L(\mathit{x},y)$. Note that the LF is the exponential of the pontual mutual information (see [
16,
17] for example for more details).
The MI is the expectation on $(\mathit{X},Y)$ of the LF logarithm, so that the LF presents locally the mass concentration measured by the MI. As the LF may be written as the ratio between the conditional probability of Y given $\mathit{X}$ and the marginal probability of Y, the main interest in its behaviour is in determining for which points $(\mathit{x},y)\in {R}_{\mathit{X},Y}$ $L(\mathit{x},y)>1$ and for which $L(\mathit{x},y)<1$. If $L(\mathit{x},y)>1$ then the fact of $\mathit{X}$ being equal to $\mathit{x}$ increases the probability of Y being equal to y, as the conditional probability is greater than the marginal one. Therefore, we say that event $\{\mathit{X}=\mathit{x}\}$ lifts event $\{Y=y\}$ or that instances with profile $\mathit{x}$ are prone to be of class y. In the same way, if $L(\mathit{x},y)<1$, we say that event $\{\mathit{X}=\mathit{x}\}$ inhibits event $\{Y=y\}$, for $f(y|\mathit{x})<h(y)$. If $L(\mathit{x},y)=1,\forall (\mathit{x},y)\in {R}_{\mathit{X},Y}$, then the random variables are independent. Note that the LF is symmetric: $\{\mathit{X}=\mathit{x}\}$ lifts $\{Y=y\}$ if, and only if, $\{Y=y\}$ lifts $\{\mathit{X}=\mathit{x}\}$. Therefore, the LF may be interpreted as $\mathit{X}$ lifting Y or Y lifting $\mathit{X}$. From now on, we interpret it as $\mathit{X}$ lifting Y, even though it could be the other way around.
An important property of the LF is that it cannot be greater than one nor lesser than one for all points $(\mathit{x},y)\in {R}_{\mathit{X},Y}$. Indeed, if $L(\mathit{x},y)>1,\forall (\mathit{x},y)\in {R}_{\mathit{X},Y}$, then $f(y\mid \mathit{x})>h(y),\forall (\mathit{x},y)\in {R}_{\mathit{X},Y},$ what implies the absurd $1={\sum}_{y\in {R}_{Y}}f(y\mid \mathit{x})>{\sum}_{y\in {R}_{Y}}h(y)=1$ for $\mathit{x}\in {R}_{\mathit{X}}$. With an analogous argument we see that $L(\mathit{x},y)$ cannot be lesser than one for all $(\mathit{x},y)\in {R}_{\mathit{X},Y}$. Therefore, if there are LF values greater than one, then there must be values lesser than one, what makes it clear that the values of the LF are dependent and that the lift is a pointwise characteristic of the joint probability function and not a global property of it. Thus, the study of the LF behaviour gives the full view of the dependence between the variables, without restricting it to a specific kind nor making assumptions about it.
Although the LF presents a wide picture of variable dependence, it may present it in a too high resolution, making it complex to interpret. Therefore, instead of measuring dependence for each point in the range
${R}_{\mathit{X},Y}$, we may measure it for a
window $W\in \mathcal{P}({R}_{\mathit{X}})$. The dependence between
$\mathit{X}$ and
Y in the window
W, i.e., for
$\mathit{X}\in W$, may be measured by the
$\eta $ coefficient defined as
when
$E\left(\right)open="\{"\; close="\}">H\left(\right)open="("\; close=")">\left[Y\right|\mathit{X}],Y\ne 0$ and
$\mathbb{P}(\mathit{X}\in W)>0$, in which
${D}_{KL}(\xb7||\xb7)$ is the Kullback-Leibler divergence [
18],
$H(\xb7,\xb7)$ is the cross-entropy [
19] and
$H\left(\right)open="("\; close=")">\left[Y\right|\mathit{X}],Y$ means the cross-entropy between the conditional distribution of
Y given
$\mathit{X}$ and the marginal distribution of
Y. The
$\eta $ coefficient (
3) compares the conditional probability of
Y given
$\mathit{x}$,
$\forall \mathit{x}\in W$, with the marginal probability of
Y, so that as greater the coefficient, as
distant the conditional probability is from the marginal one and, therefore, greater is the influence of the event
$\{\mathit{X}\in W\}$ in
Y. Note that, analogously to the MI, we may write
in which
$H\left(\right)open="("\; close=")">\left[Y\right|\mathit{X}]$ means the Entropy of the conditional distribution of
Y given
$\mathit{X}$, and we have that
$0\le \eta $${}_{\mathit{X}}$$(Y|W)\le 1$, that
$\eta $${}_{\mathit{X}}$$(Y|W)=0$ if, and only if,
$h(y)\equiv f(y|\mathit{x}),\forall \mathit{x}\in W$, and that
$\eta $${}_{\mathit{X}}$$(Y|W)=1$ if, and only if, there exists a function
$Q:W\to {R}_{Y}$ such that
$\mathbb{P}(Y=Q(\mathit{X})|\mathit{X}\in W)=1$. Observe that the
$\eta $ coefficient of a window is also a local dependence quantifier, although its resolution is lower than that of the LF if the cardinality of
W is greater than one. Also note that the
$\eta $ coefficient (
3) is a generalization of (
2) to all subsets (windows) of
${R}_{\mathit{X}}$, as
$W={R}_{\mathit{X}}$ is a window and that the numerator of
$\eta $${}_{\mathit{X}}$$(Y|W)$ equals
$E\left(\right)open="\{"\; close="\}">logL(\mathit{X},Y)\mid (\mathit{X},Y)\in W\times {\mathcal{R}}_{Y}$. It is important to outline that we may
complete the LLDS with coefficients that are given by
$E\left(\right)open="\{"\; close="\}">logL(\mathit{X},Y)\mid (\mathit{X},Y)\in {W}_{\mathit{X}}\times {W}_{Y}$ normalized, with
${W}_{\mathit{X}}\in \mathcal{P}({\mathcal{R}}_{\mathit{X}})$ and
${W}_{Y}\in \mathcal{P}({\mathcal{R}}_{Y})$, that measure the dependence between
$\mathit{X}$ and
Y for
$(\mathit{X},Y)\in {W}_{\mathit{X}}\times {W}_{Y}$. However, we do not use coefficients of this type in this paper for in our case the variable
Y and its range are always fixed. Note that if the cardinality of
${W}_{\mathit{X}}\times {W}_{Y}$ is one then
$E\left(\right)open="\{"\; close="\}">logL(\mathit{X},Y)\mid (\mathit{X},Y)\in {W}_{\mathit{X}}\times {W}_{Y}$ so that the scale is indeed
complete.
The three dependence coefficients presented, when analysed collectively, measure variable dependence in all kinds of resolutions: since the low resolution of the MI, through the middle resolutions of the windows W, until the high resolution of the LF. Indeed, the $\eta $ coefficients and the LF define a dependence scale in ${R}_{\mathit{X}}$, that we call LLDS, that gives a dependence measure for each subset $W\in \mathcal{P}({R}_{\mathit{X}})$. This scale may be useful for various purposes and we outline some of them in the following paragraphs.
Potential applications of the Local Lift Dependence Scale
The LLDS, more specifically the LF, is relevant in frameworks in which we want to choose a set of elements, e.g, people, in order to apply some kind of
treatment to them, obtaining some kind of
response Y, and are interested in maximizing the number of elements with a given response
$y\in {R}_{Y}$. In this scenario, given the features
$\mathit{X}$, the LF provides the set of elements that must be chosen, that is the set whose elements have profile
$\mathit{x}\in {R}_{\mathit{X}}$ such that
$L(\mathit{x},y)$ is greatest. Formally, we must choose elements whose profile is
Indeed, if we choose
n elements randomly from our population, we expect that
$n\times \mathbb{P}(Y=y)$ of them will have the desired response. However, if we choose
n elements from the population of all elements with profile
${\mathit{x}}_{opt}(y)\in {R}_{\mathit{X}}$, then we expect that
$n\times \mathbb{P}(Y=y|\mathit{X}={\mathit{x}}_{opt}(y))$ of them will have the desired response, what is
$[L({\mathit{x}}_{opt}(y),y)-1]\times n$ more elements when comparing with the whole population sampling framework. Observe that this framework is the exact opposite of the classification problem. In the classification problem, we want to classify an instance given its profile
$\mathit{x}\in {R}_{\mathit{X}}$ into a class
$y\in {R}_{Y}$, that may be, for example, the class
y such that
$f(y|\mathit{x})$ is maximum. On the other hand, in this framework, we are interested in, given a
$y\in {R}_{Y}$, finding the profile
${\mathit{x}}_{opt}(y)\in {R}_{\mathit{X}}$ such that
$f(y|{\mathit{x}}_{opt}(y))$ is maximum. In the applications section we further discuss the differences between this framework and the classification problem, and how the LLDS may be applied to both.
Furthermore, the $\eta $ coefficient is relevant in scenarios in which we want to understand the influence of $\mathit{X}$ in Y by region, i.e., for each subset of ${R}_{\mathit{X}}$. As an example of such framework, consider an image in the grayscale, in which $\mathit{X}=({X}_{1},{X}_{2})$ represents the pixels of the image and Y is the random variable whose distribution is the distribution of the colors in the picture, i.e., $\mathbb{P}(Y=y)=\frac{{n}_{y}}{n}$ in which ${n}_{y}$ is the number of pixels whose color is $y\in \{1,\cdots ,255\}$ and n is the total number of pixels in the image. If we define the distribution of $Y|\mathit{X}=({x}_{1},{x}_{2})$ properly for all $({x}_{1},{x}_{2})\in {R}_{\mathit{X}}$, we may calculate ${\eta}_{\mathit{X}}(Y|W),W\in \mathcal{P}({R}_{\mathit{X}})$, in order to determine the regions that are a representation of the whole picture, i.e., whose color distribution is the same of the whole image, and the regions W whose color distribution differs from that of the whole image. The $\eta $ coefficient may be useful for identifying textures and recognizing patterns in images.
Lastly, the LLDS may be used for feature selection, when we are not only interested in selecting the features $\mathit{\chi}\in \mathcal{P}(\mathit{X})$ that are most related to Y, but also want to determine the features $\mathit{\chi}\in \mathcal{P}(\mathit{X})$ whose levels $W\in \mathcal{P}({R}_{\mathit{\chi}})$ most influence Y. In the same manner, we may want to select the features $\mathit{\chi}$ whose level ${\mathit{x}}_{opt}(y)\in {R}_{\mathit{\chi}}$ maximizes L${}_{(\mathit{\chi},Y)}$$({\mathit{x}}_{opt}(y),y)$, for a given $y\in {R}_{Y}$, so that we may sample from the population of elements with profile ${\mathit{x}}_{opt}(y)\in {R}_{\mathit{\chi}}$ in order to maximize the number of elements of class y. Feature selection based on the LLDS is a special case of the classical and multi-resolution approaches to feature selection as presented next.
4. Applications
The
multi-resolution approach proposed in the previous sections is now applied to three different datasets. First, we apply it to the performances dataset, that consists of student performances on entrance exams and undergraduate courses. Then, we apply the algorithms to two UCI Machine Learning Repository datasets: the Congressional Voting Records and Covertype datasets [
12].
4.1. Performances dataset
A recurrent issue in universities all over the world is the framework of their recruitment process, i.e., the manner of selecting their undergraduate students. In Brazilian universities, for example, the recruitment of undergraduate students is solely based on their performance on exams that cover high school subjects, called vestibulares, so that knowing which subjects are most related to the performance on undergraduate courses is a matter of great importance to universities admission offices, as it is important to optimize the recruitment process in order to select the students that are most likely to succeed. Therefore, in this scenario, the algorithm presented in the previous sections may be an useful tool in determining the entrance exam subjects, and the performances on them, that are most related to the performance on undergraduate courses, so that students may be selected based on their performance on these subjects.
The recruitment of students to the University of São Paulo is based on an entrance exam that consists of an essay and questions of eight subjects: Mathematics, Physics, Chemistry, Biology, History, Geography, English and Portuguese. The selection of students is entirely based on this exam, although the weights of the subjects differ from one course to another. In the exact sciences courses, as Mathematics, Statistics, Physics, Computer Science and Engineering, for example, the subjects with greater weights are Portuguese, Mathematics and Physics, as those are the subjects that are qualitatively most related to what is taught at these courses. Although weights are given to each subject in a systematic manner, it is not known what subjects are indeed most related to the performance on undergraduate courses. Therefore, it would be of interest to measure the relation between the performance on exam subjects and undergraduate courses and, in order to do so, we apply the algorithms proposed on the previous sections.
The dataset to be considered consists of 8,353 students that enrolled in 28 courses of the University of São Paulo between 2011 and 2016. The courses are those of its Institute of Mathematics and Statistics, Institute of Physics and Polytechnic School, and are in general Mathematics, Computer Science, Statistics, Physics and Engineering courses. The variable of interest (Y) is the weighted mean grade of the students on the courses they attended in their first year at the university (the weights being the courses credits), and is a number between zero and ten. The features, that are denoted $\mathit{X}=({X}_{1},{X}_{2},{X}_{3},{X}_{4},{X}_{5},{X}_{6},{X}_{7},{X}_{8},{X}_{9})$, are the performances on each one of the eight entrance exam subjects, that are numbers between zero and one, and the performance on the essay, that is a number between zero and one hundred.
In order to apply the proposed algorithm to the data at hand, it is necessary to conveniently discretize the variables and, to do so, we take into account an important characteristic of the data: the scale of the performances. The scale of the performances, both on the entrance exam and the undergraduate courses, depends on the course and the year. Indeed, the performance on the entrance exam of students of competitive courses is better, as only the students with high performance are able to enrol in these courses. In the same way, the performances differ from one year to another, as the entrance exam is not the same every year and the teachers of the first year courses also change from one year to another, what causes the scale of the grades to change. Therefore, we discretize all variables by tertiles inside each year and course, i.e., we take the tertiles considering only the students of a given course and year. Furthermore, we do not discretize each variable by itself, but rather discretize the variables jointly, by a method based on distance tertiles, as follows.
Suppose that at a step of the algorithm we want to measure the relation between
Y and the features
$\mathit{\chi}\in \mathcal{P}(\mathit{X})$. In order to do so, we discretize
Y by its tertiles inside each course and year, e.g., a student is in the third tertile if he is on the top one third students of his class according to the weighted mean grade, and discretize the performance on
$\mathit{\chi}$ jointly, i.e., by discretizing the distance between the performance of each student on these subjects and zero by its tertiles. Indeed, students whose performance is close to zero have low joint performance on the subjects
$\mathit{\chi}$, while those whose performance is far from zero have high joint performance on the subjects
$\mathit{\chi}$. Therefore, we take the distance between each performance and zero, and then discretize it inside each course and year, e.g., a student is at the first tertile if he is on the bottom students of his class according to his joint performance on the subjects
$\mathit{\chi}$. The Mahalanobis distance [
33] is used, as it takes into account the variance and covariance of the performance on the subjects
$\mathit{\chi}$.
As an example, suppose that we want to measure the relation between the performances on Mathematics and Physics and the weighted mean grade of students that enrolled in the Statistics undergraduate course in 2011 and 2012. In order to do so, we discretize the weighted mean grade by year and the performance on Mathematics and Physics by the Mahalanobis distance between it and zero, also by year, as is displayed in
Figure 2. Observe that each year has its own ellipses that partition the performance on Mathematics and Physics in three and the tertile of a student depends on the ellipses of his year. The process used in
Figure 2 is extended to the case in which there are more than two subjects and one course. When there is only one subject, the performance is discretized in the usual manner inside each course and year. The LF between the weighted mean grade and the joint performance on Mathematics and Physics is presented in
Table 1. From this table we may search for the maximum lift or calculate the
$\eta $ coefficient for its windows. In this example, we have
$\eta $${}_{(\mathrm{M},\mathrm{P})}$$(Y|{R}_{(\mathrm{M},\mathrm{P})})=0.0387$, in which
$(\mathrm{M},\mathrm{P})=(\mathrm{Mathematics},\mathrm{Physics})$.
The proposed algorithm is applied to the discretized variables using three cost functions. First, we use the
$\eta $ coefficient on the window that represents the whole range of the features in order to determine what are the subjects (features) that are most related to the weighted mean grade, i.e., the features (
4). Then, we apply the algorithm using as cost function the
$\eta $ coefficient for all windows in order to determine the subjects performances (features and window) that are most related to the weighted mean grade, i.e., the subjects and performances (
5). Finally, we determine what are the subjects and their performance that most lift the weighted mean grade third tertile, i.e., the subjects and performances (
6) with
$y=\mathrm{Tertile}\phantom{\rule{4.pt}{0ex}}3$.
The subjects that are most related to the weighted mean grade, according to the proposed discretization process and the
$\eta $ coefficient (
2), are
$\mathit{\chi}=(\mathrm{M},\mathrm{P},\mathrm{C},\mathrm{B},\mathrm{Po})$ and
$\eta $${}_{\mathit{\chi}}$$(Y|{R}_{\mathit{\chi}})=0.0354$, in which
$(\mathrm{M},\mathrm{P},\mathrm{C},\mathrm{B},\mathrm{Po})=(\mathrm{Mathematics},\mathrm{Physiscs},\mathrm{Chemistry},\mathrm{Biology},\mathrm{Portuguese})$. The LF between the weighted mean grade and
$\mathit{\chi}$ is presented in
Table 2. The features
$\mathit{\chi}$ are the ones that are in general most related to the weighted mean grade, i.e., are the output of the classical feature selection algorithm that employs the inverse of the global
$\eta $ coefficient as cost function (Algorithm 1). Therefore, the recruitment of students could be optimized by taking into account only the subjects
$\mathit{\chi}$.
Applying Algorithms 2 and 3 we obtain the same result, that the performance, i.e., window, that is most related to the weighted mean grade and that most
lifts the third tertile of the weighted mean grade is the third tertile in Mathematics, for which
$\eta $${}_{M}$$(Y|\{\mathrm{Tertile}\phantom{\rule{4.pt}{0ex}}3\})=0.0575.$ and
L${}_{(M,Y)}$$(\mathrm{Tertile}\phantom{\rule{4.pt}{0ex}}3,\mathrm{Tertile}\phantom{\rule{4.pt}{0ex}}3)=1.51$, in which M = Mathematics. The LF between the weighted mean grade and the performance on Mathematics is presented in
Table 3.
The output of the algorithms provides relevant information to the admission office of the University. Indeed, it is now known that the subjects that are most related to the performance on the undergraduate courses are Mathematics, Physics, Chemistry, Biology and Portuguese. Furthermore, in order to optimize the number of students that will succeed in the undergraduate courses, the office must select those that have high performance on Mathematics, as it lifts by more than 50% the probability of the student having also a high performance on the undergraduate course, i.e., students with high performance on Mathematics are prone to have high performance on the undergraduate course. Although the subjects that are most related to the performance on the courses are obtained from the classical feature selection algorithm, only the LLDS outlines what is the performance on the entrance exam that is most related to the success on the undergraduate course, that is high performance on Mathematics. Therefore, feature selection algorithms based on the LLDS provide more information than the classical feature selection algorithm, as they have a greater resolution and take into account the local relation between the variables.
4.2. Congressional Voting Records dataset
The Congressional Voting Records dataset consists of
435 instances of
16 Boolean features and a Boolean variable that indicates the party of the instance (democrat or republican). The features indicate how the instance voted (yes or no) in the year of 1984 about each one of
16 matters, that are displayed in
Table 4. Algorithm 3 is applied to this dataset in order to determine what are the voting profiles that are most prone to be that of a republican and that of a democrat.
As the number of instances is relatively small, we perform Algorithm 3 under a restriction that avoids overfitting. Indeed, if we apply the algorithm without the restriction, then the chosen profiles are those in which all the instances are of the same party. If there is only a couple of instances with some profile, and all of them are of the same party, then this profile is chosen as a prone one for the party. However, we do not know if the profile is really prone, i.e., everybody with it is in fact of the same party, or if the fact of everybody with this profile being of the same party is just a sample deviation. In other words, without the restriction, the estimation error of the LF is too great as some profiles have low frequency in the sample and the feature selection algorithm overfits.
Therefore, we restrict the search space to the profiles with a relative frequency in the sample of at least
$0.15$. In other words, we select the profiles
for
$y\in \{democrat,republican\}$, in which
$\mathbb{P}({\mathit{\chi}}^{*}={\mathit{x}}^{*}),{\mathit{\chi}}^{*}\in \mathcal{P}(\mathit{X}),{\mathit{x}}^{*}\in {R}_{{\mathit{\chi}}^{*}}$, is estimated by the relative frequency of the profile. The selected profiles, their LF value and the sample size considered are presented in
Table 5. At each iteration of the algorithm, only the instances that have no missing data in the features being considered are taken into account when calculating the LF, so that the sample size of each iteration is not the same.
The profiles with maximum LF lifts by
94% the probability of democrat and by around
165% the probability of republican. This difference in the
lift is due to the fact that there are more democrats than republicans, so that the probability of democrat is greater and, therefore, cannot be lifted as much as the probability of republican can. The profiles in
Table 5 present a wide view of the voting profile of democrats and republicans, what allows an understanding of what differentiates a democrat from a republican regarding their vote.
This application to the Congressional Voting Records dataset shed light on two interesting properties of the LLDS approach to feature selection in its higher resolution. First, this approach is indeed local, as we are not interested in selecting the features that best classify the representatives accordingly to their party, but rather the voting profiles that are most prone to be that of a democrat or republican. Secondly, the problem treated here is the opposite of the classification problem. Indeed, in the classification problem, we are interested in classifying a representative according to his party, given his voting profile. On the other hand, the problem treated here is the exact opposite: given a party, we want to know what are the profiles of the representatives that are most prone to be of that party. In other words, in the classification problem we want to determine the party given the voting profile, while on the LLDS problem we want to determine the voting profile given the party.
4.3. Covertype dataset
The Covertype dataset consists of
581,012 instances (terrains) of
54 features (10 continuous and 44 discrete) and a variable that indicates the cover type of the terrain (7 types). We apply Algorithms 1, 2 and 3 to select features among the continuous ones that are displayed in
Table 6. The features are discretized in the same way they were in the performances dataset: by taking sample quantiles of the Mahalanobis distance between the features and zero. However, we now consider the quantiles
$0.2,0.4,0.6$ and
$0.8$ as cutting points, i.e.,
quintiles, instead of tertiles.
Applying Algorithm 1 we select the features
$\mathit{\chi}=(\mathrm{E},\mathrm{HH},\mathrm{HF})$, with a coefficient
$\eta $${}_{\mathit{\chi}}$$(Y\mid {R}_{\mathit{\chi}})=0.307$ and the LF in
Table 7. We see that being in the first quintile of the selected features lifts classes 3, 4, 5 and 6; being in the second quintile lifts classes 2 and 5; being in the third quintile lifts class 2; being in the fourth quintile lifts class 1; and being in the fifth quintile lifts classes 1 and 7. From
Table 7 we may interpret the relation between the selected features and the cover type. For example, we see that terrains with cover types 3, 4, 5 and 6 tend to have low joint values in the selected features, while terrains with cover 7 tend to have great joint values in them. This example shows how the proposed approach allows not only to select the features, but also understand why these features were selected, i.e., what is the relation between them and the cover type, by analysing the local dependence between the variables.
Applying Algorithm 2 to this dataset we obtain the windows displayed in
Table 8. We see that the window that seems to most influence the cover type is the first and fifth quintile of the features Elevation and Horizontal distance to hydrology. Indeed, all the top ten windows contain those two features, and either their first or fifth quintile. As we can see in
Table 7, the influence of the fifth quintile of
$\mathit{\chi}=(\mathrm{E},\mathrm{HH},\mathrm{HF})$, the top window, is given by the fact that no terrain of the types 3, 4, 5 and 6 is in this quintile. Note that, again, our approach allows a better interpretation of the selected features by the analysis of the local dependence between the features and the cover type.
Finally, applying Algorithm 3 we choose the profiles displayed in
Table 9 for
$y\in \{1,2,3,4,5,6,7\}$. We see, for example, that the profile most prone to be of type 1 is
$(\mathrm{E},\mathrm{HH},\mathrm{HF})=\mathrm{Quintile}\phantom{\rule{4.pt}{0ex}}5$ and of type 3 is
$(\mathrm{E},\mathrm{HH},\mathrm{HR},\mathrm{HF})=\mathrm{Quintile}\phantom{\rule{4.pt}{0ex}}1$. Note that it does not mean that most of the terrains with these profiles are of type 1 and 3, but rather that the probability of a terrain with these profiles being of types 1 and 3, respectively, is
87% and
396% greater than the probability of a terrain for which we do not know the profile. Therefore, we see again the difference between the LLDS approach and the classification problem. In the LLDS approach, given a profile, we are interested in determining the type of which the conditional probability given the profile is greater than the marginal probability, while in the classification problem, given a profile, we are interested in determining the type for which the conditional probability given the profile is the greatest.
As an example, suppose the joint distribution that generated the LF of
Table 7 and the profile Quintile 1. We have that the maximum conditional probability given this profile is the probability of type 2 (
$54,473/116,203=0.47$), while the maximum lift is that of type 4, although its conditional probability is only
$2,747/116,203=0.02$. However, the conditional probability of type 4 given the profile, even though absolutely small, is relatively great: it is 5 times the marginal probability
$0.004$. Therefore, on the one hand, if there is a new terrain whose profile is
$(\mathrm{E},\mathrm{HH},\mathrm{HF})=\mathrm{Quintile}\phantom{\rule{4.pt}{0ex}}1$, we classify it as being of type 2. On the other hand, if we want to sample terrains from a population and are interested in maximizing the number of terrains of type 4, we may sample from the population with profile
$(\mathrm{E},\mathrm{HH},\mathrm{HF})=\mathrm{Quintile}\phantom{\rule{4.pt}{0ex}}1$ instead of the whole population, expecting to sample four times more terrains of type 4.
5. Final Remarks
The feature selection algorithms based on the LLDS extend the classical approach to feature selection to a higher resolution one, as they take into account the local dependence between the features and the variable of interest. Indeed, classical feature selection may be performed by walking through a tree in which each node is a vector of features, i.e., a BLFS, while feature selection based on the LLDS is established by walking through an extended tree, i.e., a CBLOP, in which inside each node there is another tree, that represents the windows of the features, as displayed in the example in
Figure 1. Therefore, feature selection based on the LLDS increases the reach of feature selection algorithms to a new variety of applications.
The LLDS may treat a problem that is the opposite of that of classification, i.e., when we are interested in, given a class y, finding the profile $\mathit{x}$ of which we may sample from its population in order to maximize the number of instances of class y. Indeed, in the classification problem we want to do the exact opposite: classify a instance with known profile $\mathit{x}$ into a class of Y. Therefore, although LLDS tools may also be applied to the classification problem (as they are in the literature), they are of great importance in problems that we may call the reverse engineering of the classification one. Thus, our approach broadens the application of feature selection algorithms to a new set o problems by the extension of their search spaces from BLFs to CBLOPs.
The algorithms proposed in this paper may be optimized in order to not walk through the entire CBLOP, as its size increases exponentially with the number of features, so that the algorithm may not be computable for a great number of features. Moreover, the algorithms may be subjected to
overfitting if the sample size is relatively small, so that their search space may be restricted. The methods of [
1,
2,
3,
4,
5,
6,
7,
9,
10,
22,
23,
24,
25,
26,
27,
28,
29], for example, may be adapted to the multi-resolution algorithms in order to optimize them. Furthermore, the properties of the
$\eta $ coefficients and the LF must be studied in a theoretical framework, in order to establish their variances, sample distributions and develop statistical methods to estimate and test hypothesis about them.
The LLDS adapts classical measures, such as the MI and the Kullback-Leibler Divergence, into coherent dependence coefficients that assess the dependence between random variables in multiple resolutions, presenting a wide view of it. As it does not make any assumption about the dependence kind, the LLDS measures the raw dependence between the variables and, therefore, may be relevant for numerous purposes, being feature selection just one of them. We believe that the algorithms proposed in this paper, and the LLDS in general, bring advances to the state-of-art in dependence measuring and feature selection, and may be useful in various frameworks.