Abstract
The suboptimal procedure under consideration, based on the MDR-EFE algorithm, provides sequential selection of relevant (in a sense) factors affecting the studied, in general, non-binary random response. The model is not assumed linear, the joint distribution of the factors vector and response is unknown. A set of relevant factors has specified cardinality. It is proved that under certain conditions the mentioned forward selection procedure gives a random set of factors that asymptotically (with probability tending to one as the number of observations grows to infinity) coincides with the “oracle” one. The latter means that the random set, obtained with this algorithm, approximates the features collection that would be identified, if the joint distribution of the features vector and response were known. For this purpose the statistical estimators of the prediction error functional of the studied response are proposed. They involve a new version of regularization. This permits to guarantee not only the central limit theorem for normalized estimators, but also to find the convergence rate of their first two moments to the corresponding moments of the limiting Gaussian variable.
Keywords:
feature selection; relevant factors; MDR-EFE method; forward selection; suboptimal procedures; statistical estimators of error functional (of a response); regularized estimators; CLT; convergence of estimators moments MSC:
62G20; 62H12; 62J02; 62L12
1. Introduction
This paper is dedicated to the eminent scientist Professor A.S. Holevo, academician of the Russian Academy of Sciences, on occasion of his remarkable birthday.
The classical problem of regression analysis consists in the search for deterministic function f, which, in a certain sense, “well” approximates the observed random variable (response) Y by the value , where is a vector of factors influencing the behavior of Y. This approach was initiated by the works of A.-M. Legendre and K. Gauss. At that time it found application in the processing of astronomical observations. Nowadays one widely uses the methods involving the appropriate choice of unknown real coefficients for a linear model of the form , where describes a random error. Clearly, can be included in the collection of factors, then . For example, books [1,2] are devoted to regression. The close tasks also arise in observations classification, see, e.g., [3].
Since the end of the 20th century, stochastic models have been studied where the random response Y depended only on some subset of the factors in the set of . So, in article [4], the LASSO method (Least Absolute Shrinkage and Selection Operator) was introduced, using the idea of regularization (going back to A.N.Tikhonov), which allowed to find factors included with non-zero coefficients in a “sparse” linear model. Somewhat earlier, this approach was used by several authors for the treatment of geophysical data. Generalizations of the mentioned method are considered in monograph [5]. We emphasize that the idea of identifying some of the factors having a principle (in a certain sense) impact on a response is also intensely developing within the framework of nonlinear models. Such direction of modern mathematical statistics is called Feature Selection (FS), i.e., the choice of features (variables, factors). In this regard, we refer, e.g., to monographs [6,7,8,9] and also to reviews [10,11,12,13,14]. In [10] the authors consider filter, wrapper and embedded methods of FS. They concentrate on feature elimination and also demonstrate the application of FS technique on standard datasets. In [11] the modern mainstream dimensionality reduction methods are analyzed including ones for small samples and those based on deep learning. In [12] FS machinery is considered based on filtering methods for detecting the cyber attacks. Survey [13] is devoted to FS methods in machine learning (the structured information is contained in 20 tables). The authors of [14] concentrate on applications of FS to stock market prediction and applications of FS in the analysis of credit risks are considered, e.g., in [15]. Beyond financial mathematics the choice of relevant factors is very important in medicine and biology. For instance, in the field of genetic data analysis there is an extensive research area called GWAS (Genome-Wide Association Studies) aimed at studying the relationships between phenotypes and genotypes, see, e.g., [16,17]. The authors of [18] provide the survey of starting methods used by genetic algorithms. Review [19] is devoted to the FS methods for predicting the risk of diseases. Thus, research in the field of FS is not only of theoretical interest, but also admits various applications.
Note that there are a number of complementary methods for identifying relevant factors. Much attention is paid to those employing the basic concepts of information theory such as entropy, mutual information, conditional mutual information, interaction information, various divergences, etc. Here statistical estimation of information characteristics plays an important role. One can mention, e.g., works [20,21]. In this article, the accent is made on identifying a set of relevant factors in the framework of a certain stochastic model, when the quality of the response approximation is evaluated by means of some metric.
Recall that J.B. Herrick in 1910 described the Sickle cell anemia (HbS). Later it was discovered that all clinical manifestations of the presence of HbS are the consequences of the single change in the B-globin gene. This famous example shows that even the search of a single feature having impact on a disease is reasonable. Nowadays the researchers concentrate on complex diseases provoked by several disorders of the human genome. Even identification of two SNPs (single nucleotide polymorphisms) having impact on a certain disease is of interest, see, e.g., [22].
Now we turn to the description of the studied mathematical model. All the considered random variables are defined on a probability space . Let a random variable Y map to some finite set . We assume that, for , a random variable , where is an arbitrary finite set. Then the vector takes the values in . For a set , where , we put . Similarly, for , denotes a vector . A collection of indices (the symbol ⊂ is everywhere understood as a non-strict inclusion) is called relevant if the following relation holds for any and :
whenever . In this case, the set of factors is called relevant as well. If (1) takes place for some then it will be obviously valid for any S containing . Therefore, the natural desire is to identify a set S that satisfies (1) and has cardinality (if such a set other than T exists). Note that there are different definitions of the relevant factors collection, see, e.g., [23,24] and the references therein.
It is assumed that a collection of relevant factors has r elements ), however, the set S itself, which appears in (1), is unknown and should be identified. We label this assumption as (A). There is no restriction that S satisfying (1) and containing r elements is unique. Usually the joint distribution of is also unknown. Therefore, a statistical estimator of S is constructed based on the first N observations of a sequence , consisting of i.i.d. random vectors, where, for , has the same distribution as the vector .
In 2001, the authors of [25] proposed a method for identifying relevant factors, called MDR (Multifactor Dimensionality Reduction). According to article [26], more than 800 publications were devoted to the development of this method and its applications in the period from 2001 to 2014. Research in this direction has continued over the last decade, see, e.g., [27,28,29]. In [30], for the binary response Y, a modification of the MDR method was introduced, namely, MDR-EFE (Error Function Estimation), based on statistical estimates of the error functional of the response prediction using the K-fold cross-validation procedure, see also [31]. Later this method was extended in [32] to study the non-binary response.
Recall how the MDR-EFE method is employed. Let a non-random function be used to predict the response Y by the values of the factors vector X. Further we exclude considering the trivial case when with probability one for some (hence, X and Y are independent). The prediction quality is determined by applying the following error functional
where a penalty function . The functional takes finite values for the discrete X and Y under consideration. The function allows to take into account the importance of approximating a particular value of Y using .
In biomedical research, one often considers the binary response Y characterizing the patient’s state of health, say, the value corresponds to illness, and means that the patient is healthy. In many situations it is more important to consider the disease detection, so the value of 1 is attributed more weight. Of interest is the situation when . Then the value 0 describes some intermediate state of uncertainty (“gray zone”). Following [32], we will consider a more general scheme when the set for some . Lemma 1 in [32] describes for such model all optimal functions that deliver a minimum to the error functional (2). Note that we can suppose that the set of values of Y is strictly contained in , i.e., some values are accepted with zero probability. For such y, we assume that . Thus, it is possible to study Y taking values in an arbitrary finite subset of . In order to simplify the notation, we further consider for all .
It is proved that in the framework of model (1) the relation is valid, where, for and , and a function f is constructed in a due way. At the same time, for any such that (♯ denotes the cardinality of a finite set) and S appearing in (1), the following inequality is true:
For , the function is introduced further. It depends on the joint distribution of which is usually unknown. Thus we use observations for statistical estimates of the functional , where , and then select as an estimator of S the set U on which the minimum of the corresponding statistical estimate is attained. This approach is described in the next section of the article.
We underline that consideration of all subsets (of the set T) having the cardinality r in the mentioned comparison procedure (involving regularized estimators, as explained in Section 2) for statistical estimates of the error functional is practically unfeasible, when p is large and r is moderately large. Therefore, a number of suboptimal methods of sequential feature selection have emerged. Such methods are used in various approaches to identify sets of relevant factors.
Mainly, one aims either to sequentially add indexes at each step of the algorithm for constructing a statistical estimator of a set S appearing in (1), or to sequentially exclude features from the general set T. In [33], algorithms of forward selection, i.e., sequential addition of indexes to the initial set, based on information theory, are considered. The authors of [33] show that the various algorithms employed can be interpreted as procedures based on proper approximations of the certain objective function. In [34] the principle attention is paid to simple models describing the phenomenon of epistasis observed in genetics, when individual factors do not affect the response, and some combinations of them lead to essential effects (in statistics one says “synergy interaction” of factors). Besides we also demonstrated that a number of well-known algorithms, for instance, mRMR (Minimum Redundancy Maximum Relevance) using mutual information and/or interaction information with a sequential procedure for selecting relevant factors can lead to the identification of the desired set with probability which is negligibly small. In [35] a variant is proposed for sequential (forward) application of the MDR-EFE method within the binary response model involving the naive Bayesian classifier scheme. The latter means that, for any and all , the following relation holds:
In other words, the factors are conditionally independent for a given response Y. In [35] the joint distribution of X and Y was assumed known.
The principle goal of our work is to derive, for a non-binary, in general, random response, the probability that a sequential selection of features based on the (forward) application of the MDR-EFE method, without assuming the validity of (4), leads to identifying a suboptimal set that would be constructed by means of the same method from observations with a known joint distribution of the response and the vector of factors.
This result builds on the central limit theorem (CLT) for statistical estimates of the prediction error functional for a possibly non-binary response, proved in [32], which extends the CLT for the binary response model studied by the author previously. In addition, for the purposes of this work, we found the convergence rate of the first two moments of the considered statistics to the corresponding moments of the limiting Gaussian variable as the number of observations tends to infinity.
The article has the following structure. Section 2 describes statistical estimates of the error functional (for a response prediction) based on the MDR-EFE method. We also introduce the regularized versions of these estimators. In Section 3, the convergence rate of the first two moments of the regularized estimators of the error functional to the corresponding moments of the limiting Gaussian variable is established. Section 4 contains the main result related to the forward selection of relevant factors. The concluding remarks are given in Section 5. The proof of elementary Lemma 2 is provided in Appendix A for completeness of exposition.
2. Error Functional Estimators
Consider, in general, a non-binary response, i.e., let for some . In the framework of the introduced discrete model, Lemma 1 of [32] gives a complete description of the class of optimal functions providing the minimum error , determined by (2), in the class of all functions . To define such a function (included in the optimal class) for , we deal with a vector having components
It can be easily seen that
where , is a column of matrix Q having elements (the element is located in the upper left corner of the matrix Q), ⊤ stands for the transposition of column vectors. In other words, one employs in (5) the scalar product of the vectors and . Thus, search for an optimal function means finding the partition of into such sets , , that provide the minimum value of the right-hand side of (5). Note also that, according to Formula (13) of [32], the error of response prediction can be written as follows:
Let, for , the vector have the first components equal to 1, and the remaining components equal to . For any , we introduce a vector with components having the form
According to formula (11) of [32] one infers that
The joint distribution of is, in general, unknown. Therefore, the optimal function cannot be found in practice, so an algorithm is used to predict it, i.e., to approximate by means of specified statistical estimators. The response prediction algorithm is defined as a function given for and a set of observations
The function takes values in the set . It is assumed that the value of becomes close, in a certain sense, to for x in a specified subset of the set when W is sufficiently “massive”. More precisely, we consider a family of functions that depend on sets of different cardinalities, but we will not complicate the notation. Consider . For , and , introduce a vector with components
Set
For , let be defined by means of a counterpart of formula (8), where is now written instead of . Then, according to Section 5 of [32] (the notation is used there instead of U), in the framework of model (1), the optimal function , where S appears in (1) and . Therefore relation (3) is valid for corresponding to any with (the assumption (A) holds).
To introduce an algorithm for predicting the function , we employ statistical estimators of the penalty function , as well as the values , where , , . Consider
In the case of a binary response, such a choice of the penalty function was proposed in [36], the justification for this choice is given in [31], see also Section 4 in [32]. For the specified function and observations , where the finite set , we use
where the frequency estimator of a probability has the form
It is not difficult to see that the strong law of large numbers for arrays of random variables (see, e.g., [37]) entails for finite sets , such that , the relation
Let the prediction algorithm of a function be constructed by means of formula (8) analogue, where, for , , , and , one uses now statistical estimators of functions introduced in (10). Namely, let us define the following random variables:
where is an estimator of appearing in (12). For , , , set
Replace the value in (8) by . Then one can claim that
For , , we take a partition of a set into subsets
here , is an integer part of a number , is an indicator of a set A. These sets are applied in the K-fold cross-validation procedure increasing the stability of statistical inference (cross-validation procedure is studied, e.g., in [38]). Following [32], the estimator of the functional , i.e., a statistical estimator of the prediction error functional for a function and observations , involving the K-fold cross-validation procedure, is given by the formula:
where and are evaluated according to (12) for , . The estimator (17) is a natural statistical analogue of the error functional (2) written in the form (6) when one employs the K-cross-validation procedure. Namely, instead of we apply its statistical estimator of the type (12) and instead of f we use its approximation by means of prediction algorithm based on the part of observations. To obtain the statistical estimators of the probability appearing in Formula (6) we write the corresponding average of indicator functions. One employs also the averaging over different parts of observations.
By Theorem 2 of [32], if is a set of relevant factors, i.e., (1) holds, then, for each and any set , the following inequality takes place almost sure for all N large enough:
Thus, it is natural to consider all subsets and choose as a statistical estimator of a relevant collection of indices a set U on which the minimum of is attained. Here we also note that, for the study of asymptotic properties of the error functional, the regularization of the prediction algorithm by means of a sequence of positive numbers such that , as , plays an important role. Namely, for , we define
As in article [32], we assume that
Now we introduce a statistical estimator using an analogue of Formula (17), where one employs instead of . For the regularized statistical estimators, as mentioned in [32], the analogue of Formula (18) holds. In [32], for estimators constructed when condition (20) is met, the CLT is established. In the next section we apply a slightly different regularization for the error functional estimates, which will permit us to specify the convergence rate of the first two moments of these estimators to corresponding moments of the limiting Gaussian variable. This result is not only of independent interest, but is also applied in Section 4.
3. Asymptotic Behavior of the First Two Moments of Statistical Estimators of the Error Functional
As noted in Section 2, we will use the penalty function (11). Therefore, for , as a strongly consistent estimator of we will employ the variable appearing in (12), denoted below as , where , , . Recall that the estimator is defined by formula (2). If the regularized version is substituted into this estimator instead of , where and , then the notation is used. We will apply the following Corollary 3 of [32] established in the framework of a model satisfying (1).
Theorem 1 ([32]).
It is known that the convergence in distribution of random variables, in general, does not ensure the convergence of their moments even when the moments exist. We will manage to establish the convergence rate of the first two moments of the error functional statistical estimators to the corresponding moments of the limit random variable. For this purpose we slightly strength the condition of estimates regularization. We require that a sequence satisfies the following condition:
Lemma 1.
Proof of Lemma 1.
Let us fix an arbitrary set . For each one has
where
are defined by means of (12) for , , . The proof is divided into several steps.
Step 1. At first we consider
To simplify the notation, we do not write that also depends on K, and . Our aim is to show that if (23) holds then
Taking into account (29), by Theorem 5.4 of [39], relation (28) holds if (and only if) the sequence is uniformly integrable. Due to theorem by De La Vallé - Poussin (see, e.g., Theorem 1.3.4 of [40]) it is sufficient to verify that
For , , , and we introduce the following random variables:
where, for , is defined by Formula (9). Write , here
Now note that, for any real numbers , every and an arbitrary , the Hölder inequality implies that
Evidently, (32) is true for as well. Consequently, we get
Clearly, for all , , and , one has
where the functions appearing in (34) were introduced in Section 2. For any and , the inequalities , are satisfied if and only if, for arbitrary such that , as , and all sufficiently large , the following inequalities are valid: , (the analogous statement is true for inequalities corresponding to coordinates and in Formula (19)). Obviously,
where is defined in (12). One has
For , , and , consider the following event
where (we assumed that for ). More precisely one can write . We will not include a set U in the list of arguments since this set is fixed. Then, for , in view of (35), we get
Then by virtue of (37), for any and all N large enough, i.e., for , one has
and hence the following relation holds
Taking into account that the sets and have finite cardinalities, we ascertain that, for any , and all N large enough, for , one has
Consequently, for any , , , , where , , for all N large enough (i.e., ), the following inequality holds:
Applying (32) we come to the inequality
Let denote the summation over all for . For one has
here we employ (40) and take into account that . We see that
For , and , introduce the functions
It is known (see, e.g., formula (15) in Chap. VI of [41]) that if a bounded Borel function , and are independent random vectors taking values in and , respectively, then
Due to independence of , , we can apply the lemma on grouping random vectors (see, e.g., [42], p. 28) to get the relation
By the Rosenthal inequality (see, e.g., Theorem 2.9 of [43]), for independent centered random variables , having for some and each , one has
where depends on t but does not depend on v and distributions of variables , .
Set , . Note that for all . Then according to (42) we come to the inequality
where and . Hence, applying (32) for and , one has
Evidently, we can write
Let , where , . Set , where , . Clearly, depends on , y and U. Random variables are identically distributed for . Therefore , but does not depend on q. If , then the variables are a.s. equal to some constant. According to (36), an event occurrence means that the variable which is equal to zero a.s. turns greater than . Therefore, in the degenerate case one has
and for all . Consider now the case when . Then we get
where appeared in (36).
Now we employ the Berry-Esseen estimate of the convergence rate in CLT for i.i.d. random variables. Let be i.i.d. random variables such that , , . We write F for the distribution function of and stands for the distribution function of . Then (see, e.g., Theorem 5.4 of [43]), for any ,
where is the distribution function of a standard normal random variable, is a positive constant ( does not depend on distribution of and v). According to [44] one has 0.4693. Consequently, taking , we have
since for , where .
It is well-known (see, e.g., formula (29) of Chap. II of [41]), that, for , the following inequality is true:
Therefore, by virtue of an inequality (which is valid for the indicator variance) and as
we can write under condition (23) that
and does not depend on N.
Introduce
where one considers only strictly positive . Then obviously , as there exists only a finite collection of different variants. Thus in view of (44), for all x, y and U under consideration, one has
where appeared in (43) and does not depend on N.
Therefore, if condition (23) is satisfied then, for all , , and , the following inequality holds:
where does not depend on , k and N. Hence, in view of (44) we come to the relation
where does not depend on x, y, k and N. Thus according to (41), for all N large enough, we have proved the inequality
where does not depend on N.
In a similar way (taking into account (42) and (45)), for , , , and all N large enough, we get
where is introduced in (31), and does not depend on N.
We will employ an elementary result for the Bernoulli scheme. Let be a sequence of i.i.d. random variables such that and , where . Consider the following frequency estimator of a probability :
Define
Lemma 2.
For the Bernoulli scheme introduced above and the estimators provided by formula (48), for each , the following relation holds:
More precisely, the absolute value of the function in the left-hand side of (49), for all , admits a bound where for and .
For the sake of completeness the proof of this result is given in Appendix A.
Now we continue the proof corresponding to Step 1. For all considered k, i, y and any , the Cauchy - Bunyakovsky - Schwarz inequality yields
Due to Lemma 2 one has , . Employing the Minkowski inequality (to take into account the summation over i, y, k), for all , we come to the bound
where does not depend on N.
Consequently, by virtue of (33), (46) and (50) the uniform integrability of a sequence is established. Thus (28) is verified.
Step 2. Now we study the asymptotic behavior of the variables , as , where and are given by Formulas (26) and (27), respectively. For , , , we set . One has
where
The purpose of the second step is to prove that
For , and introduce
The Cauchy-Bunyakovsky-Schwarz inequality yields
For each considered N, y, i and k, the variables are independent and , so by virtue of the Rosenthal inequality (42) we obtain
Taking into account Lemma 2 for and in view of (44), for each , we get the relation
Therefore, the goal of the second step has been achieved.
Step 3. The implementation of steps 1 and 2 permits to reduce the study of the asymptotic behavior (as ) of given by Formula (25) to the study of variables
where is defined by Formula (51).
The aim of the third step is to prove that , as , where is the variance of the random variable appearing in Formula (22).
On this way, we will show that the sum of certain part of the terms in a specified representation of the variables does not affect (in the sense of ) the limit behavior of these variables for growing N. For and , where , we introduce the event
where is defined according to (13). Then, in view of the independence of observations we have
If then . Set
where and an event is introduced by Formula (53). Then
since for , and because all for each , stands for an integer part of a number.
We verify that for large N is approximated in the space by the random variable
where and was introduced by (13) for and . Evidently, for all k, i, y and N under consideration. Consequently, it follows that
where
For any considered k, i, y and N the Cauchy - Bunyakovsky - Schwarz inequality implies that
The Rosenthal inequality (42) yields that . By means of Lemma 2 (for and multipliers with ), for all considered i, y, k and any we come to the bound
Therefore, as .
Let us define the variable by formula similar to but without the multiplier . In view of (44) it is easily seen that
Thus as , where
The variables are centered, i.i.d. and uniformly bounded for all j (clearly, ). For each , the distributions of and coincide, where is introduced in (22). Thus, one has
According to the lemma on grouping independent random variables, for each , the variables , , are independent. Since as , for , we come to the relation
as . Hence , . The goal of the third step has been achieved.
In view of the above approximations (in ) of the initial random variables , introduced by (25), we conclude that , as . Namely, we apply the following elementary statement: if and then , as . Therefore, (24) is established. The proof of Lemma 1 is complete. □
Further we will also employ a result that immediately follows from Theorem 1.
Corollary 1.
Let the conditions of Lemma 1 be satisfied. Then the following relations hold:
where is a variance of the random variable introduced in (22).
Proof.
Note that (59) can be obtained directly under conditions of Lemma 1. For each and any , according to Lindeberg’s theorem applied to arrays of centered i.i.d. uniformly bounded summands, where a sequence is introduced in (55), taking into account (56) one has
4. Forward Selection of Relevant Factors
Now we can turn to the sequential selection of factors based on MDR-EFE method. At the first step one searches for a point where the function attains the minimum over all . If there are several such points, then we take, e.g., one with the smallest index value. Recall that according to (17) (more precisely, after regularization), the random variable is in fact a function of , which is a forecast of the function . Then this procedure is repeated, namely, if at -th step the set is constructed, where , then is selected at step k in such a way that given the function takes the minimum value over for . It is convenient to assume that an empty set is taken at the zero step. Then at each next step one new element is added to the previously constructed sets. If at some step there are several minimum points of the considered function then we take only one of them, e.g., with the minimal index.
Thus, for each the random sets arise, where and , . By construction one can write
where and . In other words the choice at step k means that, for ,
moreover, , . If the joint distribution of X and Y is known, then instead of the described scheme for constructing random sets, we turn to considering the non-random “oracle” sets , where ,
, and the functional is introduced by formula (2). If there are several satisfying (63) we take among them that one which has the minimal value.
For and introduce
By construction of the sets we have , where and . We call a model, satisfying condition (1), regular whenever the following relation is true:
In other words, for each , a point in (63) is determined uniquely. Further we employ the penalty function introduced in (11). We also use its strongly consistent estimate of type (48) with
and as .
Theorem 2.
Let the considered model (1) with a collection of relevant factors having cardinality , be regular, i.e., let (64) take place. Then, for the random sets introduced above, the following relation is valid
where is defined by means of (63) for . In other words, with probability close to one, the described procedure of forward selection based on statistical estimates of the error functional leads to the “oracle” collection , when N is large enough.
Proof.
For a random set , where is an element taken at k-th step, one has
Note that
where
. Thus, we obtain:
where, as usual, for . Then, for , and , we get
For , set
For any , and each in light of formula (57) of Corollary 1, for all N large enough () it holds
where are introduced in (66), is defined by (68).
Applying the Bienaymé - Chebyshev inequality and taking into account Formula (58) of Corollary 1, for each and any , we come, for a centered random variable , to the relation
where is determined by Formula (22). According to (64), for and , one has . Therefore, for all N large enough (), the following inequality takes place:
For a fixed , one can change the summation order over i and y to write Formula (22) as follows:
where
Thus, for any , one has
Consequently, we come to the inequality
where . We see that for all , and . For each , any , and all N large enough, we get the following bound:
Hence, for each and all N large enough, by virtue of (67) the following inequality holds:
where according to (64). Thus relation (72) implies the validity of (66). □
Now note that according to (69) the following relation is true:
The question arises whether this probability decreases like where C is a positive constant or more rapidly. The answer depends on the variance of the random variable given by Formula (22). In view of (70) we will determine when the variable is degenerate, i.e., equal to a constant a.s. This is also of independent interest for the CLT established in Section 6 of [32] and given above as Theorem 1. The following result provides a simple characterization of the degeneracy.
Lemma 3.
For an arbitrary set , the variance of the random variable , appearing in Formula (22), is zero if and only if, for every , there is such that
Thus, for each , on the set the random variable does not necessarily take a constant value. Moreover, the values of need not coincide for different y.
Proof.
For and a random variable , introduced by Formula (71), one can write
In a similar way we consider . Thus, for all , one gets
Recall that for all . If, for some , , we have
then on the events and the variable takes different values. Therefore, takes different values on these events. Hence , if (74) is not valid. Thus (74) is a necessary condition to guarantee that . Suppose now that, (74) holds. In this case we get
Clearly, depends on U as well. We see that on each set takes (up to the set of measure zero) the value , . Therefore, . Note that need not coincide for different . The proof is complete. □
5. Concluding Remarks
The established asymptotical result (Theorem 2) is rather qualitative in nature, since relation (66) assumes increasing values of N. Relation (72) is more precise. However, (72) demonstrates that, loosely speaking, one has to employ . As previously, we assume that assumption (A), introduced on page 2, is valid. Evidently, the sequential choice of relevant variables based on statistical estimators of the error functional (of response approximation), is attractive for implementation, although suboptimal. In this regard Theorem 2 shows that under certain conditions, forward (random) selection with a high probability leads to the same collection of factors, which is provided by the sequential procedure with known joint distribution of the vector of factors X and the response Y. In the future work, it would be reasonable to supplement the theoretical results by computer simulations (see, e.g., [45]).
Consideration of the proximity of the results of optimal and suboptimal procedures requires a separate study. In addition, we note that within the framework of linear models, estimates of the probability of correct identification of relevant factors are considered, e.g., in [46,47]. Theorem 2 does not assume the linearity of stochastic model. Presumably for the first time, in our work a forward selection of relevant factors affecting the non-binary random response is treated on the base of MDR-EFE method. It would be interesting to extend the conditions allowing to establish relation (66). Moreover, stability problems of FS deserve special attention, see, e.g., [48,49,50]. Algorithms stability for classification problems in the framework of random trees is treated in [51].
Finally, we emphasize that the problem of statistical estimation of the cardinality of a set of relevant factors appearing in definition (1) is very important and complex. Along with dealing with the deterministic number of selected factors, there is a research approach based on developing the rules for stopping the procedures used to identify the relevant set. In this regard, we indicate, e.g., article [52], dedicated to information methods for selecting relevant factors. The study of non-discrete stochastic models is also of undoubted interest, see, e.g., [53].
Further it would be interesting to study other functionals than (2) to measure the quality of a response approximation by means of functions defined on various collections of factors. One can also consider a random number of observations. In this regard we refer, e.g., to [27,54].
Funding
This research received no external funding.
Data Availability Statement
Data are contained within the article.
Acknowledgments
The author is very grateful to the Reviewers for careful reading the manuscript and making valuable remarks and suggestions. He would also like to thank Alexander Tikhomirov for invitation to present manuscript for this issue.
Conflicts of Interest
The author declares no conflict of interest.
Appendix A. Proof of Lemma 2
Proof.
For any and , one has
where , as , and . We do not use the explicit formulas . Note that
where and, for , one has
For each , introduce
Obviously, one can write , as
for all , , and since
Consequently, for any , we get
where , as . Evidently, for . For each , set . Thus, for , one has
because
and
The proof of Lemma 2 is complete. □
References
- Seber, G.A.F.; Lee, A.J. Linear Regression Analysis, 2nd ed.; J.Wiley and Sons Publication: Hoboken, NJ, USA, 2003. [Google Scholar]
- Györfi, L.; Kohler, M.; Krzyz˙ak, A.; Walk, H. A Distribution-Free Theory of Nonparametric Regression; Springer: New York, NY, USA, 2002. [Google Scholar]
- Matloff, N. Statistical Regression and Classification. From Linear Models to Machine Learning; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar]
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Hastie, T.; Tibshirani, R.; Wainwrigth, R. Statistical Learning with Sparsity. The Lasso and Generalizations; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
- Bolón-Candedo, V.; Alonso-Betanzos, A. Recent Advances in Ensembles for Feature Selection; Springer: Cham, Switzerland, 2018. [Google Scholar]
- Giraud, C. Introduction to High-Dimensional Statistics; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
- Stańczyk, U.; Zielosko, B.; Jain, L.C. (Eds.) Advances in Feature Selection for Data and Pattern Recognition; Springer International Publishing AG: Cham, Switzerland, 2018. [Google Scholar]
- Kuhn, M.; Johnson, K. Feature Engineering and Selection. A Practical Approach for Predictive Models; CRC Press: Boca Raton, FL, USA, 2020. [Google Scholar]
- Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
- Jia, W.; Sun, M.; Lian, J.; Hou, S. Feature dimensionality reduction: A review. Complex Intell. Syst. 2022, 8, 2663–2693. [Google Scholar] [CrossRef]
- Lyu, Y.; Feng, Y.; Sakurai, K. A survey on feature selection techniques based on filtering methods for cyber attack detection. Information 2023, 14, 191. [Google Scholar] [CrossRef]
- Pradip, D.; Chandrashekhar, A. A comprehensive survey on feature selection in the various fields of machine learning. Appl. Intell. 2023, 52, 4543–4581. [Google Scholar]
- Htun, H.H.; Biehl, M.; Petkov, N. Survey of feature selection and extraction techniques for stock market prediction. Financ. Innov. 2023, 9, 26. [Google Scholar] [CrossRef] [PubMed]
- Laborda, J.; Ryoo, S. Feature Selection in a Credit Scoring Model. Mathematics 2021, 9, 746. [Google Scholar] [CrossRef]
- Emily, M. A survey of statistical methods for gene-gene interaction in case-control genomewide association studies. J. Société Fr. Stat. 2018, 159, 27–67. [Google Scholar]
- Tsunoda, T.; Tanaka, T.; Nakamura, Y. (Eds.) Genome-Wide Association Studies; Springer: Singapore, 2019. [Google Scholar]
- Luque-Rodriguez, M.; Molina-Baena, J.; Jimenez-Vilchez, A.; Arauzo-Azofra, A. Initialization of feature selection search for classification. J. Artif. Intell. Res. 2022, 75, 953–998. [Google Scholar] [CrossRef]
- Pudjihartono, N.; Fadason, T.; Kempa-Liehr, A.W.; O’Sullivan, J.M. A review of feature selection methods for machine learning-based disease risk prediction. Front. Bioinform. 2022, 2, 927312. [Google Scholar] [CrossRef]
- Coelho, F.; Braga, A.P.; Verleysen, M.A. Mutual information estimator for continuous and discrete variables applied to feature selection and classification problems. Int. J. Comput. Intell. Syst. 2016, 9, 726–733. [Google Scholar] [CrossRef]
- Kozhevin, A.A. Feature selection based on statistical estimation of mutual information. Sib. Elektron. Mat. Izv. 2021, 18, 720–728. [Google Scholar] [CrossRef]
- Latt, K.Z.; Honda, K.; Thiri, M.; Hitomi, Y.; Omae, Y.; Sawai, H.; Kawai, Y.; Teraguchi, S.; Ueno, K.; Nagasaki, M.; et al. Identification of a two-SNP PLA2R1 haplotype and HLA-DRB1 alleles as primary risk associations in idiopathic membranous nephropathy. Sci. Rep. 2018, 8, 15576. [Google Scholar] [CrossRef] [PubMed]
- Vergara, J.R.; Estévez, P.A. A review of feature selection methods based on mutual information. Neural Comput. Applic 2014, 24, 175–186. [Google Scholar] [CrossRef]
- AlNuaimi, N.; Masud, M.M.; Serhani, M.A.; Zaki, N. Streaming feature selection algorithms for big data: A survey. Appl. Comput. Inform. 2022, 18, 113–135. [Google Scholar] [CrossRef]
- Ritchie, M.D.; Hahn, L.W.; Roodi, N.; Bailey, L.R.; Dupont, W.D.; Parl, F.F.; Moore, J.H. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Human Genet. 2001, 69, 138–147. [Google Scholar] [CrossRef]
- Gola, D.; John, J.M.M.; van Steen, K.; Kónig, I.R. A roadmap to multifactor dimensionality reduction methods. Briefings Bioinform. 2016, 17, 293–308. [Google Scholar] [CrossRef]
- Bulinski, A.; Kozhevin, A. New version of the MDR method for stratified samples. Stat. Optim. Inf. Comput. 2017, 5, 1–18. [Google Scholar] [CrossRef][Green Version]
- Abegaz, F.; van Lishout, F.; Mahachie, J.J.M.; Chiachoompu, K.; Bhardwaj, A.; Duroux, D.; Gusareva, R.S.; Wei, Z.; Hakonarson, H.; Van Steen, K. Performance of model-based multifactor dimensionality reduction methods for epistasis detection by controlling population structure. BioData Min. 2021, 14, 16. [Google Scholar] [CrossRef]
- Yang, C.H.; Hou, M.F.; Chuang, L.Y.; Yang, C.S.; Lin, Y.D. Dimensionality reduction approach for many-objective epistasis analysis. Briefings Bioinform 2023, 24, bbac512. [Google Scholar] [CrossRef]
- Bulinski, A.; Butkovsky, O.; Sadovnichy, V.; Shashkin, A.; Yaskov, P.; Balatskiy, A.; Samokhodskaya, L.; Tkachuk, V. Statistical Methods of SNP Data Analysis and Applications. Open J. Stat. 2012, 2, 73–87. [Google Scholar] [CrossRef]
- Bulinski, A. On foundation of the dimensionality reduction method for explanatory variables. J. Math. Sci. 2014, 199, 113–122. [Google Scholar] [CrossRef]
- Bulinski, A.V.; Rakitko, A.S. MDR method for nonbinary response variable. J. Multivar. Anal. 2015, 135, 25–42. [Google Scholar] [CrossRef]
- Macedo, F.; Oliveira, M.R.; Pacheco, A.; Valadas, R. Theoretical Foundations of Forward Feature Selection Methods based on Mutual Information. Neurocomputing 2019, 325, 67–89. [Google Scholar] [CrossRef]
- Bulinski, A.V. On relevant feature selection based on information theory. Theory Probab. Its Appl. 2023, 68, 392–410. [Google Scholar] [CrossRef]
- Rakitko, A. MDR-EFE method with forward selection. In Proceedings of the The 5th International Conference on Stochastic Methods (ICSM-5), Moscow, Russia, 23–27 November 2020. [Google Scholar] [CrossRef]
- Velez, D.R.; White, B.C.; Motsinger, A.A.; Bush, W.S.; Ritchie, M.D.; Williams, S.M.; Moore, J.H. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet. Epidemiol. 2007, 31, 306–315. [Google Scholar] [CrossRef] [PubMed]
- Hu, T.-C.; Moricz, F.; Taylor, R. Strong laws of large numbers for arrays of rowwise independent random variables. Acta Math. Hung. 1989, 54, 153–162. [Google Scholar] [CrossRef]
- Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 2010, 4, 40–79. [Google Scholar] [CrossRef]
- Billingsley, P. Convergence of Probability Measures; John Wiley and Sons: New York, NY, USA, 1968. [Google Scholar]
- Borkar, V.S. Probability Theory: An Advanced Course; Springer: New York, NY, USA, 1995. [Google Scholar]
- Bulinski, A.V.; Shiryaev, A.N. Theory of Stochastic Processes, 2nd ed.; Fizmatlit: Moscow, Russia, 2005. (In Russian) [Google Scholar]
- Kallenberg, O. Foundations of Modern Probability; Springer: New York, NY, USA, 1997. [Google Scholar]
- Petrov, V.V. Limit Theorems of Probability Theory: Sequences of Independent Random Variables; Clarendon Press: Oxford, UK, 1995. [Google Scholar]
- Shevtsova, I.G. On absolute constants in the Berry-Esseen inequality and its structural and non-uniform refinements. Informatics Its Appl. 2013, 7, 124–125. [Google Scholar]
- Bulinski, A.V.; Rakitko, A.S. Simulation and analytical approach to the identification of significant factors. Commun. Stat.-Simul. Comput. 2016, 45, 1430–1450. [Google Scholar] [CrossRef][Green Version]
- Shah, R.D.; Samworth, R.J. Variable selection with error control: Another look at stablity selection. J. R. Statist. Soc. B. 2012, 74, 1–26. [Google Scholar] [CrossRef]
- Beinrucker, A.; Dogan, U.; Blanchard, G. Extensions of stability selection using subsamples of observations and covariates. Stat. Comput. 2016, 26, 1059–1077. [Google Scholar] [CrossRef]
- Nogueira, S.; Sechidis, K.; Brown, G. On the stability of feature selection algorithms. J. Mach. Learn. Res. 2018, 18, 1–54. [Google Scholar]
- Khaire, U.M.; Dhanalakshmi, R. Stability of feature selection algorithm: A review. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 1060–1073. [Google Scholar] [CrossRef]
- Bulinski, A. Stability properties of feature selection measures. Theory Probab. Appl. 2024, 69, 3–15. [Google Scholar]
- Bénard, C.; Biau, G.; Da Veiga, S.; Scornet, E. SIRUS: Stable and Interpretable RUle Set for classification. Electron. J. Statist. 2021, 15, 427–505. [Google Scholar] [CrossRef]
- Mielniczuk, J. Information theoretic methods for variable selection—A review. Entropy 2022, 24, 1079. [Google Scholar] [CrossRef]
- Linke, Y.; Borisov, I.; Ruzankin, P.; Kutsenko, V.; Yarovaya, E.; Shalnova, S. Universal Local Linear Kernel Estimators in Nonparametric Regression. Mathematics 2022, 10, 2693. [Google Scholar] [CrossRef]
- Rachev, S.T.; Klebanov, L.B.; Stoyanov, S.V.; Fabozzi, F.J. The Methods of Distances in the Theory of Probability and Statistics; Springer: New York, NY, USA, 2013. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).