Abstract
Ultra-high dimensional longitudinal data feature screening procedures are widely studied, but most require model assumptions. The screening performance of these methods may not be excellent if we specify an incorrect model. To resolve the above problem, a new model-free method is introduced where feature screening is performed by sample splitting and data aggregation. Distance correlation is used to measure the association at each time point separately, while longitudinal correlation is modeled by a specific cumulative distribution function to achieve efficiency. In addition, we extend this new method to handle situations where the predictors are correlated. Both methods possess excellent asymptotic properties and are capable of handling longitudinal data with unequal numbers of repeated measurements and unequal intervals between repeated measurement time points. Compared to other model-free methods, the two new methods are relatively insensitive to within-subject correlation, and they can help reduce the computational burden when applied to longitudinal data. Finally, we use some simulated and empirical examples to show that both new methods have better screening performance.
1. Introduction
Longitudinal data, characterized by repeated measurements of the same subjects over some time, are common in many scientific fields, including public health, psychology, and sociology. Traditional methods, such as linear mixed models (LMMs), are widely used to analyze longitudinal supervised problems. However, as the number of predictors, p, grows, these methods are limited by “the curse of dimensionality”. Even if these problems have a seemingly large sample size, n, it may actually be far smaller than p. This is more commonly called the ultra-high-dimensional setting, defined by Fan and Lv [] as for some . On the other hand, technological developments have made it possible to obtain high-throughput data through repeated measurements of subjects. Therefore, it is essential to develop some new methods capable of handling the above data.
Since only a few predictors are associated with the response in the ultra-high-dimensional setting (i.e., sparsity), feature screening is a common method to deal with these problems. For early work on ultra-high-dimensional longitudinal data, sure independence screening for varying-coefficient models was firstly proposed [], while subsequent studies developed further, including partially linear models [], additive models [], and rank regression models []. However, these methods rely on parametric or semi-parametric models, which may be insufficient for accurate model specification in the big data era. To address this, we agree with Zhu et al. [] that model-free methods are more appealing, as they rely only on a general model framework.
Currently, research on model-free methods for cross-sectional data has made great progress. The primary approach of these methods is to measure the marginal dependence, such as distance correlation (DC) [], martingale difference correlation [], conditional distance correlation (CDC) [], and covariate information number []. Recently, some studies have also applied these methods to survival data and other types of data by improving the methods originally designed for cross-sectional data [,,]. However, up to now, studies applying model-free feature screening procedures to longitudinal data are still rare, with the main challenge being that repeated measurements between subjects are correlated.
Since Székely et al. [] proposed DC and Wang et al. [] proposed CDC, the two methods have been widely applied in feature screening, e.g., the studies by Li et al. [] and Chen [] are based on DC, while the studies by Wen et al. [] and Lu and Lin [] are based on CDC. Consequently, the aim of this study is to improve DC and CDC for better performance in longitudinal data. Recently, the sample splitting and data aggregation methods are frequently used in feature screening, such as Zhong et al. [] and Dai et al. []. These studies provide us with an idea of handling ultra-high-dimensional longitudinal data.
We first introduce a new method called longitudinal distance correlation sure independence screening (LDCS) in this study. The new method uses DC to measure marginal dependence at each time point separately and then aggregates the results using a cumulative distribution function (CDF) used in Zhong et al. []. We also make a simple extension of LDCS to handle situations where predictors are correlated. The method is based on CDC, so we call it longitudinal conditional distance correlation sure independence screening (LCDCS). The two new methods have some distinct advantages. For example, compared to existing longitudinal data feature screening procedures, the two methods are model-free, which enables them to perform better in some complex situations. Additionally, compared to existing model-free feature screening procedures, LDCS and LCDCS use the sample-splitting and data aggregation methods, making them relatively insensitive to within-subject correlation. Moreover, the computational burden is lower than that of methods that directly apply dependence measures to longitudinal data.
In Section 2, we propose our methods and provide their asymptotic properties. Section 3 presents several simulated and empirical examples to compare the performance of our methods with that of some other feature screening procedures. Section 4 provides a brief discussion of the article. Appendix A contains the technical proofs of all theorems, and Appendix B presents a simple sensitivity analysis.
2. Materials and Methods
2.1. Model Setup
We consider the following longitudinal data, denotes the measurement at the mth measurement time point for the ith subject, where is the response vector correlated with the predictor vector and the conditional predictor vector . Here, n represents the number of subjects, and denotes the repeated measurements for the ith subject. We allow the measurement intervals at different time points to be unequal and repeated measurements for each subject to vary. Furthermore, we define the total measurements as and the maximum of as M.
2.2. Marginal Screening Procedure
We assume that n is much less than p, and only a few predictors are associated with . Let denote the conditional distribution function of , given . The index set of important predictors can be defined as
without assuming a specific regression model. We define the number of important predictors as , the cardinality of D, and denote as the index set of unimportant predictors. Similar to most screening procedures, our screening goal is to estimate D conservatively so that all important predictors can be retained.
DC was proposed to quantify the association between two random vectors []. For details, they defined DC between and as the square root of
where distance covariance is defined as the square root of
where is a known function, and denotes the characteristic function.
DC is an excellent dependence measure capable of detecting the relationship between predictors and response for cross-sectional data. However, the screening performance of DC in longitudinal data is not as satisfactory, as can be seen in the simulation study. In this article, we consider a sample splitting and data aggregation method used by Zhong et al. [] to improve the application of DC in longitudinal data feature screening.
First, we assume that . We split the longitudinal data into M parts based on the difference of , and we denote DC between and at the mth time point by . We construct the following aggregated statistic:
where is a specific CDF. For instance, taking as the CDF of the true time variable T, and then the aggregated statistic becomes
which is the mean of . In this article, we specifically define as the CDF given by
where and for . Using this specific CDF , for each predictor , we define
as our dependence measure at the population level. The aggregated statistic defined in (1) can be naturally interpreted as the area under curve (AUC) after the normalization of , and up to the maximum of . Figure 1 provides a geometric view of the statistic . A larger means a stronger association between and .
Figure 1.
The geometric illustration of LDC statistics. represents the measurement interval between time points and . represents the area of the histogram between time points and .
Next, we estimate in (1). Following Székely et al. [], given a random sample defined in Section 2.1, we can obtain the estimator of through moment estimation
where ,
,
. Similarly, and can be defined in the same way. Then, we can get the aggregated estimate of by
We refer to the new longitudinal data feature screening procedure based on as LDCS, which is provided in Algorithm 1. Since we use the sample splitting and data aggregation method, the computational complexity of LDCS can be reduced from to . Finally, we sort in decreasing order and denote the estimated important predictors set as
by specifying a positive integer, . Similarly, we can choose with the positive integer or [].
| Algorithm 1 Longitudinal distance correlation screening (LDCS). |
|
Next, we analyze the properties of LDCS. We should make the following distribution assumption.
- (C.1):
- Both q-dimensional vector and p-dimensional vector have a sub-exponential tail probability uniformly in p, meaning that there exists constant , such that, for all ,
Remark 1.
Assumption (C.1) is frequently utilized in technical proofs for ultra-high-dimensional problems, such as those in Liu [] and Li et al. [], as it concerns the moments of the response and predictors. Any bounded or normal distribution would satisfy this assumption.
We present the asymptotic properties of LDCS as follows.
Theorem 1
(Sure screening property). Under Assumption (C.1), for any and , there exist constants , , and , such that
Furthermore, with being supposed, then
where is the cardinality of D.
Theorem 1 shows that LDCS can effectively identify important predictors for the response. In addition, if we select the optimal order , then Equation (3) becomes
Theorem 2
(Rank consistency property). Under Assumption (C.1), supposing for constant and , , there exist constants and , such that
Furthermore, as discussed in Theorem 1, if , then
Appendix A gives the technical proofs of Theorems 1 and 2.
Remark 2.
Theorem 1 is essential for all ultra-high-dimensional data feature screening procedures, as it guarantees that contains all predictors in D with high probability. Theorem 2 provides a stronger theoretical result, as important predictors are more likely to be ranked higher than unimportant ones. In other words, if we assume , then with a high probability, there exists a δ that can effectively separate the important and unimportant predictors.
2.3. Conditional Screening Procedure
When there exists a high correlation between predictors, LDCS may incorrectly exclude some truly informative predictors. Wang et al. [] applied the idea of conditional screening to deal with the problem and obtained excellent theoretical properties. Therefore, we applied this method to deal with the above problem. We denote
where denotes the conditional distribution function of , given and .
Wang et al. [] proposed CDC based on Székely et al. [] to deal with the issue of DC, which may make serious errors when predictors are highly correlated. For details, given the conditional predictor vector , they defined CDC between and as the square root of
where conditional distance covariance is defined as the square root of
Similar to Section 2.2, we split the longitudinal data into s parts based on the difference of , and denote the conditional distance correlation between and , given at the mth time point , by . We define
as our dependence measure given some important predictors . We denote as the Euclidean distance between and , where at the mth measurement. Similarly, can be denoted in the same way. Then, the distance function can be defined as
where . Note that and are defined similarly. Let and , where H is a diagonal matrix determined by bandwidth h, selected using the plug-in method. is a kernel function, which is set as the Gaussian kernel function in this paper, i.e.,
Given , the sample conditional distance covariance
where
The sample conditional distance covariances and can be estimated similarly. Thus, the plug-in estimate of is
and the aggregated estimate of is
Based on , we propose a conditional screening procedure, specifically designed to handle highly correlated predictors:
We refer to the new method as LCDCS. Similar to LDCS, the computational complexity of LCDCS can be reduced from to .
Before proving the asymptotic properties of LCDCS, we list the other technical assumptions:
- (C.2):
- The kernel function is uniformly bounded and satisfies the following conditions: , , , and .
- (C.3):
- If are independent copies of Z, and then, for , there exists constant , such that
We present the asymptotic properties of LCDCS as follows.
Theorem 3
(Sure screening property). Under Assumption (C.1)–(C.3), if the bandwidth for kernel estimation of satisfies , where r is the dimension of , then for any and , there exist constants , and , such that
Furthermore, supposing , then
where is the cardinality of . If we set , then the first part of Theorem 3 becomes
Theorem 4
(Rank consistency property). Under Assumption (C.1)–(C.3), supposing for constant and , , there exist constants and , such that
Furthermore, as discussed in Theorem 3, if , then
Appendix A gives the technical proofs of Theorems 3 and 4.
3. Results
3.1. Simulation Study
We conduct several simulated examples to evaluate the screening performance of LDCS and LCDCS, and compare the results with those of some existing methods, including partial residual sure independence screening (PRSIS) [], time-varying coefficient models’ sure independence screening (TVCM-SIS) [], distance correlation sure independence screening (DC-SIS) [], conditional distance correlation sure independence screening (CDC-SIS) [], and covariate information number sure independence screening (CIS) []. PRSIS and TVCM-SIS are longitudinal data feature screening procedures where PRSIS is based on partially linear models and TVCM-SIS is based on time-varying coefficient models. DC-SIS, CDC-SIS, and CIS are all model-free methods for cross-sectional data.
We repeat each simulation example 500 times. In all examples, the number of predictors is , and the sample size is , yielding a submodel size of . The predictor and the random error are generated independently from , where and are different in each example. The time points are sampled from a standard uniform distribution and do not change in a simulation. We use the following four criteria to assess the screening performance:
- : The minimum submodel size containing all important predictors. The median and robust standard deviation (RSD = IQR/1.34) of are reported, where IQR denotes interquartile range.
- : The proportion of the important predictor selected in the submodel.
- : The proportion of all important predictors selected in the submodel.
- : The average number of important predictors selected in the submodel.
3.1.1. Example 1: Partially Linear Models
We generate the response from the following two models:
Model (1.a):
Model (1.b):
Model (1.a) considers categorical predictor variables, while Model (1.b) incorporates interaction effects. These two models are based on Model (1.a) and Model (1.b) in Li et al. [], but the difference is that we introduced longitudinal measurements and nonlinear time variables. We set , and in Model (1.a), and , , and in Model (1.b). Furthermore, we set the repeated measurements and the within-subject correlation structure as first-order autoregressive (AR(1)) structure. The within-subject correlations are set to and for each model, respectively.
The detailed results are reported in Table 1. For the four different situations, our proposed method LDCS achieves the smallest MMS (including median and RSD), along with the largest and . When a strong nonlinear time effect is present, the model-free feature screening procedures DC-SIS and CIS perform less effectively. Furthermore, we observe that TVCM-SIS and PRSIS are not as effective in identifying interaction effects. While CIS relatively successfully identifies interaction effects, it tends not to retain important predictors with and without interaction effects simultaneously. Example 1 demonstrates that our proposed method LDCS is unaffected by strong time effects and performs better in handling situations with interaction effects in partially linear models.
Table 1.
, , , for Example 1.
3.1.2. Example 2: Time-Varying Coefficient Models
In this example, we generate the response from the following two models:
Model (2.a):
Model (2.b):
Model (2.a) is similar to Example I in Chu et al. [], but we set a smaller sample size and measurements. Model (2.b) considers a count response with Poisson distribution, which is based on Example II in Chu et al. [], but we use time-varying coefficient model, rather than the generalized varying coefficient mixed-effect model. In Model (2.a), we set , , , and . In Model (2.b), we set , , , , and . Furthermore, we set the within-subject correlation structure as a compound symmetry (CS) structure. The within-subject correlations are set to and for each model, respectively.
Table 2 presents the detailed simulation results for Example 2. It can be seen that the results of Example 2 are similar to those of Example 1, i.e., our proposed method LDCS achieves the smallest MMS (including median and RSD), along with the largest and . Specifically, for Model (2.a), although TVCM-SIS is based on the assumption of varying coefficient models, its performance is not ideal when the categorical predictors have a stronger impact on the response. For Model (2.b), the screening performance of TVCM-SIS, PRSIS, DC-SIS, and CIS is all unsatisfactory, indicating that these methods are not well suited for varying coefficient models with count responses. These examples further demonstrate that, for some special varying coefficient models, our proposed method LDCS may achieve better screening performance.
Table 2.
, , , for Example 2.
3.1.3. Example 3: Partially Linear Single-Index Models
We assess the screening performance of LCDCS when the predictors are correlated with each other. The response is generated from Model (3):
Model (3):
We set , , , the repeated measurements and the within-subject correlation structure as AR(1) structure with within-subject correlation as . The correlation structure between predictors is set as CS structure, with two scenarios considered: and .
We first use LDCS and DC-SIS to screen the most relevant predictor as a conditional predictor for LCDCS and CDC-SIS, respectively. Table 3 presents the detailed simulation results for Example 3. It can be seen that our proposed method, LCDCS, also achieves the smallest MMS (including median and RSD), along with the largest and . Since the number of repeated measurements is only 3, the improvement in screening performance of LCDCS over CDC-SIS is not particularly obvious. Additionally, the screening performance of TVCM-SIS and PRSIS is very poor, indicating that these methods, which rely on specific model assumptions, may not perform well in some complex model structures. CIS is also not well suited when the predictors are correlated. This example demonstrates that our proposed method, LCDCS, is more suitable for longitudinal data with correlated predictors in some special model structures.
Table 3.
, , , for Example 3.
3.2. Application to Gut Microbiota Data
We analyze real-world data to demonstrate the empirical performance of our methods in this section. This is a gut microbiota data from Bangladeshi children, as reported by Subramanian et al. []. The longitudinal cohort study aimed at analyzing the effects of therapeutic foods on children. For details, they monitored the growth and development of infants over a two-year period after birth, ultimately collecting a total of fecal samples. Each sample contained 79,597 operational taxonomic units sharing at least nucleotide sequence identity (97%-identity OTUs). Following Subramanian et al. [], we obtain 1222 97%-identity OTUs that have at least two fecal samples with an average relative abundance greater than 0.1%. In this article, we consider height-for-age Z-scores (HAZ) as the response, which reflects how a child’s height compares to the average height of the same age and gender group, indicating whether growth and development are within the normal range. The HAZ of each child was measured between 6 and 22 times, as illustrated in Figure 2. As a further preprocessing step, we retain only the months in which at least 24 children had their HAZ measured. Thus, we eventually work with a dataset comprising 1222 97%-identity OTUs (the predictors) and HAZ (the response), measured on children over 13 months, for a total of 433 measurements. On this dataset, we apply our LDCS and LCDCS, along with TVCM-SIS, PRSIS, DC-SIS, CDC-SIS, and CIS for feature screening, and we compare the results with those of Subramanian et al. [].
Figure 2.
Trajectories of HAZ in the first 2 years after birth for 50 children. Each boxplot illustrates the variability in the HAZ of children for the given month.
First, we compare the top 97%-identity OTUs selected by each method, as summarized in Table 4. Our methods, LDCS and LCDCS, demonstrate a moderate overlap in the number of selected 97%-identity OTUs with PRSIS, DC-SIS, and CDC-SIS, while the selection results of TVCM-SIS and CIS differ significantly from those of the other methods. Next, we compare the top 97%-identity OTUs, as Subramanian et al. [] identified 220 gut microbiota that were significantly different between severe acute malnutrition (SAM) and healthy children. It is obvious that LDCS and LCDCS have the maximum number of 97%-identity OTUs, which align with those of Subramanian et al. [] among all screening procedures.
Table 4.
Number of overlapping 97%-identity OTUs among the top (above diagonal) and the top (below diagonal) selected via different methods.
Furthermore, we evaluate the screening performance through regression analysis. Following Chu et al. [], we establish time-varying coefficient models with different numbers of 97%-identity OTUs and use the total heritability commonly used in genetic analysis to assess the goodness of fit. The total heritability of all 97%-identity OTUs is calculated through
Here, fetal means whether the child is a singleton birth and RSS is the residual sum of squares defined as
where denotes the actual value, and denotes the predicted value from the time-varying coefficient models.
We calculate the total heritability of the top to 49 97%-identity OTUs and remove more irrelevant 97%-identity OTUs using forward regression to achieve better regression fitting results. Our proposed method, LCDCS, has the highest total heritability at 64.76%, followed by LDCS with 60.24%. The final number of 97%-identity OTUs selected and the total heritability for each method are shown in Table 5.
Table 5.
Quantity of 97%-identity OTUs selected and total heritability for each method.
We also plot the curves of total heritability for the different screening procedures, as shown in Figure 3. We note that, overall, when the number of 97%-identity OTUs selected is between 4 and 23, the total heritability of LCDCS is higher than that of other screening procedures. Additionally, when the number of 97%-identity OTUs selected is between 24 and 33, the LDCS method shows a higher total heritability. Therefore, when the number of selected 97%-identity OTUs is moderate, we recommend using LCDCS, while for a larger number of selected 97%-identity OTUs, we recommend using LDCS.
Figure 3.
Curves of total heritability with number of 97%-identity OTUs selected through different methods.
We also calculate the heritability of each 97%-identity OTU, which depends on the sequence of selection in the forward regression. The heritability of the kth 97%-identity OTU is calculated through
Table 6 presents the IDs and taxonomic annotations of the top 10 97%-identity OTUs based on the heritability, as selected via LCDCS. The taxonomic annotations are given by Subramanian et al. [], representing the genus of the gut microbiota. Among the 10 gut microbiota, 4 gut microbiota belong to the Bifidobacterium genus. Bifidobacterium is a common group of probiotics found primarily in the human gut, particularly in infants, and it has been shown to be closely associated with children’s nutrient absorption and growth development. Furthermore, there are 5 97%-identity OTUs that differ from the results of Subramanian et al. [], which may represent a new discovery. We also perform a sensitivity analysis to evaluate how the parameter influences the empirical results, with detailed information provided in Appendix B.
Table 6.
IDs and taxonomic annotations of the top 10 97%-identity OTUs selected via LCDCS based on heritability.
4. Discussion
In this article, we improve DC by applying a sample splitting and data aggregation method to achieve better screening performance with longitudinal data. We also make a simple extension of LDCS to deal with the situation where the predictors are correlated. The two methods are capable of handling longitudinal data with unequal numbers of repeated measurements and unequal intervals between repeated measurement time points. Simulation studies indicate that, in some special situations, such as partially linear models with strong time effects or interaction effects, and varying coefficient models with count responses, our proposed method LDCS demonstrates better screening performance. Furthermore, in the situations where predictors are correlated, our proposed method, LCDCS, also achieves better screening performance for some complex structures. Finally, the results of the application show that LDCS and LCDCS achieve better outcomes at different selection scales.
Inevitably, this work encountered some limitations. For instance, we did not consider the treatment of missing data or time-varying confounding. Additionally, both LDCS and LCDCS rely on sub-exponential tail probability assumptions, which may not always hold in practice. The screening performance of LCDCS is sensitive to bandwidth selection, and due to the kernel estimation, its computational burden is higher than that of some other model-free methods.
We plan to deal with the selection of the threshold in LDCS and LCDCS in a future study. Recently, Liu et al. [] proposed handling this issue from the perspective of false discovery rate (FDR) control by using Model-X knockoff features. Chi et al. [] applied Model-X knockoff features to the FDR control problem for time series data by using an e-value aggregation method. Therefore, similarly, we could also consider constructing Model-X knockoff features and corresponding knockoff statistics for each measurement time point in longitudinal data, and then aggregate these statistics by using the e-value aggregation method. We expect that this approach will help handle the selection problem of the threshold in LDCS and LCDCS from the perspective of FDR control.
Author Contributions
Conceptualization, J.C.; Methodology, J.C.; Software, J.C.; Validation, J.C. and Y.L.; Formal analysis, J.C. and Y.L.; Investigation, X.Y. and Y.L.; Resources, X.Y.; Writing – original draft, J.C.; Writing – review & editing, X.Y., J.D. and Y.L.; Visualization, J.C. and J.D.; Supervision, X.Y., J.D. and Y.L.; Project administration, J.D. and Y.L.; Funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research is funded by Sichuan Province Administration of Traditional Chinese Medicine of China (grant numbers 25MSZX477 and 25MSZX495).
Data Availability Statement
The data presented in this study are openly available in [repository name, e.g., FigShare] at [doi], reference number [reference number]. [Subramanian Sathish] [10.1038/nature13421].
Conflicts of Interest
The authors state no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| mth measurement time point | |
| Response vector | |
| Predictor vector | |
| Conditional predictor vector | |
| p, q, r | Dimensional of , , |
| n | Number of subjects |
| Repeated measurements for the i subject | |
| N | Total measurements |
| M | Maximum of |
| Cumulative distribution function | |
| D, | Index set of important predictors |
| , | Index set of unimportant predictors |
| , | Cardinality of D, |
| Distance correlation of and | |
| , | Dependence measure at the population-level |
| , | Dependence measure at the sample-level |
| Selection number threshold | |
| Conditional distance correlation of and | |
| Kernel function | |
| h | Bandwidth |
| Multivariate normal distribution | |
| Indicator function | |
| Within-subject correlation | |
| Correlation between predictors | |
| Heritability | |
| Residual sum of squares | |
| kth 97%-identity OTU |
Appendix A. Theoretical Proofs
Before proving Theorem 1, we give the following lemma from Li et al. [].
Lemma A1.
For any random vectors, and , under Assumption (C.1), for any and , there exist constants and , such that
Proof of Theorem 1.
First, we show that, for any k,
This is because
where , and . For any ,
Next, we have
The last inequality holds because .
Next, we deal with . By using Lemma A1, for any and , it follows from Assumption (C.1) that
for constants and . (A4) entails that
Let and , where satisfies . If M is finite, then we have
and
The proof of the first part of Theorem 1 is completed.
Next, we begin the proof of the second part of Theorem 1. If , then there must exist some , such that . entails that for some , indicating that , and then . Consequently, when n is sufficiently large,
where is the cardinality of D. (A7) implies that , as . □
Proof of Theorem 2.
Let and . For any , we have
where , , are positive constants. The last equation can be derived from Theorem 1. Thus, we have
If with , then we have
We know for a large n; therefore, there must exist a large n, such that
Then, we have
where . Therefore, according to the Borel–Cantelli Lemma, we obtain
and then
The proof of Theorem 2 is complete. □
Before proving Theorem 3, we give the following lemma from Wen et al. [].
Lemma A2.
For any random vectors, , and , with Assumption (C.1)–(C.3) being supposed to hold and the bandwidth for kernel estimation of Z satisfying , for any and , there exist constants and , such that
Proof of Theorem 3.
Similar to the proof of Theorem 1, we can get
By using Lemma A2, for any and , it follows from Assumption (C.1)–(C.3) that
for constants and . (A9) entails that
Then, let and , where satisfies , we have
and
The proof of the first part of Theorem 3 is completed.
Next, we begin the proof of the second part of Theorem 3. If , then there must exist some , such that . entails that for some , indicating that , and then . Consequently, when n is sufficiently large,
where is the cardinality of . (A10) implies that , as . □
Proof of Theorem 4.
Similar to the proof of Theorem 2, for any , we have
where , are positive constants. Thus, we have
If with , then we have
We know for a large n; therefore, there must exist a large n, such that
Then, we have
where . Therefore, according to the Borel–Cantelli Lemma, we obtain
and then
which completes the proof of Theorem 4. □
Appendix B. Sensitivity Analysis in Application
To demonstrate the impact of the choice of on LCDCS, we conduct a sensitivity analysis on the application. In addition to considered in Section 3.2, we also considered , where . Table A1 provides the IDs and taxonomic annotations of the 97%-identity OTUs, ranked by heritability after forward regression.
We can see that, for each situation, the heritability of New.0.CleanUp.ReferenceOTU89679, 561636, and New.0.ReferenceOTU340 consistently ranks in the top 3. However, there are some differences in the rankings from 4th to 10th, indicating that selecting the appropriate threshold is important for future studies.
Table A1.
IDs and taxonomic annotations of the top 10 97%-identity OTUs selected by LCDCS based on heritability at different .
Table A1.
IDs and taxonomic annotations of the top 10 97%-identity OTUs selected by LCDCS based on heritability at different .
| Order | ID | Taxonomic Annotation |
|---|---|---|
| 1 | New.0.CleanUp.ReferenceOTU89679 * | Collinsella |
| 2 | 561636 | Streptococcus |
| 3 | New.0.ReferenceOTU340 * | Bifidobacterium |
| 4 | 326977 * | Bifidobacterium |
| 5 | New.1.ReferenceOTU284 | NA † |
| 6 | 533785 * | Bifidobacterium |
| 7 | 561483 * | Bifidobacterium |
| 8 | 305760 * | Escherichia/Shigella |
| 1 | New.0.CleanUp.ReferenceOTU89679 * | Collinsella |
| 2 | 561636 | Streptococcus |
| 3 | New.0.ReferenceOTU340 * | Bifidobacterium |
| 4 | 326977 * | Bifidobacterium |
| 5 | 469868 | Bifidobacterium |
| 6 | 554755 | Enterococcaceae |
| 7 | 72820 * | Bifidobacterium |
| 8 | New.1.ReferenceOTU284 | NA † |
| 9 | 533785 * | Bifidobacterium |
| 10 | 469852 * | Bifidobacterium |
| 1 | New.0.CleanUp.ReferenceOTU89679 * | Collinsella |
| 2 | 561636 | Streptococcus |
| 3 | New.0.ReferenceOTU340 * | Bifidobacterium |
| 4 | 326977 * | Bifidobacterium |
| 5 | 469868 | Bifidobacterium |
| 6 | New.0.ReferenceOTU339 | NA † |
| 7 | 130663 * | Bacteroides |
| 8 | 554755 | Enterococcaceae |
| 9 | 210269 * | NA † |
| 10 | 301004 * | Olsenella |
| 1 | New.0.CleanUp.ReferenceOTU89679 * | Collinsella |
| 2 | 561636 | Streptococcus |
| 3 | New.0.ReferenceOTU340 * | Bifidobacterium |
| 4 | 470527 | Lactobacillus |
| 5 | 326977 * | Bifidobacterium |
| 6 | 471180 * | Bifidobacterium |
| 7 | 469868 | Bifidobacterium |
| 8 | 345575 | Lactococcus |
| 9 | 316587 * | Streptococcus |
| 10 | New.0.ReferenceOTU339 | NA † |
Note: When the threshold is set to 12, only 8 97%-identity OTUs remain after forward regression. * 97%-identity OTUs discovered in Subramanian et al. []. † The taxonomic annotation is not provided by Subramanian et al. [].
References
- Fan, J.Q.; Lv, J.C. Sure Independence Screening for Ultrahigh Dimensional Feature Space. J. R. Stat. Soc. B. 2008, 70, 849–911. [Google Scholar] [CrossRef]
- Song, R.; Yi, F.; Zou, H. On Varying-Coefficient Independence Screening for High-Dimensional Varying-Coefficient Models. Stat. Sinica. 2014, 24, 1735–1752. [Google Scholar] [CrossRef]
- Liu, J.Y. Feature Screening and Variable Selection for Partially Linear Models with Ultrahigh-Dimensional Longitudinal Data. Neurocomputing 2016, 195, 202–210. [Google Scholar] [CrossRef]
- Lai, P.; Liang, W.J.; Wang, F.J.; Zhang, Q.Z. Feature Screening of Quadratic Inference Functions for Ultrahigh Dimensional Longitudinal Data. J. Stat. Comput. Sim. 2020, 90, 2614–2630. [Google Scholar] [CrossRef]
- Jiang, B.Y.; Lv, J.; Li, J.L.; Cheng, M.Y. Robust Model Averaging Prediction of Longitudinal Response with Ultrahigh-Dimensional Covariates. J. R. Stat. Soc. Ser. B Stat. Methodol. 2024, 87, 337–361. [Google Scholar] [CrossRef]
- Zhu, L.P.; Li, L.X.; Li, R.Z.; Zhu, L.X. Model-Free Feature Screening for Ultrahigh-Dimensional Data. J. Am. Stat. Assoc. 2011, 106, 1464–1475. [Google Scholar] [CrossRef] [PubMed]
- Li, R.Z.; Zhong, W.; Zhu, L.P. Feature Screening via Distance Correlation Learning. J. Am. Stat. Assoc. 2012, 107, 1129–1139. [Google Scholar] [CrossRef]
- Shao, X.F.; Zhang, J.S. Martingale Difference Correlation and Its Use in High-Dimensional Variable Screening. J. Am. Stat. Assoc. 2014, 109, 1302–1318. [Google Scholar] [CrossRef]
- Wen, C.H.; Pan, W.L.; Huang, M.M.; Wang, X.Q. Sure Independence Screening Adjusted for Confounding Covariates with Ultrahigh Dimensional Data. Stat. Sinica. 2018, 28, 293–317. [Google Scholar] [CrossRef]
- Nandy, D.; Chiaromonte, F.; Li, R.Z. Covariate Information Number for Feature Screening in Ultrahigh-Dimensional Supervised Problems. J. Am. Stat. Assoc. 2022, 117, 1516–1529. [Google Scholar] [CrossRef]
- Zhang, J.; Liu, Y.Y.; Cui, H.J. Model-Free Feature Screening via Distance Correlation for Ultrahigh Dimensional Survival Data. Stat. Papers. 2021, 62, 2711–2738. [Google Scholar] [CrossRef]
- Zhong, W.; Qian, C.; Liu, W.J.; Zhu, L.P.; Li, R.Z. Feature Screening for Interval-Valued Response with Application to Study Association between Posted Salary and Required Skills. J. Am. Stat. Assoc. 2023, 118, 805–817. [Google Scholar] [CrossRef]
- Zhou, T.Y.; Zhu, L.P. Model-Free Feature Screening for Ultrahigh Dimensional Censored Regression. Stat. Comput. 2017, 27, 947–961. [Google Scholar] [CrossRef]
- Székely, G.J.; Rizzo, M.L.; Bakirov, N.K. Measuring and Testing Dependence by Correlation of Distances. Ann. Stat. 2007, 35, 2769–2794. [Google Scholar] [CrossRef]
- Wang, X.Q.; Pan, W.L.; Hu, W.H.; Tian, Y.; Zhang, H.P. Conditional Distance Correlation. J. Am. Stat. Assoc. 2015, 110, 1726–1734. [Google Scholar] [CrossRef]
- Chen, L.P. Feature Screening Based on Distance Correlation for Ultrahigh-Dimensional Censored Data with Covariate Measurement Error. Comput. Stat. 2021, 36, 857–884. [Google Scholar] [CrossRef]
- Lu, J.; Lin, L. Model-Free Conditional Screening via Conditional Distance Correlation. Stat. Papers. 2020, 61, 225–244. [Google Scholar] [CrossRef]
- Dai, C.G.; Lin, B.Y.; Xing, X.; Liu, J.S. False Discovery Rate Control via Data Splitting. J. Am. Stat. Assoc. 2023, 118, 2503–2520. [Google Scholar] [CrossRef]
- Chu, W.H.; Li, R.Z.; Reimherr, M. Feature Screening for Time-Varying Coefficient Models with Ultrahigh-Dimensional Longitudinal Data. Ann. Appl. Stat. 2016, 10, 596–617. [Google Scholar] [CrossRef]
- Chu, W.H.; Li, R.Z.; Liu, J.Y.; Reimherr, M. Feature Selection for Generalized Varying Coefficient Mixed-Effect Models with Application to Obesity GWAS. Ann. Appl. Stat. 2020, 14, 276–298. [Google Scholar] [CrossRef] [PubMed]
- Subramanian, S.; Huq, S.; Yatsunenko, T.; Haque, R.; Mahfuz, M.; Alam, M.A.; Benezra, A.; DeStefano, J.; Meier, M.F.; Muegge, B.D.; et al. Persistent Gut Microbiota Immaturity in Malnourished Bangladeshi Children. Nature 2014, 510, 417–421. [Google Scholar] [CrossRef] [PubMed]
- Liu, W.J.; Ke, Y.; Liu, J.Y.; Li, R.Z. Model-Free Feature Screening and FDR Control with Knockoff Features. J. Am. Stat. Assoc. 2022, 117, 428–443. [Google Scholar] [CrossRef]
- Chi, C.M.; Fan, Y.Y.; Ing, C.K.; Lv, J.C. High-Dimensional Knockoffs Inference for Time Series Data. J. Am. Stat. Assoc. 2025, 1–24. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).