Next Article in Journal
Synthetic Hydrograph Estimation for Ungauged Basins: Exploring the Role of Statistical Distributions
Previous Article in Journal
Expansions for the Conditional Density and Distribution of a Standard Estimate
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Model-Free Feature Screening Based on Data Aggregation for Ultra-High-Dimensional Longitudinal Data

1
School of Mathematics, Southwest Jiaotong University, Chengdu 611756, China
2
Office of Medical Information and Data, Medical Support Center, The General Hospital of Western Theater Command, PLA, Chengdu 610083, China
3
Department of Information, Medical Support Center, The General Hospital of Western Theater Command, PLA, Chengdu 610083, China
*
Author to whom correspondence should be addressed.
Stats 2025, 8(4), 99; https://doi.org/10.3390/stats8040099
Submission received: 3 September 2025 / Revised: 6 October 2025 / Accepted: 14 October 2025 / Published: 16 October 2025

Abstract

Ultra-high dimensional longitudinal data feature screening procedures are widely studied, but most require model assumptions. The screening performance of these methods may not be excellent if we specify an incorrect model. To resolve the above problem, a new model-free method is introduced where feature screening is performed by sample splitting and data aggregation. Distance correlation is used to measure the association at each time point separately, while longitudinal correlation is modeled by a specific cumulative distribution function to achieve efficiency. In addition, we extend this new method to handle situations where the predictors are correlated. Both methods possess excellent asymptotic properties and are capable of handling longitudinal data with unequal numbers of repeated measurements and unequal intervals between repeated measurement time points. Compared to other model-free methods, the two new methods are relatively insensitive to within-subject correlation, and they can help reduce the computational burden when applied to longitudinal data. Finally, we use some simulated and empirical examples to show that both new methods have better screening performance.

1. Introduction

Longitudinal data, characterized by repeated measurements of the same subjects over some time, are common in many scientific fields, including public health, psychology, and sociology. Traditional methods, such as linear mixed models (LMMs), are widely used to analyze longitudinal supervised problems. However, as the number of predictors, p, grows, these methods are limited by “the curse of dimensionality”. Even if these problems have a seemingly large sample size, n, it may actually be far smaller than p. This is more commonly called the ultra-high-dimensional setting, defined by Fan and Lv [1] as log p = O ( n α ) for some α ( 0 , 1 / 2 ) . On the other hand, technological developments have made it possible to obtain high-throughput data through repeated measurements of subjects. Therefore, it is essential to develop some new methods capable of handling the above data.
Since only a few predictors are associated with the response in the ultra-high-dimensional setting (i.e., sparsity), feature screening is a common method to deal with these problems. For early work on ultra-high-dimensional longitudinal data, sure independence screening for varying-coefficient models was firstly proposed [2], while subsequent studies developed further, including partially linear models [3], additive models [4], and rank regression models [5]. However, these methods rely on parametric or semi-parametric models, which may be insufficient for accurate model specification in the big data era. To address this, we agree with Zhu et al. [6] that model-free methods are more appealing, as they rely only on a general model framework.
Currently, research on model-free methods for cross-sectional data has made great progress. The primary approach of these methods is to measure the marginal dependence, such as distance correlation (DC) [7], martingale difference correlation [8], conditional distance correlation (CDC) [9], and covariate information number [10]. Recently, some studies have also applied these methods to survival data and other types of data by improving the methods originally designed for cross-sectional data [11,12,13]. However, up to now, studies applying model-free feature screening procedures to longitudinal data are still rare, with the main challenge being that repeated measurements between subjects are correlated.
Since Székely et al. [14] proposed DC and Wang et al. [15] proposed CDC, the two methods have been widely applied in feature screening, e.g., the studies by Li et al. [7] and Chen [16] are based on DC, while the studies by Wen et al. [9] and Lu and Lin [17] are based on CDC. Consequently, the aim of this study is to improve DC and CDC for better performance in longitudinal data. Recently, the sample splitting and data aggregation methods are frequently used in feature screening, such as Zhong et al. [12] and Dai et al. [18]. These studies provide us with an idea of handling ultra-high-dimensional longitudinal data.
We first introduce a new method called longitudinal distance correlation sure independence screening (LDCS) in this study. The new method uses DC to measure marginal dependence at each time point separately and then aggregates the results using a cumulative distribution function (CDF) used in Zhong et al. [12]. We also make a simple extension of LDCS to handle situations where predictors are correlated. The method is based on CDC, so we call it longitudinal conditional distance correlation sure independence screening (LCDCS). The two new methods have some distinct advantages. For example, compared to existing longitudinal data feature screening procedures, the two methods are model-free, which enables them to perform better in some complex situations. Additionally, compared to existing model-free feature screening procedures, LDCS and LCDCS use the sample-splitting and data aggregation methods, making them relatively insensitive to within-subject correlation. Moreover, the computational burden is lower than that of methods that directly apply dependence measures to longitudinal data.
In Section 2, we propose our methods and provide their asymptotic properties. Section 3 presents several simulated and empirical examples to compare the performance of our methods with that of some other feature screening procedures. Section 4 provides a brief discussion of the article. Appendix A contains the technical proofs of all theorems, and Appendix B presents a simple sensitivity analysis.

2. Materials and Methods

2.1. Model Setup

We consider the following longitudinal data, { ( y i m , X i m , Z i m ) , 1 i n , 1 m M i } denotes the measurement at the mth measurement time point t m for the ith subject, where y i m R q is the response vector correlated with the predictor vector X i m R p and the conditional predictor vector Z i m R r . Here, n represents the number of subjects, and M i denotes the repeated measurements for the ith subject. We allow the measurement intervals at different time points to be unequal and repeated measurements for each subject to vary. Furthermore, we define the total measurements as N = i = 1 n M i and the maximum of M i as M.

2.2. Marginal Screening Procedure

We assume that n is much less than p, and only a few predictors are associated with y . Let F ( y | X ) denote the conditional distribution function of y , given X . The index set of important predictors can be defined as
D = { k : F ( y | X ) functionally depends on X k , k = 1 , , p } ,
without assuming a specific regression model. We define the number of important predictors as L 0 = | D | , the cardinality of D, and denote D c as the index set of unimportant predictors. Similar to most screening procedures, our screening goal is to estimate D conservatively so that all important predictors can be retained.
DC was proposed to quantify the association between two random vectors [14]. For details, they defined DC between y and X as the square root of
d C o r 2 ( X , y ) = d C o v 2 ( X , y ) d C o v 2 ( X , X ) d C o v 2 ( y , y ) ,
where distance covariance is defined as the square root of
d C o v 2 ( X , y ) = R p + q ϕ X , y ( s , t ) ϕ X ( s ) ϕ y ( t ) 2 p ( s , t ) d s d t ,
where p ( s , t ) is a known function, and ϕ ( · ) denotes the characteristic function.
DC is an excellent dependence measure capable of detecting the relationship between predictors and response for cross-sectional data. However, the screening performance of DC in longitudinal data is not as satisfactory, as can be seen in the simulation study. In this article, we consider a sample splitting and data aggregation method used by Zhong et al. [12] to improve the application of DC in longitudinal data feature screening.
First, we assume that t 1 < t 2 < < t M . We split the longitudinal data into M parts based on the difference of t m , and we denote DC between X k m ( k = 1 , , p ) and y m at the mth time point t m by D C k ( t m ) = d C o r 2 ( X k m , y m ) . We construct the following aggregated statistic:
ω k = u R D C k ( u ) d F ( u ) ,
where F ( u ) is a specific CDF. For instance, taking F ( · ) as the CDF of the true time variable T, and then the aggregated statistic becomes
ω k = t R D C k ( t ) d F ( t ) = E ( D C k ( T ) ) ,
which is the mean of D C k ( T ) . In this article, we specifically define F ( · ) as the CDF given by
p m = Pr U = t m = t m t m 1 t M t 1 = Δ t m t M t 1 , m = 1 , 2 , , M ,
where Δ t 1 = 0 and Δ t m = t m t m 1 for m = 2 , , M . Using this specific CDF F ( u ) , for each predictor X k ( k = 1 , , p ) , we define
ω k = m = 1 M D C k ( t m ) Δ t m t M t 1
as our dependence measure at the population level. The aggregated statistic ω k defined in (1) can be naturally interpreted as the area under curve (AUC) after the normalization of D C k ( · ) , and up to the maximum of { D C k ( t m ) , m = 1 , 2 , , M } . Figure 1 provides a geometric view of the statistic ω k . A larger ω k means a stronger association between X k and y .
Next, we estimate ω k in (1). Following Székely et al. [14], given a random sample defined in Section 2.1, we can obtain the estimator of d C o v 2 ( X k m , y m ) through moment estimation
d C o v ^ 2 ( X k m , y m ) = S ^ 1 ( X k m , y m ) + S ^ 2 ( X k m , y m ) 2 S ^ 3 ( X k m , y m ) ,
where S ^ 1 ( X k m , y m ) = 1 n 2 i = 1 n j = 1 n X k i m X k j m 1 y i m y j m q ,
S ^ 2 ( X k m , y m ) = 1 n 4 ( i = 1 n j = 1 n X k i m X k j m 1 ) ( l = 1 n f = 1 n y l m y f m q ) ,
S ^ 3 ( X k m , y m ) = 1 n 3 i = 1 n j = 1 n l = 1 n X k i m X k l m 1 y j m y l m q . Similarly, d C o v ^ 2 ( X k m , X k m ) and d C o v ^ 2 ( y m , y m ) can be defined in the same way. Then, we can get the aggregated estimate of ω k by
ω ^ k = m = 1 M D C ^ k ( t m ) Δ t m t M t 1 = m = 1 M d C o v ^ 2 ( X k m , y m ) d C o v ^ 2 ( X k m , X k m ) d C o v ^ 2 ( y m , y m ) Δ t m t M t 1 .
We refer to the new longitudinal data feature screening procedure based on ω ^ k as LDCS, which is provided in Algorithm 1. Since we use the sample splitting and data aggregation method, the computational complexity of LDCS can be reduced from O ( n 2 M 2 ) to O ( n 2 M ) . Finally, we sort ω ^ k in decreasing order and denote the estimated important predictors set as
D ^ = { 1 k p : ω ^ k is among the top d 0 }
by specifying a positive integer, d 0 . Similarly, we can choose d 0 = a [ n / log ( n ) ] with the positive integer a = 1 , 2 , 3 or d 0 = n 1 [1].
Algorithm 1 Longitudinal distance correlation screening (LDCS).
Input: 
source data { X i m , y i m , t m } m = 1 M .
Output: 
D ^ = { 1 k p : ω ^ k is among the top d 0 } .
  1:
Split the sample into M parts based on the difference of t m .
  2:
for  m = 1 , , M do
  3:
    for  k = 1 , , p  do
  4:
         Estimate S ^ 1 ( X k m , y m ) , S ^ 2 ( X k m , y m ) , S ^ 3 ( X k m , y m ) by moment estimation.
  5:
         Estimate distance covariance: d C o v ^ 2 ( X k m , y m ) , d C o v ^ 2 ( X k m , X k m ) , and d C o v ^ 2 ( y m , y m ) .
  6:
          Estimate distance correlation: D C ^ k ( t m ) = d C o v ^ 2 ( X k m , y m ) d C o v ^ 2 ( X k m , X k m ) d C o v ^ 2 ( y m , y m ) .
  7:
    end for
  8:
end for 
  9:
for  k = 1 , , p do
       Estimate the longitudinal distance correlation: ω ^ k = m = 1 M D C ^ k ( t m ) Δ t m t M t 1 .
  10:
end for 
  11:
Sort { ω ^ k , 1 k p } in descending order.
  12:
Obtain the estimated important predictors set D ^ .
Next, we analyze the properties of LDCS. We should make the following distribution assumption.
(C.1):
Both q-dimensional vector y and p-dimensional vector X have a sub-exponential tail probability uniformly in p, meaning that there exists constant b 0 > 0 , such that, for all 0 < b b 0 ,
sup t E { exp ( 2 b y q 2 | t } < , and sup t max 1 k p E { exp ( 2 b X k 1 2 | t } < .
Remark 1. 
Assumption (C.1) is frequently utilized in technical proofs for ultra-high-dimensional problems, such as those in Liu [3] and Li et al. [7], as it concerns the moments of the response and predictors. Any bounded or normal distribution would satisfy this assumption.
We present the asymptotic properties of LDCS as follows.
Theorem 1 
(Sure screening property). Under Assumption (C.1), for any 0 τ < 1 / 2 and 0 < κ < 1 2 τ , there exist constants c 1 > 0 , c 2 > 0 , and c 3 > 0 , such that
Pr max 1 k p | ω ^ k ω k | c 1 n τ O p exp c 2 n 1 2 ( τ + κ ) + n exp ( c 3 n κ ) .
Furthermore, with min k D ω k 2 c 1 n τ being supposed, then
Pr D D ^ 1 O L 0 exp c 2 n 1 2 ( τ + κ ) + n exp c 3 n κ ,
where L 0 is the cardinality of D.
Theorem 1 shows that LDCS can effectively identify important predictors for the response. In addition, if we select the optimal order κ = ( 1 2 τ ) / 3 , then Equation (3) becomes
Pr max 1 k p ω ^ k ω k c 1 n τ O p exp c 2 n ( 1 2 τ ) / 3 .
Theorem 2 
(Rank consistency property). Under Assumption (C.1), supposing min k D ω k max k D c ω k 2 c 4 n τ for constant c 4 > 0 and 0 τ < 1 / 2 , 0 < κ < 1 2 τ , there exist constants c 5 > 0 and c 6 > 0 , such that
Pr min k D ω ^ k max k D c ω ^ k > 0 > 1 O p exp c 5 n 1 2 ( τ + κ ) + n exp c 6 n κ .
Furthermore, as discussed in Theorem 1, if log p = o n ( 1 2 τ ) / 3 , then
lim inf n min k D ω ^ k max k D c ω ^ k > 0 , almost surely .
Appendix A gives the technical proofs of Theorems 1 and 2.
Remark 2. 
Theorem 1 is essential for all ultra-high-dimensional data feature screening procedures, as it guarantees that D ^ contains all predictors in D with high probability. Theorem 2 provides a stronger theoretical result, as important predictors are more likely to be ranked higher than unimportant ones. In other words, if we assume δ = c 4 n τ , then with a high probability, there exists a δ that can effectively separate the important and unimportant predictors.

2.3. Conditional Screening Procedure

When there exists a high correlation between predictors, LDCS may incorrectly exclude some truly informative predictors. Wang et al. [15] applied the idea of conditional screening to deal with the problem and obtained excellent theoretical properties. Therefore, we applied this method to deal with the above problem. We denote
A = { k : F ( y | X , Z ) functionally depends on X k , k = 1 , , p } .
where F ( y | X , Z ) denotes the conditional distribution function of y , given X and Z .
Wang et al. [15] proposed CDC based on Székely et al. [14] to deal with the issue of DC, which may make serious errors when predictors are highly correlated. For details, given the conditional predictor vector Z , they defined CDC between y and X as the square root of
c d C o r 2 ( X , y | Z ) = c d C o v 2 ( X , y | Z ) c d C o v 2 ( X , X | Z ) c d C o v 2 ( y , y | Z ) ,
where conditional distance covariance is defined as the square root of
c d C o v 2 ( X , y | Z ) = R p + q ϕ X , y | Z ( s , t ) ϕ X | Z ( s ) ϕ y | Z ( t ) 2 p ( s , t ) d s d t .
Similar to Section 2.2, we split the longitudinal data into s parts based on the difference of t m , and denote the conditional distance correlation between X k m and y m , given Z m at the mth time point t m , by C D C k ( t m ) = c d C o r 2 ( X k m , y m | Z m ) . We define
ρ k = m = 1 M C D C k ( t m ) Δ t m t M t 1
as our dependence measure given some important predictors Z . We denote d i j , k X ( t m ) = d ( X k i m , X k j m ) as the Euclidean distance between X k i m and X k j m , where i , j = 1 , . . . , n at the mth measurement. Similarly, d i j , k y ( t m ) can be denoted in the same way. Then, the distance function can be defined as
d i j l f , k ( t m ) = d i j l f , k ( t m ) + d i j f l , k ( t m ) + d i f l j , k ( t m ) 3 ,
where d i j l f , k ( t m ) = ( d i j , k X ( t m ) + d l f , k X ( t m ) d i l , k X ( t m ) d j f , k X ( t m ) ) ( d i j , k y ( t m ) + d l f , k y ( t m ) d i l , k y ( t m ) d j f , k y ( t m ) ) . Note that d i j f l , k ( t m ) and d i f l j , k ( t m ) are defined similarly. Let ω i m ( Z m ) = K ( ( Z i m Z m ) / H ) and ω m ( Z m ) = i n ω i m ( Z m ) , where H is a diagonal matrix determined by bandwidth h, selected using the plug-in method. K ( · ) is a kernel function, which is set as the Gaussian kernel function in this paper, i.e.,
K H ( Z ) = | H | 1 K ( H 1 Z ) = ( 2 π ) r / 2 | H | 1 exp ( 1 2 Z H 2 Z ) .
Given Z m , the sample conditional distance covariance
c d C o v ^ 2 ( X k m , y m | Z m ) = 1 n 4 i , j , l , f ψ n ( W i m , W j m , W l m , W f m ; Z m ) ) ,
where
ψ n ( W i m , W j m , W l m , W f m ; Z m ) = n 4 ω i m ( Z m ) ω j m ( Z m ) ω l m ( Z m ) ω f m ( Z m ) 4 ω 4 ( Z m ) d i j l f , k ( t m ) .
The sample conditional distance covariances c d C o v ^ 2 ( X k m , X k m | Z m ) and c d C o v ^ 2 ( y m , y m | Z m ) can be estimated similarly. Thus, the plug-in estimate of C D C k ( t m ) is
C D C ^ k ( t m ) = 1 n i = 1 n c d C o v ^ 2 ( X k i m , y i m | Z i m ) c d C o v ^ 2 ( X k i m , X k i m | Z i m ) c d C o v ^ 2 ( y i m , y i m | Z i m ) ,
and the aggregated estimate of ρ k is
ρ ^ k = m = 1 M C D C ^ k ( t m ) Δ t m t M t 1 .
Based on ρ ^ k , we propose a conditional screening procedure, specifically designed to handle highly correlated predictors:
A ^ = { 1 k p : ρ ^ k is among the top d 0 } .
We refer to the new method as LCDCS. Similar to LDCS, the computational complexity of LCDCS can be reduced from O ( n 3 M 3 ) to O ( n 3 M ) .
Before proving the asymptotic properties of LCDCS, we list the other technical assumptions:
(C.2):
The kernel function K ( · ) is uniformly bounded and satisfies the following conditions: K ( u ) 0 , K ( u ) d u = 1 , u K ( u ) d u = 0 , and u 2 K ( u ) d u < .
(C.3):
If Z 1 , Z 2 , Z 3 , Z 4 are independent copies of Z, and then, for 1 k p , there exists constant C > 0 , such that
sup k | E { d 1234 , k | Z 1 , Z 2 , Z 3 , Z 4 } E { d 1234 , k | Z 1 , Z 2 , Z 3 , Z 4 } | C | Z 1 Z 1 | .
We present the asymptotic properties of LCDCS as follows.
Theorem 3 
(Sure screening property). Under Assumption (C.1)–(C.3), if the bandwidth for kernel estimation of Z satisfies h = O { n τ ˜ / ( 2 r ) } , where r is the dimension of Z , then for any 0 τ ˜ < 1 / 2 and 0 < κ ˜ < 1 / 2 τ ˜ , there exist constants c 7 > 0 , c 8 > 0 and c 9 > 0 , such that
Pr max 1 k p | ρ ^ k ρ k | c 7 n τ ˜ O p exp c 8 n 1 2 ( τ ˜ + κ ˜ ) + n 4 exp ( c 9 n κ ˜ ) .
Furthermore, supposing min k A ρ k 2 c 7 n τ ˜ , then
Pr A A ^ 1 O L 1 exp c 8 n 1 2 ( τ ˜ + κ ˜ ) + n 4 exp c 9 n κ ˜ ,
where L 1 is the cardinality of A . If we set κ ˜ = ( 1 2 τ ˜ ) / 3 , then the first part of Theorem 3 becomes
Pr max 1 k p ρ ^ k ρ k c 7 n τ ˜ O p exp c 8 n ( 1 2 τ ˜ ) / 3 .
Theorem 4 
(Rank consistency property). Under Assumption (C.1)–(C.3), supposing min k D ρ k max k D c ρ k 2 c 10 n τ ˜ for constant c 10 > 0 and 0 τ ˜ < 1 / 2 , 0 < κ ˜ < 1 2 τ ˜ , there exist constants c 11 > 0 and c 12 > 0 , such that
Pr min k D ρ ^ k max k D c ρ ^ k > 0 > 1 O p exp c 11 n 1 2 ( τ ˜ + κ ˜ ) + n 4 exp c 12 n κ ˜ .
Furthermore, as discussed in Theorem 3, if log p = o n ( 1 2 τ ˜ ) / 3 , then
lim inf n min k A ρ ^ k max k A c ρ ^ k > 0 , almost surely .
Appendix A gives the technical proofs of Theorems 3 and 4.

3. Results

3.1. Simulation Study

We conduct several simulated examples to evaluate the screening performance of LDCS and LCDCS, and compare the results with those of some existing methods, including partial residual sure independence screening (PRSIS) [3], time-varying coefficient models’ sure independence screening (TVCM-SIS) [19], distance correlation sure independence screening (DC-SIS) [7], conditional distance correlation sure independence screening (CDC-SIS) [9], and covariate information number sure independence screening (CIS) [10]. PRSIS and TVCM-SIS are longitudinal data feature screening procedures where PRSIS is based on partially linear models and TVCM-SIS is based on time-varying coefficient models. DC-SIS, CDC-SIS, and CIS are all model-free methods for cross-sectional data.
We repeat each simulation example 500 times. In all examples, the number of predictors is p = 2000 , and the sample size is n = 80 , yielding a submodel size of d 0 = n / log n = 18 . The predictor X k i ( k = 1 , , p ) and the random error ϵ i are generated independently from MVN ( 0 , σ 0 2 Σ ) , where σ 0 and Σ are different in each example. The time points t 1 , . . . , t M are sampled from a standard uniform distribution and do not change in a simulation. We use the following four criteria to assess the screening performance:
  • M M S : The minimum submodel size containing all important predictors. The median and robust standard deviation (RSD = IQR/1.34) of M M S are reported, where IQR denotes interquartile range.
  • P k : The proportion of the important predictor X k selected in the submodel.
  • P a : The proportion of all important predictors selected in the submodel.
  • d c : The average number of important predictors selected in the submodel.

3.1.1. Example 1: Partially Linear Models

We generate the response from the following two models:
Model (1.a): y i m = β 1 X 1 i m + β 2 X 2 i m + β 3 X 3 i m + β 4 I ( X 4 i m > 0 ) + f ( t m ) + ϵ i m ,
Model (1.b): y i m = β 1 β 2 X 1 i m X 2 i m + β 3 X 3 i m + β 4 X 4 i m + f ( t m ) + ϵ i m .
Model (1.a) considers categorical predictor variables, while Model (1.b) incorporates interaction effects. These two models are based on Model (1.a) and Model (1.b) in Li et al. [7], but the difference is that we introduced longitudinal measurements and nonlinear time variables. We set ( β 1 , β 2 , β 3 , β 4 ) T = ( 1 , 0.75 , 1 , 4 ) T , f ( t ) = 36 sin ( 4 π t ) and σ 0 = 2 in Model (1.a), and ( β 1 , β 2 , β 3 , β 4 ) T = ( 2.25 , 2 , 1 , 1 ) T , f ( t ) = 4 sin ( 2 π t ) , and σ 0 = 1 in Model (1.b). Furthermore, we set the repeated measurements M i 5 and the within-subject correlation structure as first-order autoregressive (AR(1)) structure. The within-subject correlations are set to ρ = 0.3 and 0.6 for each model, respectively.
The detailed results are reported in Table 1. For the four different situations, our proposed method LDCS achieves the smallest MMS (including median and RSD), along with the largest P a and d c . When a strong nonlinear time effect is present, the model-free feature screening procedures DC-SIS and CIS perform less effectively. Furthermore, we observe that TVCM-SIS and PRSIS are not as effective in identifying interaction effects. While CIS relatively successfully identifies interaction effects, it tends not to retain important predictors with and without interaction effects simultaneously. Example 1 demonstrates that our proposed method LDCS is unaffected by strong time effects and performs better in handling situations with interaction effects in partially linear models.

3.1.2. Example 2: Time-Varying Coefficient Models

In this example, we generate the response from the following two models:
Model (2.a): y i m = β 1 ( t m ) X 1 i m + β 2 ( t m ) X 2 i m + β 3 ( t m ) X 3 i m + β 4 ( t m ) X 4 i m + ϵ i m ,
Model (2.b): log ( μ i ( t m ) ) = β 1 ( t m ) X 1 i m + β 2 ( t m ) X 2 i m + β 3 ( t m ) X 3 i m + β 4 ( t m ) X 4 i m .
Model (2.a) is similar to Example I in Chu et al. [19], but we set a smaller sample size and measurements. Model (2.b) considers a count response with Poisson distribution, which is based on Example II in Chu et al. [20], but we use time-varying coefficient model, rather than the generalized varying coefficient mixed-effect model. In Model (2.a), we set β 1 ( t ) = 4 cos ( 4 π t ) I ( t < 0.5 ) , β 2 ( t ) = 4.25 sin ( 4 π t ) I ( t < 0.5 ) , β 3 ( t ) = sin ( 4 π t ) , β 4 ( t ) = 1.25 ( 1.2 t ) and σ 0 = 2 . In Model (2.b), we set β 1 ( t ) = sin ( 2 π t ) , β 2 ( t ) = 1.25 sin ( 2 π t ) , β 3 ( t ) = sin ( 2 π t ) , β 4 ( t ) = 1.25 sin ( 2 π t ) , and σ 0 = 1 . Furthermore, we set the within-subject correlation structure as a compound symmetry (CS) structure. The within-subject correlations are set to ρ = 0.3 and 0.6 for each model, respectively.
Table 2 presents the detailed simulation results for Example 2. It can be seen that the results of Example 2 are similar to those of Example 1, i.e., our proposed method LDCS achieves the smallest MMS (including median and RSD), along with the largest P a and d c . Specifically, for Model (2.a), although TVCM-SIS is based on the assumption of varying coefficient models, its performance is not ideal when the categorical predictors have a stronger impact on the response. For Model (2.b), the screening performance of TVCM-SIS, PRSIS, DC-SIS, and CIS is all unsatisfactory, indicating that these methods are not well suited for varying coefficient models with count responses. These examples further demonstrate that, for some special varying coefficient models, our proposed method LDCS may achieve better screening performance.

3.1.3. Example 3: Partially Linear Single-Index Models

We assess the screening performance of LCDCS when the predictors are correlated with each other. The response is generated from Model (3):
Model (3): y i m = ( β 1 X 1 i m + β 2 X 2 i m + β 3 X 3 i m + β 4 X 4 i m ) 2 + f ( t m ) + ϵ i m ,
We set ( β 1 , β 2 , β 3 , β 4 ) T = ( 3 , 2.5 , 2.5 , 3 ) T , f ( t ) = 36 t / ( 1 t ) , σ 0 = 1 , the repeated measurements M i 3 and the within-subject correlation structure as AR(1) structure with within-subject correlation as ρ 1 = 0.5 . The correlation structure between predictors is set as CS structure, with two scenarios considered: ρ 2 = 0.5 and ρ 2 = 0.8 .
We first use LDCS and DC-SIS to screen the most relevant predictor as a conditional predictor for LCDCS and CDC-SIS, respectively. Table 3 presents the detailed simulation results for Example 3. It can be seen that our proposed method, LCDCS, also achieves the smallest MMS (including median and RSD), along with the largest P a and d c . Since the number of repeated measurements is only 3, the improvement in screening performance of LCDCS over CDC-SIS is not particularly obvious. Additionally, the screening performance of TVCM-SIS and PRSIS is very poor, indicating that these methods, which rely on specific model assumptions, may not perform well in some complex model structures. CIS is also not well suited when the predictors are correlated. This example demonstrates that our proposed method, LCDCS, is more suitable for longitudinal data with correlated predictors in some special model structures.

3.2. Application to Gut Microbiota Data

We analyze real-world data to demonstrate the empirical performance of our methods in this section. This is a gut microbiota data from Bangladeshi children, as reported by Subramanian et al. [21]. The longitudinal cohort study aimed at analyzing the effects of therapeutic foods on children. For details, they monitored the growth and development of n = 50 infants over a two-year period after birth, ultimately collecting a total of N = 996 fecal samples. Each sample contained 79,597 operational taxonomic units sharing at least 97 % nucleotide sequence identity (97%-identity OTUs). Following Subramanian et al. [21], we obtain 1222 97%-identity OTUs that have at least two fecal samples with an average relative abundance greater than 0.1%. In this article, we consider height-for-age Z-scores (HAZ) as the response, which reflects how a child’s height compares to the average height of the same age and gender group, indicating whether growth and development are within the normal range. The HAZ of each child was measured between 6 and 22 times, as illustrated in Figure 2. As a further preprocessing step, we retain only the months in which at least 24 children had their HAZ measured. Thus, we eventually work with a dataset comprising 1222 97%-identity OTUs (the predictors) and HAZ (the response), measured on n = 50 children over 13 months, for a total of 433 measurements. On this dataset, we apply our LDCS and LCDCS, along with TVCM-SIS, PRSIS, DC-SIS, CDC-SIS, and CIS for feature screening, and we compare the results with those of Subramanian et al. [21].
First, we compare the top d 0 = 10 97%-identity OTUs selected by each method, as summarized in Table 4. Our methods, LDCS and LCDCS, demonstrate a moderate overlap in the number of selected 97%-identity OTUs with PRSIS, DC-SIS, and CDC-SIS, while the selection results of TVCM-SIS and CIS differ significantly from those of the other methods. Next, we compare the top d 0 = 220 97%-identity OTUs, as Subramanian et al. [21] identified 220 gut microbiota that were significantly different between severe acute malnutrition (SAM) and healthy children. It is obvious that LDCS and LCDCS have the maximum number of 97%-identity OTUs, which align with those of Subramanian et al. [21] among all screening procedures.
Furthermore, we evaluate the screening performance through regression analysis. Following Chu et al. [19], we establish time-varying coefficient models with different numbers of 97%-identity OTUs and use the total heritability commonly used in genetic analysis to assess the goodness of fit. The total heritability of all 97%-identity OTUs is calculated through
H ( HAZ ) = RSS ( HAZ | Fetal ) RSS ( HAZ | Fetal ) RSS ( HAZ | Fetal , OTU 1 , , OTU p ) RSS ( HAZ | Fetal ) .
Here, fetal means whether the child is a singleton birth and RSS is the residual sum of squares defined as
R S S = i = 1 n m = 1 M i y i ( t i m ) y ^ i ( t i m ) 2 ,
where y i ( t i m ) denotes the actual value, and y ^ i ( t i m ) denotes the predicted value from the time-varying coefficient models.
We calculate the total heritability of the top d 0 = 1 to 49 97%-identity OTUs and remove more irrelevant 97%-identity OTUs using forward regression to achieve better regression fitting results. Our proposed method, LCDCS, has the highest total heritability at 64.76%, followed by LDCS with 60.24%. The final number of 97%-identity OTUs selected and the total heritability for each method are shown in Table 5.
We also plot the curves of total heritability for the different screening procedures, as shown in Figure 3. We note that, overall, when the number of 97%-identity OTUs selected is between 4 and 23, the total heritability of LCDCS is higher than that of other screening procedures. Additionally, when the number of 97%-identity OTUs selected is between 24 and 33, the LDCS method shows a higher total heritability. Therefore, when the number of selected 97%-identity OTUs is moderate, we recommend using LCDCS, while for a larger number of selected 97%-identity OTUs, we recommend using LDCS.
We also calculate the heritability of each 97%-identity OTU, which depends on the sequence of selection in the forward regression. The heritability of the kth 97%-identity OTU is calculated through
H ( OTU k ) = R S S ( HAZ | Fetal , OTU ( 1 ) , , OTU ( k 1 ) ) R S S ( HAZ | Fetal ) R S S ( HAZ | Fetal , OTU ( 1 ) , , OTU ( k 1 ) , OTU ( k ) ) R S S ( HAZ | Fetal ) .
Table 6 presents the IDs and taxonomic annotations of the top 10 97%-identity OTUs based on the heritability, as selected via LCDCS. The taxonomic annotations are given by Subramanian et al. [21], representing the genus of the gut microbiota. Among the 10 gut microbiota, 4 gut microbiota belong to the Bifidobacterium genus. Bifidobacterium is a common group of probiotics found primarily in the human gut, particularly in infants, and it has been shown to be closely associated with children’s nutrient absorption and growth development. Furthermore, there are 5 97%-identity OTUs that differ from the results of Subramanian et al. [21], which may represent a new discovery. We also perform a sensitivity analysis to evaluate how the parameter d 0 influences the empirical results, with detailed information provided in Appendix B.

4. Discussion

In this article, we improve DC by applying a sample splitting and data aggregation method to achieve better screening performance with longitudinal data. We also make a simple extension of LDCS to deal with the situation where the predictors are correlated. The two methods are capable of handling longitudinal data with unequal numbers of repeated measurements and unequal intervals between repeated measurement time points. Simulation studies indicate that, in some special situations, such as partially linear models with strong time effects or interaction effects, and varying coefficient models with count responses, our proposed method LDCS demonstrates better screening performance. Furthermore, in the situations where predictors are correlated, our proposed method, LCDCS, also achieves better screening performance for some complex structures. Finally, the results of the application show that LDCS and LCDCS achieve better outcomes at different selection scales.
Inevitably, this work encountered some limitations. For instance, we did not consider the treatment of missing data or time-varying confounding. Additionally, both LDCS and LCDCS rely on sub-exponential tail probability assumptions, which may not always hold in practice. The screening performance of LCDCS is sensitive to bandwidth selection, and due to the kernel estimation, its computational burden is higher than that of some other model-free methods.
We plan to deal with the selection of the threshold d 0 in LDCS and LCDCS in a future study. Recently, Liu et al. [22] proposed handling this issue from the perspective of false discovery rate (FDR) control by using Model-X knockoff features. Chi et al. [23] applied Model-X knockoff features to the FDR control problem for time series data by using an e-value aggregation method. Therefore, similarly, we could also consider constructing Model-X knockoff features and corresponding knockoff statistics for each measurement time point in longitudinal data, and then aggregate these statistics by using the e-value aggregation method. We expect that this approach will help handle the selection problem of the threshold d 0 in LDCS and LCDCS from the perspective of FDR control.

Author Contributions

Conceptualization, J.C.; Methodology, J.C.; Software, J.C.; Validation, J.C. and Y.L.; Formal analysis, J.C. and Y.L.; Investigation, X.Y. and Y.L.; Resources, X.Y.; Writing – original draft, J.C.; Writing – review & editing, X.Y., J.D. and Y.L.; Visualization, J.C. and J.D.; Supervision, X.Y., J.D. and Y.L.; Project administration, J.D. and Y.L.; Funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by Sichuan Province Administration of Traditional Chinese Medicine of China (grant numbers 25MSZX477 and 25MSZX495).

Data Availability Statement

The data presented in this study are openly available in [repository name, e.g., FigShare] at [doi], reference number [reference number]. [Subramanian Sathish] [10.1038/nature13421].

Conflicts of Interest

The authors state no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
t m mth measurement time point
y Response vector
X Predictor vector
Z Conditional predictor vector
p, q, rDimensional of X , y , Z
nNumber of subjects
M i Repeated measurements for the i subject
NTotal measurements
MMaximum of M i
F ( · ) Cumulative distribution function
D, A Index set of important predictors
D c , A c Index set of unimportant predictors
L 0 , L 1 Cardinality of D, A
D C k ( t m ) Distance correlation of X k m and y m
ω k , ρ k Dependence measure at the population-level
ω ^ k , ρ ^ k Dependence measure at the sample-level
d 0 Selection number threshold
C D C k ( t m ) Conditional distance correlation of X k m and y m
K ( · ) Kernel function
hBandwidth
M V N ( · ) Multivariate normal distribution
I ( · ) Indicator function
ρ 1 Within-subject correlation
ρ 2 Correlation between predictors
H ( · ) Heritability
R S S Residual sum of squares
O T U k kth 97%-identity OTU

Appendix A. Theoretical Proofs

Before proving Theorem 1, we give the following lemma from Li et al. [7].
Lemma A1. 
For any random vectors, X R p and y R q , under Assumption (C.1), for any 0 < κ < 1 2 τ and ϵ > 0 , there exist constants c 2 > 0 and c 3 > 0 , such that
Pr d C o r ^ 2 ( X , y ) d C o r 2 ( X , y ) ϵ O exp c 2 ϵ 2 n 1 2 κ + n exp c 3 n κ
Proof of Theorem 1. 
First, we show that, for any k,
Pr ω ^ k ω k c 1 n τ O exp c 2 n 1 2 ( τ + κ ) + n exp c 3 n κ .
This is because
ω ^ k ω k = m = 1 M D C ^ k ( t m ) Δ t m t M t 1 m = 1 M D C k ( t m ) Δ t m t M t 1 G t M t 1 m = 1 M D C ^ k ( t m ) D C k ( t m ) ,
where G = max 1 m M Δ t m , Δ t m = t m t m 1 and Δ t 1 = 0 . For any ϵ > 0 ,
Pr ω ^ k ω k ϵ Pr G t M t 1 m = 1 M D C ^ k ( t m ) D C k ( t m ) ϵ .
Next, we have
Pr G t M t 1 m = 1 M D C ^ k ( t m ) D C k ( t m ) ϵ Pr m = 1 M D C ^ k ( t m ) D C k ( t m ) ϵ M t M t 1 G M max 1 m M Pr D C ^ k ( t m ) D C k ( t m ) t M t 1 ϵ M G M max 1 m M Pr D C ^ k ( t m ) D C k ( t m ) ϵ M .
The last inequality holds because t M t 1 G = m a x 1 m M Δ t m .
Next, we deal with Pr D C ^ k ( t m ) D C k ( t m ) ϵ M . By using Lemma A1, for any ϵ > 0 and 0 < κ < 1 2 τ , it follows from Assumption (C.1) that
Pr D C ^ k ( t m ) D C k ( t m ) ϵ O exp c 2 ϵ 2 n 1 2 κ + n exp c 3 n κ
for constants c 2 > 0 and c 3 > 0 . (A4) entails that
Pr D C ^ k ( t m ) D C k ( t m ) ϵ M O exp c 2 ϵ M 2 n 1 2 κ + n exp c 3 n κ .
Let ϵ / M = c 0 n τ and c 1 = c 0 M , where τ satisfies 0 < τ + κ < 1 / 2 . If M is finite, then we have
Pr ω ^ k ω k c 1 n τ O exp c 2 n 1 2 ( κ + τ ) + n exp c 3 n κ ,
and
Pr max 1 k p | ω ^ k ω k | c 1 n τ O p exp c 2 n 1 2 ( τ + κ ) + n exp ( c 3 n κ ) .
The proof of the first part of Theorem 1 is completed.
Next, we begin the proof of the second part of Theorem 1. If D D ^ , then there must exist some k D , such that ω ^ k < c 1 n τ . min k D ω k 2 c 1 n τ entails that ω ^ k ω k > c 1 n τ for some k D , indicating that D D ^ { ω ^ k ω k > c 1 n τ , for some k D } , and then D n = max k D ω ^ k ω k c 1 n τ D D ^ . Consequently, when n is sufficiently large,
Pr D D ^ Pr D n = 1 Pr D n c = 1 Pr min k D ω ^ k ω k c 1 n τ = 1 s n Pr ω ^ k ω k c 1 n τ 1 O L 0 exp c 2 n 1 2 ( κ + τ ) + n exp c 3 n κ ,
where L 0 is the cardinality of D. (A7) implies that Pr D D ^ 1 , as n . □
Proof of Theorem 2. 
Let k 1 = arg min k D ω ^ k and k 2 = arg max k D c ω ^ k . For any 0 < κ < 1 2 τ , we have
Pr min k D ω ^ k max k D c ω ^ k 0 Pr min k D ω ^ k max k D c ω ^ k min k D ω k max k D c ω k 2 c 4 n τ = Pr min k D ω k min k D ω ^ k + max k D c ω ^ k max k D c ω k 2 c 4 n τ Pr ω k 1 ω ^ k 1 + ω ^ k 2 ω k 2 2 c 4 n τ Pr ω ^ k 1 ω k 1 c 4 n τ + Pr ω ^ k 2 ω k 2 c 4 n τ 2 Pr max 1 k p ω ^ k ω k c 4 n τ = O p exp c 5 n 1 2 ( τ + κ ) + n exp c 6 n κ ,
where c 4 , c 5 , c 6 are positive constants. The last equation can be derived from Theorem 1. Thus, we have
Pr min k D ω ^ k max k D c ω ^ k > 0 1 O p exp c 5 n 1 2 ( τ + κ ) + n exp c 6 n κ .
If log p = o n ( 1 2 τ ) / 3 with κ = ( 1 2 τ ) / 3 , then we have
Pr min k D ω ^ k max k D c ω ^ k > 0 1 O p exp c 5 n ( 1 2 τ ) / 3 .
We know p < exp ( c 5 n ( 1 2 τ ) / 3 / 2 ) for a large n; therefore, there must exist a large n, such that
p exp c 5 n ( 1 2 τ ) / 3 exp c 5 n ( 1 2 τ ) / 3 / 2 exp 2 log n n 2 .
Then, we have
n = n 0 c 5 p exp c 5 n ϵ 2 c 5 n = n 0 n 2 ,
where c 5 , n 0 > 0 . Therefore, according to the Borel–Cantelli Lemma, we obtain
Pr lim sup n min k D ω ^ k max k D c ω ^ k 0 = 0 ,
and then
Pr lim inf n min k D ω ^ k max k D c ω ^ k > 0 = Pr lim inf n min k D ω ^ k max k D c ω ^ k > 0 c = 1 .
The proof of Theorem 2 is complete. □
Before proving Theorem 3, we give the following lemma from Wen et al. [9].
Lemma A2. 
For any random vectors, X R p , y R q and Z R r , with Assumption (C.1)–(C.3) being supposed to hold and the bandwidth for kernel estimation of Z satisfying h = O ( n κ / 2 r ) , for any 0 < κ ˜ < 1 2 τ ˜ and ϵ > 0 , there exist constants c 8 > 0 and c 9 > 0 , such that
Pr c d C o r ^ 2 ( X , y | Z ) c d C o r 2 ( X , y | Z ) ϵ O exp c 8 ϵ 2 n 1 2 κ ˜ + n 4 exp c 9 n κ ˜
Proof of Theorem 3. 
Similar to the proof of Theorem 1, we can get
Pr ρ ^ k ρ k ϵ M max 1 m M Pr C D C ^ k ( t m ) C D C k ( t m ) ϵ M .
By using Lemma A2, for any ϵ > 0 and 0 < κ ˜ < 1 2 τ ˜ , it follows from Assumption (C.1)–(C.3) that
Pr C D C ^ k ( t m ) C D C k ( t m ) ϵ O exp c 8 ϵ 2 n 1 2 κ ˜ + n 4 exp c 9 n κ ˜
for constants c 8 > 0 and c 9 > 0 . (A9) entails that
Pr C D C ^ k ( t m ) C D C k ( t m ) ϵ M O exp c 8 ϵ M 2 n 1 2 κ ˜ + n 4 exp c 9 n κ ˜ .
Then, let ϵ / M = c 0 n τ ˜ and c 7 = c 0 M , where τ ˜ satisfies 0 < τ ˜ + κ ˜ < 1 / 2 , we have
Pr ρ ^ k ρ k c 7 n τ ˜ O exp c 8 n 1 2 ( κ ˜ + τ ˜ ) + n exp c 9 n κ ˜ ,
and
Pr max 1 k p | ρ ^ k ρ k | c 7 n τ ˜ O p exp c 8 n 1 2 ( τ ˜ + κ ˜ ) + n exp ( c 9 n κ ˜ ) .
The proof of the first part of Theorem 3 is completed.
Next, we begin the proof of the second part of Theorem 3. If A A ^ , then there must exist some k A , such that ρ ^ k < c 7 n τ ˜ . min k A ρ k 2 c 7 n τ ˜ entails that ρ ^ k ρ k > c 7 n τ ˜ for some k A , indicating that A A ^ { ρ ^ k ρ k > c 7 n τ ˜ , for some k A } , and then A n = max k A ρ ^ k ρ k c 7 n τ ˜ A A ^ . Consequently, when n is sufficiently large,
Pr A A ^ Pr A n = 1 Pr A n c = 1 Pr min k A ρ ^ k ρ k c 7 n τ ˜ = 1 s n Pr ρ ^ k ρ k c 7 n τ ˜ 1 O L 1 exp c 8 n 1 2 ( κ ˜ + τ ˜ ) + n 4 exp c 9 n κ ˜ ,
where L 1 is the cardinality of A . (A10) implies that Pr A A ^ 1 , as n . □
Proof of Theorem 4. 
Similar to the proof of Theorem 2, for any 0 < κ ˜ < 1 2 τ ˜ , we have
Pr min k A ρ ^ k max k A c ρ ^ k 0 O p exp c 11 n 1 2 ( τ ˜ + κ ˜ ) + n 4 exp c 12 n κ ˜ ,
where c 11 , c 12 are positive constants. Thus, we have
Pr min k A ρ ^ k max k A c ρ ^ k > 0 1 O p exp c 11 n 1 2 ( τ ˜ + κ ˜ ) + n 4 exp c 12 n κ ˜ .
If log p = o n ( 1 2 τ ˜ ) / 3 with κ ˜ = ( 1 2 τ ˜ ) / 3 , then we have
Pr min k A ρ ^ k max k A c ρ ^ k > 0 1 O p exp c 11 n ( 1 2 τ ˜ ) / 3 .
We know p < exp ( c 11 n ( 1 2 τ ˜ ) / 3 / 2 ) for a large n; therefore, there must exist a large n, such that
p exp c 11 n ( 1 2 τ ˜ ) / 3 exp c 11 n ( 1 2 τ ˜ ) / 3 / 2 exp 2 log n n 2 .
Then, we have
n = n 0 c 11 p exp c 11 n ϵ 2 c 11 n = n 0 n 2 ,
where c 11 , n 0 > 0 . Therefore, according to the Borel–Cantelli Lemma, we obtain
Pr lim sup n min k A ρ ^ k max k A c ρ ^ k 0 = 0 ,
and then
Pr lim inf n min k A ρ ^ k max k A c ρ ^ k > 0 = Pr lim inf n min k A ρ ^ k max k A c ρ ^ k > 0 c = 1 ,
which completes the proof of Theorem 4. □

Appendix B. Sensitivity Analysis in Application

To demonstrate the impact of the choice of d 0 on LCDCS, we conduct a sensitivity analysis on the application. In addition to d 0 = n 1 considered in Section 3.2, we also considered d 0 = a [ n / log ( n ) ] , where a = 1 , 2 , 3 . Table A1 provides the IDs and taxonomic annotations of the 97%-identity OTUs, ranked by heritability after forward regression.
We can see that, for each situation, the heritability of New.0.CleanUp.ReferenceOTU89679, 561636, and New.0.ReferenceOTU340 consistently ranks in the top 3. However, there are some differences in the rankings from 4th to 10th, indicating that selecting the appropriate threshold d 0 is important for future studies.
Table A1. IDs and taxonomic annotations of the top 10 97%-identity OTUs selected by LCDCS based on heritability at different d 0 .
Table A1. IDs and taxonomic annotations of the top 10 97%-identity OTUs selected by LCDCS based on heritability at different d 0 .
OrderIDTaxonomic Annotation
d 0 = [ n / log ( n ) ] = 12
1New.0.CleanUp.ReferenceOTU89679 *Collinsella
2561636Streptococcus
3New.0.ReferenceOTU340 *Bifidobacterium
4326977 *Bifidobacterium
5New.1.ReferenceOTU284NA
6533785 *Bifidobacterium
7561483 *Bifidobacterium
8305760 *Escherichia/Shigella
d 0 = 2 [ n / log ( n ) ] = 25
1New.0.CleanUp.ReferenceOTU89679 *Collinsella
2561636Streptococcus
3New.0.ReferenceOTU340 *Bifidobacterium
4326977 *Bifidobacterium
5469868Bifidobacterium
6554755Enterococcaceae
772820 *Bifidobacterium
8New.1.ReferenceOTU284NA
9533785 *Bifidobacterium
10469852 *Bifidobacterium
d 0 = 3 [ n / log ( n ) ] = 38
1New.0.CleanUp.ReferenceOTU89679 *Collinsella
2561636Streptococcus
3New.0.ReferenceOTU340 *Bifidobacterium
4326977 *Bifidobacterium
5469868Bifidobacterium
6New.0.ReferenceOTU339NA
7130663 *Bacteroides
8554755Enterococcaceae
9210269 *NA
10301004 *Olsenella
d 0 = n 1 = 49
1New.0.CleanUp.ReferenceOTU89679 *Collinsella
2561636Streptococcus
3New.0.ReferenceOTU340 *Bifidobacterium
4470527Lactobacillus
5326977 *Bifidobacterium
6471180 *Bifidobacterium
7469868Bifidobacterium
8345575Lactococcus
9316587 *Streptococcus
10New.0.ReferenceOTU339NA
Note: When the threshold d 0 is set to 12, only 8 97%-identity OTUs remain after forward regression. * 97%-identity OTUs discovered in Subramanian et al. [21]. The taxonomic annotation is not provided by Subramanian et al. [21].

References

  1. Fan, J.Q.; Lv, J.C. Sure Independence Screening for Ultrahigh Dimensional Feature Space. J. R. Stat. Soc. B. 2008, 70, 849–911. [Google Scholar] [CrossRef]
  2. Song, R.; Yi, F.; Zou, H. On Varying-Coefficient Independence Screening for High-Dimensional Varying-Coefficient Models. Stat. Sinica. 2014, 24, 1735–1752. [Google Scholar] [CrossRef]
  3. Liu, J.Y. Feature Screening and Variable Selection for Partially Linear Models with Ultrahigh-Dimensional Longitudinal Data. Neurocomputing 2016, 195, 202–210. [Google Scholar] [CrossRef]
  4. Lai, P.; Liang, W.J.; Wang, F.J.; Zhang, Q.Z. Feature Screening of Quadratic Inference Functions for Ultrahigh Dimensional Longitudinal Data. J. Stat. Comput. Sim. 2020, 90, 2614–2630. [Google Scholar] [CrossRef]
  5. Jiang, B.Y.; Lv, J.; Li, J.L.; Cheng, M.Y. Robust Model Averaging Prediction of Longitudinal Response with Ultrahigh-Dimensional Covariates. J. R. Stat. Soc. Ser. B Stat. Methodol. 2024, 87, 337–361. [Google Scholar] [CrossRef]
  6. Zhu, L.P.; Li, L.X.; Li, R.Z.; Zhu, L.X. Model-Free Feature Screening for Ultrahigh-Dimensional Data. J. Am. Stat. Assoc. 2011, 106, 1464–1475. [Google Scholar] [CrossRef] [PubMed]
  7. Li, R.Z.; Zhong, W.; Zhu, L.P. Feature Screening via Distance Correlation Learning. J. Am. Stat. Assoc. 2012, 107, 1129–1139. [Google Scholar] [CrossRef]
  8. Shao, X.F.; Zhang, J.S. Martingale Difference Correlation and Its Use in High-Dimensional Variable Screening. J. Am. Stat. Assoc. 2014, 109, 1302–1318. [Google Scholar] [CrossRef]
  9. Wen, C.H.; Pan, W.L.; Huang, M.M.; Wang, X.Q. Sure Independence Screening Adjusted for Confounding Covariates with Ultrahigh Dimensional Data. Stat. Sinica. 2018, 28, 293–317. [Google Scholar] [CrossRef]
  10. Nandy, D.; Chiaromonte, F.; Li, R.Z. Covariate Information Number for Feature Screening in Ultrahigh-Dimensional Supervised Problems. J. Am. Stat. Assoc. 2022, 117, 1516–1529. [Google Scholar] [CrossRef]
  11. Zhang, J.; Liu, Y.Y.; Cui, H.J. Model-Free Feature Screening via Distance Correlation for Ultrahigh Dimensional Survival Data. Stat. Papers. 2021, 62, 2711–2738. [Google Scholar] [CrossRef]
  12. Zhong, W.; Qian, C.; Liu, W.J.; Zhu, L.P.; Li, R.Z. Feature Screening for Interval-Valued Response with Application to Study Association between Posted Salary and Required Skills. J. Am. Stat. Assoc. 2023, 118, 805–817. [Google Scholar] [CrossRef]
  13. Zhou, T.Y.; Zhu, L.P. Model-Free Feature Screening for Ultrahigh Dimensional Censored Regression. Stat. Comput. 2017, 27, 947–961. [Google Scholar] [CrossRef]
  14. Székely, G.J.; Rizzo, M.L.; Bakirov, N.K. Measuring and Testing Dependence by Correlation of Distances. Ann. Stat. 2007, 35, 2769–2794. [Google Scholar] [CrossRef]
  15. Wang, X.Q.; Pan, W.L.; Hu, W.H.; Tian, Y.; Zhang, H.P. Conditional Distance Correlation. J. Am. Stat. Assoc. 2015, 110, 1726–1734. [Google Scholar] [CrossRef]
  16. Chen, L.P. Feature Screening Based on Distance Correlation for Ultrahigh-Dimensional Censored Data with Covariate Measurement Error. Comput. Stat. 2021, 36, 857–884. [Google Scholar] [CrossRef]
  17. Lu, J.; Lin, L. Model-Free Conditional Screening via Conditional Distance Correlation. Stat. Papers. 2020, 61, 225–244. [Google Scholar] [CrossRef]
  18. Dai, C.G.; Lin, B.Y.; Xing, X.; Liu, J.S. False Discovery Rate Control via Data Splitting. J. Am. Stat. Assoc. 2023, 118, 2503–2520. [Google Scholar] [CrossRef]
  19. Chu, W.H.; Li, R.Z.; Reimherr, M. Feature Screening for Time-Varying Coefficient Models with Ultrahigh-Dimensional Longitudinal Data. Ann. Appl. Stat. 2016, 10, 596–617. [Google Scholar] [CrossRef]
  20. Chu, W.H.; Li, R.Z.; Liu, J.Y.; Reimherr, M. Feature Selection for Generalized Varying Coefficient Mixed-Effect Models with Application to Obesity GWAS. Ann. Appl. Stat. 2020, 14, 276–298. [Google Scholar] [CrossRef] [PubMed]
  21. Subramanian, S.; Huq, S.; Yatsunenko, T.; Haque, R.; Mahfuz, M.; Alam, M.A.; Benezra, A.; DeStefano, J.; Meier, M.F.; Muegge, B.D.; et al. Persistent Gut Microbiota Immaturity in Malnourished Bangladeshi Children. Nature 2014, 510, 417–421. [Google Scholar] [CrossRef] [PubMed]
  22. Liu, W.J.; Ke, Y.; Liu, J.Y.; Li, R.Z. Model-Free Feature Screening and FDR Control with Knockoff Features. J. Am. Stat. Assoc. 2022, 117, 428–443. [Google Scholar] [CrossRef]
  23. Chi, C.M.; Fan, Y.Y.; Ing, C.K.; Lv, J.C. High-Dimensional Knockoffs Inference for Time Series Data. J. Am. Stat. Assoc. 2025, 1–24. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The geometric illustration of LDC statistics. Δ t 4 represents the measurement interval between time points t 3 and t 4 . D C ( t 4 ) Δ t 4 represents the area of the histogram between time points t 3 and t 4 .
Figure 1. The geometric illustration of LDC statistics. Δ t 4 represents the measurement interval between time points t 3 and t 4 . D C ( t 4 ) Δ t 4 represents the area of the histogram between time points t 3 and t 4 .
Stats 08 00099 g001
Figure 2. Trajectories of HAZ in the first 2 years after birth for 50 children. Each boxplot illustrates the variability in the HAZ of children for the given month.
Figure 2. Trajectories of HAZ in the first 2 years after birth for 50 children. Each boxplot illustrates the variability in the HAZ of children for the given month.
Stats 08 00099 g002
Figure 3. Curves of total heritability with number of 97%-identity OTUs selected through different methods.
Figure 3. Curves of total heritability with number of 97%-identity OTUs selected through different methods.
Stats 08 00099 g003
Table 1. M M S , P k , P a , d c for Example 1.
Table 1. M M S , P k , P a , d c for Example 1.
MethodMMS P k P a d c
MedianRSD P 1 P 2 P 3 P 4
Model (1.a), ρ 1 = 0.3
LDCS4.000.001.000.991.001.000.993.99
TVCM-SIS4.003.730.990.860.990.980.823.82
PRSIS4.0082.280.840.740.840.830.683.25
DC-SIS370.00356.900.310.170.330.270.041.08
CIS609.50726.490.340.130.381.000.041.85
Model (1.a), ρ 1 = 0.6
LDCS4.000.001.000.961.001.000.963.96
TVCM-SIS4.004.480.990.840.980.990.813.80
PRSIS4.0023.510.890.780.900.900.713.47
DC-SIS302.50349.630.430.170.460.390.081.45
CIS667.00711.380.350.130.331.000.041.82
Model (1.b), ρ 1 = 0.3
LDCS13.0022.570.900.900.830.820.593.45
TVCM-SIS651.00626.120.260.210.810.810.032.08
PRSIS1022.50773.510.140.110.930.900.012.08
DC-SIS19.0033.580.750.750.870.860.493.23
CIS490.50596.080.870.890.180.170.022.10
Model (1.b), ρ 1 = 0.6
LDCS21.0038.810.870.840.770.790.463.26
TVCM-SIS587.50672.570.220.230.720.740.021.91
PRSIS957.00787.130.150.120.800.840.001.91
DC-SIS30.0047.950.640.640.830.830.362.93
CIS523.00608.020.890.890.170.160.032.10
Table 2. M M S , P k , P a , d c for Example 2.
Table 2. M M S , P k , P a , d c for Example 2.
MethodMMS P k P a d c
MedianRSD P 1 P 2 P 3 P 4
Model (2.a), ρ 1 = 0.3
LDCS12.0093.470.840.850.820.940.543.45
TVCM-SIS187.00785.820.560.570.840.890.362.86
PRSIS740.50958.580.650.670.230.990.102.54
DC-SIS335.50682.460.720.710.320.990.152.74
CIS948.00755.780.470.460.200.550.011.68
Model (2.a), ρ 1 = 0.6
LDCS14.00105.220.840.840.820.950.533.45
TVCM-SIS157.50750.370.570.570.840.900.382.88
PRSIS753.50861.190.640.640.190.970.072.44
DC-SIS366.50601.870.670.680.270.980.082.61
CIS1059.00695.520.470.460.160.560.011.66
Model (2.b), ρ 1 = 0.3
LDCS16.0045.520.760.930.750.950.523.39
TVCM-SIS261.50638.060.340.520.380.530.141.77
PRSIS845.00972.010.210.340.220.330.051.10
DC-SIS430.00715.860.340.500.350.530.181.71
CIS1402.50537.310.050.130.070.140.010.40
Model (2.b), ρ 1 = 0.6
LDCS24.5060.630.710.910.730.920.453.26
TVCM-SIS395.50735.630.330.500.350.520.111.70
PRSIS916.00921.830.180.270.180.310.030.94
DC-SIS557.00816.230.300.470.280.500.111.55
CIS1423.50558.580.060.080.050.110.000.30
Table 3. M M S , P k , P a , d c for Example 3.
Table 3. M M S , P k , P a , d c for Example 3.
MethodMMS P k P a d c
MedianRSD P 1 P 2 P 3 P 4
Model (3), ρ 2 = 0.5
LCDCS6.008.210.970.890.910.960.783.73
TVCM-SIS1307.00508.400.050.040.070.060.000.22
PRSIS1557.50456.160.020.010.020.020.000.07
CDC-SIS9.0059.510.870.760.750.860.593.24
CIS263.50354.100.550.380.430.570.071.93
Model (3), ρ 2 = 0.8
LCDCS9.0016.600.960.830.880.970.683.64
TVCM-SIS1479.00419.400.020.020.020.030.000.10
PRSIS1579.50403.920.020.010.020.010.000.05
CDC-SIS16.5090.490.850.720.720.860.513.15
CIS439.50481.340.380.270.300.370.011.32
Table 4. Number of overlapping 97%-identity OTUs among the top d 0 = 10 (above diagonal) and the top d 0 = 220 (below diagonal) selected via different methods.
Table 4. Number of overlapping 97%-identity OTUs among the top d 0 = 10 (above diagonal) and the top d 0 = 220 (below diagonal) selected via different methods.
LDCSLCDCSTVCM-SISPRSISDC-SISCDC-SISCIS
d 0 = 10
LDCS d 0 = 220 -506570
LCDCS186-04250
TVCM-SIS94103-0001
PRSIS1149770-880
DC-SIS11510069181-60
CDC-SIS11811064160152-0
CIS353862524735-
ANOVA *1031085147497445
* Subramanian et al. [21] used an analysis of variance with LMM to identify 97%-identity OTUs that were significantly different between SAM and healthy children.
Table 5. Quantity of 97%-identity OTUs selected and total heritability for each method.
Table 5. Quantity of 97%-identity OTUs selected and total heritability for each method.
MethodNumber of 97%-Identity OTUsTotal Heritability
LDCS3660.24%
LCDCS3964.76%
TVCM-SIS3949.56%
PRSIS4052.77%
DC-SIS3451.46%
CDC-SIS4155.85%
CIS4137.02%
Table 6. IDs and taxonomic annotations of the top 10 97%-identity OTUs selected via LCDCS based on heritability.
Table 6. IDs and taxonomic annotations of the top 10 97%-identity OTUs selected via LCDCS based on heritability.
OrderIDTaxonomic AnnotationHeritability
1New.0.CleanUp.ReferenceOTU89679 *Collinsella9.29%
2561636Streptococcus7.54%
3New.0.ReferenceOTU340 *Bifidobacterium4.87%
4470527Lactobacillus4.43%
5326977 *Bifidobacterium4.22%
6471180 *Bifidobacterium3.31%
7469868Bifidobacterium2.76%
8345575Lactococcus2.59%
9316587 *Streptococcus2.23%
10New.0.ReferenceOTU339NA 2.05%
* 97%-identity OTUs discovered in Subramanian et al. [21]. The taxonomic annotation is not provided by Subramanian et al. [21].
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, J.; Yang, X.; Dai, J.; Li, Y. Model-Free Feature Screening Based on Data Aggregation for Ultra-High-Dimensional Longitudinal Data. Stats 2025, 8, 99. https://doi.org/10.3390/stats8040099

AMA Style

Chen J, Yang X, Dai J, Li Y. Model-Free Feature Screening Based on Data Aggregation for Ultra-High-Dimensional Longitudinal Data. Stats. 2025; 8(4):99. https://doi.org/10.3390/stats8040099

Chicago/Turabian Style

Chen, Junfeng, Xiaoguang Yang, Jing Dai, and Yunming Li. 2025. "Model-Free Feature Screening Based on Data Aggregation for Ultra-High-Dimensional Longitudinal Data" Stats 8, no. 4: 99. https://doi.org/10.3390/stats8040099

APA Style

Chen, J., Yang, X., Dai, J., & Li, Y. (2025). Model-Free Feature Screening Based on Data Aggregation for Ultra-High-Dimensional Longitudinal Data. Stats, 8(4), 99. https://doi.org/10.3390/stats8040099

Article Metrics

Back to TopTop