Model-Free Feature Screening Based on Data Aggregation for Ultra-High-Dimensional Longitudinal Data

Chen, Junfeng; Yang, Xiaoguang; Dai, Jing; Li, Yunming

doi:10.3390/stats8040099

Open AccessArticle

Model-Free Feature Screening Based on Data Aggregation for Ultra-High-Dimensional Longitudinal Data

by

Junfeng Chen

^1,2,

Xiaoguang Yang

²,

Jing Dai

³ and

Yunming Li

^1,2,*

¹

School of Mathematics, Southwest Jiaotong University, Chengdu 611756, China

²

Office of Medical Information and Data, Medical Support Center, The General Hospital of Western Theater Command, PLA, Chengdu 610083, China

³

Department of Information, Medical Support Center, The General Hospital of Western Theater Command, PLA, Chengdu 610083, China

^*

Author to whom correspondence should be addressed.

Stats 2025, 8(4), 99; https://doi.org/10.3390/stats8040099

Submission received: 3 September 2025 / Revised: 6 October 2025 / Accepted: 14 October 2025 / Published: 16 October 2025

Download

Browse Figures

Versions Notes

Abstract

Ultra-high dimensional longitudinal data feature screening procedures are widely studied, but most require model assumptions. The screening performance of these methods may not be excellent if we specify an incorrect model. To resolve the above problem, a new model-free method is introduced where feature screening is performed by sample splitting and data aggregation. Distance correlation is used to measure the association at each time point separately, while longitudinal correlation is modeled by a specific cumulative distribution function to achieve efficiency. In addition, we extend this new method to handle situations where the predictors are correlated. Both methods possess excellent asymptotic properties and are capable of handling longitudinal data with unequal numbers of repeated measurements and unequal intervals between repeated measurement time points. Compared to other model-free methods, the two new methods are relatively insensitive to within-subject correlation, and they can help reduce the computational burden when applied to longitudinal data. Finally, we use some simulated and empirical examples to show that both new methods have better screening performance.

Keywords:

model-free; longitudinal data; ultra-high-dimensional; feature screening; data aggregation

1. Introduction

Longitudinal data, characterized by repeated measurements of the same subjects over some time, are common in many scientific fields, including public health, psychology, and sociology. Traditional methods, such as linear mixed models (LMMs), are widely used to analyze longitudinal supervised problems. However, as the number of predictors, p, grows, these methods are limited by “the curse of dimensionality”. Even if these problems have a seemingly large sample size, n, it may actually be far smaller than p. This is more commonly called the ultra-high-dimensional setting, defined by Fan and Lv [1] as

log p = O (n^{α})

for some

α \in (0, 1 / 2)

. On the other hand, technological developments have made it possible to obtain high-throughput data through repeated measurements of subjects. Therefore, it is essential to develop some new methods capable of handling the above data.

Since only a few predictors are associated with the response in the ultra-high-dimensional setting (i.e., sparsity), feature screening is a common method to deal with these problems. For early work on ultra-high-dimensional longitudinal data, sure independence screening for varying-coefficient models was firstly proposed [2], while subsequent studies developed further, including partially linear models [3], additive models [4], and rank regression models [5]. However, these methods rely on parametric or semi-parametric models, which may be insufficient for accurate model specification in the big data era. To address this, we agree with Zhu et al. [6] that model-free methods are more appealing, as they rely only on a general model framework.

Currently, research on model-free methods for cross-sectional data has made great progress. The primary approach of these methods is to measure the marginal dependence, such as distance correlation (DC) [7], martingale difference correlation [8], conditional distance correlation (CDC) [9], and covariate information number [10]. Recently, some studies have also applied these methods to survival data and other types of data by improving the methods originally designed for cross-sectional data [11,12,13]. However, up to now, studies applying model-free feature screening procedures to longitudinal data are still rare, with the main challenge being that repeated measurements between subjects are correlated.

Since Székely et al. [14] proposed DC and Wang et al. [15] proposed CDC, the two methods have been widely applied in feature screening, e.g., the studies by Li et al. [7] and Chen [16] are based on DC, while the studies by Wen et al. [9] and Lu and Lin [17] are based on CDC. Consequently, the aim of this study is to improve DC and CDC for better performance in longitudinal data. Recently, the sample splitting and data aggregation methods are frequently used in feature screening, such as Zhong et al. [12] and Dai et al. [18]. These studies provide us with an idea of handling ultra-high-dimensional longitudinal data.

We first introduce a new method called longitudinal distance correlation sure independence screening (LDCS) in this study. The new method uses DC to measure marginal dependence at each time point separately and then aggregates the results using a cumulative distribution function (CDF) used in Zhong et al. [12]. We also make a simple extension of LDCS to handle situations where predictors are correlated. The method is based on CDC, so we call it longitudinal conditional distance correlation sure independence screening (LCDCS). The two new methods have some distinct advantages. For example, compared to existing longitudinal data feature screening procedures, the two methods are model-free, which enables them to perform better in some complex situations. Additionally, compared to existing model-free feature screening procedures, LDCS and LCDCS use the sample-splitting and data aggregation methods, making them relatively insensitive to within-subject correlation. Moreover, the computational burden is lower than that of methods that directly apply dependence measures to longitudinal data.

In Section 2, we propose our methods and provide their asymptotic properties. Section 3 presents several simulated and empirical examples to compare the performance of our methods with that of some other feature screening procedures. Section 4 provides a brief discussion of the article. Appendix A contains the technical proofs of all theorems, and Appendix B presents a simple sensitivity analysis.

2. Materials and Methods

2.1. Model Setup

We consider the following longitudinal data,

{(y_{i m}, X_{i m}, Z_{i m}), 1 \leq i \leq n, 1 \leq m \leq M_{i}}

denotes the measurement at the mth measurement time point

t_{m}

for the ith subject, where

y_{i m} \in R^{q}

is the response vector correlated with the predictor vector

X_{i m} \in R^{p}

and the conditional predictor vector

Z_{i m} \in R^{r}

. Here, n represents the number of subjects, and

M_{i}

denotes the repeated measurements for the ith subject. We allow the measurement intervals at different time points to be unequal and repeated measurements for each subject to vary. Furthermore, we define the total measurements as

N = \sum_{i = 1}^{n} M_{i}

and the maximum of

M_{i}

as M.

2.2. Marginal Screening Procedure

We assume that n is much less than p, and only a few predictors are associated with

y

. Let

F (y | X)

denote the conditional distribution function of

y

, given

X

. The index set of important predictors can be defined as

D = {k : F (y | X) functionally depends on X_{k}, k = 1, \dots, p},

without assuming a specific regression model. We define the number of important predictors as

L_{0} = | D |

, the cardinality of D, and denote

D^{c}

as the index set of unimportant predictors. Similar to most screening procedures, our screening goal is to estimate D conservatively so that all important predictors can be retained.

DC was proposed to quantify the association between two random vectors [14]. For details, they defined DC between

y

and

X

as the square root of

d C o r^{2} (X, y) = \frac{d C o v^{2} (X, y)}{\sqrt{d C o v^{2} (X, X) d C o v^{2} (y, y)}},

where distance covariance is defined as the square root of

d C o v^{2} (X, y) = \int_{R^{p + q}} {∥ ϕ_{X, y} (s, t) - ϕ_{X} (s) ϕ_{y} (t) ∥}^{2} p (s, t) d s d t,

where

p (s, t)

is a known function, and

ϕ (\cdot)

denotes the characteristic function.

DC is an excellent dependence measure capable of detecting the relationship between predictors and response for cross-sectional data. However, the screening performance of DC in longitudinal data is not as satisfactory, as can be seen in the simulation study. In this article, we consider a sample splitting and data aggregation method used by Zhong et al. [12] to improve the application of DC in longitudinal data feature screening.

First, we assume that

t_{1} < t_{2} < \dots < t_{M}

. We split the longitudinal data into M parts based on the difference of

t_{m}

, and we denote DC between

X_{k m} (k = 1, \dots, p)

and

y_{m}

at the mth time point

t_{m}

by

D C_{k} (t_{m}) = d C o r^{2} (X_{k m}, y_{m})

. We construct the following aggregated statistic:

ω_{k} = \int_{u \in R} D C_{k} (u) d F (u),

where

F (u)

is a specific CDF. For instance, taking

F (\cdot)

as the CDF of the true time variable T, and then the aggregated statistic becomes

ω_{k} = \int_{t \in R} D C_{k} (t) d F (t) = E (D C_{k} (T)),

which is the mean of

D C_{k} (T)

. In this article, we specifically define

F (\cdot)

as the CDF given by

p_{m} = \Pr (U = t_{m}) = \frac{t_{m} - t_{m - 1}}{t_{M} - t_{1}} = \frac{Δ t_{m}}{t_{M} - t_{1}}, m = 1, 2, \dots, M,

where

Δ t_{1} = 0

and

Δ t_{m} = t_{m} - t_{m - 1}

for

m = 2, \dots, M

. Using this specific CDF

F (u)

, for each predictor

X_{k} (k = 1, \dots, p)

, we define

ω_{k} = \sum_{m = 1}^{M} D C_{k} (t_{m}) \frac{Δ t_{m}}{t_{M} - t_{1}}

(1)

as our dependence measure at the population level. The aggregated statistic

ω_{k}

defined in (1) can be naturally interpreted as the area under curve (AUC) after the normalization of

D C_{k} (\cdot)

, and up to the maximum of

{D C_{k} (t_{m}), m = 1, 2, \dots, M}

. Figure 1 provides a geometric view of the statistic

ω_{k}

. A larger

ω_{k}

means a stronger association between

X_{k}

and

y

.

Next, we estimate

ω_{k}

in (1). Following Székely et al. [14], given a random sample defined in Section 2.1, we can obtain the estimator of

d C o v^{2} (X_{k m}, y_{m})

through moment estimation

{\hat{d C o v}}^{2} (X_{k m}, y_{m}) = {\hat{S}}_{1} (X_{k m}, y_{m}) + {\hat{S}}_{2} (X_{k m}, y_{m}) - 2 {\hat{S}}_{3} (X_{k m}, y_{m}),

where

{\hat{S}}_{1} (X_{k m}, y_{m}) = \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} ∥ X_{k i m} - X_{k j m} ∥_{1} {∥ y_{i m} - y_{j m} ∥}_{q}

,

{\hat{S}}_{2} (X_{k m}, y_{m}) = \frac{1}{n^{4}} (\sum_{i = 1}^{n} \sum_{j = 1}^{n} ∥ X_{k i m} - X_{k j m} ∥_{1}) (\sum_{l = 1}^{n} \sum_{f = 1}^{n} ∥ y_{l m} - y_{f m} ∥_{q})

,

{\hat{S}}_{3} (X_{k m}, y_{m}) = \frac{1}{n^{3}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} \sum_{l = 1}^{n} ∥ X_{k i m} - X_{k l m} ∥_{1} {∥ y_{j m} - y_{l m} ∥}_{q}

. Similarly,

{\hat{d C o v}}^{2} (X_{k m}, X_{k m})

and

{\hat{d C o v}}^{2} (y_{m}, y_{m})

can be defined in the same way. Then, we can get the aggregated estimate of

ω_{k}

by

{\hat{ω}}_{k} = \sum_{m = 1}^{M} {\hat{D C}}_{k} (t_{m}) \frac{Δ t_{m}}{t_{M} - t_{1}} = \sum_{m = 1}^{M} \frac{{\hat{d C o v}}^{2} (X_{k m}, y_{m})}{\sqrt{{\hat{d C o v}}^{2} (X_{k m}, X_{k m}) {\hat{d C o v}}^{2} (y_{m}, y_{m})}} \frac{Δ t_{m}}{t_{M} - t_{1}} .

(2)

We refer to the new longitudinal data feature screening procedure based on

{\hat{ω}}_{k}

as LDCS, which is provided in Algorithm 1. Since we use the sample splitting and data aggregation method, the computational complexity of LDCS can be reduced from

O (n^{2} M^{2})

to

O (n^{2} M)

. Finally, we sort

{\hat{ω}}_{k}

in decreasing order and denote the estimated important predictors set as

\hat{D} = {1 \leq k \leq p : {\hat{ω}}_{k} is among the top d_{0}}

by specifying a positive integer,

d_{0}

. Similarly, we can choose

d_{0} = a [n / log (n)]

with the positive integer

a = 1, 2, 3

or

d_{0} = n - 1

[1].

Algorithm 1 Longitudinal distance correlation screening (LDCS).

Input:: source data ${X_{i m}, y_{i m}, t_{m}}_{m = 1}^{M}$ .
Output:: $\hat{D} = {1 \leq k \leq p : {\hat{ω}}_{k} is among the top d_{0}}$ .
1:: Split the sample into M parts based on the difference of $t_{m}$ .
2:: for $m = 1, \dots, M$ do
3:: for $k = 1, \dots, p$ do
4:: Estimate ${\hat{S}}_{1} (X_{k m}, y_{m})$ , ${\hat{S}}_{2} (X_{k m}, y_{m})$ , ${\hat{S}}_{3} (X_{k m}, y_{m})$ by moment estimation.
5:: Estimate distance covariance: ${\hat{d C o v}}^{2} (X_{k m}, y_{m})$ , ${\hat{d C o v}}^{2} (X_{k m}, X_{k m})$ , and ${\hat{d C o v}}^{2} (y_{m}, y_{m})$ .
6:: Estimate distance correlation: ${\hat{D C}}_{k} (t_{m}) = \frac{{\hat{d C o v}}^{2} (X_{k m}, y_{m})}{\sqrt{{\hat{d C o v}}^{2} (X_{k m}, X_{k m}) {\hat{d C o v}}^{2} (y_{m}, y_{m})}}$ .
7:: end for
8:: end for
9:: for $k = 1, \dots, p$ do
Estimate the longitudinal distance correlation: ${\hat{ω}}_{k} = \sum_{m = 1}^{M} {\hat{D C}}_{k} (t_{m}) \frac{Δ t_{m}}{t_{M} - t_{1}}$ .
10:: end for
11:: Sort ${{\hat{ω}}_{k}, 1 \leq k \leq p}$ in descending order.
12:: Obtain the estimated important predictors set $\hat{D}$ .

Next, we analyze the properties of LDCS. We should make the following distribution assumption.

(C.1):: Both q-dimensional vector $y$ and p-dimensional vector $X$ have a sub-exponential tail probability uniformly in p, meaning that there exists constant $b_{0} > 0$ , such that, for all $0 < b \leq b_{0}$ ,

$sup_{t} {E {exp (2 b ∥ y ∥}_{q}^{2} | t} < \infty, and sup_{t} max_{1 \leq k \leq p} E {exp (2 b ∥ X_{k} ∥_{1}^{2} | t} < \infty .$

Remark 1.

Assumption (C.1) is frequently utilized in technical proofs for ultra-high-dimensional problems, such as those in Liu [3] and Li et al. [7], as it concerns the moments of the response and predictors. Any bounded or normal distribution would satisfy this assumption.

We present the asymptotic properties of LDCS as follows.

Theorem 1

(Sure screening property). Under Assumption (C.1), for any

0 \leq τ < 1 / 2

and

0 < κ < \frac{1}{2} - τ

, there exist constants

c_{1} > 0

,

c_{2} > 0

, and

c_{3} > 0

, such that

Pr \{max_{1 \leq k \leq p} | {\hat{ω}}_{k} - ω_{k} | \geq c_{1} n^{- τ}\} \leq O \{p [exp (- c_{2} n^{1 - 2 (τ + κ)}) + n exp (- c_{3} n^{κ})]\} .

(3)

Furthermore, with

{min}_{k \in D} ω_{k} \geq 2 c_{1} n^{- τ}

being supposed, then

Pr \{D \subseteq \hat{D}\} \geq 1 - O \{L_{0} [exp (- c_{2} n^{1 - 2 (τ + κ)}) + n exp (- c_{3} n^{κ})]\},

(4)

where

L_{0}

is the cardinality of D.

Theorem 1 shows that LDCS can effectively identify important predictors for the response. In addition, if we select the optimal order

κ = (1 - 2 τ) / 3

, then Equation (3) becomes

Pr \{max_{1 \leq k \leq p} |{\hat{ω}}_{k} - ω_{k}| \geq c_{1} n^{- τ}\} \leq O \{p exp (- c_{2} n^{(1 - 2 τ) / 3})\} .

Theorem 2

(Rank consistency property). Under Assumption (C.1), supposing

{min}_{k \in D} ω_{k} - {max}_{k \in D^{c}} ω_{k} \geq 2 c_{4} n^{- τ}

for constant

c_{4} > 0

and

0 \leq τ < 1 / 2

,

0 < κ < \frac{1}{2} - τ

, there exist constants

c_{5} > 0

and

c_{6} > 0

, such that

Pr \{min_{k \in D} {\hat{ω}}_{k} - max_{k \in D^{c}} {\hat{ω}}_{k} > 0\} > 1 - O \{p [exp (- c_{5} n^{1 - 2 (τ + κ)}) + n exp (- c_{6} n^{κ})]\} .

(5)

Furthermore, as discussed in Theorem 1, if

log p = o \{n^{(1 - 2 τ) / 3}\}

, then

{lim inf}_{n \to \infty} (min_{k \in D} {\hat{ω}}_{k} - max_{k \in D^{c}} {\hat{ω}}_{k}) > 0, almost surely .

Appendix A gives the technical proofs of Theorems 1 and 2.

Remark 2.

Theorem 1 is essential for all ultra-high-dimensional data feature screening procedures, as it guarantees that

\hat{D}

contains all predictors in D with high probability. Theorem 2 provides a stronger theoretical result, as important predictors are more likely to be ranked higher than unimportant ones. In other words, if we assume

δ = c_{4} n^{- τ}

, then with a high probability, there exists a δ that can effectively separate the important and unimportant predictors.

2.3. Conditional Screening Procedure

When there exists a high correlation between predictors, LDCS may incorrectly exclude some truly informative predictors. Wang et al. [15] applied the idea of conditional screening to deal with the problem and obtained excellent theoretical properties. Therefore, we applied this method to deal with the above problem. We denote

A = {k : F (y | X, Z) functionally depends on X_{k}, k = 1, \dots, p} .

where

F (y | X, Z)

denotes the conditional distribution function of

y

, given

X

and

Z

.

Wang et al. [15] proposed CDC based on Székely et al. [14] to deal with the issue of DC, which may make serious errors when predictors are highly correlated. For details, given the conditional predictor vector

Z

, they defined CDC between

y

and

X

as the square root of

c d C o r^{2} (X, y | Z) = \frac{c d C o v^{2} (X, y | Z)}{\sqrt{c d C o v^{2} (X, X | Z) c d C o v^{2} (y, y | Z)}},

where conditional distance covariance is defined as the square root of

c d C o v^{2} (X, y | Z) = \int_{R^{p + q}} {∥ ϕ_{X, y | Z} (s, t) - ϕ_{X | Z} (s) ϕ_{y | Z} (t) ∥}^{2} p (s, t) d s d t .

Similar to Section 2.2, we split the longitudinal data into s parts based on the difference of

t_{m}

, and denote the conditional distance correlation between

X_{k m}

and

y_{m}

, given

Z_{m}

at the mth time point

t_{m}

, by

C D C_{k} (t_{m}) = c d C o r^{2} (X_{k m}, y_{m} | Z_{m})

. We define

ρ_{k} = \sum_{m = 1}^{M} C D C_{k} (t_{m}) \frac{Δ t_{m}}{t_{M} - t_{1}}

(6)

as our dependence measure given some important predictors

Z

. We denote

d_{i j, k}^{X} (t_{m}) = d (X_{k i m}, X_{k j m})

as the Euclidean distance between

X_{k i m}

and

X_{k j m}

, where

i, j = 1, . . ., n

at the mth measurement. Similarly,

d_{i j, k}^{y} (t_{m})

can be denoted in the same way. Then, the distance function can be defined as

d_{i j l f, k}^{'} (t_{m}) = \frac{d_{i j l f, k} (t_{m}) + d_{i j f l, k} (t_{m}) + d_{i f l j, k} (t_{m})}{3},

where

d_{i j l f, k} (t_{m}) = (d_{i j, k}^{X} (t_{m}) + d_{l f, k}^{X} (t_{m}) - d_{i l, k}^{X} (t_{m}) - d_{j f, k}^{X} (t_{m})) (d_{i j, k}^{y} (t_{m}) + d_{l f, k}^{y} (t_{m}) - d_{i l, k}^{y} (t_{m}) - d_{j f, k}^{y} (t_{m}))

. Note that

d_{i j f l, k} (t_{m})

and

d_{i f l j, k} (t_{m})

are defined similarly. Let

ω_{i m} (Z_{m}) = K ((Z_{i m} - Z_{m}) / H)

and

ω_{m} (Z_{m}) = \sum_{i}^{n} ω_{i m} (Z_{m})

, where H is a diagonal matrix determined by bandwidth h, selected using the plug-in method.

K (\cdot)

is a kernel function, which is set as the Gaussian kernel function in this paper, i.e.,

K_{H} (Z) = {| H |}^{- 1} K (H^{- 1} Z) = {(2 π)}^{- r / 2} {| H |}^{- 1} exp (- \frac{1}{2} Z^{'} H^{- 2} Z) .

Given

Z_{m}

, the sample conditional distance covariance

{\hat{c d C o v}}^{2} (X_{k m}, y_{m} | Z_{m}) = \frac{1}{n^{4}} \sum_{i, j, l, f} ψ_{n} (W_{i m}, W_{j m}, W_{l m}, W_{f m}; Z_{m})),

where

ψ_{n} (W_{i m}, W_{j m}, W_{l m}, W_{f m}; Z_{m}) = \frac{n^{4} ω_{i m} (Z_{m}) ω_{j m} (Z_{m}) ω_{l m} (Z_{m}) ω_{f m} (Z_{m})}{4 ω^{4} (Z_{m})} d_{i j l f, k}^{'} (t_{m}) .

The sample conditional distance covariances

{\hat{c d C o v}}^{2} (X_{k m}, X_{k m} | Z_{m})

and

{\hat{c d C o v}}^{2} (y_{m}, y_{m} | Z_{m})

can be estimated similarly. Thus, the plug-in estimate of

C D C_{k} (t_{m})

is

{\hat{C D C}}_{k}^{*} (t_{m}) = \frac{1}{n} \sum_{i = 1}^{n} \frac{{\hat{c d C o v}}^{2} (X_{k i m}, y_{i m} | Z_{i m})}{\sqrt{{\hat{c d C o v}}^{2} (X_{k i m}, X_{k i m} | Z_{i m}) {\hat{c d C o v}}^{2} (y_{i m}, y_{i m} | Z_{i m})}},

and the aggregated estimate of

ρ_{k}

is

{\hat{ρ}}_{k}^{*} = \sum_{m = 1}^{M} {\hat{C D C}}_{k}^{*} (t_{m}) \frac{Δ t_{m}}{t_{M} - t_{1}} .

(7)

Based on

{\hat{ρ}}_{k}^{*}

, we propose a conditional screening procedure, specifically designed to handle highly correlated predictors:

\hat{A} = {1 \leq k \leq p : {\hat{ρ}}_{k}^{*} is among the top d_{0}} .

We refer to the new method as LCDCS. Similar to LDCS, the computational complexity of LCDCS can be reduced from

O (n^{3} M^{3})

to

O (n^{3} M)

.

Before proving the asymptotic properties of LCDCS, we list the other technical assumptions:

(C.2):: The kernel function $K (\cdot)$ is uniformly bounded and satisfies the following conditions: $K (u) \geq 0$ , $\int K (u) d u = 1$ , $\int u K (u) d u = 0$ , and ${\int ∥ u ∥}^{2} K (u) d u < \infty$ .
(C.3):: If $Z_{1}, Z_{2}, Z_{3}, Z_{4}$ are independent copies of Z, and then, for $1 \leq k \leq p$ , there exists constant $C > 0$ , such that

$sup_{k} | E {d_{1234, k} | Z_{1}, Z_{2}, Z_{3}, Z_{4}} - E {d_{1234, k} | Z_{1}^{'}, Z_{2}, Z_{3}, Z_{4}} | \leq C | Z_{1} - Z_{1}^{'} | .$

We present the asymptotic properties of LCDCS as follows.

Theorem 3

(Sure screening property). Under Assumption (C.1)–(C.3), if the bandwidth for kernel estimation of

Z

satisfies

h = O {n^{- \tilde{τ} / (2 r)}}

, where r is the dimension of

Z

, then for any

0 \leq \tilde{τ} < 1 / 2

and

0 < \tilde{κ} < 1 / 2 - \tilde{τ}

, there exist constants

c_{7} > 0

,

c_{8} > 0

and

c_{9} > 0

, such that

Pr \{max_{1 \leq k \leq p} | {\hat{ρ}}_{k}^{*} - ρ_{k} | \geq c_{7} n^{- \tilde{τ}}\} \leq O \{p [exp (- c_{8} n^{1 - 2 (\tilde{τ} + \tilde{κ})}) + n^{4} exp (- c_{9} n^{\tilde{κ}})]\} .

(8)

Furthermore, supposing

{min}_{k \in A} ρ_{k} \geq 2 c_{7} n^{- \tilde{τ}}

, then

Pr \{A \subseteq \hat{A}\} \geq 1 - O \{L_{1} [exp (- c_{8} n^{1 - 2 (\tilde{τ} + \tilde{κ})}) + n^{4} exp (- c_{9} n^{\tilde{κ}})]\},

(9)

where

L_{1}

is the cardinality of

A

. If we set

\tilde{κ} = (1 - 2 \tilde{τ}) / 3

, then the first part of Theorem 3 becomes

Pr \{max_{1 \leq k \leq p} |{\hat{ρ}}_{k}^{*} - ρ_{k}| \geq c_{7} n^{- \tilde{τ}}\} \leq O \{p exp (- c_{8} n^{(1 - 2 \tilde{τ}) / 3})\} .

Theorem 4

(Rank consistency property). Under Assumption (C.1)–(C.3), supposing

{min}_{k \in D} ρ_{k} - {max}_{k \in D^{c}} ρ_{k} \geq 2 c_{10} n^{- \tilde{τ}}

for constant

c_{10} > 0

and

0 \leq \tilde{τ} < 1 / 2

,

0 < \tilde{κ} < \frac{1}{2} - \tilde{τ}

, there exist constants

c_{11} > 0

and

c_{12} > 0

, such that

Pr \{min_{k \in D} {\hat{ρ}}_{k}^{*} - max_{k \in D^{c}} {\hat{ρ}}_{k}^{*} > 0\} > 1 - O \{p [exp (- c_{11} n^{1 - 2 (\tilde{τ} + \tilde{κ})}) + n^{4} exp (- c_{12} n^{\tilde{κ}})]\} .

(10)

Furthermore, as discussed in Theorem 3, if

log p = o \{n^{(1 - 2 \tilde{τ}) / 3}\}

, then

{lim inf}_{n \to \infty} (min_{k \in A} {\hat{ρ}}_{k}^{*} - max_{k \in A^{c}} {\hat{ρ}}_{k}^{*}) > 0, almost surely .

Appendix A gives the technical proofs of Theorems 3 and 4.

3. Results

3.1. Simulation Study

We conduct several simulated examples to evaluate the screening performance of LDCS and LCDCS, and compare the results with those of some existing methods, including partial residual sure independence screening (PRSIS) [3], time-varying coefficient models’ sure independence screening (TVCM-SIS) [19], distance correlation sure independence screening (DC-SIS) [7], conditional distance correlation sure independence screening (CDC-SIS) [9], and covariate information number sure independence screening (CIS) [10]. PRSIS and TVCM-SIS are longitudinal data feature screening procedures where PRSIS is based on partially linear models and TVCM-SIS is based on time-varying coefficient models. DC-SIS, CDC-SIS, and CIS are all model-free methods for cross-sectional data.

We repeat each simulation example 500 times. In all examples, the number of predictors is

p = 2000

, and the sample size is

n = 80

, yielding a submodel size of

d_{0} = [n / log (n)] = 18

. The predictor

X_{k i} (k = 1, \dots, p)

and the random error

ϵ_{i}

are generated independently from

MVN (0, σ_{0}^{2} Σ)

, where

σ_{0}

and

Σ

are different in each example. The time points

t_{1}, . . ., t_{M}

are sampled from a standard uniform distribution and do not change in a simulation. We use the following four criteria to assess the screening performance:

$M M S$ : The minimum submodel size containing all important predictors. The median and robust standard deviation (RSD = IQR/1.34) of $M M S$ are reported, where IQR denotes interquartile range.
$P_{k}$ : The proportion of the important predictor $X_{k}$ selected in the submodel.
$P_{a}$ : The proportion of all important predictors selected in the submodel.
$d_{c}$ : The average number of important predictors selected in the submodel.

3.1.1. Example 1: Partially Linear Models

We generate the response from the following two models:

Model (1.a):

y_{i m} = β_{1} X_{1 i m} + β_{2} X_{2 i m} + β_{3} X_{3 i m} + β_{4} I (X_{4 i m} > 0) + f (t_{m}) + ϵ_{i m},

Model (1.b):

y_{i m} = β_{1} β_{2} X_{1 i m} X_{2 i m} + β_{3} X_{3 i m} + β_{4} X_{4 i m} + f (t_{m}) + ϵ_{i m} .

Model (1.a) considers categorical predictor variables, while Model (1.b) incorporates interaction effects. These two models are based on Model (1.a) and Model (1.b) in Li et al. [7], but the difference is that we introduced longitudinal measurements and nonlinear time variables. We set

{(β_{1}, β_{2}, β_{3}, β_{4})}^{T} = {(- 1, - 0.75, 1, 4)}^{T}

,

f (t) = 36 sin (4 π t)

and

σ_{0} = 2

in Model (1.a), and

{(β_{1}, β_{2}, β_{3}, β_{4})}^{T} = {(- 2.25, - 2, - 1, 1)}^{T}

,

f (t) = 4 sin (2 π t)

, and

σ_{0} = 1

in Model (1.b). Furthermore, we set the repeated measurements

M_{i} \equiv 5

and the within-subject correlation structure as first-order autoregressive (AR(1)) structure. The within-subject correlations are set to

ρ = 0.3

and

0.6

for each model, respectively.

The detailed results are reported in Table 1. For the four different situations, our proposed method LDCS achieves the smallest MMS (including median and RSD), along with the largest

P_{a}

and

d_{c}

. When a strong nonlinear time effect is present, the model-free feature screening procedures DC-SIS and CIS perform less effectively. Furthermore, we observe that TVCM-SIS and PRSIS are not as effective in identifying interaction effects. While CIS relatively successfully identifies interaction effects, it tends not to retain important predictors with and without interaction effects simultaneously. Example 1 demonstrates that our proposed method LDCS is unaffected by strong time effects and performs better in handling situations with interaction effects in partially linear models.

3.1.2. Example 2: Time-Varying Coefficient Models

In this example, we generate the response from the following two models:

Model (2.a):

y_{i m} = β_{1} (t_{m}) X_{1 i m} + β_{2} (t_{m}) X_{2 i m} + β_{3} (t_{m}) X_{3 i m} + β_{4} (t_{m}) X_{4 i m} + ϵ_{i m},

Model (2.b):

log (μ_{i} (t_{m})) = β_{1} (t_{m}) X_{1 i m} + β_{2} (t_{m}) X_{2 i m} + β_{3} (t_{m}) X_{3 i m} + β_{4} (t_{m}) X_{4 i m} .

Model (2.a) is similar to Example I in Chu et al. [19], but we set a smaller sample size and measurements. Model (2.b) considers a count response with Poisson distribution, which is based on Example II in Chu et al. [20], but we use time-varying coefficient model, rather than the generalized varying coefficient mixed-effect model. In Model (2.a), we set

β_{1} (t) = - 4 cos (4 π t) I (t < 0.5)

,

β_{2} (t) = - 4.25 sin (4 π t) I (t < 0.5)

,

β_{3} (t) = sin (4 π t)

,

β_{4} (t) = 1.25 (1.2 - t)

and

σ_{0} = 2

. In Model (2.b), we set

β_{1} (t) = - sin (2 π t)

,

β_{2} (t) = - 1.25 sin (2 π t)

,

β_{3} (t) = sin (2 π t)

,

β_{4} (t) = 1.25 sin (2 π t)

, and

σ_{0} = 1

. Furthermore, we set the within-subject correlation structure as a compound symmetry (CS) structure. The within-subject correlations are set to

ρ = 0.3

and

0.6

for each model, respectively.

Table 2 presents the detailed simulation results for Example 2. It can be seen that the results of Example 2 are similar to those of Example 1, i.e., our proposed method LDCS achieves the smallest MMS (including median and RSD), along with the largest

P_{a}

and

d_{c}

. Specifically, for Model (2.a), although TVCM-SIS is based on the assumption of varying coefficient models, its performance is not ideal when the categorical predictors have a stronger impact on the response. For Model (2.b), the screening performance of TVCM-SIS, PRSIS, DC-SIS, and CIS is all unsatisfactory, indicating that these methods are not well suited for varying coefficient models with count responses. These examples further demonstrate that, for some special varying coefficient models, our proposed method LDCS may achieve better screening performance.

3.1.3. Example 3: Partially Linear Single-Index Models

We assess the screening performance of LCDCS when the predictors are correlated with each other. The response is generated from Model (3):

Model (3):

y_{i m} = {(β_{1} X_{1 i m} + β_{2} X_{2 i m} + β_{3} X_{3 i m} + β_{4} X_{4 i m})}^{2} + f (t_{m}) + ϵ_{i m},

We set

{(β_{1}, β_{2}, β_{3}, β_{4})}^{T} = {(3, 2.5, 2.5, 3)}^{T}

,

f (t) = 36 t / (1 - t)

,

σ_{0} = 1

, the repeated measurements

M_{i} \equiv 3

and the within-subject correlation structure as AR(1) structure with within-subject correlation as

ρ_{1} = 0.5

. The correlation structure between predictors is set as CS structure, with two scenarios considered:

ρ_{2} = 0.5

and

ρ_{2} = 0.8

.

We first use LDCS and DC-SIS to screen the most relevant predictor as a conditional predictor for LCDCS and CDC-SIS, respectively. Table 3 presents the detailed simulation results for Example 3. It can be seen that our proposed method, LCDCS, also achieves the smallest MMS (including median and RSD), along with the largest

P_{a}

and

d_{c}

. Since the number of repeated measurements is only 3, the improvement in screening performance of LCDCS over CDC-SIS is not particularly obvious. Additionally, the screening performance of TVCM-SIS and PRSIS is very poor, indicating that these methods, which rely on specific model assumptions, may not perform well in some complex model structures. CIS is also not well suited when the predictors are correlated. This example demonstrates that our proposed method, LCDCS, is more suitable for longitudinal data with correlated predictors in some special model structures.

3.2. Application to Gut Microbiota Data

We analyze real-world data to demonstrate the empirical performance of our methods in this section. This is a gut microbiota data from Bangladeshi children, as reported by Subramanian et al. [21]. The longitudinal cohort study aimed at analyzing the effects of therapeutic foods on children. For details, they monitored the growth and development of

n = 50

infants over a two-year period after birth, ultimately collecting a total of

N = 996

fecal samples. Each sample contained 79,597 operational taxonomic units sharing at least

97 %

nucleotide sequence identity (97%-identity OTUs). Following Subramanian et al. [21], we obtain 1222 97%-identity OTUs that have at least two fecal samples with an average relative abundance greater than 0.1%. In this article, we consider height-for-age Z-scores (HAZ) as the response, which reflects how a child’s height compares to the average height of the same age and gender group, indicating whether growth and development are within the normal range. The HAZ of each child was measured between 6 and 22 times, as illustrated in Figure 2. As a further preprocessing step, we retain only the months in which at least 24 children had their HAZ measured. Thus, we eventually work with a dataset comprising 1222 97%-identity OTUs (the predictors) and HAZ (the response), measured on

n = 50

children over 13 months, for a total of 433 measurements. On this dataset, we apply our LDCS and LCDCS, along with TVCM-SIS, PRSIS, DC-SIS, CDC-SIS, and CIS for feature screening, and we compare the results with those of Subramanian et al. [21].

First, we compare the top

d_{0} = 10

97%-identity OTUs selected by each method, as summarized in Table 4. Our methods, LDCS and LCDCS, demonstrate a moderate overlap in the number of selected 97%-identity OTUs with PRSIS, DC-SIS, and CDC-SIS, while the selection results of TVCM-SIS and CIS differ significantly from those of the other methods. Next, we compare the top

d_{0} = 220

97%-identity OTUs, as Subramanian et al. [21] identified 220 gut microbiota that were significantly different between severe acute malnutrition (SAM) and healthy children. It is obvious that LDCS and LCDCS have the maximum number of 97%-identity OTUs, which align with those of Subramanian et al. [21] among all screening procedures.

Furthermore, we evaluate the screening performance through regression analysis. Following Chu et al. [19], we establish time-varying coefficient models with different numbers of 97%-identity OTUs and use the total heritability commonly used in genetic analysis to assess the goodness of fit. The total heritability of all 97%-identity OTUs is calculated through

H (HAZ) = \frac{RSS (HAZ | Fetal)}{RSS (HAZ | Fetal)} - \frac{RSS (HAZ | Fetal, {OTU}_{1}, \dots, {OTU}_{p})}{RSS (HAZ | Fetal)} .

Here, fetal means whether the child is a singleton birth and RSS is the residual sum of squares defined as

R S S = \sum_{i = 1}^{n} \sum_{m = 1}^{M_{i}} {(y_{i} (t_{i m}) - {\hat{y}}_{i} (t_{i m}))}^{2},

where

y_{i} (t_{i m})

denotes the actual value, and

{\hat{y}}_{i} (t_{i m})

denotes the predicted value from the time-varying coefficient models.

We calculate the total heritability of the top

d_{0} = 1

to 49 97%-identity OTUs and remove more irrelevant 97%-identity OTUs using forward regression to achieve better regression fitting results. Our proposed method, LCDCS, has the highest total heritability at 64.76%, followed by LDCS with 60.24%. The final number of 97%-identity OTUs selected and the total heritability for each method are shown in Table 5.

We also plot the curves of total heritability for the different screening procedures, as shown in Figure 3. We note that, overall, when the number of 97%-identity OTUs selected is between 4 and 23, the total heritability of LCDCS is higher than that of other screening procedures. Additionally, when the number of 97%-identity OTUs selected is between 24 and 33, the LDCS method shows a higher total heritability. Therefore, when the number of selected 97%-identity OTUs is moderate, we recommend using LCDCS, while for a larger number of selected 97%-identity OTUs, we recommend using LDCS.

We also calculate the heritability of each 97%-identity OTU, which depends on the sequence of selection in the forward regression. The heritability of the kth 97%-identity OTU is calculated through

\begin{matrix} H ({OTU}_{k}) = & \frac{R S S (HAZ | Fetal, {OTU}_{(1)}, \dots, {OTU}_{(k - 1)})}{R S S (HAZ | Fetal)} \\ - \frac{R S S (HAZ | Fetal, {OTU}_{(1)}, \dots, {OTU}_{(k - 1)}, {OTU}_{(k)})}{R S S (HAZ | Fetal)} . \end{matrix}

Table 6 presents the IDs and taxonomic annotations of the top 10 97%-identity OTUs based on the heritability, as selected via LCDCS. The taxonomic annotations are given by Subramanian et al. [21], representing the genus of the gut microbiota. Among the 10 gut microbiota, 4 gut microbiota belong to the Bifidobacterium genus. Bifidobacterium is a common group of probiotics found primarily in the human gut, particularly in infants, and it has been shown to be closely associated with children’s nutrient absorption and growth development. Furthermore, there are 5 97%-identity OTUs that differ from the results of Subramanian et al. [21], which may represent a new discovery. We also perform a sensitivity analysis to evaluate how the parameter

d_{0}

influences the empirical results, with detailed information provided in Appendix B.

4. Discussion

In this article, we improve DC by applying a sample splitting and data aggregation method to achieve better screening performance with longitudinal data. We also make a simple extension of LDCS to deal with the situation where the predictors are correlated. The two methods are capable of handling longitudinal data with unequal numbers of repeated measurements and unequal intervals between repeated measurement time points. Simulation studies indicate that, in some special situations, such as partially linear models with strong time effects or interaction effects, and varying coefficient models with count responses, our proposed method LDCS demonstrates better screening performance. Furthermore, in the situations where predictors are correlated, our proposed method, LCDCS, also achieves better screening performance for some complex structures. Finally, the results of the application show that LDCS and LCDCS achieve better outcomes at different selection scales.

Inevitably, this work encountered some limitations. For instance, we did not consider the treatment of missing data or time-varying confounding. Additionally, both LDCS and LCDCS rely on sub-exponential tail probability assumptions, which may not always hold in practice. The screening performance of LCDCS is sensitive to bandwidth selection, and due to the kernel estimation, its computational burden is higher than that of some other model-free methods.

We plan to deal with the selection of the threshold

d_{0}

in LDCS and LCDCS in a future study. Recently, Liu et al. [22] proposed handling this issue from the perspective of false discovery rate (FDR) control by using Model-X knockoff features. Chi et al. [23] applied Model-X knockoff features to the FDR control problem for time series data by using an e-value aggregation method. Therefore, similarly, we could also consider constructing Model-X knockoff features and corresponding knockoff statistics for each measurement time point in longitudinal data, and then aggregate these statistics by using the e-value aggregation method. We expect that this approach will help handle the selection problem of the threshold

d_{0}

in LDCS and LCDCS from the perspective of FDR control.

Author Contributions

Conceptualization, J.C.; Methodology, J.C.; Software, J.C.; Validation, J.C. and Y.L.; Formal analysis, J.C. and Y.L.; Investigation, X.Y. and Y.L.; Resources, X.Y.; Writing – original draft, J.C.; Writing – review & editing, X.Y., J.D. and Y.L.; Visualization, J.C. and J.D.; Supervision, X.Y., J.D. and Y.L.; Project administration, J.D. and Y.L.; Funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by Sichuan Province Administration of Traditional Chinese Medicine of China (grant numbers 25MSZX477 and 25MSZX495).

Data Availability Statement

The data presented in this study are openly available in [repository name, e.g., FigShare] at [doi], reference number [reference number]. [Subramanian Sathish] [10.1038/nature13421].

Conflicts of Interest

The authors state no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

$t_{m}$	mth measurement time point
$y$	Response vector
$X$	Predictor vector
$Z$	Conditional predictor vector
p, q, r	Dimensional of $X$ , $y$ , $Z$
n	Number of subjects
$M_{i}$	Repeated measurements for the i subject
N	Total measurements
M	Maximum of $M_{i}$
$F (\cdot)$	Cumulative distribution function
D, $A$	Index set of important predictors
$D^{c}$ , $A^{c}$	Index set of unimportant predictors
$L_{0}$ , $L_{1}$	Cardinality of D, $A$
$D C_{k} (t_{m})$	Distance correlation of $X_{k m}$ and $y_{m}$
$ω_{k}$ , $ρ_{k}$	Dependence measure at the population-level
${\hat{ω}}_{k}$ , ${\hat{ρ}}_{k}$	Dependence measure at the sample-level
$d_{0}$	Selection number threshold
$C D C_{k} (t_{m})$	Conditional distance correlation of $X_{k m}$ and $y_{m}$
$K (\cdot)$	Kernel function
h	Bandwidth
$M V N (\cdot)$	Multivariate normal distribution
$I (\cdot)$	Indicator function
$ρ_{1}$	Within-subject correlation
$ρ_{2}$	Correlation between predictors
$H (\cdot)$	Heritability
$R S S$	Residual sum of squares
$O T U_{k}$	kth 97%-identity OTU

Appendix A. Theoretical Proofs

Before proving Theorem 1, we give the following lemma from Li et al. [7].

Lemma A1.

For any random vectors,

X \in R^{p}

and

y \in R^{q}

, under Assumption (C.1), for any

0 < κ < \frac{1}{2} - τ

and

ϵ > 0

, there exist constants

c_{2}^{'} > 0

and

c_{3} > 0

, such that

Pr \{|{\hat{d C o r}}^{2} (X, y) - d C o r^{2} (X, y)| \geq ϵ\} \leq O \{exp (- c_{2}^{'} ϵ^{2} n^{1 - 2 κ}) + n exp (- c_{3} n^{κ})\}

Proof of Theorem 1.

First, we show that, for any k,

Pr \{|{\hat{ω}}_{k} - ω_{k}| \geq c_{1} n^{- τ}\} \leq O \{exp (- c_{2} n^{1 - 2 (τ + κ)}) + n exp (- c_{3} n^{κ})\} .

(A1)

This is because

\begin{matrix} |{\hat{ω}}_{k} - ω_{k}| & = |\sum_{m = 1}^{M} {\hat{D C}}_{k} (t_{m}) \frac{Δ t_{m}}{t_{M} - t_{1}} - \sum_{m = 1}^{M} {D C}_{k} (t_{m}) \frac{Δ t_{m}}{t_{M} - t_{1}}| \\ \leq \frac{G}{t_{M} - t_{1}} \sum_{m = 1}^{M} |{\hat{D C}}_{k} (t_{m}) - {D C}_{k} (t_{m})|, \end{matrix}

where

G = \max_{1 \leq m \leq M} Δ t_{m}

,

Δ t_{m} = t_{m} - t_{m - 1}

and

Δ t_{1} = 0

. For any

ϵ > 0

,

Pr \{|{\hat{ω}}_{k} - ω_{k}| \geq ϵ\} \leq Pr \{\frac{G}{t_{M} - t_{1}} \sum_{m = 1}^{M} |{\hat{D C}}_{k} (t_{m}) - {D C}_{k} (t_{m})| \geq ϵ\} .

(A2)

Next, we have

\begin{matrix} Pr \{\frac{G}{t_{M} - t_{1}} \sum_{m = 1}^{M} |{\hat{D C}}_{k} (t_{m}) - {D C}_{k} (t_{m})| \geq ϵ\} \\ \leq & Pr \{⋃_{m = 1}^{M} (|{\hat{D C}}_{k} (t_{m}) - {D C}_{k} (t_{m})| \geq \frac{ϵ}{M} \frac{t_{M} - t_{1}}{G})\} \\ \leq & M max_{1 \leq m \leq M} Pr \{|{\hat{D C}}_{k} (t_{m}) - {D C}_{k} (t_{m})| \geq \frac{(t_{M} - t_{1}) ϵ}{M G}\} \\ \leq & M max_{1 \leq m \leq M} Pr \{|{\hat{D C}}_{k} (t_{m}) - {D C}_{k} (t_{m})| \geq \frac{ϵ}{M}\} . \end{matrix}

(A3)

The last inequality holds because

t_{M} - t_{1} \geq G = {m a x}_{1 \leq m \leq M} Δ t_{m}

.

Next, we deal with

Pr \{|{\hat{D C}}_{k} (t_{m}) - {D C}_{k} (t_{m})| \geq \frac{ϵ}{M}\}

. By using Lemma A1, for any

ϵ > 0

and

0 < κ < \frac{1}{2} - τ

, it follows from Assumption (C.1) that

Pr \{|{\hat{D C}}_{k} (t_{m}) - D C_{k} (t_{m})| \geq ϵ\} \leq O \{exp (- c_{2}^{'} ϵ^{2} n^{1 - 2 κ}) + n exp (- c_{3} n^{κ})\}

(A4)

for constants

c_{2}^{'} > 0

and

c_{3} > 0

. (A4) entails that

Pr \{|{\hat{D C}}_{k} (t_{m}) - {D C}_{k} (t_{m})| \geq \frac{ϵ}{M}\} \leq O \{exp (- c_{2}^{'} {(\frac{ϵ}{M})}^{2} n^{1 - 2 κ}) + n exp (- c_{3} n^{κ})\} .

(A5)

Let

ϵ / M = c_{0} n^{- τ}

and

c_{1} = c_{0} M

, where

τ

satisfies

0 < τ + κ < 1 / 2

. If M is finite, then we have

Pr \{|{\hat{ω}}_{k} - ω_{k}| \geq c_{1} n^{- τ}\} \leq O \{exp (- c_{2} n^{1 - 2 (κ + τ)}) + n exp (- c_{3} n^{κ})\},

(A6)

and

Pr \{max_{1 \leq k \leq p} | {\hat{ω}}_{k} - ω_{k} | \geq c_{1} n^{- τ}\} \leq O \{p [exp (- c_{2} n^{1 - 2 (τ + κ)}) + n exp (- c_{3} n^{κ})]\} .

The proof of the first part of Theorem 1 is completed.

Next, we begin the proof of the second part of Theorem 1. If

D ⊈ \hat{D}

, then there must exist some

k \in D

, such that

{\hat{ω}}_{k} < c_{1} n^{- τ}

.

{min}_{k \in D} ω_{k} \geq 2 c_{1} n^{- τ}

entails that

|{\hat{ω}}_{k} - ω_{k}| > c_{1} n^{- τ}

for some

k \in D

, indicating that

\{D ⊈ \hat{D}\} \subseteq {|{\hat{ω}}_{k} - ω_{k}| > c_{1} n^{- τ}, for some k \in D}

, and then

D_{n} = \{{max}_{k \in D} |{\hat{ω}}_{k} - ω_{k}| \leq c_{1} n^{- τ}\} \subseteq \{D \subseteq \hat{D}\}

. Consequently, when n is sufficiently large,

\begin{matrix} Pr \{D \subseteq \hat{D}\} & \geq Pr \{D_{n}\} \\ = 1 - Pr \{D_{n}^{c}\} \\ = 1 - Pr \{min_{k \in D} |{\hat{ω}}_{k} - ω_{k}| \geq c_{1} n^{- τ}\} \\ = 1 - s_{n} Pr \{|{\hat{ω}}_{k} - ω_{k}| \geq c_{1} n^{- τ}\} \\ \geq 1 - O \{L_{0} [exp (- c_{2} n^{1 - 2 (κ + τ)}) + n exp (- c_{3} n^{κ})]\}, \end{matrix}

(A7)

where

L_{0}

is the cardinality of D. (A7) implies that

Pr \{D \subseteq \hat{D}\} \to 1

, as

n \to \infty

. □

Proof of Theorem 2.

Let

k_{1} = arg {min}_{k \in D} {\hat{ω}}_{k}

and

k_{2} = arg {max}_{k \in D^{c}} {\hat{ω}}_{k}

. For any

0 < κ < \frac{1}{2} - τ

, we have

\begin{matrix} Pr \{min_{k \in D} {\hat{ω}}_{k} - max_{k \in D^{c}} {\hat{ω}}_{k} \leq 0\} \\ \leq & Pr \{min_{k \in D} {\hat{ω}}_{k} - max_{k \in D^{c}} {\hat{ω}}_{k} \leq min_{k \in D} ω_{k} - max_{k \in D^{c}} ω_{k} - 2 c_{4} n^{- τ}\} \\ = & Pr \{(min_{k \in D} ω_{k} - min_{k \in D} {\hat{ω}}_{k}) + (max_{k \in D^{c}} {\hat{ω}}_{k} - max_{k \in D^{c}} ω_{k}) \geq 2 c_{4} n^{- τ}\} \\ \leq & Pr \{(ω_{k_{1}} - {\hat{ω}}_{k_{1}}) + ({\hat{ω}}_{k_{2}} - ω_{k_{2}}) \geq 2 c_{4} n^{- τ}\} \\ \leq & Pr \{|{\hat{ω}}_{k_{1}} - ω_{k_{1}}| \geq c_{4} n^{- τ}\} + Pr \{|{\hat{ω}}_{k_{2}} - ω_{k_{2}}| \geq c_{4} n^{- τ}\} \\ \leq & 2 Pr \{max_{1 \leq k \leq p} |{\hat{ω}}_{k} - ω_{k}| \geq c_{4} n^{- τ}\} \\ = & O \{p [exp (- c_{5} n^{1 - 2 (τ + κ)}) + n exp (- c_{6} n^{κ})]\}, \end{matrix}

where

c_{4}

,

c_{5}

,

c_{6}

are positive constants. The last equation can be derived from Theorem 1. Thus, we have

Pr \{min_{k \in D} {\hat{ω}}_{k} - max_{k \in D^{c}} {\hat{ω}}_{k} > 0\} \geq 1 - O \{p [exp (- c_{5} n^{1 - 2 (τ + κ)}) + n exp (- c_{6} n^{κ})]\} .

If

log p = o \{n^{(1 - 2 τ) / 3}\}

with

κ = (1 - 2 τ) / 3

, then we have

Pr \{min_{k \in D} {\hat{ω}}_{k} - max_{k \in D^{c}} {\hat{ω}}_{k} > 0\} \geq 1 - O \{p exp (- c_{5} n^{(1 - 2 τ) / 3})\} .

(A8)

We know

p < exp (c_{5} n^{(1 - 2 τ) / 3} / 2)

for a large n; therefore, there must exist a large n, such that

p exp (- c_{5} n^{(1 - 2 τ) / 3}) \leq exp (- c_{5} n^{(1 - 2 τ) / 3} / 2) \leq exp (- 2 log n) \leq n^{- 2} .

Then, we have

\sum_{n = n_{0}}^{\infty} c_{5}^{'} p exp (- c_{5} n ϵ^{2}) \leq c_{5}^{'} \sum_{n = n_{0}}^{\infty} n^{- 2} \leq \infty,

where

c_{5}^{'}, n_{0} > 0

. Therefore, according to the Borel–Cantelli Lemma, we obtain

Pr \{lim {sup}_{n \to \infty} (min_{k \in D} {\hat{ω}}_{k} - max_{k \in D^{c}} {\hat{ω}}_{k} \leq 0)\} = 0,

and then

\begin{matrix} Pr \{lim {inf}_{n \to \infty} (min_{k \in D} {\hat{ω}}_{k} - max_{k \in D^{c}} {\hat{ω}}_{k} > 0)\} \\ = & Pr \{{[{lim inf}_{n \to \infty} (min_{k \in D} {\hat{ω}}_{k} - max_{k \in D^{c}} {\hat{ω}}_{k} > 0)]}^{c}\} = 1 . \end{matrix}

The proof of Theorem 2 is complete. □

Before proving Theorem 3, we give the following lemma from Wen et al. [9].

Lemma A2.

For any random vectors,

X \in R^{p}

,

y \in R^{q}

and

Z \in R^{r}

, with Assumption (C.1)–(C.3) being supposed to hold and the bandwidth for kernel estimation of Z satisfying

h = O (n^{κ / 2 r})

, for any

0 < \tilde{κ} < \frac{1}{2} - \tilde{τ}

and

ϵ > 0

, there exist constants

c_{8}^{'} > 0

and

c_{9} > 0

, such that

Pr \{|{\hat{c d C o r}}^{2} (X, y | Z) - c d C o r^{2} (X, y | Z)| \geq ϵ\} \leq O \{exp (- c_{8}^{'} ϵ^{2} n^{1 - 2 \tilde{κ}}) + n^{4} exp (- c_{9} n^{\tilde{κ}})\}

Proof of Theorem 3.

Similar to the proof of Theorem 1, we can get

Pr \{|{\hat{ρ}}_{k}^{*} - ρ_{k}| \geq ϵ\} \leq M max_{1 \leq m \leq M} Pr \{|{\hat{C D C}}_{k} (t_{m}) - {C D C}_{k} (t_{m})| \geq \frac{ϵ}{M}\} .

By using Lemma A2, for any

ϵ > 0

and

0 < \tilde{κ} < \frac{1}{2} - \tilde{τ}

, it follows from Assumption (C.1)–(C.3) that

Pr \{|{\hat{C D C}}_{k} (t_{m}) - C D C_{k} (t_{m})| \geq ϵ\} \leq O \{exp (- c_{8}^{'} ϵ^{2} n^{1 - 2 \tilde{κ}}) + n^{4} exp (- c_{9} n^{\tilde{κ}})\}

(A9)

for constants

c_{8}^{'} > 0

and

c_{9} > 0

. (A9) entails that

Pr \{|{\hat{C D C}}_{k} (t_{m}) - {C D C}_{k} (t_{m})| \geq \frac{ϵ}{M}\} \leq O \{exp (- c_{8}^{'} {(\frac{ϵ}{M})}^{2} n^{1 - 2 \tilde{κ}}) + n^{4} exp (- c_{9} n^{\tilde{κ}})\} .

Then, let

ϵ / M = c_{0}^{'} n^{- \tilde{τ}}

and

c_{7} = c_{0}^{'} M

, where

\tilde{τ}

satisfies

0 < \tilde{τ} + \tilde{κ} < 1 / 2

, we have

Pr \{|{\hat{ρ}}_{k}^{*} - ρ_{k}| \geq c_{7} n^{- \tilde{τ}}\} \leq O \{exp (- c_{8} n^{1 - 2 (\tilde{κ} + \tilde{τ})}) + n exp (- c_{9} n^{\tilde{κ}})\},

and

Pr \{max_{1 \leq k \leq p} | {\hat{ρ}}_{k}^{*} - ρ_{k} | \geq c_{7} n^{- \tilde{τ}}\} \leq O \{p [exp (- c_{8} n^{1 - 2 (\tilde{τ} + \tilde{κ})}) + n exp (- c_{9} n^{\tilde{κ}})]\} .

The proof of the first part of Theorem 3 is completed.

Next, we begin the proof of the second part of Theorem 3. If

A ⊈ \hat{A}

, then there must exist some

k \in A

, such that

{\hat{ρ}}_{k}^{*} < c_{7} n^{- \tilde{τ}}

.

{min}_{k \in A} ρ_{k} \geq 2 c_{7} n^{- \tilde{τ}}

entails that

|{\hat{ρ}}_{k}^{*} - ρ_{k}| > c_{7} n^{- \tilde{τ}}

for some

k \in A

, indicating that

\{A ⊈ \hat{A}\} \subseteq {|{\hat{ρ}}_{k}^{*} - ρ_{k}| > c_{7} n^{- \tilde{τ}}, for some k \in A}

, and then

A_{n} = \{{max}_{k \in A} |{\hat{ρ}}_{k}^{*} - ρ_{k}| \leq c_{7} n^{- \tilde{τ}}\} \subseteq \{A \subseteq \hat{A}\}

. Consequently, when n is sufficiently large,

\begin{matrix} Pr \{A \subseteq \hat{A}\} & \geq Pr \{A_{n}\} \\ = 1 - Pr \{A_{n}^{c}\} \\ = 1 - Pr \{min_{k \in A} |{\hat{ρ}}_{k}^{*} - ρ_{k}| \geq c_{7} n^{- \tilde{τ}}\} \\ = 1 - s_{n} Pr \{|{\hat{ρ}}_{k}^{*} - ρ_{k}| \geq c_{7} n^{- \tilde{τ}}\} \\ \geq 1 - O \{L_{1} [exp (- c_{8} n^{1 - 2 (\tilde{κ} + \tilde{τ})}) + n^{4} exp (- c_{9} n^{\tilde{κ}})]\}, \end{matrix}

(A10)

where

L_{1}

is the cardinality of

A

. (A10) implies that

Pr \{A \subseteq \hat{A}\} \to 1

, as

n \to \infty

. □

Proof of Theorem 4.

Similar to the proof of Theorem 2, for any

0 < \tilde{κ} < \frac{1}{2} - \tilde{τ}

, we have

\begin{matrix} Pr \{min_{k \in A} {\hat{ρ}}_{k}^{*} - max_{k \in A^{c}} {\hat{ρ}}_{k}^{*} \leq 0\} \leq O \{p [exp (- c_{11} n^{1 - 2 (\tilde{τ} + \tilde{κ})}) + n^{4} exp (- c_{12} n^{\tilde{κ}})]\}, \end{matrix}

where

c_{11}

,

c_{12}

are positive constants. Thus, we have

Pr \{min_{k \in A} {\hat{ρ}}_{k}^{*} - max_{k \in A^{c}} {\hat{ρ}}_{k}^{*} > 0\} \geq 1 - O \{p [exp (- c_{11} n^{1 - 2 (\tilde{τ} + \tilde{κ})}) + n^{4} exp (- c_{12} n^{\tilde{κ}})]\} .

If

log p = o \{n^{(1 - 2 \tilde{τ}) / 3}\}

with

\tilde{κ} = (1 - 2 \tilde{τ}) / 3

, then we have

Pr \{min_{k \in A} {\hat{ρ}}_{k}^{*} - max_{k \in A^{c}} {\hat{ρ}}_{k}^{*} > 0\} \geq 1 - O \{p exp (- c_{11} n^{(1 - 2 \tilde{τ}) / 3})\} .

We know

p < exp (c_{11} n^{(1 - 2 \tilde{τ}) / 3} / 2)

for a large n; therefore, there must exist a large n, such that

p exp (- c_{11} n^{(1 - 2 \tilde{τ}) / 3}) \leq exp (- c_{11} n^{(1 - 2 \tilde{τ}) / 3} / 2) \leq exp (- 2 log n) \leq n^{- 2} .

Then, we have

\sum_{n = n_{0}}^{\infty} c_{11}^{'} p exp (- c_{11} n ϵ^{2}) \leq c_{11}^{'} \sum_{n = n_{0}}^{\infty} n^{- 2} \leq \infty,

where

c_{11}^{'}, n_{0} > 0

. Therefore, according to the Borel–Cantelli Lemma, we obtain

Pr \{lim {sup}_{n \to \infty} (min_{k \in A} {\hat{ρ}}_{k}^{*} - max_{k \in A^{c}} {\hat{ρ}}_{k}^{*} \leq 0)\} = 0,

and then

\begin{matrix} Pr \{lim {inf}_{n \to \infty} (min_{k \in A} {\hat{ρ}}_{k}^{*} - max_{k \in A^{c}} {\hat{ρ}}_{k}^{*} > 0)\} \\ = & Pr \{{[{lim inf}_{n \to \infty} (min_{k \in A} {\hat{ρ}}_{k}^{*} - max_{k \in A^{c}} {\hat{ρ}}_{k}^{*} > 0)]}^{c}\} = 1, \end{matrix}

which completes the proof of Theorem 4. □

Appendix B. Sensitivity Analysis in Application

To demonstrate the impact of the choice of

d_{0}

on LCDCS, we conduct a sensitivity analysis on the application. In addition to

d_{0} = n - 1

considered in Section 3.2, we also considered

d_{0} = a [n / log (n)]

, where

a = 1, 2, 3

. Table A1 provides the IDs and taxonomic annotations of the 97%-identity OTUs, ranked by heritability after forward regression.

We can see that, for each situation, the heritability of New.0.CleanUp.ReferenceOTU89679, 561636, and New.0.ReferenceOTU340 consistently ranks in the top 3. However, there are some differences in the rankings from 4th to 10th, indicating that selecting the appropriate threshold

d_{0}

is important for future studies.

Table A1. IDs and taxonomic annotations of the top 10 97%-identity OTUs selected by LCDCS based on heritability at different

d_{0}

.

Table A1. IDs and taxonomic annotations of the top 10 97%-identity OTUs selected by LCDCS based on heritability at different

d_{0}

.

Order	ID	Taxonomic Annotation
$d_{0} = [n / log (n)] = 12$
1	New.0.CleanUp.ReferenceOTU89679 *	Collinsella
2	561636	Streptococcus
3	New.0.ReferenceOTU340 *	Bifidobacterium
4	326977 *	Bifidobacterium
5	New.1.ReferenceOTU284	NA ^†
6	533785 *	Bifidobacterium
7	561483 *	Bifidobacterium
8	305760 *	Escherichia/Shigella
$d_{0} = 2 [n / log (n)] = 25$
1	New.0.CleanUp.ReferenceOTU89679 *	Collinsella
2	561636	Streptococcus
3	New.0.ReferenceOTU340 *	Bifidobacterium
4	326977 *	Bifidobacterium
5	469868	Bifidobacterium
6	554755	Enterococcaceae
7	72820 *	Bifidobacterium
8	New.1.ReferenceOTU284	NA ^†
9	533785 *	Bifidobacterium
10	469852 *	Bifidobacterium
$d_{0} = 3 [n / log (n)] = 38$
1	New.0.CleanUp.ReferenceOTU89679 *	Collinsella
2	561636	Streptococcus
3	New.0.ReferenceOTU340 *	Bifidobacterium
4	326977 *	Bifidobacterium
5	469868	Bifidobacterium
6	New.0.ReferenceOTU339	NA ^†
7	130663 *	Bacteroides
8	554755	Enterococcaceae
9	210269 *	NA ^†
10	301004 *	Olsenella
$d_{0} = n - 1 = 49$
1	New.0.CleanUp.ReferenceOTU89679 *	Collinsella
2	561636	Streptococcus
3	New.0.ReferenceOTU340 *	Bifidobacterium
4	470527	Lactobacillus
5	326977 *	Bifidobacterium
6	471180 *	Bifidobacterium
7	469868	Bifidobacterium
8	345575	Lactococcus
9	316587 *	Streptococcus
10	New.0.ReferenceOTU339	NA ^†

Note: When the threshold

d_{0}

is set to 12, only 8 97%-identity OTUs remain after forward regression. * 97%-identity OTUs discovered in Subramanian et al. [21]. ^† The taxonomic annotation is not provided by Subramanian et al. [21].

References

Fan, J.Q.; Lv, J.C. Sure Independence Screening for Ultrahigh Dimensional Feature Space. J. R. Stat. Soc. B. 2008, 70, 849–911. [Google Scholar] [CrossRef]
Song, R.; Yi, F.; Zou, H. On Varying-Coefficient Independence Screening for High-Dimensional Varying-Coefficient Models. Stat. Sinica. 2014, 24, 1735–1752. [Google Scholar] [CrossRef]
Liu, J.Y. Feature Screening and Variable Selection for Partially Linear Models with Ultrahigh-Dimensional Longitudinal Data. Neurocomputing 2016, 195, 202–210. [Google Scholar] [CrossRef]
Lai, P.; Liang, W.J.; Wang, F.J.; Zhang, Q.Z. Feature Screening of Quadratic Inference Functions for Ultrahigh Dimensional Longitudinal Data. J. Stat. Comput. Sim. 2020, 90, 2614–2630. [Google Scholar] [CrossRef]
Jiang, B.Y.; Lv, J.; Li, J.L.; Cheng, M.Y. Robust Model Averaging Prediction of Longitudinal Response with Ultrahigh-Dimensional Covariates. J. R. Stat. Soc. Ser. B Stat. Methodol. 2024, 87, 337–361. [Google Scholar] [CrossRef]
Zhu, L.P.; Li, L.X.; Li, R.Z.; Zhu, L.X. Model-Free Feature Screening for Ultrahigh-Dimensional Data. J. Am. Stat. Assoc. 2011, 106, 1464–1475. [Google Scholar] [CrossRef] [PubMed]
Li, R.Z.; Zhong, W.; Zhu, L.P. Feature Screening via Distance Correlation Learning. J. Am. Stat. Assoc. 2012, 107, 1129–1139. [Google Scholar] [CrossRef]
Shao, X.F.; Zhang, J.S. Martingale Difference Correlation and Its Use in High-Dimensional Variable Screening. J. Am. Stat. Assoc. 2014, 109, 1302–1318. [Google Scholar] [CrossRef]
Wen, C.H.; Pan, W.L.; Huang, M.M.; Wang, X.Q. Sure Independence Screening Adjusted for Confounding Covariates with Ultrahigh Dimensional Data. Stat. Sinica. 2018, 28, 293–317. [Google Scholar] [CrossRef]
Nandy, D.; Chiaromonte, F.; Li, R.Z. Covariate Information Number for Feature Screening in Ultrahigh-Dimensional Supervised Problems. J. Am. Stat. Assoc. 2022, 117, 1516–1529. [Google Scholar] [CrossRef]
Zhang, J.; Liu, Y.Y.; Cui, H.J. Model-Free Feature Screening via Distance Correlation for Ultrahigh Dimensional Survival Data. Stat. Papers. 2021, 62, 2711–2738. [Google Scholar] [CrossRef]
Zhong, W.; Qian, C.; Liu, W.J.; Zhu, L.P.; Li, R.Z. Feature Screening for Interval-Valued Response with Application to Study Association between Posted Salary and Required Skills. J. Am. Stat. Assoc. 2023, 118, 805–817. [Google Scholar] [CrossRef]
Zhou, T.Y.; Zhu, L.P. Model-Free Feature Screening for Ultrahigh Dimensional Censored Regression. Stat. Comput. 2017, 27, 947–961. [Google Scholar] [CrossRef]
Székely, G.J.; Rizzo, M.L.; Bakirov, N.K. Measuring and Testing Dependence by Correlation of Distances. Ann. Stat. 2007, 35, 2769–2794. [Google Scholar] [CrossRef]
Wang, X.Q.; Pan, W.L.; Hu, W.H.; Tian, Y.; Zhang, H.P. Conditional Distance Correlation. J. Am. Stat. Assoc. 2015, 110, 1726–1734. [Google Scholar] [CrossRef]
Chen, L.P. Feature Screening Based on Distance Correlation for Ultrahigh-Dimensional Censored Data with Covariate Measurement Error. Comput. Stat. 2021, 36, 857–884. [Google Scholar] [CrossRef]
Lu, J.; Lin, L. Model-Free Conditional Screening via Conditional Distance Correlation. Stat. Papers. 2020, 61, 225–244. [Google Scholar] [CrossRef]
Dai, C.G.; Lin, B.Y.; Xing, X.; Liu, J.S. False Discovery Rate Control via Data Splitting. J. Am. Stat. Assoc. 2023, 118, 2503–2520. [Google Scholar] [CrossRef]
Chu, W.H.; Li, R.Z.; Reimherr, M. Feature Screening for Time-Varying Coefficient Models with Ultrahigh-Dimensional Longitudinal Data. Ann. Appl. Stat. 2016, 10, 596–617. [Google Scholar] [CrossRef]
Chu, W.H.; Li, R.Z.; Liu, J.Y.; Reimherr, M. Feature Selection for Generalized Varying Coefficient Mixed-Effect Models with Application to Obesity GWAS. Ann. Appl. Stat. 2020, 14, 276–298. [Google Scholar] [CrossRef] [PubMed]
Subramanian, S.; Huq, S.; Yatsunenko, T.; Haque, R.; Mahfuz, M.; Alam, M.A.; Benezra, A.; DeStefano, J.; Meier, M.F.; Muegge, B.D.; et al. Persistent Gut Microbiota Immaturity in Malnourished Bangladeshi Children. Nature 2014, 510, 417–421. [Google Scholar] [CrossRef] [PubMed]
Liu, W.J.; Ke, Y.; Liu, J.Y.; Li, R.Z. Model-Free Feature Screening and FDR Control with Knockoff Features. J. Am. Stat. Assoc. 2022, 117, 428–443. [Google Scholar] [CrossRef]
Chi, C.M.; Fan, Y.Y.; Ing, C.K.; Lv, J.C. High-Dimensional Knockoffs Inference for Time Series Data. J. Am. Stat. Assoc. 2025, 1–24. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The geometric illustration of LDC statistics.

Δ t_{4}

represents the measurement interval between time points

t_{3}

and

t_{4}

.

D C (t_{4}) Δ t_{4}

represents the area of the histogram between time points

t_{3}

and

t_{4}

.

Figure 1. The geometric illustration of LDC statistics.

Δ t_{4}

represents the measurement interval between time points

t_{3}

and

t_{4}

.

D C (t_{4}) Δ t_{4}

represents the area of the histogram between time points

t_{3}

and

t_{4}

.

Figure 2. Trajectories of HAZ in the first 2 years after birth for 50 children. Each boxplot illustrates the variability in the HAZ of children for the given month.

Figure 3. Curves of total heritability with number of 97%-identity OTUs selected through different methods.

Table 1.

M M S

,

P_{k}

,

P_{a}

,

d_{c}

for Example 1.

Table 1.

M M S

,

P_{k}

,

P_{a}

,

d_{c}

for Example 1.

Method	MMS		$P_{k}$				$P_{a}$	$d_{c}$
Method	Median	RSD	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$	$P_{a}$	$d_{c}$
	Model (1.a), $ρ_{1} = 0.3$
LDCS	4.00	0.00	1.00	0.99	1.00	1.00	0.99	3.99
TVCM-SIS	4.00	3.73	0.99	0.86	0.99	0.98	0.82	3.82
PRSIS	4.00	82.28	0.84	0.74	0.84	0.83	0.68	3.25
DC-SIS	370.00	356.90	0.31	0.17	0.33	0.27	0.04	1.08
CIS	609.50	726.49	0.34	0.13	0.38	1.00	0.04	1.85
	Model (1.a), $ρ_{1} = 0.6$
LDCS	4.00	0.00	1.00	0.96	1.00	1.00	0.96	3.96
TVCM-SIS	4.00	4.48	0.99	0.84	0.98	0.99	0.81	3.80
PRSIS	4.00	23.51	0.89	0.78	0.90	0.90	0.71	3.47
DC-SIS	302.50	349.63	0.43	0.17	0.46	0.39	0.08	1.45
CIS	667.00	711.38	0.35	0.13	0.33	1.00	0.04	1.82
	Model (1.b), $ρ_{1} = 0.3$
LDCS	13.00	22.57	0.90	0.90	0.83	0.82	0.59	3.45
TVCM-SIS	651.00	626.12	0.26	0.21	0.81	0.81	0.03	2.08
PRSIS	1022.50	773.51	0.14	0.11	0.93	0.90	0.01	2.08
DC-SIS	19.00	33.58	0.75	0.75	0.87	0.86	0.49	3.23
CIS	490.50	596.08	0.87	0.89	0.18	0.17	0.02	2.10
	Model (1.b), $ρ_{1} = 0.6$
LDCS	21.00	38.81	0.87	0.84	0.77	0.79	0.46	3.26
TVCM-SIS	587.50	672.57	0.22	0.23	0.72	0.74	0.02	1.91
PRSIS	957.00	787.13	0.15	0.12	0.80	0.84	0.00	1.91
DC-SIS	30.00	47.95	0.64	0.64	0.83	0.83	0.36	2.93
CIS	523.00	608.02	0.89	0.89	0.17	0.16	0.03	2.10

Table 2.

M M S

,

P_{k}

,

P_{a}

,

d_{c}

for Example 2.

Table 2.

M M S

,

P_{k}

,

P_{a}

,

d_{c}

for Example 2.

Method	MMS		$P_{k}$				$P_{a}$	$d_{c}$
Method	Median	RSD	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$
	Model (2.a), $ρ_{1} = 0.3$
LDCS	12.00	93.47	0.84	0.85	0.82	0.94	0.54	3.45
TVCM-SIS	187.00	785.82	0.56	0.57	0.84	0.89	0.36	2.86
PRSIS	740.50	958.58	0.65	0.67	0.23	0.99	0.10	2.54
DC-SIS	335.50	682.46	0.72	0.71	0.32	0.99	0.15	2.74
CIS	948.00	755.78	0.47	0.46	0.20	0.55	0.01	1.68
	Model (2.a), $ρ_{1} = 0.6$
LDCS	14.00	105.22	0.84	0.84	0.82	0.95	0.53	3.45
TVCM-SIS	157.50	750.37	0.57	0.57	0.84	0.90	0.38	2.88
PRSIS	753.50	861.19	0.64	0.64	0.19	0.97	0.07	2.44
DC-SIS	366.50	601.87	0.67	0.68	0.27	0.98	0.08	2.61
CIS	1059.00	695.52	0.47	0.46	0.16	0.56	0.01	1.66
	Model (2.b), $ρ_{1} = 0.3$
LDCS	16.00	45.52	0.76	0.93	0.75	0.95	0.52	3.39
TVCM-SIS	261.50	638.06	0.34	0.52	0.38	0.53	0.14	1.77
PRSIS	845.00	972.01	0.21	0.34	0.22	0.33	0.05	1.10
DC-SIS	430.00	715.86	0.34	0.50	0.35	0.53	0.18	1.71
CIS	1402.50	537.31	0.05	0.13	0.07	0.14	0.01	0.40
	Model (2.b), $ρ_{1} = 0.6$
LDCS	24.50	60.63	0.71	0.91	0.73	0.92	0.45	3.26
TVCM-SIS	395.50	735.63	0.33	0.50	0.35	0.52	0.11	1.70
PRSIS	916.00	921.83	0.18	0.27	0.18	0.31	0.03	0.94
DC-SIS	557.00	816.23	0.30	0.47	0.28	0.50	0.11	1.55
CIS	1423.50	558.58	0.06	0.08	0.05	0.11	0.00	0.30

Table 3.

M M S

,

P_{k}

,

P_{a}

,

d_{c}

for Example 3.

Table 3.

M M S

,

P_{k}

,

P_{a}

,

d_{c}

for Example 3.

Method	MMS		$P_{k}$				$P_{a}$	$d_{c}$
Method	Median	RSD	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$
	Model (3), $ρ_{2}$ = 0.5
LCDCS	6.00	8.21	0.97	0.89	0.91	0.96	0.78	3.73
TVCM-SIS	1307.00	508.40	0.05	0.04	0.07	0.06	0.00	0.22
PRSIS	1557.50	456.16	0.02	0.01	0.02	0.02	0.00	0.07
CDC-SIS	9.00	59.51	0.87	0.76	0.75	0.86	0.59	3.24
CIS	263.50	354.10	0.55	0.38	0.43	0.57	0.07	1.93
	Model (3), $ρ_{2}$ = 0.8
LCDCS	9.00	16.60	0.96	0.83	0.88	0.97	0.68	3.64
TVCM-SIS	1479.00	419.40	0.02	0.02	0.02	0.03	0.00	0.10
PRSIS	1579.50	403.92	0.02	0.01	0.02	0.01	0.00	0.05
CDC-SIS	16.50	90.49	0.85	0.72	0.72	0.86	0.51	3.15
CIS	439.50	481.34	0.38	0.27	0.30	0.37	0.01	1.32

Table 4. Number of overlapping 97%-identity OTUs among the top

d_{0} = 10

(above diagonal) and the top

d_{0} = 220

(below diagonal) selected via different methods.

Table 4. Number of overlapping 97%-identity OTUs among the top

d_{0} = 10

(above diagonal) and the top

d_{0} = 220

(below diagonal) selected via different methods.

		LDCS	LCDCS	TVCM-SIS	PRSIS	DC-SIS	CDC-SIS	CIS
		$d_{0} = 10$
LDCS	$d_{0} = 220$	-	5	0	6	5	7	0
LCDCS		186	-	0	4	2	5	0
TVCM-SIS		94	103	-	0	0	0	1
PRSIS		114	97	70	-	8	8	0
DC-SIS		115	100	69	181	-	6	0
CDC-SIS		118	110	64	160	152	-	0
CIS		35	38	62	52	47	35	-
ANOVA *		103	108	51	47	49	74	45

* Subramanian et al. [21] used an analysis of variance with LMM to identify 97%-identity OTUs that were significantly different between SAM and healthy children.

Table 5. Quantity of 97%-identity OTUs selected and total heritability for each method.

Method	Number of 97%-Identity OTUs	Total Heritability
LDCS	36	60.24%
LCDCS	39	64.76%
TVCM-SIS	39	49.56%
PRSIS	40	52.77%
DC-SIS	34	51.46%
CDC-SIS	41	55.85%
CIS	41	37.02%

Table 6. IDs and taxonomic annotations of the top 10 97%-identity OTUs selected via LCDCS based on heritability.

Order	ID	Taxonomic Annotation	Heritability
1	New.0.CleanUp.ReferenceOTU89679 *	Collinsella	9.29%
2	561636	Streptococcus	7.54%
3	New.0.ReferenceOTU340 *	Bifidobacterium	4.87%
4	470527	Lactobacillus	4.43%
5	326977 *	Bifidobacterium	4.22%
6	471180 *	Bifidobacterium	3.31%
7	469868	Bifidobacterium	2.76%
8	345575	Lactococcus	2.59%
9	316587 *	Streptococcus	2.23%
10	New.0.ReferenceOTU339	NA ^†	2.05%

* 97%-identity OTUs discovered in Subramanian et al. [21]. ^† The taxonomic annotation is not provided by Subramanian et al. [21].

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Yang, X.; Dai, J.; Li, Y. Model-Free Feature Screening Based on Data Aggregation for Ultra-High-Dimensional Longitudinal Data. Stats 2025, 8, 99. https://doi.org/10.3390/stats8040099

AMA Style

Chen J, Yang X, Dai J, Li Y. Model-Free Feature Screening Based on Data Aggregation for Ultra-High-Dimensional Longitudinal Data. Stats. 2025; 8(4):99. https://doi.org/10.3390/stats8040099

Chicago/Turabian Style

Chen, Junfeng, Xiaoguang Yang, Jing Dai, and Yunming Li. 2025. "Model-Free Feature Screening Based on Data Aggregation for Ultra-High-Dimensional Longitudinal Data" Stats 8, no. 4: 99. https://doi.org/10.3390/stats8040099

APA Style

Chen, J., Yang, X., Dai, J., & Li, Y. (2025). Model-Free Feature Screening Based on Data Aggregation for Ultra-High-Dimensional Longitudinal Data. Stats, 8(4), 99. https://doi.org/10.3390/stats8040099

Article Menu

Model-Free Feature Screening Based on Data Aggregation for Ultra-High-Dimensional Longitudinal Data

Abstract

1. Introduction

2. Materials and Methods

2.1. Model Setup

2.2. Marginal Screening Procedure

2.3. Conditional Screening Procedure

3. Results

3.1. Simulation Study

3.1.1. Example 1: Partially Linear Models

3.1.2. Example 2: Time-Varying Coefficient Models

3.1.3. Example 3: Partially Linear Single-Index Models

3.2. Application to Gut Microbiota Data

4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Theoretical Proofs

Appendix B. Sensitivity Analysis in Application

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI