Imputation-Based Variable Selection Method for Block-Wise Missing Data When Integrating Multiple Longitudinal Studies †

: When integrating data from multiple sources, a common challenge is block-wise missing. Most existing methods address this issue only in cross-sectional studies. In this paper, we propose a method for variable selection when combining datasets from multiple sources in longitudinal studies. To account for block-wise missing in covariates, we impute the missing values multiple times based on combinations of samples from different missing pattern and predictors from different data sources. We then use these imputed data to construct estimating equations, and aggregate the information across subjects and sources with the generalized method of moments. We employ the smoothly clipped absolute deviation penalty in variable selection and use the extended Bayesian Information Criterion criteria for tuning parameter selection. We establish the asymptotic properties of the proposed estimator, and demonstrate the superior performance of the proposed method through numerical experiments. Furthermore, we apply the proposed method in the Alzheimer’s Disease Neuroimaging Initiative study to identify sensitive early-stage biomarkers of Alzheimer’s Disease, which is crucial for early disease detection and personalized treatment.


Introduction
Multi-sources data are now attracting more attention in scientific research.A practical problem with multi-source data is block-wise missing.Our work is motivated by the existence of block-wise missingness in Alzheimer's Disease Neuroimaging Initiative (ADNI) data when investigating the biomarkers that are associated with Alzheimer's Disease (AD).In the ADNI study, healthy elderly subjects, as well as subjects with normal cognition (NC), mild cognitive impairment (MCI), or AD, were recruited to identify neuroimaging measures, cognitive measures and biomarkers that can effectively and timely detect cognitive and functional changes [1].The ADNI data exhibit a block-wise missing structure along with the long duration of the study, and the high cost of certain measurements, etc.Besides the ADNI data, datasets with block-wise missing structure also exist across many other fields including environmental science, sociology, and economics.For example, a block-wise missing structure appears in human mortality data integrated from Italy and Switzerland [2] and in credit data collected from various institutions (Lan and Jiang [3]; Li et al. [4]).
Statistical analysis with missing covariates has been widely studied due to the prevalence of missing values in many datasets.Common methods for dealing with missing data include complete case analysis, maximum likelihood, inverse probability weighting, and imputation.While complete case analysis is the easiest approach to implement, it has several drawbacks, such as potential bias in certain cases and a significant loss of information when the proportion of missingness is high.The maximum likelihood approach (e.g., Sabbe et al. [5]; Bondarenko and Raghunathan [6]; Audigier et al. [7]; von Hippel and Bartlett [8]) requires a specification on the distribution of variables, though this is unknown and unverifiable in practice.Inverse probability weighting (e.g., Chen et al. [9]; Creemers et al. [10]; Zubizarreta [11]; Hughes et al. [12]) heavily relies on the information from complete cases, which can be problematic when the fraction of completely observed subject is small.Two big challenges with the above ADNI data are the high proportion of missingness and the large number of covariates, which make the complete case analysis and maximum likelihood approach inefficient.In addition to these two challenges, weighted methods cannot handle the problem in presence of multiple missing patterns.Compared to these methods with notable limitations, imputation methods are more appropriate for the ADNI data.Recently, multi-source data with block-wise missing, exemplified by the ADNI data, have drawn extensively attention in statistically research.Ref. [13] developed a classification framework, which was accomplished by three steps: feature selection, sample selection, and matrix completion.Ref. [2] proposed a dimension reduction method called generalized integrative principal component analysis (GIPCA).Under the assumption of identical type of distribution in the exponential family within each data source, GIPCA decomposed the overall effect into joint and individual effect across data sources.Ref. [14] imputed the missing data using a factor structure model, which considered the correlation between predictors and does not depend on missing mechanism.Ref. [15] developed a multiple block-wise imputation (MBI) approach by constructing estimating functions based on both complete and incomplete observation.Other related literature include those of [4,16,17].
However, these methods are not applicable to longitudinal studies.Using these methods on the ADNI data, they only select baseline measurement for each patient and simply delete the following measurements.Thus, these methods are inefficient for the ADNI data since they fail to take account of with-subject correlations.In this paper, we aim to develop a method for variable selection when integrating longitudinal data from multiple sources in the existing block-wise missing structure.We impute the block-wise missing data multiple times by using the information from both subjects with complete observation and subjects with missing values.We construct estimating equations based on imputed data and incorporate working correlation matrices to account for within-cluster correlation.With the generalized method of moment, we are capable of integrating data from multiple sources and identifying the relevant variables by introducing a penalty term.
This paper is organized as follows.Section 2 describes the setup and formalize the proposed method.In Section 3, we study the asymptotic properties of the proposed estimator.In Section 4, we develop an algorithm to implement the proposed method, followed by a simulation study conducted in Section 5 to evaluate the performance of the proposed method.In Section 6, we apply the proposed method to the ADNI study.Section 7 provides further discussions.

Setup
Suppose the dataset consists of n independent and identically distributed (i.i.d.) samples drawn from independent sources with disjoint covariates.Without loss of generality, we assume that the data are already sorted by missing patterns, and the total number of missing patterns is K with n k samples in each pattern, where ∑ K k=1 n k = n and k = 1, . . ., K. Within each missing pattern, all subjects have the same missing structure and the covariates from any specific source are either fully observed or fully missing.Let Y k,i = (Y k,i1 , . . ., Y k,im i ) T be the response vector for the ith subject in the kth pattern with m i measurements.For ease of presentation, we assume that each sample has the same number of measurements m.Furthermore, let X k,i = (X k,i1 , . . ., X k,ip ) be the corresponding covariate matrix for the ith subject in the kth pattern across all the measurements, where p is the number of covariates.We assume the underlying population-level model is as follows: where µ(•) is a known monotonic link function and β is a p-dimentional vector in the parameter space.Let O(k) and M(k) denote the index of observed covariates and missing covariates in the kth pattern, respectively.Define R i = 1 if X k,i is fully observed, otherwise 0. We assume the missing mechanism of X k,i is missing completely at random [18].Figure 1 is an example illustrated what block-wise missing data look like, which consist of three sources with three missing patterns.Note that covariates in source 1 are completely observed in all three patterns, while covariates in source 2 are only observed in pattern 1 and 2, and covariates in source 3 are only observed in pattern 1 and 3. A similar structure also exists in the ADNI data.For example, variables in cerebrospinal fluid (CSF) are only measured in a subsample since CSF collection were mainly performed in phase II.Although complete cases analysis is feasible for ADNI data, it is inefficient especially the number of subjects with complete observation is limited.Thus, it is essential to leverage information from incomplete observation.

Proposed Method
One approach to utilizing incomplete data is by imputing missing values and performing statistical analysis based on the imputed dataset.Traditional methods impute missing values using information solely from complete cases.However, in scenarios involving block-wise missing data, the proportion of complete cases can be relatively small, resulting in unstable imputed values.To further illustrate how to incorporate information from subjects with partially observed values when imputing missing values, we continue to use the example given in Figure 1.Let X k,i(r) be the rth imputed covariate vector for the ith subject from pattern k, r = 1, . . ., R k .For instance, the missing values of X 2,i , i.e., the covariates of source 3 in pattern 2, can be imputed using the information of all sources in pattern 1, which we denoted as X 2,i(1) .Additionally, these can also be imputed based on the covariates in source 1 and source 3 for subjects from pattern 1 and pattern 3, which we denoted as X 2,i(2) .Figure 2 illustrate how the above procedures work.When all the covariates are observed, X k,i(r) = X k,i .Similarly, we can define µ k,i(r) (β) as the corresponding imputed conditional mean.
The intuition behind the proposed method stems from generalized estimating equations (GEE) and quadratic inference functions (QIF).Suppose V k,i is the unknown true covariance matrix of Y k,i .Ref. [19] proposed that V k,i can be estimated by k,i , where A k,i is the diagonal matrix of the conditional variance of Y k,i and R k,i is a working correlation matrix that fully specified by a vector of parameter α.Ref. [20] proposed the QIF using the fact that the inverse of the correlation matrix R −1 k,i can be approximated by ∑ J j=1 a k,j M j , where M 1 , . . ., M J are some basis matrices.For example, if we assume the within-cluster correlation structure is exchangeable, R −1 k,i can be approximated by where M 1 is the identity matrix and M 2 is a matrix with elemtents in the diagonal to be 0 and elements in the off-diagonal to be 1.The estimation of inverse of correlation matrix using linear combination has been intensively studied by [21].The advantage of this linear approximation is that the parameter α can be treated as a nuisance parameter, leading to some improvement in computational efficiency.Then, the estimating function for the subject i in the kth pattern with the rth imputation is defined as: Here, we only take derivative with respect to β O(k) to enhance numerical stability.Recall that a k,j is a linear coefficient that used to approximate the inverse of correlation matrix, and thus, it is the nuisance parameter.To avoid estimating these nuisance parameters, we define the extended score vector: Similarly, we obtain extended score vectors for all imputed covariate vectors and subjects.To integrate all score vectors, we aggregate the information by stacking them into a long vector: . . .
Note that this might be an overdetermined system because the number of equations can exceed the number of parameters.To overcome this difficulty, we adopt generalized method of moment [22] and add a penalty term.Therefore, the objective function becomes: where: is a block-diagnoal matrix under the assumption of independence among samples from different missing patterns and p λ n (•) is an arbitrary, investigator's chosen, penalty function with a tuning parameter λ.Among many optional penalty functions, we adopt the nonconvex smoothly clipped absolute deviation (SCAD) penalty [23]: for some a > 2, which possess desirable oracle property.

Asymptotic Properties
In this section, we investigate the asymptotic properties of the proposed estimator.In Section 3.1, we assume the sample size n is increasing while the number of parameters p is fixed, and demonstrate that the proposed estimator is √ n-consistent and asymptotically normal.As sample size goes to infinity, the proposed method is capable of selecting out the relevant variables with probability goes to 1.We also show that the proposed estimator is asymptotically more efficient than single imputation method via incorporating information of samples with missing values.In Section 3.2, we suppose both the sample size n and the number of parameters p are increasing but n increases faster than p.We show that the consistency and sparsity still hold with diverging p.Without loss of generality, we assume β can be partitioned into two parts, i.e., β = ( βT A , βT N ) T , where βA corresponds to relevant variables with a non-zero true value, while βN consists of coefficients of irrelevant variables with a zero true value.For any function g(β), we use ġ(β) to denote the first derivative of g(•) evaluated at β.We use similar notation for its other order derivatives.

Fixed Number of Parameters
To establish the asymptotic properties of the proposed estimator in the setting of increasing sample sizes and fixed number of parameter, we require the following regularity conditions:  5 The penalty function satisfied: and Ω k to be a diagonal matrix with n k dimension and each element equals to lim n→∞ n/n k .
C.1-C.3 are conditions that require the existence of the moment, which are easily satisfied.C.4 requires the imputed conditional mean converges to the true conditional mean in probability, which is satisfied as long as the imputed model is correctly specified and the missing mechanism is either missing completely at random.C.5 is a standard condition for SCAD penalty which is commonly used in variable selection method (Gao et al. [24]; Cho and Qu [25]; Tian et al. [26]).More specifically, (a) ensures the property of sparsity is satisfied, (b) and (c) ensure the property of consistency is satisfied, and (c) also guarantees that the objective function ( 1) is dominated by the first term.C.6 is used to establish the asymptotic normality of the proposed estimator.
Theorem 1 states the existence of a minimizer of the objective function and the minimizer will converge to the true coefficients at a rate of √ n as the sample size increases.Next, we demonstrate that this estimator possesses the sparsity property and the estimator for the non-zero coefficient is asymptotically normal, as outlined in the following theorem.
The sparsity of the proposed estimator guarantees that the probability of selecting the true model approaches 1.We also obtained in Theorem 2 the asymptotic normality of βA , the estimator of coefficients for the relevant variables, which allows us to estimate its variance if H and Σ are known.However, in practice, these are unknown to us.We can obtain the empirical variance covariance matrix of βA by replacing H with H( β) = ∂g T ( β)/∂β A and replacing Σ with C( β), i.e., V = ( HC −1 Ω −1 H T ) −1 .Next, we compare the empirical variance of the proposed estimator with the empirical variance of the single imputation approach.Theorem 3. If a single imputation is used based on complete cases and denotes the asymptotic covariance matrix of β A as V , then under the conditions of Theorem 2, V − V is positive semi-definite.
Theorem 3 claims that the proposed estimator is asymptotically more efficient than the single imputation approach, as it incorporates information from incomplete cases during imputation.The result of this Theorem is intuitive because the proposed method incorporates more samples into the imputation process.

Diverging Number of Parameters
In this subsection, we consider the setting where sample size n and number of coefficients p increase simultaneously.For certain properties to remain true, we require that n increases faster than p.We replace the notation p by p n to indicate that the number of parameters also increases.We make the following assumptions: ) and: From the result of Theorem 4, we find that the consistency still holds for the proposed estimator, even with a diverging number of parameters.Not surprisingly, the convergence rate is no longer √ n, but n/p n .We also require that p n does not increase faster than n 1/4 to ensure the model remains sparse.To be specific, the majority of the coefficients is zero.
Theorem 5 states the sparsity of the proposed estimator with a diverging number of parameters.This property guarantees that the proposed method can still select the true model with a probability approaching 1, even when the number of parameters is diverging.

Implementation
Since directly minimizing the objective function is difficult, we incorporate an iterative procedure inspired by the implementation in [27], where they combined the minorization-maximization algorithm [28] with the Newton-Raphson algorithm.Given the current estimate of β (t) and tuning parameter λ n , the objective function S(β) can be locally approximated by (except a constant term): where: and ϵ is a sufficiently small number (e.g., ϵ = 10 −6 ).Thus, the search for estimator minimizing the objective function is equivalent to find an estimator that minimize (2).Notice that both Q(β (t) ) and Q(β (t) ) are unknown.Fortunately, from Lemma S2 in Supplementary Materials, Q(β (t) ) can be approximated by: and Q(β (t) ) can be approximated by: Plugging ( 3) and ( 4) into ( 2) and applying the Newton-Raphson algorithm, we obtain the following formula to update β (t+1) : We repeat the above procedure until ∥β (t+1) − β (t) ∥ is smaller than a pre-specified threshold or reach a pre-specified maximum number of iteration.
It is known that the sampling covariance matrix C(β) may be singular in some cases [29].To overcome the difficulty in computing the inverse of C(β), we adopt the Moore-Penrose generalized inverse, which exists and is unique for any matrix.
In the implementation of the proposed method, we select tuning parameter λ n with extended Bayesian Information Criterion (EBIC) criteria proposed by [30]: where d f λ n is the number of parameters of the model with tuning parameter λ n and RSS = ∑ K k=1 RSS k is the residual sum of square of all the missing pattern with: .

Simulation
In this section, we implement a simulation study to compare the performance of the proposed method in variable selection against complete case analysis (CC) with SCAD penalty, single imputation (SI) with SCAD penalty, and the penalized generalized estimating Equation (PGEE).We use the same data structure as shown in Figure 1, where we have three missing patterns and three sources.The number of measurement is set to be three throughout this section.We replicate the simulation 100 times and use false positive rate (FPR) and false negative rate (FNR) to evaluate the performance of each method, which reflect the proportion of covariates that are irrelevant but falsely selected and the proportion of covariates that are relevant but fail to be selected, respectively.In the tuning parameter selection procedure, the parameter γ was set to 0.5 in EBIC.At the end of the iterative algorithm in Section 4, the estimated coefficient is considered as zero, if its absolute value is smaller than 0.01.
In the first setting, we simulate a dataset with a small proportion of complete cases, where n 1 = 40, n 2 = 120, n 3 = 120, and the missing rate is around 87%.The data with continuous outcome are generated from the model: where j = 1, . . ., 3, X ij = (x ij,1 , . . ., x ij,30 ) T is a vector consisting of 30 covariates, and β = (1, 2, 0, . . ., 0, 1, 2, 0, . . ., 0, 1, 2, 0, . . ., 0) T .Here, each source consists of 10 covariates with the first two covariates having non-zero coefficients.x ij,1 is a time-fixed covariate and we generate it from the standard normal distribution, whereas other covariates are time-varying covariates and follow multivariate normal distribution with mean zero and exchangeable covariance matrix with marginal variance 1 and correlation coefficient 0.5.We generate random error ε i from the multivariate normal distribution with mean 0 and exchangeable covariance matrix with marginal variance 1 and correlation coefficient ρ.We always assume the true within-cluster correlation structure is known and considered ρ to be 0.3, 0.5, and 0.7 in each setting, which corresponded to mild, moderate, and strong within-cluster correlation.Let }).Then, n 1 , n 2 , and n 3 samples were sequentially drawn with probability proportional to the ϕ i and assigned to the pattern 1, pattern 2, and pattern 3, respectively.Obviously, subjects with higher covariates value from source 1 at the baseline are more likely to be assigned to pattern 1, followed by pattern 2 and then pattern 3.This data generating process implies a MAR mechanism for the missing covariates.The results of Table 1 summarize the performance of each method for three different ρ.All of these methods effectively control the FNR.However, FPR of the proposed method is lower than the other three methods.In other words, the proposed method is able to select most of relevant variables while controlling the error of selecting irrelevant variables.In addition, we notice that the proposed method is more capable of utilizing within-cluster correlation compared with PGEE since the proposed method performs better as the within-cluster correlation becomes stronger.This result demonstrates the superiority of the proposed method when the percentage of complete cases is small in the block-missing data.
In the second setting, we continue to investigate the proposed method's performance with a continuous outcome, but we proportionally increase the sample size in each missing pattern to demonstrate the proposed method's effectiveness in larger samples, where n 1 = 120, n 2 = 300, n 3 = 300.The results are described in Table 2. Unsurprisingly, the FPR and FNR of all the methods decreased compared with the first setting.We observe that the performance of the PGEE is very close to that of the single imputation method while the proposed method has a much lower FPR.In the meanwhile, complete cases analysis is still the worst option since the improvement is minor as the sample size increase, and even negligible when the within-cluster correlation is strong.Therefore, the proposed method is still able to maintain an appealing performance in the large sample size.The results from this setting further verify the efficiency gain of the proposed method in incorporating more information from the missing data compared to the single imputation.In the third setting, we consider a correlated binary outcome with n 1 = 120, n 2 = 300, and n 3 = 300.The data are generated from the model: where j = 1, . . ., m, X ij = (x ij,1 , . . ., x ij,15 ) T is a vector consisting of 15 covariates, and β = (1, 0, . . ., 0, −0.7, 0, . . ., 0, 0.5, 0, . . ., 0) T .Here, each source consists of five covariates, with the first covariate in each having non-zero coefficients.x ij,1 is a time-fixed covariate and we generate it from the standard normal distribution, whereas other covariates are random split process was replicated 30 times.A variable is marked as a potential predictor of AD if its absolute coefficient value is greater than 0.01.Table 5 summarizes the average number of biomarkers selected by each method, along with the most frequently selected biomarkers.We also report the post-model-selection p-value.Our method successfully identifies biomarkers that align with findings reported in existing Alzheimer's Disease research literature.In comparison to PGEE, the other three methods consistently select amyloid-β as a biomarker of AD, whose accumulation in cells is an early event of AD [34].Phosphorylated tau, another widely accepted biomarker, has been validated by multiple large-scale, multi-center studies [35].Studies found that neurons in AD patients are more likely to loss the superior temporal sulcus [36].Two distinct normalization methods of summary measures for the standardized uptake value ratio (SUVR) of the florbetapir tracer, in the composite reference region and the whole cerebellum reference region, may potentially serving as AD biomarkers [37].Besides these biomarkers, the proposed method additionally identifies several well-established and potential biomarkers.The size of the region of interest (ROI) in the left and right hemisphere precuneus area of the cortex, as well as cortical volume of left precuneus, summarize the health status of precuneus, which may be atrophy in the early stage of AD.The size and volume of the ROI in the left and right inferior lateral ventricle reflect disease progression (Bartos et al. [38]; Song et al. [39]).White matter changes in cerebral or subcortical areas can appear in other neurological conditions and normal aging, their connections with AD potentially make them useful biomarkers for distinguishing AD from normality, especially when considered along with other biomarkers in future investigations.While the surface area of the left caudal middle frontal and the cortical volume of the right caudal anterior cingulate are both associated with AD, more research is required to further explore these associations.

Figure 1 .
Figure 1.Example of block-wise missing data in longitudinal studies.

Figure 2 .
Figure 2. Two imputation approaches for missing covariates of source 3 in pattern 2. In the left figure, samples from pattern 1 and covariates in source 1 and source 2 are used to train the model, which is subsequently used to predict the missing covariates in pattern 2. Similarly, in the right figure, samples from pattern 1 and pattern 3 and covariates in source 1 are used to train the model.

D. 1 Theorem 4 .
and D.2 are analogous to C.1-C.4.D.3 is the modification of C.5 for diverging number of parameters.Under D.1-D.3, if p n = o(n 1/4 ), there exists a local minimizer β of S(β) such that and 1 ≤ r, where the inner expectation is with respect to the imputed values.C.2 All the variance matrix A

Theorem 2 .
Under C.1-C.5, if λ n → 0 and there exist a sequence such that λ n √ n/a n → ∞ as n → ∞, where a n = o p ( ).

Table 4 .
Data composition and missing pattern for the subset of ADNI data; "O" denotes the observed data and "-" denotes the missing data.

Table 5 .
Comparision of the mean of the number of selected biomarkers (MNSB) whose absolute value of coefficient is greater than 0.01 based on 30 replications in application to ADNI data.Time is the computation time in seconds.