Confounder Adjustment in Shape-on-Scalar Regression Model: Corpus Callosum Shape Alterations in Alzheimer’s Disease

: Large-scale imaging studies often face challenges stemming from heterogeneity arising from differences in geographic location, instrumental setups, image acquisition protocols, study design, and latent variables that remain undisclosed. While numerous regression models have been developed to elucidate the interplay between imaging responses and relevant covariates, limited attention has been devoted to cases where the imaging responses pertain to the domain of shape. This adds complexity to the problem of imaging heterogeneity, primarily due to the unique properties inherent to shape representations, including nonlinearity, high-dimensionality


Introduction
Multi-site neuroimaging data integrative analysis is becoming of great interest so that establishing the relationship between imaging responses and covariates of interest from large-scale imaging studies, such as the Alzheimer's Disease Initiative (ADNI) study [1], can identify significant biomarkers for major neurological diseases, irrespective of any technical barriers.The need for imaging integration strategies arises as the differences in image acquisition protocols, experimental designs, and other unknown hidden factors could lead to invalid associations between biological variables of interest and false conclusions.Some statistical integration techniques have been developed and applied to neuroimaging data harmonization.ComBat was first developed to remove unwanted variations caused by batches (with small samples) in gene expression data [2].Recently, ComBat was applied to neuroimage data by shrinking the batch effect to the overall mean batch effect across voxels using the empirical Bayes framework.ComBat-GAM is an extension of Com-Bat where nonlinear trends of demographic features like age and sex are accommodated in the location-scale adjustment model by using a generalized additive model [3].Surrogate variable analysis is another widely used harmonization technique to estimate unknown hidden factors in gene expression studies [4].It was also applied to neuroimaging data that successfully identified brain disorder identification and predicted disease progression after removal of inter-site heterogeneity [5][6][7][8][9].Although these strategies have been developed to harmonize multi-site effect and increase power and reproducibility of statistical tests of various imaging data modalities, including diffusion tensor images [10], cortical thickness measurements [11,12], and functional magnetic resonance imaging (MRI) data [13,14], limited work has been conducted for imaging responses derived from the shape space.The shape is broadly defined to be a characteristic that is left after certain nuisance or shapepreserving transformations, such as rotations, translations, and scale, are removed [15][16][17], with the result that shape representation spaces are nonlinear, high-dimensional, and have quotient space geometry [18].Therefore, the conventional normality assumptions often applied to imaging responses in existing image-on-scalar regression models cannot be directly extended to shape data.Furthermore, most existing harmonization approaches fail to provide inference tools to investigate the model uncertainty [2,3].
This paper aims to introduce a shape-on-scalar regression model that incorporates confounder adjustment.In particular, we leverage the square root velocity function to extract elastic shape representations, which are embedded within the linear Hilbert space of square integrable functions.Subsequently, we introduce a shape regression model aimed at characterizing the intricate relationship between elastic shapes and covariates of interest, all while effectively managing the challenges posed by imaging heterogeneity.We develop comprehensive procedures for estimating and making inferences about the unknown model parameters.To demonstrate our method, we consider the corpus callosum contour (CC) shape data from ADNI study and investigate its relationship with some covariates of interest.CC is the largest white matter tract containing millions of nerve fibers that connect the left cerebral hemisphere to the right hemisphere.It facilitates cognitive functions like attention, memory, learning, reasoning, planning, and problem solving.Studies on Alzheimer's disease (AD) consistently show that the CC undergoes notable changes in the early stages of the disease.These alterations in both anterior and posterior CC sections correlate with cognitive decline.Specific MRI measurements of the CC are associated with this cognitive deterioration.Overall, patterns of callosal deformation are emerging as significant biomarkers for diagnosing and tracking AD progression [19][20][21][22][23].
The paper is organized as follows.Section 2 introduces a shape-on-scalar regression model with confounder adjustment and outlines the estimation and inference procedures for the model parameters.In Section 3, we apply our method to the CC shape dataset in the ADNI study.

Preliminaries and Notations
We suppose that we observe both imaging data and covariates of interest, including clinical variables and demographic information, from n unrelated subjects.Instead of the whole brain image, we are interested in the contour of the planar CC.We let L i be a m × 2 matrix with m landmarks representing the contour of the ROI in R 2 .In addition, we let X be an n × p full column rank matrix of observed covariates, including the intercept.We let e ⊗2 = ee for any vector e, A ⊗ B be the Kronecker product of two matrices A and B, and diag(•) represent a diagonal matrix with all the elements on the diagonal.

Shape-on-Scalar Regression Model
Given the landmarks L i from the contour of planar CC, we first derive the coordinate functions, f i (t) .= ( f i,1 (t), f i,2 (t)) with f i,1 (t) and f i,2 (t) in the x-axis and the y-axis, respectively, in Kendall's shape space, where shapes are invariant to shape-preserving transformations, e.g., rotation, translation, and scaling.Specifically, we remove these nuisance transformations from the landmarks L i via applying some preprocessing steps (details can be found in [24]).After that, as illustrated in [25], we derive the square root velocity function (SRVF) representations for the ith subject: According to the results in [25], if the functions in f i (t) are absolutely continuous, then the corresponding SRVF representations are square integrable, i.e., g i (t) ∈ L 2 ([0, 1], R 2 ), i = 1, . . ., n.Under this representation, we can further determine the optimal registration (or re-parameterization) group action for each SRVF representation, which corresponds to the following optimization problem: where µ(t) is a template, such as the mean of {g i (t . Then, we can obtain the aligned SRVF representations as follows: To investigate the relationship between shape representations and some covariates of interest while handling the heterogeneity introduced by unobserved confounders, we consider the following shape-on-scalar regression model: where Ψ(t) is a n × 2 matrix, including the shape representations, with the jth column In practice, we assume that the shape responses are observed at n v grid points, denoted as t 1 , . . ., t n v .For the functional coefficients, ) is a p × 2 matrix that represents the primary effect associated with X.For the unobserved terms, that represents the functional hidden confounders related to X, η(t) = (η .1 (t), η .2(t)) is a n × 2 matrix, independent of X, representing both subject-specific and location-specific spatial variability in the shape representations, and (t) = ( .1 (t), .2(t)) is a n × 2 matrix including the measurement errors.It is assumed that each row in η(t) and (t) is a mutually independent and identical copy of SP(0, Σ η ) and SP(0, Σ ), respectively.Here, SP(µ, Σ) denotes a stochastic process with mean function µ(t) and covariance function Σ(s, s ).
Moreover, Σ (t, t ) is in the form of Ω (t)1(t = t ), where Ω (t) is a diagonal matrix and 1(•) is an indicator function.

Estimation Procedure
To estimate the coefficient functions B(t) and functional hidden confounders Λ(t) in (4), we follow three main steps: (i) orthogonal decomposition of functional hidden confounders; (ii) rank-q representation of functional hidden confounders; and (iii) bias correction of the estimated coefficient functions.
(i) Orthogonal decomposition of functional hidden confounders.We apply the orthogonal decomposition of Λ .j(t) onto the columns of X and reparametrize (4) as where , and P X = X(X X) −1 X .Then, we can obtain the weighted least squares estimate of β * j (t) by using the local linear kernel smoothing method in [26,27].We let , where K(•) is the kernel function and h β is the bandwidth matrix.Then, the WLS estimator of β * j (t) can be written as where The local linear estimator we obtain in ( 6) is a biased estimator.We correct this bias using the preasymptotic substitution method in [27].We find the bias term by fitting a local cubic with a pilot bandwidth selected in (6).Moreover, we find the estimate of Λ .j(t).Finally, we can subtract the estimate (X X) −1 X Λ .j(t) from β * j (t) to achieve the unbiased estimator β j (t).
(ii) Rank-q representation of functional hidden confounders.In order to estimate Λ(t), we consider a rank-q representation of Λ * (t), i.e., ZA(t), where Z is a n × q loading matrix and A(t) = (α 1 (t), α 2 (t)) is a q × 2 matrix representing the corresponding effect.To estimate the loading matrix Z, we first compute the residuals from the first step as R .j(t) = Ψ .j(t) − X β j (t) where β j (t) is the bias-corrected version of β * j (t).Then, we write the extended residual matrix as a n × 2n v matrix denoted by R = (R .1 . Given X and Λ(t), we can write the conditional expectation of R as [28] E where Ā = (α 1 (t 1 ), . . . ,α 1 (t n v ), α 2 (t 1 ), . . ., α 2 (t n v )).Therefore, the loading matrix Z can be estimated through the singular value decomposition (SVD) on the extended residual matrix R, i.e., R = U∆V .U and V represent the left and right singular vectors and ∆ represents the diagonal matrix containing ordered singular values of R. The first q columns of U represented as U 1:q are the estimators of linear combinations of columns of Z [28].Then, there exists a q × q orthonormal matrix Q such that U 1:q = ZQ + o p (1) and Zα j (t) = U 1:q δ j (t), where δ j (t) = Q α j (t), j = 1, 2.
(iii) Bias correction of the estimated coefficient functions.We treat the residual terms in Step (ii) as functional responses and derive the estimate of δ j (t) through the new constructed varying coefficient model, i.e., R .j(t) = U 1:q δ j (t) + η.j (t) + ˜ .j(t), where η.j and ˜ .jare defined the same way as η .jand .j .For a fixed bandwidth h δ , we can derive the estimator of δ j (t), as U 1:q ∑ n v k=1 a k (h δ , t)R .j(t k ), and the bias-corrected version, denoted as δ j (t).Then, we can construct the estimating equation: where B(t) = ( β1 (t), β2 (t)) and D(t) = ( δ 1 (t), δ 2 (t)).In addition, assuming that the row vectors of B(t) and the row vectors of Λ(t) are orthogonal after mean centering, we can derive the estimator of G as G = U 1:q + X 1 0 B(t)P D (t)dtΩ −1 , where Ω = 1 0 D(t)P D (t)dt, P = I 2 − 1 2 (1 2 1 2 ) −1 1 2 , and 1 2 = (1, 1) .Finally, the bias-corrected estimator of B(t) is given by Remark 1. First, to derive the estimated covariance function of η(s), we smooth the individual functions η(s) via applying the method in [26] on the residuals.Second, to select the optimal bandwidth in B(s) and D(s), we use the leave-one-curve-out cross-validation, whereas for the optimal bandwidth in η(s), we use the generalized cross-validation score method [29,30].Third, the rank q in Step (ii) is unknown in practice.According to the simulation and empirical studies in [28], we consider the eigenvalue difference method [31] in our real-data analysis.

Hypothesis Testing
We consider the hypothesis for coefficients B(t) of Model (4) as where C is a r × 2p matrix of coefficients with rank r and vec(•) represents the vectorization function.The Wald global test statistic is then calculated as follows: where ζ(t) = Cvec( B(t)), M = (I p , 0 q×q )( W W) −1 W , and W = (X, G).Instead of the complicated asymptotic distribution under H 1 , we consider a wild bootstrap method discussed in [28] to obtain the null distribution of global test statistic T n , which consists of four steps: 1.
Fit Model (4) on X and Ψ(t k ) n v k=1 under H 0 and compute all the coefficients B(t), G, D(t), η(t), (t), and the global test statistic T n .2.
Generate independent random vectors τ (m) and τ (m) (t k ) from standard normal distribution N(0, I n ) for k = 1, . . ., n v and generate Remark 2. Given the elastic shape representations, the asymptotic properties of the estimated functions, including B(t) and Λ(t), the asymptotic distribution of the global test statistic T n (11) under the null hypothesis, and its asymptotic power under local alternative hypotheses have been systematically investigated in our recent work [28].Due to space limitations, we do not claim these theoretical results here.Readers who are interested in them can find the assumptions used to facilitate the technical details and the detailed proof in the Supplementary Material of [28].

Data Description and Processing
Data used in the preparation of this article were obtained from the ADNI database (adni.loni.usc.edu(accessed on 27 September 2023)).The ADNI was launched in 2003 by the National Institute on Aging, National Institute of Biomedical Imaging and Bioengineering, Food and Drug Administration, private pharmaceutical companies and non-profit organizations as a USD 60 million, 5-year public-private partnership.The primary goal of ADNI is to test whether serial magnetic resonance imaging (MRI), positron emission tomography, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD).Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians in developing new treatments and monitoring their effectiveness, as well as lessening the time and cost of clinical trials.The principal investigator of this initiative is Michael W. Weiner, MD, at the VA Medical Center and University of California, San Francisco.ADNI is the result of efforts of many coinvestigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada.The goal was to recruit 800 subjects, but the initial study (ADNI-1) is followed by ADNI-GO and ADNI-2.To date, these three protocols have recruited over 1500 adults, ages 55 to 90, to participate in the research, consisting of cognitively normal older individuals, people with early or late MCI, and people with early AD.The follow-up duration of each group is specified in the protocols for ADNI-1, ADNI-2 and ADNI-GO.Subjects originally recruited for ADNI-1 and ADNI-GO had the option to be followed in ADNI-2.For up-to-date information, see www.adni-info.org(accessed on 27 September 2023).
We consider n = 707 MRI scans from both normal controls and individuals with mild cognitive impairment (MCI) or AD in the ADNI-1 study.The scans, which were performed on a variety of 1.5 Tesla MRI scanners with protocols individualized for each scanner, include standard T1-weighted images obtained using volumetric three-dimensional sagittal MPRAGE or equivalent protocols with varying resolutions.To obtain the contour of planar CC, we use FreeSurfer [32] to process each T1-weighted MRI, including motion correction, non-parametric non-uniform intensity normalization, affine transform to the MNI305 atlas, intensity normalization, skull-stripping, and automatic subcortical segmentation.Some quality control procedures are performed on each output image data.Then, through package CCSeg [33], each T1-weighted MRI image and tissue segmentation result is used to extract the planar CC contour data on the midsagittal slice, which contains 100 landmarks (Figure 1a-b).Given the coordinate functions of landmarks (Figure 1c), we extract the aligned SRVF shape representation.The resulting aligned SRVFs ψ i (t) is shown in Figure 1d.

Data Analysis
Research indicates that factors such as age, sex, handedness, and educational level can influence the structure of CC, which is linked to cognitive development [34][35][36][37].Additionally, the APOE4 gene is not only linked to brain structural and functional variations across a wide age range, but is also recognized as a hereditary risk for declining cognitive ability and Alzheimer's disease [38,39].Given the CC elastic shape responses, we are interested in their relationship with the demographic and clinical variables such as age, gender, handiness, education level, APOE-4, and one SNP variant, rs11719939 from chromosome 3, which is close to the AD-related high-risk gene, ATP2B2 [40,41].The variable APOE-4 is coded as the number of alleles (zero, one, and two: zero and two represent the homozygous genotype, and one represents the heterozygous genotype).In addition, the population stratification is addressed by including the top two principal components (PCs) computed from the whole SNP data.Table 1 summarizes the demographic and genetic information of all the subjects.This study aimed to determine the effects of different covariates of interest on the CC shape alterations in the presence of unobserved confounding factors.We fitted our model with the real dataset, and estimated the regression coefficients and hidden factors.In particular, we used an eigenvalue difference method [31] to determine the number of hidden factors, q, in our model.We employed the wild bootstrap method and generated 500 bootstrap samples to derive the empirical null distribution of Wald's global test statistic and calculated the p-value for each regression coefficient function.For comparison, we considered two competing methods, i.e., the multivariate varying coefficient model (MVCM) developed in [30] and ComBat [2], which uses empirical Bayes estimates to remove the site effect.
According to the testing results in Table 2, our method detected significant effects related to all the covariates of interest, including age, gender, handiness, APOE-4, education level, two genetic PCs, and the causal SNP rs11719939, on the CC shapes.In particular, our model detected not only significant effects related to age, gender, and SNP, which are in agreement with the other two methods, but also the effects related to handiness, APOE-4, genetic PCs, and education, which were not fully detected at all by MVCM and ComBat.In Table 3, we observed that gender, age, APOE-4, and SNP are significantly correlated with the first hidden factor, which indicates that the confounding factors are correlated with primary variables.This could be a potential cause of additional heterogeneity in elastic shape data.Therefore, this correlation must be accounted for while estimating the varying coefficient functions of the model.In the second part of our analysis, we considered the fivefold cross-validation strategy and computed the estimation error for the CC shape data using all three methods (Table 4).Our method outperformed both MVCM and ComBat in terms of both root-mean-squared error (RMSE) and mean absolute error (MAE), which indicates that our method successfully detected the potential hidden factors and captured the relationship between the elastic shape responses and covariates of interest better than the other two methods.

Discussion
In this paper, we proposed a shape-on-scalar regression model that incorporates confounder adjustment.In particular, we leveraged the square root velocity function to extract elastic shape representations, which are embedded within the linear Hilbert space of square integrable functions.Subsequently, we introduced a shape regression model aimed at characterizing the intricate relationship between elastic shapes and covariates of interest, all while effectively managing the challenges posed by imaging heterogeneity.We developed comprehensive procedures for estimating and making inferences about the unknown model parameters.Through the real-data analysis, our method demonstrated its superiority in terms of estimation accuracy when compared to existing approaches.
In our case study on ADNI CC shape data, compared to other competing methods, i.e., MVCM and ComBat, our method identified a potential significant effect related to the education level, which is consistent with the existing literature [42], indicating that having ongoing learning experiences could be a important prevention strategy for cognitive impairment diseases such as AD.Our method also showed significant effects related to age, gender, and SNP on CC shape alterations.
Although our method demonstrated its success in the case study on ADNI CC shape data, there are couple of issues to be addressed.First, as discussed in [28], the key assumption of our method requires the row vectors of B(t) and the row vectors of A(t) to be orthogonal with respect to the underlying density function p(t) after mean centering.Actually, this assumption is reasonable but difficult to examine in practice.Therefore, it is of great importance to improve the performance of our method even if this assumption does not hold.Second, this paper only investigated the linear relationship between the shape responses and other observed and/or hidden covariates.However, this assumption is not reliable when the nonlinear relationship exists in real applications.Therefore, we will find a scope to explore the nonlinear relationship between the covariates and the elastic shape response while adjusting for hidden factors.

Figure 1 .
Figure 1.Preprocessing procedures for shape mediator: (a) raw MRI brain images with 2D CC segmented at the middle sagittal slice; (b) 50 landmarks sampled on the 2D contour of CC with L i,x and L i,y representing the x and y coordinate of CC respectively; (c) coordinate functions of landmarks with rotation, translation, and scaling removed; and (d) aligned SRVF representation of shape mediators.

Table 2 .
Hypothesis testing of estimated varying coefficients functions β(s).

Table 3 .
Correlation between hidden factors and primary variables.

Table 4 .
Estimation error calculated for the three competing methods.