Abstract
Analyzing treatment or exposure effect is a major research theme in scientific studies. In the current big-data era where multiple sources of data are available, it is of interest to perform a synthesized analysis of treatment effects by integrating information from different data sources or studies. However, studies may contain heterogeneous and incomplete covariate sets, and individual data therein may not be accessible. We apply and extend the generalized meta-analysis method to integrate summary results (e.g., regression coefficients) of outcome and treatment (propensity score, PS) regression analyses across different datasets that may contain heterogeneous covariate sets. The proposed integrated analysis utilizes a reference dataset, which contains data on the complete set of covariates. The asymptotic distribution for the proposed integrated estimator is established. Simulations reveal that the proposed estimator performs well. We apply the proposed method to obtain the causal effect of waist circumference on hypertension by integrating two existing outcomes and PS regression analyses with different sets of covariates.
    MSC:
                62P10; 62J12
            1. Introduction
The impact of a treatment or exposure on an outcome is a significant focus in many scientific fields. These include clinical trials [], education [], economics [], and medical [] fields. In observational studies, treatment assignment often correlates with subject characteristics, potentially leading to systematic differences between treated and control groups. This can result in biased conclusions when directly comparing these groups.
To perform causal inferences in observational studies, Rubin [] proposed the counterfactual or potential outcome framework. Suppose the treatment assignment Z is binary with  denoting the treated group and  the control group, and  and  are the potential outcome values a subject would have if he/she had a treatment assignment  and , respectively. The outcome Y observed for a subject with treatment Z is then given by
      
      
        
      
      
      
      
    
A causal treatment effect for a subject is obtained as the difference of two potential outcomes, . Usually, only one potential outcome canc be observed for a subject, and the causal effect is unobserved. However, under suitable assumptions, the average treatment effect (ATE)
      
      
        
      
      
      
      
    
      is estimable from observational studies and has been a popular target of causal inferences. For a binary outcome, the ATE amounts to the marginal risk difference . A key assumption for estimating ATE with data from observational studies is that the treatment assignment is strongly ignorable []:
      
        
      
      
      
      
    
      where “⊥” stands for statistical independence; that is, the potential outcomes , and the treatment assignment Z are independent conditioning on the covariate set . In other words, the treatment assignment Z is irrelevant to the values of potential outcomes  and  once the covariates  are controlled. Another key assumption is the stable unit-treatment value assumption (SUTVA), which ensures that  is the unique outcome associated with  []. Under (1) and SUTVA, consistent estimation of the ATE can then be achieved by the technique of matching, stratification, covariate adjustment (CA), or inverse probability of treatment weighting (IPTW) [,,,,].
Rosenbaum and Rubin [] further proposed using propensity score (PS) for conducting causal inferences. A PS is the probability  of a subject being assigned to the treated group conditioning on his/her observed covariates and is the coarsest balancing score such that the covariate distributions are the same between treatment groups once the score is matched. Accordingly, a PS can replace the full covariate set  used for matching, stratification, CA, or IPTW to conduct causal inferences on treatment effect when the assumption (1) holds Rosenbaum and Rubin []. Since the PS is a univariate minimal sufficient statistic, the PS-based method can also be viewed as an effective dimension-reduction technique, since matching, stratification, CA, or IPTW can be simply performed on a scalar PS rather than on a multi-dimensional covariate vector [,,,,]. Since its invention, the PS method has gained tremendous popularity in observational studies; see Figure 1 of Simoneau et al. [].
In the big data era, there is growing interest in conducting integrated analyses of treatment effects by integrating information from databases across various observational studies [,]. However, combining data from different databases presents challenges due to variations in covariate variables, even if they share common outcome and treatment variables, and discrepancies in baseline data distributions. Moreover, stringent data privacy regulations such as the European Union’s General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA) pose barriers to directly sharing, exchanging, and synthesizing individual-level data from disparate sources.
Analyzing treatment or exposure effects is a major research theme in scientific studies. In the current big-data era where multiple data sources are available, it is interesting to perform a synthesized analysis of treatment effects by integrating information from different data sources or studies. However, studies may contain heterogeneous and incomplete covariate sets, and individual data may not be accessible. This motivates us to consider a framework for integrated regression analysis that can utilize summary results (e.g., regression coefficients) of outcome and treatment (PS) regression analyses across different datasets that may contain heterogeneous covariate sets.
This study is specifically motivated by two recent investigations into the relationship between waist circumference (WC) and hypertension (HT): Ren et al. [] and Hu et al. []. Each study explored how WC influences the risk of HT while accounting for different sets of covariates (detailed in Section 4). The aim here is to examine the impact of WC on HT while controlling for the comprehensive set of covariates (combining the respective sets from both studies). Specifically, this study integrates the regression analysis results from two existing studies on the relationship between WC and HT, considering their respective sets of covariates. Additionally, it incorporates the findings from propensity score analyses of WC as a treatment variable in both studies. This integrated approach aims to conduct a synthesized analysis of the causal effect of WC on HT, utilizing a comprehensive covariate dataset sourced from the Taiwan BioBank database. Specifically, we apply and extend the generalized meta-analysis method developed by Kundu et al. [] to integrate the existing PS and outcome regression analyses with heterogeneous covariate variables using a dataset on the complete covariate set, which provides a reference to the covariate distribution and is termed “reference dataset” by Kundu et al. []. We obtain consistent estimates for the parameters of the full PS and outcome regression models and for the ATE. Asymptotic theory for the estimators of the regression parameters and the ATE is also established. The proposed integrated analysis allows researchers to integrate summary results from existing outcome and PS regression analyses with heterogeneous covariates sets, and hence to obtain enhanced statistical efficiency and power from existing studies.
In Section 2, we describe the proposed estimation methods for the full PS and outcome regression models and the ATE. In Section 3, we report simulation results to show the satisfactory performance of the proposed method. Section 4 presents the empirical study on the effect of waist circumference on hypertension based on two existing studies and a reference sample of covariate data. Finally, concluding discussions are provided in Section 5.
2. The Proposed Method
In this section, we present an inferential framework that integrates results from existing studies and results from treatment (i.e., propensity score, PS) regression analyses, in which different and incomplete covariates may be present. We use the reference set on the full set of covariates to convert existing outcome and PS regression analyses into consistent estimates of the full outcome and PS regression model based on the full set of covariates, which in turn generate causal inferences about treatment effects.
2.1. The Existing Outcome and PS Regression Analyses
Suppose that there exist K independent studies, where the kth study contains independent data on  of size . Here, Y and Z are, respectively, the common outcome and binary treatment indicator of interest in the K studies, and  is the covariate set considered in study k and is some subset of , the complete covariate set satisfying the strong ignorability condition (1). Suppose that  and  are, respectively, the “reduced” outcome and PS regression models used in the kth study, where  and  are the respective vectors of regression coefficients in the reduced outcome and the PS models based on the reduced covariate set . Note that since the reduced PS and outcome models are from existing studies based on reduced sets of the covariate variables, to make the proposed method applicable even when models used in such studies are wrongly specified, we consider the general case where these models can be misspecified. That is,  and  may not equal to the true conditional distributions  and , respectively. Assume that the estimates of  and  are available for study . Individual data in these studies are not required in the proposed method.
Further, assume that the true distribution of Z given the complete covariate set  is given by the full PS model , where  is the vector of regression coefficients of , and hence the full PS is given by . Also, assume that the true distribution of Y given Z and  is given by the full outcome regression model , where  is the vector of regression coefficients of . To fix ideas, in the following we consider the underlying distributions  and  that follow the generalized linear models (GLMs) of Nelder and Wedderburn [], although the ideas may simply extend to more general models. We assume that the underlying treatment and outcome distributions  and  are essentially the same across different studies, except that their baselines, namely their intercept parameters in  and  are allowed to vary across different studies to accommodate possible differences in baseline data distributions among existing studies; however, we do not make this explicit in the notation for the regression parameters to keep the notation simple.
2.2. Reference Data
In addition to the existing regression results, the proposed method needs a reference dataset with data on the complete covariate set.
Specifically, let  be the reference sample of n independent observations of . Since the intercept parameters in the treatment and outcome distributions  and  are already allowed to be different among the existing studies, the covariate distributions among the existing studies and the reference sample are assumed to be the same up to location shifts, namely these covariate distributions can have different means but become the same after mean removal. In fact, under the above assumptions, it is possible just to assume the covariate distributions among the studies and the reference dataset are the same, and to adjust the intercept parameters of the underlying treatment and outcome distributions  and  in each study, such that  equals the joint distribution of  in each study population. It is such adjusted full models  and  that are our estimation targets.
2.3. Estimation of the Full Propensity and Outcome Regression Models
The generalized meta-analysis method of Kundu et al. [] is a data integration method for combining information on parameters of various outcome regression models with disparate covariate sets. Assuming a common underlying data distribution, this method integrates the estimating equations for parameter estimates of various outcome regression models using the generalized method of moments approach [] and a reference dataset on the covariates to yield consistent estimation for the full outcome regression model.
We apply and extend the generalized meta-analysis method to estimate the parameters in both the full PS and outcome models  and  using the available estimates  and , from the existing K studies, as well as the reference dataset . Following the usual practice, for , we assume the estimates  and  are obtained by the maximum likelihood estimation (of the GLMs) based on the (reduced) models  and , , respectively. Let  and  be the score functions of the kth reduced PS and outcome models, respectively, and consider the expected scores
        
      
        
      
      
      
      
    
        Let
        
      
        
      
      
      
      
    
The estimator  for  is obtained by minimizing the objective function , or equivalently solving the estimating equation , where , . The matrix  is an arbitrary positive semi-definite weighting matrix and the optimal (minimal asymptotic covariance of the resulting estimator) choice of  is [], where  is the covariance matrix of , and  arises from the covariances of , , which may be available from the existing studies or estimated using the reference dataset. The way of estimation of the matrices  and  is given in Appendix A.
Following similar arguments in Kundu et al. [],  has the asymptotic normal distribution with zero mean and the covariance matrix given by , where  whose estimation can be based on its sample analog in the reference dataset, and the matrices  and  are estimated in the way mentioned in Appendix A using the final estimator  for .
We conclude this section by noting that, the covariance matrices of the regression parameter estimates from the existing studies are not necessary for implementing the proposed method. However, when the covariance matrix for the regression parameter estimates is unavailable and the outcome regression model in an existing study contains a dispersion parameter not fixed to 1 (e.g., the normal regression model), the estimate for that dispersion parameter is required for the proposed method to implement estimation of the covariance matrix of the regression parameters (see Appendix A for detail).
2.4. Estimation of the Average Treatment Effect (ATE)
Let . By the SUTVA and the strong ignorability assumption (1),  corresponds to the conditional average treatment effect (CATE) at , and  corresponds to the ATE.
Let . Given the estimate  for the full outcome regression parameter obtained in Section 2.3, we can estimate  by , the CATE  at  by , and the ATE by the reference sample data  via
        
      
        
      
      
      
      
    
The consistency of the proposed ATE estimator  follows directly from the consistency of . Also, using the delta method, we can obtain the asymptotic normal distribution of ; details are provided in Appendix B.
3. The Simulation Studies
In this section, we conduct simulations to assess the performance of the proposed estimators for the regression coefficients of the underlying outcome and treatment distributions, as well as for the ATE. We specifically report results for bias, standard deviation (SD), mean of estimated standard errors (ESE), and coverage probability 1 and 2 (CP1 and CP2) of the  and  Wald-type confidence intervals, respectively, across 2000 simulation replications.
3.1. The Simulation Setting
We consider a simple simulation setting where there exist independent datasets  on  on , and  on , where Y is the common outcome of interest, Z is the common treatment indicator variable,  is the covariate observed in dataset  is the covariate observed in dataset , and  is the set of covariates in the underlying full regression models for the outcome and the treatment (i.e., the PS). We set the sample sizes of the datasets , , and  to , 100 or 200, respectively.
The covariates  in , and  are generated by independent standard normal random variables, and given , the treatment assignment Z in both  and  is generated using the linear logistic regression model:  with . The outcome Y in both  and  is binary and generated by the linear logistic regression model for given treatment assignment Z and covariates : . The coefficients for the covariates,  and , in the PS and outcome models are both fixed at , and the coefficient  for the treatment indicator is set to 0, , or  when generating both datasets  and . On the other hand, the intercept parameters  are set to  and  when generating datasets  and , respectively, to reflect different baselines in the underlying distributions of the two datasets. The reduced PS and outcome regression coefficient estimates are obtained from both  and ; in , the working PS and outcome regression models are, respectively, the linear logistic models for  and , while in  they are, respectively, the linear logistic models for  and . Such reduced-model regression coefficient estimates are used in the later estimation procedure while individual data in  and  are not. Individual data in  on the complete covariate set  are used as the reference data for the proposed estimation.
The method in Section 2.3 is applied to the regression coefficient estimates from  and  and the individual data from  to obtain the estimates of the parameters in the full PS and outcome regression models, which are correctly specified in the estimation as the data generating models mentioned above. Further, the method in Section 2.4 is applied to estimate the ATE of Z on Y.
We also perform an extended simulation with  existing datasets over 1000 simulation replications, and the distribution of the outcome Y is binomial, normal, or Poisson. The complete covariate set is , where  are generated from standard normal with a correlation coefficient of , and  are Bernoulli random variables with success probabilities expit and expit, where ,  are the covariates mentioned above, ,  are independent standard normal, and expit. Accordingly,  is correlated with , and  is correlated with  and hence also  since  and  are correlated. The four existing datasets contain data on the common outcome variable Y and treatment variable Z, and data on the different covariate sets , , , and , respectively. The treatment variable Z in the kth dataset is generated using the linear logistic regression model: , , and  for . The outcome Y in the kth dataset is generated using the linear logistic regression model:  with g the expit, identity, or exponential function when the distribution of Y is binomial, normal, or Poisson (i.e., the inverse canonical link function), respectively. The true parameter values ,  for , and  or . The reduced propensity (PS) and outcome regression coefficient estimates are obtained from the four existing datasets, using the models having the same link functions as in the data-generating models but the reduced covariate sets observed in these datasets, as mentioned above. Such reduced regression coefficient estimates are used in the estimation procedure while individual data in the existing datasets are not. Individual data in the reference sample, which contain data on the complete covariate set , are also used in the estimation procedure as proposed in Section 2.3. The full PS and outcome regression models in the estimation are correctly specified as the data-generating models mentioned above. Further, the method in Section 2.4 is applied to estimate the average treatment effect (ATE) of Z on Y in the population of each dataset. All four existing studies and the reference sample have the same sample size, set to 200, 500, or 1000.
3.2. The Simulation Results
Table 1 presents the simulation results for the proposed estimation of the regression coefficient for the treatment variable in the full outcome model and the ATEs for the populations of the two existing datasets. The findings in Table 1 suggest that the proposed estimation for the full outcome model parameter and the ATE is essentially unbiased, with the absolute bias of the proposed estimates for both parameters being less than 1%. Also, the estimated standard error (ESE) based on the asymptotic theory is close to the simulation standard deviation of the estimator with the absolute difference less than 1%, and the coverage probability of the  and  Wald-type confidence intervals based on the asymptotic normality of the estimator is close to the nominal levels  and , respectively. These results reveal that the proposed estimators perform well in finite samples.
       
    
    Table 1.
    Simulation results (multiplied by 100) for the estimates of the coefficient  of the treatment variable in the full outcome model and the ATEs with true parameter values  in Study 1, and  in study , and different  and ATE values (in parenthesis),  in Study 1, and  in Study 2.
  
Table 2 presents the simulation results for the proposed estimation of the regression coefficients in the full PS model under the same simulation settings as in Table 1. The findings demonstrate that our proposed method performs well in the estimation of the full PS model.
       
    
    Table 2.
    Simulation results (multiplied by 100) for the estimates of the coefficients  of the covariate variables in the full propensity score model with true parameter values  in Study 1, and  in study , , and different  values,  in Study 1, and  in Study 2.
  
The extended simulations, performed under the settings with  studies, correlated covariates, and binomial, normal, and Poisson distributed outcome variables (see Section 3.1 for detail), still reveal satisfactory performances of the proposed estimation method; the results are shown in the Appendix C.
The performance of the proposed method in terms of computation time is summarized as follows. When , the computation time for a simulation case with  is  s in a desktop computer with i7-9700 CPU, and the time increases to  s when  and . Essentially, the computation time increases linearly with K. Also, when , the computation time for running a case in the simulation study is  s when the size of the reference sample increases to .
4. A Real Data Application
In this section, we apply the proposed method to analyze the impact of waist circumference (WC) on hypertension (HT) risk, while controlling for covariates such as age, sex, body mass index (BMI), smoking status (SMK), drinking status (DRK), body fat percentage (BFP), heart rate (HR), and hip circumference (HC) among working-age individuals. The analysis leverages regression analyses from two existing studies on the relationship between WC and HT and a reference dataset encompassing the complete set of covariates.
4.1. Two Existing Studies on the Effect of Waist Circumference on Hypertension
The WC reflects the size of the visceral fat depot and is an effective clinical tool for assessing the risk of diabetes and cardiovascular diseases []. Guagnano et al. [] indicated that WC seems to have a strong association with the risk of hypertension. In recent years, Ren et al. [] investigated the cut-off values for the obesity indices that represent the elevated incidence of hypertension in Chinese adults aged between 18 and 65. Hu et al. [] indicated that a combination of WC and BMI was superior to individual indices for identifying hypertension. Data from Ren et al. [] and Hu et al. [] were publicly provided (https://dx.doi.org/10.6084/m9.figshare.2151271.v1 (accessed on 15 February 2016), https://doi.org/10.1371/journal.pone.0170238.s001 (accessed on 5 January 2017)). The study of Hu et al. [], termed Study 1, contains data on the covariate set  including age, sex, BMI, SMK, DRK, BFP, and HR, while the study of Ren et al. [], termed Study 2, contains data on the covariate set  including age, sex, BMI, SMK, DRK, and HC. We focus on working-age (20–65 years old) people and the subsamples from the two studies meeting this criterion are of sizes  and , respectively, (after removing missing observations).
4.2. The Reference Dataset with Complete Covariates
The Taiwan BioBank (TWB) database, created by Academia Sinica, comprises a community-based cohort of over 200,000 study participants. It includes comprehensive data on demographics, health behaviors, environmental factors, and biomarkers collected through meticulously conducted questionnaires and thorough examinations. Details about the TWB data can be found at https://www.twbiobank.org.tw/ (accessed on 1 June 2024). The reference dataset we employ in the current analysis is based on the released subsample of the TWB cohort consisting of 4575 randomly sampled study subjects aged 20–65 years. The reference dataset contains data on the complete covariate set  including age, sex, BMI, SMK, DRK, BFP, HR, and HC, but contains no data on either the treatment (WC) or the outcome (HT).
4.3. The Proposed Analysis
In the following analysis, both WC and HT are defined as binary variables, classified according to  and  mmHg or , where I(.) is the indicator function and SBP and DBP denote systolic and diastolic blood pressures, respectively; the classification rules follow those in Lean et al. (1998) []. The covariates age (years), BMI (kg/m), BFP (%), HR (beats/min), and HC (cm) are continuous variables, while the covariates sex (female vs. male), SMK (yes vs. no), and DRK (yes vs. no) are binary.
We apply the proposed methods in Section 2.3 and Section 2.4 to assess the treatment effect of WC on the risk of HT controlling for the covariates age, sex, BMI, SMK, DRK, BFP, HR, and HC, which are regarded as the complete covariate set among working-age people. Specifically, the proposed analysis uses the results of the regression analyses from Study 1 (Hu et al. []) and Study 2 (Ren et al. []), as well as the reference dataset from the TWB database. Both the analyses in Studies 1 and 2 employ linear logistic regressions to examine the association between WC and HT by adjusting the covariate sets  and , respectively; see Section 4.1 for detail. Also, both PS analyses in the two studies are based on the linear logistic regressions for  with the covariate sets  and , respectively.
In the proposed analysis, only the outcome and regression parameter estimates from Studies 1 and 2 are employed, while individual data are not. The full outcome (HT) model is specified by the linear logistic regression model for HT given the treatment (WC) and the complete covariate set , and the full PS model is specified as the linear logistic regression model for WC given ; in these logistic regression models only the main effects of the treatment and the covariate variables are considered. To account for possible differences between the baselines of Studies 1 and 2, the intercept parameters of the full models, including the outcome and the PS models, across the studies are treated as different.
4.4. The Analysis Results
The results for the proposed estimation of the logistic regression models for the PS (treatment, WC) and the outcome (HT) adjusting for the complete covariate set are provided in Table 3 and Table 4, respectively. We can see from Table 3 that, older age, male, larger BMI, higher heart rate, and larger hip circumference tend to have a waist circumference greater than 80 cm (treatment group), and the estimation result seems to nicely summarize, synthesize, and complement the results from the two existing studies. Also, we see from Table 4 that, after adjusting for the covariates age, sex, body mass index (BMI), smoking (SMK), drinking (DRK), body fat percentage (BFP), heart rate (HR), and hip circumference (HC), the effect of waist circumference (WC) on the risk of hypertension (HT) is strongly significant; the odds for hypertension in working-age people with waist circumference greater than 80 cm is 1.4 (≈) times as high as those with waist circumference no greater than 80 cm (p-value ). In contrast, the effect of WC on the risk of HT obtained by adjusting an incomplete covariate set can be somewhat higher (when adjusting only for age, sex, BMI, SMK, DRK, BFP, and HR in Study 1) or lower (when adjusting only for age, sex, BMI, SMK, DRK, and HC in Study 2). Since HR has its own significant effects on both WC and HT, the lower effect of WC on HT without adjusting for HR obtained in Study 2 is likely to be biased.
       
    
    Table 3.
    Results of the real-data analysis. The propensity score (PS) for waist circumference (WC) with the covariates age, sex, body mass index (BMI), smoking status (SMK), drinking status (DRK), body fat percentage (BFP), heart rate (HR), and hip circumference (HC) based on results from two studies and a reference sample.
  
       
    
    Table 4.
    Results of the real-data analysis. The risk of hypertension (HT) with treatment of waist circumference (WC) adjusting for age, sex, body mass index (BMI), smoking status (SMK), drinking status (DRK), body fat percentage (BFP), heart rate (HR), and hip circumference (HC) based on results from two studies and a reference sample.
  
The average treatment effect of WC on HT, averaged over the covariate distribution, is obtained as 0.044 with SE (standard error) , p-value , in Study 1 population, and as 0.042 with SE , p-value , in Study 2 population. That is overall working-age people with waist circumference larger than 80 cm can have 44 (42) additional cases of HT per 1000 people in the Study 1 (2) population, 95% confidence interval 26–62 (24–60), compared to those who with waist circumference no larger than 80 cm.
From the results mentioned above and shown in Table 3, we conclude that the proposed integrated analysis, using information from both studies and the reference data with complete covariates, can lead to less biased and possibly more efficient analysis results than those from the original individual studies.
5. Discussion and Conclusions
In this study, we propose a new inference framework that integrates the results of the outcome and the treatment (i.e., the PS) regression analyses across multiple existing databases. These databases may vary in their coverage of covariate variables and may contain incomplete data, potentially introducing bias in individual database analyses. Moreover, access to individual-level data from these databases may be restricted. Our proposal integrates the existing PS and outcome regression analyses through a reference sample, which contains only data on the complete covariate set. We obtain consistent estimates for the parameters of the full PS and outcome regression models and for the ATE. The new proposal extends the original generalized meta-analysis method of Kundu et al. [] by further considering the treatment (propensity score) regression in addition to the outcome regression. Also, the new proposal can apply with a general outcome variable, such as one following a generalized linear model, and hence is more flexible than the work of Li et al. [], which considers the setting similar to ours but is restricted to normality outcome and linear regression model.
Our approach necessitates a dataset with comprehensive covariate information, which acts as a benchmark for the underlying covariate distribution []. Such a reference dataset could be sourced from a large-scale database like the Taiwan Biobank, as outlined in Section 4 of our application. Alternatively, a reference sample might be gathered through a smaller validation study, a method commonly discussed in the epidemiological literature (e.g., Stümer et al. []).
As in the existing methods, such as Kundu et al. [], for integrating common information from different studies, we require the underlying treatment and outcome distributions to be the same across various studies. When this assumption is not satisfied, we should interpret the parameter estimates from the proposed method with caution, since they no longer represent consistent estimates for some common parameters, but instead represent the estimates for some “average effects” over different studies.
In summary, our proposal is the best applicable in the following two scenarios: (1) A multi-center study where individual data from each of the centers are not accessible except for the derived summary statistics (e.g., regression coefficient estimates), and an independent reference sample of complete covariate data is available. (2) A meta-analysis where results for both the outcome and the treatment (propensity score) regression analyses are available for various studies, together with a reference sample of complete covariate data. In both scenarios, our approach optimally integrates analysis results from diverse data sources to yield valid inferences on treatment effects using summarized and synthesized information.
Author Contributions
Conceptualization, Y.-H.C.; methodology, Y.-H.C. and S.-Y.H.; software, S.-Y.H., J.-H.W. and C.-C.S.; validation, S.-Y.H., J.-H.W. and C.-C.S.; formal analysis, S.-Y.H., J.-H.W. and C.-C.S.; investigation, S.-Y.H., J.-H.W. and C.-C.S.; resources, Y.-H.C. and J.-H.W.; data curation, S.-Y.H., J.-H.W. and C.-C.S.; writing—original draft preparation, Y.-H.C.; writing—review and editing, Y.-H.C. and J.-H.W.; visualization, Y.-H.C. and J.-H.W.; supervision, Y.-H.C.; project administration, Y.-H.C.; funding acquisition, Y.-H.C. and J.-H.W. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported by the grant NSTC 112-2118-M-194-003-MY2 from the National Science and Technology Council of the Republic of China (Taiwan).
Data Availability Statement
The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
      
| ATE | Average treatment effect | 
| SUTVA | Stable unit-treatment value assumption | 
| CA | Covariate adjustment | 
| IPTW | Inverse probability of treatment weighting | 
| PS | Propensity score | 
| GDPR | General Data Protection Regulation | 
| HIPAA | Health insurance portability and accountability act | 
| WC | Waist circumference | 
| HT | Hypertension | 
| CATE | Conditional average treatment effect | 
| SD | Standard deviation | 
| ESE | Estimated standard errors | 
| CP | Coverage probability | 
| BMI | Body mass index | 
| SMK | Smoking status | 
| DRK | Drinking status | 
| BFP | Body fat percentage | 
| HR | Heart rate | 
| HC | Hip circumference | 
| TWB | Taiwan BioBank | 
| SBP | Systolic blood pressures | 
| DBP | Diastolic blood pressures | 
Appendix A. Estimation of the Matrices Δ and Λ
The matrices  and  are defined as , and
        
      
        
      
      
      
      
    
        with , where
        
      
        
      
      
      
      
    . In practice, we can first use the identity matrix for the weight , namely minimize the objective function  to obtain the initial estimator  for , and use the initial estimator to estimate the matrix  by
        
      
        
      
      
      
      
    
        and the matrix  by
        
      
        
      
      
      
      
    
        where  is the empirical measure with respect to the reference sample and is evaluated at . The optimal weight  is then estimated by .
Note that, when the outcome regression in study k is specified by a GLM with a dispersion parameter not fixed to 1 and estimated from study data, the calculation of  involves the estimated value of the dispersion parameter.
Appendix B. Large Sample Theory for the ATE Estimator
By the consistency of the estimator  and the continuous mapping theorem, in large samples, the ATE estimator  converges to  with probability one. Also, by the delta method,  converges in distribution to a zero-mean normal distribution with the variance given by  with  being the lower right  submatrix of , i.e., the asymptotic covariance of , the matrix  being the matrix formed by the lower p rows of , and , and  are given in Section 2.3 and Appendix A. The estimation of the asymptotic variance of  can be performed similarly to that of the asymptotic covariance of  as mentioned in Appendix A.
Appendix C. Extended Simulations
The extended simulations, performed under the settings with  studies, correlated covariates, and binomial, normal, and Poisson distributed outcome variables (see Section 3.1 of the main text for details about the simulation settings), are reported in the following supplementary tables, including results for both the small sample (with sample size of 200 or 500) and the large sample (with samples size of 1000), and both the outcome and the propensity score regressions. Table A1, Table A2, Table A3 and Table A4 are for the setting with , while Table A5, Table A6, Table A7 and Table A8 are for the setting with . As seen from these tables, the simulation results still reveal satisfactory performances of the proposed estimation method.
       
    
    Table A1.
    Simulation results (multiplied by 100) for the estimates of the coefficient  of the treatment variable in the full outcome model and the ATEs for the populations of the existing datasets under the binomial, normal, or Poisson outcome distribution; true value of  (small sample case with sample size of 200 or 500).
  
Table A1.
    Simulation results (multiplied by 100) for the estimates of the coefficient  of the treatment variable in the full outcome model and the ATEs for the populations of the existing datasets under the binomial, normal, or Poisson outcome distribution; true value of  (small sample case with sample size of 200 or 500).
      | Bias | SD | ESE | CP | Bias | SD | ESE | CP | |
| Binomial | ||||||||
| (0) | ||||||||
| ATE1 | ||||||||
| ATE2 | ||||||||
| ATE3 | ||||||||
| ATE4 | ||||||||
| Normal | ||||||||
| (0) | ||||||||
| ATE1 | ||||||||
| ATE2 | ||||||||
| ATE3 | ||||||||
| ATE4 | ||||||||
| Poisson | ||||||||
| (0) | ||||||||
| ATE1 | ||||||||
| ATE2 | ||||||||
| ATE3 | ||||||||
| ATE4 | ||||||||
n, size of reference data; SD, standard deviation; ESE, estimated standard error; CP, coverage probability of  confidence intervals.
       
    
    Table A2.
    Simulation results (multiplied by 100) for the estimates of the coefficients  of the covariate variables in the full propensity score model with true parameter values , under the binomial, normal, or Poisson outcome distribution; true value of  (small sample case with sample size of 200 or 500).
  
Table A2.
    Simulation results (multiplied by 100) for the estimates of the coefficients  of the covariate variables in the full propensity score model with true parameter values , under the binomial, normal, or Poisson outcome distribution; true value of  (small sample case with sample size of 200 or 500).
      | Bias | SD | ESE | CP | Bias | SD | ESE | CP | |
| Binomial | ||||||||
| Normal | ||||||||
| Poisson | ||||||||
n, size of reference data; SD, standard deviation; ESE, estimated standard error; CP, coverage probability of  confidence intervals.
       
    
    Table A3.
    Simulation results (multiplied by 100) for the estimates of the coefficient  of the treatment variable in the full outcome model and the ATEs for the populations of the existing datasets under the binomial, normal, or Poisson outcome distribution; true value of  (large sample case with sample size of 1000).
  
Table A3.
    Simulation results (multiplied by 100) for the estimates of the coefficient  of the treatment variable in the full outcome model and the ATEs for the populations of the existing datasets under the binomial, normal, or Poisson outcome distribution; true value of  (large sample case with sample size of 1000).
      | Bias | SD | ESE | CP | |
| Binomial | ||||
| (0) | ||||
| ATE1 | ||||
| ATE2 | ||||
| ATE3 | ||||
| ATE4 | ||||
| Normal | ||||
| (0) | ||||
| ATE1 | ||||
| ATE2 | ||||
| ATE3 | ||||
| ATE4 | ||||
| Poisson | ||||
| (0) | ||||
| ATE1 | ||||
| ATE2 | ||||
| ATE3 | ||||
| ATE4 | ||||
n, size of reference data; SD, standard deviation; ESE, estimated standard error; CP, coverage probability of  confidence intervals.
       
    
    Table A4.
    Simulation results (multiplied by 100) for the estimates of the coefficients  of the covariate variables in the full propensity score model with true parameter values , under the binomial, normal, or Poisson outcome distribution; true value of  (large sample case with sample size of 1000).
  
Table A4.
    Simulation results (multiplied by 100) for the estimates of the coefficients  of the covariate variables in the full propensity score model with true parameter values , under the binomial, normal, or Poisson outcome distribution; true value of  (large sample case with sample size of 1000).
      | Bias | SD | ESE | CP | |
| Binomial | ||||
| Normal | ||||
| Poisson | ||||
n, size of reference data; SD, standard deviation; ESE, estimated standard error; CP, coverage probability of  confidence intervals.
       
    
    Table A5.
    Simulation results (multiplied by 100) for the estimates of the coefficient  of the treatment variable in the full outcome model and the ATEs for the populations of the existing datasets under the binomial, normal, or Poisson outcome distribution; true value of  (small sample case with sample size of 200 or 500).
  
Table A5.
    Simulation results (multiplied by 100) for the estimates of the coefficient  of the treatment variable in the full outcome model and the ATEs for the populations of the existing datasets under the binomial, normal, or Poisson outcome distribution; true value of  (small sample case with sample size of 200 or 500).
      | Bias | SD | ESE | CP | Bias | SD | ESE | CP | |
| Binomial | ||||||||
| ATE1 | ||||||||
| ATE2 | ||||||||
| ATE3 | ||||||||
| ATE4 | ||||||||
| Normal | ||||||||
| ATE1 | ||||||||
| ATE2 | ||||||||
| ATE3 | ||||||||
| ATE4 | ||||||||
| Poisson | ||||||||
| ATE1 | ||||||||
| ATE2 | ||||||||
| ATE3 | ||||||||
| ATE4 | ||||||||
n, size of reference data; SD, standard deviation; ESE, estimated standard error; CP, coverage probability of  confidence intervals.
       
    
    Table A6.
    Simulation results (multiplied by 100) for the estimates of the coefficients  of the covariate variables in the full propensity score model with true parameter values , under the binomial, normal, or Poisson outcome distribution; true value of  (small sample case with sample size of 200 or 500).
  
Table A6.
    Simulation results (multiplied by 100) for the estimates of the coefficients  of the covariate variables in the full propensity score model with true parameter values , under the binomial, normal, or Poisson outcome distribution; true value of  (small sample case with sample size of 200 or 500).
      | Bias | SD | ESE | CP | Bias | SD | ESE | CP | |
| Binomial | ||||||||
| Normal | ||||||||
| Poisson | ||||||||
n, size of reference data; SD, standard deviation; ESE, estimated standard error; CP, coverage probability of  confidence intervals.
       
    
    Table A7.
    Simulation results (multiplied by 100) for the estimates of the coefficient  of the treatment variable in the full outcome model and the ATEs for the populations of the existing datasets under the binomial, normal, or Poisson outcome distribution; true value of  (large sample case with sample size of 1000).
  
Table A7.
    Simulation results (multiplied by 100) for the estimates of the coefficient  of the treatment variable in the full outcome model and the ATEs for the populations of the existing datasets under the binomial, normal, or Poisson outcome distribution; true value of  (large sample case with sample size of 1000).
      | Bias | SD | ESE | CP | |
| Binomial | ||||
| ATE1 | ||||
| ATE2 | ||||
| ATE3 | ||||
| ATE4 | ||||
| Normal | ||||
| ATE1 | ||||
| ATE2 | ||||
| ATE3 | ||||
| ATE4 | ||||
| Poisson | ||||
| ATE1 | ||||
| ATE2 | ||||
| ATE3 | ||||
| ATE4 | ||||
n, size of reference data; SD, standard deviation; ESE, estimated standard error; CP, coverage probability of  confidence intervals.
       
    
    Table A8.
    Simulation results (multiplied by 100) for the estimates of the coefficients  of the covariate variables in the full propensity score model with true parameter values , under the binomial, normal, or Poisson outcome distribution; true value of  (large sample case with sample size of 1000).
  
Table A8.
    Simulation results (multiplied by 100) for the estimates of the coefficients  of the covariate variables in the full propensity score model with true parameter values , under the binomial, normal, or Poisson outcome distribution; true value of  (large sample case with sample size of 1000).
      | Bias | SD | ESE | CP | |
| Binomial | ||||
| Normal | ||||
| Poisson | ||||
n, size of reference data; SD, standard deviation; ESE, estimated standard error; CP, coverage probability of 95% confidence intervals.
References
- Wang, D.; Zheng, S.; Cui, Y.; He, N.; Chen, T.; Huang, B. Adjusted win ratio using the inverse probability of treatment weighting. J. Biopharm. Stat. 2023, 10, 1–16. [Google Scholar] [CrossRef] [PubMed]
 - Liang, J.; Liu, J. Evaluation of educational interventions based on average treatment effect: A case study. Mathematics 2022, 10, 4333. [Google Scholar] [CrossRef]
 - Hsu, Y.C.; Lai, T.C.; Lieli, R.P. Estimation and inference for distribution and quantile functions in endogenous treatment effect models. Econom. Rev. 2020, 41, 22–50. [Google Scholar] [CrossRef]
 - Yang, S.; Ding, P. Asymptotic inference of causal effects with observational studies trimmed by the estimated propensity scores. Biometrika 2018, 105, 487–493. [Google Scholar] [CrossRef]
 - Rubin, D.B. Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol. 1974, 66, 688–701. [Google Scholar] [CrossRef]
 - Rosenbaum, P.R.; Rubin, D.B. The central role of the propensity score in observational studies for causal effects. Biometrika 1983, 70, 41–55. [Google Scholar] [CrossRef]
 - Austin, P.C. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivar. Behav. Res. 2011, 46, 399–422. [Google Scholar] [CrossRef] [PubMed]
 - D’Agostino, R.B. Tutorial in biostatistics propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Stat. Med. 1998, 17, 2265–2281. [Google Scholar] [CrossRef]
 - Lunceford, J.K.; Davidian, M. Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. Stat. Med. 2004, 23, 2937–2960. [Google Scholar] [CrossRef] [PubMed]
 - Athey, S.; Imbens, G.; Pham, T.; Wager, S. Estimating average treatment effects: Supplementary analyses and remaining challenges. Am. Econ. Rev. 2017, 107, 278–281. [Google Scholar] [CrossRef]
 - Simoneau, G.; Pellegrini, F.; Debray, T.P.; Rouette, J.; Muñoz, J.; Platt, R.W.; Petkau, J.; Bohn, J.; Shen, C.; De Moor, C.; et al. Recommendations for the use of propensity score methods in multiple sclerosis research. Mult. Scler. 2022, 28, 1467–1480. [Google Scholar] [CrossRef] [PubMed]
 - Taylor, S.A.; Phillips, K.J.; Gertzog, M.G. Use of synthesized analysis and informed treatment to promote school reintegration. Behav. Interv. 2018, 33, 364–379. [Google Scholar] [CrossRef]
 - Hamada, A. Using meta-analysis and propensity score methods to assess treatment effects toward evidence-based practice in extensive reading. Front. Psychol. 2020, 11, 617. [Google Scholar] [CrossRef] [PubMed]
 - Ren, Q.; Su, C.; Wang, H.; Wang, Z.; Du, W.; Zhang, B. Prospective study of optimal obesity index cut-off values for predicting incidence of hypertension in 18-65-year-old Chinese adults. PLoS ONE 2016, 11, e0148140. [Google Scholar] [CrossRef] [PubMed]
 - Hu, L.; Huang, X.; You, C.; Li, J.; Hong, K.; Li, P.; Wu, Y.; Wu, Q.; Bao, H.; Cheng, X. Prevalence and risk factors of prehypertension and hypertension in southern China. PLoS ONE 2017, 12, e0170238. [Google Scholar] [CrossRef] [PubMed]
 - Kundu, P.; Tang, R.; Chatterjee, N. Generalized meta-analysis for multiple regression models across studies with disparate covariate information. Biometrika 2019, 106, 567–585. [Google Scholar] [CrossRef] [PubMed]
 - Nelder, J.A.; Wedderburn, R.W.M. Generalized linear models. J. R. Stat. Soc. Ser. A Stat. Soc. 1972, 135, 370–384. [Google Scholar] [CrossRef]
 - Hansen, L.P. Large sample properties of generalized method of moments estimators. Econometrica 1982, 50, 1029–1054. [Google Scholar] [CrossRef]
 - Lean, J.; Han, T.S.; Seidell, J.C. Impairment of health and quality of life in people with large waist circumference. Lancet 1998, 351, 853–856. [Google Scholar] [CrossRef] [PubMed]
 - Guagnano, M.T.; Ballon, E.; Colagrande, V.; Della Vecchia, R.; Manigrasso, M.R.; Merlitti, D.; Riccioni, G.; Sensi, S. Large waist circumference and risk of hypertension. Int. J. Obes. 2001, 25, 1360–1364. [Google Scholar] [CrossRef]
 - Li, H.; Miao, W.; Cai, Z.; Liu, X.; Zhang, T.; Xue, F.; Geng, Z. Causal data fusion methods using summary-level statistics for a continuous outcome. Stat. Med. 2020, 39, 1054–1067. [Google Scholar] [CrossRef] [PubMed]
 - Stümer, T.; Schneeweiss, S.; Avorn, J.; Glynn, R.J. Adjusting effect estimates for unmeasured confounding with validation data using propensity score calibration. Am. J. Epidemiol. 2005, 162, 279–289. [Google Scholar] [CrossRef] [PubMed]
 
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.  | 
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).