1. Introduction
Cohort studies are widely used in epidemiology to measure how the exposure to certain factors influences the risk of a specific disease. The role of large cohort studies is increasing with the development of multi-omics approaches and with the search of methods for the translation of omics findings, especially those that are derived from genome-wide association studies (GWAS) in clinical settings [
1]. Many research efforts have been made to link vast amounts of phenotypic data across diverse centers. This procedure concerns molecular information, as well as data regarding environmental factors, such as those recorded in and obtained from health-care databases and epidemiological registers [
2]. Cohort studies can be prospective (forward-looking) or retrospective (backward-looking). Large-scale initiatives of cohort studies have been initiated worldwide, such as the NIG Roadmap Epigenomics Project [
3] and the 500 Functional Genomics cohort [
4] due to the increasing need of integration of data for genomic analysis. An example of the former is the Nurses’ Health Study, a large prospective cohort study that revealed several significant associations between lifestyle choices and health by following up hundreds of thousands of women in North America [
5]. Similarly, the National Health and Nutrition Examination Survey (
http://www.cdc.gov/nchs/nhanes.htm) discovered the association between cholesterol levels and heart disease. Furthermore, another large prospective cohort study, the Framingham Heart Study (
https://www.framinghamheartstudy.org), demonstrated the cardiovascular health risks of tobacco smoking. However, the integration of data from multiple cohort studies faces a variety of challenges [
6].
Database federation is a method for data integration that offers constant access to a number of various data sources in which middleware can operate, including interactive database management systems [
6]. Individual-level data indicate information about participants, being either contributed by the participants themselves in surveys etc., or collected from registers. In fact, individual-level data assembling of large population-based studies across centers in international collaborations faces several difficulties [
7]. On the one hand, merging cohort datasets extends the capability of these studies by allowing research questions of mutual interest to be addressed, by enabling the harmonization of standard measures, and by authorizing the investigation of a range of psychosocial, physical, and contextual factors over time. However, on the other hand, data are often collected in different locations and systems, and they are rarely aggregated into larger datasets, an aspect that limits their utility. Additionally, it is essential to accurately address privacy, legal, and ethical issues that are associated with data usage [
8].
Meta-analysis is a common approach to pool results from multiple cohorts and it is consistently used, for example, in GWAS [
9] epidemiological studies [
10], case-control studies [
11] and randomized controlled trials [
12]. Several studies are pooled to increase the sample size to increase the statistical power to detect true association signals (decreasing type 1 error). Due to the restrictions of participants confidentiality and the refusal by authors to provide anonymized dataset, individual-level data often cannot be pooled among studies, so meta-analytical approaches are typically used to combine summary statistics across studies, and they have been shown to be as powerful as the integration of individual datasets [
9].
In cohort studies, retaining participants with several waves of follow-up is challenging. These waves of data collection give researchers an opportunity to get data regarding deviations in the measures of participants’ exposure and outcome over time. The duration of follow-up waves of data can range from one to two years up to 20 to 30 years or even with longer post-baseline assessments. Missingness of the data can be related to study designs in which recurrent measures of exposure and outcome over time are needed. Specifically, the candidates might not available or might be too ill to participate, or they might refuse to respond to specific inquiries or could be deceased [
13]. Thus, researchers often face missing data, which might introduce bias in the parameter estimations (for instance of risk calculation) as well as the loss of statistical power and accuracy [
13]. Although further research is required to understand both the impact of missing data and the development of effective methodologies for their handling, there are approaches (discussed below) that may decrease the detrimental effect of missing information.
There are several ways of storing and integrating data, such as warehouses, federations, data hotels, etc. For example, the Dutch Techcentre for Life sciences (DTL) is an innovative solution for data hotels. DTL keenly supports FAIR (Findable, Accessible, Interoperable, and Reusable) data stewardship of life science data, within its partnership and in close collaboration with its international partners [
14]. The principles of FAIR data serve as an international guideline for high quality data stewardship. A federated facility is a meta-database system that is distributed across multiple locations. It allows making the data from different sources comparable and useful for analyses [
15]. Its main difference between federated facility and registries and warehouses is that data management is carried out via a remote distributed request from one federated server (or database manager) to multiple sources.
A federated facility allows for researchers to receive analytical insight from the pooled information of diverse datasets without having to move all the data to the main location, thus reducing the extent of data movement in the distribution of intermediate results, and maximizing the security of the local data in the distributed sources [
16,
17]. Most of the data are analyzed close to where they are produced in a federated analytical model. To enable collaboration at scale, federated analytics permits the integration of intermediate outcomes of data analytics while the raw data remains in its locked-down site. When the integrated results are pooled and explored, a substantial amount of knowledge is acquired, and researchers managing a single-center database have the ability to compare their results with the findings that were derived by the analyses of federated pooled data.
In this paper, we describe some methodological aspects that are related to the federated facility. Specifically, we review the methods of harmonization and analysis of multi-center datasets, focusing on the impact of missing data (as well as different approaches to deal with them) in cohorts. Thus, firstly, we aim to suggest a few examples of cohort studies and a data collection procedure for a cohort study, and, secondly, to offer approaches of harmonization and integrative data analysis over cohorts. To this end, we present different methods for handling missing data, such as complete case-analysis and multiple imputations. Finally, we offer a perspective on the future directions of this research area.
2. Examples of Cohort Studies and Integration of Cohorts
Cohort studies allow for one to answer different epidemiological questions regarding the association between an exposure factor and a disease, such as whether exposure to smoking is associated with the manifestation of lung cancer. The British Doctors Study, which started in 1951 (and continued until 2001), was a cohort study that comprised both smokers (the exposed group) and non-smokers (the unexposed group) [
18]. The study delivered substantial evidence of the association of smoking with the prevalence of lung cancer by 1956. In a cohort study, the groups are selected in terms of many other variables (i.e., general health and economic status), such that the effect of the variable being evaluated, i.e., smoking (independent variable), is the only one that could be associated with lung cancer (dependent variable). In this study, a statistically significant increase in the prevalence of lung cancer in the smoking group when compared to the non-smoking group rejects the null hypothesis of the absence of a relationship between risk factor and outcome.
Another example is the Avon Longitudinal Study of Parents and Children (ALSPAC), a prospective observational study that examines the impacts on health and development across the life course [
19]. ALSPAC is renowned for investigating how genetic and environmental factors affect health and growth in parents and children [
20]. This study has examined multiple biological, (epi) genetic, psychological, social, and environmental factors that are associated with a series of health, social, and developmental outcomes. Enrollment sought to register pregnant women in the Bristol area in the UK during 1990–92. This was prolonged to comprise additional children that are eligible up to the age of 18 years. In 1990–92, the children from 14,541 pregnancies were enrolled, which increased the number of participants enrolled to include 15,247 pregnancies by the age of 18 years. The follow-up comprised 59 questionnaires (four weeks–18 years of age) and nine clinical assessment visits (7–17 years of age) [
19]. Genetic (the DNA of 11,343 children, genome-wide data for 8365 children, complete genome sequencing for 2000 children) and epigenetic (methylation sampling of 1000 children) data were collected during this study [
19].
The federated model is more often used in multi-center studies, large national biobanks, such as the UK Biobank [
21], and meta-analyses projects combining data from different registries or databases. It requires new methods and systems to handle large data collection and storing.
One example is the Cross Study funded by the National Institutes of Health (NIH). In this project, data are combined from three current longitudinal studies of adolescent development with a specific emphasis on recognizing evolving pathways that are prominent in substance use and disorder [
22]. All three studies oversampled offspring who had at least one biological parent affected by alcohol use disorder and comprised a matched sample of healthy control offspring of unaffected parents. The Michigan Longitudinal Study [
23] is the first study that has collected a comprehensive dataset in a large sample of 2–5 year olds subjects who were evaluated via four waves of surveys up to early adulthood. The Adolescent and Family Developmental Project [
24] is the second study that recruited families of adolescents aged 11–15, with the surveys being distributed well into adulthood. The Alcohol, Health, and Behavior Project [
25] is the third study to include intensive assessments of college freshmen, who, up to their thirties, participated in more than six waves of surveys. Collectively, these three studies span the first four decades of life, mapping the phases when early risk factors for later substance outcomes first emerge (childhood), substance use initiation typically occurs (adolescence), top rates of substance use disorders are evident (young adulthood), and deceleration in substance involvement is evident (adulthood). One potential cause might be that conducting such analyses can be an extremely complex and challenging task. Key practical issues that are associated with data acquisition and data management are often exceed by a multitude of difficulties that arise from at times substantial study-to-study differences [
22].
Similarly, the European Union-funded ongoing project Childhood and Adolescence Psychopathology: unraveling the complex etiology by a large Interdisciplinary Collaboration in Europe (CAPICE—
https://www.capice-project.eu/) [
26] is currently working to create a facility for federated analyses. This requires the databases to have a common structure. CAPICE brings together data from eight population-based birth and childhood (twin) cohorts to focus on the causes of individual differences in childhood and adolescent psychopathology and its course. However, different cohorts use a different measure to assess childhood and adolescent mental health. These different instruments assess the same dimensions of child psychopathology, but they phrase questions in different ways and use different response categories. Comparing and combining the results across cohorts is most efficient when a common unit of measurement is used.
Another project, the Biobank Standardisation and Harmonization for Research Excellence in the European Union (BioSHare) study, built the federated facility using the Mica-Opal federated framework aiming at building a cooperative group of researchers and developing tools for data harmonization, database integration, and federated data analyses [
7]. New database management systems and web-based networking technologies are at the limelight of providing solutions to federated facility [
7]. Furthermore, the GenomeEUtwin is a large-scale biobank-based research project that integrates massive amounts of genotypic and phenotypic data from distinct data sources that are located in specific European countries and Australia [
2]. The federated system is a network called TwinNET used to exchange and pool analyses. The system pools data from over 600,000 twin pairs, and genotype information from a part of those with the goal to detect genetic variants related to common diseases. The network architecture of TwinNET consists of the Hub (the integration node) and Spokes (data-providing centers, such as twin registers). Data-providers initiate connections while using virtual private network tunnels that provide security. This approach also allows for the storage and combining of two databases: the genotypic and the phenotypic database, which are often stored in different locations [
27]. The development of Genome EUtwin facility started from the integration of the limited number of variables that appear simple and non-controversial and it is intended to include more variables standard for the world twin community. Most of the European twin registries do not have genotypic or phenotypic information from non-twin individuals. But some do, and GenomEUtwin will want to take advantage of those samples. The advantage with this structure consists in the possibility to store completely new variables as soon as they emerge without changing the database structure. By applying the same variable names and value formats to variables in common to all databases, several advantages will be accomplished [
27]. Here, we describe the process of building a federated facility, divided into separate steps (see
Table 1).
4. Data Integration
Built-in security features of database management systems can limit access to the whole dataset of a federated facility, and security can be increased while using encryption. Some solutions can be applied, such as establishing a common variable format and standard, creating a unique identifier for all individuals in the cohorts, implementing security access to data and integrity constraints in the database management system, and making automated integration algorithms in the core module to synchronize or federate multiple heterogeneous data sources, to facilitate the integration of different datasets among various cohorts.
There are three steps in data integration: (1) extraction of data and harmonization into a common format at a data provider site; (2) transfer of harmonized data to a data-collecting center for checking; and, (3) data load into a common database [
2].
Harmonization is a systematic approach, which allows the integration of data collected in multiple studies or multiple sources. Sample size can be increased by pooling data from different cohort studies. Conversely, individual datasets may be comprised of variables that measure the same hypothesis in different ways, which hinders the efficacy of pooled datasets [
29]. Variable harmonization can help to handle this problem.
The federation facility might also be created without the need for harmonization (i.e., any cohorts that have some data (e.g., genotyping data) in one place and some data (e.g., phenotypic) in another or have a connection to national registries, etc.). For example, in the Genome of the Netherlands Project (
http://www.nlgenome.nl/), nine large Dutch biobanks (~35,000 samples) were imputed with the population-specific reference panel, an approach that led to the identification of a variant within the ABCA6 gene that is associated with cholesterol levels [
30].
The potential of harmonization can be evaluated with the studies’ questionnaires, data dictionaries, and standard operating techniques. Harmonized datasets available on a server in every single research centers across Europe can be interlocked through a federated system to allow for integration and statistical analysis [
7]. To co-analyze harmonized datasets, the Opal [
31], Mica software [
7], and the Data SHIELD package within the R environment are used to generate a federated infrastructure that allows for investigators to mutually analyze harmonized data while recollecting individual-level records within their corresponding host organizations [
7]. The idea is to generate harmonized datasets on local servers in every single host organization that can be securely connected while using encrypted remote connections. Using a strong collaborative association among contributing centers, this approach can lead to effortless collateral analyses using globally harmonized research databases while permitting each study to maintain complete control over individual-level data [
7].
Data harmonization is implemented in light of several factors, for instance, detailed or partial variable matching about the question asked/responded, the answer noted (value definition, value level, data type), the rate of measurement, the period of measurement, and missing values [
29].
For instance, in the context of the CAPICE project, the important variables are those concerning demographics (i.e., sex, family structure, parental educational attainment, parental employment, socio-economic status (SES), individual’s school achievements, mental health measures (both for psychopathology as well as for wellbeing and quality of life) by various raters (mother, father, self-report, teacher)), pregnancy/the perinatal period (i.e., alcohol and substance use during pregnancy, birth weight, parental mental health, breast feeding), general health (i.e., height, weight, physical conditions, medication, life events), family (i.e., divorce, family climate, parenting, parental mental health), and biomarkers (genomics, epigenetics, metabolomics, microbiome data, etc.). All of these pieces of data gathered in children and parents are harmonized while using various procedures.
Variable manipulation is not essential if the query asked/responded and the answer noted in both datasets is the same [
29]. If the response verified is not the same, the response is re-categorized/re-organized to improve the comparability of records from both of the datasets. Missing values are generated for each subsequent unmatched variable and are switched by multiple imputations if the same pattern is calculated in both datasets, even if using different methods/scales. A scale that is applied in both datasets is recognized as a reference standard [
29]. If the variables are calculated several times and/or in distinct periods, these are harmonized by gestation trimesters data. Lastly, the harmonized datasets are assembled into a single dataset.
5. Meta-analysis and mega-analysis
Researchers are currently analyzing large datasets to clarify the biological underpinnings of diseases that, particularly in complex disorders, remain obscure. However, due to privacy concerns and legal complexities, data hosted in different centers cannot always be directly shared. In practice, data sharing is also hindered by the administrative burden that is associated with the need to transfer huge volumes of data. This situation made researchers to look for an analytical solution within meta-analysis or federated learning paradigms. In the federated setting, a model is fitted without sharing individual data across centers, but only using model parameters. Meta-analysis instead performs statistical testing by combining results from several independent analyses, for instance, by sharing p-values, effect sizes, and/or standard errors across centers [
32].
Lu and co-authors have recently proposed two additions to the splitting approach for meta-analysis: splitting in a cohort and splitting cohorts [
9]. The first method implies that data for each cohort is divided, monitored by a choice of variables on one subset and calculation of p-values on the other subset, and by meta-analysis across all cohorts [
9]. This is a typical addition of the data splitting approach that can be applied to numerous cohorts. The second method comprises splitting cohorts as an alternative. Cohorts are divided into two groups, one group is used for variable selection and the other is used for attaining p-values as well as meta-analysis. This is a more applicable method, since it simplifies the analysis burden for each study and decreases the possibility of errors [
9].
As the focus of a meta-analysis is on the creation of summary statistics obtained from several studies, this method is most efficient when the original individual records used in prior analyses are not accessible or no longer collected [
22]. The individual-level information can be pooled into a single harmonized dataset upon which mega-analyses are carried out [
33]. The increased flexibility in handling confounders at the individual patient level and assessing the impact of missing data are substantial benefits of a mega-analytical method [
34]. Mega-analyses have also been endorsed to evade the assumptions of within-study normality and recognize the within-study variances, which are particularly challenging with small samples. In spite of these benefits, mega-analysis requires homogeneous data sets and the creation of a shared centralized database [
34].
Meta-analysis has several disadvantages, including the presence of high level of heterogeneity [
35], unmeasured confounders [
36], limitation by ecological fallacy [
37]. In addition, most of the primary studies included in meta-analysis are conducted in developed or western countries [
38]. However, there are numerous benefits to directly fitting models straight into the original raw data instead of creating the applicable summary statistics. Current technological developments (such as a superior capacity for data sharing and wide opportunities for electronic data storage and retrieval) have increased the feasibility of retrieving original individual records for secondary analysis. This gives new opportunities for the progress of different approaches to integrate results across studies by using original individual records to overcome some of the inevitable limitations of meta-analysis [
22].
Here, we focus on approaches of integrative data analysis within the psychological sciences, as approaches to collecting current data can differ across disciplines.
7. Different Approaches to Dealing with Missing Data in Cohorts
Different types of missing data in phenotypic and genotypic databases can appear in cohorts: these include, but are not limited to, the following: irrelevant non-response, responses of participants that were excluded from the study in the follow up, irrelevant missing structural data (when data are irrelevant in the context), responses of participants such as “do not know” responses, no answer to a specific item, and missing data due to error codes.
There are several approaches for dealing with missing data: complete-case analyses, last observational carried forward (LOCF) method, mean value substitution method, missing indicator method, and multiple imputations (MI).
The complete case-analysis only is comprised of applicants with full data in all waves of data collection, thus possibly decreasing the accuracy of the estimates of the exposure-outcome relations [
13]. To be effective, complete case analyses should assume that applicants with missing data can be considered to be a random sample of those that were intended to be observed (generally referred as missing completely at random (MCAR)) [
13], or at least that the probability of data being missing does not dependent on the observed value [
39]. Further, LOCF is a method of imputing missing data in longitudinal studies with the non-missing value from the previously completed time-point for the same individual, since the imputed values are unrelated to a subject’s other measurements. The mean value substitution replaces the missing value with the average value available from the other individual time-points of a longitudinal study [
40]. The missing indicator method comprises an additional category for the analysis created for applicants with missing data [
41].
Missing data can be handled by means of multiple imputation [
42,
43,
44,
45]. MI methods are used to address missing data and its assumptions are more flexible than those of complete case analysis [
46]. The principle of MI is to substitute missing observations with plausible values multiple times and generate complete data sets [
47]. Every single complete data set is individually analyzed and the effects of the analyses are then assembled. MI results are consistent when the missing mechanism satisfies the MCAR and the missing at random (MAR) assumptions [
48,
49]. Multiple imputation consists of three steps. The first step is (1) to determine which variables are used for imputation. The variables used for imputation should be selected on the basis of the presentation of their missing information as MAR [
43], that is, whether or not a score that is missing depend on the missing value [
42]. The variables that cause the missingness are unknown to the researcher unless missingness is, to some extent, expected. Practically, variables are selected in a way that expected to be good predictors containing missing values. One can choose the number of variables and which variables to use, but there is no alternate way to assess whether MAR is achieved, and MAR is an assumption. Binary or ordinal variables may be imputed under a normality assumption and then rounded off to discrete values. If a variable is right skewed, it might be modeled on a logarithmic scale and then transformed back to the original scale after imputation [
42]. We should impute variables that are functions of other (incomplete) variables. Several data sets consist of transformed variables, sum scores, interaction variables, ratio’s, and so on. It can be useful to integrate the transformed variables into the multiple imputation algorithms [
50]. The second step is (2) to generate imputed data matrices. One of the tools that can be used is the R package Multiple Imputation by Chained Equations (MICE) [
50], which uses an iterative procedure in which each variable is sequentially imputed and restricted on the real and imputed values of the other variables. The third step of the multiple imputation procedure is (3) to analyze each imputed data set as desired and pool the results [
51].
8. Discussion and Future Perspective
The current narrative review focuses on the rationale of federated facility, as well as the challenges and solutions developed to when attempting to maximize the advantages that are obtained from the federated facility of cohort studies. Assembling individual-level data can be a useful, particularly when the results of interest are relevant. There are several benefits of federated facility and harmonizing cohorts: integrating harmonized data allows for an increase in sample sizes, improves the generalizability of results [
1,
7,
52], ensures the validity of comparative research, creates opportunities to compare different population groups by filling the gaps in the distribution (different age groups, nationality, ethnicity etc.), facilitates extra proficient secondary use of data, and offers opportunities for collaborative and consortium research.
Data pooling of different cohort studies faces many hurdles, including interoperability, shared access, and ethical issues, when cohorts working under different national regulations are integrated.
It is essential that strong collaboration among different parties exists to effectively implement database federation and data harmonization [
7]. A federated framework allows investigators to analyze data safely and remotely (i.e., produce summary statistics, contingency tables, logistic regressions) facilitating their accessibility and decreasing actual time restrictions without the burden of filing several data access requests at various research centers, thereby saving principal investigators and study managers time and resources [
7].
An important aspect of our review is to provide insights into large samples that result from merging the datasets. Meta-analysis, or mega-analysis of studies, might lead to a more robust estimate of the magnitude of the associations ultimately increasingly the generalizability of findings [
33]. As progressively thorough computations can be accomplished in a mega-analysis, some researchers reckon that mega-analysis of individual-participant data can be more efficient than meta-analysis of aggregated data [
34]. The mega-analytical framework appears to be the more robust methodology due to the relatively high amount of variation detectable among cohorts in multi-center studies.
In cohort studies, several methods are used to deal with missing data in the exposure and outcome analyses. The most common method is to perform a complete case analysis, an approach that might generate biased consequences if the missing data are not followed by the assumptions of missing completely at random (MCAR). The complete-case analysis allows for consistent results only when the missing data probabilities do not depend on the disease and exposure status simultaneously. Nowadays, researchers are using advanced statistical modeling procedures (for example,
MI and Bayesian) to handle missing data. Combining studies by Bayesian enable us to quantify the relative evidence with respect to multiple hypotheses using the information from multiple cohorts [
44]. Missingness is a typical problem in cohort studies and it is likely to introduce substantial bias into the results. We highlighted how the unpredictable recording of missing data in cohort studies, if not dealt with properly, and the ongoing use of inappropriate approaches to handle missing data in the analysis can substantially affect the study findings leading to inaccurate estimates of associations [
13]. Increasing the quality of the study design and phenotyping should be a priority to decrease the amount and impact of missing data. Robust and adequate study designs minimize the additional requests on participants and clinicians beyond routine clinical care, an aspect that encourages the implementation of pragmatic trial design [
53].
An organization of databases can facilitate the use of innovative exploratory tools based on machine learning and data mining for data analyses due to data-harmonization techniques. In this narrative review, we proposed several approaches of data integration over cohorts, meta-analysis, mega-analysis in a framework for federated system, and various methods to handle missing data. Further developments of these studies will extend the proposed analysis, from multi-center facility to large-scale cohort data, such as in the context of the CAPICE project.