Prediction of Influenza Complications: Development and Validation of a Machine Learning Prediction Model to Improve and Expand the Identification of Vaccine-Hesitant Patients at Risk of Severe Influenza Complications

Influenza vaccinations are recommended for high-risk individuals, but few population-based strategies exist to identify individual risks. Patient-level data from unvaccinated individuals, stratified into retrospective cases (n = 111,022) and controls (n = 2,207,714), informed a machine learning model designed to create an influenza risk score; the model was called the Geisinger Flu-Complications Flag (GFlu-CxFlag). The flag was created and validated on a cohort of 604,389 unique individuals. Risk scores were generated for influenza cases; the complication rate for individuals without influenza was estimated to adjust for unrelated complications. Shapley values were used to examine the model’s correctness and demonstrate its dependence on different features. Bias was assessed for race and sex. Inverse propensity weighting was used in the derivation stage to correct for biases. The GFlu-CxFlag model was compared to the pre-existing Medial EarlySign Flu Algomarker and existing risk guidelines that describe high-risk patients who would benefit from influenza vaccination. The GFlu-CxFlag outperformed other traditional risk-based models; the area under curve (AUC) was 0.786 [0.783–0.789], compared with 0.694 [0.690–0.698] (p-value < 0.00001). The presence of acute and chronic respiratory diseases, age, and previous emergency department visits contributed most to the GFlu-CxFlag model’s prediction. When higher numerical scores were assigned to more severe complications, the GFlu-CxFlag AUC increased to 0.828 [0.823–0.833], with excellent discrimination in the final model used to perform the risk stratification of the population. The GFlu-CxFlag can better identify high-risk individuals than existing models based on vaccination guidelines, thus creating a population-based risk stratification for individual risk assessment and deployment in vaccine hesitancy reduction programs in our health system.


Introduction
The human influenza virus causes substantial morbidity and mortality, often reducing the quality of life [1][2][3][4]; outbreaks have attack rates of 10-20 percent, but rates can exceed

Aim
This study aimed to develop and validate machine learning (ML) models to identify unvaccinated high-risk individuals, predicting the probability of acquiring influenza and developing influenza-related complications.

Population and Setting
This study was performed at Geisinger (a multi-hospital system in Central and Northeast PA, USA) in collaboration with Medial EarlySign (Hod Hasharon, Israel). The data originated from a de-identified data lake of >641,000 unique individuals who received Geisinger primary care services from 1 October 2008 to 31 January 2018 (i.e., the membership period) when vaccination coverage was 32.9-36.7%. After filtering individuals without longitudinal data, the final cohort consisted of unique unvaccinated individuals, representing 2,318,736 patient years, with influenza and one or more complication(s) within three months or none for at least nine months after infection (n = 604,389).

Definitions and Registries
Supplementary Table S1 (SuppT1) lists the model and time-window features. An influenza season was defined to begin on 1 September and end on 1 May. The complication follow-up continued until 31 July ( Figure 1). Influenza events defined the registries ( Figure 2). Cohort membership was based on outpatient encounters. Exclusion criteria and cohorts used for model testing were determined (Figure 2A). effective approach to improve influenza vaccination in high-risk individuals, identifying those most likely to experience extreme complications for a personalized follow-up to communicate their risks.

Aim
This study aimed to develop and validate machine learning (ML) models to identify unvaccinated high-risk individuals, predicting the probability of acquiring influenza and developing influenza-related complications.

Population and Setting
This study was performed at Geisinger (a multi-hospital system in Central and Northeast PA, USA) in collaboration with Medial EarlySign (Hod Hasharon, Israel). The data originated from a de-identified data lake of >641,000 unique individuals who received Geisinger primary care services from 1 October 2008 to 31 January 2018 (i.e., the membership period) when vaccination coverage was 32.9-36.7%. After filtering individuals without longitudinal data, the final cohort consisted of unique unvaccinated individuals, representing 2,318,736 patient years, with influenza and one or more complication(s) within three months or none for at least nine months after infection (n = 604,389).

Definitions and Registries
Supplementary Table S1 (SuppT1) lists the model and time-window features. An influenza season was defined to begin on 1 September and end on 1 May. The complication follow-up continued until 31 July ( Figure 1). Influenza events defined the registries (Figure 2). Cohort membership was based on outpatient encounters. Exclusion criteria and cohorts used for model testing were determined ( Figure 2A).  Inclusion and exclusion criteria with definitions of cases and controls used during data pre-processing: Description of cases, controls, and exclusion criteria during data pre-processing, i.e., cohort definition of influenza-related complications for unvaccinated individuals within a given influenza season.
To mitigate diagnosis inaccuracy, two confidence levels defined two corresponding influenza registries within the cohort ( Figure 2B). The Laboratory Test Registry (LabReg) used positive laboratory tests for influenza diagnosis, Supplementary Table S2 (SuppT2). The more broadly defined Phenomic Registry (PheReg) used influenza-like illness (ILI), defined by ICD codes or Tamiflu usage, SuppT2. To mitigate diagnosis inaccuracy, two confidence levels defined two corresponding influenza registries within the cohort ( Figure 2B). The Laboratory Test Registry (LabReg) used positive laboratory tests for influenza diagnosis, Supplementary Table S2 (SuppT2).

Data Pre-Processing
Geisinger stores ICD codes within internal (EDG) codes in Epic software (Madison, WI, USA). For the study, Geisinger EDG and ICD-9 codes were converted to ICD-10 codes (Supplementary Table S3).

Severity Tiers
Once placed into a registry, influenza complications were categorized into three severity tiers: death, hospitalization (in-patient or ED visits), and severe illness (e.g., pneumonia) (Supplementary Table S4 and Figure 2).

Probability Characterization and Performance Measure Calculation
Influenza cases with non-influenza-related comorbidities were determined to define post-influenza complications properly; probability equations categorizing individuals before model training and validation are listed ( Figure 3). "True" cases were defined as complications with a preceding influenza event, Equation (1). "Observed" cases were defined as a complication after an influenza event, regardless of possible causation (either "true" cases or random temporal positioning of influenza and non-related complications), Equation (2), and estimated by the product of two equations: Equation (3), for estimating the true case probability from observed, and Equation (4), counting the unrelated complications minus observed influenza cases followed by complications. The more broadly defined Phenomic Registry (PheReg) used influenza-like illness (ILI), defined by ICD codes or Tamiflu usage, SuppT2.

Data Pre-Processing
Geisinger stores ICD codes within internal (EDG) codes in Epic software (Madison, WI). For the study, Geisinger EDG and ICD-9 codes were converted to ICD-10 codes (Supplementary Table S3).

Severity Tiers
Once placed into a registry, influenza complications were categorized into three severity tiers: death, hospitalization (in-patient or ED visits), and severe illness (e.g., pneumonia) (Supplementary Table S4 and Figure 2).

Probability Characterization and Performance Measure Calculation
Influenza cases with non-influenza-related comorbidities were determined to define post-influenza complications properly; probability equations categorizing individuals before model training and validation are listed ( Figure 3). "True" cases were defined as complications with a preceding influenza event, Equation (1). "Observed" cases were defined as a complication after an influenza event, regardless of possible causation (either "true" cases or random temporal positioning of influenza and non-related complications), Equation (2), and estimated by the product of two equations: Equation (3), for estimating the true case probability from observed, and Equation (4), counting the unrelated complications minus observed influenza cases followed by complications.

Model Training, Testing, and Validation
The GFlu-CxFlag model was trained on Geisinger's dataset; training and test samples were generated. Each individual was randomly assigned to an ML subset: 70% was assigned to the training subset, 20% to the test subset for model testing, and 10% was saved for model validation.

Feature Generation and Selection
A set of categorical features was generated for each sample (e.g., ICD-10 codes, anatomical therapeutic chemical codes (ATCs), hospital admissions and transfers, and current procedural terminology (CPT) codes). Multiple time-window-dependent features

Model Training, Testing, and Validation
The GFlu-CxFlag model was trained on Geisinger's dataset; training and test samples were generated. Each individual was randomly assigned to an ML subset: 70% was assigned to the training subset, 20% to the test subset for model testing, and 10% was saved for model validation.

Feature Generation and Selection
A set of categorical features was generated for each sample (e.g., ICD-10 codes, anatomical therapeutic chemical codes (ATCs), hospital admissions and transfers, and current procedural terminology (CPT) codes). Multiple time-window-dependent features were generated for each category and several time windows to create intuitive and explainable features (e.g., pneumonia events over the last five years). The approach (SuppT1) resulted in an extensive matrix with 698,780 features. The ICD-10 features' hierarchies were ex-amined using algorithmic logic and clinical intuition, Supplementary Table S5, to choose between descendants and ascendants if both showed significant dependence.

Model Development
The classifier used was XGBoost [26], an algorithm from the Gradient Boosting Machines family; it performed better than logistic regression. Model development and tuning used 6-fold cross-validation to maximize the AUC when testing on unvaccinated individuals to avoid overfitting. The optimization process tested XGBoost parameters with several training and weighting options on trained samples, with and without vaccinated individuals (Supplementary Table S6). Blinded validation occurred with subjects randomly placed into ML subsets. For parameter tuning, 156 runs were performed within the MES ML software environment. Supplementary Table S7 lists the XGBoost parameter tests and results. A weighting process was used during model training, Figure 3 Equation (5), to correct for unrelated complications.
After pre-processing and data modeling, two models were selected for final development: GFlu-CxFlag, a "full" model using 147 data features, including vital signs, laboratory results, and clinical procedures, and by applying iterative backward feature selection, a smaller set of features was used to create the MES Flu Algomarker (Supplementary Table S8).

Model Evaluation
The final models were compared with the simpler CDC/WHO risk assessments converted to ML models. Bootstrapping was used to estimate confidence intervals and standard errors of performance measurements. Performance was compared using an XGBoost model trained with age and sex, in addition to age, sex, and comorbidities.

Propensity Analysis for Predicting Potential Vaccination
Because the GFlu-CxFlag model was trained on unvaccinated individuals, inverse propensity weighting (IPW) in the MES environment was used to validate the model and adjust for population bias; it was not used in calculating risk scores. For IPW analysis, the model was trained to predict whether individuals would get vaccinated using historical patient communications (Table 1).

Bias Assessment
Model bias was evaluated with four sociodemographic characteristics: race, ethnicity, sex, and socioeconomic status (SES); Medicaid insurance was a surrogate for low SES. Sensitivity across different characteristic categories was compared; chi-squared tests determined statistical significance, with two-tailed p < 0.05 criteria defined to identify potential evidence of bias. A reference group, to which all other categories were compared in a pairwise fashion, was chosen for characteristics with more than two categories: White for race; Medicaid for SES.
To probe for possible bias sources across groups exhibiting model biases, random sampling created sub-groups that were matched on dimensions for which model perfor-mance was expected to vary: age and amount of data (defined as visits/last five years). Sensitivity was re-evaluated using these matched sub-groups. We applied the same process to a "model" that used a simple age cutoff to classify individuals > 65 years of age as "high risk" as a means to contextualize bias. Supplementary Table S9 depicts sensitivity for individuals categorized by each attribute of interest.

Data Features
The data contained about 590,000 individuals/year. The case distribution/year exhibited high variability due to varying influenza severity. The monthly distribution fits seasonal patterns, peaking in January. The LabReg included 25,156 events/10 years (0.5-1% each year). The PheReg contained 1,300,045 events/10 years (12.1-17.6%). There were more events for young and elderly individuals each year.
Adjusting for non-influenza-related complications reduced the influenza complications' case count by approximately 22%, indicating that certain post-influenza complications occurred within three months, even without preceding influenza infection(s). After filtering and matching for the influenza season, the training set, > 1.6 million data points, had 2371 features for the GFlu-CxFlag, 334 for the MES Flu Algomarker, and 15 for CDC/WHO model. Most laboratory features did not contribute significantly to model performance and were eliminated from the MES Flu Algomarker. The addition of the lymphocyte percentage feature slightly improved the full model performance, as did respiratory rate and SPO 2 .   The GFlu-CxFlag model significantly outperformed the CDC/WHO model (p < 0.00001), identifying unique features (Table 4), when a 5% false-positive rate was assigned as the cutoff; other respiratory diseases, age, and previous ED admission contributed most to prediction. The performance for training on both vaccinated and unvaccinated individuals was less robust, even when testing occurred in the cohort containing unvaccinated and vaccinated individuals. The training process weighting method improved the model performance slightly in all analyses, even when measuring AUC without corrections or not using IPW on unvaccinated individuals.  Black-shade = Laboratory testing with RT-PCR, was unique but not a model feature because it was a classifier to the influenza diagnosis # longitudinal trends were used as a measure of the variable; RT-PCR is not a model feature, because it is a classifier to the influenza diagnosis; its importance is underscored by the model's prediction when RT-PCR is used to define illness; & WHO unique features = Lung, heart, kidney, neurologic, liver, and blood disease, plus immunocompromised status, stroke, pregnancy, and work in healthcare and CDC unique features = The same as WHO features, plus aspirin therapy, long-term care, and race. CDC risks do not include healthcare workers.

GFlu-CxFlag, Comparisons When Substratified by Severe Complications
To support the claim that GFlu-CxFlag ranks more severe influenza complications higher, the model discrimination between influenza complications cases was tested by severity tiers 1 and 2, without 3. The cohort was changed to include only individuals who experienced influenza complications (n = 22,116). When the least severe complications (tier 3) were labeled as controls and severity tiers 1 and 2 were labeled as cases, the AUC was 0.596 [0.586-0.606], confirming the model ranked the more severe cases higher. The mean risk-severity score for tiers 1   and clinical characteristics, such as influenza, vaccination history, smoking, hyperlipidemia, temperature, weight, psycho-analeptic drugs, and lipid-modifying agents. Figure 4 shows the feature contribution, ordered by the mean absolute Shapley values. The top four contributing features linked the history of respiratory-related and general comorbidities. The most important category was ICD10:J00-J99-a superset of respiratory diseases, followed by years of data, complications, and influenza history. The temporal membership features documented data missingness, important for features that use time windows, and allowed for normalization of numerical features, such as the number of ED visits, substratified by the time period in which they were counted.  x-axis represents the feature value, and the yellow lines represent the mean outcome over the training set conditioned on the feature value. The blue line represents the feature's mean Shapley value. The average score, conditioned on feature value, was similar to the mean outcome (data not shown) in all cases. As depicted in Figure 5A, the U shape was expected for the contribution of age; very young and very old individuals have a higher risk of complications. Figure 5B shows that the complication risk increased with the number of respiratory diseases over the last five years, defined by the history of ICD10:J00-J99. The complication risk decreased as time since smoking cessation increased ( Figure 5C).  x-axis represents the feature value, and the yellow lines represent the mean outcome over the training set conditioned on the feature value. The blue line represents the feature's mean Shapley value. The average score, conditioned on feature value, was similar to the mean outcome (data not shown) in all cases. As depicted in Figure 5A, the U shape was expected for the contribution of age; very young and very old individuals have a higher risk of complications. Figure 5B shows that the complication risk increased with the number of respiratory diseases over the last five years, defined by the history of ICD10:J00-J99. The complication risk decreased as time since smoking cessation increased ( Figure 5C). Increased risk in individuals who quit smoking long ago (a small set) is not reflected in the Shapley value, indicating that the model did not overfit. Instead, the model attributed the higher risk to old age (e.g., 80 years old since quitting means the individual was old). Figure 5D shows a U-shaped behavior in the mean outcome as a function of body mass index (BMI)-a young age is a likely confounder associated with a lower BMI. A high BMI was an independent risk factor, reflected in the mean Shapley values, which remained low at a low BMI, but monotonically increased with a higher BMI. Increased risk in individuals who quit smoking long ago (a small set) is not reflected in the Shapley value, indicating that the model did not overfit. Instead, the model attributed the higher risk to old age (e.g., 80 years old since quitting means the individual was old). Figure 5D shows a U-shaped behavior in the mean outcome as a function of body mass index (BMI)-a young age is a likely confounder associated with a lower BMI. A high BMI was an independent risk factor, reflected in the mean Shapley values, which remained low at a low BMI, but monotonically increased with a higher BMI.

Post-Processing (GFlu-Cx Flag Bias Assessment)
Post hoc analysis is depicted in Table 5 and Supplementary Table S9

Post-Processing (GFlu-Cx Flag Bias Assessment)
Post hoc analysis is depicted in Table 5 and Supplementary Table S9 6], X 2 = 1.83, p = 0.176), suggesting that age differences between groups may drive bias. A simplistic "model" tagging anyone > 65 years old, as high risk would produce a stronger White-favoring bias (X 2 = 123.56, p < 0.001). There was a significant difference in model sensitivity between White and Asian individuals (X 2 = 7.89, p = 0.005); this difference decreased but remained significant after age-matching  * Significantly different at p < 0.05 compared to reference category, always listed first; CI = Confidence Interval; LA = Latin American 1 Other race and insurer categories exist but each compose less than 1% of the population; 2 Patients enrolled in Medicaid at any point in the last 11 years were placed in this category, even if they later shifted insurance (e.g., aged into Medicare).
For sex, the model revealed greater sensitivity for female than for male individuals (X 2 = 61.54, p < 0.001). This effect remained after age-matching (female: 45.

Discussion
Human and healthcare influenza burden remains high [18]; therefore, a process to improve risk-stratification was created. GFlu-CxFlag improved sensitivity for identifying unvaccinated individuals with the highest risk for influenza and complications compared with the CDC/WHO model by 86% when a 5% false-positive rate was the cutoff. The improvement will identify an additional 33.1% of influenza complications compared with 17.8% with the CDC/WHO model used with Geisinger data. GFlu-CxFlag is generalizable to other data-rich organizations; the MES Flu Algomarker and the CDC/WHO model could be implemented using most current electronic health software programs.
The bias analysis did not reveal any significant biases against Black, Hispanic, or Latin American individuals; Medicaid patients; or females, which could not be accounted for by differences in predictive features, such as age or number of visits. For Black individuals, subpopulation differences in age appear to account adequately for the lower observed sensitivity, suggesting that individuals of the same age as their White counterparts should be flagged as being at the same risk as identified by other predictors. GFlu-CxFlag use may be more limited for Black individuals when compared with White individuals; however, the model results in an almost threefold improvement in performance for this group when pragmatically compared with the typical age-based risk-stratification method. Similarly, insurance coverage disparities between Medicare and Medicaid are significantly reduced when accounting for age, suggesting the model may not be biased against poorer populations and favors these individuals in some cases. The bias evaluation indicates the model is appropriate and highlights steps to identify sources of bias and make future model adjustments.
Geisinger's data-rich environment is a study advantage due to population longevity and the low percentage of geographic movement. Limitations may include a high insurance coverage rate for individuals, including healthcare employees (commercial insurance coverage 48.5%, 36.1% Medicaid, and 14.5% Medicare).
Due to biases in the underlying data or the social processes that generate them, ML algorithms can propagate or exacerbate biases against under-represented groups traditionally facing discrimination. After accounting for age, bias remains against one group: Asian individuals (N = 87); results should be interpreted with caution.
GFlu-CxFlag was impacted by inaccurate ILI documentation since it encapsulates both general fragility risk and the probability of ILI, which is challenged by medical coding heterogeneity. The impact of accurate test results is difficult to disentangle. Due to the model's elimination process, many different solutions occur. "Richer" more common information sources, such as diagnosis codes and medications, are important for broad inclusion; therefore, more specific laboratory tests were saved until the end of the elimination process, the likely reason for the small, redundant impact. Future study of the variable elimination "order" could lead to a more comprehensive model understanding.
The RT-PCR impact cannot be discounted because the effect was absorbed into the diagnosis and complications of influenza, thereby "flowing" through other data sources. RT-PCR counts were lower in the early years, minimizing test impact by approximately 30%. Based on the higher AUC of the LabReg, an accurate identification of influenza could continue to improve model prediction in the future.
Despite the promising results, the model must perform well over time and in other organizations. Users who do not use the MES ML environment would need to recreate models with their data. Several population-based models, including Google's Flu Trends [19], attempted to describe the general severity of influenza seasons. Nonetheless, there is disagreement on how helpful predictive modeling is and what benefit it serves for a healthcare community (https://time.com/23782/google-flu-trends-big-data-problems, accessed on 1 July 2022). If GFlu-CxFlag was applied prospectively, seasonal variables would need to be estimated.
The Geisinger Flu-Complications Flag (GFlu-CxFlag), created in conjunction with Medial Early Sign (MES), uses many more conditions than other models. According to 2020 population data, the improvement reflects the identification of nearly 641,000 unique individuals in the entire primary care population of the health system, serving a catchment area of approximately three million people in a rural region of the United States. The 10% at highest risk for influenza complications were identified as high risk. Extrapolated to the US, 10% recognitions could be over 33 million high-risk individuals, and globally 770 million. Healthcare systems could adapt the model to target vaccination outreach more effectively than using age, sex, and comorbidity cutoffs alone. Because different healthcare systems may not capture the same variables used in this study, the value of the study can still help identify some core model parameters in other centers. Finally, this work has implications for identifying risk factors for COVID-19 to advance the prediction of the first version of the MES COVID Complications AlgoMarker.

Conclusions
The GFlu-CxFlag is a significant new contribution to risk-stratification strategies, supporting more accurate risk calculation for influenza-related morbidity and mortality by identifying key factors contributing to severe complications in different sub-groups of individuals. Using a GFlu-CxFlag-like approach, healthcare organizations could combine their risk-stratification and vaccination efforts to advance vaccine uptake.
The findings add to the scientific literature that may help mitigate the impact of vaccine hesitancy. Current vaccine recommendations from the World Health Organization (WHO), the USA Center for Disease Control and Prevention (CDC), and the Israeli Ministry of Health (MOH) recommend vaccination for the entire population at six months of age and older, with an emphasis on the importance of vaccination for people at a higher risk of severe influenza complications. According to the CDC, high-risk groups include individuals with long-term diseases, such as acquired or congenital cardiovascular disease, congestive heart failure, atherosclerosis, diabetes, and other chronic metabolic diseases; chronic diseases. Chronic illness include chronic lung diseases, including asthma; chronic liver disease, chronic kidney disease and urinary tract infections; neurological and hematological diseases; and diseases accompanied by immunosuppression, including AIDS and malignant diseases. Additional special high-risk populations are pregnant and post-partum women, children aged 6 months to 6 years (and especially up to the age of 2 years), children aged 6 months up to 18 years that receive long-term aspirin therapy, and individuals 50 years old and above, especially 65 and above. The WHO further identifies pregnant women as the highest risk priority. This study uses primary care data and the machine learning modeling to improve the CDC/WHO guidelines for predicting the risk of future morbidity and mortality from influenza infections by 86%.
Our machine learning (ML) approach to risk stratification provides an essential new contribution to the field by determining the baseline rates of morbidity and mortality that reflect conditions other than age, sex, and limited comorbidities. The approach allows for a more accurate calculation of influenza-related morbidity and mortality, which could be generalizable to influenza vaccine campaigns and provide helpful information to policymakers. Future research can use these tools and strategies to understand vaccine campaigns for COVID-19. Adopting the GFlu-CxFlag could expand the identification of high-risk individuals, reducing influenza's human and organizational impact. If the GFlu-CxFlag was adopted for predicting influenza-associated complications, the results would translate to the identification of approximately 64,000 high-risk individuals in a Geisinger-like system serving a catchment area of roughly three million individuals. Extrapolated to the US, the prediction could reach 33 million and 770 billion globally.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/jcm11154342/s1, Table S1: Time windows and features; Table S2: Registries; Table S3: ICD EDG and CPT codes; Table S4: Tiers of confidence for influenza diagnosis and levels of severity for influenza-complications; Table S5: Steps in deployment for the rule set that informed data filtering for the model(s); Table S6: Data filtering that occurred according to a rule set, ordered in the following stepwise manner; Table S7: (XGBoost parameter tests and results); Table S8: Attributes of full Gflu-CxFlag and MES minimal models; Table S9: Bias analysis using a simple age cutoff to classify individuals > 65 years old as high risk, by each attribute of interest;  Funding: This research was funded by Medial EarlySign (MES), grant number 62405101. MES accessed a de-identified data lake and performed machine learning. Geisinger created the data lake and reviewed data summaries and data interpretation. Both organizations participated in data analysis, writing, and critical review of the data analytics and manuscript. Both teams had full data access and accepted the responsibility to submit the publication.

Institutional Review Board Statement:
The study was conducted per the Declaration of Helsinki and approved by the Institutional Review Board of Geisinger (IRB# 2020-0211,5 November 2020) for human studies. Data from Geisinger's Phenomic Initiatives Database were used. Data Availability Statement: Geisinger and their patients own the data used for the project; they was collected from an existing data lake within the Geisinger data architecture, which contains individuals with a Geisinger PCP. The data can be shared with academic researchers with investigational support to fund the data transfer. Individual participant data and a data dictionary defining each field in the set will be available to others as follows: a de-identified copy of the data lake can be shared if the appropriate documentation and data-use agreement are on file on the publication date and for five years after. Contact the Geisinger Research Institute at irb@geisinger.edu to obtain a data-use agreement.