Estimating Disease-Free Life Expectancy based on Clinical Data from the French Hospital Discharge Database

The development of health indicators to measure healthy life expectancy (HLE) is an active field of research aimed at summarizing the health of a population. Although many health indicators have emerged in the literature as critical metrics in public health assessments, the methods and data to conduct this evaluation vary considerably in nature and quality. Traditionally, health data collection relies on population surveys. However, these studies, typically of limited size, encompass only a small yet representative segment of the population. This limitation can necessitate the separate estimation of incidence and mortality rates, significantly restricting the available analysis methods. In this article, we leverage an extract from the French National Hospital Discharge database to define health indicators. Our analysis focuses on the resulting Disease-Free Life Expectancy (Dis-FLE) indicator, which provides insights based on the hospital trajectory of each patient admitted to hospital in France during 2008-13. Through this research, we illustrate the advantages and disadvantages of employing large clinical datasets as the foundation for more robust health indicators. We shed light on the opportunities that such data offer for a more comprehensive understanding of the health status of a population. In particular, we estimate age-dependent hazard rates associated with sex, alcohol abuse, tobacco consumption, and obesity, as well as geographic location. Simultaneously, we delve into the challenges and limitations that arise when adopting such a data-driven approach.


Introduction
Over the last century, life expectancies have significantly increased.However, this increase has also been accompanied by a rise in the duration of life spent in a state of dependency (Fries 1980;Gruenberg 2005).This underscores the importance of health indicators, such as Healthy Life Expectancy (HLE), in monitoring the overall health of a population.HLE is an umbrella term for a family of health indicators that calculate the expected number of years lived in various health states.
HLE are utilized at all levels of policymaking, from international to local.Organizations such as the World Health Organization (WHO) and the European Union (EU) incorporate health indicators-Healthy Life Expectancy (HALE) and Healthy Life Years (HLY), respectively-into their frameworks for assessing population health (World Health Organization 2023;Bogaert et al. 2018).Another example is Japan, which has prioritized health as a key policy objective in recent years (Abe 2013).
Despite the consensus on the importance of health indicators, no universally used definition of health emerged (Chapter 1, Jagger et al. (2020)).The complexity of defining a useful health concept and the multiplicity of existing health concepts and methods to calculate them are well-documented (Kim et al. 2022).However, one aspect appears to remain invariant: the use of surveys.
Surveys are the main source of data on health status of a population (Chapter 5, Jagger et al. (2020)).Unlike mortality data, that are already available from national statistics agencies who collect it for administrative purposes, health data are harder to come by, and surveys provide the most readily available means of doing so.The use of survey data necessarily imposes limits on the data collected.For one, the cost of surveys limits the sample sizes.Constructing survey instruments to be comparable over large areas is challenging (Robine 2003).Moreover, self-evaluation of health which is influenced by various factors and can therefore be biased (Kempen et al. 1996;Krause and Jay 1994;Peersman et al. 2012).Finally, survey data also does not provide reliable mortality data.
At the same time, the introduction of electronic health records (EHRs) and international diagnostic harmonization has enabled the collection of medical information across large populations, with datasets like the United States' National Hospital Care Survey, the United Kingdom's Hospital Episode Statistics database, and Denmark's National Patient Registry.In this paper, we focus on a subset of the French National Hospital Discharge database (Programme de Médicalisation des Systèmes d'Information).These data cover all hospital discharges from 2010 to 2013 for adults aged 50 and older, and cover, after all exclusions, 10 million unique patients.Each discharge contains the main discharge diagnosis, coded using ICD-10, a standardized international classification of diagnoses (World Health Organization 2015), as well as some demographic information on the patient.
We propose using such large clinical databases to construct health indicators.This proposed approach has many advantages for assessing the health status of a population.First, the use of standardized discharge diagnosis codes, like the ICD-10, simplifies and reinforces cross-regional and temporal comparisons.Second, involvement of healthcare professionals in diagnosis minimizes biases associated with self-assessment.As the entire population is included, the database can provide a longitudinal view over a lifetime of diagnoses to create a comprehensive health picture.Finally, the individual-level data that contains information on both morbidity and mortality avoids the need for aggregating and allows for more nuanced analysis, promising a more profound view of health.
Nonetheless, the clinical view of health has inherent limitations.First, a clinical view of health corresponds necessarily to a negative concept of health, considered inadequate by some (Chapter 1, Jagger et al. (2020)).In clinical settings, the focus is diagnosis and treatment, not holistic health assessment.This divergence yields several notable consequences (Euro-REVES et al. 2000).For instance, preventive measures can avert certain conditions without the need for formal diagnosis.Another concern associated with the clinical perspective is its reliance on healthcare access levels (Sanders 1964).Moreover, the same diagnosis can have varying effects on different individuals.A disease may or may not lead to impairment or disability.For example, two people experiencing a stroke may face different outcomes, a subtlety that may not be accounted for by diagnoses alone.Finally, clinical data represents only a part of the population.Therefore, producing estimates representative of the general population is challenging and requires additional assumptions to correct the selection bias causes by the hospital admission.Even with these limitations we believe clinical data can provide a complimentary view of population health.
In this paper, we develop a novel approach to constructing health indicators from the family of Disease-Free Life Expectancy-type indicators using clinical data.Most literature using Dis-FLE focuses on a family of

Description
This study uses a subset of the French National Hospital Discharge database, PMSI, that covers 2010 to 2013.These data cover all hospital visits in Metropolitan France during the observation period.Only hospital stays of people ages 50 and up are included in this subset.For this age category, over 75% of the general population appear in the database.These data were previously used in Schwarzinger et al. (2018) and Schwarzinger (2018).The first references provides the ICD-10 (International Classification of Diseases, 10th Revision) codes which were used to identify conditions as well as some risk factors.The second reference is however in French, we therefore include a brief description here.
For each patient, the data include a series of discharge dates and the associated diagnosis.These enable us to track individual health trajectories over time.A severe condition in this study should be understood as a medical syndrome encompassing multiple diseases or evolving stages with a high risk of disability or death.A typical example of a severe condition is 'dementia,' which includes Alzheimer's disease and related conditions, i.e., all causes of cognitive loss of autonomy (Schwarzinger 2018).The notion of disease-free used in this paper is based on these severe conditions.Some exclusion criteria are applied to construct health indicators relative to a healthy population, in terms of the selected conditions.These criteria are adapted from Schwarzinger (2018).Firstly, we exclude patients observed for any of the severe conditions used to define the healthy population during the period 2008-2010.Here, we assume that individuals within the general population that did not appear in a hospital during this 2-year period for any of the selected conditions, whether for an initial consultation or follow-up for a chronic disease, are in good health, i.e., they are not affected by the consequences of these conditions.Thus, this procedure allows us to obtain, as of January 1, 2010, a selected population without any history of severe conditions for over 2 years.Additionally, we exclude 914,595 individuals hospitalized from 2008 to 2013 for certain chronic conditions (e.g., birth defects, HIV infection, psychiatric disorders, etc.)In this regard, we observe that 375,579 (41%) of these individuals are already included in the first exclusion group.After exclusions, data include almost 30 million hospital visits and over 10 million unique individuals over the observation period, see Table 9 for the details of the exclusion criteria.
Table 1 describes the information available for individual patients.Basic demographic information is available : year of birth, sex and approximate place of residence, i.e., the departement of residence among the 96 official French administrative departments over the period 2008-2013.To enable the estimation of regular survival functions, a fictitious birthdate is imputed for each patient.Three lifestyle risk factors are inferred from hospital data and prior diagnoses : active tobacco smoking, alcohol use disorders and obesity (body-mass index ≥ 30 kg/m 2 ).Each risk factor is classified into 3 categories : 0, 1 and 2; 0 being the absence of risky behavior (Schwarzinger et al. 2018).It should be noted that alcohol or tobacco consumption is defined based on medical codes rather than on patients' self-reporting.Therefore, these variables capture a relatively severe exposure to these factors.Information on education levels and immigration status is a commune-level proxy Estimating Disease-Free Life Expectancy based on Clinical Data from the French Hospital Discharge Database We define disease-free as the absence of new events described in Table 2.This choice is motivated by previous research on this dataset (Guibert, Planchet, and Schwarzinger 2018a;Schwarzinger 2018).These previous works define disability much more narrowly, considering only two events, "Physical dependence" and "Severe dementia".Physical dependence is defined as bedridden state.Schwarzinger (2018) established that this definition of disability aligns with severe disability as measured by activities of daily living (ADLs).In our study, we aim to broaden our definition of disability to encompass all identified severe events, bringing it closer to a less severe level of activity limitation, similar to the concept of Global Activity Limitation Instrument (GALI), the measure of disability used for HLY.
The approach used to define disease-free in this study is distinctive, as it covers essentially all diseases that increase the risk of death.Indeed, this list almost exhaustively covers the various causes of death with 98% Estimating Disease-Free Life Expectancy based on Clinical Data from the French Hospital Discharge Database

A Preprint
of the 1,774,703 deaths in the hospital from 2008 to 2013 (Schwarzinger 2018).Moreover, we believe that including such a wide range of diseases brings the resulting Dis-FLE closer to a general-notion of population health.
It is worth noting that the list of events used to define disability employed in this study was not explicitly designed to mirror existing health measures, such as GALI.Instead, it represents the closest available approximation using this data, based on our knowledge.While this approach allows us to assess the merits of using clinical data, it is important to recognize that the indicator used may not capture the same aspects of health as existing health indicators.

Summary statistics
Table 3 gives summary statistics of the population under study.Women represent a larger proportion of the population, for two reasons.First, women tend to live longer and second, a higher proportion of women has visited hospitals.
The exact age in years is used as the timescale for the analysis.The exact age is the number of years since birth, including the fractional part.Individuals are considered exposed from their 50th birthdays to the first adverse event, within the period from 2010 to 2013, the observation period.
For all three risk behaviors over 85% of the population are in category 0, i.e., absence of any risk factor.This reflects the fact that risk factors represent relatively severe cases of each behavior.The immigration and education variables are grouped into quartiles.
Table 4 shows correlations between presence of risk factors.Correlations for risk factors are calculated on the indicator variables for any category risk factor, i.e., category 1 and 2 risk factors are grouped together.For Education and Immigration, the numeric 0-based quartile is taken.All correlations are highly significant (p < 0.001), but most are small.There is a correlation between alcohol consumption and smoking.The correlation between immigration and education is hard to interpret as it is likely a reflection of postal codes rather than individuals.

Statistical tools
In our study, we employ two types of models: the Kaplan-Meier estimator for survival curves and the Cox proportional hazards model.See, for example, Klein et al. (2016) for general background on survival models.The Kaplan-Meier estimator stratifies the population and calculates survival curves separately for each stratum.In contrast, the Cox model takes into account all available data and covariates simultaneously.Furthermore, the Cox model offers a method for estimating a survival curve based on the covariates in question.Both methods rely on the assumption that the censoring time is independent of both the exit time and the covariables.
The Kaplan-Meier survival function estimator at time t is given by : where: • t i is the observed event time for the i th observation, • d i is the number of non-censored events at t i , • n i is the number of individuals at risk just before t i .
To obtain stratum-specific survival curves the estimator is calculated independently for each subset of data.
The Cox model, in contrast to Kaplan-Meier, is a regression model as it attempts to establish a link between covariables and the survival time.It does so by assuming that all observations share the same baseline hazard function, λ 0 (t), that is scaled by the covariables.The Cox model estimates the hazard function as : Estimating Disease-Free Life Expectancy based on Clinical Data from the French Hospital Discharge Database  where X is the design matrix and β are Cox model coefficients.To obtain survival curve estimates, we also need to estimate the baseline hazard function λ 0 or equivalently its cumulative counterpart Λ 0 (t) = t 0 λ 0 (u) du.We use the Breslow estimate for the cumulative baseline hazard function : (3) Here, δ i represents the event indicator (1 if the event occurred, 0 if censored).The summation is performed over all events i where exit time t i ≤ t.The denominator calculates the risk set contribution for observations still under risk at t i , with R i = {j : t j ≥ t i } and β the maximum likelihood estimator of Cox model coefficients.
Overall, for the Cox model, the survival function is estimated using This basic variant of the Cox model assumes that the conditional hazard functions are all proportional to a base hazard function, This assumption is not satisfied for these data.For this reason we use a variant of the model that allows the hazard ratio to vary over time, in this case age, thus reducing non proportionality λ(t, X) = λ 0 (t)e Xβ(t) (Martinussen and Scheike 2006).This procedure requires duplicating each observation for every change in β(t).For this reason, instead of using every event time we choose a coarse grid of ages : steps of 2 years from 50 to 100.This results in step-function estimate for coefficients with time dependent effect.We use a natural spline basis to estimate β(t).In the rest of the paper we refer to these time-dependent coefficients as age-dependent as age is the timescale used for this model.
Initial data wrangling is done in SAS.Further data treatment and analysis is done in R (R Core Team 2022).
The Kaplan-Meier survival curves and Cox model was estimated using methods from the survival package (Therneau 2023).The procedure survSplit from the survival package is used to split observations over time, as required to estimate age-dependent effects.The splines are implemented using nsk function from the same package.

Statistical modeling
We analyze health as a censored life duration without disease.Our estimation approach relies on the use of survival models.The observed individual disease-free life durations, denoted T , are subject to right censoring and left truncation linked to the observation period.The truncation and censoring dates are assumed to be independent of T .An important assumption that we make is that the conditions selected to define Dis-FLE are supposed to be severe enough to require hospital care.Thus, we consider that the information loss related to patients with these conditions but not observed in hospital induces limited bias.
The duration studied is the disease-free survival which we define as the time between the start of the observation (either 2010/01/01 or the 50th birthday, whichever comes later) end the end of observation (either 2013/12/31 or date of death or censoring, whichever comes first).Censoring can be due to the end of the observation period on 2013/12/31 or due to being lost to follow-up.For Kaplan-Meier only sex is used to stratify the population, whereas for the Cox model uses many variables as covariables are used, as described in the end of this section.Both methods allow estimating survival curves.
We view Dis-FLE as the expected value of the disease-free survival distribution conditional on attaining a certain age.The disease-free survival distribution can either be estimated . We assume that this grid is sufficiently small so that we have t (i) = t for the first i : t (i) ≥ t.The first value of the Dis-FLE curve, Dis-FLE(50), is the expected disease-free life duration at 50.
We first calculate and present sex-specific survival curves estimated via Kaplan-Meier.We then calculate the corresponding Dis-FLE(t) for all ages t ≥ 50.The main part of the analysis is done using a Cox proportional hazards model.
The covariates used in the Cox model include sex, behavioral risk factors, and geographical information.All terms of the Cox model are described in Table 5. Sex and all risk factors have age-dependent coefficients.
Age-dependent coefficients are obtained by including in the model an interaction term between a natural spline as a function of age and the age-dependent effect.The main effects (i.e., without interaction with age-dependent spline) are not included because they would be colinear with the interaction effect.The relationship with age is modeled using cubic natural splines with 8 degrees of freedom, and with knots at the edges of observed values to prevent linear extrapolation at the extremes.The interaction terms are modeled as a constant offset of the main age-dependent effect.
We usually avoid discussions of p-values, or significance tests, for two reasons.First is practical, with such large data almost all comparisons detect significant differences.Second is conceptional, data analyzed exhaustively covers the studied population, therefore estimates are not subject to sampling error.
40% of observed individuals are randomly reserved for model validation, which is shown in the Appendix.Indeed, the volume of data is more than sufficient to estimate the model described above, as can be seen from small standard errors of estimated coefficients.

Whole population adjustment
The Metropolitan French PMSI dataset analyzed in this article is limited to individuals who have been hospitalized at some point, forming a non-random sample of the broader French population.Consequently, any calculations Dis-FLE within this sub-population yield a biased estimate of the true general population indicator, rendering direct comparisons impractical.To make a meaningful comparison with HLY, we make the assumption that individuals not observed in PMSI are in good health and adjust the exposure accordingly.
This disparity is not surprising given the substantial number of individuals who have never been hospitalized.In 2010, France had 22.5 million individuals aged 50 and over (INSEE 2022), but only 10.5 million observed in hospital and included in this study after various exclusions.Indeed, hospitalization introduces selection bias that needs to be corrected.There are two distinct and opposite sources of bias : Estimating Disease-Free Life Expectancy based on Clinical Data from the French Hospital Discharge Database A Preprint 1. the population included in the PMSI is, on average, in worse health than the general population since they required hospitalization and 2. exclusions applied to the original PMSI data should result in a study population that is healthier than the PMSI population.
Of these two effects the first one is stronger, and is the one we attempt to correct using this adjustment.
Let l x,k represent the population aged x on January the 1st of year k.It's important to note that the assumption on which this adjustment is made-that individuals not present in the PMSI database are alive and in good health-is not universally satisfied : (1) it disregards the subpopulation initially included in PMSI but later excluded for this study, and (2) it does not account for rare events missed by PMSI.The first point is handled by scaling the observed population size, l PMSI 2010−c,2010 , to the pre-exclusion levels before calculating by how much the exposure needs be increased to match the entire French population.This done to avoid re-adding the excluded population back in as healthy observations.The scaling factor corresponds to a 40% increase and is simply the ratio between the population before exclusions and after : 18 440 022/13 170 355 ≈ 1.40, both values come from Table 9.The use of the scaling factor is a simplification as it assumes that the exclusions had proportionally the same effect on all ages.The search for an adjustment to correct the selection bias caused by the use of clinical data is a delicate topic that is outside the scope of this paper.The second point cannot be handled easily.
It's essential to emphasize that this adjustment can only be applied when considering sex as the sole covariate.We cannot employ this adjustment for the Cox model since we lack individual-level information on covariates for the entire population.Therefore, Cox model should be interpreted as estimating the risk relative to the hospitalized population.

Dis-FLE and comparison to Eurostat's HLY
We estimate Dis-FLE using Kaplan-Meier survival curve estimates on the data adjusted for the whole population.The data allow us to calculate the entire survival curve and Dis-FLE for each age.Figures 1 and  2 show the survival curves and Dis-FLE with the adjustment for the whole population.Life duration in good health is significantly larger with than without the whole population adjustment (see Figures 10 and 11 in the Appendix for the unadjusted curves).
Overall, Dis-FLE steadily decreases from 50 to about 80, before stabilizing from about 80 to 90, and continues to decrease thereafter.Seeing the entire curve reveals an interesting pattern : the sex gap between Dis-FLE starts at about 5 years at 50, and decreases to 0 at 80. Dis-FLE for men and women stays essentially the same thereafter.Dis-FLE without whole population adjustment (Figure 11 in the appendix) does display a proportionally consistent sex gap; therefore, the closing of the sex gap observed in Figure 2 is due to the whole-population adjustment.There is a higher proportion of men than women who never enter a hospital.Therefore, the adjustment adds more healthy men than healthy women, thus having a favorable impact for Dis-FLE for men, relative to women.However, this observation is difficult to interpret and requires further investigation in future research.For this reason, we focus on the Cox model for the hospitalized population only.
To place the proposed Dis-FLE indicator into context of existing health indicators, we compare it to the closest available indicator, Eurostat's Healthy Life Years (HLY) for France over the same period (Eurostat 2020).HLY's concept of health is based on a self-evaluation of long term activity limitation, as measured by GALI of the EU-SILC survey.
HLY represents the expected life duration without long term activity limitation.This indicator was deliberately chosen to reflect the overall level of perceived ability, without attempting to identify the source of type of limitations.This allows it to be simple, and be widely applied, thus increasing coverage and allowing for comparisons between countries and over time (Robine 2003).HLY is also the only comparable health indicator covering France in the observation period.Another candidate was the HALE indicator from Global Burden of Disease study for France, but it is not directly comparable to Dis-FLE-type indicators, as HALE assigns weights to different health states.Finally, previous articles using these data, (Guibert, Planchet, andSchwarzinger 2018a, 2018b), focused on similar Dis-FLE type indicator, but that took into consideration only a small number of severe diseases, resulting in significantly longer Dis-FLE.
Table 6 compares Dis-FLE adjusted for the whole French population with HLY at ages 50 and 65.In general, Dis-FLE and HLY follow expected patterns, decreasing from age 50 to 65 for both genders.Women consistently exhibit higher Dis-FLE and HLY compared to men across all ages.However, at age 50, Dis-FLE is significantly lower than HLY for both genders.Furthermore, the sex gap is more pronounced in Dis-FLE.At 50, for Dis-FLE the female-male gap is 2 years larger than for HLY.At 65 the difference between sex gaps is smaller but still present at about 1 year.
Assuming that the Dis-FLE estimates are indeed representative of the general population then the difference may be explained by the difference of perceived activity limitation as measured by GALI and their clinical state, as well as the exclusion of institutional households in the EU-SILC survey.

Cox model inferences
In this section we analyze the data through a Cox model described in Section 3.This model allows us to identify factors influencing health.Through this analysis, we illustrate the advantages of using clinical data.Similar analysis would not be possible with other data sources, either because they lack the necessary information (covariables) or volume.
We present hazard ratios estimated for this Cox model, that is, e βj for the j th variable, rather than the model coefficient, β j .For non-age-dependent effects we give the numeric value of the ratio in a table.For terms with age-dependent effects we show curves of ratios as a function of age.
Overall, the available covariables have a large impact on healthy life duration, with behavioral risk factors having the largest impact, but that impact also decreases with age.Following risk factors in importance is sex, with men experiencing adverse events earlier than women, even after controlling for covariables.As with risk factors, the difference becomes smaller for later ages.
In the following sections we examine one-by-one the effects of the risk factors, but first we want to get an overall idea of just how much the risk factors influence Dis-FLE.
N.B. : the estimates of Dis-FLE and other quantities do not represent estimates for the general French population as the adjustment described in Section 3.3 cannot be applied for the Cox model.

Risk profiles
Before delving into the individual impact of each variable, we first illustrate the collective discriminatory power of the model by examining survival curves and Dis-FLE for selected risk profiles.As will be seen later in this section, the presence of risk-increasing behaviors present (smoking, obesity, and alcohol consumption) is the determinant factor of Dis-FLE.Therefore, the risk profiles are simply the number of risk-increasing behaviors present (smoking, obesity, and alcohol consumption) : • The "Lowest" risk profile, representing individuals without any risk factors.
• The "Intermediate" risk profile, involving one risk-increasing behavior.
• The "Highest" risk profile, featuring two risk-increasing behaviors.
Figures 3 and 4 display survival curves and Dis-FLE(t) estimated by the Cox model for these risk profiles.There are two curves for the "Lowest" risk profile, one for men and one for women, while the "Intermediate" and "Highest" profiles each include six curves, one for each combination of sex and one of the risk factors.Since these risk profiles are just groupings of covariables, they remain constant for each individual.
The impact of risk behaviors on disease-free life duration is evident, with a substantial 10-year range in Dis-FLE at age 50 between the lowest and highest risk profiles.Having at least one risk-increasing behavior appears to be a key factor, reducing Dis-FLE by approximately 5 years.In the absence of such behaviors, sex emerges as the determining factor for Dis-FLE.Age Hazard ratios for males (sex) Figure 5: Estimated age-dependent hazard ratio for sex.Values above 1 increase hazard for males.Gray areas are 95% pointwise confidence intervals.
2.5 years.However, with the presence of at least one risk factor, this gap diminishes to less than a year.This indicates that while behavioral differences contribute to the Dis-FLE sex gap, they do not entirely explain it.

Sex
We now proceed to inspect the effect of each variable on the disease-free life duration one by one.We examine age-dependent hazard ratios.First variable analyzed is the sex of the individual.To take into account apparent non-proprotionality of hazard functions, the estimated hazard ratio of sex is allowed to vary with age and is modeled by a step function.All else being equal men have larger hazard than women, even when controlling for other covariates, as seen in Figure 5.This difference is not constant over time, it starts off at about 30% excess hazard at 50, and rises steadily before attaining a maximum of almost 45% excess hazard at about 70 years of age.The difference then declines to 5% at 100 years.
Note that Figure 4 illustrates the impact of sex on Dis-FLE while keeping other variables constant.From it, we see that in absence of risk factors Dis-FLE is 2.5 years lower for men than for women.In presence of at least one risk factor the difference is less than a year.

Behavioral risk factors
We analyzed the effect of three risk factors : • tobacco consumption, • alcohol consumption, • obesity.
Each risk factor is grouped into three risk categories, 0, 1, and 2. Category 0 represents the absence of risk-increasing behavior and is taken as reference.Figure 6 shows the age-dependent effects for these risk factors.All risk factors appear to have large negative impact on outcome.The impact of these risk factors appears to decrease with age.Category 2 alcohol abuse has the largest impact on health (although it also impacts the smallest population compared to other risk factors), followed by smoking and obesity.The hazard ratios for category 1 risk factors are substantially smaller.All hazard ratios decrease with age.

Multiple behavioral risk factors
In our analysis, we investigated the combined impact of multiple risk factors.Given the extensive range of possible combinations involving category 1 and 2 risk factors, we specifically concentrated on the most prevalent interactions-those among category 2 risk factors.
We find that multiple risk factors increases the overall risk.However, the marginal increase in risk is less pronounced compared to the risk associated with each factor independently.This suggests a compensatory effect when multiple behavioral risk factors coexist.Notably, the combination of alcohol and smoking exhibits the highest compensatory ratio, followed by obesity-alcohol and obesity-smoking.
Figure 7 visually represents the distinctions between : • the main effects, • the naive combined effect of two risk factors (calculated by multiplying the hazard ratios of the main effects without considering the interaction term), • and the estimated effect that accounts for the interaction term.For all three combinations of risk factors the combined effect with interaction is lower than without it.These observations shed light on the nuanced interplay of risk factors and their collective influence on the overall hazard.

Behavioral risk factors conditional on sex
We measure if risk factors impact men and women differently.To simplify the model we model this difference as an offset for males.Table 7 gives the hazard ratios for the interaction terms between sex and behavioral risk factors.These ratios can be interpreted as additional burden of these risk factors on men, relative to women.
Overall men appear to be slightly less sensitive to the presence of behavioral risk factors.This explains in part the reason for decrease in the Dis-FLE sex gap in presence of risk factors, as seen in Figure 4.
We focus on category 2 behavioral risk factors because category 1 are rare or without substantial male-female differences.Category 2 alcohol consumption, has a substantially stronger impact on women, with women suffering additional 12% of hazard.Obesity also impacts women stronger, by about 7%; While men's health is slightly more sensitive to smoking.

Geographical
Figure 8 gives the hazard ratios relative to the Yvelines department (78).This reference was chosen because it is the Île-de-France region, while not being Paris itself.For each panel, the main age-dependent effect is shown for the risk factors, and the combined effect with and without interaction are displayed.The combined effect without interaction is simply the product of the hazard ratios of the main effect.The combined effect with interactions is the product of the main effects and the interaction term.. Values above 1 increase hazard.Gray areas are 95% pointwise confidence intervals.Northern departments have a markedly higher hazard rate, even after controlling for other covariates.Southeast, and eastern departments on the other hand appear to have the inverse effect.Both these facts are in accord with previous literature.In the rest of the territory the effects appear to be more local.
To put these results in context Figure 9 provides a map of life expectancies at 60 by sex and by department (INSEE 2023).Overall, we observe similar trends.The similarity suggests that the geographic location is an independent predictor of life expectancy and Dis-FLE.
In and of itself it is hard to interpret this result, as may not necessarily reflect the impact of local environment on health, but instead reflect the level of access to healthcare, as discussed when introducing this approach.Further work is necessary to explain these differences.A first step would be including more information on the departments themselves, e.g., population, population density, GDP, median income, etc.
The variables "Education" and "Immigration" indicate the level of education and immigration in the commune of residence.Table 8 gives the obtained hazard ratios for these variables.Surprisingly, the level of education and immigration in the commune of residence appear to increase the hazard.The effect is minor compared to other risk factors, but nonetheless significant.This result is also hard to interpret on it own as there is a level of indirection between the individual and the commune of residence.

Conclusion
We propose the use of clinical data to construct health indicators.The use of clinical data opens up a hitherto unused source of information and makes rich analysis possible, some of which we present in this paper.
This work provides a methodological blueprint for calculating health indicators based on clinical datasets.The implications of our research extend beyond the French context, with potential applications in other countries and healthcare systems.Specifically, our methodology is not confined to large clinical datasets and can be applied at smaller scales, such as hospital cohorts, in France or elsewhere.However, when considering entire populations, accessing national hospitalization datasets to calculate nationally representative health indicators can be exceedingly challenging.We hope that this work provides a precedent that will encourage and facilitate similar efforts in the future.
Although clinical data impose a diagnosis centric vision, rather than outcome based one that may be provided by health oriented survey instruments, it does provide with a clear outline of the health state over the lifetime of the patient.When combined with the large volume data available, this results in pertinent indicators on a population level.Indeed, as the comparison with HLY shows, Dis-FLE with the adjustment for the whole population displays similar trends, although with a wider sex gap.
In the absence of standardized practice to define health from clinical data it is difficult to construct comparable health indicators.We sidestep the issue by focusing on a simple definition of being disease-free.A more complex indicator would take into account the entire health trajectory, but would be difficult to analyze, something that could be treated in further work.Instead, our focus on simple trajectories combined with large amount of data available allowed us to exhaustively analyze the impact of available covariables.In doing so we illustrate the kind of analysis we believe can be made possible by using clinical data.We apply the proposed methods to the French PMSI database, and analyze the health status of population aged 50 and up from 2010 to 2013.We summarize the results of the analysis in terms of Dis-FLE based on 36 severe conditions and hazard ratios of the corresponding Cox model.
For the population studied the Dis-FLE at 50 years is 10 years for women and 7.5 for men.Dis-FLE is strongly influenced by the covariables available, indeed Dis-FLE can range from 2.5 to 12.5 for women and from 2.5 to 10 for men when conditioned on covariates.
The most important determinant of Dis-FLE are the behavioral factors, in order of importance : alcohol consumption, tobacco use and obesity.Each of these have hazard ratios exceeding 2 for all ages before 80. Alcohol consumption has hazard ratio larger than 3 before 60 years.Interestingly, all age-dependent effects decrease with age after 60.
Sex also has a large influence with a hazard ratio above 1.2 before 80, and as large as 1.4 at about 70.Also, the effect of behavioral risk factors was found to differ by sex, with alcohol consumption and obesity having a stronger effect on women, and smoking having a stronger effect on men.Other factors influence Dis-FLE, but have a weaker effect.
The Cox model analyzed in this paper is the simplest model that still allows us to illustrate the richness of underlying data.There are however many possible improvements to it.For example, the model analyzed does not take into account calendar time.This is in contrast to most indicators where the ability to follow them over time is vital.A natural extension of the model would be to take in account calendar time by including it as an age-dependent covariate.Other possible extensions include making effects not only depend on age, but also on calendar time, therefore modeling possible improvements of treatment of behavioral factors.
The trajectories analyzed are based on a specific definition of health, or more specifically disease-free.This definition is based on previous work using this dataset, and is conceptually coherent with other indicators.However, it lacks direct comparables, making its usefulness as an indicator limited for now.Further work may help identify a definition of disability closer aligned with other indicators, such as GALI.
More fundamentally, the concept of health used introduces an artificial dichotomy between good and ill health.
Using the same data it should be possible to define more realistic individual trajectories, for example by assigning each disease a weight.Using this approach we can define individual level health-weighted indicator, extending the flexible approach to other indicators such as HALE.In this context the use of clinical data would also simply methodology as many problems plaguing HALE estimates are resolved by these data, as for example comorbidity and the nuance between incidence and prevalence.
Such an approach would make both the definition of health trajectories and their analysis significantly more complex.We, believe however that that would be a natural next step in using clinical data as data-source for health indicators.
Beyond considerations of health concept used, the use of clinical data requires additional assumptions and adjustment procedures to produce nationally-representative indicators.A simple adjustment procedure was introduced and used to calculate Dis-FLE for the general French population.However, we believe that this procedure could be improved by using more granular data and, under additional assumptions, extended to the estimates provided by the Cox model.
Should our methodology and findings prove useful and robust, future work could delve into the development of a definition of health, that is based on clinical data that explicitly targets GALI or other relevant health indicators, potentially drawing upon detailed assessments of activities of daily living (ADLs).Such an endeavor could enhance the accuracy and sensitivity of our understanding of disability and its implications for individual and population health.
The Cox model was the tool of choice for this analysis.However, the large volume of data combined with the need to explicitly define the model matrix required a large amount of computer memory to do the necessary computations.The use of other machine learning algorithms may provide a more efficient means to analyze this dataset.

B Sex-specific survival curves without adjustment
Figure 10 shows the sex specific survival curves without adjustment for the whole population.Unsurprisingly women spend longer in healthy state than men.The oscillations in the curves are due to rounding in anonymized dates.Figure 11 shows the corresponding Dis-FLE(t).
Estimating Disease-Free Life Expectancy based on Clinical Data from the French Hospital Discharge Database

C Model diagnostics
The cox model used is fit on 60% of the available data.The remaining 40% are reserved to perform model diagnostics presented in this section.
The C-statistic calculated on the test set is 59.91%.
To evaluate the quality of the fit on the test dataset, we calculate the linear predictor, i.e., log(Hazard ratio), for every individual.Individuals are then classified into classes based on the calculated value.The distribution of linear predictors is clustered around few values.This is due to the fact that the influence of sex and the presence of risk factors essentially determines risk, with all other variables essentially only adding noise.The lowest interval : (−∞, 0.2] covers essentially only women without any risk factors; (0.2, 0.7] covers men without any risk factors; (0.7; 1.1] covers mostly persons with obesity, (1.1, 1, 5] covers mostly smokers, and (1, 5, ∞] covers persons with alcohol consumption, or multiple risk factors.Finally, Figure 12 compares the observed survival curves for each of these classes with the predicted survival curve.
One problem with this approach is that the model includes age-dependent coefficients.This means that risk score for each individual changes over time, making it impossible to attribute a constant score to each individual.However, since the observation period is four years and the time grid for age-dependent coefficient is two years, each individual may at most have 3 unique risk values, and most have only 1.When multiple risk values are present, they are close to each other.For the calculation above, we use the average of predicted linear prediction scores.
The adjustment is made by introducing l INSEE 2010−c,2010 − l PMSI 2010−c,2010 artificial data points without any disease, corresponding to individuals not observed in the PMSI on 2010/01/01, for each observed cohort c (year of birth) and separately for each sex, notation notwithstanding.These individuals are then censored at the end of years 2010 through 2013 as needed to align the exposure with INSEE data.

Figure 1 :Figure 2 :
Figure 1: Survival curves of being without disease for the general population aged 50 and up, by sex.

Figure 6 :
Figure 6: Estimated age-dependent hazard ratios for behavioral risk factors.Values above 1 increase hazard.Gray areas are 95% pointwise confidence intervals.

Figure 7 :
Figure7: Estimated age-dependent hazard ratios for two-way category 2 risk factors combinations.Each panel shows the interplay between two risk factors.For each panel, the main age-dependent effect is shown for the risk factors, and the combined effect with and without interaction are displayed.The combined effect without interaction is simply the product of the hazard ratios of the main effect.The combined effect with interactions is the product of the main effects and the interaction term.. Values above 1 increase hazard.Gray areas are 95% pointwise confidence intervals.

Figure 10 :Figure 11 :
Figure 10: Survival curves of being in good health, by sex.

Table 1 :
Description of individual patient data.

Table 2 :
List of 36 severe conditions requiring hospital care and considered incompatible with good health and number of times the event was observed during the 2010-2013 period.Free Life Expectancy based on Clinical Data from the French Hospital Discharge Database A Preprint

Table 3 :
Descriptive statistics of information available for the analysis.

Table 4 :
Correlations between risk factors.Only the presence of each risk factor is considered, ignoring categories.Free Life Expectancy based on Clinical Data from the French Hospital Discharge Database A Preprint max is the maximum assumed age.We set t max to 100, the largest age in the INSEE age pyramid used in whole population adjustment (see Section 3.3).Setting a maximal age is one way of dealing with the fact that survival function does not reach 0 if the longest observation is censored.
using either Kaplan-Meier or Cox model.Formally, if S is the estimate of the survival curve of T , then the restricted conditional expectation is Dis-FLE(t) = E(T − t|T > t), for t ≥ 50, and can be calculated by However, given that both survival function estimators are step functions, this formula reduces to a weighted sum.The formula used to calculate Dis-FLE is given by :Dis-FLE(t) = i:t (i) ≥t Ŝ(t (i) ) Ŝ(t) (t (i+1) − t (i) ),Estimating Disease-Free Life Expectancy based on Clinical Data from the French Hospital Discharge DatabaseA Preprint

Table 5 :
Terms used in the Cox model.
where t (i) are the unique, ordered, non-censored exit times observed in the data, such that t (1

Table 6 :
Comparison of Eurostat's HLY at 50 and 65 for France to analogous Dis-FLE calculated with the proposed health definition and method.HLY value corresponds to the average of HLY from 2010 to 2013.
Free Life Expectancy based on Clinical Data from the French Hospital Discharge Database It's worth noting that Dis-FLE curves may intersect for men and women in some risk profiles due to agedependent coefficients in the Cox model.Additionally, these figures allow us to isolate the sex gap when other factors are equal.For instance, in the absence of risk factors at age 50, the sex gap is approximately Estimating Disease-

Table 7 :
Hazard ratios for additional risk for men from behavioral risk factors, with associated standard errors and p-values.Only category 2 risk factors are shown.

Table 8 :
Cox model coefficients for the education and immigration levels in the commune of residence.
Estimating Disease-Free Life Expectancy based on Clinical Data from the French Hospital Discharge Database Estimated hazard ratios for departments of residence.Values binned.Values above 1 increase hazard relative to residents of department 78.Non-significant values (p-value > 0.05) are greyed out.

Table 9 :
Schwarzinger (2018)lusion criteria applied to the dataset.Translated and adapted from Table1inSchwarzinger (2018).Exclusion criteria and impact on number of patients.

population, aged 50 and up in good health on the 1st of January 2010 (selected after exclusion criteria 1 and 2) 2010-2013 13 170 355 71.4% Data preparation for analysis 2008,2009 2 559 726 13.9%
Estimating Disease-Free Life Expectancy based on Clinical Data from the French Hospital Discharge Database