No Excess Mortality up to 10 Years in Early Stages of Breast Cancer in Women Adherent to Oral Endocrine Therapy: A Probabilistic Graphical Modeling Approach

Breast cancer (BC) is globally the most frequent cancer in women. Adherence to endocrine therapy (ET) in hormone-receptor-positive BC patients is active and voluntary for the first five years after diagnosis. This study examines the impact of adherence to ET on 10-year excess mortality (EM) in patients diagnosed with Stages I to III BC (N = 2297). Since sample size is an issue for estimating age- and stage-specific survival indicators, we developed a method, ComSynSurData, for generating a large synthetic dataset (SynD) through probabilistic graphical modeling of the original cohort. We derived population-based survival indicators using a Bayesian relative survival model fitted to the SynD. Our modeling showed that hormone-receptor-positive BC patients diagnosed beyond 49 years of age at Stage I or beyond 59 years at Stage II do not have 10-year EM if they follow the prescribed ET regimen. This result calls for developing interventions to promote adherence to ET in patients with hormone receptor-positive BC and in turn improving cancer survival. The presented methodology here demonstrates the potential use of probabilistic graphical modeling for generating reliable synthetic datasets for validating population-based survival indicators when sample size is an issue.


Introduction
Breast cancer (BC) is the most common cancer and the leading cause of cancer death in European women [1]. A decrease in BC mortality is correlated with improvements in survival [2,3], an indicator of the success of cancer control efforts in a population-based setting. Conditional five-year survival is an outcome that measures the efficacy of cancer management, since it responds to the question of "once a patient survives for T years, what is the probability of surviving another five years?" [4]. Most population-based cancer survival indicators are derived from relative survival (RS), defined as the ratio between the overall survival (OS) and expected survival of the cohort with respect to the general population [5]. RS is as an estimate of the patients' cancer-specific survival compared to the survival of the general population, and one can also assess the conditional RS(CRS) at five additional years after surviving T years [5]. On the basis of the CRS(T), one can determine the five-year excess mortality as EM(T) = 1-CRS(T), which is used to assess whether patient mortality surpasses the mortality of the general population, that is, when EM(T) > 0 [6].
These conditional survival or mortality indicators provide very relevant information on the prognosis of BC over time, as they are a starting point to identify prognostic factors related to long-term survival [6][7][8][9]. For instance, the BC cohort's mortality is not different from the general population's mortality when EM equals 0 beyond a certain time interval T [6]. Moreover, population-based cancer registries can define the time to cure of cancer as "the number of years after cancer diagnosis when the EM, expressed as a percentage, becomes negligible" [4,8]. That situation occurs when the EM remains clearly below 5% for more than 10 years, and CRS consequently surpasses 95% [8]. A recent study using European cancer registry data showed that an EM of 5% could persist for at least 15 years in BC patients [9]. However, the EM in that study was an overall indicator that could only be adjusted for age because other prognostic factors could not be retrieved from all participating cancer registries.
Stage, molecular subtype, and adherence to endocrine therapy (ET) are key predictors for providing population-based BC survival estimates [10]. Indeed, tamoxifen and aromatase inhibitors are pillars of adjuvant therapy for patients with hormone receptor positive (HR+) BC diagnosed at Stages I-III [11]. Randomized clinical trials showed that five years of adherence to ET positively impact BC survival [11]. In a previous study, we found that nonadherence to ET is significantly and independently associated with recurrence and all-cause mortality at Stages I-III of hormone receptor positive BC after adjusting for age [12]. A question arises regarding the impact of ET adherence on long-term survival and risk of death in patients with BC versus the general population [9].
The sample size of the cohort could be an issue when trying to estimate age-specific survival according to stage and molecular subtype; however, generating a large cohort of simulated survival data on the basis of observed cohort data could help overcome this limitation [13]. This simulation could be achieved in two ways: (1) only simulating survival times [14,15] or (2) generating a set of cohort covariates as a function of survival times [13,16]. For the latter, oversampling techniques such as SMOTE [17], Borderline SMOTE [18], and MWMOTE [19] can also be used to generate balanced subsets of data, where the efficiency of these methods in simulating new datasets must be assessed with the observed survival patterns of real data [19]. However, if we are interested in detecting new patterns of survival, the specific modeling of probabilistic dependencies between the variables of the observed data is needed, which requires estimating a joint probability distribution of the variables [13,[20][21][22]. For that purpose, our research team developed Modelling Graphical Probabilistic Dependencies (ModGraProDep) and suggested that future work should be oriented toward selecting data subsets across several synthetic datasets (SynD) that better mimic the cohort's survival pattern [13].
In the present study, we developed a method to validate the survival estimates of the original cohort by using a synthetic cohort that combines the "best" subsets of simulated data derived from graphical models. Survival indicators are generated by fitting the cohort data and the simulated SynD to a Bayesian RS model developed for that purpose.

Data: BCStage Dataset
BC data were obtained from the population-based cancer registries of Girona and Tarragona (northeastern Spain) covering an average annual population of 560,120 women from 2005 to 2009 [23]. During this time period, 4053 women under the age of 75 years were diagnosed with invasive BC (code C50 of the 10th edition of the International Classification of Diseases, ICD-10). A total of 352 women (8.7%) were excluded from the analyses due to missing data on estrogen and progesterone status, and another 1215 (30.0%) were excluded due to missing data on stage, Stage IV at diagnosis, or diagnosis of HER2-enriched or triple-negative BC tumors, and we could not retrieve follow-up status (if the patient died or not at the end of follow-up) in N = 189. Each woman with BC diagnosed from 2005 to 2009 was followed up to 31 December 2019; we considered a maximal follow-up of 10 years. Of the patients eligible for ET (N = 2297), information could only be retrieved for BC patients diagnosed from 2007 to 2009 who met the inclusion criteria: patients presenting positivity for estrogen and/or progesterone receptors diagnosed at Stages I, II, or III, who were eligible for ET (N = 1243). Survival times for patients not found to be dead at the end of follow-up were censored. Stage classification was based on the TNM classification system, as described in the 6th edition of the American Joint Committee on Cancer staging manual [24], classifying patients at Stage I, II, or III when TNM was available at the moment of diagnosis.
Adherence to ET for patients with HR+ BCs was tracked during the first five years after BC diagnosis. Any switch to tamoxifen or aromatase inhibitor was considered to be a continuation of treatment. Adherence was estimated as "the proportion of days covered by a filled drug prescription over the treatment period (up to five years from the date of first prescription)", deeming a cumulative adherence rate of 80% or more as satisfactory [12]. Data on ET prescription refills for BC were collected for the entire study period (2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015) from the community pharmacy database, which is mandatory for drug reimbursement in Catalonia.

Fitting Graphical Models through ModGraProDep
Four synthetic datasets were simulated by modeling the probabilistic dependencies between variables using ModGraProDep [13]. In brief, let Γ be the set of cells in a contingency table, where c ash is a cell of the table with indices a(age) − s(stage) − h(adherence). Let p(c ash ) be the cell probabilities of the contingency table Γ. Using a hierarchical expansion of log (p(c ash )) we considered a saturated log-linear model, a model including the main effects, and all interactions between these, that is where parameter α is an intercept, β refers to main effects, and γ I q refers to the set of interaction parameters of order q, where q ∈ {2, 3}. We can also specify a model with fewer interaction terms by setting higher-order interaction to zero.
Assuming that there is a set of candidate models M(j)|j ∈ {1, . . . , J} , ModGraProDep uses a heuristic search based on penalized log-likelihood H(j,k) = −2log(p(c ash )) + k * z(j) where z(j) is the number of model parameters, and k is a penalty factor. Changing the value of k can result in several models using backward stepwise elimination of graph arches.
Starting from the saturated model, ModGraProDep fits four models: three by using the k penalty factor, GMK1 for k = 1, GMAIC for k = 2 (Akaike information criterion [25]), and GMBIC for k = log (N) (Bayesian information criterion [25]), and another by testing the arch's conditional independence, GMTEST. Once these four models had been fitted, we first imputed adherence in the BC cases with missing adherence, and then generated the synthetic datasets. We used the junction-tree simulation algorithm implemented in ModGraProDep for simulating four datasets of size N = 1,000,000 from each of the four models and according to the probabilistic relationships between variables (see Vilardell et al. (2020) for technical details [13]). Figure 1 presents the scheme for generating a combined synthetic dataset that selects the best subsets of data that better mimic the survival pattern of the cohort. These are summarized as follows:

ComSynSurData: Combining Synthetic Survival Datasets
Step 0. Use ModGraProDep for generating the four SynDs.
Step 1. Produce a partition of the cohort dataset into L subsets according to A age groups and S levels of a stratification variable, such as stage at diagnosis; then, L = A × S. For instance, if strata were stage at diagnosis with levels {I, II, III}, and three age groups were considered, then L = 3 × 3 = 9 subsets (one for each age group and stage combination). In the same line, the same partition is made for each SynD.
Step 2. For each of the L subsets of the cohort data, find its "best" counterpart among the 4 × 9 = 36 subsets of SynDs by comparing survival estimates between the observed cohort and that derived from the SynDs through a scoring method.
Step 3. Once L subsets of SynD are selected in each age stratum, generate a combined synthetic cohort by merging these L subsets, from which Kaplan-Meier survival estimates according to stage and corresponding age groups can be derived. between observed and predicted survival [26,27]. The Supplementary Material file includes the R code for running ComSynSurData. Figure 1. Scheme of procedure for generating combined cohort by using best synthetic cohort for each of the considered L age-stratum groups. Synthetic cohorts generated according to Mod-GraProDep.

Statistical Modeling of Excess Mortality
We used an RS model to derive the survival indicators. Let ( ) be the overall hazard of death in the cohort at a specific time T, and ( ) is the expected hazard in the Figure 1. Scheme of procedure for generating combined cohort by using best synthetic cohort for each of the considered L age-stratum groups. Synthetic cohorts generated according to ModGraProDep.

Scoring Method for Comparing Observed versus Predicted Survival in Step 2
ComSynSurData uses the integrated Brier score (IBS), a scoring method to detect inaccuracies in the prognostic classification scheme, that is, disagreement between the survival curves of cohort and simulated data at a certain time T [26]. LetŜ(T) be the predicted survival function, andĜ(T) the censoring distribution, both functions estimated using the Kaplan-Meier method and using the SynD. Here, we used the following definition of the Brier score at time T for censored data [27]: where t i is the follow-up of the i-th patient in the cohort, and I(·) are indicator functions, such that I(t i ≤ t) = 1 and I(t i > t) = 0 if the i-th patient dies before t, and I(t i > t) = 1 and IBS is an overall measure up to a certain time target t * , which uses weights defined as W(t) = t/t* [27]. Here, we used maximal follow-up t * = 10 years. The IBS was calculated as For each age and stage stratum, the selected subset of SynD would be that with the smallest IBS score, which could lie between 0 and 1, where IBS = 0 shows a perfect match between observed and predicted survival [26,27]. The Supplementary Material file includes the R code for running ComSynSurData.

Statistical Modeling of Excess Mortality
We used an RS model to derive the survival indicators. Let λ O (T) be the overall hazard of death in the cohort at a specific time T, and λ P (T) is the expected hazard in the cohort using the general population mortality [28]. Applying additive modeling, the excess hazard of death in the cohort due to BC is λ X (T) = λ O (T) − λ P (T) [29], where OS(T) = T 0 exp(−λ O (T)dt) is the observed survival in the cohort at time T, and ES(T) its expected survival in the cohort, ES(T) = T 0 exp(−λ P (T)dt). Relative survival (RS) at time t is calculated as [28]: RS(T) could reach (or even surpass) 1 when OS(T) is equal to the survival of the general population [28]. From RS(T), one can derive the five-year conditional relative survival at T years of follow-up as [5] From this, the five-year conditional excess mortality (EM) at T years of follow-up [5,6].
Using (7), one can assess temporal changes in the EM by monitoring this quantity during follow-up [5]. Moreover, it is of interest for both the patient and clinician to estimate the probability of death due to cancer in the presence of other causes at time T, PCa(T) and the crude probability of death due to other causes in the presence of cancer mortality at time T, POC(T) [6]. These quantities can be derived from the RS(T) by using competing risks modeling as (9) where the sum of these two probabilities gives the probability of death from any cause at time T [6]. Since all these indicators are related to λ X (t) and λ O (t), these last two risks can be estimated byλ is the observed number of deaths at T and E(T) is the expected number of deaths at T, which is calculated from applying the age-specific mortality rates of the general population to each one of the individuals at risk within the T interval, and finally, Y(T) is the number of individuals at risk in T.
Since O(T) is usually considered to be a Poisson-distributed random variable with mean µ T , we used a Bayesian autoregressive modeling of order 1 to estimateλ O , assuming a prior precision (inverse of variance) of 0.001 [30], defined as Posterior distributions and the corresponding 95% credible intervals of aforementioned survival indicators (5)-(9) were calculated through posterior estimates of µ T , and fixed quantities E(T) and Y(T). The model was implemented using WinBUGS [31] (see the program code in supplementary material file), which was run within R (http://www.Rproject.org, accessed on 5 December 2021) through the R2WinBUGS library [32].

Analysis Scheme
First, the GM was fitted to the original dataset, and adherence was imputed in cases with missing information. Second, four SynDs were generated using ModGraProDep, and from these SynDs, ComSynSurData selected the best L age-stage subsets of synthetic data that were used to generate the combined synthetic dataset. Survival indicators were derived from fitting the Bayesian relative survival model to this combined cohort, and these were also validated with those obtained using the original cohort. Lastly, age-specific survival indicators for epidemiologic or clinical use were calculated. Table 1 presents the clinical and pathological characteristics of the observed cohort in Girona and Tarragona in 2005-2009, stratified according to HER2+/HER2− expression. Main differences were detected in the distribution of BC stage: stages II and III were more frequent in patients with HER2+ compared to HER2− tumors. Mean age at diagnosis was 55.3 years: 32.7% of the patients were diagnosed with BC before 50 years of age, 29.6% were diagnosed at age 50 to 59 years, and 37.7% were 60 years or older. Most patients were diagnosed at early stages, whereas only 17.5% were diagnosed at Stage III. Mean follow-up was 8.2 years, and 11.7% of patients died during that period. Of these, information about adherence could be retrieved in those diagnosed from 2007 to 2009 (N = 1243), 75% of whom showed a cumulative adherence rate of 80% or higher during the first five years after the BC diagnosis. In cases with missing adherence data, a value for adherence was imputed making use of ModGraProDep. Table 1 also shows the distribution of the number of BC cases according to adherence and HER2 status after the imputation of these four models. We did not find any difference in the distribution of the percentages according to adherence status when comparing the observed frequencies in the cohort (the N = 1243 BC patients) with those obtained after using each of the four models implemented in ModGraProDep (see Table 1, Distribution of BC cases in the cohort after the imputation of adherence to ET when missing). However, the distribution of adherence status in the cohort was identical when GMAIC and GMBIC models were used, indicating that the probabilistic graphical pattern of the dependencies between variables in the observed data (N = 1243) was likely to be identical when fitting these two graphical models to the cohort data.  Figure 2 shows the graphical modeling of the data, which encodes a factorization of the joint probability distribution of the dataset. Three probabilistic schemes can be distinguished: one obtained using GMK1 (Figure 2a), another using GMTEST (Figure 2b), and another, as noted above, obtained through GMAIC and GMBIC (Figure 2c). Figure 2a shows that the model GMK1 considered that all variables were related (connected). The GMTEST model considers age as related to exitus, but this is conditional on adherence or stage at BC diagnosis, and HER2 as directly related to the other variables through stage. Lastly, GMAIC and GMBIC models consider that age could be independent from the data structure, and all remaining variables are conditionally independent once exitus is known. Stage was related to the remaining variables, conditional on others, regardless of the model used.

Results
shows that the model GMK1 considered that all variables were related (connected). The GMTEST model considers age as related to exitus, but this is conditional on adherence or stage at BC diagnosis, and HER2 as directly related to the other variables through stage. Lastly, GMAIC and GMBIC models consider that age could be independent from the data structure, and all remaining variables are conditionally independent once exitus is known. Stage was related to the remaining variables, conditional on others, regardless of the model used.

Data Simulation
After the imputation of the missing data, ModGraProDep was used for simulating the four SynDs, and from these, ComSynSurData was applied to generate the combined dataset. Four datasets were considered, and on their basis, four SynDs were simulated. Once these models were fitted, the four SynDs were introduced into ComSynSurData, and the combined dataset was generated. Table 2 shows the matrix of internally generated IBS scores by ComSynSurData from which to select the L = 9 subsets. From these, seven data subsets were selected from the SynD dataset derived from GMK1, two from the SynDs derived from GMAIC and GMBIC, and none from GMTEST.

Data Simulation
After the imputation of the missing data, ModGraProDep was used for simulating the four SynDs, and from these, ComSynSurData was applied to generate the combined dataset. Four datasets were considered, and on their basis, four SynDs were simulated. Once these models were fitted, the four SynDs were introduced into ComSynSurData, and the combined dataset was generated. Table 2 shows the matrix of internally generated IBS scores by ComSynSurData from which to select the L = 9 subsets. From these, seven data subsets were selected from the SynD dataset derived from GMK1, two from the SynDs derived from GMAIC and GMBIC, and none from GMTEST.

Comparing Observed Survival in the Cohort with Survival in the Combined Cohort
To assess the reliability of these simulated datasets, OS in the cohort with real data (N = 1243) was compared with the estimated survival using the combined cohort ( Figure 3). Using the posterior distribution of the survival derived from the combined cohort, its median survival overlapped with the 95% credible intervals of observed survival in the original cohort in almost all age groups. In some, however, the median of the survival's combined cohort was slightly lower than the observed survival, but close to the lower bound of the 95% credible interval of the survival in the original cohort: age group ≤ 49 years at Stages I and II, and for the age group of 59-74 years at Stage III. Table 2. Integrated Brier score at up to 10 years of follow-up by age and stage, comparing the cohort's absolute survival with the absolute survival estimated using each one of the synthetic datasets derived from the Graphical Models (in bold: minimal integrated Brier score for each age group according to stage of breast cancer at diagnosis).

Comparing Observed Survival in the Cohort with Survival in the Combined Cohort
To assess the reliability of these simulated datasets, OS in the cohort with real data (N = 1243) was compared with the estimated survival using the combined cohort ( Figure  3). Using the posterior distribution of the survival derived from the combined cohort, its median survival overlapped with the 95% credible intervals of observed survival in the original cohort in almost all age groups. In some, however, the median of the survival's combined cohort was slightly lower than the observed survival, but close to the lower bound of the 95% credible interval of the survival in the original cohort: age group ≤ 49 years at Stages I and II, and for the age group of 59-74 years at Stage III.  Figure 4 compares the EM observed in the original cohort with that estimated using the combined dataset. Median EM between these datasets did not differ, since the 95% credible intervals derived from the observed cohort overlapped with the estimates derived from the combined cohort. In this line, the patients diagnosed in stages I and II who were adherent to endocrine therapy did not show EM with respect to the general population. However, we found that patients diagnosed in these early stages who were not adherent to ET had an EM with a median ranging from 5% to 10%, which usually suggests a significant EM. For patients diagnosed at Stage III, the effect of nonadherence to ET might double the EM with respect to adherence. Figure 4 compares the EM observed in the original cohort with that estimated using the combined dataset. Median EM between these datasets did not differ, since the 95% credible intervals derived from the observed cohort overlapped with the estimates derived from the combined cohort. In this line, the patients diagnosed in stages I and II who were adherent to endocrine therapy did not show EM with respect to the general population. However, we found that patients diagnosed in these early stages who were not adherent to ET had an EM with a median ranging from 5% to 10%, which usually suggests a significant EM. For patients diagnosed at Stage III, the effect of nonadherence to ET might double the EM with respect to adherence.   Table 3 presents the age-specific epidemiological survival indicators derived from the combined cohort across age groups and stage at diagnosis. The adherence group showed higher OS (+ 6% at 5 years and +15.2% at 10 years) and lower 10-year PCa (−18.7%) and 5-year EM (−14.5%) compared to the nonadherent group. Table 3 also shows that, at Stage I, adherent patients diagnosed before 50 years of age may present a small but non-negligible 1.1% EM when compared to the general population. In contrast, no EM was detected in patients diagnosed beyond that age. Nonadherent patients present 4.6% to 9% higher EM, depending on the age group. I Stage II, adherent patients diagnosed beyond 59 years did not show EM during the follow-up. The largest differences in survival indicators between adherent and nonadherent patients were observed in the Stage III group, with better prospects of survival in adherent compared to nonadherent patients, independently of age at BC diagnosis.   Figure 5 shows the comparison of the 3 main population-based survival indicators across age groups and stratified by adherent and nonadherent patients: EM(5), PCa (10) and OS (10). In Stages I and II of BC, differences in EM(5) and PCa(10) between adherent and nonadherent patients were clearly marked and showed their maximum among BC patients diagnosed beyond 50 years. At Stage III, the age trend of these two indicators was similar, showing a marked rise beyond 59 years of age at BC diagnosis. Lastly, OS(10) showed two patterns: (i) for adherent patients, survival was similar up to 59 years of age and decrease thereafter, independently of stage at diagnosis; (ii) for nonadherent patients, OS(10) exponentially decreased with age except in Stage III. and OS (10). In Stages I and II of BC, differences in EM(5) and PCa(10) between adherent and nonadherent patients were clearly marked and showed their maximum among BC patients diagnosed beyond 50 years. At Stage III, the age trend of these two indicators was similar, showing a marked rise beyond 59 years of age at BC diagnosis. Lastly, OS(10) showed two patterns: (i) for adherent patients, survival was similar up to 59 years of age and decrease thereafter, independently of stage at diagnosis; (ii) for nonadherent patients, OS(10) exponentially decreased with age except in Stage III.

Discussion
This study provides estimates of the most common population-based statistical indicators in order to assess the impact of stage, age, and adherence to ET for survival in patients with positive estrogen-and/or progesterone-receptor BC. We compared the estimates from the original cohort with those derived from synthetic datasets generated through graphical models fitted to the cancer registry cohort. Using the advantages of probabilistic graphical modeling, we first identified the probabilistic data structure, used it to impute the adherence status in patients with missing data for this variable, and simulated data for a large cohort to estimate age-specific survival indicators. We implemented the Comb-SynSurData method in order to select the best subsets of four synthetic datasets derived from ModGraProDep. To the best of our knowledge, this is the first study to show that adherence to ET greatly impacts BC survival among HR+ patients with early-stage breast cancer: no excess risk of death up to 10 years after BC in women diagnosed beyond 49 years of age. This result sheds light into curing BC for this group of patients.
The assessment of treatment response is crucial for evaluating anticancer therapies, treatment planning, and outcomes, where patients' OS is the baseline measure [33]. However, that evaluation requires a large sample and long-term follow-up, which are usually not available in the same study. We used a method for generating a large sample of synthetic data on the basis of the original cohort in order to estimate the observed survival indicators using the original cohort data, which had the minimal required long-term follow-up of 10 years for assessing EM due to BC [8]. Using these indicators, healthcare policy planning should be informed by the estimated prevalence of cancer deaths at a population level, which can be calculated through RS [34]. These indicators are strongly related to the concept of a statistical assessment of the "cure" of BC [35], which entails: (I) long survival time beyond 10 years and equal life expectancy [9], and (II) no cancer relapses up to almost 10 years after BC diagnosis [35].
Our study has a strong limitation in assessing the statistical cure of BC: our follow-up cannot go beyond 10 years. Another limitation is that the simulated cohorts were based on the observed data provided by the original cohort. Therefore, survival indicators derived from these simulated cohorts can only internally validate the indicators estimated from the original data. The availability of external data provided by other cancer registries with similar information would be useful for an additional validation of the results and reproducibility. Information on long-term prognosis by stage, receptor status and adherence to ET is information not usually reported by population-based cancer registries [9]. However, recent studies suggest the need for using these variables for population-based studies in order to assess whether the influence of stage or BC subtype on survival lessens in the long term, which might lead to a consideration of cancer cure in early stages [36][37][38]. The impact of ET adherence on BC patient survival is significant [39], and our results, which show differences in EM when comparing the cohort's mortality with that of the general population, are relevant to this. Moreover, differences between adherent and nonadherent patients are significant across all age groups, but show different impact depending on stage at diagnosis. This point must be accounted and further investigated, since age, stage, and treatment play a crucial role in the clinical follow-up of BC patient. Studies regarding this are needed.
A small but significant level of EM was detected in the adherent group of younger BC patients (<50 years) diagnosed at Stage I. However, survival estimates for these women using the combined cohort could be slightly lower than the observed survival in the original cohort, and this could limit the use of this subset of data. On the other hand, a previous study carried out on a cohort with ductal carcinoma in situ and diagnosed in Girona also detected statistically significant EM in patients diagnosed before 50 years of age [40]. Evidence suggests that differences in biological characteristics of breast tumors could impact patient survival [41]. Moreover, 5-and 10-year local recurrences at early stages [42] arise depending on age and molecular subtype. Although a high proportion of BCs are HR+ and HER2−, those diagnosed in young women are likely to be more aggressive [43,44], even in luminal-like early BC [45,46]. A study carried out using SEER data noted worse BC-specific survival for women in the oldest age groups for every BC subtype analyzed, with the exception of Stage IV triple-negative disease [10]. In that study and others, worse survival was observed in patients diagnosed before 35 years of age at Stages I-III [10,46]. Other studies showed that young age is also a predictor of decreased adherence to adjuvant ET, which in turn is associated with increased mortality [47]. Although ET is unquestionably a therapeutic tool for HR + BC, these strategies are associated with potential side effects and toxicity, which may have a differential effect depending on age [48]. On the other hand, randomized trials showed that, in premenopausal women with BC, the addition of ovarian suppression to tamoxifen may increase 8-year rates of both disease-free and overall survival [49]. However, diagnoses in the cohort under study predate results of these randomized trials, so women under 49 years of age in our study could not have had access to these improved treatments. Studies on BC survival and late adverse events due to ET must be considered beyond 10 years of follow-up, since evidence suggests that distant recurrences may arise from 5 to 20 years after diagnosis [49].
Studies of EM derived from small cohorts of cancer patients must be further evaluated using larger cohorts [40]. Here, we present a procedure for simulating a large sample dataset by fitting graphical models to cohort data and coupling a log-linear model and a Bayesian network. Since our interest was in simulating the most reliable data, one aim was to assess the probabilistic dependencies between variables. ModGraProDep identifies a set of graphical models by using a heuristic search based on changing k, a penalty factor in the partial likelihood (see Equation (2) above) [16][17][18]21]. Although specific values of k such as k = 2 and log (N) equation lead to two known measures for model choice, AIC and BIC, ModGraProDep identifies two alternative models, one using k = 1 and another testing the arch's statistical significance at α = 0.05 [13,18]. Vilardell et al. showed that estimating survival from one of these four models could provide reliable survival indicators [13]. Here, we introduced a method for deriving a synthetic dataset that provides better survival indicators by combining the best subsets of data of several synthetic datasets. An interesting feature in ComSynSurData is that it could be adapted to use any set of simulated data, and these could come from oversampling techniques, such as SMOTE [17], Borderline SMOTE [18] and MWMOTE [19]. However, synthetic datasets derived from ModGraProDep provide additional information about the data structure and data relationship between variables. The latter can also be useful for clinicians and epidemiologists in understanding the probabilistic patterns of the disease under study.

Conclusions
To sum up, coupling relative survival modeling with synthetic data simulation validated our main clinical result: patients with HR+ breast cancers diagnosed beyond 49 years of age at Stage I and diagnosed beyond 59 years of age in Stage II do not have 10-year EM compared to the general population if they follow the prescribed regimen of ET. These results call for developing interventions that promote adjuvant ET adherence in eligible BC patients given its potential benefits in improving cancer survival. The methodology presented here demonstrates the potential use of probabilistic graphical modeling in generating reliable synthetic datasets to be used for validating population-based survival indicators when sample size is an issue.