Can National Registries Contribute to Predict the Risk of Cancer? The Cancer Risk Assessment Model (CRAM)

Simple Summary Early identification of individuals with an increased risk of cancer is an important challenge. Danish administrative registers may be useful in this respect because they cover the entire population and include comprehensive and consistently coded long-term data. We aimed to develop a predictive model based on Danish administrative registers to facilitate the automated identification of individuals at risk of any type of cancer. In addition to age, almost all the included factors contributed statistically significantly, but also only marginally, to the prediction models, which means that we have not overlooked obvious information available in the register. Future prediction studies should focus on specific cancer types where more precise risk estimations might be expected. It is our ultimate ambition that an effective model can be used at the point of care, integrated into electronic patient record systems to alert physicians of patients at a high risk of cancer. Abstract Purpose: To develop a predictive model based on Danish administrative registers to facilitate automated identification of individuals at risk of any type of cancer. Methods: A nationwide register-based cohort study covering all individuals in Denmark aged +20 years. The outcome was all-type cancer during 2017 excluding nonmelanoma skin cancer. Diagnoses, medication, and contact with general practitioners in the exposure period (2007–2016) were considered for the predictive model. We applied backward selection to all variables by logistic regression to develop a risk model for cancer. We applied the models to the validation cohort, calculated the receiver operating characteristic curves, and estimated the corresponding areas under the curve (AUC). Results: The study population consisted of 4.2 million persons; 32,447 (0.76%) were diagnosed with cancer in 2017. We identified 39 predictive risk factors in women and 42 in men, with age above 30 as the strongest predictor for cancer. Testing the model for cancer risk showed modest accuracy, with an AUC of 0.82 (95% CI 0.81–0.82) for men and 0.75 (95% CI 0.74–0.75) for women. Conclusion: We have developed and tested a model for identifying the individual risk of cancer through the use of administrative data. The models need to be further investigated before being applied to clinical practice.


Introduction
Early identification of individuals at a high risk of cancer is an important challenge for all healthcare systems. In recent years, the focus has been on using healthcare data for risk assessment models in cancer [1,2]. Denmark has a long tradition of collecting comprehensive healthcare data [3,4]. However, these data have yet to be used for cancer prediction tools including all-type cancers to be applied in a clinical setting.

Study Design
This study was a nationwide register-based cohort study using data from the Danish national registers covering all individuals in Denmark aged 20 years or above in 2017 with a 10-year look-back period (2007 to 2016).

Data Sources
In Denmark, all inhabitants are provided with a unique civil registration number (CRN) issued at birth or when immigrating to Denmark, which is used as the key identifier in all health and social registers. The Danish healthcare system is tax-funded and provides equal access to universal healthcare services [13].
Statistics Denmark is a national organization in Denmark that is responsible for collecting statistical information about Danish society. We used data on demographic factors, vital status, employment status, education, and personal income [14,15]. Data on marital status, and ethnicity were extracted for 1 January 2017, whereas income and occupational status data were extracted for the year 2016 to avoid any lay-year "illness effect". Details of the variables are described in Table S2.

Study Population
The Danish Civil Registration System (CRS) includes all persons living in Denmark [16] and was used to identify persons for inclusion in the study population. The study population included all individuals aged 20 years or above on 1 January 2017. Persons with a cancer diagnosis (ICD-10 code C0-C9, not counting C44) between 1 January 2007 and  31 December 2016 were excluded ( Figure 1). Death or emigration in 2017 did not lead to exclusion.

Study Population
The Danish Civil Registration System (CRS) includes all persons living in Denmark [16] and was used to identify persons for inclusion in the study population. The study population included all individuals aged 20 years or above on 1 January 2017. Persons with a cancer diagnosis (ICD-10 code C0-C9, not counting C44) between 1 January 2007 and 31 December 2016 were excluded ( Figure 1). Death or emigration in 2017 did not lead to exclusion.

Outcome (Cancer)
The Danish Cancer Registry (DCR) contains the data of all cases of cancer in the Danish population, including date of diagnosis and tumor characteristics [17]. We used this register to identify all cases of cancer during 2017 (ICD-10 codes: C0-C9) (excluding nonmelanoma skin cancer C44) and any cases of cancer prior to 2017 (exclusion criterion).

Conditions of Interest (Exposure)
The Danish National Patient Register (NPR) [18], The Danish National Prescription Registry (DNPR) [19] and The Danish National Health Service Register (NHSR) [20] were used to retrieve information on exposure variables. The NPR includes all inpatient and outpatient hospital visits, including the main medical reason for diagnostic procedures or treatment. From the NPR, we retrieved information on all ICD-10 codes at Level 3 (both somatic (1607 codes included), psychiatric (1272 codes included), and private hospital contacts (1358 codes included)) given as primary or secondary diagnoses from 2007 to 2016. The DNPR contains individual data on all dispensed prescription pharmaceuticals sold in Danish community pharmacies. We used ATC codes from the DNPR at Level 3, recorded as binary variables (yes/no) in the exposure period (2007-2016). The ATC codes had to be registered at least twice during the exposure period to be recorded as "yes" (89 ATC codes included).
The NHSR contains information about activities in primary healthcare, including all general practitioner (GP) contacts [20]. We obtained information on the number of

Outcome (Cancer)
The Danish Cancer Registry (DCR) contains the data of all cases of cancer in the Danish population, including date of diagnosis and tumor characteristics [17]. We used this register to identify all cases of cancer during 2017 (ICD-10 codes: C0-C9) (excluding nonmelanoma skin cancer C44) and any cases of cancer prior to 2017 (exclusion criterion).

Conditions of Interest (Exposure)
The Danish National Patient Register (NPR) [18], The Danish National Prescription Registry (DNPR) [19] and The Danish National Health Service Register (NHSR) [20] were used to retrieve information on exposure variables. The NPR includes all inpatient and outpatient hospital visits, including the main medical reason for diagnostic procedures or treatment. From the NPR, we retrieved information on all ICD-10 codes at Level 3 (both somatic (1607 codes included), psychiatric (1272 codes included), and private hospital contacts (1358 codes included)) given as primary or secondary diagnoses from 2007 to 2016. The DNPR contains individual data on all dispensed prescription pharmaceuticals sold in Danish community pharmacies. We used ATC codes from the DNPR at Level 3, recorded as binary variables (yes/no) in the exposure period (2007-2016). The ATC codes had to be registered at least twice during the exposure period to be recorded as "yes" (89 ATC codes included).
The NHSR contains information about activities in primary healthcare, including all general practitioner (GP) contacts [20]. We obtained information on the number of contacts with GPs and selected practicing specialists, and the procedures and measurements issued by a GP from 2007 to 2016 (310 categories included) (Table S1).
Coding details on age, sex, marital status, country of origin, income, educational and occupational status, and comorbidity can be found in Table S2.

Statistical Analysis
The study population characteristics are reported as numbers and frequencies for categorical variables and as means and standard deviations or medians and interquartile ranges for numerical variables. Age on 1 January 2017 was categorized into 5-year categories.
The study population was randomly split into 50% as a development cohort, 25% as a validation cohort, and 25% as a test cohort stratified by group (cancer versus control) and sex. To enable accurate comparisons with competing prediction models, we withheld the test dataset for future analysis. We applied a three-step variable selection procedure, stratified by sex, on the development datasets. In the first step, we excluded conditions and ATC codes that occurred in <0.1% of the development cohort during the exposure period. In the second step, we carried out a backward selection with a p-value cut-off of 0.05 on variables remaining after the first step by logistic regression for cancer in 2017 separately for hospital diagnoses, ATC codes, and number of contacts per year with GPs and specialists. In the third step, we carried out a similar backward selection with a p-value cut-off of 0.01 combining the selected conditions/ATC codes/contacts with GPs from the second step and age (in 5-year age groups) (Model A).
To investigate the impact of socioeconomic status (SES) on cancer risk, the third step was repeated, including civil status, income, education level, occupation, and country of origin (Model B). Moreover, we constructed a model including age groups only (Model Age), and a fourth model including SES variables only (Model SES) to determine the predictive power of these aspects on their own.
To evaluate the resulting models, we applied the models to the validation cohort and calculated a receiver operating characteristic (ROC) curve based on the predicted probabilities and estimated the corresponding area under the curve (AUC). For predicted risk strata (0-1%, 1-2%, 2-3%, 3-4%, 4-5%, >5%), we calculated the observed cancer frequencies. Furthermore, we evaluated the prediction models by determining the sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) for 1-year cancer risk cut-offs of 1% as well as 5%.

Results
Denmark had a total population of 5.7 million individuals in 2017. After excluding persons who did not meet the inclusion criteria, the study population consisted of 4.2 million persons ( Figure 1). The characteristics of the study population are shown in Table 1. Table 1. Characteristics of the study population stratified by sex, and development and validation cohorts.

Cancer Outcome
In total, 32,447 (0.76%) individuals in the study population were diagnosed with first-time cancer in 2017, of whom 648 (1.99%) individuals were diagnosed with more than one cancer type during 2017. Overall, the highest frequency for first-time cancer was seen for gastrointestinal cancer (23.9%). When stratified by sex, the most frequent type of cancer was breast cancer (28.4%) in women and cancer of the genital organs (26.3%) in men.

Conditions Related to Cancer and Development of the Predictive Model (CRAM)
We identified 39 predictive risk factors in women (Table 2) and 42 in men (Table 3).    Age above 30 was a strong predictor for cancer in both sexes. In total, 11 of the 39 identified risk factors in women and 13 of the 42 in men were associated with a lower risk of cancer. Eight risk factors were consistent across the two sexes, whereas none of the GP services (by year) or practicing specialists were identical for men and women.
The models including SES differed slightly (Tables S3 and S4). Early retirement and retirement increased the risk of cancer, while being from a non-Western country was associated with a lower risk of cancer in both sexes. For men, being from a Western country (other than Denmark) or having an income in the middle or highest tertile was associated with a lower risk of cancer.

Validation of CRAM
Validating Model A on the corresponding cohort resulted in an AUC of 0.82 (95% CI 0.81-0.82) for men and 0.75 (95% CI 0.74-0.75) for women (Table 4 and Figure 2).    When comparing the observed and predicted frequency of cancer cases, we found that individuals with a predicted 1-year cancer risk above and below the cut-off of 1% had an observed cancer risk in the predicted low-risk group of 0.22% in men and 0.36% in women compared with an observed risk (PPV) in the high-risk group of 2.14% in men and 1.64% in women (Table S5, Model A). Similarly, for a 5% risk cut-off, we observed a cancer risk of 0.76% in men and 0.75% in women in the low-risk group compared with 3.75% for men and 2.38% for women in the high-risk group (Table S5, Model A). Furthermore, When comparing the observed and predicted frequency of cancer cases, we found that individuals with a predicted 1-year cancer risk above and below the cut-off of 1% had an observed cancer risk in the predicted low-risk group of 0.22% in men and 0.36% in women compared with an observed risk (PPV) in the high-risk group of 2.14% in men and 1.64% in women (Table S5, Model A). Similarly, for a 5% risk cut-off, we observed a cancer risk of 0.76% in men and 0.75% in women in the low-risk group compared with 3.75% for men and 2.38% for women in the high-risk group (Table S5, Model A). Furthermore, stratifying individuals by predicted cancer risk resulted in well-calibrated agreement between predicted and observed risk, and for predicted risk between 1% and 5%, but with the predicted risk overestimating the observed risk for predicted risk above 5% (Figure 2).

Men Women Development Cohort Validation Cohort Development Cohort Validation Cohort
The odds ratio for being diagnosed with any cancer in 2017 was 9.78 (9.05; 10.57) with a 1% cut-off and 4.96 (4.00; 6.10) with a 5% cut-off among men. For women, the model resulted in an odds ratio of 4.49 (4.20; 4.80) with a 1% cut-off and 3.34 (1.65; 6.05) with a 5% cut-off. The predictive performance of Model A appears in Table 5.
Including the socioeconomic details (Model B) only marginally improved the AUC. Models including age alone resulted in an AUC of 0.81 for men and 0.74 for women, while the model only including socioeconomics resulted in AUCs of 0.75 for men and 0.70 for women (Table 4).

Discussion
This study is the first study based on a vast number of already available Danish register data, with overall cancer as the outcome. In addition to age, almost all the included factors contributed statistically significantly, but also only marginally, to the prediction models, which means that we have not overlooked obvious register-available information. Given the inclusion of overall cancer as the outcome and the large dataset, the identified predictive risk factors may, to some extent, represent random findings. Future prediction studies should therefore focus on specific cancer types, staging, and testing in clinical care.
The ROC showed that it is possible to make moderately precise models; however, 'all-type cancer' is a difficult case due to the heterogenous outcome. Omitting SES from the models only weakened the model marginally. This is a sign of how the different factors in the registers are related. Our results show that age is a strong predictor of cancer: the model including age alone had a higher precision than the model based on SES exclusively. The CRAM model predicts the 1-year cancer risk well up to 5% risk, which is a clinically relevant spectrum of risk, as the 1-year risk of cancer in the general population is below 1%. However, given a predicted risk above 5%, the model may overestimate the risk of cancer.

Strengths and Limitations
This is the first study developing and validating a risk prediction model for cancer in a Danish setting following up on a similar prediction model developed for osteoporotic fractures [21]. A main strength is the nationwide design covering the entire Danish population. The study did not require patient recruitment, which ensured the inclusion of the whole population of interest and therefore avoided selection bias [13,22].
We extracted data from administrative registries and thereby reduced the likelihood of information bias. Due to the public national registries, the results are applicable to everyone with access to the healthcare system. The CRAM is consequently transparent, can be used on an individual level according to the information presented in this article, and can be tested in future cohorts.
A limitation to our study is that we were not able to include information about 'online' covariates as symptom of presentation, or lifestyle (for instance, drinking, smoking, and dietary habits), as this information is not available from the administrative registries. The results should therefore be perceived as a supplement to the information received during patient presentation in clinical practice.
Overall, the cancer group had a higher median age than the control group. This age difference was expected, as age is a known risk factor for cancer. Hence, matching the case and control group would have limited the generalizability of the results to the general population level.
We did not take time to event during 2017 and the competing risk of death into account. Although this might have influenced the results slightly, we found that the effect of these factors would be limited, as our 1-year outcome period was short, implying a limited risk of death during the period, and time to cancer diagnosis is neither very informative nor clinically relevant.
We used a classical backward selection method instead of, e.g., machine learning algorithms. This relatively simple approach has the drawback that more complicated patterns of risk prediction could have been overlooked by the model. On the other hand, the advantage was a transparent methodology, resulting in a final model which can easily be reported, interpreted, and implemented without any privacy concerns with respect to the development data.
We excluded patients with previous cancer, as secondary cancer was considered to be clinically different from primary cancer and because we expect patients with earlier cancers to be followed closely in clinical practice, and hence they are not relevant for general screening. We included any type of cancer for this first study. However, risk factors may play different roles for different types of cancers, and a model should be investigated in relation to specific cancer types in future studies.

Comparison with theExisting Literature
The British QCancer prediction models are, to some extent, comparable with our study [23]; however, the QCancer algorithms included 11 types of cancer, whereas the outcome in our study was any type of cancer, improving the usefulness when the concern is cancer in general. Further, the QCancer studies included socioeconomic characteristics in terms of Townsend score, which is a four-variable population-based deprivation score for a geographical area, whereas we had individual data for developing the CRAM.
Our findings of AUC 0.82 for men and 0.75 for women are comparable with a review of studies regarding prediction models for lung cancer, where the included studies found an AUC between 0.57 and 0.879 [24].
In a prospective observational study among patients with hematuria from 110 hospitals across 26 countries referred to secondary care, a prediction model for urinary tract cancer showed an AUC of 0.86 (95% confidence interval: 0.85-0.87) [25].
The modest contribution of SES to our prediction model is somewhat in contradiction to the findings of other studies of prediction models for nonmalignant diseases, where SES in terms of education and income was found to improve the accuracy of predicting cardiovascular disease risk [26,27] and diabetes [28]. This might be explained by the circumstance that the large number of health predictors included in our study covered the risk information which otherwise could be obtained from the SES data.
A growing number of studies have aimed to improve the risk stratification of patients with cancer through multimodal data integration. There has been much recent interest in the use of machine learning (ML) and different artificial intelligence algorithms for cancer predictions. Studies comparing ML with classical statistical models for risk prediction have already been published [29][30][31], and although some of the studies have demonstrated a promising path toward improved risk stratification of patients with cancer, the relevance for clinical purposes remains to be proved.

Conclusions
We have verified that the Danish administrative registers are useful for developing a cancer risk prediction model (CRAM) for identifying individuals at risk of having any cancer. The CRAM showed moderately accuracy in the validation cohort and included 39 and 42 risk factors for cancer for women and men, respectively. In addition to age, almost all the included factors contributed statistically significantly, but also only marginally, to