A Multi-Center Competing Risks Model and Its Absolute Risk Calculation Approach

In the competing risks frame, the cause-specific hazard model (CSHM) can be used to test the effects of some covariates on one particular cause of failure. Sometimes, however, the observed covariates cannot explain the large proportion of variation in the time-to-event data coming from different areas such as in a multi-center clinical trial or a multi-center cohort study. In this study, a multi-center competing risks model (MCCRM) is proposed to deal with multi-center survival data, then this model is compared with the CSHM by simulation. A center parameter is set in the MCCRM to solve the spatial heterogeneity problem caused by the latent factors, hence eliminating the need to develop different models for each area. Additionally, the effects of the exposure factors in the MCCRM are kept consistent for each individual, regardless of the area they inhabit. Therefore, the coefficient of the MCCRM model can be easily explained using the scenario of each model for each area. Moreover, the calculating approach of the absolute risk is given. Based on a simulation study, we show that the estimate of coefficients of the MCCRM is unbiased and precise, and the area under the curve (AUC) is larger than that of the CSHM when the heterogeneity cannot be ignored. Furthermore, the disparity of the AUC increases progressively as the standard deviation of the center parameter (SDCP) rises. In order to test the calibration, the expected number (E) of strokes is calculated and then compared with the corresponding observed number (O). The result is promising, so the SDCP can be used to select the most appropriate model. When the SDCP is less than 0.1, the performance of the MCCRM and CSHM is analogous, but when the SDCP is equal to or greater than 0.1, the performance of the MCCRM is significantly superior to the CSHM. This suggests that the MCCRM should be selected as the appropriate model.


Introduction
Cardio-cerebrovascular diseases have become a severe public health issue. In particular, stroke is the primary cause of disability. With the aging population, the ante-displacement of the age of disease onset, and the improvement of the material standard of living, the prevention and treatment of stroke is still a great challenge worldwide [1,2]; therefore, the development of new theories and methods is of utmost importance [3,4]. Carrying out risk assessments on individuals prior to stroke can provide important information for medical research, hence reduce the social economic burden in the future. To this end, either the effects of multiple covariates for each individual must be determined, or the absolute risk of every person can be calculated through regression or other approaches. The most commonly used regression analysis for risk assessment is the Cox model [5]. The traditional hazard-based Cox model uses a semi-parametric setting with non-parametric baseline hazard, perfect link and exponential functions. However, although it has been widely used in medical studies, cox model ignores the existence of competing risks.
Medical practice produce a large amount of competing risks data, which is especially related to elderly people [6,7]. A common type of competing risks data is survival data with multiple causes of death. For example, in a clinical trial that compares different treatment therapies for breast cancer, interest may be focused on death from breast cancer, but a patient may die due to causes other than breast cancer, such as coronary heart disease, hospital infection, or a traffic accident [8]. The standard methods for analyzing competing risks data include cause-specific hazard functions [9,10], subdistribution hazard model [11], Framingham models [12][13][14] and fine adjustment Framingham models, depending on the population characteristics. However, the application and imitation of Framingham models have caused many problems, such as variables significant in clinical treatment becoming insignificant in models or coefficients being unexplainable [15][16][17][18]. Therefore, a more adequate modeling approach is needed for the stroke patients in China.
Sometimes, the observed covariates cannot explain the large proportion of variation in time-to-event data from different areas, for instance, the data of multi-center clinical trials or multi-center cohort studies [19][20][21][22][23]. China has a vast territory and many ethnic groups. The heterogeneity of the public in different areas caused by climate, economic level, living habits, and many other factors, is enormous. Therefore, the effects of these latent factors should not be ignored when studying the risk factors of diseases and calculating the absolute risk, even though some of these important factors are unavailable in some circumstances. Furthermore, the effect of a specified covariate should be consistent for individuals at different centers, according to the risk assessment in survival analysis. Therefore, in a multi-center competing risks scenario, with the presence of heterogeneity caused by some latent factors, it is inappropriate to establish different models for every center, even the sample size at each center is sufficient.
In this paper, we demonstrate that most of the predictors (covariates) are not effected by spatial heterogeneity. For example, for smokers with the same amount of cigarette consumption each day, smoking should have the same effect in different areas. That is, the effect of smoke is not related to the smoker's geographical location. However, it is not ideal if we established different models for different areas, because we may obtain different odds ratios about smoke. Therefore, for multi-center cohort data or multi-center randomized controlled trials (RCT), we established a uniform model for different areas based on the Frailty model [24][25][26][27], while also setting a center parameter to eliminate the problem detailed above. All the effects of latent factors were incorporated in the center parameter, which was helpful for obtaining accurate and consistent estimators of all of the predictors. Furthermore, based on the Gail model [28], we estimated the absolute risk of stroke for each person under the competing risks frame. These methods can be easily applied to other cardio-cerebrovascular diseases, and perhaps even to other diseases.
The rest of this paper is organized as follows. Section 2 reviews the cause-specific hazard model (CSHM), and then introduces the multi-center competing risks model (MCCRM) and the approach of calculating absolute risks. Section 3 presents the results of the simulation study, assesses the performance of the proposed model, and describes the calibration of the approach to calculate absolute risks. Section 4 compares the performance of the two models (CSHM and MCCRM) on a dataset from the Shandong Center for Disease Control and Prevention. Section 5 presents the discussion. Concluding remarks are given in Section 6.

The Cause-Specific Hazard Model
For the CSHM, without loss of generality, we assumed that there were only two causes of failure for the following description [9,10]. For example, when death caused by stroke was the event of interest, all other causes of death were treated as competing risks. The cause-specific hazard function is where Z(t) denotes the value of the regression vector at time t [5]. If cases with only two competing risks, the proportional cause-specific hazard model based on the Cox model can be expressed as α 02 (t|Z) = α 02;0 (t) exp(β T 02 Z).

The Multi-Center Competing Risks Model
This model deals with a multi-center scenario with the presence of heterogeneity. Z ki (k = 1, 2, · · · , K; i = 1, 2, · · · , n k ) denotes the covariates vector, where K is the number of centers, and n k is the number of individuals in the kth center. We add a center parameter η k to the CSHM, so the MCCRM can be given as α The center parameter η k incorporates all other effects of latent factors (covariates) such as economy, politics, climate, and living habits. α 01;0 (t) and α 02;0 (t) are the baseline hazard functions. The coefficients vector β 0 j is consistent for every center. This means that the effect of the specified covariate given to other covariates to every individual is consistent. For example, smoke has the same negative effect for each person regardless of the center or area they belong to.
Under the multi-center competing risks scenario, we can derive the following formulas according to reference [7].
The all causes hazard can be expressed as and the cumulative hazard for the kth is Therefore, the distribution function for the ith individual who belongs to the kth center is From here, we assumed that our data had been stratified by gender and age; in other words, the MCCRM was developed using data from individuals with the same gender and age period. In such a scenario, the baseline hazard α 01;0 (u) and α 02;0 (u) can be assumed to be constant, and Equation (5) can be simplified as follows: where λ = α 01;0 (u) exp(β T 01 Z ki + η k ) + α 02;0 (u) exp(β T 02 Z ki + η k ). As mentioned above, the distribution function of the event of interest for the ith individual who belongs to the kth center has a similar expression to Equation (6): where λ 01 = α 01;0 (u) exp(β T 01 Z ki + η k ). Therefore, At the baseline level, we assumed that all of the covariates and the center parameter equaled to zero; thus Equation (8) was simplified as where P is the overall incidence of all centers. P k was taken as the incidence of the kth center, so the center parameter η k could be calculated as follows: A detailed derivation of the formula can be found in Appendix A. Now, we give the absolute risk equation as in the Gail model [23]: where a is the age of the ith individual of the kth center and τ is a time interval. P{a, τ|Z ki } is the absolute risk that a person has a certain disease in the time interval [a, a + τ] with covariates Z ki . The relative risk r(t) is calculated as follows: In addition, in Equation (11), S 2 (t) = exp − t 0 l k α 02;0 (u)du . As our model was established after the data had been stratified by age and gender, the baseline hazard could be set as constant. Therefore, Equation (11) was simplified as follows: where ξ = l k α 01;0 (t)r(t) and λ = −ξ − l k α 02;0 (t).

Simulations
In this section, we show the performances of the CSHM and the MCCRM. First, we generated random data with the method introduced by Jan Beyersmann et al. [7,29] and then established the two models with random data. Next, we assessed the above-mentioned models through statistical parameters such as bias, standard deviation (SD), root mean square error (RMSE), and area under the curve (AUC). Finally, we calibrated the multi-center competing risks model (MCCRM) by calculating the ratio of the expected number (E) of strokes in the given time interval and compared it with the corresponding observed number (O), i.e., E/O.
In the simulation, we chose stroke as the dependent variable. Death from stroke was the outcome of interest, and death from other causes was the competing risk. Using the Framingham models [12,13,15,30], we chose five factors as covariates: Total cholesterol (TC), high density lipoprotein (HDL), systolic blood pressure (SBP), diabetes, and smoking. Age and gender were used as stratified variables.

The Generation of the Dataset
Firstly, we generated the random data of covariates according to the real data used in the study by Zhenxin Zhu et al. [31]. The real data came from a cohort of all participants who received routine health check-ups from 2005 to 2010 at the Center for Health Management of Shandong Provincial QianFoShan Hospital and the Health Examination Center of Shandong Provincial Hospital. For TC, HDL and SBP were continuous variables, and we calculated the mean vector and covariance matrix of the three covariates using the real data. Then, random data were generated with a multivariate normal distribution. Diabetes and smoke were variables with values of 0-1, which were generated by binomial distribution with the rate of real data from Shouguang City, Shandong Province, China. The center parameter was calculated using Equation (10).
Secondly, due to the existence of competing risks, we generated random data from the dependent variables, which were survival time and survival outcome. As we stratified the data by age and gender, the baseline hazards in Equation (6) could be set as constants, and the covariates could be treated as approximately time-independent; therefore, the distribution of survival time could be simplified as an exponential distribution for each person. The true values of coefficients of the selected five covariates (exposure factors) in Equation (6) were set as β 01 = (1, −3, 0.01, 1, 1) according to a rough estimate by the Cox model with real data. As the influence of covariates for other competing risks is always regarded as insignificant, the true values of coefficients of covariates for competing risks in Equation (6) were set as β 02 = (0.01, −0.01, 0.0001, 0.01, 0.01).
At this point, we had obtained the survival time T. Next, we generated the survival outcome X. There were three outcomes, which were indicated by 0, 1, and 2, where 0 referred to a censored outcome, 1 indicated death from stroke, and 2 indicated death from causes other than stroke. We used α 01 (t)/[α 01 (t) + α 02 (t)] as the parameter of binomial distribution to generate the outcome (1 or 2) for each sample. We generated random data C with a uniform distribution [0, b], and then specified 0 for the sample if C was smaller than T for each person, or 1 otherwise. The right endpoint b was used to control the censored ratio.

The Assessment of Models
Then, we were able to establish the CSHM and the MCCRM with the random data generated above. We used the packages survival, pROC, and MASS in R software [32] to conduct the analysis [33][34][35][36]. The total sample sizes of the simulated data were N = 1000 and 5000, with censored ratios of Q = 0.  Table 1; 5000, 0.2, and 1.0 for the values given in Table 2; and 5000, 0.2, and 2.0 for the values given in Table 3. Table 1 shows that there was no significant difference between the performance of the MCCRM and the CSHM when the SDCP was equal to 0.01. However, when the SDCP was equal to 1.0 or 2.0, the estimate of coefficients of the MCCRM were more precise than those of the CSHM (Tables 2 and 3). Additionally, the AUC of the MCCRM was significantly greater than that of the CSHM. According to the simulation results, the estimators of coefficients of MCCRM can be seen as unbiased compared with the true value β 01 = (1, −3, 0.01, 1, 1). Furthermore, through a large number of simulations, we found that when the SDCP was less than 0.1, there was no significant difference between the two models. When the SDCP was greater than 0.1, the AUC of the MCCRM was significantly greater than that of the CSHM, and the difference increased progressively with the increase in the SDCP (Figure 1). When the SDCP was greater than 0.1, the estimate of coefficients of the MCCRM was more precise than those of the CSHM, and the difference was highly significant.      No obvious difference in performance was shown between censor ratios of Q = 0.4 and Q = 0.2. Further the performance was similar with sample sizes of N = 1000 and N = 5000. The robustness of the estimators (bias, SD, RMSE, AUC, E/O, etc.) was not good when N was less than 1000, and the performance of the statistics was robust when N was sufficient. Table 4 presents the calibration of the MCCRM. The field name t-i expresses the time interval [0, i], i = 1, 2, · · · , 5. SDCP is the standard deviation of the center parameter, and only the simulation results of SDCPs equal to 0.5 or 1.0 are listed. The value of the first cell, 1.0510, represents the E/O at the given time interval [0,1]. E is the expected number, which is the sum of the absolute risk of every sample generated in Section 3.1. O is the observed number, which is the number of samples whose observed time was less than or equal to 1, and the observed cause was 1 (the outcome of interest).  Table 4 shows that the E/O was acceptable when the SDCP was equal to 0.5. However, when the SDCP was equal to 1.0 and the time length was greater than 3, the E/O was unsatisfactory. Through many simulations, we found that the E/O was acceptable when the SDCP was less than or equal to 0.5 and the time length was less than or equal to 5. Additionally, the precision of E/O decreased linearly with an increase in SDCP.

Illustration
We obtained data from the Shandong Center for Disease Control and Prevention study from patients with four diseases (stroke, coronary heart disease (CHD), lung cancer, and stomach cancer) from 17 cities in Shandong Province, China in 2015. For every disease, we had data on the incidence number and population size of the 17 cities, which were stratified by age (five years for each interval). Furthermore, lung cancer and stomach cancer were stratified by gender. We calculated the incidence for each city and then calculated the SDCP after the transformation of incidence using Equation (10). Table 5 shows SDCP for the four diseases of patients whose age was equal or greater than 40. According to the results in Section 3, when the SDCP is greater than 0.1, the heterogeneity across different centers cannot be ignored. From Table 5, we can see that all of the numbers were significantly greater than 0.1; thus, it is necessary to emphasize the importance of the MCCRM during the practical application of multi-center data. We chose stroke to illustrate the performance of the MCCRM. The results of other diseases were analogous. Table 6 presents the comparisons of the regression coefficients and the AUC of the two models with stroke. We used the stroke incidence of the 17 cities to generate the center parameter with Equation (10), and the age interval was equal or greater than 50 and less than 55. The covariates survival time and survival outcome were generated by the same method introduced in detail in Section 3. The sample size was 5000, and the censored ratio was 0.2. The SDCP 1.047 of ages 50 to 55, which according to Table 5, is obviously greater than 0.1. As the results of the simulation in Section 3, Table 6 shows that all of the estimators (bias, SD, RMSE) of the MCCRM were more precise and superior to the corresponding estimators of the CSHM. For example, the maximum of RMSE of the CSHM was 0.8580, while the RMSE of the same covariate (HDL) in the MCCRM was 0. 079. The AUC of the MCCRM was 0.7994, while the AUC of the CSHM was only 0.7205.

Discussion
A common question arising in multi-center random clinical trials and multi-center cohort studies where competing risks exist is whether any heterogeneity in outcomes exists, and whether the heterogeneity has an obvious influence on the research target. Therefore, it is necessary to choose the appropriate model and determine whether statistical adjustment is required while estimating the effect of risk factors or calculating the absolute risk of a certain disease. When analyzing multi-center survival data, frailty survival models have been shown being useful, notably with regard to the usual large number of centers and low number of patients in each center [37,38]. Nevertheless, frailty survival models do not provide any detailed differences of the CSHM and frailty models. Through the simulation in Section 3, we have provided a precise analysis of the CSHM and the MCCRM by changing the sample size, censored ratio, the standard deviation of the center parameter, and the number of centers, among other factors. Theoretical studies will be presented in follow-up work.
With existing competing risks, Bayesian statistics have been reported to be more useful and efficient for assessing prior information, variable selection, and absolute risk [39][40][41][42]. Moreover, Bayesian models are more flexible than empirical models. However, in this paper, in order to emphasize the importance of heterogeneity, we have only mentioned the multi-center data, and did not take prior information into account. The limitation of this paper is that only baseline data are used for prediction, and the situation of multiple follow-up observations is not fully considered. This will inevitably affect the accuracy of predicting absolute risk and the stability of the model. In our follow-up work, we will consider combining the multiple follow-up observations, the prior information and multi-center data under a competing risks scenario.

Conclusions
Through Equation (10), we calculated the SDCP, which helped us select the most appropriate model according to the physical truth. When the SDCP was less than 0.1, the MCCRM and CSHM performed analogously, so either could be selected randomly for the practical application. When the SDCP was equal to or greater than 0.1, the performance of the MCCRM was significantly superior to the CSHM according to estimators such as bias, SD, RMSE, AUC, and E/O. Furthermore, when the SDCP was too big, the CSHM became inefficient, then the MCCRM should be selected as the appropriate model. Therefore, MCCRM can help us make full use of multi-center data and give accurate estimates of covariate coefficients. Moreover, the covariate coefficients of the MCCRM are consistent for different centers or areas, so the explanation of covariate coefficients has become more simple and reasonable.
Using Equation (11) and the MCCRM, the absolute risk of stroke occurring for a certain person was calculated. The approach of calculating absolute risk was excellent when the time interval was not too large, according to the calibration in Section 3. That is, short time intervals were predicted more precisely than long time intervals. However, only having the baseline value of covariates in a cohort study may cause inaccuracy in long time interval prediction. Acknowledgments: We would like to thank the Shandong Center for Disease Control and Prevention for their permission to use their data.

Conflicts of Interest:
The authors declare no conflict of interest.