Data Analysis of the Risks of Type 2 Diabetes Mellitus Complications before Death Using a Data-Driven Modelling Approach: Methodologies and Challenges in Prolonged Diseases

(1) Background: A disease prediction model derived from real-world data is an important tool for managing type 2 diabetes mellitus (T2D). However, an appropriate prediction model for the Asian T2D population has not yet been developed. Hence, this study described construction details of the T2D Holistic Care model via estimating the probability of diabetes-related complications and the time-to-occurrence from a population-based database. (2) Methods: The model was based on the database of a Taiwan pay-for-performance reimbursement scheme for T2D between November 2002 and July 2017. A nonhomogeneous Markov model was applied to simulate multistate (7 main complications and death) transition probability after considering the sequential and repeated difficulties. (3) Results: The Markov model was constructed based on clinical care information from 163,452 patients with T2D, with a mean follow-up time of 5.5 years. After simulating a cohort of 100,000 hypothetical patients over a 10-year time horizon based on selected patient characteristics at baseline, a good predicted complication and mortality rates with a small range of absolute error (0.3–3.2%) were validated in the original cohort. Better and optimal predictabilities were further confirmed compared to the UKPDS Outcomes model and applied the model to other Asian populations, respectively. (4) Contribution: The study provides well-elucidated evidence to apply real-world data to the estimation of the occurrence and time point of major diabetes-related complications over a patient’s lifetime. Further applications in health decision science are encouraged.


Introduction
The incidence of type 2 diabetes (T2D) has reached epidemic proportions globally, with an average of one person dying of diabetes every 8 s in the world [1]. In South-East Asia regions, T2D has been estimated to have increased by 74% from 2019 to 2045, resulting in a great economic impact [1,2]. Inappropriate T2D care and management have directly caused permanent morbidities, including cardiovascular disease (CVD), kidney new models using a contemporary patient-level input dataset. The data-driven approach for the predictive model is presented by the revealing management of prevalent short and long-term complications in diabetes, where long-term is defined as at least 15 years follow-up.
Diabetes is one of the top priorities in medical science and health care management, and an abundance of data and information is available on these patients. Whether data stem from statistical models or complex pattern recognition models, they may be fused into predictive models that combine patient information and prognostic outcome results. Almost any statistical regression model can be used as a predictive model, but due to their transparent functionality, multiple logistic or similar linear regressions are often used for prediction model development [16]. However, a prediction model should take into account all important complication risks as a whole. To capture the whole disease as a system has been discussed in Tappenden et al. [17] and Esensoy and Carter [18].
Such knowledge could be used in clinical decision support, disease surveillance, and public health management to improve patient care. Predictive models often include multiple predictors (covariates) to estimate the probability, risk of a certain outcome or to classify that a certain outcome is present/absent (diagnostic prediction model) or will happen within a specific timeframe (prognostic prediction model) in an individual. Although extensive effort has been made with building these prediction models, there is a remarkable scarcity of impact studies due to data limitations. In Taiwan, the national diabetic care registry was established in 2001, so it is possible to collect enough information for long-term prediction with important covariates to appraise high risk events. Regarding the risk of diabetic complications, the primary events of this study were arteriosclerotic heart disease (ASHD), ischaemic heart disease (IHD), chronic heart failure (CHF), ischemic stroke (ISC), first-time renal failure (FESRD), retinopathy (EYE), amputation (Fin Foot), and death. The T2DHoc model was built upon the occurrence of complication risks after a diagnosis of diabetes. Figure 1 presents the logic structure with two layers, where the first layer after T2D demonstrates the first occurrence of complications and the second layer indicates the second complication. Therefore, the study aims to describe construction details of the T2DHoc model through estimating the probability of diabetes-related complications and the time-to-occurrence from a population-based database. In addition, we also validate this model against the results from Japan, Korean clinical studies, and the UKPDS Outcomes Model.

Data Sources
The data for the model development were taken from the Taiwan T2DHoc using the Diabetic Pay-for-performance (P4P) Registry, a subset of the Taiwan NHIRD. The Taiwan National Health Insurance (NHI) programme initiated in 1995 is compulsory universal health insurance to provide holistic healthcare to over 99% of Taiwan's inhabitants. All contracted healthcare providers are mandated to upload medical claims; the database contains comprehensive information on insured subjects, including dates of clinical visits, diagnostic codes, details of prescriptions, medical procedures, and expenditure [2,3]. Taiwan's NHI Administration regularly performs reviews of the system to prevent waste, safeguard quality, and maintain public healthcare safety and quality [4]. The NHIRD has been opened to researchers for generating real-world population-based evidence to test clinical and epidemiological research hypotheses under governmental regulations. It includes approximately 2.4 million laboratory records, 60 million outpatient records, and 554,000 inpatient records.
In 2001, the NHI Administration launched a pay-for-performance programme to encourage healthcare providers to deliver high-quality care to patients with T2D. Patients with T2D and healthcare providers can voluntarily enrol in the program. During the early period of the program, it cultivated and certified T2D care providers and encouraged them to increase patient monitoring and follow-up [5]. To verify care quality, the laboratory data of patient care were uploaded under the T2D P4P scheme. This study was approved by the Institutional Review Boards of National Chengchi University and National Health Research Institutes, Taiwan (NCCU-REC-201603-I005 and EC1050505-E). All research procedures followed the directives of the Declaration of Helsinki. All identifying personal information was removed from the data files prior to analysis, so the review board waived the requirement for informed written consent.
In addition, the T2D cohort data were also verified through Kaohsiung Medical University Hospital Research Database (KMUHRD). Kaohsiung Medical University Hospital (KMUH) is a medical centre located in southern Taiwan, with around 1600 beds and 6000 clinical visits per day in 2015. The KMUHRD, which is managed by the KMUH Division of Medical Statistics and Bioinformatics, offers comprehensive data of approximately two million patients who attended KMUH from 2009 to 2015, with coverage on ambulatory care, hospital admissions, dental services, drug-dispensing records, and biochemical test results. For confidentiality and according to the Personal Information Protection Act, all personal identifiers were removed, and the authorised researchers only performed data linkage, processing, and statistical analyses with specified computers in an independent 24-hr monitored room using encrypted identifiers. In this study, all diagnoses were coded according to the International Classification of Diseases, 9th Revision and Clinical Modification (ICD-9-CM). Subjects with type 2 DM (ICD-9-CM codes 250.1-250.9) who were prescribed with hypoglycaemic agents and had an HbA1c level ≥6.5% (≥48 mmol/mol) were included. The clinical definitions of diabetic complications of T2D patients are summarised in Appendix A.

Data-Driven Approaches
NHRID stores a huge amount of patient-specific data, including demographic information diagnoses, laboratory tests, prescriptions, radiological images, and clinical notes, which can be extracted via neural networks and deep learning models. Data-driven thinking and methods which depart from traditional statistical tools or algorithms utilise advanced computational and mathematical systems to cope with the biomedical data to discover underlying patterns, disease phenotypes, unanticipated effects, etc. Dynamic intervention is often involved in healthcare systems. As the population needs change over time or interactions among clinical entities, it is difficult for decision-makers to plan healthcare services. Thus, with the capability of dealing with the variety and uncertainty in healthcare systems, computer-based methods provide a more practical way to model diabetes management systems without costly and time-consuming experimentation [19][20][21].
Generally, there are two types of models for building a prediction model-parametric and nonparametric. Parametric models make assumptions regarding the underlying data distribution, whereas nonparametric models (and semiparametric models) make fewer or no assumptions about the underlying distribution. The model developed in this study is a combination of both, with respective time distributions and transition rates among complications. A total of 62 (13 + 49) different combinations of complications (endpoints) were predicted using the T2DHoc risk engine through a discrete-time event micro-simulation process under the corresponding time horizon of each possible diabetic complication.
Diabetes progression was characterised by changes in a patient's disease status in terms of the number of complications that they developed. The number implies the occurrence of one event that may also preclude the development of another, thus, the major diabetes progression status of this study includes T2D-associated complications, all-cause of death, and staying in the same disease state while considering the competing risk events. Prognostic models with competing risks have been discussed in various studies [22][23][24][25]. To tackle the interaction among health states, we analysed the development of these competing risk events within a single mathematical analytic framework instead of treating each complication in isolation, which enables us to determine whether one event is more likely to occur than another. A tree-like structure of patients' disease progression was set up as shown in Figure 2, consisting of 62 risk equations representing important pathways after diabetes has been diagnosed. The baseline characteristics of each trial have been reported and applied directly as the characteristic of simulation samples. Conventionally, Weibull distributions are useful to describe the survival time distribution, which was adopted from the corresponding literature [26]. The simulation assumes that all time transition variables are Weibull distributed between health states, but normal distributions are assumed for the standard error of some risk factors as simulation variates. HbA1c may change during the simulation, where the value function in Figure 3 is constructed based on clinical data. Table 1 gives the distribution of values of key risk factors, including low-density lipoprotein (LDL), systolic blood pressure (SBP), urine, etc., which were assumed to reach the corresponding treatment target in the first year and used as input parameters for hypothetical patients. Some risk factors are assumed constant for each patient individually in the simulation.  The prediction model is based on computer simulation techniques, a method of modelling the progression of T2D and predicting long-term disease outcomes. The development of the simulation model involved (1) re-estimating, over a longer duration of follow-up, the 13 risk equations for complications ASHD, ISC, CHF, Fin Foot, EYE, FESRD and those combinations, and (2) estimating the following risks after the first complication, which involves constructing sixty-two new equations (13 + 49) and some follow-up events.

Building Risk Equations
In this analysis, a proportional hazards Weibull regression model was used to model diabetes-related complications with a baseline hazard: where κ is a shape parameter, and the scale parameter is λ = exp(β 0 ) which is expressed by the exponential intercept coefficient β 0 . According to the proportional hazards assumption, the hazard of an event at time t is: where X(t) is a vector of covariates and β is a vector of associated coefficients. The unknown parameters requiring estimation are λ, κ, β 0 and β. Some of these covariates (such as starting age and sex) remain constant as time elapses, while others potentially vary over time (such as HbA1c and SBP). For each risk factor k, a regression learning algorithm by Bayesian approach finds β k (i, j) associated with X k to represent a risk score during the time interval (t 0 , t 1 ) bore by the patient of age a from health state i to state j (one complication). Here, for each pair (i,j) were required survival distributions assuming that no other event was possible. Then, a time was sampled for each event and the earliest time determines which event happens. This is implemented by taking other events as censored events and the other times are discarded. Namely, a risk score at time t between the time period t 0 < t ≤ t 1 yields this patient a quantified value R a that indicates the impact of this specified complication under health conditions X k , k = 1,2, . . . , p: where X k (a, t), β k (i, j) are explanatory variables (k = 1, ..., p) provided there are p cofactors to consider in this model. The Cox proportional hazard regression expression can be written as: According to the corresponding conditional assumptions, we suggest a risk equation that would be expressed by, t 0 < t ≤ t 1 : provided that the accumulated hazard rate at time t is, H(t). Model Equation (5) denotes the probability (risk) of a complication event during (t 0 , t 1 ) for a patient with newly diagnosed diabetes, in the absence of death from causes other than this specific complication. Information retrieved from the collected data provides the essential details to the mechanism governing these transitions and serves as the major modelling tool in this study. Predicted values of R a were used in conjunction with the event equations to complete the simulations. This means that H(t) may be estimated for a specific diabetes complication, such as CHF status equation or ISC from the diagnosis of diabetes [27]. The detailed description and methodology used for the prediction model are given in Appendix B.
Confidence intervals presented in this paper are based on a two-stage process of evaluation. Firstly, the original data used to fit the risk equations were bootstrapped, and the risk equations refitted, and the coefficients recorded. Repeating this process many times generated a vector of coefficients that represented the parameter uncertainty in those coefficients, also accounting for the covariance between risk equations. As the model was applied to predict lifetime outcomes of P4P patients, in the second stage, we treated the predictions as imputations of missing values, in that we were predicting values that were not observed. Standard methods for combining the results of multiple imputations [28] were then employed, including a bias correction to adjust for the relatively small number of multiple imputations performed.

The Simulation
Based on the T2DHoc model, the computer simulation was executed by a cohort profile of hypothetical patients with the initial description of demographic characteristics and baseline risk factors as given in Table 1. In the simulation of diabetes disease, hypothetical patients were created based on random baseline risk factors. Patients have attributes in which each individual has a specific value for each characteristic set at the start of the simulation and may be updated as events occur, such as age increases, disease severity decreases, the number of risk events being incremented, etc. For example, HbA1c changes according to the function in Figure 3, or disease duration follows the time distribution given in Equation (2). The flowchart of the simulation procedure is shown in Figure 4. When a hypothetical patient is created, a future state is randomly chosen by a functional distribution with the best-fit parameters by the data set. A method to select the next occurring event was chosen through sample times for each possible event and the minimum chosen, namely following the logic according to which event was first to happen. Accordingly, the duration of this disease was also determined for this patient. An explicit simulation clock keeps track of the passage of time. If the future state is death, then this patient goes to the condition of the patient death box shown in Figure 4. Otherwise, the patient continues the course of the disease with age and a future state in the simulation is selected.
In all subsequent cycles, risk factors generate random effects through Ra and H(t) such that a pair of health states (i,j) takes place at different times continuously for each patient. Moving forward with time, the simulation mimics disease progression by creating diabetic complications one by one until death occurs, or the simulation is terminated depending on whichever comes first. By this procedure, the simulation can model the disease progressive pathway of an individual. The micro-simulation was conducted using a joint programme written in Python and Excel. Accordingly, each hypothetical patient generates their trajectories over time by simulation. The simulation was conducted with 12,000 patients over the time horizon to produce cumulative incidences consisting of data corresponding to observations of NHIRD. While running the simulation, the model was verified by comparing the simulation outcomes with the observations. All simulation runs were used to reach convergence in outcomes, with an ordinary least-square model used to fit the T2DHoc predicted incidence rates to observed incidence rates, and slope, intercept, and R 2 were used to evaluate prediction accuracy. After the simulation experiments with the expected risk factors in representative populations generated hypothetically from the patient characteristics, a set of diabetes durations and a reasonable level of complications were collected and used to predict the diabetic risks of diabetic patients with the same conditions.

Internal and External Validation
According to the best practice recommendations by the International Society for Pharmacoeconomic Outcomes Research and the Society for Medical Decision-making Task Force [29,30], model transparency and validation, in particular, are most important. T2DHoc maintains transparency, since its development begins on the first day. Public health data on which the model is based are always available on request by anyone who has permission issued by NHIRD. In addition, the model is open to internal researchers who are investigators of NHID projects. The model validation involved face validity, verification, cross-validity, external validity, and predictive validity. For a general face validation of this model, we consulted with diabetologists/endocrinologists to ensure that the model was constructed to reflect and be used according to their expectations. Model verification was performed to ensure that there were no unintentional computational errors and each equation as well as the programme codes were checked by the model implementation.
In general, internal validation is designed to assess whether the model output is internally consistent with the studies of the data sources used to model the disease progression. The model construction with an interaction framework was examined by endocrinologists, nephrologists, cardiologists, and other related field experts. Moreover, the model equations with parameters were selected through sufficient statistical tests conducted by the SAS software. The model framework and assumptions were examined, and the definition, as well as the use of parameters embedded in each risk equation, were reviewed. After face validation and functional specification as described above, model verification was executed and evaluated by comparison using Equation (5), where the parameter calibration was applied for numerical stability. The calculation of cumulative incidence in the competing risk analysis was carried out by computation in Python.

Model and Patient Characteristics
The characteristics of diabetic patients are shown in Table 1. In summary, 163,452 T2D patients were included; 44.40% were female. The mean diagnosed age and diabetes history were 54.00 ± 11.86 (range: 18.51, 98.07) and 5.56 ± 6.28 (range: 0, 61.51) years, respectively. The mean baseline HbA1c level was 7.8 ± 2.10%; body mass index (26.481 ± 3.97 kg/m 2 ), triglycerides (172.64 ± 135.51 mg/dL) and LDL (116.13 ± 35.28 mg/dL) were slightly higher than the normal range; systolic (130.65 ± 18.16 mmHg) and diastolic (79.77 ± 13.82 mmHg) blood pressures were within the normal ranges. Patients with a history of diabetic complications and risk factors were identified and collected. In the simulation, the factors and biomarkers were labelled as shown in Table 2. The patient disease pathways over ten years were collected as completely as possible to study the natural course of prolonged T2D complications and inconsistent fragmented data, and solely trace patients first enrolling P4P programme between 2002 and 2005 were excluded. In total, 12,242 patient pathways with at least one complete pathway were identified. The data collection procedure is outlined in Figure 5.

Processed and Final Outcomes
The changes in HbA1c levels or at different time points can have different implications for the clinician or diabetic complications [19]. As one of the prediction functions in our investigation, a level function HbA1c of time was built depending on patient age ( Figure 3). Among the possible diabetic risks, the total number of initial complications identified included 2015 cases with ASHD (16.5%), 827 cases with CHF (6.8%), 549 cases with ISC (4.5%), 2219 cases with FESRD (18.1%), 246 treatments for a detached retina or vitreous haemorrhage and 7241 all-causes, with recurrent events including 284 cases with ASHD (14.1%), 59 cases with CHF (7.1%), 22 cases with ISC (4.0%), 2924 cases with IHD, and 1158 cases with ESRD (Tables 3 and 4). However, if we consider the complication risks within 5 years, then the number of events occurring at the first layer and the second layer are shown in Tables 5 and 6.

Internal Validation
(1) Taking ASHD as an example, the analysis showed that the proportion of diabetes complicated by ASHD at 55-60 years of age varies as age increases from 0.05 to 0.1. A comparison of the actual incidence of ASHD and the simulation ratio was almost the same at 55-56 years old; 57-58 is slightly lower but still within the prediction interval, and the actual incidence of complications at 59-60 years old is higher than the simulated ratio.
(2) Comparison and display of the simulation analysis (a) The first actual concurrent ASHD and simulation results were 2015 and 2268 patients, with an annual rate of 0.0114 and 0.0128, respectively, there were 284 relapses and 309 patients, with an annual rate of 0.0016 and 0.0017, respectively. (b) The first actual concurrent stroke and simulation results were 549 and 594 patients, with an annual rate of 0.0031 and 0.0033, respectively, there were 22 relapses and 29 patients, with an annual rate of 0.0001 and 0.0002, respectively. (c) The first actual concurrent CHF and simulation results were 828 and 780 patients, with an annual rate of 0.0047 and 0.0044, respectively. (d) The first actual concurrent renal failure and simulation results were 2250 and 2268 patients, with an annual rate of 0.0127 and 0.0128, respectively.
(3) A comparison of the simulated incidence rates of the first complication and overall deaths with observations in 10 years is presented in Figure 6. From the observational data, the prediction gap of FESRD was the smallest, e.g., underestimated 0.3%, and the prediction gap of retinopathy the largest, e.g., overestimated by 3.2%. Most of the predicted complication rates were higher than the actual rates, with gaps of 1.30% for death, 2.60% for ASHD, 3.20% for retinopathy, and 2.90% for ASHD + CHF, 2.00% difference in infarct stroke, 1.80% difference in CHF, 1.70% difference in diabetic amputation, 0.00% difference in CHF + infarct stroke, 0.10% difference in CHF + diabetic amputation. The difference between atherosclerosis heart disease + infarct stroke was 0.20%. A similar accuracy is shown when further applying it to predict recurrent complications ( Figure 7). (4) The distribution of prediction error by complications and death rate is shown in Figures 8 and 9. The error ranges between the overall simulation results, and the actual observations were within 5% of the predicted value of complication and death rate and the observation quality. Suppose the average absolute percentage error was used for evaluation, and only 5 out of 10,000 events included. In that case, ASHD, death, and ESRD are within the generally accepted 30% range. Moderate complications are the incidence of stroke and CHF, with the incidence of foot lesions being highly overestimated.    For example, the model estimates the cardiovascular risk from the T2DHoc risk engine, reflecting the disease progression of T2D patients in the P4P programme. The internal validation of the current T2DHoc risk engine can be shown by checking 5-year occurrences of each complication after the diagnosis of diabetes in Table 7. For further validation for a long-term prediction, we examine the incidence rate and recurrence of risks within 10 years. In a long-term prediction, ASHD+CHF with the event rate predicted by the risk engine was close to the observed rate, indicating that the model accurately predicts; the prediction confidence intervals may vary with different complications depending on health conditions, but the model has captured the disease progression trend.

External Validation
External validation was conducted by comparing different models, such as UKPDS [14], Osaka model [31], the Korean model, and the Hong Kong model, by comparing the risk events predicted by the model, with observed clinical outcomes from research studies that were not directly used to inform disease progression.
(1) The well-known Japanese diabetes literature [31] regarding T2D clinical trials from 1995 to 1996 involves a total of 2205 people aged 40-70 years old with HbA1c>6.5% who were randomly assigned to a lifestyle intervention group and conventional treatment group. All patients have initial data for both groups. Two sets of initial data were used in the Taiwan diabetes model to simulate 5000 hypothetical patients and compare the complication rates. From the literature, after 7.8 years of followup, the incidences of complications in the intervention group were coronary heart disease: 7%; stroke: 10%; nephropathy: 6.7%; eye lesions: 29% and in the control group was arterial heart disease: 7%; stroke: 6.5%; nephropathy: 6.7%; eye disease: 35.7%. Simulation results of the intervention group were coronary ASHD: 10.3%; ISC: 2.6%; ESRD: 1.1%; eye disease 3.6% and the simulation results of the control group were coronary ASHD: 10.9%; ISC: 2.0%; ESRD: 0.8%; eye disease: 3.8%. Since this generation of nephropathy was defined as proteinuria (UACR>300 mg/g), and ocular lesions were defined as the diagnosis of clinical tests (phases 1-4), it is different from the diagnosis code and the definition of dialysis in the Taiwan diabetes model, so the observed values were much higher than the predicted values. (2) South Korea collected 732 diabetic generations from Boramae Hospital in 2006 for 6 years [32]. It was observed that 43 (6.6%) patients developed coronary heart disease, and the use of the UKPDS risk formula would lead to the overestimation of the disease risk of patients. The patients' initial data were used in the Taiwan diabetes model to simulate 5000 hypothetic patients and estimate coronary ASHD for 6 years, and the result was 6.8%, consistent with the observed ratios. (3) Diabetes generation clinical trials in Hong Kong were developed in 1995, with a total of 7534 patients with T2Dcollected (the average course of diabetes was 7.1 years, and the prevalence rate of hypertension was 70%) [33]. They were tracked for 5 years and have targeted different common types of diabetes. According to the literature, the numbers of major complications in this generation within 5 years were death: 763; coronary heart disease: 377; stroke: 362; diabetic nephropathy: 693; CVD: 1120; ESRD: 282 and the percentage of major complications was death: 10.13%; coronary heart disease: 5%; stroke: 4.8%; diabetic nephropathy: 9.2%; CVD: 14.87%; endstage renal disease: 3.74%. These data were used in the Taiwan diabetes model for simulation revealing death: 1862; coronary ASHD: 1142; ISC: 446; ESRD: 445, whereas the percentage of major complications was death: 17.2%; coronary ASHD: 11.4%; ISC: 4.46%; ESRD: 4.5%. The death rate by simulation was slightly higher than the observed values. The predicted rate of stroke, coronary ASHD, and ESRD was slightly higher than the observed rates.
The external validation studies selected in this study represent a broad range, including observations of each complication risk with duration, as well as collected data from disease progression. The specific external validation studies included in this analysis were: National Health dataset, clinical datasets. The external validation has shown the comparison of the T2DHoc and UKPDS by 3867 UK patients which is illustrated in Figure 10. The UKPDS results mentioned above also support the use of the current risk engine.
Data from the literature were compiled for comparing the discrete event simulation for advanced Asian countries, showing that subjects in South Korea and Japan were older than subjects in Taiwan. The mean values of HbA1c, SBP, and HDL cholesterol were lowest in South Korea and highest in Japan. The mean value of triglycerides was lowest in Japan and highest in Taiwan. Limited information on LDL cholesterol in Japan and South Korea and limited information on serum creatinine and urine albumin in Japan, South Korea, and Hong Kong were noted. The simulation data of the three countries are hypothetically assumed to be the same as the Taiwanese values. We further compared the main complications listed in this plan to the original definition of diabetes in Asian countries. The definition of complications for Japanese diabetics mainly uses clinical diagnostic criteria to define major complications, while Hong Kong defines complications, compared to Taiwan, are based on diagnosis codes. South Korea's definition of complications is based on retrospective medical history. Based on the differences in the four definitions, the Taiwan model was found to be more comparable in terms of the definition of ASHD for the diabetes generation in other advanced Asian countries. Hong Kong is better than Japan in the sense of advanced research on diabetic complication development, and Japan is better than South Korea. The comparability of the definition of stroke and nephropathy is that Hong Kong is better than Japan. It is worth noting that the ratio of death to ASHD model predicted value is much higher than the observed value, which may be related to medical ecology, and the actual cause remains to be clarified.

Discussion
The study successfully extracted information from a long-lasting trial, with more than 163 thousands participants following up 16 years. A good predicted complication and mortality rates were approved against the setting observed numbers of complications or death in the original cohort. In addition, better and optimal predictabilities were further confirmed when compared to the UKPDS Outcomes model and applied the model to other Asian populations, respectively.
The T2DHoc is based on more comprehensive follow-up data as it captures more outcomes has significant advantages over existing models, and more comprehensively captures the progression of diabetes. It permits detailed and reliable lifetime simulations of key health outcomes in people with T2D, especially for Asian societies. The predicted event rates for complications by the T2DHoc model are slightly lower than most existing models, despite estimated rates in those models often being overshot. The results suggest that applying recently developed models to clinical practice should pay some attention to their intrinsic variations such as data source, operational definitions, time-span, complexity, and feasibility. Undoubtedly, our model derived from practical data shows a large promise to shrinkage the abovementioned gaps. However, more research on methods in selecting models by clinically usefulness index is needed.
Previous studies have pointed out that T2D models mainly developed using Western patients do not appropriately reflect the risks of CVD and renal disease in the Asian population [34,35]. It is reasonable to suspect that ethnicity, genes (ALDH2 deficiency), environmental factors (birth cohort effects: World Wars, PM 2.5 pollutants, food habitats), as well as healthcare systems are highly involved in the accuracy of a prediction model [36]. Although the definitions of selected complications are different, the T2DHoc model shows better performances than the UKPDS Outcomes model or Framingham risk score when applying them to the Asian T2D population. Further studies to compare performances of current models developed in the Western population and ours in more Asian ethnicities such as Australian T2D patients are required.
The T2DHoc model has several advantages. First, it simultaneously takes many major complications, absorbed status (death), and their dynamic time relationship within a single analytic framework. Thus, information on complications is collected more comprehensively. Then, the dynamic relationships between age and HbA1c levels were established and incorporated into the model structure, allowing more flexibility for practical applications. Finally, the follow-up time in the current study is obviously longer than for other models developed from data collecting by clinical trials. The accuracy for complications that need more time to be observed, such as renal failure, should be largely improved.
Some limitations of the study should be recognized. The study mainly takes laboratory data at baseline rather than consequent medication change for the model development, which makes us not able to understand the influences of medication use on the probabilities of health state change. For example, patients who used cardiovascular prevention drugs (such as antiplatelet drugs, renin-angiotensin-aldosterone blockade, and beta-blocker) may change the probability of experiencing the first ASHD (heath state j) from baseline (health state i). Although adding information on medication usage may enhance the current model's performance, medication use changes over time and is modified depending on the patient's conditions, making our model more difficult to understand and practically apply. In addition, due to a long observed time, we could not rule out and quantify the influences of the T2D care improvements (new drugs) over time on our results.

Conclusions
Model building is an iterative process, and models need to be updated as new information becomes available. The T2DHoc was based on P4P data collected from 2002 up to now. Additional information collected during the P4P 10-year period provided an opportunity to update the simulation model and to incorporate data on new risk factors and outcomes that were unavailable when other diabetes prediction models were constructed.
The progression of diabetes in Asian patients is different from that reported in western countries. For Asian patients, ASHD, chronic kidney disease, and stroke are the most likely first complications to occur than CHF, ASHD, retinopathy, and amputation. Risks of developing a further complication vary according to patients' existing complication profiles. Patients with an existing cardiovascular complication or retinopathy have a higher risk of developing ASHD and chronic kidney disease. These results inform clinical decisionmaking regarding prioritising monitoring and interventions for diabetic patients who are at high risk of developing severe complications in this region.
The study demonstrates how a simplified whole disease model approach can be applied to rare diseases to provide the disease burden estimation, serving as an information tool for payers, for diseases with small patient volume and an unknown cost burden. Additionally, machine learning, data mining, and text mining can be applied to data contained in electronic health records to further research other diseases.
Though efficacy and safety were not taken into account in this model, we believe that it still provides a good understanding of resource utilisation and costs of conditions with poorly documented epidemiology and disease burden. The enormous economic burden of type 2 diabetes mellitus T2D can be reduced by implementing inexpensive, easy-to-use interventions, such as joining a diabetes prevention programme.
T2DHoc is a prediction model with a single analytic framework suitable for predicting T2D risks, both for long-term and short-term complications. This pilot study demonstrated that non-homogeneous-based modelling is useful for T2DHoc modelling. Further research, including the most updated treatments, should be sustained to complete the disease process of T2D with diabetes management.

Informed Consent Statement:
The ethical review boards waived patient consent because all data used in the study have been de-identified.

Data Availability Statement:
The data used in our study were limited to research purposes only and cannot be made publicly available under the regulation of the personal information protection act in Taiwan. The raw data were obtained from the national health insurance administration, ministry of health and welfare https://www.nhi.gov.tw/content_list.aspx?N=2d2faf5214807829&topn=78 7128dad5f71b1a) (accessed on 21 May 2021) and can be made available to qualified researchers upon request.

Acknowledgments:
We would like to thank the team members, Ming-Huan Chan, Shyi-Jang Shin, Jer-Chia Tsai, Kun Der Lin, Yi-Ting Lin, Meng-Chuan Huang, Hui-Min Hsieh, Yi-Hsin Yang, Li-Jen Cheng, Sheng-Tai Huang, Teng-Hui Huang, and Shu-An Yang for providing their valuable opinions and devoting their efforts on our model development.

Conflicts of Interest:
The authors declare no conflict of interest.

Clinical Definition of Diabetic Complications
• Retinopathy • Diabetic retinopathy is a primary cause of blindness worldwide, and this serious complication of diabetes is already present at the time of clinical diagnosis of type 2 diabetes in some patients.

•
In the Wisconsin Epidemiologic Study of Diabetic Retinopathy, 3.6% of patients with type 1 diabetes and 1.6% of patients with type 2 diabetes were blind. • It is recommended that patients with type 2 diabetes have an initial comprehensive eye examination by an ophthalmologist or optometrist shortly after being diagnosed with diabetes.
• Neuropathy • Diabetic peripheral neuropathy is frequent, and 50% of people with type 2 diabetes have neuropathy and therefore are at risk of developing diabetic foot ulcers. • Diabetic neuropathy is known by the American Diabetes Association as "the presence of symptoms and/or signs of peripheral nerve dysfunction in people with diabetes after the exclusion of other causes." A foot ulcer is one of the major complications in patients with diabetes, with a 15% lifetime risk of amputation.
• Nephropathy • Diabetic nephropathy is the leading cause of renal failure in the United States.

•
The kidneys begin to leak, and albumin passes into the urine. This can be preceded by lower degrees of proteinuria or microalbuminuria and can proceed to renal failure in the worst case.

•
Identification of people at high risk of rapid decline in renal function is important, and evidence-based interventions have been shown to prevent or slow the development toward advanced stages of nephropathy.

• Heart Disease
• Diabetes is a well-known risk factor for coronary heart disease. Diabetes adds an about 2-fold risk for a wide range of vascular diseases, independently of other conventional risk factors.

•
Much research has been conducted to develop predictive models or risk scores for at-risk individuals from the general population. One of the best models is the Framingham score (link), which has been widely accepted and includes diabetes as a predictor.
• Hypoglycemia • People with type 1 diabetes often experience episodes of hypoglycaemia because they need to reduce the level of blood sugar by using insulin. Additionally, patients with type 2 diabetes may experience episodes of hypoglycaemia because of the increasing use of insulin in this group.

•
The fear induced by hypoglycaemia is pronounced, and the clinical results of this condition are serious. The literature suggests that the incidence of hypoglycaemia requiring emergency assistance reaches 7.1% per year among patients with diabetes and that as many as 6% of all deaths in patients with type 1 diabetes are due to hypoglycaemia.

Insulin-Associated Weight Gain
• In most patients with type 2 diabetes, it will eventually be necessary to begin insulin treatment to achieve the therapeutic goal of HbA1c < 7 mmol/L (126 mg/dL). The problem of weight gain induced by insulin has long been documented as an issue in diabetes treatment.

•
In the Diabetes Control and Complications Trial (DCCT), the average weight gain of patients with type 1 diabetes undergoing intensive treatment was 5.1 kg compared with 2.4 kg in the standard treatment arm, and similar results are seen for type 2 diabetes.

Appendix B. An Explanation of Markovian Approach
Appendix B.

The Basic Computational Model
In this appendix, the Markov modelling approach is applied to consider the progression of diabetes disease with complications. Here, the primary purpose is to introduce the mathematical notations and formulae used in the paper. For detailed treatments to these topics, they may be referred to [27].
Consider a sequence of random variables Y = {Y n , n ∈ N ∪ {0}} defined on a probability space (Ω, F, P) with a finite set {s 0 } ∪ E = {s 1 , s 2, , · · · , s m } for m < ∞, where N is the set of all positive integers. In our case, a unit cycle is defined as one month and E = {ASHD, CHF, ISC, FESRD, EYE, FINFOOT, ASHD + CHF, ASHD + ISC, CHF + ISC, ASHD + CHF + ISC, ESRD, DEAD}. The state symbols are defined and shown in Table A1. According to Grossetti et al. [27] and Equations (1)-(4) in our paper, a risk equation from state i at age a to state j during time t is defined in Equation (5). For example, a risk equation from ASHD to j, j ∈ E, where t ∈ (t 0 , t 1 ) where σ (ASHD, j) denotes the scale and c (ASHD, j) denotes the shape of Weibull distribution from ASHD to j, respectively. Table A2 illustrates those parameters of Weibull distributions that have been excerpted from NHIRD associated with all possible states for the T2DHoc model. In addition, C a (t, ASHD, j) is computed by C a (t, ASHD, j) = exp(R a (t, ASHD, j)), where R a (t, ASHD, j) = ∑ β i (j)x i + β HbA1c (j) × x HbA1c (a, t) + β age (a, t, j). Note that ∑ β i (j)x i is calculated with β i and x i , where β i is chosen from the corresponding values in Table A1 and x i is chosen from the corresponding value in Table 2. Similarly, ∑ β HbA1c (j) × x HbA1c (a, t) + β age (a, t, j) is computed with the associated glycated haemoglobin (HbA1c) function, which is defined by where show in Table A2.
Since the unit step transition probability matrix changes with age a, the proposed model clearly generates a nonhomogeneous Markovian process. With a given starting age and an initial healthy state, the probability associated with any likely realization consisting of chronically experienced states can be computed. Now, define the transition function p t,s i,j = P(Y s = j Y t = i) as a transition probability matrix with the elements P t,s = (p t,s i,j ) Namely, P t,s i,j is the probability that a patient with the initial healthy state i at age written in terms of cycle time t will be at the health state j at age in terms of cycle time s. Specifically, when s = t + 1, p t,t+1 i,j in this matrix is the one-step transition probability from state i to state j with a starting time t and an ending time t + 1. That is for i = j we have p t,t+1 i,j = r t (t + 1, i, j) and p t,t+1 i,k for i = j. Let P(k) denote the unit step transition matrix with starting cycle time k and ending cycle time k + 1. Then, the transition probability matrix P t,s can be computed by

. Numerical Experiments
Based on data reported in NHRID, we compute the probability of 5 years' and 10 years' complication risks of diabetes according to a nonhomogeneous Markovian process. The probabilities of complication risks of diabetes are generated through the formulae mentioned above, assuming the DM patient starts from 55 years old on average. These possible outcomes by cycles (months) are given in the Supplementary Material. Summarised results are depicted in Figures A1 and A2.