Validation of Machine Learning-Based Individualized Treatment for Depressive Disorder Using Target Trial Emulation

This study aims to develop and validate the use of machine learning-based prediction models to select individualized pharmacological treatment for patients with depressive disorder. This study used data from Taiwan’s National Health Insurance Research Database. Patients with incident depressive disorders were included in this study. The study outcome was treatment failure, which was defined as psychiatric hospitalization, self-harm hospitalization, emergency visits, or treatment change. Prediction models based on the Super Learner ensemble were trained separately for the initial and the next-step treatments if the previous treatments failed. An individualized treatment strategy was developed for selecting the drug with the lowest probability of treatment failure for each patient as the model-selected regimen. We emulated clinical trials to estimate the effectiveness of individualized treatments. The area under the curve of the prediction model using Super Learner was 0.627 and 0.751 for the initial treatment and the next-step treatment, respectively. Model-selected regimens were associated with reduced treatment failure rates, with a 0.84-fold (95% confidence interval (CI) 0.82–0.86) decrease for the initial treatment and a 0.82-fold (95% CI 0.80–0.83) decrease for the next-step. In emulation of clinical trials, the model-selected regimen was associated with a reduced treatment failure rate.


Introduction
Depressive disorder is a common psychiatric disorder that imposes a heavy social and economic burden [1,2]. Antidepressant therapy is the standard treatment for depressive disorders [3]. However, the effects of antidepressant treatment are limited, with a response rate of approximately 50% or less [4]. Approximately 29% to 46% of patients fail antidepressant treatment and develop treatment-resistant depression [5]. Various antidepressants have been approved for the treatment of depression. Meta-analyses have demonstrated varying therapeutic effects of different antidepressants [6]. However, this information was obtained based on collective data from clinical trials. Treatment responses and adverse effects may widely vary with patient's characteristics, including age, sex, underlying conditions, and biological factors [7].
Developing individualized treatment can improve treatment effectiveness. Several studies have attempted to identify the predictors and moderators of pharmacological treatment outcomes of depressive disorders, such as pharmacogenetic [8,9] and clinical [7,10] factors. However, no individual predictor could be used to select individualized treatments in clinical practice [11]. Simultaneous consideration of numerous factors is a potential way to develop individualized treatments [12]. Machine learning has the advantage of being able to process high-dimensional information and complex interactions between variables. Machine learning has been used to predict the severity of depression [13]. In addition, individualized treatments based on machine-learning algorithms have been developed for the treatment of schizophrenia [14] and diabetes mellitus [15].

Aims of the Study
This study aims to develop prediction models for individualized pharmacological treatment of depressive disorders, including the initial and next-step treatments, if previous treatments failed. In addition, a clinical trial was conducted to estimate the effectiveness of individualized pharmacological treatment using a population-based database in Taiwan.

Data Source
Taiwan's National Health Insurance (NHI) program was launched in 1995. In 2009, 99.8% of the Taiwanese population was enrolled in the NHI program. The claims database derived from the NHI program, the National Health Insurance Research Database (NHIRD), includes information about the beneficiaries' demographic characteristics, medical contacts, ICD-9-CM diagnoses, and prescription records. The accuracy of clinical diagnosis for major depressive disorders and other psychiatric illnesses has been validated [16]. This study was approved by the Research Ethics Committee of the National Taiwan University Hospital.

Study Sample
Initially, we identified 1,853,382 patients with diagnoses of depressive disorders (ICD-9-CM codes 296.2, 296.3, 300.4, and 311) from the NHIRD between 2001 and 2013. Individuals aged below 20 or above 75 years and those diagnosed with schizophrenia, bipolar disorder, or dementia were excluded. A total of 745,356 patients were included in our analysis.
The unit of analysis in this study was a treatment episode defined as the start of administration of the first antidepressant agent for incident patients, or a change in treatment regimen when previous treatment was ineffective or intolerable. The index date of a treatment episode was the date of starting or changing the treatment regimen. Treatment episodes were divided into two scenarios: the initial treatment episodes for the incident patients and the next-step treatment episodes if the previous treatments failed, including the second treatment and all subsequent treatments. There were 715,246 first treatment episodes and 739,980 next-step treatment episodes. A flowchart for selection of the treatment episodes is shown in the Supplementary Methods and Supplementary Figure S1.

Data Training and Testing
We randomly divided the patients into two sets: 80% in training set and 20% in test set. Data management, the application of the machine learning algorithms of the prediction models, and statistical analyses were performed using SAS 9.4 software (SAS Institute, Inc., Cary, NC, USA) and R 3.5.3 software [17].

Study Outcome
The treatment was considered a failure in the following four conditions: (1) psychiatric hospitalizations; (2) self-harm hospitalizations (defined using ICD-9-CM codes E950-E959); (3) emergency department visits due to psychiatric problems (defined using ICD-9-CM codes 290-319); or (4) treatment regimen changes including switching to or adding another antidepressant, or augmenting with second-generation antipsychotics or mood stabilizers, which indicated that the current treatment regimen resulted in either inadequate treatment responses or intolerable adverse reactions. Treatment discontinuation or dose titration was not considered as treatment failure because dose titration is necessary for some antidepressants, and one of the common reasons for discontinuation is symptom improvement [18,19]. Each treatment episode duration was a one-year observation period or until treatment failure occurred, whichever occurred first.

Predictors
Potential predictors of antidepressant treatment responses included the patients' demographic variables, clinical characteristics of depression, and comorbid medical and psychiatric conditions. In Taiwan, patients can directly visit medical specialists without primary care physicians' referrals. Thus, the number of outpatient visits to the specialists was used to identify each patient's health-seeking behavior and underlying physical conditions. Patients' various medication history was also extracted. Overall, there were 125 variables regarding patient characteristics for the initial treatment and 153 variables for the next-step treatment episodes. The additional 28 variables of next-step treatment included the characteristics of the previous regimen, the interval between the index and the last treatment episode, and the number of failed treatments. The definitions and distributions of these variables are presented in Supplementary Table S1.

Model Development
The treatment episodes in the training set were divided into several subsets based on the treatments. For example, patients treated with fluoxetine as the initial treatment were used to develop fluoxetine-initiating prediction models. There were 16 antidepressantspecific prediction models for the initial treatment and 41 models for the next-step treatment (16 strategies for antidepressant switching, 16 strategies for antidepressant combinations, and 9 strategies for augmentations). We did not include all regimens as a variable in a single model because the interaction variables between regimen and predictors are too numerous to develop a model efficiently. Therefore, we developed treatment-specific prediction models for each regimen. The contribution of each predictor varied across the different treatment-specific prediction models.
We conducted a filter-based feature selection (Pearson's correlation coefficient p) to shorten training times and enhance generalization by reducing overfitting. We then used Super Learner to develop treatment-specific prediction models [20]. Super Learner is a supervised learning method that uses a stacking process to determine the optimal weighted combination of a collection of base machine learning algorithms [20]. Super Learner can include many diverse prediction models and perform equally to or better than the bestperforming base algorithm. In this study, we used 19 base algorithms and combined them using the Lawson-Hanson algorithm to generate a Super Learner (Table 1). This model was analyzed using the SuperLearner package (version 2.0-26) in R [21]. In addition, we established different prediction models using logistic regression, random forest, and support vector machine to compare with the models using the Super Learner ensemble.

Statistical Analysis for Evaluating the Prediction Models Prediction Performance
The outcomes predicted by combining all treatment-specific models were compared with the observed outcome in the test set. The performance of combining prediction models was evaluated using the area under the curve (AUC) of receiver operator characteristic analyses and the corresponding 95% confidence intervals (CIs) were estimated using the Delong method [22].

Evaluating the Effectiveness of the Machine-Selected Agents in the Test Set
The developed treatment-specific prediction models estimated the treatment failure probability across all potential antidepressant regimens for each treatment episode in the test dataset. The antidepressant with the lowest probability was defined as the model-selected agent. For the next-step treatment, there were three strategies: switching, combination, and augmentation. The three best regimens were identified, with each regimen having the lowest failure rate for each strategy. A randomized trial is the preferred design to evaluate the effect of the model-selected treatment. However, this is currently unavailable. Thus, the target trial was emulated [23,24] to estimate the effectiveness of the model-selected treatment. One target trial with four arms was considered, using two eligibility scenarios: (1) patients administered antidepressant treatment for depressive disorders for the first time or (2) patients just starting the next-step treatment if the previous ones failed. The four arms of trials for each scenario included the treatment group (treatment selected by model recommendation) and the three control groups (treatment as usual, treatment randomly selected based on the prescription proportions in the observed dataset, and treatment randomly selected based on the prescription proportion of the model recommendation). Four study samples were replicated to emulate the clinical trial. If the individual's treatment was the same as the assigned treatment, the subject then would be enrolled into the arms (see Figure 1). All patients were followed up from their baseline visits and assessed every month until the study outcome occurred or the one-year follow-up period ended. A summary of the target trial protocol is listed in Supplementary Table S2.
We used intention-to-treat analysis and as-treated analysis to assess the effects of the model-selected regimen for the initial treatment and the next-step treatment. The details are described in the Supplementary Methods.

Results
A total of 715,246 patients with incident depressive disorders were included in the analysis. Among them, 283,927 patients received a next-step treatment, which led to 739,980 treatment episodes. The definition and distribution of baseline characteristics of the treatment episodes in the training and test datasets are shown in Supplementary Table S1.
The distribution and treatment failure rates of each prescribed antidepressant regimen in the training dataset are shown in Table 2. Overall, treatment failure rates varied markedly across antidepressant regimens. Regarding the initial antidepressant treatments, the overall treatment failure rate was 26.6%, ranging from 19.9% for imipramine to 30.9% for fluvoxamine. Regarding the next-step treatment episodes, the overall treatment failure rate was 54.3%, ranging from 48.5% for switching to escitalopram and 71.0% for switching to trazodone.

Prediction Performance
The overall performance (quantified by AUC) of the combined initial treatment prediction model for treatment failure was 0.627 (95% CI: 0.623, 0.630), while the AUC of the overall next-step treatment prediction model was 0.751 (95% CI: 0.747, 0.756). The performance of the prediction models using the Super Learner ensemble was better than those using logistic regression, random forest, and support vector machine (Supplementary Table S3).  Table S4 shows the difference in the distributions of the observed and recommended treatments using the prediction model. The baseline characteristics of each treatment group for the initial treatment episode and the next-step treatment episodes in the emulating target trials are shown in Supplementary Tables S5 and S6, respectively.
Three pairwise comparisons of the treatment failure rates between the patients treated with the model-selected regimen and each of the three control groups are shown in Table 3. The model-selected regimens had the lowest failure rates across all treatment arms. In the ITT analyses of initial treatment episodes, the hazard ratios of the model-selected regimens showed a decrease in treatment failure by 0.84-fold (95% CI: 0.82, 0.86), 0.86-fold (95% CI: 0.82, 0.89), and 0.92-fold (95% CI: 0.89, 0.96) compared to treatment as usual, treatment randomly selected by the prescription proportion, and treatment randomly selected by the recommended proportion, respectively. Regarding the next-step treatment episodes, the hazard ratios of the model-selected treatment exhibited a decrease in treatment failure by 0.82-fold (95% CI: 0.80, 0.83), 0.85-fold (95% CI: 0.83, 0.88), and 0.87-fold (95% CI: 0.84, 0.90) compared with treatment as usual, treatment randomly selected by the prescription proportion, and treatment randomly selected by the recommended proportion, respectively. The results were similar to those of the as-treated analyses. Table 3. Treatment failure rates of the groups with treatment selected using model recommendation compared with the control groups.

Intention-to-Treat Analysis
As-Treated Analysis

Principal Results
This study used a machine learning approach to develop individualized pharmacological treatments for depression and estimated their effect by emulating clinical trials based on a large-scale database. The accuracy of the overall prediction models for treatment failure was 0.627 for the initial treatment episodes and 0.751 for the next-step treatment episodes. In addition, the model-selected regimens showed an 8-18% reduction in treatment failure rates compared with treatment as usual or when given randomly selected treatment in proportion to the actual or recommended treatments.
The model-selected treatment was associated with reduced treatment failure rates, with a moderate 8-18% decrease in treatment failure rates. However, individualized treatment strategies can generate large benefits across the population, as depression is a common mental illness. Furthermore, these prediction models were constructed based on the patients' demographic variables, medical and psychiatric comorbid conditions, concomitant use of medications, and history of antidepressant treatment, which were obtained from the claims database. Thus, these prediction models can be easily implemented in Taiwan's clinical electronic information system. Moreover, the prediction models can generate a list of antidepressant regimens according to their probabilities of treatment failure. Physicians then can prescribe agents with relatively low failure probabilities, if the first or previous agent was ineffective for the patient.
The distributions of the model-selected regimens were different from the observed prescription patterns. Although the model-selected treatment had a lower predicted treatment failure rate, this does not imply that the commonly recommended drugs are better than other drugs. The model recommendation was based on individual patients' personal characteristics. Thus, the best treatment choice varies across individual patients. Furthermore, the distributions might differ when different patient populations are considered.

Limitations
The findings of this study should be interpreted with caution. First, common measures of treatment response in patients, such as symptom severity, functional level, or quality of life, were not directly assessed. Treatment changes, emergency visits, and hospitalizations were assumed either as poor treatment effectiveness or intolerance to adverse effects. However, these indicators may be insensitive and can only identify the worst conditions. Second, treatment episodes were used as the basic units of analysis. Therefore, poor responders with multiple treatment episodes were over-represented in the analysis for the next-step treatment. However, there is limited evidence for the best strategies for managing treatment-resistant depression. The prediction models of this study might provide a novel approach for improving treatment outcomes in this patient population. Finally, the claims data did not include several important variables, such as the reason for the treatment changes, socioeconomic factors, patient preferences, medication adherence, pharmacogenomics, or other biological markers. In addition, we used filter-based feature selection by Pearson's correlation, which might be inadequate because some variables were not normally distributed. Therefore, unmeasured factors or inadequate variable selection can decrease the accuracy of prediction models. On the other hand, unmeasured factors would confound the estimation of the effects of the model-selected treatment. Our study only demonstrated associations, rather than causal effects. Thus, prospective clinical trials are warranted to validate the effectiveness of prediction models for individualized treatment.

Implications
The prediction models might be further optimized as clinical records accumulate and information on other relevant factors becomes available for model development. After a new antidepressant enters the market, data for its treatment responses could be rapidly collected in a population-based claims database, and its effectiveness can be assessed using the individualized treatment and prediction model strategy. Moreover, as largescale genomic data and brain imaging information become more widely available, a machine-learning algorithm could integrate these data to improve the prediction accuracy of treatment response.

Conclusions
The prediction models presented in this study used a machine learning approach based on data of patient characteristics and history of antidepressant treatments. The overall accuracy of the prediction model was moderate. However, the model-selected treatments demonstrated reduced treatment failure rates compared to the usual treatments. Future prospective clinical trials investigating the effectiveness of individualized pharmacological treatment strategies are warranted to validate our findings.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/jpm11121316/s1, Figure S1: Flowchart of study cohort formation, Table S1: Baseline characteristics for the initial and next-step treatment episodes, Table S2: Study design for target trials, Table S3: Comparisons of the performance of the prediction models using Super Learner ensemble, logistic regression, random forest, and support vector machine, Table S4: Distribution of observed and recommended treatment among test datasets, Table S5: The baseline characteristics of these four groups for the initial treatment, Table S6: The baseline characteristics of these four groups for the next-step treatment.  Informed Consent Statement: Patient consent was waived as all participants were de-identified.

Data Availability Statement:
Restrictions apply to the availability of some or all data generated or analyzed during this study to preserve patient confidentiality or because they were used under license. The corresponding author will request details of the restrictions and any conditions under which access to some data may be provided.

Conflicts of Interest:
The authors declare no conflict of interest.