Predicting Physician Consultations for Low Back Pain Using Claims Data and Population-Based Cohort Data—An Interpretable Machine Learning Approach

(1) Background: Predicting chronic low back pain (LBP) is of clinical and economic interest as LBP leads to disabilities and health service utilization. This study aims to build a competitive and interpretable prediction model; (2) Methods: We used clinical and claims data of 3837 participants of a population-based cohort study to predict future LBP consultations (ICD-10: M40.XX-M54.XX). Best subset selection (BSS) was applied in repeated random samples of training data (75% of data); scoring rules were used to identify the best subset of predictors. The rediction accuracy of BSS was compared to randomforest and support vector machines (SVM) in the validation data (25% of data); (3) Results: The best subset comprised 16 out of 32 predictors. Previous occurrence of LBP increased the odds for future LBP consultations (odds ratio (OR) 6.91 [5.05; 9.45]), while concomitant diseases reduced the odds (1 vs. 0, OR: 0.74 [0.57; 0.98], >1 vs. 0: 0.37 [0.21; 0.67]). The area-under-curve (AUC) of BSS was acceptable (0.78 [0.74; 0.82]) and comparable with SVM (0.78 [0.74; 0.82]) and randomforest (0.79 [0.75; 0.83]); (4) Conclusions: Regarding prediction accuracy, BSS has been considered competitive with established machine-learning approaches. Nonetheless, considerable misclassification is inherent and further refinements are required to improve predictions.


Introduction
Non-specific low back pain (LBP) is one of the leading health problems and associated with substantial individual burdens, health service utilization and indirect costs worldwide [1][2][3]. LBP has been portrayed as having a good prognosis with a high proportion of patients recovering spontaneously within few weeks, with no or minimal interventions. Although this is true for most, a substantial proportion of patients will develop chronic or recurrent back pain [4][5][6]. This has implications for the provision of health services.
One rationale is to focus service provision on subjects with subsequent high service utilization, reflecting relevant impairment and suffering. For health care providers and health insurances, the identification of such patients is, therefore, of great importance, since this is a prerequisite for the rational allocation of limited specialized services and for offering targeted interventions to those who are most affected. This would, for example, be of relevance to optimize disease management programs for chronic low back pain, which are mandated by the statutory health insurance in Germany.
Machine-learning approaches such as randomforest [7] have been shown to obtain high accuracy for the prediction of events and have frequently outperformed established models [8]. Nevertheless, the deficiencies of machine-learning approaches, including randomforest, comprise the following: (a) a black box like behavior, (b) lacking interpretability of results, and (c) the results cannot be reused in new data sources as the model definition is conducted implicitly [9,10].
In best subset selection (BSS) approaches, all possible combinations of candidate variables are explored in a regression model, i.e., 2 P -1 possible subsets must be examined. This approach was proposed decades ago and provides accurate and interpretable prediction models, as candidate variables, which do not affect the prediction accuracy of the model, will be removed [11]. BSS is also known as L 0 -penalization and represents one among many penalized approaches, such as the lasso [12]. Due to its computational costs, BSS has rarely been applied in health research studies, although the availability of high-performance-cluster machines and parallel computing make BSS more feasible.
In this study, we follow the pipeline of machine-learning approaches (model training, model calibration, and model validation) to obtain the best subset of predictors from claims data and those of a population-based cohort to predict future consultations for low back pain. For benchmarking, BSS results will be compared with parameter-tuned randomforest and support vector machines.

Study of Health in Pomerania
The Study of Health in Pomerania (SHIP) [13] is a population-based project that comprises two distinct cohorts, SHIP-START and SHIP-TREND. This study is restricted to the baseline examinations conducted for the more recently started SHIP-TREND cohort (SHIP-TREND-0). Participants were sampled in Mecklenburg-Vorpommern, a region in northeast Germany, and followed up every 4-6 years. After the exclusion of deceased and relocated individuals, the net sample comprised 8826 individuals. In total, 4420 (2275 women) followed the invitation (response 50.1%) and were examined during the baseline visits between 2008 and 2012.
The SHIP-TREND-0 examination program comprised medical and dental examinations, laboratory measurements, an interview, and self-reported questionnaires, among others. In the latter, several standardized instruments were presented to participants, comprising the PHQ-9 [14], a diagnostic instrument for depressive symptoms, and a von Korff short version [15,16] to grade the severity of LBP.

Claims Data
In Germany, Statutory Health Insurances host data of approximately 72.8 million assured individuals; 87.7% of the whole population [17]. For SHIP participants', respective data were available between 2002 and 2018. We considered ICD-10 codes specific to diseases and symptoms in the lumbar spine (Dorsopathies, M40.XX to M54.XX), excluding ICD-10 codes of thoracic and cervical diagnoses (e.g., M54.01 to M54.04). Excluded ICD-10 codes are listed in the supplement under paragraph "List of ICD-10 codes" and supplement Table S1. In addition, physicians' fee schedules were considered and grouped into those being claimed for back-pain-associated treatments by General practitioners, Orthopedists, Neurologists, Radiologists or for respective medical interventions. The list of included physicians' fee schedules is provided in the supplement (Table S2).

Record Linkage
Data from the statutory health insurance and from SHIP have no common identifier for record linkage. Therefore, we followed the approach of Vatsalan et al. [18] and linked the data sources based on surname, name, date of birth, and sex of participants, with linkage consent. The names were normalized to upper-case letters; the indexing of candidate pairs was blocked via birth date. All exact agreements were considered matches and possible matches, i.e., agreements with Levenshtein distance > 0, were manually reviewed.

Methods
Data preparations and record linkage were conducted using SAS version 9.4 (SAS Institute Inc., Cary, NC, USA). R version 4.0.3 has been used for all analyses [19], including zero-inflated regression models using the R package pscl [20,21]. For the best subset selection, the R code was parallelized [22] and computed at the high-performance-computing center of the University of Greifswald [23].

Study Design
For each SHIP-TREND-0 participant, a longitudinal analysis period covering three years of observation was considered (Figure 1). The year antedating the SHIP examination served as the baseline period, which also covers physicians' fee schedules and ICD-10 codes from four quarters (Q -3 to Q 0 ). This reflects the billing periods in ambulatory care. The year before the baseline period (Q -7 to Q -4 ) was used to identify any previous occurrences of ICD-10 codes related to unspecific low back pain. Claims data of quarters Q 1 to Q 4 represent follow-up data after participation in SHIP (Figure 1). and possible matches, i.e., agreements with Levenshtein distance > 0, were manually reviewed.

Methods
Data preparations and record linkage were conducted using SAS version 9.4 (SAS Institute Inc., Cary, NC, USA). R version 4.0.3 has been used for all analyses [19], including zero-inflated regression models using the R package pscl [20,21]. For the best subset selection, the R code was parallelized [22] and computed at the high-performancecomputing center of the University of Greifswald [23].

Study Design
For each SHIP-TREND-0 participant, a longitudinal analysis period covering three years of observation was considered (Figure 1). The year antedating the SHIP examination served as the baseline period, which also covers physicians' fee schedules and ICD-10 codes from four quarters (Q-3 to Q0). This reflects the billing periods in ambulatory care. The year before the baseline period (Q-7 to Q-4) was used to identify any previous occurrences of ICD-10 codes related to unspecific low back pain. Claims data of quarters Q1 to Q4 represent follow-up data after participation in SHIP ( Figure 1).

Primary Outcome
The primary outcome was the sum of ICD-10 codes specific to diseases and symptoms in the lumbar spine during the year after baseline (Q1 to Q4, Figure 1). Assigned ICD-10 codes are a valid surrogate measure for physician consultations, as they can only be assigned following a consultation. This outcome represents count data, which were examined for zero-inflation using a score test [24,25] (Supplement Figure S1).

Candidate Variables
The following 12 candidate variables were considered for best subset selection from claims data: categorized age (<40; 40-69; > 69), history of LBP (yes/no prior to baseline), imaging procedure related to back pain in the baseline period, medical interventions related to back pain in the baseline period, sum of back-pain-related ICD-10 codes (in the quarter of the SHIP examination only), sum of back-pain-related ICD-10 codes (in at least two quarters of the baseline period), visited physician specialties in the baseline period (none, general practitioner only, specialist only, general practitioner and specialist), use of opioids, use of benzodiazepine, occurrence of ICD-10 code M54.XX only (unspecific back pain), the Charlson comorbidity index [26], and the interaction between categorized age and unspecific M54.XX only. (XX = any code after the decimal point is selected, except for those excluded and listed in the Supplement) All 32 candidate variables from SHIP are listed in the supplement (Table S4, Figure  S4). This list was created based on the expert opinion of the investigators, and an informal literature review. Non-linear associations between candidate variables and the primary outcome were examined using generalized additive models (Supplement Figure S2). In

Primary Outcome
The primary outcome was the sum of ICD-10 codes specific to diseases and symptoms in the lumbar spine during the year after baseline (Q 1 to Q 4 , Figure 1). Assigned ICD-10 codes are a valid surrogate measure for physician consultations, as they can only be assigned following a consultation. This outcome represents count data, which were examined for zero-inflation using a score test [24,25] (Supplement Figure S1).

Candidate Variables
The following 12 candidate variables were considered for best subset selection from claims data: categorized age (<40; 40-69; > 69), history of LBP (yes/no prior to baseline), imaging procedure related to back pain in the baseline period, medical interventions related to back pain in the baseline period, sum of back-pain-related ICD-10 codes (in the quarter of the SHIP examination only), sum of back-pain-related ICD-10 codes (in at least two quarters of the baseline period), visited physician specialties in the baseline period (none, general practitioner only, specialist only, general practitioner and specialist), use of opioids, use of benzodiazepine, occurrence of ICD-10 code M54.XX only (unspecific back pain), the Charlson comorbidity index [26], and the interaction between categorized age and unspecific M54.XX only. (XX = any code after the decimal point is selected, except for those excluded and listed in the Supplement).
All 32 candidate variables from SHIP are listed in the supplement (Table S4, Figure S4). This list was created based on the expert opinion of the investigators, and an informal literature review. Non-linear associations between candidate variables and the primary outcome were examined using generalized additive models (Supplement Figure S2). In addition, three interaction effects were assessed: depression x sex, depression x PHQ-9, and disc prolapse x radiating back pain. To reduce overall computational costs, we applied an exploratory variable selection using a boosting approach with stability selection [27] (please see supplement paragraph: "Exploratory variable selection"). This method has previously been applied to high-dimensional and collinear data [28]. In the case of strong non-linear associations (effective degrees of freedom > 1 and p < 0.05), the respective candidate variable was modelled as having a linear as well as a non-linear effect. With respect to the candidate variable age, the exploratory analyses favored a discretized form of age over a smoothed variant (B-spline) and a linear form. Therefore, discretized age was also used in the claims data. Overall, 19 candidate variables and one interaction effect were examined in the best-subset selection approach using the SHIP data.
The best subsets of claims data and SHIP data were joined afterwards and assessed in another best subset selection approach in the joined data.

Model Training
The whole model training is illustrated and annotated in the supplement ( Figure S4). In brief, we first split the SHIP-TREND-0 data set into training data and validation data in a ratio of 3:1 (training data: n = 2869, validation data: n = 968). From the training data, b = 500 repeated bootstrap samples were drawn to define inner training data (observations selected by bootstrap) and inner validation data (observations not selected by bootstrap). Then, hurdle-models comprising a zero-part (logistic regression) and a count-part (either a Poisson or a negative-binomial distribution) were trained on the inner training data. The resulting model was used to predict the outcomes on the inner validation data. The model training process was iterated over two entities: (i) all possible combinations of covariates (p i ) in bootstrap b j , and (ii) b = 500 bootstrap samples of the training data. This approach is similar to the one proposed by Filzmoser et al. for parameter tuning using repeated double cross-validation [29]. All these steps were separately conducted for the claims data and the SHIP data and, after the selection of the best subsets in each, the data were joined (Claims + SHIP).

Model Evaluation
In each iterative step (p i , b j ), information criteria (AIC, [30]) of Poisson and negativebinomial models were used to evaluate the best distributional form. In addition, strictly proper scoring rules were applied to assess the quality or agreement between the predicted and observed outcomes in the inner validation data [31]. The best model, i.e., the best subset of candidate variables, was considered to be the one that had, on average, the best prediction accuracy over 500 bootstrap samples. The selected model was then used to predict the outcomes in the outer validation data.
The goodness of fit of the optimal model was examined using rootograms, which illustrate overfitting and underfitting of count data regression models [32]. Furthermore, the estimated probability of having an ICD-10 code (yes or no, which corresponds to the zero-part in the hurdle model) was used to predict and classify if study participants will have an ICD-10 code after participation in SHIP. The area-under-curve (AUC) and receiver-operating-characteristics (ROC) were generated for the best subsets of all three datasets [33] and the sensitivity analyses. Table 1 presents the proportion of missing values in all covariates originating from the SHIP study. First, we separately applied multiple imputations by chained equations (m = 20) using the R package mice [34] for training data and validation data. Second, the increase in variance in the training data (Supplement Table S6) was evaluated for each candidate variable with missing values. The highest increase in variance was found for household income (4.58%), PHQ-9 (3.74%), and SF12 PCS sum score (1.24%). In all other candidate variables, the increase was <1%. Recommendations for the application of multiple imputations in the case of missing data < 5% are contradictory [35,36]. Therefore, for model training, we used similar training data in best subset selection and randomforest restricted to m = 1 imputation. However, to evaluate the best subset model in the validation data, we used m = 20 imputations.

Comparative Analyses
To compare the predictive accuracy of the best subset model with established machinelearning algorithms we applied randomforest (RF) [7] and support vector machines (SVM) [37] using the combined data (claims and SHIP). As in most machine-learning approaches [38], randomforest and SVM have several tuning parameters to increase the predictive accuracy. For randomforest, we used the R package tuneRanger to define the optimal parameter setting [39]. For SVMs, we chose a radial basis function kernel due to non-linear associations and applied a two-step grid search for the best parameter setting [40]. The R package e1071 was used to apply SVMs [41]. More details regarding the parameter tuning are provided in the supplement, under the paragraph "Comparative analyses".

Subgroup Analyses
Subgroup analyses were applied to examine sources for misclassification in predictive models. Therefore, estimated probabilities for a future episode of LBP were stratified in multiway-tables using the applied method, the training data and the validation data, back pain severity, and pattern of search of care.

Cohort Characteristics
The descriptive characteristics of 4420 participants of the SHIP-TREND-0 cohort at baseline are summarized in Table 1. Of these, 341 (7.7%) were stated to be insured in a private or other health insurance organization and excluded from the analyses. Due to a temporarily reduced examination program, the insurance type was not asked for in 57 out of 4420 (1.3%) participants'; however, 51 of these consented to record linkage and were subsumed under the respective category. Therefore, out of the 4079 participants, 3837 (94.1%) consented to record linkage of claims data and were included in the analyses. Participant characteristics differed slightly between those consenting vs. not consenting record linkage. For example, mean age of consenting participants was 1.7 years higher compared to those not giving consent (Table 1). Larger differences, particularly regarding the household income, were found in participants insured in private/other health insurances (Table 1).
In 30% out of those consenting to record linkage, at least one ICD-10 code suggestive for LBP occurred within the baseline period. The distribution of the number of LBP-related ICD-10 codes was zero-inflated in both the baseline period and the follow-up period (Supplementary Figure S1).

Best Subset Models
The best subset of candidate variables in the joined data comprised eight predictors for the zero-part and eight, partly different predictors for the count-part of the hurdle model (Table 2). According to AIC, a Poisson distribution of the primary outcome was preferred over a negative-binomial distribution in all models and iterations of the model training. Overall, the fit of a hurdle model using the best subset of predictors with a Poisson distribution for the count-part was high in the training and validation data (please see rootograms [32] in Supplement Figure S13). Table 2. Results of best subset selection for the claims data, the SHIP data, and the joined data to predict the number of LBP-related ICD-codes during follow-up.

Characteristics
Claims The table consists of two parts: the zero-part of the zero-inflated count data model, in which exponentiated coefficients correspond to odds ratios, and the count part, in which exponentiated coefficients can be interpreted as incidence rate ratios. 1 SHIP interview item: disc prolapse ever before SHIP, 2 SHIP interview item: physician visits within last 4 weeks, 3 No. of ICD-10 codes (2Q: codes must have been reported in two quarters of the baseline period of 12 month; acute: any of the code appeared within the quarter of the SHIP examination), 4 Claims data: any ICD-10 code suggestive for LBP prior baseline (yes vs. no), 5 Claims data: based on physicians' fee schedules. # = the quantity of respective ICD-10 codes.
In the zero-part of the hurdle model using the claims and SHIP data, participants between 40 and 69 years of age had higher odds of future consultations compared to the reference group of individuals under 40 years of age, while participants older than 69 years had considerably lower odds ( Table 2). The odds for consultations increased with a higher household income, a history of disc prolapse, and female sex. In contrast, the presence of chronic diseases (cardiovascular disease, cancer, diabetes) lowered the odds for future consultations. The strongest association was observed for the number of ICD-10 codes in the baseline period and a history of back pain (prior baseline) ( Table 2). If ICD-10 codes suggestive for LBP were found in at least two quarters of the baseline period (2Q), the odds for a future event of LBP increased (odds ratio (OR): 2.43, 95% confidence interval For the count-part of the hurdle model, different candidate variables were selected. The count of back-pain-related ICD codes increased with the number of ICD-10 codes (2Q) documented during the baseline period. In addition, higher back pain (numerical rating scale (NRS), radiating back pain, and rare physical activity (1-2 h/week vs. none and >2 h/week) increased the count. The use of benzodiazepines and being single (compared to married/widowed/separated) lowered the counts ( Table 2). Participants' sex had no effect on the count.
The predictive value of most candidate variables from SHIP data were lower compared to candidate variables from the claims data (see Brier scores and Dawid-Sebastiani scores in the Supplement). For example, self-reported physician visits was a strong predictor within the best subset of the SHIP data. In the joined data, however, this predictor was not selected. In contrast, competing diseases, which were defined based on SHIP data, remained in the best subset of the joined data, as well as participant-reported outcomes of back pain severity (NRS), radiating back pain, and household income.

Predictive Accuracy and Comparative Analyses
The best subset model using only SHIP data performed lowest in predicting future LBP-related consultations with an area-under-curve (AUC) of 0.68 (95% CI [0.64, 0.72]) ( Figure 2). This is considered a non-acceptable discrimination [42]. The accuracy of the best subset model using claims data (AUC: 0.76 [0.72; 0.80]) was higher, and barely improved on by the best subset of the joined data (AUC: 0.78 [0.74; 0.82]). The presence of missing data had no influence on the predictive accuracy of the best subset model. The receiveroperating-characteristic's (ROCs) for the different imputations of the validation data were almost identical (Supplement Figure S14).

Subgroup Analyses
Study participants with incident consultations of LBP in the follow-up period only, i.e., those with no history of back pain and no ICD-10 codes suggestive for LBP during the baseline period, had, according to both model approaches, low probabilities for future LBP related consultations. The estimated probabilities were very similar for those with no ICD-codes suggestive for LBP ( Figure 3). As shown in the supplementary Table S7, these two strata of study participants share very similar characteristics, except for participants' sex (seek of care: 0|0 = 48.8% females, 0|1= 58.1% females, Χ 2 = 9.21, p < 0.01) and back pain For comparative analyses using randomforest, the optimal parameter setting was estimated using tuneRanger [39], which resulted in: mtry = 9 (number of variables used for splitting the trees), ntree = 10,000 (number of trees), nodesize = 60 (minimum terminal node size), sampsize = 1674 (sample size of learning data)). The tuned randomforest model According to a variable importance measure of randomforest, the number of ICD-10 codes documented in the baseline period, age, and a history of back pain (prior baseline) had the strongest importance for the prediction of future episodes of LBP (Supplement Figure S5).

Subgroup Analyses
Study participants with incident consultations of LBP in the follow-up period only, i.e., those with no history of back pain and no ICD-10 codes suggestive for LBP during the baseline period, had, according to both model approaches, low probabilities for future LBP related consultations. The estimated probabilities were very similar for those with no ICD-codes suggestive for LBP ( Figure 3). As shown in the supplementary Table S7, these two strata of study participants share very similar characteristics, except for participants' sex (seek of care: 0|0 = 48.8% females, 0|1= 58.1% females, X 2 = 9.21, p < 0.01) and back pain in the last 3 months (µ diff = 0.49 scores on NRS, p < 0.01 (t-test)). Mean age was almost identical between these two strata. However, participants with health care utilization only after baseline (0|1) were more frequently aged from 40 to 70 years (X 2 = 22.55, p < 0.01), a subgroup that has been found to be more susceptible to seek care for LBP (Table 2).

Discussion
We applied best subset selection (BSS) in a common machine-learning pipeline of model training in resampled data, model calibration and model averaging using strictly proper scoring rules, as well as internal model validation of predictions of future consultations for low back pain (LBP) in the general population. The predictive accuracy of BSS (AUC: 0.78) was comparable to the established machine-learning approaches randomforest (SVM) (AUC: 0.79 (0.78)). The accuracy of all approaches is acceptable [42] and in line with those reported for other machine-learning-based prediction models [8,43] and for different back-pain-related outcomes [44]. In terms of computational time and efficacy, BSS (21 h) was outperformed by randomforest (4:01 min) and SVM (24:46 min). From a clinical perspective, the benefit of interpretable model coefficients, as provided by BSS (Table 2), outweighs this computational drawback. In addition, participants with a history of back pain and/or ICD-codes suggestive for LBP in the baseline period but no ICD-codes in the follow-up periods had comparable probabilities to those seeking continuously physicians care, i.e., having ICD-10 codes suggestive for LBP in all three analysis periods. The former participants were slightly older than those steadily seeking care (µ diff = 2.4 years, p < 0.01 (t-test)), and more frequently competing diseases (OR: 1.58 [1.23; 2.03]), and lower household income (median: 1096 vs. 1364, Wilcoxon rank sum test: p < 0.01). The abovementioned patterns were similar across both methodological approaches, the training and validation data, and irrespective of back pain severity (Figure 3).

Discussion
We applied best subset selection (BSS) in a common machine-learning pipeline of model training in resampled data, model calibration and model averaging using strictly proper scoring rules, as well as internal model validation of predictions of future consultations for low back pain (LBP) in the general population. The predictive accuracy of BSS (AUC: 0.78) was comparable to the established machine-learning approaches randomforest (SVM) (AUC: 0.79 (0.78)). The accuracy of all approaches is acceptable [42] and in line with those reported for other machine-learning-based prediction models [8,43] and for different back-pain-related outcomes [44]. In terms of computational time and efficacy, BSS (21 h) was outperformed by randomforest (4:01 min) and SVM (24:46 min). From a clinical perspective, the benefit of interpretable model coefficients, as provided by BSS (Table 2), outweighs this computational drawback.
An acceptable accuracy (AUC in [0.7; 0.8], [42]), as in our study, is commonly reported for prediction models of health research studies [43]. Regarding the prediction of LBP, however, most previous studies lacked the validation of predictors as summarized by a systematic review [45]. Therefore, in this clinical domain, our study represents one among a few validated and clinically interpretable prediction models.
The best subset of predictors in the joined data comprised strong predictors from claims data, particularly variables of previous consultations for LBP, and participantreported outcomes collected in the SHIP study, such as back pain severity (NRS), radiating back pain, and household income. The latter outcomes were inferior in their predictive value compared to the predictors from claims data (Supplement Figures S7-S12). Therefore, the AUC of the best subset in the joined data improved only marginally compared to the best subset using claims data only ( Figure 2). The observed lack of improvement in predictions with participant-reported outcomes was also indicated by the low-to-moderate univariate associations of self-reported back pain outcomes and the frequency of LBPrelated ICD-10 codes (Supplement Figure S6). Therefore, it seems that past administrative data are best at predicting future administrative data.
From a clinical perspective, this rather surprising finding reflects the fact that claims data in this study represent data from patients who decided to seek medical care for LBP. The clinical data from the SHIP study reflect the data of the general population. The results of our study suggest that many participants with LBP were able to cope with LBP without seeking medical care, and thus did not become patients. As was shown in other research, health-care-seeking behavior is partially detached from clinical symptoms [46]. Another important factor explaining differences in the predictive value of clinical data and claims data is comorbidity or competing diseases. Participants with higher Charlson comorbidity index (CCI) or competing disease were less likely to reconsult for LBP (Table 2). This does not reflect lower pain severity, but a higher likelihood of seeking and obtaining care for a predominant health problem and a lower likelihood of LBP being coded in the claims data. This role of coping strategies and competing diseases was also affirmed by the systematic review by Ferreira et al. [47].
In subgroup analyses of our study, we found that a considerable proportion of participants (n = 740, 19.3%) with severe LBP had previously consulted a physician for LBP, but these participants did not seek physicians' care again during follow-up. The respective participants were of higher age, more frequently retired, had more competing diseases, and a lower household income. The characteristics of this subgroup had strong implications on the predictive accuracy of all methodological approaches in this study. These participants previously sought care for LBP, had a high intensity of back pain, and in 46.9%, back pain was radiating to lower extremities (Supplement Table S7). In consequence, very high probabilities for future episodes of LBP were estimated from all prediction approaches (Figure 3).
Irrespective of any cutoff used to balance sensitivity and specificity in classification [48], this subgroup of participants will be misclassified by present approaches (Figure 3).
In this study, claims data covering only two years were included in the prediction model ( Figure 1). Extending the pre-observation period might have allowed for the identification of patients who used extensive medical services for LBP in the past and, therefore, discontinued seeking care despite persisting pain, e.g., due to unsuccessful treatment attempts or exhaustion of treatment options. Identifying this longitudinal pattern in seeking care for LBP might improve prediction models, possibly under consideration of competing diseases such as cancer.
A second subgroup susceptible to misclassification were participants with incident LBP episodes in the follow-up only (Figure 3). Except for participants' sex and a slightly higher back pain intensity, these participants were almost similar compared to those not seeking care for LBP (Supplement Table S7). Recently, a similar difficulty was found in the correct prediction of incident LBP in a large claims data-based study for the new onset and recurrence of LBP with over 400,000 participants [49].
Strength and Limitations: Our analyses were based on a large general population sample, with a very high coverage of claims data. While we considered a wide ange of predictors from SHIP and claims data, LBP and potential risk-factors could have been characterized in more depth to further enrich the power of potential predictors, e.g., by more widely covering psychosocial predictors, for example, fear-avoidance beliefs [44,50]. In addition, further refinements in the study design must be sought to more appropriately select persons at risk of unfavorable LBP courses at different clinical states.
The best subset selection, as applied in our approach, is not a true machine-learning approach. Although we emphasize that there is no clear distinction between statistical modeling and machine-learning [43], advanced regression techniques such as lasso-which learns from the data and requires parameter tuning-are sometimes not considered machinelearning [43], whereas the application of logistic regression for prediction is sometimes, debatably, considered a machine-learning approach [8]. We consider our approach an interpretable or hybrid form, which mimics the pipeline of machine learning while maintaining interpretability: the approach was conducted to make robust predictions. From the data, we learned the optimal combination of candidate variables and the best fitting distribution, and we gained insight into the clinical and socio-demographic roles of candidate variables.
We applied common statistical and epidemiological reasoning to reduce bias. For example, unlike many machine-learning applications, in which either single-value imputations are used [8] or missingness in data elements is not mentioned at all [51], we explicitly addressed the uncertainty of imputations using multiple imputations in the validation data, and independently of the training data. The best subset model was evaluated in the validation data for 20 imputed versions of the data (Supplement Figure S14). In this study, the impact of missing data was negligible.
In randomforest, multiple interactions between candidate variables are implicitly accounted for, which can be investigated, for example, using the R package randomForest-Explainer [52]. This feature of randomforest might explain the slightly better performance compared to best subset selection. The superiority of randomforest, as well as SVM, was found for computational efficacy. This computational drawback of BSS will be mitigated as soon as advances in mixed-integer-optimization (MIO) are extended to use cases with data elements of mixed data types (continuous and categorical). Bertsimas et al. [53] have shown the possibility of solving more complex settings of best subset selection in minutes.
The strongest limitation of our study is that we were not able to validate the model with external data. Our approach comprised internal validation, as all data were generated and collected under the same design.

Conclusions
Claims data alone could be used to predict patients with LBP who are likely to reconsult for LBP with acceptable accuracy. Adding participant-reported outcomes from a population-based cohort study to the prediction model yielded a minor improvement, but only in the current unstratified setting. Therefore, further refinements of patients at risk are required to obtain insights into the role of predictors at different patient states over the course of LBP.
The development of prediction models in health research should comprise epidemiological and statistical reasoning to minimize bias, maintain the interpretability of results, and pass through a pipeline of model training, calibration, and validation. These efforts can result in prediction models, with a comparable performance to machine-learning approaches. The latter, such as randomforest, should be applied more frequently, to benchmark and possibly improve the models proposed by other modelling techniques in case of a major inferiority. Overall, and given a comparable performance, the interpretability of the study results outweighs computational efficacy in health research. In this manner, best subset selection appears a reasonable modeling approach.  Table S1: List of ICD-10 Codes associated with low back pain., Table S2: List of physicians' fee positions related to back pain, Table S3: Results of a generalized additive model for nonlinear associations, Table S4: Candidate variables and results of explorative variable selection, Table S5: Matrix of all possible variable combinations, Table S6: Imputation diagnostics, Table S7: Subgroup analyses through stratification for seeking care, Figure S1: Frequency of ICD-10 codes for back pain observed in the training data, Figure S2: Nonlinear associations of covariates with the presence of ICD-10 codes, Figure S3: Distributional plots of all continuous candidate variables, Figure S4: Scheme of the Validation algorithm to assess model prediction accuracy, Figure S5: Variable importance resulting from a tuned random forest approach, Figure S6: Correlation plot of continuous or count data, Figure S7: Brier scores claims data, Figure S8: Brier scores SHIP data, Figure S9: Brier scores joined data, Figure S10: Dawid-Sebastiani scores claims data, Figure S11: Dawid-Sebastiani scores SHIP data, Figure S12: Dawid-Sebastiani scores joined data, Figure S13: Rootograms of the optimal model, Figure S14: ROC curves for the best subset model with multiple imputations. Informed Consent Statement: All SHIP study participants (N = 4420) gave written informed consent for the examinations. Thereof, n = 3837 provided consent for a linkage with administrative claims records. Data Availability Statement: Data of the SHIP studies are available and can be applied for under https://www.fvcm.med.uni-greifswald.de/ (last accessed on 15 November 2021). The claims data utilized in this study are not publicly available due to privacy restrictions for the protection of personal data of research participants.