Estimating Prevalence and Characteristics of Statin Intolerance among High and Very High Cardiovascular Risk Patients in Germany (2017 to 2020)

Statin intolerance (SI) (partial and absolute) could lead to suboptimal lipid management. The lack of a widely accepted definition of SI results into poor understanding of patient profiles and characteristics. This study aims to estimate SI and better understand patient characteristics, as reflected in clinical practice in Germany using supervised machine learning (ML) techniques. This retrospective cohort study utilized patient records from an outpatient setting in Germany in the IQVIA™ Disease Analyzer. Patients with a high cardiovascular risk, atherosclerotic cardiovascular disease, or hypercholesterolemia, and those on lipid-lowering therapies between 2017 and 2020 were included, and categorized as having “absolute” or “partial” SI. ML techniques were applied to calibrate prevalence estimates, derived from different rules and levels of confidence (high and low). The study included 292,603 patients, 6.4% and 2.8% had with high confidence absolute and partial SI, respectively. After deploying ML, SI prevalence increased approximately by 27% and 57% (p < 0.00001) in absolute and partial SI, respectively, eliciting a maximum estimate of 12.5% SI with high confidence. The use of advanced analytics to provide a complementary perspective to current prevalence estimates may inform the identification, optimal treatment, and pragmatic, patient-centered management of SI in Germany.


Introduction
The 2019 European Society of Cardiology (ESC)/European Atherosclerosis Society (EAS) guidelines on the management of dyslipidemia focus on risk-adapted low-density lipoprotein cholesterol (LDL-C) reduction as a primary strategy [1]. Despite treatment with oral lipid-lowering therapies (LLTs), up to 80% of patients with hypercholesterolemia and mixed dyslipidemia across Europe do not reach guideline-recommended LDL-C goals [1][2][3]. Germany has adapted lipid management goals as outlined in the ESC/EAS guidelines, however, real-world evidence from Germany (2017) indicates that only 36.3% of high and very high CV risk patients seem to be treated with LLTs [4].
Although statin therapy is the accepted primary pharmacological approach for LDL-C reduction, non-adherence to treatment and discontinuation of statin therapy are current problems worldwide [5]. A frequently noted complaint among statin users is statinassociated muscle symptoms (SAMS). Recent updates from the National Lipid Association

Study Design
This was a retrospective study based on a cohort of patients on LLT in the IQVIA TM Disease Analyzer from March 2017 to March 2020.
The IQVIA™ Disease Analyzer is based on a representative sample of more than 3300 resident physicians in Germany (as of February 2020) equipped with office-based IT systems [16,17]. The data source contains daily-based anonymized records of treatmentrelevant data of statutory health insurance and private health insurance patients. This allows a detailed consideration of the therapeutic and diagnostic behavior of physicians (in retail), as well as a longitudinal view of anonymized patients with their diagnoses and treatments. Unlike in other data sources, outpatient diagnoses are recorded daily (by date of visit).
Patients were included in the study if they were aged at least 18 years or more on the index date with at least one consultation during the selection period (March 2019 to March 2020). Moreover, patients were included if they were on LLTs from March 2017 to March 2020 and had been identified to have hypercholesterolemia or atherosclerotic cardiovascular disease (ASCVD) or high CV risk during the selection period or in the past.
Hypercholesterolemia, LLTs, ASCVD, and high CV risk patients were defined based on Anatomical Therapeutic Chemical (ATC) classes or International Classification of Diseases (ICD)-10 codes (Table S1 for more details).
The index date was defined by the latest prescription of statins for patients actively on treatment medication during the study period. For patients who were on non-statin LLTs during the study period, the index date was defined based on the latest prescription of non-statin LLTs.
Baseline characteristics, comorbidities, concomitant medications, and SI events were assessed during a maximum two-year lookback period, with the earliest possible date being March 2017. The lookback period was based on the index date.

Statin Intolerance
To identify SI, a variety of different events that can signal potential SI were analyzed. The events included statin down-titration (same and different molecule), statin switch/multiple statins (without up-or down-titration), statin discontinuation, intermittent dosing, low-dose statin use, documented SI/toxicity/allergy, and the presence of SAMS-derived from clinical practice in Europe and reflecting the definitions of various societies [18,19] (Table S2 for more details). These events were used later to create rules for the categorization of patients as having SI or not. Final SI definitions were derived based on different combinations of SI events/conditions (Table 1).
We applied two separate classifications of SI, as reported by others [12], in order to differentiate between patients who permanently discontinued statins and patients who exhibited certain SI characteristics but had not completely discontinued statins during the study period:

•
Absolute SI: Patients with a history of SI events ( In addition, the two separate classes of SI were classified into two groups: high and low confidence levels, based on the level of confidence for the specific rules and driven by the stronger association with SI as a clinical syndrome (Table 1) (Tables S5 and S6 for more details).

Supervised Machine Learning Prevalence Estimates
Supervised ML was used to refine the rules of SI prevalence estimate based on the EMR dataset and find the most important features in predicting SI [13,20], as illustrated in Figure 1. An advantage of using ML over traditional regression-based approaches is that previously unknown complex variable interactions and non-linear relationships can also be considered. Two classification models were trained. First on group 1, to differentiate "high confidence absolute intolerant patients versus tolerant patients" and second on group 2, to differentiate "high-confidence partial intolerant patients versus tolerant patients". The two trained models were used to run a refined classification on low confidence absolute intolerant patients (model 1) and low confidence partial intolerant patients (model 2) respectively, in order to revise the estimate for SI prevalence. SHapley Additive exPlanations (SHAP) feature importance analysis was used to quantify the contributions of the most important features the model uses to predict SI in patients [21].

Supervised Machine Learning Prevalence Estimates
Supervised ML was used to refine the rules of SI prevalence estimate based on the EMR dataset and find the most important features in predicting SI [13,20], as illustrated in Figure 1. An advantage of using ML over traditional regression-based approaches is that previously unknown complex variable interactions and non-linear relationships can also be considered. Two classification models were trained. First on group 1, to differentiate "high confidence absolute intolerant patients versus tolerant patients" and second on group 2, to differentiate "high-confidence partial intolerant patients versus tolerant patients". The two trained models were used to run a refined classification on low confidence absolute intolerant patients (model 1) and low confidence partial intolerant patients (model 2) respectively, in order to revise the estimate for SI prevalence. SHapley Additive exPlanations (SHAP) feature importance analysis was used to quantify the contributions of the most important features the model uses to predict SI in patients [21]. The models were trained using the same cohorts selected from the EMR IQVIA TM Disease Analyzer dataset. The absolute intolerant dataset consists of all high confidence

Training Data Definitions
The models were trained using the same cohorts selected from the EMR IQVIA TM Disease Analyzer dataset. The absolute intolerant dataset consists of all high confidence absolute intolerant patients, along with a random sample of 50,000 tolerant patients. The partial intolerant dataset consists of all high confidence partial intolerant patients, along with a random sample of 50,000 tolerant patients.

Feature Engineering
Features were created from patient-level data based on the presence, frequency, mean values of diagnosis ICD codes, ATC classification codes, and test results. Appropriate ICD codes were also grouped together to create less-granular features. Demographic features included gender and age of the patient. Other features included the number of physician visits and unique prescriptions (molecules and ATC classes and LLT molecules).

Feature Selection
Features were selected by estimating the mutual information between each feature independently and the target variable (statin intolerant vs. tolerant). Mutual information between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero when two random variables are independent, and higher values indicate greater dependency. The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances as described in Kraskov A et al. [22] and Ross BC [23]. Both methods are based on the concept first proposed in Kozachenko LF et al. [24]. In our model, 400 features were selected for modeling by taking the features with the highest mutual information score. The estimation method uses a nonparametric method based on entropy estimation from a k-nearest neighbors algorithm [22,23].

Model Selection
Different models such as logistic regression, LGBM, and XGBOOST algorithms were trained on the overall dataset [25,26], with the highest performing one being selected based on F1 score (which is the harmonic mean of precision and recall metrics). The dataset was randomly sampled into four equally sized groups. The model was trained on three of the groupings and tested on the remaining one, four times in each possible train-test combination. Performance metrics were evaluated across splits to ensure model performance. Model performance was considered by measuring the receiver operating characteristic (ROC) curve with the area under the curve score, precision-recall curves (PRC), and F1 score.

Prevalence Estimation
The prevalence estimate was updated by predicting for patients labeled as low confidence (absolute and partial intolerant by SI rules) if they were either intolerant or tolerant. The absolute intolerant model was applied to low confidence absolute intolerant patients and the partial intolerant model to low confidence partial intolerant patients. The prevalence was updated by summing up the new total of intolerant patients based on individual patient-level predictions of the model. The decision boundary threshold of both models was adjusted, so that the precision was equal to the recall score of each.

Results
Patient selection criteria and attrition flow for entire study population is summarized in Figure 2. The study included 292,603 patients. Among these, greater number of patients (n = 221,442) were tolerant to statin while n = 71,161 patients were intolerant to statin.

Key Patient and Treatment Characteristics
Key patient and treatment characteristics are summarized in Table 2. Overall, a higher prevalence of SI was observed in patients belonging to the 70+ years age group. The prevalence of absolute SI was higher in females than in males. A ML SHAP analysis was performed to identify key feature importance and key characteristics that help to identify SI patients. Examples are the following: presence of fibrates, use of other lipidregulators, presence of myalgia, cramps and spasms, total number of prescriptions, and physician visits are significantly higher in 'absolute' SI patients than tolerant patients. The ML model used these key predictors to identify low confidence SI patients that could be reclassified as high confidence SI patients. Most preidentified risk factors had a higher prevalence in SI populations compared to tolerant patients. Among the more prevalent risk factors were obesity, hypothyroidism, vitamin D deficiency, and chronic kidney disease. These risk factors also had a higher presence across different intolerant groups.
Simvastatin (47.2%; total) and atorvastatin (36.8%; total) were the most common statins used across both absolute and partial SI patients across all confidence levels. Among statin tolerant patients, simvastatin and atorvastatin were used by 49.6% and 36.9% patients, respectively. Absolute intolerant patients with a high confidence level showed higher non-statin monotherapy (fenofibrate and bezafibrate) use compared to partial SI patients with a high confidence level. Similarly, partially intolerant patients with a high confidence level were using higher (54.0%) ezetimibe as monotherapy or combination with statinscompared to absolute intolerant patients with a high confidence level (21.6%, only). SAMS were consistently observed to occur more frequently in SI groups; the underlying SAMS conditions observed most frequently across confidence levels were myalgia and cramps/spasms. Absolute SI patients exhibited a higher prevalence of known manifestations such as myalgia and cramps/spasms. Drugs that could interact with statins had a slightly higher use among intolerant patients compared to tolerant patients ( Table  2).
Detailed patient and treatment characteristics are presented in Table S7.

Key Patient and Treatment Characteristics
Key patient and treatment characteristics are summarized in Table 2. Overall, a higher prevalence of SI was observed in patients belonging to the 70+ years age group. The prevalence of absolute SI was higher in females than in males. A ML SHAP analysis was performed to identify key feature importance and key characteristics that help to identify SI patients. Examples are the following: presence of fibrates, use of other lipid-regulators, presence of myalgia, cramps and spasms, total number of prescriptions, and physician visits are significantly higher in 'absolute' SI patients than tolerant patients. The ML model used these key predictors to identify low confidence SI patients that could be reclassified as high confidence SI patients. Most preidentified risk factors had a higher prevalence in SI populations compared to tolerant patients. Among the more prevalent risk factors were obesity, hypothyroidism, vitamin D deficiency, and chronic kidney disease. These risk factors also had a higher presence across different intolerant groups. Simvastatin (47.2%; total) and atorvastatin (36.8%; total) were the most common statins used across both absolute and partial SI patients across all confidence levels. Among statin tolerant patients, simvastatin and atorvastatin were used by 49.6% and 36.9% patients, respectively. Absolute intolerant patients with a high confidence level showed higher nonstatin monotherapy (fenofibrate and bezafibrate) use compared to partial SI patients with a high confidence level. Similarly, partially intolerant patients with a high confidence level were using higher (54.0%) ezetimibe as monotherapy or combination with statinscompared to absolute intolerant patients with a high confidence level (21.6%, only). SAMS were consistently observed to occur more frequently in SI groups; the underlying SAMS conditions observed most frequently across confidence levels were myalgia and cramps/spasms. Absolute SI patients exhibited a higher prevalence of known manifestations such as myalgia and cramps/spasms. Drugs that could interact with statins had a slightly higher use among intolerant patients compared to tolerant patients ( Table 2).
Detailed patient and treatment characteristics are presented in Table S7.
Statin down-titration and switches for absolute intolerant and partial intolerant patients are presented in Figures 3 and 4, respectively. Atorvastatin 40 mg was the most frequently down-titrated statin, with patients typically shifting from a high-to mediumintensity dose. Simvastatin to atorvastatin was the most predominant class switch in SI patients, followed by atorvastatin to rosuvastatin. Simvastatin 20 mg to atorvastatin 20 mg represents the highest proportion of switches among intolerant patients, followed by simvastatin 40 mg to atorvastatin 20 mg.

Prevalence Results
Prevalence estimates for SI based on EMR data are presented in Table 3. Among the 292,603 (100%) total eligible patients, the prevalence of tolerant patients was 76.6% (n = 224,112), while the SI prevalence was estimated at a maximum of 24.3% (n = 71,161). The fraction of SI condition among all the ASCVD patients in the study was 24.3%, for the high CV risk patients it was 28.2% and for the hypercholesterolemia patients it was 20.9% (Table S8). The absolute SI prevalence (high confidence) was 6.4%, based on SI rules. However, this was enhanced to 8.1% when supervised ML was applied (~27% increase in patients classified with high confidence). Similarly, partial SI prevalence (high confidence) was 2.8%, based on SI rules, but it was enhanced to 4.4% using supervised ML (~57%

Prevalence Results
Prevalence estimates for SI based on EMR data are presented in Table 3. Among the 292,603 (100%) total eligible patients, the prevalence of tolerant patients was 76.6% (n = 224,112), while the SI prevalence was estimated at a maximum of 24.3% (n = 71,161). The fraction of SI condition among all the ASCVD patients in the study was 24.3%, for the high CV risk patients it was 28.2% and for the hypercholesterolemia patients it was 20.9% (Table S8). The absolute SI prevalence (high confidence) was 6.4%, based on SI rules. However, this was enhanced to 8.1% when supervised ML was applied (~27% increase in patients classified with high confidence). Similarly, partial SI prevalence (high confidence) was 2.8%, based on SI rules, but it was enhanced to 4.4% using supervised ML (~57% increase in patients classified with high confidence). Through supervised ML, the prevalence of high confidence increased by nearly 1.6% in both absolute and partial SI patients, resulting in an overall estimate for high tolerance SI of 12.5%. The overall prevalence estimate remained the same, as the model was trained on high confidence SI patients and applied to low confidence SI patients. The resultant increase of 1.6%, potentially provides a more precise estimate of SI prevalence and helps us to identify a pool of low confidence SI patients that, though do not display specific signs of SI such as down-titration, have characteristics similar to that of high confidence SI patients. As detailed in the methods (Section Model selection)-model performance was considered by measuring the ROC curve with the area under the curve score, PRC, and F1 score. XGB classifier model was the best performing model. For model 1 and model 2, the area under the ROC curve was 0.91 and 0.90, the area under the PRC curve was 0.61 and 0.53, and the F-score was 0.69 and 0.61, respectively.

Discussion
The results of this study highlight the challenge of estimating and describing SI among patients with high and very high CV risk in routine clinical practice in Germany. The data were obtained in Germany but probably also hold true in other countries. The identification of pragmatic groups of patients with SI depends upon a range of clinical observations events and characteristics [27]. Statin down-titration, statin switch, statin discontinuation, intermittent dosing, low-dose statin use, documented SI/toxicity/allergy, and the presence of SAMS were events used to define SI-also reflecting definitions of various societies [18,19]. Agreement exists that these events may lead to non-adherence, treatment discontinuation, and often, suboptimal lipid treatment.
The present analysis supports the distinction between partial (i.e., only some types of statins at some doses) and absolute (e.g., all statins at any dose) SI, as also published elsewhere. In a recent meta-analysis, the authors referred to "complete" SI, but this has different definitions in the ILEP and NLA guidelines and is not defined in the EAS guidelines they used [5]. Overall, though they reported a 9.1% prevalence for SI worldwide; randomized controlled trials reported a prevalence below 5% compared to 17% with cohort studies [5]. In our study, the absolute SI prevalence (high confidence) was 6.4%, based on SI rules. However, this prevalence was enhanced to 8.1% when supervised ML was leveraged. These prevalence estimates are very similar to recently published figures on SI prevalence [5]. The contemporary observational cohort study INTERCATH in Germany used simulations to identify the appropriate target population for certain lipid treatments and the related costs generated [12]. In this study, the authors used the concept of partial SI as the inability to tolerate a high-intensity statin and full SI as the inability to tolerate any statin at any dose, which are aligned to ours and previous publications [28]. Additionally, based on the level of confidence SI were classified into high and low confidence levels in our study. This is an innovative way to classify SI, and is a new artificial way invented to assist the SI classification and patient characterization. Validation of the results was performed with qualitative insights from primary market research conducted with consulting external experts (validation data is available upon request) [29].
Regardless of the confidence levels in SI classification, the nocebo effect could be a challenge. Due to the negative expectations of patients about adverse events related to statin use, such patients were more likely to experience the nocebo effect [30]. A systematic review of open-label, blinded phase trials estimated that about 38% to 78% of SAMS-related SI could be attributed to expectation alone/the nocebo effect [27]. As per the self-assessment method for statin side effects or nocebo (SAMSON) trial, 90% of statin complaints were related to the nocebo effect [31]. This finding highlights that patients on statin therapy experienced side effects; however, side effects can be related to the act of taking medications and not essentially to statin use. Physicians should consider prevention and management approaches for the nocebo/drucebo effect at the point of therapy initiation [32]. These approaches are crucial to achieving optimal LLT treatment goals and the management of the CV risk in a large number of patients [32].
There are various risk factors/conditions that may affect the risk of SI. In our study, intolerance was observed more often among women and the elderly. Obesity, hypothyroidism, vitamin D deficiency, and chronic kidney disease were the more prevalent risk factors for the manifestation of SI, aligned with published literature [5]. The current study also revealed antibiotics to be a factor associated with an increased risk of SI, due to a potential drug-drug interaction, as also reported by Fitchett DH et al. [28].
From the physician's viewpoint, there are several barriers associated with the prescription of and adherence to statin therapies, including partial knowledge, inconsistent clinical guidelines, and a lack of a system to identify the right patients for statin therapy. However, from the patient's viewpoint, fear of side effects and resistance to taking additional medications are key barriers [33,34]. Hence, there is an opportunity to educate physicians and patients during the treatment journey. A holistic, patient-centric approach is required for the optimal CV risk reduction, continued therapy or alternative drugs and management strategies [32]. The Personalized Lipid Intervention Plan could be utilized for this purpose [35], which together with ML techniques can not only enhance the patients' education but also the prediction of their behaviour.

Study Limitations
With the EMR database, the course of treatment can only be tracked longitudinally within the same physician's office (practice). However, any previous or following diagnosis or treatment by another physician is not recorded. The EMR database only covers outpatient data, while no inpatient data are covered. In addition, outpatient or medical care center (ambulances or MVZ) data are not covered. The results of the analysis were based on the opinions of the panel of general practitioners (GPs) or cardiologists at practice level. In the projection, data for patients presenting to cardiologists (current coverage~6%) and GPs (current coverage~3%) were extrapolated separately to the respective specialist populations in Germany. Referrals to specialists or hospitals were documented by physicians, but treatment in other medical practices or hospitals was not recorded. Another limitation is that data on exercise, ethnicity, and family history of SAMS were not included in the study, and these can influence the outcomes of statin therapy and the prevalence estimates of SI indirectly. SI classification approach of high and low confidence levels is an innovative approach but not scientifically quantitative. Although the approach was validated qualitatively with primary market research, even so validation of the approach is considered a limitation for this study. The study estimates SI prevalence based on rules developed using real-world diagnosis and prescription characteristics, and thus cannot differentiate between real and self-opinionated SI prevalence.

Conclusions
In this study we explored the population of SI patients in Germany using advanced analytics. The derived prevalence estimates and patient characteristics add a complemen-