Extracting New Temporal Features to Improve the Interpretability of Undiagnosed Type 2 Diabetes Mellitus Prediction Models

Type 2 diabetes mellitus (T2DM) often results in high morbidity and mortality. In addition, T2DM presents a substantial financial burden for individuals and their families, health systems, and societies. According to studies and reports, globally, the incidence and prevalence of T2DM are increasing rapidly. Several models have been built to predict T2DM onset in the future or detect undiagnosed T2DM in patients. Additional to the performance of such models, their interpretability is crucial for health experts, especially in personalized clinical prediction models. Data collected over 42 months from health check-up examinations and prescribed drugs data repositories of four primary healthcare providers were used in this study. We propose a framework consisting of LogicRegression based feature extraction and Least Absolute Shrinkage and Selection operator based prediction modeling for undiagnosed T2DM prediction. Performance of the models was measured using Area under the ROC curve (AUC) with corresponding confidence intervals. Results show that using LogicRegression based feature extraction resulted in simpler models, which are easier for healthcare experts to interpret, especially in cases with many binary features. Models developed using the proposed framework resulted in an AUC of 0.818 (95% Confidence Interval (CI): 0.812−0.823) that was comparable to more complex models (i.e., models with a larger number of features), where all features were included in prediction model development with the AUC of 0.816 (95% CI: 0.810−0.822). However, the difference in the number of used features was significant. This study proposes a framework for building interpretable models in healthcare that can contribute to higher trust in prediction models from healthcare experts.


Introduction
Morbidity and mortality are often results of Type 2 diabetes mellitus (T2DM). In addition, T2DM presents a substantial financial drain for individuals and families, health systems, and societies. Globally, the incidence and prevalence of T2DM are increasing rapidly [1]. In 2017, it was estimated that 425 million people had any diabetes (approx. 5.5% of the worldwide population), of which 90% had T2DM. According to projection estimations, the prevalence is going to increase substantially in the coming years; by 2045, for example, a 48% increase of prevalence from the above numbers is expected, or in absolute numbers, an estimated 629 million people (approx. 6.6% of the worldwide population) are expected to be suffering from any diabetes [2]. T2DM can also lead to a substantially increased risk of macrovascular and microvascular disease, especially in inadequate glycemic control [3]. Impaired fasting glucose typically leads to slow progression of T2DM and, more importantly, its symptoms may remain undetected for many years. Electronic Health Records (EHR) enable researchers to perform predictive modeling by providing a large amount of data [4] and many links have been found between patient health, the environment, and clinical decisions [5]. Nowadays, data mining techniques are applied to various fields of science, including healthcare and medicine [6]. Usually, techniques such as pattern recognition, disease prediction, and classification are used. Although multiple methods are available to build prediction models, prediction accuracy and data validity are often not realistic for model application in practice. Models usually perform well in specific datasets used to build the prediction models but are frequently not adapted sufficiently well when used on other datasets [7].
There is growing interest in clinical prediction, but models' interpretation is rarely based on end-user needs [8], and there is a lack of model interpretability techniques [9]. Interpretability of results based on predictive models is crucial in critical areas such as healthcare and is essential for adopting models. People often do not understand predictive models and therefore do not trust them [10]. LogicRegression can be used to improve the interpretability of predictive models.
LogicRegression is an adaptive classification and regression procedure which searches for Boolean (logic) combinations of binary variables that best explain the variability in the outcome [11,12]. LogicRegression looks for logical combinations of binary features. We can explain the variability of the outcome feature and thus reveal the features and interactions related to the response and whether they have predictive capabilities [11].
The purpose of this paper is to use LogicRegression to make final models less complex (i.e., with less features) and the features that appear in the interpretation of predictive models much more understandable. This is also important for health professionals, as they do not have the necessary knowledge to apply prediction models or interpret the results obtained. This is also important from the patient's point of view and the provision of personalized healthcare. Simple interpretation will make it easier for the patient to understand the operation of the predictive model and outcome. The paper presents an example of using extracted features using Logic Regression to improve the personalized interpretability of the prediction models to the end-users.

Data
EHR data consisted of health check-ups and prescribed drugs data from four Slovenian primary healthcare providers for a period of approximately 3.5 years from 12 December 2014 to 27 July 2018. Data for 21,138 medical records and 114 potential useful features were exported from the healthcare information systems after the on-site anonymization process. Our first step was the removal of features with more than 20% of missing data (73 potential features remain). Since our focus when building prediction models was on the fasting plasma glucose level (FPGL) measurement (mmol/L) and results of Finnish Diabetes Risk Score (FINDRISC) features, which included Age, Gender, BMI, Waist circumference, Active_30_min, Medication, High_BS, Grocer, and Diab_fam we selected cases with all those values present (4086 such cases remained). We next removed (a) cases with more than 50% of the features were not available (4067 cases left), (b) removed all duplicate entries (in cases of multiple patient visits only the most recent visit was included) (3535 cases left), (c) cases not having a previous diabetes diagnosis (3176 cases left) and entries where: (d) FPGL was not reported giving us a total of 3120 records of patient visits were left for development of a prediction model to estimate the risk of undiagnosed T2DM. Data included demographics, questionnaire answers for lifestyle choices, physiological measurements, and prescribed medications for two time periods.
Binary features were created for prescribed drugs and questionnaire responses, which resulted in nine numeric and 161 binary features where specific drug related feature was coded as positive in cases where a patient was prescribed with the specific drug during the last 4 months prior to the visit. The target feature was binary, where positive cases were defined as having FPGL higher than 6.1 mmol/L consisting of 24.71% (n = 771) of patient visits.
We imputed the remaining missing values using the MissForest based approach [7], which on average meant features with 12.25% missing values as we initially already removed features with 20% or more missing values. MissForest is used to impute missing values particularly in the case of mixed-type data. It can be used to impute continuous and/or categorical data including complex interactions and nonlinear relations. The summary information of the basic predictive and target features can be seen in Table 1. Please see Table A1 for list of all features used in the experiments.

Experimental Setup
The data were split into 80% to derive five extracted features using Logic Regression [13] and 20% to build and evaluate the final prediction models.
Finally, we created three datasets with the following features: all numeric and binary (170), all numeric and logic (14), and all numeric, binary, and logic features (175). On each dataset, we built a predictive model separately using the same training data.
The Least Absolute Shrinkage and Selection Operator (LASSO) [13] was used to build prediction models. We repeated each 10-fold cross-validation ten times to estimate the variance in Area under the ROC curve (AUC) that was used as our classification performance metric.

Results
We split the results in this section into two parts. First, we present selected logic attributes extracted from the dataset for the undiagnosed T2DM prediction use case. Next, we present the performance evaluation of the model.

Feature Extraction Using LogicRegression Approach
To demonstrate the practical example of using LogicRegression based extraction of new features to improve interpretability of the prediction models, we provide the results of the first cross-validation run.
The selected use case resulted in five logic features (Table 2) extracted from the complete set of features.
In Table 3, we list all features that were selected in at least 50% of runs in our experiments with LASSO on the dataset with numeric and logic features, while Table 4 lists features for the dataset with numeric, binary and logic features. Frequency (freq) shows in how many experiment runs each feature appeared in the final set of features.  In the case of results from a much wider set of features (Table 4), we can see a higher variance in selection by the final prediction models. Four (L2, L3, L4, L5) logic features can be found among the varaibles that were selected in at least 50% of evaluation runs.

Performance Evaluation
In Figure 1, we summarize AUC and a selected number of features for all three datasets: no_logic (numeric and binary features), all_logic (numeric, binary, and logic features), and num_logic (numeric and logic features).

Performance Evaluation
In Figure 1, we summarize AUC and a selected number of features for all three datasets: no_logic (numeric and binary features), all_logic (numeric, binary, and logic features), and num_logic (numeric and logic features).
We can observe a slowly increasing average AUC from 0.816 (Standard Deviation (SD)) = 0.03) in no_logic to 0.819 (SD = 0.03) in all_logic and finally 0.829 (SD = 0.03) in the num_logic dataset. When looking at the number of selected feature averages and its variation, we can observe that it slowly increases from 21.7 (SD = 11.18) in no _logic to 23.7 (SD = 10.09) in all_logic but it then almost halves to 13.35 (SD = 0.63) in the num_logic dataset. The SD is steadily increasing in the first two cases, but then it decreases sharply to below 1 (SD = 0.63), which means that out of the 100 repetitions in 92 cases 13 or 14 features were selected in the num_logic dataset. This indicates a very stable final prediction models when comparing num_logic based solutions to no_logic or all_logic.  We can observe a slowly increasing average AUC from 0.816 (Standard Deviation (SD)) = 0.03) in no_logic to 0.819 (SD = 0.03) in all_logic and finally 0.829 (SD = 0.03) in the num_logic dataset. When looking at the number of selected feature averages and its variation, we can observe that it slowly increases from 21.7 (SD = 11.18) in no _logic to 23.7 (SD = 10.09) in all_logic but it then almost halves to 13.35 (SD = 0.63) in the num_logic dataset. The SD is steadily increasing in the first two cases, but then it decreases sharply to below 1 (SD = 0.63), which means that out of the 100 repetitions in 92 cases 13 or 14 features were selected in the num_logic dataset. This indicates a very stable final prediction models when comparing num_logic based solutions to no_logic or all_logic.

Discussion and Conclusions
In this paper, we compared three dimensionality reduction approaches to improve the interpretability of undiagnosed T2DM prediction models (Please note that the calibration of a prediction models was not the scope of this paper and presents a limitation). A simple LASSO regression approach is compared to two variants where a pre-selection of predictive features is conducted on the training set using LogicRegression to consequently simplify a final set of features obtained by the LASSO regression. We kept all original features with added logic features in the first variant, while in the second variant, we kept only numeric and logic features.
Results showed that logic features resulted in simpler models with lower number of features, which are potentially easier to interpret by healthcare experts. This is especially important in the field of personalized medicine. Measured AUC was similar to more complex models, where all features were included. It should be noted that although our method resulted in a lower number of features, some of the logic features may not be straightforward to interpret (e.g., the feature L3 in this paper). To address this issue, we plan to include an interactive system in our future work, where the user would specify the maximum number of original features included in generated logic features in cases where the final model would include many complex logic features. As a result of the current work, in cases when some of the final features are hard to interpret, we recommend that the user uses LogicRegression settings to adjust the complexity of final logic features for achieving satisfactory results.
When healthcare professionals and patients know which features are important in obtaining the outcome of a prediction model and how they can be combined, it helps to understand and increase the level of trust in the decision-making systems [10]. With greater interpretability of the model, we better understand and interpret the forecast for end-users and improve the support in decision-making for health professionals based on data [14]. More complex models such as deep neural networks [15] allow high accuracy but are difficult to explain. Simple models (e.g., decision trees) are less accurate but allow for more straightforward explanations [16]. Therefore, sophisticated machine learning models usually offer better performance than traditional simple models but are difficult for health professionals to understand. However, in many cases simple models also provide good classification performance, which is not significantly different from more complex models [17]. Our results confirm this hypothesis. Comprehensible models are known for their contribution to higher trust in prediction models from the end-users in healthcare.
Interpretability techniques are often categorized according to the time period used to develop the machine learning model [14]. Pre-model approaches are independent of the model and may be employed prior to making a choice on which model to use. Our approach presented in this study belongs in this group of interpretability approaches along with techniques such as Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and some clustering techniques. While Molnar [18] classifies PCA, t-SNE, and clustering methods as interpretable methods, it is worth noting that the interpretability of attributes transformed using PCA, embeddings, or clusters cannot provide comprehensible medical interpretation, but can be used to visualize the results and highlight patterns of interest from an interpretability standpoint. The proposed approach is much more interpretable, despite the possible complex combinations of features that might occur as a result of LogicRegression.
During the experiments, we also observed the unstable behavior of logic regression, where different logic features were selected with each run of the cross-validation. Although this did not influence the average number of selected features it resulted in instability of the interpretability of the model. Another limitation are the combinations of the features used in extracted features. For example, the first extracted feature (L1) suggested that checking whether a person did not experience elevated blood sugar in the past should be accompanied by checking for sulfadiazine and trimethoprim use in the last 4 months -this extracted feature works as a protective factor as seen from Table 3. We see this as a disadvantage of logic regression since different conclusions can be made based on selected features. This could be resolved to some extent by using exhaustive search methods to extract logic features resulting in extremely long running times, presenting another drawback, especially in cases where personalized models would be built. To personalize the solution even further, it would be worth exploring the prediction model development for each specific patient at the time of the examination using the subset of the data where patients similar to the examined patient would be assigned a higher weight in comparison to other patients (boosting principle).
Although our work is the field of healthcare, we believe that our results can also be applied in other emerging fields of applied prediction modeling where interpretability of results is important such as security [19] or ecology [20]. In future work, we will explore effectiveness of our methods in the broader field of security, specifically, to help us understand how misinformation (e.g., intentionally misleading information) is being spread.