Training of Machine Learning Models for Recurrence Prediction in Patients with Respiratory Pathologies †

: Information extracted from electronic health records (EHRs) is used for predictive tasks and clinical pattern recognition. Machine learning techniques also allow the extraction of knowledge from EHR. This study is a continuation of previous work in which EHRs were exploited to make predictions about patients with respiratory diseases. In this study, we will try to predict the recurrence of patients with respiratory diseases using four different machine learning algorithms.


Introduction
The electronic health record (EHR) is an electronic version of patient's medical history and demographic, clinical and administrative data are included in them [1,2]. The EHR was created to improve the efficiency of health systems; however, it has several applications in clinical informatics and epidemiology. Specifically, EHR have been used for patient clustering, disease prediction and pattern recognition [3].
The analysis of clinical data associated to EHRs is based in statistical and Artificial Intelligence (AI) procedures. Recently, machine learning and deep learning algorithms have been successfully used to extract informative and useful patterns from the EHRs [4].
The present study is a continuation of previous work [5] in which EHRs were exploited to make predictions about patients with respiratory diseases. In this project, we propose the use of Machine Learning to predict the recurrence of patients with respiratory diseases in less than 6, 12 or 18 months (depending on diagnosis). For this task, four machine learning algorithms were used: linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), k-nearest neighbors (kNN) and decision trees.

Data Set Description
Anonymous patient data were extracted from the San Rafael Hospital database. Records range from January 2000 to January 2020. The data set consisted consisted of 996 records and 40 variables. A total of 47.19% of patients suffered a relapse in less than six months, whilst 52.81% had not relapsed in that period of time.

Machine Learning Algorithms 2.2.1. Linear Discriminant Analysis
Linear discriminant analysis (LDA) is generally used to classify patterns between two classes [6]. LDA models differences among samples assigned to certain groups, in order to maximize the ratio of the between-group variance and the within-group variance.

Quadratic Discriminant Analysis
Quadratic discriminant analysis (QDA) is used when it is known that individual classes show distinct covariances. In this method, individual covariance matrix is estimated for every class of observations.

K-Nearest Neighbors
The k-nearest neighbor classifiers (k-NNCs) assumes that similar features will form a different cluster in feature space with multiple data points. The classifier takes k-nearest neighbors to find similarities between the test data and the features of a different class.

Decision Trees
Decision trees (DTs) are used for classification and regression. The DT predicts the value of a target variable by learning simple decision rules inferred from the data features. Figure 1 shows the results obtained for the four models. The accuracy is expressed as the ratio of correctly predicted observation to the total observations; sensitivity, ratio of true positives to actual positives; and specificity, ratio of true negatives to total negatives in the data. The overall accuracy for the four models is 60%; however, the accuracy value must be greater than 80% to be considered good.

Results and Discussion
The differences between sensitivity and specificity indicate that these models have a better performance predicting non-relapses than relapses. As expected, the accuracies reported by this study were lower than the ones we would expect. In this study, we used a dataset which did not have input and output parameters for a specific disease diagnostic. Clinical records from San Rafael included information about diagnosis, procedures or health system, but it did not include parameters to diagnose a respiratory disease. With aim to make better predictions, data sets need to include more useful information such as whether the patient is smoker or not, air quality or physical activity. The use of machine learning for health predictions is growing in popularity, although some challenges lie ahead.