Electronic Health Records Exploitation Using Artiﬁcial Intelligence Techniques †

: The exploitation of electronic health records (EHRs) has multiple utilities, from predictive tasks and clinical decision support to pattern recognition. Artiﬁcial Intelligence (AI) allows to extract knowledge from EHR data in a practical way. In this study, we aim to construct a Machine Learning model from EHR data to make predictions about patients. Speciﬁcally, we will focus our analysis on patients suffering from respiratory problems. Then, we will try to predict whether those patients will have a relapse in less than 6, 12 or 18 months. The main objective is to identify the characteristics that seem to increase the relapse risk. At the same time, we propose an exploratory analysis in search of hidden patterns among data. These patterns will help us to classify patients according to their speciﬁc conditions for some clinical variables.


Introduction
The electronic health record (EHR) represents the digital version of a patient's medical history. In an EHR system, data is stored in a collection of tables where each record corresponds to a patient's healthcare episode. EHRs constitute a rich source of information, including demographic data (age, gender, address, ...), administrative data and a wide range of clinical information (clinical notes, diagnoses, procedure-treatments, lab test, medical imaging...) [1][2][3]. The knowledge extracted from EHRs can be used in clinical decision support, epidemiological and predictive tasks, population care improvement and pattern recognition [2,4]. For this reason, the exploitation of EHRs has aroused interest of researchers in the last years [5,6]. Nevertheless, EHRs have some characteristics that make this goal hard to achieve. Heterogeneity, noise, incompleteness, redundancy or the inconsistent representation of data are some of the challenges to cope with. In this context, exploratory analysis and preprocessing steps play a fundamental role [7,8].
Artificial Intelligence (AI) has become a key tool for EHR exploitation. Machine Learning and Deep Learning have been successfully used to identify new risk factors, patterns and medical associations [6,9]. In addition, recent studies show the potential of these modern techniques to make predictions better than the traditional existing methods [9][10][11].
In this project, we propose the use of AI to exploit and extract value from EHR data. More concretely, we focus our study on the analysis of relapse rates in patients suffering from the most prevalent diagnoses in our data set. We consider as a relapse the return of a disease time after its apparent overcoming. We will construct a Machine Learning model to predict whether a patient will have a recurrence in less than 6, 12 or 18 months (depending on diagnosis). This model will allow us to identify the characteristics that seem to increase the relapse risk in those patients. At the same time, we will carry out exploratory analysis in search of hidden patterns among data. We hope the results help us to classify patients according to their specific conditions.

Data Set Description
Anonymous patient data were extracted from the San Rafael Hospital database. Records range from January 2000 to January 2020. Main diagnoses and procedures are encoded in both ICD-9 and ICD-10, so the data is divided in two codification sets. ICD-9 set consists of 156,362 records and 89,211 patients. ICD-10 set consists of 32,069 records and 25,013 patients. More information about the sets is given in Table 1. Demographic and clinical features acts as predictive variables. Table 1. Numeric description of ICD-9 and ICD-10 sets.

Present Work
Currently, the study is in the preprocessing phase. The most frequent diagnoses have been identified by a descriptive study of the data set. Table 2 shows these main diagnoses and the associated recounts. Among all the most prevalent diagnoses, we have selected those related to respiratory problems. We discarded the diagnoses of traumatology and varicose veins because they were not considered relevant to this specific research.
After selecting the ICD-9 and ICD-10 codes of interest, the sets will be unified in order to procure a larger and completed data set. Null and missing values will be removed to ensure data quality. In addition, the Machine Learning models will be defined. Once we obtain a clean data set, the next steps will allow us to recognize the most explanatory predictive variables for the chosen diagnoses. For this task, we will apply a Principal Component Analysis (PCA) [12].

Conflicts of Interest:
The authors declare no conflict of interest.