Evaluation of Structured, Semi-Structured, and Free-Text Electronic Health Record Data to Classify Hepatitis C Virus (HCV) Infection

: Evaluation of the United States Centers for Disease Control and Prevention (CDC)-deﬁned HCV-related risk factors are not consistently performed as part of routine care, rendering risk-based testing susceptible to clinician bias and missed diagnoses. This work uses natural language processing (NLP) and machine learning to identify patients who are at high risk for HCV infection. Models were developed and validated to predict patients with newly identiﬁed HCV infection (detectable RNA or reported HCV diagnosis). We evaluated models with three types of variables: structured (structured-based model), semi-structured and free-text notes (text-based model), and all variables (full-set model). We applied each model to three stratiﬁcations of data: patients with no history of HCV prior to 2020, patients with a history of HCV prior to 2020, and all patients. We used XGBoost and ten-fold C-statistic cross-validation to evaluate the generalizability of the models. There were 3564 unique patients, 487 with HCV infection. The average C-statistics on the structured-based, text-based, and full-set models for all the patients were 0.777 (95% CI: 0.744–0.810), 0.677 (95% CI: 0.631–0.723), and 0.774 (95% CI: 0.735–0.813), respectively. The full-set model performed slightly better than the structured-based model and similar to text-based models for patients with no history of HCV prior to 2020; average C-statistics of 0.780, 0.774, and 0.759, respectively. NLP was able to identify six more risk factors inconsistently coded in structured elements: incarceration, needlestick, substance use or abuse, sexually transmitted infections, piercings, and tattoos. The availability of model options (structured-based or text-based models) with a similar performance can provide deployment ﬂexibility in situations where data is limited.


Introduction
Identifying targeted, innovative, and effective methods to screen and treat people living with HCV will contribute towards diagnosing the millions who are not yet identified, and realizing the World Health Organizational goal of eliminating HCV by 2030 [1]. The disparate resources and limited regional and community infrastructure coupled with an often marginalized and socio-economically unstable patient population results in the need to develop multi-faceted and tailored approaches to elimination.
The US rate of new HCV cases has tripled from 2005-2015, ref. [2] primarily due to the rise in opioid use, with resulting calls to broaden HCV testing and treatment [3]. The World Health Organization has called for the global elimination of HCV (WHO elimination) [1], though to meet this goal, persons at risk of infection need to get tested and, if diagnosed, treated. Traditional risk-based screening alone has failed to capture testing in important populations making testing less effective [4,5]. The American Association for the Study of Liver Diseases and Infectious Diseases Society of America (AASLD/IDSA), the United States Preventive Services Task Force (USPSTF) and the Centers for Disease Control and Prevention (CDC) expanded their recommendations in 2020 to include the HCV screening of all pregnant women and the one-time universal HCV screening in all persons over 18 and repeat testing in those at-risk (current or historic injection drug users, intranasal drug use; HIV diagnosis; long-term hemodialysis; received clotting factor produced prior to 1987; abnormal alanine transaminase (ALT); solid organ donor; prior recipient of a transfusion or organ transplant; exposure to needle sticks, sharps; mucosal exposure to HCV+ blood; ever incarcerated; child born to HCV+ mother; and sexually active persons starting pre-exposure prophylaxis) [6,7]. Previously the recommendation included persons born between 1945 and 1965 and risk-based testing.
There is increasing research in mining electronic health record (EHR) data to support the identification and treatment of HCV patients. Some have demonstrated the utility of using rule-based logic to query EHR data to identify diagnosed but untreated HCV-RNA detectable patients. The output of the algorithms was used by patient navigators to contact potential candidates for care coordination programs [8,9]. While others evaluated various machine learning (ML) algorithms and techniques to predict HCV from an Egyptian patient database using structured elements, i.e., those that are stored in the EHR as discrete elements that map a unique set of variables [10].
Structured elements are limited to what gets coded; however, semi-structured and free-text elements have the advantages of potentially capturing more subtle concepts in the provider notes and narratives. Natural language processing (NLP) is the research and application of computational techniques to extract, model, and analyze written or spoken language [11,12]. NLP and text mining techniques leverage the power of computers to process and make sense of large amounts of text that may otherwise not be input as structured data. Researchers have applied NLP to the analysis of clinical documentation, such as discharge summaries and radiology interpretations, to improve care and workflow processes [13]. NLP can extract concepts as variables that are then used in the development of machine learning models [14,15]. With the increased use of electronic health records, there has been a growing corpus of text information in healthcare that is ripe for NLP and ML.
Despite efforts to improve targeted risk-based testing, evaluation of the CDC-defined HCV-related risk factors are not consistently performed as part of routine care, rendering risk-based testing susceptible to clinician bias and missed diagnoses. Although one-time universal testing has been recommended since 2020 to capture missed diagnoses, [3] persons who are at high-risk for HCV infection need to have repeat testing with continued risk, as well as re-testing after treatment, especially if we are to reach HCV elimination by 2030 [1]. One of these approaches should be micro-elimination strategies at larger health-care systems, which can then be widely disseminated to other organizations.
This work aims to develop a ML model to identify patients at high risk for newly identified HCV infections and those who are not easily identified and repeatedly tested by providers. In this study, we use both ML and NLP to predict patients with HCV infections with EHR data. The contributions of this study are two-fold. First, we evaluate and compare the utility of structured, semi-structured, and free-text EHR data to predict patients with HCV infections. Second, we compare prediction models for patients with and without a prior history of HCV.

Data Source and Study Design
Our study uses data between January 2020 to October 2021 from a ten-hospital academic health system in the mid-Atlantic region of the United States. Models were developed and validated to predict patients with HCV infection (defined as having a detectable RNA or reported HCV diagnosis) amongst patients who had either an antibody (ATB) or RNA test during the study dates. Variables extracted from patient records include year of birth, sex (female or male), and race (Black, White, or Other), ethnicity (Hispanic or Not Hispanic), as well as clinical notes, diagnosis, and medications (specifically, prescribed opioids). Patient records were excluded if they were missing sex, race, ethnicity, ICD10 or notes data. Patients were further stratified if they had a history of HCV prior to 2020 as defined by ICD10 codes (B17, B18, B19) or mentions in the notes (i.e., history of chronic hepatitis, established care for chronic hepatitis C). This study was approved by the MedStar Health Research Institute Institutional Review Board.

Variables
We extracted data from 58 variables (including ICD10 codes) as potential predictors of patients with HCV infection (detectable RNA or reported HCV infection) from structured, semi-structured, and free-text EHR data elements. These variables were identified based on published literature and clinical expertise [6,7]; 46 variables were structured diagnosis codes, one variable was a structured medication code, four were demographic variables, and seven variables were extracted from the semi-structured and free-text data elements ( Table 1). The variables determined by clinical experts [DF, AV, and BH] and chart reviews as likely to be captured in clinical notes were identified for NLP modeling. These semi-structured and free-text data elements were patients with a history of substance abuse or use, incarceration, piercings, tattoos, transfusions, needlestick injuries, or sexual transmitted infections [6,7]. The presence of a variable, excluding demographic variables, was coded as 1 and the absence of a variable as 0. Student's t-test and the Chi-square test for independence were used to evaluate the significance of continuous and categorical variables, respectively.

Semi-structured and free-text data elements (7)
TEXT_Substance Patient with history of substance abuse or use including methamphetamine, cocaine, heroin *look up slang* and excludes tobacco, nicotine, and marijuana Included example "female with history of COPD, depression, substance abuse" Excluded example "Substance Use: Denies"

Structured Data Elements
Structured data elements are stored in the EHR as discrete elements that map to a unique set of codes or specific quantity values. For this analysis, we used diagnosis codes and medication lists as structured elements. Race, sex, age, and ethnicity are also considered structured data elements.

Semi-Structured Data Elements
Semi-structured data elements are inputted by the user in a checkbox or drop-down, but it gets captured in the EHR as a string. At times, these are fields that are not as defined as structured elements and may also let the user input additional information. These data are converted to a string and are inconsistently captured over time. As a result, text mining approaches are necessary to extract concepts in semi-structured data elements. Semistructured data elements used in this analysis include extracting concepts from personal histories, for example, drug use, tobacco use, and social history.

Free-Text Data Elements
Free-text data elements correspond to clinician and other provider narratives and notes. This includes narratives associated with a patient's history, a provider's note assessment, and reasons for the visit that are not ICD10-coded. Natural language processing (NLP) algorithms are used to extract concepts from these elements accounting for common spelling mistake (i.e., heroine, opiod).

Model Development and Validation
We developed and tested models to predict patients with HCV infection. The model was developed and validated following the guidelines for Transparent Reporting of a multivariable Prediction model of Individual Prognosis or Diagnosis (TRIPOD) [16]. A similar approach was previously used to develop and validate an EHR-based HIV-risk prediction tool to identify potential PrEP candidates from which we leveraged relevant ICD10 codes for high-risk sexual behaviors and substance use disorders as well as modeling approach and reporting metrics in our study [17]. We evaluated models with three types of variables: models with structured variables (structured-based model), models with semistructured and free-text notes variables (text-based model), and models that includes all variables (full-set model) ( Table 1). Each model included four demographic variables (age, sex, race, and ethnicity). As a result, the structured-based models included 51 variables, the text-based models had 11 variables, and the full-set models had 58 variables. We applied each model to three stratifications of data: patients with no history of HCV prior to 2020, patients with a history of treated HCV prior to 2020, and all patients. In total, we had nine test conditions.
We used XGBoost, a decision tree-based boosting ensemble machine learning algorithm, to predict patients with newly identified HCV infection [18]. In a boosting algorithm, many weak learners are trained to correctly predict the observations incorrectly classified in previous training rounds. XGBoost uses a shallow tree as a weak learner and has a good performance in the case of class-imbalanced data classification [19,20]. We evaluated the models using the C-statistic, which is the area under the receiver operating characteristic curve and represents the probability that a randomly drawn HCV-infected patient is ranked as higher risk by the model. Patients identified with HCV infections could either have had a history of treated HCV prior to 2020 or no history of HCV prior to 2020 as these are two distinct groups of patient data. We consider these groupings in the three stratifications of modeling: patients with no history of HCV prior to 2020, patients with a history of treated HCV prior to 2020, and all patients. We used a ten-fold cross-validation to evaluate the generalizability of the models and minimize overfitting. We present the mean and the 95% confident interval (CI) of the C-statistic results from the cross-validation and use the Student's t-test to compare performance. Lastly, we calculated the importance of a variable using the average information gain of the variable across all model decision trees. Information gain is the difference in information entropy after a variable split. A higher gain implies more importance. This metric provides a way to look at the additional value that a variable adds to the overall model [21,22]. All analysis, including text extraction, XGBoost modeling, and statistical calculations were carried out with Python 3.9.

Results
There were 3889 unique patients in the data set. Due to missing demographics information, 325 patients (8.4%) were excluded. The remaining 3564 unique patients were used in the analysis. There were 487 patients (13.7%) who had a newly identified HCV infection and 3077 patients (86.3%) who did not have HCV infection (Table 2). Male patients tended to make up a larger proportion of the newly identified HCV infections compared to female patients (p < 0.001). In addition, Black patients made up a larger proportion of newly identified HCV infections compared White and other race patients (p < 0.001). The mean age of patients with newly identified HCV infections was older than patients who did not have HCV infection (p < 0.001). Lastly, the majority of patients (90.2%) did not have any history of HCV infection prior to 2020.

Model Results
The average C-statistics on the structured-based, text-based, and full-set models for all the patients were 0.777 (95% CI: 0.744-0.810), 0.677 (95% CI: 0.631-0.723), and 0.774 (95% CI: 0.735-0.813), respectively. C-statistics are summarized by patient stratification and model variable types in Table 3 and Figure 1. There were no statistically significant differences between the structured-based models and the full-set models. For patients with no history of HCV prior to 2020, there were no significant differences between the three model variable types. For patients with a history of HCV prior to 2020, the text-based model performed worse than both the structured-based and the full-set models; there was no statistically significant difference between the structured-based and full-set models. Text-based models were statistically significantly (p-value < 0.001) better for patients with no history compared to patients with a history of HCV prior to 2020.

Prevalence of Substance and Opioid Use Amongst Predictors
The most important predictors, as defined by the average gain, in patients with no prior history of HCV infection (Table 4; n = 3215, middle columns) were opioid related disorders, end stage renal disease, psychoactive substance dependence, and cocaine related disorders. The most important predictors in patients with a previously noted history of HCV (Table 4, n = 349) were transplanted organ and tissue status, transfusions, liver disease, and end stage renal disease. The full list of predictors is summarized in the Table S1. Oral opioid medication use was prevalent in patients with newly identified HCV infections. The ratio of newly identified HCV infection to no infections amongst patients on opioid medications (0.35) was double the ratio across all patients (0.16). While a ratio provides a general distribution of the data, it will not always correlate to XGBoost model importance. Prior diagnosis codes associated with renal and liver disease or conditions were more important predictors for patients with a prior history of HCV infection than in patients with no HCV infection history.

Comparison of Structured and Unstructured Extraction
Text mining was able to identify six risk factors inconsistently coded in structured elements (Table 5): 107 patients had a history of incarceration identified in the free-text, none of whom had the associated ICD10 Z65.1; 11 patients were identified as having a needlestick injury using text mining, only one of which had an ICD10 identifier, W46. Piercings and tattoos were all identified using text mining from the semi-structured and free-text data elements; 352 patients were identified as having substance abuse, 276 from only free-text, 25 from only ICD10 and 51 from both; 1422 patients were identified with sexually transmitted infections, 967 from only free-text, 272 from only ICD10, and 183 from both.

Utility of Structured Fields
The full-set and structured-based models had comparable performance statistics and the best overall model performance. We suspect that the structured-based model had improved performance over text-based models alone because the scope of data captured by the various ICD10 codes are enough to have high performance results in this data at a large healthcare system. It is important to note that certain structured fields were included in the model, such as opioid medications and sexually transmitted infections, which are not listed in the CDC and AASLD/IDSA risk factor recommendations. Having a model that relies only on structured fields is very useful as such models can be more easily shared between, and implemented by, different healthcare practices and systems. By utilizing and leveraging platforms such as the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), [23] models based on only structured data elements can be more easily shared and tested for generalizability.

Free-Text Concepts
Using NLP features extracted from clinical documentation in a risk-prediction model offers a novel means to identify persons at high risk for newly identified HCV with the need for repeat HCV testing, and for targeted navigation and behavioral health interventions. There was no statistically significant difference between models for patients with no HCV infection history, demonstrating the utility of NLP for patients with no prior history of HCV. It is important to note the differences between text extraction and the ICD10 codes as identified in Table 5, specifically for sexually transmitted infections and substance use/abuse, as this demonstrates that text mining captures a wider net of patients which could explain the comparable performance, even given a limited predictive variable set. Consistently mapping structured fields across resource limited healthcare systems can be difficult, especially without resources such as OMOP CDM. However, having a text-based model with similar performances for patients with no HCV infection history, provides an alternative opportunity to perform limited HCV infection prediction without the reliance of provider-driven output of structured fields.

HCV and Opioid Use
Opioid prescribed medication and use predictors were consistently among the leading predictors in all patient groups and strongest in patients with no HCV infection history. Results from this study support many other studies that considers injection opioid use as a concern for HCV transmission and thus incorporated in the CDC and AASLD/IDSA risk factors [2,6,24,25]. Although rates of prescribing opioids have decreased in recent years in attempts to curb the opioid epidemic, prescription opioids remain a significant risk factor for the development of opioid use disorder and injection drug use (IDU); the latter is the single most significant risk factor for acquiring HCV [25,26]. Despite the established trajectory from oral prescription opioids to opioid use disorder, then to IDU, and IDU to HCV, oral prescribed opioids are not yet included in defined CDC and AASLD/IDSA risk factors. Previous work from our group associated this as a factor in HCV infection [27]. Our results add to the literature and make an additional case for including oral prescribed opioids in the risk factor guidelines.

Patient Care
While universal testing is recommended, it is possible that patients might still 'fall through the cracks' or be missed as it is not certain that such recommendations will be required or will be followed up. Integrating this model that predicts patients at-risk for HCV infection into the clinical workflow may help providers prioritize patients who are at high-risk and who they should follow more closely. It may also make the compelling argument for systems to allocate increased resources for linkage into care. Identifying patients at higher risk is helpful for providers and healthcare facilities when prioritizing resources especially in clinics that are more resource constrained. In our continuing work, the risk-prediction model will be integrated into a clinical decision support (CDS) tool for clinicians. This will test, validate, and operationalize the model, as well as provide mechanisms for the model to learn from user feedback.

Limitations
Data used in this study come from a single healthcare system and the variables sections were based on chart reviews from the healthcare system. Although the healthcare system consists of a diverse collection of hospitals and ambulatory sites, evaluating the generalizability of these models at different healthcare and other systems will be important. Our model did not include pregnancy as a specific risk factor variable; however, it is recommended that women have an HCV test with each pregnancy and this will be built into the CDS tool. These results rely on the use of large data and artificial intelligence which does not allow for granular review. However, the use should never supersede the need for continued and sound clinical judgement. Lastly, understanding ethnical impacts and consideration for ML and NLP models is critical for building reliable and equitable tools and techniques for all patients and care providers. This includes further investigating all dimensions of fairness, such as accurately obtaining patient race and ethnicity information, as to not propagate potential systemic biases in how data were collected or modeled.

Conclusions
Integrating a structured-based or full-set risk-prediction model into the clinical workflow may assist providers prioritize patients, and help systems provide resources for patients who are at high risk for HCV infection and re-infection. We demonstrated the comparable performance of text-based and structured-based models for patients with no HCV infection history. In addition, the models highlight the importance of oral prescribed opioids for predicting HCV infection suggesting the utility of including oral prescribed opioids as a risk factor in guidelines. Using NLP features extracted from clinical documen-tation in a risk-prediction model, when available, offers a novel means to capture predictors often missed in structured data elements.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/gidisord5020012/s1. Table S1. The prevalence, ratios, importance (Imp.) and descriptions of all predictors amongst the different patient subsets. (Note: Importance of '-' indicates the variable was not used the model splits).