Machine Learning Approaches for the Prediction of Hepatitis B and C Seropositivity

(1) Background: The identification of patients at risk for hepatitis B and C viral infection is a challenge for the clinicians and public health specialists. The aim of this study was to evaluate and compare the predictive performances of four machine learning-based models for the prediction of HBV and HCV status. (2) Methods: This prospective cohort screening study evaluated adults from the North-Eastern and South-Eastern regions of Romania between January 2022 and November 2022 who underwent viral hepatitis screening in their family physician’s offices. The patients’ clinical characteristics were extracted from a structured survey and were included in four machine learning-based models: support vector machine (SVM), random forest (RF), naïve Bayes (NB), and K nearest neighbors (KNN), and their predictive performance was assessed. (3) Results: All evaluated models performed better when used to predict HCV status. The highest predictive performance was achieved by KNN algorithm (accuracy: 98.1%), followed by SVM and RF with equal accuracies (97.6%) and NB (95.7%). The predictive performance of these models was modest for HBV status, with accuracies ranging from 78.2% to 97.6%. (4) Conclusions: The machine learning-based models could be useful tools for HCV infection prediction and for the risk stratification process of adult patients who undergo a viral hepatitis screening program.


Introduction
Hepatitis B infection is caused by the hepatitis B Virus (HBV), a deoxyribonucleic acid (DNA) virus belonging to the Hepadnaviridae family and the Orthohepadnavirus genus [1]. It is transmitted through contact with contaminated blood or bodily fluids, most frequently through intravenous drug use, sexual contact, or vertical transmission from mother to child [1]. HBV prevalence is decreasing in the developed countries due to vaccination, but it remains high in endemic areas due to vertical transmission [2,3]. The primary determinant of the hepatitis course is the age of HBV infection, so that the vast majority of perinatally infected individuals acquire chronic hepatitis B [4].
According to a systematic analysis, in 2019, the estimated global, all-age prevalence of chronic HBV infection was 4.1%, corresponding to 316 million infected people [5]. Moreover, there was a 31.3% decline in all-age prevalence between 1990 and 2019, with a more marked decline of 76.8% in prevalence in children younger than 5 years. The World Health Assembly launched the WHO Global Health Sector Strategy on Viral Hepatitis (WHO-GHSS) in 2016, with the goal of eradicating viral hepatitis as a public health concern [6]. The WHO-GHSS proposed impact targets of 30% fewer new hepatitis B cases and 10% fewer HBV-related deaths by 2020, and 95% fewer new cases and 65% fewer deaths by 2030, compared to the baseline year of 2015.
The hepatitis C virus (HCV), a single-stranded RNA virus, causes hepatitis C infection. HCV is a Flaviviridae family and Hepacivirus genus virus that is primarily spread through direct bloodstream inoculation [7]. According to recent studies, there are approximately 71 million persons infected with HCV, which amounts to a global prevalence of 1.0% [8].
The WHO-GHSS aim to reduce the HCV incidence by 80%, and the HCV-related mortality by 65% [6].
The identification of infected people is the first step in the sequence of care, and this can be achieved with a systematic screening program, especially for patients at risk. All persons who are seronegative should receive hepatitis B vaccine [9]. Persons who are at risk for HBV/HCV infection are represented by newborns from infected mothers, hemodialysis patients, individuals infected with human immunodeficiency virus (HIV), drug users, migrants from countries with high HBV/HCV prevalence rates, prisoners, people who have received blood products, war veterans, and people with risky sexual behavior [10][11][12].
The risk profile identification is essential for a proper selection of target population that will be further screened. Moreover, the governmental agencies need to constantly adapt their strategies in order to offer the best screening opportunities that are also costeffective [13]. Therefore, it is important to evaluate the performances of the current screening programs and to constantly improve their quality.
Artificial intelligence and machine learning approaches have the ability to outperform traditional hepatitis screening strategies and to evaluate large datasets in order to provide a full picture of regional epidemiological profiles. The machine learning-based methods for disease prediction include random forest (RF), decision trees (DT), gradient boosting (GB), naïve Bayes (NB), and support vector machine (SVM) [14]. Until now, machine learning-based methods have been employed for the prediction of the type and duration of antiviral therapy [15,16], stage of HCV infection [17], the occurrence of hepatic fibrosis and hepatocellular carcinoma related to hepatitis infection [18,19], and for classification purposes [20].
A recent study used three neural networks for the prediction of HBV and HCV incidence in a cohort from China, using surveillance data from a 13-year time-frame [21]. The results showed that the Long Short-Term Memory (LSTM) prediction model, Recurrent Neural Network (RNN) model, and the Back Propagation Neural Network (BPNN) model had significant predictive performance for the early detection of the disease incidents.
Although the literature supports the use of these approaches for disease prediction, data are heterogeneously reported, and few studies concentrate on the patient's epidemiological profile. The aim of this study was to evaluate and compare the predictive performances of four machine-learning based models for the prediction of HBV and HCV seropositivity.

Materials and Methods
We conducted a prospective cohort screening study of adult persons from the northeastern and southeastern regions of Romania, between January 2022 and November 2022 (LIVE(RO) 2-EST). Ethical approval for this study was obtained from the Institutional Ethics Committee of University of Medicine and Pharmacy 'Grigore T. Popa' (No. 151/13 January 2022). Informed consent was obtained from all participants included in the study. All methods were carried out in accordance with relevant guidelines and regulations.
We recruited participants at the time of the routine family physician evaluation. The inclusion criteria taken into consideration were age of ≥18 and a home address located in the northeastern and southeastern regions of Romania. Exclusion criteria comprised pregnant patients, arrested persons, incomplete medical records, or those who were unable to offer informed consent.
A structured questionnaire was applied for all participants by family physician, and the following data were recorded for the purpose of this study: demographic data, age, level of education, ethnicity, employment status, vulnerability due to medium, ethnicity, work or other situations, HBV vaccinal status, previous diagnosis and/or treatment for viral hepatitis, contact with hepatitis viruses in family, through sexual contact, work, or other instances, previous blood transfusions, hemodialysis, surgery, hospitalization, oral procedures, work or house-related accidents that necessitated hospitalization, cuts/other injuries with sharp objects, incarceration, tattoos/piercings, use of intravenous drugs, one or more unprotected sexual intercourse(s) with one or multiple partners, previous sexually transmitted infections.
All patients underwent rapid blood testing for HBs antigens and for HCV antibodies using immunochromatographic tests, and after the results came back, the patients were asked whether or not they would like further referring to a diagnostic center from the country.
A total of 1359 patients were included in the preliminary analysis of this study, and were divided into three groups: those who had HBV infection (116 patients, group 1), those who had HCV infection (116 patients, group 2), and a control group, without infection (116 patients, group 2). HBV-HCV coinfection was identified in two patients who were excluded from the analysis due to small sample size.
In the first stage of the statistical analysis, each variable was evaluated with chi-squared and Fisher's exact tests for categorical variables, which were presented as frequencies with corresponding percentages, and t-tests for continuous variables, which were presented as means and standard deviations (SD).
Multinomial logistic regression was used to determine whether or not there is a statistically significant difference between the groups regarding their clinical characteristics derived from the standardized questionnaire. The statistical analyses were performed using STATA SE (version 17, 2021, StataCorp LLC, College Station, TX, USA).
In the second stage of the analysis, we evaluated the predictive performance of four machine learning-based models: support vector machine, naïve Bayes, random forest algorithm, and K nearest neighbors (KNN).
A SVM is a supervised learning algorithm used for classification and regression [22,23]. This algorithm is a relatively new method that has shown promising results in recent years for disease prediction. SVM classifiers are based on linear classifiers and seek to select a line that is slightly more confident.
NB is a classification technique based on the Bayes' theorem [24]. This theorem can predict the likelihood of an occurrence depending on prior knowledge of the event conditions. This classifier asserts that a given characteristic in a class is not directly related to any other feature, despite the fact that features in that class may be interdependent [25].
Random forests are ensemble classifiers that randomly learn multiple decision trees [26]. The random forest approach consists of a training stage in which many decision trees are built and a testing step in which an outcome variable is classified or predicted based on an input vector [25]. The different decision trees of a RF are trained using the different parts of the training dataset. To classify or predict a new sample, the input vector of that sample is needed to pass down with each DT of the forest. Each DT then considers a different part of that input vector and offers a prediction outcome. The forest then selects the prediction with the greatest number of 'votes' (for discrete outcomes) or the average of all trees in the forest (for numeric outcomes).
KNN is a supervised machine-learning algorithm predominantly used for classification and prediction purposes [25]. It is able to classify datasets using a training model similar to the testing query by taking into account the K nearest training data points (neighbors) which are the closest to the query it is testing [27]. The algorithm performs a majority voting rule to check which classification to finalize [28].
The data were segregated into data for testing (70%) and training (30%). In order to protect the results from overfitting, all models underwent a 5-fold cross validation. Their true positive rates (TPR), false negative rates (FNR), positive predictive values (PPV), false detection rates (FDR), accuracies, values for area under the curve (AUC), precision, recall, and F1 scores were calculated, and compared for HBV and HCV seropositivity version. The models were constructed and analyzed using Matlab (version R2021b, The MathWorks, Inc., Natick, MA, USA).

Results
A total of 1359 participants were evaluated in our prospective study. Their demographic characteristics are presented in Table 1, segregated into the following groups: patients with HBV (38 patients, group 1), patients with HCV (group 2, 33 patients), and controls (group 3, 1288 patients). Significantly more widowed (p < 0.001), employed (p = 0.01), and agricultural workers (p < 0.001) were identified in the first group, while the second group comprised persons who were older (p < 0.001), females (p = 0.01), with an educational background predominantly in the ISCED 1-3 interval (p = 0.02), and widowed (p < 0.001) compared to control.   The questionnaire results for the main groups are presented in Table 2. Both HBV and HCV patients reported a significantly higher personal or family history of viral hepatitis compared with control (p < 0.001). Moreover, both groups had significantly higher propor-tion of risky professions, hospitalizations, hemodialysis, surgeries, and dental procedures (p < 0.05). Apart from that, the first group had significantly more severe accidents (p = 0.002) and blood transfusions (p < 0.001) than the control group. In the second stage of the analysis, we incorporated the patients' clinical characteristics into four machine learning-based models, and we calculated their predictive performance (Table 3). KNN achieved the highest accuracy when predicting HCV status (98.1%), with an AUC value of 0.67. The SVM had equal accuracies (97.6%) for the prediction of both HBV and HCV status, but the AUC value was higher for HCV classification (0.89 versus 0.80). Both RF and NB performed best when used to predict the HCV status (RF: accuracy-97.6%; AUC-0.79; NB: accuracy-95.7%; AUC-0.85). In terms of sensitivity, it was higher for algorithms that predicted HCV status, with KNN having the highest sensitivity (100%).

Discussion
This is the first prospective study in the literature that trained four machine learningbased models (SVM, RF, NB, and KNN) for the prediction of hepatitis B and C seropositivity in a cohort of adult patients from Romania using clinical parameters determined during family physicians' visits.
Our results showed that all evaluated models performed better when used to predict HCV status. The highest predictive performance was achieved by KNN algorithm (accuracy: 98.1%), followed by SVM and RF with equal accuracies (97.6%) and NB (95.7%). The sensitivity was modest for HBV status prediction, only one model (SVM) achieving a sensitivity of 40%.
SVM increases class separation and reduces expected prediction error and is applicable for the analysis of high-dimensionality data with small sample size [29][30][31]. The bagging algorithm serves as the foundation for RF, which employs ensemble learning [32]. It creates as many trees on the subset of the data and combines the output of all the trees. In doing so, it lessens the issue of overfitting in decision trees, as well as lowers variance and raises accuracy. On the other hand, NB is suitable for solving multi-class prediction problems, especially when using small datasets, and has much lower costs than RF [33]. Finally, one of the biggest advantages of KNN model is that it can be used both for classification and regression problems, but does not perform well on imbalanced data [34].
A recent case-control study by Majzoobi et al. evaluated the predictive performance of four ensemble learning methods (bagging, AdaBoost, RF, and logistic regression) for the prediction of HBV and HCV infection [35]. The authors demonstrated superior predictive performances of RF when used to predict both HBV (accuracy: 66%) and HCV infection (accuracy: 77%) compared to the other models. Although we obtained better results in terms of accuracies for HBV (accuracy: 78.5%) and HCV infection (accuracy: 97.6%), the model was outperformed by KNN.
Another recent study by Zhou et al, used three machine learning methods (RF, KNN, and SVM) for analyzing correlations among chronic hepatitis B inflammation grades, gene expressions and clinical parameters (serum alanine amino transaminase, aspartate amino transaminase, and HBV-DNA), and for predicting inflammation grades by using clinical parameters and/or gene expressions. The authors showed that KNN had the highest accuracy (76.6%) compared to SVM (accuracy: 65.4%) and RF (accuracy 72.8%) when using all the evaluated types of data [36].
Chen et al. evaluated the predictive performance of four classifiers (SVM, NB, RF, and KNN) in order to build a decision-support system that would improve the hepatitis B staging using real-time elastography data. The results indicated that RF had the highest accuracy for the prediction of stage 0 (82.8%), 1 (81.1%), 2 (88%), and 3 (91.2%) of liver fibrosis [37].
These studies outline the high predictive performance of machine learning-based models in various settings and use multiple types of data. However, the costs of performing such analyses are high, especially for gene sequencing. In order to provide the best screening strategy, it is important to identify the subjects with the highest risk based on data that can be obtained with minimal costs and thus constitute an advantage of our study that indicates a good predictive performance of machine learning models using data which can be easily obtained.
Our study has several limitations, including a small cohort of patients and number of predictors. At the same time, the trained models have the advantage of an easier implementation by the physicians. All chosen machine learning-based models have the ability to handle small sample data [38]. Moreover, the used algorithms have proven superior predictive performance when applied for datasets based mainly on categorical predictors in comparison with other models such as gradient boosting [39], artificial neural networks [40], support vector machines, extreme gradient boosting, multilayer perceptron [41] or linear discriminant analysis [42].
We hypothesize that the model accuracies could be improved by adding repeated serum measurements and liver elastography parameters which have been proven to be useful biomarkers for viral hepatitis prediction [43][44][45][46].
Further studies on larger cohorts of patients could evaluate the predictive performance of these ML-based models in different settings and populations. The results could aid clinicians in the risk stratification process of adult patients who undergo a screening program, and could help optimize the costs of the screening programs.

Conclusions
The machine learning-based models could be useful tools for HCV infection prediction and for the risk stratification process of adult patients who undergo a viral hepatitis screening program.
The results for HBV prediction using only clinical characteristics are modest in terms of predictive performance.
These findings are important for clinicians and public health specialists because they can be further validated and incorporated into national screening programs in order to optimize them and to reduce their costs.