Hepatitis B virus (HBV) infection remains a major public health concern worldwide, with an average prevalence of 3.61% [1
]. In 2015, over 300 million patients were reported to have viral hepatitis globally—of which, approximately 257 million people were HBV-infected. Additionally, approximately 0.65 million deaths per year were due to HBV infection [2
]. In China, the disease burden of HBV is a serious concern. An estimated 90 million people in China—approximately 7% of the national population—are chronically infected with the HBV, and 0.33 million people annually die from HBV-related cancers [3
]. According to China Center for Disease Control and Prevention (China CDC), the total cost of treating HBV-related diseases was estimated at 80–120 billion RMB (i.e., Chinese Yuan, CNY) in 2015 [4
In 2016, the World Health Assembly published the Global Health Sector Strategy, calling for the elimination of the threat of hepatitis in humans by 2030, reducing the number of new viral infections by 95% and the number of hepatitis deaths by 65%. Hepatitis B is a major contributor to the epidemic of viral hepatitis, and the primary target for prevention and control [5
]. Hepatitis B is an infectious disease, with severe prognosis and no effective way to eliminate the virus in an infected individual. Moreover, if HBV infection results in the development of a chronic disease, there is a risk that the associated harm will be lifelong. Most HBV carriers are without symptoms in the early stages and are often diagnosed during medical examinations, missing the ideal treatment period [6
]. There are also hidden hepatitis B carriers who do not know their infection status. Therefore, the early identification of high-risk groups and timely intervention are effective ways to control HBV infection [7
Currently, China’s government takes great steps towards ensuring enrollment and employment rights for hepatitis B patients. Laws including the Infectious Diseases Prevention Law require that educational institutions and employers cannot screen for HBV for admission of citizens to school and employment. For this reason, challenges occur as a result of limited available information that can assist in calculating the current prevalence of hepatitis B and controlling HBV transmission [8
]. Additionally, general HBV screening is neither cost-effective nor practical [9
]. Assessing the risk of HBV infection is important for health care providers to identify patients appropriate for antigen testing. Previous studies aimed to prevent and control HBV infection by identifying risk factors including lifestyle, and corresponding vaccine and infectious history [10
]. However, risk factors may not be fully identified and the risk of HBV infection was not predicted. Predictive models are widely used in the medical field to quantify population risks of certain disease [12
]. If individuals at risk of HBV infection could be identified using a prediction model, it would be possible to perform targeted intervention efficiently. However, there remains a gap in knowledge on an early warning model for HBV infection based on large health screening data.
Currently, machine learning technology is an important branch of artificial intelligence and widely used for analyzing medical data. Machine learning can automatically discover and exploit the interactions and nonlinear relationships between variables and improve the accuracy of disease prediction [13
]. A study by Weng, et al., reported that machine learning improves the accuracy of cardiovascular risk prediction and increases the number of patients identified who could benefit from preventive treatment [14
]. The purpose of the present study was to develop and evaluate models for identifying people who require screening for hepatitis B surface antigen (HBsAg). We applied machine learning methods to select high-risk groups more efficiently. We believe that the development and application of predictive models will provide important information for law makers to distribute limited medical resources more efficiently and effectively.
2. Material and Methods
2.1. Data Collection
In the present study, data were obtained from a community-based cross-sectional study enrolling 97,173 residents from Guangzhou city and Zhongshan city in Guangdong, China. Stratified cluster random sampling was used to recruit residents from targeted regions between January 2014 and December 2015. The first level of stratification sampling involved Guangzhou city and Zhongshan city. The Yuexiu district in Guangzhou and Xiaolan district in Zhongshan were chosen, which were the second level of stratification. The third level of stratification was the random selection of communities. The contents of the survey included the collection of demographic information, a physical heath examination, and collecting a blood sample. The blood sample was used to test for blood routine and liver function. The study obtained ethics approval from the Human Ethics Committee at Sun Yat-sen University (L2017030). All research participants signed informed consent. Among participants that agreed to the study and provided informed consent, doctors in community health centers (CHCs) collected venous blood aseptically to screen for HBsAg and biochemical tests, respectively, using enzyme-linked immunosorbent assay and velocity method. The serum was separated from the blood by centrifugation and was transported in small vials in an ice-packed box to maintain their temperature at 0–4 °C to the laboratory at Da An Gene Company.
A total of 33 indicators with the potential to be associated with HBV were included in the analysis, including demographic information, blood routine indicators, and liver function. HBsAg served as an indicator of HBV exposure and presented as the primary outcome (positive or negative). We randomly selected 80% for training and the remaining 20% for testing. Four models were trained, as described in the classification model sections.
2.2. Data Preprocessing
The total HBsAg-positive cases were 8034, accounting for 8.27% of all participants. Data such as this are considered unbalanced (the proportion of the normal population is larger). The synthetic minority oversampling technique (SMOTE), which is an oversampling technique proposed by Chawla et al., is of great popularity to address class imbalance by creating synthetic minority class samples [15
]. Borderline-SMOTE combines the original SMOTE and boundary information algorithm, which only oversamples the minority examples near the borderline [16
]. The borderline minority examples should first be identified from the original dataset, and then used to generate new minority examples before inserting back to the original one in order to achieve data balance. This study used Borderline-SMOTE to overcome class imbalance problems, reconstruct the training set, and then use machine learning to train the classifier.
2.3. Classification Models
We developed models for HBV prediction using four machine learning algorithms: logistic regression (LR), decision tree (DT), random forest (RF), and extreme gradient boosting (XGBoost).
LR is a generalized linear regression model that is commonly applied to binary dependent variables or multiple classification variables, which is chosen as a baseline comparison. It has advantages in the interpretation of model results, and implementation with low computational cost, and can directly derive the weight of each predictor [17
]. The disadvantage is that it is sensitive to the multicollinearity of independent variables, making it unsuitable for dealing with data imbalance (i.e., the positive rate of 8.27% in our study), and it may provide an under-fitting prediction.
DT is a tree structure used for classification and regression. DT represents the procedure of classifying instances based on features, which can be considered as the set of if-then rules or the probability distribution defined between feature space and class [18
]. The main merits of DT are intuitional results and fast computation. The model is built with training data relying on the principle of minimizing the loss function in learning procedure and applied to classifying testing data. However, it is easy for over-fitting to occur and bias for unbalanced data.
RF is an ensemble algorithm based on a decision tree classifier. The learning procedure combines bagging and random feature selection, which add additional diversity to the decision tree model. RF applies the majority of votes over all decision trees to output the final classification result. This can improve the predictive accuracy without increasing the computational complexity, resulting in the ability to predict outcomes for thousands of variables [19
]. RF is also insensitive to the assumption of multivariate linearity, providing robust results for missing or unbalanced data.
XGBoost is a distributed gradient boosting algorithm based on classification and regression trees. XGBoost is popular in the fields of machine learning and data mining, revealing excellent judgment and recognition. The basic principle is to weigh the results of multiple decision trees (weak classifiers) as the final output (strong classifier) [20
]. XGBoost achieves good control for model complexity by adding regular items to the objective function, which solves the collinearity problem between variables to a certain extent, and prevents the model from over-fitting. In the XGBoost model, the second-order Taylor series is used for the cost function, and the first and second derivatives are used to make the approximate optimization of the objective function closer to the actual value, thereby improving the predictive accuracy [21
2.4. Tuning of Parameters
The use of XGBoost, RF, and DT for prediction requires tuning several parameters or hyper-parameters. We tuned the parameters or hyper-parameters to maximize the mean area under the receiver operating characteristic (ROC) curve (AUC) value computed from the 5-fold cross validation of the training data. Each time the training data is randomly divided into five subsets of the same size, four subsets are used to train the model and another subset is used for verification. After finding the optimal values of the parameters, prediction models are trained using the entire training data set. The performance is evaluated using the test data. Table 1
presents the tuning parameters and values of the final model for predicate HBV infection.
2.5. Evaluation Metric
In our study, we use accuracy, sensitivity, specificity, and area under the receiver operating characteristic (ROC) curve (AUC) as metrics to evaluate the performance of the prediction models [22
]. The accuracy, sensitivity, and specificity were calculated as follows:
, and FN
denote true positives, false positives, true negatives, and false negatives, respectively.
Accuracy represents the proportion of correctly predicted samples to all predicted. Sensitivity represents the proportion of correctly predicted positive samples to all actual positive ones. Specificity represents the proportion of correctly predicted negative samples to all actual negative ones. ROC curves are plotted to describe the variance on numbers of correctly classified abnormal cases and those of incorrectly classified normal cases as abnormality. The AUC value is used to comprehensively evaluate the model prediction ability [23
2.6. Statistical Analysis
All statistical analyses were conducted using R software version 3.3.5 (R Core Team, Vienna, Austria). Data that was normally distributed was expressed as the mean and standard deviation, and differences between groups were compared using t test. The categorical variables are expressed in terms of frequency (percentage), and the differences between groups are compared using Fisher’s exact probability method. The R packages involved include XGBoost, glm, rpart, random Forest, smotefamily
We developed HBV infection risk assessment models based on health examination data of 97,173 community residents using a machine learning method with the goal of determining the optimal model and improving the detection rate of positive HBsAg. Our findings revealed that the Borderline-SMOTE XGBoost combined model outperformed the other models with desirable performance and may help identify individuals in need of HBsAg testing. The combined model of preprocessing samples with Borderline-SMOTE can solve the problem of data imbalance and improve the overall prediction performance of the model. A large proportion of people unaware of HBV infection missed the ideal treatment time, resulting in treatment difficulties and poor prognosis. There is a lack of assessment of patients at risk of HBV risk in clinical settings. Thus, it is necessary to improve the detection probability of HBV infected patient [24
]. Therefore, the XGBoost model can be applied to assess the prevalence of HBV in the general population, promote early diagnosis and timely treatment of high-risk groups, and improve the utilization of medical resources, particularly in low resource countries [25
The use of machine learning algorithms to predict disease risk has gained attention in the biomedical field [26
]. In this study, we took advantage of large-scale datasets to identify individuals at high-risk of HBV infection by applying machine learning methods. Our findings yielded important implications for participants, such as that early identification helps to take effective interventions targeting high-risk groups. Additionally, early treatment in the disease process often means better efficacy. Negative results of the predictive model can eliminate the need for HBsAg testing in most of the general population [27
]. Our predictive model can be used to improve the positive detection rate of HBV in areas with limited budget and resources.
Secondly, the predictive performance of the prediction models using machine learning methods was significantly different than that of commonly used traditional classification methods. The more commonly used ensemble model RF and the latest boosting method XGBoost were applied in this study, with controls of traditional machine learning model DT and traditional model LR. The top-performing algorithm, Borderline-SMOTE XGBoost, achieved an AUC of 0.782 (95% CI: 0.771,0.793), and overall accuracy of 70.2%, nearly a four percent higher AUC than that of the traditional LR model. Our findings are consistent with results from a previous study [28
]. XGBoost can solve the classification bias problem of traditional models in a few categories, and show strong classification prediction performance on unbalanced data.
The variable importance plot of the XGBoost model showed that age was of high importance to predict HBV infection, which was consistent with a previous study [29
]. In China, the majority of HBV infection cases are caused by perinatal vertical transmission and childhood infection. We could infer that older patients with hepatitis B who might have a longer infection time had more serious liver damage and greater susceptibility to adverse outcomes. In order to detect and treat hepatitis B patients early, it is important to carry out long-term follow-up and regular examinations. This study also suggests that variables (alanine aminotransferase (ALT), platelet count (PLT), Aspartate aminotransferase (AST), albumin (ALB), and plateletcrit (PCT)) were important predictors for HBV infection [30
]. Our results are consistent with findings from other studies. For instance, the levels of PCT and PLT not only reflect the number of platelets, but also indirectly reveal the functional status of the liver [31
]. Serum AST and ALT levels are important indicators for examining HBV infection, where content increases are closely related to liver disease [33
]. Serum ALB levels can reflect liver reserve, especially synthetic function, which is parallel to the severity of liver disease. The above variables can provide clues and reference for further studies exploring the potential factors of HBV infection prediction.
Our findings also have societal benefits. Adopting risk assessment strategies can provide a greater understanding of HBV prevalence and identify the greatest number of patients for antigen testing [34
]. Additionally, the risk of transmission for HBV infection to other individuals can be reduced by early diagnosis with subsequent lifestyle modifications. Moreover, earlier treatment in the course of the disease is related to acceptable cost per quality-adjusted life years estimates [35
]. Our results are generalizable with other diseases such as diabetes [27
] and cardiovascular risk to a certain extent [14
], which can easily build identical predictive models using the same machine learning techniques.
Though our study provides new insight on predicting HBV infection using machine learning algorithm. Several limitations must be mentioned. First, the features we included in our model were based on the obtained datasets. There is a chance that potentially unknown relevant features might have been missing. However, this study included 31 variables and considered as many factors of HBV infection as possible. Additionally, although our model was developed using a limited number of algorithms, it also shows certain representativeness, where XGBoost represents the latest boosting method, RF signifies the traditional integration model, and DT represents traditional machine learning model. In our future research, other machine learning algorithms will be considered to improve the prediction accuracy. Third, we were unable to better analyze the variable of hepatitis B vaccine and provide more detailed information due to missing data on the history of the hepatitis B vaccine. Finally, we note that data was from a community-based study in China and data outside the study area was not used for external verification. However, our data volume is large and still has a certain extrapolation.