Despite a decline in the prevalence of hepatitis B in China, the disease burden remains high. Large populations unaware of infection risk often fail to meet the ideal treatment window, resulting in poor prognosis. The purpose of this study was to develop and evaluate models identifying high-risk populations who should be tested for hepatitis B surface antigen. Data came from a large community-based health screening, including 97,173 individuals, with an average age of 54.94. A total of 33 indicators were collected as model predictors, including demographic characteristics, routine blood indicators, and liver function. Borderline-Synthetic minority oversampling technique (SMOTE) was conducted to preprocess the data and then four predictive models, namely, the extreme gradient boosting (XGBoost), random forest (RF), decision tree (DT), and logistic regression (LR) algorithms, were developed. The positive rate of hepatitis B surface antigen (HBsAg) was 8.27%. The area under the receiver operating characteristic curves for XGBoost, RF, DT, and LR models were 0.779, 0.752, 0.619, and 0.742, respectively. The Borderline-SMOTE XGBoost combined model outperformed the other models, which correctly predicted 13,637/19,435 cases (sensitivity 70.8%, specificity 70.1%), and the variable importance plot of XGBoost model indicated that age was of high importance. The prediction model can be used to accurately identify populations at high risk of hepatitis B infection that should adopt timely appropriate medical treatment measures.
This is an open access article distributed under the Creative Commons Attribution License
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited