Prediction Model with HLA-A*33:03 Reveals Number of Days to Develop Liver Cancer from Blood Test

The development of liver cancer in patients with hepatitis B is a major problem, and several models have been reported to predict the development of liver cancer. However, no predictive model involving human genetic factors has been reported to date. For the items incorporated in the prediction model reported so far, we selected items that were significant in predicting liver carcinogenesis in Japanese patients with hepatitis B and constructed a prediction model of liver carcinogenesis by the Cox proportional hazard model with the addition of Human Leukocyte Antigen (HLA) genotypes. The model, which included four items—sex, age at the time of examination, alpha-fetoprotein level (log10AFP) and presence or absence of HLA-A*33:03—revealed an area under the receiver operating characteristic curve (AUROC) of 0.862 for HCC prediction within 1 year and an AUROC of 0.863 within 3 years. A 1000 repeated validation test resulted in a C-index of 0.75 or higher, or sensitivity of 0.70 or higher, indicating that this predictive model can distinguish those at high risk of developing liver cancer within a few years with high accuracy. The prediction model constructed in this study, which can distinguish between chronic hepatitis B patients who develop hepatocellular carcinoma (HCC) early and those who develop HCC late or not, is clinically meaningful.


Introduction
About 350 million people are estimated to be infected with the hepatitis B virus (HBV) worldwide, and about 1.3 million to 1.5 million people are infected with the virus in Japan. Although the Japanese government started a universal hepatitis B vaccination program from October 2016, it is estimated that there are currently about 10,000 new cases annually. In a multipurpose Japanese cohort study, it was reported that patients with HBV infection had a 16.1-fold increased risk of liver carcinogenesis [1], and there is a strong demand for accurate prediction of carcinogenesis in HBV-infected individuals.
Several clinical parameters have been reported as risk factors for liver carcinogenesis in HBV-infected patients, including HBV-DNA levels, presence of hepatitis B e-antigen (HBeAg), and presence of cirrhosis, in addition to basic information such as age and sex. REACH-B, constructed in a cohort study in Taiwan, is one of the most commonly used hepatocellular carcinoma (HCC) prediction models for untreated hepatitis B patients, with sex, age, serum alanine aminotransferase (ALT) level, HBeAg, and serum HBV-DNA levels used as variables [2]. REACH-B used Cox multivariate proportional hazards model to predict risk of HCC at 3, 5, and 10 years, with an area under the receiver operating characteristic curve (AUROC) of 0.8111 at 3 years, 0.796 at 5 years, and 0.769 at 10 years. The PAGE-B is another commonly used the HCC predictive model for treated hepatitis B patients in Caucasians, using age and sex and platelet as variables [3]. The 5-years cumulative HCC incidence rates predicted by the PAGE-B model showed 0.82 of c-index after bootstrap validation for 1325 adult Caucasians who received entecavir/tenofovir for 12 months or longer. Subsequently, a modified PAGE-B (mPAGE-B) model was constructed by adding serum albumin level to the PAGE-B model in 2001, treated Asian hepatitis B patients, and showed 0.82 of AUROC [4].
In addition to REACH-B, the HCC predictive model for chronic hepatitis B (CHB) patients who remained untreated or were treated with entecavir only or with entecavir and tenofovir disoproxil fumarate includes GAG-HCC, reported by a Hong Kong group [5]. A risk score using the GAG-HCC model, which was based on age, gender, HBV-DNA levels, HBV core promoter mutations, and liver cirrhosis (LC), can estimate the development of HCC in 5-and 10-years with AUROC of 0.88 at 5 years, and 0.89 at 10 years. On the other hand, in addition to PAGE-B, the HCC predictive model for treated hepatitis B patients includes APA-B, reported by a Taiwanese group [6]. APA-B was the HCC prediction model based on age, sex, platelet counts, and alpha-fetoprotein (AFP) levels, and the prediction accuracy of HCC risk after 2, 3, and 5 years was 0.877, 0.842, and 0.827, respectively.
Several models have been reported to predict the risk of HCC in CHB patients [7], but none has added host genetic factors. Genome-wide association studies (GWASs) identified the association of HLA class II genes with chronic hepatitis B infection in Asian populations [8,9]. As a result of GWAS and HLA association analysis focusing on HCC in Japanese CHB patients, HLA-DPB1*02:01, which belongs to HLA class II, and HLA-A*33:03 and HLA-A*31:01, which belongs to HLA class I, showed a significant association with HCC [10,11]. In this study, we constructed a model that added HLA genotypes to the prediction of HCC in hepatitis B patients and verified its predictive accuracy.

Clinical Characteristics of Studied Individuals and Study Design
Clinical information from 893 hepatitis B patients distributed by the BioBank Japan (BBJ) was used to select clinical items that were significantly associated with the prediction of HCC [12,13]. Nine test items (including age, sex, HBV-DNA, HBeAg, ALT, LC, PLT, Alb, and AFP) incorporated into five previously reported HCC prediction models (REACH-B, GAG-HCC, PAGE-B, mPAGE-B, and APA-B) were examined to be associated with the development of HCC using BBJ samples. A stepwise variable selection based on the Akaike information criterion (AIC) was performed for the Cox proportional-hazards regression model with the nine variables. As a result, four items were selected: presence or absence of LC, log 10 AFP, age, and sex. An independent set of 672 hepatitis B patients were collected from four hospitals and determined their HLA genotypes. In Table 1, clinical information  of 162 patients which were able to collect four clinical items without any defects was  summarized, along with the presence or absence of HLA-A*33:03, HLA-A*31:01,  For all 162 cases, the number of days between the date of the blood test and the date of the liver cancer diagnosis, or between the date of the blood test and the date of the investigation, is available ( Figure 1). Non-HCC subjects in which 5 years had not elapsed between the date of the blood test and the date of the investigation were excluded from the analysis.
(REACH-B, GAG-HCC, PAGE-B, mPAGE-B, and APA-B) were examined to be associated with the development of HCC using BBJ samples. A stepwise variable selection based on the Akaike information criterion (AIC) was performed for the Cox proportional-hazards regression model with the nine variables. As a result, four items were selected: presence or absence of LC, log10AFP, age, and sex. An independent set of 672 hepatitis B patients were collected from four hospitals and determined their HLA genotypes. In Table 1, clinical information of 162 patients which were able to collect four clinical items without any defects was summarized, along with the presence or absence of HLA-A*33:03, HLA-A*31:01, and HLA-DPB1*02:01. Clinical information of all 672 patients, including missing values, is summarized in Materials and Methods. For all 162 cases, the number of days between the date of the blood test and the date of the liver cancer diagnosis, or between the date of the blood test and the date of the investigation, is available ( Figure 1). Non-HCC subjects in which 5 years had not elapsed between the date of the blood test and the date of the investigation were excluded from the analysis.

The Cox Proportional-Hazards Regression Model to Predict the Early Onset of HCC
To predict which CHB patients would develop HCC after blood test, we used the Cox proportional-hazards regression model [14]. The results of multivariate Cox proportionalhazards regression analysis for seven variables (age, sex, LC, log 10 AFP, HLA-A*31:01, HLA-A*33:03, and HLA-DPB1*02:01) are described in Materials and Methods. After the stepwise variable selection using 162 cases with no lack of clinical information, four variables remained: age, sex, log 10 AFP, and HLA-A*33:03 ( Table 2). The hazard function is h(t) = h 0 (t)*exp(0.057*age + 0.601*sex (male = 1, female = 0) + 0.642*log 10 AFP + 0.714*A*33:03 (carrier = 1, non-carrier = 0)), where h 0 (t) indicates the baseline hazard function. As an example, the beta coefficient for age is 0.057, indicating that older age is associated with a higher risk of liver cancer. Table 2 shows beta coefficient, hazard ratio (HR), confidence intervals of the hazard ratios (HR lower and HR upper), and global statistical significance of the model (z and p.value). From a clinical standpoint, it is important to distinguish between CHB patients who develop HCC early and those who develop HCC late or not. The value of h 0 (t) varies with time but is common to all the subjects. Therefore, to facilitate the use of the results of this study, score noted by S was calculated for each subject according to the following formula: The distribution of S is shown in Figure 2. We then used receiver operating characteristic (ROC) analysis to determine whether S could be used to predict the early onset of HCC. The study design of ROC analysis for HCC onset within X years was summarized in Figure 3. The area under the curve (AUC) was more than 0.85 when discriminating between patients who would develop HCC within 1 year and those who would develop HCC later or not (Figure 4a). For the case of within 3 years, the sensitivity exceeded 0.8 ( Figure 4b). The results from ROC analyses at various numbers of years suggest that the performance is better at discriminating patients who would develop HCC within a few years (Figure 4c). The survival curves for subjects with S of more than 4 and with S of less than or equal to 4 are shown in Figure 5. S values more than 4 were associated with a higher probability of developing HCC within 100 days. From the present study, we can say that a score above 4 indicates a high likelihood of developing HCC within a few years.          Survival curves for subjects with S of more than 4 and with S of less than or equal to 4. p value in the plot was calculated by log-rank test.

Validation of the Model
Since the samples used to construct the Cox proportional-hazards regression model was also used in the ROC analysis, the performance mentioned above is overestimated. Figure 5. Survival curves for subjects with S of more than 4 and with S of less than or equal to 4. p value in the plot was calculated by log-rank test.

Validation of the Model
Since the samples used to construct the Cox proportional-hazards regression model was also used in the ROC analysis, the performance mentioned above is overestimated. Thus, to evaluate the performance of the Cox proportional-hazards regression model with four variables, computer simulations were further performed. In the simulation, 162 subjects were first randomly shuffled and then split into a pair of training and test sets without considering whether they were CHB or HCC. The results form training set were used to predict the outcome of each subject in test set. After 1000 repetitions of this operation, the distributions of Harrell's C-index [15], sensitivity, and specificity were obtained ( Figure 6). The model constructed in this study appears to perform well when detecting patients who would develop HCC within a year, since the mean of Harrell's C-index and the mean of sensitivity exceeded 0.75 and 0.7, respectively.

Discussion
In order to predict the risk of liver carcinogenesis in Japanese patients with hepatitis B, we constructed a prediction model by adding HLA to items that had been incorporated into previous HBV related HCC prediction models, including basic information (including sex, year of birth, and age at first year registration) and blood test results (including HBV-DNA levels, presence or absence of HBe antigen, serum ALT level, platelet counts, serum albumin level, and AFP level). Among these items, the BBJ sample was used to examine four items showing significant association with predicting liver carcinogenesis in Japanese patients with hepatitis B: age, sex, log10AFP, and presence of liver cirrhosis. Among 672 Japanese patients with hepatitis B who were independent of the BBJ sample, we constructed a prediction model for liver carcinogenesis by adding HLA genotypes in 162 cases without any deficiency in the four items. We used the Cox proportional-hazards regression model to predict which CHB patients would develop HCC after blood test, and, as a result, we completed the prediction model based on the four items of age, sex, log10AFP and HLA-A*33:03. Of the four items included in the HCC prediction model constructed in this study, age was included in all five previously reported prediction models.

Discussion
In order to predict the risk of liver carcinogenesis in Japanese patients with hepatitis B, we constructed a prediction model by adding HLA to items that had been incorporated into previous HBV related HCC prediction models, including basic information (including sex, year of birth, and age at first year registration) and blood test results (including HBV-DNA levels, presence or absence of HBe antigen, serum ALT level, platelet counts, serum albumin level, and AFP level). Among these items, the BBJ sample was used to examine four items showing significant association with predicting liver carcinogenesis in Japanese patients with hepatitis B: age, sex, log 10 AFP, and presence of liver cirrhosis. Among 672 Japanese patients with hepatitis B who were independent of the BBJ sample, we constructed a prediction model for liver carcinogenesis by adding HLA genotypes in 162 cases without any deficiency in the four items. We used the Cox proportional-hazards regression model to predict which CHB patients would develop HCC after blood test, and, as a result, we completed the prediction model based on the four items of age, sex, log 10 AFP and HLA-A*33:03. Of the four items included in the HCC prediction model constructed in this study, age was included in all five previously reported prediction models. Sex is also an item incorporated into four of the five prediction models (REACH-B, GAG-HCC, PAGE-B, and mPAGE-B). AFP is an item included only in APA-B among the five prediction models, but it is also included in other HCC prediction models, REAL-B [16] and RWS-HCC [17]. An HCC prediction model incorporating the HLA genotype has never been reported before, and the HCC prediction model constructed in this study is the first report. For AUROC for 3-year HCC prediction, the prediction model constructed in this study had AUROC = 0.86, which was second only to AUROC = 0.88 for GAG-HCC, which had the highest AUROC among the five previously reported prediction models.
To validate the performance of the HCC prediction model based on four variables, 162 cases were randomly shuffled 2:1 as a training set and a test set, and the accuracy was evaluated by predicting the disease state of the test set with the training set. After 1000 repeated evaluations, A C-index of 0.75 or higher and a sensitivity of 0.70 or higher indicated that the model can accurately detect patients at high risk of developing liver cancer. As a result of the AUC from the ROC analysis, the AUC was higher for years closer to the date of the blood test, indicating that the HCC prediction model can predict the development of early liver cancer with high accuracy. Although sensitivity at 1 year after blood test fell below 0.7, both sensitivity and specificity tended to decrease over time, as did AUC. We also found that patients with a predicted score "S" of 4 or higher had a significantly increased risk of developing liver cancer within 100 days. The prediction model constructed in this study, which can distinguish between CHB patients who develop HCC early and those who develop HCC late or not, is clinically meaningful.
Although it is not a prediction model that reflects the incidence of liver cancer in the natural population because the cases were collected retrospectively, it is possible to predict the period from the date of blood test to the time of liver cancer. Prediction of the time to liver cancer based on the condition of the blood test date is expected to provide clinically useful information because appropriate treatment and medication can be given to patients. However, because HLA is known to have very different frequencies of genotypes among populations, the limitation of this study is that the HCC prediction model including HLA genotypes constructed in this study may be less effective in non-Asians. In the future, it is necessary to validate the model with prospectively collected cases, and to test the efficacy in non-Asian populations.
The conclusion of this study is that the current model that can predict the development of liver cancer within a year with high accuracy in hepatitis B patients who have specific HLA genotype can be expected to be a clinically useful tool.

Study Population
A total of 672 Japanese patients aged 18 years and older with chronic HBV infection were recruited in four hospitals, including Hokkaido University Hospital, the University of Tokyo Hospital, Kyushu University Hospital, and Kurume University Hospital. Basic information (including age and sex), blood test results (including date of test, HBV-DNA levels, presence or absence of HBe antigen, ALT, PLT, Alb, and AFP), and date of diagnosis of HCC for each patient were collected from each hospital using the same inquiry form (Table 3). Written informed consent was obtained from all patients involved in the study.
In addition, an independent set of 893 Japanese HBV patients aged 18 years and older was distributed by the BBJ and was used to select the clinical test items to be used in the construction of the HCC prediction model. A total of more than 300 items of clinical information distributed by the BBJ include basic information (including sex, year of birth, and age at first year registration) and blood test results (including HBV-DNA levels, presence or absence of HBe antigen, serum ALT level, platelet counts, serum albumin level, and AFP level). The use of distributed data from the BBJ for this study has been approved by both the NCGM and BBJ ethics committees.

Data Curation of Clinical Items
The outlier test by Smirnov-Grubbs test was performed for the continuous clinical value. Although there were several cases of ALT exceeding 1000 U/L, we decided not to perform the outlier test because patients with acute hepatitis may have elevated ALT levels. There are cases in which AFP exceeds 1000 ng/mL, but we decided not to perform the outlier test because AFP is higher in patients who develop HCC. For PLT with a normal value in the range of 100,000/µL to 454,000/µL and Alb with a normal value of 3.5 g/dL or more, outliers were determined to be excluded by Smirnov-Grubbs test.

HLA Imputation Using Genome-Wide SNP Typing Data
For 511 of the total 672 cases, genome-wide SNP typing was performed on genomic DNA extracted from blood samples collected from patients, and HLA genotype were determined by HLA imputation. Genome-wide SNP typing was carried out using the AXIOM Genome-Wide ASI 1 array, according to the manufacturer's instructions. SNP data from an extended MHC region ranging from 25,759,242 to 33,534,827 bp based on the hg19 position was used to conduct 2-field HLA genotype imputation for six class I and class II HLA genes using the HIBAG R package the same way as in our previous report [18]. The remaining 161 cases were subjected to whole-genome sequencing using DNA extracted from non-cancerous portions of hepatocellular carcinoma tissue samples. Whole genome sequencing (WGS) was performed using the TruSeq DNA PCR Free Kit (Illumina) to create libraries and NovaSeq 6000 (Illumina) to acquire data at 150 bp pair-end, according to the manufacturer's instructions. HLA genotypes to WGS were determined by HLA calling using HLA-VBSeq pipeline [19,20]. Initially, a preprocessing pipeline aligned the human reference assembly GRCh38 without alt contigs and reference sequence of HBV genotypes A to D (GencodeID: AB246337, AB246339, AB246344 and AF043594) using bwa (ver. 0.7.17) [21]. Next a preprocessing step extracted the sequenced reads mapped to the HLA chromosomal regions and unmapped sequenced reads. Using these extracted reads and the HLA reference panel constructed from the IPD-IMGT3.45.0 reference sequences [22], the HLA-VBSeq pipeline [20] called the HLA genotype combinations.

Constructing the Cox Proportional-Hazards Regression Model to Predict the Onset of HCC
In this study, to construct the Cox proportional-hazards regression model for predicting the onset of HCC in CHB patients, we used 162 subjects (83 HCC and 79 CHB patients) with no missing values for the following seven variables: age, sex, LC, log 10 AFP, HLA-A*33:03, HLA-A*31:01, and HLA-DPB1*02:01. A stepwise variable selection based on the AIC was performed for the Cox proportional-hazards regression model with the seven initial variables ( Table 4). The survival time of each subject was defined as the time from blood test date to date of liver cancer onset (when CHB patient developed HCC) or to the date of censoring (when CHB patient did not develop HCC at the time of study). A receiver operating characteristic (ROC) analysis was conducted to examine if the hazard function obtained from the Cox proportional-hazards regression model with four variables-age, sex, log 10 AFP, and HLA-A*33:03-could be used to predict CHB patients who develop HCC within the specified number of years. In the ROC analysis, CHB patients who developed HCC later than the specified number of years were regarded as ones who did not develop HCC, and CHB patients who were censored within the specified number of years were excluded from the analysis. The optimal threshold was defined as the point closest to the top-left corner in the ROC plane. Sensitivity and specificity were calculated at the threshold. The R packages "survival" and "pROC" were used to carry out the Cox proportional-hazards regression analysis and the ROC analysis, respectively. The survival curve analysis was conducted using R package "survminer".

Validation of the HBV-HCC Model by Computer Simulation
To evaluate the performance of the Cox proportional-hazards regression model with four variables, computer simulations were performed. A total of 162 subjects were first randomly shuffled and then split into a pair of training and test sets without considering whether they were CHB or HCC (training and test sets were composed of 2/3 and 1/3 of the 162 subjects). The training set was used to construct a Cox proportional-hazards regression model, and the threshold in the ROC analysis for the specified number of years was obtained. The results of the training set were used to compute the hazard function for each subject in the test set, and Harrell's C-index, sensitivity, and specificity were obtained for the test set. This procedure was repeated 1000 times.