HBV Drug Resistance Substitutions Existed before the Clinical Approval of Nucleos(t)ide Analogues: A Bioinformatic Analysis by GenBank Data Mining

Naturally occurring nucleos(t)ide analogue resistance (NUCr) substitution frequencies in the reverse transcriptase (RT) of the hepatitis B virus (HBV) were studied extensively after the clinical approval of nucleos(t)ide analogues (NUCs; year of approval 1998). We aimed to study NUCr substitutions in HBV RT sequences obtained before 1998 and better understand the evolution of RT sequences without NUC pressures. Our strategy was to retrieve HBV sequences from GenBank deposited before 1998. The initial search used the keywords “hepatitis B virus” or “HBV” and 1139 sequences were found. Data analyses included information extraction: sequence quality control and amino acid substitution analysis on 8 primary NUCr and 3 secondary substitution codons. Three hundred and ninety-four RT-containing sequences of 8 genotypes from 25 countries in 4 continents were selected. Twenty-seven (6.9%) sequences were found to harbor substitutions at NUCr-related codons. Secondary substitutions (rtL80V and rtV173G/A/L) occurred more frequently than primary NUCr substitutions (rtI169L; rtA181G; T184A/S; rtS202T/R; rtM204L and rtM250K). Typical amino acid substitutions associated with NUCr were of rtL80V, rtV173L and rtT184A/S. We confirm the presence of naturally occurring typical HBV NUCr substitutions with very low frequencies, and secondary substitutions are more likely to occur than primary NUCr substitutions without the selective pressure of NUCs.

substitutions are naturally occurring. However, the serum samples collected before 1998 are difficult to obtain. Table 1. Summary of potential AA substitutions associated with NUCr in treated CHB patients.
In this study, we analyzed HBV RT sequences before the clinical approval of NUCs to characterize the pre-existing NUCr substitutions by using publically available GenBank data. Our study focuses on analyzing the classic NUCr substitutions recommended by several clinical practice guidelines. These results can provide critical information regarding pre-existing (largely due to naturally occurring) NUCr substitutions in HBV RT sequences and a better understanding of HBV evolution without selective pressure of NUCs.

Sequence Search and Qualification Strategy
We performed a GenBank database (https://www.ncbi.nlm.nih.gov/nuccore) search using keywords "hepatitis B virus" or "HBV" to retrieve sequences deposited before December 1998, the year just before the clinical approval of NUCs. Each sequence with a unique accession number was retained after manually removing the redundant sequences by checking duplicate sequence accession numbers. The following sequences were then excluded: (1) artificial synthetic constructs; (2) chromosomally integrated HBV sequences with human genes; (3) non-human HBV sequences; (4) other types of hepatitis virus sequences (hepatitis A virus, hepatitis C virus, hepatitis D virus or hepatitis E virus); (5) expression vector sequences; (6) sequences from patients co-infected with HBV/human immunodeficiency virus (HIV) receiving antiviral therapy against HIV; (7) sequences from CHB patients who participated in the clinical trials (phase I, II and III) for LAM; and (8) HBV sequences without the RT region. After exclusion, the rest of the sequences containing the HBV RT region were qualified for further analysis.

Data Extraction
The selection of each sequence and data extraction were reviewed and performed independently by two authors (X.X. and K.X.). For each qualified sequence, the following data were extracted: the GenBank accession number, related nucleotide sequence, sequence release year, sample origin, and sequence length. References related to specific sequences were thoroughly analyzed to collect the relevant information if the GenBank records did not provide it.

Systematic Analysis of Potential NUCr Substitutions
Potential AA substitutions in RT domain of HBV polymerase associated with NUCr were listed in Table 1. This table was modified from Menéndez-Arias et al. [4] and Liu et al. [11], and updated according to the latest literature. All these AA substitutions have been reported as being selected during NUCs therapy and verified by in vitro phenotypic data.

HBV Drug Resistance Substitution Analysis
HBV RT sequences were extracted from the qualified HBV sequences, which contained various lengths of the RT region. Multiple RT nucleotide sequences were aligned using BioEdit7.0 software (Ibis Biosciences, Carlsbad, CA, USA). All RT nucleotide sequences were translated in frame to AA sequences. Each genetic codon harboring degenerate base was manually translated to the corresponding AA. Classic substitutions at 11 codons that have been often suggested for monitoring NUCr in treated patients were analyzed in this study. The obtained AA sequences were compared with the consensus RT AA sequences to identify the NUCr substitutions at rtL80, rtI169, rtV173, rtL180, rtA181, rtT184, rtA194, rtS202, rtM204, rtN236 and rtM250 [2,[54][55][56]. The codons that were mixed with wild-type and substituted AAs were recorded as indicating the presence of a substitution, which might reflect the presence of HBV quasispecies [11,54]. All sequences with NUCr substitutions were further determined by an HBV drug resistance interpretation online tool (http://www.hiv-grade.de/HBV_grade/deployed/grade.pl?program=HBValg).

Statistical Analysis
Statistical analysis was performed using SPSS 20.0 software (IBM, Armonk, NY, USA). Chi-square tests were used for categorical data. p Values were calculated with the two-tailed method. p Value < 0.05 was considered statistically significant.

Selection for RT-Containing HBV Sequences
A total of 1139 sequences with unique accession number were found from the GenBank nucleotide database using the keywords "hepatitis B virus" or "HBV" before December 1998. The sequence exclusion and inclusion strategy and processing were illustrated by the flowchart in Figure 1. Seven hundred and eight HBV sequences were identified, in which 314 sequences were excluded because they did not harbor an RT region. Finally, 394 RT-containing (full-length: 134; partial-length: 260) sequences were included for further analysis. Background information (i.e., accession numbers, release time, origin, RT coverage and sequence length) for all qualified sequences is presented in Supplementary Materials (Table S1).

Quality Control for HBV RT-Containing Sequences
The selected RT sequences were further evaluated in terms of their coverage of the studied NUCr-related codons. One hundred and thirty-four sequences were full-length RT sequences, which contained all 11 codons of interest. The other 260 partial-length RT sequences ranged from 23 to 305 AAs in length with a median length of 226 AAs. Figure 2A illustrates the overlapping features of the recruited RT sequences. Our data suggested that most of previous studies appeared to primarily focus on HBV small surface proteins, especially on the 'a' determinant region, which is overlapped by the RT region. The RT sequences were further grouped according to their coverage, as shown in Figure 2B. In total, 90.1% (355/394) of the recruited RT sequences carried 5-11 analyzed NUCr substitution codons and only 28 sequences (7.1%) did not cover any studied NUCr codons.
Due to the diversity of the sequences in length and their coverage of the studied codons, the substitution analysis strategy was codon-based rather than sequence-based. The numbers of the available AAs at 11 codons in RT were summarized in Figure 2C. Thus, the substitution frequency was calculated as the percentage of the numbers of substituted AAs with respect to the available AAs at this codon.

Quality Control for HBV RT-Containing Sequences
The selected RT sequences were further evaluated in terms of their coverage of the studied NUCr-related codons. One hundred and thirty-four sequences were full-length RT sequences, which contained all 11 codons of interest. The other 260 partial-length RT sequences ranged from 23 to 305 AAs in length with a median length of 226 AAs. Figure 2A illustrates the overlapping features of the recruited RT sequences. Our data suggested that most of previous studies appeared to primarily focus on HBV small surface proteins, especially on the 'a' determinant region, which is overlapped by the RT region. The RT sequences were further grouped according to their coverage, as shown in Figure 2B. In total, 90.1% (355/394) of the recruited RT sequences carried 5-11 analyzed NUCr substitution codons and only 28 sequences (7.1%) did not cover any studied NUCr codons.
Due to the diversity of the sequences in length and their coverage of the studied codons, the substitution analysis strategy was codon-based rather than sequence-based. The numbers of the available AAs at 11 codons in RT were summarized in Figure 2C. Thus, the substitution frequency was calculated as the percentage of the numbers of substituted AAs with respect to the available AAs at this codon.

RT Sequence Origins and HBV Genotypes
The finally analyzed 394 HBV RT sequences were found from 25 countries within 4 continents. The sequences from Africa, America, Asia and Europe accounted for 4.8%, 23.1%, 41.9% and 30.2%, respectively. Eight HBV genotypes (A: 27.4%; B: 9.4%; C: 31.0%; D: 16.8% E: 2.3%; F: 10.7%; G: 0.5%; H: 0.5%) were found ( Figure S1). Moreover, four recombinant sequences were found, which did not affect the NUCr-related codons and could be used for substitution analysis ( Figure S2). The geographical distribution of HBV genotypes was consistent with previous reports [57]. Notably, although the dominant genotype in America was F (35.2%), these sequences were mainly from Central and South America (Argentina), not from North America.

NUCr Substitution Analysis
Among the 394 sequences, 27 (6.9%) were found to harbor substitutions at NUCr-related codons ( Table 2). Eleven sequences had substitutions at 6 codons associated with primary NUCr, and 16 carried secondary substitutions at 2 codons. However, no combination substitutions at multiple codons were found. No substitution at rtA194, rtN236, or rtL180 was detected.

RT Sequence Origins and HBV Genotypes
The finally analyzed 394 HBV RT sequences were found from 25 countries within 4 continents. The sequences from Africa, America, Asia and Europe accounted for 4.8%, 23.1%, 41.9% and 30.2%, respectively. Eight HBV genotypes (A: 27.4%; B: 9.4%; C: 31.0%; D: 16.8% E: 2.3%; F: 10.7%; G: 0.5%; H: 0.5%) were found ( Figure S1). Moreover, four recombinant sequences were found, which did not affect the NUCr-related codons and could be used for substitution analysis ( Figure S2). The geographical distribution of HBV genotypes was consistent with previous reports [57]. Notably, although the dominant genotype in America was F (35.2%), these sequences were mainly from Central and South America (Argentina), not from North America.

Discussion
To the best of our knowledge, this is the first study to analyze HBV NUCr-related substitutions from HBV RT sequences obtained before the clinical approval of NUCs. Our results demonstrate the pre-existence of HBV NUCr substitution, which may largely attribute to the true naturally occurring substitutions and reflect the natural evolution characteristics of HBV variants with NUCr-related substitutions.
The acquisition of high-quality data is critical to a superior GenBank-based data mining study. After a rigorous screening process, 394 HBV RT sequences belonging to genotype A to H from 25 countries in 4 continents were recruited for this study, which showed a wide range of data sources. Moreover, a codon-based rather than a sequence-based method was used to calculate the substitution frequency because qualified HBV RT sequences had different lengths. This strategy could ensure not only an accurate calculation of the substitution frequency at each studied codon, but also an effective use of the available information.
In this study, AA substitutions at NUCr-related codons of RT were found in 72.7% (8/11) of the studied codons with frequencies ranging from 0.3% to 2.5% (Table 2). Secondary substitutions (1.6%) occurred more frequently than primary NUCr substitutions (0.5%). Interestingly, 59.3% (16/27) had an atypical AA substitution type at these codons, meaning that their roles in NUCr and viral fitness are still unknown, as is the case for rtM204L. This phenomenon was more obvious at the primary NUCr substitution codons (81.8%, 9/11) than at the secondary substitution codons (43.8%, 7/16) (Table 2). Therefore, our findings highly suggested that HBV could naturally mutate at these codons free of NUCs, but those variants carrying typical primary NUCr substitution(s) hardly evolve, probably due to their reduced survival fitness [2].
Our results show that only 11 sequences (2.8%) had typical NUCr-related substitutions. Substitutions rtL80V (n = 7) or rtV173L (n = 2) are typical secondary substitutions that can restore functional defects of HBV polymerase caused by primary LAM or LdT resistance substitutions [2,4,11]. These naturally occurring secondary substitutions might reduce the resistance barrier to LAM or LdT [2]. Substitution rtT184A/S (n = 2) was the only naturally occurring substitution detected as belonging to a typical primary NUCr substitution. It has been related to ETV resistance in the presence of the rtM204V + rtL180M backbone [73]. However, rtT184A/S alone would not lead to clinical resistance to ETV [74,75].
In addition, we found that the detected atypical substitutions of rtI169L, rtA181G, rtS202T/R, rtM204L, and rtM250K were not reported in NUCs-treated patients by the Stanford HBV Drug Resistance Database (https://hivdb.stanford.edu/HBV/HBVseq/development/HBVseq.html; accessing date: 17 January 2017). This suggests that these atypical substitutions might not play important roles in the emergence of NUCr as compared to rtA181T/V, rtS202C/G/I, rtM204I/V and rtM250I/L/V [2,76,77]. Consistent with our findings, atypical substitutions of rtI169L, rtS202R, and rtM204L have been detected in six treatment-naïve CHB patients with low proportions using a high-throughput sequencing method [78]. Kim et al. found that three patients had an rtI169L substitution at a level of 1-8% using a clone sequencing method [79]. We speculate that the reasons for the presence of such atypical substitutions might be due to a quasispecies property of HBV or sequencing errors [80][81][82]. Moreover, some variants might not be necessarily viable or infectious. Current evidence suggests that these atypical pre-existing variants have a limited role in the development of NUCr. Tenney et al. reported that substitution at rtI169 could only modestly decrease the susceptibility to ETV in the recombinant viruses with LAM resistance and rtM250V substitutions, increasing the viral fitness for survival [74]. Furthermore, in vitro studies showed that substitutions rtS202R and rtM250K would severely impair the replication capability of HBV variants with LAM resistance substitutions (rtL180M + rtM204V), while substitution rtS202T played only a minor role in ETV resistance [77]. Substitution rtM204L was shown to be sensitive to LAM in a study [83]. No study has been performed to elucidate the relationship between substitution rtA181G and ADV resistance.
Except for substitutions detected at eight codons, our data showed no substitutions at rtA194, rtN236 and rtL180. Since there were only 155 AAs at rtN236 available for analysis, this data might not be as solid as for rtA194 and rtL180, each of which had 332 and 355 available AAs, respectively ( Figure 2). Moreover, by searching the Stanford HBV Drug Resistance Database, we found that rtA194, rtN236 and rtL180 in NUC treatment-naïve patients had very low substitution frequencies at 0.4% (14/3218), 0.03% (1/3218) and 0.2% (6/3218), respectively [84]. Considering that HBV RT sequences in this database were obtained both before and after 1998, our data together with their data suggested that the true naturally occurring substitutions at these codons should be very low.
We also found that genotype C sequences had the highest substitution frequency (16/122, 13.1%) at NUCr-related codons. However, we should notice that only 7 of 16 sequences harbored typical NUCr AA substitution types, i.e., 6 of rtL80V and 1 of rtT184A. Notably, 4 of 6 rtL80V-containing sequences were from different clones of the same patient. Thus, the naturally occurring NUCr substitutions in different HBV genotypes might not be as different as they seemed to be. This is in line with many previous drug resistance studies carried out after 1998 [2].
Although this study is novel in investigating the occurrence of NUCr substitutions before NUCs were approved, some limitations exist. Firstly, there was a limited number of sequences available in the GenBank database before 1998. Secondly, most sequences were from Asia, Europe and America (375/394, 95.2%) and data from Africa, a highly endemic area of HBV infection, is not adequate. Thirdly, some minor mutants in the quasispecies pool might not be detected due to the lack of a sensitive detection method. In fact, most substitutions (20/27, 74.1%) found in this study were determined by clone sequencing. Fourth, the possibility of sequencing errors or potential typos while submitting to GenBank could not be completely ruled out, especially for relatively rare AA substitutions, although the original studies submitting these sequences have been carefully reviewed.
In conclusion, we confirmed the presence of naturally occurring typical NUCr substitutions of rtL80V, rtV173L and rtT184A/S with low frequencies before the clinical approval of NUCs. Atypical substitutions of rtI169L, rtA181G, rtS202T/R, rtM204L, rtM250K were also detected. However, more clinical studies are still needed to explore the impacts of these baseline NUCr-related substitutions on antiviral responses.
Supplementary Materials: The following are available online at www.mdpi.com/1999-4915/9/8/199/s1, Figure S1: HBV genotypes identified in this study and their geographical distributions; Figure S2: The HBV sequences with recombinant genotypes found in this study; Figure S3: The distribution of HBV RT sequences with NUCr-related substitutions in different countries; Table S1: Summary of the relevant information of HBV sequences used in this study.