Whole-Genome Analysis of Human Papillomavirus Type 16 Prevalent in Japanese Women with or without Cervical Lesions

Recent large-scale genomics studies of human papillomaviruses (HPVs) have shown a high level of genomic variability of HPV16, the most prevalent genotype in HPV-associated malignancies, and provided new insights into the biological and clinical relevance of its genetic variations in cervical cancer development. Here, we performed deep sequencing analyses of the viral genome to explore genetic variations of HPV16 that are prevalent in Japan. A total of 100 complete genome sequences of HPV16 were determined from cervical specimens collected from Japanese women with cervical intraepithelial neoplasia and invasive cervical cancer, or without cervical malignancies. Phylogenetic analyses revealed the variant distribution in the Japanese HPV16 isolates; overall, lineage A was the most prevalent (94.0%), in which sublineage A4 was dominant (52.0%), followed by sublineage A1 (21.0%). The relative risk of sublineage A4 for cervical cancer development was significantly higher compared to sublineages A1/A2/A3 (odds ratio = 6.72, 95% confidence interval = 1.78–28.9). Interestingly, a novel cluster of variants that branched from A1/A2/A3 was observed for the Japanese HPV16 isolates, indicating that unique HPV16 variants are prevalent among Japanese women.


Viral Whole-Genome Amplification and Next Generation Sequencing
Based on the genotyping results, DNA samples positive for HPV16 were subjected to long-range PCR to amplify the whole-genome sequences of HPV16, as described previously [30]. Full-circle PCR or overlapping PCR was performed using PrimeSTAR ® GXL DNA polymerase (Takara, Ohtsu, Japan) with the following primers: for full-circle PCR, HPV16-1742F (5 -TGC TGT CTA AAC TAT TAT GTG TGT CTC-3 ) and HPV16-1873R (5 -GCG TGT CTC CAT ACA CTT CA-3 ); for overlapping PCR, HPV16-1744F (5 -TGT CTA AAC TAT TAT GTG TGT CTC CAA TG-3 ) and HPV16-5692R (5 -GAT ACT GGG ACA GGA GGC AAG TAG ACA GT-3 ); HPV16-5531F (5 -GGG TCT CCA CAA TAT ACA ATT ATT GCT G-3 ) and HPV16-1980R (5 -TAT CGT CTA CTA TGT CAT TAT CGT AGG CCC-3 ). Among 238 HPV16-positive samples subjected to PCR, 100 samples successfully generated PCR products covering the complete viral genome. The amplified DNA separated using agarose gel electrophoresis was then purified with the Wizard gel purification kit (Promega, Madison, WI, USA). The purified DNA was converted to a short-fragmented DNA library using the Nextera XT DNA sample prep kit (Illumina, San Diego, CA, USA), followed by size selection with SPRIselect (Beckman Coulter, Brea, CA, USA). The multiplexed libraries were analyzed on a MiSeq sequencer (Illumina) with the MiSeq reagent kit v3 (150 cycle). The complete genome sequences of HPV16 were de novo assembled from the total read sequences using the VirusTAP pipeline [31] (https://gph.niid.go.jp/cgi-bin/virustap/index.cgi). The average depth of reads covering the HPV16 genome was approximately 11,000 per sample, and the minimum depth of 500 was used to confirm viral SNPs. The accuracy of the reconstructed whole-genome sequences was verified via read mapping with Burrows-Wheeler Aligner v0.7.12 [4] and subsequent visual inspection using Integrative Genomics Viewer v2.3.90 [32]. All HPV16 genome sequences determined were deposited to the DNA Data Bank of Japan (DDBJ) (accession numbers are shown in Table S1), and the raw sequencing data are available from the DDBJ, Sequence Read Archive, under accession number DRA006584.

Sequencing of HPV16 E6/E7
The E6/E7 sequence was amplified using PCR from the total cellular DNA with primers (HPV16-32F,  5 -GTA ACC GAA ATC GGT TGA ACC GAA ACC-3 , and HPV16-998R, 5 -CCT GTA TCA CTG TCA  TTT TCG TTC TCG T-3 ) using PrimeSTAR ® GXL DNA polymerase. The PCR condition consisted of 38 cycles of 98 • C for 10 s, 64 • C for 15 s, and 68 • C for 1 min. Amplification was verified using agarose gel electrophoresis, and successfully amplified DNA (approximately 950-bp length) was purified with the Wizard gel purification kit, followed by direct sequencing with the above-described PCR primers on a 3730 xl sequencer (Applied Biosystems, Austin, TX, USA).

Generation of Human Cervical Keratinocytes Expressing E7
The E7 sequences of A1, A4, and A5 were amplified using PCR with primers (forward, 5 -TTG  CGG CCG CAC CAT GCA TGG AGA TAC ACC TAC ATT GC-3 ; and reverse, 5 -TTG CGG CCG  CTG GTT TCT GAG AAC AGA TGG GGC ACA C-3 ) from clinical specimens, and cloned into the NotI site of p3xFLAG-CMV14 (Sigma-Aldrich, St. Louis, MO, USA) to fuse the 3xFLAG sequence to the C-terminus of the E7 protein. Then, the E7-FLAG sequence was amplified using PCR, and cloned between the PacI and XhoI sites of a retroviral transfer plasmid, pMXs-puro (Cell Biolabs, San Diego, CA, USA). A retrovirus vector expressing E7-FLAG was prepared via transfection of the transfer plasmid into GP2-293 packaging cells together with an envelope expression plasmid, pE-ampho (Takara). Human cervical keratinocytes immortalized with telomerase reverse transcriptase [36] were infected with the recombinant retrovirus expressing E7-FLAG, and selected with 1 µg/mL puromycin for 72 h, followed by culture without the drug for 48 h. The surviving cells were pooled and harvested for western blot analyses.

Statistical Analysis
A generalized linear model with binomial distribution and log link was used to calculate the odds ratio of progression from CIN1 to CIN2/3 or CIN2/3 to squamous cell carcinoma (SCC) for variant A sublineages with 95% confidence intervals (CI). The odds ratio was adjusted for the women's age at the time of diagnosis. Fisher's exact test was performed to evaluate a difference in sublineage distribution between different histological categories. A p-value < 0.05 was regarded as statistically significant. All statistical analyses were performed using R version 3.4.0 (https://cran.r-project.org/).
We then performed next generation sequencing analyses of HPV16 DNA that had been amplified from the clinical samples. By using our established bioinformatics pipeline for reconstruction of HPV whole-genome sequences from short-read sequences [30], we successfully determined 100 complete HPV16 genome sequences (Table S1). The length of the viral genome sequences ranged from 7863 to 7909 bp, with all isolates retaining the authentic HPV genomic organization comprised of six early genes (E1, E2, E4, E5, E6, and E7), two late genes (L1 and L2), and two non-coding regions. Interestingly, nine isolates harbored insertion or deletion in E1 or E2/E4 as follows: five isolates with 60-, 63-, or 66-nucleotide insertions in E1 (five unique isolates); two isolates with 11-or 33-nucleotide deletions in E1; and two isolates with 21-or 42-nucleotide deletions in E2/E4. The recovery of HPV16 whole-genome sequences suggests the presence of viral episomes in the corresponding clinical specimens, but we were unable to exclude a possibility of viral integration because a head-to-tail concatemeric HPV16 genome integrated into the host genome can also yield the viral whole-genome by PCR.
Out of the 100 women enrolled, 90 (90.0%) showed unique HPV16 isolates differing by at least one nucleotide from the sequences isolated from any other woman. Among the remaining 10 women, two identical pairs of isolates were shared by two women each (2 isolates, 4 women) and two identical isolates were shared by three women each (2 isolates, 6 women). Consequently, we obtained a total of 94 different HPV16 whole genome sequences from these Japanese women.
As shown in the phylogenetic tree, the 94 isolates of lineage A were grouped into two major clades including European and Asian variants, respectively. Consistent with a previous study from Japan [23], the prevalence of these two variants within the Japanese HPV16 isolates was similar: European (42 of 94, 44.7%) and Asian (52 of 94, 55.3%). Intriguingly, a small distinct cluster that branched from sublineages A1/A2/A3 near the root of the European clade was observed for the Japanese isolates, including two Japanese isolates previously reported by us (accession numbers, AB818687 and AB818688) [30]. The isolates in this cluster were most closely related to an HPV16 isolate previously reported from Thailand (accession number, FJ610151) [33] (Figure 1), but a similar cluster was not found in HPV16 isolates from the Netherlands ( Figure S1) [37]. A more recent phylogenetic analysis of 211 complete HPV16 genome sequences also failed to detect the cluster [38]. These results suggest that this novel cluster exclusively contains Japanese HPV16 variants, which share a common ancestor with the Thai isolate. Since the complete genome sequences of these isolates and FJ610151 differ from the reference genome sequence of A1 (accession number, K02718) by an average 0.52% (±0.03% standard deviation), A2 (accession number, AF536179) by 0.49% (±0.02%), A3 (accession number, HQ644236) by 0.39% (±0.02%), and A4 (accession number, AF534061) by 0.70% (±0.02%), we hereafter refer to these variants as A5 variants.  Figure 1. Maximum likelihood phylogenetic tree of Japanese HPV16 isolates. The phylogenetic tree was constructed using whole-genome sequences of 100 Japanese isolates obtained in this study, together with those of 14 Japanese isolates previously determined by us, one Thai isolate, and 10 reference strains for HPV16 variant (sub) lineages. Bootstrap values larger than 70% are displayed. Scale bar, nucleotide substitutions per site. A1, A2, A3, and A4: sublineages; B, C, and D: lineages. Gray area indicates a cluster of A5 variants.  The relative risk of cervical cancer development associated with individual sublineages was assessed by restricting the analyses for HPV16 single-infection cases in NILM, CIN, and squamous cell carcinoma (SCC). As shown in Table 2, the prevalence of the A4 sublineage was significantly higher in SCC (24 of 32, 75.0%) than in NILM/CIN1 (8 of 22, 36.4%) (Fisher's exact test, p = 0.006) and CIN2/3 (7 of 22, 31.8%) (Fisher's exact test, p = 0.002). Consistent with these observations, a higher risk of progression from CIN2/3 to SCC was observed for A4 compared to A1/A2/A3 (odds ratio = 6.72, 95% CI = 1.78-28.9). Although the prevalence of A5 was lower in SCC (2 of 32, 6.3%) than in NILM/CIN1 (4 of 22, 18.2%) and CIN2/3 (4 of 22, 18.2%), the difference was not statistically significant (Fisher's exact test, p = 0.21). Accordingly, the relative risk of cervical lesion progression in individuals with A5 was not significantly different compared to the A1/A2/A3 sublineages ( Table 2).

Genetic Variability among Sublineage A4 Variants
As the majority of the Japanese HPV16 isolates (n = 52) were grouped into the sublineage A4, we investigated variations in the whole-genome sequences of the A4 isolates. When compared them to the A4 reference genome sequence (accession number, AF534061), a total of 142 SNPs were detected across the genome of the Japanese isolates. As shown in Table 3, the highest variability was observed for the non-coding region between E5 and L2 (7.46%), followed by E5 (3.57%) and the long control region (LCR) (2.64%), whereas the lowest variability was observed for E1 (0.87%), followed by L1 (1.25%) and E6 (1.26%). Among 117 SNPs detected in the coding regions, 57 were synonymous substitutions and 60 were non-synonymous substitutions. L2 harbored the largest number of non-synonymous substitutions (n = 21), followed by E2 (n = 13) and E5 (n = 8). In contrast, no non-synonymous substitutions were detected in the E7 region. Table 3. SNPs detected in Japanese A4 isolates (n = 52).

Region
Size (

Identification of SNPs Unique for A5 Variants
Multiple sequence alignment of the Japanese HPV16 isolates and reference HPV16 genomes revealed the presence of characteristic SNPs that distinguish A5 variants from other variant (sub) lineages. Table 4 shows these SNPs: C at nucleotide position 645, A at position 3068, C at position 4458, A at position 5042, and G at position 5836. Out of the five SNPs, three led to amino acid changes of the viral proteins as follows: Leu to Phe at amino acid position 28 in the E7 protein, Ala to Thr at position 105 in the E2 protein, and Ser/Pro/Ala to Asp/Asn at position 269 in the L2 protein.

HPV16 Variant Distribution in Cervical Adenocarcinoma
We were not able to amplify the viral whole-genome sequences from the majority of the HPV16-positive adenocarcinoma cases, possibly due to viral integration into the host genome. Thus, by utilizing viral SNPs in the E6/E7 region, we further explored the HPV16 variant distribution among Japanese women with cervical adenocarcinoma. The E6/E7 fragments were successfully amplified from 13 additional samples of adenocarcinoma with HPV16 single-infection and subjected to Sanger sequencing. Variant assignment was performed by reading the nucleotide of (sub) lineage-specific SNPs as follows: G at nucleotide position 145 for lineage A, G at position 647 for sublineage A4, and C at position 645 for A5 variants. Analysis of the 13 samples revealed that, as was the case in SCC, A4 variants were most prevalent in adenocarcinoma (9 of 13, 69.2%), followed by A1/A2/A3 variants (3 of 13, 23.1%) and A5 variants (1 of 13, 7.7%). As shown in Figure 3, when combined with four whole-genome isolates determined via next generation sequencing, the variant distribution in adenocarcinoma cases was as follows: A1/A2/A3 (3 of 17, 17.6%), A4 (11 of 17, 64.7%), A5 (2 of 17, 11.8%), and D2 (1 of 17, 5.9%). No significant difference was observed for the variant distribution between adenocarcinoma and SCC (Fisher's exact test, p = 0.85).

Biological Activity of E7 Variants of Lineage A
The SNPs in the E7 region specific for A4 and A5 variants were non-synonymous substitutions compared to A1 and lead to amino acid changes in the E7 protein as follows: N29S for A4, and L28F for A5. Since these amino acid residues were in close proximity to the conserved LXCXE motif required for binding to the retinoblastoma protein (pRb) (Figure 4A), we examined whether these E7 variations affected its ability to degrade pRb. To this end, human cervical keratinocytes immortalized with telomerase reverse transcriptase were stably transduced with a retrovirus vector encoding FLAG-tagged E7 proteins of A1, A4, and A5, or an empty vector, and the total cell lysates were analyzed using western blotting. As shown in Figure 4B, similar levels of expression were observed for all three E7 variants. However, while the A1 and A4 variants decreased pRb levels to a similar extent, the A5 variant showed a slightly attenuated ability to degrade pRb, although the difference was not statistically significant. In contrast, all E7 variants similarly reduced levels of PTPN14, another tumor suppressor protein that is targeted by E7 [39,40], indicating that these amino acid residues were not involved in the degradation of PTPN14.

Discussion
This study reports for the first time the distribution of HPV16 variant lineages/sublineages in Japan based on the viral complete genome sequences obtained from 100 Japanese women with or without cervical lesions. Although many studies have so far relied on sequences of E6/E7 and the LCR for assignment of HPV16 variant (sub) lineages, these regions lack appropriate diagnostic SNPs for some (sub) lineages [19]. Using the viral whole-genome sequence is thus the most reliable procedure to determine (sub) lineage classification for each HPV16 isolate.
The overall distribution of HPV16 variants in Japan was strongly biased toward lineage A (94.0%), within which the frequency of European and Asian variants was similar (42.0% and 52.0%, respectively). This pattern of variant distribution is consistent with a previous study analyzing cervical specimens from CIN/ICC patients in Japan [41], which demonstrated the variant distribution as 54.4% for European variants and 44.9% for Asian variants [38]. On the other hand, a much higher prevalence of Asian (80.6%) versus European (19.4%) variants was reported in another study [42], which collected cervical swab samples from female sex workers in Japan; this may be due to demographic differences in sample source compared to our study.
In agreement with a previous Chinese study including a meta-analysis of a relative risk of the A4 sublineage [24], we find a significantly higher risk of A4 for the development from high-grade CIN to ICC in Japanese women compared to A1/A2/A3. Intriguingly, a recent study in the United States demonstrated that a higher risk of A4 for CIN3 and cervical cancer was observed for Asian women compared to white/Hispanic women [19], which suggests that some matching of viral-host ethnicity plays an important role for this differential risk. Since HPV16 A4 variant genomes encode an E7 protein with a characteristic N29S substitution, we focused on whether this E7 substitution led to an enhancement of its major oncogenic activities, namely pRb-and PTPN14-degradation activities, thereby contributing to the higher risk of A4. However, there were no apparent differences in degradation activity between the E7 variants of A4 and A1, although the human cervical keratinocytes used in this study were derived from a Japanese woman and ethnically match the A4 sublineage. These results suggest that the viral gene(s) other than E7 and/or the different activities of the LCR might be responsible for the higher risk of cervical cancer that is associated with A4 in Asian women. Interestingly, one A4-specific SNP in the LCR (C at nucleotide position 7176, in AF534061) generates a consensus binding site for the AP-1 transcription factor (TGANTCA), which may contribute to higher expression of E6/E7 and thereby, higher oncogenic potential of A4.
Although the D2 sublineage reportedly imposes a higher risk for adenocarcinoma compared to SCC [19], we did not observe a higher prevalence of D2 in adenocarcinoma cases among our Japanese HPV16 isolates. This discrepancy may be due to a low prevalence of D2 across all cervical histological grades in Japanese women, which precludes a risk assessment of this sublineage for predisposing HPV16-infected women to adenocarcinoma. Because A4 variants were predominantly detected in both adenocarcinoma and SCC over other variants in this study, the A4 sublineage may facilitate the development of ICC regardless of the histological types in which it is detected.
The E7 L28F substitution, which is characteristic for A5 variants, was previously identified in Japanese sex workers [42], and was also reported from Thailand [33] and northeast China [43]. A more recent study also described the same E7 substitution in one oropharyngeal cancer case in the United States [44]. To date, however, the complete genome sequence of a virus harboring the E7 L28F substitution has only been reported for one isolate in Thailand [33]. This study thus expands the number of whole-genome sequences of A5 variants, enabling detailed analyses of the genomic characteristics of these unique variants.
Besides E7, the whole-genome sequences of A5 variants also contain particular non-synonymous substitutions in E2 and L2. The E2 substitution results in a change from alanine to threonine at position 105 of the E2 protein, which is located in its N-terminal transactivation domain [45]. We found that this substitution greatly reduced the intracellular level of the E2 protein in HeLa cells (data not shown), which is likely to have an impact on its biological functions. Furthermore, the substituted residue at position 269 of the L2 capsid protein lies in its chromatin-binding domain [46], thus potentially affecting the virus entry processes. We speculate that these amino acid changes may collectively lead to different biological behaviors of A5 variants compared to other lineage A variants.
It is interesting to find some degree of prevalence for A5 variants among Japanese women (16 of 100 isolates, 16%). Intriguingly, although not statistically significant, we observed a slightly reduced ability of the E7 protein of A5 to target pRb for degradation. From an evolutionary perspective, the E7 L28F substitution might be unfavorable because it potentially weakens E7's ability to promote the growth of HPV-infected cells through dysregulation of the pRb/E2F-pathways. In this regard, we note that this amino acid residue is located within one of the CD4+ T-cell epitopes of the HPV16 E7 protein [47]. Thus, the fitness cost inflicted by the E7 L28F substitution might be counterbalanced by immune evasion, such as an escape from recognition by helper T-cells, which may facilitate adaptation to the human host in order to maintain long-term persistent infection.
In conclusion, this study revealed the variant distribution of HPV16 in Japanese women, and reports that phylogenetically unique HPV16 variants are present in Japan. Since HPV16 genomic sequences exhibit much higher variability than previously thought [48], it is now critical to investigate the HPV16 genetic diversity in individual countries and regions, as was recently performed in Sweden [49]. Such viral genomics analyses will improve our understanding of the biological and epidemiological impact imposed by HPV genetic changes, and provide novel insight that can be leveraged in order to prevent or treat HPV-induced malignancies.