Direct Comparison of HPV16 Viral Genomic Integration, Copy Loss, and Structural Variants in Oropharyngeal and Uterine Cervical Cancers Reveal Distinct Relationships to E2 Disruption and Somatic Alteration

Simple Summary HPV16 causes approximately 60% of uterine cervical cancer and 95% of HPV-driven oropharynx cancers. Despite both being HPV-associated, these tumors are very different. We directly compare integration of the HPV16 genome into the human genome in these two diseases, finding that the viral gene E2 is frequently lost in cervical cancer, but usually maintained in oropharyngeal cancer. We also found that oropharyngeal cancers with integration have many more integration sites per tumor and these more frequently occur in genomic regions with a high density of genes. Abstract Squamous cell carcinoma of the oropharynx caused by HPV type 16 (HPV16+ OPSCC) is the most common HPV-associated malignancy in the USA and has many molecular differences from uterine cervical squamous cell carcinoma (UCSCC). Our understanding of HPV oncogenesis relied on studies of UCSCC revealing a consensus model reliant on HPV integration with a loss of E2. Here, we compare patterns of HPV integration in UCSCC and OPSCC by analysis of affinity capture sequencing of the HPV16 genome in 104 OPSCC and 44 UCSCC tumors. These cohorts were contemporaneously sequenced using an identical strategy. Integration was identified using discordant read pair clustering and assembly-based approaches. Viral integration sites, structural variants, and copy losses were examined. While large-scale deep losses of HPV16 genes were common in UCSCC and were associated with E2 loss, deep copy losses of the HPV16 genome were infrequent in HPV16+ OPSCC. Similarly, structural variants within HPV16 favored E2 loss in UCSCC but not OPSCC. HPV16 integration sites were non-random, with recurrent integration hot-spots identified. OPSCC tumors had many more integration sites per tumor when compared to UCSCC and had more integration sites in genomic regions with high gene density. These data show that viral integration and E2 disruption are distinct in UCSCC and OPSCC. Our findings also add to growing literature suggesting that HPV tumorigenesis in OPSCC does not follow the model developed based on UCSCC.


Introduction
Papillomaviruses are species-specific, non-enveloped, double-stranded DNA viruses with an about 8kb circular genome protected by a 55 nm icosahedral capsid [1][2][3][4]. Human Important differences between HPV driven oropharyngeal and uterine cervical carcinomas continue to emerge; however, no comprehensive study has compared HPV integration sites and HPV structural genomic aberrations present in these two cancers. To define similarities and differences, we used identical ultra-deep targeted sequencing of HPV16 from 104 oropharyngeal and 44 uterine cervical tumors. Here, we present an in-depth investigation that reveals drastic dissimilarities in HPV integrations and structural genomic aberrations in OPSCC and UCSCC. These findings emphasize their individual biology and suggest that HPV-driven carcinogenesis is different between these tumor types.

Materials and Methods
Human Subjects-DNA sequencing of data was performed as a part of the UNCseq tumor sequencing program. Tumor sample identifiers and genomic data were derived from the clinical trial LCCC1108: Development of a Tumor Molecular Analyses Program and Its Use to Support Treatment Decisions. This IRB-approved trial opened in 2011. All studies were done with the approval of our Institutional Review Board, patient participation required written informed consent, and all studies were conducted in accordance with recognized ethical guidelines as described in U.S Common Rule. By means of a chart review of electronic medical records, demographic information was obtained for each study subject, including age, gender, race, and smoking history. The clinical stage at presentation according to the AJCC staging system (AJCC 8th edition) was recorded considering that many patients did not receive pathological staging.
DNA Isolation, Library Preparation, and Sequencing-A pathologist examined H&Estained slides from each case to confirm the diagnosis of squamous cell carcinoma. Automated DNA extraction was from FFPE tissue sections using the Promega Maxwell MDx16 TM instruments (Promega) and then fragmented by sonication. Subsequent quality assessments were performed by ultraviolet absorbance and quantity assessments. During DNA isolation and library preparation, DNA concentration was measured by fluorometry and DNA quality was evaluated using the Agilent 2100 Bioanalyzer high sensitivity assay. DNA libraries were pooled for deep sequencing using an Illumina HiSeq2500 TM sequencer. The UNCseq targeted sequencing platform involves sequencing exons of a custom list of 650 human genes (covering 3.4 M bases) and 10 pathogen genome segments in fixed or frozen cancer tissue and matched germline DNA from consenting local patients. This custom sequencing platform provided deep coverage of all HPV16 open reading frames. For general metrics on human gene sequencing quality, read depth, and coverage statistics, please see our recent report which reviewed these features of the data set [37]. Excluding the hypervariable region 3150-3351, the median coverage of the HPV16 genome was 7904 with an IQR of 6867-7953. The hypervariable region 3150-3351 had a median coverage of 3307 with an IQR of 837-7377.
Inclusion Criteria-The UNCseq database was queried for p16+ tumors originating from the anatomic oropharynx (tonsil or tongue base) with available tumor sequencing data as well as data on stage, treatment strategy, clinical outcome, and histopathology available. HPV16 positivity was confirmed by DNA sequencing reads which mapped to the HPV16 genome (see above comments). Patients were excluded from clinical analyses if tumors were not p16+. The UNCseq database was also queried for squamous cell carcinomas of the uterine cervix; HPV16 positivity was confirmed by DNA sequencing reads which mapped to the HPV16 genome (see Viral Copy Number Analysis section below). For both tumors from the oropharynx and uterine cervix, tumors with non-HPV16 genotypes were excluded from the study.
Assigning HPV16 Positivity and Viral Copy Number Analysis-Raw reads were aligned to the human genome plus a comprehensive library of HPV virus sequences, using compiled reference sequences from the ViFi analysis pipeline [38]. An HPV16 viral copy number was estimated based on the ratio of reads mapping to the HPV16 (NC_001526.4) genome and HG19. Log 10 (Reads HPV16 /Reads HG19 ) > −4 was used as an empiric cut off for HPV positivity. This threshold was selected based on an obvious bimodal distribution in the data; see our prior publication justifying this criterion [21]. The Samtools Idxstats() function was used to quantify reads mapping to HPV16 and human chromosomes [39]. For HPV16positive tumors, Log 10 (Reads HPV16 /Reads HG19 ) calculated as above was considered as a relative metric of the HPV16 copy number.
Integration Site Calling-Raw reads were aligned to the human genome plus a comprehensive library of HPV virus sequences, using compiled reference sequences from the ViFi analysis pipeline [38]. The ViFi pipeline was used to identified discordant (humanviral) read pairs, which cluster to potential integration sites in the human and HPV genomes [38]. Breakpoints with more than 50 clustered discordant read pairs were classified as positive for integration. As a second method of detecting integration, an independent assembly structural variant caller, svABA, was utilized with default settings [40]. Alignment BAM files output from ViFi, which included hg19 as well as the HPV16 genome, were used as inputs for svABA. Contiguous sequences assembled by svABA were matched to discordant read pair clusters from ViFi. Discordant read pair clusters that were with fewer than 50 read pairs were also considered suggestive of integration if a continuous sequence supporting a HPV-human structural variant was also identified by svABA and passed the quality filter independently, or contained at least 5 reads spanning a HPV-human junction also identified by a ViFi cluster. Integration sites were identified from individual HPV-human breakpoints, looking for clusters of breakpoints within 1 Mb of each other using in-house scripts.
Assigning Gene Effects-The R SMITE [41] and GenomicRanges [42] packages were used to identify the nearest gene and minimum distance to it, for each high-confidence breakpoint. Based on prior reports demonstrating altered gene expression of genes as much as 500 Kb away from the gene of interest [7], we used ±500 kb as a threshold for genes whose functions may be affected by integration.
Structural Variant Calling and Analysis-We also considered structural variants which joined disparate aspects of the HPV16 genome. To distinguish these form HPV16human structural variants (i.e., integration breakpoints), we will refer to these events as HPV-HPV structural variants. HPV-HPV structural variants were identified from the svABA analysis as described above. HPV-HPV structural variants were only considered if they passed svABA quality filters. Additional filtering to assess for biologically relevant structural variants was performed based on variant allele frequencies (VAFs) estimated by svABA. The thresholds for VAF filters are described in the results section. Viral Deep Copy Loss Analysis-Considering that HPV+ cancer cells often contain multiple copies of the viral genome, the loss of genomic material in some copies of the virus might result in a decreased copy number (shallow loss) of some viral genes. However, loss of genomic material in all or the vast majority of copies, or deep copy loss, would be expected to result in the most prominent functional differences. In this manuscript, we focused on copy number analysis on regions of deep viral copy losses, which represent the complete or nearly complete loss of the genomic regions in question. To identify these events, nucleotide-specific estimates of read depth were acquired using the Samtools depth function [39]. Relative tumor-specific read depths were calculated as the ratio of read depth to the tumor's median read depth in the HPV16 regions corresponding to E6 and E7. Nucleotides with read depths <0.2% of median E6/E7 coverage were considered to have deep loss of the involved nucleotides. Nucleotides with <1% of median E6/E7 coverage that bordered on both side areas of <0.2% deep loss were also considered to be a component of the same deep loss event. Deep losses involving only nucleotides in the hypervariable regions 3150-3351 were excluded from analysis as this area had low coverage across the entire data set.

Integration Analyses
Viral integration sites were identified in approximately 75% of HPV16-associated oropharyngeal (OPSCC, n = 104) and uterine cervical (UCSCC, n = 44) cancers (see Figure 1A). Individual integration sites, defined as being within a mega-base region of a chromosome, had similar numbers of HPV-human breakpoint junctions per site in OPSCC and UCSCC, with some sites having 10 or more breakpoint junctions ( Figure 1B). The viral copy number was not related to the viral integration status of tumors in either OPSCC or UCSCC; however, the viral copy number was higher in OPSCC as compared to UCSCC regardless of integration ( Figure 1C). When analysis was restricted to OPSCC and UCSCC tumors with integration, the viral copy number correlated with the number of integration sites (Spearman's Rho = 0.49, p < 5 × 10 −8 ). Amongst tumors with HPV16 integration, OPSCC tumors had significantly more integration sites per tumor (see Figure 1D). Although not statistically significant, integration preferences by chromosome were dissimilar between UCSCC and OPSCC, with chromosome 7 being preferred in UCSCC and chromosome 19 preferred in OPSCC (see Figure 1E). The distribution of sites differed significantly from a random model in OPSCC ( Figure 1F), and based on chromosome size, integrations in chromosome 19 were independently enhanced above expectations (chi-squared test, p < 0.001). Integrations in areas of the genome with the highest gene density (top 5%) were also enhanced in OPSCC as compared to UCSCC ( Figure 1G), and the relative number of genes with integration sites within 500 kB was higher in OPSCC ( Figure 1H). Similarly, OPSCC tumors typically had more chromosomes affected by integration sites (see Figure 1I).
Individual sites were mapped to the closest gene, based on the median breakpoint position (see Figure 1J). In agreement with prior reports, recurrent integration sites were noted in both OPSCC and UCSCC. The most common genes associated with integration sites are displayed in Figure 1J or Figure 2. The genomic distribution of individual integration sites is displayed in Figure 2. A gross preference for peri-centromeric integrations were noted (see Figure 2) [43].

Deep Copy Loss Analyses
Deep copy-losses within the HPV genome were identified based on read depth as compared to E6 and E7 in the same tumor (see Methods). The regions of deep loss identified are displayed in Figure 3A. Deep losses were much more common and larger in UCSCC as compared to OPSCC ( Figure 3B,C). Only 2 out of 104 (2%) OPSCC tumors had large-scale losses of > 10% of the viral genome, whereas 15 of 44 (34%) of UCSCC had such losses ( Figure 3A). Tumors with deep losses of the HPV genome were usually integrated ( Figure 3D), with only one UCSCC tumor with deep loss lacking integration. Deep losses involving E1, E2, and E5 were more common in UCSCC as compared to OPSCC ( Figure 3E). E2 was the gene most commonly lost in UCSCC and the HPV gene most differentially lost as compared to OPSCC ( Figure 3E). More specifically, 27% of UCSCC tumors and 5% of OPSCC tumors harbored deep losses involving E2. As expected, the upstream regulatory region (URR) containing the major early promoters, as well as E6 and E7 genes, were universally spared from deep losses in both diseases ( Figure 3E).    losses ( Figure 3A). Tumors with deep losses of the HPV genome were usually integrated ( Figure 3D), with only one UCSCC tumor with deep loss lacking integration. Deep losses involving E1, E2, and E5 were more common in UCSCC as compared to OPSCC ( Figure  3E). E2 was the gene most commonly lost in UCSCC and the HPV gene most differentially lost as compared to OPSCC ( Figure 3E). More specifically, 27% of UCSCC tumors and 5% of OPSCC tumors harbored deep losses involving E2. As expected, the upstream regulatory region (URR) containing the major early promoters, as well as E6 and E7 genes, were universally spared from deep losses in both diseases ( Figure 3E).  NCR-hypervariable non-coding region with poor coverage (also orange in Panel A). URR-upstream regulatory region. * p-value < 0.05. ** p-value < 0.005. *** p-value < 0.0005.

HPV-HPV Structural Variant Analyses
We also investigated structural variants (SVs) within HPV16 itself (HPV-HPV SVs). These variants were identified by the svABA pipeline as passing the default quality filter, and were then further filtered by variant allele frequency to look for potentially biologically relevant variants ( Figure 4A). There was a non-significant trend towards increasing numbers of SVs in OPSCC as compared to UCSCC ( Figure 4B). Interestingly, HPV-HPV SVs of moderate VAF in UCSCC were mostly identified in tumors without integration, and this relationship was distinct from OPSCC ( Figure 4C). The reliability of the SV calls and their potential biological relevance was bolstered by relative copy number alterations in HPV that correlated with the identified HPV-HPV SVs. More specifically, all high VAF HPV-HPV SVs had obvious discontinuities in coverage at the identified breakpoints, which also correlated with the orientation of the break-end pair. Two examples of this are shown in Figure 4D-E, where an HPV-HPV SV is the origin of a relative amplification involving L1 in an OPSCC tumor (Case A) and E2 is partially lost in a UCSCC tumor (Case B). Locations of HPV-HPV SV breakpoints appeared to be non-random and grossly favored regions of L2 and the URR, Figure 4F. Examining HPV genomic regions based on their size also demonstrated that breakpoint locations were not random for OPSCC or UCSCC, with UCSCC favoring breakpoints that involved E2 ( Figure 4G). Interesting HPV-HPV SVs that caused a break in E2 were more common in UCSCC as compared to OPSCC ( Figure 4H). Considering that UCSCC had both an increased percentage of tumors with deep loss of E2 ( Figure 3) and a greater number of HPV-HPV SVs causing loss of all or part of E2, UCSCC has a much larger percentage of tumors with E2 alteration as compared to OPSCC ( Figure 4H).  Taken together, these results indicate that both integration and HPV-HPV SVs are selected for E2 loss in UCSCC, but not OPSCC. Conversely, the higher frequency of integration sites per tumor and preferential integration in a gene-dense area of the genome, suggests that integration sites in OPSCC are selected for the oncogenic properties resulting from effects on the host genome.

Discussion
Our direct comparison of HPV integration in oropharyngeal and cervical tumors revealed several important dissimilarities supporting the idea that HPV-driven carcinogenesis is different between these cancers. Below, we summarize major points highlighting their discrete biology.
The Different Role of E2 in HPV Carcinogenesis-Integration of the HPV genome into the human genome is an established hallmark of HPV-mediated carcinogenesis. Given that the HPV E2 protein negatively regulates expression of viral oncoproteins E6 and E7, another hallmark of this model is E2 loss. This HPV carcinogenesis model was developed from analysis of UCSCC, and our analyses of UCSCC are supportive since integration was detected in the vast majority and E2 was the HPV gene most commonly altered by deletion (30%) and SVs (40%). On the other hand, integration was less frequent in OPSCC, and even in integrated tumors, E2 loss by deep deletion or HPV-HPV SVs was significantly less frequent in OPSCC, where L2 was found to be the HPV gene most frequently altered by deletion and SVs ( Figure 3E or Figure 4G). Although more limited in number of tumors analyzed and without direct comparison to UCSCC, Huang et al. noted a similar lack of E2 loss related to viral integration in squamous cell carcinomas of the penis [44]. The indifference toward the loss of E2 and a higher percentage of non-integrated tumors in OPSCC contradict tenets of most accepted models of HPV carcinogenesis, but is not without precedent in the literature. Our group and others have reported that some HPV+ OPSCC tumors lack integrations, maintain all HPV genes, and express E6 and E7, albeit at lower levels [27]. As a refinement of these findings, here, we report that integration can be detected in 75% of HPV+ OPSCC tumors. This estimate is higher than many prior reports, perhaps reflecting the sensitivity of our assay to detect low frequency variants that could be sub-clonally present in cancer cells. However, even with a very sensitive assay, 25% of tumors lacked HPV integration. Groves et al. demonstrated constitutive expression of HPV oncogenes E6 and E7 without E2 loss [45]. Other groups have suggested that HPV integration, or lack thereof, was not correlated with differential expression of E6 and E7 [46,47]. Furthermore, maintained expression of E2, E4, and E5 have been purported to be critical for an alternate carcinogenic pathway in HPV+ OPSCC, which was associated with a lack of genomic integration [48]. The present work further highlights a distinct relationship of E2 to HPV+ OPSCC as compared to UCSCC, to the extent that a significant role for E2 disruption in HPV+ OPSCC should be questioned.
A Higher Number of HPV16 Integration Sites Per Tumor in OPSCC-Considering that E2 loss may be less important in providing selective advantages for cancer and premalignant cells in the oropharynx, other roles for viral integration should be considered, especially since we found HPV integration in 75% of OPSCC cases. Furthermore, we found a strikingly higher number of integration sites per tumor in HPV+ OPSCC as compared to UCSCC, suggesting that maintenance of these integrated HPV copies contributes to cancer development or maintenance. Accompanying higher integration sites per tumor, we also found that viral copy number was higher in OPSCC. Although integration itself was not associated with a higher or lower copy number in either disease, the number of integration sites in an individual tumor was correlated to viral copy number. We note that other groups have associated higher viral copy number in tumors harboring episomal HPV; however, our data do not support this association [49].
The ability of HPV integration to alter expression of nearby human genes (within 500 Kb) has been demonstrated in both OPSCC and UCSCC [7,12] with recurrent integration sites identified near genes implicated in cancer such as MYC [7,50], MYB [7], PVT1 [51][52][53], RAD51B [50,54], EMBP1 [55], CD274/PD-L1 [7,56], and SOX2 [7]. Our results agree with these prior reports confirming integration in or near these genes. Our study identified additional non-recurrent, cancer-associated genes in or near integration sites in HPV+ OPSCC including TGFBR2, FGFR2 [20], PD-L2 [57], TRAF3 [26], BRCA1, SPOP [58], and BCL-2 [59] Our results bolster prior reports that modulation of the function of genes near HPV integration sites is key in the oncogenesis if OPSCC [7]. As other studies investigating HPV integration in UCSCC have found, we noted HPV16 integration near NR4A2, MYC [60], LRP1B [61][62][63], and MACROD2 [54]. Unfortunately, additional tissue for expression analysis was not available for the vast majority of the analyzed cases; however, there is a strong precedent in the literature that HPV integration can modulate the expression of nearby cancer genes in both OPSCC [7,50,64] and UCSCC [50,60,65]. Our finding that areas of high gene density are more likely to harbor integration sites in OPSCC as compared to UCSCC, as well as the increased number of integration sites per tumor, are both consistent with the hypothesis that the modulation of human cancer genes may carry increased significance in OPSCC as compared to UCSCC. HPV+ OPSCC is strikingly more prevalent in men as compared to women [66]. In HPV-negative HNSCC, men typically have a worse prognosis, yet loss of the Y chromosome has been associated with poorer outcomes [67]. Our data identify an interesting integration hotspot in chromosome Y near/at NLGN4Y ( Figure 1F). Previous work has associated this gene with complex neurocognitive traits such as male homosexuality and disorders including autism [68,69]. A homologous gene on chromosome X, NLGN4X, has been shown to have non-identical signaling properties despite 97% sequence homology to NLGN4Y, which have been used to explain X-linked phenomenon related to this XYgene-pair [68]. However, we found only scant reports linking NLGN4Y to cancer [70,71]. The most frequent integration hot-spot we identified in OPSCC was near LINC00486, on chromosome 2; however, it was notably uncommon for these sites to be called by assemblybased reconstruction, so they should be interpreted with some caution. However, HPV16 integration sites in this area have been confirmed and extensively characterized in UCSCC cell lines [72]. In this study, the integration sites were near an enhancer element that is thought to be related to epithelial differentiation [72], and HPV integration has been demonstrated to be more likely to be found in or near enhancer elements [50].
Mechanistic Considerations Regarding HPV16 Integration-The strikingly higher number of integration sites in OPSCC as compared to UCSCC per integrated tumor ( Figure 1D, median shift of~5x), merits consideration from a mechanistic perspective. Integration, like all structural variants, necessitate DNA double-strand breaks as a com-ponent of the process [73,74]. HPV+ OPSCC has been noted to be particularly deficient in double-strand break repair [75][76][77][78] and this has been related to the relative radiosensitivity of these tumors [79]. Deficiency of homologous recombination [75] and replication stress [80] have been linked to HPV integration and pathogenicity. Work from our group and others have demonstrated that APOBEC plays a key role in HPV-mediated carcinogenesis in the oropharynx [81], and increased expression and activity of APOBEC3B and APOBEC3A, respectively, have been linked to direct actions of HPV oncoproteins E6 and E7 [82]. APOBEC3s have been associated with genomic instability across multiple cancer types [83]. Maybe not surprisingly, APOBEC3A (A3A) protein expression has been strongly correlated with increased double-strand breaks and HPV viral integration [84]. Others have shown that integration also directly correlates with genomic instability [85]. Whether differences in DNA damage repair and global genomic instability between UCSCC and OPSCC may explain the differential rates of integration is unknown but merits additional focused analysis.
Similarly, one could consider if the molecular process of integration is variable between the two diseases, and indeed, at least three structural classes of integration sites have been identified: single viral genomic insertions (Type I), insertions of tandem viral repeats (Type II), and tandem repeats of hybrid viral-human DNA (Type III) [72]. Unfortunately, without long reads or optical techniques [72,85], it is impossible to distinguish these types, as in our data. However, integrations were commonly found to be quite complex in both OPSCC and UCSCC, with four or more human-viral genomic junctions; these do not fit well into any of the described models in the literature. We note that experimental studies on LINC00486-related integration sites were Type III, and the complex/repetitive nature of the site might explain the failure of assembly-based methods in these cases [72].
Our study was limited in the ability to resolve some complexities of integration events, primarily due to the targeted affinity capture technique, which does not allow unbiased analysis of the copy number of small regions of non-target human genomic DNA, as is possible with whole genome sequencing. More specifically, because of this, we were not able to determine if the integration sites detected were related to chromosomal or extrachromosomal DNA, although other groups using whole genome sequencing have reported detecting human-HPV hybrid extrachromosomal DNA in OPSCC tumors [36].
Clinical Implications of HPV16 Integration in OPSCC-We did not find any prognostic relationship to HPV16 viral integration in HPV+ OPSCC in this study. However, this is a frequently discussed association in the literature, with most studies suggesting that integration is a poor prognostic marker [21,86], but conflicting reports exist [27,87]. We notice that other DNA-sequencing-based studies have failed to demonstrate prognostic relationship to integration [7,36]. Interestingly, the strongest evidence for HPV integration being a poor prognostic factor has been based on RNA sequencing [21,86], while conflicting studies as well as studies with no association between HPV integrations and survival were based on DNA technology [7,27,36,87]; therefore, we hypothesize that only transcriptionally active integration sites are prognostic.

Conclusions
We directly compared 44 cases of HPV16+ UCSCC and 104 cases of HPV16+ OPSCC. E2 loss was common in HPV+ UCSCC, mediated by frequent copy number losses or structural variant breakpoints. Conversely, E2 loss was not selected for in HPV+ OPSCC. Tumors with no evidence of integration represented~25% of both UCSCC and OPSCC cases. Amongst tumors with genomic integration, OPSCC tumors had many more integration sites per tumor, and were more likely to have integration sites in high gene density areas of the genome. These data highlight clear differences between HPV16-associated UCSCC and OPSCC related to the physical state of the HPV16 genome, and there may be implications for distinct HPV-associated carcinogenesis in the two diseases.

Conflicts of Interest:
The authors declare no conflict of interest.