HPV16-Genotyper: A Computational Tool for Risk-Assessment, Lineage Genotyping and Recombination Detection in HPV16 Sequences, Based on a Large-Scale Evolutionary Analysis

: Previous analyses have identiﬁed certain but limited evidence of recombination among HPV16 genomes, in accordance with a general perception that DNA viruses do not frequently recombine. In this evolutionary/bioinformatics study we have analyzed more than 3600 publicly available complete and partial HPV16 genomes. By studying the phylogenetic incongruence, similarity plots and the distribution patterns of lineage-speciﬁc SNPs, we identify several potential recombination events between the two major HPV16 evolutionary clades. These two clades comprise the (widely considered) phenotypically more benign (lower risk) lineage A and the (widely considered) phenotypically more aggressive (higher risk) non-European lineages B, C and D. We observe a frequency of potential recombinant sequences ranging between 0.3 and 1.2% which is low, but nevertheless considerable. Our ﬁndings have clinical implications and highlight that HPV16 genotyping and risk assessment based only on certain genomic regions and not the entire genome may provide a false genotype and, therefore, its associated risk estimate. Finally, based on this analysis, we have developed a bioinformatics tool that automates the entire process of HPV16 lineage genotyping, recombination detection and further identiﬁes, within the submitted sequences, SNPs that have been reported in the literature to increase the risk of cancer.


Introduction
Human Papillomaviruses (HPVs) are members of the Papillomaviridae family that consist of a diversified group of small, non-enveloped, dsDNA viruses [1,2]. Currently, over 200 HPV genotypes have been completely characterized, while phylogenetically they are classified into five genera, including Alphapapillomaviruses (alpha), Betapapillomaviruses (beta), Gammapapillomaviruses (gamma), Mupapillomaviruses (mu), and Nupapillomaviruses (nu) [3,4]. According to the Papillomavirus Nomenclature Committee, HPV genomes are further grouped into intratypic variants that have more than 98% sequence similarity with the reference sequence of the L1 gene [5]. Another method of classification, depending on the whole genome sequence, defines variant lineages and sub-lineages by a difference of 1% to 10%, and 0.5% to 1%, respectively [6,7]. The alpha-HPVs infect the mucosal epithelium and they are subdivided into high-risk (HR-HPV) and low-risk (LR-HPV) genotypes, considering their tumorigenic disposition [1,5,8,9]. HR-HPV infection is regarded as a major public health concern, since it is responsible for more than 99% of cervical cancer incidences while HPV16 DNA has been identified in more than 50% of cervical cancer cases worldwide [10,11].
The high tumorigenic capacity of alpha-PVs and particularly that of HPV16 promoted the evolutionary analysis of HPVs in order to shed light on the evolutionary mechanisms involved in the emergence and variance of viral carcinogenicity among the different HPV genotypes [19]. PVs have evolved closely with their hosts, but their diversity cannot be explained only by virus-host co-divergence, as papillomaviruses and their hosts did not follow a common evolutionary line [20].
Various evolutionary mechanisms seem to affect viral diversity, including cross-species infection, gene duplication and recombination events [21][22][23]. Although DNA viruses are considered to recombine with significantly lower frequency than RNA viruses [24], nevertheless, there have been reports of homologous [25,26] and even non-homologous [27] recombination among distant lineages of animal papillomaviruses. Recombination has also been reported among members of alpha-HPVs [28][29][30] and even among HPV16 lineages [31,32]. These reports indicated that HPV16 inter-lineage recombination may occur during infection, however the prevalence of recombinant HPV16 strains or the impact of these recombination events on viral pathogenicity remains unclear.
The first focus of this study is to apply well-established bioinformatics methods and analyze for the first time all the publicly available HPV16 genomes and large genome fragments in order to (i) estimate the frequency of inter-lineage recombination events, (ii) observe where the recombination crossover sites are localized and (iii) understand whether these recombination events lead to circulating variants or they are evolutionary dead-ends. Understanding all the above points has clinical implications as well, because if inter-lineage recombination does happen at a considerable frequency, then lineage genotyping based on certain genomic regions may actually provide a false estimate of risk. The second focus of this study is the development of a bioinformatics tool, called HPV16-genotyper, that automates the aforementioned analyses in order to determine the genotype and identify any potential recombinants. In addition, HPV16-genotyper identifies within the submitted sequences any of the nine well known SNPs that have been reported in the literature to increase the risk of cancer.

Materials and Methods
For the first part of our analyses, in order to retrieve HPV16 genomic sequences, the reference HPV16 genome with accession NC_001526 was used as query in the NCBI BlastN algorithm [33] against the NCBI nt database in February 2021. We applied an e-value cut-off of 1e-5, a nucleotide identity greater than 97%, and also retained sequences of at least 7500 nucleotides, where less than 1% of nucleotides were unknown. This search resulted in 1534 sequences. The gene boundaries of these 1534 sequences were determined with BlastN against the annotated CDS of the reference HPV16 genome NC_001526.
In order to remove redundant genomic sequences of high nucleotide identity, the 1534 genomes were clustered with the Uclust algorithm of Usearch [34] by applying a threshold of 99.6% identity over 99.6% query coverage. This analysis resulted in 193  /representative sequence was chosen for each cluster, while we also  retained each of the 16 sub-lineage reference genomes (A1-A4, B1-B4, C1-C4, D1-D4) described in [7]. Based on this non-redundant set of 193 genomes, each of the HPV-16 genes  (E6, E7, E1, E2, E4, E5, L2, L1) were extracted and aligned as codons using MUSCLE [35] which is embedded in the Seaview programme [36]. We also performed phylogenetic analysis of the LCR region. Manual inspection of the alignments revealed 13 sequences with many un-sequenced nucleotides localized in the E5 gene and consequently they were removed from the entire analysis, thus resulting in 180 total representative genome sequences/clusters. Complete genome alignments of the 180 representative genomes were performed with mafft v7.471 (parameters: G-INSI) [37].
PhyML trees were calculated for the multiple alignments of the complete genome and for each genomic region separately, with the GTR + I + G model, 4 gamma categories, aLRT and SPR tree search operation using SeaView. Model selection for the PhyML trees was performed with Jmodeltest 2 [38,39] based on AIC. The phylogenetic trees were visualized and annotated in Treedyn [40] and manually inspected for sequences that displayed clear incongruence at the lineage level.
The resulting 5 incongruent sequences were scanned for recombinations using T-RECs [41] with 1% nt difference cut-off and e-value cut-off 1e-10. Each recombination event was manually inspected with similarity plots of window size 300 and step 20. To further validate the five detected recombination events, each of them was also tested with RDP4 [42] and the GARD algorithm within the Datamonkey webserver (breakpoints with support over 0.75) [43]. Genome graph visualization was performed in Biopython [44] with the GenomeDiagramm module [45].
Sixty-seven lineage-specific nucleotide mutations/SNPs were identified from the 175 representative genomic sequences (after removing the 5 inter-lineage recombinant sequences) using Jalview [46] and custom python scripts. In order to classify a SNP as lineage-specific (A-specific, B-specific, C-specific, D-specific, BCD-specific), it should be present in more than 95% of the sequences of that given lineage and in less than 5% in each of the other lineages. Once the lineage-specific SNPs were identified, each of the five recombinant genome sequences were analyzed for their presence, in order to validate that these SNPs could be used as a quick and reliable method to identify potential inter-lineage recombination events.
The above-mentioned 67 lineage-specific SNPs were used to develop their corresponding 31nt probes, based on the HPV16 reference genome and by placing the lineage-specific SNP in the middle of the probe. Next, Python scripts were developed that used these 67 probes as subject sequences in a BlastN search, where the query sequences were obtained from NCBI nucleotide. By developing these SNP probes, it was feasible to rapidly scan a nucleotide sequence for the presence of any of the lineage-specific SNPs and thus identify any sequences (either entire genomes, genome fragments or even gene fragments) that had mixtures of different lineage-specific SNPs. This allowed us to rapidly analyze thousands of nucleotide sequences for any signs of potential inter-lineage recombination without the need to resolve to complex phylogenetic analyses. Based on the presence of lineage-specific SNPs, a sequence was assigned to a certain lineage and was considered as a potential recombinant if it included at least 3 SNPs that were from a different lineage. If three or more of these other-lineage SNPs were next to each other, that region was considered a higher-confidence potential recombination event. Each of the potential recombinants were also manually checked with similarity plots within T-RECs for the locations of the recombination crossover sites.
In the second part of our study, we developed a Bioinformatics tool in BioPython that automates all of the above analyses (see Figure 1). First, it scans the submitted sequences (in FASTA format) whether they belong to the HPV16 group or not. The submitted sequences have to have more than 90% nucleotide identity against the reference HPV16 sequence. Next, a blast search against the reference genes of each of the 16 HPV16 sub-lineages (A1-4, B1-4, C1-4, D1-4) identifies the gene boundaries of each submitted HPV16 sequence. Each of these genes undergoes a Neighbor Joining phylogenetic analysis (Kimura distance, 100 bootstraps) against the homologous reference genes of each of the 16 sub-lineages, using Seaview v4. The multiple alignments are generated by MUSCLE. The phylogenetic trees of each region separately can be viewed within the software via the ETE3 package [47]. Next, the software identifies and visualizes any of the 67 lineage-specific SNPs in the submitted sequence. Based on the gene-based blast results and the lineage-specific SNPs, the software determines whether a submitted sequence is a potential recombinant or not. Furthermore, the software scans the submitted sequence for each of the 9 SNPs that have reported in the literature to increase the risk of cancer. For each of these cancer-related SNPs, the software also provides a short description and the relevant citations. Finally, a help video guides the user (with example data) on how to use the software and interpret the results. The software and help video are available in http://bioinf.bio.uth.gr/hpv16-genotyper.html (accessed on 11 October 2021) for installation and use in both Windows 10 and Ubuntu Linux (version 20).  [47]. Next, the software identifies and visualizes any of the 67 lineage-specific SNPs in the submitted sequence. Based on the gene-based blast results and the lineage-specific SNPs, the software determines whether a submitted sequence is a potential recombinant or not. Furthermore, the software scans the submitted sequence for each of the 9 SNPs that have reported in the literature to increase the risk of cancer. For each of these cancer-related SNPs, the software also provides a short description and the relevant citations.

Phylogenetic Analyses
In the complete genome phylogenomic analysis of the 180 representative genomes ( Figure 2) we identified two major clades as reported previously in [7]. The first major clade is comprised of the A lineage (light green), while the second major clade encompasses the B (light blue), C (orange) and D (red) lineages and is referred to as BCD. The separation of the two major clades is strongly supported by an approximate likelihood ratio test value (aLRT) of 1.
passes the B (light blue), C (orange) and D (red) lineages and is referred to as BCD. The separation of the two major clades is strongly supported by an approximate likelihood ratio test value (aLRT) of 1.
With the exception of five identified inter-lineage recombinant sequences, in all phylogenetic trees of the individual genes (see Figure 2 and Supplementary Materials File S1) we see an overall clustering, similar to the complete genome tree, where lineage A forms a major group with high aLRT values (0.835-0.999) and the other three lineages (B, C, D) form a second major group. The B, C and D lineages appear as monophyletic only in certain genomic regions. More specifically, each of the B, C and D lineages is monophyletic in E2, L2 and LCR. The B lineage is also monophyletic in E5 and L1, while D is monophyletic in E1. In addition, the C and D lineages form one larger cluster in E1, E2 and L2 and are sister groups in E2 and L2. Interestingly, the C and D lineages are intermingled in E5, which indicates either the presence of many recombination events between these two lineages, or more probably that the phylogenetic signal is not strong enough to show them as monophyletic clades. Of note, these are very short regions of very high sequence identity.

Figure 2.
Phylogenetic analyses and identification of potential recombinant sequences. Complete genome alignment was performed with mafft, whereas individual gene alignments were performed with MUSCLE. PhyML trees were generated with GTR + I + G model. Lineage A is coloured light green, lineage B (Af-1) is coloured blue, lineage C (Af-2) is coloured orange, and lineage D is coloured red. The sequence KY549190 appears to be between the two lineages and is assumed "Ungrouped" in the phylogenomic tree. Above the phylogenetic trees, each recombinant sequence is depicted with coloured blocks corresponding to the major and minor parent lineages. Vertical lines above each recombinant sequence represent lineage-specific SNPs. Green lines correspond to A-lineage SNPs, yellow to BCD-clade SNPs, orange to C-lineage SNPs and red to D-lineage SNPs. Recombination crossover sites as detected by GARD are displayed with downward arrows above the genome architecture diagram.

Figure 2.
Phylogenetic analyses and identification of potential recombinant sequences. Complete genome alignment was performed with mafft, whereas individual gene alignments were performed with MUSCLE. PhyML trees were generated with GTR + I + G model. Lineage A is coloured light green, lineage B (Af-1) is coloured blue, lineage C (Af-2) is coloured orange, and lineage D is coloured red. The sequence KY549190 appears to be between the two lineages and is assumed "Ungrouped" in the phylogenomic tree. Above the phylogenetic trees, each recombinant sequence is depicted with coloured blocks corresponding to the major and minor parent lineages. Vertical lines above each recombinant sequence represent lineage-specific SNPs. Green lines correspond to A-lineage SNPs, yellow to BCD-clade SNPs, orange to C-lineage SNPs and red to D-lineage SNPs. Recombination crossover sites as detected by GARD are displayed with downward arrows above the genome architecture diagram.
With the exception of five identified inter-lineage recombinant sequences, in all phylogenetic trees of the individual genes (see Figure 2 and Supplementary Materials File S1) we see an overall clustering, similar to the complete genome tree, where lineage A forms a major group with high aLRT values (0.835-0.999) and the other three lineages (B, C, D) form a second major group. The B, C and D lineages appear as monophyletic only in certain genomic regions. More specifically, each of the B, C and D lineages is monophyletic in E2, L2 and LCR. The B lineage is also monophyletic in E5 and L1, while D is monophyletic in E1. In addition, the C and D lineages form one larger cluster in E1, E2 and L2 and are sister groups in E2 and L2. Interestingly, the C and D lineages are intermingled in E5, which indicates either the presence of many recombination events between these two lineages, or more probably that the phylogenetic signal is not strong enough to show them as monophyletic clades. Of note, these are very short regions of very high sequence identity.

Identification of Inter-Lineage Recombinant Sequences Based on Phylogenetic Incongruence
The phylogenetic analyses of the individual genes of the 180 representative genomes identified five phylogenetically incongruent sequences that have signs of recombination at the lineage level. These sequences were further analysed with T-RECs, RDP4 and GARD.
The first inter-lineage recombination event concerns the sequence with accession number MT316214, retrieved from cervical cancer samples from Guatemala as described in [48]. Sequencing in this study was performed with Ion Torrent. This recombinant sequence is symbolized with a reverse triangle in the phylogenetic trees of Figure 2.
The second potential recombinant sequence is KU641509 that was isolated from a cervical carcinoma biopsy in India [49]. It is symbolized with a circle in the phylogenetic trees of Figure 2. It is a member of the D1 sub-lineage that has obtained a small part of the E1 region from another unknown member of the wider BCD clade.
The third potential recombinant sequence is MT316301, obtained from cervical cancer samples [48]. It is a member of the D3 sub-lineage, that has obtained part of its E2 gene from the A lineage. The recombinant sequence is symbolized with a triangle in the phylogenetic trees of Figure 2.
Similarly, MT316253 is the fourth potential recombinant sequence, obtained by [48]. It is a member of the D lineage (D2 sub-lineage) that has obtained part of its E2 gene from the A lineage. The recombinant sequence is symbolized with a diamond in the phylogenetic trees of Figure 2.
The fifth potential recombinant sequence is KY549190, which was isolated in the Netherlands and sequenced with Sanger whole genome sequencing method [50]. The various analyses revealed that almost half of this sequence (from the second half of E1 until the first half of L1) belongs to the A lineage, whereas the other part belongs to the BCD major clade and more specifically to the C lineage. The potential recombinant sequence is symbolized with a star in the phylogenetic trees of Figure 2.
Overall, from the above analysis of five recombinant genomes, we identified at least four potential recombination crossover sites in E1, five in E2 and one in L1. It is conceivable that some of the intermingled sequences in the phylogenies of E6, E7 and E5 genes could be additional recombination events. However, these genomic regions have small sizes and the overall sequence identity among members of different HPV16 lineages is high. Thus, it is difficult to conclude whether these regions are also involved in recombination events or, more probably, their phylogenetic incongruence is due to the weak phylogenetic signal.
Next, we checked whether each of the five inter-lineage recombinant sequences had more highly similar genomes within their respective UCLUST clusters. Nevertheless, the five identified recombinant sequences are singleton clusters, thus they represent five potential inter-lineage recombination events in over 1500 analyzed complete genomes. Accordingly, they do not appear to be broadly circulating in the general population and it is conceivable that these potential recombinants are low-frequency dead-end by-products of the infection cycle of HPV16.

Identification of Lineage-Specific SNPs
After removing the five identified recombinant sequences, we analyzed the remaining 175 complete representative genomes for lineage-specific SNPs that characterize each of the four lineages (A-D). Such lineage-specific SNPs could be valuable for easily and rapidly identifying potential inter-lineage recombination events in the future, where thousands of new HPV16 genomes or even genome fragments are bound to be sequenced. In total, we identified 67 lineage-specific SNPs, that are frequently present (>95%) in a specific lineage and have very low presence (<5%) in each of the other three lineages. Four of them are positioned inside the E2-E4 shared/overlapping genomic region, thus the total number of lineage-specific SNPs shown in Table 1 is 71. Furthermore, 61 of the 67 are found inside a gene and about 40% (25/61) of them are non-synonymous. Forty three (43) lineage-specific SNPs can categorize a sequence as either belonging to the A-lineage/clade or the wider BCD-clade, whereas the B, C and D lineages have eight, eight and eight lineage-specific SNPs, respectively. Therefore, this significant imbalance strongly suggests that our approach is better tailored towards identifying recombinants between the major A and BDC clades, whereas identifying recombinants among the more closely related B, C and D lineages is more difficult, due to the low number of available lineage-specific SNPs. Table 1. Lineage-specific SNPs. The first column shows the genomic position of the specific SNP mapped to the A1 Reference genome NC_001526 and the corresponding reference nucleotide. The second column shows the genomic region which corresponds to this position. The third and fourth columns show the SNP position in the corresponding NC_001526 gene and protein sequences, respectively. The fifth to twelfth columns show the nucleotide of each group and the percentage it is found in the sequences. The last column shows the amino acid change effect of each mutation, with the first letter being the variant amino acid. Cell colour indicates the non-synonymous mutations and mutations that are not placed inside an ORF. Lineages A, B, C and D non-synonymous SNPs are denoted with green, blue, orange and red, respectively. Novel lineage-specific synonymous mutations identified in this study are shown in bold/underlined. Reassuringly, many of these lineage-specific SNPs have already been reported in the literature [17,[51][52][53][54][55]. However, five novel synonymous nucleotide changes that facilitate the specific characterization of HPV16 variant lineages were (to the best of our knowledge) found for the first time in the present study. In particular, one nucleotide variation was detected at position 3431 of the E2 gene, while one variation was found at position 3858, which is located in the overlapping fragment of the E2 and E5 ORFs. Moreover, two new lineage-specific SNPs were identified at positions 4145 and 4149, respectively, which are positioned in the region between E5 and L2 ORFs, whereas a novel sequence change was detected at position 5659 which is located in the overlapping fragment of L1 and L2 genes.
Next, each genomic region was tested for enrichment in group-specific SNPs. Only L2 was found to be significantly enriched, with an enrichment fold of 1.83 (Fisher's exact test p < 0.03).

Rapid Detection of Inter-Lineage Recombination Events, Based on the Presence of Lineage-Specific SNPs
Based on the 67 lineage-specific SNPs, we developed a virtual SNP array (with Python scripts) of 31nt probes (using the HPV16 reference genome), where we placed the lineagespecific SNP in the middle of the probe (position 16). Next, we downloaded 9993 HPV16 sequences from NCBI nucleotide (May 2021), that were either entire genomes, or large genome fragments or even gene fragments. Many of the large genome fragments contained un-sequenced nucleotides, designated with N. The python scripts that used the 67 probes checked each of the nucleotide sequences for the presence of any lineage-specific SNP. Thus, this approach allowed us to rapidly scan thousands of sequences and identify in each of them any mixture of lineage-specific SNPs, that was considered as a sign of inter-lineage recombination.
In order to identify putative inter-lineage recombination events, all sequences were assigned to a certain lineage (A, B, C, D) or clade (A, BCD). Next, the putative recombinant would need to have more than 3 signature SNPs (%5; 3/67) of another lineage or clade. The putative recombinants were further inspected manually with similarity plots, within T-RECs (see Supplementary Materials File S2). Recombinant regions with 3 or more consecutive SNPs belonging to another lineage/clade were considered as higher confidence, whereas if the recombination event was visible in similarity plot, but was only supported by two or less consecutive SNPs, it was considered as lower confidence.
Of the 9993 sequences downloaded, only 3657 had more than 6000 sequenced nucleotides each (accounting for more than 75% of the HPV16 genome). We focused on these 3657 sequences and identified 45 putative recombinants (Figure 3), which account for 1.2% of the analyzed sequences. Of note, these 3657 sequences also include the 1534 complete genomes of our first analysis. Based on these lineage-specific SNPs, it was possible to roughly estimate the region that underwent recombination. Next, these putative interlineage recombinants were further manually inspected with similarity plots within T-RECs, for validating them and for identifying more precisely their recombination crossover sites (Figure 4 and Supplementary Materials File S2). In the original higher quality genomic dataset (1534 sequences), we discovered only five recombinant sequences (recombination frequency of 0.3%), based on phylogenetic analyses. Thus, our second approach based on lineage-specific SNPs is more powerful and sensitive.
Based on this second analysis, we identified 95 higher confidence and 45 lower confidence recombination crossover sites across the entire genome (140 in total), which accounted to an average of 12-18 recombination sites per 1000 nt. However, the recombination sites were not evenly distributed across the HPV16 genome. We observed a statistically significant enrichment (using the hypergeometric test, p-value < 0.05,) for L2 (41 sites, 29 sites per 1000 nts, p-value = 7.6 × 10 −4 , enrichment fold 1.62) and a statistically significant depletion for LCR (six sites, seven sites per 1000 nts, p-value = 5.5 × 10 −3 , under-enrichment fold 2.48). 3657 sequences and identified 45 putative recombinants (Figure 3), which account for 1.2% of the analyzed sequences. Of note, these 3657 sequences also include the 1534 complete genomes of our first analysis. Based on these lineage-specific SNPs, it was possible to roughly estimate the region that underwent recombination. Next, these putative inter-lineage recombinants were further manually inspected with similarity plots within T-RECs, for validating them and for identifying more precisely their recombination crossover sites (Figure 4 and Supplementary Materials File S2). In the original higher quality genomic dataset (1534 sequences), we discovered only five recombinant sequences (recombination frequency of 0.3%), based on phylogenetic analyses. Thus, our second approach based on lineage-specific SNPs is more powerful and sensitive.   Higher confidence and lower confidence recombination sites are shown as red and blue vertical lines, respectively, above the genome diagram. The stacked histogram below the genome diagram shows the relative abundance of higher and lower confidence recombination sites per 100 nucleotides. L2 seems to be the most frequently recombined genomic region, followed by E1 and E2. The graphical representation was made using the reference sequence NC_001526 that is depicted as a linear chromosome (although it is actually circular) for visualization purposes.
Based on this second analysis, we identified 95 higher confidence and 45 lower confidence recombination crossover sites across the entire genome (140 in total), which accounted to an average of 12-18 recombination sites per 1000 nt. However, the recombination sites were not evenly distributed across the HPV16 genome. We observed a statistically significant enrichment (using the hypergeometric test, p-value < 0.05,) for L2 (41 sites, 29 sites per 1000 nts, p-value = 7.6 × 10 −4 , enrichment fold 1.62) and a statistically significant depletion for LCR (six sites, seven sites per 1000 nts, p-value = 5.5 × 10 −3 , under-enrichment fold 2.48).

Figure 4.
Recombination sites on the HPV16 genome. Higher confidence and lower confidence recombination sites are shown as red and blue vertical lines, respectively, above the genome diagram. The stacked histogram below the genome diagram shows the relative abundance of higher and lower confidence recombination sites per 100 nucleotides. L2 seems to be the most frequently recombined genomic region, followed by E1 and E2. The graphical representation was made using the reference sequence NC_001526 that is depicted as a linear chromosome (although it is actually circular) for visualization purposes.

Development of the HPV16-Genotyper Computational Tool
Based on the previously mentioned analyses, we developed a computational tool in Biopython that automates the entire process and performs genotyping, quality control of genome assembly, detection of recombination events and detection of cancer-related SNPs. The software was tested on the 180 above-mentioned representative genomes and was validated for proper performance. The entire analysis took 16 min on a personal Linux laptop with eight cores (2.4 GHz). In Figure 5, an example of phylogenetic and recombination analysis of the HPV16 strain with accession number MT316214.1 is described in detail using the newly designed HPV16-genotyper tool. The HPV16-genotyper confirmed the initial analysis that is summarized in Figure 2.
More specifically, the software accepts sequences in FASTA format and first tries to determine which of them are HPV16 sequences or not (based on the 90% nucleotide identity criterion). Next, the software scans each HPV16 sequence for the presence of any of the 67 lineage-specific SNPs (of Table 1). The results are depicted in sections A7 and A8 of Figure 5. From this analysis, the user can determine which clade/lineage the sequence belongs to. In addition, if the majority of SNPs belong to a certain clade/lineage, but there also exist a number of SNPs from another clade/lineage that also tend to cluster together; this is a strong indication of interlineage recombination. However, if the sequence shows a randomly mixed pattern of SNPs from different lineages, this may be a sign of genome mis-assembly, due to infection from more than one lineages. Thus, this analysis also functions as a quality control of the assembled sequence.
Afterwards, HPV16 sequences are analyzed with BlastN in order to determine the boundaries of each gene. Furthermore, BlastN determines in which of the 16 sub-lineages each gene belongs to. This result is depicted in section A10 of Figure 5 and can be used as a good indication in which lineage and sub-lineage the sequence belongs to and whether the sequence could be an interlineage recombinant or not.
Next, the genes of the analyzed HPV16 sequences undergo Neighbor Joining phylogenetic analysis together with the homologous genes from each of the 16 reference sub-lineages. The results of the phylogenetic analyses are visualized with the ETE3 package (see section B of Figure 5) by pressing the corresponding button (in section A11 in Figure 5). ETE3 allows the user to perform many editing functions upon the phylogenetic tree, such as rerooting, ladderizing, clade swapping, etc. Therefore, this phylogenetic analysis of each gene separately is probably the best method to determine the lineage and sub-lineage of a sequence and whether it's a recombinant (or even assembly artifact), based on any observed phylogenetic incongruence.
The software also decides whether a sequence is a recombinant or not based on both the BlastN and lineage-specific SNPs. More specifically, a sequence identified as a potential recombinant needs to have genes from a different lineage. Simultaneously, the sequence needs to have three or more consecutive SNPs from another clade/lineage. The potential recombinants can be displayed by pressing the corresponding button (in section A4 of Figure 5). In addition, a potential recombinant may undergo Similarity Plot analysis (see section C of Figure 5) by pressing the corresponding button (see section A12 of Figure 5).
Finally, the software scans each HPV16 sequence for the presence of any of the nine SNPs that are known to be associated with an increased risk of cancer. These results are depicted in the corresponding table (see section A5 of Figure 5) whereas a description/annotation of each cancer-related SNP together with its associated literature citation/s is depicted in the corresponding box (see section A6 of Figure 5). The nine SNPs are: (i) G145T [56,57], (ii) C335T [56,57], (iii) T350G [15,18,[56][57][58][59][60][61][62], (iv) A647G [63][64][65][66], (v) C749T [64,[66][67][68][69], (vi) C2860A [70], (vii) C3410T [51,[71][72][73], (viii) C3684A [70,73], (ix) A4042G/C/T [51,52]. is the status bar, where the user obtains information about the currently displayed page. A2 are the total pages available (Home and Results). The first frame of the results page contains A3, a list of all the analyzed sequences. By double clicking the name of a sequence, the page updates with the corresponding information. By checking the button A4, the list displays only the putative recombinants identified in the analysis. The next frame contains information about 9 SNPs associated with increased risk of cancer (A5) and clicking on any of the identified SNPs displays information (A6) about that specific SNP. A7 shows the lineage-specific SNPs identified in the selected sequence and this information is summarized in graph A8. BLAST results for each gene are shown in table A9 and summarized in graph A10. In the frame A11, the user has the option to view the phylogenetic tree for each gene. In case the selected sequence does not contain the selected gene, an error message will be displayed. The A12 frame gives the option to create the similarity plot of the selected sequence. A13 saves A8 and A10 on the output directory. Panel B shows an example interactive tree visualization, where B1 is the gene label, B2 shows the different reference sequences which are colored based on their lineage and B3 shows the selected sequence which will always be colored gray. Panel C shows an example similarity plot window. C1 is the plot description, C2 is the similarity plot, C3 is the plot legend and C4 is a button that can save the page in JPG format.
More specifically, the software accepts sequences in FASTA format and first tries to determine which of them are HPV16 sequences or not (based on the 90% nucleotide identity criterion). Next, the software scans each HPV16 sequence for the presence of any of Figure 5. The HPV16 genotyper tool has three GUI components (A-C). The first component is the main results page. A1 is the status bar, where the user obtains information about the currently displayed page. A2 are the total pages available (Home and Results). The first frame of the results page contains A3, a list of all the analyzed sequences. By double clicking the name of a sequence, the page updates with the corresponding information. By checking the button A4, the list displays only the putative recombinants identified in the analysis. The next frame contains information about 9 SNPs associated with increased risk of cancer (A5) and clicking on any of the identified SNPs displays information (A6) about that specific SNP. A7 shows the lineage-specific SNPs identified in the selected sequence and this information is summarized in graph A8. BLAST results for each gene are shown in table A9 and summarized in graph A10. In the frame A11, the user has the option to view the phylogenetic tree for each gene. In case the selected sequence does not contain the selected gene, an error message will be displayed. The A12 frame gives the option to create the similarity plot of the selected sequence. A13 saves A8 and A10 on the output directory. Panel B shows an example interactive tree visualization, where B1 is the gene label, B2 shows the different reference sequences which are colored based on their lineage and B3 shows the selected sequence which will always be colored gray. Panel C shows an example similarity plot window. C1 is the plot description, C2 is the similarity plot, C3 is the plot legend and C4 is a button that can save the page in JPG format.

Discussion
DNA viruses recombine with significantly lower frequency than RNA viruses [24]; nevertheless, there have been reports of homologous [25] and even non-homologous [27] recombination among distant lineages of animal papillomaviruses. In addition, there have also been reports of recombination events within alpha-HPVs, where recombination crossover sites were localized at the E6, E7, L2 and L1 genes, hence advocating that novel recombinant types could be formed upon a natural viral co-infection [29,30]. The first report of HPV16 recombination in clinical samples was provided by [31]. Inter-lineage recombination events were observed between the HPV16 A and HPV16 C (African type II) lineages. In addition, another study reported recombination events in clinical samples between European and African type II lineages and between D (Asian American) and B (African type I) lineages, respectively [32].
Our initial phylogenetic incongruence analyses on all the genomic regions of 180 complete HPV16 genomes revealed at least five potential recombinant sequences where the exchange occurred between the two major evolutionary clades; the clade that encompasses lineage A and the clade that encompasses lineages B, C, and D. Lineage A is considered to contain mostly low-risk viruses whereas lineages B, C, and D usually contain HPV16 viruses with more aggressive phenotypes [14][15][16][17][18]. More specifically, the largest HPV16 genome-scale study to date analyzed more than 7000 sequences of various sub-lineages from all areas of the world [74]. Their findings showed that the A lineage was the most prevalent worldwide as it accounted for more than 78% of the analyzed sequences, while the next most abundant lineage was D with a frequency of 9%. The A lineage is dominant in America, Europe, Asia and Oceania, the B and C lineages are mostly found in Africa, and the D lineage can be found in America, Asia and some parts of Africa. That analysis also revealed that A1 and A2 are the most globally widespread sub-lineages, constituting 63% (4474/7116, 707/7116) of the total examined cases. The A3 and A4 sub-lineages are more specific to Asia. The B1-3 and C2-3 sub-lineages are reported almost exclusively in sub-Saharan Africa and B-D4 are found mostly in North Africa. The D1 sub-lineage can be found in America, Europe and East Asia, while D2 is found almost exclusively in America. Lastly, D3 can be found in many regions around the world and is the most abundant D sub-lineage. These geographical dispersion patterns suggest that the old nomenclature (e.g., European for A1-3 variants) is now outdated.
The study of [14] based on 3200 women associated the A1/2 sub-lineages with greater CIN3+ risk in "white women" population. Additionally, [74] reported that the A3 sublineage demonstrates an increased cancer risk for the East Asian population, where it is most commonly found. The A4 sub-lineage is associated with greater cancer risk than the other A lineage members [14], which was further confirmed for the Asian population by [74]. The D lineage is associated with increased cancer risk compared to the rest of the variant lineages [12][13][14]74] and increased incidents of adenocarcinoma and adenocarcinoma in situ. One study found that the D sub-lineages had an increased LCR P97 oncogenic promoter activity compared to the A lineages, which might partially explain the difference in the oncogenic potential [75].
The 180 genomes of our first analysis constitute representative genomes of more than 1500 complete genomes. The five potential recombinant sequences were singleton clusters in the 1500 sequences, thus, a rough estimate of recombination frequency is 1 in 300 (0.3%).
The second recombination analysis utilized 3657 complete and partial genomes where more than 75% of their nucleotides had been sequenced. This analysis was based on 67 lineage-specific SNPs that are found in very high frequency (>95%) in a certain lineage and in very low frequency (<5%) in each of the other three lineages. In accordance with our phylogenetic analyses, 43 of the 67 lineage-specific SNPs can categorize a sequence as either belonging to the A-lineage or the wider BCD-clade. These SNPs are distributed across the entire genome, nevertheless, we observed a statistically significant enrichment for L2. Based on the distribution of lineage-specific SNPs, we identified 45 recombinant sequences whose recombination crossover sites were manually determined with similarity plots. The HPV16 genome is circular, thus every recombination event consists of double recombination crossover sites. In addition, we observed several sequences with more than one recombination events. Overall, recombination sites were observed in all genomic regions, however, we also observed a significantly higher frequency for the L2 region and a significantly lower frequency for LCR. Nevertheless, these hotspot/coldspot results should be treated with caution because they are strongly biased by the un-even distribution of lineage/clade specific SNPs that were used in the first place to identify recombination events. Therefore, future validation of these recombination hotspots and coldspots is needed, which will be based on more and fully sequenced genomes and preferably by phylogenetic incongruence and similarity plot methods. Our approach constitutes a crucial first step, because it needed to utilize a large number of incompletely sequenced genomes.
Based on the second recombination analysis, the recombination frequency among the two distinct clades is now elevated from 0.3% to at least 1.2%, which is not trivial. This finding also may have important clinical implications in the near future, because HPV16 genotyping and risk assessment is usually based on certain genomic regions and not the entire genome. Thus, it is conceivable that a recombinant sequence may give a false risk estimate because one part of it originates from a clade/lineage that is usually associated with more aggressive phenotypes and another part of it originates from a clade/lineage that is usually associated with more benign phenotypes. Furthermore, our study clearly demonstrates that phylogenomic analyses based on the entire genomes of inter-lineage recombinant sequences instead of phylogenetic analyses of each gene separately may also give a mistaken impression of the emergence of a new and distinct evolutionary lineage as is the case of the recombinant KY549190.
As a note of caution, it is conceivable that some of the above-mentioned recombination events are actually artifacts of the genome assembly process of next-generation sequencing data from tissues that have infections from more than one HPV16 lineages. However, if that is the case, then lineage-specific SNPs from another clade should be distributed across the genome in a mixed fashion and not appearing to cluster. Thus, we believe that our criterion of at least three consecutive other-lineage-specific SNPs should filter out many such false recombinants.
Finally, based on the above mentioned analyses and results, we developed a computational tool named HPV16 genotyper that performs automatically as a pipeline a series of complicated analyses and helps the wet-lab user to rapidly assess the genotype, to identify any potential recombinants and to assess the cancer risk of a certain set of sequences. This computational tool will streamline many complicated bioinformatics analyses for the wet-lab users in this new, exciting and challenging era of next-generation sequencing of thousands of HPV16 genomes.