Characterization and Diversity of 243 Complete Human Papillomavirus Genomes in Cervical Swabs Using Next Generation Sequencing

In recent years, next generation sequencing (NGS) technology has been widely used for the discovery of novel human papillomavirus (HPV) genotypes, variant characterization and genotyping. Here, we compared the analytical performance of NGS with a commercial PCR-based assay (Anyplex II HPV28) in cervical samples of 744 women. Overall, HPV positivity was 50.2% by the Anyplex and 45.5% by the NGS. With the NGS, we detected 25 genotypes covered by Anyplex and 41 additional genotypes. Agreement between the two methods for HPV positivity was 80.8% (kappa = 0.616) and 84.8% (kappa = 0.652) for 28 HPV genotypes and 14 high-risk genotypes, respectively. We recovered and characterized 243 complete HPV genomes from 153 samples spanning 40 different genotypes. According to phylogenetic analysis and pairwise distance, we identified novel lineages and sublineages of four high-risk and 16 low-risk genotypes. In total, 17 novel lineages and 14 novel sublineages were proposed, including novel lineages of HPV45, HPV52, HPV66 and a novel sublineage of HPV59. Our study provides important genomic insights on HPV types and lineages, where few complete genomes were publicly available.


Introduction
Human papillomavirus (HPV) is one of the most common sexually transmitted infections and the principal cause of cervical cancer [1]. HPVs belong to the Papillomaviridae family and are highly diverse. According to the International Committee on the Taxonomy of Viruses Study Group HPVs are classified in genera, species and types based on the L1 open reading frame (ORF), being the most conserved region of the HPV genome [2]. Different HPV types share less than 90% of the L1 gene nucleotide sequence [2]. HPV types with a difference in complete viral nucleotide sequence of 1% to 10% and of 0.5% to 1% have been further classified into lineages and sublineages denoted by letters and numbers, respectively [2][3][4]. Additionally, HPVs are classified according to their oncogenic potential into high-risk (hr) and low-risk (lr) genotypes [5,6].
Recent studies indicate that lineages of HPV16, 31 and 58 genotypes have different carcinogenic properties [3,[7][8][9][10][11]. For example, sublineage D2 of HPV16 and lineages A/B of HPV31 are associated with an increased risk of cervical precancer compared to lineages A and C, respectively [8,9,12]. While the scientific community is mainly focusing on the characterization of hr-HPV, only a limited number of lr-HPV have been classified and characterized into lineages [13][14][15].
Modern technological advances have resulted in the development of more than 250 HPV detection and genotyping methods [16]. The Anyplex II HPV28 (Seegene, Seoul, Korea) is a widely used commercial PCR-based assay for the simultaneous detection and genotyping of 14 hr and 14 lr HPV types. This assay with high analytical performance has been clinically validated for cervical cancer screening and used in HPV vaccine surveillance studies [17][18][19][20].
While PCR-based HPV assays detect by design only a limited number of targeted genotypes, next generation sequencing (NGS) may be an alternative approach for universal HPV detection and characterization [16,[21][22][23][24][25]. This technology facilitated the detection of novel HPV genotypes and variants [14,22,26] resulting in the identification of 222 HPV types and more than 100 lineages/ sublineages [27,28]. Rolling-circle amplification (RCA) has been developed for the random amplification of circular DNA genomes, including viruses from the Papillomaviridae family [29]. It was first used for HPV amplification and sequencing by Rector et al. before the availability of NGS [30], and since then few studies have used this approach for investigating HPV [14,31,32]. The RCA technology allows unbiased random amplification of HPV genome without prior knowledge of the sequence and therefore has a great potential for improving characterization of both known and novel HPV genotypes [26,32,33].
In this study, we aimed to investigate and characterize HPV diversity circulating in healthy young women attending vaccine surveillance study in Luxembourg using NGS-RCA technology. In addition, we compared the analytical performance of the Anyplex II HPV28 assay and NGS using RCA for HPV detection and genotyping. To the best of our knowledge, this is the first large scale study to present HPV diversity using RCA-NGS in healthy women and compare the analytical performance of a PCR based genotyping assay with NGS-RCA.

Materials and Methods
A total of 744 cervical samples of women (median age: 22, range  participating in an HPV vaccination surveillance project were included in the study. More details on the study population and recruitment process have been published elsewhere [20]. Briefly, cervical samples were collected with a cervical broom by gynecologists or other physicians and placed in a vial containing PreservCyt medium of ThinPrep (Hologic, Inc., Bedford, MA, USA) [20,34].

DNA Extraction
DNA extraction from ThinPrep samples for sequencing and genotyping by the Anyplex II HPV28 was performed using the QIAamp DNA mini kit according to manufacturer's instructions (Qiagen, Hilden, Germany) [35]. Prior to genotyping/sequencing DNA extracts were stored at −20 • C.
Real time PCR was performed in two multiplex reactions on the CFX96 real-time thermocycler (Bio-Rad, Hercules, CA, USA). The human beta-globin housekeeping gene was amplified along with L1 viral gene, as an internal positive control while water was used as a negative control. All reactions were performed according to the manufacturer's instructions and the Seegene viewer software was used for data recording and interpretation [19].

Library Preparation
Total DNA extracts were enriched using rolling-circle amplification (RCA) technology with the TempliPhi 100 kit according to manufacturer's instructions (GE Healthcare Life Sciences, New Jersey, NJ, USA) [30]. Libraries were prepared using Nextera XT DNA Library Prep Kits (Illumina Inc., San Diego, CA, USA) as recommended by manufacturer's instructions followed by sequencing on Illumina MiniSeq Platform (Illumina Inc., San Diego, CA, USA). Up to 96 samples were multiplexed using standard Nextera DNA CD Illumina indices in one run (2 × 150 bp paired-end).

Full Genome Assembly and Annotation
In order to assemble full HPV genome sequences, we followed two complementary approaches. For both approaches we performed an initial QC step, including visual inspection of FastQC report for each sample [37]. The strictness of applied quality criteria depended on the subsequent approach followed (Supplementary Figure S1A,B).
The first approach was to "blindly" assemble all reads from 744 samples (irrelevant of being HPV positive or negative) remaining after filtering against human, bacterial, plasmid and fungal reference sequences. The second approach was based on assemblies done on only dehumanized reads, which have been pre-mapped to all available HPV reference genome sequences. In both approaches mapping was performed with Bowtie2 [38] and de novo assembly with SPAdes (v. 3.13.0) [38,39]. Annotation of genomes was completed by means of an adapted version of VAPiD [40]. Detailed information on methods and annotation is provided in Supplementary Materials.
We obtained 75 identical genomes by both methods, 86 not fully identical, 42 only with blind assembly and 43 only with bowtie assembly. We selected 75 identical genomes, while not identical genomes were compared and further processed to allow assembly selection. All assemblies were checked by remapping fastq reads to assembled genomes and by aligning to the respective reference genomes. Of the total 246 genomes assessed, three (HPV42, HPV53, HPV56) were discarded from further analysis due to chimeric artefacts.

HPV Detection
For HPV detection and genotyping, quality controlled and dehumanized fastq-files from (trimmed and filtered against human genome as described above) were mapped with Bowtie2 to a reference set of 318 HPV sequences downloaded from PaVE (https://pave.niaid.nih.gov/, accessed in November 2018) and one novel genotype detected in our laboratory. Accession numbers of reference genomes are provided in Supplementary Table S1 [27].
Samples were considered positive if at least one concordantly mapping read pair was detected covering a minimum of 150 bp of the reference genome. To avoid possible mapping artefacts reads covering < 200 bp with the mutation rate > 0.03 (~4.5 variants on 150 bp read) were blasted against NCBI-based database (built on July 2019) and manually/visually checked. Artefacts were removed if the blast provided different results.

Phylogenetic Analysis and Lineage Classification
Complete HPV genomes obtained in this study (n = 243) were assessed to investigate HPV variant distribution in healthy women in Luxembourg. Complete genomes were linearized according to the respective reference sequence available in PaVE (https://pave.niaid.nih.gov/) [27] and aligned using MAFFT [41]. Each HPV isolate genome was classified into lineage or sublineage using the p-distance method and phylogenetic analysis with respective reference [3,4,13]. For the phylogenetic analysis the evolutionary history was inferred using RAxML method employing 1000 bootstrap values [42]. P-distance calculation was done in MEGA7 [43]. For HPV genomes with well-established lineages/sublineages we used the respective reference lineages and sublineages in PaVE and Chen et al. [13], whereas for HPVs with no established lineages/sublineages all complete genomes available in NCBI were downloaded for analysis (Supplementary Table S2). Phylogenetic trees were visualized in iTOL5.3 and Dendroscope 3 [44,45]. Plots were constructed in R (scripts available on request).

Statistical Analysis
Statistical analysis was performed using STATA 14 (College Station, TX, USA). We estimated HPV genotyping concordance by the percentage agreement and Cohen's kappa (k). Cohen's kappa values less than zero were classified as no agreement, 0.00-0.19 as poor, 0.20-0.39 as fair, 0.40-0.59 as moderate, 0.60-0.79 as good and 0.80-1.00 as excellent, as proposed in VALGENT [46,47]. Agreement, concordance and discordance were assessed restricting analysis to 28 genotypes included in the Anyplex II HPV28 assay and 729/744 samples with results available by both methods. Fully concordant samples were defined when all genotypes were detected by both Anyplex and NGS. Partially concordant samples were defined if at least one genotypes was shared by the two methods and discordant samples were defined as those where no genotypes were shared.

HPV Variant Distribution and Lineage Classification
We obtained 232 complete alpha HPV and 11 gamma HPV genomes from 153 samples ( Figure 2) spanning 40 different HPV types. The most-frequently obtained complete genomes were HPV51 (9.1%), HPV42 (8.6%), HPV53 (8.6%) and HPV59 (7.4%) (Tables 3 and 4 and Supplementary Table S3). Phylogenetic trees were created for each isolate and topologies were evaluated to classify our genomes to the existing or novel lineages/sublineages as described before [4]. Reference HPV genomes were considered as the prototype sequence, which were always assigned as lineage A or sublineage A1 (Supplementary Table S2).  [5]. 10 Analysis is restricted to 14 high-risk HPV, Anyplex positivity is restricted to viral load medium or high (++/+++).

HPV Variant Distribution and Lineage Classification
We obtained 232 complete alpha HPV and 11 gamma HPV genomes from 153 samples ( Figure  2) spanning 40 different HPV types. The most-frequently obtained complete genomes were HPV51 (9.1%), HPV42 (8.6%), HPV53 (8.6%) and HPV59 (7.4%) (Tables 3-4 and Supplementary Table S3). Phylogenetic trees were created for each isolate and topologies were evaluated to classify our genomes to the existing or novel lineages/sublineages as described before [4]. Reference HPV genomes were considered as the prototype sequence, which were always assigned as lineage A or sublineage A1 (Supplementary Table S2).   Figures S3 and S4).   Figure S3A-I). No complete genome of HPV18 could be obtained. We characterized novel lineages of HPV45, HPV52, HPV66 and a novel sublineage of HPV59 ( Figure 3A-D). The novel lineage C of HPV45 differed by 0.93-1.35% and 1.38-1.39% from lineages A and B, respectively. The difference between the novel lineage E of HPV52 and other lineages ranged from 0.54% to 1.56%. Our analysis suggests that lineage B of HPV59 can be subdivided into sublineages B1 and B2. The pairwise distance between sublineages B1 and B2 varied from 0.27% to 0.53%. The novel lineage C of HPV66 differed by 1.25-1.41% and 0.96-1.32% from lineages A and B, respectively (Supplementary Table S5 and Figure 3A-D).

High-Risk HPV
In total, 118 complete hr-HPV genomes were assembled, spanning 24 existing lineages and 31 sublineages of 13 hr-HPV types (Supplementary Figure S3A-I). No complete genome of HPV18 could be obtained. We characterized novel lineages of HPV45, HPV52, HPV66 and a novel sublineage of HPV59 ( Figure 3A-D). The novel lineage C of HPV45 differed by 0.93-1.35% and 1.38-1.39% from lineages A and B, respectively. The difference between the novel lineage E of HPV52 and other lineages ranged from 0.54% to 1.56%. Our analysis suggests that lineage B of HPV59 can be subdivided into sublineages B1 and B2. The pairwise distance between sublineages B1 and B2 varied from 0.27% to 0.53%. The novel lineage C of HPV66 differed by 1.25-1.41% and 0.96-1.32% from lineages A and B, respectively (Supplementary Table S5 and Figure 3A-D).
) indicates novel lineage/sublineage. The phylogenetic topology is shown on the left, pairwise differences for each isolate are shown on the right. Values for each isolate are connected by lines of different colors to distinguish each lineage and sublineage. Genomes from our collection are indicated with LNS identification number.

Gamma HPV
We obtained 10 gamma-6 and one gamma-13 complete HPV genomes. In total, seven novel variants were identified including lineages A, B, C of HPV101 and lineages A, B of HPV108 and HPV226 (Figure 4). The difference amongst HPV101 lineages varied from 1.30% to 2.51%. The distance between A and B lineages was 1.59-1.68% for HPV108 and 1.07% for HPV226 (Supplementary Table S6). Variant of HPV213 gamma-13 genotype differed by 0.77% from reference genome.

Gamma HPV
We obtained 10 gamma-6 and one gamma-13 complete HPV genomes. In total, seven novel variants were identified including lineages A, B, C of HPV101 and lineages A, B of HPV108 and HPV226 (Figure 4). The difference amongst HPV101 lineages varied from 1.30% to 2.51%. The distance between A and B lineages was 1.59-1.68% for HPV108 and 1.07% for HPV226 (Supplementary Table S6). Variant of HPV213 gamma-13 genotype differed by 0.77% from reference genome.

Discussion
Our study showed a high diversity of human papillomavirus in women attending cervical cancer screening. We characterized 243 complete HPV genomes of 40 different genotypes and proposed the classification of novel lineages/sublineages for 4 hr-HPV and 16 lr-HPV genotypes. To the best of our knowledge, this is the first large scale study to present HPV diversity using unbiased NGS-RCA in healthy young women and comparing the analytical performance of a PCR-based genotyping assay with NGS-RCA.
Our findings suggest that the PCR-based the Anyplex assay is superior to the NGS in terms of sensitivity since only 16% of genotypes with low viral load by the Anyplex were concordantly identified by NGS. This is not surprising considering that Anyplex detection is based on type-specific primers, whereas RCA is a random amplification of circular DNA. Nevertheless, NGS detected an additional 110 genotypes compared to the Anyplex PCR. Further, our study suggests that concordance between the two methods increases with higher viral loads. In a recent clinical evaluation of the Anyplex assay, the detection of cervical intraepithelial neoplasia of grade 2 or worse (CIN2+) was associated with medium and high viral loads of the 14 high-risk types only, but not with low viral loads or low-risk genotypes [48]. In our study, only 3.5% participants were diagnosed with abnormal cytology, while we recovered 40% of complete genomes from these samples (Supplementary Table S7). As such, it would be important to validate the NGS approach in a clinical context using established protocol such as VALGENT [46].

Discussion
Our study showed a high diversity of human papillomavirus in women attending cervical cancer screening. We characterized 243 complete HPV genomes of 40 different genotypes and proposed the classification of novel lineages/sublineages for 4 hr-HPV and 16 lr-HPV genotypes. To the best of our knowledge, this is the first large scale study to present HPV diversity using unbiased NGS-RCA in healthy young women and comparing the analytical performance of a PCR-based genotyping assay with NGS-RCA.
Our findings suggest that the PCR-based the Anyplex assay is superior to the NGS in terms of sensitivity since only 16% of genotypes with low viral load by the Anyplex were concordantly identified by NGS. This is not surprising considering that Anyplex detection is based on type-specific primers, whereas RCA is a random amplification of circular DNA. Nevertheless, NGS detected an additional 110 genotypes compared to the Anyplex PCR. Further, our study suggests that concordance between the two methods increases with higher viral loads. In a recent clinical evaluation of the Anyplex assay, the detection of cervical intraepithelial neoplasia of grade 2 or worse (CIN2+) was associated with medium and high viral loads of the 14 high-risk types only, but not with low viral loads or low-risk genotypes [48]. In our study, only 3.5% participants were diagnosed with abnormal cytology, while we recovered 40% of complete genomes from these samples (Supplementary Table S7). As such, it would be important to validate the NGS approach in a clinical context using established protocol such as VALGENT [46].
In total, we detected 66 different HPV genotypes including 90 lineages/sublineages, of which 27 were novel and not previously described. Although, the diversity of hr-HPV genotypes has been well studied [4,13], we were able to identify novel variants of HPV45, HPV52, HPV59 and HPV66. Further, we proposed the classification of 10 novel lineages and 13 sublineages of lr-HPV which has been less studied. We were not able to obtain any complete HPV18 genomes (only five positive samples by both methods) as~50% of the study population received HPV vaccine [20,49].
Several NGS-based approaches have been trialed, including RNA-baits, PCR-based and untargeted metagenomics [14,22,26,32,[50][51][52]. Due to the high variability of HPV genomes across genera, species and even genotypes, designing primers or baits with acceptable sensitivity in a cost-effective manner is a major challenge. In our study, we found the untargeted rolling-circle amplification approach, initially implemented for HPV sequencing by Meiring [53], to be a very useful tool for characterizing HPV variants and detection. Identification of new variants could be important for adopting PCR-based assays, as we observed substantial discordance in HPV type-specific genotyping of some hr and lr HPVs.
Previous studies applied an RCA-based approach in immune-suppressed individuals, where high viral load would facilitate HPV detection and complete genome recovery. Pastrana et al. reported the discovery of 83 novel HPV types using RCA-NGS with the prior physical enrichment of viral DNA from immunodeficient patients [32]. This method was successfully used for HPV detection and characterization by researchers in South Africa and Brazil in HIV positive populations [14,53]. Results of the study by Wang et al. suggest that RCA-based sequencing is more accurate than the long-PCR template cloning and deep sequencing of HPV58 [54].
In our study, we obtained 243 HPV complete genomes, which increases the number of publicly available complete HPV genomes on GenBank (~2283 complete genomes on April 2020) by approximately 10%. Our dataset provides important genomic insights on HPV variants where currently only a few complete genomes were available, such as HPV42, HPV59, HPV66, HPV67, HPV90, and HPV91.
Interestingly, we also obtained 11 complete HPV genomes of gamma species, with a population prevalence of 3.7% in our population. Our findings are in line with the previous research which found gamma-6 species in cervical samples confirming their mucosal tropism [26,[55][56][57][58]. Although gamma-6 pathogenicity is not well described, HPV108 could induce dysplasia in organotypic keratinocytes [57]. In our study, gamma-6 genotypes were present independently only in four samples, whereas other 23 samples were co-infected with other alpha HPVs. As gamma-6 HPV genotypes lack the E6 oncogene, they may have a lower oncogenic potential [56].
One of the limitations of our study was the absence of negative and positive control during the sequencing, in order to account for the cross contamination and index hopping. Therefore, our detection criteria might produce some false positive results. Nevertheless, only 3% of the samples were defined as HPV positive by NGS with at least one concordantly mapping read pair to the reference genome. Moreover, when considering more stringent detection criteria for HPV positivity by NGS (mapping coverage of 1000 bp), the agreement between methods did not change significantly (Supplementary Table S8). The NGS detection threshold of concordant read pairs might have a higher impact on genotyping results, rather than HPV positivity. We accounted for potential PCR or sequencing errors during the assembly process by stringent quality controls and we excluded three genomes from the analysis due to the presence of chimeric contigs.
The major advantage of the NGS method is that it requires no prior knowledge of which HPV genotypes are present in the samples. Overall, we were able to detect 51 different HPV genotypes including 26, which are not detectable by the Anyplex assay. As the majority of these HPV types belong to IARC Group 3 (not classifiable as to its carcinogenicity to humans) [5], our findings currently do not have direct clinical implications in the context of cervical cancer screening. Nevertheless, this approach is certainly useful from a population and evolutionary biology perspective and may have potential to address new disease etiology in the future.

Supplementary Materials:
The following are available online at http://www.mdpi.com/1999-4915/12/12/1437/s1, Figure S1A. Flow chart illustrating generation of blind assembly input data. Bar chart illustrating the generation of the blind assembly input data. The bar labels show the average single read count over 744 samples. For HG (Human Genome) we used genome version GRCh38. The average input size per sample for a blind genome assembly is 93,675 reads, which corresponds on average to~14% of the raw sample size. Figure S1B. Flow chart illustrating generation of bowtie assembly input data. Bar chart illustrating the generation of the bowtie assembly input data. The lower graph shows the generation of the genome assembly input data only considering HPV positive samples. Figure S2A- Table S5).
) indicates novel lineage/sublineage. The phylogenetic trees are shown on the left, pairwise differences for each isolate are shown on the right. Values for each isolate are connected by lines of different colors to distinguish each lineage and sublineage. Genomes from our collection are indicated with LNS identification number.