Analysis of HPV Integrations in Mexican Pre-Tumoral Cervical Lesions Reveal Centromere-Enriched Breakpoints and Abundant Unspecific HPV Regions

Human papillomavirus (HPV) DNA integration is a crucial event in cervical carcinogenesis. However, scarce studies have focused on studying HPV integration (HPVint) in early-stage cervical lesions. Using HPV capture followed by sequencing, we investigated HPVint in pre-tumor cervical lesions. Employing a novel pipeline, we analyzed reads containing direct evidence of the integration breakpoint. We observed multiple HPV infections in most of the samples (92%) with a median integration rate of 0.06% relative to HPV mapped reads corresponding to two or more sequence breakages. Unlike cancer studies, most integrations events were unique (supported by one read), consistent with the lack of clonal selection. Congruent to other studies, we found that breakpoints could occur, practically, in any part of the viral genome. We noted that L1 had a higher frequency of rupture integration (25%). Based on host genome integration frequencies, we found previously reported integration sites in cancer for genes like FHIT, CSMD1, and LRP1B and putatively many new ones such as those exemplified in CSMD3, ROBO2, and SETD3. Similar host integrations regions and genes were observed in diverse HPV types within many genes and even equivalent integration positions in different samples and HPV types. Interestingly, we noted an enrichment of integrations in most centromeres, suggesting a possible mechanism where HPV exploits this structural machinery to facilitate integration. Supported by previous findings, overall, our analysis provides novel information and insights about HPVint.


Introduction
Cervical cancer (CC) is the is the fourth most common malignancy and a leading cause of mortality in women worldwide [1]. In Mexico, CC is the second most common cancer in women [2]. High-risk (HR) human papilloma viruses (HPV) infection is the most important

Patient Samples
A total of 24 cervical cell samples were taken from January 2014 to July 2016, DNA was extracted and stored at −20 • C until use. Samples were grouped according to Pap smear result: Normal, atypical squamous cells of undetermined significance (ASCUS), low-grade cervical lesions (LSIL), high-grade cervical lesions (HSIL), and Unknown (when Pap smear result was unavailable).
The patients included in this study were referred to the colposcopy consultation of the Department of Gynecology and Obstetrics of the Hospital Universitario "Dr. Jose Eleuterio González" of the Universidad Autónoma de Nuevo León (HU-UANL) due to an altered Papanicolaou test, genital warts, or genital pathology detection. The median age was 36 years (SD = 13) and had a mean body mass index (BMI) of 26.3, where half of these patients had overweight. The patients selected for this study had a sexual debut at a median age of 20 years old with an average of 3 sexual partners. Fifteen patients had a previous sexually transmitted disease reported in their clinical history.
We observed 58% HPV-type agreement between both detection methods (Supplementary Table S1). The HPV types most frequently detected by qPCR were HPV 39 and 52, whereas by HIVID-NGS were 51, 52, and a possible variant of 45 (t45).

HIVID-NGS HPV Genotype Detection
We noted that the 24 samples obtained more than 140,000 total reads (Table 1), 67% showing length of 142 nt (85% >= 120 nt) with a median insert size of 171 nt (median absolute deviation of 55 nt). With these overall sequencing results (140K reads, 120 nt each), we could reach depths around 2100 per nt for a typical HPV genome (8000 nt), or at least depths around 630 considering 30% mapping reads, which seems reasonable to estimate HPV presence. Thus, to determine possible HPV types per sample by our capturesequencing strategy, we first mapped the reads to the hpvDB containing 451 representative HPV types. The results are shown in Table 1. On average, the mapped reads were 33%. A sample showed less than 500 total mapped reads (M-7449), suggesting that HPV typing is uncertain. The number of detected HPV types was not correlated to the number of reads per sample (R 2 = 0.00, p = 0.81), mapped reads (R 2 = 0.03, p = 0.43), or % of mapping (R 2 = 0.00, p = 0.81), suggesting that the number of types detected is not highly dependent on sequencing reads. Most samples showed multiple HPV infections (92%), similar to 93% found in liquid-based cytology specimens using HPV capture technology [21]. We noted 15 HPV types in two or more samples (Table 2). To determine whether HPV type detection in several samples is not the result of cross-contamination, we first observed that the pattern of HPV types in samples was not a subset of another sample. For example, after analysis of the five samples for HPV 51 (M-7454, M-7457, M-7460, M-7462, M-7472), all showed different patterns of detected HPV types, indicating that there was no contamination between them (see Table 1). Moreover, we compared the particular nucleotide variant sites observed in HPV 51, 11, 52, and 16 ( Figure 2). In all cases, each sample can be distinguished from the others by particular nucleotide variant sites. Thus, these results demonstrate that HPV typing by sequencing is not product cross-contamination. For t45 (MP134365 showing oligonucleotide frequencies similar to HPV45), we noted detections in 10 samples in more than 1% but at a low number of reads (Tables 1 and 2). Only two regions of the genome were mapped (around 1500 and 7400). Still, reads do not show evidence of cross-contamination (Supplementary Figure S1). The low number of reads and the high number of nucleotide variant sites suggest that a similar but distinctive type might be present instead of HPV45 or t45.    In samples with a high number of reads, we observed complete coverage of reads all along the HPV genomes ( Figure 2). Interestingly, we noted a highly conserved pattern of depth across samples within HPV types. Because our results show that cross-contamination is unlikely, the similarity in the depth pattern is more likely due to biases of the capturing, sequencing, mapping, and database used rather than to the relative presence of HPV genomic regions.

Identification of Viral Integrations
In our preliminary analyses, we tried common approaches that we have already used in cancer to estimate integrations [20]. Nevertheless, we observed almost no reported integrations even though we manually observed that unmapped and partially mapped reads contained HPV and human sequences. We reasoned that current pipelines for viral integrations methods are focused on cancer samples [20,[29][30][31] where cells have already passed through a heavy clonal selection process. Therefore, we did not follow these methods. Instead, we devised a novel approach considering only not fully mapped reads and whose one end matched to HPV and the opposite end matched to human (see Methods and Figure 1). Using this approach, we focus on reads that must contain evidence of an integration point within the read. Then, candidate reads were aligned (using blastn) to human and HPV to determine specific read fragments at nucleotide level corresponding to each genome. Finally, reads were filtered as described in methods to diminish false positives per sample and only possible integrations involving HPV types reported in Table 1. Figure 3A shows examples of detected integrations in two known recurrent cancer genes, RAD51B [20] and MACROD2 [32]. Figure 3B shows two reads of different lengths (no PCR duplicates) mapping to the same positions close to gene RAB32. Overall, the putative integrated reads across samples are summarized in Table 3. Relatively, the median percentage of integration was 0.06% of the mapped reads. The maximum was 1.71% observed in a carcinoma sample (M-7440), followed by 1.52% in a HSIL sample. The minimum was 0.02% observed in a LSIL sample (M-7445) followed by 0.03% in a PAP-normal sample (M-7471). We used the hits to chromosome Y (chrY) as an indicator and estimation of false-positive rate. The median of hits to chrY was 0.2% (0-0.7%), approximately to 1 per 500 hits, which is approximately 10 times less than expected by random chance (1.9% relative to the chrY size), suggesting that our pipeline is precise.
We observed that the percentage of integrated reads per type highly corresponded to the overall mapping shown in Table 1.
To determine whether there are preferences for integration sites in HPV, we compared the overall mapping to the putative integrated reads. We noted that peaks of putative integrated reads highly correspond to peaks of mapped reads across samples, HPV types, and viral load, as shown in representative comparisons in Figure 4 besides those shown in Figure 3. For example, the sample M-7447 shows clear peaks of integrations in HPV52 and HPV74 correlated to the overall sequencing maps' peaks. Moreover, we observed similar behavior in low load types (HPV87 for M-7447 sample in Figure 4). This result suggests that integration at this stage of the disease has not been highly selective for specific HPV genome regions. Arrows denote marked reads within detected tandem repeats regions. Nucleotides in lowercase denote mismatches relative to the target sequence; if underlined, refer to insertion relative to the target. The vertical axis at the left refers to integrations, whereas the right axis refers to overall HPV mapping.   Different from cancer samples [19], we rarely observed repeated reads for a specific integration. This result is consistent with the idea that lesions still have not been through clonal selection. Nevertheless, in manual revision, we noted two different detected reads of a putative integration close to the gene RAB32 ( Figure 3B). One of these reads was marked by our pipeline containing 35% of their sequence within a detected tandem repeat region (TG in Figure 3B). Therefore, we systematically estimate the integration points showing more than one read of evidence after removing reads in repeat regions. We found 372 regions in 13 samples close to regions of 354 human genes (Supplementary File S2). Only eight gene regions in 9 samples were found to have two or more reads in 2 or more samples ( Figure 5A). Because some reads do not seem to be PCR duplicates (such as shown in Figure 3B), these results suggest that low numbers of cells have been divided after the integration. Because some genes have shown a higher frequency of HPVint in cancer [10,20,[32][33][34], we wonder whether, in lesions, integrations can also be detected in high-frequent genes. To improve certainty, we only considered coding genes showing two or more reads in the same sample independently of the exact integration point. We found 2,616 genes in 13 samples (Supplementary File S3). To highlight the most frequent genes, we selected those genes present in 4 or more samples ( Figure 5B). We noted previously reported and novel genes. For example, FHIT, CSMD1, and LRP1B ( Figure 6) have been reported in cancer and lesions [10,35]. Nevertheless, Figure 5B shows more than 30 genes not previously reported found in 4 samples or more. Figure 6 shows examples of CSMD3, ROBO2, and SETD3. The integration point in SETD3 seems to be the same across samples and HPV types. Nevertheless, the read sequences show very similar hits in a close and apparently duplicated region at~10 kb distance. We also counted integrations in non-coding regions that showed two reads or more per sample in regions around 10 kb From the 5172 regions in 18 samples (Supplementary File S4), we highlighted those present in 3 or more samples ( Figure 5C). Interestingly, we noted that 20 of the 28 regions shown are located very close to chromosomal centromeres. An overall map of integrations confirms that centromere regions are rich in integrations ( Figure 7). This observation is consistent with previous reports linking HPV proteins with centromere and kinetochore [36,37].

HPV Genotyping
CC remains a public health problem in Mexico since it is the second cause of death from cancer in women [2,38]. Persistent infection with HR-HPV produces pre-cancerous lesions starting with low-grade squamous intraepithelial lesions (LSIL), progressing to high-grade squamous intraepithelial lesions (HSIL) until CC is generated [39,40]. Although infection with HR-HPV is necessary to develop the oncogenic process, its genome integration is considered one of the most important risk factors for cervical carcinoma development [10]. Therefore, in this study, we analyzed the HPVint in pre-tumorous lesions in Mexican women. It is important to mention that when we compared observed HPV with nucleotide variant sites in each sample (for example those shown in Figure 2), we distinguished them from each other by presenting particular variant sites. Thus, the HPV typing by sequencing and the subsequent HPVint are not a product of cross-contamination.
HPV genotyping by NGS and qPCR were in agreement in about 58% of patients. qPCR is limited by the primer sequences, whereas NGS is limited by the probe sets. Because the probe set is comprehensive among types and across the genome [10], NGS should be more accurate to detect HPV types. Nevertheless, it is also limited by the database used to map reads. In this context, we preliminarily compared the reads mapped into the hpvDB used containing 451 HPV sequences against a database of 4342 HPV sequences (10x more sequences), resulting in only 6% improvement in mapped reads for 20 of the 24 samples, suggesting that the database used and the HPV typing were adequate. The use of large collections of sequences, such as all 4,342 sequences, generates the problem that many reads are assigned to very similar sequences fragmenting the estimation of correct types. Even in hpvDB of 451 sequences, we observed fractions of reads assigned to t52 rather than to HPV52, presumably due to minor sequence changes. By NGS, the most frequent HPV genotypes found were HPV 51, 52 plus t52, and HPV 11. We observed t45 (MP134365) in less than 1% on 11 samples or a low number of reads in other samples. Only two regions of 300 nt were mapped. t45 is a sequence similar to HPV 45, which in turn is similar to HPV 18 and 59 [4]. We noted co-occurrence of t45, with HPV 18 and 59. Thus, it is likely that the t45 reads correspond to other genotypes putatively similar to HPV 18 and 59 or to a particular type common in our population. These subtle details need to be solved.
Another interesting finding in this study was that most all the patients had multiple HPV infections by the NGS method (92%). Our results are in good agreement with a previous study using NGS for HPV genotyping that also found that NGS had high sensitivity, in particular for multiple infections [43,44]. Other authors found multiple HPV infections in 98% of the liquid-based cytology samples with precancerous lesions [21]. The HIVID-NGS method we used can detect multiple infections and unknown or incompletely characterized types, for which sequence data are not available.

HPVint
HPV is frequently integrated into the human genome [17]. The integration of highrisk human papillomavirus (HR-HPV) into the host genome is seen in~85% of cervical squamous cell carcinomas. It is viewed as a critical driver of squamous carcinogenesis [45]. Therefore, the integration of HPV is considered an event that promotes cellular carcinogenesis [12,22,46]. Most of the HPVint studies are focused on carcinomas [20,[29][30][31] where clonal selection for tumor growth has already happened, and an integration event showed several non-PCR duplicates reads. In this study, nevertheless, we analyzed the viral integration in precancerous samples [16,21,47] where a new analysis strategy was designed that allows us to identify reads with evidence of the viral integration with presumably high sensitivity (Figure 1). It was found that virtually all samples analyzed presented viral integration (96%) (except one sample of scarce mapped reads). Moreover, several of the samples (63%) showed integrations of two or more viral types. Our data reveal that HPVint occurred in early stages before carcinogenesis, which agrees with other reports, where HPVint was observed in early stages and that the rate and number of integrations increases according to the progression of the disease [10,13,24,48]. We included three HPV positive samples with normal Pap, all of them showed integrations. Although we found a higher proportion of integration in the early stages than has been reported, this may be due to differences in the sample characteristics, the sensitivity of the assay method, the database, and the pipeline used. NGS combined with capture technology seems to be the most sensitive method to detect HPV viral integrations [21,[49][50][51][52]. It would be interesting to study whether specific integrations are related to HPV infection persistence or other clinical characteristics.

Centromere Integrations
We noted an enrichment of integrations in most centromeres (Figures 6 and 7). There is evidence linking HPV proteins to the centromere and kinetochore [36,37]. This fact suggests a possible mechanism where HPV utilizes some of the structural machinery to facilitate its integration. In this regard, the data obtained in our study seems to be scarce to answer further questions. Nonetheless, a specific experimental design where deeper sequencing is performed in wisely selected samples may be adequate. In our data, we noted an enrichment in most centromere chromosomes except in chr3, chr5, and chr13, and low in chr12 and chr20. It would be exciting whether this may be related to particular HPV types, stages of the disease, populations, chromosomes, or technical issues. In our data, the enrichment seems to be around 3 to 30 times higher in centromeres than in other parts of the human genome.

HPVint Ratio and Breakpoint
Other authors noted that the integrations' frequency increases with lesion progression [16,53,54]. Thus, an important issue in the analysis of HPVint is the fraction that may be present in integrated form compared to that in episomal. A recent analysis used the fraction of reads with integrations relative to those showing no integration in the same region as the episomal [21]. This estimation seems correct, assuming that many cells carry the same integration event, and the integration site is unique. These assumptions are acceptable for cancer or tumors where clonal selection has driven cellular expansions. However, for non-tumoral cells, applying the same criteria could, presumably, result in highly deviated estimations. In this context, we interestingly observed that depth of integrated reads along the HPV genomes correlates with overall depth. This observation suggests that choosing an integration point in the HPV genome is close to a uniform random process, which agrees with other studies [10]. Consequently, the expected fraction of integration reads in about 8 kb length is 0.0125% per breakage point (nucleotide). One breakage point would open its circular episomal form to generate a linear molecule that can be integrated into the host genome. Two or more breakage points would generate smaller fragments. We observed that the minimum integration read fraction was approximately 0.02% and 0.03% for an LSIL and Normal pap smear patients, respectively, which corresponds to about 2 breakage points. On the other hand, we also observed samples with 10 times more integrated read fractions of 0.12% for an LSIL patient and even 100 times more than one breakage point to about 1.52% for an HSIL patient. We noted that the two HSIL samples showed high integration fractions (0.08% and 1.52%), which can be achieved by raising the HPVint activity or by an infection that persists over time, aggregating integrations. In any case, the integration fraction could be used as a surrogate measure for future experiments or studies.

Comparison of Reported HPV Genes vs. Observed
Consensus reports reached that various portions of the HPV genome are deleted in the integrated HPV sequences [55]. Common disruption of the viral E2 gene has been demonstrated in different studies, resulting in functional inactivation. Loss of the E2 expression abrogates the E2-mediated repression of E6/E7 transcription from integrated HPV DNA and increases the expression of these oncoproteins that induces HPV-immortalization [3,55,56]. Globally, we found that breakpoint could occur in any part of the viral genome, which agrees with what was previously reported with the same methodology we used [10], with a higher frequency in the L1 region (25%), followed by L2 (17%), E7 (16%), and E1 (14%), and less frequency in E6 (4.6%), and E4 (4%). Other researchers developed the HIVID-NGS methodology that we used [10], finding a higher frequency of rupture in E1. Another study in cervical lesions found that the disruptions were more frequent in the L2 gene (67.7%), followed by the L1 gene (25.8%) and the E1 gene (22.6%) [16]. When we analyze the samples individually, we observe differences in integrations. In some samples, we observe a higher disruption in E2, L1, or L2. However, we also found samples in which E2 is not broken. The differences compared with our results may be due to the histological classification of the samples, viral genotype, database, and pipeline used. Our view is that there is variability between patients, viral strains, and other unknown factors such as lifestyle.

Comparison of Reported Human Genes
Viral genome integration into the host genome triggers various genetic alterations, such as oncogenes amplification, tumor suppressor gene inactivation, and inter-or intrachromosomal rearrangements, as well as genetic instability [14,33]. The integration event could lead to uncontrolled growth, which can eventually lead to cancer [15].
It has been suggested that HPV initially integrates into the human genome randomly, based on its accessibility to the genome [10,13]. However, cancer studies have shown that the integration of the viral genome into some loci or hot spots is recurrent and can confer a selective growth advantage [10,33].
In our study, we found that the integration of the viral genome was almost nonselective when analyzed by large segments, finding breakpoints throughout the entire human genome. However, when analyzed by gene, we also detected similar integrations in various samples. It was found that in 2,616 coding genes, 2 or more integration sites presented among samples (Supplementary File S3), like RAD51B, MACROD2, FHIT, CSMD1, LRP1B, and DLG2 ( Figures 5 and 6), which had already been previously reported as frequent sites of integration in squamous carcinoma samples [10,33,34]. The RAD51B gene is implicated in the DNA repair pathway; viral integrations in this gene cause loss of function and have been associated with other types of cancer, like breast, ovary, prostate, and colorectal [14]. Other authors also found integration in FHIT and LRP1B [10]. Intronic HPV breakpoints in FHIT and LRP1B have been related to decreased protein expression in carcinoma samples [8]. Also, it has been reported that HPVint in MACROD2 may cause gene loss of function and impact genome instability [33]. Non-coding and structural variations in the MACROD2 gene have been associated with cancer predisposition, especially colorectal cancer, reported to alter DNA repair [33,57]. Thus, our results are consistent with previous findings indicating that our data showing novel results are valuable.
Nevertheless, we found viral integration in more than 30 genes not previously reported, like CSMD3, ROBO2, and SETD3 ( Figure 5B). CSMD3 is involved in dendrite development [58] and germline variants in this gene have been linked to colorectal cancer [59]. ROBO2 participates as axon guidance, and cell migration [60] and low levels of mRNA expression are associated with poor survival in pancreatic ductal adenocarcinoma [61]. Another interesting finding was that the integration point in SETD3 was the same for all samples and HPV type, which could be a pattern of integration in precancerous samples. An increase in the number of samples is required to corroborate this finding. This gene has been identified as an actin-specific histidine N-methyltransferase, and its expression has been associated with oncogenesis, especially in breast cancer [62]. Even though these three genes are involved in carcinogenesis, the meaningful impact in cervical cancer is not clear.
The interpretation of the integrations for many other genes and regions observed in various samples could be limited for various reasons. First, we used reads that show at least 30 not-mapped nucleotides. Although we observed no apparent bias in mapping, we noted some reads that could match similar regions of the genome. In this context, larger reads may be helpful. Second, we only used reads having evidence of the integration because we would like to quantify specific positions while paired reads mapping to different genomes are not accurate and to provide confidence in the use of a novel pipeline in non-tumoral samples. Nevertheless, paired reads may also be informative. Third, there is a lack of models to estimate the frequency of integrations expected by chance correcting for gene size, perhaps sequence context, and sample reads (indeed, we observed a high variation in the number of integrations among samples, from a handful to thousands).

Study Population
We selected 24 patients with gynecological alterations referred to the colposcopy consultation at the Gynecology and Obstetrics Department HU-UANL in Monterrey, Nuevo León, México. All of them agreed to participate in this study by signing an informed consent, previously approved on August 10, 2011 by the Institutional Review Board of the Hospital Universitario "Dr. Jose Eleuterio González" of the Universidad Autónoma de Nuevo León (Project identification code BI11-002). The PAP smear result was re-interpreted from transcripts in all but two samples. All the patients included were previously detected as HPV-positive [63].

Sample Collection and DNA Extraction
Cervical samples were taken using a cytobrush (Colpoltre ® ), preserved in ThinPrep ® PreservCyt solution, and stored at −70 • C until DNA extraction. Samples were collected from January 2014 to July 2016. DNA was extracted from cervical cells using the PureLink Genomic DNA kit from Invitrogen ® (Life Technologies, Carlsbad, CA, USA), following the manufacturer's instructions. DNA quantity and quality were measured by spectrophotometry. DNA was stored at −20 • C until use. Samples were grouped according to Pap smear as Normal, ASCUS, LSIL, HSIL, and Unknown.

HPV Genotyping
HPV detection was performed using the consensus primers set PGMY 09/11, which amplifies a broad spectrum of HPV types. β-globin gene was used as an internal control. PCR products were analyzed by 2% agarose gel electrophoresis, stained with ethidium bromide, and visualized in a UVP Model 2UV High-Performance Transilluminator (Upland, CA, USA).
qPCR was performed to genotype and quantify HPV types in samples, using AmpliSens ® HPV HRC genotype-titer-FRT kit (Ecoli, Bratislava, Slovak Republic) according to the manufacturer's instructions in a 7500 Fast Real-Time PCR System (Thermo Scientific, Waltham, MA, USA).
This kit is based on simultaneous real-time amplification (multiplex PCR) of DNA fragments of 14 HR-HPV types (HPV 16,18,31,33,35,39,45,51,52,56,58,59,66, and 68) and a DNA fragment of the β-globin gene as an internal endogenous control, carried out in four separated reaction tubes. Each genotype is detected in a separate fluorescent channel (FAM, JOE, ROX, and Cy5) that allows its detection and quantification (a total of 16 qPCR probes are measured). Each genotype is detected in a separate fluorescent channel, making it possible to determine the genotype and viral load.

Capturing and Sequencing
HIVID-NGS was performed by MyGenostics (Beijing, China) Gene Technology Company, as reported previously [10]. A target HPV gene region capture, and sequencing were performed using the ViralCap_HPV kit for capturing HPV genomes of 32 types (6,11,16,18,31,33,35,39,45,52,56,58,59,66,68,69,82, and others) developed by MyGenostics (Beijing, China) Gene HIVID-GNS method, HPV-specific probes are used to capture virus and flanking sequences prior to unbiased PCR amplification and NGS. It produces high-quality genotype data. Increased chance of finding HPVint due to sequence capture [26]. Technology Co., Ltd. This kit allows capturing the viral sequences in simultaneous detection of all known virus subtypes and virus variants, as well as information on the integration of the viral genome into the host genome. The libraries were constructed following the manufacturer's instructions. DNA samples were sheared to 150-200 bp DNA fragments using a Covaris S2 system ultrasonicator (Covaris, Inc., Woburn, MA, USA). The fragments were purified, end blunted, A-tailed and adaptor-ligated. Libraries were hybridized with HPV probes, including 32 types of HPV, and then washed to remove uncaptured fragments. The eluted fragments were amplified by PCR to generate libraries for sequencing. Libraries were quantified then sequenced in the Illumina NextSeq500 high-throughput sequencer according to the manufacturer's instructions (Illumina Inc., San Diego, CA, USA). The length of 91% of reads was 142. The insert size median was 171, and the median absolute deviation of 55.

HPV Mapping and Typing from Sequencing
A set of most dissimilar HPV types genomes was generated from Papillomavirus Episeme (PaVE) [64], a curated viral database dedicated to HPV, and from the NCBI nucleotide database. A clustering approach from oligonucleotide frequencies was used to generate a database of 451 representative HPV genomes (hpvDB). The database is included in Supplementary File S1. The reads were mapped to hpvDB using bwa-mem. Overall estimations of viral presence were estimated mapping raw reads to the hpvDB. The most abundant HPV type was always considered. To include other types, we integrated those whose counts were at least 1% of the most abundant type and higher than 50 reads or that the read count was at least 5000.

Viral Integration Pipeline
Viral integrations pipeline methods are focused on cancer, which are based on multiple reads supporting the same integration point [20,[29][30][31]. Therefore, we designed a specific data analysis pipeline for pre-malignant cervical lesions for viral integration in capture-based sequencing. Here, we will briefly describe the pipeline schematized in Figure 1. First, because our methodology is based on capturing HPV sequences, we focused on reads not fully mapped to hpvDB, after removing adaptor sequences. This process was performed by analyzing the CIGAR field of SAM/BAM files (https://github.com/samtools/hts-specs, accessed on 4 January 2021) taking reads with 30 nt or more not matched. Second, we generated pseudo-reads of 30 nt from 5 and 3 ends from NGS reads that did not fully map to hpvDB. Third, we identified candidate integration reads whose one end mapped to human (hg19) and the other end mapped to hpvDB. Fourth, original NGS candidate reads were aligned to human and hpvDB using Blastn [65]. The procedure was performed independently on each of the two paired reads (Figure 1). Fifth, we then filtered those blast hits showing a misalignment size less than 20, showing mismatch ratio less than or equal to 1/15, showing duplicate start-end positions in the same target sequence (taking the most significant), or, in HPV, showing alternative alignments for same target sequence (taking the most significant). This filtering was independently performed to human and HPV blast hit lists and finally removing those reads not found in both. In case of multiple hits per read into either human or hpvDB, we analyzed the top 3 most significant hits for annotation. To determine the most sensible hit, we used the annotation of the paired reads to resolve the ambiguous hit within the 3 most significant when possible, or used the most significant from blast otherwise. Sixth, reads were marked when they have more than 33% of their sequence in repetitive regions, when involving chromosome Y, or where both paired reads show integration, and the direction of mapping does not correspond to the expected inverted direction. To determine repetitive regions, we used Phobos setting parameters for detecting regions of size 12 nt or more (-searchMode imperfect -minLength 12 -minScore 2 -minLength_b 2 -minScore_b 2 -recursion 5 -outputFormat 3 -printRepeatSeqMode 2 -reportUnit 0 -mismatchScore −3 -indelScore −4).

Conclusions
In this study, our results had revealed characteristics of HPVint in precancerous lesions. We observed that breakpoint in HPV can occur in any part of the viral genome, with a higher frequency in L1 gene. We rarely observed repeated reads for a specific integration site, contrary to what is observed in cancer samples. This observation is consistent with the idea that lesions still have not been through clonal selection. Based on host genome integration frequencies, we found previously reported integration sites like FHIT, CSMD1, and LRP1B and others such as CSMD3, ROBO2, and SETD3. We noted an enrichment of integrations in most centromeres, suggesting a possible mechanism where HPV utilizes centrosome or kinetochore machinery to facilitate integration. We observed that in precancerous cervical cells, there are already integrations in genes observed after clonal selection. The integration sites identified in the host genome could be used as possible biomarkers for early diagnosis in patients with cervical lesions.
This study provides a theoretical basis to understanding the mechanism of tumorigenesis from the perspective of HPVint and its association with cervical lesions.