BKTyper: Free Online Tool for Polyoma BK Virus VP1 and NCCR Typing

Human BK polyomavirus (BKPyV) prevalence has been increasing due to the introduction of more potent immunosuppressive agents in transplant recipients, and its clinical interest. BKPyV has been linked mostly to polyomavirus-associated hemorrhagic cystitis, in allogenic hematopoietic stem cell transplant, and polyomavirus-associated nephropathy in kidney transplant patients. BKPyV is a circular double-stranded DNA virus that encodes for seven proteins, of which Viral Protein 1 (VP1), the major structural protein, has been extensively used for genotyping. BKPyV also contains the noncoding control region (NCCR), configured by five repeat blocks (OPQRS) known to be highly repetitive and diverse, and linked to viral infectivity and replication. BKPyV genetic diversity has been mainly studied based on the NCCR and VP1, due to the high occurrence of BKPyV-associated diseases in transplant patients and their clinical implications. Here BKTyper is presented, a free online genotyper for BKPyV, based on a VP1 genotyping and a novel algorithm for NCCR block identification. VP1 genotyping is based on a modified implementation of the BK typing and grouping regions (BKTGR) algorithm, providing a maximum-likelihood phylogenetic tree using a custom internal BKPyV database. Novel NCCR block identification relies on a minimum of 12-bp motif recognition and a novel sorting algorithm. A graphical representation of the OPQRS block organization is provided.

BKPyV is a circular double-stranded DNA virus (cdsDNA) with an average genome size of 5100 bp and an average GC content of 40%. Its genome is structured in two sections, the early and late coding regions. The early region encodes the regulatory proteins, i.e., large tumor antigen (LTag), small tumor antigen (sTag), and the truncated tumor antigen (truncTag), all derived by alternative splicing of a single primary transcript. The late region encodes the structural proteins VP1, VP2, and VP3, as well as a small regulatory protein known as Agno protein. Additionally, the BKPyV genome contains a noncoding control region (NCCR), also known as the regulatory region or the transcript control region, separating the early and late regions. The NCCR directs early and late transcription and replication of the genome as it contains the origin of replication (ori). NCCR is a bidirectional promoter-enhancer region that includes binding sites for several transcription factors [4]. The NCCR is known to be formed by repeat-rich regions, being highly variable in sequence and length [5][6][7]. Strains with NCCR rearrangements (i.e., deletions, insertions, or duplications of complete or partial blocks) are found in patients suffering from BKPyV disease. Previous studies have suggested the possible role of NCCR rearrangements in viral replication, as rearranged NCCR BKPyV emergence in plasma has been linked to increased replication capacity and disease in kidney transplant recipients [8].
Subtyping of BKPyV has been based on VP1 genetic diversity. There are four major VP1 subtypes: I, II, III, and IV. Typing of these regions has been first performed by restriction endonuclease of a 327 bp variable region of VP1 [9]. Later, with the reduction of sequencing cost, routine typing of BKPyV has been conducted by Sanger sequencing and recently by Illumina sequencing [10]. Worldwide, the most abundant subtype is subtype I (80%), followed by subtype IV (15%) [2]. Contrary, the sister groups, i.e., subtypes II and III, are rarely detected. In its turn, subtype I can be subdivided in Ia, Ib-1, Ib-2, and Ic [11]; and subtype IV in IVa-1, IVa-2, IVb-1, IVb-2, IVc-1, and IVc-2 [12]. Epidemiology and geographical distribution of BKPyV have been previously studied [11], focusing on subgroups I and IV, which are known to have variant distributions between continents. BKPyV subtyping and subgrouping are conducted routinely in diagnostic assays and in epidemiological studies, albeit its prognostic value remains unclear. Recently, Morel et al. designed a strategy to subtype BKPyV, based on a 100 bp amplicon, called the BK typing and grouping region (BKTGR) [13]. In this same study, Morel et al. suggest a subtyping algorithm, validated through multiple sequence alignment and phylogeny.
The archetypical NCCR has been arbitrarily divided into diverse repeat-blocks as follows: O (142 bp), P (68 bp), Q (39 bp), R (63 bp) and S (63 bp). The O, P, Q, R, and S blocks, despite containing numerous transcriptional and binding sites, are not transcribed. BKPyV strains isolated and sequenced from urine in both immunocompromised and immunocompetent individuals mostly contain the archetypical NCCR architecture with minor variants, and it is thought to be the conformation found in transmissible virus [14,15].
It is worth noting that archetypical strains of BKPyV (being the WW strain the prototype strain), in contrast to rearranged types, replicate poorly in cell culture, indicating the role of NCCR rearrangements in viral growth in various cell types in vitro [16]. In addition, the rearranged structure of the NCCR are often found in kidney and other body compartments, linked to increased viremia, viruria [8,17], and frequently in association with diseases [18,19]. However, recent studies suggest that rearranged BKPyV is not required for developing PyVAN. Therefore, rearrangements appear not to correlate with viral reactivation disease and development, but with early and late gene expression and overall viral replication, coincidently with the high prevalence of transcriptional factor binding motifs [4,18,19]. It is hypothesized that rearranged viral strains indicate prolonged immunosuppression, favoring enhanced viral replication and therefore giving rise to NCCR rearrangements [4,[18][19][20].
The incidence of BKPyV reactivation is mostly observed in renal transplant recipients, being a significant risk factor for developing PyVAN that is associated with a high chance for graft loss. Therefore, understanding BKPyV genetic factors that can be associated with increased pathogenicity is crucial. Due to the incidence of BKPyV-associated diseases and their clinical implications for human health, a reliable, automatic, and free BKPyV typing tool would be of great interest.
In this manuscript, BKTyper is presented as a reliable, automatic, free-typing tool for BKPyV virus genotyping based on VP1 typing (implementing an adaptation of Morel et al. algorithm [13]) and a novel algorithm for reliable NCCR block identification. BKTyper can be found as an online free service (http://bktyper.zidu.be/) or on GitHub (https://github.com/joanmarticarreras/BKTyper) for local installation.

Results
BKTyper is the first free online tool that allows automatic and reproducible BKPyV typing. BKTyper conducts two independent, non-mutually exclusive, typings: (i) VP1 genotyping, based on single nucleotide polymorphisms at the BKTGR region and (ii) NCCR structure identification, based on newly canonized OPQRS sequences, and a novel algorithm designed to discriminate incomplete blocks. Additionally, it provides the phylogenetic context for VP1 BKTGR and graphical disposition for the NCCR structure.

VP1 Genotyping
The implementation of the VP1 typing is an adaptation of the BKTGR algorithm proposed by Morel et al. in 2017 [13]. Such an algorithm needs to account for two premises: (i) variable length and coordinates from query sequences, and (ii) VP1 gene or alignment can contain gaps.

VP1 BKTGR (BK Typing and Grouping Region)
BKTGR typing relies on specific single nucleotide polymorphisms (SNPs) at reference positions 1976, 1988, 1995, 2006, 2018, 2057, and 2075 (VP1 gene). Input sequences may have diverse lengths and may cover slightly different sections of the BKPyV genome. In order to generalize and automatize the typing process, BKTGR typing coordinates are first transferred to VP1-based coordinates (see Table 2). VP1 alignments can have internal gaps, which will alter the correspondence between specific coordinates and nucleotides. Therefore, specific conserved DNA motifs from the reference are designed as that the first nucleotide corresponds to the BKTGR typing positions (see Table 2). Looking for specific conserved DNA motifs in the reference allows searching the BKTGR typing positions regardless of fixed numerical positions, which may change depending on sequence input. Alignments are posteriorly trimmed at both ends until the first alignment column without gaps, helping to standardize inputs composed of sequences of diverse lengths and reduce memory space. The complete implementation of the algorithm can be found in Appendix A, and a graphical representation in Figure 1.  1976,1988,1995,2006,2018,2057, and 2075 (VP1 gene). Input sequences may have diverse lengths and may cover slightly different sections of the BKPyV genome. In order to generalize and automatize the typing process, BKTGR typing coordinates are first transferred to VP1-based coordinates (see Table 2). VP1 alignments can have internal gaps, which will alter the correspondence between specific coordinates and nucleotides. Therefore, specific conserved DNA motifs from the reference are designed as that the first nucleotide corresponds to the BKTGR typing positions (see Table 2). Looking for specific conserved DNA motifs in the reference allows searching the BKTGR typing positions regardless of fixed numerical positions, which may change depending on sequence input. Alignments are posteriorly trimmed at both ends until the first alignment column without gaps, helping to standardize inputs composed of sequences of diverse lengths and reduce memory space. The complete implementation of the algorithm can be found in Appendix A, and a graphical representation in Figure 1.  [13]. Blue boxes represent coordinates for specific positions from the polyomavirus BK Dunlop strain (V01108.1) reference and the nucleotide that needs to be present to stop the decision tree. If true, the subgroup corresponds to the contiguous colored box. If another nucleotide is found at the given position, the following colored box must be checked. (B), ML phylogenetic tree that corresponds to the typing of a Dunlop strain VP1 with the internal database of BKTyper. Colors represent the different subtypes (shared in panel A). Query sequence "dunlop" is highlighted in red. Table 2. BKTGR coordinates correspondence and its specific motifs. Genome coordinates are based on polyoma BK Dunlop strain (V01108.1) reference. VP1 coordinates correspond to nucleotide coordinates of the same reference using the first base of the gene as a starting position. The conserved motif corresponds to the motif used to transport coordinates from the reference to the query sequence (first nucleotide of the motif).

Genome Coordinates VP1 Coordinates
Conserved Motif 1989 426 AAAACCTAT , decision tree to genotype VP1 from Morel et al. [13]. Blue boxes represent coordinates for specific positions from the polyomavirus BK Dunlop strain (V01108.1) reference and the nucleotide that needs to be present to stop the decision tree. If true, the subgroup corresponds to the contiguous colored box. If another nucleotide is found at the given position, the following colored box must be checked. (B), ML phylogenetic tree that corresponds to the typing of a Dunlop strain VP1 with the internal database of BKTyper. Colors represent the different subtypes (shared in panel A). Query sequence "dunlop" is highlighted in red.  [25]. The alignment is posteriorly trimmed as explained earlier in the VP1 BKTGR section. This alignment, consisting of approximately 107 nucleotides, is modeled using ModelFinder [29], identifying Kimura 2-Parameter (K2P+G4) as the best performing ML model. The phylogenetic tree is built with IQTree [27] using 10,000 ultrafast bootstraps from IQTree module UFBoot2 [30] as faster and more accurate surrogate for classical bootstrapping. Tree is rooted by mid-point distance of the tree.

VP1 Genotyping Validation
BKTyper can correctly classify the 199 BK polyomavirus genomes used in Morel et al. [13], including recombinants VP1 JX195567 and JX195570, into Ia, Ib-1, Ib-2, Ic, II, III, IVa-1, IVa-2, IVb-1,2, IVc-1, and IVc-2 genotypes. A subset of those sequences is used as an internal database for the BKTyper ML tree. As can be observed in Figure 1B, the phylogenetic clustering of the different subtypes is conserved, validating the implementation of the algorithm.

NCCR Typing
Until now, NCCR typing has been mainly conducted by manual curation. This approach tends to be very tedious and prone to error. However, in order to optimize this process, two main aspects must be considered for its automatization: (i) canonization of the archetypical OPQRS blocks, and (ii) length and diversity of the blocks between isolates.

Defining a Canon for the Archetypical OPQRS Blocks
Over the years, several original works have tackled the content and diversity of the OPQRS block [5,7,31,32] Figure 2 of the manuscript, the P block seems to carry an identical copy of the Q block at the end. Finally, (iv), in Burger-Calderon et al., the R block is an exact copy of the Q block with an extra A at the end.  Figure 2 of the manuscript, the P block seems to carry an identical copy of the Q block at the end. Finally, (iv), in Burger-Calderon et al., the R block is an exact copy of the Q block with an extra A at the end.  Table 1

NCCR Typing Based on Local Alignment
Once the NCCR blocks sequence content is standardized, there are additional premises to consider for its implementation in an automatic tool: (i) blocks will generally follow an OPQRS structure. (ii) Blocks can be repeated in tandem (i.e., OPPQRS). (iii) Blocks can be non-tandem repeated (i.e., OPQPRS). (iv) Block duplication may be incomplete [i.e., P(20) vs. P(70)]. (v) NCCR blocks are not coding and are only functional in short transcriptional binding sites, therefore, high diversity between input sequences is expected. (vi) Terminal regions of a block with low similarity with the references may be excluded from the block. (vii) NCCR blocks may be missing. Subsequently, alignments must allow for zero or more hits of each NCCR fragment per query, accounting for diverse order and minimum alignment size and nucleotide identity. Minimum alignment size has been delimited to 12 exact nucleotides, meaning that at least 12 exact nucleotides are needed between the query sequence and the NCCR archive block sequences in order to progress the classification. This parameter has been manually fine-tuned in order to ensure a low false-positive rate and a high sensitivity for fragmented blocks. Minimum identity between a putative block in the query sequence and a given block from the NCCR archive sequences has been set to 75%, 5% more stringent than previous estimates [7], as an arbitrary but reasonable similarity threshold for noncoding region. Based on these considerations, NCCR typing is conducted by locally aligning the query nucleotide sequence to a custom NCCR archive (see Table 1). Blast results are filtered by a minimum e-value of 0.05. Results are ordered by start sequence position and for each alignment identifiable as a putative NCCR block, start and end alignment coordinates are stored. For each non-  Table 1) [7,33]. Editing errors from both Moens et al. and Burger-Calderón et al. are displayed in the alignment. This alignment was constructed by concatenating the different OPQRS NCCR blocks and posteriorly aligned with Mafft [25]. Diverse color underlying has been used to highlight the different blocks (blue for O, red for P, yellow for Q, green for R, and purple for S).

NCCR Typing Based on Local Alignment
Once the NCCR blocks sequence content is standardized, there are additional premises to consider for its implementation in an automatic tool: (i) blocks will generally follow an OPQRS structure. (ii) Blocks can be repeated in tandem (i.e., OPPQRS). (iii) Blocks can be non-tandem repeated (i.e., OPQPRS). (iv) Block duplication may be incomplete [i.e., P(20) vs. P(70)]. (v) NCCR blocks are not coding and are only functional in short transcriptional binding sites, therefore, high diversity between input sequences is expected. (vi) Terminal regions of a block with low similarity with the references may be excluded from the block. (vii) NCCR blocks may be missing. Subsequently, alignments must allow for zero or more hits of each NCCR fragment per query, accounting for diverse order and minimum alignment size and nucleotide identity. Minimum alignment size has been delimited to 12 exact nucleotides, meaning that at least 12 exact nucleotides are needed between the query sequence and the NCCR archive block sequences in order to progress the classification. This parameter has been manually fine-tuned in order to ensure a low false-positive rate and a high sensitivity for fragmented Viruses 2020, 12, 837 7 of 13 blocks. Minimum identity between a putative block in the query sequence and a given block from the NCCR archive sequences has been set to 75%, 5% more stringent than previous estimates [7], as an arbitrary but reasonable similarity threshold for noncoding region. Based on these considerations, NCCR typing is conducted by locally aligning the query nucleotide sequence to a custom NCCR archive (see Table 1). Blast results are filtered by a minimum e-value of 0.05. Results are ordered by start sequence position and for each alignment identifiable as a putative NCCR block, start and end alignment coordinates are stored. For each non-overlapping alignment, the longest alignment is classified as a NCCR block. The complete implementation can be found in Appendix B and a graphical representation on Figure 3.
Viruses 2020, 12, x FOR PEER REVIEW 7 of 13 overlapping alignment, the longest alignment is classified as a NCCR block. The complete implementation can be found in Appendix B and a graphical representation on Figure 3. BKTyper includes a visual inspection of NCCR block structure as well as a short list of conserved transcriptional binding sites that can be found as exact matches from Moens et al. (see Table 3) [33]. Table 3. Shortlist of transcriptional binding from Moens et al. [33] that are included in BKTyper NCCR graphical representation. The first column displays the name of the transcriptional sites, followed by its sequences, as shown in Moens et al. [33]. Validation of NCCR BKTyper has been carried out with the 3 most used reference genomes (Gardner ATCC VR-837, MM, and Dunlop strains) and the 10 available sequences reported by Carr et al. [5]. BKTyper identifies the NCCR structure for the Gardner strain (LC029411.1) as OPQPQRS, BKTyper includes a visual inspection of NCCR block structure as well as a short list of conserved transcriptional binding sites that can be found as exact matches from Moens et al. (see Table 3) [33]. Table 3. Shortlist of transcriptional binding from Moens et al. [33] that are included in BKTyper NCCR graphical representation. The first column displays the name of the transcriptional sites, followed by its sequences, as shown in Moens et al. [33].

Transcriptional Binding Sites Motif
Promoter

Figure 5.
Visual confirmation of the NCCR sequence alignment Carr et al. [5]. This alignment was constructed by concatenating the different OPQRS NCCR blocks and posteriorly aligned with Mafft [25]. Diverse color underlying has been used to highlight the different blocks (blue for O, red for P, yellow for Q, green for R, and purple for S).
BKTyper has been used to genotype all available full-length genomes for BKPyV listed as genome neighbors from NIH NCBI resource. Out of 318 genomes (accession numbers in Supplementary Information Table S1, March 2020), 20 genomes were not suited for BKTyper complete genotyping, as lack the origin of replication and part of the NCCR. Those were later resubmitted for VP1 typing alone. Complete results are available as Supplementary Information   Figure 5. Visual confirmation of the NCCR sequence alignment Carr et al. [5]. This alignment was constructed by concatenating the different OPQRS NCCR blocks and posteriorly aligned with Mafft [25]. Diverse color underlying has been used to highlight the different blocks (blue for O, red for P, yellow for Q, green for R, and purple for S). BKTyper provides a free, automated, and reproducible alternative to manual genotyping of VP1 and NCCR for BKPyV. It has shown to be sensible enough to detect inconsistencies in previous literature and perfect to summarize genotyping information in the range of hundreds of sequences in a few minutes. BKTyper will allow researchers and clinicians to obtain BKPyV typing in a fast and reliable manner, bolstering its research and clinical forecasting during routine screenings.