Rearrangement in the Hypervariable Region of JC Polyomavirus Genomes Isolated from Patient Samples and Impact on Transcription Factor-Binding Sites and Disease Outcomes

JC polyomavirus (JCPyV) is the causative agent of the fatal, incurable, neurological disease, progressive multifocal leukoencephalopathy (PML). The virus is present in most of the adult population as a persistent, asymptotic infection in the kidneys. During immunosuppression, JCPyV reactivates and invades the central nervous system. A main predictor of disease outcome is determined by mutations within the hypervariable region of the viral genome. In patients with PML, JCPyV undergoes genetic rearrangements in the noncoding control region (NCCR). The outcome of these rearrangements influences transcription factor binding to the NCCR, orchestrating viral gene transcription. This study examines 989 NCCR sequences from patient isolates deposited in GenBank to determine the frequency of mutations based on patient isolation site and disease status. The transcription factor binding sites (TFBS) were also analyzed to understand how these rearrangements could influence viral transcription. It was determined that the number of TFBS was significantly higher in PML samples compared to non-PML samples. Additionally, TFBS that could promote JCPyV infection were more prevalent in samples isolated from the cerebrospinal fluid compared to other locations. Collectively, this research describes the extent of mutations in the NCCR that alter TFBS and how they correlate with disease outcome.


Introduction
JC polyomavirus (JCPyV) is a human-specific virus that infects most of the population [1][2][3]. JCPyV was first identified upon isolation from the brain of a patient with Hodgkin's disease that had developed progressive multifocal leukoencephalopathy (PML) [4]. Since this discovery, JCPyV has been found to cause a persistent infection in the kidneys of healthy individuals, and, during immunosuppression, reactivate and spread in the central nervous system (CNS) causing the fatal, demyelinating disease, PML [1,[5][6][7][8].
There are currently no approved cures for PML, and when left untreated, the disease can be fatal within a few months [9,10]. Historically, individuals most at risk for disease were positive for HIV, representing up to 5% of all PML cases [3,9]. The use of highly active antiretroviral therapy (HAART) has significantly reduced the rate of PML among HIV individuals [11,12]. Unfortunately, new risk groups are emerging that encompass patients receiving immunomodulatory therapies for immune-mediated diseases [13]. This includes individuals with multiple sclerosis (MS) taking natalizumab and individuals receiving rituximab for treatment of systemic lupus erythematosus (SLE) [9,[13][14][15]. As there are no approved therapies for PML, current medical interventions address treatment of the "f" [13,24,43]. There are numerous JCPyV isolates, including other Mad-isolates that were derived from tissues of PML patients [44], with Mad-8, having similar blocks to Mad-1 but also having a portion of block "b" as well as insertions of single base pairs; Mad-8 is more typical of NCCR variants found in PML patients, compared to Mad-1 [13,44,45]. Additionally, beyond the consensus sequences of JCPyV isolates, deep sequencing analysis of the NCCR has demonstrated that JCPyV isolated from urine closely resembles the CY strain, while isolates from the CSF and plasma of PML patients have significantly rearranged NCCRs with a highly diverse viral population representing a quasispecies [42].
Pathogenic isolates share a 98-base pair direct tandem repeat, referred to as an enhancer element in the NCCR. These enhancer elements are composed of blocks, "a", "c", and "e" and therefore, contain duplicate TATA boxes, located in the "a" block, and have additional transcription factor binding sites (TFBS) [37,46,47]. Due to the prevalence of these tandem repeats in the NCCR sequences of viral isolates from PML patients, as well as the archetype rarely associated with PML tissue [13,24,47], it is suggestive that these enhancer elements and the addition of TFBS are critical for viral pathogenesis. Furthermore, the loss of the 23-base pair "b" block and the 66-base pair "d" block can also result in increased viral gene expression. These deletions of both regions in the NCCR allow for additional TFs such as YB-1/Purα and NF-1, to facilitate enhanced viral gene expression [38,48]. TFs, such as Spi-B transcription factor (Spi-B), nuclear factor of activated T cells 4 (NFAT4) and subtypes of the nuclear factor 1 (NF-1) family are also important in early gene transcription and may also play a role in cellular tropism [49][50][51][52]. Specifically, JCPyV can infect B cells, and as these cells mature, TFs such as NFI-X and Spi-B have been shown to be upregulated [51,53]. These changes can enhance viral transcription and most notably in individuals who are receiving natalizumab treatment, induce B cell differentiation, possibly inducing the development of PML [54].
Associations between the genetic mutations in the NCCR and PML have been described previously, and there is comprehensive evidence to highlight TFs that are important for JCPyV infection. However, metanalyses to observe the changes in the NCCR in the archetype and PML-type strains as well as tissue location of sequence isolates are very limited. A recent study by Nakamichi et al. described the curation of a database with TFBSs in NCCR sequences of archetype and PML-type JCPyV isolates identified through computer simulations [55]. Our current study provides an extensive bioinformatic approach to validate and uncover TFBS that are influenced by NCCR rearrangements which ultimately enhance viral transcription and cause disease. Using published sequences isolated from deidentified patient samples, this study characterizes the NCCRs of 989 nucleotide sequences, defining them by both their location of isolation and disease status. Most importantly, by using the largest open-access database of curated and non-redundant TFBS, known as JASPAR [56], this study validated and elucidated possible novel TFBS that influence JCPyV infection in each of the six blocks that arise from rearrangements in the NCCR. These data will provide additional rationale and framework to understand how this hypervariable region of the JCPyV genome can persist in almost 80% of the population, however mutations in the NCCR can enhance viral infection, expand tissue tropism, and ultimately, cause the fatal disease PML.

The Curation of 989 JCPyV NCCR Sequences from Patient Samples
A total of 989 unique NCCR sequences were identified from 5507 JCPyV sequences downloaded from GenBank on 6 December 2021. Of the 989, 579 were GenBank records containing only the NCCR region. In total, 410 additional NCCR sequences were extracted from GenBank records that contained JCPyV whole genome assemblies. The NCCR was taken as the interval from the origin of replication to the coding sequence of the agnoprotein. Tissue source and disease status was determined from the GenBank record (Table 1 and  Supplemental Table S1) [12,23]. There were 565 nonurine sequences, 217 from the cerebral spinal fluid (CSF), 111 from the plasma, serum, or peripheral blood mononuclear cells A cladogram of the 989 NCCR sequences was performed to cluster the sequences based on similarity. The sequences were aligned with Clustal Omega [58] and the neighbor joining algorithm was used to generate a cladogram that was then labeled by sample site and disease status ( Figure 1). The analysis revealed that most of the sequences (~70%) isolated from the urine clustered among each other compared to other samples, regardless of disease status ( Figure 1). Additionally, sequences were clustered together based on tissue location, whereas NCCR sequences isolated from the brain and CSF were the most dissimilar ( Figure 1). Samples isolated from the blood (~11%, defined in Table 1) clustered to one another and to sequences isolated from the urine, especially from patients diagnosed with HIV or autoimmune diseases such as MS, SLE, or RA, relative to other isolation sites ( Figure 1). The clustering among urine samples suggests that the NCCR does not vary greatly in kidneys of healthy individuals. The distant clustering observed in the CSF, brain, and plasma samples observed, reflects genetic rearrangements in the NCCR of diseased patients. Overall, this analysis revealed patterns of sequence similarity among NCCR sequences isolated from various tissues and fluids among PML patients and healthy individuals and the clustering of sequences isolated from the urine and other isolation sites.

The Heterogenicity of NCCR Blocks across Sample Isolation Sites
Previously published data were used to determine sequence motifs for each block ("a"-"f") [40]. Initially, the FASTA/Q toolkit, SeqKit, [59] was used to identify the blocks in 181 NCCR sequences from GenBank. Aligned block sequences were manually inspected using Jalview (Version 2.11.0). Then, MEME [60] motifs were generated for the blocks (see Methods) and FIMO [61] was used to map the blocks in all 989 sequences. Next, a Perl script was used to parse the FIMO output and annotate the blocks in the NCCR sequences. A subset of 100 NCCR sequences derived from CSF samples were randomly selected for manual inspection to validate the blocks annotations (see Methods). The total length of the portion of the 989 NCCR sequences covered by blocks had a mean length of 258 bp with a minimum length of 99 bp and maximum length of 523 bp.
To visualize differences in rearrangements of blocks within the 989 NCCR sequences, the occurrence of individual blocks ("a"-"f") were analyzed to generate NCCR block codes ( Table 2). We define NCCR block codes as the arrangement of the individual blocks. Nearly 60% (592) of the 989 NCCR sequences resembled the archetype strain with a block code of ABCDEF. The next most frequently observed NCCR block code, ABCECEF, had a duplication of blocks "c" and "e". The third most frequent NCCR block code, ABCDCDEF, had a duplication of blocks "c" and "d". Interestingly, 42% (82/195) of the NCCR sequences from CSF samples from PML patients lacked a block "d". The archetype NCCR block code was found in sequences from multiple tissues from PML patients. This suggests that the majority of JCPyV genomes may still represent the non-disease strains, and in actuality, a smaller proportion of JCPyV quasispecies, representing a population of duplicated enhancer elements drive replication, and ultimately, lead to disease.     # = number of sequences analyzed.

Sequences Isolated from Urine Samples Had a Higher Frequency of TFBS That Repress JCPyV Infection, While TFBS That Facilitate JCPyV Infection Were More Frequent in Sequences from Other Tissues
To understand how NCCR variation among isolated sequences may influence JCPyV infection, the frequency of known TFBS involved in JCPyV replication from the literature review were determined in each block of every sequence ( Figure 2). Sequences from unknown tissue sources and more than one tissue source were excluded in this analysis (n = 63). The R package, 'TFBStools' (Version 1.26.0) was used along with the 2020 JASPAR database to find the TFBS motifs of each NCCR sequence per block. Sequences isolated from the CSF had the highest number of known TFBS that influence JCPyV infection (Figure 2D). Blocks "c" and "d" had the highest variation in TFBS between sample source, specifically, CCAAT Enhancer Binding Protein β (CEBPβ) and Nuclear Factor I X (NFIX),  Figure 2. Transcription factors binding sites that are known to activate JCPyV infection are more prevalent in the NCCR of viral isolates from the brain, plasma, and especially, the CSF in patients that have PML or AIDS. The R package, 'TFBStools' was used to determine the TFBS for each block, using the 2020 JASPAR database. Known TFBS that have are associated with JCPyV transcription and replication were illustrated as a balloon plot using 'ggplot2'. TFBS are faceted by the 6 blocks (A-F), disease status, and by tissue source: urine (A), blood (B), brain (C), and CSF (D). The size of the shape represents the normalized frequency of the TFBS (i.e., the number of times the TFBS is present) [(# of TFBS/Block)/(# of sequences)] in each of the 6 groups/locations and the color represents the activity that correlates with JCPyV infection. Sequences isolated from the blood represent serum, plasma, and PBMC sequences. Unadjusted p values were determined by Fisher's exact test to compare the normalized frequency of TFBS (size of shapes) from the Urine (Healthy) to all other groups.

Highly Variable Number of TFBS in NCCR Viral
Isolates from the Brain, Plasma, and CSF, Specifically in Blocks "c" and "d" To determine how rearrangements in the NCCR alter TFBS, especially based on site of viral isolation and disease status, the total number of TFBS were tabulated by disease status and tissue source (Figure 3). A stacked bar graph illustrates the overall number of TFBS in each block of the JCPyV genome. As previously illustrated in Figure 2, the greatest variation in the number of TFBS by location was observed in block "c" and block "d", specifically in sequences isolated from the CSF ( Figure 2D). The frequency of TFBS in block "c" were higher in diseased patients, specifically comparing the number of TFBS from the urine of healthy individuals to the brain and urine of PML patients (adjusted p value < 0.01) ( Figure 3). Overall, the frequency of TFBS in each block isolated from the urine were less variable and equally distributed compared to nonurine sequences ( Figure 3). This is most likely attributed to the duplications of the 'c' block and deletions of the 'd' block (Table 2). Taken together, these data suggest that the number of TFBS is influenced by the genetic rearrangements in diseased patients, specifically attributed to the duplications and deletions of specific blocks in the NCCR.
using the 2020 JASPAR database. Known TFBS that have are associated with JCPyV transcription and replication were illustrated as a balloon plot using 'ggplot2'. TFBS are faceted by the 6 blocks (A-F), disease status, and by tissue source: urine (A), blood (B), brain (C), and CSF (D). The size of the shape represents the normalized frequency of the TFBS (i.e., the number of times the TFBS is present) [(# of TFBS/Block)/(# of sequences)] in each of the 6 groups/locations and the color represents the activity that correlates with JCPyV infection. Sequences isolated from the blood represent serum, plasma, and PBMC sequences. Unadjusted p values were determined by Fisher's exact test to compare the normalized frequency of TFBS (size of shapes) from the Urine (Healthy) to all other groups.

Highly Variable Number of TFBS in NCCR Viral Isolates from the Brain, Plasma, and CSF, Specifically in Blocks "c" and "d"
To determine how rearrangements in the NCCR alter TFBS, especially based on site of viral isolation and disease status, the total number of TFBS were tabulated by disease status and tissue source (Figure 3). A stacked bar graph illustrates the overall number of TFBS in each block of the JCPyV genome. As previously illustrated in Figure 2, the greatest variation in the number of TFBS by location was observed in block "c" and block "d," specifically in sequences isolated from the CSF ( Figure 2D). The frequency of TFBS in block "c" were higher in diseased patients, specifically comparing the number of TFBS from the urine of healthy individuals to the brain and urine of PML patients (adjusted p value < 0.01) ( Figure 3). Overall, the frequency of TFBS in each block isolated from the urine were less variable and equally distributed compared to nonurine sequences ( Figure  3). This is most likely attributed to the duplications of the 'c' block and deletions of the 'd' block (Table 2). Taken together, these data suggest that the number of TFBS is influenced by the genetic rearrangements in diseased patients, specifically attributed to the duplications and deletions of specific blocks in the NCCR. To understand additional TFBS that are altered in the NCCR sequences of nonurine locations and diseased patients,~60 TFBS/tissue source representing the top 10% of TFBS were determined ( Figure 4). All groups are represented, however not all blocks are represented, as the top 10% of the full sequence was considered. In all, additional TFBS occurred more frequently in PML patients, predominantly occurring in the "c" block ( Figure 4). Numerous homeobox protein TFBS were found including HOXA4, MEIS1, and HOXB3, as well as forkhead protein TFBS, such as FOXL1 and FOXP3 (Figure 4). These TFBS, FOXL1 and FOXP3 occurred less frequently in the "d" block in nonurine sequences, specifically within the CSF (p = 0.048 and p = 0.0062, respectively, data not shown) and more frequently in the "e" block of all nonurine sequences (p < 0.05, data not shown). Another TFBS that was found more frequently in PML samples was Nuclear Respiratory Factor 1 (NRF1), a protein-coding gene with DNA-binding transcription activator activity [62,63]. Specifically, the NRF1 binding site was significantly higher in the "c" block of sequences isolated from the CSF compared to sequences isolated from the urine (p = 0.0062, data not shown). Most importantly, this analysis also revealed a TFBS for oligodendrocyte transcription factor 3 (OLIG3). This TFBS, which occurred more frequently in the "c" block, may contribute to cellular tropism as it was only observed in sequences isolated from the CSF and brain ( Figure 4). However, OLIG3 was only significantly more prevalent in sequences isolated from the blood compared to sequences isolated from the urine in block "c" (p = 0.039, data not shown). There were not significant differences when comparing OLIG3 binding sites in the urine to the CSF and brain but it is important to note that another OLIG gene member, OLIG2 was also revealed in a similar study [55].
The TFBS that occurred less frequently in PML samples were found in block "d". Specifically, the MEIS1 TFBS had the largest decrease in the number of sites in block "d" in nonurine strains compared to NCCR sequences of urine strains, especially when comparing this TFBS in the urine to the CSF and brain (p = 0.0079, data not shown). This may suggest that MEIS1 is important in the development of PML as it was found more frequently in the enhancer element of the JCPyV NCCR. Collectively, this analysis reveals combinations of TFBS that may influence the development of PML, expand the cellular tropism of the virus, and validates TFBS that are known to influence JCPyV infection.

Discussion
The NCCR is a hypervariable region within the JCPyV genome that can enhance JCPyV infection, and rearrangement of the NCCR is associated with the fatal disease, PML [13]. However, our understanding of how the NCCR is implicated in disease is limited, due in part to low sample sizes [64], and few studies demonstrating the changes in TFBS from sequences isolated from the urine to nonurine locations, and asymptomatic to symptomatic individuals. This study addresses these gaps in knowledge through a metanalysis, compiling 989 NCCR sequences from GenBank to examine how the arrangement of blocks ("a"-"f") and TFBS found within the blocks were altered across various locations based on individual disease status. Classification of the NCCR sequences using block codes provided a useful representation of the modular duplication, deletion and rearrangement of blocks ("a"-"f"). Our results demonstrate the variability in the NCCR of PML patients and illustrate the differences in the archetype versus PML-type strains of JCPyV. Furthermore, this analysis validates NCCR TFBS previously associated with JCPyV infection and PML while also elucidating TFBS that have not been previously associated with infection and disease.
Sequence analysis revealed that most NCCR sequences isolated from the urine clustered together, however some sequences from diseased patients had nucleotide similarity to sequences isolated from non-urine sources from diseased patients (Figure 1). This was especially true from patients with autoimmune diseases and patients positive for HIV, suggesting that PML-like strains could also be found in the kidney and not restricted to the CNS (Table 2). These same conclusions were also deduced from Pfister et al. [57]. This was contradictory to earlier findings in that significant rearrangement did not occur in the kidney [65,66]. These previous studies could have indicated selection bias and further evidence indicated that mutations, specifically the duplication of the 98 base pair tandem repeat, can occur in the kidneys of PML patients with immunodeficiency disorders [64] ( Table 2). These data suggest that even though reactivation of JCPyV is poorly understood, immunosuppressed individuals may not have adequate immune surveillance, prompting enhanced JCPyV replication driven by the probability of promoter rearrangements within the NCCR [13]. However, Figure 1 and Table 2 also illustrate nucleotide similarity and block rearrangement in sequences isolated from the plasma of PML patients comparable to that of sequences isolated from the urine. Lymphocytes can act as reservoirs for JCPyV, and it has been hypothesized that rearrangement of the NCCR can occur here, as B cells have the machinery to facilitate genome rearrangement [24]. Sequences isolated from the plasma illustrated archetype-like structure of the NCCR (Table 2). Taken together, these data demonstrate the likelihood of multiple mechanisms of JCPyV reactivation when comparing sequence identity and block structure of the NCCR in sequences isolated from the urine of healthy individuals versus PML patients.
There have been numerous TFs reported to influence JCPyV infection [24]. However, the impact on the TFBS from rearrangements in the NCCR, in the context of isolation location and disease status, has not been well studied. NFIX is a well characterized TF that activates JCPyV transcription [50,67]. This analysis validates the importance of this TFBS, occurring more frequently in PML patients, specifically, in sequences isolated from the brain and CSF in block "c" of the NCCR (Figure 2B,D). Additionally, this supports the hypothesis that rearrangements in the NCCR increase tissue tropism as NFIX is highly expressed in the CNS, and the addition of these TFBS most likely aids in viral transcription, enhancing replication [50]. Interestingly though, the NFIX TFBS occurred more frequently in sequences isolated from the urine and plasma of PML patients compared to healthy individuals, which demonstrates the importance of NFIX enhancing JCPyV transcription, regardless of location, which may lead to PML.
NCCR mutations are complex and most likely the result of homologous recombination that leads to large deletions and tandem duplications in the NCCR [13] ( Table 2). The influence of these rearrangements is also observed in TFBS that are known to influence JCPyV infection (Figure 2). In many sequences isolated from individuals with PML, there are deletions of block "d" (Table 2), which is important in the success of viral transcription as there are TFBS for CEBPβ, a repressor of viral transcription (Figure 2). Deletion of portions or even the full "d" block leads to less TFBS for CEBPβ, which enhances viral transcription [68]. Interestingly however, there were TFBS for NFIX located on the "d" block that occurred more frequently compared to sequences isolated from the brain, plasma, and CSF ( Figure 2B-D). This highlights the complexity of not only NCCR rearrangement but also the regulation of TFs in the replication of JCPyV. Romagnoli et al. demonstrated the important inhibitory role of CEBPβ during JCPyV infection of human glioblastoma and transformed human glial cells but also revealed the activation of JCPyV transcription by NF-κB. They suggested that there was a unique interplay between both TFs that control the balance of JCPyV latency and reactivation during immunosuppression [68]. Furthermore, research has also demonstrated that TFs that repress viral transcription, specifically AP-1 TFs, block NF-1 binding sites, because of the proximity of these TFBS [69]. This study isolated human CNS progenitor cells and differentiated them into an astrocytic cell line to determine that AP-1 TFs, like c-Jun, bind to AP-1 binding sites thus blocking TFs that activate transcription as they can no longer bind to NF-1 sites [69]. This analysis revealed that even though there was an increase in NFIX sites in block "d" there was also an increase in TFBS that inhibit JCPyV transcription, like CEPBβ and JUN, which result in lower transcriptional levels that are observed in healthy individuals. These studies examining the importance of TFs in JCPyV infection mainly used in vitro techniques and were across various cell types, and thus it should be noted that cell-type differences can be observed. Overall, examining the known TFBS that influence JCPyV transcription of 989 sequences from healthy and diseased patients reinforced the intricate balance between viral transcription, expanding cellular tropism, and possibly leading to disease caused by enhanced JCPyV replication.
JASPAR is one of the largest databases of known TFBS, and this study examined known TFBS that influence JCPyV infection. In addition to this validation, we further examined other TFBS that have not been previously characterized in JCPyV transcription and cellular tropism, identifying novel TFBS that may correlate with the disease PML ( Figure 4). Differences in the number of TFBS were most notable in blocks "c", "d", and "f" ( Table 2 and Figure 3). This was most likely due to the extent of rearrangements, deletions, and duplications that these blocks are subject to within the NCCR (Table 2). Previous research has also demonstrated significant mutations in the "f" block that resulted in an increase in JCPyV late gene expression [70]. The authors hypothesized this was mostly due to the loss of the TFBS for NFIA and AP1 [70]. Our data support their hypothesis as the TFBS for NFIA was not present in the "f" block, and CEPBβ occurred less frequently, however this did not occur in all disease states ( Figure 2B-D). TFBS that occurred more frequently in PML patients, including sequences isolated from the urine, were TFBS associated with HOX gene families (Figure 4). Although HOX genes are important for their role in embryonic development, they have also been implicated in angiogenesis and tumor metastasis [71]. Recently, these genes have also been shown to be involved in hepatitis C virus (HCV) and human cytomegalovirus (HCMV) infection [72,73]. During HCMV infection, numerous HOX genes, including MEIS1, are differentially expressed [72]. A TFBS for MEIS1 was identified in the NCCR of JCPyV across all four isolation sites, yet was most prevalent in the "c" block of nonurine isolates (Figure 4). During HCMV infection, the authors postulated that as TFs, HOX genes regulate downstream gene expression involved in angiogenesis and cellular DNA repair, promoting cell proliferation, inhibiting apoptosis, and possibly being involved in vascular disease pathogenesis [72]. Additionally, these genes are associated with cell cycle-related genes [74]. JCPyV promotes cellular proliferation and inhibits cellular apoptosis through T Ag [26,27] and can influence cyclin expression during the infectious cycle [75]. Perhaps, as viral replication is enhanced by NCCR rearrangements, it may promote further induction of HOX-related genes, which further promote viral transcription; however, molecular studies would need to be performed to validate this hypothesis.
Other TFBS that varied between sequences were binding sites that were associated with FOX genes (Figure 4). The FOX TF family are critical in regulating numerous biological processes, including metabolism, differentiation, proliferation, and apoptosis [76]. Numerous FOX proteins have been demonstrated to be involved in viral infection as well [77]. For example, FOXP3 is upregulated upon human papillomavirus (HPV) infection and may accelerate the cancerous transformation of cervical epithelial cells [77][78][79]. JCPyV can activate cellular pathways, including the mitogen-activated protein kinase, extracellular signal-regulated kinase (MAPK/ERK) pathway [80][81][82][83] and the phosphoinositide 3-kinase/AKT signaling pathway [84,85] that can influence the expression of the FOX TF family, which in turn, may benefit viral transcription, as there are more TFBS in block "c" of the NCCR, isolated from PML patients (Figure 4). Overall, this analysis discovered novel TFBS, specifically related to HOX and FOX TF families that are possibly important in JCPyV replication and disease outcomes. Future studies should validate these findings using molecular and infectivity assays to determine if these TFBS require the binding of TFs to activate JCPyV transcription.
In addition to these findings specific for JCPyV, this study contributes to our understanding of polyomavirus NCCR rearrangements and the role of TFBS in polyomavirusinduced disease. The tripartite genome organization is shared among the 12 identified human polyomaviruses [86] and interestingly, rearrangements in the NCCR correlates with pathogenic disease progression in other polyomaviruses [87]. Notably, BK polyomavirus archetype strain (WW), which infects the kidney and is shed in the urine, is characterized by five sequence blocks: O, P, Q, R, and S [88,89]. The BKPyV NCCR is hypervariable, and detection of the rearranged (rr) strains (i.e., Dunlop), correlates with BKPyV diseases including polyomavirus-associated nephropathy [90][91][92][93]. Further, the BKPyV NCCR contains binding sites for TFs: AP-1, NF-1, NF-kB, Sp1, and Spi-B [89,[94][95][96], which are also found in the JCPyV NCCR. Additionally, Sp1 and Spi-B sites are present in all known human polyomaviruses [97]. Overall, this study complements our current knowledge of the similarities among PyV NCCRs and TFBS, which will lead to a better understanding of genetic regulation in polyomavirus reactivation and disease progression.

Determining and Visualizing the Blocks of Each NCCR Viral Isolate
Initially, only 181 sequences were used to obtain the blocks ("a"-"f") within the NCCR, using already published sequences for each [40]. First, the sequences were separated by isolate site and aligned using Clustal Omega. This step resulted in more accurate block sequences and faster quality control of each NCCR. The blocks were extracted using the command line and the toolkit, SeqKit, an ultrafast FASTA/Q Go programming language [59]. Specifically, the function "locate" was used in the seqkit executable file with varying max mismatch values when matching the known sequence for each block with the 181 sequences, based on the length of nucleotides for the individual blocks ( Table 3). The resulting maximum mismatch values provided in Table 3 are the most accurate representation of the NCCR blocks determined by manual inspection. Additionally, only the "+" strand was used. Manual inspection was performed using the program Jalview (Version 2.11.0), an open-source program developed for the editing, analyzing, and visualizing of multiple sequence alignments. This first workflow was later used to validate the Perl script described below to capture more NCCR sequences from GenBank, both, entire NCCR sequences and NCCR sequences extracted from the full JCPyV genome. Table 3. Block sequences and criteria used to locate them in each sequence.

Block Letter Nucleotide Sequence Maximum Mismatch Value
Block "a" CCTGTATATATAAAAAAAAGGGAAGG 9 Block "b" AGGGAGGAGCTGGCTAAAACTG 8 Block "c" GATGGCTGCCAGCCAAGCATGAGCTCAT ACCTAGGGAGCCAACCAGCTGACAGCC 27 Block "d" AGAGGGAGCCCTGGCTGCATGCCACTGGCAG TTATAGTGAAACCCCTCCCATAGTCCTTAATCACA 31 Block "e" AGTAAACAAAGCACAAGG 1 Block "f" GGAAGTGGAAAGCAGCCAAGGGAACATGTTTT GCGAGCCAGAGCTGTTTTGGCTTGTCACCAGCTGGCCAGT 31 Data from the initial smaller subset of NCCR sequences were visualized using the package, ggplot2 (Version 3.3.3) in R and the function geom_pointrange was used to illustrate the position of each block for all the NCCRs faceted by location and disease status. Additionally, the function, geom_histogram was used to determine the density distribution of each block, by length, also faceted by location and disease status (data not shown).

Comprehensive Search for JCPyV NCCR Sequences
A total of 5507 JCPyV nucleotide sequences were downloaded from the NCBI on 6 December 2021, by searching for all sequences with NCBI Taxonomy ID: 10632. All but one of these sequences were from GenBank. The additional sequence was the complete JCPyV genome sequence for the Mad-1 strain from the Reference Sequence database. Using a custom Perl script, 1049 sequences were identified as NCCR sequences that were less than 700 bp in length and had "control region" or "non-coding regulatory region" listed in the GenBank-formatted sequence records. Of the 1049 sequences, 578 were from one study where multiple sequences were generated from the same tissue source for 17 individuals [34]. The 578 sequences were collapsed to 108 unique sequences for the patient-tissue combinations. Thus, a total of 579 NCCR sequences from the 1049 sequences were included in our analysis. A 554 additional NCCR sequences were extracted from complete JCPyV genome sequence records from set of 5507 sequences using a custom Perl script. All of these genome-derived NCCR sequences began at the ORI site and continued up to the agnoprotein coding sequence. We also required that the ORI site had to be located at position 1 of the assemblies. Of the 554 JCPyV genomes sequences analyzed, 152 of them were from the same study described above [34] and were collapsed to 8 unique sequences based on patient-tissue combinations. Therefore, a total of 410 genome-derived NCCR sequences were included in our analysis. Combining the 579 NCCR sequences together with the 410 genome-derived NCCR sequences gave a total of 989 NCCR sequences that we analyzed.
Sample source information for the 989 NCCR sequences were identified using two steps. First, a Perl script was used to parse the GenBank-formatted sequence records to find source and disease status information in the "note" and "isolation_source" fields or elsewhere in the record and report any PubMed identifiers (IDs). Second, the automatically-parsed information was manually curated to assign a tissue and disease (primary and secondary) status for each sequence. For sequences with PubMed IDs, the publications were reviewed to assign tissue and disease information. For some sequences, publications were not reported in the GenBank-formatted sequence records, but we were able to find the missing publications based on the authors and description in the entries. The curated information for the 989 sequences are in Supplementary Table S1.

Analysis of Blocks among 989 NCCR Sequences
We analyzed the 989 NCCR sequences to find the block A-F motifs using the MEME suite version 5.4.1 [60]. First, MEME motifs were created for blocks A-F using sets of NCCR sequences that manually annotated from CSF samples. A total of 106 block "a", 156 block "b", 220 block "c", 100 block "d", 175 block "e" and 146 block "f" sequences were used to create one block "a" motif, three block "b" motifs, three block "c" motifs, five block "d" motifs, three block "e" motifs and three block "f" motifs. Locations of these motifs in the NCCR sequences were then mapped using FIMO version 5.4.1 [61]. Custom Perl scripts were then used to filter the output to find matches with q-values < 0.05 and report the matching sequences for the block motifs found in each of the NCCR sequences. A block code was reported as the order of block motifs (e.g., ABCDEF) found in each NCCR sequence.

Validating the Perl Script to Determine the Accuracy and Precision of Capturing the NCCR and the Individual Blocks
To validate the Perl script that was previously mentioned, 100 NCCR sequences isolated from the CSF in the initial analysis were compared with the same sequences captured from the Perl script. Sequences isolated from the CSF were used to validate the script as these sequences had the greatest variation in the NCCR and would serve as a meticulous reference point. The 100 sequences were compared at the block level and at the individual base pairs, the results are summarized in Table 4. Overall, the code was~90% accurate in representing 85% of the correct block code, determined by the initial analysis ( Table 4). The most frequent error occurred at the nucleotide level, yet the sequences were missing an average of 8 base pairs which only constituted of~3% of the overall NCCR sequence (Table 4).

Cladogram
The sequences were aligned using Clustal Omega in the command line, using the specific packages and programs, the European Molecular Bioinformatic Open Software Suite (EMBOSS) and ClustalW. Specifically, Clustal 2.1 Multiple Sequence Alignments was used to perform a multiple alignment using the slow and accurate method [58]. Briefly, all sequences were pairwise aligned and a dendrogram was constructed, describing the approximate groupings of the sequences by similarity [58]. Finally, the multiple alignment was carried out, using the dendrogram as a guide. To generate a cladogram, the alignment output was used to calculate the distances between all pairs of sequences. Then the neighbor joining method was used to create the distance matrix [58,98,99]. The .ph file generated from this output was imported into R using the function "read.tree" from the package "ape" (Version 5.4-1). The "phylo4d" function from the package, "phylobase" (Version 0.8.10) was used to merge the cladogram with sequence information, such as where the sequence was isolated from and disease status. Lastly, to visualize the circular cladogram ( Figure 1) the function "ggtree" was used from the package, "ggtree" (Version 2.2.4).

Mapping TFBS for Each NCCR Sequence Using the JASPAR Database
To curate the dataset of TFBS for the 1416 blocks, the R packages "TFBStools" (Version 1.26.0) and "Biostrings" (Version 2.56.0) were used to upload the sequences into the RStudio environment with the function "readDNAStringSet". The matrix for each human TFBS (species ID: 9606) from the 2020 JASPAR database was uploaded into R with the function "getMatrixSet". TFBS that mapped to the "+" strand of all 989 sequences were predicted using a minimum score of 70% based on the same approach that was used to extract the individual blocks. A minimum score of 70% was determined by analyzing output data with varying scores that resulted in TFBS, such as SPIB and SP1 that are known to influence JCPyV infection, being included in the dataset. To compare the frequency of TFBS with a varying number of sequences from each location, the normalized frequency was determined (Equation (1)).

Statistical Analysis of TFBS
The normalized frequency of TFBS between groups was statistically compared using Fisher's exact test, adjusting the p values using the Benjamini-Hochberg method [100]. For groups with more than 10 sites, statistical significance was determined using the chi-square test, adjuring the p values using the Bonferroni method. Lastly, to determine the distribution of the normalized frequency of TFBS in the various tissue sources, the pairwise Wilcoxon rank sum test, along with the Bonferroni adjusted was used to determine the pair of groups that were statistically significant. All analyses were conducted using R version 4.0.5.
R scripts are available upon request. Datasets are provided as supplemental tables, which includes the information for the 989 sequences (Supplemental Table S1) and the frequency of TFBS (Supplemental Table S2).

Conclusions
The NCCR of the JCPyV genome is a hypervariable region that is susceptible to complex mutations and rearrangements that are associated with the fatal disease PML. This is one of the first metanalyses that characterize JCPyV NCCR sequences, confirm previously reported TFBS, and identified potential novel TFBS. By curating a dataset of 989 sequences of the JCPyV NCCR, as well as characterizing the block orientation of each sequence, we have illustrated the complexity of this region based on site of viral isolation and disease status of the individual ( Figure 5). Additionally, we have validated TFBS that are important in activating or repressing JCPyV transcription by location, disease status, and by each block of the NCCR (Figure 3). Given the effectiveness of this approach, we have discovered possible novel TFBS that correlate with JCPyV infection based on location and disease status (Figure 4). This research contributes to a similar study in which computer simulations were utilized to identify TFBSs in the NCCR [55], leading to new resources for the field and possibly clinicians. Future studies should revisit the outcomes of these TFBS in JCPyV transcription and replication through molecular analyses. To conclude, this research validates and establishes the importance of the NCCR of the JCPyV genome, an understudied area, and in turn, further uncovers the mechanisms of reactivation and PML outcome, which are also poorly understood. an understudied area, and in turn, further uncovers the mechanisms of reactivation and PML outcome, which are also poorly understood. Figure 5. Rearrangements of the NCCR and changes in TFBS. The archetype strain is predominantly detected in the urine of healthy patients (blue/green virus) with numerous TFBS (blue) that inhibit replication. Rearrangements in diseased patients (red virus) causes an increase in TFBS (green) that enhance viral replication of PML-type strains. TFBS, like SPIB (gray) or possibly OLIG3 (yellow), enhance cellular tropism from the rearrangements in the NCCR. Additionally, TFBS, including MEIS1, FOX family transcription factors (FOX), and HOX family transcription factors (HOX) (yellow) can also be observed from NCCR rearrangements in disease patients compared to TFBS in the NCCR from healthy individuals. Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are publicly available and scripts are available on request from the corresponding authors.