Identifying New Clusterons: Application of TBEV Analyzer 3.0

Early knowledge about novel emerging viruses and rapid determination of their characteristics are crucial for public health. In this context, development of theoretical approaches to model viral evolution are important. The clusteron approach is a recent bioinformatics tool which analyzes genetic patterns of a specific E protein fragment and provides a hierarchical network structure of the viral population at three levels: subtype, lineage, and clusteron. A clusteron is a group of strains with identical amino acid (E protein fragment) signatures; members are phylogenetically closely related and feature a particular territorial distribution. This paper announces TBEV Analyzer 3.0, an analytical platform for rapidly characterizing tick-borne encephalitis virus (TBEV) strains based on the clusteron approach, workflow optimizations, and simplified parameter settings. Compared with earlier versions of TBEV Analyzer, we provide theoretical and practical enhancements to the platform. Regarding the theoretical aspect, the model of the clusteron structure, which is the core of platform analysis, has been updated by analyzing all suitable TBEV strains available in GenBank, while the practical enhancements aim at improving the platform’s functionality. Here, in addition to expanding the strain sets of prior clusterons, we introduce eleven novel clusterons through our experimental results, predominantly of the European subtype. The obtained results suggest effective application of the proposed platform as an analytical and exploratory tool in TBEV surveillance.


Introduction
Tick-Borne Encephalitis Virus (TBEV) is a causative pathogen for tick-borne encephalitis (TBE), and carries the possibility of serious neurological outcomes, including fatal ones. TBEV is an arthropod-borne virus that belongs to the Flaviviridae family, genus Flavivirus. Beginning in 2012, TBE has become one of the most notable human diseases in the European Union [1]. Annually, more than 500,000 tick bite incidents are registered in Russia, and 1500-2000 cases of TBE are reported [2]. TBEV's geographical area of spread forms a belt starting in East Asia and ending in central Europe. Recent animal surveillance activities indicate that the virus is spreading in northern Africa, including northwestern Tunisia [3], as well as in the east/south of England [4,5]. Several of these locations, however, have not yet registered human infections.
was to address this issue and to maintain reproducibility of the approach. In this way, researchers can focus on the final results without distraction by computational details.
Next, after presenting the platform's first version, we received requests for visualization of clusteron distributions on a map. Hence, we introduced TBEV Analyzer version 2.0 in 2020 [23]. In addition to expanding previous features, the second version gained geographical mapping, customization of the alignment table to study genetic variability, and integration with GenBank for fetching the query strain. Furthermore, the user interface underwent significant changes, and many technical issues were fixed. Generally speaking, the previous versions of the platform were devoted to implementation and customization of the CA, respectively.
In this paper, we take a step forward and announce TBEV Analyzer 3.0. The main goal of the third version focuses on updating the phylogenetic model, referred to as the clusteron structure, that underlies the platform analysis. Indeed, we implemented theoretical and practical improvement to the platform. Our contributions to the latest version are as follows: • Theoretical improvement: -Performing whole GenBank analysis and introducing new clusterons, followed by updating of the model of clusteron structure obtained by the clusteron approach.
• Practical improvements leading to enhancement of platform functionality: -Automatic monitoring of GenBank for emerging novel strains.

-
Identification of the query's amino acid signature and its visualization on the E protein surface.

-
Provision of high-quality visualization of clusteron spatial distributions and visualization of a query on a geographical map via its latitude and longitude.

-
Interactive visualization of the clusteron structure.

-
Equipping the platform with an Application Programming Interface (API).
The remaining parts of the paper are organized as follows: Section 2 presents more details about extra features included in the latest version of the platform; further, we discuss the results obtained from analyzing the GenBank database and report new clusterons in Sections 3 and 4. Finally, our conclusions are provided in Section 5.

Materials and Methods
As pointed out earlier, the clusteron approach is the core of our platform. The CA currently relies on performing phylogenetic analysis by constructing two phylogenetic networks [23] and having their results merged by a specialist to create a unified network called a clusteron structure (CS). At present, this procedure is carried out manually and requires several verification steps to obtain the final network of the CS. The CA can mainly be divided into two procedures:

•
Constructing the CS via phylogenetic network analysis. • Application of the CS to identify the hierarchical three-fold phylogenetic characteristics of a query strain.
The former procedure can be considered as the construction of an evolutionary model, whereas the latter is the application of the obtained model. Our project's primary goal is to facilitate application of the CS by automatically locating the query strain in the CS graph and inferring its characteristics. Developing a unified computational pipeline for CS construction from phylogenetic networks is being considered in our future plans.
It is a known fact that phylogenetic analysis of TBEV strains at the nucleotide level may yield different results compared to amino acid level analysis. Hence, the CA incorporates the complementary roles of genetic information at the nucleotide and amino acid levels by combining their results. According to our computational pipeline, assigning a clusteron to a query requires verification at the nucleotide and amino acid levels. The overall schema of our pipeline was described previously [19,23], and is illustrated (with adjustment) in Figure 1. The overall schema of the computational pipeline for clusteron structure application. The pipeline generally includes alignment, identifying a target coding sequence of an E protein fragment, constructing the phylogenetic tree, and inferring/verifying the assigned clusteron to the query. Note that assigning a clusteron to a query requires verification at both the nucleotide and amino acid levels.
The computational pipeline is comprised of three procedures: preprocessing, phylogenetic analysis, and inferring the characteristics. The preprocessing procedure extracts a specific E protein coding sequence fragment containing sufficient genetic information for query characterization. The fragment has a restricted length of 454 bp (from nt 309 to 762 according to the sequence of the Vasilchenko strain, GenBank: M97369). Thus, fragments with insertions, deletions, or ambiguous nucleotide characters are excluded from analysis. The second pipeline step consists of constructing a phylogenetic tree for coding sequences (called prototypes). We evaluated several tree construction algorithms and showed that this process requires knowledge about the tree on the coarse scale (i.e., clade or branch). Inferring the subtype and lineage depends on the position of the query taxon and its sister node's clade. The specifics of the algorithm have been described elsewhere [23].
Next, after determining the subtype and lineage, the process moves to the final step of identifying the clusteron. The decision about assigning a clusteron is taken at the amino acid level. Therefore, we search for a match in the target fragment protein sequence between the query and the clusterons using the subtype and lineage determined from the previous step. The final output is the hierarchical three-fold phylogenetic characteristics of the query strain. According to the CA, certain clusterons have identical amino acid profiles while their lineages are entirely different. This is why the amino acid sequence alone is not enough to determine the clusteron to which the query strain belongs. A dashed line visually indicates such a relationship on the CS network ( Figure 2). In case of failure to identify any of the three-fold characteristics, the strain is tagged as unique. Unique strains are viruses that do not meet the epidemiological threshold of clusteron development. In other words, in addition to the unique and specific amino acid signature, a clusteron is formed when there is enough evidence about its stability. A sole unique virus may be due to the stochastic nature of viral evolution, and its presence alone does not have epidemiological implications for the future. This is why, in TBEV Analyzer 3.0, we implemented automatic monitoring of Gen-Bank resources to find clues about the development of unique strains. The emergence of additional unique strains with the same profile leads to the formation of a new clusteron. However, adding a clusteron to the CS requires biological justification and verification.
The platform accepts queries containing multiple sequences, and analyzes each sequence individually to generate a report. The report consists of eight sections: Table  • Protein Alignment Table  • Amino Acid Signature In the following sections, we briefly describe the content of each report section, including the related updates to TBEV Analyzer 3.0. The general report lists three types of information: query information entered by a user, system-generated information during task processing, and finally the three-fold phylogenetic characteristics, including the subtype, lineage, and clusteron. This is the part of the report during which a user receives the query's brief primary features in one glance.
The "Clusteron Structure" section provides an overall picture of TBEV evolution. Such a visualization provides the opportunity to study the history of evolution at both the global and local scales. The CS assigns a unique identifier, referred to as the CS ID, to each clusteron. Typically, clusterons can be divided into two classes, the clusteron founder and the clusteron derivative [24]. The clusteron founder is a founder of a subtype/lineage, and is the greatest in number among other clusterons of a subtype/lineage. At present, there are seven founders: 1A, 4A, 3A 2 , 3A, 3J, 3D, and 2A. Clusteron derivatives are smaller, and stem from a clusteron founder. Derivatives vary based on their level (first, second, etc.). They differ from the founder by one, two, or more amino acid substitutions. The evolutionary paths between founders are decorated with 'transition points', i.e., amino acid sequences of the E protein fragment, that are not seen. As such, they were likely deleterious [6]. As mentioned earlier, there can be clusterons with different evolutionary paths even though their amino acid profiles are identical. Such clusterons are called homoplastic clusterons, and are connected by dotted lines (3F-3F 2 , 3A-3A 2 , 3C-3C 2 -3C 3 , 3L-3L 2 ) in Figure 2. It is worth mentioning that the prototype coding sequence of a clusteron within a phylogenetic lineage is unique, even for homoplastic clusterons.
In TBEV Analyzer 3.0, we increased informativeness by adding the ability to interact with the CS. By clicking on each clusteron, a popup window displays clusteron specifics, which consist of three types of information: • Phylogenetic characteristics • Specific amino acid signature • E protein fragment coding sequence (prototype) To better visualize clusteron location, when the platform assigns a query to a known clusteron, its position in the CS blinks. With version 3.0, TBEV Analyzer gains the ability to regularly check the GenBank database and report novel emerging TBEV strains not included in the current CS. The latest analysis carried out by the platform revealed the formation of new clusterons (explained in Section 3); these were verified and added to the CS. Hence, the current CS includes the latest updated version of TBEV evolutionary dynamics.
The next report section provides the phylogenetic tree. The tree is generated from the coding sequence of the specific E protein fragment. As mentioned, the tree is necessary for declaring the subtype and lineage by comparing the query taxon's location with clusteron taxa. More details about its algorithms and the method for determining the phylogenetic characteristics of the query have been described [23]. Note that the tree visualization is customized such that the query taxon is located as the topmost leaf in the tree. In addition, clades associated with subtypes and lineages are individually colored and the query-to-root path is highlighted in red.
Because the CA relies on both nucleotide and amino acid sequences, we provide two alignment tables equipped with several coloring schemes. The tables allow for the exploration of genetic variations among the clusterons and comparison of their signatures with the query's signature. It should be noted that a user can examine the genetic variability from various aspects, e.g., hydrophobicity, by changing the coloring scheme. Additional information is presented as well, including information about the target fragment of the genetic sequence, position in the E protein, conserved and variable sites, and similarity score between the query and clusterons.
Clusterons are declared based on the E protein amino acid sequence. Thus, each clusteron has its own unique specific amino acid signature, except for the homoplastic clusterons 3C-3C 2 -3C 3 , 3A-3A 2 , 3L-3L 2 , and 3F-3F 2 , which have the same signature and phylogenetically different lineages. The signatures were collected from published data [6,20] and further updated in the platform. We expanded the report by adding a new section called "Amino Acid Signature". This section contains a table and 3D visualization. The table lists signatures of all current clusterons (Table 1), with the signature of the query's assigned clusteron highlighted in red. Through the signatures, key sites that are responsible for phylogenetic characteristics can be determined. For example, the combination of positions 206/234 may serve as a sub-signature for subtype and lineage identification. The highlighted signature is shown on the surface of the E protein (PDB ID: 1SVB [25]) by PDBe Molstar [26][27][28] (Figure 3), a modern web-based toolkit for visualization and analysis of large-scale molecules. This type of interactive visualization permits examination of structural and functional characteristics of key protein positions. Compared with the second version of the platform, we visualize the TBEV spatial distribution by a high-performance WebGL-powered web application called Kepler.gl [29]. By default, the current map supports two visualizations:, a scatter plot and a heatmap. Each can be customized through the interface (Figure 4). The map is equipped with three-level filters: subtype, lineage, and clusteron. Thus, a user can personalize the map strains at each level of the CS. As a novel feature, unlike the previous version, the query strain's location of isolation can be visualized on the map if the user provides them. The last section of the report contains supplementary files generated by the system during query analysis. They include an aligned FASTA file of coding sequences, the phylogenetic tree file in Newick format, a phylogenetic tree image, and similarity score files for both nucleotide and protein sequences.
With the rapid growth of genetic databases, the number of online bioinformatics platforms for processing and analyzing this data has increased. The performance of platforms is enhanced via exchanging of information between them. To enable platforms to communicate, they are often equipped with an API. The current version of the TBEV Analyzer supports an API, via which the characterization of a query strain can be requested.

Experiment and Results
As mentioned earlier, TBEV Analyzer 3.0 is equipped with the remarkable advantage of regularly monitoring the GenBank database for emerging new strains. The platform performs two sequential analyses: • Determining the three-fold phylogenitic characteristics for all new strains • Reconsideration of the set of strains identified as unique, for finding any new clusterons When a new strain does not belong to a known clusteron, it is tagged as a unique strain. After analyzing the GenBank database, the platform reconsiders the set of all unique strains. Suppose there are unique strains with an identical amino acid profile and their abundance meets the epidemiological threshold (e.g., minimum two strains). In such a case, they are reported and considered for further verification. When introducing a new verified clusteron into the platform database, its related unique strains are automatically labeled by its CS ID.
To demonstrate the high performance of the upgraded platform, we analyzed all TBEVrelated records from GenBank. At the moment of performing this analysis, 2923 registered strains were found with the term "TBEV". We filtered out strains with insertions, deletions, or ambiguous nucleotide characters in the region of interest. The remaining strains either had a known clusteron or were identified as unique strains. The platform analyzed 1763 strains overall, including 1419 strains with known clusterons, 46 unique strains introducing 11 novel clusterons, and 298 unique strains currently under consideration for further monitoring and analysis.
The platform found eleven new clusterons, which are marked in Table 1. Two clusterons ("1L", "1M") belong to the Far-Eastern subtype. Clusteron "1L" has three samples (KM019546, KJ914682, KJ739729) isolated from the Tomsk and Novosibirsk regions in Russia. Clusteron "1M" has three strains (KP869172, KF880804, KT001073) isolated from the Khabarovsk region. Seven clusterons are related to the European subtype: 2K, 2L;, 2M, 2N, 2O, 2P, and 2Q. Two remaining clusterons, 3TN and 3PQ, are associated with the Siberian subtype, with different lineages (Asian and Baltic, respectively). Characteristics of the newly analyzed clusterons are presented in Table 2. Table 2. Characteristics of new analyzed clusterons. The majority of them belong to European subtypes. The associated accession numbers of each clusteron are presented. The clusteron region is determined through information available in GenBank.

Subtype Lineage Clusteron Strains Region
Far-Eastern

Discussion
The emergence of COVID-19 has reminded us that many aspects of viral dynamics remain unknown and that evolution is full of surprises. Along these lines, early knowledge about antigenic variants and emerging novel viruses is crucial for public health. In addition to laboratory and clinical research, surveillance platforms play a significant role in controlling pathogens.
Although humans are an accidental host for TBEV, there is no guarantee of the virus not becoming an adapted human pathogen. The history of TBEV evolution provides strong evidence about its capability for variation in pathogenicity. Such surprising viral evolution dynamics motivated us to develop the TBEV monitoring platform.
Beginning in 2019, when we proposed a TBEV-specific platform for the first time, we have received requests to add extra features and analyses. Our efforts have envisioned the implementation of a fully automated monitoring platform with human supervision. Currently, integration of the platform with external resources, e.g., GenBank, enables it to function as a high-performance analytical tool. Newly obtained results from analyzing the GenBank database reveal the formation of novel clusterons. Essential information about their characteristics, especially their spatial distribution dynamics, was uncovered.
In addition to phylogenetic analysis, the platform presents important epidemiological information. For example, clusteron 2K, which belongs to the European subtype, was isolated in Baden-Wuerttemberg (Germany), yet it was registered in the Altai region as well. Considering the long geographical distance between Germany and Altai, a question arises about the introduction of the virus into the Altai region. A similar situation exists with clusteron 3PQ, which is related to TBEV-Sib-Baltic. Two out of four strains were found in the Omsk and Sverdlovsk regions, while the two remaining strains were collected from the Republic of Karelia, near Finland.
Other interesting facts are the spreading of TBEV in the eastern UK and reconfirmation of its origin by our platform. There are two UK-related clusterons, 2O and 2Q. Currently, clusteron 2Q covers the Netherlands and UK-Hampshire, while 2O covers Denmark, UK, and Norway. Construction of these clusterons reveals the early stage of TBEV spread in UK territory. This agrees with recent studies [4,5] that suggest introduction of TBEV from the Netherlands and Norway into the UK.
In addition to various newly added capabilities, our platform has two remarkable advantages: • Performing CA analysis • Monitoring the geographical distribution of clusterons Beyond known clusteron strains, more attention should be paid to unique strains. They can be divided into two classes: evolutionary dead-end strains and evolutionary developing strains. Due to the stochastic nature of viral evolution, certain replicants acquire deleterious mutations; although they can be encountered or isolated, no further trace of them is seen in the evolutionary timeline. Because we do not have access to all isolated samples, instead of performing laboratory experiments we set an epidemiological threshold to determine the ability of a virus to develop. Unlike the deleterious mutations mentioned, selective pressure can be advantageous for a unique strain. Thus, such a unique strain may, after expansion, reach the threshold and be considered for identification as a new clusteron. In any case, all unique strains are the spotlight of the platform, and are regularly investigated algorithmically.
While the current platform is reliable and promising, there are several ways that we plan to enhance it. Thanks to the CA, we are now able to determine TBEV strain characteristics in a few seconds. The CA relies on the results of two phylogenetic networks: the first network obtained with the Median-joining algorithm [30] from phylogenetic network software developed by Fluxus (www.fluxus-engineering.com) and the second one obtained from PHYLOViZ [31], while the CS network is manually generated via merging their information. Because placing new clusterons into the CS requires reconstructing phylogenetic networks, it requires a great deal of effort. Therefore, the next stage of platform development will involve automatic CS construction.
The CS is a network presented in two dimensions. Additional characteristics, such as the age of the clusteron, can improve its informativeness. Such features will be considered in the future by producing a customized color graph in 3D space, with the additional dimension enabling the presentation of additional clusteron features.
We plan to connect the table of amino acid-specific signatures to a 3D protein surface visualization to enable exploration of differences between amino acid signatures. Thus, users will be able to interact with the table to choose positions and clusterons, allowing them to better explore genetic variability. Furthermore, by providing the degree of amino acid similarity along with reduced amino acid alphabets, researchers will be able to investigate the nature of mutations. We believe this might help aid understanding of the variability of pathogenicity within the clusterons of a lineage/subtype.
It is a known fact that there is a high degree of similarity between the E proteins of human pathogenic Flaviviruses. More interestingly, they have the same overall protein architecture [11]. As the CA employs a specific E protein fragment, it is possible to expand and generalize its application to other Flaviviruses. Furthermore, modern bioinformatics platforms include the ability to search for information within scientific papers. Therefore, we intend to develop an automated web crawler to find TBEV-related papers, look up metadata inside the web document, and fetch them.

Conclusions
In conclusion, genetic resources, especially GenBank, are accumulating more strainspecific genetic data all the time. Virologists, researchers, and other specialists need tools to bring existing voluminous data or new submissions into focus and place them in context. With newly emerging biothreats, the need to rapidly interpret data is becoming even more critical. In this paper, we propose an update to the hierarchical phylogenetic model of TBEV known as the clusteron structure, represented as a graph. Development of this version, TBEV Analyzer 3.0, includes a new comprehensive analysis of all appropriate GenBank entries for the virus. This yielded eleven new clusterons beyond those previously identified by the platform. In addition, the most recent analysis has automatically updated all clusteron strain sets, representing an important algorithmic feature of the newest version. This is an important result in that it shows that the Analyzer is capable of flagging new results while remaining consistent and reproducible with respect to previous findings. As a monitoring tool for emerging or known biological threats, this is a key requirement. In addition, other refinements were implemented for greater platform functionality. We hope that the application of the updated TBEV Analyzer can elucidate overall TBEV evolutionary dynamics or other hidden biological nuances by technological means. We foresee TBEV Analyzer and other approaches being used together to more effectively combat such viral infection and its associated health burdens.

Conflicts of Interest:
The authors declare that they have no competing interest.

Abbreviations
The following abbreviations are used in this manuscript: