You are currently viewing a new version of our website. To view the old version click .
Viruses
  • Systematic Review
  • Open Access

24 November 2025

Bioinformatics Tools and Approaches for Virus Discovery in Genomic Data: A Systematic Review

,
,
,
and
1
Research Institute for Systems Biology and Medicine (RISBM), Department of Mathematical Biology and Bioinformatics, Moscow 117246, Russia
2
Martsinovsky Institute of Medical Parasitology, Tropical and Vector Borne Diseases, Sechenov University, Moscow 119435, Russia
*
Author to whom correspondence should be addressed.
This article belongs to the Section General Virology

Abstract

The exponential growth of viral metagenomic data has created an urgent need for accurate and scalable tools for virus discovery, yet the extreme diversity, rapid evolution, and limited reference databases for viruses pose unique computational challenges that traditional sequence comparison methods struggle to address. This systematic review, conducted in accordance with PRISMA 2020, examines current trends and methodological advances in virus discovery tools from 1990 to 2025. As virus discovery is a broad and multi-dimensional topic, this review focuses on the first-line tools used to analyze the results of high-throughput sequencing. The review was conducted using the PubMed database with a snowballing approach, with over 54 key studies selected for the analysis. These studies encompass the following approaches: alignment-based methods, rapid similarity estimation techniques, profile hidden Markov model methods, combination pipelines, k-mer-based approaches, and machine learning-based methods. The transition from alignment-based to machine learning methods has dramatically improved the detection of divergent viruses, yet challenges remain in interpreting model decisions and handling incomplete viral genomes. This review summarizes current knowledge and potential future directions for the development of virus detection capabilities.

1. Introduction

Viruses are the most abundant biological entities on Earth, with an estimated 1031 particles globally, colonizing every environment where potential hosts exist []. Viral abundance varies by habitat: marine samples contain 8.5 × 105 to 2.2 × 107 virus-like particles (VLPs) per mL, soil environments show densities of 107–109 VLPs/g and account for 90–95% of global viral biomass [,]. The variety of viral hosts is immense, spanning all domains of life: bacteria, archaea, protists, fungi, plants, and animals, including humans.
Viruses exhibit exceptional diversity in their genetic material organization. Viral genome sizes range from 1.8 kb to 2.5 Mb, reflecting a broad spectrum of genetic information encoding strategies []. Current classification of viruses is based on host cell type: viruses of eukaryotes, viruses of archaea (archaeal viruses), and viruses of bacteria (bacteriophages); nucleic acid type and organization (DNA-containing deoxyviruses and RNA-containing riboviruses, single- and double-stranded, linear and circular); and structural organization (enveloped and non-enveloped viruses). Viruses drive host evolution through horizontal gene transfer and dramatically alter cellular metabolism—cyanophages reprogram photosynthesis in Synechococcus—while auxiliary metabolic genes enhance host fitness under stress []. Many phages persist as integrated prophages, conferring capabilities from toxin production to antibiotic resistance [].
Despite their abundance and importance, virus classification faces fundamental challenges. Unlike cellular organisms with universal ribosomal RNA markers, viruses lack conserved genes across all lineages [,]. Combined with extreme mutation rates (10−6 to 10−4 substitutions per nucleotide per cell in RNA viruses), extensive horizontal gene transfer [], and polyphyletic origins [], viral genomes represent evolutionary mosaics that challenge traditional taxonomy. Nevertheless, viral taxonomy continues to expand rapidly: as of 2025, the International Committee on Taxonomy of Viruses (ICTV) has established 16,215 virus species organized into seven realms, 11 kingdoms, 22 phyla, and numerous lower taxonomic ranks, with recent additions including the new phylum Ambiviricota for viruses combining features of both typical RNA viruses and viroids [].
The study of viral diversity remains severely limited by methodological constraints. Traditional virology depends on cultivating viruses from cultured hosts, yet most bacteria, and thus their phages, cannot be grown in laboratory settings. While next-generation sequencing enables culture-independent discovery, known viruses remain underrepresented in metagenomic datasets, typically comprising less than 5% of sequencing reads due to their small genomes. Most significantly, the “viral dark matter” persists—40–90% of viral sequences in metagenomic studies show no homology to known viruses [,], reaching 95% in environmental samples. The rapid expansion of metagenomic sequencing has created an urgent need for accurate virus discovery tools, yet the extreme diversity and rapid evolution of viruses pose unique computational challenges.
The identification of viruses in metagenomic datasets has undergone a remarkable transformation over the past two decades, evolving from simple sequence alignment to sophisticated artificial intelligence approaches. Formally, we can highlight four generations of viral identification tools. The first generation emerged in the 1990s based on sequence alignment approaches, with BLAST as the ubiquitous tool that was used to compare sequences against reference databases but was limited to detecting related viruses [,]. The second generation emerged in the 2010s based on statistical models and hidden Markov models (HMMs), with VirSorter (2015) as a key tool that detected viral “hallmark” genes and analyzed k-mer frequencies beyond simple sequence similarity []. This probabilistic approach combined multiple genomic features like gene density and strand bias to distinguish viral from cellular DNA with improved sensitivity. The late 2010s marked the emergence of machine learning-based approaches for viral identification, with VirFinder (2017) using logistic regression to identify viral sequences based on k-mer frequencies []. This was followed by DeepVirFinder (2020), which pioneered deep learning using convolutional neural networks to recognize complex patterns without reference databases []. The current frontier encompasses two main approaches: hybrid methodologies that combine multiple detection strategies, exemplified by VIBRANT (2020) which integrates neural networks with HMM-based annotation to maximize sensitivity and specificity [], and emerging large language model (LLM) approaches like ViraLM (2024) [] that leverage pre-trained genome foundation models such as DNABERT-2 to achieve enhanced detection capabilities, particularly for short viral contigs that challenge traditional methods. These approaches were further advanced in the last few years by the development of integrated identification and assembly tools that use HMM to grab conserved motifs of distantly related viruses from large sequencing datasets and subsequent assembly to enhance the contigs into near-full genomes. In this systematic review, we summarize tools and approaches for virus discovery, tracing their development from simple sequence-based methods to sophisticated artificial intelligence systems.

2. Materials and Methods

We conducted a comprehensive literature search using PubMed (National Center for Biotechnology Information, Bethesda, MD, USA) [] as the primary database in July 2025 to systematically identify computational tools developed for the discovery of viral sequences. We included tools that can at minimum distinguish viral from non-viral sequences, while also offering differing capabilities for taxonomic resolution where applicable. No restrictions were imposed on the year of publication, resulting in coverage of articles from 1990 through 2025. The search queries were designed to cover a wide range of virus detection tools and classification studies, using keywords such as “viral” and “metagenomics”, and “computational tool”.
In addition to a direct database search, a snowballing approach was implemented to broaden the scope. Relevant tools referenced in both reviews and primary research articles were incorporated, particularly when they were not indexed through our initial PubMed search. This strategy ensured a more exhaustive and representative selection of computational resources pertinent to viral taxonomy annotation.
Given the breadth and diversity of available tools, we focused our analysis on categorizing methods according to their underlying viral detection strategies. Recognizing that many tools implement hybrid approaches, we assigned each tool to the group corresponding to the methodological paradigm that is most prominent and foundational within its workflow. Consequently, we distinguished four principal approaches: alignment-, profile hidden Markov-, machine learning, and k-mer based approaches. In addition, foldome-based strategies, which rely on protein structural folds to infer evolutionary relationships, have recently emerged. While these methods hold promise for viral identification, and especially for further taxonomic assignment, they fall outside the direct scope of this review and are therefore not described in detail here. Some tools discussed in this review incorporate elements of such structural approaches, and the underlying algorithms are typically trained on either reference genomes or amino acid sequences. In most cases, protein data are used for the training process.
Inclusion criteria:
Peer-reviewed original research articles describing the development of a computational tool or pipeline for the identification and taxonomic annotation of viral sequences from eukaryotic, prokaryotic viruses, and proviruses.
Articles were required to provide full-text access with open code.
Relevance of the tool to the classification of viral sequences in the context of this review.
The exclusion criteria are shown in Figure 1. All articles that met the inclusion criteria and did not conflict with the exclusion criteria were reviewed in full text.
The tools in this review are suitable for metagenomic and virome studies.
Registration number 10.17605/OSF.IO/MKPFT (Center for Open Science, Charlottesville, VA, USA) [].
Figure 1. PRISMA flowchart.

3. Results

Based on the literature, the instruments have been categorized into several groups, as illustrated in Figure 2. The evolution of the instruments was illustrated in the timeline.
Figure 2. Evolution of viral detection tools and methodological approaches. The timeline illustrates the development of key computational tools for viral taxonomic annotation from 1990 to 2025. Each tool name is colored according to its underlying methodological approach.

3.1. Alignment-Based Approaches for Virus Sequence Identification

Sequence alignment-based methods represent the classical computational approach for identifying viral sequences through systematic comparison with reference databases. These methods operate on the principle that evolutionarily related sequences retain detectable similarity despite accumulating mutations over time. The approach involves aligning unknown query sequences against known viral genomes or proteins to identify regions of similarity that indicate shared ancestry []. The core workflow consists of three stages: alignment computation [,], statistical evaluation, and taxonomic inference. Statistical frameworks are used to evaluate alignment significance using metrics like E-values, bit scores, and percentage identity. E-values (Expect values) represent the number of alignments with similar or better scores expected to occur by chance in a database search—lower E-values indicate more significant matches, with values below 0.001 typically considered biologically relevant []. Bit scores provide a normalized measure of alignment quality independent of database size, calculated from the raw alignment score using log-odds substitution matrices, where higher values indicate better alignments and scores above 50 generally suggest homology []. Percentage identity measures the proportion of exact matches between aligned positions, offering an intuitive metric of sequence similarity, though its interpretation varies by sequence length and type []. Finally, taxonomic assignment occurs through several approaches, most commonly using the closest matching reference sequences and genetic distances for direct classification within established viral taxonomy, or through the lowest common ancestor (LCA) method, which analyzes the top n hits to determine the deepest shared taxonomic level, providing more robust classification and reducing misclassification errors.

3.1.1. Pairwise Alignment Methods

Pairwise sequence alignment is a fundamental method in bioinformatics that compares two biological sequences-DNA, RNA, or protein-to identify regions of similarity. This process involves finding the optimal alignment that maximizes similarity by introducing gaps (insertions or deletions) where necessary. Pairwise alignments can be either local or global: local alignment identifies and aligns the most similar subsequences within two sequences, even when the overall sequences differ significantly, with the Smith–Waterman algorithm serving as a widely adopted dynamic programming approach for this purpose. In contrast, global alignment attempts to align entire sequences end-to-end, matching every character, often with gaps; the Needleman–Wunsch algorithm is the classical method employed in global alignment [].
Pairwise sequence comparison represents a fundamental pillar of viral taxonomic classification, as recognized by the ICTV [,]. These comparative methodologies analyze viral genomic sequences to quantify similarity thresholds, thereby defining the demarcation criteria that distinguish between different taxonomic ranks.
BLAST (Table 1) is a widely used local alignment search tool that identifies high-scoring local similarities between biological sequences []. Using weight matrices (such as PAM-120), it scores residue pairs, summing scores over aligned segments to find maximal segment pairs (MSPs). It efficiently searches large databases by indexing short k-mer words from the query and extending hits that exceed defined score thresholds, balancing sensitivity and speed. Tools implementing this approach can process both nucleotide sequences (BLASTn) and translated amino acid sequences (BLASTp), with protein-level comparisons offering enhanced sensitivity for divergent viruses.
Table 1. Alignment-based tools for virus detection.
The NCBI Virus portal [] provides a specialized BLAST interface optimized for viral sequences, integrating results with curated viral metadata. MegaBLAST is an optimized version of the BLAST algorithm designed for the rapid detection of exact matches in large genomic databases []. It employs a database index containing compressed sequences and positional information on k-mers. MetaPhinder builds upon BLAST by utilizing metrics such as average nucleotide identity (%ANI) and sequence coverage relative to the database, evaluating the overall similarity across multiple matches rather than relying solely on the highest-scoring alignment [].
PASC pioneered another approach at NCBI, comparing each new viral sequence against reference genomes using both local (BLAST) and global (Needleman–Wunsch) alignment algorithms. However, PASC shows limitations with highly diverse virus families or those exhibiting significant genome length variation []. SDT employs Needleman–Wunsch alignment algorithms, implemented through MUSCLE, ClustalW, or MAFFT, to perform pairwise sequence alignments. SDT can produce publication–quality pairwise identity plots and color-coded distance matrices to further aid the classification of sequences according to ICTV-approved taxonomic demarcation criteria []. VICTOR specializes in prokaryotic virus classification using genome BLAST distance phylogeny (GBDP) methodology, which employs modified BLAST comparisons across entire genomes rather than individual genes []. VIRIDIC has become a widely used tool for bacteriophage classification, calculating pairwise intergenomic similarities and performing hierarchical clustering with intuitive heatmap visualization []. The tool implements genus-level demarcation using 70% nucleotide similarity thresholds aligned with ICTV recommendations, though species-level boundaries vary by viral family (typically 95% for many bacteriophages). VIRIDIC processes multiple viral genomes simultaneously, generating color-coded matrices that reveal phylogenetic relationships and facilitate the identification of novel viral taxa. vConTACT [,,] versions 1 and 2 utilize the Markov cluster algorithm (MCL) methodology to generate protein clusters (PC) based on BLASTP results. The similarity between genomes is estimated through the hypergeometric distribution, which calculates the probability that two genomes will randomly share n PCs, given the total number of PCs [].

3.1.2. Multiple Sequence Alignment Methods

While pairwise alignments are often sufficient for initial sequence identification, comprehensive virus analysis relies upon a multiple sequence alignment (MSA) to elucidate the relationships between multiple viruses with varying relationship degrees. This approach is fundamental in bioinformatics applications, such as reconstructing evolutionary histories and identifying functionally important motifs. However, the complexity of aligning multiple sequences presents significant computational challenges. To address these, various algorithms have been developed, including progressive methods that build alignments stepwise from pairwise comparisons, as well as iterative refinement techniques and other advanced strategies that enhance alignment accuracy and scalability.
PSI-BLAST [] is an iterative amino acid sequence search algorithm that improves sensitivity by constructing and refining a position-specific scoring matrix (PSSM). Initially, it performs a standard BLAST search to identify significant matches, which are used to generate a multiple sequence alignment. From this alignment, a PSSM is derived, capturing position-specific amino acid conservation. The process iterates, updating the PSSM with new significant hits, improving detection of distant homologs. Iterations continue until no new significant sequences are found or a set maximum number of rounds is reached.
The standard tools for performing MSA are MAFFT [,], MUSCLE [], ClustalW [], ClustalX [], and Clustal Omega [,]. ClustalW (and ClustalX, its slightly improved variant with a graphical interface) is a multiple sequence alignment algorithm that does progressive alignment, prioritizing sequences based on similarity scores and combining pairwise and global alignment approaches. Clustal Omega enhances scalability with the mBed embedding method for fast guide tree construction and aligns sequences using profile hidden Markov models combined with external profile alignment and iterative refinement for improved accuracy. MAFFT accelerates multiple sequence alignment by using the fast Fourier transform (FFT) to quickly identify homologous segments based on amino acid properties before applying dynamic programming. MUSCLE employs a three-stage progressive alignment process that starts with k-mer distance clustering, refines the guide tree with Kimura-corrected distances, and iteratively improves the alignment using a log-expectation scoring function. VIRULIGN [] builds codon-correct alignments by comparing each target sequence to a reference using Needleman–Wunsch alignment of amino acid translations. It detects and corrects frame-shifts by adjusting problematic gaps, repeating until no frame-shifts remain or a limit is reached. VIRULIGN is designed to handle large sequence datasets; its main focus is on generating codon-correct multiple sequence alignments that prevent frameshifts, making it ideal for coding region alignments. In contrast, ViralMSA [] is primarily developed for the rapid alignment of complete viral genomes in real time. ViralMSA is a flexible, cross-platform tool that rapidly aligns viral genomes by mapping sequences to a reference genome using Minimap2 by default, efficiently generating multiple sequence alignments while discarding insertions relative to the reference to focus on informative variations. It supports various mappers such as STAR [], Bowtie 2 [], and HISAT2 []. MACSE extends the classical Needleman–Wunsch algorithm to handle protein-coding nucleotide sequences with frameshifts and stop codons by considering 15 possible moves during alignment []. The alignment cost includes penalties for amino acid substitution, opening and extending gaps, as well as special high penalties for frameshifts and stop codons, which help maintain the correct codon structure in the alignment. For multiple sequence alignment, MACSE employs a progressive strategy using nucleotide k-mer frequencies for similarity estimation, constructing a guide tree via UPGMA, and aligning sequences and profiles with a pessimistic gap counting approach to accurately handle insertions and deletions. TranslatorX performs multiple sequence alignment of nucleotide sequences by translating them into amino acid sequences, aligning these amino acid sequences using established algorithms, and then back-translating the alignment to nucleotides while preserving codon structure []. To enhance alignment quality, TranslatorX employs a cleaning procedure based on the analysis of amino acid alignments using the GBlocks tool. This approach enables the removal of ambiguous regions from the nucleotide alignment while retaining informative sites and maintaining positional homology.
Virus taxonomic classification in GLUE is structured around an evolutionary hierarchy called an “alignment tree,” which organizes virus sequences into clades and clade categories reflecting their evolutionary relationships []. The alignment tree links parent and child clades through their reference sequences, ensuring coherent evolutionary representation. GLUE integrates MAFFT and BLAST+ for alignment. Sequence-to-clade assignment is performed using a Maximum Likelihood Clade Assignment (MLCA) algorithm, which places query sequences onto a fixed reference phylogenetic tree using the RAxML Evolutionary Placement Algorithm (EPA). Based on evolutionary distances to neighboring reference sequences already assigned to clades, MLCA calculates the likelihood of membership in each clade and assigns the query sequence to the most probable one above a confidence threshold.

3.1.3. Rapid Similarity Estimation Methods

Traditional alignment-based approaches are computationally intensive, particularly for large-scale genomic datasets. To address this limitation, novel algorithms have been developed to minimize computational overhead while maintaining accuracy. Among these, MashMap stands out as an efficient method for estimating sequence similarity between reads and reference genomes. MashMap [,] approximates the Jaccard coefficient using a combination of MinHash sketching and winnowing techniques, bypassing the need for exhaustive alignment. The algorithm operates by indexing representative minimizers (k-mers) from both query and reference sequences, enabling rapid candidate region identification through a two-stage filtering process: initial filtering based on shared minimizer density to exclude non-homologous regions and precise similarity calculation using ordered data structures for retained candidates. This strategy efficiently narrows down potential matches, significantly reducing runtime while preserving high accuracy compared to conventional alignment methods. As a result, MashMap enables scalable and sensitive mapping, making it particularly suitable for large genomic datasets. MashMap’s underlying principles have been leveraged in FastANI [], a tool designed for rapid Average Nucleotide Identity (ANI) computation. ANI serves as a key metric for quantifying nucleotide-level similarity between genomes, originally developed for bacterial and archaeal classification but now increasingly applied to viral genomes. FastANI exhibits high computational efficiency, processing thousands of genome pairs per minute, which makes it invaluable for large-scale genome clustering.
Vclust is a scalable viral genome analysis workflow that integrates k-mer-based similarity estimation (Kmer-db 2), precise pairwise sequence alignment using Lempel–Ziv parsing (LZ-ANI), and flexible clustering algorithms (Clusty) to efficiently process and classify millions of viral genomes [].
Alignment-based methods remain the gold standard in virus identification. At some point, they are used in any such task. They are attractive because they are straightforward and relatively easy to set up. However, their main limitations include high calculation costs and low sensitivity to divergent viruses. Therefore, they may not be the best first-line solution for analyzing NGS data, and they are poorly suited for analyzing the “dark matter” of sequencing data, which might include genomes of highly divergent viruses. It should also be noted that, due to high genetic diversity and common indels (which often have the same position in the genome, but are not homologous), alignment of complete virus genomes at a family level cannot be fully automated. Key information, such as Methodology, Database source, Viral Specialization, Citation index (CI), and Limitations for alignment-based approaches, is presented in Table 1. While the citation index may indicate relative relevance of methods, it is important to consider that universal methods, such as BLAST, are used in many fields of biology.

3.2. Profile Hidden Markov Models Methods

Profile Hidden Markov Models (profile HMMs) are probabilistic frameworks that model the sequence variation within a family of related sequences (Table 2). Constructed from multiple sequence alignments (MSAs), profile HMMs capture patterns of conservation, insertions, and deletions characteristic of the sequence family. Compared to traditional similarity search methods such as BLAST, profile HMMs offer greater sensitivity, particularly in detecting distant homologs. As a distinct and foundational class of computational tools, profile HMMs have made invaluable contributions to the advancement of computational biology and molecular sequence analysis. HMMER is a widely adopted software implementation of profile HMMs for biological sequence analysis [].
Table 2. HMM-based tools for virus detection.
To build reliable profile HMMs, the input multiple sequence alignments (MSAs) must be carefully prepared. When training data includes many closely related sequences (e.g., many similar viruses from one species), this can bias the model by over-representing certain patterns. Sequence weighting schemes assign lower weights to redundant, closely related sequences and higher weights to more unique or divergent sequences. Dirichlet mixture priors incorporate knowledge about amino acid substitution patterns, improving the estimation of model parameters when training data is limited. The calibration process uses simulations to establish statistical parameters, enabling accurate E-value calculations essential for controlling false discovery rates in large-scale searches [,].
Modern viral identification tools often integrate profile HMM strategies within comprehensive bioinformatic pipelines to enhance sensitivity and specificity. VirSorter predicts viral sequences in complete or fragmented genome sequence data from bacteria and archaea by computing multiple statistically modeled metrics across sliding gene windows, including viral hallmark gene presence, viral-like gene enrichment, and various depletion or enrichment patterns in gene features []. It integrates hmm search with blastp search to sensitively detect viral hallmark genes and protein domains by comparing predicted proteins against curated viral HMM profile databases, which enhances its ability to find distant homologs and improve annotation accuracy. Detected regions are classified into three confidence categories based on combined metric significance and further refined iteratively by incorporating newly identified viral genes into reference databases. ViralRecall utilizes a nonredundant database of nucleo-cytoplasmic large DNA viruses (NCLDVs) genomes and constructs HMM profiles from clustering viral orthologous groups to identify viral sequences []. It calculates normalized scores from HMMER3 searches against giant virus orthologous groups (GVOGs) and Pfam databases to distinguish NCLDV-specific signatures, reducing false positives from related viral groups. Cenote-Taker2 is a comprehensive virus discovery and annotation pipeline that integrates BLAST-based and HMM-based methodologies []. The pipeline systematically annotates candidate tRNA genes and infers viral taxonomy through BLASTX comparisons against a curated database of viral sequences. ORF prediction is dynamically tailored according to inferred taxonomy, employing PHANOTATE for putative bacteriophages and Prodigal for all other viruses. Subsequent functional annotation of predicted ORFs is conducted via a rigorous, multi-tiered approach leveraging HMMER, RPS-BLAST, and HHblits to detect remote homologs against carefully curated protein domain repositories, ensuring precise and sensitive characterization of viral gene content. Phage_Finder combines protein homology, domain profiles, gene annotations, and integration site analysis to identify functional prophage regions in bacterial genomes []. Starting from windows exceeding a defined hit threshold, it expands candidate regions gene-by-gene based on HMM profile hits, BLASTP matches, presence of tRNA/tmRNA genes, and known phage annotations.
Phigaro [] is a computational tool for prophage detection that leverages gene prediction (Prodigal) and phage-specific domain annotation (HMMER3/pVOGs) to score each gene based on curated “white-list” (prophage-enriched) and “black-list” (non-prophage-associated) profile hidden Markov models, with positive, neutral, or penalized weights assigned accordingly. These gene-level scores are integrated across genomic neighborhoods using the middle of the sliding window to smooth prophage likelihood estimates, and further refined by incorporating local GC content deviation from the host genome to improve boundary resolution, except when operating in GC-independent mode. It also produces dynamic annotated ‘prophage genome maps’ and marks possible transposon insertion spots inside prophages.
The viral RNA-dependent RNA polymerase (RdRp or replicase) is the most conserved protein in RNA viruses. Consequently, the high conservation of RdRp makes it a good phylogenetic marker for RNA viruses []. The palmdb is a curated database of viral polymerase palmprint sequences clustered into species-like OTUs at 90% amino acid identity, enabling standardized viral classification []. RdRp-scan constructs a comprehensive viral RdRp database by integrating sequences from palmdb and recent metagenomic studies []. The sequence is reduced via CD-HIT clustering. For each taxonomic cluster and unassigned group, multiple sequence alignments are generated using Clustal Omega, followed by manual curation. Hidden Markov Models are built from these alignments with HMMer3 under standard settings. Combining RdRp-specific HMMs and structural homology enables RdRp-scan to detect RdRp sequences sharing as little as 10% identity with known viruses. NeoRdRp2 [] is a tool that works similarly. Sequences are clustered using CD-HIT at a 99% identity threshold to remove redundancy. Clusters with more than three sequences are aligned using MAFFT, followed by gap-based splitting with a custom script to refine conserved regions. Comparisons of eight different RdRp search tools showed that NeoRdRp2 exhibited balanced RdRp and nonspecific detection power.
Overall, HMM-based tools excel at the identification of unknown viruses. Their limitations include reliance on pre-identified profiles or laborious setup when aiming to use the most up-to-date genomic data and targeting of only the most conserved genome regions. Key information, such as Methodology, Database source, Viral Specialization, Citation index (CI), and Limitations for HMM-based approaches, is presented in Table 2. It is noteworthy that many HMM-based tools are not highly cited and are limited to narrow applications.

3.3. Machine-Learning-Based Approach

One advantage of machine learning methods is their ability to identify patterns that are challenging for humans to detect. Consequently, these methods can identify viral sequences that may not be present in databases despite the existence of underlying patterns. In principle, an ML method consists of a sequence representation and an analysis algorithm, which can be used in many combinations (Figure 3).
Figure 3. Schematic representation of machine learning algorithms and data processing workflows for virus identification. Green and orange bars indicate the type of training data utilized (nucleotide and amino acid sequences, respectively). Each arrow originates from a machine learning method, traverses relevant data preprocessing steps, and points to the corresponding analytical tool, illustrating the flow from data preparation to application in virus identification.
Early approaches to addressing the challenge of detecting viral sequences in metagenomic data were based on traditional machine learning methods, which, despite their simplicity, laid the foundation for more advanced algorithms. One of the first such tools was VirFinder [], which implements logistic regression to predict viral and non-viral sequences. The underlying assumptions of VirFinder include that the distribution of k-mers (8-mers) in viral sequences correlates more strongly with their hosts than with random hosts, and that certain nucleotide combinations statistically differentiate viral from non-viral genome fragments. Beyond linear models, ensemble methods such as Random Forest (RF) have also been successfully applied for viral sequence detection in MARVEL [], a tool developed specifically for identifying double-stranded DNA bacteriophages (dsDNA phages) of the order Caudovirales. The RF algorithm was also selected as ViraPipe’s primary classifier [,] using codon usage features (Relative Synonymous Codon Usage, RSCU) as a sequence feature. VirSorter2 constructs a comprehensive viral HMM database and utilizes it to derive 27 sequence-based features, which are then employed to train random forest classifiers for the accurate identification of viruses across diverse taxonomic groups []. PhiSpy [] analyzes sliding windows of genes to compute features such as customized nucleotide skews, protein length differences, transcription strand consistency, and presence of unique phage-specific sequences, integrating homology information. Using a Random Forest classifier trained on related bacterial groups, it predicts prophage regions. VIBRANT [] leverages a rigorously curated, nonredundant dataset of genomic fragments encompassing bacteria, archaea, plasmids, and viruses. These fragments undergo uniform gene prediction and protein annotation via multiple hidden Markov model (HMM) profile databases, from which 27 informative annotation-derived metrics are extracted. Employing these features, VIBRANT utilizes a multilayer perceptron neural network to robustly classify genomic fragments as viral or non-viral.
Convolutional neural networks (CNNs) were originally developed for image processing, but have since found wide application in the analysis of text data, including nucleotide or amino acid sequences. VirFinder developers created DeepVirFinder [], a tool that uses a CNN trained on non-overlapping fragments of fixed lengths (150 bp, 300 bp, 500 bp, 1000 bp, and 3000 bp) of prokaryotic virus genomes and reference prokaryotic genomes. In comparative analyses, DeepVirFinder was shown to outperform VirFinder when identifying viral sequences in metagenomic data. PPR-Meta [] is based on a Bi-path convolutional neural network (BiPathCNN) architecture that processes two distinct input matrices: a BOH matrix (representing base information useful for non-coding regions) and a COH matrix (representing codon information useful for coding regions). The CHEER [] model is a hierarchical taxonomic classification framework designed for viral metagenomic reads, with a particular focus on RNA viruses. It organizes multiple CNN classifiers arranged in a tree structure from order to genus levels. Classification is performed top–down: first rejecting non-RNA viral reads, then classifying at the order level, followed by family-level classifiers within orders, and finally genus-level classifiers within families. Sequences are encoded via a k-mer embedding that captures k-mer co-occurrence and ordering information, improving performance. DeepMicroClass [] employs a CNN with two input paths: a base-path encoding one-hot nucleotide-level information (A, C, G, T), including reverse complements, and a codon-path encoding codon-level information. VirDetect-AI [] is a deep neural network based on the ResNet architecture combined with CNNs, designed for the classification of eukaryotic viral proteins. Amino acid sequences are preprocessed by fragmentation into overlapping k-mers and encoded using one-hot encoding into binary matrices.
Long short-term memory (LSTM) is a type of recurrent neural network (RNN) that can learn long-term dependencies thanks to the mechanism of “gates”. The main advantage of LSTM over other methods (such as CNN or Random Forest) is the ability to take into account the order of nucleotides/amino acids and identify complex, extended dependencies in the data. Seeker [] is an LSTM-based deep learning model trained for the identification of bacteriophage genomes. HVSeeker [] is a computational tool for sequence identification and classification that offers two LSTM-based models, DNA and protein-based. Similarly, Virtifier [] uses LSTM to predict viral sequences using the Seq2Vec encoding method to represent the data.
DETIRE [] employs a two-stage deep learning framework for virus identification from metagenomic data. Initially, a graph convolutional network (GCN) learns meaningful 30-dimensional embeddings of 3-mer nucleotide fragments. These embeddings serve as inputs to a hybrid classifier that integrates CNN to capture spatial features and bidirectional LSTM (BiLSTM) networks to model sequential dependencies. The combined features are jointly learned and classified via fully connected dense layers and a softmax output, enabling accurate discrimination between viral and non-viral sequences from fragmented metagenomic reads.
PhaGCN [] employs a semi-supervised graph convolutional network framework for bacteriophage identification.
The tool geNomad [] implements a hybrid approach for the identification of viruses, plasmids, and proviruses in metagenomic data by combining two methods: a sequence-based classifier and a gene-based classifier. The gene-based approach relies on the XGBoost algorithm, where each amino acid sequence is represented by a vector of 25 features that serve as input to the classifier. The sequence-based classifier applies a two-step model: first, a neural network encoder transforms nucleotide sequences into fixed-length vector representations []. IGLOO’s architecture is designed under the assumption that sequences belonging to the same class should have similar vector embeddings (i.e., cluster closely in feature space), while sequences from different classes should be well-separated. The resulting vector is then fed into a fully connected (dense) neural network that performs the final classification. The outputs of these two independent classifiers—gene-based and sequence-based classifiers—are integrated using a feedforward neural network to produce the final decision regarding the class membership of the analyzed sequence.
VirNucPro [] employs dual embeddings: DNABERT_S for nucleotide sequences and ESM2-3B for amino acid sequences, producing contextual feature vectors.
GCNFrame [] employs a novel representation of genomic sequences as gapped pattern graphs (GPGs), which encode both contiguous and non-contiguous k-mer patterns to capture genetic variations, including single-nucleotide polymorphisms (SNPs) and insertions/deletions. These graph-structured data are processed using a graph convolutional network based on the GraphSAGE framework, which learns high-quality, low-dimensional embeddings that encapsulate complex sequence features and variation patterns. Subsequently, these embeddings serve as inputs to fully connected MLPs for accurate classification of genomic attributes such as bacteriophage identification, lifestyle characterization, and host range prediction.
VirRep [] is a hybrid language representation learning framework for identifying viruses in human gut metagenomic data. It combines a BERT-like semantic encoder that captures k-mer patterns with a BiLSTM alignment encoder that encodes sequence similarity to prokaryotic genomes. The tool processes sequences by segmenting them into 1 kb fragments, tokenizing into 7-mers, and generating probability scores through a siamese neural network analyzing both strands. VirRep uses a three-stage training approach: pre-training both encoders separately, fine-tuning each on specific tasks, then combining them with a binary classifier.
ViraLM [] implements a foundation model-based approach for identifying viral sequences in metagenomic data by leveraging the pre-trained genome foundation model DNABERT-2 with masked language modeling (MLM). The architecture consists of a transformer attention block initialized with DNABERT-2’s parameters, followed by a binary classification layer for virus identification.
All the algorithms described above were trained either on reference genomes or amino acid sequences. Typically, protein data are used to complement the training process, and only one tool performs complete training solely on proteins. Before training, a data preprocessing stage is applied. During this stage, sequences may be segmented into k-mers or fragmented using various approaches such as non-overlapping sequences of different lengths, padding, or sliding window techniques. These fragments can then undergo different encoding processes to convert them into numerical representations suitable for machine learning models. Training can be performed directly on raw reads, which may be assembled and aligned against reference databases to improve classification accuracy. The scheme of training and preprocessing of algorithms is presented in Figure 3.
Key information, such as Methodology, Database source, Viral Specialization, Citation index (CI), and Limitations for ML-based approaches, is presented in Table 3. It is noteworthy that most of the ML-based tools have been developed for the detection of dsDNA viruses, primarily phages. Their general utility for RNA virus identification may be questioned because RNA viruses exhibit remarkable variation in genome composition, even at the family level, whereas most ML-based methods were designed to handle nucleic acid sequences. Also, the complexity of ML-based virus identification tools is rapidly increasing; however, there is limited evidence that this leads to a significant prediction quality gain. In addition, citation values indicate that many tools were not integrated into routine virus discovery.
Table 3. ML-based tools for virus detection.

3.4. K-Mer-Based Approach

K-Mer-based approaches have gained significant popularity due to their high computational speed. One of the most well-known algorithms in this category is Kraken2 []. At the core of Kraken lies a database containing records composed of a k-mer (31-mers by default) and the LCA of all organisms whose genomes include that k-mer. This design introduces a key limitation: Kraken2 can only identify viruses present in its reference database. As expected, k-mer-based methods often exhibit lower sensitivity and specificity when identifying species in complex metagenomic samples compared to full sequence alignment methods, but they are substantially faster []. DisCVR [] uses the k-mers approach in which the sample reads are decomposed into k-mers and then matched against a virus k-mers database. KAnalyze [] was chosen for integration into DisCVR because the k-mers it generates are sorted lexicographically, thus making the search for matches very efficient.
Another widely used tool for taxonomic classification is CLARK []. Unlike standard approaches, CLARK constructs its database exclusively from discriminative k-mers unique to each taxon, improving classification accuracy. However, traditional frequency-based k-mer representations do not capture the order of k-mers, potentially losing important structural information about sequences. To address this, recent studies have adopted techniques from natural language processing, such as Skip–Gram models and other embedding methods, which consider not only k-mer composition but also their contextual co-occurrence. These models learn which k-mers tend to appear together and map them into vector spaces, such that semantically related k-mers have similar embeddings, as demonstrated in tools like CHEER. We also find the use of k-mers to speed up alignment as implemented in Mashmap. Thus, the k-mer-based approach may be a stand-alone tool or integrated into various programs to optimize algorithm operation.
In metagenomic datasets, viral reads often constitute only a small fraction of total sequenced data, making their detection and analysis methodologically challenging. Genome assembly, a computationally intensive step, can be partially circumvented by pre-filtering reads likely to be viral before assembly and alignment stages. This pre-filtering reduces computational load and improves downstream analysis []. Moreover, k-mer-based filtering can complement assembly-based and homology-based methods, enhancing viral sequence identification as demonstrated by the AliMarko pipeline [].
VISTA [] employs an integrative approach combining pairwise sequence comparisons, k-mer profiling, and machine learning to achieve robust viral identification. The method begins with the extraction of k-mer profiles that encode both sequence composition and positional information derived from physicochemical properties of protein translations. Feature selection via chi-squared statistics and the extremely randomized trees (ERT) identified the most discriminative k-mer signatures for taxonomic classification. The optimal subset of features extracted from k-mer profiles was utilized as input to calculate pairwise measures of distance among all genome sequences. These distances are normalized and analyzed using Gaussian kernel density estimation to identify natural thresholds corresponding to taxonomic ranks (species, genus, family). Hierarchical clustering guided by these thresholds is evaluated against a known taxonomy to select optimal cutoffs. For an unknown viral genome, VISTA constructs its k-mer profile and computes distances to reference sequences. By comparing the minimum distance to the established taxonomic thresholds, the genome is assigned to an existing species if below the species cutoff, to a new species within a genus if below the genus cutoff, or to a different genus or family if above these thresholds.
Key information, such as Methodology, Database source, Viral specialization, Citation index (CI), and Limitations for k-mer-based approaches, is presented in Table 4.
Table 4. K-Mer-based tools for virus detection.
We summarized key information (n = 19 parameters) for all 54 viral detection tools based on their original publications in Table S1. The summary included details such as methodology, database source, input type, advantages, limitations, etc.

4. Discussion

Viruses comprise a highly diverse community. Many existing tools were trained on specific databases representing only particular groups, such as viruses infecting prokaryotes, eukaryotes, or RNA viruses. In this systematic review, we present a selection of tools that cover a broad spectrum of viruses, including bacteriophages, eukaryotic viruses, large DNA viruses, and others. Tools that focus exclusively on a single virus genus were excluded. When designing a study, it is crucial to carefully choose the appropriate tool based on the viral group under investigation.
Approaches for virus discovery vary widely, ranging from classical alignment-based methods to advanced alignment-free machine learning models. The complementary strengths of these methods suggest that optimal virus identification strategies should integrate multiple approaches. Alignment-based methods remain the gold standard for well-characterized viruses with close database homologs. HMM-based tools excel at detecting conserved protein domains across divergent sequences. K-Mer methods provide rapid initial screening and taxonomic classification of known viruses. Machine learning approaches may fill critical gaps by identifying novel viruses, handling fragmented sequences, and learning complex patterns invisible to traditional methods. As metagenomic datasets grow exponentially, hybrid pipelines combining k-mer-based pre-filtering with ML classification and targeted alignment verification will likely become standard practice for comprehensive virome characterization.
We also observe a growing trend toward hybrid methods that combine multiple strategies, with machine learning integrated with k-mer analysis becoming particularly prominent. These modern tools have evolved beyond simple frequency counting by incorporating sophisticated machine learning techniques to address the limitations of traditional k-mer approaches. While conventional methods like Kraken2 perform direct database matching of individual k-mers, newer hybrid tools such as CHEER and VirRep apply natural language processing techniques, including Skip–Gram models and BERT-based encoders, to learn k-mer co-occurrence patterns and contextual relationships. By treating DNA sequences analogously to text, where k-mers function as “words” with meaningful relationships, these methods capture not just k-mer presence but their typical co-occurrence patterns in viral versus non-viral sequences. Similarly, tools like geNomad combine machine learning with homology searches, integrating sequence-based neural network classifiers with gene-based XGBoost models to leverage both pattern recognition and similarity-based detection. This evolution toward hybrid approaches creates a new generation of tools that maintain the computational efficiency of traditional methods while incorporating sophisticated pattern recognition capabilities, effectively bridging the gap between fast but limited conventional approaches and computationally intensive pure deep learning models.
The Global Virome Project (GVP), which endeavors to characterize an estimated 1.67 million previously undescribed viral species with approximately 631,000–827,000 possessing zoonotic potential, has fundamentally underscored the imperative for sophisticated and reliable bioinformatics methodologies in viral taxonomic annotation []. Nevertheless, the computational landscape for viral classification is characterized by the continuous emergence of novel methodologies, complicating the determination of optimal approaches without standardized benchmarking protocols and evaluation frameworks []. This deficiency in systematic comparative assessment precludes definitive determination of tool performance characteristics and consequently impedes evidence-based methodology selection for viral detection applications.
While the present review does not aim to provide benchmarking assessments of the described tools, analysis of recent large-scale virome initiatives offers valuable insights into which computational approaches have demonstrated scalability and practical utility in extensive viral discovery efforts. However, choosing optimal pipelines depends on particular research tasks, as different studies require distinct analytical strategies. To illustrate this, we selected several representative studies that demonstrate remarkable diversity in analytical approaches. For instance, Zhang et al. (2025) developed a comprehensive workflow combining Kraken2 for host read filtering and Diamond [] BlastX/BlastN searches against nr/nt databases, followed by MMseqs2 [] clustering to identify viral operational taxonomic units from 1113 small mammals []. Similarly, Guo et al. (2022) constructed a specialized detection framework using custom databases of human, archaeal, bacterial, and vector sequences alongside NCBI viral references, employing Kraken2 for initial classification and BLAST for species-level annotation with stringent filtering criteria to analyze blood virome data from over 10,000 individuals []. Several large-scale gut virome studies have incorporated machine learning-based viral identification tools into their analytical pipelines. Nayfach et al. (2021) [] utilized VirFinder’s k-mer frequency machine learning models to identify 189,680 DNA viral genomes from human gut metagenomes, while Zeng et al. (2024) [] combined VirFinder with VIBRANT’s hybrid ML approach and VirSorter2’s automated classifiers to catalog 160,478 viral sequences from early-life gut samples. Similarly, Nishijima et al. (2022) [] employed DeepVirFinder for viral-specific k-mer pattern detection in their analysis of 4198 Japanese individuals, and recent studies like Yan et al. (2025) [] integrated DeepVirFinder with VIBRANT to construct the Chinese Gut Virus Catalog containing 426,496 viral sequences, while Galperina et al. (2025) [] used DeepVirFinder and VirSorter2 alongside geNomad and PhaGCN to create the Aggregated Gut Viral Catalog with over 1 million dereplicated viral sequences.
A challenge in the identification of virus sequences, especially those of RNA viruses, comes from high genetic variability. Most RNA viruses accumulate about 10−3 substitutions/site/year. This renders nucleotide sequence-based approaches poorly usable for the identification of even moderately related viruses. Moreover, above a family level, only a few proteins (usually polymerase and protease) may be recognized by the most sensitive methods (usually HMM-based), while most of the genome cannot be identified at all. This challenge calls for integrated identification-assembly methods that first identify anchor sequences by HMM screening and then work over raw sequencing data to specifically extend virus contigs []. This approach allowed the identification of multiple novel viruses in publicly available read archives [].
The field of viral taxonomy is rapidly evolving, driven by ongoing discoveries and the integration of novel genomic data. Maintaining up-to-date reference databases is critical for the development and accuracy of virus classification models. For instance, the ICTV reported a substantial increase in the number of recognized virus species, from 9110 species in 2020 to 16,215 species by 2024 []. Without regular updates to the underlying training databases, computational classification tools risk becoming outdated, potentially leading to misclassification or reduced sensitivity in recognizing emerging viral diversity. Therefore, ongoing curation and timely incorporation of the latest taxonomic releases are essential to sustain the performance and relevance of predictive models in viral genomics.
A significant limitation encountered in some studies is the lack of publicly accessible source code. Without open and reproducible code, the scientific community faces challenges in independently evaluating, validating, and benchmarking these computational methods, especially as they were trained at different times and on different datasets. This lack of transparency potentially undermines confidence in the forecasts of the tools and their integration into broader workflows. This issue reflects a wider reproducibility crisis in computational virology, where many published tools cannot be reliably replicated or compared due to unavailable code and undocumented parameters. Strengthening reproducibility is therefore essential to ensure scientific credibility and long-term usability of these resources. Therefore, to advance viral taxonomy computationally and ensure community trust, future contributions must prioritize transparency by providing fully accessible, well-documented code and a reproducible workflow that allows re-training. Such practices will facilitate rigorous peer assessment, foster methodological improvements, and enable the field to keep pace with the rapidly expanding virosphere.
The most popular database chosen for training and annotation is NCBI viral RefSeq, a comprehensive resource comprising over 6600 virus species with high-quality genome annotations as of July 2024 []. It integrates closely with the ICTV by adopting exemplar isolates designated as reference representatives for viral species, ensuring taxonomic consistency.
This work has systematically reviewed the most advanced tools in virus detection, highlighting the fundamental role that alignment, k-mer, profile HMMs, and machine learning approaches play in virology bioinformatics. Our systematic review has some limitations. The initial search was conducted only in PubMed. Although we applied the snowballing principle, certain studies may not have been captured. Furthermore, the search query itself may have limited the scope of the search. Finally, our review included only tools that had been published up to September 2025.

5. Summary

The choice of virus identification methods depends upon the object, because there are different levels of genome conservation among diverse virus families and at different taxonomic levels. Also, the goal defines the means, be it fast screening for known viruses in surveillance applications or deep exploration of the “dark matter” of sequencing experiments.
Alignment-based methods are optimal for well-characterized viruses requiring high-confidence identification, clinical diagnostics, and precise strain-level taxonomic assignment in small-scale analyses. They excel at validating novel virus claims and distinguishing closely related strains, but may not be the primary choice for large metagenomic datasets, novel virus discovery, and highly divergent sequences. The fundamental limitation is their inability to detect viruses without similar references in the database, restricting their utility to confirmation rather than discovery.
Rapid similarity estimation tools enable high-throughput classification of millions of viral genomes for ANI-based taxonomic assignment, outbreak surveillance, and building viral operational taxonomic units following ICTV and MIUViG standards. While excelling at dereplication and rapid genotype assignment in time-critical scenarios, they should be avoided for novel virus discovery, sequences with less than 70% ANI to references, and protein-level analysis.
Profile HMM approaches are particularly valuable for novel virus discovery through conserved protein domain detection, identifying divergent viral sequences, and specialized tasks such as prophage detection or RNA virus identification. They are especially useful when whole-genome similarity is too low for alignment methods, but should be avoided for very short contigs with high false positive risk, extremely large datasets with limited computational resources, and when database coverage is sparse.
K-Mer-based methods provide ultra-fast preliminary screening of massive metagenomic datasets and abundance estimation of known viruses when speed is paramount and computational resources are limited. However, they should be avoided for novel virus discovery, precise strain-level differentiation, and divergent sequences, as they are restricted to exact or near-exact k-mer matches against database content.
Machine learning methods enable database-independent novel virus discovery by integrating multiple genomic features to identify intrinsic viral sequence patterns, making them ideal for fragmented metagenomic assemblies and under-sampled viral diversity. They should be avoided for taxonomic assignment and in contexts requiring interpretable results, as their black box nature limits explainability.
The most effective approach involves layering methods strategically: fast screening with k-mer or machine learning, domain validation with profile HMMs (probably combined with re-assembling or contig extension), precise characterization with alignment-based or rapid similarity methods, and quality assessment with combination pipelines. Tool selection fundamentally depends on whether one is working with known viruses, favoring database-dependent methods, or pursuing novel discovery requiring database-independent approaches.
Using this information from the original publications, we created a roadmap to guide tool selection according to research goals and virus types presented in Figure 4.
Figure 4. Schematic guide for selecting viral analysis tools according to specific research goals and virus types. Each tool name is colored according to its underlying methodological approach.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/v17121538/s1, Table S1: Tools for taxonomic virus annotation. A dash (-) indicates the absence of a reference database, while a question mark (?) denotes that the specific number or detail is not provided in the publication.

Author Contributions

J.G. and P.K. contributed equally to this work. They jointly conducted the literature search and wrote the manuscript. A.M. and A.L. participated in the review process. E.I. supervised the group and managed funding. All authors have read and agreed to the published version of the manuscript.

Funding

This work was carried out with financial support from the Russian Ministry of Education and Science. Agreement No. 075-15-2025-530.

Data Availability Statement

The data presented in this study is available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
VLPsVirus-Like Particles
ICTVInternational Committee on Taxonomy of Viruses
HMMsHidden Markov Models
LLMLarge Language Model
LCALowest Common Ancestor
MSPsMaximal Segment Pairs
ORFsOpen Reading Frames
GBDPGenome BLAST Distance Phylogeny
MCLMarkov Cluster Algorithm
MLCAMaximum Likelihood Clade Assignment
PCProtein Clusters
MSAMultiple Sequence Alignment
PSSMPosition-Specific Scoring Matrix
FFTFast Fourier Transform
EPAEvolutionary Placement Algorithm
ANIAverage Nucleotide Identity
NCLDVsNucleo-Cytoplasmic Large DNA Viruses
GVOGsGiant Virus Orthologous Groups
RdRpRNA-dependent RNA polymerase
dsDNA phagesDouble-Stranded DNA bacteriophages
FFNNFeed-Forward Neural Network
RSCURelative Synonymous Codon Usage
CNNsConvolutional Neural Network
BiPathCNNBi-Path Convolutional Neural Network
LSTMLong Short-Term Memory
RNNRecurrent Neural Network
GCNGraph Convolutional Network
BiLSTMBidirectional LSTM
MLPsMultilayer Perceptrons
GPGsGapped Pattern Graphs
SNPsSingle-Nucleotide Polymorphisms
MLMMasked Language Modeling
ERTExtremely Randomized Trees
GVPGlobal Virome Project

References

  1. Mushegian, A.R. Are There 10 Virus Particles on Earth, or More, or Fewer? J. Bacteriol. 2020, 202, e00052-20. [Google Scholar] [CrossRef] [PubMed]
  2. Johnstone, C.; Salles, S.; Mercado, J.M.; Cortés, D.; Yebra, L.; Gómez-Jakobsen, F.; Sánchez, A.; Alonso, A.; Valcárcel-Pérez, N. Abundance of Virus-like Particles (VLPs) and Microbial Plankton Community Composition in a Mediterranean Sea Coastal Area. Aquat. Microb. Ecol. 2018, 81, 137–148. [Google Scholar] [CrossRef]
  3. Cornell, C.R.; Zhang, Y.; Van Nostrand, J.D.; Wagle, P.; Xiao, X.; Zhou, J. Temporal Changes of Virus-Like Particle Abundance and Metagenomic Comparison of Viral Communities in Cropland and Prairie Soils. mSphere 2021, 6, e0116020. [Google Scholar] [CrossRef] [PubMed]
  4. Takada, K.; Holmes, E.C. Genome Sizes of Animal RNA Viruses Reflect Phylogenetic Constraints. Virus Evol. 2025, 11, veaf005. [Google Scholar] [CrossRef]
  5. Ain, Q.u.; Wu, K.; Wu, X.; Bai, Q.; Li, Q.; Zhou, C.-Z.; Wu, Q. Cyanophage-Encoded Auxiliary Metabolic Genes in Modulating Cyanobacterial Metabolism and Algal Bloom Dynamics. Front. Virol. 2024, 4, 1461375. [Google Scholar] [CrossRef]
  6. Camargo, A.P.; Roux, S.; Schulz, F.; Babinski, M.; Xu, Y.; Hu, B.; Chain, P.S.G.; Nayfach, S.; Kyrpides, N.C. Identification of Mobile Genetic Elements with geNomad. Nat. Biotechnol. 2024, 42, 1303–1312. [Google Scholar] [CrossRef]
  7. Sanjuán, R.; Nebot, M.R.; Chirico, N.; Mansky, L.M.; Belshaw, R. Viral Mutation Rates. J. Virol. 2010, 84, 9733–9748. [Google Scholar] [CrossRef]
  8. Irwin, N.A.T.; Pittis, A.A.; Richards, T.A.; Keeling, P.J. Systematic Evaluation of Horizontal Gene Transfer between Eukaryotes and Viruses. Nat. Microbiol. 2022, 7, 327–336. [Google Scholar] [CrossRef]
  9. Koonin, E.V.; Dolja, V.V.; Krupovic, M. Origins and Evolution of Viruses of Eukaryotes: The Ultimate Modularity. Virology 2015, 479–480, 2–25. [Google Scholar] [CrossRef]
  10. ICTV. Available online: https://ictv.global/ (accessed on 19 November 2025).
  11. Santiago-Rodriguez, T.M.; Hollister, E.B. Unraveling the Viral Dark Matter through Viral Metagenomics. Front. Immunol. 2022, 13, 1005107. [Google Scholar] [CrossRef]
  12. Krishnamurthy, S.R.; Wang, D. Origins and Challenges of Viral Dark Matter. Virus Res. 2017, 239, 136–142. [Google Scholar] [CrossRef] [PubMed]
  13. Fouts, D.E. Phage_Finder: Automated Identification and Classification of Prophage Regions in Complete Bacterial Genome Sequences. Nucleic Acids Res. 2006, 34, 5839–5851. [Google Scholar] [CrossRef]
  14. Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic Local Alignment Search Tool. J. Mol. Biol. 1990, 215, 403–410. [Google Scholar] [CrossRef]
  15. Roux, S.; Enault, F.; Hurwitz, B.L.; Sullivan, M.B. VirSorter: Mining Viral Signal from Microbial Genomic Data. PeerJ 2015, 3, e985. [Google Scholar] [CrossRef]
  16. Ren, J.; Ahlgren, N.A.; Lu, Y.Y.; Fuhrman, J.A.; Sun, F. VirFinder: A Novel K-Mer Based Tool for Identifying Viral Sequences from Assembled Metagenomic Data. Microbiome 2017, 5, 69. [Google Scholar] [CrossRef]
  17. Ren, J.; Song, K.; Deng, C.; Ahlgren, N.A.; Fuhrman, J.A.; Li, Y.; Xie, X.; Poplin, R.; Sun, F. Identifying Viruses from Metagenomic Data Using Deep Learning. Quant. Biol. 2020, 8, 64–77. [Google Scholar] [CrossRef] [PubMed]
  18. Kieft, K.; Zhou, Z.; Anantharaman, K. VIBRANT: Automated Recovery, Annotation and Curation of Microbial Viruses, and Evaluation of Viral Community Function from Genomic Sequences. Microbiome 2020, 8, 90. [Google Scholar] [CrossRef]
  19. Peng, C.; Shang, J.; Guan, J.; Wang, D.; Sun, Y. ViraLM: Empowering Virus Discovery through the Genome Foundation Model. Bioinformatics 2024, 40, btae704. [Google Scholar] [CrossRef]
  20. Home. Available online: https://pubmed.ncbi.nlm.nih.gov (accessed on 21 November 2025).
  21. OSF. Available online: https://osf.io (accessed on 21 November 2025).
  22. Christensen, H.; Olsen, J.E. Pairwise Alignment, Multiple Alignment, and BLAST. In Introduction to Bioinformatics in Microbiology; Springer International Publishing: Cham, Switzerland, 2018; pp. 51–79. ISBN 9783319992792. [Google Scholar]
  23. Pearson, W.R. An Introduction to Sequence Similarity (“homology”) Searching. Curr. Protoc. Bioinform. 2013, 42, 3.1.1–3.1.8. [Google Scholar] [CrossRef]
  24. Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Res. 1997, 25, 3389–3402. [Google Scholar] [CrossRef] [PubMed]
  25. Zerbini, F.M.; Crane, A.; Kuhn, J.H.; Simmonds, P.; Lefkowitz, E.J.; ICTV Taxonomy Summary Consortium. Summary of Taxonomy Changes Ratified by the International Committee on Taxonomy of Viruses (ICTV)—General Taxonomy Proposals, 2025. J. Gen. Virol. 2025, 106, 002116. [Google Scholar] [CrossRef]
  26. Home. Available online: https://ictv.global/news/taxablast (accessed on 20 November 2025).
  27. NCBI Virus. Available online: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/ (accessed on 19 November 2025).
  28. Morgulis, A.; Coulouris, G.; Raytselis, Y.; Madden, T.L.; Agarwala, R.; Schäffer, A.A. Database Indexing for Production MegaBLAST Searches. Bioinformatics 2008, 24, 1757–1764. [Google Scholar] [CrossRef]
  29. Jurtz, V.I.; Villarroel, J.; Lund, O.; Voldby Larsen, M.; Nielsen, M. MetaPhinder-Identifying Bacteriophage Sequences in Metagenomic Data Sets. PLoS ONE 2016, 11, e0163111. [Google Scholar] [CrossRef]
  30. Bao, Y.; Chetvernin, V.; Tatusova, T. PAirwise Sequence Comparison (PASC) and Its Application in the Classification of Filoviruses. Viruses 2012, 4, 1318–1327. [Google Scholar] [CrossRef]
  31. Muhire, B.M.; Roumagnac, P.; Varsani, A.; Martin, D.P. Sequence Demarcation Tool (SDT), a Free User-Friendly Computer Program Using Pairwise Genetic Identity Calculations to Classify Nucleotide or Amino Acid Sequences. Methods Mol. Biol. 2025, 2912, 71–79. [Google Scholar] [CrossRef]
  32. Meier-Kolthoff, J.P.; Göker, M. VICTOR: Genome-Based Phylogeny and Classification of Prokaryotic Viruses. Bioinformatics 2017, 33, 3396–3404. [Google Scholar] [CrossRef]
  33. Moraru, C.; Varsani, A.; Kropinski, A.M. VIRIDIC-A Novel Tool to Calculate the Intergenomic Similarities of Prokaryote-Infecting Viruses. Viruses 2020, 12, 1268. [Google Scholar] [CrossRef]
  34. Bin Jang, H.; Bolduc, B.; Zablocki, O.; Kuhn, J.H.; Roux, S.; Adriaenssens, E.M.; Brister, J.R.; Kropinski, A.M.; Krupovic, M.; Lavigne, R.; et al. Taxonomic Assignment of Uncultivated Prokaryotic Virus Genomes Is Enabled by Gene-Sharing Networks. Nat. Biotechnol. 2019, 37, 632–639. [Google Scholar] [CrossRef] [PubMed]
  35. Bolduc, B.; Jang, H.B.; Doulcier, G.; You, Z.-Q.; Roux, S.; Sullivan, M.B. vConTACT: An iVirus Tool to Classify Double-Stranded DNA Viruses That Infect and. PeerJ 2017, 5, e3243. [Google Scholar] [CrossRef] [PubMed]
  36. Katoh, K.; Misawa, K.; Kuma, K.-I.; Miyata, T. MAFFT: A Novel Method for Rapid Multiple Sequence Alignment Based on Fast Fourier Transform. Nucleic Acids Res. 2002, 30, 3059–3066. [Google Scholar] [CrossRef] [PubMed]
  37. Katoh, K.; Rozewicki, J.; Yamada, K.D. MAFFT Online Service: Multiple Sequence Alignment, Interactive Sequence Choice and Visualization. Brief. Bioinform. 2019, 20, 1160–1166. [Google Scholar] [CrossRef]
  38. Edgar, R.C. MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput. Nucleic Acids Res. 2004, 32, 1792–1797. [Google Scholar] [CrossRef]
  39. Larkin, M.A.; Blackshields, G.; Brown, N.P.; Chenna, R.; McGettigan, P.A.; McWilliam, H.; Valentin, F.; Wallace, I.M.; Wilm, A.; Lopez, R.; et al. Clustal W and Clustal X Version 2.0. Bioinformatics 2007, 23, 2947–2948. [Google Scholar] [CrossRef]
  40. Sievers, F.; Wilm, A.; Dineen, D.; Gibson, T.J.; Karplus, K.; Li, W.; Lopez, R.; McWilliam, H.; Remmert, M.; Söding, J.; et al. Fast, Scalable Generation of High-Quality Protein Multiple Sequence Alignments Using Clustal Omega. Mol. Syst. Biol. 2011, 7, 539. [Google Scholar] [CrossRef] [PubMed]
  41. Sievers, F.; Higgins, D.G. Clustal Omega for Making Accurate Alignments of Many Protein Sequences. Protein Sci. 2018, 27, 135–145. [Google Scholar] [CrossRef]
  42. Libin, P.J.K.; Deforche, K.; Abecasis, A.B.; Theys, K. VIRULIGN: Fast Codon-Correct Alignment and Annotation of Viral Genomes. Bioinformatics 2019, 35, 1763–1765. [Google Scholar] [CrossRef]
  43. Moshiri, N. ViralMSA: Massively Scalable Reference-Guided Multiple Sequence Alignment of Viral Genomes. Bioinformatics 2021, 37, 714–716. [Google Scholar] [CrossRef]
  44. Dobin, A.; Davis, C.A.; Schlesinger, F.; Drenkow, J.; Zaleski, C.; Jha, S.; Batut, P.; Chaisson, M.; Gingeras, T.R. STAR: Ultrafast Universal RNA-Seq Aligner. Bioinformatics 2013, 29, 15–21. [Google Scholar] [CrossRef] [PubMed]
  45. Langmead, B.; Salzberg, S.L. Fast Gapped-Read Alignment with Bowtie 2. Nat. Methods 2012, 9, 357–359. [Google Scholar] [CrossRef] [PubMed]
  46. Kim, D.; Paggi, J.M.; Park, C.; Bennett, C.; Salzberg, S.L. Graph-Based Genome Alignment and Genotyping with HISAT2 and HISAT-Genotype. Nat. Biotechnol. 2019, 37, 907–915. [Google Scholar] [CrossRef]
  47. Ranwez, V.; Harispe, S.; Delsuc, F.; Douzery, E.J.P. MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons. PLoS ONE 2011, 6, e22594. [Google Scholar] [CrossRef]
  48. Abascal, F.; Zardoya, R.; Telford, M.J. TranslatorX: Multiple Alignment of Nucleotide Sequences Guided by Amino Acid Translations. Nucleic Acids Res. 2010, 38, W7–W13. [Google Scholar] [CrossRef] [PubMed]
  49. Singer, J.B.; Thomson, E.C.; McLauchlan, J.; Hughes, J.; Gifford, R.J. GLUE: A Flexible Software System for Virus Sequence Data. BMC Bioinform. 2018, 19, 532. [Google Scholar] [CrossRef]
  50. Jain, C.; Dilthey, A.; Koren, S.; Aluru, S.; Phillippy, A.M. A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases. J. Comput. Biol. 2018, 25, 766–779. [Google Scholar] [CrossRef]
  51. Jain, C.; Rodriguez-R., L.M.; Phillippy, A.M.; Konstantinidis, K.T.; Aluru, S. High Throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries. Nat. Commun. 2018, 9, 5114. [Google Scholar] [CrossRef]
  52. Zielezinski, A.; Gudyś, A.; Barylski, J.; Siminski, K.; Rozwalak, P.; Dutilh, B.E.; Deorowicz, S. Ultrafast and Accurate Sequence Alignment and Clustering of Viral Genomes. Nat. Methods 2025, 22, 1191–1194. [Google Scholar] [CrossRef] [PubMed]
  53. Potter, S.C.; Luciani, A.; Eddy, S.R.; Park, Y.; Lopez, R.; Finn, R.D. HMMER Web Server: 2018 Update. Nucleic Acids Res. 2018, 46, W200–W204. [Google Scholar] [CrossRef]
  54. Nguyen, V.-A.; Boyd-Graber, J.; Altschul, S.F. Dirichlet Mixtures, the Dirichlet Process, and the Structure of Protein Space. J. Comput. Biol. 2013, 20, 1–18. [Google Scholar] [CrossRef]
  55. Eddy, S.R. A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation. PLoS Comput. Biol. 2008, 4, e1000069. [Google Scholar] [CrossRef]
  56. Aylward, F.O.; Moniruzzaman, M. ViralRecall-A Flexible Command-Line Tool for the Detection of Giant Virus Signatures in ’Omic Data. Viruses 2021, 13, 150. [Google Scholar] [CrossRef] [PubMed]
  57. Tisza, M.J.; Belford, A.K.; Domínguez-Huerta, G.; Bolduc, B.; Buck, C.B. Cenote-Taker 2 Democratizes Virus Discovery and Sequence Annotation. Virus Evol. 2021, 7, veaa100. [Google Scholar] [CrossRef]
  58. Starikova, E.V.; Tikhonova, P.O.; Prianichnikov, N.A.; Rands, C.M.; Zdobnov, E.M.; Ilina, E.N.; Govorun, V.M. Phigaro: High-Throughput Prophage Sequence Annotation. Bioinformatics 2020, 36, 3882–3884. [Google Scholar] [CrossRef] [PubMed]
  59. Koonin, E.V.; Dolja, V.V.; Krupovic, M.; Varsani, A.; Wolf, Y.I.; Yutin, N.; Zerbini, F.M.; Kuhn, J.H. Global Organization and Proposed Megataxonomy of the Virus World. Microbiol. Mol. Biol. Rev. 2020, 84, e00061-19. [Google Scholar] [CrossRef]
  60. Edgar, R.C.; Taylor, B.; Lin, V.; Altman, T.; Barbera, P.; Meleshko, D.; Lohr, D.; Novakovsky, G.; Buchfink, B.; Al-Shayeb, B.; et al. Petabase-Scale Sequence Alignment Catalyses Viral Discovery. Nature 2022, 602, 142–147. [Google Scholar] [CrossRef]
  61. Charon, J.; Buchmann, J.P.; Sadiq, S.; Holmes, E.C. RdRp-Scan: A Bioinformatic Resource to Identify and Annotate Divergent RNA Viruses in Metagenomic Sequence Data. Virus Evol. 2022, 8, veac082. [Google Scholar] [CrossRef]
  62. Sakaguchi, S.; Nakano, T.; Nakagawa, S. NeoRdRp2 with Improved Seed Data, Annotations, and Scoring. Front. Virol. 2024, 4, 1378695. [Google Scholar] [CrossRef]
  63. Amgarten, D.; Braga, L.P.P.; da Silva, A.M.; Setubal, J.C. MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins. Front. Genet. 2018, 9, 304. [Google Scholar] [CrossRef] [PubMed]
  64. Bzhalava, Z.; Tampuu, A.; Bała, P.; Vicente, R.; Dillner, J. Machine Learning for Detection of Viral Sequences in Human Metagenomic Datasets. BMC Bioinform. 2018, 19, 336. [Google Scholar] [CrossRef]
  65. Guo, J.; Bolduc, B.; Zayed, A.A.; Varsani, A.; Dominguez-Huerta, G.; Delmont, T.O.; Pratama, A.A.; Gazitúa, M.C.; Vik, D.; Sullivan, M.B.; et al. VirSorter2: A Multi-Classifier, Expert-Guided Approach to Detect Diverse DNA and RNA Viruses. Microbiome 2021, 9, 37. [Google Scholar] [CrossRef] [PubMed]
  66. Kang, H.S.; McNair, K.; Cuevas, D.A.; Bailey, B.A.; Segall, A.M.; Edwards, R.A. Prophage Genomics Reveals Patterns in Phage Genome Organization and Replication. bioRxiv 2017. [Google Scholar] [CrossRef]
  67. Shang, J.; Sun, Y. CHEER: HierarCHical Taxonomic Classification for Viral mEtagEnomic Data via Deep leaRning. Methods 2021, 189, 95–103. [Google Scholar] [CrossRef]
  68. Hou, S.; Tang, T.; Cheng, S.; Liu, Y.; Xia, T.; Chen, T.; Fuhrman, J.A.; Sun, F. DeepMicroClass Sorts Metagenomic Contigs into Prokaryotes, Eukaryotes and Viruses. NAR Genom. Bioinform. 2024, 6, lqae044. [Google Scholar] [CrossRef]
  69. Zárate, A.; Díaz-González, L.; Taboada, B. VirDetect-AI: A Residual and Convolutional Neural Network-Based Metagenomic Tool for Eukaryotic Viral Protein Identification. Brief. Bioinform. 2024, 26, bbaf001. [Google Scholar] [CrossRef]
  70. Auslander, N.; Gussow, A.B.; Benler, S.; Wolf, Y.I.; Koonin, E.V. Seeker: Alignment-Free Identification of Bacteriophage Genomes by Deep Learning. Nucleic Acids Res. 2020, 48, e121. [Google Scholar] [CrossRef]
  71. Al-Najim, A.; Hauns, S.; Tran, V.D.; Backofen, R.; Alkhnbashi, O.S. HVSeeker: A Deep-Learning-Based Method for Identification of Host and Viral DNA Sequences. Gigascience 2025, 14, giaf037. [Google Scholar] [CrossRef]
  72. Miao, Y.; Liu, F.; Hou, T.; Liu, Y. Virtifier: A Deep Learning-Based Identifier for Viral Sequences from Metagenomes. Bioinformatics 2022, 38, 1216–1222. [Google Scholar] [CrossRef] [PubMed]
  73. Miao, Y.; Bian, J.; Dong, G.; Dai, T. DETIRE: A Hybrid Deep Learning Model for Identifying Viral Sequences from Metagenomes. Front. Microbiol. 2023, 14, 1169791. [Google Scholar] [CrossRef]
  74. Shang, J.; Jiang, J.; Sun, Y. Bacteriophage Classification for Assembled Contigs Using Graph Convolutional Network. Bioinformatics 2021, 37, i25–i33. [Google Scholar] [CrossRef]
  75. Sourkov, V. IGLOO: Slicing the Features Space to Represent Sequences. arXiv 2018, arXiv:1807.03402. [Google Scholar]
  76. Li, J.; Mi, J.; Lin, W.; Tian, F.; Wan, J.; Gao, J.; Tong, Y. VirNucPro: An Identifier for the Identification of Viral Short Sequences Using Six-Frame Translation and Large Language Models. Brief. Bioinform. 2025, 26, bbaf224. [Google Scholar] [CrossRef]
  77. Wang, R.H.; Ng, Y.K.; Zhang, X.; Wang, J.; Li, S.C. Coding Genomes with Gapped Pattern Graph Convolutional Network. Bioinformatics 2024, 40, btae188. [Google Scholar] [CrossRef]
  78. Dong, Y.; Chen, W.-H.; Zhao, X.-M. VirRep: A Hybrid Language Representation Learning Framework for Identifying Viruses from Human Gut Metagenomes. Genome Biol. 2024, 25, 177. [Google Scholar] [CrossRef] [PubMed]
  79. Wood, D.E.; Lu, J.; Langmead, B. Improved Metagenomic Analysis with Kraken 2. Genome Biol. 2019, 20, 257. [Google Scholar] [CrossRef] [PubMed]
  80. MacDonald, M.L.; Polson, S.W.; Lee, K.H. k-mer-Based Metagenomics Tools Provide a Fast and Sensitive Approach for the Detection of Viral Contaminants in Biopharmaceutical and Vaccine Manufacturing Applications Using Next-Generation Sequencing. mSphere 2021, 6, e01336-20. [Google Scholar] [CrossRef]
  81. Maabar, M.; Davison, A.J.; Vučak, M.; Thorburn, F.; Murcia, P.R.; Gunson, R.; Palmarini, M.; Hughes, J. DisCVR: Rapid Viral Diagnosis from High-Throughput Sequencing Data. Virus Evol. 2019, 5, vez033. [Google Scholar] [CrossRef]
  82. Audano, P.; Vannberg, F. KAnalyze: A Fast Versatile Pipelined K-Mer Toolkit. Bioinformatics 2014, 30, 2070–2072. [Google Scholar] [CrossRef] [PubMed]
  83. Ounit, R.; Wanamaker, S.; Close, T.J.; Lonardi, S. CLARK: Fast and Accurate Classification of Metagenomic and Genomic Sequences Using Discriminative K-Mers. BMC Genom. 2015, 16, 236. [Google Scholar] [CrossRef]
  84. Popov, N.; Sonets, I.; Evdokimova, A.; Molchanova, M.; Panova, V.; Korneenko, E.; Manolov, A.; Ilina, E. AliMarko: A Pipeline for Virus Identification Using an Expert-Guided Approach. Viruses 2025, 17, 355. [Google Scholar] [CrossRef]
  85. Zhang, T.; Liu, Y.; Guo, X.; Zhang, X.; Zheng, X.; Zhang, M.; Bao, Y. VISTA: A Tool for Fast Taxonomic Assignment of Viral Genome Sequences. Genom. Proteom. Bioinform. 2025, 23, qzae082. [Google Scholar] [CrossRef]
  86. Carroll, D.; Daszak, P.; Wolfe, N.D.; Gao, G.F.; Morel, C.M.; Morzaria, S.; Pablos-Méndez, A.; Tomori, O.; Mazet, J.A.K. The Global Virome Project. Science 2018, 359, 872–874. [Google Scholar] [CrossRef]
  87. Wu, Y.; Peng, Y. Ten Computational Challenges in Human Virome Studies. Virol. Sin. 2024, 39, 845–850. [Google Scholar] [CrossRef]
  88. Buchfink, B.; Xie, C.; Huson, D.H. Fast and Sensitive Protein Alignment Using DIAMOND. Nat. Methods 2015, 12, 59–60. [Google Scholar] [CrossRef]
  89. Mirdita, M.; Steinegger, M.; Söding, J. MMseqs2 Desktop and Local Web Server App for Fast, Interactive Sequence Searches. Bioinformatics 2019, 35, 2856–2858. [Google Scholar] [CrossRef] [PubMed]
  90. Zhang, N.; Hu, B.; Zhang, L.; Gan, M.; Ding, Q.; Pan, K.; Wei, J.; Xu, W.; Chen, D.; Zheng, S.; et al. Virome Landscape of Wild Rodents and Shrews in Central China. Microbiome 2025, 13, 63. [Google Scholar] [CrossRef]
  91. Guo, J.; Huang, X.; Zhang, C.; Huang, P.; Li, Y.; Wen, F.; Wang, X.; Yang, N.; Xu, M.; Bi, Y.; et al. The Blood Virome of 10,585 Individuals from the ChinaMAP. Cell Discov. 2022, 8, 113. [Google Scholar] [CrossRef] [PubMed]
  92. Nayfach, S.; Páez-Espino, D.; Call, L.; Low, S.J.; Sberro, H.; Ivanova, N.N.; Proal, A.D.; Fischbach, M.A.; Bhatt, A.S.; Hugenholtz, P.; et al. Metagenomic Compendium of 189,680 DNA Viruses from the Human Gut Microbiome. Nat. Microbiol. 2021, 6, 960–970. [Google Scholar] [CrossRef]
  93. Zeng, S.; Almeida, A.; Li, S.; Ying, J.; Wang, H.; Qu, Y.; Paul Ross, R.; Stanton, C.; Zhou, Z.; Niu, X.; et al. A Metagenomic Catalog of the Early-Life Human Gut Virome. Nat. Commun. 2024, 15, 1864. [Google Scholar] [CrossRef] [PubMed]
  94. Nishijima, S.; Nagata, N.; Kiguchi, Y.; Kojima, Y.; Miyoshi-Akiyama, T.; Kimura, M.; Ohsugi, M.; Ueki, K.; Oka, S.; Mizokami, M.; et al. Extensive Gut Virome Variation and Its Associations with Host and Environmental Factors in a Population-Level Cohort. Nat. Commun. 2022, 13, 5252. [Google Scholar] [CrossRef]
  95. Yan, Q.; Huang, L.; Li, S.; Zhang, Y.; Guo, R.; Zhang, P.; Lei, Z.; Lv, Q.; Chen, F.; Li, Z.; et al. The Chinese Gut Virus Catalogue Reveals Gut Virome Diversity and Disease-Related Viral Signatures. Genome Med. 2025, 17, 30. [Google Scholar] [CrossRef]
  96. Galperina, A.; Lugli, G.A.; Milani, C.; De Vos, W.M.; Ventura, M.; Salonen, A.; Hurwitz, B.; Ponsero, A.J. The Aggregated Gut Viral Catalogue (AVrC): A Unified Resource for Exploring the Viral Diversity of the Human Gut. PLoS Comput. Biol. 2025, 21, e1012268. [Google Scholar] [CrossRef]
  97. Alves, J.M.P.; de Oliveira, A.L.; Sandberg, T.O.M.; Moreno-Gallego, J.L.; de Toledo, M.A.F.; de Moura, E.M.M.; Oliveira, L.S.; Durham, A.M.; Mehnert, D.U.; Zanotto, P.M.d.A.; et al. GenSeed-HMM: A Tool for Progressive Assembly Using Profile HMMs as Seeds and Its Application in Alpavirinae Viral Discovery from Metagenomic Data. Front. Microbiol. 2016, 7, 269. [Google Scholar] [CrossRef] [PubMed]
  98. Lauber, C.; Zhang, X.; Vaas, J.; Klingler, F.; Mutz, P.; Dubin, A.; Pietschmann, T.; Roth, O.; Neuman, B.W.; Gorbalenya, A.E.; et al. Deep Mining of the Sequence Read Archive Reveals Major Genetic Innovations in Coronaviruses and Other Nidoviruses of Aquatic Vertebrates. PLoS Pathog. 2024, 20, e1012163. [Google Scholar] [CrossRef] [PubMed]
  99. Taxonomy Release History. Available online: https://ictv.global/taxonomy/history (accessed on 19 November 2025).
  100. Goldfarb, T.; Kodali, V.K.; Pujar, S.; Brover, V.; Robbertse, B.; Farrell, C.M.; Oh, D.-H.; Astashyn, A.; Ermolaeva, O.; Haddad, D.; et al. NCBI RefSeq: Reference Sequence Standards through 25 Years of Curation and Annotation. Nucleic Acids Res. 2025, 53, D243–D257. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.