Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Unified Framework to Prioritize RNA Virus Cross-Species Transmission Risk Across an Expansive Host Landscape

Viruses 2026, 18(2), 211; https://doi.org/10.3390/v18020211

by Di Zhao^1,2

, Yi-Fei Wang¹, Zu-Fei Yin³, Ya-Fei Wu⁴, Hui-Jun Yu^1,2, Luo-Yuan Xia²

, Xiao-He Liu², Xiao-Ming Cui², Xiao-Yu Shi², Dai-Yun Zhu², Na Jia², Jia-Fu Jiang², Wu-Chun Cao^1,2,* and Wenqiang Shi^2,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Viruses 2026, 18(2), 211; https://doi.org/10.3390/v18020211

Submission received: 6 January 2026 / Revised: 29 January 2026 / Accepted: 3 February 2026 / Published: 5 February 2026

(This article belongs to the Section General Virology)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors present a prediction framework for identifying RNA viruses with zoonotic potential. This method has achieved a ~15% improvement over conventional BLASTP approaches. Overall this is an intriguing paper which has the potential to have a high impact. There are a few areas that need some clarification however:

1) Please explain a little about how CheckV works and how it is helpful to choose high quality virus sequences.

2) In Line 328, it is not clear how BLASTP was used. Can the authors please clarify what was BLASTed against what? For example, if a reader wanted to replicate the BLASTP analysis, what is the BLASTP command line used?

3) Line 490 states that the framework can be used for the early detection of high-risk emerging viruses. How can one use the code on Github to do this with their own data? Can the authors provide an example on Github of this?

4) I recommend making a webserver of this framework so researchers can use this on their own data. I don't mean for the current manuscript but in the future because the framework may not be easy to use by non-computer savvy users. Are there any plans to do this?

5) Can the authors please explain the rationale for using the genomic language model LucaOne? Why was this language model chosen?

Author Response

Comments 1: Please explain a little about how CheckV works and how it is helpful to choose high quality virus sequences.

Response 1: Thank you for this helpful suggestion. We agree that a brief explanation of CheckV would improve clarity. Accordingly, we revised the Methods section to describe that CheckV (v1.0.3) evaluates viral contig quality by integrating protein-coding gene prediction and homology to reference viruses with genome-length/terminal-signature evidence to estimate genome completeness, and by identifying putative host-derived (cellular) regions to assess contamination and delineate proviral boundaries. We also clarified that we retained only sequences labeled “High-quality” or “Complete” in the checkv_quality output, thereby minimizing the inclusion of fragmented or host-contaminated viral sequences in downstream analyses. These changes can be found in the revised manuscript on Page 3, Paragraph 2, Lines 107–114.

“To rigorously ensure the quality of viral sequences, we used CheckV (v1.0.3) [16]. CheckV evaluates viral contig quality by integrating protein-coding gene prediction and homology searches against reference viruses. It estimates genome completeness using genome-length information and terminal-signature evidence. In addition, it identifies putative host-derived (cellular) regions to assess contamination and to delineate proviral boundaries. Only sequences labeled “High-quality” or “Complete” in the checkv_quality output were retained, thereby minimizing the inclusion of fragmented or host-contaminated viral sequences in downstream analyses.”

Comments 2: In Line 328, it is not clear how BLASTP was used. Can the authors please clarify what was BLASTed against what? For example, if a reader wanted to replicate the BLASTP analysis, what is the BLASTP command line used?

Response 2: Thank you for this helpful suggestion. We agree that the original description of the BLASTp baseline was not sufficiently clear. Accordingly, we revised the Methods section to explicitly clarify what sequences were BLASTed against. Specifically, predicted protein sequences from training viruses were used to construct a BLAST protein database, and predicted proteins from test viruses were used as queries. BLASTp hits were retained if they met E-value < 1e−5 and query coverage > 50% (aligned query residues divided by the full query protein length). To aggregate protein-level alignments to the virus level, for each (query virus, training virus) pair, we retained the single best protein alignment (highest bit score) across all protein–protein matches between their proteomes. The host range of a query virus was then inferred by transferring/aggregating host annotations from the matched training viruses. We further clarified that we report results under two aggregation settings—(i) retaining all qualified training-virus matches (all-cover) and (ii) retaining only the top-ranked matches—both used as baselines for comparison with our model. These changes can be found in the revised manuscript in the Methods section (Page 5, Paragraph 3, Lines 204–214).

“Predicted protein sequences from training viruses were used to construct a BLAST protein database, and predicted proteins from test viruses were used as queries. BLASTp hits were retained if they met E-value < 1e−5 and query coverage > 50%. To aggregate protein-level alignments to the virus level, for each (query virus, training virus) pair, we kept the single best protein alignment (highest bit score) across all protein–protein matches between their proteomes. The host range of a query virus was then inferred by transferring/aggregating host annotations from the matched training viruses. For evaluation, we report BLASTp results under two aggregation settings—(i) retaining all qualified training-virus matches (all-cover) and (ii) retaining only the top-ranked matches—both of which serve as baselines for comparison with our model.”

Example BLASTp command line (train database, test query):

makeblastdb -in train_proteins.faa -dbtype prot -out train_proteins_db

blastp \

-query test_proteins.faa \

-db train_proteins_db \

-evalue 1e-5 \

-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen" \

-num_threads 64 \

> blastp_test_vs_train.tsv

The coverage filtering and virus-level aggregation (best alignment per query–train virus pair) were applied during post-processing, as described above.

Comments 3: Line 490 states that the framework can be used for the early detection of high-risk emerging viruses. How can one use the code on Github to do this with their own data? Can the authors provide an example on Github of this?

Response 3: Thank you for your suggestion. We have improved the documentation in the GitHub repository (https://github.com/Z-099/UniVH-model) by adding detailed, step-by-step instructions on how to run the framework with users’ own data, including clear input/output format requirements. We also provided a fully runnable example that covers data preparation and preprocessing, model inference, and an end-to-end demonstration of virus–host prediction, enabling users to reproduce the pipeline and apply the framework to their own datasets.

Comments 4: I recommend making a webserver of this framework so researchers can use this on their own data. I don't mean for the current manuscript but in the future because the framework may not be easy to use by non-computer savvy users. Are there any plans to do this?

Response 4: Thank you very much for this valuable suggestion. We recognize that a webserver would greatly increase the accessibility and user-friendliness of our framework, particularly for researchers without a computational background. We fully agree with your recommendation, and we are already in the process of developing a web-based interface to allow users to conveniently analyze their own data. We hope to make this webserver publicly available in a future release.

Comments 5: Can the authors please explain the rationale for using the genomic language model LucaOne? Why was this language model chosen?

Response 5: Because our framework incorporates host proteins in addition to viral proteins, we require a pretrained model that yields consistent and transferable representations across heterogeneous proteins from different biological sources (virus vs. host). LucaOne is designed for general genomic/metagenomic sequence representation learning, which makes it well suited for encoding proteins from diverse taxa and supports cross-domain generalization in virus–host related modeling. In addition, viruses and their hosts often exhibit substantial similarity in codon usage. To capture such nucleotide-level signals, we can use a nucleic-acid language model, whereas protein-only language models such as ESM2 are less suitable for this purpose.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

In this paper, the authors constructed a large-scale virus-host association dataset, spanning 90 RNA virus families and 240 host families (24,354 virus strain-host pairs, 525 host species). The following lists some comments. First, the "virus-host associations" used for training were derived solely from nucleic acid records in NCBI/VHDB, without distinguishing between infection, carriage, symbiosis, or incidental exposure, leading to potential false positives. Meanwhile, the negative set was constructed solely based on "absence from databases," which cannot exclude cases where the virus is actually capable of infecting the host but has not yet been sequenced or reported, resulting in label errors. Second, all newly predicted high-risk virus-host pairs remain at the computational level without experimental validation through cell or animal infection assays. Third, only 525 host species have complete genomes; a large number of wildlife, arthropods, and fungi lack high-quality sequences, limiting prediction accuracy for rare hosts. Viral sampling coordinates are often only available at the country level; GBIF host distribution records are severely missing in South America, Africa, and Southeast Asia, causing bias in geographic overlap features. Moreover, the current approach only uses primary sequence embeddings and has not yet utilized structural models such as AlphaFold to analyze key receptor-binding interfaces or integrated protein structural information, potentially missing spatial conformational determinants.

Comments on the Quality of English Language

English need be polished.

Author Response

Comments 1: First, the "virus-host associations" used for training were derived solely from nucleic acid records in NCBI/VHDB, without distinguishing between infection, carriage, symbiosis, or incidental exposure, leading to potential false positives. Meanwhile, the negative set was constructed solely based on "absence from databases," which cannot exclude cases where the virus is actually capable of infecting the host but has not yet been sequenced or reported, resulting in label errors.

Response 1: Thank you for highlighting the potential label noise in both positive and negative samples. We agree that virus–host links in NCBI/VHDB are primarily derived from nucleic-acid sequence submissions and typically do not distinguish among infection, carriage, symbiosis, or incidental exposure; therefore, the “positive associations” used for training may contain false positives. We also acknowledge that public repositories are incomplete and subject to sampling and reporting biases; consequently, “absence from databases” cannot be equated with confirmed non-infection, and the negative set may inevitably include false negatives in a supervised-learning setting.

We would like to clarify, however, that our negatives were not constructed by directly labeling all database-absent pairs as negative. Instead, we adopted a conservative, rule-based negative sampling strategy to reduce false negatives. Specifically, we first generated candidate negatives via random re-pairing of viruses and hosts, and then filtered them by excluding any virus taxid–host family combinations observed in a global association knowledge database. In addition, we incorporated viral sequence clustering and treated host families observed for any virus within the same cluster as an additional “cluster-aware blacklist,” thereby avoiding negatives from host families plausibly associated with closely related viruses. Moreover, when constructing validation/test negatives, we further excluded any exact virus–host pairs that appeared in the training split to prevent data leakage and duplicate pairs across splits. Taken together, our negatives should be interpreted as “unobserved” pairs under conservative exclusion constraints rather than confirmed non-host pairs.

We agree with the reviewer that label uncertainty is a common limitation of large-scale virus–host datasets derived from public repositories. Accordingly, we position our framework as a scalable computational pre-screening and risk-ranking tool to prioritize candidate virus–host pairs for downstream experimental and epidemiological validation, rather than to assert definitive infection for each predicted pair. We appreciate the suggestion and will consider incorporating stricter evidence-level curation as well as PU/weak-supervision strategies in future work to further reduce label uncertainty and improve robustness.

Comments 2: Second, all newly predicted high-risk virus-host pairs remain at the computational level without experimental validation through cell or animal infection assays.

Response 2: We thank the reviewer for this important comment. We fully agree that experimental validation—such as in vitro infection/replication assays in permissive cell lines or ex vivo systems, and in vivo challenge studies in relevant animal models—represents the gold standard for confirming host susceptibility and productive infection. The aim of this study, however, is to develop a scalable computational framework for prioritizing candidate high-risk virus–host pairs across diverse RNA virus families and broad host taxa. Given the large number of predicted pairs, the breadth of host diversity, and practical constraints (e.g., availability of isolates or infectious clones, appropriate cell/animal models, biosafety requirements, and ethical/regulatory approvals), comprehensive experimental validation is beyond the scope of the current work.

Importantly, we position UniVH as a computational pre-screening and risk-ranking tool that narrows the hypothesis space and enables more focused downstream validation. To support follow-up studies, we provide the predicted high-risk pairs together with model-derived confidence scores (and, where applicable, interpretable feature attributions), enabling virologists to prioritize a manageable subset of top-ranked candidates for targeted entry/replication assays and, when feasible, animal challenge experiments. As additional experimentally confirmed virus–host interactions become available, the framework can be retrained and updated, allowing iterative refinement of predictive performance and biological inference.

Comments 3: Third, only 525 host species have complete genomes; a large number of wildlife, arthropods, and fungi lack high-quality sequences, limiting prediction accuracy for rare hosts. Viral sampling coordinates are often only available at the country level; GBIF host distribution records are severely missing in South America, Africa, and Southeast Asia, causing bias in geographic overlap features.

Response 3: We thank the reviewer for this important comment. We agree that host genomic coverage remains limited: our analysis includes only 525 host species with high-quality/complete genomes, while many wildlife taxa—particularly arthropods and fungi—are still underrepresented. This limitation can reduce predictive reliability for rare or poorly sampled hosts, because our feature construction depends on host gene repertoires and functional profiles that require sufficiently complete genome assemblies and reasonably consistent annotations. Accordingly, we anticipate that UniVH will become more accurate and taxonomically comprehensive as ongoing large-scale sequencing initiatives continue to expand the availability of high-quality host genomes.

We also acknowledge that current geographic metadata are imperfect. Viral sampling locations in public repositories are often recorded at coarse spatial resolution (frequently country-level), and GBIF host occurrence data are notably incomplete in several regions (e.g., South America, Africa, and Southeast Asia), which may introduce bias into geography-based overlap features. In our framework, geographic variables are used as auxiliary signals to support prioritization rather than as the primary determinant of host range. Therefore, predictions involving data-sparse regions should be interpreted cautiously. As georeferencing becomes more precise and biodiversity occurrence datasets improve in coverage and completeness, we expect the utility of geographic features—and overall model performance—to improve accordingly.

We appreciate the reviewer’s suggestion and will explore future extensions, including sampling-bias correction, uncertainty-aware geographic representations, and higher-resolution geocoding when suitable metadata become available.

Comments 4: Moreover, the current approach only uses primary sequence embeddings and has not yet utilized structural models such as AlphaFold to analyze key receptor-binding interfaces or integrated protein structural information, potentially missing spatial conformational determinants.

Response 4: We thank the reviewer for this insightful comment. We agree that incorporating protein structural information—e.g., AlphaFold-predicted structures and interface-focused representations—could help capture spatial and conformational determinants of host specificity, particularly for receptor-binding and other host-interacting proteins. In the current study, however, we intentionally focus on a sequence-based representation (primary-sequence embeddings coupled with functional aggregation) because it is broadly applicable across diverse RNA virus families and host taxa, and it remains feasible at the scale of thousands of viral proteins and hundreds of host genomes.

We also note that, for many RNA viruses, the key entry receptors and the relevant interacting protein partners are still unknown, which makes it difficult to define biologically grounded virus–host structural interfaces for systematic modeling. Moreover, reliable structure-based modeling of host range would ideally require complex-level information (virus–host protein complexes and binding interfaces), which is not yet available at large scale and would introduce substantial computational cost and uncertainty.

Importantly, our Discussion already acknowledges this limitation: while the embedding-based model appears to capture higher-level signals beyond simple composition features, it does not currently resolve fine-grained protein–protein interactions. As stated, integrating tools such as AlphaFold to incorporate structural priors and interface-level features is a promising direction for future work. We therefore appreciate the reviewer’s suggestion and view it as a valuable avenue for extending UniVH when sufficiently curated interaction targets and scalable structural resources become available.

Response to Comments on the Quality of English Language

Point 1: English need be polished.

Response 1: We have carefully revised and polished the manuscript to improve grammar, clarity, and overall readability. We also standardized punctuation and formatting to ensure consistency throughout the document. All revisions are highlighted in red in the revised manuscript.

Author Response File: Author Response.pdf

Article Menu

A Unified Framework to Prioritize RNA Virus Cross-Species Transmission Risk Across an Expansive Host Landscape

Further Information

Guidelines

MDPI Initiatives

Follow MDPI