Next Article in Journal
Visual-Guidance Interventions for Child Pedestrian Behavior: An Empirical Study Employing Multimodal Experiments
Previous Article in Journal
Quality Performance Criterion Model for Distributed Automated Control Systems Based on Markov Processes for Smart Grid
Previous Article in Special Issue
Applications of Artificial Intelligence in the Study and Diagnosis of Orbital Pathologies: State of the Art and Future Directions
 
 
Article
Peer-Review Record

Bio-Inspired Generative Network with Knowledge Integration

Appl. Sci. 2025, 15(24), 12918; https://doi.org/10.3390/app152412918
by Erdenebileg Batbaatar 1 and Keun Ho Ryu 2,*
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3:
Appl. Sci. 2025, 15(24), 12918; https://doi.org/10.3390/app152412918
Submission received: 25 October 2025 / Revised: 3 December 2025 / Accepted: 5 December 2025 / Published: 8 December 2025
(This article belongs to the Special Issue Application of Artificial Intelligence in Bioinformatics)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper presents BioGen-KI, a biologically inspired generative network that integrates biological knowledge to produce more realistic and biologically plausible synthetic gene expression data, enhancing the robustness and interpretability of biological analyses under data-scarce conditions.

Lines 80-93: It is not common to describe the sections in a linear fashion in a paper. This seems more fitting for a thesis. Please consider this point.

Section 2.3: The authors do not mention the numerous software tools that already simulate RNA-seq data and how these methods attempt to replicate genetic variations, gene expression, and co-expression patterns in different cellular contexts. They should also mention the limitations of these approaches, such as the inability to preserve the biological complexity of gene interactions and the challenge of adapting the models to different types of data, such as single-cell RNA-seq or cancer genomics data. The authors mention Alphafold, which is not related to genomics but rather to proteomics.

Section 2.1, lines 108-111: The authors do not provide appropriate references. Additionally, up until line 115, it is unclear, based on the literature, that there is a gap. The gap should be better described and supported by references.

Line 267: What is the early-stopping criterion referred to here?

Line 272: What is the criterion for identifying highly variable genes? This is unclear, especially since there was normalization to the [-1,1] range.

Section 3.9 (Line 279): The authors again mention "conventional generative models", a term used in several places in the text. However, they do not specify which models they are referring to, and in Section 3.9, a comparative table is not provided. The authors should include this comparative table and also mention the comparative methods.

Section 3.7: The explanation of the Two-Time-Scale Update Rule (TTUR) could be more detailed. Additionally, a better explanation of what mini-batches are would be helpful.

Lines 301 and 302: "Comparative baselines include a multivariate Copula model, a standard VAE, a cGAN, and a WGAN-GP." There are no citations for these methods. The authors should cite references for these methods, ideally in the Materials and Methods section as well. It would be useful to provide more context on why these baseline models were chosen.

Missing references: There are no references for the comparative methods used in this section. The authors need to add them.

Section 4.2: It would be helpful to test the model's robustness on smaller datasets or in data-scarce conditions (such as a limited number of samples or missing genes), since the main advantage of the model is its ability to handle data scarcity.

Figure 2: The figure can be removed since the similarity metric between generated (blue points) and real (gray) data is already measured by MMD or W1. Just including the figure without a similarity calculation doesn’t add much value.

Sections 5.1 and 5.2: These should be merged. Combine the final ideas about the platform. Section 5.3 should be moved to the introduction. Section 5.4 should be placed in the methodology section of BioGen-KI. Section 5.5 should be in the limitations section, not the discussion section. Additionally, the authors should include references to support these points. Sections 5.6, 5.7, and 5.8 can be removed.

GitHub repository: The authors should create a GitHub repository or GitHub Docs to provide a well-organized manual for the software, ensuring reproducibility.

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This is an interesting work on the real-world biological data synthesis. Authors have well presented the idea of generating bioinspired generative network by integrating the larger context of system’s knowledge. They have attempted to solve a long-standing real-world problem of generating the gene expression data which is defined by intricated network of interdependent biological factors, such as genetic, epigenetic and environmental. Successful synthesis of high-fidelity artificial data suggests our better understanding of the biological regulatory processes and helps us develop better prediction model for real downstream processes.

 The previous models are great, but BioGen-KI’s better performance can be attributed to the innovative methodology that includes embeddings from biological knowledge Graph (e.g., Gene Interaction, Pathways) and Contextual Information (e.g., Cell Type, Disease) is appreciable. The model’s performance is better on all the metrics and has shown superior ability to generate high-quality and realistic samples. The model records smallest MMD (0.058 ± 0.003) and W₁ (0.261 ± 0.009) suggesting generated data is closer to the real-world data with respect to other models. The model further claims lowest median KS score (0.138 ± 0.006), suggesting minimal statistical deviation from the real-world data. Additionally highest PRD-F1 score suggest excellent balance between fidelity (realism of samples) and diversity (variety among generated samples). The manuscript also claims better performance of the model on downstream tasks, such differential-expression analysis, and classification tasks.

Authors have worked on valuable unsolved question that can help address complicated biological questions. They have conducted a systematic scientific investigation that aligns well with the question and result. However, I have a few minor concerns:

  1. Though the model presented is better than others, yet they are close at many instances. A detailed discussion on the fundamental differences on the better discriminating features would be helpful for the readers and development of the better models.
  2. The research finding would have larger impact on the reader community if the data and code is made available.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors 1) Summary and contribution: The manuscript proposes BioGen-KI, a knowledge-guided generative framework for synthetic gene-expression data. It integrates biological knowledge-graph embeddings (GO/KEGG/Reactome/STRING) with a conditional GAN; a biologically informed discriminator and co-expression / pathway losses aim to ensure both statistical realism and mechanistic plausibility. The paper evaluates on bulk RNA-seq, scRNA-seq and pan-cancer data, reporting lower MMD/Wasserstein distances and improved TF–target consistency / pathway coherence, plus better downstream clustering/DE/classification, with ~12–18% runtime overhead vs WGAN-GP (Table 7).    2) Strengths:
  • Clear motivation & framing. The paper accurately identifies that vanilla GANs can match distributions yet violate biology (e.g., TF–target inversions), motivating knowledge-guided constraints. 
  • Methodological integration. Well-explained pipeline (KG encoder → contextual embeddings → knowledge-guided generator → biological discriminator) with explicit training objectives (WGAN-GP + biological regularizers). 
  • Multi-axis evaluation. Statistical fidelity (MMD/W1), biological plausibility (edge-sign consistency, TF–target corr., pathway coherence), and downstream tasks (clustering, DE, low-data classification). 
  • Interpretability & cases. Concrete case studies (e.g., Luminal A vs Basal-like markers; interferon response) show biologically sensible signals in synthetics. 
  3) Major issues:
  • Quantitative specificity & statistical testing: The results mention relative improvements (e.g., ~20% divergence reduction vs WGAN-GP; better TF–target/pathway metrics), but lack effect sizes with uncertainty (mean±SD over runs) and significance tests across datasets. Please provide run-to-run variance, CIs, and tests for MMD/W1 and biological scores; do the same for downstream metrics. 
  • Dataset transparency & splits: Section 4 describes several contexts, but the exact dataset names, accession IDs, preprocessing (gene sets), train/val/test splits, and sample sizes per context should be listed in a compact table to ensure reproducibility. (Normalization and KG sources are described well; list concrete datasets & accession links similarly.) 
  • Baselines & breadth: Baselines include VAE / WGAN-GP and a cGAN-like reference. Given the 2024–2025 landscape, add one recent knowledge-aware or diffusion-based baseline (or justify absence) and one scRNA-seq generator commonly used for benchmarking; otherwise clearly argue why WGAN-GP suffices as the main foil. 
  • Ablation reporting: The text notes that each knowledge-guided component helps, but the ablation evidence is scattered. Provide a single ablation table quantifying the contribution of: (i) KG embeddings, (ii) co-expression loss, (iii) pathway loss, (iv) contextual conditioning — each added stepwise, with metrics for fidelity, biology, and one downstream task. 
  • Privacy & leakage: The Discussion lists future work on membership/attribute inference and DP. Given the synthetic-omics context, add a brief empirical privacy check (e.g., nearest-neighbor distance distributions vs training samples) or clearly state the current privacy limitations to avoid over-claiming safe sharing. 
  • Novelty positioning: The Related Work is strong but slightly generic in places. Please sharpen how BioGen-KI differs from knowledge-guided ML and prior KG-enhanced generators in life sciences (what is new: joint adversarial + biological losses with KG embeddings; conditional control; explicit biological discriminator signals). Cite a couple of closest works and draw a short contrast table. 
  4) Minor issues:
  • Clarity & brevity. The manuscript is readable but could be trimmed by ~10–15% in Related Work and repeated summaries (Sections 2 & 5). 
  • Figure quality. Ensure axis labels, units, and legible fonts in training-dynamics and pathway-coherence plots; add captions that interpret why curves differ, not only that they differ. 
  • Implementation details. Hyperparameters are partly given (Adam, TTUR), but please tabulate core settings (batch size, latent dim, λcoexp, λpathway, early stopping) and release code or pseudo-code to lock in reproducibility. 
  • Limitations box. The paper already lists limitations (KG incompleteness, correlational supervision, pathway-set dependence, overhead). Consider a boxed “Limitations” paragraph for quick reference. 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop