Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (84)

Search Parameters:
Keywords = genomic language model

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
17 pages, 10327 KB  
Article
A Unified Framework to Prioritize RNA Virus Cross-Species Transmission Risk Across an Expansive Host Landscape
by Di Zhao, Yi-Fei Wang, Zu-Fei Yin, Ya-Fei Wu, Hui-Jun Yu, Luo-Yuan Xia, Xiao-He Liu, Xiao-Ming Cui, Xiao-Yu Shi, Dai-Yun Zhu, Na Jia, Jia-Fu Jiang, Wu-Chun Cao and Wenqiang Shi
Viruses 2026, 18(2), 211; https://doi.org/10.3390/v18020211 - 5 Feb 2026
Abstract
RNA viruses exhibit high mutation rates and strong host adaptive capacity, posing major public health challenges. Although meta-transcriptomic studies have uncovered vast numbers of novel RNA viral sequences, identifying those with spillover risks remains difficult. Current virus host-prediction methods can only predict a [...] Read more.
RNA viruses exhibit high mutation rates and strong host adaptive capacity, posing major public health challenges. Although meta-transcriptomic studies have uncovered vast numbers of novel RNA viral sequences, identifying those with spillover risks remains difficult. Current virus host-prediction methods can only predict a narrow set of host labels at coarse taxonomic levels (e.g., kingdom or order), which hampers precise evaluation of cross-species transmission risk and may overlook potential zoonotic hosts. To overcome these limitations, we developed UniVH, a unified virus–host association prediction framework trained on an exceptionally broad spectrum of 90 viral families and 240 host families, enabling robust prediction even for phylogenetically distant or data-scarce hosts. UniVH achieved a host prediction accuracy of 71.2% for novel viruses discovered after 2020, representing a 15.3% improvement over conventional BLASTp-based homology approaches. Feature interpretation revealed that viral structural genes and host immune- and metabolism-related genes contributed most significantly to predictive performance. Model predictions indicated widespread host-range expansion, with 20 mammalian virus families doubling their documented mammalian host ranges and several showing marked increases in viruses with human-infection potential. This unified, interpretable framework represents an important methodological advance for future RNA virus spillover-risk evaluation and emerging virus prioritization. Full article
(This article belongs to the Section General Virology)
16 pages, 1288 KB  
Article
Genome Mining of Acinetobacter nosocomialis J2 Using Artificial Intelligence Reveals a Highly Efficient Acid Phosphatase for Phosphate Solubilisation
by Kaixu Chen, Huiling Huang, Xiao Yu, Jing Zhang, Chunming Zhou, Zhong Yao, Zheng Xu, Yang Liu and Yang Sun
Fermentation 2026, 12(1), 64; https://doi.org/10.3390/fermentation12010064 - 21 Jan 2026
Viewed by 291
Abstract
Excessive application of chemical fertilisers has led to soil phosphorus immobilisation and aquatic eutrophication, making the development of highly efficient acid/neutral phosphatases crucial for sustainable phosphorus utilisation. In this study, we systematically investigated strain J2, which was isolated from phosphate-contaminated soil in Laoshan, [...] Read more.
Excessive application of chemical fertilisers has led to soil phosphorus immobilisation and aquatic eutrophication, making the development of highly efficient acid/neutral phosphatases crucial for sustainable phosphorus utilisation. In this study, we systematically investigated strain J2, which was isolated from phosphate-contaminated soil in Laoshan, Nanjing, China. 16S rRNA gene sequence analysis identified this strain as Acinetobacter nosocomialis J2, with 99.78% sequence similarity. Whole-genome sequencing generated a 3.83 Mb genome with a GC content of 38.59%, revealing multiple phospho-metabolism-related enzyme genes, including phospholipase C and α/β-hydrolases. A large language model–based protein representation learning strategy was employed to mine acid/neutral phosphatase genes from the genome, in which the model learned contextual and functional features from known phosphatase sequences and was used to identify semantically similar genes within the J2 genome. This approach predicted nine phosphatase candidate sequences, including AnACPase, a putative acid/neutral phosphatase. Biochemical characterisation showed that AnACPase exhibits optimal activity at pH 6.0 and 50 °C, with a Km value of 0.2454 mmol/L for the p-NPP substrate, indicating high substrate affinity. Mn2+ and Ni2+ significantly enhanced enzyme activity, whereas Cu2+ and Zn2+ strongly inhibited it. Soil remediation experiments further validated the application potential of AnACPase, which solubilised 171.56 mg/kg of phosphate within seven days. Overall, this study highlights the advantages of deep learning-assisted genome mining for functional enzyme discovery and provides a novel technological pathway for the bioremediation of phosphorus-polluted soils. Full article
Show Figures

Figure 1

29 pages, 3485 KB  
Systematic Review
Integrating Genomics, Radiomics, and Pathomics in Oncology: A Scoping Review and a Framework for AI-Enabled Surgomics
by Selma Mtoor, Niki Rashidian, Nouredin Messaoudi, Vincent Grasso, Floriane Noel, Michele Steindler, Derar Jaradat, Isabella Frigerio, Giovanni Butturini, Roland Croner, Karol Rawicz-Pruszynski, Giulia Capelli, Gaya Spolverato, Marc G. Besselink, Takeaki Ishizawa, Elie Chouillard, Mohammad Abu-Hilal, Ulf Kahlert, Ibrahim Dagher and Andrew A. Gumbs
Bioengineering 2026, 13(1), 117; https://doi.org/10.3390/bioengineering13010117 - 20 Jan 2026
Viewed by 267
Abstract
Background: Multimodal AI integration across genomics, radiomics, and pathomics is rapidly evolving in oncology, but evidence remains heterogeneous and unevenly distributed across modalities. Objective: To map empirical studies integrating two or more -omic modalities, summarize integration and validation approaches, and identify gaps informing [...] Read more.
Background: Multimodal AI integration across genomics, radiomics, and pathomics is rapidly evolving in oncology, but evidence remains heterogeneous and unevenly distributed across modalities. Objective: To map empirical studies integrating two or more -omic modalities, summarize integration and validation approaches, and identify gaps informing future directions toward surgomics. Methods: We conducted a scoping review in accordance with PRISMA-ScR, searching PubMed, Ovid, Wiley Online Library, and Google Scholar for English-language studies published from January 2020 to 5 March 2025. We charted study characteristics, modalities combined, fusion strategies, AI model categories, validation approaches, and reported performance metrics as presented by the original studies. Results: From 184 records, 11 studies met inclusion criteria (n = 1078 total participants across reported studies), most focusing on radiomics–pathomics integration; fewer incorporated genomics, and tri-modal fusion was uncommon. Studies varied widely in clinical tasks, endpoints, preprocessing, and validation, limiting direct comparability. Conclusions: The mapped evidence indicates growing methodological activity in radiopathomics and cross-scale association modeling, while tri-modal pipelines and clinically deployable multimodal workflows remain underdeveloped. Surgomics is presented as a conceptual, staged roadmap informed by these gaps rather than a current clinical capability. Full article
(This article belongs to the Special Issue AI and Data Science in Bioengineering: Innovations and Applications)
Show Figures

Figure 1

29 pages, 1732 KB  
Systematic Review
Surveillance of Healthcare-Associated Infections in the WHO African Region: Systematic Review of Literature from 2011 to 2024
by Laetitia Gahimbare, Nathalie K. Guessennd, Claude Mambo Muvunyi, Walter Fuller, Sheick Oumar Coulibaly, Landry Cihambanya, Pierre Claver Kariyo, Olga Perovic, Ambele Judith Mwamelo, Diané Kouao Maxime, Valérie Gbonon, Konan Kouadio Fernique, Babacar Ndoye and Yahaya Ali Ahmed
Antibiotics 2025, 14(12), 1287; https://doi.org/10.3390/antibiotics14121287 - 18 Dec 2025
Viewed by 724
Abstract
Background: Evidence on HAIs in Africa is fairly common. Objectives: The main objective was to identify the surveillance tools used for healthcare–associated infections (HAIs) in countries in the WHO African Region. Secondary objectives focused on the organization of surveillance, the pathogens involved, and [...] Read more.
Background: Evidence on HAIs in Africa is fairly common. Objectives: The main objective was to identify the surveillance tools used for healthcare–associated infections (HAIs) in countries in the WHO African Region. Secondary objectives focused on the organization of surveillance, the pathogens involved, and the frequency of multidrug–resistant species. Inclusion and exclusion criteria: Observational or interventional studies on healthcare–associated infections in humans, published between January 2011 and December 2024, in French or English, were included. However, the following publications were not included: animal studies, healthcare–associated infections not related to healthcare, literature reviews, studies outside the period or geographical area, and studies in languages other than French or English. Sources of information and search date: The databases consulted were PubMed, Web of Science, EMBASE, Cochrane, African Index Medicus, Google Scholar, and AJOL. The search was conducted between January and March 2025. Risk of bias assessment: The risk of bias was assessed using a specific grid (eleven criteria), scored from one (low) to three (high). The studies were classified into three levels of methodological quality. The results of the bias assessment showed that the publications were excellent (strong and moderate) with a cumulative rate of 99.9%. Methods of synthesizing results: Data were extracted using a standardized grid and synthesized narratively. No meta–analysis was performed. Number of studies and characteristics: 95 studies were included, mostly cross–sectional studies (82.1%), cohorts (10.4%), and a few case reports. Most were from West Africa (60.0%), particularly Nigeria (16.8%) and South Africa (14.7%). Main results: • Most common pathogens: Staphylococcus aureus (53.7%), Escherichia coli (43.2%), Klebsiella pneumoniae (32.6%). • Resistance profile: ESBL (27.4%), MRSA (21.1%), multidrug resistance (13.7%). • Sources of HAIs: mainly exogenous (83.2%). • Laboratory methods: phenotypic (70.5%), genotypic or genomic rare (3.1%). • Scope of studies: local (96.8%), national (3.2%). Limitations of evidence: Risk of bias due to underreporting of HAIs, methodological heterogeneity, predominance of cross–sectional studies, low use of molecular methods, lack of modeling, and uneven geographical coverage. Overall interpretation and implications: surveillance of HAIs in Africa remains fragmented and poorly standardized. There is a need to strengthen national systems, integrate molecular methods, train professionals, and promote interventional research. The WHO GLASS program can serve as a framework for harmonizing surveillance. Full article
Show Figures

Figure 1

49 pages, 1617 KB  
Review
Harnessing Machine Learning Approaches for the Identification, Characterization, and Optimization of Novel Antimicrobial Peptides
by Naveed Saleem, Naresh Kumar, Emad El-Omar, Mark Willcox and Xiao-Tao Jiang
Antibiotics 2025, 14(12), 1263; https://doi.org/10.3390/antibiotics14121263 - 14 Dec 2025
Viewed by 1495
Abstract
Antimicrobial resistance (AMR) has become a major health crisis worldwide, and it is expected to surpass cancer as one of the leading causes of death by 2050. Conventional antibiotics are struggling to keep pace with the rapidly evolving resistance trends, underscoring the urgent [...] Read more.
Antimicrobial resistance (AMR) has become a major health crisis worldwide, and it is expected to surpass cancer as one of the leading causes of death by 2050. Conventional antibiotics are struggling to keep pace with the rapidly evolving resistance trends, underscoring the urgent need for novel antimicrobial therapeutic strategies. Antimicrobial peptides (AMPs) function through diverse, often membrane-disrupting mechanisms that can address the latest challenges to resistance. However, the identification, prediction, and optimization of novel AMPs can be impeded by several issues, including extensive sequence spaces, context-dependent activity, and the higher costs associated with wet laboratory screenings. Recent developments in artificial intelligence (AI) have enabled large-scale mining of genomes, metagenomes, and quantitative species-resolved activity prediction, i.e., MIC, and de novo AMPs designed with integrated stability and toxicity filters. The current review has synthesized and highlighted progress across different discriminative models, such as classical machine learning and deep learning models and transformer embeddings, alongside graphs and geometric encoders, structure-guided and multi-modal hybrid learning approaches, closed-loop generative methods, and large language models (LLMs) predicted frameworks. This review compares models’ benchmark performances, highlighting AI-predicted novel hybrid approaches for designing AMPs, validated by in vitro and in vivo methods against clinical and resistant pathogens to increase overall experimental hit rates. Based on observations, multimodal paradigm strategies are proposed, focusing on identification, prediction, and characterization, followed by design frameworks, linking active-learning lab cycles, mechanistic interpretability, curated data resources, and uncertainty estimation. Therefore, for reproducible benchmarks and interoperable data, collaborative computational and wet lab experimental validations must be required to accelerate AI-driven novel AMP discovery to combat multidrug-resistant Gram-negative pathogens. Full article
(This article belongs to the Special Issue Novel Approaches to Prevent and Combat Antimicrobial Resistance)
Show Figures

Graphical abstract

23 pages, 3559 KB  
Article
From Static Prediction to Mindful Machines: A Paradigm Shift in Distributed AI Systems
by Rao Mikkilineni and W. Patrick Kelly
Computers 2025, 14(12), 541; https://doi.org/10.3390/computers14120541 - 10 Dec 2025
Viewed by 1190
Abstract
A special class of complex adaptive systems—biological and social—thrive not by passively accumulating patterns, but by engineering coherence, i.e., the deliberate alignment of prior knowledge, real-time updates, and teleonomic purposes. By contrast, today’s AI stacks—Large Language Models (LLMs) wrapped in agentic toolchains—remain rooted [...] Read more.
A special class of complex adaptive systems—biological and social—thrive not by passively accumulating patterns, but by engineering coherence, i.e., the deliberate alignment of prior knowledge, real-time updates, and teleonomic purposes. By contrast, today’s AI stacks—Large Language Models (LLMs) wrapped in agentic toolchains—remain rooted in a Turing-paradigm architecture: statistical world models (opaque weights) bolted onto brittle, imperative workflows. They excel at pattern completion, but they externalize governance, memory, and purpose, thereby accumulating coherence debt—a structural fragility manifested as hallucinations, shallow and siloed memory, ad hoc guardrails, and costly human oversight. The shortcoming of current AI relative to human-like intelligence is therefore less about raw performance or scaling, and more about an architectural limitation: knowledge is treated as an after-the-fact annotation on computation, rather than as an organizing substrate that shapes computation. This paper introduces Mindful Machines, a computational paradigm that operationalizes coherence as an architectural property rather than an emergent afterthought. A Mindful Machine is specified by a Digital Genome (encoding purposes, constraints, and knowledge structures) and orchestrated by an Autopoietic and Meta-Cognitive Operating System (AMOS) that runs a continuous Discover–Reflect–Apply–Share (D-R-A-S) loop. Instead of a static model embedded in a one-shot ML pipeline or deep learning neural network, the architecture separates (1) a structural knowledge layer (Digital Genome and knowledge graphs), (2) an autopoietic control plane (health checks, rollback, and self-repair), and (3) meta-cognitive governance (critique-then-commit gates, audit trails, and policy enforcement). We validate this approach on the classic Credit Default Prediction problem by comparing a traditional, static Logistic Regression pipeline (monolithic training, fixed features, external scripting for deployment) with a distributed Mindful Machine implementation whose components can reconfigure logic, update rules, and migrate workloads at runtime. The Mindful Machine not only matches the predictive task, but also achieves autopoiesis (self-healing services and live schema evolution), explainability (causal, event-driven audit trails), and dynamic adaptation (real-time logic and threshold switching driven by knowledge constraints), thereby reducing the coherence debt that characterizes contemporary ML- and LLM-centric AI architectures. The case study demonstrates “a hybrid, runtime-switchable combination of machine learning and rule-based simulation, orchestrated by AMOS under knowledge and policy constraints”. Full article
(This article belongs to the Special Issue Cloud Computing and Big Data Mining)
Show Figures

Figure 1

18 pages, 2321 KB  
Article
Two-Stage Probability-Enhanced Regression on Property Matrices and LLM Embeddings Enables State-of-the-Art Prediction of Gene Knockdown by Modified siRNAs
by Ivan Golovkin, Denis Shatkovskii and Nikita Serov
Int. J. Mol. Sci. 2025, 26(24), 11791; https://doi.org/10.3390/ijms262411791 - 5 Dec 2025
Viewed by 513
Abstract
Six small interference RNAs (siRNAs) have been approved as therapeutics since 2018 making them promising nanosystems due to selective gene knockdown activity. siRNA design is complex due to various factors, where the chemical modifications are crucial to improve its half-life and stability. Machine [...] Read more.
Six small interference RNAs (siRNAs) have been approved as therapeutics since 2018 making them promising nanosystems due to selective gene knockdown activity. siRNA design is complex due to various factors, where the chemical modifications are crucial to improve its half-life and stability. Machine learning (ML) enabled more efficient analysis of siRNA data, moreover predicting efficacy and off-target effects. This work proposes a novel pipeline for predicting gene knockdown activity of chemically modified siRNAs across the whole range of activities leveraging both descriptors of siRNA chemical composition-aware property matrices and large language model (LLM) embeddings for target gene encoding. Several general-purpose and domain-specific fine-tuned LLMs were benchmarked on the target task, where the Mistral 7B general-purpose model slightly outperformed even the models pre-trained on genomic data. Proposed two-stage probability-enhanced model successfully mitigates data imbalance towards moderate-to-high active constructs and achieves state-of-the-art (SOTA) quality with R2 = 0.84 and a RMSE = 12.27% on unseen data, where the probabilistic outputs of classifiers trained with F-scores up to 0.92 were used for regression model supervision. Moreover, leave-one-gene-out (LOGO) experiments show that the model is able to extrapolate on unseen genes, which further shows representativeness of siRNA features and gene embeddings. By filling the gap in the field of advanced chemical composition-aware siRNA design, our model aims to improve the efficacy of developed siRNA-based therapies. Full article
(This article belongs to the Section Molecular Genetics and Genomics)
Show Figures

Figure 1

12 pages, 699 KB  
Article
Reaping the Fruits of LLM Pruning: Towards Small Language Models for Efficient Non-Coding Variant Effect Prediction
by Megha Hegde, Jean-Christophe Nebel and Farzana Rahman
Genes 2025, 16(11), 1358; https://doi.org/10.3390/genes16111358 - 10 Nov 2025
Viewed by 994
Abstract
Background: Interpreting variant effects is essential for precision medicine. Large Transformer-based genomic language models (DNABERT 2, Nucleotide Transformer) capture patterns in coding DNA but scale poorly for non coding variant prediction because attention complexity grows quadratically with sequence length. Evidence from natural [...] Read more.
Background: Interpreting variant effects is essential for precision medicine. Large Transformer-based genomic language models (DNABERT 2, Nucleotide Transformer) capture patterns in coding DNA but scale poorly for non coding variant prediction because attention complexity grows quadratically with sequence length. Evidence from natural language processing shows that pruning less informative layers can reduce model size and computational load without sacrificing accuracy. Methods: We systematically ablated each Transformer layer in DNABERT 2 and the Nucleotide Transformer to assess its contribution to variant prediction. By observing changes in performance, we built layer importance profiles and created pruned models by removing redundant layers. Pruned and full models were fine tuned with identical hyperparameters using the Enformer eQTL causal variant dataset, a curated benchmark for non coding variant effect prediction. Results: Layer ablation revealed that the importance of individual layers varies widely across models; some layers can be removed with little loss in performance while others are critical. After fine tuning, pruned models achieved accuracy and area under the ROC curve comparable to full models. Additionally, pruned versions required substantially less training time and memory, reducing resource usage by a significant margin. Conclusions: Layer wise pruning provides a principled strategy for developing compact genomic LLMs. By identifying and removing less critical layers, we produced leaner models that preserve predictive power while lowering computational demands. These efficient models demonstrate how insights from general LLM research can advance genomic variant interpretation and make large scale non coding analysis more accessible in research and clinical settings. This approach complements ongoing efforts to optimise Transformer architectures for genomic data. Full article
(This article belongs to the Section Technologies and Resources for Genetics)
Show Figures

Figure 1

14 pages, 738 KB  
Opinion
Envisioning the Future of Machine Learning in the Early Detection of Neurodevelopmental and Neurodegenerative Disorders via Speech and Language Biomarkers
by Georgios P. Georgiou
Acoustics 2025, 7(4), 72; https://doi.org/10.3390/acoustics7040072 - 10 Nov 2025
Cited by 1 | Viewed by 1458
Abstract
Speech and language offer a rich, non-invasive window into brain health. Advances in machine learning (ML) have enabled increasingly accurate detection of neurodevelopmental and neurodegenerative disorders through these modalities. This paper envisions the future of ML in the early detection of neurodevelopmental disorders [...] Read more.
Speech and language offer a rich, non-invasive window into brain health. Advances in machine learning (ML) have enabled increasingly accurate detection of neurodevelopmental and neurodegenerative disorders through these modalities. This paper envisions the future of ML in the early detection of neurodevelopmental disorders like autism spectrum disorder and attention-deficit/hyperactivity disorder, and neurodegenerative disorders, such as Parkinson’s disease and Alzheimer’s disease, through speech and language biomarkers. We explore the current landscape of ML techniques, including deep learning and multimodal approaches, and review their applications across various conditions, highlighting both successes and inherent limitations. Our core contribution lies in outlining future trends across several critical dimensions. These include the enhancement of data availability and quality, the evolution of models, the development of multilingual and cross-cultural models, the establishment of regulatory and clinical translation frameworks, and the creation of hybrid systems enabling human–artificial intelligence (AI) collaboration. Finally, we conclude with a vision for future directions, emphasizing the potential integration of ML-driven speech diagnostics into public health infrastructure, the development of patient-specific explainable AI, and its synergistic combination with genomics and brain imaging for holistic brain health assessment. Overcoming substantial hurdles in validation, generalization, and clinical adoption, the field is poised to shift toward ubiquitous, accessible, and highly personalized tools for early diagnosis. Full article
(This article belongs to the Special Issue Artificial Intelligence in Acoustic Phonetics)
Show Figures

Figure 1

20 pages, 1014 KB  
Article
Evaluating Retrieval-Augmented Generation Variants for Clinical Decision Support: Hallucination Mitigation and Secure On-Premises Deployment
by Krzysztof Wołk
Electronics 2025, 14(21), 4227; https://doi.org/10.3390/electronics14214227 - 29 Oct 2025
Cited by 1 | Viewed by 5511
Abstract
For clinical decision support to work, medical knowledge needs to be easy to find quickly and accurately. Retrieval-Augmented Generation (RAG) systems use big language models and document retrieval to help with diagnostic reasoning, but they could cause hallucinations and have strict privacy rules [...] Read more.
For clinical decision support to work, medical knowledge needs to be easy to find quickly and accurately. Retrieval-Augmented Generation (RAG) systems use big language models and document retrieval to help with diagnostic reasoning, but they could cause hallucinations and have strict privacy rules in healthcare. We tested twelve different types of RAG, such as dense, sparse, hybrid, graph-based, multimodal, self-reflective, adaptive, and security-focused pipelines, on 250 de-identified patient vignettes. We used Precision@5, Mean Reciprocal Rank, nDCG@10, hallucination rate, and latency to see how well the system worked. The best retrieval accuracy (P@5 ≥ 0.68, nDCG@10 ≥ 0.67) was achieved by a Haystack pipeline (DPR + BM25 + cross-encoder) and hybrid fusion (RRF). Self-reflective RAG, on the other hand, lowered hallucinations to 5.8%. Sparse retrieval gave the fastest response (120 ms), but it was not as accurate. We also suggest a single framework for reducing hallucinations that includes retrieval confidence thresholds, chain-of-thought verification, and outside fact-checking. Our findings emphasize pragmatic protocols for the secure implementation of RAG on premises, incorporating encryption, provenance tagging, and audit trails. Future directions encompass the incorporation of clinician feedback and the expansion of multimodal inputs to genomics and proteomics for precision medicine. Full article
Show Figures

Figure 1

27 pages, 1960 KB  
Review
AI and Machine Learning in Biology: From Genes to Proteins
by Zaw Myo Hein, Dhanyashri Guruparan, Blaire Okunsai, Che Mohd Nasril Che Mohd Nassir, Muhammad Danial Che Ramli and Suresh Kumar
Biology 2025, 14(10), 1453; https://doi.org/10.3390/biology14101453 - 20 Oct 2025
Cited by 1 | Viewed by 4945
Abstract
Artificial intelligence (AI) and machine learning (ML), especially deep learning, have profoundly transformed biology by enabling precise interpretation of complex genomic and proteomic data. This review presents a comprehensive overview of cutting-edge AI methodologies spanning from foundational neural networks to advanced transformer architectures [...] Read more.
Artificial intelligence (AI) and machine learning (ML), especially deep learning, have profoundly transformed biology by enabling precise interpretation of complex genomic and proteomic data. This review presents a comprehensive overview of cutting-edge AI methodologies spanning from foundational neural networks to advanced transformer architectures and large language models (LLMs). These tools have revolutionized our ability to predict gene function, identify genetic variants, and accurately determine protein structures and interactions, exemplified by landmark milestones such as AlphaFold and DeepBind. We elaborate on the synergistic integration of genomics and protein structure prediction through AI, highlighting recent breakthroughs in generative models capable of designing novel proteins and genomic sequences at unprecedented scale and accuracy. Furthermore, the fusion of multi-omics data using graph neural networks and hybrid AI frameworks has provided nuanced insights into cellular heterogeneity and disease mechanisms, propelling personalized medicine and drug discovery. This review also discusses ongoing challenges including data quality, model interpretability, ethical concerns, and computational demands. By synthesizing current progress and emerging frontiers, we provide insights to guide researchers in harnessing AI’s transformative power across the biological spectrum from genes to functional proteins. Full article
(This article belongs to the Special Issue Artificial Intelligence Research for Complex Biological Systems)
Show Figures

Graphical abstract

14 pages, 630 KB  
Article
Disease-Specific Prediction of Missense Variant Pathogenicity with DNA Language Models and Graph Neural Networks
by Mohamed Ghadie, Sameer Sardaar and Yannis Trakadis
Bioengineering 2025, 12(10), 1098; https://doi.org/10.3390/bioengineering12101098 - 13 Oct 2025
Viewed by 1706
Abstract
Accurate prediction of the impact of genetic variants on human health is of paramount importance to clinical genetics and precision medicine. Recent machine learning (ML) studies have tried to predict variant pathogenicity with different levels of success. However, most missense variants identified on [...] Read more.
Accurate prediction of the impact of genetic variants on human health is of paramount importance to clinical genetics and precision medicine. Recent machine learning (ML) studies have tried to predict variant pathogenicity with different levels of success. However, most missense variants identified on a clinical basis are still classified as variants of uncertain significance (VUS). Our approach allows for the interpretation of a variant for a specific disease and, thus, for the integration of disease-specific domain knowledge. We utilize a comprehensive knowledge graph, with 11 types of interconnected biomedical entities at diverse biomolecular and clinical levels, to classify missense variants from ClinVar. We use BioBERT to generate embeddings of biomedical features for each node in the graph, as well as DNA language models to embed variant features directly from genomic sequence. Next, we train a two-stage architecture consisting of a graph convolutional neural network to encode biological relationships. A neural network is then used as the classifier to predict disease-specific pathogenicity of variants, essentially predicting edges between variant and disease nodes. We compare performance across different versions of our model, obtaining prediction-balanced accuracies as high as 85.6% (sensitivity: 90.5%; NPV: 89.8%) and discuss how our work can inform future studies in this area. Full article
(This article belongs to the Special Issue AI-Driven Approaches to Diseases Detection and Diagnosis)
Show Figures

Figure 1

21 pages, 6390 KB  
Article
Machine Learning-Based Characterization of Bacillus anthracis Phenotypes from pXO1 Plasmid Proteins
by William Harrigan, Thi Hai Au La, Prashant Dahal, Mahdi Belcaid and Michael H. Norris
Pathogens 2025, 14(10), 1019; https://doi.org/10.3390/pathogens14101019 - 8 Oct 2025
Viewed by 982
Abstract
The Bacillus anthracis pXO1 plasmid, encoding ~143 proteins, presents a compact model for exploring protein function and evolutionary patterns using protein language models. Due to the organism’s slow evolutionary rate, its limited amino acid variation enhances detection of physiologically relevant patterns in plasmid [...] Read more.
The Bacillus anthracis pXO1 plasmid, encoding ~143 proteins, presents a compact model for exploring protein function and evolutionary patterns using protein language models. Due to the organism’s slow evolutionary rate, its limited amino acid variation enhances detection of physiologically relevant patterns in plasmid protein composition. In this study, we applied embedding-based analyses and machine learning methods to characterize pXO1 protein modules across diverse B. anthracis lineages. We generated protein sequence embeddings, constructed phylogenies, and compared plasmid content with whole genome variation. While whole genome and plasmid-based phylogenies diverge, the composition of proteins encoded along the pXO1 plasmid revealed lineage specific structure. Association rule mining combined with decision tree classification produced plasmid-encoded targets for assessing anthrax sublineage, which yielded functionally redundant protein modules that reflected geographic and phylogenetic patterns. A conserved DNA replication module exhibited both shared and B. anthracis lineage specific features. These results show that pXO1 plasmid protein modules contain biologically meaningful and evolutionarily informative signatures, exemplifying their value in phylogeographic characterizations of bacterial pathogens. This framework can be extended to study additional virulence plasmids across Bacillus and other environmental pathogens using scalable protein language model tools. Full article
(This article belongs to the Section Bacterial Pathogens)
Show Figures

Figure 1

19 pages, 914 KB  
Article
Large Language Model and Knowledge Graph-Driven AJCC Staging of Prostate Cancer Using Pathology Reports
by Eunbeen Jo, Tae Il Noh and Hyung Joon Joo
Diagnostics 2025, 15(19), 2474; https://doi.org/10.3390/diagnostics15192474 - 27 Sep 2025
Viewed by 1077
Abstract
Background/Objectives: To develop an automated American Joint Committee on Cancer (AJCC) staging system for radical prostatectomy pathology reports using large language model-based information extraction and knowledge graph validation. Methods: Pathology reports from 152 radical prostatectomy patients were used. Five additional parameters [...] Read more.
Background/Objectives: To develop an automated American Joint Committee on Cancer (AJCC) staging system for radical prostatectomy pathology reports using large language model-based information extraction and knowledge graph validation. Methods: Pathology reports from 152 radical prostatectomy patients were used. Five additional parameters (Prostate-specific antigen (PSA) level, metastasis stage (M-stage), extraprostatic extension, seminal vesicle invasion, and perineural invasion) were extracted using GPT-4.1 with zero-shot prompting. A knowledge graph was constructed to model pathological relationships and implement rule-based AJCC staging with consistency validation. Information extraction performance was evaluated using a local open-source large language model (LLM) (Mistral-Small-3.2-24B-Instruct) across 16 parameters. The LLM-extracted information was integrated into the knowledge graph for automated AJCC staging classification and data consistency validation. The developed system was further validated using pathology reports from 88 radical prostatectomy patients in The Cancer Genome Atlas (TCGA) dataset. Results: Information extraction achieved an accuracy of 0.973 and an F1-score of 0.986 on the internal dataset, and 0.938 and 0.968, respectively, on external validation. AJCC staging classification showed macro-averaged F1-scores of 0.930 and 0.833 for the internal and external datasets, respectively. Knowledge graph-based validation detected data inconsistencies in 5 of 150 cases (3.3%). Conclusions: This study demonstrates the feasibility of automated AJCC staging through the integration of large language model information extraction and knowledge graph-based validation. The resulting system enables privacy-protected clinical decision support for cancer staging applications with extensibility to broader oncologic domains. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

19 pages, 272 KB  
Review
Artificial Intelligence in the Diagnosis of Pediatric Rare Diseases: From Real-World Data Toward a Personalized Medicine Approach
by Nikola Ilić and Adrijan Sarajlija
J. Pers. Med. 2025, 15(9), 407; https://doi.org/10.3390/jpm15090407 - 1 Sep 2025
Cited by 2 | Viewed by 2972
Abstract
Background: Artificial intelligence (AI) is increasingly applied in the diagnosis of pediatric rare diseases, enhancing the speed, accuracy, and accessibility of genetic interpretation. These advances support the ongoing shift toward personalized medicine in clinical genetics. Objective: This review examines current applications of AI [...] Read more.
Background: Artificial intelligence (AI) is increasingly applied in the diagnosis of pediatric rare diseases, enhancing the speed, accuracy, and accessibility of genetic interpretation. These advances support the ongoing shift toward personalized medicine in clinical genetics. Objective: This review examines current applications of AI in pediatric rare disease diagnostics, with a particular focus on real-world data integration and implications for individualized care. Methods: A narrative review was conducted covering AI tools for variant prioritization, phenotype–genotype correlations, large language models (LLMs), and ethical considerations. The literature was identified through PubMed, Scopus, and Web of Science up to July 2025, with priority given to studies published in the last seven years. Results: AI platforms provide support for genomic interpretation, particularly within structured diagnostic workflows. Tools integrating Human Phenotype Ontology (HPO)-based inputs and LLMs facilitate phenotype matching and enable reverse phenotyping. The use of real-world data enhances the applicability of AI in complex and heterogeneous clinical scenarios. However, major challenges persist, including data standardization, model interpretability, workflow integration, and algorithmic bias. Conclusions: AI has the potential to advance earlier and more personalized diagnostics for children with rare diseases. Achieving this requires multidisciplinary collaboration and careful attention to clinical, technical, and ethical considerations. Full article
Back to TopTop