Diffusion Models at the Drug Discovery Frontier: A Review on Generating Small Molecules Versus Therapeutic Peptides

Wang, Yiquan; Ma, Yahui; Chang, Yuhan; Yan, Jiayao; Zhang, Jialin; Cai, Minnuo; Wei, Kai

doi:10.3390/biology14121665

Open AccessReview

Diffusion Models at the Drug Discovery Frontier: A Review on Generating Small Molecules Versus Therapeutic Peptides

by

Yiquan Wang

^1,2,†

,

Yahui Ma

^1,†

,

Yuhan Chang

²

,

Jiayao Yan

²,

Jialin Zhang

²

,

Minnuo Cai

¹

and

Kai Wei

^1,*

¹

Xinjiang Key Laboratory of Biological Resources and Genetic Engineering, College of Life Science and Technology, Xinjiang University, Urumqi 830049, China

²

College of Mathematics and System Sciences, Xinjiang University, Urumqi 830046, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Biology 2025, 14(12), 1665; https://doi.org/10.3390/biology14121665

Submission received: 31 October 2025 / Revised: 15 November 2025 / Accepted: 18 November 2025 / Published: 24 November 2025

(This article belongs to the Section Medical Biology)

Download

Browse Figures

Versions Notes

Simple Summary

Discovering new medicines is a slow, expensive, and often unsuccessful process. A new type of Artificial Intelligence (AI), known as diffusion models, shows great promise in changing this by designing entirely new drugs on a computer. This review examines how this technology is used to create two major types of medicines: small molecules, which are common in pills, and larger therapeutic peptides. For small molecules, the main challenge for the AI is to design drugs that can actually be created in a chemistry lab. For peptides, the focus is on designing molecules that are stable in the human body, fold into the correct shape to work properly, and do not cause an unwanted immune reaction. Both areas face common hurdles, such as the need for more real-world experimental data to train the AI and more reliable AI evaluation methods to predict a drug’s success. The ultimate goal is to connect these powerful AI design tools with automated robotic labs. This would create a rapid cycle of designing, building, and testing new medicines, transforming drug discovery from a process of slow exploration to one of creating novel, targeted therapies on demand.

Abstract

Diffusion models have emerged as a leading framework in generative modeling, poised to transform the traditionally slow and costly process of drug discovery. This review provides a systematic comparison of their application in designing two principal therapeutic modalities: small molecules and therapeutic peptides. We dissect how the unified framework of iterative denoising is adapted to the distinct molecular representations, chemical spaces, and design objectives of each modality. For small molecules, these models excel at structure-based design, generating novel, pocket-fitting ligands with desired physicochemical properties, yet face the critical hurdle of ensuring chemical synthesizability. Conversely, for therapeutic peptides, the focus shifts to generating functional sequences and designing de novo structures, where the primary challenges are achieving biological stability against proteolysis, ensuring proper folding, and minimizing immunogenicity. Despite these distinct challenges, both domains face shared hurdles: the scarcity of high-quality experimental data, the reliance on inaccurate scoring functions for validation, and the crucial need for experimental validation. We conclude that the full potential of diffusion models will be unlocked by bridging these modality-specific gaps and integrating them into automated, closed-loop Design-Build-Test-Learn (DBTL) platforms, thereby shifting the paradigm from mere chemical exploration to the on-demand engineering of novel therapeutics.

Keywords:

diffusion models; drug discovery; de novo design; small molecules; therapeutic peptides

1. Introduction

1.1. The Bottleneck of Drug Discovery and the Rise of Generative AI

Traditional drug discovery pipelines, reliant on high-throughput screening, which involves the automated testing of large numbers of compounds, and combinatorial chemistry, a method for rapidly creating vast libraries of molecules, are characterized by prolonged development timelines, high attrition rates, and enormous costs. The entire process from target identification to market approval typically spans 10–15 years [1], with the clinical development phase alone requiring a median of 8.3 years [2]. Despite decades of optimization, clinical success rates remain discouragingly low, with only approximately 7.9% of drug candidates entering Phase I trials ultimately receiving regulatory approval [3], though these rates vary significantly across therapeutic areas and have shown dynamic fluctuations throughout the 21st century [4,5,6]. Recent advances in cell and gene therapies have demonstrated distinct success rate profiles, offering new prospects for durable treatments [7]. The financial burden is staggering: while historical estimates reached $2.6 billion per approved drug [8], more recent analyses suggest mean development costs of approximately $879 million based on 2000–2018 data [9], though costs continue to escalate with increasingly complex trial designs and regulatory requirements as evidenced by record-breaking FDA approval trends [10].

The vast chemical space, estimated to contain

10^{60}

drug-like molecules [11,12], remains largely unexplored through conventional screening approaches. This estimation, originally derived from molecules up to 30 heavy atoms constructed from organic elements [11], has been supported by systematic enumeration studies such as the GDB-17 database containing 166 billion molecules [13,14] and recent explorations of peptide/peptoid chemical space [12]. More conservative estimates suggest approximately

10^{33}

molecules strictly adhering to Lipinski’s rule of five, a set of physicochemical guidelines used to predict a compound’s potential for oral bioavailability [15], yet even this reduced scope represents a vast and largely unsampled space. Generative Artificial Intelligence (AI) offers a paradigm shift, moving from merely screening existing compounds to creating entirely novel molecules tailored to specific needs. This promise is not merely theoretical; the broader field of generative AI has already begun to deliver tangible results, with dozens of AI-designed small molecules advancing into human clinical trials and demonstrating the potential to significantly shorten discovery timelines and improve success rates [16,17,18]. Early generative models like Variational Autoencoders (VAEs) [19], Generative Adversarial Networks (GANs) [20], and Flow-based models [21,22,23] laid the groundwork but faced limitations in generation quality, training stability, and mode collapse issues. VAEs often produced blurry outputs due to the trade-off between reconstruction and latent loss, while GANs were susceptible to training instability and mode collapse, challenges extensively reviewed in the literature focused on adversarial networks [24]. Flow-based models, in turn, encountered computational efficiency limitations. The distinct trade-offs in performance, stability, and computational cost across these generative families have been systematically compared in several surveys [25,26].

1.2. The Emergence of Diffusion Models

Diffusion models have recently emerged as a highly successful framework in generative modeling, demonstrating competitive and robust capabilities in generating high-quality, diverse samples compared to previous approaches [27]. Their core idea involves a two-step process: a forward diffusion process that incrementally adds Gaussian noise to data according to a predefined variance schedule until it becomes pure noise, and a reverse denoising process where a trained neural network learns to iteratively denoise samples, effectively generating new data from random noise [27,28].

The success of diffusion models extends far beyond a single domain. They have achieved revolutionary breakthroughs in fields like computer vision (e.g., DALL-E 2, Stable Diffusion), audio synthesis, and natural language processing [29,30,31,32], proving an exceptional ability to learn and generate high-quality samples from complex, high-dimensional data. This cross-domain success underscores the framework’s inherent flexibility, which makes it particularly attractive for molecular design, where data is inherently multimodal—encompassing continuous 3D coordinates, discrete atom types, graph structures, and sequential patterns. Moreover, key techniques pioneered for image generation, such as classifier-free guidance [33] for precise control and latent diffusion [30] for computational efficiency, have been successfully adapted to molecular design challenges [34,35]. This powerful combination of generative fidelity and adaptability provides a strong foundation for using diffusion models to create diverse, valid, and novel therapeutics with desired properties.

1.3. Scope and Structure of This Review

This review focuses specifically on the recent surge of diffusion models in drug discovery, primarily drawing from the rapidly evolving literature. For the first time, we systematically compare the application, challenges, and future prospects of this technology in designing two critical drug modalities: small molecules and therapeutic peptides. These modalities were chosen for their immense clinical and commercial importance and their complementary strengths and weaknesses, which create distinct design challenges perfect for a comparative analysis.

Small molecules constitute a substantial portion of approved drugs. Recent FDA approval data from 2023 to 2024 indicate that small molecule drugs (new molecular entities, NMEs) accounted for approximately 55–69% of novel therapeutic approvals [36,37,38,39]. In 2023, the FDA approved 55 new medications consisting of 17 biologics license applications and 38 NMEs, with small molecules representing approximately 55% of total approvals [36]. This approval trend continued in 2024, with 50 NMEs approved, further demonstrating the continued importance of small molecule drugs in modern therapeutics [37]. Small molecules typically have molecular weights below 900 Da, are orally bioavailable, can penetrate cells to target intracellular proteins, and are relatively cost-effective to manufacture. They have been successfully applied to a wide range of diseases, from infectious diseases (antibiotics, antivirals) to chronic conditions (cardiovascular drugs, diabetes medications) to cancer (kinase inhibitors, chemotherapeutics). However, small molecules face significant limitations in targeting certain “undruggable” proteins—targets historically considered intractable for small-molecule intervention due to features like lacking well-defined binding pockets or those involving protein-protein interactions with large, flat interfaces [40,41,42]. These challenging targets have spurred interest in alternative modalities and advanced drug design approaches.

Therapeutic peptides, by contrast, represent a rapidly growing class of drugs, with over 80 peptide drugs currently approved and more than 150 in clinical development [43]. The peptide therapeutics field has experienced remarkable growth, driven by advances in peptide chemistry, delivery technologies, and the clinical success of peptide-based therapeutics such as GLP-1 receptor agonists for diabetes and obesity [43]. Peptides offer several advantages: high specificity and potency (often binding targets with nanomolar to picomolar affinities), low toxicity (due to degradation into natural amino acids), and the ability to target protein-protein interactions and extracellular targets that are challenging for small molecules [41,43]. These characteristics make peptides particularly valuable for addressing targets previously considered "undruggable" by traditional small molecule approaches [40]. However, peptides face significant biological hurdles, such as poor metabolic stability and potential immunogenicity, which limit their therapeutic application and necessitate specialized design considerations [43,44,45,46,47]. These complementary strengths and weaknesses make small molecules and peptides ideal for comparative analysis in the context of AI-driven design.

This review is organized to first introduce the unified framework of diffusion models for molecular generation (Section 2). We then dedicate separate sections to their application in designing small molecules (Section 3) and therapeutic peptides (Section 4), highlighting representative models, performance benchmarks, and domain-specific challenges. Finally, drawing these threads together, we provide a comprehensive head-to-head comparison, discuss the shared hurdles that transcend modality, and outline future research directions toward a fully integrated, closed-loop discovery paradigm (Section 5).

2. The Core Engine: Diffusion Models for Molecular Generation

2.1. Representing Molecules for Diffusion

The choice of molecular representation is fundamental to the design of the diffusion process, as it dictates both the mathematical formulation of the noise process and the architecture of the denoising network [48,49]. For small molecules, representations primarily fall into two categories. One approach utilizes graph-based representations, where molecules are encoded as graphs with atoms as nodes and bonds as edges [50,51,52], allowing diffusion to operate on features like discrete atom types or continuous latent embeddings [49]. An alternative and increasingly prevalent approach employs 3D coordinate-based representations, treating molecules as point clouds of atomic positions in Euclidean space [53,54,55]. This latter representation is particularly suited for structure-based drug design, as it naturally captures spatial relationships critical for protein-ligand interactions and necessitates the use of E(3) equivariant neural networks, architectures designed to respect the natural rotational and translational symmetries of 3D molecules, to handle rotational and translational symmetries [56,57,58,59,60].

In contrast, the representation of peptides is shaped by their polymeric nature. The most straightforward method is sequence-based, encoding peptides as discrete sequences of amino acid tokens, which requires specialized discrete diffusion processes [61,62,63,64]. Complementing this, structure-based representations capture the peptide’s three-dimensional conformation through the coordinates of backbone and side-chain atoms [65,66], or alternatively, through internal coordinates like torsion angles that inherently respect geometric constraints [67]. These distinct representational paradigms for small molecules and peptides shape the subsequent design of the diffusion models and the type of conditioning information that can be effectively integrated [68,69].

2.2. The Mathematics of Diffusion: Forward and Reverse Processes

The diffusion process consists of two Markov chains [27,28]. The forward process gradually corrupts data

x_{0}

by adding Gaussian noise over T timesteps:

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I)

, where

β_{t}

is a variance schedule. A key property is that we can sample

x_{t}

directly from

x_{0}

:

q (x_{t} | x_{0}) = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I)

, where

{\bar{α}}_{t} = \prod_{s = 1}^{t} (1 - β_{s})

[27]. Here,

β_{t}

is a small constant from a predefined variance schedule, controlling the amount of noise added at each step. By defining

α_{t} = 1 - β_{t}

, the term

{\bar{α}}_{t} = \prod_{s = 1}^{t} α_{s}

becomes a cumulative product that governs how much of the original signal

x_{0}

is preserved at timestep t. As t increases,

{\bar{α}}_{t}

decreases towards zero, signifying that the signal progressively fades into noise. Thus,

{\bar{α}}_{t}

can be intuitively understood as a measure of the signal-to-noise ratio at any given step. The reverse process learns to denoise:

p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))

. The model is trained to predict either the noise

ϵ

added at each step or the denoised data

x_{0}

, by minimizing a variational lower bound on the log-likelihood [27,28]. For molecular generation, this framework is adapted to handle both continuous (coordinates) and discrete (atom/bond types, amino acid sequences) variables [64,70,71], often requiring specialized noise processes and network architectures.

2.3. Conditional Generation: From Noise to Purpose

Unconditional generation has limited utility in drug design. The key is conditional generation, which steers the generative process toward specific objectives by injecting information—such as a target protein’s binding pocket geometry or desired physicochemical properties—into the denoising network at each timestep. Early approaches relied on classifier guidance, which uses a separately trained classifier to steer sampling by adding its gradient to the score function [72]. However, a more recent and popular strategy is classifier-free guidance, which elegantly avoids the need for a separate model by training a single conditional network that can operate both with and without conditioning information, allowing guidance strength to be tuned at inference time [33]. Another powerful technique, particularly for structure-based tasks, involves integrating conditioning information via cross-attention mechanisms within the denoising network, enabling the model to dynamically attend to relevant features of the conditioning input at each generation step [73]. These techniques provide precise control over the generation process, making them highly suitable for the multi-objective optimization challenges inherent in drug design [35,68].

2.4. Comparison with Other Generative Approaches

To appreciate the advantages of diffusion models, it is instructive to compare them with the generative paradigms that were foundational and previously represented the state-of-the-art in drug design, such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Flow-based models, and Autoregressive models. This comparison primarily focuses on the overarching generative framework rather than the underlying neural network architecture, as implementations of these paradigms often share powerful backbones like Graph Neural Networks (GNNs) or Transformers. The key distinctions, therefore, lie in their training objectives and generation mechanisms. For example, VAEs like the Junction Tree VAE [74] learn a continuous latent space but often struggle with posterior collapse and may generate chemically invalid structures [75,76,77]. Similarly, GANs like MolGAN [78] can produce diverse molecules but are notoriously difficult to train [79], frequently suffering from mode collapse and instability [80,81]. Flow-based models such as MoFlow [82], constrained by their invertible architecture, can lack expressiveness for complex molecular graphs [83,84]. Autoregressive models like GraphAF [85] can be slow and suffer from error propagation, where an early mistake compromises the entire structure [35,86,87,88,89,90].

In contrast, diffusion models circumvent many of these issues, which explains their recent ascendancy. Their training is stable and guided by a well-defined denoising objective, avoiding the adversarial instabilities of GANs while consistently producing samples of high quality and diversity [35,91,92,93,94,95,96]. Their framework is remarkably flexible, accommodating both continuous data like 3D coordinates and discrete data like atom types through tailored noise processes [48,70,97]. This adaptability, combined with powerful conditioning techniques like classifier-free guidance [33,98,99], allows for precise control over the iterative refinement process, leading to better global coherence and making them uniquely suited for the multifaceted challenges of molecular design. This entire process, from the core diffusion engine to its specific applications in designing small molecules and therapeutic peptides, is conceptually illustrated in Figure 1.

These distinct generative mechanisms also lead to different trade-offs in computational cost, scalability, and interpretability. While frameworks like VAEs and GANs typically employ their backbone in a single-pass, feed-forward manner for generation, diffusion models operate iteratively, requiring hundreds or thousands of denoising steps. This iterative paradigm makes them more computationally intensive at inference time. Although the scalability of any single denoising step is governed by the underlying backbone (e.g., GNN or Transformer), the total generation cost is this value multiplied by the number of iterations. This computational overhead, however, is often offset by superior training stability, as diffusion models circumvent the notorious convergence issues and mode collapse that plague GANs. From an interpretability perspective, all these deep generative models face the ’black box’ challenge. Nevertheless, the step-by-step refinement process of diffusion models may offer unique, albeit still nascent, opportunities for mechanistic insight by allowing observation of the generative trajectory, though deciphering the rationale at each step remains an active area of research [27,33,35,79].

3. Application I: De Novo Design of Small Molecules

3.1. Datasets and Benchmarks for Small Molecule Generation

The development and evaluation of diffusion models for small molecule design rely heavily on large-scale, high-quality datasets. The most widely used benchmark is CrossDocked2020 [100], a dataset containing approximately 22.5 million docked poses from over 100,000 protein-ligand complexes derived from the PDB (Protein Data Bank) through a systematic docking procedure [101]. Each complex includes the 3D coordinates of the protein binding pocket (typically defined as residues within 6–10 Å of the ligand) and the bound ligand, along with docking scores as a proxy for binding affinity. CrossDocked2020 has become the de facto standard for evaluating structure-based drug design models [48,102,103], enabling direct comparison across different approaches including diffusion-based methods [35] and other generative AI techniques [104]. However, it has several acknowledged limitations: the docking scores are computational estimates rather than experimental measurements, the dataset is biased toward certain protein families (kinases and proteases are over-represented), and the ligands are primarily known drugs or drug-like molecules, limiting chemical diversity. These limitations have motivated ongoing efforts to develop more diverse and experimentally validated benchmarks for the field.

For property-based generation and conformer generation tasks, the GEOM-Drugs dataset [105] is commonly used, containing approximately 430,000 drug-like molecules with pre-computed 3D conformers generated using RDKit [106] and optimized with semi-empirical quantum chemistry methods. This dataset enables training of models that learn the distribution of molecular geometries and can generate diverse, low-energy conformers [53,107,108]. The ZINC database [109], containing over 230 million purchasable compounds, is often used for pre-training or as a source of negative examples. The QM9 dataset [110], containing approximately 134,000 small organic molecules with quantum chemical properties computed at the DFT level, is used for benchmarking models on property prediction tasks, though its molecules are smaller and simpler than typical drug candidates.

A critical limitation across all datasets is the scarcity of experimentally validated binding affinity data [111,112,113]. While databases like BindingDB [114] and ChEMBL [115] contain millions of bioactivity measurements, only a small fraction include high-resolution 3D structures of protein-ligand complexes. Beyond scarcity, the quality and heterogeneity of this data present a fundamental challenge to model reproducibility. Bioactivity measurements (e.g., K_i, K_d, IC₅₀) are often aggregated from diverse assays with varying experimental conditions, introducing significant noise and inconsistencies. This lack of standardized data curation and reporting protocols directly undermines a model’s ability to learn robust structure-activity relationships, thereby compromising its generalization capabilities. Consequently, establishing rigorous data management standards is as critical as developing new algorithms for building truly predictive and reproducible generative models. This data challenge further motivates the development of transfer learning and semi-supervised approaches [116,117,118,119,120] that can leverage large unlabeled datasets while being robust to label noise.

3.2. Structure-Based Drug Design (SBDD)

The central task in SBDD is to generate molecules that are geometrically and chemically complementary to a given protein binding pocket, maximizing binding affinity while maintaining drug-like properties. Diffusion models have shown remarkable success in this domain by learning to generate molecules directly in the 3D space of the binding pocket.

Pocket2Mol [121], one of the pioneering works in 2022, employs a two-stage approach: first generating molecular scaffolds as a set of 3D points, then predicting atom and bond types for these points. The model is conditioned on pocket atom coordinates and features through a cross-attention mechanism, achieving 68.4% pose selection accuracy on the CrossDocked2020 benchmark. The model generates molecules with high validity (>95%) and uniqueness (>90%), demonstrating the capability of diffusion models to produce chemically valid structures.

DiffSBDD [103] introduces an SE(3)-equivariant graph neural network architecture that jointly diffuses over atomic coordinates and discrete atom types. By incorporating pocket information through a joint graph representation of the pocket and the growing molecule, DiffSBDD achieves superior performance in generating molecules with favorable predicted binding affinities. On the CrossDocked2020 dataset, DiffSBDD generates molecules with a median Vina score of −7.5 kcal/mol. This score serves as a computational estimate of binding affinity, where more negative values indicate a stronger predicted interaction, and this performance outperforms previous autoregressive and VAE-based approaches. Importantly, the model demonstrates the ability to generate molecules that form key interactions (hydrogen bonds, hydrophobic contacts) with critical pocket residues, as validated through molecular dynamics simulations.

TargetDiff [48,59] further advances the field by introducing a target-aware diffusion process that explicitly models the protein-ligand interaction energy during generation. By incorporating a learned energy function that estimates binding affinity, TargetDiff demonstrates improved performance in generating high-affinity binders while maintaining molecular diversity across different regions of chemical space with strong pocket complementarity.

Building upon these foundational approaches, recent work has explored dual diffusion frameworks and pharmacophore-oriented generation. Huang et al. [73] introduced a dual diffusion model that enables both de novo 3D molecule generation and lead optimization, providing a unified framework for structure-based drug discovery. More recently, pharmacophore-oriented approaches [122] have emerged to incorporate explicit constraints on the pharmacophore—the essential three-dimensional arrangement of molecular features required for biological activity—during the diffusion process, enabling more efficient feature-customized drug discovery by directly controlling key molecular properties and interaction patterns.

A primary challenge that remains is the precise modeling of key molecular interactions, such as hydrogen bonds, salt bridges, and

π

-

π

stacking [123]. Furthermore, systematic benchmarks reveal persistent challenges in achieving accurate 3D spatial modeling, as many generated structures show significant deviations from energy-minimized references, especially for larger molecules [124]. While current models can generate molecules that occupy the binding pocket, ensuring that specific pharmacophoric features are correctly positioned to form critical interactions with the protein remains difficult. Additionally, the generated molecules often require post-processing steps, such as bond order correction and protonation state assignment, to ensure chemical validity [125].

3.3. Property-Based Ligand Design and Optimization

This area focuses on generating molecules that satisfy multiple objectives simultaneously, such as high binding affinity, favorable drug-likeness measured by metrics like the Quantitative Estimate of Drug-likeness (QED) [126], where values closer to 1 suggest a better drug-like profile, appropriate lipophilicity (logP), low toxicity, high membrane permeability, and synthetic accessibility (SA). To achieve this multi-objective optimization, several property-guided generation approaches have been developed. Conditional diffusion models, for example, are trained to generate molecules with specific property values by directly conditioning on target property vectors [48]. These models can produce molecules with specified molecular weight, logP, and hydrogen bond donor/acceptor counts with reasonable accuracy [48]. Alternatively, guidance-based methods employ pre-trained property predictors to steer the diffusion sampling process at inference time [35,127,128]. By computing the gradients of property predictors with respect to the molecular representation, these techniques can navigate the chemical space to optimize multiple properties simultaneously [53].

However, optimizing for multiple, often conflicting, objectives remains a significant challenge. For instance, increasing lipophilicity (logP) to improve membrane permeability may concurrently decrease aqueous solubility and heighten toxicity risk. To address this, recent work has explored more sophisticated frameworks. Some studies focus on generating diverse molecules along the Pareto front, providing a set of candidates that represent different trade-offs between objectives [129]. Other advanced strategies include using reinforcement learning to dynamically balance competing goals [130,131] and developing dual diffusion architectures for simultaneous optimization across multiple design criteria [99].

However, ensuring the synthesizability of the generated molecules remains a major and persistent challenge in the field. While diffusion models can generate chemically valid molecules (as determined by valence rules and RDKit sanitization), these molecules may be synthetically inaccessible or require prohibitively complex synthetic routes. Synthetic accessibility scores (SA scores [132]) provide a rough estimate [127,131], but they do not guarantee that a practical synthesis route exists. Recent efforts have focused on incorporating models for retrosynthesis, a computational technique for planning chemical synthesis by working backward from the target molecule, into the generation process, either by using retrosynthesis feasibility as an additional objective [133] or by generating molecules in a retrosynthetically aware manner, building molecules from commercially available building blocks through known reaction templates [134,135]. Alternative approaches evaluate synthesizability by combining retrosynthetic planning with forward reaction prediction to verify route feasibility [136]. Methods that optimize molecular geometry and structural stability have also been proposed to improve the practical viability of generated candidates [53]. Despite these advances, the gap between computational generation and experimental synthesis remains a critical bottleneck [104,137]—a synthesis barrier that has been identified as a major challenge limiting the real-world impact of generative AI in pharmaceutical development [138]. Bridging this gap by integrating generative models with retrosynthesis prediction and automated experimental validation remains a central goal for the field [139], a challenge shared across modalities, where the synthetic accessibility hurdle for small molecules finds its critical counterpart in the biological stability and production challenges inherent to therapeutic peptides (Section 4).

4. Application II: Innovative Design of Therapeutic Peptides

4.1. Datasets and Benchmarks for Peptide Design

Peptide and protein design models rely on fundamentally different datasets compared to small molecule models, reflecting the distinct nature of biopolymers. The Protein Data Bank (PDB) [101], containing over 240,000 experimentally determined protein structures (as of 2024), serves as the primary source of structural data. For training diffusion models on protein backbones, high-quality subsets are typically used: the CATH database [140,141] (containing 601,493 domains from over 150,000 PDB structures, classified by architecture and topology) and the SCOPe database [142,143] (classifying 344,851 domains from 106,976 PDB entries by structural and evolutionary relationships) are commonly used to ensure structural diversity and avoid redundancy. These datasets enable models to learn the principles of protein folding—the allowed backbone geometries, secondary structure propensities, and tertiary packing arrangements.

For sequence-based models, much larger datasets are available. UniProt [144,145], containing over 246 million protein sequences, provides a vast resource for learning sequence patterns and evolutionary relationships. The UniRef50 and UniRef90 datasets [146] (clustered at 50% and 90% sequence identity, respectively) are commonly used for training, providing non-redundant reference clusters that enable models to learn amino acid co-evolution patterns, functional motifs, and sequence-structure relationships. The recent AlphaFold Database [147,148], containing predicted structures for over 214 million proteins, has dramatically expanded the available structure data, though the quality varies and experimental validation is limited.

For specific peptide design tasks, specialized datasets exist. The Antimicrobial Peptide Database (APD3) contains approximately 3000 experimentally validated antimicrobial peptides with activity data (MIC values, target organisms) [149]. The Database of Antimicrobial Activity and Structure of Peptides (DBAASP) contains over 15,000 entries with detailed activity annotations [150]. For cell-penetrating peptides, CPPsite contains approximately 1800 entries [151,152]. However, these specialized datasets are much smaller than those available for small molecules. Furthermore, they suffer from significant data heterogeneity, a challenge that directly impacts model reproducibility and the creation of reliable benchmarks. For instance, antimicrobial activity measured as Minimum Inhibitory Concentration (MIC) can vary dramatically depending on the bacterial strain and assay protocol, while cell-penetrating efficiency lacks a universally accepted standard metric. This inconsistency makes it difficult to harmonize data for training robust, generalizable predictive models and underscores the critical need for community-wide standards in peptide bioactivity reporting.

A critical challenge is the scarcity of experimentally validated peptide-protein interaction data with structural information. While databases like PDBbind [153,154] contain thousands of protein-ligand complexes, only a small fraction involve peptide ligands. The lack of large-scale, high-quality training data for peptide binder design motivates the use of transfer learning from general protein structure prediction models (e.g., AlphaFold2 [155], RoseTTAFold [156]) and the development of physics-informed models that incorporate biophysical priors.

4.2. Generation of Functional Peptide Sequences

The goal here is to generate amino acid sequences with specific biological functions, such as antimicrobial peptides (AMPs), cell-penetrating peptides (CPPs), or peptides with specific binding properties. This task typically employs discrete diffusion models [61,157,158,159], which are adapted to handle the categorical nature of amino acid data. Pioneering work has demonstrated sequence-only generation without requiring structural information [61], with recent advances enabling multi-objective optimization for therapeutic properties [157], length-controlled peptide design [159], and applications in practical binder design [66,68].

Discrete diffusion models for peptide sequences operate by gradually corrupting amino acid sequences through a process of random token replacement or masking, then learning to reverse this process. The foundational work in this area proposed several noise processes, including uniform transition matrices, absorbing state models, and learned transition matrices that respect amino acid similarity [64]. The uniform transition approach, for instance, has been applied in subsequent peptide generation models [61]. The choice among these noise processes carries significant practical implications for peptide design. The uniform transition matrix, while the simplest to implement, disregards the inherent biochemical similarities between amino acids, treating a transition from Alanine to Valine (both hydrophobic) the same as one to Lysine (hydrophilic). The absorbing state model is particularly well-suited for tasks like sequence inpainting or constrained generation, as the MASK token provides a clear demarcation between fixed regions and those to be generated. Finally, learned transition matrices offer the most sophisticated approach, allowing the model to incorporate prior knowledge, such as amino acid substitution matrices (e.g., BLOSUM), which can potentially improve learning efficiency and generate more biologically plausible intermediate sequences.

Recent studies have demonstrated that deep generative and foundation models can successfully design antimicrobial peptides (AMPs) with predicted and experimentally validated activity comparable to, or even exceeding, that of natural AMPs [160,161,162,163]. Models are typically trained on curated datasets of a few thousand sequences drawn from larger public databases such as APD3, DBAASP, or DRAMP, which contain up to 22,000 entries [164]. For instance, a recent generative model was trained on a specific set of 3280 MIC-labeled AMPs [162]. These approaches generate novel sequences with experimentally confirmed minimum inhibitory concentrations (MICs) in the low-micromolar range against common pathogens like E. coli and S. aureus; for example, validated MICs ranging from 0.20 to 15.18 μM have been reported [162], with other generative frameworks also confirming potent hits [160]. Importantly, these generated peptides often exhibit substantial sequence novelty, with one study reporting a median sequence identity of approximately 35% to any example in the training set, indicating true de novo design rather than memorization [162].

In peptide design, particularly for antimicrobial peptides (AMPs), diffusion models have been conditioned using strategies like text guidance or post-generation property filtering (e.g., net charge, hydrophobicity) [165,166]. The application of similar methods for cell-penetrating peptides (CPPs), especially by explicit conditioning on predicted membrane permeability, is an emerging area that could leverage advances in CPP prediction models [167]. Some generated peptides have demonstrated in silico or in vitro cellular uptake efficiencies comparable to canonical CPPs like TAT under specific assay conditions [168,169], showcasing the potential to explore novel sequence space. However, systematic experimental validation remains a significant bottleneck. Recent reviews emphasize the persistent gap between computational predictions and functional confirmation, a key challenge in translating in silico designs into effective therapeutics [170,171,172].

A key advantage of diffusion models over previous generative approaches (such as RNNs or VAEs) is their ability to generate highly diverse sequences while maintaining exceptional validity [61,173,174,175]. Recent studies report that sequence validity—defined as the generation of valid amino acid strings of a desired length—consistently achieves near-perfect rates, typically ≥98–100% [61,173,175]. Simultaneously, these models demonstrate substantially greater sequence diversity compared to VAE or language model baselines, producing broader and less redundant libraries that better span natural sequence and functional spaces [61,173,175]. While sequence-based generation is valuable for designing peptides with specific functional properties, many therapeutic applications require precise control over 3D structure and binding geometry. This motivates the development of structure-guided design approaches, which we explore next.

4.3. Structure-Guided De Novo Peptide Design

A more ambitious goal is to directly generate peptides that fold into specific 3D structures or bind to target protein surfaces with high affinity and specificity. This includes not only linear peptides but also larger, structurally defined mini-proteins that function as peptide mimetics. This task requires modeling both sequence and structure simultaneously, since the sequence must be compatible with the desired fold and the structure must be stable and functional [176,177]. Recent deep learning advances, particularly diffusion-based methods, have made significant progress toward achieving this goal [178,179].

RFdiffusion, a landmark model in this area, has significantly advanced structure-guided protein and peptide design [65]. Built upon the RoseTTAFold structure prediction network [156], RFdiffusion performs diffusion directly on protein backbone coordinates (represented as rigid body transformations of residue frames) while maintaining SE(3) equivariance [65]. The model can be conditioned on various structural constraints, including target protein surfaces for binder design, desired secondary structure motifs (helices, sheets), or functional site geometries [65].

RFdiffusion has demonstrated remarkable success in designing mini-protein binders, a breakthrough that directly paves the way for creating structurally defined peptides with high efficacy [65]. When tasked with designing binders to challenging protein targets such as influenza hemagglutinin, for instance, RFdiffusion generates backbones that, after sequence design using ProteinMPNN [180], achieve experimental binding affinities in the nanomolar range (e.g., a K_D of 28 nM for an influenza binder) in approximately 19% of tested designs [65]. This success rate is substantially higher than previous computational design methods, which typically achieved success rates below 5% [181,182]. The designed binders often exhibit novel folds not present in natural proteins, demonstrating the model’s ability to explore diverse and novel structural topologies within the protein fold space [65]. Furthermore, the approach has been successfully extended to designing high-affinity binders for challenging helical peptide targets, yielding picomolar to sub-nanomolar affinities [183].

The typical workflow, largely established by the developers of RFdiffusion [65], is a critical hybrid approach involving two distinct generative stages. First, RFdiffusion (a diffusion model) is used to generate a peptide backbone (continuous coordinates) that is geometrically complementary to the target protein surface, with the diffusion process conditioned on the target structure and desired binding interface residues. Second, a sequence design model such as ProteinMPNN [180] (a GNN-based, non-diffusion model) or ESM-IF [184] is employed to perform inverse folding, designing an amino acid sequence (discrete tokens) compatible with the generated backbone. This two-step, hybrid methodology is significant because it highlights that structure-guided sequence design currently relies on integrating a powerful backbone DM with a specialized, non-diffusion inverse folding tool. A pure diffusion model solution capable of generating both optimal structure and sequence simultaneously remains an active area of research. Third, the resulting designs undergo computational validation using high-accuracy structure prediction models like AlphaFold2 [155] or RoseTTAFold [156] to verify that the designed sequence folds into the intended structure and maintains the desired binding geometry. Finally, promising candidates proceed to experimental validation through protein expression, purification, and binding assays.

Despite these successes, significant challenges unique to peptide therapeutics remain. Generated peptides must be engineered for proteolytic stability to overcome their inherently short in vivo half-lives, a consideration often addressed by incorporating non-canonical amino acids or cyclization, which are not yet fully integrated into diffusion workflows [185]. Furthermore, minimizing potential immunogenicity by avoiding T-cell epitopes—specific peptide fragments recognized by the immune system that can trigger an unwanted response—is a critical design constraint that requires sophisticated predictive modeling [186]. Ultimately, ensuring that the designed sequence not only folds into the intended conformation but also remains stable and avoids aggregation is paramount, as current models may not fully capture the subtle side-chain interactions governing these properties [187]. Integrating these complex biological and biophysical constraints into the next generation of generative models represents a critical frontier for the field.

While the specific bottlenecks differ, the parallel evolution of diffusion models in these two domains invites a systematic comparison. Synthesizing these distinct modality-specific insights is essential to identify the shared fundamental challenges and to envision the future trajectory of the field toward a unified, automated discovery paradigm.

5. Comparison, Challenges, and Future Perspectives

5.1. A Head-to-Head Comparison: Small Molecules vs. Peptides

The fundamental differences in applying diffusion models to small molecules and peptides are visually contrasted in Figure 2 and further detailed in Table 1. This comparison highlights distinct challenges and opportunities in each domain, providing a clear framework for understanding the current landscape. As illustrated, the design of small molecules is fundamentally a challenge of navigating a vast, discrete chemical space to ensure chemical synthesizability, whereas peptide design is a problem of conquering a continuous conformational space to achieve biological stability. These core distinctions dictate everything from molecular representation to the primary validation hurdles, shaping two related yet distinct fields of AI-driven discovery. Beyond these qualitative differences, quantitative performance metrics reveal the maturity and capabilities of current diffusion-based approaches in each domain, as detailed in Table 2.

5.2. Shared Hurdles and Common Challenges

Despite their fundamental differences, the deployment of diffusion models in both small molecule and peptide design is hampered by several shared, fundamental obstacles. Perhaps the most universal bottleneck is the reliance on imperfect scoring functions to evaluate generated candidates. Current approaches depend heavily on computational proxies like docking scores or predicted affinities, which often show poor correlation with experimental reality and lead to high false-positive rates in downstream validation [199,200,201,202,203,204,205,206,207]. This challenge is directly exacerbated by the scarcity of high-quality labeled data. While vast repositories exist [101,109,147,148,208,209], data that pairs molecular structures with experimentally validated, high-fidelity biological activity or binding affinity is a rare commodity, limiting the predictive power of supervised models [100,210]. Promising mitigation strategies include physics-informed modeling, active learning, and transfer learning, but fundamental limitations remain [176,177,211,212,213,214,215,216,217].

Consequently, a critical imperative for the field is to “close the loop” by integrating generative models with automated experimental validation in a Design-Build-Test-Learn (DBTL) cycle, as illustrated in Figure 3 [218,219,220]. The implementation of such a cycle creates a direct pathway from in silico hypothesis generation to experimental validation and back, enabling a rapid, iterative flow where data from one round directly informs the next. Without such a framework, which is now becoming feasible through advances in laboratory automation [221,222,223,224], the design process remains a slow, sequential, and inefficient endeavor [225,226,227,228]. Finally, even with better data and validation, the issue of generalization persists. Like all machine learning models, diffusion models risk overfitting to their training distribution, potentially failing to generate effective and novel solutions for new biological targets or chemical spaces that lie outside their learned domain [127,229,230,231]. Overcoming these interconnected challenges is essential to translate the theoretical promise of diffusion models into tangible therapeutic breakthroughs [232,233].

5.3. Future Outlook and Opportunities

A critical measure of success for any drug discovery technology is its path to clinical translation. While molecules designed specifically by diffusion models have not yet entered human clinical trials, the broader field of generative AI provides a strong and encouraging precedent. Companies such as Insilico Medicine, Exscientia, and Recursion Pharmaceuticals have successfully advanced AI-designed small molecules into various stages of clinical development, validating the principle that AI can indeed yield viable therapeutic candidates [234,235]. This industry progress sets the stage for diffusion models, which represent the state-of-the-art in generative capability. Several leading biotechnology companies are now deeply integrating these models into their R&D pipelines, and although much of this work remains proprietary, early reports indicate that diffusion-generated candidates are demonstrating excellent activity and favorable properties in preclinical studies. This marks a rapid transition of the technology from academic exploration to industrial application.

The field of diffusion models for drug discovery is rapidly evolving, with future work poised to address current limitations and unlock transformative capabilities. A key frontier is the development of unified frameworks—so-called “foundation models” for molecular science—that could seamlessly design not only small molecules and peptides but also complex hybrid therapeutics like peptide-drug conjugates (PDCs) and PROTACs from a single, powerful architecture. Enhancing model reliability is also paramount; this involves a shift from ‘black box’ generators to interpretable and controllable tools that empower expert-guided design, while integrating first-principles simulations from quantum chemistry and physics to ensure the physical realism of generated candidates. Ultimately, the successful translation of these technologies will hinge on fully realizing the automated Design-Build-Test-Learn (DBTL) paradigm, as illustrated in Figure 3, which promises to accelerate discovery cycles from months to days. This acceleration, however, must be navigated alongside the establishment of clear ethical and regulatory frameworks to guide AI-designed therapeutics safely from concept to clinic.

6. Conclusions

Diffusion models have emerged as a powerful, unified generative framework, demonstrating remarkable versatility in designing both small molecules and therapeutic peptides. While successful in generating novel candidates for both modalities, the path to clinical translation is defined by distinct, fundamental hurdles: for small molecules, the challenge lies in bridging the gap from computational validity to practical chemical synthesizability; for peptides, it is ensuring that de novo structural designs achieve in vivo biological stability and function. Crucially, the progress of AI-designed drugs now entering clinical trials provides a strong tailwind for the field, validating the potential of these advanced generative approaches. The full potential of this technology will be significantly accelerated by closing the Design-Build-Test-Learn loop through deep integration with laboratory automation, which will enable rapid, data-driven iteration. By overcoming these challenges, diffusion models hold the promise to catalyze a fundamental shift in drug discovery—moving from the passive exploration of existing chemical space to the active, purpose-driven creation of novel medicines.

Author Contributions

Y.W. and Y.M. contributed equally to this work. Conceptualization, K.W.; Investigation, Y.W. and Y.M.; Writing—Original Draft, Y.W., Y.M., Y.C., J.Y., J.Z., M.C. and K.W.; Visualization, Y.W., Y.C., J.Y., J.Z. and M.C.; Writing—Review & Editing, all authors; Supervision and Project Administration, K.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Xinjiang Uygur Autonomous Region (Grant Number: 2024D01C216) and the “Tianchi Talents” introduction plan.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Singh, N.; Vayer, P.; Tanwar, S.; Poyet, J.L.; Tsaioun, K.; Villoutreix, B.O. Drug discovery and development: Introduction to the general public and patient groups. Front. Drug Discov. 2023, 3, 1201419. [Google Scholar] [CrossRef]
Brown, D.G.; Wobst, H.J.; Kapoor, A.; Kenna, L.A.; Southall, N. Clinical development times for innovative drugs. Nat. Rev. Drug Discov. 2022, 21, 793–794. [Google Scholar] [CrossRef]
Kim, E.; Yang, J.; Park, S.; Shin, K. Factors affecting success of new drug clinical trials. Ther. Innov. Regul. Sci. 2023, 57, 737–750. [Google Scholar] [CrossRef]
Zhou, Y.; Zhang, Y.; Chen, Z.; Huang, S.; Li, Y.; Fu, J.; Zhu, F. Dynamic clinical success rates for drugs in the 21st century. Nat. Commun. 2025, 16, 9537. [Google Scholar] [CrossRef] [PubMed]
Smietana, K.; Siatkowski, M.; Møller, M. Trends in clinical success rates. Nat. Rev. Drug Discov. 2016, 15, 379–380. [Google Scholar] [CrossRef] [PubMed]
Mullard, A. Parsing clinical success rates. Nat. Rev. Drug Discov. 2016, 15, 447–448. [Google Scholar] [CrossRef]
Phares, S.; Phillip, K.; Trusheim, M. Clinical development success rates for durable cell and gene therapies. Nat. Rev. Drug Discov. 2025, 24, 329–330. [Google Scholar] [CrossRef]
Mullard, A. New drugs cost US $2.6 billion to develop. Nat. Rev. Drug Discov. 2014, 13, 877. [Google Scholar] [CrossRef]
Sertkaya, A.; Beleche, T.; Jessup, A.; Sommers, B.D. Costs of drug development and research and development intensity in the US, 2000–2018. JAMA Netw. Open 2024, 7, e2415445. [Google Scholar] [CrossRef] [PubMed]
Senior, M. Fresh from the biotech pipeline: Record-breaking FDA approvals. Nat. Biotechnol. 2024, 42, 355–361. [Google Scholar] [CrossRef]
Bohacek, R.S.; McMartin, C.; Guida, W.C. The art and practice of structure-based drug design: A molecular modeling perspective. Med. Res. Rev. 1996, 16, 3–50. [Google Scholar] [CrossRef]
Orsi, M.; Reymond, J.L. Navigating a 1E+60 chemical space of peptide/peptoid oligomers. Mol. Inform. 2025, 44, e202400186. [Google Scholar] [CrossRef]
Ruddigkeit, L.; Van Deursen, R.; Blum, L.C.; Reymond, J.L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J. Chem. Inf. Model. 2012, 52, 2864–2875. [Google Scholar] [CrossRef]
Reymond, J.L.; Awale, M. Exploring chemical space for drug discovery using the chemical universe database. ACS Chem. Neurosci. 2012, 3, 649–657. [Google Scholar] [CrossRef] [PubMed]
Polishchuk, P.G.; Madzhidov, T.I.; Varnek, A. Estimation of the size of drug-like chemical space based on GDB-17 data. J. Comput.-Aided Mol. Des. 2013, 27, 675–679. [Google Scholar] [CrossRef] [PubMed]
Jayatunga, M.K.; Ayers, M.; Bruens, L.; Jayanth, D.; Meier, C. How successful are AI-discovered drugs in clinical trials? A first analysis and emerging lessons. Drug Discov. Today 2024, 29, 104009. [Google Scholar] [CrossRef]
Arnold, C. Inside the nascent industry of AI-designed drugs. Nat. Med. 2023, 29, 1292–1295. [Google Scholar] [CrossRef]
Kanakia, A.; Sale, M.; Zhao, L.; Zhou, Z. AI in action: Redefining drug discovery and development. Clin. Transl. Sci. 2025, 18, e70149. [Google Scholar] [CrossRef] [PubMed]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using real nvp. arXiv 2016, arXiv:1605.08803. [Google Scholar]
Kingma, D.P.; Salimans, T.; Jozefowicz, R.; Chen, X.; Sutskever, I.; Welling, M. Improved variational inference with inverse autoregressive flow. In Proceedings of the Advances in Neural Information Processing Systems 29 (NIPS 2016), Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Papamakarios, G.; Pavlakou, T.; Murray, I. Masked autoregressive flow for density estimation. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Sharma, P.; Kumar, M.; Sharma, H.K.; Biju, S.M. Generative adversarial networks (GANs): Introduction, taxonomy, variants, limitations, and applications. Multimed. Tools Appl. 2024, 83, 88811–88858. [Google Scholar] [CrossRef]
Bond-Taylor, S.; Leach, A.; Long, Y.; Willcocks, C.G. Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7327–7347. [Google Scholar] [CrossRef]
Vivekananthan, S. Comparative analysis of generative models: Enhancing image synthesis with vaes, gans, and stable diffusion. arXiv 2024, arXiv:2408.08751. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 2256–2265. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Kong, Z.; Ping, W.; Huang, J.; Zhao, K.; Catanzaro, B. Diffwave: A versatile diffusion model for audio synthesis. arXiv 2020, arXiv:2009.09761. [Google Scholar]
Li, X.; Thickstun, J.; Gulrajani, I.; Liang, P.S.; Hashimoto, T.B. Diffusion-lm improves controllable text generation. Adv. Neural Inf. Process. Syst. 2022, 35, 4328–4343. [Google Scholar]
Ho, J.; Salimans, T. Classifier-free diffusion guidance. arXiv 2022, arXiv:2207.12598. [Google Scholar] [CrossRef]
Weiss, T.; Mayo Yanes, E.; Chakraborty, S.; Cosmo, L.; Bronstein, A.M.; Gershoni-Poranne, R. Guided diffusion for inverse molecular design. Nat. Comput. Sci. 2023, 3, 873–882. [Google Scholar] [CrossRef]
Alakhdar, A.; Poczos, B.; Washburn, N. Diffusion models in de novo drug design. J. Chem. Inf. Model. 2024, 64, 7238–7256. [Google Scholar] [CrossRef]
Bai, Y.R.; Seng, D.J.; Xu, Y.; Zhang, Y.D.; Zhou, W.J.; Jia, Y.Y.; Song, J.; He, Z.X.; Liu, H.M.; Yuan, S. A comprehensive review of small molecule drugs approved by the FDA in 2023: Advances and prospects. Eur. J. Med. Chem. 2024, 276, 116706. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Sun, X.; Sun, M.; Wang, C.; Yang, L. Game Changers: Blockbuster Small-Molecule Drugs Approved by the FDA in 2024. Pharmaceuticals 2025, 18, 729. [Google Scholar] [CrossRef]
Mullard, A. 2023 FDA approvals. Nat. Reviews. Drug Discov. 2024, 23, 88–95. [Google Scholar]
Martins, A.C.; Oshiro, M.Y.; Albericio, F.; de la Torre, B.G. Food and Drug Administration (FDA) approvals of biological drugs in 2023. Biomedicines 2024, 12, 1992. [Google Scholar] [CrossRef]
Xie, X.; Yu, T.; Li, X.; Zhang, N.; Foster, L.J.; Peng, C.; Huang, W.; He, G. Recent advances in targeting the “undruggable” proteins: From drug discovery to clinical trials. Signal Transduct. Target. Ther. 2023, 8, 335. [Google Scholar] [CrossRef]
Nada, H.; Choi, Y.; Kim, S.; Jeong, K.S.; Meanwell, N.A.; Lee, K. New insights into protein–protein interaction modulators in drug discovery and therapeutic advance. Signal Transduct. Target. Ther. 2024, 9, 341. [Google Scholar] [CrossRef] [PubMed]
Xu, W.; Kang, C. Fragment-based drug design: From then until now, and toward the future. J. Med. Chem. 2025, 68, 5000–5004. [Google Scholar] [CrossRef]
Xiao, W.; Jiang, W.; Chen, Z.; Huang, Y.; Mao, J.; Zheng, W.; Hu, Y.; Shi, J. Advance in peptide-based drug development: Delivery platforms, therapeutics and vaccines. Signal Transduct. Target. Ther. 2025, 10, 74. [Google Scholar] [CrossRef]
Baral, K.C.; Choi, K.Y. Barriers and strategies for oral peptide and protein therapeutics delivery: Update on clinical advances. Pharmaceutics 2025, 17, 397. [Google Scholar] [CrossRef] [PubMed]
Mehrdadi, S. Lipid-based nanoparticles as oral drug delivery systems: Overcoming poor gastrointestinal absorption and enhancing bioavailability of peptide and protein therapeutics. Adv. Pharm. Bull. 2023, 14, 48. [Google Scholar] [CrossRef]
Lamers, C. Overcoming the shortcomings of peptide-based therapeutics. Future Drug Discov. 2022, 4, FDD75. [Google Scholar] [CrossRef]
Verma, S.; Goand, U.K.; Husain, A.; Katekar, R.A.; Garg, R.; Gayen, J.R. Challenges of peptide and protein drug delivery by oral route: Current strategies to improve the bioavailability. Drug Dev. Res. 2021, 82, 927–944. [Google Scholar] [CrossRef]
Hu, Q.; Sun, C.; He, H.; Xu, J.; Liu, D.; Zhang, W.; Li, H. Target-aware 3D molecular generation based on guided equivariant diffusion. Nat. Commun. 2025, 16, 7928. [Google Scholar] [CrossRef]
Chen, L.; Li, Y.; Ma, Y.; Gao, L.; Yu, L. Multiscale graph equivariant diffusion model for 3D molecule design. Sci. Adv. 2025, 11, eadv0778. [Google Scholar] [CrossRef]
Vignac, C.; Krawczuk, I.; Siraudin, A.; Wang, B.; Cevher, V.; Frossard, P. DiGress: Discrete Denoising diffusion for graph generation. In Proceedings of the The Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Bian, T.; Niu, Y.; Chang, H.; Yan, D.; Huang, J.; Rong, Y.; Cheng, H. Hierarchical graph latent diffusion model for conditional molecule generation. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 130–140. [Google Scholar]
Liu, G.; Chen, J.; Zhu, Y.; Sun, M.; Luo, T.; Chawla, N.V.; Jiang, M. Graph Diffusion Transformers are In-Context Molecular Designers. arXiv 2025, arXiv:2510.08744. [Google Scholar] [CrossRef]
Morehead, A.; Cheng, J. Geometry-complete diffusion for 3D molecule generation and optimization. Commun. Chem. 2024, 7, 150. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Liu, Y.; Liu, X.; Wang, C.; Guo, M. Equivariant score-based generative diffusion framework for 3D molecules. BMC Bioinform. 2024, 25, 203. [Google Scholar] [CrossRef]
Liu, C.; Vadgama, S.; Ruhe, D.; Bekkers, E.; Forré, P. Clifford Group Equivariant Diffusion Models for 3D Molecular Generation. arXiv 2025, arXiv:2504.15773. [Google Scholar] [CrossRef]
Satorras, V.G.; Hoogeboom, E.; Welling, M. E (n) equivariant graph neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 9323–9332. [Google Scholar]
Wang, Y.; Wang, T.; Li, S.; He, X.; Li, M.; Wang, Z.; Liu, T.Y. Enhancing geometric representations for molecules with equivariant vector-scalar interactive message passing. Nat. Commun. 2024, 15, 313. [Google Scholar] [CrossRef] [PubMed]
Soleymani, F.; Paquet, E.; Viktor, H.L.; Michalowski, W. Structure-based protein and small molecule generation using EGNN and diffusion models: A comprehensive review. Comput. Struct. Biotechnol. J. 2024, 23, 2779–2797. [Google Scholar] [CrossRef]
Guan, J.; Qian, W.W.; Peng, X.; Su, Y.; Peng, J.; Ma, J. 3D Equivariant Diffusion for Target-Aware Molecule Generation and Affinity Prediction. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Guo, M.; Liu, C.; Forré, P. Frame-based Equivariant Diffusion Models for 3D Molecular Generation. arXiv 2025, arXiv:2509.19506. [Google Scholar] [CrossRef]
Alamdari, S.; Thakkar, N.; Van Den Berg, R.; Tenenholtz, N.; Strome, R.; Moses, A.M.; Yang, K.K. Protein generation with evolutionary diffusion: Sequence is all you need. BioRxiv 2023, 2023-09. [Google Scholar] [CrossRef]
Lisanza, S.L.; Gershon, J.M.; Tipps, S.W.; Sims, J.N.; Arnoldt, L.; Hendel, S.J.; Baker, D. Multistate and functional protein design using RoseTTAFold sequence space diffusion. Nat. Biotechnol. 2025, 43, 1288–1298. [Google Scholar] [CrossRef]
Bai, P.; Miljković, F.; Liu, X.; De Maria, L.; Croasdale-Wood, R.; Rackham, O.; Lu, H. Mask-prior-guided denoising diffusion improves inverse protein folding. Nat. Mach. Intell. 2025, 7, 876–888. [Google Scholar] [CrossRef]
Austin, J.; Johnson, D.D.; Ho, J.; Tarlow, D.; Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Adv. Neural Inf. Process. Syst. 2021, 34, 17981–17993. [Google Scholar]
Watson, J.L.; Juergens, D.; Bennett, N.R.; Trippe, B.L.; Yim, J.; Eisenach, H.E.; Baker, D. De novo design of protein structure and function with RFdiffusion. Nature 2023, 620, 1089–1100. [Google Scholar] [CrossRef] [PubMed]
Chen, L.T.; Chatterjee, P. Peptide binders designed directly from protein sequences. Nat. Biotechnol. 2025. [Google Scholar] [CrossRef]
Wu, K.E.; Yang, K.K.; van den Berg, R.; Alamdari, S.; Zou, J.Y.; Lu, A.X.; Amini, A.P. Protein structure generation via folding diffusion. Nat. Commun. 2024, 15, 1059. [Google Scholar] [CrossRef]
Li, W.R.; Cadet, X.F.; Medina-Ortiz, D.; Davari, M.D.; Sowdhamini, R.; Damour, C.; Cadet, F. From thermodynamics to protein design: Diffusion models for biomolecule generation towards autonomous protein engineering. arXiv 2025, arXiv:2501.02680. [Google Scholar] [CrossRef]
Cremer, J.; Le, T.; Clevert, D.A.; Schütt, K.T. Latent-Conditioned Equivariant Diffusion for Structure-Based De Novo Ligand Generation. In Proceedings of the International Workshop on AI in Drug Discovery, Lugano, Switzerland, 19 September 2024; pp. 36–46. [Google Scholar]
Hoogeboom, E.; Satorras, V.G.; Vignac, C.; Welling, M. Equivariant diffusion for molecule generation in 3d. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 8867–8887. [Google Scholar]
Xu, M.; Yu, L.; Song, Y.; Shi, C.; Ermon, S.; Tang, J. GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation. In Proceedings of the International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Huang, L.; Xu, T.; Yu, Y.; Zhao, P.; Chen, X.; Han, J.; Zhang, H. A dual diffusion model enables 3D molecule generation and lead optimization based on target pockets. Nat. Commun. 2024, 15, 2657. [Google Scholar] [CrossRef]
Jin, W.; Barzilay, R.; Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 2323–2332. [Google Scholar]
Ochiai, T.; Inukai, T.; Akiyama, M.; Furui, K.; Ohue, M.; Matsumori, N.; Inuki, S.; Uesugi, M.; Sunazuka, T.; Kikuchi, K.; et al. Variational autoencoder-based chemical latent space for large molecular structures with 3D complexity. Commun. Chem. 2023, 6, 249. [Google Scholar] [CrossRef] [PubMed]
Tevosyan, A.; Khondkaryan, L.; Khachatrian, H.; Tadevosyan, G.; Apresyan, L.; Babayan, N.; Stopper, H.; Navoyan, Z. Improving VAE based molecular representations for compound property prediction. J. Cheminform. 2022, 14, 69. [Google Scholar] [CrossRef] [PubMed]
Praljak, N.; Lian, X.; Ranganathan, R.; Ferguson, A.L. Protwave-vae: Integrating autoregressive sampling with latent-based inference for data-driven protein design. ACS Synth. Biol. 2023, 12, 3544–3561. [Google Scholar] [CrossRef] [PubMed]
De Cao, N.; Kipf, T. MolGAN: An implicit generative model for small molecular graphs. arXiv 2018, arXiv:1805.11973. [Google Scholar]
Saad, M.M.; O’Reilly, R.; Rehmani, M.H. A survey on training challenges in generative adversarial networks for biomedical image analysis. Artif. Intell. Rev. 2024, 57, 19. [Google Scholar] [CrossRef]
Barsha, F.L.; Eberle, W. An in-depth review and analysis of mode collapse in generative adversarial networks. Mach. Learn. 2025, 114, 141. [Google Scholar] [CrossRef]
Wang, H.; Wang, J.; Wang, J.; Zhao, M.; Zhang, W.; Zhang, F.; Guo, M. Graphgan: Graph representation learning with generative adversarial nets. Proc. AAAI Conf. Artif. Intell. 2018, 32. [Google Scholar] [CrossRef]
Zang, C.; Wang, F. Moflow: An invertible flow model for generating molecular graphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 23–27 August 2020; pp. 617–626. [Google Scholar]
Madhawa, K.; Ishiguro, K.; Nakago, K.; Abe, M. Graphnvp: An invertible flow model for generating molecular graphs. arXiv 2019, arXiv:1905.11600. [Google Scholar] [CrossRef]
Mercado, R.; Rastemo, T.; Lindelöf, E.; Klambauer, G.; Engkvist, O.; Chen, H.; Bjerrum, E.J. Practical notes on building molecular graph generative models. Appl. AI Lett. 2020, 1. [Google Scholar] [CrossRef]
Shi, C.; Xu, M.; Zhu, Z.; Zhang, W.; Zhang, M.; Tang, J. Graphaf: A flow-based autoregressive model for molecular graph generation. arXiv 2020, arXiv:2001.09382. [Google Scholar]
Segler, M.H.; Kogej, T.; Tyrchan, C.; Waller, M.P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 2018, 4, 120–131. [Google Scholar] [CrossRef]
Gupta, A.; Müller, A.T.; Huisman, B.J.; Fuchs, J.A.; Schneider, P.; Schneider, G. Generative recurrent networks for de novo drug design. Mol. Inform. 2018, 37, 1700111. [Google Scholar] [CrossRef]
Wang, Z.; Shi, J.; Heess, N.; Gretton, A.; Titsias, M.K. Learning-Order Autoregressive Models with Application to Molecular Graph Generation. arXiv 2025, arXiv:2503.05979. [Google Scholar] [CrossRef]
He, T.; Zhang, J.; Zhou, Z.; Glass, J. Exposure bias versus self-recovery: Are distortions really incremental for autoregressive text generation? arXiv 2019, arXiv:1905.10617. [Google Scholar]
Wang, Y.; Che, T.; Li, B.; Song, K.; Pei, H.; Bengio, Y.; Li, D. Your autoregressive generative model can be better if you treat it as an energy-based one. arXiv 2022, arXiv:2206.12840. [Google Scholar] [CrossRef]
Zhang, P.; Baker, D.; Song, M.; Bi, J. Unraveling the potential of diffusion models in small-molecule generation. Drug Discov. Today 2025, 30, 104413. [Google Scholar] [CrossRef]
Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Müller-Franzes, G.; Niehues, J.M.; Khader, F.; Arasteh, S.T.; Haarburger, C.; Kuhl, C.; Truhn, D. A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis. Sci. Rep. 2023, 13, 12098. [Google Scholar] [CrossRef]
Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Yang, M.H. Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
Wang, C.; Ong, H.H.; Chiba, S.; Rajapakse, J.C. GLDM: Hit molecule generation with constrained graph latent diffusion model. Briefings Bioinform. 2024, 25, bbae142. [Google Scholar] [CrossRef]
Brown, N.; Fiscato, M.; Segler, M.H.; Vaucher, A.C. GuacaMol: Benchmarking models for de novo molecular design. J. Chem. Inf. Model. 2019, 59, 1096–1108. [Google Scholar] [CrossRef]
Dunn, I.; Koes, D.R. FlowMol3: Flow Matching for 3D De Novo Small-Molecule Generation. arXiv 2025, arXiv:2508.12629. [Google Scholar]
Liu, H.; Zhang, W.; Xie, J.; Faccio, F.; Xu, M.; Xiang, T.; Shou, M.Z.; Schmidhuber, J. Faster diffusion through temporal attention decomposition. Trans. Mach. Learn. Res. 2025. [Google Scholar]
Yang, Y.; Gu, S.; Liu, B.; Gong, X.; Lu, R.; Qiu, J.; Liu, H. DiffMC-Gen: A Dual Denoising Diffusion Model for Multi-Conditional Molecular Generation. Adv. Sci. 2025, 12, 2417726. [Google Scholar] [CrossRef]
Francoeur, P.G.; Masuda, T.; Sunseri, J.; Jia, A.; Iovanisci, R.B.; Snyder, I.; Koes, D.R. Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design. J. Chem. Inf. Model. 2020, 60, 4200–4215. [Google Scholar] [CrossRef]
Berman, H.M.; Westbrook, J.; Feng, Z.; Gillil, G.; Bhat, T.N.; Weissig, H.; Shindyalov, I.N. The protein data bank. Nucleic Acids Res. 2000, 28, 235–242. [Google Scholar] [CrossRef] [PubMed]
Corso, G.; StÃ, H.; Jing, B.; Barzilay, R.; Jaakkola, T. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. In Proceedings of the International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Schneuing, A.; Harris, C.; Du, Y.; Didi, K.; Jamasb, A.; Igashov, I.; Du, W.; Gomes, C.; Blundell, T.L.; Lio, P.; et al. Structure-based drug design with equivariant diffusion models. Nat. Comput. Sci. 2024, 4, 899–909. [Google Scholar] [CrossRef] [PubMed]
Das, U. Generative AI for drug discovery and protein design: The next frontier in AI-driven molecular science. Med. Drug Discov. 2025, 27, 100213. [Google Scholar] [CrossRef]
Axelrod, S.; Gomez-Bombarelli, R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Sci. Data 2022, 9, 185. [Google Scholar] [CrossRef] [PubMed]
Landrum, G. Rdkit documentation. Release 2013, 1, 4. [Google Scholar]
Zhu, J.; Xia, Y.; Liu, C.; Wu, L.; Xie, S.; Wang, Y.; Wang, T.; Qin, T.; Zhou, W.; Li, H.; et al. Direct molecular conformation generation. arXiv 2022, arXiv:2202.01356. [Google Scholar] [CrossRef]
McNutt, A.T.; Bisiriyu, F.; Song, S.; Vyas, A.; Hutchison, G.R.; Koes, D.R. Conformer generation for structure-based drug design: How many and how good? J. Chem. Inf. Model. 2023, 63, 6598–6607. [Google Scholar] [CrossRef]
Irwin, J.J.; Tang, K.G.; Young, J.; Dandarchuluun, C.; Wong, B.R.; Khurelbaatar, M.; Moroz, Y.S.; Mayfield, J.; Sayle, R.A. ZINC20—A free ultralarge-scale chemical database for ligand discovery. J. Chem. Inf. Model. 2020, 60, 6065–6073. [Google Scholar] [CrossRef]
Ramakrishnan, R.; Dral, P.O.; Rupp, M.; Von Lilienfeld, O.A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 2014, 1, 140022. [Google Scholar] [CrossRef]
Wei, J.; Zhang, Y.; Ramdhan, P.A.; Huang, Z.; Seabra, G.; Jiang, Z.; Li, Y. GatorAffinity: Boosting Protein-Ligand Binding Affinity Prediction with Large-Scale Synthetic Structural Data. bioRxiv 2025, 2025-09. [Google Scholar] [CrossRef]
Liu, H.; Chen, P.; Zhai, X.; Huo, K.G.; Zhou, S.; Han, L.; Fan, G. PPB-Affinity: Protein-Protein Binding Affinity dataset for AI-based protein drug discovery. Sci. Data 2024, 11, 1316. [Google Scholar] [PubMed]
Wang, H. Prediction of protein–ligand binding affinity via deep learning models. Briefings Bioinform. 2024, 25, bbae081. [Google Scholar]
Liu, T.; Hwang, L.; Burley, S.K.; Nitsche, C.I.; Southan, C.; Walters, W.P.; Gilson, M.K. BindingDB in 2024: A FAIR knowledgebase of protein-small molecule binding data. Nucleic Acids Res. 2025, 53, D1633–D1644. [Google Scholar] [CrossRef] [PubMed]
Gaulton, A.; Bellis, L.J.; Bento, A.P.; Chambers, J.; Davies, M.; Hersey, A.; Overington, J.P. ChEMBL: A large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012, 40, D1100–D1107. [Google Scholar]
Krishnan, S.R.; Bung, N.; Bulusu, G.; Roy, A. Accelerating de novo drug design against novel proteins using deep learning. J. Chem. Inf. Model. 2021, 61, 621–630. [Google Scholar] [CrossRef]
Dalkıran, A.; Atakan, A.; Rifaioğlu, A.S.; Martin, M.J.; Atalay, R.; Acar, A.C.; Atalay, V. Transfer learning for drug–target interaction prediction. Bioinformatics 2023, 39, i103–i110. [Google Scholar] [CrossRef]
Buterez, D.; Janet, J.P.; Kiddle, S.J.; Oglic, D.; Lió, P. Transfer learning with graph neural networks for improved molecular property prediction in the multi-fidelity setting. Nat. Commun. 2024, 15, 1517. [Google Scholar] [CrossRef]
Atz, K.; Cotos, L.; Isert, C.; Håkansson, M.; Focht, D.; Hilleke, M.; Schneider, G. Prospective de novo drug design with deep interactome learning. Nat. Commun. 2024, 15, 3408. [Google Scholar] [CrossRef]
Wang, J.; Dokholyan, N.V. Leveraging Transfer Learning for Predicting Protein–Small-Molecule Interaction Predictions. J. Chem. Inf. Model. 2025, 65, 3262–3269. [Google Scholar] [CrossRef] [PubMed]
Peng, X.; Luo, S.; Guan, J.; Xie, Q.; Peng, J.; Ma, J. Pocket2mol: Efficient molecular sampling based on 3d protein pockets. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, ML, USA, 17–23 July 2022; pp. 17644–17655. [Google Scholar]
Peng, J.; Yu, J.L.; Yang, Z.B.; Chen, Y.T.; Wei, S.Q.; Meng, F.B.; Li, G.B. Pharmacophore-oriented 3D molecular generation toward efficient feature-customized drug discovery. Nat. Comput. Sci. 2025, 5, 898–914. [Google Scholar] [CrossRef]
Zhung, W.; Kim, H.; Kim, W.Y. 3D molecular generative framework for interaction-guided drug design. Nat. Commun. 2024, 15, 2688. [Google Scholar] [CrossRef] [PubMed]
Qin, Y.; Wei, X.; Xu, M.; Wu, J.; Tang, M.; Ran, T.; Chen, H. Comprehensive Benchmark Study of Diffusion-Based 3D Molecular Generation Models. ACS omega 2025. [Google Scholar] [CrossRef] [PubMed]
Buttenschoen, M.; Morris, G.M.; Deane, C.M. PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chem. Sci. 2024, 15, 3130–3139. [Google Scholar] [CrossRef]
Bickerton, G.R.; Paolini, G.V.; Besnard, J.; Muresan, S.; Hopkins, A.L. Quantifying the chemical beauty of drugs. Nat. Chem. 2012, 4, 90–98. [Google Scholar] [CrossRef]
Oestreich, M.; Merdivan, E.; Lee, M.; Schultze, J.L.; Piraud, M.; Becker, M. DrugDiff: Small molecule diffusion model with flexible guidance towards molecular properties. J. Cheminform. 2025, 17, 23. [Google Scholar] [CrossRef] [PubMed]
Han, X.; Shan, C.; Shen, Y.; Xu, C.; Yang, H.; Li, X.; Li, D. Training-free multi-objective diffusion model for 3d molecule generation. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Khodabandeh Yalabadi, A.; Yazdani-Jahromi, M.; Garibay, O.O. BoKDiff: Best-of-K diffusion alignment for target-specific 3D molecule generation. Bioinform. Adv. 2025, 5, vbaf137. [Google Scholar] [CrossRef]
Chen, L.; Kim, D.; Domaratzki, M.; Hu, P. Uncertainty-aware multi-objective reinforcement learning-guided diffusion models for 3D de novo molecular design. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2025), San Diego, CA, USA, 2–7 December 2025. [Google Scholar]
Yuan, Y.; Pan, X.; Li, X.; Zhang, R.; Su, W. A 3D generation framework using diffusion model and reinforcement learning to generate multi-target compounds with desired properties. J. Cheminform. 2025, 17, 93. [Google Scholar] [CrossRef]
Ertl, P.; Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 2009, 1, 8. [Google Scholar] [CrossRef]
Guo, J.; Schwaller, P. Directly optimizing for synthesizability in generative molecular design using retrosynthesis models. Chem. Sci. 2025, 16, 6943–6956. [Google Scholar] [CrossRef]
Seo, S.; Lim, J.; Kim, W.Y. Molecular generative model via retrosynthetically prepared chemical building block assembly. Adv. Sci. 2023, 10, 2206674. [Google Scholar] [CrossRef]
Gaiński, P.; Boussif, O.; Rekesh, A.; Shevchuk, D.; Parviz, A.; Tyers, M.; Batey, R.A.; Koziarski, M. Scalable and cost-efficient de novo template-based molecular generation. arXiv 2025, arXiv:2506.19865. [Google Scholar]
Liu, S.; Zhang, D.; Tu, Z.; Dai, H.; Liu, P. Evaluating Molecule Synthesizability via Retrosynthetic Planning and Reaction Prediction. arXiv 2024, arXiv:2411.08306. [Google Scholar]
Zeng, X.; Wang, F.; Luo, Y.; Kang, S.G.; Tang, J.; Lightstone, F.C.; Cheng, F. Deep generative molecular design reshapes drug discovery. Cell Rep. Med. 2022, 3, 100794. [Google Scholar] [CrossRef] [PubMed]
Fu, C.; Chen, Q. The future of pharmaceuticals: Artificial intelligence in drug discovery and development. J. Pharm. Anal. 2025, 15, 101248. [Google Scholar] [CrossRef] [PubMed]
Ramos, M.C.; Collison, C.J.; White, A.D. A review of large language models and autonomous agents in chemistry. Chem. Sci. 2025, 16, 2514–2572. [Google Scholar] [CrossRef]
Orengo, C.A.; Michie, A.D.; Jones, S.; Jones, D.T.; Swindells, M.B.; Thornton, J.M. CATH–a hierarchic classification of protein domain structures. Structure 1997, 5, 1093–1109. [Google Scholar] [CrossRef] [PubMed]
Sillitoe, I.; Lewis, T.E.; Cuff, A.; Das, S.; Ashford, P.; Dawson, N.L.; Furnham, N.; Laskowski, R.A.; Lee, D.; Lees, J.G.; et al. CATH: Comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res. 2015, 43, D376–D381. [Google Scholar] [CrossRef] [PubMed]
Fox, N.K.; Brenner, S.E.; Chandonia, J.M. SCOPe: Structural Classification of Proteins—Extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 2014, 42, D304–D309. [Google Scholar] [CrossRef]
Chandonia, J.M.; Fox, N.K.; Brenner, S.E. SCOPe: Classification of large macromolecular structures in the structural classification of proteins—Extended database. Nucleic Acids Res. 2019, 47, D475–D481. [Google Scholar] [CrossRef] [PubMed]
Apweiler, R.; Bairoch, A.; Wu, C.H.; Barker, W.C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M.; et al. UniProt: The universal protein knowledgebase. Nucleic Acids Res. 2004, 32, D115–D119. [Google Scholar] [CrossRef]
UniProt: The universal protein knowledgebase in 2025. Nucleic Acids Res. 2025, 53, D609–D617. [CrossRef]
Suzek, B.E.; Huang, H.; McGarvey, P.; Mazumder, R.; Wu, C.H. UniRef: Comprehensive and non-redundant UniProt reference clusters. Bioinformatics 2007, 23, 1282–1288. [Google Scholar] [CrossRef]
Varadi, M.; Anyango, S.; Deshpande, M.; Nair, S.; Natassia, C.; Yordanova, G.; Yuan, D.; Oregi, O.; Kleywegt, G.; Kleywegt, G.J.; et al. AlphaFold Protein Structure Database: Massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022, 50, D439–D444. [Google Scholar] [CrossRef]
Varadi, M.; Bertoni, D.; Magana, P.; Paramval, U.; Pidruchna, I.; Radhakrishnan, M.; Tucholska, A.; Yahiya, M.; Kleywegt, G.J.; Velankar, S. AlphaFold Protein Structure Database in 2024: Providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 2024, 52, D368–D375. [Google Scholar] [CrossRef]
Wang, G.; Li, X.; Wang, Z. APD3: The antimicrobial peptide database as a tool for research and education. Nucleic Acids Res. 2016, 44, D1087–D1093. [Google Scholar] [CrossRef]
Pirtskhalava, M.; Amstrong, A.A.; Grigolava, M.; Chubinidze, M.; Alimbarashvili, E.; Vishnepolsky, B.; Tartakovsky, M. DBAASP v3: Database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics. Nucleic Acids Res. 2021, 49, D288–D297. [Google Scholar] [CrossRef]
Gautam, A.; Singh, H.; Tyagi, A.; Chaudhary, K.; Kumar, R.; Kapoor, P.; Raghava, G.P.S. CPPsite: A curated database of cell penetrating peptides. Database 2012, 2012, bas015. [Google Scholar] [CrossRef]
Agrawal, P.; Bhalla, S.; Usmani, S.S.; Singh, S.; Chaudhary, K.; Raghava, G.P.; Gautam, A. CPPsite 2.0: A repository of experimentally validated cell-penetrating peptides. Nucleic Acids Res. 2016, 44, D1098–D1103. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Li, Y.; Han, L.; Li, J.; Liu, J.; Zhao, Z.; Nie, W.; Liu, Y.; Wang, R. PDB-wide collection of binding data: Current status of the PDBbind database. Bioinformatics 2015, 31, 405–412. [Google Scholar] [CrossRef] [PubMed]
Wang, R.; Fang, X.; Lu, Y.; Yang, C.Y.; Wang, S. The PDBbind database: Methodologies and updates. J. Med. Chem. 2005, 48, 4111–4119. [Google Scholar] [CrossRef]
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef]
Baek, M.; DiMaio, F.; Anishchenko, I.; Dauparas, J.; Ovchinnikov, S.; Lee, G.R.; Wang, J.; Cong, Q.; Kinch, L.N.; Schaeffer, R.D.; et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021, 373, 871–876. [Google Scholar] [CrossRef] [PubMed]
Tang, S.; Zhang, Y.; Chatterjee, P. Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion. arXiv 2025, arXiv:2412.17780. [Google Scholar]
Meshchaninov, V.; Strashnov, P.; Shevtsov, A.; Nikolaev, F.; Ivanisenko, N.; Kardymon, O.; Vetrov, D. Diffusion on language model encodings for protein sequence generation. arXiv 2024, arXiv:2403.03726. [Google Scholar]
Luo, Z.; Geng, A.; Wei, L.; Zou, Q.; Cui, F.; Zhang, Z. CPL-Diff: A Diffusion Model for De Novo Design of Functional Peptide Sequences with Fixed Length. Adv. Sci. 2025, 12, 2412926. [Google Scholar] [CrossRef]
Szymczak, P.; Możejko, M.; Grzegorzek, T.; Jurczak, R.; Bauer, M.; Neubauer, D.; Sikora, K.; Michalski, M.; Sroka, J.; Setny, P.; et al. Discovering highly potent antimicrobial peptides with deep generative model HydrAMP. Nat. Commun. 2023, 14, 1453. [Google Scholar] [CrossRef]
Li, T.; Ren, X.; Luo, X.; Wang, Z.; Li, Z.; Luo, X.; Shen, J.; Li, Y.; Yuan, D.; Nussinov, R.; et al. A foundation model identifies broad-spectrum antimicrobial peptides against drug-resistant bacterial infection. Nat. Commun. 2024, 15, 7538. [Google Scholar] [CrossRef]
Dong, R.; Liu, R.; Liu, Z.; Liu, Y.; Zhao, G.; Li, H.; Hou, S.; Ma, X.; Kang, H.; Liu, J.; et al. Exploring the repository of de novo-designed bifunctional antimicrobial peptides through deep learning. eLife 2025, 13, RP97330. [Google Scholar] [CrossRef]
Wang, J.; Feng, J.; Kang, Y.; Pan, P.; Ge, J.; Wang, Y.; Wang, M.; Wu, Z.; Zhang, X.; Yu, J.; et al. Discovery of antimicrobial peptides with notable antibacterial potency by an LLM-based foundation model. Sci. Adv. 2025, 11, eads8932. [Google Scholar] [CrossRef]
Brizuela, C.A.; Liu, G.; Stokes, J.M.; de la Fuente-Nunez, C. AI methods for antimicrobial peptides: Progress and challenges. Microb. Biotechnol. 2025, 18, e70072. [Google Scholar] [CrossRef] [PubMed]
Cao, J.; Zhang, J.; Yu, Q.; Ji, J.; Li, J.; He, S.; Zhu, Z. TG-CDDPM: Text-guided antimicrobial peptides generation based on conditional denoising diffusion probabilistic model. Briefings Bioinform. 2025, 26, bbae644. [Google Scholar] [CrossRef] [PubMed]
Jin, S.; Zeng, Z.; Xiong, X.; Huang, B.; Tang, L.; Wang, H.; Lin, F. AMPGen: An evolutionary information-reserved and diffusion-driven generative model for de novo design of antimicrobial peptides. Commun. Biol. 2025, 8, 839. [Google Scholar] [CrossRef]
Seixas Feio, J.A.; de Oliveira, E.C.L.; de Sales, C.D.S.; da Costa, K.S.; e Lima, A.H.L. Investigating molecular descriptors in cell-penetrating peptides prediction with deep learning: Employing N, O, and hydrophobicity according to the Eisenberg scale. PLoS ONE 2024, 19, e0305253. [Google Scholar] [CrossRef]
Tran, D.P.; Tada, S.; Yumoto, A.; Kitao, A.; Ito, Y.; Uzawa, T.; Tsuda, K. Using molecular dynamics simulations to prioritize and understand AI-generated cell penetrating peptides. Sci. Rep. 2021, 11, 10630. [Google Scholar] [CrossRef] [PubMed]
González, R.D.; Simões, S.; Ferreira, L.; Carvalho, A.T. Designing Cell Delivery Peptides and SARS-CoV-2-Targeting Small Interfering RNAs: A Comprehensive Bioinformatics Study with Generative Adversarial Network-Based Peptide Design and In Vitro Assays. Mol. Pharm. 2023, 20, 6079–6089. [Google Scholar] [CrossRef]
Ramelot, T.A.; Palmer, J.; Montelione, G.T.; Bhardwaj, G. Cell-permeable chameleonic peptides: Exploiting conformational dynamics in de novo cyclic peptide design. Curr. Opin. Struct. Biol. 2023, 80, 102603. [Google Scholar] [CrossRef]
Lai, L.; Liu, Y.; Song, B.; Li, K.; Zeng, X. Deep generative models for therapeutic peptide discovery: A comprehensive review. ACM Comput. Surv. 2025, 57, 1–29. [Google Scholar] [CrossRef]
Sutcliffe, R.; Doherty, C.P.; Morgan, H.P.; Dunne, N.J.; Mccarthy, H.O. Strategies for the design of biomimetic cell-penetrating peptides using AI-driven in silico tools for drug delivery. Biomater. Adv. 2024, 169, 214153. [Google Scholar] [CrossRef]
Zhang, S.; Jiang, Z.; Huang, R.; Mo, S.; Zhu, L.; Li, P.; Zhang, Z.; Pan, E.; Chen, X.; Long, Y.; et al. Pro-ldm: Protein sequence generation with a conditional latent diffusion model. bioRxiv 2023. [Google Scholar] [CrossRef]
Chen, T.; Vure, P.; Pulugurta, R.; Chatterjee, P. AMP-diffusion: Integrating latent diffusion with protein language models for antimicrobial peptide generation. bioRxiv 2024. [Google Scholar] [CrossRef]
Wang, Y.; Song, M.; Liu, F.; Liang, Z.; Hong, R.; Dong, Y.; Luan, H.; Fu, X.; Yuan, W.; Fang, W.; et al. Artificial intelligence using a latent diffusion model enables the generation of diverse and potent antimicrobial peptides. Sci. Adv. 2025, 11, eadp7171. [Google Scholar] [CrossRef]
Rezaee, K.; Eslami, H. Bridging machine learning and peptide design for cancer treatment: A comprehensive review. Artif. Intell. Rev. 2025, 58, 1–59. [Google Scholar] [CrossRef]
Wan, F.; Wong, F.; Collins, J.J.; de la Fuente-Nunez, C. Machine learning for antimicrobial peptide identification and design. Nat. Rev. Bioeng. 2024, 2, 392–407. [Google Scholar] [CrossRef] [PubMed]
Rettie, S.A.; Bhardwaj, G. Deep learning-enabled design of macrocyclic peptide binders. Nat. Chem. Biol. 2025. [Google Scholar] [CrossRef] [PubMed]
Rettie, S.A.; Juergens, D.; Adebomi, V.; Bueso, Y.F.; Zhao, Q.; Leveille, A.N.; Bhardwaj, G. Accurate de novo design of high-affinity protein-binding macrocycles using deep learning. Nat. Chem. Biol. 2025, 1–9. [Google Scholar] [CrossRef]
Dauparas, J.; Anishchenko, I.; Bennett, N.; Bai, H.; Ragotte, R.J.; Milles, L.F.; Baker, D. Robust deep learning–based protein sequence design using ProteinMPNN. Science 2022, 378, 49–56. [Google Scholar] [CrossRef]
Cao, L.; Coventry, B.; Goreshnik, I.; Huang, B.; Sheffler, W.; Park, J.S.; Jude, K.M.; Marković, I.; Kadam, R.U.; Verschueren, K.H.G.; et al. Design of protein-binding proteins from the target structure alone. Nature 2022, 605, 551–560. [Google Scholar] [CrossRef] [PubMed]
Bennett, N.R.; Coventry, B.; Goreshnik, I.; Huang, B.; Allen, A.; Vafeados, D.; Baker, D. Improving de novo protein binder design with deep learning. Nat. Commun. 2023, 14, 2625. [Google Scholar] [CrossRef]
Vázquez Torres, S.; Leung, P.J.; Venkatesh, P.; Lutz, I.D.; Hink, F.; Huynh, H.H.; Becker, J.; Yeh, A.H.; Juergens, D.; Bennett, N.R.; et al. De novo design of high-affinity binders of bioactive helical peptides. Nature 2024, 626, 435–442. [Google Scholar] [CrossRef]
Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123–1130. [Google Scholar] [CrossRef]
Fetse, J.; Kandel, S.; Mamani, U.F.; Cheng, K. Recent advances in the development of therapeutic peptides. Trends Pharmacol. Sci. 2023, 44, 425–441. [Google Scholar] [CrossRef] [PubMed]
Achilleos, K.; Petrou, C.; Nicolaidou, V.; Sarigiannis, Y. Beyond Efficacy: Ensuring Safety in Peptide Therapeutics through Immunogenicity Assessment. J. Pept. Sci. 2025, 31, e70016. [Google Scholar] [CrossRef] [PubMed]
Rettie, S.A.; Campbell, K.V.; Bera, A.K.; Kang, A.; Kozlov, S.; Bueso, Y.F.; Bhardwaj, G. Cyclic peptide structure prediction and design using AlphaFold2. Nat. Commun. 2025, 16, 4730. [Google Scholar] [CrossRef]
Reymond, J.L. The chemical space project. Accounts Chem. Res. 2015, 48, 722–730. [Google Scholar] [CrossRef]
Doak, B.C.; Over, B.; Giordanetto, F.; Kihlberg, J. Oral druggable space beyond the rule of 5: Insights from drugs and clinical candidates. Chem. Biol. 2014, 21, 1115–1142. [Google Scholar] [CrossRef] [PubMed]
Fosgerau, K.; Hoffmann, T. Peptide therapeutics: Current status and future directions. Drug Discov. Today 2015, 20, 122–128. [Google Scholar] [CrossRef] [PubMed]
Kitchen, D.B.; Decornez, H.; Furr, J.R.; Bajorath, J. Docking and scoring in virtual screening for drug discovery: Methods and applications. Nat. Rev. Drug Discov. 2004, 3, 935–949. [Google Scholar] [CrossRef]
Daina, A.; Michielin, O.; Zoete, V. SwissADME: A free web tool to evaluate pharmacokinetics, drug-likeness and medicinal chemistry friendliness of small molecules. Sci. Rep. 2017, 7, 42717. [Google Scholar] [CrossRef]
Rich, R.L.; Myszka, D.G. Advances in surface plasmon resonance biosensor analysis. Curr. Opin. Biotechnol. 2000, 11, 54–61. [Google Scholar] [CrossRef]
Myszka, D.G.; Rich, R.L. Implementing surface plasmon resonance biosensors in drug discovery. Pharm. Sci. Technol. Today 2000, 3, 310–317. [Google Scholar] [CrossRef]
Myszka, D.G. Kinetic analysis of macromolecular interactions using surface plasmon resonance biosensors. Curr. Opin. Biotechnol. 1997, 8, 50–57. [Google Scholar] [CrossRef]
Polykovskiy, D.; Zhebrak, A.; Sanchez-Lengeling, B.; Golovanov, S.; Tatanov, O.; Belyaev, S.; Kurbanov, R.; Artamonov, A.; Aladinskiy, V.; Veselov, M.; et al. Molecular sets (MOSES): A benchmarking platform for molecular generation models. Front. Pharmacol. 2020, 11, 565644. [Google Scholar] [CrossRef] [PubMed]
Igashov, I.; Stärk, H.; Vignac, C.; Schneuing, A.; Satorras, V.G.; Frossard, P.; Welling, M.; Bronstein, M.; Correia, B. Equivariant 3D-conditional diffusion model for molecular linker design. Nat. Mach. Intell. 2024, 6, 417–427. [Google Scholar] [CrossRef]
Ingraham, J.B.; Baranov, M.; Costello, Z.; Barber, K.W.; Wang, W.; Ismail, A.; Frappier, V.; Lord, D.M.; Ng-Thow-Hing, C.; Van Vlack, E.R.; et al. Illuminating protein space with a programmable generative model. Nature 2023, 623, 1070–1078. [Google Scholar] [CrossRef] [PubMed]
Trott, O.; Olson, A.J. AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 2010, 31, 455–461. [Google Scholar] [CrossRef]
Eberhardt, J.; Santos-Martins, D.; Tillack, A.F.; Forli, S. AutoDock Vina 1.2.0: New docking methods, expanded force field, and python bindings. J. Chem. Inf. Model. 2021, 61, 3891–3898. [Google Scholar] [CrossRef] [PubMed]
Friesner, R.A.; Banks, J.L.; Murphy, R.B.; Halgren, T.A.; Klicic, J.J.; Mainz, D.T.; Shenkin, P.S. Glide: A new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J. Med. Chem. 2004, 47, 1739–1749. [Google Scholar] [CrossRef]
Hollingsworth, S.A.; Dror, R.O. Molecular dynamics simulation for all. Neuron 2018, 99, 1129–1143. [Google Scholar] [CrossRef] [PubMed]
Hospital, A.; Goñi, J.; Orozco, M.; Gelpí, J.L. Molecular dynamics simulations: Advances and applications. Adv. Appl. Bioinform. Chem. 2015, 15, 37–47. [Google Scholar] [CrossRef]
Jiménez, J.; Skalic, M.; Martinez-Rosell, G.; De Fabritiis, G. K deep: Protein–ligand absolute binding affinity prediction via 3d-convolutional neural networks. J. Chem. Inf. Model. 2018, 58, 287–296. [Google Scholar] [CrossRef]
Wang, L.; Wu, Y.; Deng, Y.; Kim, B.; Pierce, L.; Krilov, G.; Abel, R. Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free-energy calculation protocol and force field. J. Am. Chem. Soc. 2015, 137, 2695–2703. [Google Scholar] [CrossRef]
Kwon, Y.; Shin, W.H.; Ko, J.; Lee, J. AK-score: Accurate protein-ligand binding affinity prediction using an ensemble of 3D-convolutional neural networks. Int. J. Mol. Sci. 2020, 21, 8424. [Google Scholar] [CrossRef] [PubMed]
Lee, H.J.; Emani, P.S.; Gerstein, M.B. Improved Prediction of Ligand–Protein Binding Affinities by Meta-modeling. J. Chem. Inf. Model. 2024, 64, 8684–8704. [Google Scholar] [CrossRef]
Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B.A.; Thiessen, P.A.; Yu, B.; et al. PubChem 2023 update. Nucleic Acids Res. 2023, 51, D1373–D1380. [Google Scholar] [CrossRef]
Kim, S.; Thiessen, P.A.; Bolton, E.E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B.A.; et al. PubChem substance and compound databases. Nucleic Acids Res. 2016, 44, D1202–D1213. [Google Scholar] [CrossRef]
Wu, X.; Lin, H.; Bai, R.; Duan, H. Deep learning for advancing peptide drug development: Tools and methods in structure prediction and design. Eur. J. Med. Chem. 2024, 268, 116262. [Google Scholar] [CrossRef]
Bhati, A.P.; Wan, S.; Alfè, D.; Clyde, A.R.; Bode, M.; Tan, L.; Titov, M.; Merzky, A.; Turilli, M.; Jha, S.; et al. Pandemic drugs at pandemic speed: Infrastructure for accelerating COVID-19 drug discovery with hybrid machine learning-and physics-based simulations on high-performance computers. Interface Focus 2021, 11, 20210018. [Google Scholar] [CrossRef]
Filella-Merce, I.; Molina, A.; Díaz, L.; Orzechowski, M.; Berchiche, Y.A.; Zhu, Y.M.; Vilalta-Mor, J.; Malo, L.; Yekkirala, A.S.; Ray, S.; et al. Optimizing drug design by merging generative AI with a physics-based active learning framework. Commun. Chem. 2025, 8, 238. [Google Scholar] [CrossRef]
Gorantla, R.; Kubincova, A.; Suutari, B.; Cossins, B.P.; Mey, A.S. Benchmarking active learning protocols for ligand-binding affinity prediction. J. Chem. Inf. Model. 2024, 64, 1955–1965. [Google Scholar] [CrossRef] [PubMed]
Bailey, M.; Moayedpour, S.; Li, R.; Corrochano-Navarro, A.; Kötter, A.; Kogler-Anele, L.; Riahi, S.; Grebner, C.; Hessler, G.; Matter, H.; et al. Deep Batch Active Learning for Drug Discovery. eLife 2024, 12. [Google Scholar] [CrossRef]
Loeffler, H.H.; Wan, S.; Klähn, M.; Bhati, A.P.; Coveney, P.V. Optimal molecular design: Generative active learning combining REINVENT with precise binding free energy ranking simulations. J. Chem. Theory Comput. 2024, 20, 8308–8328. [Google Scholar] [CrossRef]
Goles, M.; Daza, A.; Cabas-Mora, G.; Sarmiento-Varón, L.; Sepúlveda-Yañez, J.; Anvari-Kazemabad, H.; Davari, M.D.; Uribe-Paredes, R.; Olivera-Nappa, Á; Navarrete, M.A.; et al. Peptide-based drug discovery through artificial intelligence: Towards an autonomous design of therapeutic peptides. Briefings Bioinform. 2024, 25. [Google Scholar] [CrossRef]
Al-Omari, A.M.; Akkam, Y.H.; Zyout, A.A.; Younis, S.A.; Tawalbeh, S.M.; Al-Sawalmeh, K.; Al Fahoum, A.; Arnold, J. Accelerating antimicrobial peptide design: Leveraging deep learning for rapid discovery. PLoS ONE 2024, 19, e0315477. [Google Scholar] [CrossRef] [PubMed]
Matzko, R.; Konur, S. Technologies for design-build-test-learn automation and computational modelling across the synthetic biology workflow: A review. Netw. Model. Anal. Health Inform. Bioinform. 2024, 13, 22. [Google Scholar] [CrossRef]
National Academies of Sciences, Engineering, and Medicine. The Age of AI in the Life Sciences: Benefits and Biosecurity Considerations; The National Academies Press: Washington, DC, USA, 2025. [Google Scholar]
Liao, X.; Ma, H.; Tang, Y.J. Artificial intelligence: A solution to involution of design–build–test–learn cycle. Curr. Opin. Biotechnol. 2022, 75, 102712. [Google Scholar] [CrossRef]
Abolhasani, M.; Kumacheva, E. The rise of self-driving labs in chemical and materials sciences. Nat. Synth. 2023, 2, 483–492. [Google Scholar] [CrossRef]
Dai, T.; Vijayakrishnan, S.; Szczypiński, F.T.; Ayme, J.F.; Simaei, E.; Fellowes, T.; Clowes, R.; Kotopanov, L.; Shields, C.E.; Zhou, Z.; et al. Autonomous mobile robots for exploratory synthetic chemistry. Nature 2024, 635, 890–897. [Google Scholar] [CrossRef]
Tom, G.; Schmid, S.P.; Baird, S.G.; Cao, Y.; Darvish, K.; Hao, H.; Lo, S.; Pablo-García, S.; Rajaonson, E.M.; Skreta, M.; et al. Self-driving laboratories for chemistry and materials science. Chem. Rev. 2024, 124, 9633–9732. [Google Scholar] [CrossRef]
Ha, T.; Lee, D.; Kwon, Y.; Park, M.S.; Lee, S.; Jang, J.; Choi, B.; Jeon, H.; Kim, J.; Choi, H.; et al. AI-driven robotic chemist for autonomous synthesis of organic molecules. Sci. Adv. 2023, 9, eadj0461. [Google Scholar] [CrossRef] [PubMed]
Kusne, A.G.; Yu, H.; Wu, C.; Zhang, H.; Hattrick-Simpers, J.; DeCost, B.; Sarker, S.; Oses, C.; Toher, C.; Curtarolo, S.; et al. On-the-fly closed-loop materials discovery via Bayesian active learning. Nat. Commun. 2020, 11, 5966. [Google Scholar] [CrossRef] [PubMed]
Ramos, M.C.; Michtavy, S.S.; Porosoff, M.D.; White, A.D. Bayesian optimization of catalysts with in-context learning. arXiv 2023, arXiv:2304.05341. [Google Scholar] [CrossRef]
Xian, Y.; Ding, X.; Jiang, X.; Zhou, Y.; Sun, J.; Xue, D.; Lookman, T. Unlocking the black box beyond Bayesian global optimization for materials design using reinforcement learning. Npj Comput. Mater. 2025, 11, 143. [Google Scholar] [CrossRef]
Wu, Y.; Walsh, A.; Ganose, A.M. Race to the bottom: Bayesian optimisation for chemical problems. Digit. Discov. 2024, 3, 1086–1100. [Google Scholar] [CrossRef]
Klarner, L.; Rudner, T.G.; Morris, G.M.; Deane, C.M.; Teh, Y.W. Context-guided diffusion for out-of-distribution molecular and protein design. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; pp. 24770–24807. [Google Scholar]
Wacker, D.; Stevens, R.C.; Roth, B.L. How ligands illuminate GPCR molecular pharmacology. Cell 2017, 170, 414–427. [Google Scholar] [CrossRef] [PubMed]
Chen, S.; Lin, T.; Basu, R.; Ritchey, J.; Wang, S.; Luo, Y.; Cheng, X. Design of target specific peptide inhibitors using generative deep learning and molecular dynamics simulations. Nat. Commun. 2024, 15, 1611. [Google Scholar] [CrossRef] [PubMed]
Khoee, A.G.; Yu, Y.; Feldt, R. Domain generalization through meta-learning: A survey. Artif. Intell. Rev. 2024, 57, 285. [Google Scholar] [CrossRef]
Xie, W.; Zhang, J.; Xie, Q.; Gong, C.; Ren, Y.; Xie, J.; Pei, J. Accelerating discovery of bioactive ligands with pharmacophore-informed generative models. Nat. Commun. 2025, 16, 2391. [Google Scholar] [CrossRef] [PubMed]
Dharmasivam, M.; Kaya, B.; Akinware, A.; Azad, M.G.; Richardson, D.R. Leading AI-Driven Drug Discovery Platforms: 2025 Landscape and Global Outlook. Pharmacol. Rev. 2025, 100102. [Google Scholar] [CrossRef]
Dermawan, D.; Alotaiq, N. From Lab to Clinic: How Artificial Intelligence (AI) Is Reshaping Drug Discovery Timelines and Industry Outcomes. Pharmaceuticals 2025, 18, 981. [Google Scholar] [CrossRef]

Figure 1. A unified framework for de novo drug design using a conditional diffusion model. (a) The core engine is a conditional diffusion model, which comprises two processes. The noising process systematically corrupts a data structure, such as a protein (

X_{0}

), into Gaussian noise (

X_{T}

) over discrete timesteps. The generative process learns the reverse, creating novel structures by iteratively denoising from noise, guided by specific conditions. (b) For de novo small molecule design, the model generates molecular graphs or 3D coordinates conditioned on a target’s binding pocket and desired properties (e.g., high activity, low toxicity) to produce diverse, pocket-fitting ligands. (c) For de novo therapeutic peptide design, the model generates peptide sequences and their corresponding 3D structures, conditioned on a target protein’s surface, to design novel binders.

Figure 1. A unified framework for de novo drug design using a conditional diffusion model. (a) The core engine is a conditional diffusion model, which comprises two processes. The noising process systematically corrupts a data structure, such as a protein (

X_{0}

), into Gaussian noise (

X_{T}

) over discrete timesteps. The generative process learns the reverse, creating novel structures by iteratively denoising from noise, guided by specific conditions. (b) For de novo small molecule design, the model generates molecular graphs or 3D coordinates conditioned on a target’s binding pocket and desired properties (e.g., high activity, low toxicity) to produce diverse, pocket-fitting ligands. (c) For de novo therapeutic peptide design, the model generates peptide sequences and their corresponding 3D structures, conditioned on a target protein’s surface, to design novel binders.

Figure 2. Contrasting Design Paradigms for Small Molecules and Therapeutic Peptides with Diffusion Models. The figure illustrates the distinct challenges and tailored AI-driven solutions for small molecules (left column, (a,c,e,g)) versus therapeutic peptides (right column, (b,d,f,h)). (a,b) The primary challenge for small molecules is navigating the vast, discrete chemical space, whereas for peptides, it is conquering the continuous conformational space to achieve a stable fold. (c,d) Consequently, diffusion models are employed for structure-based generation to fit small molecules into binding pockets, while for peptides, they perform structure-guided design by decorating a predefined scaffold. (e,f) Key downstream hurdles also differ: ensuring chemical synthesizability for small molecules versus achieving biological stability against degradation for peptides. (g,h) Finally, solutions are modality-specific: integrating chemical knowledge (e.g., reaction rules) to guide synthesis for small molecules, and engineering stability in peptides through modifications like cyclization or using non-canonical amino acids. Explanation of symbols: The red crosses (X) indicate synthetic infeasibility (e) or blocked enzymatic degradation (f,h). In (e), the colored spheres represent atoms within a complex molecular graph structure.

Figure 3. A Closed-Loop Paradigm for Drug Discovery Driven by AI and Automation. The figure depicts an autonomous Design-Build-Test-Learn (DBTL) cycle, representing a future paradigm for accelerated therapeutic discovery. This approach seamlessly integrates AI-powered design with automated laboratory execution to create a self-optimizing discovery engine. (a) Design: Generative AI models propose novel molecular candidates in silico. (b) Build: The most promising candidates are synthesized and purified using robotic platforms. (c) Test: The synthesized compounds are evaluated in high-throughput biological assays to generate activity data. (d) Learn: Experimental results are fed back into the AI model, which updates its knowledge and generates more informed hypotheses for the next cycle. This iterative process aims to dramatically shorten timelines and increase the success rate of finding novel medicines.

Table 1. A Head-to-Head Comparison: Diffusion Models for Small Molecules vs. Peptides.

Feature	Small Molecules	Therapeutic Peptides
Representation	Graphs: Atoms & bonds 3D Point Clouds: Coordinates Requires $E (3)$ equivariance [53,70,71]	Sequences: Discrete amino acids 3D Backbones: Continuous coordinates Often requires distinct models for sequence (discrete) and structure (continuous) generation
Chemical Space	Vast & Discontinuous (∼ $10^{60}$ ) [11,12,15,188] Learns implicit chemical rules (e.g., valence)	Combinatorial & Structured ( $20^{n}$ ) [12] Governed by protein folding principles
Typical Size	MW: 150–900 Da (oral drugs often 300–500 Da) [189] Heavy Atoms: 10–50 Mostly rigid structures	MW: 500–5000 Da Length: 5–50 amino acids [190] Highly flexible, multiple conformations
Key Challenge	Synthesizability: Can it be made? [132] Stereochemistry control	Biological Stability: Folding, proteolysis Immunogenicity avoidance [190]
Validation	Computational: Docking, ADMET [191,192] Experimental: Synthesis, binding assays (SPR, ITC) [193,194,195]	Computational: Structure prediction (AF2) [155] Experimental: Expression, binding & stability assays
Conditioning	Protein pocket geometry [59,103,121] Pharmacophores, desired properties (QED, logP) [126]	Target protein surface [65] Structural motifs (helix), sequence patterns
Data & Cost	Data: PDBbind (∼20k complexes), CrossDocked ( 100k pairs) Cost: Varies widely by model and scale	Data: PDB (∼220k entries), AlphaFold DB (>200 M structures) Cost: Varies widely by model and scale
Success Metrics	Chemical: Validity, Uniqueness, Novelty [96,103,196] Predicted Affinity: High-affinity rate	Structural: Designability (folds to target) [155] Experimental Success: Varies, often a few to tens of percent [65]
Example Works	Pocket2Mol [121], DiffSBDD [103], TargetDiff [59], GeoDiff [71], DiffLinker [197]	RFdiffusion [65], ProteinMPNN [180] (seq. design), Chroma [198], EvoDiff [61], FoldingDiff [67]

Table 2. Performance Highlights of Representative Models in Molecular Generation.

Model	Modality/Role	Key Performance Metrics & Highlights
Small Molecule Generation (Diffusion Models)
Pocket2Mol [121]	Structure-based generation	Avg. Vina score: −7.29 kcal/mol; High-affinity rate: 54.2%; Good drug-likeness (QED: 0.56).
DiffSBDD [103]	Structure-based generation	High chemical validity (97.8%) and novelty (85.7%); Median Vina score: −7.50 kcal/mol.
TargetDiff [59]	Guided generation	State-of-the-art binding affinity (Avg. Vina: −7.80 kcal/mol); High-affinity rate: 58.1%.
GeoDiff [71]	Conformer generation	High-quality 3D conformer generation with low geometric error (MAT-R: 0.86 Å on Drugs dataset).
Peptide and Protein Design (Diffusion-Centric Workflows)
RFdiffusion [65]	Backbone generation (Diffusion)	High experimental success rate for binders (14–19%); Generated structures match Cryo-EM to 0.63 Å RMSD.
ProteinMPNN [180]	Sequence design (GNN, non-diffusion)	High native sequence recovery (52.4%); Essential downstream tool for designing sequences for generated backbones.
Chroma [198]	Protein/Complex generation (Diffusion)	Experimentally confirmed designs with crystal structures matching to ~1.0 Å RMSD; Generates diverse topologies.
EvoDiff [61]	Sequence generation (Discrete Diffusion)	High experimental success for functional proteins (65–75%); Generates evolutionarily plausible sequences.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Ma, Y.; Chang, Y.; Yan, J.; Zhang, J.; Cai, M.; Wei, K. Diffusion Models at the Drug Discovery Frontier: A Review on Generating Small Molecules Versus Therapeutic Peptides. Biology 2025, 14, 1665. https://doi.org/10.3390/biology14121665

AMA Style

Wang Y, Ma Y, Chang Y, Yan J, Zhang J, Cai M, Wei K. Diffusion Models at the Drug Discovery Frontier: A Review on Generating Small Molecules Versus Therapeutic Peptides. Biology. 2025; 14(12):1665. https://doi.org/10.3390/biology14121665

Chicago/Turabian Style

Wang, Yiquan, Yahui Ma, Yuhan Chang, Jiayao Yan, Jialin Zhang, Minnuo Cai, and Kai Wei. 2025. "Diffusion Models at the Drug Discovery Frontier: A Review on Generating Small Molecules Versus Therapeutic Peptides" Biology 14, no. 12: 1665. https://doi.org/10.3390/biology14121665

APA Style

Wang, Y., Ma, Y., Chang, Y., Yan, J., Zhang, J., Cai, M., & Wei, K. (2025). Diffusion Models at the Drug Discovery Frontier: A Review on Generating Small Molecules Versus Therapeutic Peptides. Biology, 14(12), 1665. https://doi.org/10.3390/biology14121665

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Diffusion Models at the Drug Discovery Frontier: A Review on Generating Small Molecules Versus Therapeutic Peptides

Simple Summary

Abstract

1. Introduction

1.1. The Bottleneck of Drug Discovery and the Rise of Generative AI

1.2. The Emergence of Diffusion Models

1.3. Scope and Structure of This Review

2. The Core Engine: Diffusion Models for Molecular Generation

2.1. Representing Molecules for Diffusion

2.2. The Mathematics of Diffusion: Forward and Reverse Processes

2.3. Conditional Generation: From Noise to Purpose

2.4. Comparison with Other Generative Approaches

3. Application I: De Novo Design of Small Molecules

3.1. Datasets and Benchmarks for Small Molecule Generation

3.2. Structure-Based Drug Design (SBDD)

3.3. Property-Based Ligand Design and Optimization

4. Application II: Innovative Design of Therapeutic Peptides

4.1. Datasets and Benchmarks for Peptide Design

4.2. Generation of Functional Peptide Sequences

4.3. Structure-Guided De Novo Peptide Design

5. Comparison, Challenges, and Future Perspectives

5.1. A Head-to-Head Comparison: Small Molecules vs. Peptides

5.2. Shared Hurdles and Common Challenges

5.3. Future Outlook and Opportunities

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI