Previous Article in Journal
Osteoporosis: Focus on Bone Remodeling and Disease Types
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Deep Generative Modeling of Protein Conformations: A Comprehensive Review

Bellini College of AI, Cybersecurity and Computing, University of South Florida, 4202 E. Fowler Avenue, LIB 122, Tampa, FL 33620, USA
*
Author to whom correspondence should be addressed.
BioChem 2025, 5(3), 32; https://doi.org/10.3390/biochem5030032
Submission received: 1 July 2025 / Revised: 23 August 2025 / Accepted: 3 September 2025 / Published: 15 September 2025

Abstract

Proteins are dynamic macromolecules whose functions are intricately linked to their structural flexibility. Recent breakthroughs in deep learning have enabled accurate prediction of static protein structures. However, understanding protein function is more complex. It often requires access to a diverse ensemble of conformations. Traditional sampling techniques exist to help with this. These include molecular dynamics and Monte Carlo simulations. These techniques can explore conformational landscapes. However, they have limitations as they are often limited by high computational cost and suffer from slow convergence. In response, deep generative models (DGMs) have emerged as a powerful alternative for efficient and scalable protein conformation sampling. Leveraging architectures such as variational autoencoders, normalizing flows, generative adversarial networks, and diffusion models, DGMs can learn complex, high-dimensional distributions over protein conformations directly from data. This survey on generative models for protein conformation sampling provides a comprehensive overview of recent advances in this emerging field. We categorize existing models based on generative architecture, structural representation, and target tasks. We also discuss key datasets, evaluation metrics, limitations, and opportunities for integrating physics-based knowledge with data-driven models. By bridging machine learning and structural biology, DGMs are poised to transform our ability to model, design, and understand dynamic protein behavior.

1. Introduction

Proteins are fundamental biomolecules that carry out essential functions in living organisms, from catalyzing metabolic reactions to transmitting signals and forming structural components. The function of a protein is intimately tied to its three-dimensional (3D) structure, which is determined by the sequence of amino acids through a complex folding process. Consequently, predicting a protein’s structure from its amino acid sequence, a problem known as protein structure prediction, has been a central challenge in computational biology for decades.
Recent breakthroughs, such as AlphaFold2 [1], have dramatically advanced static structure prediction, achieving near-experimental accuracy and reshaping the landscape of structural biology. These models offer valuable insights for drug discovery [2,3], enzyme engineering [4], protein-protein interaction prediction [5,6], construction of a comprehensive protein database [7], along with experimental validation of protein-protein complexes [8], and study of biologically relevant states [9]. However, they provide only a single, static snapshot of a protein’s structure. In reality, proteins are inherently dynamic and often adopt multiple conformations to perform their biological functions.
This is a significant milestone that is likely to support a variety of structure-function studies by computational and molecular biologists alike, but it is still a first-order approximation to protein structure modeling. To fully understand and model protein function and characterize the various molecular interactions in which a protein molecule participates in a cell, including interactions of protein targets with therapeutic, small-molecule leads, we need to account for the intrinsic structural plasticity of protein molecules. We have many examples of proteins that switch between different structures to regulate interactions with different molecular partners in the cell. Nowhere is this structural plasticity more evident than in the spike protein of the SARS-CoV-2 virus, where open-to-close structural changes are key to the ability of the protein to evade our immune system just long enough to then be able to position itself and bind with the ACE2 receptors in the lungs [10].
Another relevant example is the Calmodulin protein [11,12], which undergoes significant conformational changes upon calcium binding. This is evident upon the experimental observation of the transition of its motifs from “closed” to “open” during the binding process with other molecular partners. This transition facilitates Calmodulin to play a crucial role in regulating numerous cellular processes and maintaining overall cellular health. These examples further strengthen the argument that finding the conformations that a protein adopts to regulate interactions with molecular partners in the cell is an important open problem in molecular biology.
Therefore, obtaining a broad view of the protein conformation space is the main objective of the computational research proposed here. This problem, however, poses an outstanding challenge and has been approached by computational biologists and computer scientists alike with a variety of computational methods over the decades. Most prominent among these are those based on optimization, which can be broadly characterized into three groups: those based on numerical optimization, stochastic optimization, and hybrid methods.
In numerical optimization methods, a physics-based objective function is defined and optimized through numerical simulation to navigate the conformational space and produce physically relevant conformations. These are known as molecular dynamics (MD) in the computational biology literature. One of the prominent ways researchers augment potential energy functions during the simulation process is through the selection of collective variables [13,14,15,16,17]. CVs represent a set of Cartesian coordinates through a collection of basis functions that can act as a lower-dimensional representation of a complex system. This system of representation subsequently reduces the complexity of the system, leading to its inherent advantage compared to other approximations of the molecular dynamics algorithm. Other approaches include umbrella sampling [18], isolating low frequency vibrations [19], and Temperature-accelerated molecular dynamics (TAMD) [20], adaptive sampling [21] among many others. Despite the vast array of research conducted in this field, it suffers from problems such as inability to reach biologically relevant timescales [22] along with high computational cost [23] and force field limitations [24].
Another powerful yet less computationally expensive approach based on stochastic optimization is Evolutionary Algorithms (EA). EA is a biologically inspired algorithm based on works such as [25,26] that mimics human evolution, and has been successfully applied to the protein structure prediction domain. Its stochastic nature is conducive to exploring a vast conformational space and identifying energetically favorable structures [27,28,29,30,31,32,33]. Despite its promises, this approach is susceptible to problems such as slow convergence [34]. Finding the global optimum also becomes challenging with the increasing randomness of the energy optimization landscape [35].
A recent line of work involving deep learning techniques for conformation sampling revolves around adopting a single structure predictor, AlphaFold2 [1], or its evolutions like Alphafold3 [36] or EsmFold [37], for the task. Although AlphaFold is primarily designed for the prediction of structure from sequence, its adaptations and integration with generative models open new possibilities for sampling and exploring conformational diversity [38]. One notable work built on AlphaFold2 is MSA subsampling [39], used as a benchmark in a variety of literature involving DGMs in this survey.
Deep generative models (DGMs) offer a fundamentally different approach to sampling protein conformational space. Traditional physics-based simulations, such as molecular dynamics (MD), are limited by small time steps and high computational cost, making it difficult to access the full range of biologically relevant protein motions, especially for large systems or intrinsically disordered regions (IDRs).
In contrast, DGMs learn a parametric model of the equilibrium distribution of protein conformations directly from data, enabling rapid generation of diverse, independent structural samples. This allows scalable exploration of conformational landscapes that are otherwise prohibitively expensive to access with conventional simulations.
In this review, we focus on deep generative models with explicit probabilistic mechanisms, including variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, and diffusion models. This review offers a scope broader than that of Barethiya et al. [40], which focuses on benchmarking specific VAE and diffusion models, while this work provides a general overview of this nascent field. Our work encompasses a more diverse range of protein types compared to Erdhos et al. [41] and addresses a fundamentally different objective than previous studies [42,43], which focus specifically on deep learning models for protein design rather than the broader analytical framework we present here. Figure 1 presents a review of key studies on the integration of Multiple Sequence Alignment, Sequence, and Structural Data in Protein Conformational Analysis and Trajectory Sampling.
The paper is arranged as follows: in the Section 3, we discuss the basics of the generative models surveyed; Section 4 discusses the relevant papers and their applications in the context of protein conformation modeling; Section 5 discusses popular datasets used for conformation modeling; Section 6 discusses different metrics used to compare generated conformations with the reference dataset followed by the Section 7 to discuss the challenges and opportunities that lie ahead in the broader context of modeling protein structural plasticity.

2. Problem Definition

In this review, we explore deep generative models that address two closely related but distinct challenges in structural biology: protein conformation sampling and protein trajectory generation.

2.1. Protein Conformation Sampling

Protein conformation sampling refers to the process of generating structural configurations that reflect the underlying equilibrium distribution of a protein. Ideally, a sampling method produces conformations x such that:
p S ( x ) p eq ( x )
where p S ( x ) is the sampling distribution induced by the method, and  p eq ( x ) is the true equilibrium distribution. Traditional approaches such as molecular dynamics (MD) and Monte Carlo simulations are commonly used for this task.
The goal is to generate a representative ensemble of conformations:
{ x 1 , x 2 , , x N }
This captures the diversity and relative probabilities of protein structures under physiological conditions. Such ensembles are crucial for studying conformational variability, functional motions, and interaction propensities.
The equilibrium distribution that governs these conformations is given by Boltzmann statistics:
p eq ( x ) = 1 Z e β E ( x )
where E ( x ) the potential energy of conformation x , β = 1 / ( k B T ) is the inverse thermal energy, and Z is the partition function ensuring normalization. This formulation implies that lower-energy conformations are exponentially more probable than higher-energy ones.

2.2. Protein Trajectory Generation

While conformation sampling captures thermodynamic diversity, protein trajectory generation focuses on the time-resolved evolution of protein structure. These trajectories are typically generated via MD simulations, which numerically integrate Newton’s equations of motion under a specified force field and thermodynamic ensemble [44,45].
For a protein consisting of N atoms, a trajectory is defined as:
X ( t ) = x 1 ( t ) , x 2 ( t ) , , x N ( t ) R 3 N
where x i ( t ) R 3 represents the position vector of the atom i at time t, and  X ( t ) denotes the full protein conformation at that time.
Together, conformation sampling and trajectory generation provide complementary insights: the former captures the statistical ensemble of accessible structures, while the latter reveals the dynamical transitions between them.

3. Basics of Deep Generative Modeling

This section provides an overview of the four primary classes of deep generative models that have been applied to protein conformation sampling. Each architecture offers distinct advantages and trade-offs in terms of training stability, computational efficiency, and the ability to model complex conformational distributions. An overview of the architectures of the models discussed in this literature can be found in Figure 2. We begin with an examination of generative adversarial networks.

3.1. Generative Adversarial Networks

Generative Adversarial Networks (GANs) are generative models composed of two neural networks: a generator G and a discriminator D trained in a competitive learning process. The generator G learns to map random noise z p z ( z ) to synthetic samples G ( z ) , while the discriminator D attempts to distinguish between real samples x p data ( x ) and generated ones. The objective is formulated as:
min G max D E x p data [ log D ( x ) ] + E z p z [ log ( 1 D ( G ( z ) ) ) ]
Here, E x p data [ log D ( x ) ] represents the discriminator’s confidence in correctly identifying real samples, while E z p z [ log ( 1 D ( G ( z ) ) ) ] penalizing it for mistakenly identifying generated samples as real.
In protein conformation sampling, GANs offer a powerful alternative to molecular dynamics by directly learning the distribution of conformational states from simulation data. Once trained, they can generate statistically independent structures that capture both local flexibility and large-scale conformational transitions, providing ensembles at a fraction of the computational cost. Conditional GANs, such as idpGAN, extend this capability to sequence-dependent sampling, enabling generalization to unseen proteins and enhancing transferability across systems [46]. Beyond ensemble generation, GANs also provide a framework for approximating protein trajectories by sampling continuous states from the learned distribution, thereby reconstructing dynamic pathways without the need for long MD simulations. This highlights their potential to accelerate exploration of conformational landscapes and to generate physically consistent trajectories that improve our understanding of biomolecular dynamics [47].

3.2. Variational Autoencoders

Variational Autoencoders (VAEs) [48,49] learn a probabilistic mapping between data and a latent space through variational inference. It consists of two neural networks: an encoder q ϕ ( z x ) , which maps an input x to a distribution over latent variables z, and a decoder p θ ( x z ) , which reconstructs data from latent samples. The training objective maximizes a variational lower bound (ELBO) on the data log-likelihood:
L ( θ , ϕ ; x ) = E z q ϕ ( z x ) [ log p θ ( x z ) ] D KL ( q ϕ ( z x ) p ( z ) )
Here, the first term encourages accurate reconstruction of the input, while the second term regularizes the latent space by minimizing the Kullback–Leibler divergence between the approximate posterior q ϕ ( z x ) and a prior p ( z ) , typically a standard Gaussian.
For application, VAEs have been shown to effectively capture the complex, high-dimensional conformational landscapes of both intrinsically disordered and ordered proteins [50]. By learning a smooth and structured latent representation, VAEs can generate novel conformations that extend beyond those sampled in short MD trajectories, thereby enriching conformational ensembles with states that would otherwise require prohibitively long simulations to observe. Moreover, points interpolated within the latent space correspond to physically plausible intermediate structures, enabling reconstruction of continuous transition pathways and offering a trajectory-like view of protein dynamics. Recent advances, such as latent space assisted adaptive sampling (LAST) [51], demonstrate that VAE-generated conformations can serve as efficient starting points for iterative MD simulations, accelerating the discovery of transition pathways and expanding conformational coverage. Thus, VAEs provide a powerful framework for augmenting molecular dynamics, enhancing conformational ensemble generation, and approximating protein trajectories in a computationally efficient manner.

3.3. Flow

Flow-based models, or normalizing flows [52,53], are generative models that learn an invertible transformation f θ to map a simple base distribution p z ( z ) (such as a Gaussian) into a complex data distribution p x ( x ) , such that f θ ( z ) = x and f θ 1 ( x ) = z . This framework enables exact likelihood estimation through the change-of-variables formula:
log p x ( x ) = log p z ( f θ 1 ( x ) ) + log det f θ 1 ( x ) x
Here, the determinant of the Jacobian matrix f θ 1 ( x ) x measures how the transformation f θ 1 locally scales or distorts volume around a point x. This term is essential for adjusting the density under the transformation, allowing the model to compute the exact likelihood x based on how much space it occupies relative to the base distribution.
In protein conformation sampling, flow models learn invertible mappings from a latent space to complex 3D structures. However, the requirement for invertibility and a tractable Jacobian determinant limits architectural flexibility. Flow Matching [54] addresses this by learning a time-dependent velocity field v θ ( x , t ) that transports samples from a base distribution x ( 0 ) to a target structure x ( 1 ) using linear interpolation:
x ( t ) = ( 1 t ) x ( 0 ) + t x ( 1 ) , L ( θ ) = E x ( 0 ) , x ( 1 ) , t v θ ( x ( t ) , t ) ( x ( 1 ) x ( 0 ) ) 2
This formulation x ( t ) lies along a straight path between the start and target conformations. The model is trained to match the constant velocity x ( 1 ) x ( 0 ) at each point along this path. This enables smooth transitions between structures without requiring an invertible transformation, making Flow Matching suitable for modeling the continuous and high-dimensional landscape of protein conformations.
In practice, flow-based generative models have shown strong potential for protein conformation sampling by directly learning distributions over structural ensembles rather than relying solely on costly molecular simulations. Recent advances, such as AlphaFlow and P2DFlow [55,56], demonstrate that flow matching can be integrated with structural priors (e.g., AlphaFold/ESMFold predictions or MD-derived ensembles) to generate physically plausible protein conformations. By leveraging SE(3)-equivariant architectures and energy-informed priors, these models can capture domain motions, intermediate states, and conformational transitions that underlie protein function. Importantly, flow-based approaches enable trajectory-like sampling of conformations, providing a data-driven alternative to molecular dynamics for exploring equilibrium distributions and transition pathways. This positions flow matching not only as a powerful tool for ensemble generation but also as a promising framework for efficient protein trajectory modeling, with implications for understanding allosteric regulation, stability, and dynamics in diverse biological contexts.

3.4. Diffusion

Diffusion-based generative models  operate by learning to reverse a stochastic process that progressively adds noise to data, transforming it into a tractable prior distribution. In the forward process, a data sample x 0 is gradually perturbed through a sequence of latent variables x 1 , , x T using Gaussian noise, typically following a formulation that preserves variance:
q ( x t x t 1 ) = N ( x t ; 1 β t x t 1 , β t I )
The model is trained to approximate the reverse transitions:
p θ ( x t 1 x t ) = N ( x t 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) )
Here, x t represents the noisy version of the data at time step t, and  β t is a predefined noise schedule that controls the variance of the added noise. In the forward process, noise is added step-by-step, causing x t to gradual loss of information from the original data. The reverse process aims to reconstruct the data by predicting the mean μ θ ( x t , t ) and variance Σ θ ( x t , t ) of the denoised sample at each step, where θ represents the parameters of a neural network.
In protein conformation sampling, diffusion models provide a scalable framework for exploring the high-dimensional energy landscape beyond molecular dynamics. By denoising random noise into physically plausible structures, they generate diverse conformations approximating the equilibrium ensemble. Advances with SE(3)-equivariant architectures and physics-guided terms enable sampling that respects structural symmetries and energy constraints [57,58,59]. This not only reveals hidden binding pockets and transient states crucial for allosteric regulation and ligand recognition, but also offers a surrogate for trajectory generation, where denoising implicitly traces pathways between metastable states. Thus, diffusion models combine equilibrium fidelity with kinetic relevance, making them a powerful tool for studying protein dynamics and guiding drug discovery.

4. Taxonomy

Having established the theoretical foundations of deep generative models, we now examine their practical applications to protein conformation sampling and trajectory generation. This section is organized by model architecture, discussing the specific implementations, datasets, and performance characteristics of each approach. We begin with GAN-based methods, which, while less prevalent in this domain, have shown promise in specific applications.

4.1. GAN

4.1.1. Application of Generative Adversarial Networks

The application of GAN for protein conformation sampling has been limited but diverse. Pang et al. [60] propose DeepPath, an active learning (AL) framework to model protein transition pathways that generates low-energy transitions using Active Learning, eliminating the need for a large pre-existing training datasets and incorporating physical feasibility using all-atom MM force fields- empirical potential energy functions that describe interatomic interactions through bonded (bonds, angles, dihedrals) and non-bonded (electrostatic, van der Waals) terms- as an oracle. Despite the lack of comparison with other deep learning based methods, this approach opens up an interesting research avenue for flow and diffusion-based generative models.
To address the challenge of modeling intrinsically disordered proteins (IDPs) [61], Janson et al. developed idpGAN [46], a Transformer-based conditional GAN trained on coarse-grained MD simulation data to rapidly generate diverse, physically realistic conformations. This method utilized multiple discriminators to ensure structural validity and later retrain idpGAN on atomistic simulation data to show that this approach is also extensible to higher-resolution conformational ensemble generation.
A unique application of Generative Adversarial Networks (GANs) has been employed by Liu et al. [62] to tackle the problem of conformation clustering, referred to as AAE-GS, which demonstrates effective clustering performance on a small protein, Trp-cage (PDB ID: 2JOF). Furthermore, the authors provide a model trajectory with a sufficiently long simulation time to address the absence of standardized datasets for assessing clustering quality.
Bouvier et al. [63] develop a generative adversarial network (GAN) to generate stable oligosaccharide conformations on the fly. The architecture includes a generator that maps latent space points to candidate structures and a discriminator that distinguishes these from true energy minima. Subsequently, the generator gets better at ‘fooling’ the discriminator by creating ever more realistic conformations in a zero-sum game.

4.1.2. Discussion

While GAN-based approaches are not yet the dominant paradigm in protein conformation modeling, they offer unique advantages, particularly in scenarios where traditional simulation methods struggle, such as modeling intrinsically disordered protein (IDP) conformations.
However, the limited adoption of GANs in this domain is understandable in retrospect. These models are known to suffer from fundamental challenges, including training instability, mode collapse, and convergence difficulties [64,65,66], which can hinder their reliability and generalization in sensitive biological applications.

4.2. VAE

Compared to generative adversarial networks (GANs), variational autoencoders (VAEs) [49] offer several advantages that make them particularly appealing for modeling protein conformational landscapes. First, VAEs are built on a probabilistic framework that explicitly models a continuous and smooth latent space, where each point corresponds to a plausible protein conformation. This latent space is trained to follow a prior distribution (commonly Gaussian) via the Kullback-Leibler (KL) divergence term in the objective. At the same time, a reconstruction loss ensures that sampled latent vectors decode into realistic conformations.
This structured latent space allows VAEs to act as a low-dimensional proxy for the protein energy landscape, where nearby points correspond to structurally similar states with gradual transitions between them. As a result, VAEs are especially suited to capture the continuum of conformations relevant for flexible proteins or proteins undergoing allosteric changes. Furthermore, the latent space can be systematically explored or interpolated to generate novel, previously unsampled conformations, including rare intermediate states that may be difficult to access via molecular dynamics simulations alone.

4.2.1. Protein Conformation Sampling

VAE has been used for sampling the protein conformation space by leveraging the trained latent space. The principle is to convert high-dimensional protein structural data into a continuous, low-dimensional representation using the VAE Encoder, followed by a search in this space guided by a structure quality metric, and then use the decoder to obtain a backbone-only or full-atom protein structure.
Mansoor et al. [67] leveraged RosettaFold [68] to retrieve the full atom structure to generate an ensemble of undruggable K-Ras protein [69,70]. Tian et al. [71] and Xiao et al. [72] show that the latent space in the VAE can be used to generate unsampled protein conformations when trained on relatively short molecular dynamics simulation data, followed by works from [73,74,75,76,77]. Among these, Degiacomi et al. [75] test the VAE for a complex protein-protein docking scenario (using HIV-1 hexameric capsomer) to account for broad hinge motions taking place upon binding.
Ruzmetov et al. [73] proposes the Internal Coordinate Net (ICoN), which subsequently generates novel synthetic conformations through learned latent space. To facilitate full atom generation, the authors utilize bond-angle-torsion (BAT)-based vector representation for comprehensive sampling of intrinsically disordered protein A β 42’s conformational landscape, validated through metrics like number of conformations within energy cutoff and number of synthetic conformations.
Another notable work by Zhu et al. [50], utilizes short MD simulations to enable enhanced sampling of IDP conformations by introducing a probabilistic latent space that performs better than capturing the conformational landscape than standard autoencoders when measured in terms of metrics such as Pearson Correlation Coefficient (PCC) and Root Mean Squared Distance (RMSD) (details in Section 6).

4.2.2. Protein Trajectory

Molecular dynamics (MD) simulations are a widely used tool for studying protein conformations and dynamics at atomic resolution [44,45,78]. Despite their accuracy, conventional MD simulations often suffer from limited sampling efficiency due to the tendency to become trapped in local energy minima, making it difficult to explore the full conformational landscape [79,80,81].
Recent approaches have sought to overcome this challenge by integrating MD with deep learning techniques. Last [51] demonstrated large conformational changes in a shorter time compared to traditional MD and Stochastic Dynamics Simulation (SDS) [82] methods. After initial training with conformations obtained via short MD simulation, similar to [50], further seeds are selected from the learned latent space, similar to Gaussian accelerated MD simulation [83].
Similarly, Moritsugu et al. [84] introduced VAE-driven Multiscale enhanced sampling (MSES) for modeling the transition between open and closed states of RNA binding protein, again using short MD simulation(100 ns long) as VAE training data. The authors utilize “Motion Tree” [85] as a structural ensemble to model the dynamics of different regions of a given protein structure to model multiple conformations.

4.2.3. Other Applications

Kleiman et al. [86] propose VAMPNet to assign probabilistic memberships to metastable states using softmax outputs, allowing each conformation to belong to multiple states with varying confidence. Bozkurt et al. [87] aim to provide interpretable embeddings, recognizing that the latent space can be multimodal Gaussian. Albu et al. [88] show that the VAE latent space can capture structural information characterizing a certain protein superfamily and thus, can characterize an unseen protein from the same superfamily, using the Structural Alphabet [89] and Angles representation.

4.2.4. Discussion

VAE-based approaches discussed above and summarized in Table 1 excel at sampling unseen conformations using extensive MD simulation as training data along with modeling complex protein-protein interaction in some cases, e.g., [90]. Although this is an exciting application, in general, the VAE architecture needs extensive hyperparameter tuning for balancing encoder-decoder performance. Moreover, a VAE trained on one protein system may not generalize well to others, and users must still determine suitable force field parameters for the specific system under study, essentially adding another variable into the mix to generate computationally expensive simulation data.

4.3. Flow

Flow-based models [48,52,53] present an emerging approach in the field of protein conformation sampling. Based on recent developments such as flow matching that combines optimal transport, neural Ordinary Differential Equations, and normalizing flows [54,92,93], flow models offer a principled way to model complex, high-dimensional distributions over protein structures while being conditioned on coevolutionary data and ensuring equivariance, as detailed in the sections below. Related Table 2 provides a comparative analysis of prominent flow-based deep learning models used for protein conformation sampling.

4.3.1. Alphaflow and Its Variants

Alphaflow and ESMFlow [56] are generative models that fine-tune AlphaFold 2 [1] and ESMFold [37] under a flow-matching [54] framework to sample diverse and accurate protein conformational ensembles from sequence input. The authors modify AlphaFold and ESMFold into denoising models within a custom flow matching framework trained on PDB structures and further fine-tuned on an all-atom MD ensemble dataset ATLAS [96], to model conformational variability. Compared to MSA subsampling, AlphaFLOW and ESMFLOW produce more diverse and precise structures on PDB test sets. Li et al. [97], who introduced Alphaflow-lit, further improved this framework by freezing the AlphaFold 2 [1] embedder and evoformer modules during the training steps, achieving a speedup of 47 times and still performing on par with Alphaflow [56] in terms of attributes such as protein dynamics and local arrangement.

4.3.2. Integration of Equivariance and Full Atom Modeling

Although AlphaFlow and its variants achieve state-of-the-art performance in protein ensemble generation, they do not account for rotational invariance or equivariance. Jin et al. [55] address this limitation with P2DFlow, which innovatively incorporates both the protein sequence and an approximate energy as inputs. This approximate energy is derived by projecting molecular dynamics (MD) simulation results onto a low-dimensional manifold. P2DFlow then leverages ESM-2 [99] to obtain sequence embeddings and employs ESMFold with structural perturbations to produce a noisy structure as a prior. A SE(3)-equivariant architecture integrating the AlphaFold IPA (Invariant Point Attention) module with the E(n)-equivariant EGNN [100] (Equivariant Graph Neural Network) model predicts the vector field for the flow process.
This approach is shown to outperform existing generative models such as AlphaFlow [56] and STR2STR [58] across multiple benchmarks, with the caveat that we obtain generated backbone conformations only, not full atom proteins.
Another flow-based framework focused solely on the generation of backbone conformation was introduced by Wolf et al. [98], as BBFlow. A given “equilibrium” backbone atom structure is used to condition an SE(3) flow matching model for the generation of a set of protein backbone conformations. One important aspect of this work is that the conditioning of the generation is not on the sequence info, such as MSA, but on the equilibrium structure; the authors eliminate the need for evolutionary information and a pre-trained folding model weights (e.g., like Alphaflow [56]). Thus, the work is analogous to a molecular dynamics simulation where an initial protein backbone structure is required. Subsequent benchmarking shows that BBFlow performed better than AlphaFlow(with no templates) for de novo proteins, yielding lower median RMSF and pairwise RMSD values.
Expanding beyond modeling backbone-only atoms, a recent contribution by Mahmoud et al. [94] presents a deep learning framework that combines coarse-grained SIRAH [101] protein representations with normalizing flows. This framework captures multimodal conformational distributions and generates low-energy full-atom protein ensembles that align closely with molecular dynamics (MD) simulations.

4.3.3. Discussion

Arguably, the inclusion of equivariance into the generative model itself is a matter of debate, as AlphaFold 3 [36] omits computationally expensive [102] equivariant layers in its structure generation module. Instead, the model relies on extensive data augmentation to learn approximate equivariance behavior from the training data, a departure from the approach adopted by AlphaFold2 [1], highlighting an alternative strategy for incorporating symmetry awareness in deep protein structure prediction models [103,104,105].
Flow-based generative models present a promising alternative to traditional learning approaches for modeling protein conformations, as they offer exact likelihood estimation and stable training behavior. Despite these advantages, current models are limited in their ability to generate full-atom protein structures that include both backbone and side chains. The studies examined in this context define the generative process on coordinate frames constructed using only backbone atoms, such as C α , N, C β , and O. This constraint largely stems from the models’ focus on improving upon AlphaFlow [56], which itself produces denoised C β structures rather than full atom models.
In addition, there are currently no flow-based generative models that explicitly model conformational ensembles of intrinsically disordered proteins, suggesting a valuable opportunity for future research.

4.4. Diffusion

Diffusion-based generative models have rapidly emerged as a powerful paradigm in various areas of protein engineering, including protein design [106,107,108], protein-ligand docking [109,110,111], and protein structure generation [112,113,114]. Their core strength lies in modeling the conformational landscapes of proteins and peptides by learning to reverse a progressive noise corruption process. This iterative denoising procedure enables the generation of diverse, accurate, and physically plausible molecular structures. Table 3 presents a comparative overview of leading diffusion models in protein conformational sampling.

4.4.1. Perturb and Anneal Based Diffusion

Simulated annealing emulates high temperature (early steps), allowing larger jumps, while at low temperature (later steps), it fine-tunes toward high-probability regions. In terms of protein conformation sampling, early noisy samples can explore global conformational space, while later steps focus on fine adjustments.
Str2Str [58], shows that effective conformational sampling can be achieved using only static crystal structures in a zero-shot manner, without MD-based training. It employs an SE(3)-equivariant diffusion framework, using a perturb-and-anneal approach for the forward and reverse stages, respectively. The generative process is guided by score-based annealing, drawing inspiration from the equivariant Invariant Point Attention (IPA) module of AlphaFold2 [1]. Notably, this approach remains agnostic to the underlying energy landscape, allowing flexible modeling without relying on predefined potential functions. Later, full atom conformations are sampled using faspr [120] protein side chain packing tool.
DiG, a subsequent work by Zheng et al. [119] leverages a perturb and anneal diffusion principle to obtain the equilibrium distribution of the protein conditioned on a general descriptor. In a novel contribution, the diffusion process is further supervised by an energy function such as DFT (Density Functional Theory) [121] beyond a standard data-based training. Unlike str2str, the model produces a coarse-grained representation of the protein molecule, comprising individual residue orientations and C α atom coordinates.
BioEmu [57] extends the perturb-and-anneal diffusion framework by training on a diverse dataset that includes static protein structures from the Protein Data Bank (PDB) [95], molecular dynamics (MD) trajectories, and experimental data to learn coarse-grained representations, similar to DiG. Like AlphaFold2, it incorporates sequence information to derive single and pairwise representations, which are then used as inputs to the diffusion model. The resulting model effectively captures equilibrium distributions and demonstrates conformational diversity, such as the open and closed states of the ADK protein. In addition, it supports the prediction of protein stability metrics, compared to the MEGAScale dataset derived from Tsuboyama et al. [122] and outperforms AlphaFlow [56] in conformation generation tasks.

4.4.2. Other Approaches

DeepConformer [123] builds on the protein design model RFDiffusion [124]. The authors argue that protein energy landscape and conformational dynamics can be learned from experimental structures in PDB and coevolution data, rather than the oft-used simulation data. The resulting model demonstrated dynamic properties (measures in terms of ResID vs RMSF) similar to those of MD-sampled structures using KaiB fold switching mechanism and catalytic conformation changes in the Adenylate Kinase protein.
Maddipatla et al. [125] proposed a diffusion-based framework trained on experimental data like NMR (Nuclear Magnetic Resonance) [126] instead, which generates conformational ensembles by treating models such as AlphaFold3 [36] as structural priors and performing posterior inference conditioned on experimental data, and shown to violate fewer constraints when modeling NMR ensembles than AlphaFold3 [36].
In a related method that also leverages training on experimental data, ExEnDiff [117] builds on the pre-trained protein ensemble sampler Str2str [58], generating protein conformation ensembles that more closely approximate the true Boltzmann distribution than those produced by Str2str alone. The advantage of this approach is that the model can have proportionally improved performance with respect to the underlying protein ensemble sampler.
A persistent challenge in protein conformation sampling lies in balancing the high computational cost of collecting molecular dynamics (MD) data informed by the underlying force field with the scalability of energy-agnostic generative models. Str2str-FT and Str2str-NE, proposed by Lu et al. [118], address this by introducing a few-shot learning framework that fuses neural network based sampling with energy landscape awareness.
Their two-step strategy first generates a diverse set of plausible conformations using a neural network, which are further complemented through short MD simulations. This hybrid approach demonstrates improved performance in terms of Jensen-Shannon (JS) divergence and conformation validity, outperforming methods such as idpGAN, the original Str2str, and MSA subsampling [39,46,58].
As a complementary approach to modeling the protein energy landscape, ConfDiff, a diffusion-based method to generate protein conformations, was introduced by Wang et al. (2024) [59]. The model is guided by the input sequence using classifier-free guidance operating on the SE(3) group.
Additionally, it incorporates physics-based rewards derived from energy and force information, helping the model generate more realistic structures. Unlike models like Dig [119], this approach avoids reliance on computationally expensive MD training data and introduces intermediate force supervision to balance quality and diversity compared to prior work like Str2str [58], leading to diverse samples more true to the underlying Boltzmann distribution.

4.4.3. Modeling Instrinsically Disordered Proteins

As discussed in the Section 4.3, IDPs pose a unique challenge as they exist as an ensemble rather than a static structure. Inspired by Str2str [58], IDP-Fold [127] captures the conformational diversity of IDPs using score-based diffusion. IDP-Fold achieves more detailed structural features when compared to idpGAN [46]. Additionally, this approach bypasses the need for Multiple Sequence Alignments (MSAs) or experimental data, supporting the idea that IDPs do not require or benefit much from MSAs [128,129].
Taneja et al. [130] employ a 2-stage approach to model the conformational ensemble of IDPs. First, the pairwise distance vector is generated given the sequence s and the inverted charge mutant s . This is followed by pre-trained DDPMs (Denoising Diffusion Probabilistic Models) [131] to map the vector to C α coordinates.
Improving upon idpGAN [46], Janson et al. proposed idpSAM [132], a latent diffusion model combining an SE(3)-invariant autoencoder and a conditional DDPM [131], enabling transferable and generalizable ensemble generation from limited simulation data. In 2 stages of training, a Euclidean-invariant AE is trained to encode protein conformations represented by C α coordinates, followed by a DDPM [131] to decode C α coordinates for subsequent IDP conformations. The resulting model improves upon idpGAN [46] on metrics such as KL divergence and RMSE.

4.4.4. Discussion

Diffusion models employ a fundamentally different approach from auto-encoder and flow-based methodologies. Rather than relying on indirect representations, these models utilize score functions to characterize the energy landscape governing protein folding dynamics directly. This energy-informed framework provides substantial advantages for conformation sampling by incorporating thermodynamically relevant principles. The key strength lies in their ability to generate complete, full-atom protein structures. While typically requiring integration with specialized side-chain optimization algorithms, diffusion models enable a direct sequence-to-conformation modeling pipeline, offering comprehensive structural prediction with both completeness and physical relevance.

5. Dataset

The success of deep generative models for protein conformation sampling depends critically on the quality and diversity of training datasets. This section reviews the primary datasets used in the literature, ranging from experimental structures to molecular dynamics simulations. Each dataset presents unique characteristics in terms of size, structural diversity, and temporal resolution that influence model performance and applicability.

5.1. Protein Data Bank

The Protein Data Bank (PDB) derived from [95] serves as a comprehensive dataset for training diffusion models with experimentally determined protein structures, primarily obtained via X-ray crystallography, nuclear magnetic resonance (NMR), and cryo-electron microscopy (cryo-EM). Computational models frequently utilize curated subsets of these structures, filtered based on experimental resolution quality, sequence length criteria to ensure computational feasibility, and completeness of structural information (e.g., resolution better than 5 Å, sequence length typically between 10 and 512 residues). To prevent data leakage and improve model generalization, training protocols often remove structures sharing high sequence similarity between training, validation, and testing datasets. Additionally, clustering strategies based on sequence similarity (e.g., 30% sequence identity) may be applied to manage redundancy and maintain data diversity [58,59,119].

5.2. AlphaFold Protein Structure Database

The AlphaFold Protein Structure Database (AlphaFold DB) is an extensive repository of protein structures predicted by the AlphaFold 2 artificial intelligence system developed by Google DeepMind [7]. In preparing dataset for computational modeling applications, the AlphaFold DB typically undergoes systematic preprocessing steps, such as sequence clustering based on identity thresholds (e.g., approximately 80% sequence identity), subsequent structural clustering to capture diverse conformational states, and stringent filtering criteria based on structural confidence metrics (e.g., pLDDT scores). This rigorous curation ensures the dataset maintains significant structural variability, thereby providing a valuable resource for training advanced generative models designed to accurately represent protein conformational diversity [57].

5.3. Atlas of Protein Molecular Dynamics

The ATLAS dataset is a large-scale resource comprising all-atom, explicit solvent molecular dynamics (MD) simulations of 1390 structurally diverse, non-membrane proteins [96]. These proteins were selected to represent the full range of Evolutionary Classification of protein Domains (ECOD) structural classes. For each protein, the dataset provides three independent 100 ns MD trajectories, sampled at high temporal resolution (e.g., 10,001 frames per trajectory). In typical data preprocessing pipelines, the trajectories are subsampled at regular intervals (e.g., every 100 frames) to obtain manageable subsets for training, such as 300 frames per protein. Representative conformations can also be selected based on structural or energetic criteria, including approximate energy estimates [55,56,57,97,98].

5.4. Bovine Pancreatic Trypsin Inhibitor

The Bovine Pancreatic Trypsin Inhibitor (BPTI) [133] is a widely used benchmark dataset for evaluating generative models of protein conformation due to its compact size and well-characterized dynamic properties. It consists of 58 residues and exhibits transitions among multiple conformational states. Molecular dynamics simulations of BPTI are commonly sampled at high temporal resolution (e.g., 250 ps intervals), and analysis typically focuses on C α atom coordinates. These settings facilitate the evaluation of a model’s capacity to reproduce conformational diversity and dynamics [59,78].

5.5. MEGAScale

The MEGAscale dataset [122] is a large-scale experimental resource for protein stability, comprising approximately one million measurements of folding free energies obtained via cDNA display proteolysis. A subset of this dataset, including over 750,000 experimental measurements, has been used to train generative models for protein property prediction [57]. As the dataset does not include structural information, complementary molecular dynamics (MD) simulations spanning 25 ms across 271 wild-type proteins and over 22,000 mutants have been employed to model folding-unfolding dynamics. These simulations enable the alignment of generated conformational ensembles with experimentally determined stability landscapes.

5.6. CATH

The CATH (Class-Architecture-Topology-Homologous superfamily) dataset derived from [134] provides a hierarchical classification of protein domains based on their structural and evolutionary relationships, supporting large-scale annotations of protein function and structure. The latest release (v4.3) significantly expands both sequence and structural coverage, with over 500,000 fully classified structural domains and 151 million predicted sequence domains assigned to 5481 superfamilies. In generative modeling applications, curated subsets of CATH domains—typically consisting of 50–200 residues—are used to capture functionally and structurally diverse protein fragments [57]. These domains are simulated using adaptive sampling protocols and high-resolution molecular dynamics, resulting in datasets with cumulative simulation times exceeding 40 ms across more than 1000 systems. Selected test domains reach trajectory lengths of up to 100 μs each, enabling accurate modeling of long-timescale conformational dynamics.

6. Metrics

Please refer to Table 4. Details of the individual metrics can be found in the Appendix A Section.

7. Future Work

7.1. Challenges

7.1.1. Lack of Interpretability

The lack of interpretability in deep learning models for protein conformation sampling is a real challenge because it makes it hard to understand how or why the model generates certain structures. These models might produce realistic-looking conformations, but we often don’t know what parts of the input, e.g., sequence features or structural patterns, are influencing the output. That means we cannot easily tell if the model is learning meaningful biology or just picking up on patterns in the data. This makes it harder to trust the results, especially when there’s little experimental data to compare with, making the model a black box.

7.1.2. Simulation of Full-Atom Protein Trajectory

Sampling protein conformations, and especially full structural trajectories, presents a unique challenge because it involves capturing how atomic coordinates change over time. This makes it a complex, multidimensional problem. Although some models have been proposed as data-driven alternatives to molecular dynamics simulations [135,139], accurately tracking the progression of structural changes, including both backbone and side chain atoms, remains a significant and open challenge.

7.1.3. Integration of Other Biomolecules

As noted by Lewis et al. [57], deep generative models driven by data can complement insights from molecular dynamics (MD) simulations by efficiently exploring the conformational landscape of proteins. However, integrating other biomolecules, such as DNA, RNA, and ions, into these models remains a significant challenge to account for the structural changes upon binding. With the advent of AlphaFold 3 [36], which enables static modeling of such complexes, there is now a potential path forward to capture the conformational changes induced by protein interactions with these cellular partners.

7.2. Opportunities (See Box 1)

7.2.1. Aligning with Physical Feedback

Physical alignment methods specifically designed for generative models, such as Lu et al. [140], demonstrate a promising path forward. The conformation sampling models discussed in this survey lack concrete mechanisms for aligning the generative process through explicit energy calculations of individual protein structures. This novel approach, termed Energy Based Alignment (EBA), can be employed to align pre-trained generative models toward specific structural classes without requiring computationally expensive retraining procedures.
Box 1. What is Missing?
  • Intrinsically Disordered Protein specific Flow models.
  • Harmonized benchmarks across datasets.
  • Interpretable Conformational Sampling focused Deep Learning models [141].
  • Integration of DNA, RNA, and ions as a part of large scale conformational landscape models aided by Alphafold 3 [36].

7.2.2. Textual Representation

Schwing et al. [142] explore the textual representation of protein conformations from MD trajectories using vectors of the Ramachandran basin, as opposed to all-atom or coarse-grained representations. This abstraction suggests a future direction in which text-to-conformation models could emerge, combining advances in the fields of natural language processing and protein structure modeling.

7.2.3. Structural Language Models for Conformational Plasticity

Recently, protein language models have gained significant attention for a variety of applications, including the prediction of protein structure, functional annotation, and prediction of intrinsically disordered regions (IDR). In line with these developments, Lu et al. [143] introduce Structural Language Models (SLMs) aimed at modeling protein structural plasticity, capturing the inherent flexibility of proteins under different functional and environmental conditions. This approach could inform the development of future protein language models that translate sequences into structures to act as a priori for full-fledged protein conformation modeling.

7.2.4. Full-Atom End-to-End Structure Generation

One of the central challenges in protein conformation modeling is the end-to-end generation of full-atom protein structures, including both backbone and side chain atoms. Alphafolding [135] addresses this in a limited setting by incorporating 4D dynamics prediction up to a 256 amino acid long protein chain to learn from molecular dynamics data. For each residue, it incorporates the side chain torsion angle loss represented as points on the unit circle is inspired by AlphaFold [1].

7.2.5. Generative Modeling of Peptide Dynamics

Jing et al. [139] propose a generative model designed to replicate the molecular dynamics of peptides. This opens promising opportunities for scaling generative models toward full-scale emulation of molecular dynamics for protein systems.

8. Conclusions

Deep generative models are reshaping protein structure modeling by moving beyond static representations to capture the dynamic conformational ensembles that underlie biological function. Unlike traditional approaches that often focus on a single best structure, these models aim to represent the full landscape of protein conformations, offering a richer view of structural variability and dynamics. In this review, we highlight recent advances in applying GANs, VAEs, normalizing flows, and diffusion models to protein conformation sampling, showcasing how they provide scalable and efficient alternatives to physics-based simulations such as molecular dynamics.
Each class of generative model brings unique strengths to the table. GANs offer fast generation, but face challenges with stability and mode collapse. VAEs provide interpretable and continuous latent spaces, which are valuable for adaptive sampling and ensemble learning. Flow-based models combine efficiency with tractable likelihood estimation and are naturally compatible with physical constraints. Diffusion models, meanwhile, have emerged as a particularly powerful paradigm due to their flexibility, equivariance support, and capacity to model complex, multimodal distributions, including those relevant for intrinsically disordered proteins and coarse-grained simulations.

Author Contributions

Conceptualization, T.R. and T.M.D.; methodology, T.R.; investigation, T.R.; resources, T.M.D.; writing—original draft preparation, T.R. and T.M.D.; writing—review and editing, T.R.; visualization, T.M.D.; supervision, T.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The references used in this review can be found in this Open Science Framework—https://osf.io/698vj/(accessed on 7 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MDMolecular Dynamics
MSAMultiple Sequence Alignment
IDPIntrinsically Disordered Proteins
PDBProtein Data Bank
ADKAdenosin Kinase
VAEVariational Autoencoder
GANGenerative Adversarial Network
MAT-RMatching Precision
MAT-PMatching Recall
RMSDRoot-Mean-Square Deviation
RMSFRoot Mean Square Fluctuation
RgRadius of Gyration
SASASolvent Accessible Surface Area
JS-PwDJensen-Shannon divergence of Pair Wise Distributions
JS-TICJensen-Shannon divergence of Total Internal Coordinates
JS-RgJensen-Shannon divergence of Radius of Gyration
TICATime-lagged Independent Component Analysis

Appendix A. Metrics

Evaluating the quality of generated protein conformations involves five interrelated aspects: validity, precision, diversity, distributional similarity, and structural dynamics. Each group of metrics plays a complementary role in assessing whether a generative model captures biologically relevant features of protein structure.

Appendix A.1. Validity

Validity metrics evaluate whether generated conformations adhere to fundamental physical and geometric constraints of real proteins. Several studies assess validity by checking for steric clashes and unrealistic bond lengths or angles. For instance, Val-Clash and ValBond scores reflect how often generated structures are free from such violations [118]. Similar principles are quantified by the violation loss, which decomposes structural errors into bond length, bond angle, and steric terms [115]. Str2Str [58] further defines validity as the fraction of conformations without Cα-Cα bond breaks or clashes. More detailed geometric validity is captured using Ramachandran plot analysis, which measures the proportion of backbone torsion angles falling within energetically favorable regions [55,116]. In addition to this, the frequency of the contact map quantifies the frequency with which specific residue-residue distances are preserved over time, indicating consistent spatial organization [116]. Finally, P2DFlow [55] introduces a holistic “sanity-check” pass rate, combining these physical and geometric validations into a single metric.

Appendix A.2. Precision and Accuracy

While validity ensures physical plausibility, precision and accuracy metrics assess how closely generated conformations approximate known or reference structures. The TM-score is widely adopted for global fold similarity, reporting values between 0 and 1, with higher values indicating better alignment to a known structure [115,119]. RMSD and its variants, RMSE and ARMSE, provide atomic-level error measures, often computed over Cα atoms, to capture the distance between predicted and reference coordinates [135,136]. For models reconstructing structures from latent representations, the recTM score and recRMSD assess fidelity to input conformations [144]. Local alignment quality is evaluated using lDDT-Cα metrics, with precision and recall defined as the average lDDT score from predictions to the closest reference, and vice versa [56]. Furthermore, statistical agreement with physical ground truth is measured using the Pearson correlation [145], which compares predicted and reference per-residue fluctuations or diversity scores [56,137]. Collectively, these metrics evaluate not only how accurately the models recover known states but also how reliably they capture finer structural details.

Appendix A.3. Diversity

Beyond accuracy, generative models should produce a structurally diverse set of conformations to reflect the intrinsic flexibility of proteins. Diversity is often quantified using pairwise RMSD between sampled conformations, where higher average RMSD indicates broader sampling [58,98]. A more refined analysis uses matchness metrics: MAT-P and MAT-R report the average RMSD between each predicted conformation and its nearest neighbor in the reference or generated set, respectively, thus offering a dual perspective on coverage and novelty [115].
Similarly, GeoDiff introduces COV-P and COV-R, which report the percentage of conformations of one set that are covered by another within a given RMSD threshold [138]. AlphaFold-based methods further measure diversity using (1–lDDT-Cα), reflecting the average dissimilarity of local structural features [56]. Another relevant metric is pwRMSD, the mean RMSD across all sample pairs without needing a reference structure [98]. Together, these metrics ensure the model’s capability to explore multiple distinct, yet valid, conformational states.

Appendix A.4. Distributional Similarity

While diversity reflects ensemble spread, distributional similarity metrics evaluate whether generated ensembles match the statistical properties of molecular dynamics (MD) simulations. One of the most widely used measures is the Jensen–Shannon (JS) divergence, which compares distributions over structural features such as pairwise atomic distances (JS-PwD), radius of gyration (JS-Rg), and time-lagged independent components (JS-TIC) [97,137]. The metric 1–JS-PwD is sometimes reported as a fidelity score to interpret divergence in an intuitive manner [137].
AlphaFlow additionally adopts the root-mean Wasserstein distance (RMWD), a generalization of RMSD that compares spatial Gaussian distributions of atomic positions across ensembles [56]. ExEnDiff proposes a composite sampler score combining multiple JS divergences, residue-level RMSF deviation, and differences in secondary structure content, offering both global and local fidelity assessments [117].
Lastly, metrics like coverage and k-recall quantify how well generated samples recover the conformations seen in MD ensembles, by computing the fraction or average distance to nearest reference structures [59]. These metrics collectively evaluate whether the sampled ensemble replicates the thermodynamic and kinetic properties of protein conformational space.

Appendix A.5. Structural Dynamics and Stability

Beyond static structural metrics, dynamic behaviors are essential for understanding protein function. Root-mean-square fluctuation (RMSF) remains the most common metric for measuring flexibility at the residue level, capturing how much each Cα atom deviates across an ensemble [97,98,116]. Time-resolved RMSD curves further describe how sampled conformations drift from a reference throughout simulation, indicating structural stability or transitions [116]. Similarly, the gyration radius plotted over time can reveal global compaction or unfolding events [116]. To capture transient interactions, P2DFlow introduces weak and transient contact metrics—identifying residue pairs that either lose or gain contacts during simulation beyond a 10% frequency threshold [55]. Additionally, principal component analysis (PCA)-based metrics evaluate how well the sampled conformations capture collective motions by projecting the ensemble onto dominant eigenmodes [98]. These dynamic measures complement static metrics by ensuring the sampled conformations are not only accurate but also dynamically realistic.

References

  1. Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef]
  2. Schauperl, M.; Denny, R.A. AI-based protein structure prediction in drug discovery: Impacts and challenges. J. Chem. Inf. Model. 2022, 62, 3142–3156. [Google Scholar] [CrossRef]
  3. Nero, T.L.; Parker, M.W.; Morton, C.J. Protein structure and computational drug discovery. Biochem. Soc. Trans. 2018, 46, 1367–1379. [Google Scholar] [CrossRef] [PubMed]
  4. Kiss, G.; Çelebi-Ölçüm, N.; Moretti, R.; Baker, D.; Houk, K. Computational enzyme design. Angew. Chem. Int. Ed. 2013, 52, 5700–5725. [Google Scholar] [CrossRef] [PubMed]
  5. Humphreys, I.R.; Pei, J.; Baek, M.; Krishnakumar, A.; Anishchenko, I.; Ovchinnikov, S.; Zhang, J.; Ness, T.J.; Banjade, S.; Bagde, S.R.; et al. Computed structures of core eukaryotic protein complexes. Science 2021, 374, eabm4805. [Google Scholar] [CrossRef] [PubMed]
  6. Evans, R.; O’Neill, M.; Pritzel, A.; Antropova, N.; Senior, A.; Green, T.; Žídek, A.; Bates, R.; Blackwell, S.; Yim, J.; et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv 2021. bioRxiv:2021.10.04.463034. [Google Scholar] [CrossRef]
  7. Varadi, M.; Bertoni, D.; Magana, P.; Paramval, U.; Pidruchna, I.; Radhakrishnan, M.; Tsenkov, M.; Nair, S.; Mirdita, M.; Yeo, J.; et al. AlphaFold Protein Structure Database in 2024: Providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 2024, 52, D368–D375. [Google Scholar] [CrossRef]
  8. Bryant, P.; Pozzati, G.; Zhu, W.; Shenoy, A.; Kundrotas, P.; Elofsson, A. Predicting the structure of large protein complexes using AlphaFold and Monte Carlo tree search. Nat. Commun. 2022, 13, 6028. [Google Scholar] [CrossRef]
  9. Mazhibiyeva, A.; Pham, T.T.; Pats, K.; Lukac, M.; Molnár, F. Bridging prediction and reality: Comprehensive analysis of experimental and AlphaFold 2 full-length nuclear receptor structures. Comput. Struct. Biotechnol. J. 2025, 27, 1998–2013. [Google Scholar] [CrossRef]
  10. Huang, Y.; Yang, C.; Xu, X.F.; Xu, W.; Liu, S.W. Structural and functional properties of SARS-CoV-2 spike protein: Potential antivirus drug development for COVID-19. Acta Pharmacol. Sin. 2020, 41, 1141–1149. [Google Scholar] [CrossRef]
  11. Park, H.Y.; Kim, S.A.; Korlach, J.; Rhoades, E.; Kwok, L.W.; Zipfel, W.R.; Waxham, M.N.; Webb, W.W.; Pollack, L. Conformational changes of calmodulin upon Ca2+ binding studied with a microfluidic mixer. Proc. Natl. Acad. Sci. USA 2008, 105, 542–547. [Google Scholar] [CrossRef] [PubMed]
  12. Kawasaki, H.; Soma, N.; Kretsinger, R.H. Molecular Dynamics Study of the Changes in Conformation of Calmodulin with Calcium Binding and/or Target Recognition. Sci. Rep. 2019, 9, 10688. [Google Scholar] [CrossRef]
  13. Noé, F.; Clementi, C. Collective variables for the study of long-time kinetics from molecular trajectories: Theory and methods. Curr. Opin. Struct. Biol. 2017, 43, 141–147. [Google Scholar] [CrossRef] [PubMed]
  14. Fiorin, G.; Klein, M.L.; Hénin, J. Using collective variables to drive molecular dynamics simulations. Mol. Phys. 2013, 111, 3345–3362. [Google Scholar] [CrossRef]
  15. Hayward, S.; Go, N. Collective variable description of native protein dynamics. Annu. Rev. Phys. Chem. 1995, 46, 223–250. [Google Scholar] [CrossRef]
  16. Sittel, F.; Stock, G. Perspective: Identification of collective variables and metastable states of protein dynamics. J. Chem. Phys. 2018, 149, 150901. [Google Scholar] [CrossRef]
  17. McCarty, J.; Parrinello, M. A variational conformational dynamics approach to the selection of collective variables in metadynamics. J. Chem. Phys. 2017, 147, 204109. [Google Scholar] [CrossRef]
  18. Zhu, F.; Hummer, G. Convergence and error estimation in free energy calculations using the weighted histogram analysis method. J. Comput. Chem. 2012, 33, 453–465. [Google Scholar] [CrossRef]
  19. Sauer, M.A.; Mondal, S.; Neff, B.; Maiti, S.; Heyden, M. Fast Sampling of Protein Conformational Dynamics. arXiv 2024, arXiv:2411.08154. [Google Scholar] [CrossRef]
  20. Abrams, C.F.; Vanden-Eijnden, E. Large-Scale Conformational Sampling of Proteins Using Temperature-Accelerated Molecular Dynamics. Biophys. J. 2010, 98, 26a. [Google Scholar] [CrossRef]
  21. Kleiman, D.E.; Nadeem, H.; Shukla, D. Adaptive Sampling Methods for Molecular Dynamics in the Era of Machine Learning. arXiv 2023, arXiv:2307.09664. [Google Scholar] [CrossRef]
  22. Zwier, M.C.; Chong, L.T. Reaching biological timescales with all-atom molecular dynamics simulations. Curr. Opin. Pharmacol. 2010, 10, 745–752. [Google Scholar] [CrossRef]
  23. Nocito, D.; Beran, G.J.O. Reduced computational cost of polarizable force fields by a modification of the always stable predictor-corrector. J. Chem. Phys. 2019, 150, 151103. [Google Scholar] [CrossRef]
  24. Barman, A.; Batiste, B.; Hamelberg, D. Pushing the Limits of a Molecular Mechanics Force Field To Probe Weak CH·π Interactions in Proteins. J. Chem. Theory Comput. 2015, 11, 1854–1863. [Google Scholar] [CrossRef]
  25. Fogel, D.; Fogel, L.; Atmar, J. Meta-evolutionary programming. In Proceedings of the [1991] Conference Record of the Twenty-Fifth Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 4–6 November 1991; pp. 540–545. [Google Scholar] [CrossRef]
  26. Holland, J.H. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence, 1st ed.; Complex Adaptive Systems; MIT Press: Cambridge, MA, USA, 1992. [Google Scholar]
  27. Boumedine, N.; Bouroubi, S. A new hybrid genetic algorithm for protein structure prediction on the 2D triangular lattice. arXiv 2019, arXiv:1907.04190. [Google Scholar] [CrossRef]
  28. Geng, X.; Guan, J.; Dong, Q.; Zhou, S. An improved genetic algorithm for statistical potential function design and protein structure prediction. Int. J. Data Min. Bioinform. 2012, 6, 162. [Google Scholar] [CrossRef] [PubMed]
  29. Supady, A.; Blum, V.; Baldauf, C. First-principles molecular structure search with a genetic algorithm. arXiv 2015, arXiv:1505.02521. [Google Scholar] [CrossRef] [PubMed]
  30. Comte, P.; Vassiliev, S.; Houghten, S.; Bruce, D. Genetic algorithm with alternating selection pressure for protein side-chain packing and pK(a) prediction. Biosystems 2011, 105, 263–270. [Google Scholar] [CrossRef] [PubMed]
  31. Yang, Y.; Liu, H. Genetic algorithms for protein conformation sampling and optimization in a discrete backbone dihedral angle space. J. Comput. Chem. 2006, 27, 1593–1602. [Google Scholar] [CrossRef]
  32. Khimasia, M.M.; Coveney, P.V. Protein structure prediction as a hard optimization problem: The genetic algorithm approach. arXiv 1997, arXiv:PHYSICS/9708012. [Google Scholar] [CrossRef]
  33. Zaman, A.B.; Inan, T.T.; De Jong, K.; Shehu, A. Adaptive Stochastic Optimization to Improve Protein Conformation Sampling. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 2759–2771. [Google Scholar] [CrossRef]
  34. Chen, Y.; He, J. Average convergence rate of evolutionary algorithms in continuous optimization. Inf. Sci. 2021, 562, 200–219. [Google Scholar] [CrossRef]
  35. Saleh, S.; Olson, B.; Shehu, A. A population-based evolutionary search approach to the multiple minima problem in de novo protein structure prediction. BMC Struct. Biol. 2013, 13, S4. [Google Scholar] [CrossRef] [PubMed]
  36. Abramson, J.; Adler, J.; Dunger, J.; Evans, R.; Green, T.; Pritzel, A.; Ronneberger, O.; Willmore, L.; Ballard, A.J.; Bambrick, J.; et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024, 630, 493–500. [Google Scholar] [CrossRef]
  37. Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123–1130. [Google Scholar] [CrossRef] [PubMed]
  38. Sala, D.; Engelberger, F.; Mchaourab, H.; Meiler, J. Modeling conformational states of proteins with AlphaFold. Curr. Opin. Struct. Biol. 2023, 81, 102645. [Google Scholar] [CrossRef] [PubMed]
  39. Del Alamo, D.; Sala, D.; Mchaourab, H.S.; Meiler, J. Sampling alternative conformational states of transporters and receptors with AlphaFold2. eLife 2022, 11, e75751. [Google Scholar] [CrossRef]
  40. Barethiya, S.; Huang, J.; Chen, J. Predicting protein conformational ensembles using deep generative models. Biophys. J. 2024, 123, 549a. [Google Scholar] [CrossRef]
  41. Erdos, G.; Dosztanyi, Z. Deep learning for intrinsically disordered proteins: From improved predictions to deciphering conformational ensembles. Curr. Opin. Struct. Biol. 2024, 89, 102950. [Google Scholar] [CrossRef]
  42. Ovchinnikov, S.; Huang, P.S. Structure-based protein design with deep learning. Curr. Opin. Chem. Biol. 2021, 65, 136–144. [Google Scholar] [CrossRef]
  43. Notin, P.; Rollins, N.; Gal, Y.; Sander, C.; Marks, D. Machine learning for functional protein design. Nat. Biotechnol. 2024, 42, 216–228. [Google Scholar] [CrossRef] [PubMed]
  44. Karplus, M.; McCammon, J.A. Molecular dynamics simulations of biomolecules. Nat. Struct. Biol. 2002, 9, 646–652. [Google Scholar] [CrossRef]
  45. Dror, R.O.; Dirks, R.M.; Grossman, J.; Xu, H.; Shaw, D.E. Biomolecular Simulation: A Computational Microscope for Molecular Biology. Annu. Rev. Biophys. 2012, 41, 429–452. [Google Scholar] [CrossRef]
  46. Janson, G.; Valdes-Garcia, G.; Heo, L.; Feig, M. Direct generation of protein conformational ensembles via machine learning. Nat. Commun. 2023, 14, 774. [Google Scholar] [CrossRef]
  47. Xie, X.; Valiente, P.A.; Kim, P.M. HelixGAN a deep-learning methodology for conditional de novo design of α-helix structures. Bioinformatics 2023, 39, btad036. [Google Scholar] [CrossRef]
  48. Kingma, D.P.; Dhariwal, P. Glow: Generative Flow with Invertible 1x1 Convolutions. arXiv 2018, arXiv:1807.03039. [Google Scholar] [CrossRef]
  49. Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
  50. Zhu, J.J.; Zhang, N.J.; Wei, T.; Chen, H.F. Enhancing Conformational Sampling for Intrinsically Disordered and Ordered Proteins by Variational Autoencoder. Int. J. Mol. Sci. 2023, 24, 6896. [Google Scholar] [CrossRef]
  51. Tian, H.; Jiang, X.; Xiao, S.; La Force, H.; Larson, E.C.; Tao, P. LAST: Latent Space-Assisted Adaptive Sampling for Protein Trajectories. J. Chem. Inf. Model. 2023, 63, 67–75. [Google Scholar] [CrossRef]
  52. Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using Real NVP. arXiv 2017, arXiv:1605.08803. [Google Scholar] [CrossRef]
  53. Dinh, L.; Krueger, D.; Bengio, Y. NICE: Non-linear Independent Components Estimation. arXiv 2015, arXiv:1410.8516. [Google Scholar] [CrossRef]
  54. Lipman, Y.; Chen, R.T.Q.; Ben-Hamu, H.; Nickel, M.; Le, M. Flow Matching for Generative Modeling. arXiv 2023, arXiv:2210.02747. [Google Scholar] [CrossRef]
  55. Jin, Y.; Huang, Q.; Song, Z.; Zheng, M.; Teng, D.; Shi, Q. P2DFlow: A Protein Ensemble Generative Model with SE(3) Flow Matching. J. Chem. Theory Comput. 2025, 21, 3288–3296. [Google Scholar] [CrossRef]
  56. Jing, B.; Berger, B.; Jaakkola, T. AlphaFold Meets Flow Matching for Generating Protein Ensembles. arXiv 2024, arXiv:2402.04845. [Google Scholar] [CrossRef]
  57. Lewis, S.; Hempel, T.; Jiménez Luna, J.; Gastegger, M.; Xie, Y.; Foong, A.Y.K.; García Satorras, V.; Abdin, O.; Veeling, B.S.; Zaporozhets, I.; et al. Scalable emulation of protein equilibrium ensembles with generative deep learning. bioRxiv 2024. bioRxiv 2024.12.05.626885. [Google Scholar] [CrossRef]
  58. Lu, J.; Zhong, B.; Zhang, Z.; Tang, J. Str2Str: A Score-based Framework for Zero-shot Protein Conformation Sampling. arXiv 2023, arXiv:2306.03117. [Google Scholar] [CrossRef]
  59. Wang, Y.; Wang, L.; Shen, Y.; Wang, Y.; Yuan, H.; Wu, Y.; Gu, Q. Protein Conformation Generation via Force-Guided SE(3) Diffusion Models. arXiv 2024, arXiv:2403.14088. [Google Scholar] [CrossRef]
  60. Pang, Y.T.; Yang, L.; Gumbart, J.C. From static to dynamic: Rapid mapping of protein conformational transitions using DeepPath. Biophys. J. 2024, 123, 45a. [Google Scholar] [CrossRef]
  61. Xian, R.; Rauscher, S. Current Topics, Methods, and Challenges in the Modelling of Intrinsically Disordered Protein Dynamics. arXiv 2022, arXiv:2211.06020. [Google Scholar] [CrossRef]
  62. Liu, Y.; Amzel, L.M. Conformation Clustering of Long MD Protein Dynamics with an Adversarial Autoencoder. arXiv 2018, arXiv:1805.12313. [Google Scholar] [CrossRef]
  63. Bouvier, B. Substituted Oligosaccharides as Protein Mimics: Deep Learning Free Energy Landscapes. J. Chem. Inf. Model. 2024, 64, 2195–2204. [Google Scholar] [CrossRef] [PubMed]
  64. Mescheder, L.; Geiger, A.; Nowozin, S. Which Training Methods for GANs do actually Converge? arXiv 2018, arXiv:1801.04406. [Google Scholar] [CrossRef]
  65. Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved Techniques for Training GANs. arXiv 2016, arXiv:1606.03498. [Google Scholar] [CrossRef]
  66. Durall, R.; Chatzimichailidis, A.; Labus, P.; Keuper, J. Combating Mode Collapse in GAN training: An Empirical Analysis using Hessian Eigenvalues. arXiv 2020, arXiv:2012.09673. [Google Scholar] [CrossRef]
  67. Mansoor, S.; Baek, M.; Park, H.; Lee, G.R.; Baker, D. Protein Ensemble Generation Through Variational Autoencoder Latent Space Sampling. J. Chem. Theory Comput. 2024, 20, 2689–2695. [Google Scholar] [CrossRef]
  68. Krishna, R.; Wang, J.; Ahern, W.; Sturmfels, P.; Venkatesh, P.; Kalvet, I.; Lee, G.R.; Morey-Burrows, F.S.; Anishchenko, I.; Humphreys, I.R.; et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science 2024, 384, eadl2528. [Google Scholar] [CrossRef] [PubMed]
  69. Zhu, C.; Guan, X.; Zhang, X.; Luan, X.; Song, Z.; Cheng, X.; Zhang, W.; Qin, J.J. Targeting KRAS mutant cancers: From druggable therapy to drug resistance. Mol. Cancer 2022, 21, 159. [Google Scholar] [CrossRef] [PubMed]
  70. Huang, L.; Guo, Z.; Wang, F.; Fu, L. KRAS mutation: From undruggable to druggable in cancer. Signal Transduct. Target. Ther. 2021, 6, 386. [Google Scholar] [CrossRef]
  71. Tian, H.; Jiang, X.; Trozzi, F.; Xiao, S.; Larson, E.C.; Tao, P. Explore Protein Conformational Space With Variational Autoencoder. Front. Mol. Biosci. 2021, 8, 781635. [Google Scholar] [CrossRef]
  72. Xiao, S.; Song, Z.; Tian, H.; Tao, P. Assessments of Variational Autoencoder in Protein Conformation Exploration. J. Comput. Biophys. Chem. 2023, 22, 489–501. [Google Scholar] [CrossRef]
  73. Ruzmetov, T.; Hung, T.I.; Jonnalagedda, S.P.; Chen, S.h.; Fasihianifard, P.; Guo, Z.; Bhanu, B.; Chang, C.e.A. Sampling Conformational Ensembles of Highly Dynamic Proteins via Generative Deep Learning. J. Chem. Inf. Model. 2025, 65, 2487–2502. [Google Scholar] [CrossRef]
  74. Afrasiabi, F.; Dehghanpoor, R.; Haspel, N. Using Autoencoders to Explore the Conformational Space of the Cdc42 Protein. In Computational Structural Bioinformatics; Communications in Computer and Information Science; Haspel, N., Molloy, K., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2025; Volume 2396, pp. 45–57. [Google Scholar] [CrossRef]
  75. Degiacomi, M.T. Coupling Molecular Dynamics and Deep Learning to Mine Protein Conformational Space. Structure 2019, 27, 1034–1040.e3. [Google Scholar] [CrossRef]
  76. Jin, Y.; Johannissen, L.O.; Hay, S. Predicting new protein conformations from molecular dynamics simulation conformational landscapes and machine learning. Proteins: Struct. Funct. Bioinform. 2021, 89, 915–921. [Google Scholar] [CrossRef]
  77. Gupta, A.; Dey, S.; Zhou, H.X. Artificial Intelligence Guided Conformational Mining of Intrinsically Disordered Proteins. bioRxiv 2021. bioRxiv 2021.11.21.469457. [Google Scholar] [CrossRef]
  78. Lindorff-Larsen, K.; Piana, S.; Dror, R.O.; Shaw, D.E. How Fast-Folding Proteins Fold. Science 2011, 334, 517–520. [Google Scholar] [CrossRef]
  79. Laio, A.; Parrinello, M. Escaping free-energy minima. Proc. Natl. Acad. Sci. USA 2002, 99, 12562–12566. [Google Scholar] [CrossRef]
  80. Bernardi, R.C.; Melo, M.C.; Schulten, K. Enhanced sampling techniques in molecular dynamics simulations of biological systems. Biochim. Biophys. Acta (BBA)—Subj. 2015, 1850, 872–877. [Google Scholar] [CrossRef] [PubMed]
  81. Chodera, J.D.; Noé, F. Markov state models of biomolecular conformational dynamics. Curr. Opin. Struct. Biol. 2014, 25, 135–144. [Google Scholar] [CrossRef] [PubMed]
  82. Harada, R.; Shigeta, Y. Efficient Conformational Search Based on Structural Dissimilarity Sampling: Applications for Reproducing Structural Transitions of Proteins. J. Chem. Theory Comput. 2017, 13, 1411–1423. [Google Scholar] [CrossRef]
  83. Hamelberg, D.; Mongan, J.; McCammon, J.A. Accelerated molecular dynamics: A promising and efficient simulation method for biomolecules. J. Chem. Phys. 2004, 120, 11919–11929. [Google Scholar] [CrossRef] [PubMed]
  84. Moritsugu, K. Multiscale Enhanced Sampling Using Machine Learning. Life 2021, 11, 1076. [Google Scholar] [CrossRef] [PubMed]
  85. Koike, R.; Ota, M.; Kidera, A. Hierarchical Description and Extensive Classification of Protein Structural Changes by Motion Tree. J. Mol. Biol. 2014, 426, 752–762. [Google Scholar] [CrossRef]
  86. Kleiman, D.E.; Shukla, D. Active learning of Conformational ensemble of Proteins. J. Chem. Theory Comput. 2023, 19, 4377–4388. [Google Scholar] [CrossRef] [PubMed]
  87. Bozkurt Varolgüneş, Y.; Bereau, T.; Rudzinski, J.F. Interpretable embeddings from molecular simulations using Gaussian mixture variational autoencoders. Mach. Learn. Sci. Technol. 2020, 1, 015012. [Google Scholar] [CrossRef]
  88. Albu, A.I. Towards learning transferable embeddings for protein conformations using Variational Autoencoders. Procedia Comput. Sci. 2021, 192, 10–19. [Google Scholar] [CrossRef]
  89. Pandini, A.; Fornili, A.; Kleinjung, J. Structural alphabets derived from attractors in conformational space. BMC Bioinform. 2010, 11, 97. [Google Scholar] [CrossRef]
  90. Ward, M.D.; Zimmerman, M.I.; Meller, A.; Chung, M.; Swamidass, S.J.; Bowman, G.R. Deep learning the structural determinants of protein biochemical properties by comparing structural ensembles with DiffNets. Nat. Commun. 2021, 12, 3023. [Google Scholar] [CrossRef] [PubMed]
  91. Bandyopadhyay, S.; Mondal, J. A deep autoencoder framework for discovery of metastable ensembles in biomacromolecules. J. Chem. Phys. 2021, 155, 114106. [Google Scholar] [CrossRef]
  92. Macenski, S.; Singh, S.; Martin, F.; Gines, J. Regulated Pure Pursuit for Robot Path Tracking. arXiv 2023, arXiv:2305.20026. [Google Scholar] [CrossRef]
  93. Datta, D.; Lee, E.S. Exploring Thermal Transport in Electrochemical Energy Storage Systems Utilizing Two-Dimensional Materials: Prospects and Hurdles. arXiv 2023, arXiv:2310.08592. [Google Scholar] [CrossRef]
  94. Mahmoud, A.H.; Masters, M.; Lee, S.J.; Lill, M.A. Accurate Sampling of Macromolecular Conformations Using Adaptive Deep Learning and Coarse-Grained Representation. J. Chem. Inf. Model. 2022, 62, 1602–1617. [Google Scholar] [CrossRef]
  95. Burley, S.K.; Berman, H.M.; Kleywegt, G.J.; Markley, J.L.; Nakamura, H.; Velankar, S. Protein Data Bank (PDB): The single global macromolecular structure archive. In Protein Crystallography: Methods and Protocols; Humana Press: New York, NY, USA, 2017; pp. 627–641. [Google Scholar]
  96. Vander Meersche, Y.; Cretin, G.; Gheeraert, A.; Gelly, J.C.; Galochkina, T. ATLAS: Protein flexibility description from atomistic molecular dynamics simulations. Nucleic Acids Res. 2024, 52, D384–D392. [Google Scholar] [CrossRef] [PubMed]
  97. Li, S.; Li, M.; Wang, Y.; He, X.; Zheng, N.; Zhang, J.; Heng, P.A. Improving AlphaFlow for Efficient Protein Ensembles Generation. arXiv 2024, arXiv:2407.12053. [Google Scholar] [CrossRef]
  98. Wolf, N.; Seute, L.; Viliuga, V.; Wagner, S.; Stühmer, J.; Gräter, F. Learning conformational ensembles of proteins based on backbone geometry. arXiv 2024, arXiv:2503.05738. [Google Scholar] [CrossRef]
  99. Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.; Liu, J.; Guo, D.; Ott, M.; Zitnick, C.L.; Ma, J.; et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 2021, 118, e2016239118. [Google Scholar] [CrossRef]
  100. Satorras, V.G.; Hoogeboom, E.; Welling, M. E(n) Equivariant Graph Neural Networks. arXiv 2021, arXiv:2102.09844. [Google Scholar] [CrossRef]
  101. Klein, F.; Soñora, M.; Helene Santos, L.; Nazareno Frigini, E.; Ballesteros-Casallas, A.; Rodrigo Machado, M.; Pantano, S. The SIRAH force field: A suite for simulations of complex biological systems at the coarse-grained and multiscale levels. J. Struct. Biol. 2023, 215, 107985. [Google Scholar] [CrossRef]
  102. He, L.; Chen, Y.; Dong, Y.; Wang, Y.; Lin, Z. Efficient equivariant network. Adv. Neural Inf. Process. Syst. 2021, 34, 5290–5302. [Google Scholar]
  103. Brehmer, J.; Behrends, S.; de Haan, P.; Cohen, T. Does equivariance matter at scale? arXiv 2024, arXiv:2410.23179. [Google Scholar] [CrossRef]
  104. Pozdnyakov, S.N.; Ceriotti, M. Smooth, exact rotational symmetrization for deep learning on point clouds. arXiv 2023, arXiv:2305.19302. [Google Scholar] [CrossRef]
  105. Wang, Y.; Elhag, A.A.; Jaitly, N.; Susskind, J.M.; Bautista, M.A. Swallowing the Bitter Pill: Simplified Scalable Conformer Generation. arXiv 2024, arXiv:2311.17932. [Google Scholar] [CrossRef]
  106. Gruver, N.; Stanton, S.; Frey, N.C.; Rudner, T.G.J.; Hotzel, I.; Lafrance-Vanasse, J.; Rajpal, A.; Cho, K.; Wilson, A.G. Protein Design with Guided Discrete Diffusion. arXiv 2023, arXiv:2305.20009. [Google Scholar] [CrossRef]
  107. Li, W.r.; Cadet, X.F.; Medina-Ortiz, D.; Davari, M.D.; Sowdhamini, R.; Damour, C.; Li, Y.; Miranville, A.; Cadet, F. From thermodynamics to protein design: Diffusion models for biomolecule generation towards autonomous protein engineering. arXiv 2025, arXiv:2501.02680. [Google Scholar] [CrossRef]
  108. Zhao, L.; He, Q.; Song, H.; Zhou, T.; Luo, A.; Wen, Z.; Wang, T.; Lin, X. Protein A-like Peptide Design Based on Diffusion and ESM2 Models. Molecules 2024, 29, 4965. [Google Scholar] [CrossRef]
  109. Nakata, S.; Mori, Y.; Tanaka, S. End-to-end protein–ligand complex structure generation with diffusion-based generative models. BMC Bioinform. 2023, 24, 233. [Google Scholar] [CrossRef] [PubMed]
  110. Cao, D.; Chen, M.; Zhang, R.; Wang, Z.; Huang, M.; Yu, J.; Jiang, X.; Fan, Z.; Zhang, W.; Zhou, H.; et al. SurfDock is a surface-informed diffusion generative model for reliable and accurate protein–ligand complex prediction. Nat. Methods 2025, 22, 310–322. [Google Scholar] [CrossRef] [PubMed]
  111. Yim, J.; Stärk, H.; Corso, G.; Jing, B.; Barzilay, R.; Jaakkola, T.S. Diffusion models in protein structure and docking. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2024, 14, e1711. [Google Scholar] [CrossRef]
  112. Wu, K.E.; Yang, K.K.; Van Den Berg, R.; Alamdari, S.; Zou, J.Y.; Lu, A.X.; Amini, A.P. Protein structure generation via folding diffusion. Nat. Commun. 2024, 15, 1059. [Google Scholar] [CrossRef]
  113. Anand, N.; Achim, T. Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models. arXiv 2022, arXiv:2205.15019. [Google Scholar] [CrossRef]
  114. Jing, B.; Erives, E.; Pao-Huang, P.; Corso, G.; Berger, B.; Jaakkola, T. EigenFold: Generative Protein Structure Prediction with Diffusion Models. arXiv 2023, arXiv:2304.02198. [Google Scholar] [CrossRef]
  115. Fan, J.; Li, Z.; Alcaide, E.; Ke, G.; Huang, H.; E, W. Accurate Conformation Sampling via Protein Structural Diffusion. J. Chem. Inf. Model. 2024, 64, 8414–8426. [Google Scholar] [CrossRef]
  116. Liu, C.; Wang, J.; Cai, Z.; Wang, Y.; Kuang, H.; Cheng, K.; Zhang, L.; Su, Q.; Tang, Y.; Cao, F.; et al. Dynamic PDB: A New Dataset and a SE(3) Model Extension by Integrating Dynamic Behaviors and Physical Properties in Protein Structures. arXiv 2024, arXiv:2408.12413. [Google Scholar] [CrossRef]
  117. Liu, Y.; Yu, Z.; Lindsay, R.J.; Lin, G.; Chen, M.; Sahoo, A.; Hanson, S.M. ExEnDiff: An Experiment-guided Diffusion model for protein conformational Ensemble generation. bioRxiv 2024. bioRxiv 2024.10.04.616517. [Google Scholar] [CrossRef]
  118. Lu, J.; Zhang, Z.; Zhong, B.; Shi, C.; Tang, J. Fusing Neural and Physical: Augment Protein Conformation Sampling with Tractable Simulations. arXiv 2024, arXiv:2402.10433. [Google Scholar] [CrossRef]
  119. Zheng, S.; He, J.; Liu, C.; Shi, Y.; Lu, Z.; Feng, W.; Ju, F.; Wang, J.; Zhu, J.; Min, Y.; et al. Predicting equilibrium distributions for molecular systems with deep learning. Nat. Mach. Intell. 2024, 6, 558–567. [Google Scholar] [CrossRef]
  120. Huang, X.; Pearce, R.; Zhang, Y. FASPR: An open-source tool for fast and accurate protein side-chain packing. Bioinformatics 2020, 36, 3758–3765. [Google Scholar] [CrossRef]
  121. Argaman, N.; Makov, G. Density functional theory: An introduction. Am. J. Phys. 2000, 68, 69–79. [Google Scholar] [CrossRef]
  122. Tsuboyama, K.; Dauparas, J.; Chen, J.; Laine, E.; Mohseni Behbahani, Y.; Weinstein, J.J.; Mangan, N.M.; Ovchinnikov, S.; Rocklin, G.J. Mega-scale experimental analysis of protein folding stability in biology and design. Nature 2023, 620, 434–444. [Google Scholar] [CrossRef]
  123. Tang, Y.; Yu, M.; Bai, G.; Li, X.; Xu, Y.; Ma, B. Deep learning of protein energy landscape and conformational dynamics from experimental structures in PDB. bioRxiv 2024. bioRxiv 2024.06.27.600251. [Google Scholar] [CrossRef]
  124. Watson, J.L.; Juergens, D.; Bennett, N.R.; Trippe, B.L.; Yim, J.; Eisenach, H.E.; Ahern, W.; Borst, A.J.; Ragotte, R.J.; Milles, L.F.; et al. De novo design of protein structure and function with RFdiffusion. Nature 2023, 620, 1089–1100. [Google Scholar] [CrossRef]
  125. Maddipatla, A.; Sellam, N.B.; Bojan, M.; Vedula, S.; Schanda, P.; Marx, A.; Bronstein, A.M. Inverse problems with experiment-guided AlphaFold. arXiv 2025, arXiv:2502.09372. [Google Scholar] [CrossRef]
  126. Hu, Y.; Cheng, K.; He, L.; Zhang, X.; Jiang, B.; Jiang, L.; Li, C.; Wang, G.; Yang, Y.; Liu, M. NMR-Based Methods for Protein Analysis. Anal. Chem. 2021, 93, 1866–1879. [Google Scholar] [CrossRef]
  127. Zhu, J.; Li, Z.; Zhang, B.; Zheng, Z.; Zhong, B.; Bai, J.; Hong, X.; Wang, T.; Wei, T.; Yang, J.; et al. Precise Generation of Conformational Ensembles for Intrinsically Disordered Proteins via Fine-tuned Diffusion Models. bioRxiv 2024. bioRxiv 2024.05.05.592611. [Google Scholar] [CrossRef]
  128. Piovesan, D.; Monzon, A.M.; Tosatto, S.C.E. Intrinsic protein disorder and conditional folding in AlphaFoldDB. Protein Sci. 2022, 31, e4466. [Google Scholar] [CrossRef]
  129. Ruff, K.M.; Pappu, R.V. AlphaFold and Implications for Intrinsically Disordered Proteins. J. Mol. Biol. 2021, 433, 167208. [Google Scholar] [CrossRef] [PubMed]
  130. Taneja, I.; Lasker, K. Machine-learning-based methods to generate conformational ensembles of disordered proteins. Biophys. J. 2024, 123, 101–113. [Google Scholar] [CrossRef]
  131. Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
  132. Janson, G.; Feig, M. Transferable deep generative modeling of intrinsically disordered protein conformations. PLoS Comput. Biol. 2024, 20, e1012144. [Google Scholar] [CrossRef]
  133. Hoffman, M. Straightening Out the Protein Folding Puzzle: A newpaper suggests an old model is wrong—and the subject is too hot for protein chemists to touch. Science 1991, 253, 1357–1358. [Google Scholar] [CrossRef] [PubMed]
  134. Sillitoe, I.; Bordin, N.; Dawson, N.; Waman, V.P.; Ashford, P.; Scholes, H.M.; Pang, C.S.M.; Woodridge, L.; Rauer, C.; Sen, N.; et al. CATH: Increased structural coverage of functional space. Nucleic Acids Res. 2021, 49, D266–D273. [Google Scholar] [CrossRef] [PubMed]
  135. Cheng, K.; Liu, C.; Su, Q.; Wang, J.; Zhang, L.; Tang, Y.; Yao, Y.; Zhu, S.; Qi, Y. AlphaFolding: 4D Diffusion for Dynamic Protein Structure Prediction with Reference and Motion Guidance. arXiv 2024, arXiv:2408.12419. [Google Scholar] [CrossRef]
  136. Wu, F.; Li, S.Z. DiffMD: A Geometric Diffusion Model for Molecular Dynamics Simulations. Proc. AAAI Conf. Artif. Intell. 2023, 37, 5321–5329. [Google Scholar] [CrossRef]
  137. Wang, B.; Wang, C.; Chen, J.; Liu, D.; Sun, C.; Zhang, J.; Zhang, K.; Li, H. Conditional Diffusion with Locality-Aware Modal Alignment for Generating Diverse Protein Conformational Ensembles. bioRxiv 2025. bioRxiv 2025.02.21.639488. [Google Scholar] [CrossRef]
  138. Xu, M.; Yu, L.; Song, Y.; Shi, C.; Ermon, S.; Tang, J. GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation. arXiv 2022, arXiv:2203.02923. [Google Scholar] [CrossRef]
  139. Jing, B.; Stärk, H.; Jaakkola, T.; Berger, B. Generative Modeling of Molecular Dynamics Trajectories. arXiv 2024, arXiv:2409.17808. [Google Scholar] [CrossRef]
  140. Lu, J.; Chen, X.; Lu, S.Z.; Lozano, A.; Chenthamarakshan, V.; Das, P.; Tang, J. Aligning Protein Conformation Ensemble Generation with Physical Feedback. arXiv 2025, arXiv:2505.24203. [Google Scholar] [CrossRef]
  141. Simon, E.; Zou, J. InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders. bioRxiv 2024. bioRxiv 2024.11.14.623630. [Google Scholar] [CrossRef]
  142. Schwing, G.; Palese, L.L.; Fernández, A.; Schwiebert, L.; Gatti, D.L. Molecular dynamics without molecules: Searching the conformational space of proteins with generative neural networks. arXiv 2022, arXiv:2206.04683. [Google Scholar] [CrossRef]
  143. Lu, J.; Chen, X.; Lu, S.Z.; Shi, C.; Guo, H.; Bengio, Y.; Tang, J. Structure Language Models for Protein Conformation Generation. arXiv 2025, arXiv:2410.18403. [Google Scholar] [CrossRef]
  144. Liu, Y.; Chen, L.; Liu, H. Diffusion in a quantized vector space generates non-idealized protein structures and predicts conformational distributions. bioRxiv 2023. bioRxiv 2023.11.18.567666. [Google Scholar] [CrossRef]
  145. Schober, P.; Boer, C.; Schwarte, L.A. Correlation Coefficients: Appropriate Use and Interpretation. Anesth. Analg. 2018, 126, 1763–1768. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Summary of notable literature showing the interplay between Multiple Sequence Alignment, Sequence, and Structure data for protein conformation and protein trajectory sampling.
Figure 1. Summary of notable literature showing the interplay between Multiple Sequence Alignment, Sequence, and Structure data for protein conformation and protein trajectory sampling.
Biochem 05 00032 g001
Figure 2. Schematic of the deep generative models discussed in this section.
Figure 2. Schematic of the deep generative models discussed in this section.
Biochem 05 00032 g002
Table 1. Compressed summary of variational autoencoders for protein conformation sampling. Columns: atomistic granularity (Gran.), Multiple Sequence Alignment conditioning (MSA), equivariance (Equiv.), and evaluation metrics (Metric). Dataset column is placed last.
Table 1. Compressed summary of variational autoencoders for protein conformation sampling. Columns: atomistic granularity (Gran.), Multiple Sequence Alignment conditioning (MSA), equivariance (Equiv.), and evaluation metrics (Metric). Dataset column is placed last.
PaperInputOutputGran.MSAEquiv.MetricDataset
Bandyopadhyay et al. [91]Pairwise C α distancesLatent space repr.Residue-level ( C α C α ) representationNoNoVAMP-2 score, Chapman–Kolmogorov (CK) test200 MD traj (30–100 ns)
Gupta et al. [77]Heavy-atom Cartesian coordsReconstruct coordsHeavy-atom levelNoNoRMSD, Experimental dataQ15, A β 40, ChiZ (1–3.5 μ s)
Xiao et al. [72]Normalize heavy-atom coordsReconstruct + 3D confsHeavy-atom levelNoNoCorrelation coefficient, RMSDCalmodulin, β -lactamase, Ubiquitin MD
Degiacomi et al. [75]Heavy-atom Cartesian coordsReconstruct coordsBackbone, C β atomsNoNoRMSDMurD open/closed MD
Ward et al. [90]Heavy-atom Cartesian coordsReconstruct coordsHeavy-atom levelNoNoRMSD, ROC–AUC, CorrelationTEM β -lactamase, HIV-1 capsomer MD
Tian et al. [51]Normalize heavy-atom coords2D latent embeddingsHeavy-atom levelNoNoRMSD, Sampling efficiency100–200 ps MD (NVT/NPT)
Jin et al. [76]Selected atom coords (snapshots)New conformation coordsHeavy-atom levelNoNoRMSDL-Ala, Calmodulin (20 ns MD)
Mansoor et al. [67]2D RoseTTAFold templates2D → 3D structuresBackbone, C β atomsYesNoRMSD, CCEK Ras (10 ns MD)
Ruzmetov et al. [73]BAT vector representationAll-atom conformationAll-atomNoYesRMSD, RgAmyloid- β (1 μ s MD)
Table 2. Comparative analysis of flow-based protein conformation generative models. Columns include atomistic granularity (Gran.), MSA conditioning (MSA), equivariance (Equiv.), evaluation metrics, and the dataset used.
Table 2. Comparative analysis of flow-based protein conformation generative models. Columns include atomistic granularity (Gran.), MSA conditioning (MSA), equivariance (Equiv.), evaluation metrics, and the dataset used.
PaperInputOutputGran.MSAEquiv.MetricDataset
Mahmoud et al. [94]Coarse-grained protein structuresFull atom protein conformationsAll-atomNoYesDihedral distributions, TICAMD of Bromodomain and Chignolin
Jing et al. [56]Protein sequenceProtein backbone conformationBackboneYesNolDDT-C α PDB [95], ATLAS [96]
Li et al. [97]Protein sequence + MSAProtein backbone conformationAll-atomNoNoRMSD, RMSF, DCCM correlationATLAS [96]
Wolf et al. [98]Equilibrium backbone structureProtein backbone conformationBackboneNoYesRMSD, RMSF, PCA W2ATLAS [96]
Jin et al. [55]Protein sequence + Approx. energyProtein backbone conformationBackboneNoYesRMSF, RMWD, Rg, PWDATLAS [96]
Table 3. A comparison of diffusion models for protein conformation sampling showing an extensive enforcement of equivariance into the generative model while ignoring the use of evolutionary MSA data.
Table 3. A comparison of diffusion models for protein conformation sampling showing an extensive enforcement of equivariance into the generative model while ignoring the use of evolutionary MSA data.
PaperInputOutputGran.MSAEquiv.MetricDataset
Fan et al. [115]Protein sequence and MSA3D protein conformationAll-atomYesYesMAT-R, MAT-P, TM-scoreAll PDB structures before 30 April 2022
Liu et al. [116]Protein sequenceMD trajectoryAll-atomNoYesRMSD, RMSF, MAEDynamic PDB
Liu et al. [117]Protein sequence and Experimental dataProtein conformational ensembleBackboneNoYesRg, SASAGround truth generated via MD
Lu et al. [118]Protein sequenceProtein backbone conformationsBackboneNoYesJS-PwD, JS-TIC, JS-RgPDB
Wang et al. [59]Protein sequenceBackbone conformation ensembleBackboneNoYesRMSD, RMSF, Val-C α PDB
Lu et al. [58]A starting structure of a target Protein sequenceAll-atom protein conformationsAll-atomNoYesVal-Clash, RMSD, TICA, RgPDB
Zheng et al. [119]CG protein representationProtein conformationAll-atomNoYesRMSD, TICAMD data and force field
Lewis et al. [57]Protein SequenceProtein backbone conformationsBackboneYesYesRMSD, Free-energy agreementPDB+ MD+ NMR data
Table 4. Comprehensive evaluation metrics for assessing protein conformation generation models across five key dimensions.
Table 4. Comprehensive evaluation metrics for assessing protein conformation generation models across five key dimensions.
Metric CategoryScore NameDefinition and References
ValidityVal-ClashValidation metric for bond lengths/angles; [118]
Steric-clashFraction of structures free from Cα clashes [58]
Ramachandran Plot ScoreMeasures fraction of ( ϕ , ψ ) backbone angles within allowed regions [55,116]
Contact-map FrequencyFrequency of residue pairs maintained within contact threshold across conformations [116]
Sanity-check Pass RateFraction of conformations passing comprehensive validation (steric, bonding, and dihedral angle checks) [55]
Precision & AccuracyTM-scoreTemplate-modeling score (0–1 scale) quantifying global fold similarity and structural alignment quality [115,119]
RMSDRoot-mean-square deviation measuring atomic position accuracy relative to reference structures [135,136]
Pearson CorrelationCorrelation coefficient between predicted and reference per-residue measurements [56,137]
DiversityMAT-P/MAT-RMatching precision/recall: average RMSD to nearest reference/generated structure [115]
COV-P/COV-RCoverage precision/recall: percentage of structures covered within distance threshold [138]
lDDT-CαAverage local structural dissimilarity between pairs of sampled conformations [56]
Distributional
Similarity
JS Divergence (PwD, Rg, TIC)Jensen-Shannon divergence measuring similarity between generated and MD reference distributions [97,137]
JS-PwDFidelity score derived from Jensen-Shannon divergence of pairwise distance distributions [137]
Sampler ScoreComposite metric combining global (JS divergence) and local (RMSF, secondary structure) similarity [117]
RMWDRoot-mean Wasserstein distance across atom-wise positional probability distributions [56]
Coverage/k-recallFraction of MD reference structures covered and average distance to k-nearest generated samples [59]
Structural
Dynamics
RMSFRoot-mean-square fluctuation quantifying per-residue mobility across conformational ensembles [97,98,116]
Radius of Gyration (Rg)Time-dependent measure of overall protein compactness and conformational changes [116]
Weak/Transient ContactsFrequency of non-covalent contact formation relative to reference structure [55]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dao, T.M.; Rahman, T. Deep Generative Modeling of Protein Conformations: A Comprehensive Review. BioChem 2025, 5, 32. https://doi.org/10.3390/biochem5030032

AMA Style

Dao TM, Rahman T. Deep Generative Modeling of Protein Conformations: A Comprehensive Review. BioChem. 2025; 5(3):32. https://doi.org/10.3390/biochem5030032

Chicago/Turabian Style

Dao, Tuan Minh, and Taseef Rahman. 2025. "Deep Generative Modeling of Protein Conformations: A Comprehensive Review" BioChem 5, no. 3: 32. https://doi.org/10.3390/biochem5030032

APA Style

Dao, T. M., & Rahman, T. (2025). Deep Generative Modeling of Protein Conformations: A Comprehensive Review. BioChem, 5(3), 32. https://doi.org/10.3390/biochem5030032

Article Metrics

Back to TopTop