Generative Protein Design: From Deep Learning Algorithms to Translational Applications

Luo, Shaotong; Zhou, Bo

doi:10.3390/ijms27093917

Open AccessReview

Generative Protein Design: From Deep Learning Algorithms to Translational Applications

by

Shaotong Luo

¹ and

Bo Zhou

^2,*

¹

College of Aulin, Northeast Forestry University, Harbin 150040, China

²

College of Life Science, Northeast Forestry University, Harbin 150040, China

^*

Author to whom correspondence should be addressed.

Int. J. Mol. Sci. 2026, 27(9), 3917; https://doi.org/10.3390/ijms27093917

Submission received: 23 March 2026 / Revised: 18 April 2026 / Accepted: 24 April 2026 / Published: 28 April 2026

(This article belongs to the Section Molecular Biology)

Download

Browse Figures

Versions Notes

Abstract

Deep learning has transformed protein design from a field long dominated by explicit energy-function optimization into one dominated by probabilistic generative modeling. In this review, we summarize the protein representation algorithmic basis for this transition, from sequence-centered encodings to geometric graph representations and, more recently, SE(3)-equivariant structural manifolds that directly respect three-dimensional symmetry. We classify current approaches into three methodological paradigms according to how sequence and structure are related during design: sequence–structure decoupled design, hybrid approaches, and sequence–structure co-design. For decoupled workflows, we discuss hallucination, backbone generation, and backbone-conditioned sequence design. For hybrid approaches, we examine integrated two-stage architectures and predictor-driven iterative co-refinement. For co-design, we review explicit joint generative formulations in which sequence and structure are treated as a coupled design state throughout generation. Additionally, we summarize evaluation principles for assessing the design results, such as physical validity, folding consistency, and design coverage, and then introduce some important applications in several fields. Taken together, these developments indicate that generative protein design is making progress from structure generation toward the programmable engineering of complex biological function.

Keywords:

deep learning; protein design; SE(3)-equivariant; decoupled design; hybrid approaches; co-design

1. Introduction

Protein design aims to discover amino acid sequences that not only adopt a desired three-dimensional fold, but also satisfy predefined functional requirements. This goal requires grasping a variety of complex rules, including thermodynamics, geometry and chemistry.

For many years, the discipline addressed this challenge mainly through a physics-centered framework. Early de novo studies frequently employed simplified model systems to better elucidate the physical principles governing protein folding stability, including hydrophobic burial, secondary-structure propensity, and molecular packing regularity [1,2]. These principles were gradually refined into empirical scoring functions, thereby rendering computational protein design tractable. Within this framework, the design problem was formulated as an optimization task over residue identities and rotamer configurations, generally relying on a fixed or weakly flexible protein backbone [3,4]. Rosetta achieved a breakthrough in this era. By integrating fragment-based conformational sampling with Monte Carlo simulated annealing [5], Rosetta demonstrated that effective exploration of rugged energy landscapes was achievable, enabling the generation of novel topologies not previously observed in nature [6]. Nevertheless, it remained constrained by the inherent limitations of explicit energy functions and discrete optimization [7].

The situation changed with the rise of deep learning and the rapid expansion of protein sequence and structure data. Models such as AlphaFold2 and RoseTTAFold showed that neural networks could infer structural organization by extracting high-order evolutionary and geometric patterns directly from data, rather than relying exclusively on manually assembled physical terms [8,9].

The success of these predictive models suggested that protein design could be reformulated as a probabilistic learning problem, in which models learn the statistical regularities of natural proteins and use them to generate new candidates under specified constraints. Several recent reviews have already provided broad overviews of this rapidly evolving field [10,11,12,13]. Rather than revisiting that broader landscape, this work focuses on a more specific methodological question: how do current generative methods organize the relationship between sequence and structure during design? Viewed through this lens, existing approaches can be grouped into three main paradigms. In the first, sequence and structure are handled in a staged, factorized manner, with backbone generation and sequence design treated as sequential subproblems. In the second, models learn a fully joint generative distribution and treat sequence and structure as a coupled design state throughout the generation process. Between these two sits a third hybrid paradigm. These methods do not learn a fully joint generative distribution from the outset or keep sequence and structure completely separated. Instead, they coordinate the two during inference through shared intermediate representations, iterative refinement, or predictor-guided optimization.

This three-part framework provides a clearer view of the current methodological landscape than a simple binary division and serves as the main organizational axis of this review. Furthermore, we emphasize breakthrough developments over the past two years, highlighting the representative architectures that define the current frontier of the field.

2. Generative Protein Design Algorithms

2.1. Protein Representation and Geometric Deep Learning

Before comparing these design paradigms, it is useful to establish the representational and mathematical machinery on which they depend. Regardless of the downstream objective, AI-driven protein design requires translating a complex biological macromolecule into a form that a neural network can process without losing the structural regularities that matter most. Progress in the field has therefore been tightly linked to progress in representation learning. Over time, protein representations have evolved from one-dimensional sequence embeddings to three-dimensional topological graphs and then to explicitly geometric formalisms designed to respect the symmetries of Euclidean space. This progression reflects a growing recognition that proteins are sparse, structured objects whose geometry cannot be treated as generic data.

2.1.1. 1D Sequence Semantics

Sequence is the most immediate representation because it is the linear carrier of evolutionary and biochemical information. In principle, the residue string specifies the energetic tendencies that shape the folding landscape and constrain which structures can be realized in a given environment. Protein language models adopt the self-supervised training strategy of natural language processing [14]. By masking residues in large evolutionary corpora and asking the model to recover them, these systems embed discrete amino acids into dense continuous representations. Their practical value comes from the fact that they capture nonlocal dependencies along the chain, including co-evolutionary signals between distant positions [15,16]. Even in the absence of explicit geometric coordinates, attention maps in deeper network layers still contain the structural information that can be used to infer residue–residue contact maps. This demonstrates that pure sequence models can distill structural principles directly from evolutionary data [17]. Therefore, it can be seen that the sequence semantics can play a role in the generative task, and can make up for the lack of physical constraints, especially in the absence of homologous templates [18,19,20].

2.1.2. 3D Topological Graphs

When the design task requires the generation of 3D conformation, it is necessary to use a strategy to describe the Euclidean space. An early solution, inspired by computer vision, discretized local protein environments into regular volumetric grids and then applied 3D convolutional networks to those voxels [21,22]. However, this method has obvious shortcomings. First, the grid is too dense yet the physical structure of the protein is sparse, which leads to the problem of computational redundancy. Secondly, the model lacks rotation invariance and is too sensitive to input orientation, requiring data enhancement (Figure 1a). Consequently, graph-based representations are more common, which can abstract proteins into topological graphs, define atoms as nodes, and define spatial proximity as edges [23,24,25,26,27,28]. This modeling method can better meet the sparsity requirements of biomolecules and can quickly pass messages, which is conducive to the formation of long-range geometric interaction (Figure 1b).

2.1.3. SE(3) Geometric Equivariance

Graphs solve the sparsity problem, but scalar graph features alone do not adequately encode directionality. For tasks that demand high spatial fidelity, such as chain packing and atomic-resolution generation, models must do more than remain invariant to rotations and translations. They must also be equivariant: when the input coordinate frame changes, the output should transform in a predictable covariant way rather than remaining unchanged or becoming inconsistent. This requirement has driven the development of SE(3)-equivariant architectures [29,30].

Two broad design philosophies are common. One is rooted in group representation theory. In that line of work, geometric features are expressed as irreducible representations of SO(3), decomposed into tensors of different angular degrees, and coupled through spherical harmonics, tensor products, and Clebsch–Gordan coefficients [31,32]. These constructions are mathematically elegant and highly expressive, especially for capturing fine angular detail, but they can be computationally demanding [33,34]. The other philosophy pursues a more pragmatic balance between rigor and efficiency. Rather than carrying out full tensor algebra at every stage, these models separate scalar and vector channels or use simplified geometric interactions that preserve essential equivariance while remaining more computationally scalable [24,35].

2.2. Sequence–Structure Decoupled Design

Early generative protein-design methods were largely built on a decoupled strategy, in which sequence and structure are handled as ordered subproblems rather than generated jointly. In practice, this framework developed along two main routes. One uses a structure predictor to guide sequence optimization directly, with the predictor serving as an external scoring function. The other follows a structure-first logic, in which a backbone is generated first and a compatible sequence is assigned afterward. Although later developments have made these pipelines more integrated, the core idea remains the same: sequence and structure are not generated as a single coupled state, but are instead linked through a staged design process.

2.2.1. Predictor-Driven Hallucination

One prominent decoupled strategy is commonly referred to as hallucination. Its basic idea is to use a strong structure predictor as the scoring function that guides sequence search, effectively replacing classical biophysical energy terms [36,37,38,39,40,41]. In practice, optimization is performed directly in sequence space by gradient descent or by stochastic procedures such as Markov Chain Monte Carlo (MCMC), with the objective of maximizing predictor-derived confidence signals. This can yield sequences predicted to adopt novel folds. However, the method has an important weakness: optimization may exploit imperfections in the predictor itself, producing adversarial sequences that score well computationally but do not reliably fold under real experimental conditions.

2.2.2. Backbone Coordinate Generation

To gain more explicit geometric control and reduce the risk of adversarial optimization, decoupled design also developed along a structure-first route. As summarized in Figure 2a, this strategy mathematically factorizes the high-dimensional joint generation problem into the product of a structural prior distribution

p (Structure)

and a conditional sequence distribution

p (Sequence∣ Structure)

[42,43]. The first stage in this decomposition is backbone generation. The central challenge lies in parameterizing the continuous distribution of macromolecules in 3D Euclidean space while rigorously accounting for inherent roto-translational invariances and residue rigid-body constraints.

At the implementation level, three design choices are especially important. First, because protein backbone geometry lives on SE(3) rather than in a flat Euclidean space, the noising process is usually defined separately over translational and rotational degrees of freedom: positions are perturbed in Euclidean space, whereas orientations are diffused on SO(3) using geometry-aware stochastic processes [43,44,45]. This separation is necessary because rotational variables cannot be treated as ordinary vectors without violating the underlying geometry. Second, this noising scheme is closely related to coordinate parameterization. Instead of operating directly on all-atom coordinates, many backbone generators represent each residue as a local rigid frame constructed from the three non-collinear backbone atoms N, Cα, and C. This representation preserves local backbone geometry, reduces unnecessary degrees of freedom, and remains naturally compatible with SE(3)-equivariant learning. Third, the training objective is not only to recover a clean structure from a noisy input, but also to maintain global frame consistency during generation. For this reason, some influential implementations replace the frame-aligned loss commonly used in structure prediction, such as FAPE, with unaligned mean squared error (MSE), which is more suitable when generation does not assume a unique global reference frame.

A particularly influential design pattern in diffusion-based backbone generation is to initialize the denoising network from a pre-trained structure predictor that already contains strong geometric inductive biases and spatial reasoning capacity, and then fine-tune that model as a conditional generative module [43].

Beyond conventional stochastic diffusion, flow matching has emerged as an alternative formulation for structural generation. Instead of learning a reverse-time denoising process, it learns a deterministic ordinary differential equation (ODE) vector field that transports probability mass along smooth trajectories. This deterministic transport can improve numerical stability in sampling and appears particularly advantageous for scaling generation to very large proteins and multicomponent assemblies [46].

2.2.3. Backbone-Conditional Sequence Design

After a plausible backbone has been obtained, the next task is to assign a compatible sequence—in other words, to solve inverse folding. This mechanism aims to accurately model the conditional probability

p (Sequence∣ Structure)

[47]. Early neural inverse-folding methods framed this as a structure-constrained assignment problem on protein graphs. This shift allowed the field to move from hand-crafted energy functions to residue preferences learned directly from 3D geometry [23,25]. Later models improved structural encoding using geometry-aware message passing. These advancements led to highly effective inverse-folding architectures that estimate sequence likelihood directly from backbone coordinates, performing well in both monomeric and multichain design [26].

A major methodological improvement in this area is the shift away from rigid residue-generation orders. Instead of decoding strictly from the N-terminus to the C-terminus, many recent models use order-agnostic autoregression or masked reconstruction. These flexible decoding schemes are valuable because they capture long-range epistatic interactions more effectively. They also simplify practical design tasks, including multichain sequence design, partial sequence inpainting, and symmetry-constrained assignment [26,27,48,49].

Another important development has been the use of predicted structures to augment training data. Because experimental structures are scarce, high-confidence predicted structures can enlarge the effective training set and expose inverse folding models to a broader range of geometries. This augmentation has been especially helpful for improving zero-shot generalization to novel or de novo topologies [28].

Parallel to direct sequence probability prediction, an alternative technical route involves learning energy-based models (EBMs) over the sequence space. Rather than sampling residues directly from a decoder, these methods define an energy landscape over candidate sequences, with the landscape implicitly capturing both pairwise and higher-order interactions [50]. The decoupling of scoring and generation allows the vast sequence space to be explored through heuristic search algorithms or Markov chain Monte Carlo (MCMC) sampling. This formulation is especially attractive when the design target places unusual emphasis on fine-grained thermodynamic control or binding affinity [51].

2.3. Hybrid Approaches

Hybrid approaches occupy the methodological middle ground between staged decoupled design and fully joint co-design. Instead of keeping sequence and structure completely isolated or modeling them as a unified state from the start, these methods introduce partial coupling through shared internal representations or iterative inference.

2.3.1. Integrated Two-Stage Design

Classic decoupled pipelines treat backbone generation and sequence assignment as separate modules. This modularity is convenient, but it introduces a distributional mismatch. The backbones produced by a generative model may lie outside the training distribution of the inverse folding model that follows, making sequence assignment less reliable.

To make up for these deficiencies, recent studies have introduced integrated two-stage design frameworks [42,52]. In these systems, backbone generation and sequence assignment remain distinct stages, but the boundary between them is softened by shared internal representations of spatial and sequence information. Methodologically, the key change is that sequence design no longer starts from a completed backbone alone, but from an intermediate structural state that already carries sequence-relevant information. This reduces interface mismatch while preserving the controllability of scaffold-first design. Recent models illustrate this shift through distinct architectural choices. While Chroma practically follows a backbone-first route, it integrates backbone generation with a dedicated design network that conditions sequence and side-chain generation directly on the sampled backbones, thereby tightening the connection between the two stages [38]. ODesign similarly preserves an explicit two-stage strategy, yet strengthens coordination through shared multimodal representations and conditional interaction modeling across its generative modules [44]. These systems therefore remain operationally staged, but are no longer fully decoupled in the classical sense.

2.3.2. Predictor-Driven Iterative Co-Refinement

A second hybrid strategy also avoids learning a full joint generation. Instead, it treats a strong structure predictor as an implicit likelihood model and performs design by searching the induced landscape [8,53,54,55,56]. The predictor acts as an oracle for structural plausibility and constraint satisfaction, while the design algorithm updates the sequence in response to that oracle. Here, the coupling is introduced not through shared staged representations, but through repeated feedback during inference.

One route performs gradient-based optimization by backpropagating through the predictor to update sequence representations directly. Methods in this family differ mainly in how they relax the discreteness of amino acids. Relaxed Sequence Optimization (RSO), for example, keeps the sequence entirely in a continuous distributional form during optimization. The model therefore moves smoothly across a relaxed landscape until it converges on a favorable target backbone, after which an inverse-folding model is used to recover a physically discrete sequence for that optimized structure [57]. In contrast, BindCraft follows a different strategy. Instead of separating backbone optimization from sequence realization, it uses a staged optimization, which begins in continuous logit space, then gradually anneals toward one-hot sequence assignments [58]. With straight-through estimators and a final stage of explicit mutational sampling, BindCraft can extract discrete sequences directly from AlphaFold gradients. In both cases, complex multi-objective functions such as structural confidence (pLDDT), geometric consistency, and interface contact scores are rendered differentiable.

Another methodology avoids backpropagation altogether and instead performs iterative closed-loop design using only forward passes. In each iteration, the pipeline alternates between structure prediction and sequence design. The structure predictor utilizes its inherent “hallucination” capabilities to generate or refine a structure, which is then passed to an inverse folding model to redesign a highly compatible sequence. Recent frameworks like HalluDesign and Protein Hunter exemplify this approach. By alternating forward passes of structure refinement and inverse folding, they simultaneously enhance foldability and designability without relying on gradient descent [59,60]. While this block-coordinate style is easy to implement and often effective in practice, its fundamental limitation is that sequence–structure compatibility relies on repeated external predictor calls rather than being natively encoded within a joint generative prior.

2.4. Sequence–Structure Co-Design

In real proteins, sequence and structure are intrinsically coupled at every scale, as even minor sequence perturbations can reshape conformational preferences, while backbone geometry simultaneously restricts which residue patterns remain viable. Therefore, co-design aims to model sequence and structure jointly from the outset, as summarized in Figure 2b.

The explicit joint generative models aim to learn the joint distribution

p (Structure, Sequence)

so that covariation between sequence and geometry is represented natively rather than imposed only during downstream refinement. The central challenge is that the co-design model must accommodate discrete amino-acid identities, continuous geometry, and side chains with variable topology while preserving stereochemical realism and SE(3) equivariance. Different architectures implement this idea in different spaces.

In explicit all-atom approaches, the model operates directly in coordinate space, using atomic positions and residue rigid frames as the joint state to update sequence-related or structure-related variables along the shared generative trajectory [61,62,63,64]. To remain independent of the chosen coordinate frame, these models typically maintain SE(3) equivariance throughout. Their main technical obstacle is the variable dimensionality of amino-acid side chains. Because residues do not all possess the same number and arrangement of atoms, variable-topology chemistry must be embedded into a common computational representation if synchronous generation is to remain tractable. The advantage of this path is direct access to fine atomic interactions, including packing geometry and hydrogen-bond organization. The primary strength of this approach lies in its structural fidelity. By remaining entirely within atomic space, local packing, side-chain placement, and interface chemistry can be optimized concurrently with backbone geometry. However, this high resolution comes at the cost of computational complexity. Explicitly coupling sequence identity with all-atom coordinates substantially complicates training and sampling, particularly for targets involving ligands, modified residues, or multicomponent assemblies.

Semi-latent approaches keep backbone geometry explicit but compress residue identity and side-chain state into a fixed-dimensional continuous latent variable, establishing a unified generation process over the joint space [65,66]. By translating discrete sequence-related information into a continuous representation, this method allows noise injection and reverse updates to be performed stably in Euclidean space, thereby offering superior training stability and computational efficiency compared to explicit all-atom paths. These models offer a practical compromise between physical detail and computational ease. By keeping the backbone explicitly defined in 3D space while compressing sequence information into a continuous latent variable, they significantly improve numerical stability. The premise is that this latent space must retain enough chemical information to accurately reconstruct actual amino acids and side chains during decoding.

Fully latent approaches push the abstraction further. The entire sequence–structure state is encoded into a high-dimensional latent manifold, joint generation is learned in that latent space, and pretrained decoders are later used to reconstruct sequence and coordinates [67,68]. When the latent code is inherited from large pretrained structural models [16], it can also import substantial prior knowledge about sequence–structure coupling. Moving the generation process into a fully continuous latent space makes these models highly scalable, as it avoids the mathematical difficulty of mixing discrete sequences with continuous geometry. The unavoidable cost, however, is a loss of transparency and control. Because the representation is so abstract, it becomes nearly impossible to tell whether the model genuinely learned the physical relationship between sequence and structure, or simply offloaded that burden to the final decoder.

A more rigorous compromise between discrete sequence and continuous geometry is offered by multimodal flow-matching schemes. In these systems, discrete dynamics govern sequence evolution while continuous SE(3)-equivariant transport drives structural updates. By synchronizing these distinct streams through shared neural components, the framework successfully preserves the native dynamics of both modalities [69].

2.5. Evaluation Metrics

Evaluating generative protein design aims to determine whether generated candidates are physically valid, computationally consistent, and sufficiently diverse. Furthermore, different design paradigms demand tailored validation strategies. Decoupled pipelines must independently assess the structural prior and the conditional sequence model. Co-design architectures must verify the mutual consistency of the jointly generated state. Meanwhile, hybrid frameworks must ensure that predictor-guided coupling has not produced adversarial artifacts. Importantly, these in silico metrics do not correlate equally with experimental success. While some primarily serve to filter out clearly nonphysical outputs, others actively enrich for candidates more likely to fold or function, yet none can ultimately replace experimental validation.

2.5.1. Physical Validity

Validity is the first screening layer. Its purpose is to determine whether generated outputs lie inside the manifold of physically possible folded proteins. For models that produce backbones or residue frames, this assessment focuses on geometric realizability, including chain continuity, the absence of severe steric self-intersections, and reasonable Ramachandran statistics. Global shape descriptors can also help identify clearly pathological topologies. For all-atom outputs, the criteria become stricter and include side-chain rotamer legality, bond length, angle deviations, and interatomic clash scores. Validity checks function as primary filters to discard biophysically impossible conformations, rather than direct proofs of “designability”

2.5.2. Folding Consistency

Consistency provides the core computational evidence that a design actually satisfies the assumptions of the method that produced it. In inverse-folding studies, native sequence recovery (NSR) remains a widely used metric for quantifying residue-level agreement between designed and natural sequences [26,28]. However, NSR mostly measures how well a model recovers one plausible sequence under a conditional distribution, and its value depends strongly on local backbone quality [70]. High local recovery does not necessarily correspond to global stability, so it should not be interpreted as direct proof of global stability or experimental success [71].

A stronger closed-loop criterion is self-consistency, often treated as an in silico foldability test. In a typical protocol, the designed sequence is submitted to an independent structure predictor, and the resulting structure is compared against the intended target using measures such as scRMSD or scTM-score together with confidence indicators like pLDDT [26]. These metrics usually correlate more strongly with experimental tractability than sequence recovery alone, because they test whether the designed sequence returns to the intended structural basin. Even so, they remain enrichment criteria rather than direct surrogates for expression, stability, affinity, or function.

For co-design models, the notion of consistency extends naturally to cross-consistency. Here the sequence and structure generated together are re-evaluated as a pair: the sequence is refolded and the predicted structure is aligned back to the model’s own generated structure. This tests whether the output truly forms a coupled solution rather than an accidental combination of individually plausible components.

For hybrid approaches, especially predictor-guided iterative refinement frameworks, predictor confidence alone is an insufficient safeguard. Such pipelines are especially vulnerable to adversarial solutions that satisfy one oracle numerically while remaining physically suspect. Robust consistency evaluation therefore needs to incorporate sequence-prior plausibility and, ideally, orthogonal predictors during validation so that model-specific bias is reduced [19].

Accordingly, the relationship between in silico consistency metrics and experimental success should be understood as probabilistic rather than deterministic: better computational scores generally improve hit rates, but they do not guarantee experimental success.

2.5.3. Design Coverage

Once validity and consistency have been established, evaluation shifts toward how broadly and how creatively a model explores the design space. Coverage is usually discussed in terms of diversity and novelty. Diversity reflects the range of solutions that can be produced under a fixed set of constraints and can be quantified by sequence clustering, pairwise identity, or structural fold-space coverage. Sampling hyperparameters are important here because they govern the practical trade-off between confidence and variety. Novelty, in turn, measures the distance between generated designs and known database entries, helping to distinguish genuine generalization from simple nearest-neighbor memorization.

3. Applications of Protein Design

3.1. Synthetic Biological Tools

Generative protein design has created many molecular tools and biosensors, which can play an important role in the development of synthetic circuit engineering, biological analysis and other fields.

For highly dynamic targets, the generative model can stabilize the recognition of a series of conformational ensembles to obtain binders that can bind to intrinsically disordered proteins (IDPs). These binders can be used to analyze dynamic targets [72]. In intracellular engineering, miniaturized binder modules have also been designed to occupy transient interfaces in DNA mismatch repair complexes. Acting as modular regulators, these binders improve prime editing efficiency and illustrate the value of de novo proteins as genetic engineering modules [73].

In the process of generative design, dynamic cell sensors and programmable switches can also be created. By computational design, the interface electrostatic repulsion is eliminated or the buried histidine network is constructed so that pH-sensitive binders can be created, which can be dissociated in the acidic environment, and the binding event can be controlled reversibly [74]. The use of advanced algorithms can accurately predict the optimal insertion site so that the receptor domain can be integrated into the effector protein, and the allosteric switch with a strong response ability can be obtained [75]. De novo design can not only produce single-chain allostery, but can also obtain modular protein oligomers, which can be assembled strictly in the presence of specific small-molecule drugs. The assembly of multiple domains under the induction of ligands can accurately control complex biological processes, including reversible condensate formation and spatial localization [76].

In the study of lipid bilayers, the design of synthetic molecules should have not only good hydrophobicity, but also strong signal transduction ability. The newly designed anion channels can respond to the electric field, and can play an important role in neuronal suppression [77]. Similarly, transmembrane fluorescence-activating proteins convert ligand binding into high-signal-to-noise optical readouts, thereby enabling distinct membrane-embedded biosensors [78].

3.2. Therapeutic Applications

Beyond tool development, generative protein design has also been extended to therapeutic intervention.

In cancer immunotherapy, a central challenge lies in designing binders capable of discriminating between a specific disease-associated peptide–major histocompatibility complex (pMHC) and a vast repertoire of closely related complexes presented on the same HLA background. Because the underlying MHC scaffold is highly conserved, achieving exquisite specificity necessitates preferential recognition of the outward-facing residues of the presented peptide, rather than relying on extensive contacts with the MHC itself. To address this, recent computational strategies have engineered “peptide-centric” binding interfaces and enforced T-cell receptor (TCR)-like docking geometries. These approaches maximize peptide-focused recognition while minimizing off-target MHC cross-reactivity [79,80]. When integrated into chimeric antigen receptors (CARs) or other T-cell-engaging formats, these de novo interfaces mediate peptide-selective T-cell activation and robust cytotoxicity against target cancer cells, effectively translating precise molecular recognition into potent therapeutic function [81], as illustrated in Figure 3a.

Generative design has also enabled therapeutic control over immune signaling and receptor pharmacology. Soluble Notch agonists use programmed assembly geometries to induce non-natural receptor clustering, promoting T-cell development while reducing dependence on solid-phase presentation [82]. In cytokine therapy, engineered ultra-fast dissociation kinetics provide temporal control over signaling windows and help mitigate toxicities associated with sustained immune activation [83].

Related principles have also been extended to metabolic and membrane receptors, in which designed proteins modulate signaling by stabilizing specific functional states. Designed agonists can bias insulin receptor signaling by stabilizing specific allosteric states [84]. For G protein-coupled receptors (GPCRs), designed exoframe modulators target peripheral transmembrane interfaces and offer a state-selective approach for pharmacologically challenging receptors [85]. Similarly, de novo peptide modulators can target pathological states of endogenous sodium channels and restore inactivation gating, offering a potential therapeutic strategy for arrhythmias and epilepsy-associated electrophysiological dysfunction [86]. Designed proteins can also achieve the purpose of neutralization, which can maintain stability in a variety of complex biological fluids and neutralize highly variable snake venom toxins (Figure 3b) [87].

3.3. Enzyme Design and Catalysis

Enzyme design represents a further escalation in difficulty because the objective is no longer merely binding or structural compatibility, but control over chemical transformation. In this context, foldability is necessary but far from sufficient. Catalytic success depends on precise active-site geometry, stabilization of transition states and intermediates, control over protonation networks, and favorable local electrostatics. As illustrated in Figure 4a, the central design problem is not simply to create a folded scaffold, but to position catalytic groups around the substrate with the spatial precision required for the target reaction.

Model reactions such as Kemp elimination have therefore played an outsized role as benchmarks [88]. They offer a chemically clear setting in which one can test whether a designed scaffold can place catalytic groups with the required accuracy. Computational strategies that tightly constrain transition-state geometry and electro-static arrangement have produced folded proteins with meaningful catalytic activity, sometimes approaching the lower edge of natural-enzyme performance in carefully defined cases. Harder targets, including serine hydrolases and metallohydrolases, demand additional control over catalytic triads, metal coordination, solvent structure, and multistep reaction pathways [89]. These studies show that modern design models can encode sophisticated chemical constraints, but they also make clear that catalysis places more severe demands on the design system than binding alone.

The incorporation of artificial cofactors extends this framework toward new-to-nature chemistry. By embedding synthetic cofactors such as porphyrins into designed backbones, researchers can bypass the chemical limits of natural side chains and enable efficient, stereoselective transformations of non-natural substrates [90]. In the cell environment, the combination of high-throughput directed evolution and computational design can not only solve the problem of substrate toxicity, but can also overcome the problem of cofactor assembly and cytoplasmic folding, and can achieve efficient artificial olefin metathesis [91].

3.4. Protein Materials

At a higher level of organization, generative protein design has opened new opportunities in protein materials, where higher-order function depends primarily on the precise design of assembly interfaces rather than on the complexity of individual monomers.

Compared with single-chain design, this area places greater emphasis on interface rigidity, geometric programmability, and the reusable combination of modular building blocks. One representative strategy is bond-centric design, which reduces complex three-dimensional assembly to the geometric programming of inter-component connection angles. Using a restricted library of structural modules, this framework enables access to a broad topological space spanning two-dimensional arrays, polyhedral nanocages, and three-dimensional lattices [92]. As summarized in Figure 4b, by controlling symmetry relationships and inter-subunit arrangements, design rules at the local interface level can be translated into well-defined higher-order architectures. In order to break through the constraints of point group symmetry, more advanced strategies introduce pseudo-symmetry and conformational symmetry breaking, generating bifaceted nanomaterials with built-in spatial heterogeneity [93].

Such topological foundations are naturally complementary to molecular recognition modules; by arranging binding domains with high positional precision and repeatability, designed protein materials support applications such as immunogen display, multi-target delivery systems, and nanoreactors.

4. Outlook

Generative protein design has progressed rapidly to a stage that de novo proteins can be engineered for binding, signaling, catalysis, and ordered assembly. Yet the central challenge has shifted. It is no longer enough to generate something that appears foldable in silico. What now matters is whether designed proteins can sustain robust function across the fluctuating, heterogeneous conditions of real biological environments.

One major limitation is that present-day workflows still depend heavily on structure predictors as practical scoring oracles for folding consistency. This strategy has obvious utility, but it also leaves design pipelines exposed to adversarial artifacts and model-specific biases. Future progress will therefore require complementary evaluation schemes that probe thermodynamic stability and biochemical affinity more directly. At the same time, most current methods remain anchored to static structural snapshots, whereas natural protein function often emerges from conformational fluctuation, state transitions, and environment-dependent interactions. A decisive next step will be to move from static-structure design toward the generative modeling of dynamic conformational ensembles, which is essential for allosteric proteins, molecular switches, and other adaptive systems.

Finally, the future design space is almost certainly larger than the present one. Incorporating non-canonical amino acids, nucleic acids, cofactors, and small-molecule ligands directly into generative models would expand both chemistry and function. As computational frameworks continue to absorb richer physical, chemical, and biological priors, generative protein design is likely to advance from simple protein-oriented construction toward the broader engineering of sophisticated biomolecules.

Author Contributions

S.L. collected the literature and wrote the manuscript; B.Z. designed the overall framework and amended the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

DeGrado, W.F.; Wasserman, Z.R.; Lear, J.D. Protein Design, a Minimalist Approach. Science 1989, 243, 622–628. [Google Scholar] [CrossRef] [PubMed]
Hecht, M.H.; Richardson, J.S.; Richardson, D.C.; Ogden, R.C. De Novo Design, Expression, and Characterization of Felix: A Four-Helix Bundle Protein of Native-Like Sequence. Science 1990, 249, 884–891. [Google Scholar] [CrossRef]
Dahiyat, B.I.; Mayo, S.L. De Novo Protein Design: Fully Automated Sequence Selection. Science 1997, 278, 82–87. [Google Scholar] [CrossRef] [PubMed]
Ponder, J.W.; Richards, F.M. Tertiary Templates for Proteins: Use of Packing Criteria in the Enumeration of Allowed Sequences for Different Structural Classes. J. Mol. Biol. 1987, 193, 775–791. [Google Scholar] [CrossRef]
Simons, K.T.; Kooperberg, C.; Huang, E.; Baker, D. Assembly of Protein Tertiary Structures From Fragments with Similar Local Sequences Using Simulated Annealing and Bayesian Scoring Functions. J. Mol. Biol. 1997, 268, 209–225. [Google Scholar] [CrossRef]
Kuhlman, B.; Dantas, G.; Ireton, G.C.; Varani, G.; Stoddard, B.L.; Baker, D. Design of a Novel Globular Protein Fold with Atomic-Level Accuracy. Science 2003, 302, 1364–1368. [Google Scholar] [CrossRef]
Huang, P.-S.; Boyken, S.E.; Baker, D. The Coming of Age of de novo Protein Design. Nature 2016, 537, 320–327. [Google Scholar] [CrossRef]
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Zídek, A.; Potapenko, A.; et al. Highly Accurate Protein Structure Prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef]
Baek, M.; DiMaio, F.; Anishchenko, I.; Dauparas, J.; Ovchinnikov, S.; Lee, G.R.; Wang, J.; Cong, Q.; Kinch, L.N.; Schaeffer, R.D.; et al. Accurate Prediction of Protein Structures and Interactions using a Three-track Neural Network. Science 2021, 373, 871–876. [Google Scholar] [CrossRef]
Khakzad, H.; Igashov, I.; Schneuing, A.; Goverde, C.; Bronstein, M.; Correia, B. A New Age in Protein Design Empowered by Deep Learning. Cell Syst. 2023, 14, 925–939. [Google Scholar] [CrossRef] [PubMed]
Koh, H.Y.; Zheng, Y.; Yang, M.; Arora, R.; Webb, G.I.; Pan, S.; Li, L.; Church, G.M. AI-driven Protein Design. Nat. Rev. Bioeng. 2025, 3, 1034–1056. [Google Scholar] [CrossRef]
Notin, P.; Rollins, N.; Gal, Y.; Sander, C.; Marks, D. Machine Learning for Functional Protein Design. Nat. Biotechnol. 2024, 42, 216–228. [Google Scholar] [CrossRef]
Zhang, Z.; Ou, C.; Cho, Y.; Akiyama, Y.; Ovchinnikov, S. Artificial Intelligence Methods for Protein Folding and Design. Curr. Opin. Struct. Biol. 2025, 93, 103066. [Google Scholar] [CrossRef] [PubMed]
Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.; Liu, J.; Guo, D.; Ott, M.; Zitnick, C.L.; Ma, J.; et al. Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences. Proc. Natl. Acad. Sci. USA 2021, 118, e2016239118. [Google Scholar] [CrossRef]
Rao, R.; Liu, J.; Verkuil, R.; Meier, J.; Canny, J.F.; Abbeel, P.; Sercu, T.; Rives, A. MSA Transformer. bioRxiv 2021. [Google Scholar] [CrossRef]
Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; et al. Evolutionary-scale Prediction of Atomic-level Protein Structure with a Language Model. Science 2023, 379, 1123–1130. [Google Scholar]
Vig, J.; Madani, A.; Varshney, L.R.; Xiong, C.; Socher, R.; Fatema Rajani, N. BERTology Meets Biology: Interpreting Attention in Protein Language Models. arXiv 2020, arXiv:2006.15222. [Google Scholar]
Madani, A.; Krause, B.; Greene, E.R.; Subramanian, S.; Mohr, B.P.; Holton, J.M.; Olmos, J.L.; Xiong, C.; Sun, Z.Z.; Socher, R.; et al. Large Language Models Generate Functional Protein Sequences Across Diverse Families. Nat. Biotechnol. 2023, 41, 1099–1106. [Google Scholar] [CrossRef]
Verkuil, R.; Kabeli, O.; Du, Y.; Wicky, B.I.M.; Milles, L.F.; Dauparas, J.; Baker, D.; Ovchinnikov, S.; Sercu, T.; Rives, A. Language Models Generalize Beyond Natural Proteins. bioRxiv 2022. [Google Scholar] [CrossRef]
Ferruz, N.; Schmidt, S.; Höcker, B. ProtGPT2 is a Deep Unsupervised Language Model for Protein Design. Nat. Commun. 2022, 13, 4348. [Google Scholar] [CrossRef] [PubMed]
Qi, Y.; Zhang, J.Z.H. DenseCPD: Improving the Accuracy of Neural-Network-Based Computational Protein Sequence Design with DenseNet. J. Chem. Inf. Model. 2020, 60, 1245–1252. [Google Scholar] [CrossRef]
Anand, N.; Achim, T. Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models. arXiv 2022, arXiv:2205.15019. [Google Scholar] [CrossRef]
Ingraham, J.; Garg, V.; Barzilay, R.; Jaakkola, T. Generative Models for Graph-Based Protein Design. In Proceedings of the 33rd International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 15820–15831. [Google Scholar]
Jing, B.; Eismann, S.; Suriana, P.; Townshend, R.J.L.; Dror, R. Learning from Protein Structure with Geometric Vector Perceptrons. arXiv 2020, arXiv:2009.01411. [Google Scholar]
Strokach, A.; Becerra, D.; Corbi-Verge, C.; Perez-Riba, A.; Kim, P.M. Fast and Flexible Protein Design Using Deep Graph Neural Networks. Cell Syst. 2020, 11, 402–411 e4. [Google Scholar] [CrossRef]
Dauparas, J.; Anishchenko, I.; Bennett, N.; Bai, H.; Ragotte, R.J.; Milles, L.F.; Wicky, B.I.M.; Courbet, A.; de Haas, R.J.; Bethel, N.; et al. Robust Deep Learning–based Protein Sequence Design using ProteinMPNN. Science 2022, 378, 49–56. [Google Scholar] [CrossRef] [PubMed]
Dauparas, J.; Lee, G.R.; Pecoraro, R.; An, L.; Anishchenko, I.; Glasscock, C.; Baker, D. Atomic Context-conditioned Protein Sequence Design using LigandMPNN. Nat. Methods 2025, 22, 717–723. [Google Scholar] [CrossRef] [PubMed]
Hsu, C.; Verkuil, R.; Liu, J.; Lin, Z.; Hie, B.; Sercu, T.; Lerer, A.; Rives, A. Learning Inverse Folding from Millions of Predicted Structures. bioRxiv 2022. [Google Scholar] [CrossRef]
Bronstein, M.M.; Bruna, J.; Cohen, T.; Veličković, P. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. arXiv 2021, arXiv:2104.13478. [Google Scholar] [CrossRef]
Cohen, T.S.; Welling, M. Group Equivariant Convolutional Networks. arXiv 2016, arXiv:1602.07576. [Google Scholar] [CrossRef]
Thomas, N.; Smidt, T.; Kearnes, S.; Yang, L.; Li, L.; Kohlhoff, K.; Riley, P. Tensor Field Networks: Rotation- and Translation-Equivariant Neural Networks for 3D Point Clouds. arXiv 2018, arXiv:1802.08219. [Google Scholar] [CrossRef]
Weiler, M.; Geiger, M.; Welling, M.; Boomsma, W.; Cohen, T. 3D Steerable CNNs: Learning Rotationally Equivariant Features in Volumetric Data. arXiv 2018, arXiv:1807.02547. [Google Scholar] [CrossRef]
Fuchs, F.B.; Worrall, D.E.; Fischer, V.; Welling, M. SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks. arXiv 2020, arXiv:2006.10503. [Google Scholar]
Garcia Satorras, V.; Hoogeboom, E.; Welling, M. E(n) Equivariant Graph Neural Networks. arXiv 2021, arXiv:2102.09844. [Google Scholar]
Deng, C.; Litany, O.; Duan, Y.; Poulenard, A.; Tagliasacchi, A.; Guibas, L. Vector Neurons: A General Framework for SO(3)-Equivariant Networks. arXiv 2021, arXiv:2104.12229. [Google Scholar] [CrossRef]
Anishchenko, I.; Pellock, S.J.; Chidyausiku, T.M.; Ramelot, T.A.; Ovchinnikov, S.; Hao, J.; Bafna, K.; Norn, C.; Kang, A.; Bera, A.K.; et al. De novo Protein Design by Deep Network Hallucination. Nature 2021, 600, 547–552. [Google Scholar] [CrossRef]
Jendrusch, M.A.; Yang, A.L.J.; Cacace, E.; Bobonis, J.; Voogdt, C.G.P.; Kaspar, S.; Schweimer, K.; Perez-Borrajero, C.; Lapouge, K.; Scheurich, J.; et al. AlphaDesign: A de novo Protein Design Framework Based on AlphaFold. Mol. Syst. Biol. 2025, 21, 1166–1189. [Google Scholar] [CrossRef]
Moffat, L.; Greener, J.G.; Jones, D.T. Using AlphaFold for Rapid and Accurate Fixed Backbone Protein Design. bioRxiv 2021. [Google Scholar] [CrossRef]
Moffat, L.; Kandathil, S.M.; Jones, D.T. Design in the DARK: Learning Deep Generative Models for De Novo Protein Design. bioRxiv 2022. [Google Scholar] [CrossRef]
Wang, J.; Lisanza, S.; Juergens, D.; Tischer, D.; Watson, J.L.; Castro, K.M.; Ragotte, R.; Saragovi, A.; Milles, L.F.; Baek, M.; et al. Scaffolding Protein Functional Sites using Deep Learning. Science 2022, 377, 387–394. [Google Scholar] [CrossRef]
Wicky, B.I.M.; Milles, L.F.; Courbet, A.; Ragotte, R.J.; Dauparas, J.; Kinfu, E.; Tipps, S.; Kibler, R.D.; Baek, M.; DiMaio, F.; et al. Hallucinating Symmetric Protein Assemblies. Science 2022, 378, 56–61. [Google Scholar] [CrossRef]
Ingraham, J.B.; Baranov, M.; Costello, Z.; Barber, K.W.; Wang, W.; Ismail, A.; Frappier, V.; Lord, D.M.; Ng-Thow-Hing, C.; Van Vlack, E.R.; et al. Iluminating Protein Space with a Programmable Generative Model. Nature 2023, 623, 1070–1078. [Google Scholar] [CrossRef]
Watson, J.L.; Juergens, D.; Bennett, N.R.; Trippe, B.L.; Yim, J.; Eisenach, H.E.; Ahern, W.; Borst, A.J.; Ragotte, R.J.; Milles, L.F.; et al. De Novo Design of Protein Structure and Function with RFdiffusion. Nature 2023, 620, 1089–1100. [Google Scholar] [CrossRef]
Yim, J.; Trippe, B.L.; De Bortoli, V.; Mathieu, E.; Doucet, A.; Barzilay, R.; Jaakkola, T. SE(3) Diffusion Model with Application to Protein Backbone Generation. arXiv 2023, arXiv:2302.02277. [Google Scholar] [CrossRef]
De Bortoli, V.; Mathieu, E.; Hutchinson, M.; Thornton, J.; Whye Teh, Y.; Doucet, A. Riemannian Score-Based Generative Modelling. arXiv 2022, arXiv:2202.02763. [Google Scholar] [CrossRef]
Geffner, T.; Didi, K.; Zhang, Z.; Reidenbach, D.; Cao, Z.; Yim, J.; Geiger, M.; Dallago, C.; Kucukbenli, E.; Vahdat, A.; et al. Proteina: Scaling Flow-based Protein Structure Generative Models. arXiv 2025, arXiv:2503.00710. [Google Scholar] [CrossRef]
Dauparas, J. Backbone Conditional Protein Sequence Design. Cold Spring Harb. Perspect. Biol. 2026, 18, a041517. [Google Scholar] [CrossRef] [PubMed]
Yang, K.K.; Zanichelli, N.; Yeh, H. Masked Inverse Folding with Sequence Transfer for Protein Representation Learning. Protein Eng. Des. Sel. 2022, 36, gzad015. [Google Scholar] [CrossRef]
Gao, Z.; Tan, C.; Chacón, P.; Li, S.Z. PiFold: Toward Effective and Efficient Protein Inverse Folding. arXiv 2022, arXiv:2209.12643. [Google Scholar]
Anand, N.; Eguchi, R.; Mathews, I.I.; Perez, C.P.; Derry, A.; Altman, R.B.; Huang, P.-S. Protein Sequence Design with a Learned Potential. Nat. Commun. 2022, 13, 746. [Google Scholar] [CrossRef]
Li, A.J.; Lu, M.; Desta, I.; Sundar, V.; Grigoryan, G.; Keating, A.E. Neural Network-derived Potts Models for Structure-Based Protein Design using Backbone Atomic Coordinates and Tertiary Motifs. Protein Sci. 2023, 32, e4554. [Google Scholar] [CrossRef] [PubMed]
Zhang, O.; Zhang, X.; Lin, H.; Tan, C.; Wang, Q.; Mo, Y.; Feng, Q.; Du, G.; Yu, Y.; Jin, Z.; et al. ODesign: A World Model for Biomolecular Interaction Design. arXiv 2025, arXiv:2510.22304. [Google Scholar] [CrossRef]
Abramson, J.; Adler, J.; Dunger, J.; Evans, R.; Green, T.; Pritzel, A.; Ronneberger, O.; Willmore, L.; Ballard, A.J.; Bambrick, J.; et al. Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3. Nature 2024, 630, 493–500. [Google Scholar] [CrossRef]
Wohlwend, J.; Corso, G.; Passaro, S.; Getz, N.; Reveiz, M.; Leidal, K.; Swiderski, W.; Atkinson, L.; Portnoi, T.; Chinn, I.; et al. Boltz-1 Democratizing Biomolecular Interaction Modeling. bioRxiv 2025. [Google Scholar] [CrossRef]
Passaro, S.; Corso, G.; Wohlwend, J.; Reveiz, M.; Thaler, S.; Somnath, V.R.; Getz, N.; Portnoi, T.; Roy, J.; Stark, H.; et al. Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction. bioRxiv 2025. [Google Scholar] [CrossRef]
Discovery, C.; Boitreaud, J.; Dent, J.; McPartlon, M.; Meier, J.; Reis, V.; Rogozhnikov, A.; Wu, K. Chai-1: Decoding the Molecular Interactions of Life. bioRxiv 2024. [Google Scholar] [CrossRef]
Frank, C.; Khoshouei, A.; Fuβ, L.; Schiwietz, D.; Putz, D.; Weber, L.; Zhao, Z.; Hattori, M.; Feng, S.; de Stigter, Y.; et al. Scalable Protein Design Using Optimization in a Relaxed Sequence Space. Science 2024, 386, 439–445. [Google Scholar] [CrossRef]
Pacesa, M.; Nickel, L.; Schellhaas, C.; Schmidt, J.; Pyatova, E.; Kissling, L.; Barendse, P.; Choudhury, J.; Kapoor, S.; Alcaraz-Serna, A.; et al. BindCraft: One-shot Design of Functional Protein Binders. bioRxiv 2025. [Google Scholar] [CrossRef] [PubMed]
Fang, M.; Wang, C.; Shi, J.; Lian, F.; Jin, Q.; Wang, Z.; Zhang, Y.; Cui, Z.; Wang, Y.; Ke, Y.; et al. HalluDesign: Protein Optimization and de novo Design via Iterative Structure Hallucination and Sequence design. bioRxiv 2025. [Google Scholar] [CrossRef]
Cho, Y.; Rangel, G.; Bhardwaj, G.; Ovchinnikov, S. Protein Hunter: Exploiting Structure Hallucination within Diffusion for Protein Design. bioRxiv 2025. [Google Scholar] [CrossRef]
Stark, H.; Faltings, F.; Choi, M.; Xie, Y.; Hur, E.; O’Donnell, T.; Bushuiev, A.; Uçar, T.; Passaro, S.; Mao, W.; et al. BoltzGen: Toward Universal Binder Design. bioRxiv 2025. [Google Scholar] [CrossRef] [PubMed]
Butcher, J.; Krishna, R.; Mitra, R.; Brent, R.I.; Li, Y.; Corley, N.; Kim, P.T.; Funk, J.; Mathis, S.; Salike, S.; et al. De novo Design of All-atom Biomolecular Interactions with RFdiffusion3. bioRxiv 2025. [Google Scholar] [CrossRef]
Qu, W.; Guan, J.; Ma, R.; Zhai, K.; Wu, W.; Wang, H. P(all-atom) Is Unlocking New Path For Protein Design. bioRxiv 2025. [Google Scholar] [CrossRef]
Ren, M.; Zhu, T.; Zhang, H. CarbonNovo: Joint Design of Protein structure and Sequence Using a Unified Energy-based Model. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 2024; JMLR Inc.: New York, NY, USA, 2024. [Google Scholar]
Geffner, T.; Didi, K.; Cao, Z.; Reidenbach, D.; Zhang, Z.; Dallago, C.; Kucukbenli, E.; Kreis, K.; Vahdat, A. La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching. arXiv 2025, arXiv:2507.09466. [Google Scholar] [CrossRef]
Didi, K.; Zhang, Z.; Zhou, G.; Reidenbach, D.; Cao, Z.; Cha, S.; Geffner, T.; Dallago, C.; Tang, J.; Bronstein, M.M.; et al. Scaling Atomistic Protein Binder Design with Generative Pretraining and Test-Time Compute. arXiv 2026, arXiv:2603.27950. [Google Scholar] [CrossRef]
Team, L.L.; Bridgland, A.; Crabbé, J.; Kenlay, H.; Pretorius, D.; Schmon, S.M.; Hilmkil, A.; Bartke-Croughan, R.; Rombach, R.; Flashman, M.; et al. Latent-X: An Atom-level Frontier Model for De Novo Protein Binder Design. arXiv 2025, arXiv:2507.19375. [Google Scholar]
Lu, A.X.; Yan, W.; Robinson, S.A.; Kelow, S.; Yang, K.K.; Gligorijevic, V.; Cho, K.; Bonneau, R.; Abbeel, P.; Frey, N.C. Controllable All-Atom Protein Generation with Latent Diffusion. bioRxiv 2025. [Google Scholar] [CrossRef]
Campbell, A.; Yim, J.; Barzilay, R.; Rainforth, T.; Jaakkola, T. Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design. arXiv 2024, arXiv:2402.04997. [Google Scholar] [CrossRef]
Castorina, L.V.; Petrenas, R.; Subr, K.; Wood, C.W. PDBench: Evaluating Computational Methods for Protein Sequence Design. arXiv 2021, arXiv:2109.07925. [Google Scholar] [CrossRef] [PubMed]
Petukh, M.; Kucukkal, T.G.; Alexov, E. On Human Disease-Causing Amino Acid Variants: Statistical Study of Sequence and Structural Patterns. Hum. Mutat. 2015, 36, 524–534. [Google Scholar] [CrossRef]
Liu, C.; Wu, K.; Choi, H.; Han, H.L.; Zhang, X.; Watson, J.L.; Ahn, G.; Zhang, J.Z.; Shijo, S.; Good, L.L.; et al. Diffusing Protein Binders to Intrinsically Disordered Proteins. Nature 2025, 644, 809–817. [Google Scholar] [CrossRef]
Park, J.-C.; Uhm, H.; Kim, Y.-W.; Oh, Y.E.; Lee, J.H.; Yang, J.; Kim, K.; Bae, S. AI-generated MLH1 Small Binder Improves Prime Editing Efficiency. Cell 2025, 188, 5831–5846.e21. [Google Scholar] [CrossRef] [PubMed]
Ahn, G.; Coventry, B.; Haefner, E.; Sadre, S.; Hu, J.; Van, M.; Huang, B.; Sappington, I.; Broerman, A.J.; Lichtenstein, M.A.; et al. Computational Design of pH-sensitive Binders. bioRxiv 2025. [Google Scholar] [CrossRef]
Wolf, B.; Shehu, P.; Brenker, L.; von Bachmann, A.-L.; Kroell, A.-S.; Southern, N.; Holderbach, S.; Eigenmann, J.; Aschenbrenner, S.; Mathony, J.; et al. Rational Engineering of Allosteric Protein Switches by in Silico Prediction of Domain Insertion Sites. Nat. Methods 2025, 22, 1698–1706. [Google Scholar] [CrossRef]
Jin, Q.; Wang, Y.; Chen, D.; Liao, J.; Cui, Z.; Fan, Y.; Zeng, A.; Xie, M.; Cao, L. De novo Design of Small Molecule–regulated Protein Oligomers. Science 2026, 391, eady6017. [Google Scholar] [CrossRef]
Zhou, C.; Li, H.; Wang, J.; Qian, C.; Xiong, H.; Chu, Z.; Shao, Q.; Li, X.; Sun, S.; Sun, K.; et al. De novo Designed Voltage-gated Anion Channels Suppress Neuron Firing. Cell 2025, 188, 7495–7511.e21. [Google Scholar] [CrossRef]
Zhu, J.; Liang, M.; Sun, K.; Wei, Y.; Guo, R.; Zhang, L.; Shi, J.; Ma, D.; Hu, Q.; Huang, G.; et al. De novo Design of Transmembrane Fluorescence-activating Proteins. Nature 2025, 640, 249–257. [Google Scholar] [CrossRef]
Liu, B.; Greenwood, N.F.; Bonzanini, J.E.; Motmaen, A.; Meyerberg, J.; Dao, T.; Xiang, X.; Ault, R.; Sharp, J.; Wang, C.; et al. Design of High-Specificity Binders for Peptide–MHC-I Complexes. Science 2025, 389, 386–391. [Google Scholar] [PubMed]
Householder, K.D.; Xiang, X.; Jude, K.M.; Deng, A.; Obenaus, M.; Zhao, Y.; Wilson, S.C.; Chen, X.; Wang, N.; Garcia, K.C. De novo Design and Structure of a Peptide–centric TCR Mimic Binding Module. Science 2025, 389, 375–379. [Google Scholar] [CrossRef]
Johansen, K.H.; Wolff, D.S.; Scapolo, B.; Fernández-Quintero, M.L.; Risager Christensen, C.; Loeffler, J.R.; Rivera-de-Torre, E.; Overath, M.D.; Kjærgaard Munk, K.; Morell, O.; et al. De novo-designed pMHC Binders Facilitate T cell–Mediated Cytotoxicity toward Cancer Cells. Science 2025, 389, 380–385. [Google Scholar]
Mout, R.; Jing, R.; Tanaka-Yano, M.; Egan, E.D.; Eisenach, H.; Kononov, M.A.; Windisch, R.; Najia, M.A.T.; Tompkins, A.; Hensch, L.; et al. Design of Soluble Notch Agonists that Drive T cell Development and Boost Immunity. Cell 2025, 188, 5980–5994.e28. [Google Scholar] [CrossRef] [PubMed]
Broerman, A.J.; Pollmann, C.; Zhao, Y.; Lichtenstein, M.A.; Jackson, M.D.; Tessmer, M.H.; Ryu, W.H.; Ogishi, M.; Abedi, M.H.; Sahtoe, D.D.; et al. Design of Facilitated Dissociation Enables Timing of Cytokine Signalling. Nature 2025, 647, 528–535. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Cardoso, S.; Cai, K.; Venkatesh, P.; Hung, A.; Ng, M.; Hall, C.; Coventry, B.; Lee, D.S.; Chowhan, R.; et al. Tuning Insulin Receptor Signaling using de novo-designed Agonists. Mol. Cell 2025, 85, 4064–4081.e9. [Google Scholar] [CrossRef] [PubMed]
Cheng, S.; Guo, J.; Zhou, Y.-l.; Luo, X.; Zhang, G.; Zhang, Y.-z.; Yang, Y.; Xie, J.; Xu, P.; Shen, D.-d.; et al. De novo Design of GPCR Exoframe Modulators. Nature 2026, 651, 242–250. [Google Scholar] [CrossRef]
Mahling, R.; Hegyi, B.; Cullen, E.R.; Cho, T.M.; Rodriques, A.R.; Fossier, L.; Yehya, M.; Yang, L.; Chen, B.-X.; Katchman, A.N.; et al. De novo Design of a Peptide Modulator to Reverse Sodium Channel Dysfunction Linked to Cardiac Arrhythmias and Epilepsy. Cell 2025, 188, 6170–6185.e19. [Google Scholar] [CrossRef]
Vázquez Torres, S.; Benard Valle, M.; Mackessy, S.P.; Menzies, S.K.; Casewell, N.R.; Ahmadi, S.; Burlet, N.J.; Muratspahić, E.; Sappington, I.; Overath, M.D.; et al. De novo Designed Proteins Neutralize Lethal Snake Venom Toxins. Nature 2025, 639, 225–231. [Google Scholar] [CrossRef]
Listov, D.; Vos, E.; Hoffka, G.; Hoch, S.Y.; Berg, A.; Hamer-Rogotner, S.; Dym, O.; Kamerlin, S.C.L.; Fleishman, S.J. Complete Computational Design of High-efficiency Kemp Elimination Enzymes. Nature 2025, 643, 1421–1427. [Google Scholar] [CrossRef]
Lauko, A.; Pellock, S.J.; Sumida, K.H.; Anishchenko, I.; Juergens, D.; Ahern, W.; Jeung, J.; Shida, A.F.; Hunt, A.; Kalvet, I.; et al. Computational Design of Serine Hydrolases. Science 2025, 388, eadu2454. [Google Scholar] [CrossRef]
Hou, K.; Huang, W.; Qi, M.; Tugwell, T.H.; Alturaifi, T.M.; Chen, Y.; Zhang, X.; Lu, L.; Mann, S.I.; Liu, P.; et al. De novo Design of Porphyrin-containing Proteins as Efficient and Stereoselective Catalysts. Science 2025, 388, 665–670. [Google Scholar] [CrossRef]
Zou, Z.; Kalvet, I.; Lozhkin, B.; Morris, E.; Zhang, K.; Chen, D.; Ernst, M.L.; Zhang, X.; Baker, D.; Ward, T.R. De novo Design and Evolution of an Artificial Metathase for Cytoplasmic Olefin Metathesis. Nat. Catal. 2025, 8, 1208–1219. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Favor, A.; Kibler, R.D.; Lubner, J.M.; Borst, A.J.; Coudray, N.; Redler, R.L.; Chiang, H.T.; Sheffler, W.; Hsia, Y.; et al. Bond-centric Modular Design of Protein Assemblies. Nat. Mater. 2025, 24, 1644–1652. [Google Scholar] [CrossRef]
Rankovic, S.; Carr, K.D.; Decarreau, J.; Skotheim, R.; Kibler, R.D.; Ols, S.; Lee, S.; Chun, J.-H.; Tooley, M.R.; Dauparas, J.; et al. Computational Design of Bifaceted Protein Nanomaterials. Nat. Mater. 2025, 24, 1635–1643. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Structural representation strategies for 3D protein geometry. (a) 3D voxel grid representation and convolution. The 3D protein space is divided into a large number of regular grids, and the 3D-CNN is used to extract features. (b) Graph Neural Networks (GNNs). The protein is regarded as a topological graph, the node is the biological entity, and the edge can reflect the spatial relationship, which can be processed by the message passing mechanism.

Figure 2. Mainstream paradigms in generative protein design. (a) Sequence–structure decoupled design. The task is decomposed into two ordered stages: backbone generation followed by backbone-conditioned sequence design (inverse folding). (b) Sequence–structure co-design. Sequence and structure are treated as a coupled design state and optimized within a shared or iterative framework so that mutually compatible solutions can emerge during inference. Hybrid approaches lie conceptually between these two strategies, but are not shown as a separate unified panel because they do not follow a common implementation principle.

Figure 3. Therapeutic applications of de novo protein design. (a) Targeted immunotherapy using a CAR-binder chimera engineered to recognize peptide–major histocompatibility complex (pMHC) on the cell surface with high specificity. (b) Venom neutralization mediated by high-affinity de novo binders that inhibit lethal snake venom toxins.

Figure 4. Design principles for catalytic proteins and protein materials. (a) Schematic illustration of enzyme design, where a substrate is positioned within an enzyme scaffold and surrounded by catalytic residues arranged with the high geometric precision required to drive the intended reaction. (b) Modular assembly of protein materials. By precisely programming interfacial geometries, basic subunits can be directed to form diverse symmetrical architectures, including cyclic (C2–C6), dihedral (D2–D6), tetrahedral (T), and octahedral (O) complexes.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Luo, S.; Zhou, B. Generative Protein Design: From Deep Learning Algorithms to Translational Applications. Int. J. Mol. Sci. 2026, 27, 3917. https://doi.org/10.3390/ijms27093917

AMA Style

Luo S, Zhou B. Generative Protein Design: From Deep Learning Algorithms to Translational Applications. International Journal of Molecular Sciences. 2026; 27(9):3917. https://doi.org/10.3390/ijms27093917

Chicago/Turabian Style

Luo, Shaotong, and Bo Zhou. 2026. "Generative Protein Design: From Deep Learning Algorithms to Translational Applications" International Journal of Molecular Sciences 27, no. 9: 3917. https://doi.org/10.3390/ijms27093917

APA Style

Luo, S., & Zhou, B. (2026). Generative Protein Design: From Deep Learning Algorithms to Translational Applications. International Journal of Molecular Sciences, 27(9), 3917. https://doi.org/10.3390/ijms27093917

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generative Protein Design: From Deep Learning Algorithms to Translational Applications

Abstract

1. Introduction

2. Generative Protein Design Algorithms

2.1. Protein Representation and Geometric Deep Learning

2.1.1. 1D Sequence Semantics

2.1.2. 3D Topological Graphs

2.1.3. SE(3) Geometric Equivariance

2.2. Sequence–Structure Decoupled Design

2.2.1. Predictor-Driven Hallucination

2.2.2. Backbone Coordinate Generation

2.2.3. Backbone-Conditional Sequence Design

2.3. Hybrid Approaches

2.3.1. Integrated Two-Stage Design

2.3.2. Predictor-Driven Iterative Co-Refinement

2.4. Sequence–Structure Co-Design

2.5. Evaluation Metrics

2.5.1. Physical Validity

2.5.2. Folding Consistency

2.5.3. Design Coverage

3. Applications of Protein Design

3.1. Synthetic Biological Tools

3.2. Therapeutic Applications

3.3. Enzyme Design and Catalysis

3.4. Protein Materials

4. Outlook

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI