Next Article in Journal
Activated Microglia-Derived Extracellular Vesicles Elicit a Pro-Inflammatory Astrocytic Response via Cargo-Dependent Mechanisms
Next Article in Special Issue
In Situ XRPD Investigation of Relative Humidity-Induced Lattice Responses in Tetragonal Hen Egg-White Lysozyme
Previous Article in Journal
Endotoxins and Metabolic Endotoxemia in Obesity and Associated Noncommunicable Diseases: A Focus on Sex Differences
Previous Article in Special Issue
Probing the Active Site of Class 3 L-Asparaginase by Mutagenesis: Mutations of the Ser-Lys Tandems of ReAV
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Direct Phasing of Protein Crystals with Continuous Iterative Projection Algorithms and Refined Envelope Reconstruction

1
Department of Physics, School of Physical Science and Technology, Ningbo University, Ningbo 315211, China
2
Department of Physics and Texas Center for Superconductivity, University of Houston, Houston, TX 77204, USA
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Biomolecules 2026, 16(2), 227; https://doi.org/10.3390/biom16020227
Submission received: 24 December 2025 / Revised: 21 January 2026 / Accepted: 25 January 2026 / Published: 2 February 2026
(This article belongs to the Special Issue State-of-the-Art Protein X-Ray Crystallography)

Abstract

Direct methods provide a model-free approach to solving the crystallographic phase problem and deliver unbiased atomic structures. However, conventional iterative projection algorithms such as Hybrid Input–Output (HIO) face two critical challenges: discontinuous density modification at the protein-solvent boundary and inaccurate molecular envelope reconstruction that fails to account for trapped solvent, particularly in crystals with solvent content approaching the lower limits of direct phasing applicability. We introduced four continuous iterative projection algorithms, including our improved continuous version, which implements smooth density modification at protein-solvent interfaces. To address envelope inaccuracy, we developed a two-step refined reconstruction scheme using sequential large-radius and small-radius Gaussian filters to identify trapped solvent molecules within surface cavities and internal channels. This scheme enhances the performance of both continuous and classical algorithms, including HIO, the difference map, and our improved versions. Benchmarking on 28 protein structures (solvent contents 55–78%, resolutions 1.46–3.2 Å, reported R-factor less than 0.22) showed that the refined envelope scheme increased average success rates of continuous algorithms by 45.7% and classical algorithms by 60.5%. The performance of continuous algorithms and improved classical algorithms proved comparable to the well-established HIO algorithm, forming a top-tier group that exceeded other classical algorithms. Integrating a genetic algorithm co-evolution strategy further enhanced average success rates by approximately 2.5-fold and accelerated convergence through population-wide information sharing. Although the success rate correlates with solvent content, our strategy improved success probability at any given solvent level, extending the practical boundaries of direct methods. The high success rate enabled averaging of multiple independent solutions, which reduced mean phase error by approximately 6.83° and yielded atomic models with backbone root-mean-square deviation (RMSD) typically below 0.5 Å relative to structures reported in the Protein Data Bank (PDB). This work introduces novel algorithms, a refined envelope reconstruction methodology, and an effective optimization strategy with genetic algorithm evolution. The complete framework enhances the capability and reliability of direct methods for phasing protein crystals with limited solvent content and provides a toolkit for addressing challenging cases in structural biology.

1. Introduction

Protein crystallography remains the primary method for determining three-dimensional structures of biological macromolecules through X-ray diffraction. While diffraction experiments record structure factor amplitudes, the phase information is inherently lost, creating the phase problem. Conventional structure determination approaches, such as molecular replacement, depend on AlphaFold-predicted models [1,2,3] or homologous structures, which may introduce model bias, while experimental phasing methods require heavy atom derivatization and considerable experimental investment. In contrast, direct methods operate independently of prior structural information, offering an unbiased route to structure determination. This makes them valuable for validating predicted structures and for determining novel protein folds through ab initio approaches. Direct methods iteratively enforce physical constraints in real space, such as uniform bulk solvent density and characteristic protein density distribution, while simultaneously satisfying experimental diffraction data in reciprocal space until convergence to the true electron density is achieved.
Direct phasing methods have evolved from small-molecule techniques to macromolecular approaches. Early methods for small molecules exploited statistical relationships in diffraction data, including Sayre’s equation [4], triplet phase relationships [5], tangent formula [6,7], and minimal principle [8], implemented in software such as SHELX [9]. However, these classical techniques require atomic-resolution data better than 1.2 Å and are limited to structures containing fewer than 1000 non-hydrogen atoms [10], rendering them unsuitable for most protein crystals that diffract to resolutions around 2 Å and contain thousands of non-hydrogen atoms. Traditional methods could only provide low-resolution structural information for macromolecules [11]. The breakthrough came with recognizing that additional physical constraints must be maximally exploited, the most powerful being the nearly constant electron density of bulk solvent regions in protein crystals. This led to the development of iterative projection algorithms (IPAs) designed to leverage this constraint.
IPAs are primary computational tools for the direct solution of macromolecular structures by utilizing constraints in both real and reciprocal space. Seminal contributions include the Hybrid Input–Output (HIO) algorithm by Fienup [12] and the Difference Map (DM) algorithm by Elser [13,14], both applied to ab initio protein structure determination. Millane et al. demonstrated using HIO for phasing periodic structures [15,16,17]. Miao et al. and Marchesini et al. demonstrated using HIO for phasing non-periodic structures [18,19]. Lunin et al. explored the use of density connectivity for searching low-resolution structures [20]. Liu et al. have shown that given a reasonable molecular envelope, HIO can recover atomic-resolution protein structures [21]. Further developments include automated envelope generation for HIO by He and Su [22], the application of DM to proteins and viruses by Lo and Millane [23,24,25], workflows incorporating envelope clustering by Kingston et al. [26,27], and partial structure completion with machine-learning models by Pan et al. [28]. Our previous research introduced ab initio non-crystallographic symmetry (NCS) averaging [29], the transition-region-based THIO algorithm [30], and advanced strategies including resolution-weighted phasing [31] and genetic algorithm co-evolution [32]. Despite these advances, existing IPA-based methods remain largely restricted to structures with solvent content exceeding 65% [32]. However, statistical analysis of the Protein Data Bank (PDB) reveals that only 9.6% of deposited structures exceed this threshold (see Appendix A, Figure A1 for the distribution of solvent content). We categorize crystals as low solvent content (<45%), medium solvent content (45–55%), or high solvent content (>55%), with each category comprising approximately one-third of PDB-reported crystal structures. This distribution highlights the need for methodological improvements to extend applicability toward lower solvent content within the high-solvent range. Two fundamental limitations of current IPAs restrict their broader application to lower-solvent-content and structurally complex systems.
The first limitation arises from discontinuous density modification at the protein-solvent boundary. Traditional IPAs apply different update rules to protein regions, where density is preserved or adjusted through histogram matching [33], versus bulk solvent regions, where strong negative feedback forces density toward zero. This binary treatment imposes an abrupt mathematical discontinuity at the molecular boundary that contradicts the continuous nature of electron density, propagating errors throughout iteration and hindering convergence. The second limitation concerns inaccurately reconstructed molecular envelopes. Starting from random phases, precisely delineating the true protein boundary during early iterations is difficult. Consequently, researchers employ loose, oversized envelopes to ensure complete protein enclosure. However, this approach encompasses substantial volumes of discrete solvent molecules residing within surface cavities and internal channels. Conventional algorithms mistakenly treat these trapped solvent regions as protein density, failing to apply the powerful constant-density constraint these regions would otherwise provide, representing a waste of valuable prior information. When the overall solvent content is already limited, this inefficiency becomes detrimental to successful phase recovery.
To address these challenges, we developed a framework that enhances phase recovery capability and reliability through innovations at three levels. At the algorithmic level, we introduce continuous IPAs, including established algorithms Continuous Hybrid Input–Output (CHIO) [34], Hybrid Projection Reflection (HPR) [35], and Transition Hybrid Input–Output (THIO) [30], along with our improved variant, Modified CHIO (MCHIO). These algorithms establish transition zones between protein and solvent regions, enabling smooth density modification. At the methodological level, we propose a two-step refined envelope reconstruction scheme. This scheme employs a coarse-to-fine strategy using sequential large-radius and small-radius Gaussian filters to identify and recover solvent erroneously included within coarse envelopes. By incorporating recovered solvent into constrained regions, the scheme maximizes utilization of available information. This methodology enhances performance for both continuous and classical IPAs, including HIO, DM, and our improved versions, Modified Difference Map (MDM) and Modified Relaxed Averaged Alternating Reflections (MRAAR). At the strategic level, we performed optimization comparing three phasing strategies: conventional full-resolution, resolution-weighted progressive [31], and genetic algorithm (GA) co-evolution [32]. The GA strategy establishes information-sharing and inheritance mechanisms among multiple independent reconstruction processes, achieving enhancement in global search capability that elevates success rates.
This work makes several contributions. We provide the first comparison and validation of multiple continuous IPAs for protein direct phasing, establishing their advantages and practical performance. We introduce the two-step refined envelope reconstruction scheme as a method of elevating baseline performance across various algorithms. We develop improved iterative algorithms, including MDM, MCHIO, and MRAAR. Through comprehensive testing on 28 protein structures with diverse space groups, solvent contents ranging from 55% to 78%, resolutions spanning 1.46 to 3.2 Å, and PDB-reported R-work below 0.22, we quantify performance gains under different strategies. For structures with favorable conditions, accurate diffraction data, and suitable molecular packing, our methods successfully phase structures and extend applicability to the lower boundary of the high-solvent-content range. Linear regression analysis suggests the potential applicability boundary reaches approximately 55% solvent content, compared with traditional approaches that typically require solvent contents above 65%. This extends the accessible pool from 9.6% (structures > 65%) to potentially 32.3% (structures > 55%) of PDB structures (Appendix A, Figure A1) when data quality and structural characteristics permit, though success rates decrease substantially as solvent content approaches this lower limit.
While the methods presented extend the practical boundaries of direct phasing, we acknowledge their scope and limitations. Success remains dependent on adequate solvent content and sufficient data quality. Although the GA strategy enhances success rates, it requires greater computational resources, and its effectiveness depends on at least one individual achieving convergence. This does not eliminate the dependence of phase recovery on solvent content constraints. For crystals with solvent content approaching or below 55%, integration with higher-quality diffraction data, complementary experimental information, or additional physical constraints will likely remain necessary. Nevertheless, this work provides more powerful computational tools and mechanistic understanding for addressing challenging crystallographic problems, advancing model-free phase retrieval methods toward broader practical applications in structural biology.

2. Materials and Methods

2.1. Overall Workflow and Key Operations in Direct Phasing

Direct phasing in protein crystallography cyclically enforces constraints in both real space and reciprocal space through iterative alternation between these two domains. Figure 1 illustrates the workflow of this process. The algorithm begins with structure factor amplitudes | F obs ( h ) | obtained from experimentally measured diffraction intensities, where h denotes the reciprocal lattice indices. Since phase information is unavailable from the experiment, an initial set of random phases ϕ rand is generated and combined with the observed amplitudes to construct the initial complex structure factors: | F obs ( h ) | exp [ i ϕ rand ( h ) ] .
The algorithm then enters an iterative loop where constraints are applied to improve the phase estimates. At the k-th iteration, the current electron density ρ k ( r ) is first subjected to reciprocal space constraints. This begins with a Fourier transformation converting the real-space density into reciprocal space:
F cal ( h ) = ρ ( r ) exp [ 2 π i h · r ] d r .
The experimental amplitude constraint is enforced by replacing the calculated amplitude | F cal ( h ) | with the observed amplitude | F obs ( h ) | while preserving the calculated phase ϕ cal ( h ) , yielding an updated structure factor:
F cal ( h ) = | F obs ( h ) | exp [ i ϕ cal ( h ) ] .
This amplitude replacement ensures that the electron density satisfies the experimental diffraction data at each iteration. The modified structure factor is then transformed back to real space through inverse Fourier transformation, producing an updated density:
ρ ( r ) = V 1 h F cal ( h ) exp [ 2 π i h · r ] ,
where V represents the unit cell volume, and r denotes the real-space coordinate vector.
Following reciprocal space operations, constraints are applied in real space through a procedure comprising two components: molecular envelope reconstruction and density modification based on the reconstructed envelope. The envelope reconstruction step aims to identify the approximate spatial region occupied by the protein molecule, the envelope domain S, from the current electron density ρ ( r ) , which may still contain inaccuracies. This is achieved by computing a weighted average density w ( r ) through Gaussian low-pass filtering of ρ ( r ) :
w ( r ) = ρ ( r ) G ( | r r | ; σ ) d r .
In this expression, G ( r ; σ ) denotes a Gaussian kernel with standard deviation σ that smooths local density fluctuations and emphasizes the contiguous regions of macromolecular structure. Using the estimated solvent content x solv derived from the Matthews coefficient [36] based on unit cell parameters and protein molecular weight, a threshold value w cutoff is determined from the distribution of w ( r ) values throughout the unit cell. Regions satisfying the condition w ( r ) > w cutoff are then assigned to the protein region S, although this initial envelope remains coarse. Depending on whether a classical or continuous iterative projection algorithm is employed, a transition region T may be defined at the protein-solvent interface. Once the envelope domain S for the current iteration has been established, which may include the transition region T for continuous algorithms, the density modification rules of the chosen algorithm are applied to ρ ( r ) to generate the updated density ρ k + 1 ( r ) for the subsequent iteration. This density modification step is the core mathematical operation of the algorithm and enforces physical constraints: the protein region density should conform to expected statistical distributions, achieved through histogram matching [33], while the solvent region density should approach a constant value [37].
The iterative cycle repeats for several thousand iterations until convergence is achieved. Throughout this process, multiple quality metrics are computed to monitor progress and evaluate convergence. These include the working R-factor R work , the free R-factor R free , the density deviations Δ ρ in protein and bulk solvent regions, and, when a reference structure is available, the mean phase error Δ ϕ and the intersection-over-union ratio of the envelope. Successful convergence is characterized by the reduction of both R work and R free to stable low values, accompanied by improvements in the other monitored metrics. The result of this iterative refinement is an electron density map representing the protein structure. This density map can be utilized by automated model-building software packages such as ARP/wARP version 8.0 [38,39], Buccaneer [40] from the CCP4 software suite version 9 [41], or Phenix (version 1.16-3549) AutoBuild [42,43] to construct and refine the atomic model, producing a three-dimensional protein structure.

2.2. Mathematical Framework and Continuous Modification of Iterative Projection Algorithms

The mathematical foundation of iterative projection algorithms centers on identifying a solution that satisfies two constraint sets. In crystallographic direct phasing, these are the real-space constraints A , which encompass the statistical characteristics of protein density distribution and the uniform nature of solvent regions, and the reciprocal-space constraints B , which represent the experimentally measured diffraction amplitudes. The objective is to refine the electron density ρ through repeated application of projection operators P A and P B until it converges to a state satisfying both constraint sets.

2.2.1. Partitioned Update Form of Classical Iterative Projection Algorithms

Classical iterative projection algorithms such as the Hybrid Input–Output (HIO) algorithm are implemented using a partitioned update strategy adapted to protein crystals. This approach relies on the premise that at the k-th iteration, a molecular envelope domain S k can be estimated from the current electron density ρ k ( r ) . The projection operator P A incorporates the following: within this envelope domain S k , constraints appropriate to protein density are enforced through histogram matching procedures, while in the region outside S k corresponding to the bulk solvent, a constant-density constraint is applied through solvent flattening operations.
The HIO algorithm [12] exemplifies this partitioned approach, with its discretized form for protein crystallography expressed as
ρ k + 1 ( r ) = P A P B ρ k ( r ) , r S k , ρ k ( r ) β P B ρ k ( r ) , r S k ,
where β represents the negative feedback factor applied in the solvent region, typically assigned values between 0.7 and 0.9, with β = 0.75 used throughout this study. In this formulation, P B denotes the reciprocal-space projection corresponding to the amplitude replacement operation defined in Equation (2), while P A represents the real-space constraint operations that include histogram matching within the protein region and solvent flattening in the solvent region. We have inserted P A into the partitioned form of HIO. This formulation reveals the origin of the discontinuity problem: at the envelope boundary, the density update rule abruptly transitions from P A P B ρ k inside the protein region to ρ k β P B ρ k in the solvent region, potentially introducing a discontinuous step change in the updated density ρ k + 1 across this boundary.

2.2.2. Continuous Iterative Projection Algorithms: Introduction of a Transition Region

To address the discontinuity problem in classical algorithms, continuous iterative projection algorithms introduce a transition region T k at the k-th iteration, positioned at the interface between the protein region S k and the bulk solvent region. This partitioning divides real space into three zones: the protein core region S k , the interfacial transition region T k , and the bulk solvent region. The objective of these continuous algorithms is to implement a smooth update strategy within the transition region T k that provides gradual and continuous modulation between the density modification rules applied in the protein and solvent regions. We implemented and evaluated four continuous algorithms that achieve this objective through different mathematical approaches. We have updated the original forms of those algorithms to make them suitable for protein crystallography.
The Continuous Hybrid Input–Output (CHIO) algorithm [34] extends the framework of HIO by introducing intermediate feedback behavior in the transition region. Its update rule is formulated as
ρ k + 1 ( r ) = P A P B ρ k ( r ) , r S k , ρ k ( r ) γ P B ρ k ( r ) , r T k , ρ k ( r ) β P B ρ k ( r ) , r S k T k ,
where the transition feedback parameter γ is defined as γ = ( 1 α ) / α . In the original CHIO formulation [34], the parameter α is typically set to 0.4, which yields γ = 1.5 and results in negative feedback that is stronger in the transition region than in the bulk solvent region where β 0.75 .
The Hybrid Projection Reflection (HPR) algorithm [35] employs a reflection-projection framework and can be expressed in partitioned form as
ρ k + 1 ( r ) = P A P B ρ k ( r ) , r S k , ρ k ( r ) β HPR P B ρ k ( r ) , r S k ,
where β HPR is an algorithm-specific parameter set to 0.588 in the original formulation [35]. In typical partitioned implementations, the HPR algorithm applies the same feedback factor uniformly across both the transition region T k and the bulk solvent region, with its continuity manifesting through consistent treatment of the entire region outside the protein core S k .
Based on our analysis of CHIO and HPR algorithms, we developed the modified CHIO (MCHIO) algorithm to address a limitation. We observed that the relatively strong feedback in CHIO’s transition region, with γ = 1.5 , might be excessive when T k contains portions of protein side-chain density, potentially leading to inappropriate suppression of structural features. To address this, we propose a modified update rule:
ρ k + 1 ( r ) = P A P B ρ k ( r ) , r S k , ρ k ( r ) γ M P B ρ k ( r ) , r T k , ρ k ( r ) β P B ρ k ( r ) , r S k T k ,
where γ M represents the adjusted feedback factor for the transition region. To achieve more appropriate behavior that is gentler than bulk solvent treatment while effectively leveraging the solvent constraint, we set γ M = 0.5 throughout this study. This value is chosen to be lower than both β = 0.75 and β HPR = 0.588 , thereby implementing a gentler, more conservative density modification in the transition region compared with the bulk solvent. This adjustment aims to avoid excessive suppression of potential protein density features that might be incorrectly assigned to the transition region T k , while still effectively applying the solvent-like constraint.
The Transition Hybrid Input–Output (THIO) algorithm, introduced in our previous research [30], implements a density-weighted assignment approach within the transition region. Its updated formulation is
ρ k + 1 ( r ) = P A P B ρ k ( r ) , r S k , ω k ( r ) P A P B ρ k ( r ) + [ 1 ω k ( r ) ] [ ρ k ( r ) β P B ρ k ( r ) ] , r T k , ρ k ( r ) β P B ρ k ( r ) , r S k T k ,
where the weight function ω k ( r ) takes values in [ 0 ,   1.0 ] and is computed based on the weighted average density at each grid point r using a small Gaussian kernel radius of approximately σ = 1.5 Å. This weight represents the local probability that a given position belongs to the protein rather than solvent. Through this linear interpolation scheme, THIO achieves continuous and physically motivated density modulation throughout the transition region, with the update rule smoothly varying based on local density characteristics.
The fundamental concept unifying all these continuous algorithms is the mathematical refinement of electron density update behavior across the molecular boundary to achieve smoother transitions. This enhancement is designed to improve numerical stability and increase the likelihood of convergence when dealing with ambiguous boundaries and limited solvent content. Figure 2 provides a schematic comparison of the update rules employed by these algorithms, illustrating their distinctive characteristics and the progressive refinement of boundary treatment strategies.

2.2.3. Classical Iterative Projection Algorithms and Improved Variants

The two-step refined envelope reconstruction scheme demonstrated in the subsequent section exhibits universal applicability, providing performance enhancement for both continuous and classical iterative projection algorithms. To establish a comprehensive algorithmic framework, we evaluated classical iterative projection algorithms, including HIO, DM, ASR, and RAAR, and developed improved variants, MDM and MRAAR, optimized for protein crystallography. The HIO algorithm has been presented in the previous section, following the update rule in Equation (5).
The Difference Map (DM) algorithm [13] generates two candidate solutions at the k-th iteration:
ρ k , A = P A [ ( 1 + β DM 1 ) P B ρ k β DM 1 ρ k ] ,
ρ k , B = P B [ ( 1 β DM 1 ) P A ρ k + β DM 1 ρ k ] .
These solutions satisfy real-space constraint set A and reciprocal-space constraint set B , respectively. The DM iteration formula is
ρ k + 1 = ρ k + β DM ( ρ k , A ρ k , B ) ,
where β DM is the relaxation parameter (typically 0.5 to 1.0, set to 0.75 in this study) controlling convergence speed and search capability. Convergence is achieved when the two candidate solutions become equal. Considering partitioned updates for protein region S k and solvent region, the DM formula can be expressed as
ρ k + 1 ( r ) = ρ k ( r ) + β DM [ ρ k , A ( r ) ρ k , B ( r ) ] , r S k , ρ k ( r ) β DM ρ k , B ( r ) , r S k .
Experimental results revealed that DM exhibits low success rates for protein phase retrieval, primarily due to insufficient constraint enforcement within the protein region during iteration. To address this, we propose an improved variant that explicitly applies constraint operators P A and P B to the protein region:
ρ k + 1 ( r ) = P A P B ρ k ( r ) + β DM [ ρ k , A ( r ) ρ k , B ( r ) ] , r S k , ρ k ( r ) β DM ρ k , B ( r ) , r S k .
This improved formulation, designated Modified DM (MDM), strengthens constraint application in the protein region and enhances phase retrieval performance.
The Averaged Successive Reflections (ASR) algorithm [44] employs reflection operators R A = ( 2 P A I ) and R B = ( 2 P B I ) , where I is the identity operator. The ASR update rule is
ρ k + 1 ( r ) = 1 2 ( R A R B + I ) ρ k ( r ) .
Substituting the reflection operators and considering partitioned updates yields
ρ k + 1 ( r ) = ρ k ( r ) + ( 2 P A P B P A P B ) ρ k ( r ) , r S k , ρ k ( r ) P B ρ k ( r ) , r S k .
The Relaxed Averaged Alternating Reflections (RAAR) algorithm [45] introduces a relaxation parameter β RAAR (typically 0.95) with the update rule:
ρ k + 1 ( r ) = 1 2 β RAAR ( R A R B + I ) + ( 1 β RAAR ) P B ρ k ( r ) .
In partitioned form, this becomes
ρ k + 1 ( r ) = β RAAR ρ k ( r ) + ( 2 β RAAR P A P B β RAAR P A 2 β RAAR P B + P B ) ρ k ( r ) , r S k , β RAAR ρ k ( r ) ( 2 β RAAR 1 ) P B ρ k ( r ) , r S k .
Both ASR and RAAR exhibited low success rates in our experiments, attributable to insufficient constraint enforcement in the protein region. We developed an improved variant by setting β RAAR = 1 , which reduces RAAR to the ASR formulation, then explicitly applying operators P A P B to the protein region:
ρ k + 1 ( r ) = P A P B ρ k ( r ) + ( 2 P A P B P A P B ) ρ k ( r ) , r S k , ρ k ( r ) P B ρ k ( r ) , r S k .
This improved algorithm, designated Modified RAAR (MRAAR) or Modified ASR (MASR), enhances performance. This formulation is mathematically equivalent to the Hybrid Difference Map - formula 1 (HDM-f1) algorithm developed in our concurrent work [46], which was derived from a different starting point through modification of the DM framework, representing a convergence of approaches toward optimal algorithm design. The improved classical algorithms, when combined with the two-step refined envelope reconstruction scheme, expand the available toolkit for direct phasing of protein crystals.

2.3. Envelope Reconstruction Strategies: From One-Step Coarse Design to Two-Step Refined Design

The effectiveness of iterative projection algorithms, their capacity to enforce constraints within solvent regions, depends on the accuracy of the reconstructed protein molecular envelope. Traditional envelope reconstruction employs a single-step approach in which the k-th iteration’s electron density ρ k ( r ) is convolved with a Gaussian kernel, usually featuring a large radius such as σ 1 2.5 4.0 Å, to produce a smoothed weighted average density map w k ( r ) . Based on the estimated solvent content x solv derived from the Matthews coefficient [36], a threshold value w cutoff is determined from the statistical distribution of w k ( r ) values. Grid points satisfying the condition w k ( r ) > w cutoff are classified as belonging to the protein region, while the remaining volume is the solvent region. This method offers computational simplicity and can rapidly delineate the general molecular shape during early iteration stages. However, it suffers from limitations that become problematic for challenging structures. The large-scale smoothing operation erases fine structural details at the molecular surface, resulting in boundaries that are rough and imprecise. The resulting coarse envelope tends to be over-expanded to ensure complete enclosure of the protein molecule, which incorporates volumes of solvent molecules that are situated within surface crevices, pockets, and internal channels. In classical iterative algorithms, these incorrectly assigned solvent regions are treated as protein density throughout the modification process, failing to exploit the constant-density constraint that these regions would otherwise provide. This inefficient use of available information becomes detrimental when the overall solvent content is limited.
To overcome these deficiencies and maximize the utilization of all available solvent constraints within the crystal unit cell, we developed a two-step refined envelope reconstruction scheme. The concept underlying this scheme is a strategy that proceeds from coarse localization to fine-scale local refinement. This approach provides a more physically rational foundation for defining the transition region in continuous algorithms and functions as a universal technique capable of enhancing the performance of all iterative projection algorithms.
The first step focuses on coarse envelope generation following an approach similar to the traditional single-step method. A Gaussian kernel with a large radius, set to σ 1 = 2.5 Å in this study, is applied to smooth the k-th iteration’s electron density ρ k ( r ) , yielding a globally smoothed density distribution w k ( 1 ) ( r ) . Using the estimated solvent content x solv , a threshold value w cutoff ( 1 ) is determined to define an initial coarse envelope M coarse that serves as the protein candidate region:
M coarse = { r w k ( 1 ) ( r ) > w cutoff ( 1 ) } .
The objective of this initial step is to capture the protein’s center of mass and overall molecular shape while avoiding artificial fragmentation of the envelope that could arise from localized density fluctuations. By employing a large smoothing kernel, this step ensures robust identification of the main protein volume even when the current density estimate contains noise or systematic errors.
The second step is the refinement stage that distinguishes our scheme from traditional approaches. Here, a second smoothing operation is performed on the same electron density ρ k ( r ) using a Gaussian kernel with a smaller radius, set to σ 2 = 1.5 Å in this work, which produces a locally refined density map w k ( 2 ) ( r ) that preserves fine-scale density variations. All subsequent operations in this step are confined to the interior of the coarse envelope M coarse established in the first step. Within this domain, the values of w k ( 2 ) ( r ) at all grid points are ranked by magnitude, and grid points in the lowest approximately 5% (of the asymmetric unit) are identified as transition region T k , most likely to represent trapped solvent rather than protein density. This 5% threshold was determined empirically from our benchmark set and proved robust across diverse protein structures. For truly novel structures without homologous or predicted references, this value can be adjusted within a range of 3–7% through systematic trials, as the method shows tolerance to moderate variations in this parameter. The identified low-density regions are then removed from the protein candidate volume to yield refined spatial assignments. We define the refined protein core region as S k = M coarse T k , representing the volume that remains after removing the identified solvent-like regions from the coarse envelope. The refined transition region is defined as
T k = { r M coarse w k ( 2 ) ( r ) is within the lowest 5 % percentile } .
The bulk solvent region comprises both the external volume lying outside M coarse and the internal stripped region T k , reclassified based on low local density. The transition region identifies trapped solvent molecules within surface cavities and internal channels, enabling the application of solvent-like constraints during iteration. This enhances convergence for crystals with limited solvent content by maximizing utilization of available constraint information. However, although physically consisting of trapped solvent, the transition region exhibits slightly non-uniform density due to proximity to the protein surface. Negative feedback applied during iteration drives density toward uniform values, deviating from actual surface solvent density. This slightly non-uniform density is incompatible with the strict uniform-density requirements of comprehensive solvent flattening [37] in final-stage refinement. Therefore, in the last 2000 iterations, the transition region is linearly reduced from 5% (of the asymmetric unit) to zero over 1500 iterations, followed by 500 iterations of solvent flattening applied uniformly to all trials. Removing the transition region eliminates density constraint errors from surface-proximal solvent and enables optimal uniform-density enforcement in true bulk solvent regions.
This two-step scheme offers three advantages over traditional single-step methods. First, the small-radius filter is sensitive to local low-density features corresponding to surface cavities and internal channels, enabling accurate identification of trapped solvent. Second, the scheme maximizes constraint utilization by identifying trapped solvent within the coarse envelope. Although the transition region T k is spatially adjacent to protein, it exhibits low-density characteristics of solvent. During iteration, this region is treated with solvent constraints, such as negative feedback in continuous algorithms, increasing the effective constraint volume. As discussed above, the transition region should be removed in the final refinement stage to enable strict solvent flattening. Third, for continuous algorithms, the transition region T k delineated through refined local density analysis is less likely to contain protein side-chain density compared with regions defined by traditional single-step approaches, improving the purity and effectiveness of applied constraints.
Section 3.3 provides a direct visual comparison using protein structures 3rd5 [47] and 2fg0 [48] as examples, demonstrating that envelopes generated by our two-step method reconstruct finer and more accurate molecular surface details compared with those produced by the traditional one-step approach. Although multi-step refinement with progressively decreasing kernel sizes was tested, we found that the two-step approach represents an optimal balance between envelope accuracy and computational efficiency. Additional refinement steps beyond two did not improve phasing success rates, as both one-step and two-step envelopes represent approximations to the true molecular boundary, and the two-step scheme already provides sufficient accuracy for iterative projection algorithms to converge to correct solutions.

2.4. Phase Retrieval Strategies: From Full-Resolution to Genetic Co-Evolution

We implemented and compared three phasing strategies to search the solution space for correct phases. These strategies optimize data utilization and search organization to enhance success rate and computational efficiency.

2.4.1. Full-Resolution Phasing Strategy

The full-resolution approach represents the fundamental strategy. All available experimental diffraction data, excluding only a reserved free set for cross-validation, participate equally in the reciprocal-space amplitude constraint at every iteration. This strategy applies no preprocessing or weighting to diffraction data, requiring the algorithm to simultaneously satisfy constraints spanning the entire resolution range. The advantages are simplicity and straightforward implementation. However, for high-dimensional optimization problems, requiring immediate fitting of structural information across all spatial scales can create convergence difficulties and entrapment in local minima for structures with limited solvent content.

2.4.2. Resolution-Weighted Progressive Phasing Strategy

The resolution-weighted progressive strategy implements a hierarchical data utilization scheme prioritizing low-resolution diffraction information during initial iterations, then introducing high-resolution data as convergence progresses. This is implemented through a time-dependent weighting function applied to observed structure factor amplitudes:
| F obs , w ( h ) | = | F obs ( h ) | · exp 2 ( π σ w S ( h ) ) 2 ,
where S ( h ) = 1 / d ( h ) represents the reciprocal of resolution spacing, and σ w is a time-varying parameter controlling filter bandwidth. Initially, σ w is set to 0.8∼1.0 Å, attenuating high-resolution contributions. As iterations progress, σ w is reduced toward zero following a predetermined annealing schedule [31]. This progressive expansion from low to high spatial frequencies follows the logic of first establishing global protein fold before resolving atomic positions, facilitating the establishment of correct low-resolution phase relationships that provide a stable foundation for convergence.

2.4.3. Genetic Algorithm-Enhanced Co-Evolution Phasing Strategy

To overcome limitations of independent random-start searches, we implemented a genetic algorithm co-evolution strategy using population-based intelligence. This strategy treats multiple parallel phase retrieval processes, typically 100 individuals, as an evolutionary population, with each individual’s electron density map representing an organism. By simulating biological evolution through selection, crossover, and mutation, this approach establishes information exchange mechanisms enabling high-quality density features to propagate throughout the population, enhancing global search capability.
The workflow proceeds through interconnected stages. Population initialization generates N individuals, each with random phases ensuring diversity. During independent evolution, all individuals execute a predetermined number of iterations using standard iterative projection algorithms in parallel, employing envelope reconstruction schemes and resolution-weighted strategies. This phase is a local search within each individual’s solution space neighborhood. Periodically, typically every 100 iterations, the algorithm performs evaluation and selection. Individual fitness f i is quantified through
f i = max 0 , R thres R work , i R thres R min ,
R work = h work | F obs ( h ) | λ | F cal ( h ) | h work | F obs ( h ) | ,
where R min and R avg represent the best and the average R work within the population, and R thres = R avg + ( R avg R min ) . Higher fitness individuals have greater probability of genetic operations.
Genetic operations implement two mechanisms. During crossover, two parents are selected according to fitness-weighted probabilities. The asymmetric unit is divided into spatial blocks, and a chosen subset undergoes density value exchange between parents to generate offspring. Prior to crossover, all density maps undergo rotational and translational alignment to eliminate crystallographic origin and enantiomorphic ambiguities. Mutation operations introduce stochasticity by randomly selecting approximately 1% of grid points in offspring density maps and assigning new random values, introducing variation, and preventing premature convergence. The resulting offspring typically exhibit improved fitness relative to the parent population. Population update then replaces low-fitness individuals with these offspring, forming the next generation. The algorithm then returns to independent evolution for another round of local search, followed by global information exchange. More details, such as elite inheritance and similarity punishment preventing prematurity, can be found in our previous paper [32].
The effectiveness of this strategy derives from synergistic effects between local refinement and global information sharing. Once any individual approaches the correct solution, manifested as a sharp R-factor decrease, high-quality density features disseminate throughout the population via crossover. This collective guidance enables population convergence toward the global optimum at rates exceeding independent random searches. The GA strategy upgrades traditional multi-start independent searching to intelligent multi-start cooperative searching with active information sharing, representing an advancement for enhancing the reliability of direct-method phase retrieval in challenging applications.

2.5. Error Metrics, Missing Reflections, and Model Building

We employed a validation pipeline with quality metrics to monitor convergence, strategies to handle incomplete data, and automated model building from recovered density maps.

2.5.1. Error Metrics

Assessment metrics are organized into two categories: reference-dependent metrics for validation during method development and internal consistency metrics applicable to de novo structure determination. Reference-dependent metrics provide validation when PDB coordinates are available. The mean phase error quantifies the angular deviation between retrieved and true phases:
Δ ϕ = a r c c o s ( c o s ( ϕ cal ( h ) ϕ true ( h ) ) ) ,
where · denotes averaging over unique reflections in the working set. The envelope intersection-over-union metric assesses spatial accuracy of the reconstructed molecular boundary:
IoU = | S cal S true | | S cal S true | ,
quantifying the overlap ratio between reconstructed envelope domain S cal and true envelope domain S true .
Internal consistency metrics serve as criteria for evaluating convergence when reference coordinates are unavailable. The working R-factor R work , defined in Equation (24), and free R-factor R free measure agreement between calculated and experimental amplitudes:
R free = h free | F obs ( h ) | λ | F cal ( h ) | h free | F obs ( h ) | ,
where R free is computed using a randomly selected subset (1% of reflections in this study), excluded from all iterative processes as an independent validation dataset. Successful convergence is characterized by reduction of both R work and R free to stable low values. Density convergence indicators monitor the magnitude of real-space density modifications. For HIO-type algorithms, we compute | P B ρ ( r ) | for grid points r S . For MRAAR-type algorithms, we compute
Δ ρ = | ( 2 P A P B P A P B ) ρ ( r ) | , r S | P B ρ ( r ) | , r S
where · denotes averaging over grid points in each region. As convergence proceeds, these update magnitudes diminish toward zero, indicating stable density satisfying constraints.

2.5.2. Handling of Missing and Weak Diffraction Data

Diffraction datasets contain missing reflections due to various factors: reflections in the free set, low-resolution reflections (below 15 Å), and weak reflections failing signal-to-noise criteria ( | F obs ( h ) | < 2 σ | F obs ( h ) | ). To handle such incomplete data, we implemented a dynamic mixing strategy. For missing or weak reflections, calculated amplitude | F cal ( h ) | is blended with experimental values using weight factor α miss :
| F filled ( h ) | = α miss · λ · | F cal ( h ) | + ( 1 α miss ) · | F obs ( h ) | ,
where λ is a global scale factor. For missing data, α miss = 1.0 ; for weak data, α miss = 0.5 . This strategy ensures reciprocal-space data completeness, preventing information loss and projection artifacts from data gaps.

2.5.3. Electron Density Map Post-Processing and Automated Model Building

After convergence, we applied post-processing to enhance density map quality. When multiple solutions converged successfully through GA strategy, we performed spatial alignment and density averaging to reduce noise. The resulting electron density map was used for automated model building, which identifies secondary structure elements, constructs polypeptide backbones, and positions side-chain atoms based on density features and stereochemical constraints. The initial model exhibits over 80% residue completeness and is refined through several iterations using phenix.refine [49], optimizing atomic coordinates and displacement parameters against experimental data to yield the final structural model.

2.6. Test Datasets, Computational Implementation, and Parameter Settings

To ensure robustness, we assembled a benchmark dataset of 28 protein crystal structures with diversity in space group symmetry, solvent content, and diffraction resolution. All diffraction data ( | F obs | ) and reference coordinates were obtained from the Protein Data Bank. The structures comprise proteins containing 826 to 7475 non-hydrogen atoms with 0 to 635 bound water molecules. The dataset includes resolutions from 1.46 Å to 3.2 Å, solvent contents from 55% to 78%, PDB-reported R-work values below 0.22 (ranging from 0.133 to 0.215), and diverse space groups including C 121 , C 222 1 , P 2 1 2 1 2 1 , P 3 1 21 , P 4 1 32 , P 4 1 2 1 2 , P 4 3 2 1 2 , P 6 5 , P 6 1 22 , H 32 , I 222 , I 2 1 3 , and I 4 1 22 , ensuring evaluation under varied crystallographic conditions. Diffraction datasets contain 4918 to 76,249 observed reflections, with missing low-resolution reflections ranging from 3 to 236. Detailed descriptions of all test structures are provided in Appendix A, Table A1. Solvent content for each structure was estimated using the matthews_coef program from the CCP4 suite [41], based on the Matthews coefficient [36]. The calculations incorporated unit cell parameters, space group symmetry, and protein molecular weight derived from the amino acid sequence. This estimated solvent content, along with space group information and unit cell parameters, constitutes essential prior knowledge used as constraint information during iterative phasing.
The computational framework was implemented using the Clipper C++ libraries for crystallographic computing [50] combined with MPI (Message Passing Interface) for parallel processing. Key parameters were standardized: electron density maps were discretized on 3D grids with approximately 1.0 Å spacing; Gaussian filter radii were σ 1 = 2.5 Å (large-radius) and σ 2 = 1.5 Å (small-radius); the refined scheme stripped 5% of grid points in the asymmetric unit from the coarse envelope; the solvent region negative feedback factor was β = 0.75 ; continuous algorithm parameters were α = 0.4 for CHIO, β HPR = 0.588 for HPR, and γ M = 0.5 for MCHIO; GA population size was 100 individuals with genetic operations every 100 iterations, crossover probability 0.5, and mutation probability 0.01.
Real-space histogram matching requires a reference electron density distribution. When available, a homologous structure or AlphaFold-predicted model can serve as a reference; structure factors are computed via phenix.fmodel with bulk solvent correction and resolution-appropriate temperature factors. In computing structure factors from model coordinates, the constant term F(0,0,0) (F000) represents the mean electron density of the unit cell. We adjusted F000 such that bulk solvent regions exhibit zero mean electron density. During density modification iterations, solvent flattening constraints drive bulk solvent densities toward zero through negative feedback, and pre-setting the reference histogram with zero-centered solvent regions ensures consistency between the target distribution and algorithmic behavior.
For optimal histogram generation from simulated diffraction data, temperature factors (B-factors) must be assigned to the atomic model. However, there is no exact theoretical method to determine optimal B-factors for histogram matching applications. Several empirical approaches exist: atomic B-factors can be estimated from the Wilson B-factor [51] calculated from the experimental diffraction data of the unknown structure; alternatively, statistical analysis of deposited PDB structures can provide estimates for both atomic and bulk solvent B-factors [52]. In this work, we employ empirical relationships based on the high-resolution limit d high (Å). For bulk solvent regions, the B-factor is approximated as B sol 25 d high + 25 Å 2 . For homologous or AI-predicted protein models, isotropic atomic B-factors are set to B atom 15 d high + 5 Å 2 . These empirical relationships are applicable for typical macromolecular crystallography resolution ranges ( 1.0 d high 3.5 Å).
With appropriately configured temperature factors and F000 value, electron density histograms generated from homologous or AI-predicted models closely resemble those from experimentally determined structures. Our tests demonstrate that at identical resolution and with appropriately set parameters, electron density distribution histograms remain remarkably similar even across different protein structures, validating their use as reliable constraint conditions for protein region density in unknown structures. This observation is consistent with the conclusions of Zhang and Main’s seminal work on histogram matching [33]. For benchmarking purposes in this study, we generated reference histograms from deposited PDB coordinates using this temperature factor protocol, which does not affect our conclusions, as the histograms provide statistically representative protein density distributions.
For each test structure, envelope scheme (one-step coarse or two-step refined), phasing strategy (full-resolution, resolution-weighted, or GA-enhanced), and iterative algorithm (10 total), we executed 100 independent phase retrieval attempts from different random phase seeds. Maximum iterations per attempt were 10,000, with early termination upon convergence detection (reduction of R work , R free , and Δ ρ to stable plateaus). Statistics collected from all 100 attempts included success rate, minimum iteration count for first successful convergence, and median iteration count across successful cases. Calculations were performed on a Dell R740 server with 52 cores (104 threads) at 2.1 GHz. A complete evaluation for a single structure with 100 independent trials required approximately 3 h.

3. Results

3.1. Constraint Framework for Ab Initio Phasing

All direct phasing tests in this study operate within a defined constraint framework that combines experimental measurements with general physical and statistical priors, without requiring knowledge of a specific homologous atomic structure. The reciprocal-space constraint is provided by the experimentally measured structure factor amplitudes | F obs ( h ) | , while crystallographic symmetry (space group and unit cell parameters) is obtained directly from the diffraction experiment. The solvent content x solv , estimated via the Matthews coefficient from the unit cell volume and protein molecular weight (derived from sequence information), serves as standard prior knowledge for envelope reconstruction. In real space, two physical constraints are enforced: solvent flattening imposes the near-constant electron density characteristic of bulk solvent regions, and histogram matching guides the protein-region density toward a statistically representative distribution. For genuinely novel structures, this reference histogram can be obtained without model bias either from known proteins [33] or from high-confidence AlphaFold predictions, which provide plausible density distributions without precise atomic coordinates. (In this benchmarking study, reference histograms were generated from deposited PDB coordinates using an empirical temperature factor protocol described in Section 2.6). The molecular envelope itself is reconstructed iteratively from the evolving density, guided by the estimated x solv and local density continuity. This constraint set leverages measurable experimental data and fundamental physical principles, distinguishing our approach from model-dependent methods like molecular replacement that require specific atomic templates. All subsequent results are presented within this framework.

3.2. Validation Case Study: Direct Phasing of Structure 3rd5 with Continuous Algorithms

To validate the effectiveness of continuous iterative projection algorithms in addressing discontinuous density modification at molecular boundaries, we conducted a case study using protein crystal structure 3rd5, a putative uncharacterized protein from Mycobacterium paratuberculesis [47]. With estimated solvent content of 59.52%, this structure represents a challenging case at the lower end of the high-solvent-content range, falling below the 65% threshold above which traditional iterative projection algorithms typically converge, making it suitable for evaluating continuous algorithm capabilities under challenging conditions.
From the continuous iterative projection algorithms, we took HPR as an example and applied it with a full-resolution phasing strategy and two-step refined envelope reconstruction. The experiment consisted of 100 independent phase retrieval attempts, each initialized with random phases and allowed to iterate for up to 10,000 iterations. After 8000 iterations, the transition region was linearly shrunk to zero over 1500 iterations, followed by 500 iterations of solvent flattening to finalize convergence. Figure 3 presents monitoring results across multiple convergence metrics. Each trajectory in Figure 3a–e represents the temporal evolution of key quality indicators for a single independent attempt, including mean phase error, envelope intersection-over-union ratio, working R-factor, free R-factor, and solvent region density deviation. Across 96 of 100 attempts, these metrics fluctuated within high-error regimes without sustained improvement, reflecting the dimensionality and rugged topography of the solution space.
Figure 3 reveals successful convergence events, demonstrating the capability of continuous algorithms via HPR. In four attempts, distinguished from others by bold lines, all five monitored metrics underwent a dramatic transition. Mean phase error plummeted from near 90° to values indicating correct phase determination (Figure 3a). Simultaneously, envelope intersection-over-union jumped from around 0.8 to exceeding 0.9 (Figure 3b). Both working and free R-factors dropped from approximately 0.55 to lower stable values (Figure 3c,d), while solvent region density deviation decreased sharply (Figure 3e). This coordinated improvement across independent quality metrics is the signature of successful convergence to the global optimum. The scatter plot in Figure 3f provides visualization by displaying final R-factor pairs for all 100 attempts. The 96 unsuccessful attempts cluster in the high R-factor region, while the four successful cases appear as outliers in the low R-factor region.
The electron density map from these successful convergences exhibited quality for automated atomic model building. After refinement, the backbone root-mean-square deviation between the resulting model and PDB reference coordinates was only 0.10 Å, confirming high phase accuracy. This case study demonstrates that continuous iterative projection algorithms can recover atomic-resolution structural information for challenging protein crystals with limited solvent content from random phase initialization, despite the 4% success rate. The mechanistic basis lies in the transition region concept, which provides smoother and more physically realistic density modification pathways at the protein-solvent interface. By avoiding abrupt discontinuities of classical algorithms, continuous methods enable more effective navigation through multidimensional solution landscapes.

3.3. Enhancement of Phase Retrieval Performance by the Refined Envelope Scheme

The case study in Section 3.2 demonstrates that continuous iterative projection algorithms can successfully recover phases for challenging structures with limited solvent content when combined with refined envelope reconstruction. To systematically evaluate the impact of envelope quality on phase retrieval performance, we compared the traditional one-step coarse envelope versus our proposed two-step refined envelope reconstruction scheme. We conducted this evaluation using structures 3rd5 [47] and 2fg0 [48] (63.86% solvent), employing CHIO under a full-resolution phasing strategy.
Figure 4a,b provides a visual comparison of transition regions and molecular boundaries generated by the two envelope-reconstruction approaches at equivalent iteration stages for 3rd5 and 2fg0. The traditional one-step method (left panels) produces boundaries that appear smooth but are crude and oversimplified. The transition region resembles a uniform thick shell enveloping the molecular core without discrimination, overlaying portions of protein surface side chains and causing inappropriate application of solvent constraints to protein density features. In contrast, molecular envelopes from the two-step method (right panels) exhibit richer surface detail, conforming closely to true molecular geometry and capturing surface crevices, pockets, and protrusions. The transition region delineated by the refined scheme no longer manifests as a uniform shell but concentrates in spatial regions exhibiting low electron density situated on or near the molecular surface. This selective localization demonstrates identification of physical solvent spaces erroneously included within traditional coarse envelopes.
The envelope precision improvement translates into enhanced convergence behavior. Figure 4c illustrates transition region evolution for structure 3rd5 using CHIO with refined envelope scheme, comparing pre-convergence and post-convergence states. While the pre-convergence transition region remains diffuse, upon successful convergence, the transition region becomes aligned with the final molecular boundary, reflecting confidence in boundary determination. Figure 4d illustrates the progressive refinement of the protein envelope, showing the transformation from an incomplete envelope that fails to fully encompass the protein structure (with numerous atoms exposed outside the boundary) to a well-converged envelope that accurately conforms to the protein surface.
The advantage of enhanced envelope accuracy is the maximization of available constraint information within the unit cell volume. By re-identifying approximately 5% of the asymmetric unit within traditional coarse envelopes as solvent or transition region based on local density characteristics, the refined scheme provides iterative algorithms with additional reliable prior knowledge guiding convergence. For challenging crystals like 3rd5, where total bulk solvent volume is limited, this reclaimed solvent represents valuable information, strengthening the constraint framework. These enhancements demonstrate that refined envelope reconstruction, by optimizing molecular boundary determination, elevates phase retrieval capability. This represents progress toward solving challenges of phasing structures with limited solvent content, historically resistant to direct methods. As demonstrated through benchmarking in subsequent sections, performance gains prove universal, enhancing both continuous and classical iterative projection algorithms across diverse structural types and crystallographic conditions.

3.4. Systematic Benchmarking: Synergistic Effects of Algorithms, Envelopes, and Strategies

We conducted systematic benchmarking across twenty-eight diverse protein structures, comparing ten iterative projection algorithms (six classical or improved: ASR, RAAR, DM, MDM, MRAAR, HIO; four continuous: CHIO, HPR, MCHIO, THIO) under two envelope schemes (one-step coarse and two-step refined) and three phasing strategies (conventional full-resolution, resolution-weighted, genetic algorithm-enhanced). For each configuration, 100 independent phase retrieval attempts from random phase seeds were executed, recording success rate, minimum iteration count for the first convergence, and median iteration count for all successful runs. Detailed numerical results are provided in Appendix A, Table A2, Table A3, Table A4, Table A5, Table A6 and Table A7.

3.4.1. Universal Performance Enhancement from Refined Envelope Reconstruction

To contextualize the averaged performance comparisons, we first examine the success rate of two representative algorithms, the classical HIO and the continuous HPR, under the conventional full-resolution phasing strategy with one-step coarse and two-step refined envelope reconstruction schemes across all 28 test structures (Figure 5a). The results, ordered by solvent content, reveal a general upward trend but with significant variation. Structures like 1xhd [53] achieve near-perfect success, while others, such as 4qb6 [54], 3rd5 [47], 4q82 [55], 2fg0 [48], 2bke [56], and 4tpl [57], show very low or zero success rates. This preliminary view highlights the inherent difficulty of certain structures and motivates the need for both improved algorithms and enhanced strategies, the effects of which are analyzed in the following averaged results.
We conducted systematic benchmarking across the 28 diverse protein structures, comparing 10 iterative projection algorithms. Figure 5b–d demonstrates that refined envelope reconstruction delivers universal enhancement. As shown in Figure 5c, continuous algorithms improved from a 14.3% to a 20.9% success rate (45.7% relative gain), with reduced minimum and median iteration counts. For classical algorithms (excluding ASR and RAAR), the improvement was more striking: from 9.9% to 15.9% (60.5% relative gain). In Figure 5b, the proposed improved algorithms MDM, MRAAR, and MCHIO achieved 17.1%, 20.4%, and 20.5% success rates under refined envelopes, exceeding their original versions (DM, RAAR, and CHIO) by significant margins. These results establish refined envelope reconstruction as a universal performance enhancer for algorithms relying on solvent flatness constraints, while demonstrating that the proposed improved algorithms, when combined with a refined envelope, expand the available toolkit for direct phasing.

3.4.2. Algorithm Performance Differences and Selection Strategy

Figure 5b shows that continuous algorithms (CHIO, HPR, MCHIO, THIO) and improved classical variants (MDM, MRAAR) perform comparably to the established HIO algorithm when used with the two-step refined envelope scheme. Together, these algorithms form a high-performing group that surpasses classical DM, while ASR and RAAR consistently yield lower success rates.
However, examining success rates per structure in Appendix A Table A2 and Table A5 reveals that no single algorithm excels in all cases. For a given structure, different algorithms can produce widely varying success rates. If only HIO is used and its success rate is near zero for a particular structure (e.g., 3rd5), thousands of random trials may be needed to obtain a solution—a computationally demanding process. With multiple algorithms available, a more efficient strategy emerges: by testing several algorithms with a limited number of trials each (e.g., 100 random seeds per algorithm), one may identify an algorithm that performs substantially better than HIO for that structure, enabling phase determination within hundreds rather than thousands of attempts.
For practical applications to unknown structures, we recommend first testing multiple algorithms from the top-performing group—including continuous algorithms (CHIO, HPR, MCHIO, THIO), improved classical variants (MDM, MRAAR), and HIO—in combination with the two-step refined envelope scheme. If computational resources allow, the genetic algorithm co-evolution strategy should also be employed, as it enhances search efficiency through population-wide information sharing. This integrated approach exploits the benefits of improved envelope accuracy, algorithmic diversity, and collaborative optimization. If any combination succeeds, the structure is considered solved. If all tested combinations show very low success rates, increasing the number of random trials per algorithm may be attempted, though such cases likely belong to a challenging category where additional constraints or experimental data may be needed.

3.4.3. Structural Validation: From Electron Density to Atomic Models

To validate practical utility, we selected seven representative phased structures (3rd5 [47], 2fg0 [48], 1hp4 [58], 1nh6 [59], 2bke [56], 2boz [60], 4gtf [61]) for automated model building. As shown in Figure 6, rebuilt models exhibit excellent spatial alignment with PDB structures, with backbone RMSD below 0.5 Å. Local electron density maps for biologically significant regions, including bound ligands, coordinated ions, and secondary structure motifs, display clarity and spatial continuity sufficient to determine side-chain orientations and resolve structural features. This density quality validates that our direct phasing methods achieve sufficient accuracy to support atomic model construction and enable biological interpretation.

3.5. Evolution of Phasing Strategies: From Full-Resolution to Genetic Co-Evolution

We evaluated three phasing strategies: full-resolution baseline, resolution-weighted progressive, and genetic algorithm-enhanced co-evolution. Figure 7 illustrates the GA-enhanced strategy (incorporating resolution weighting) using structure 3rd5, with results compared against Figure 3. Six panels track the temporal evolution of five quality metrics (mean phase error, envelope intersection-over-union, working R-factor, free R-factor, and solvent density deviation) across 100 individuals, each starting from an independent random seed. The figure shows successful convergence unfolding in two phases: after approximately 5100 iterations, a single individual achieved convergence manifested through abrupt coordinated improvements across all metrics. Subsequently, high-quality structural information from this pioneer propagated throughout the population via genetic operations, guiding the majority of other individuals to converge within a few hundred additional iterations. Upon achieving near 100% population convergence at iteration 5500, early stopping was triggered. The transition region was then linearly reduced to zero over 1500 iterations, followed by 500 iterations of solvent flattening, with the process terminating at iteration 7500.

3.5.1. Resolution-Weighted Strategy: Modest Gains with Computational Trade-Offs

The resolution-weighted strategy prioritizes low-resolution data initially, then progressively introduces high-resolution data. Analysis of Figure 8g shows modest success rate increases of approximately 2 percentage points compared with the full-resolution baseline, regardless of envelope scheme. However, Figure 8h,i reveals increased minimum and median iteration counts. This slowdown occurs because suppressing high-resolution data during early stages, while beneficial for establishing stable low-resolution phases and accurate envelope determination, delays structural detail reconstruction.

3.5.2. Genetic Algorithm Strategy: Breakthrough Through Population Intelligence

The GA co-evolution strategy, introduced in our previous work [32], delivered breakthrough performance by transforming phase retrieval from multi-start independent search into population-based collaborative learning. While our previous study demonstrated GA effectiveness using only the HIO algorithm with traditional one-step envelope reconstruction, the current work extends this evaluation to 10 iterative algorithms, including four continuous IPAs, and examines the synergistic effects when GA is combined with the two-step refined envelope scheme. Statistical analysis (Figure 8a,g) demonstrates that the GA-enhanced strategy significantly improved average success rates compared with the conventional strategy. For coarse envelope design, success rates increased from 12.1% to 43.3%, while for refined envelope design, rates improved from 18.4% to 63.4%, representing an approximately 2.5-fold increase in success rates across both envelope designs. Figure 8d reveals that the GA-enhanced strategy universally improves success rates across all 10 algorithms. Notably, Figure 8l shows that when combining refined envelope reconstruction with the GA phasing strategy, continuous IPAs (CHIO, HPR, MCHIO, THIO) and improved classical algorithms (MDM, MRAAR) achieve success rates comparable to HIO and substantially exceeding DM. In contrast, ASR and RAAR consistently yield the lowest success rates.
Regarding convergence speed, the GA-enhanced strategy shows distinct effects on different convergence metrics (Figure 8e,f,h,i, and Appendix A, Figure A3). Minimum iteration counts remain comparable across all three strategies for both coarse and refined envelope schemes. However, median iteration counts decrease substantially under the GA-enhanced strategy, demonstrating improved overall convergence consistency across both envelope designs. Specifically, with the refined envelope design, median iterations dropped from 3681 (conventional) and 4398 (resolution-weighted) to 2112 (GA-enhanced), representing approximately a 43–52% reduction. This improvement in median convergence reflects GA’s collective optimization mechanism illustrated in Figure 7: successful pioneer individuals emerge and rapidly propagate their high-quality density features throughout the population via genetic operations. The synergy between refined envelope reconstruction and GA strategy enhances convergence reliability—refined envelopes provide accurate spatial constraints that increase pioneer emergence probability, while GA efficiently disseminates validated solutions through population-wide information sharing, thereby elevating the performance of the entire population rather than just isolated optimal cases.

3.6. Correlation Analysis: Solvent Content Dependence and Extended Applicability

Solvent content is a parameter governing phase retrieval difficulty [63,64]. We conducted correlation analysis examining relationships between solvent content and success rate across 28 structures. Figure 9 presents average success rates (across eight algorithms, excluding ASR and RAAR) versus solvent content under three phasing strategies and two envelope schemes. The 28 structures are arranged by increasing solvent content, with each represented by paired bars for coarse (blue) and refined (orange) envelopes. Results reveal that the success rate exhibits an overall increasing trend with rising solvent content, albeit with considerable variation. Within every solvent content interval, the refined envelope (orange bars) exceeds the coarse envelope (blue bars).
Figure 9 shows that several structures (4qb6 [54], 3rd5 [47], 4iqk [65], 4q82 [55], 2fg0 [48], 1nh6 [59], 2bke [56], 4tpl [57]) consistently exhibit below-average success rates regardless of strategy or envelope scheme. Detailed analysis identifies three contributing factors as illustrated in Appendix A, Figure A2 for representative cases. First, extensive surface-bound water molecules, 3rd5 (413 waters, 16% of non-hydrogen atoms), 4q82 (472, 11%), 2fg0 (428, 11%), 1nh6 (635, 13%), reduce bulk solvent available for constraints by requiring loose envelopes for encapsulation. Second, space group symmetry combined with molecular packing produces multiple equivalent origin choices with nearly identical envelopes: two proximate choices in 3rd5, 2fg0, and 4tpl; four converging choices in 4qb6, making it difficult for iterative reconstruction to select a definitive convergence pathway among these equivalent configurations. Third, severe protein-solvent interdigitation creates highly irregular envelope topologies with fragmented bulk solvent in 4iqk, 2bke, and 4tpl, which hinder accurate ab initio envelope reconstruction. Individual structures may exhibit a single factor or combinations; for instance, 3rd5 combines surface waters with origin ambiguity, 4qb6 combines four-choice origin ambiguity with surface waters (approximately 100 water molecules, 8% of atoms), while 4tpl exhibits both interdigitation and origin ambiguity.
Linear regression analysis (Figure 10) of success rates versus solvent content across six methodological combinations (three phasing strategies, two envelope designs) reveals positive correlations (coefficients ∼0.5), confirming the role of solvent constraints. Success rates decline sharply as solvent content decreases, with extrapolated trends suggesting near-zero success below 55% solvent content. Refined envelope regression lines consistently lie above coarse envelope lines across all strategies. These results demonstrate that refined envelope reconstruction combined with a GA strategy extends the applicable range to 55% solvent content by more efficiently exploiting structural constraints rather than overcoming the fundamental physical dependence. Below 55% solvent content, direct phasing remains challenging and requires integration with more constraints or complementary techniques. Analysis of convergence speed (Appendix A Figure A4 and Figure A5) reveals weaker correlations with solvent content compared with success rates. All three strategies show a slight trend toward faster convergence at higher solvent content, with refined envelopes requiring fewer iterations than coarse envelopes at equivalent solvent content.

3.7. Multi-Solution Averaging: Enhanced Precision Through Population Convergence

High GA success rates approaching 100% enable phase accuracy enhancement through multi-solution averaging. When multiple independent runs converge through distinct stochastic pathways to fixed points clustered around the true solution, their residual random errors are uncorrelated. Spatial alignment and averaging of multiple converged solutions suppress random noise, yielding a consensus solution.
Figure 11a compares mean phase error before (red bars) and after (green bars) averaging for 27 successfully solved structures (excluding 4qb6). Green bars are lower, demonstrating a reduction across diverse structures. Figure 11b quantifies improvements: averaging reduced phase error by 6.83° on average, lowering the mean from 39.06° to 32.23°. This enhancement yields electron density maps with improved signal-to-noise ratios, facilitating automated model building and identification of structural details.
Figure 11c reveals no correlation between phase error reduction magnitude and solvent content, indicating universal benefit regardless of structural difficulty. This occurs because averaging suppresses random noise from stochastic optimization aspects rather than systematic biases, providing enhancement for challenging crystals with limited solvent content. Analysis shows diminishing returns beyond approximately 20 averaged solutions (Figure 7f), indicating an optimal cost-benefit balance. Multi-solution averaging imposes minimal computational overhead while delivering a 6.83° phase error reduction, constituting a byproduct of the GA strategy.

4. Discussion

4.1. Theoretical and Practical Value of Continuous Density Modulation

Continuous iterative projection algorithms provide a more physically motivated mathematical description. Traditional algorithms impose binary update strategies at protein-solvent interfaces, creating mathematical discontinuities that contradict continuous electron density and propagate errors, hindering convergence. By introducing transition regions, continuous algorithms achieve smooth density modification, resulting in smoother optimization paths that reduce local minima entrapment. Results across 28 structures demonstrate that continuous algorithms (CHIO, HPR, MCHIO, THIO) achieve performance levels comparable to the well-established HIO algorithm under both envelope schemes (Figure 5b). The smooth density transitions at protein-solvent interfaces can provide stable convergence pathways in certain cases. Continuous iterative algorithms represent an optimization of the physical model through physically motivated mathematical descriptions that eliminate artificial discontinuities. It is important to note that algorithm performance varies across individual structures, and no single algorithm universally excels for all crystallographic scenarios.

4.2. Mechanism and Universal Significance of Refined Envelope Reconstruction

The two-step refined envelope reconstruction (Section 2.3) improved average success rates by approximately 50% (Figure 8a), demonstrating that optimizing foundational constraints can be as impactful as algorithmic innovations. Traditional one-step methods must balance complete protein enclosure against solvent misclassification. Our two-step approach uses coarse-scale smoothing to capture overall molecular shape, then fine-scale analysis to identify trapped solvent. This recovered solvent provides additional constraints for iterative refinement and benefits all algorithm types (Figure 5b). For crystals with limited solvent content where constraint information is already scarce, the ability to reclaim even 5% additional solvent volume provides crucial marginal gains that can determine success or failure.

4.3. Genetic Algorithm Strategy: Mechanism and Boundaries

The GA strategy was introduced in our previous work [32], using the HIO algorithm with traditional one-step envelope reconstruction. In the current study, we demonstrate that this enhancement extends to all 10 tested algorithms, with the combination of GA and a two-step refined envelope achieving approximately a 2.5-fold improvement (Figure 8g). While the GA mechanism itself remains unchanged, the current work reveals important synergistic effects: the refined envelope scheme increases the probability of pioneer emergence by providing more accurate constraints, enabling GA to achieve higher absolute success rates and faster convergence than reported previously. However, important boundaries exist. Regression analysis (Figure 10c,f,i) reveals that when solvent content falls below approximately 55%, even GA success rates approach zero, confirming that GA uncovers existing success probability within solution space but does not create new constraints. For constraint-insufficient problems, any search strategy fails. The primary trade-off of the GA strategy is increased computational cost. However, when GA succeeds, it typically achieves convergence faster (about a 40% reduction in median iterations, Figure 8i) compared with a conventional phasing strategy, partially offsetting the computational overhead. GA should thus be viewed as the optimal tool for excavating maximum success potential under given constraints, particularly valuable when data quality permits but traditional methods show marginal success rates.

4.4. Factors Influencing Phase Recovery Success

Our previous testing experience and the systematic analysis of 28 diverse protein crystals in this work reveal that the success rate is influenced by several primary factors including data quality and structural distribution characteristics. Building on the algorithmic selection guidelines provided in Section 3.4.2, we now analyze these underlying factors and discuss comprehensive strategies for challenging cases.

4.4.1. Data Quality

The quality of diffraction data, encompassing both measurement accuracy and completeness, fundamentally influences phase recovery success. Measurement accuracy, reflected in the PDB-reported R-work values, appears more critical than completeness for the methods presented here. Structures in our test set with R-work below 0.22 generally yielded favorable phasing outcomes, whereas those with substantially higher R-values presented greater challenges due to reduced signal-to-noise ratio, making ab initio envelope reconstruction and phase solution increasingly difficult; addressing such high-error cases remains a direction for future development.
Regarding data completeness, particularly at low resolution, missing reflections (detailed in Appendix A, Table A1) can affect the initial envelope reconstruction. While extensive missing low-resolution data (e.g., 236 reflections for 2boz) could hinder this initial stage, our dynamic data filling strategy (Section 2.5.2) enables the iterative algorithm to compensate progressively using the observed medium- and high-resolution data. Crucially, the algorithm’s ability to reconstruct accurate phases does not solely depend on low-resolution completeness, as evidenced by structures like 4qb6, which has very few missing low-resolution reflections (only 4) yet remains challenging to phase. This indicates that for such cases, the primary obstacles arise from solvent content and structural distribution factors rather than data incompleteness.

4.4.2. Structural Distribution Within the Unit Cell

Beyond molecular shape, the distribution pattern of molecules within the crystallographic unit cell directly determines envelope reconstruction difficulty and consequently affects success rates. When protein surfaces exhibit complex interdigitation with bulk solvent regions (Appendix A Figure A2a, structure 2bke), ab initio envelope reconstruction struggles to delineate clear boundaries between protein and solvent domains. The irregular topology creates ambiguity in envelope determination, hindering convergence.
Crystallographic space groups permit multiple equivalent origin choices related by allowed origin translations. Non-centrosymmetric space groups introduce additional ambiguity through enantiomorphic structures, spatially inverted configurations that produce identical diffraction amplitudes. For instance, space group P 2 1 2 1 2 1 presents 8 equivalent origin choices from crystallographic symmetry plus 8 from spatial inversion, totaling 16 equivalent origin selections. Iterative algorithms and physical constraints cannot distinguish among these mathematically equivalent origins. Reconstructed envelopes randomly select origins with equal probability. Whether different origin choices yield similar envelopes depends not only on space group symmetry but also on the specific molecular packing within the unit cell. When this occurs, such as when envelopes corresponding to two different origin choices exhibit spatial proximity (Appendix A Figure A2b–d, structure 3rd5) or when four origin choices converge to quite similar envelopes (Appendix A Figure A2f–j, structure 4qb6), the reconstruction process oscillates among equivalent configurations, failing to converge to a definitive solution.
Structures containing numerous fixed water molecules integrated into the molecular surface (Appendix A Figure A2e, structure 3rd5) require loose envelopes to encompass these structured solvent shells. This necessity encroaches upon limited bulk solvent volumes, reducing the effectiveness of solvent flatness constraints. Moreover, a subtle pathology occasionally emerges where envelope reconstruction succeeds, but protein region electron density adopts inverted signs, phases differ by 180°, and diffraction amplitudes remain unchanged. Although histogram matching constraints with positive skewness typically correct this inversion, convergence sometimes stalls at intermediate error states between correct and inverted solutions. The genetic algorithm strategy, with its population-based search and information sharing, is particularly effective at escaping such metastable states and driving the system toward the correct global minimum. The most challenging structures typically combine limited solvent content, measurement errors (high R-factors), and one or more geometric complications, creating compounding difficulties that substantially reduce success rates.
In summary, phase recovery success is governed by the interplay of data quality (measurement accuracy and completeness) and structural characteristics (solvent content, molecular packing complexity, and origin ambiguity). The two-step refined envelope scheme helps mitigate boundary-related issues, while the GA strategy is particularly effective at escaping metastable states arising from these structural complications.

4.5. Practical Considerations and Future Directions

The performance of improved algorithms (MCHIO, MDM, MRAAR) highlights importance of domain-specific adaptation. MCHIO corrects excessive CHIO transition region feedback; MDM and MRAAR explicitly incorporate P A P B operators addressing insufficient protein region constraints, elevating performance to levels comparable with HIO. Different algorithms perform optimally under different conditions in Figure 8j–l, indicating that no single universal algorithm exists. Therefore, our multi-level synergistic framework becomes necessary: refined envelope reconstruction provides high-quality constraints; a diverse algorithm toolbox enriches algorithm selection; resolution weighting and GA offer strategic optimization. This transforms direct methods from experience-dependent art into a systematic, reproducible methodological pipeline.
For particularly challenging cases, integrated strategies may be needed. When available, homologous structures or AlphaFold predictions can guide envelope definition before applying direct methods for unbiased phase retrieval. For crystals with medium and low solvent content or complex packing, experimental phasing via anomalous diffraction can provide complementary phase information. When success rates are low but non-zero, increasing computational resources through more trials or larger genetic algorithm populations may eventually yield solutions.
Looking forward, integration of AI-predicted structural models offers promising avenues for further constraint enhancement. Beyond reference histograms, high-confidence prediction regions (pLDDT scores > 90) can serve as partial structural constraints, reducing the phase problem to determining only the uncertain fragments. This hybrid approach may reduce solvent content requirements, extending applicability into the medium- and low-solvent-content range. Combined with the multi-algorithm framework established in this work, such AI-augmented strategies represent a natural evolution toward widely applicable structure determination.

5. Conclusions

This study addresses two challenges in direct-method phase retrieval for protein crystals, discontinuous density modification and crude molecular envelope reconstruction, through a systematic approach. We introduced continuous iterative projection algorithms into this domain, validating their value in enhancing convergence stability by implementing smooth density transitions at the molecular interface. We developed improved algorithm variants, including MCHIO, MDM, and MRAAR, that optimize constraint enforcement, achieving performance levels comparable to or exceeding established methods (Figure 5b). Moreover, the proposed two-step refined envelope reconstruction scheme serves as a universal enhancement, elevating average phase retrieval success rates (Figure 8a). This demonstrates that optimizing foundational constraints can be as impactful as algorithmic innovations. While continuous algorithms demonstrate competitive performance comparable to HIO, algorithm effectiveness varies across individual structures, and no single method universally excels for all cases. For practical structure determination, we recommend testing multiple algorithms from both continuous (CHIO, HPR, MCHIO, THIO) and classical (HIO, MDM, MRAAR) categories. At the strategy level, the resolution-weighted strategy provided limited improvement, while the genetic algorithm co-evolution strategy delivered breakthrough performance that enabled multi-solution averaging, reducing mean phase error by approximately 6.83° (Figure 11a,b). This precision gain was achieved universally across different solvent content levels, consistently improving model quality.
Our methods shift the success rate versus solvent content curve upward, extending the applicability of direct methods within the high-solvent-content range to its lower boundary around 55% (Figure 10). Although method efficacy remains constrained by physical limits, including low solvent contents (approaching or below 55%), insufficient data quality (large R-work values due to measurement errors or poor crystal quality), and complex structural arrangements within the unit cell, continued development is needed. Nevertheless, the framework constructed in this study provides a powerful and systematic solution for the unbiased structure determination of challenging systems in structural biology. This framework synergistically combines refined envelope reconstruction, continuous and improved iterative projection algorithms whose performance is comparable to or exceeds established methods, and genetic algorithm strategies. The compiled algorithms developed in this work are accessible on GitHub [66].

Author Contributions

H.H.: conceptualization, methodology, software, formal analysis, data curation, writing—original draft preparation, and visualization; Y.L. and R.F. assisted with minor parameter testing; W.-P.S.: writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The diffraction data were downloaded from the Protein Data Bank at https://www.rcsb.org (accessed on 20 December 2025).

Acknowledgments

We acknowledge the computational resources provided by the Department of Physics, Ningbo University.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
ASRAveraged Successive Reflections
CHIOContinuous Hybrid Input–Output
DMDifference Map
GAGenetic Algorithm
HIOHybrid Input–Output
HPRHybrid Projection Reflection
IPAIterative Projection Algorithms
MASRModified Averaged Successive Reflections
MCHIOModified Continuous Hybrid Input–Output
MPIMessage Passing Interface
MRAARModified Relaxed Averaged Alternating Reflections
PDBProtein Data Bank
RAARRelaxed Averaged Alternating Reflections
THIOTransition Hybrid Input–Output

Appendix A. Supplementary Figures, Structure Information, and Numerical Results

Figure A1. Distribution of solvent content in protein crystal structures retrieved from the Protein Data Bank. The histogram (blue bars, left y-axis) shows the number of structures in each 5% solvent content interval, while the curve (dark red line, right y-axis) represents the cumulative percentage calculated from high to low solvent content. The analysis encompasses approximately 199,083 protein crystal structures. Orange and blue dashed lines mark the 55% and 65% solvent content thresholds, corresponding to cumulative percentages of 32.3% and 9.6%, respectively.
Figure A1. Distribution of solvent content in protein crystal structures retrieved from the Protein Data Bank. The histogram (blue bars, left y-axis) shows the number of structures in each 5% solvent content interval, while the curve (dark red line, right y-axis) represents the cumulative percentage calculated from high to low solvent content. The analysis encompasses approximately 199,083 protein crystal structures. Orange and blue dashed lines mark the 55% and 65% solvent content thresholds, corresponding to cumulative percentages of 32.3% and 9.6%, respectively.
Biomolecules 16 00227 g0a1
Figure A2. Challenging structural distribution patterns within crystallographic unit cells that impede envelope reconstruction and reduce direct phasing success rates. All molecular surfaces are rendered from PDB-deposited coordinates. (a) Structure 2bke illustrating severe interdigitation between protein surfaces (gray) and bulk solvent regions (white), creating highly irregular envelope topologies with fragmented bulk solvent that hinder accurate ab initio envelope reconstruction. (b,c) Structure 3rd5 demonstrating two equivalent origin choices in space group P 2 1 2 1 2 1 , showing highly similar envelope configurations that cause oscillation between equivalent solutions during iterative reconstruction. (d) Superposition of panels (b,c) highlighting the spatial proximity of envelopes from different origin choices. (e) Structure 3rd5 showing numerous ordered water molecules (red spheres) bound to the protein surface, which must be encompassed by the reconstructed envelope, thereby reducing available bulk solvent volume. (fi) Structure 4qb6 displaying four equivalent origin choices, all exhibiting quite similar envelope topologies. (j) Superposition of panels (fi) demonstrating the convergence challenge when multiple equivalent origin selections produce indistinguishable envelope geometries. Structures with limited solvent content combined with one or more of these geometric complications present substantial challenges for direct phasing methods. Visualization using PyMOL 3.1 [62].
Figure A2. Challenging structural distribution patterns within crystallographic unit cells that impede envelope reconstruction and reduce direct phasing success rates. All molecular surfaces are rendered from PDB-deposited coordinates. (a) Structure 2bke illustrating severe interdigitation between protein surfaces (gray) and bulk solvent regions (white), creating highly irregular envelope topologies with fragmented bulk solvent that hinder accurate ab initio envelope reconstruction. (b,c) Structure 3rd5 demonstrating two equivalent origin choices in space group P 2 1 2 1 2 1 , showing highly similar envelope configurations that cause oscillation between equivalent solutions during iterative reconstruction. (d) Superposition of panels (b,c) highlighting the spatial proximity of envelopes from different origin choices. (e) Structure 3rd5 showing numerous ordered water molecules (red spheres) bound to the protein surface, which must be encompassed by the reconstructed envelope, thereby reducing available bulk solvent volume. (fi) Structure 4qb6 displaying four equivalent origin choices, all exhibiting quite similar envelope topologies. (j) Superposition of panels (fi) demonstrating the convergence challenge when multiple equivalent origin selections produce indistinguishable envelope geometries. Structures with limited solvent content combined with one or more of these geometric complications present substantial challenges for direct phasing methods. Visualization using PyMOL 3.1 [62].
Biomolecules 16 00227 g0a2
Figure A3. Comprehensive algorithm-by-algorithm comparison under three phasing strategies. Success rates (ac), minimum convergence iterations ((df), lower is better), and median convergence iterations ((gi), lower is better) are shown for the conventional (first column), resolution-weighted (second column), and GA-enhanced (third column) phasing strategies. Each panel directly compares the performance of all 10 iterative projection algorithms using the one-step coarse envelope (blue bars) versus the two-step refined envelope (orange bars). This detailed breakdown allows for the evaluation of individual algorithm sensitivity to envelope design and phasing strategy. Key observations include the universal enhancement provided by the refined envelope reconstruction scheme, the breakthrough performance of the GA strategy, and the formation of a top-tier performance group comprising continuous algorithms (CHIO, HPR, MCHIO, THIO), improved classical variants (MDM, MRAAR), and HIO.
Figure A3. Comprehensive algorithm-by-algorithm comparison under three phasing strategies. Success rates (ac), minimum convergence iterations ((df), lower is better), and median convergence iterations ((gi), lower is better) are shown for the conventional (first column), resolution-weighted (second column), and GA-enhanced (third column) phasing strategies. Each panel directly compares the performance of all 10 iterative projection algorithms using the one-step coarse envelope (blue bars) versus the two-step refined envelope (orange bars). This detailed breakdown allows for the evaluation of individual algorithm sensitivity to envelope design and phasing strategy. Key observations include the universal enhancement provided by the refined envelope reconstruction scheme, the breakthrough performance of the GA strategy, and the formation of a top-tier performance group comprising continuous algorithms (CHIO, HPR, MCHIO, THIO), improved classical variants (MDM, MRAAR), and HIO.
Biomolecules 16 00227 g0a3
Figure A4. Correlation between minimum convergence iterations and solvent content for conventional (first column), resolution-weighted (second column), and GA-enhanced (third column) phasing strategies. (ac) Coarse envelope design results; (df) improved envelope design results; (gi) overlaid comparison. Linear regression analysis reveals negative correlation between minimum iterations and solvent content across all strategies and designs. Pearson correlation coefficients (r) and p-values are shown in each panel. The improved envelope design consistently achieves lower minimum iterations than the coarse design at equivalent solvent contents.
Figure A4. Correlation between minimum convergence iterations and solvent content for conventional (first column), resolution-weighted (second column), and GA-enhanced (third column) phasing strategies. (ac) Coarse envelope design results; (df) improved envelope design results; (gi) overlaid comparison. Linear regression analysis reveals negative correlation between minimum iterations and solvent content across all strategies and designs. Pearson correlation coefficients (r) and p-values are shown in each panel. The improved envelope design consistently achieves lower minimum iterations than the coarse design at equivalent solvent contents.
Biomolecules 16 00227 g0a4
Table A1. Structure information, data statistics, and ab initio phasing results for 28 test cases.
Table A1. Structure information, data statistics, and ab initio phasing results for 28 test cases.
PDB IDDescriptionSpace GroupPDB Reported R-WorkResolution Range High (Å)Resolution Range Low (Å)Number of Non-Hydrogen AtomsNumber of Water MoleculesNumber of Reflections ObservedNumber of Missing Low-Resolution ReflectionsMatthews Coefficient Corresponding Solvent Content (%)Volume Not Occupied by Model (%)PDB Posted Solvent Content (%) Δ φ Before Averaging Multi Solutions (°) Δ φ After Averaging Multi Solutions (°)
1a53Indole-3-glycerol phosphate synthase P 2 1 2 1 2 1 0.1592.015.4226424227,2838664.7058.068.5031.0427.14
1gajABC transporter ATP-binding protein P 432 0.2052.539.0217113313,5641265.2358.767.0449.9641.62
1gc7Radixin FERM domain P 4 1 2 1 2 0.2152.830.02482015,9292272.0966.871.8739.6032.15
1gk6Vimentin coil 2B fragment P 3 1 21 0.1991.935.0106419415,422364.0757.367.5033.5428.45
1hesMu2 adaptin subunit with peptide P 6 4 0.213.042.021383313,045678.3974.379.0040.5531.79
1hp4Beta-N-acetylhexosaminidase P 6 1 22 0.1812.232.2427338247,4132567.1961.069.4327.3723.58
1nh6Chitinase A-hexasaccharide complex I 222 0.1972.0537.0485863561,7251766.0459.666.0432.8328.23
1nw3DOT1L histone methyltransferase P 6 5 0.2042.543.027806822,926771.8266.571.8244.6334.37
1reuBone morphogenetic protein 2 H 32 0.2152.6519.88331349181967.1561.067.1045.9936.96
1vh7Imidazolglycerolphosphate synthase P 6 1 22 0.2021.936.0220224534,3901063.3656.465.6729.5424.47
1vmgMazG nucleotide pyrophosphohydrolase I 4 1 22 0.1421.4626.082610724,059862.4055.362.6457.1548.93
1xhdAcetyltransferase from B. cereus I 2 1 3 0.1521.932.0151613037,5171277.9973.880.2023.5820.45
1yh2Ubiquitin-conjugating enzyme P 4 3 2 1 2 0.2022.020.0143117918,6693163.9457.168.0038.1131.92
2b44Lysostaphin-type enzymes P 3 2 21 0.1811.8319.3222618849,1036274.2569.476.0031.3625.75
2b71Plasmodium cyclophilin-like protein P 4 1 32 0.192.540.0147312213,563773.5668.671.8036.1228.48
2bkeCrenarchaeal RadA recombinase P 3 1 21 0.1743.250.02327409062370.7265.267.8047.2538.90
2bozPhotosynthetic reaction center P 3 1 21 0.1742.418.0747536876,24923674.7570.074.7531.8225.80
2bt1Epstein Barr Virus dUTPase complex P 6 2 22 0.192.720.021005713,1645469.5363.869.5340.6134.71
2bujSerine-threonine kinase 16 complex C 222 1 0.1852.664.847667633,626571.1265.772.3044.2737.24
2fg0Gamma-D-glutamyl endopeptidase P 4 1 2 1 2 0.1541.7929.5397742868,7702663.8657.063.8636.7930.92
3rd5Mycobacterium paratuberculosis protein P 2 1 2 1 2 1 0.1331.529.0254941369,9061559.5251.965.0045.1438.61
3tqeMalonyl CoA-acyl carrier protein transacylase C 121 0.1461.541.0291534772,567556.2948.063.0745.1838.65
4bsjVEGFR-3 extracellular domains D4-5 P 3 1 21 0.212.544.017222117,525576.2071.774.1045.0137.74
4gtfFlavin-dependent thymidylate synthase I 4 1 22 0.1621.7739.0208612434,764660.7253.363.3836.0830.11
4iqkKeap1 Kelch domain with inhibitor C 121 0.1531.9717.0234112230,7725763.2556.363.7742.5732.45
4q82Phospholipase/Carboxylesterase P 2 1 2 1 2 0.1531.8536.0420447266,6421462.7055.762.4936.5527.62
4qb6CBM35 in complex with aldouronic acid P 2 1 2 1 2 1 0.1641.3532.0126510041,390454.7546.255.82--
4tplWest Nile virus NS1 protein P 321 0.172.948.557676233,9181073.7468.872.5041.8833.05
Figure A5. Correlation between median convergence iterations and solvent content for conventional (first column), resolution-weighted (second column), and GA-enhanced (third column) phasing strategies. (ac) Coarse envelope design results; (df) improved envelope design results; (gi) overlaid comparison. Linear regression analysis reveals negative correlation between median iterations and solvent content across all strategies and designs. Pearson correlation coefficients (r) and p-values are shown in each panel. The improved envelope design consistently achieves lower median iterations than the coarse design at equivalent solvent contents.
Figure A5. Correlation between median convergence iterations and solvent content for conventional (first column), resolution-weighted (second column), and GA-enhanced (third column) phasing strategies. (ac) Coarse envelope design results; (df) improved envelope design results; (gi) overlaid comparison. Linear regression analysis reveals negative correlation between median iterations and solvent content across all strategies and designs. Pearson correlation coefficients (r) and p-values are shown in each panel. The improved envelope design consistently achieves lower median iterations than the coarse design at equivalent solvent contents.
Biomolecules 16 00227 g0a5
Table A2. Phase recovery success rates (%) for continuous and classical IPAs across 28 protein structures using conventional, resolution-weighted, and GA-enhanced strategies with one-step envelope reconstruction.
Table A2. Phase recovery success rates (%) for continuous and classical IPAs across 28 protein structures using conventional, resolution-weighted, and GA-enhanced strategies with one-step envelope reconstruction.
ASRRAARDMMDMMRAARHIOCHIOHPRMCHIOTHIO
PDB ID Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA
1a530010000100000017100029100041000810078310002710004100
1gaj0000000000000200010000000000400100
1gc7000000101422299383310039271002415100151510032231004956100
1gk60000000005207204010098100501710100108100
1hes00000043461005656765460100554810051489958629957561005660100
1hp4000000000200600740220000105196100
1nh6000000000000000000000000000000
1nw300000000000027331001948100680841001581001023100
1reu000000000220254100292302140838239419259540469
1vh7000000000000021004210042200120060100
1vmg0000001221002315100213110020171003440100374203429032520
1xhd0000003844100214610054811005467100100100100100100100100100100100100100
1yh200000000044062046021120152115402727100
2b4400000023211001519100151235421002714066132916254625100
2b71000000251601927100335110033471003841100442710042421003848100
2bke00000000000000000100000000000000
2boz000000423100152510031371002117100041100252238634100034100
2bt10000000500131000502516100201000232010
2buj0000000002000410064002200008510067100
2fg000000000000032000003030231100120
3rd5000000000000010000100000110240
3tqe000000000293017190081002181002714100828100626100
4bsj03000002012710069100251000200152710029100
4gtf0000000008701712100661006230236023191006310
4iqk00000000000020000100210400210430
4q82000000000000010000010000000000
4qb6000000000000000000000000000000
4tpl0000000000310061100100010218000010
Table A3. Minimum iterations required for successful convergence of continuous and classical IPAs across 28 protein structures using conventional, resolution-weighted, and GA-enhanced strategies with one-step envelope reconstruction.
Table A3. Minimum iterations required for successful convergence of continuous and classical IPAs across 28 protein structures using conventional, resolution-weighted, and GA-enhanced strategies with one-step envelope reconstruction.
ASRRAARDMMDMMRAARHIOCHIOHPRMCHIOTHIO
PDB ID Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA
1a53--1465--1113----1386584-476552-2155899-16207835150540257-1093749-2219574
1gaj-------------8730---1483--------8939--3629
1gc7------3322-3633389892337704171273175265313857781029570831761906589141159452350234280114671188
1gk6---------31225854-11844018-650-113449358439816048-92081677226228462912401542
1hes------12815322515493409248125189269791172221875193824965914370217422236831038179331339
1hp4---------6454--6710--23617382-88399475----9416-8763150417023174
1nh6------------------------------
1nw3------------925136883910047786876898999-8256270712292969393110515251509881
1reu---------11733221-12513770104027802144-37912759146373452378565194618966991616634380
1vh7-------------51816711728892552362739088109041--94657335--2486-4673
1vmg------9871945880295417431278276369481333350228131182297153-153221-168105-
1xhd------77131867558375347831359471768950735768125129556941537579
1yh2---------48737447-67269197-44033034-16455105-141892569167139888-184723722163
2b44------592169911217761785637108992618133373948105418849079-86979160884022338680857032870825992
2b71------200512-2563281901941661852171141527701708196142169180127191419140232
2bke-----------------1155------------
2boz------558412447197707235893969481031468698581-39122568223185158585105259984708-31481292
2bt1-------6697--1318781-5758-1623121710959491-8600---946491009149-8939-
2buj---------9146---4505297222354430--88998779---559987067278167758404516
2fg0------------41535427-----5508-7244-916072027491428331704460-
3rd5-------------8752----7228-----67651470-78271691-
3tqe---------23948167-40952179--26921851643224112751105735014884435322512012464315191538
4bsj-2852-----1416-29858067801782149424600715447-8190--867187426734167451825042305539
4gtf---------22653371-5211553133228541822137214333444-20218452-1034301925441171789-
4iqk------------5261----308077317334-3310--33265672-63184492-
4q82-------------6416-----3790----------
4qb6------------------------------
4tpl----------76095275531897379769--2990-9418-222191698995----7088-
Table A4. Median iterations required for successful convergence of continuous and classical IPAs across 28 protein structures using conventional, resolution-weighted, and GA-enhanced strategies with one-step envelope reconstruction.
Table A4. Median iterations required for successful convergence of continuous and classical IPAs across 28 protein structures using conventional, resolution-weighted, and GA-enhanced strategies with one-step envelope reconstruction.
ASRRAARDMMDMMRAARHIOCHIOHPRMCHIOTHIO
PDB ID Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA
1a53--1701--1402----1819900-2127800-42111199-5896105473142618601-51101000-5300800
1gaj-------------8730---2007--------9111--4000
1gc7------3322-9200467889234400531049832018499849591100642687473902884984454700595264582802450246801900
1gk6---------44565854-35564018-4834-15005379365843017202-9208626856003204341217441944
1hes------572184450047294949444515589500351469418842986318611858886088571599170296216123825672337
1hp4---------6616--7251--37287598-89759475----9416-9198340335233500
1nh6------------------------------
1nw3------------2417285610892460210190083886233-890659381902743743561591124219681302
1reu---------11733221-51735516150073426938-507868483768763473905708737961634562599350768771
1vh7-------------51817200758192552600804488109073--94657335--3609-5101
1vmg------3751194512051025250881882414376081674631716120249440714501011-1330563-620706-
1xhd------21227421617659505782224434300172343200371240234326348300253252208286364269
1yh2---------53828466-78389197-48728040-53446965-48549256916746002002-470150552538
2b44------1772405014001044677510575149926188959805151140190239393-877994879327872091289182242189026308
2b71------56947436-18785592572288425145001070151440034668818702818087057032214662465210683418600
2bke-----------------1600------------
2boz------58083156110080325678008403935130015021330700-59293205773589388932132780565092-59341802
2bt1-------7346--66361155-7596-6629456818019491-8600---946493899332-8939-
2buj---------9146---5738340023407514--90929232---777291277817323480754801
2fg0------------76727392-----6779-7761-919472217491451431706188-
3rd5-------------8752----7228-----67651470-83321767-
3tqe---------55709181-67306515--68342036643254473012683268785304634062362500830754141816
4bsj-6638-----4118-4226458710723153780130446002067802-8533--8671881967345832490225047003901
4gtf---------39368020-24624716170061363020156016636649-50978932-34378129302019344660-
4iqk------------5261----330277317334-5875--33265672-63324709-
4q82-------------6416-----3790----------
4qb6------------------------------
4tpl----------78335600787297379769--3300-9418-222191699582----7088-
Table A5. Phase recovery success rates (%) for continuous and classical IPAs across 28 protein structures using conventional, resolution-weighted, and GA-enhanced strategies with two-step envelope reconstruction.
Table A5. Phase recovery success rates (%) for continuous and classical IPAs across 28 protein structures using conventional, resolution-weighted, and GA-enhanced strategies with two-step envelope reconstruction.
ASRRAARDMMDMMRAARHIOCHIOHPRMCHIOTHIO
PDB ID Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA
1a5300100001000000711006751000101000410037310001010004100
1gaj000000000101001121100821100633100723100817100823100
1gc7000000441002110983733048311004735100503510052421005048100
1gk60000000206410013191007410013121002110100121510088100
1hes0000004452100394284565010058501006150100485210061521005042100
1hp400000000092012601915100170340012401419100
1nh6000000000000000620001000000010060100
1nw3000000000000294210054481004835100483110044421004650100
1reu0000000001512100352110058481003860100293710042401005650100
1vh7000000000250100881001527100212310019810019271001521100
1vmg00000016120322510032381003137019311002633100312710023290
1xhd00000050441006575100100100100100100100100100100100100100100100100100100100
1yh200000000027802547825291002562104017101002912100
2b44000000192510027231000204838100442102115244421993846100
2b71000000101002915100355210050461003546100525410035311002940100
2bke0000000002000002010000000420100020
2boz000000027100334210040311000311001552100465810019481002754100
2bt12002000010021510008100192510082710017810017311002750100
2buj00000000022100660172010210015210019410084100
2fg00000000008407220500410032004017100
3rd50000000003104101801410045013015100
3tqe00000000075331005860100640100452100545010021581001238100
4bsj0400000010010421001512100815100152710012311002191001219100
4gtf00000002029101002540100102101227038380153510019210
4iqk000000000102010012101020600000020
4q820000000002002227060000400000000
4qb6000000000000000000000000000000
4tpl0000000006810048100261000274881002610008100
Table A6. Minimum iterations required for successful convergence of continuous and classical IPAs across 28 protein structures using conventional, resolution-weighted, and GA-enhanced strategies with two-step envelope reconstruction.
Table A6. Minimum iterations required for successful convergence of continuous and classical IPAs across 28 protein structures using conventional, resolution-weighted, and GA-enhanced strategies with two-step envelope reconstruction.
ASRRAARDMMDMMRAARHIOCHIOHPRMCHIOTHIO
PDB ID Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA
1a53--1640--1997----5113063331640478-1801484-117910145771706521-3488452-5047846
1gaj---------693-9191898381859022031019111015021661664581261911251876101810649551459670
1gc7------16197012254975314143397951798-4455536577414191146807120910663601129129766610441131
1gk6-------6651-5841129561476543367436137728833912532904832287737384558540556835349
1hes------1358021981519265920116829234717522622714724449113014034614926427191243265
1hp4---------14053909-47803570-937380815741714-91618368--14138889-105020274163
1nh6---------------56368465---6611-----36603602-2564
1nw3------------12281057848800121643855576894052412519429291388545463873640
1reu---------668555568114714878974636395286181122185117016653969682781807788579924
1vh7---------734-159134601436306117949578801748292113281935376115311699217183288012001296
1vmg------670725-228396424173272320270325-248321180236296257168295258436325-
1xhd------7063696451864756861177776739863785875107108689589838794
1yh2---------20872559-15908081755815951651216910144537900434296839-17173889445611188052902
2b44------666249887533812181168-9223-4432990154310928457-473490338717743706944834641358986
2b71------31502190-171176233278325383162231182195206271211201278167105158184196134
2bke---------5408-----7768-2365-----81423096-6189-4794-
2boz-------1392915470135557279216071137-1310769579117252781910006865611015597934697532
2bt17511--7806----299589201868868-13398599057777677964564792791552212966632374176114701007640
2buj---------19036964113126563770-24493727-190787792883167262203960148452042904183373421447
2fg0---------28443622-329984508504-5444--4849389557596009--5304-466727441920
3rd5---------15798055-20637747-51261576-78023427509730202422-87403050-399623162435
3tqe---------5323465176293516371972414513911226438514545097204919291566319315411356443918751285
4bsj-6951------10291305500592135918905874531133541918115439013803487746365798824591941340
4gtf-------5149-112710557851329104818142010758-19621168-14812608-2414128013161056841-
4iqk---------15678230-3110-9233699461529314-9016-3644------5534-
4q82---------7035--746777528102-3553----6915--------
4qb6------------------------------
4tpl---------140366532673282529293187247439062046-29728847533044902718342925552168-44332978
Table A7. Median iterations required for successful convergence of continuous and classical IPAs across 28 protein structures using conventional, resolution-weighted, and GA-enhanced strategies with two-step envelope reconstruction.
Table A7. Median iterations required for successful convergence of continuous and classical IPAs across 28 protein structures using conventional, resolution-weighted, and GA-enhanced strategies with two-step envelope reconstruction.
ASRRAARDMMDMMRAARHIOCHIOHPRMCHIOTHIO
PDB ID Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA Con Res GA
1a53--1902--2304----170558149561824700-2337800-2248120065793008801-5652700-63181200
1gaj---------693-11106182622010007427438613897468581497670924613140176765719140155105558912
1gc7------48388239290035412594380032513246-284430341000233242631524306041001600340429451800303446361500
1gk6-------6651-1124265210112060291270245283452608196223377023355397511031569467690130982360600
1hes------50816454743309410937312201048600482758500525939128275610631010503498600342494500
1hp4---------41963909-53853658-4057530918006048-93499142--35209100-445451804400
1nh6---------------58428465---6900-----40006136-2800
1nw3------------184432461114205628167002437247413042349217612011715428890216341633901
1reu---------223826349693534558215005093533690263165630602413872231340488856451235422946821400
1vh7---------3457-1900432051993600401759121500453054631606443876081901377434641301271639601578
1vmg------39222889-82113656385239025389521565-75271941768014055046181982507918894-
1xhd------24020520019855985081539594300381408218414392219480434300266428236474326215
1yh2---------35264809-54188779800027945534240147645277902047426922-355481284901522946383150
2b44------14295854110082316201455-9223-15698159191057589188-876492369264250886064904119940681462
2b71------63674319-119253073817211285700727142740113771068601104890260014428625001026836500
2bke---------5408-----7768-2802-----90363096-6848-4794-
2boz-------2686114621244732716133436501402-396890513392087903127133759151159231180127641896720
2bt17511--7806----3400892047221222-5580120032833486110162743596801829076621800621052052100294631811100
2buj---------19036964144030868763-40943727-339887793301397762204301304165843102396484011800
2fg0---------60966188-588086778738-6405--5510430073987578--7010-466762662188
3rd5---------24688055-37667747-51265140-78024060536267163691-87406559-399633742701
3tqe---------538667742076403663452304606048911430450933495362506859001824672045981624609440321574
4bsj-7900------150039743023808517556091004382015968065258589580462032426119263655657110749074305616
4gtf-------5149-29933484110038524506210136453312-44934797-50546480-35373842150843792258-
4iqk---------78518230-4158-9233699483539314-9016-7175------5534-
4q82---------7035--746777528515-4613----7956--------
4qb6------------------------------
4tpl---------647367643011481277483604247439392306-29729081754659463002342928742556-61493300

References

  1. Abramson, J.; Adler, J.; Dunger, J.; Evans, R.; Green, T.; Pritzel, A.; Ronneberger, O.; Willmore, L.; Ballard, A.J.; Bambrick, J.; et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024, 630, 493–500. [Google Scholar] [CrossRef]
  2. Terwilliger, T.C.; Afonine, P.V.; Liebschner, D.; Croll, T.I.; McCoy, A.J.; Oeffner, R.D.; Williams, C.J.; Poon, B.K.; Richardson, J.S.; Richardson, J.S.; et al. Accelerating crystal structure determination with iterative AlphaFold prediction. Acta Cryst. D 2023, 79, 234–244. [Google Scholar] [CrossRef]
  3. Li, Z.; Fan, H.; Ding, W. Solving protein structures by combining structure prediction, molecular replacement and direct-methods-aided model completion. IUCrJ 2024, 11, 152–167. [Google Scholar] [CrossRef] [PubMed]
  4. Sayre, D. The squaring method: A new method for phase determination. Acta Cryst. 1952, 5, 60–65. [Google Scholar] [CrossRef]
  5. Cochran, W.T. Relations between the phases of structure factors. Acta Cryst. 1955, 8, 473–478. [Google Scholar] [CrossRef]
  6. Karle, J.; Hauptman, H. A theory of phase determination for the four types of non-centrosymmetric space groups 1P222, 2P22, 3P12, 3P22. Acta Cryst. 1956, 9, 635–651. [Google Scholar] [CrossRef]
  7. Schenk, H. An Introduction to Direct Methods: The Most Important Phase Relationships and Their Application in Solving the Phase Problem; University College Cardiff Press: Cardiff, UK, 1984. [Google Scholar]
  8. Miller, R.; DeTitta, G.T.; Jones, R.; Langs, D.A.; Weeks, C.M.; Hauptman, H.A. On the application of the minimal principle to solve unknown structures. Science 1993, 259, 1430–1433. [Google Scholar] [CrossRef]
  9. Sheldrick, G.M. A short history of SHELX. Acta Cryst. A 2008, 64, 112–122. [Google Scholar] [CrossRef]
  10. Giacovazzo, C.; Siliqi, D.; Gonzalez Platas, J.; Hecht, H.J.; Zanotti, G.; York, B. The Ab Initio Crystal Structure Solution of Proteins by Direct Methods. VI. Complete Phasing up to Derivative Resolution. Acta Cryst. A 1996, 52, 813–825. [Google Scholar] [CrossRef]
  11. Su, W.-P. Retrieving low- and medium-resolution structural features of macromolecules directly from the diffraction intensities—A real-space approach to the X-ray phase problem. Acta Cryst. A 2008, 64, 625–630. [Google Scholar] [CrossRef]
  12. Fienup, J.R. Phase retrieval algorithms: A comparison. Appl. Opt. 1982, 21, 2758–2769. [Google Scholar] [CrossRef]
  13. Elser, V. Phase retrieval by iterated projections. J. Opt. Soc. Am. A 2003, 20, 40–55. [Google Scholar] [CrossRef]
  14. Elser, V. Solution of the crystallographic phase problem by iterated projections. Acta Cryst. A 2003, 59, 201–209. [Google Scholar] [CrossRef]
  15. Millane, R.P.; Stroud, W.J. Reconstructing symmetric images from their undersampled Fourier intensities. J. Opt. Soc. Am. A 1997, 14, 568–579. [Google Scholar] [CrossRef]
  16. Millane, R.P. Phase retrieval in crystallography and optics. J. Opt. Soc. Am. A 1990, 7, 394–411. [Google Scholar] [CrossRef]
  17. Plas, J.L.; Millane, R.P. Ab Initio Phasing Protein Crystallography. In Image Reconstruction from Incomplete Data; SPIE: Bellingham, WA, USA, 2000; Volume 4123. [Google Scholar]
  18. Miao, J.; Sayer, D.; Chapman, H.N. Phase retrieval from the magnitude of the Fourier transforms of non-periodic objects. J. Opt. Soc. Am. 1998, 15, 1662–1669. [Google Scholar] [CrossRef]
  19. Marchesini, S.; He, H.; Chapman, H.N.; Hau-Riege, S.P.; Noy, A.; Howells, M.R.; Weierstall, U.; Spence, J.C.H. X-ray image reconstruction from a diffraction pattern alone. Phys. Rev. B 2003, 68, 140101. [Google Scholar] [CrossRef]
  20. Lunin, V.Y.; Lunina, N.L.; Petrova, T.E.; Urzhumtsev, A.G.; Podjarny, A.D. On the ab initio solution of the phase problem for macromolecules at very low resolution. II. Generalized likelihood based approach to cluster discrimination. Acta Cryst. D 1998, 54, 726–734. [Google Scholar] [CrossRef] [PubMed]
  21. Liu, Z.C.; Xu, R.; Dong, Y.H. Phase retrieval in protein crystallography. Acta Cryst. A 2012, 68, 256–265. [Google Scholar] [CrossRef] [PubMed]
  22. He, H.; Su, W.-P. Direct phasing of protein crystals with high solvent content. Acta Cryst. A 2015, 71, 92–98. [Google Scholar] [CrossRef] [PubMed]
  23. Millane, R.P.; Lo, V.L. Iterative projection algorithms in protein crystallography. I. Theory. Acta Cryst. A 2013, 69, 517–527. [Google Scholar] [CrossRef]
  24. Lo, V.L.; Kingston, R.L.; Millane, R.P. Iterative projection algorithms in protein crystallography. II. Application. Acta Cryst. A 2015, 71, 451–459. [Google Scholar] [CrossRef] [PubMed]
  25. Lo, V.L.; Kingston, R.L.; Millane, R.P. Iterative projection algorithms for Ab Initio Phasing Virus Crystallography. J. Struct. Biol. 2016, 196, 407–413. [Google Scholar] [CrossRef] [PubMed]
  26. Kingston, R.L.; Millane, R.P. A general method for directly phasing diffraction data from high-solvent-content protein crystals. IUCrJ 2022, 9, 648–665. [Google Scholar] [CrossRef]
  27. Barnett, M.J.; Millane, R.P.; Kingston, R.L. Analysis of crystallographic phase retrieval using iterative projection algorithms. Acta Cryst. D 2024, 80, 800–818. [Google Scholar] [CrossRef]
  28. Pan, T.; Dramko, E.; Miller, M.D.; Kyrillidisa, A.; George, N.P., Jr. Completion of partial structures using Patterson maps with the CrysFormer machine-learning model. Acta Cryst. D 2025, 81, 668–677. [Google Scholar] [CrossRef]
  29. He, H.; Jiang, M.C.; Su, W.-P. Direct phasing of protein crystals with non-crystallographic symmetry. Crystals 2019, 9, 55. [Google Scholar] [CrossRef]
  30. Fu, R.; Su, W.-P.; He, H. Refining protein envelopes with a transition region for enhanced direct phasing in protein crystallography. Crystals 2024, 14, 85. [Google Scholar] [CrossRef]
  31. He, H.; Su, W.-P. Improving the convergence rate of a hybrid input-output phasing algorithm by varying the reflection data weight. Acta Cryst. A 2018, 74, 36–43. [Google Scholar] [CrossRef]
  32. Fu, R.; Su, W.-P.; He, H. Genetic algorithm-enhanced direct method in protein crystallography. Molecules 2025, 30, 288. [Google Scholar] [CrossRef]
  33. Zhang, K.Y.J.; Main, P. Histogram matching as a new density modification technique for phase refinement and extension of protein molecules. Acta Cryst. A 1990, 46, 41–46. [Google Scholar] [CrossRef]
  34. Fienup, J.R. Phase retrieval with continuous version of hybrid input-output. In Frontiers in Optics, OSA Technical Digest (CD); Optica Publishing Group: Washington, DC, USA, 2003. [Google Scholar] [CrossRef]
  35. Bauschke, H.H.; Combettes, P.L.; Luke, D.R. Hybrid projection-reflection method for phase retrieval. J. Opt. Soc. Am. A 2003, 20, 1025–1034. [Google Scholar] [CrossRef]
  36. Matthews, B.W. Solvent Content of Protein Crystals. J. Mol. Biol. 1968, 33, 491–497. [Google Scholar] [CrossRef] [PubMed]
  37. Wang, B.C. Resolution of phase ambiguity in macromolecular crystallography. Methods Enzymol. 1985, 115, 90–112. [Google Scholar] [CrossRef]
  38. Chojnowski, G.; Pereira, J.; Lamzin, V.S. Sequence assignment for low-resolution modeling of protein crystal structures. Acta Cryst. D 2019, 75, 753–763. [Google Scholar] [CrossRef] [PubMed]
  39. Kovalevskiy, O.; Nicholls, R.A.; Long, F.; Murshudov, G.N. Overview of refinement procedures within REFMAC5: Utilizing data from different sources. Acta Cryst. D 2018, 74, 492–505. [Google Scholar] [CrossRef]
  40. Cowtan, K. Fitting molecular fragments into electron density. Acta Cryst. D 2008, 64, 83–89. [Google Scholar] [CrossRef]
  41. Winn, M.D.; Ballard, C.C.; Cowtan, K.D.; Dodson, E.J.; Emsley, P.; Evans, P.R.; Keegan, R.M.; Krissinel, E.B.; Leslie, A.G.; McCoy, A.; et al. Overview of the CCP4 suite and current developments. Acta Cryst. D 2011, 67, 235–242. [Google Scholar] [CrossRef] [PubMed]
  42. Adams, P.D.; Afonine, P.V.; Bunkóczi, G.; Chen, V.B.; Davis, I.W.; Echoo ls, N.; Headd, J.J.; Hung, L.-W.; Kapral, G.J.; Grosse-Kunstleve, R.W.; et al. PHENIX: A comprehensive Python-based system for macromolecular structure solution. Acta Cryst. D 2010, 66, 213–221. [Google Scholar] [CrossRef]
  43. Terwilliger, T.C.; Grosse-Kunstleve, R.W.; Afonine, P.V.; Moriarty, N.W.; Zwart, P.H.; Hung, L.-W.; Read, R.J.; Adams, P.D. Iterative model building, structure refinement and density modification with the PHENIX AutoBuild wizard. Acta Cryst. D 2008, 64, 61–69. [Google Scholar] [CrossRef]
  44. Bauschke, H.H.; Combettes, P.L.; Luke, D.R. Phase retrieval, error reduction algorithm, and Fienup variants: A view from convex optimization. J. Opt. Soc. Am. A 2002, 19, 1334–1345. [Google Scholar] [CrossRef]
  45. Luke, D.R. Relaxed averaged alternating reflections for diffraction imaging. Inverse Probl. 2005, 21, 37–50. [Google Scholar] [CrossRef]
  46. He, H.; Liu, Y.; Su, W.-P. Direct phasing of protein crystals with hybrid difference map algorithms. Molecules, 2026; in press. [Google Scholar] [CrossRef]
  47. Baugh, L.; Phan, I.; Begley, D.W.; Clifton, M.C.; Armour, B.; Dranow, D.M.; Taylor, B.M.; Muruthi, M.M.; Abendroth, J.; Fairman, J.W.; et al. Increasing the structural coverage of tuberculosis drug targets. Tuberculosis 2015, 95, 142–148. [Google Scholar] [CrossRef] [PubMed]
  48. Xu, Q.; Sudek, S.; McMullan, D.; Miller, M.D.; Geierstanger, B.; Jones, D.H.; Krishna, S.S.; Spraggon, G.; Bursalay, B.; Abdubek, P.; et al. Structural basis of murein peptide specificity of a gamma-D-glutamyl-L-diamino acid endopeptidase. Structure 2009, 17, 303–313. [Google Scholar] [CrossRef]
  49. Afonine, P.V.; Grosse-Kunstleve, R.W.; Echols, N.; Headd, J.J.; Moriarty, N.W.; Mustyakimov, M.; Terwilliger, T.C.; Urzhumtsev, A.; Zwart, P.H.; Adams, P.D. Towards automated crystallographic structure refinement with phenix.refine. Acta Cryst. D 2012, 68, 352–367. [Google Scholar] [CrossRef] [PubMed]
  50. Cowtan, K. The Clipper C++ libraries for X-ray crystallography. IUCr Comput. Comm. Newsl. 2003, 2, 4–9. Available online: http://www.iucr.org/resources/commissions/computing/newsletters/2 (accessed on 1 December 2025).
  51. Wilson, A.J.C. The probability distribution of X-ray intensities. Acta Cryst. 1949, 2, 318–321. [Google Scholar] [CrossRef]
  52. Fokine, A.; Urzhumtsev, A. Flat bulk-solvent model: Obtaining optimal parameters. Acta Cryst. D 2002, 58, 1387–1392. [Google Scholar] [CrossRef]
  53. Osipiuk, J.; Zhou, M.; Moy, S.; Collart, F.; Joachimiak, A. X-Ray Crystal Structure of Putative Acetyltransferase, Product of BC4754 Gene from Bacillus cereus; RCSB PDB: Piscataway, NJ, USA, 2026. [Google Scholar]
  54. Sainz-Polo, M.A.; Valenzuela, S.V.; Gonzalez, B.; Pastor, F.I.; Sanz-Aparicio, J. Structural analysis of glucuronoxylan-specific xyn30D and its attached CBM35 domain gives insights into the role of modularity in specificity. J. Biol. Chem. 2014, 289, 31088–31101. [Google Scholar] [CrossRef]
  55. Kim, Y.; Hatzos-Skintges, C.; Endres, M.; Joachimiak, A. Crystal Structure of Phospholipase/Carboxylesterase from Haliangium ochraceum. To be published. 2026. Available online: https://www.rcsb.org/structure/4Q82 (accessed on 1 December 2025).
  56. Ariza, A.; Richard, D.L.; White, M.F.; Bond, C.S. Conformational flexibility revealed by the crystal structure of a crenarchaeal RadA. Nucleic Acids Res. 2005, 33, 1465. [Google Scholar] [CrossRef]
  57. Akey, D.L.; Brown, W.C.; Konwerski, J.R.; Ogata, C.M.; Smith, J.L. Use of massively multiple merged data for low-resolution S-SAD phasing and refinement of flavivirus NS1. Acta Cryst. D 2014, 70, 2719–2729. [Google Scholar] [CrossRef]
  58. Mark, B.L.; Vocadlo, D.J.; Knapp, S.; Triggs-Raine, B.L.; Withers, S.G.; James, M.N. Crystallographic evidence for substrate-assisted catalysis in a bacterial beta-hexosaminidase. J. Biol. Chem. 2001, 276, 10330–10337. [Google Scholar] [CrossRef] [PubMed]
  59. Aronson, N.N., Jr.; Halloran, B.A.; Alexyev, M.F.; Amable, L.; Madura, J.D.; Pasupulati, L.; Worth, C.; Van Roey, P. Family 18 chitinase-oligosaccharide substrate interaction: Subsite preference and anomer selectivity of Serratia marcescens chitinase A. Biochem. J. 2003, 376, 87–95. [Google Scholar] [CrossRef] [PubMed]
  60. Potter, J.A.; Fyfe, P.K.; Frolov, D.; Wakeham, M.C.; Van Grondelle, R.; Robert, B.; Jones, M.R. Strong effects of an individual water molecule on the rate of light-driven charge separation in the rhodobacter sphaeroides reaction center. J. Biol. Chem. 2005, 280, 27155. [Google Scholar] [CrossRef] [PubMed]
  61. Koehn, E.M.; Perissinotti, L.L.; Moghram, S.; Prabhakar, A.; Lesley, S.A.; Mathews, I.I.; Kohen, A. Folate binding site of flavin-dependent thymidylate synthase. Proc. Natl. Acad. Sci. USA 2012, 109, 15722–15727. [Google Scholar] [CrossRef]
  62. Schrödinger LLC. The PyMOL Molecular Graphics System; Version 3.0; Schrödinger LLC: New York, NY, USA, 2025. [Google Scholar]
  63. Millane, R.P.; Arnal, R.D. Uniqueness of the macromolecular crystallographic phase problem. Acta Cryst. A 2015, 71, 592–598. [Google Scholar] [CrossRef]
  64. Elser, V.; Millane, R.P. Reconstruction of an object from its symmetry—Averaged diffraction pattern. Acta Cryst. A 2008, 64, 273–279. [Google Scholar] [CrossRef]
  65. Marcotte, D.; Zeng, W.; Hus, J.C.; McKenzie, A.; Hession, C.; Jin, P.; Bergeron, C.; Lugovskoy, A.; Enyedy, I.; Cuervo, H.; et al. Small molecules inhibit the interaction of Nrf2 and the Keap1 Kelch domain through a non-covalent mechanism. Bioorg. Med. Chem. 2013, 21, 4011–4019. [Google Scholar] [CrossRef]
  66. Direct Phasing of Protein Crystals with Continuous IPAs and Refined Envelope. Available online: https://github.com/hhe2/Direct-Phasing-of-Protein-Crystals-with-Continuous-IPAs-and-Refined-Envelope (accessed on 15 December 2025).
Figure 1. Schematic illustration of the iterative projection algorithm for direct phasing. Starting from random initial phases or density, the algorithm alternates between real-space operations and reciprocal-space constraints. Convergence is typically achieved after several thousand iterations, yielding interpretable electron density maps or phases.
Figure 1. Schematic illustration of the iterative projection algorithm for direct phasing. Starting from random initial phases or density, the algorithm alternates between real-space operations and reciprocal-space constraints. Convergence is typically achieved after several thousand iterations, yielding interpretable electron density maps or phases.
Biomolecules 16 00227 g001
Figure 2. Schematic illustration of density modification strategies across different algorithms. ρ k = P B ρ k . The horizontal axis spans from the bulk solvent region (left) through the boundary to the protein region (right), while the vertical axis indicates the protein-solvent boundary defined by the coarse envelope. HIO exhibits discontinuous modification at the boundary, while continuous IPAs (CHIO, HPR, MCHIO, THIO) achieve smooth transitions through a refined transition region. Classical algorithms (DM, ASR, RAAR) show continuous modification but lack protein-region constraints, resulting in lower success rates.
Figure 2. Schematic illustration of density modification strategies across different algorithms. ρ k = P B ρ k . The horizontal axis spans from the bulk solvent region (left) through the boundary to the protein region (right), while the vertical axis indicates the protein-solvent boundary defined by the coarse envelope. HIO exhibits discontinuous modification at the boundary, while continuous IPAs (CHIO, HPR, MCHIO, THIO) achieve smooth transitions through a refined transition region. Classical algorithms (DM, ASR, RAAR) show continuous modification but lack protein-region constraints, resulting in lower success rates.
Biomolecules 16 00227 g002
Figure 3. Performance of HPR for ab initio phasing of 3rd5 using a conventional full-resolution strategy with two-step refined envelope reconstruction. Evolution of five quality metrics over 10,000 iterations from 100 independent trials with different random seeds: (a) mean phase error Δ ϕ , (b) protein envelope IoU, (c) R w o r k , (d) R f r e e , (e) solvent density deviation Δ ρ . Each curve represents one trial. Four trials successfully converged, showing sudden improvement in all metrics. In iterations 8000–9500, the transition region was linearly reduced from 5% (of the asymmetric unit) to zero, followed by solvent flattening in iterations 9500–10,000. (f) Final R-factor distribution clearly distinguishes the four successful solutions (low R-value) from 96 failed attempts (clustered at high R-values).
Figure 3. Performance of HPR for ab initio phasing of 3rd5 using a conventional full-resolution strategy with two-step refined envelope reconstruction. Evolution of five quality metrics over 10,000 iterations from 100 independent trials with different random seeds: (a) mean phase error Δ ϕ , (b) protein envelope IoU, (c) R w o r k , (d) R f r e e , (e) solvent density deviation Δ ρ . Each curve represents one trial. Four trials successfully converged, showing sudden improvement in all metrics. In iterations 8000–9500, the transition region was linearly reduced from 5% (of the asymmetric unit) to zero, followed by solvent flattening in iterations 9500–10,000. (f) Final R-factor distribution clearly distinguishes the four successful solutions (low R-value) from 96 failed attempts (clustered at high R-values).
Biomolecules 16 00227 g003
Figure 4. Improved two-step envelope reconstruction versus conventional one-step approach. The refined method generates transition regions with adaptive thickness based on local density, capturing fine structural details while minimizing protein inclusion, unlike the uniform-shell approach. (a,b) Envelope comparison for 3rd5 and 2fg0 demonstrating superior detail recovery with the two-step method. Blue dash circles in (b) indicate surface cavities and internal channels. (c,d) Evolution of 3rd5 envelope during CHIO phasing: cross-section and full view showing progression from incomplete (pre-convergence) to complete protein encapsulation (post-convergence).
Figure 4. Improved two-step envelope reconstruction versus conventional one-step approach. The refined method generates transition regions with adaptive thickness based on local density, capturing fine structural details while minimizing protein inclusion, unlike the uniform-shell approach. (a,b) Envelope comparison for 3rd5 and 2fg0 demonstrating superior detail recovery with the two-step method. Blue dash circles in (b) indicate surface cavities and internal channels. (c,d) Evolution of 3rd5 envelope during CHIO phasing: cross-section and full view showing progression from incomplete (pre-convergence) to complete protein encapsulation (post-convergence).
Biomolecules 16 00227 g004
Figure 5. Benchmark results on 28 proteins using conventional full-resolution strategy. Metrics shown: (a) Success rates of HIO and HPR algorithms using conventional full-resolution phasing strategy, one-step coarse, and two-step refined envelope reconstruction schemes for all 28 test structures. (bd) success rate, (eg) minimum convergence iterations, (hj) median convergence iterations, organized by algorithms (b,e,h), algorithm types (c,f,i), and envelope schemes (d,g,j). The two-step refined envelope scheme enhances success rates by 60.5% for classical IPAs (excluding ASR and RAAR) and 45.7% for continuous IPAs. Individual algorithm analysis reveals that continuous algorithms (CHIO, HPR, MCHIO, THIO) and improved classical algorithms (MDM, MRAAR) achieve performance levels comparable to HIO, while exceeding ASR, RAAR, and DM. Optimal performance achieved by combining top-performing algorithms from both categories with refined envelope reconstruction.
Figure 5. Benchmark results on 28 proteins using conventional full-resolution strategy. Metrics shown: (a) Success rates of HIO and HPR algorithms using conventional full-resolution phasing strategy, one-step coarse, and two-step refined envelope reconstruction schemes for all 28 test structures. (bd) success rate, (eg) minimum convergence iterations, (hj) median convergence iterations, organized by algorithms (b,e,h), algorithm types (c,f,i), and envelope schemes (d,g,j). The two-step refined envelope scheme enhances success rates by 60.5% for classical IPAs (excluding ASR and RAAR) and 45.7% for continuous IPAs. Individual algorithm analysis reveals that continuous algorithms (CHIO, HPR, MCHIO, THIO) and improved classical algorithms (MDM, MRAAR) achieve performance levels comparable to HIO, while exceeding ASR, RAAR, and DM. Optimal performance achieved by combining top-performing algorithms from both categories with refined envelope reconstruction.
Biomolecules 16 00227 g005
Figure 6. Ab initio structure determination for seven proteins. Each column represents one structure. Row 1: PDB reported structures (gray cartoon, green ligands and ions). Row 2: automatically built models from ab initio phasing results (red cartoon). Row 3: superposition of rows 1 and 2, demonstrating excellent agreement (backbone RMSD < 0.5 Å). Row 4: enlarged views of regions marked in green sticks of row 1, showing electron density maps calculated from ab initio phasing results (orange mesh) superimposed with PDB posted structures, highlighting density quality for secondary structures (3rd5, 1hp4), bound ligands (2fg0, 1nh6, 2boz, 4gtf), and ions (2bke, 4gtf). Structure 2boz is a transmembrane protein with membrane-embedded ligands. Visualization using PyMOL 3.1 [62].
Figure 6. Ab initio structure determination for seven proteins. Each column represents one structure. Row 1: PDB reported structures (gray cartoon, green ligands and ions). Row 2: automatically built models from ab initio phasing results (red cartoon). Row 3: superposition of rows 1 and 2, demonstrating excellent agreement (backbone RMSD < 0.5 Å). Row 4: enlarged views of regions marked in green sticks of row 1, showing electron density maps calculated from ab initio phasing results (orange mesh) superimposed with PDB posted structures, highlighting density quality for secondary structures (3rd5, 1hp4), bound ligands (2fg0, 1nh6, 2boz, 4gtf), and ions (2bke, 4gtf). Structure 2boz is a transmembrane protein with membrane-embedded ligands. Visualization using PyMOL 3.1 [62].
Biomolecules 16 00227 g006
Figure 7. GA-enhanced phasing demonstration (CHIO on 3rd5 with refined envelope scheme, 100 trials). (ae) Quality metrics versus iterations; each line represents one trial. Genetic crossover every 100 iterations enables information sharing. First convergence at around 5100 iterations, population-wide success in about 400 additional iterations. Early stopping at iteration 5500, followed by transition region reduction (iterations 5500–7000) and solvent flattening (iterations 7000–7500). (f) Phase error reduction via multi-solution averaging (100 GA solutions), showing consistent improvement with diminishing returns beyond 20 averaged maps.
Figure 7. GA-enhanced phasing demonstration (CHIO on 3rd5 with refined envelope scheme, 100 trials). (ae) Quality metrics versus iterations; each line represents one trial. Genetic crossover every 100 iterations enables information sharing. First convergence at around 5100 iterations, population-wide success in about 400 additional iterations. Early stopping at iteration 5500, followed by transition region reduction (iterations 5500–7000) and solvent flattening (iterations 7000–7500). (f) Phase error reduction via multi-solution averaging (100 GA solutions), showing consistent improvement with diminishing returns beyond 20 averaged maps.
Biomolecules 16 00227 g007
Figure 8. Comprehensive performance evaluation of coarse versus refined envelope designs across three phasing strategies (conventional, resolution-weighted, and GA-enhanced). Row 1 (ac): Strategy comparison for success rate, minimum iterations, and median iterations (averaged over 8 algorithms, excluding ASR and RAAR). Row 2 (df): Algorithm-specific trends across strategies. Row 3 (gi): Envelope design comparison across strategies. Row 4 (jl): Success rate versus median iterations scatter plots showing algorithm-wise improvements from coarse (blue) to refined (orange) envelopes. Refined envelope consistently outperforms coarse envelope, with GA-enhanced strategy achieving optimal performance.
Figure 8. Comprehensive performance evaluation of coarse versus refined envelope designs across three phasing strategies (conventional, resolution-weighted, and GA-enhanced). Row 1 (ac): Strategy comparison for success rate, minimum iterations, and median iterations (averaged over 8 algorithms, excluding ASR and RAAR). Row 2 (df): Algorithm-specific trends across strategies. Row 3 (gi): Envelope design comparison across strategies. Row 4 (jl): Success rate versus median iterations scatter plots showing algorithm-wise improvements from coarse (blue) to refined (orange) envelopes. Refined envelope consistently outperforms coarse envelope, with GA-enhanced strategy achieving optimal performance.
Biomolecules 16 00227 g008
Figure 9. Average success rate across 28 protein structures ordered by solvent content for three phasing strategies: (a) conventional strategy, (b) resolution-weighted strategy, (c) GA-enhanced strategy. Paired bars compare coarse envelope (blue) versus refined envelope (orange) designs, averaged over 8 algorithms (excluding ASR and RAAR) with 100 trials per algorithm. Across all strategies, success rate generally increases with solvent content, though with notable variation among structures. The refined envelope design consistently outperforms the coarse envelope design regardless of solvent content.
Figure 9. Average success rate across 28 protein structures ordered by solvent content for three phasing strategies: (a) conventional strategy, (b) resolution-weighted strategy, (c) GA-enhanced strategy. Paired bars compare coarse envelope (blue) versus refined envelope (orange) designs, averaged over 8 algorithms (excluding ASR and RAAR) with 100 trials per algorithm. Across all strategies, success rate generally increases with solvent content, though with notable variation among structures. The refined envelope design consistently outperforms the coarse envelope design regardless of solvent content.
Biomolecules 16 00227 g009
Figure 10. Correlation analysis between success rate and solvent content for coarse and refined envelope designs across three phasing strategies. Rows show: (ac) coarse envelope design, (df) refined envelope design, (gi) overlaid comparison. Columns represent (a,d,g) conventional strategy, (b,e,h) Resolution-weighted strategy, (c,f,i) GA-enhanced strategy. Linear regression analysis reveals positive correlation between success rate and solvent content across all strategies and envelope designs (Pearson r and p-values shown in each panel). The refined envelope design consistently yields higher success rates at equivalent solvent contents, extending the lower limits of direct phasing applicability.
Figure 10. Correlation analysis between success rate and solvent content for coarse and refined envelope designs across three phasing strategies. Rows show: (ac) coarse envelope design, (df) refined envelope design, (gi) overlaid comparison. Columns represent (a,d,g) conventional strategy, (b,e,h) Resolution-weighted strategy, (c,f,i) GA-enhanced strategy. Linear regression analysis reveals positive correlation between success rate and solvent content across all strategies and envelope designs (Pearson r and p-values shown in each panel). The refined envelope design consistently yields higher success rates at equivalent solvent contents, extending the lower limits of direct phasing applicability.
Biomolecules 16 00227 g010
Figure 11. Benefits of multi-solution averaging in GA phasing. (a) Phase errors before/after averaging for 27 structures (excluding 4qb6) showing consistent improvement. (b) Absolute phase error reduction per structure (average: 6.83°). (c) Phase improvement versus solvent content reveals no correlation, demonstrating uniform benefits across all solvent content ranges. (d,e) Electron density for two ligands in 2boz before (d) and after (e) averaging, demonstrating enhanced density continuity. PDB reported structure displayed as gray sticks.
Figure 11. Benefits of multi-solution averaging in GA phasing. (a) Phase errors before/after averaging for 27 structures (excluding 4qb6) showing consistent improvement. (b) Absolute phase error reduction per structure (average: 6.83°). (c) Phase improvement versus solvent content reveals no correlation, demonstrating uniform benefits across all solvent content ranges. (d,e) Electron density for two ligands in 2boz before (d) and after (e) averaging, demonstrating enhanced density continuity. PDB reported structure displayed as gray sticks.
Biomolecules 16 00227 g011
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Y.; Fu, R.; Su, W.-P.; He, H. Direct Phasing of Protein Crystals with Continuous Iterative Projection Algorithms and Refined Envelope Reconstruction. Biomolecules 2026, 16, 227. https://doi.org/10.3390/biom16020227

AMA Style

Liu Y, Fu R, Su W-P, He H. Direct Phasing of Protein Crystals with Continuous Iterative Projection Algorithms and Refined Envelope Reconstruction. Biomolecules. 2026; 16(2):227. https://doi.org/10.3390/biom16020227

Chicago/Turabian Style

Liu, Yang, Ruijiang Fu, Wu-Pei Su, and Hongxing He. 2026. "Direct Phasing of Protein Crystals with Continuous Iterative Projection Algorithms and Refined Envelope Reconstruction" Biomolecules 16, no. 2: 227. https://doi.org/10.3390/biom16020227

APA Style

Liu, Y., Fu, R., Su, W.-P., & He, H. (2026). Direct Phasing of Protein Crystals with Continuous Iterative Projection Algorithms and Refined Envelope Reconstruction. Biomolecules, 16(2), 227. https://doi.org/10.3390/biom16020227

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop