1. Introduction
Protein crystallography remains the primary method for determining three-dimensional structures of biological macromolecules through X-ray diffraction. While diffraction experiments record structure factor amplitudes, the phase information is inherently lost, creating the phase problem. Conventional structure determination approaches, such as molecular replacement, depend on AlphaFold-predicted models [
1,
2,
3] or homologous structures, which may introduce model bias, while experimental phasing methods require heavy atom derivatization and considerable experimental investment. In contrast, direct methods operate independently of prior structural information, offering an unbiased route to structure determination. This makes them valuable for validating predicted structures and for determining novel protein folds through ab initio approaches. Direct methods iteratively enforce physical constraints in real space, such as uniform bulk solvent density and characteristic protein density distribution, while simultaneously satisfying experimental diffraction data in reciprocal space until convergence to the true electron density is achieved.
Direct phasing methods have evolved from small-molecule techniques to macromolecular approaches. Early methods for small molecules exploited statistical relationships in diffraction data, including Sayre’s equation [
4], triplet phase relationships [
5], tangent formula [
6,
7], and minimal principle [
8], implemented in software such as SHELX [
9]. However, these classical techniques require atomic-resolution data better than 1.2 Å and are limited to structures containing fewer than 1000 non-hydrogen atoms [
10], rendering them unsuitable for most protein crystals that diffract to resolutions around 2 Å and contain thousands of non-hydrogen atoms. Traditional methods could only provide low-resolution structural information for macromolecules [
11]. The breakthrough came with recognizing that additional physical constraints must be maximally exploited, the most powerful being the nearly constant electron density of bulk solvent regions in protein crystals. This led to the development of iterative projection algorithms (IPAs) designed to leverage this constraint.
IPAs are primary computational tools for the direct solution of macromolecular structures by utilizing constraints in both real and reciprocal space. Seminal contributions include the Hybrid Input–Output (HIO) algorithm by Fienup [
12] and the Difference Map (DM) algorithm by Elser [
13,
14], both applied to ab initio protein structure determination. Millane et al. demonstrated using HIO for phasing periodic structures [
15,
16,
17]. Miao et al. and Marchesini et al. demonstrated using HIO for phasing non-periodic structures [
18,
19]. Lunin et al. explored the use of density connectivity for searching low-resolution structures [
20]. Liu et al. have shown that given a reasonable molecular envelope, HIO can recover atomic-resolution protein structures [
21]. Further developments include automated envelope generation for HIO by He and Su [
22], the application of DM to proteins and viruses by Lo and Millane [
23,
24,
25], workflows incorporating envelope clustering by Kingston et al. [
26,
27], and partial structure completion with machine-learning models by Pan et al. [
28]. Our previous research introduced ab initio non-crystallographic symmetry (NCS) averaging [
29], the transition-region-based THIO algorithm [
30], and advanced strategies including resolution-weighted phasing [
31] and genetic algorithm co-evolution [
32]. Despite these advances, existing IPA-based methods remain largely restricted to structures with solvent content exceeding 65% [
32]. However, statistical analysis of the Protein Data Bank (PDB) reveals that only 9.6% of deposited structures exceed this threshold (see
Appendix A,
Figure A1 for the distribution of solvent content). We categorize crystals as low solvent content (<45%), medium solvent content (45–55%), or high solvent content (>55%), with each category comprising approximately one-third of PDB-reported crystal structures. This distribution highlights the need for methodological improvements to extend applicability toward lower solvent content within the high-solvent range. Two fundamental limitations of current IPAs restrict their broader application to lower-solvent-content and structurally complex systems.
The first limitation arises from discontinuous density modification at the protein-solvent boundary. Traditional IPAs apply different update rules to protein regions, where density is preserved or adjusted through histogram matching [
33], versus bulk solvent regions, where strong negative feedback forces density toward zero. This binary treatment imposes an abrupt mathematical discontinuity at the molecular boundary that contradicts the continuous nature of electron density, propagating errors throughout iteration and hindering convergence. The second limitation concerns inaccurately reconstructed molecular envelopes. Starting from random phases, precisely delineating the true protein boundary during early iterations is difficult. Consequently, researchers employ loose, oversized envelopes to ensure complete protein enclosure. However, this approach encompasses substantial volumes of discrete solvent molecules residing within surface cavities and internal channels. Conventional algorithms mistakenly treat these trapped solvent regions as protein density, failing to apply the powerful constant-density constraint these regions would otherwise provide, representing a waste of valuable prior information. When the overall solvent content is already limited, this inefficiency becomes detrimental to successful phase recovery.
To address these challenges, we developed a framework that enhances phase recovery capability and reliability through innovations at three levels. At the algorithmic level, we introduce continuous IPAs, including established algorithms Continuous Hybrid Input–Output (CHIO) [
34], Hybrid Projection Reflection (HPR) [
35], and Transition Hybrid Input–Output (THIO) [
30], along with our improved variant, Modified CHIO (MCHIO). These algorithms establish transition zones between protein and solvent regions, enabling smooth density modification. At the methodological level, we propose a two-step refined envelope reconstruction scheme. This scheme employs a coarse-to-fine strategy using sequential large-radius and small-radius Gaussian filters to identify and recover solvent erroneously included within coarse envelopes. By incorporating recovered solvent into constrained regions, the scheme maximizes utilization of available information. This methodology enhances performance for both continuous and classical IPAs, including HIO, DM, and our improved versions, Modified Difference Map (MDM) and Modified Relaxed Averaged Alternating Reflections (MRAAR). At the strategic level, we performed optimization comparing three phasing strategies: conventional full-resolution, resolution-weighted progressive [
31], and genetic algorithm (GA) co-evolution [
32]. The GA strategy establishes information-sharing and inheritance mechanisms among multiple independent reconstruction processes, achieving enhancement in global search capability that elevates success rates.
This work makes several contributions. We provide the first comparison and validation of multiple continuous IPAs for protein direct phasing, establishing their advantages and practical performance. We introduce the two-step refined envelope reconstruction scheme as a method of elevating baseline performance across various algorithms. We develop improved iterative algorithms, including MDM, MCHIO, and MRAAR. Through comprehensive testing on 28 protein structures with diverse space groups, solvent contents ranging from 55% to 78%, resolutions spanning 1.46 to 3.2 Å, and PDB-reported R-work below 0.22, we quantify performance gains under different strategies. For structures with favorable conditions, accurate diffraction data, and suitable molecular packing, our methods successfully phase structures and extend applicability to the lower boundary of the high-solvent-content range. Linear regression analysis suggests the potential applicability boundary reaches approximately 55% solvent content, compared with traditional approaches that typically require solvent contents above 65%. This extends the accessible pool from 9.6% (structures > 65%) to potentially 32.3% (structures > 55%) of PDB structures (
Appendix A,
Figure A1) when data quality and structural characteristics permit, though success rates decrease substantially as solvent content approaches this lower limit.
While the methods presented extend the practical boundaries of direct phasing, we acknowledge their scope and limitations. Success remains dependent on adequate solvent content and sufficient data quality. Although the GA strategy enhances success rates, it requires greater computational resources, and its effectiveness depends on at least one individual achieving convergence. This does not eliminate the dependence of phase recovery on solvent content constraints. For crystals with solvent content approaching or below 55%, integration with higher-quality diffraction data, complementary experimental information, or additional physical constraints will likely remain necessary. Nevertheless, this work provides more powerful computational tools and mechanistic understanding for addressing challenging crystallographic problems, advancing model-free phase retrieval methods toward broader practical applications in structural biology.
2. Materials and Methods
2.1. Overall Workflow and Key Operations in Direct Phasing
Direct phasing in protein crystallography cyclically enforces constraints in both real space and reciprocal space through iterative alternation between these two domains.
Figure 1 illustrates the workflow of this process. The algorithm begins with structure factor amplitudes
obtained from experimentally measured diffraction intensities, where
denotes the reciprocal lattice indices. Since phase information is unavailable from the experiment, an initial set of random phases
is generated and combined with the observed amplitudes to construct the initial complex structure factors:
.
The algorithm then enters an iterative loop where constraints are applied to improve the phase estimates. At the
k-th iteration, the current electron density
is first subjected to reciprocal space constraints. This begins with a Fourier transformation converting the real-space density into reciprocal space:
The experimental amplitude constraint is enforced by replacing the calculated amplitude
with the observed amplitude
while preserving the calculated phase
, yielding an updated structure factor:
This amplitude replacement ensures that the electron density satisfies the experimental diffraction data at each iteration. The modified structure factor is then transformed back to real space through inverse Fourier transformation, producing an updated density:
where
V represents the unit cell volume, and
denotes the real-space coordinate vector.
Following reciprocal space operations, constraints are applied in real space through a procedure comprising two components: molecular envelope reconstruction and density modification based on the reconstructed envelope. The envelope reconstruction step aims to identify the approximate spatial region occupied by the protein molecule, the envelope domain
S, from the current electron density
, which may still contain inaccuracies. This is achieved by computing a weighted average density
through Gaussian low-pass filtering of
:
In this expression,
denotes a Gaussian kernel with standard deviation
that smooths local density fluctuations and emphasizes the contiguous regions of macromolecular structure. Using the estimated solvent content
derived from the Matthews coefficient [
36] based on unit cell parameters and protein molecular weight, a threshold value
is determined from the distribution of
values throughout the unit cell. Regions satisfying the condition
are then assigned to the protein region
S, although this initial envelope remains coarse. Depending on whether a classical or continuous iterative projection algorithm is employed, a transition region
T may be defined at the protein-solvent interface. Once the envelope domain
S for the current iteration has been established, which may include the transition region
T for continuous algorithms, the density modification rules of the chosen algorithm are applied to
to generate the updated density
for the subsequent iteration. This density modification step is the core mathematical operation of the algorithm and enforces physical constraints: the protein region density should conform to expected statistical distributions, achieved through histogram matching [
33], while the solvent region density should approach a constant value [
37].
The iterative cycle repeats for several thousand iterations until convergence is achieved. Throughout this process, multiple quality metrics are computed to monitor progress and evaluate convergence. These include the working R-factor
, the free R-factor
, the density deviations
in protein and bulk solvent regions, and, when a reference structure is available, the mean phase error
and the intersection-over-union ratio of the envelope. Successful convergence is characterized by the reduction of both
and
to stable low values, accompanied by improvements in the other monitored metrics. The result of this iterative refinement is an electron density map representing the protein structure. This density map can be utilized by automated model-building software packages such as ARP/wARP version 8.0 [
38,
39], Buccaneer [
40] from the CCP4 software suite version 9 [
41], or Phenix (version 1.16-3549) AutoBuild [
42,
43] to construct and refine the atomic model, producing a three-dimensional protein structure.
2.2. Mathematical Framework and Continuous Modification of Iterative Projection Algorithms
The mathematical foundation of iterative projection algorithms centers on identifying a solution that satisfies two constraint sets. In crystallographic direct phasing, these are the real-space constraints , which encompass the statistical characteristics of protein density distribution and the uniform nature of solvent regions, and the reciprocal-space constraints , which represent the experimentally measured diffraction amplitudes. The objective is to refine the electron density through repeated application of projection operators and until it converges to a state satisfying both constraint sets.
2.2.1. Partitioned Update Form of Classical Iterative Projection Algorithms
Classical iterative projection algorithms such as the Hybrid Input–Output (HIO) algorithm are implemented using a partitioned update strategy adapted to protein crystals. This approach relies on the premise that at the k-th iteration, a molecular envelope domain can be estimated from the current electron density . The projection operator incorporates the following: within this envelope domain , constraints appropriate to protein density are enforced through histogram matching procedures, while in the region outside corresponding to the bulk solvent, a constant-density constraint is applied through solvent flattening operations.
The HIO algorithm [
12] exemplifies this partitioned approach, with its discretized form for protein crystallography expressed as
where
represents the negative feedback factor applied in the solvent region, typically assigned values between 0.7 and 0.9, with
used throughout this study. In this formulation,
denotes the reciprocal-space projection corresponding to the amplitude replacement operation defined in Equation (
2), while
represents the real-space constraint operations that include histogram matching within the protein region and solvent flattening in the solvent region. We have inserted
into the partitioned form of HIO. This formulation reveals the origin of the discontinuity problem: at the envelope boundary, the density update rule abruptly transitions from
inside the protein region to
in the solvent region, potentially introducing a discontinuous step change in the updated density
across this boundary.
2.2.2. Continuous Iterative Projection Algorithms: Introduction of a Transition Region
To address the discontinuity problem in classical algorithms, continuous iterative projection algorithms introduce a transition region at the k-th iteration, positioned at the interface between the protein region and the bulk solvent region. This partitioning divides real space into three zones: the protein core region , the interfacial transition region , and the bulk solvent region. The objective of these continuous algorithms is to implement a smooth update strategy within the transition region that provides gradual and continuous modulation between the density modification rules applied in the protein and solvent regions. We implemented and evaluated four continuous algorithms that achieve this objective through different mathematical approaches. We have updated the original forms of those algorithms to make them suitable for protein crystallography.
The Continuous Hybrid Input–Output (CHIO) algorithm [
34] extends the framework of HIO by introducing intermediate feedback behavior in the transition region. Its update rule is formulated as
where the transition feedback parameter
is defined as
. In the original CHIO formulation [
34], the parameter
is typically set to 0.4, which yields
and results in negative feedback that is stronger in the transition region than in the bulk solvent region where
.
The Hybrid Projection Reflection (HPR) algorithm [
35] employs a reflection-projection framework and can be expressed in partitioned form as
where
is an algorithm-specific parameter set to 0.588 in the original formulation [
35]. In typical partitioned implementations, the HPR algorithm applies the same feedback factor uniformly across both the transition region
and the bulk solvent region, with its continuity manifesting through consistent treatment of the entire region outside the protein core
.
Based on our analysis of CHIO and HPR algorithms, we developed the modified CHIO (MCHIO) algorithm to address a limitation. We observed that the relatively strong feedback in CHIO’s transition region, with
, might be excessive when
contains portions of protein side-chain density, potentially leading to inappropriate suppression of structural features. To address this, we propose a modified update rule:
where
represents the adjusted feedback factor for the transition region. To achieve more appropriate behavior that is gentler than bulk solvent treatment while effectively leveraging the solvent constraint, we set
throughout this study. This value is chosen to be lower than both
and
, thereby implementing a gentler, more conservative density modification in the transition region compared with the bulk solvent. This adjustment aims to avoid excessive suppression of potential protein density features that might be incorrectly assigned to the transition region
, while still effectively applying the solvent-like constraint.
The Transition Hybrid Input–Output (THIO) algorithm, introduced in our previous research [
30], implements a density-weighted assignment approach within the transition region. Its updated formulation is
where the weight function
takes values in
and is computed based on the weighted average density at each grid point
using a small Gaussian kernel radius of approximately
Å. This weight represents the local probability that a given position belongs to the protein rather than solvent. Through this linear interpolation scheme, THIO achieves continuous and physically motivated density modulation throughout the transition region, with the update rule smoothly varying based on local density characteristics.
The fundamental concept unifying all these continuous algorithms is the mathematical refinement of electron density update behavior across the molecular boundary to achieve smoother transitions. This enhancement is designed to improve numerical stability and increase the likelihood of convergence when dealing with ambiguous boundaries and limited solvent content.
Figure 2 provides a schematic comparison of the update rules employed by these algorithms, illustrating their distinctive characteristics and the progressive refinement of boundary treatment strategies.
2.2.3. Classical Iterative Projection Algorithms and Improved Variants
The two-step refined envelope reconstruction scheme demonstrated in the subsequent section exhibits universal applicability, providing performance enhancement for both continuous and classical iterative projection algorithms. To establish a comprehensive algorithmic framework, we evaluated classical iterative projection algorithms, including HIO, DM, ASR, and RAAR, and developed improved variants, MDM and MRAAR, optimized for protein crystallography. The HIO algorithm has been presented in the previous section, following the update rule in Equation (
5).
The Difference Map (DM) algorithm [
13] generates two candidate solutions at the
k-th iteration:
These solutions satisfy real-space constraint set
and reciprocal-space constraint set
, respectively. The DM iteration formula is
where
is the relaxation parameter (typically 0.5 to 1.0, set to 0.75 in this study) controlling convergence speed and search capability. Convergence is achieved when the two candidate solutions become equal. Considering partitioned updates for protein region
and solvent region, the DM formula can be expressed as
Experimental results revealed that DM exhibits low success rates for protein phase retrieval, primarily due to insufficient constraint enforcement within the protein region during iteration. To address this, we propose an improved variant that explicitly applies constraint operators
and
to the protein region:
This improved formulation, designated Modified DM (MDM), strengthens constraint application in the protein region and enhances phase retrieval performance.
The Averaged Successive Reflections (ASR) algorithm [
44] employs reflection operators
and
, where
I is the identity operator. The ASR update rule is
Substituting the reflection operators and considering partitioned updates yields
The Relaxed Averaged Alternating Reflections (RAAR) algorithm [
45] introduces a relaxation parameter
(typically 0.95) with the update rule:
In partitioned form, this becomes
Both ASR and RAAR exhibited low success rates in our experiments, attributable to insufficient constraint enforcement in the protein region. We developed an improved variant by setting
, which reduces RAAR to the ASR formulation, then explicitly applying operators
to the protein region:
This improved algorithm, designated Modified RAAR (MRAAR) or Modified ASR (MASR), enhances performance. This formulation is mathematically equivalent to the Hybrid Difference Map - formula 1 (HDM-f1) algorithm developed in our concurrent work [
46], which was derived from a different starting point through modification of the DM framework, representing a convergence of approaches toward optimal algorithm design. The improved classical algorithms, when combined with the two-step refined envelope reconstruction scheme, expand the available toolkit for direct phasing of protein crystals.
2.3. Envelope Reconstruction Strategies: From One-Step Coarse Design to Two-Step Refined Design
The effectiveness of iterative projection algorithms, their capacity to enforce constraints within solvent regions, depends on the accuracy of the reconstructed protein molecular envelope. Traditional envelope reconstruction employs a single-step approach in which the
k-th iteration’s electron density
is convolved with a Gaussian kernel, usually featuring a large radius such as
Å, to produce a smoothed weighted average density map
. Based on the estimated solvent content
derived from the Matthews coefficient [
36], a threshold value
is determined from the statistical distribution of
values. Grid points satisfying the condition
are classified as belonging to the protein region, while the remaining volume is the solvent region. This method offers computational simplicity and can rapidly delineate the general molecular shape during early iteration stages. However, it suffers from limitations that become problematic for challenging structures. The large-scale smoothing operation erases fine structural details at the molecular surface, resulting in boundaries that are rough and imprecise. The resulting coarse envelope tends to be over-expanded to ensure complete enclosure of the protein molecule, which incorporates volumes of solvent molecules that are situated within surface crevices, pockets, and internal channels. In classical iterative algorithms, these incorrectly assigned solvent regions are treated as protein density throughout the modification process, failing to exploit the constant-density constraint that these regions would otherwise provide. This inefficient use of available information becomes detrimental when the overall solvent content is limited.
To overcome these deficiencies and maximize the utilization of all available solvent constraints within the crystal unit cell, we developed a two-step refined envelope reconstruction scheme. The concept underlying this scheme is a strategy that proceeds from coarse localization to fine-scale local refinement. This approach provides a more physically rational foundation for defining the transition region in continuous algorithms and functions as a universal technique capable of enhancing the performance of all iterative projection algorithms.
The first step focuses on coarse envelope generation following an approach similar to the traditional single-step method. A Gaussian kernel with a large radius, set to
Å in this study, is applied to smooth the
k-th iteration’s electron density
, yielding a globally smoothed density distribution
. Using the estimated solvent content
, a threshold value
is determined to define an initial coarse envelope
that serves as the protein candidate region:
The objective of this initial step is to capture the protein’s center of mass and overall molecular shape while avoiding artificial fragmentation of the envelope that could arise from localized density fluctuations. By employing a large smoothing kernel, this step ensures robust identification of the main protein volume even when the current density estimate contains noise or systematic errors.
The second step is the refinement stage that distinguishes our scheme from traditional approaches. Here, a second smoothing operation is performed on the same electron density
using a Gaussian kernel with a smaller radius, set to
Å in this work, which produces a locally refined density map
that preserves fine-scale density variations. All subsequent operations in this step are confined to the interior of the coarse envelope
established in the first step. Within this domain, the values of
at all grid points are ranked by magnitude, and grid points in the lowest approximately 5% (of the asymmetric unit) are identified as transition region
, most likely to represent trapped solvent rather than protein density. This 5% threshold was determined empirically from our benchmark set and proved robust across diverse protein structures. For truly novel structures without homologous or predicted references, this value can be adjusted within a range of 3–7% through systematic trials, as the method shows tolerance to moderate variations in this parameter. The identified low-density regions are then removed from the protein candidate volume to yield refined spatial assignments. We define the refined protein core region as
, representing the volume that remains after removing the identified solvent-like regions from the coarse envelope. The refined transition region is defined as
The bulk solvent region comprises both the external volume lying outside
and the internal stripped region
, reclassified based on low local density. The transition region identifies trapped solvent molecules within surface cavities and internal channels, enabling the application of solvent-like constraints during iteration. This enhances convergence for crystals with limited solvent content by maximizing utilization of available constraint information. However, although physically consisting of trapped solvent, the transition region exhibits slightly non-uniform density due to proximity to the protein surface. Negative feedback applied during iteration drives density toward uniform values, deviating from actual surface solvent density. This slightly non-uniform density is incompatible with the strict uniform-density requirements of comprehensive solvent flattening [
37] in final-stage refinement. Therefore, in the last 2000 iterations, the transition region is linearly reduced from 5% (of the asymmetric unit) to zero over 1500 iterations, followed by 500 iterations of solvent flattening applied uniformly to all trials. Removing the transition region eliminates density constraint errors from surface-proximal solvent and enables optimal uniform-density enforcement in true bulk solvent regions.
This two-step scheme offers three advantages over traditional single-step methods. First, the small-radius filter is sensitive to local low-density features corresponding to surface cavities and internal channels, enabling accurate identification of trapped solvent. Second, the scheme maximizes constraint utilization by identifying trapped solvent within the coarse envelope. Although the transition region is spatially adjacent to protein, it exhibits low-density characteristics of solvent. During iteration, this region is treated with solvent constraints, such as negative feedback in continuous algorithms, increasing the effective constraint volume. As discussed above, the transition region should be removed in the final refinement stage to enable strict solvent flattening. Third, for continuous algorithms, the transition region delineated through refined local density analysis is less likely to contain protein side-chain density compared with regions defined by traditional single-step approaches, improving the purity and effectiveness of applied constraints.
Section 3.3 provides a direct visual comparison using protein structures 3rd5 [
47] and 2fg0 [
48] as examples, demonstrating that envelopes generated by our two-step method reconstruct finer and more accurate molecular surface details compared with those produced by the traditional one-step approach. Although multi-step refinement with progressively decreasing kernel sizes was tested, we found that the two-step approach represents an optimal balance between envelope accuracy and computational efficiency. Additional refinement steps beyond two did not improve phasing success rates, as both one-step and two-step envelopes represent approximations to the true molecular boundary, and the two-step scheme already provides sufficient accuracy for iterative projection algorithms to converge to correct solutions.
2.4. Phase Retrieval Strategies: From Full-Resolution to Genetic Co-Evolution
We implemented and compared three phasing strategies to search the solution space for correct phases. These strategies optimize data utilization and search organization to enhance success rate and computational efficiency.
2.4.1. Full-Resolution Phasing Strategy
The full-resolution approach represents the fundamental strategy. All available experimental diffraction data, excluding only a reserved free set for cross-validation, participate equally in the reciprocal-space amplitude constraint at every iteration. This strategy applies no preprocessing or weighting to diffraction data, requiring the algorithm to simultaneously satisfy constraints spanning the entire resolution range. The advantages are simplicity and straightforward implementation. However, for high-dimensional optimization problems, requiring immediate fitting of structural information across all spatial scales can create convergence difficulties and entrapment in local minima for structures with limited solvent content.
2.4.2. Resolution-Weighted Progressive Phasing Strategy
The resolution-weighted progressive strategy implements a hierarchical data utilization scheme prioritizing low-resolution diffraction information during initial iterations, then introducing high-resolution data as convergence progresses. This is implemented through a time-dependent weighting function applied to observed structure factor amplitudes:
where
represents the reciprocal of resolution spacing, and
is a time-varying parameter controlling filter bandwidth. Initially,
is set to 0.8∼1.0 Å, attenuating high-resolution contributions. As iterations progress,
is reduced toward zero following a predetermined annealing schedule [
31]. This progressive expansion from low to high spatial frequencies follows the logic of first establishing global protein fold before resolving atomic positions, facilitating the establishment of correct low-resolution phase relationships that provide a stable foundation for convergence.
2.4.3. Genetic Algorithm-Enhanced Co-Evolution Phasing Strategy
To overcome limitations of independent random-start searches, we implemented a genetic algorithm co-evolution strategy using population-based intelligence. This strategy treats multiple parallel phase retrieval processes, typically 100 individuals, as an evolutionary population, with each individual’s electron density map representing an organism. By simulating biological evolution through selection, crossover, and mutation, this approach establishes information exchange mechanisms enabling high-quality density features to propagate throughout the population, enhancing global search capability.
The workflow proceeds through interconnected stages. Population initialization generates
N individuals, each with random phases ensuring diversity. During independent evolution, all individuals execute a predetermined number of iterations using standard iterative projection algorithms in parallel, employing envelope reconstruction schemes and resolution-weighted strategies. This phase is a local search within each individual’s solution space neighborhood. Periodically, typically every 100 iterations, the algorithm performs evaluation and selection. Individual fitness
is quantified through
where
and
represent the best and the average
within the population, and
. Higher fitness individuals have greater probability of genetic operations.
Genetic operations implement two mechanisms. During crossover, two parents are selected according to fitness-weighted probabilities. The asymmetric unit is divided into spatial blocks, and a chosen subset undergoes density value exchange between parents to generate offspring. Prior to crossover, all density maps undergo rotational and translational alignment to eliminate crystallographic origin and enantiomorphic ambiguities. Mutation operations introduce stochasticity by randomly selecting approximately 1% of grid points in offspring density maps and assigning new random values, introducing variation, and preventing premature convergence. The resulting offspring typically exhibit improved fitness relative to the parent population. Population update then replaces low-fitness individuals with these offspring, forming the next generation. The algorithm then returns to independent evolution for another round of local search, followed by global information exchange. More details, such as elite inheritance and similarity punishment preventing prematurity, can be found in our previous paper [
32].
The effectiveness of this strategy derives from synergistic effects between local refinement and global information sharing. Once any individual approaches the correct solution, manifested as a sharp R-factor decrease, high-quality density features disseminate throughout the population via crossover. This collective guidance enables population convergence toward the global optimum at rates exceeding independent random searches. The GA strategy upgrades traditional multi-start independent searching to intelligent multi-start cooperative searching with active information sharing, representing an advancement for enhancing the reliability of direct-method phase retrieval in challenging applications.
2.5. Error Metrics, Missing Reflections, and Model Building
We employed a validation pipeline with quality metrics to monitor convergence, strategies to handle incomplete data, and automated model building from recovered density maps.
2.5.1. Error Metrics
Assessment metrics are organized into two categories: reference-dependent metrics for validation during method development and internal consistency metrics applicable to de novo structure determination. Reference-dependent metrics provide validation when PDB coordinates are available. The mean phase error quantifies the angular deviation between retrieved and true phases:
where
denotes averaging over unique reflections in the working set. The envelope intersection-over-union metric assesses spatial accuracy of the reconstructed molecular boundary:
quantifying the overlap ratio between reconstructed envelope domain
and true envelope domain
.
Internal consistency metrics serve as criteria for evaluating convergence when reference coordinates are unavailable. The working R-factor
, defined in Equation (
24), and free R-factor
measure agreement between calculated and experimental amplitudes:
where
is computed using a randomly selected subset (1% of reflections in this study), excluded from all iterative processes as an independent validation dataset. Successful convergence is characterized by reduction of both
and
to stable low values. Density convergence indicators monitor the magnitude of real-space density modifications. For HIO-type algorithms, we compute
for grid points
. For MRAAR-type algorithms, we compute
where
denotes averaging over grid points in each region. As convergence proceeds, these update magnitudes diminish toward zero, indicating stable density satisfying constraints.
2.5.2. Handling of Missing and Weak Diffraction Data
Diffraction datasets contain missing reflections due to various factors: reflections in the free set, low-resolution reflections (below 15 Å), and weak reflections failing signal-to-noise criteria (
). To handle such incomplete data, we implemented a dynamic mixing strategy. For missing or weak reflections, calculated amplitude
is blended with experimental values using weight factor
:
where
is a global scale factor. For missing data,
; for weak data,
. This strategy ensures reciprocal-space data completeness, preventing information loss and projection artifacts from data gaps.
2.5.3. Electron Density Map Post-Processing and Automated Model Building
After convergence, we applied post-processing to enhance density map quality. When multiple solutions converged successfully through GA strategy, we performed spatial alignment and density averaging to reduce noise. The resulting electron density map was used for automated model building, which identifies secondary structure elements, constructs polypeptide backbones, and positions side-chain atoms based on density features and stereochemical constraints. The initial model exhibits over 80% residue completeness and is refined through several iterations using phenix.refine [
49], optimizing atomic coordinates and displacement parameters against experimental data to yield the final structural model.
2.6. Test Datasets, Computational Implementation, and Parameter Settings
To ensure robustness, we assembled a benchmark dataset of 28 protein crystal structures with diversity in space group symmetry, solvent content, and diffraction resolution. All diffraction data (
) and reference coordinates were obtained from the Protein Data Bank. The structures comprise proteins containing 826 to 7475 non-hydrogen atoms with 0 to 635 bound water molecules. The dataset includes resolutions from 1.46 Å to 3.2 Å, solvent contents from 55% to 78%, PDB-reported R-work values below 0.22 (ranging from 0.133 to 0.215), and diverse space groups including
,
,
,
,
,
,
,
,
,
,
,
, and
, ensuring evaluation under varied crystallographic conditions. Diffraction datasets contain 4918 to 76,249 observed reflections, with missing low-resolution reflections ranging from 3 to 236. Detailed descriptions of all test structures are provided in
Appendix A,
Table A1. Solvent content for each structure was estimated using the matthews_coef program from the CCP4 suite [
41], based on the Matthews coefficient [
36]. The calculations incorporated unit cell parameters, space group symmetry, and protein molecular weight derived from the amino acid sequence. This estimated solvent content, along with space group information and unit cell parameters, constitutes essential prior knowledge used as constraint information during iterative phasing.
The computational framework was implemented using the Clipper C++ libraries for crystallographic computing [
50] combined with MPI (Message Passing Interface) for parallel processing. Key parameters were standardized: electron density maps were discretized on 3D grids with approximately 1.0 Å spacing; Gaussian filter radii were
Å (large-radius) and
Å (small-radius); the refined scheme stripped 5% of grid points in the asymmetric unit from the coarse envelope; the solvent region negative feedback factor was
; continuous algorithm parameters were
for CHIO,
for HPR, and
for MCHIO; GA population size was 100 individuals with genetic operations every 100 iterations, crossover probability 0.5, and mutation probability 0.01.
Real-space histogram matching requires a reference electron density distribution. When available, a homologous structure or AlphaFold-predicted model can serve as a reference; structure factors are computed via phenix.fmodel with bulk solvent correction and resolution-appropriate temperature factors. In computing structure factors from model coordinates, the constant term F(0,0,0) (F000) represents the mean electron density of the unit cell. We adjusted F000 such that bulk solvent regions exhibit zero mean electron density. During density modification iterations, solvent flattening constraints drive bulk solvent densities toward zero through negative feedback, and pre-setting the reference histogram with zero-centered solvent regions ensures consistency between the target distribution and algorithmic behavior.
For optimal histogram generation from simulated diffraction data, temperature factors (B-factors) must be assigned to the atomic model. However, there is no exact theoretical method to determine optimal B-factors for histogram matching applications. Several empirical approaches exist: atomic B-factors can be estimated from the Wilson B-factor [
51] calculated from the experimental diffraction data of the unknown structure; alternatively, statistical analysis of deposited PDB structures can provide estimates for both atomic and bulk solvent B-factors [
52]. In this work, we employ empirical relationships based on the high-resolution limit
(Å). For bulk solvent regions, the B-factor is approximated as
. For homologous or AI-predicted protein models, isotropic atomic B-factors are set to
. These empirical relationships are applicable for typical macromolecular crystallography resolution ranges (
Å).
With appropriately configured temperature factors and F000 value, electron density histograms generated from homologous or AI-predicted models closely resemble those from experimentally determined structures. Our tests demonstrate that at identical resolution and with appropriately set parameters, electron density distribution histograms remain remarkably similar even across different protein structures, validating their use as reliable constraint conditions for protein region density in unknown structures. This observation is consistent with the conclusions of Zhang and Main’s seminal work on histogram matching [
33]. For benchmarking purposes in this study, we generated reference histograms from deposited PDB coordinates using this temperature factor protocol, which does not affect our conclusions, as the histograms provide statistically representative protein density distributions.
For each test structure, envelope scheme (one-step coarse or two-step refined), phasing strategy (full-resolution, resolution-weighted, or GA-enhanced), and iterative algorithm (10 total), we executed 100 independent phase retrieval attempts from different random phase seeds. Maximum iterations per attempt were 10,000, with early termination upon convergence detection (reduction of , , and to stable plateaus). Statistics collected from all 100 attempts included success rate, minimum iteration count for first successful convergence, and median iteration count across successful cases. Calculations were performed on a Dell R740 server with 52 cores (104 threads) at 2.1 GHz. A complete evaluation for a single structure with 100 independent trials required approximately 3 h.
5. Conclusions
This study addresses two challenges in direct-method phase retrieval for protein crystals, discontinuous density modification and crude molecular envelope reconstruction, through a systematic approach. We introduced continuous iterative projection algorithms into this domain, validating their value in enhancing convergence stability by implementing smooth density transitions at the molecular interface. We developed improved algorithm variants, including MCHIO, MDM, and MRAAR, that optimize constraint enforcement, achieving performance levels comparable to or exceeding established methods (
Figure 5b). Moreover, the proposed two-step refined envelope reconstruction scheme serves as a universal enhancement, elevating average phase retrieval success rates (
Figure 8a). This demonstrates that optimizing foundational constraints can be as impactful as algorithmic innovations. While continuous algorithms demonstrate competitive performance comparable to HIO, algorithm effectiveness varies across individual structures, and no single method universally excels for all cases. For practical structure determination, we recommend testing multiple algorithms from both continuous (CHIO, HPR, MCHIO, THIO) and classical (HIO, MDM, MRAAR) categories. At the strategy level, the resolution-weighted strategy provided limited improvement, while the genetic algorithm co-evolution strategy delivered breakthrough performance that enabled multi-solution averaging, reducing mean phase error by approximately 6.83° (
Figure 11a,b). This precision gain was achieved universally across different solvent content levels, consistently improving model quality.
Our methods shift the success rate versus solvent content curve upward, extending the applicability of direct methods within the high-solvent-content range to its lower boundary around 55% (
Figure 10). Although method efficacy remains constrained by physical limits, including low solvent contents (approaching or below 55%), insufficient data quality (large R-work values due to measurement errors or poor crystal quality), and complex structural arrangements within the unit cell, continued development is needed. Nevertheless, the framework constructed in this study provides a powerful and systematic solution for the unbiased structure determination of challenging systems in structural biology. This framework synergistically combines refined envelope reconstruction, continuous and improved iterative projection algorithms whose performance is comparable to or exceeds established methods, and genetic algorithm strategies. The compiled algorithms developed in this work are accessible on GitHub [
66].