EM-detwin: A Program for Resolving Indexing Ambiguity in Serial Crystallography Using the Expectation-Maximization Algorithm

Serial crystallography (SX), first used as an application of X-ray free-electron lasers (XFELs), is becoming a useful method to determine atomic-resolution structures of proteins from micrometer-sized crystals with bright X-ray sources. Because of unknown orientations of crystals in SX, indexing ambiguity issue arises when the symmetry of Bravais lattice is higher than the space group symmetry, making some diffraction signals wrongly merged to the total intensity in twinned orientations. In this research, we developed a program within the CrystFEL framework, the EM-detwin, to resolve this indexing ambiguity problem based on the expectation-maximization algorithm. Testing results on the performance of the EM-detwin have demonstrated its usefulness in correctly indexing diffraction data as a valuable tool for SX data analysis.


Introduction
Serial crystallography (SX) is a new technology to solve protein structures from many crystals diffracted by X-rays in a serial fashion. SX was initially developed for X-ray free-electron lasers (XFELs), whose femtosecond X-ray pulses destroy crystal samples after a single exposure. The femtosecond X-ray pulses are considered to be short enough, such that the radiation damage to the crystals does not occur during the exposure [1]. Therefore, the term serial femtosecond crystallography (SFX) was coined due to the femtosecond X-ray pulses at XFELs. The SFX has been demonstrated to be a useful method to determine atomic-resolution structures of proteins from tiny crystals in the size of micrometers or even submicrometers [2][3][4][5][6]. The applications of the SX method have also been demonstrated at synchrotron radiation facilities [7][8][9]. A large number of crystals are required for diffraction measurements, and the signals need to be indexed and merged to obtain a set of full intensity for electron density model building. It is noted that crystals are in unknown orientations during the experiments, because they are either injected to the X-ray interception section, or delivered on a tape, or scanned in their mother liquor. As a result, the crystal orientations are not correlated with serial crystallography. This characteristic brings up the indexing ambiguity issue, which is unique for SX. Here, the indexing ambiguity means that more than one indexing solutions are equally good for a single diffraction pattern to satisfy peak position constraints. This happens when the Bravais symmetry of a crystal lattice is higher than the space group symmetry, and the diffraction pattern from a crystal may be indexed either in the correct orientation or in its twinned orientation(s). Even if no physical twinning occurs to the crystals in SX experiments, merging the data may result in a twined dataset for those crystals with indexing ambiguities. The procedure to solve this problem is called detwinning.
In conventional crystallography, the orientations of the crystals are controlled by goniometers, such that the relative orientation information between diffraction patterns is known. Furthermore, each pair of diffraction patterns from conventional crystallography usually share a good number of common Bragg peaks (or reflections) that are measured with full intensity, which facilitate unambiguous indexing. For each pattern, possible indexing solutions can be tried in turn and the one with the best intensity agreement to a reference pattern is selected as the correct solution for merging. However, the detwinning is more challenging in XFEL nanocrystal crystallography. The number of diffraction peaks in a single pattern is usually smaller in SFX than in conventional crystallography due to the "still" exposure in contrast to the oscillation exposure, and each pattern only records partial intensities for each reflection because of the wide angular spread of Bragg peaks in reciprocal space (due to the small size of crystal samples and micro-focused X-ray beams). In certain SX experiments, a crystal may survive multiple measurements at synchrotrons or with unfocused XFEL X-rays. Under such circumstances, each crystal results in a series of diffraction patterns that are orientationally related, such as in fine ϕ-slicing measurements [10][11][12]. This can facilitate the indexing by treating the series of patterns as groups. In general, autoindexing methods, such as MOSFLM [13], DIRAX [14] or LABELIT [15], cannot immediately distinguish between twin-related indexing solutions based on the locations of diffraction peaks alone. In a situation where two indexing solutions are equivalent, without applying detwinning analysis, nearly half of the patterns will be indexed as its twin and be merged incorrectly. As the intensity variation caused by partial reflection is significant, one needs robust algorithms for detwinning analysis on crystallography data with partial reflections.
Several methods have been developed to solve the indexing ambiguity problem. The BD algorithm first proposed by Brehm and Diederichs [16] and several variants [17,18] reduce the dimensions of data recorded in patterns based on pairwise similarities measured using Pearson's correlation and then do clustering analysis. The algorithm has been implemented in CrystFEL software [19] as the ambigator program. Another method is based on the expectation-maximization (EM) algorithm to iteratively improve the indexing consistency by maximizing the similarities between individual patterns and the merged reflection model [20]. Different from the BD algorithm, which evaluates the pairwise distances between diffraction patterns that record partial reflections, the EM algorithm utilizes the relationship between reflections from any pattern and the full reflection model merged in the previous iteration. The convergence of the indexing solutions and the merged reflection model will yield a solution that is optimally consistent with data in the whole dataset. In this paper, we present the implemented EM algorithm within the CrystFEL framework as an add-on program called EM-detwin. The program was tested with three experimental datasets in two distinct space groups to assess the quality of the merged results.

The Implementation of the EM-detwin
The EM-detwin algorithm is an iterative reconstruction algorithm and uses the Pearson's correlation coefficients between diffraction patterns and the reflection model obtained in the previous iteration as metrics to assign the crystal orientations, such that the associated diffraction patterns can be indexed consistently. The Pearson correlation between a diffraction pattern I i (the term "crystal" will be used in the following text because there may be diffraction signals from multiple crystals recorded in a single pattern) and a model consisting of full reflections I f ull defined as the following [20]: where I i is the diffraction intensities recorded in the crystal i, with {h} as the miller indices of common reflections and r i is the correlation coefficient corresponding to the diffraction data from crystal i.
The mean values for the common dataset are denoted as the I i and I f ull . The algorithm takes the indexing solutions from CrystFEL stream file as the input, and carries out the first round of merging without detwinning to get the initial model of the merged full reflections, the M(0). In the subsequent iterations, the merged reflection model is improved by updating the orientations of crystals based on the comparison between crystals and previously merged full reflection models. Specifically, in the kth iteration, the diffraction signals from the ith crystal {I i } is transformed to its equivalent indexed intensities I t i , where t enumerates all of the possible twinning operators of corresponding space groups (i.e., the miller indices are updated based on the symmetry operations). Then, the correlation coefficients r t i are calculated between I t i and the merged full reflection model obtained in the previous iteration, the M(k − 1). Based on the correlation coefficients, a new model M(k) is obtained by merging the data in the updated orientations. There are two merging strategies, one is the "winner-takes-all" and the other is the "weighted-merging". For the "winner-takes-all" mode, the winner twinning operator t 0 for crystal {I i } is obtained by max t r t i , and only I t 0 i is merged to M(k). In the "weighted merging" strategy, the set of I t i are merged to M(k) in a weighted manner, where the weights are calculated in the following way: Since each crystal can only take a single orientation in reality, at the final iteration, all crystals are merged in the "winner-takes-all" mode to get merged intensities.
The EM-detwin program is designed in a similar way as the process_hkl program in the CrystFEL software. Besides the final merged model from the whole dataset, the EM-detwin also generates the merged models from half datasets (even vs. odd numbered patterns) for consistency analysis. The new input parameters for EM-detwin are listed in Table 1, while other parameters are the same as the process_hkl program. Table 1. New input parameters for the EM-detwin program.

Parameters Explanation
-spacegroupnum = <n> The space group number, used to determine twinning operators.
-winner-takes-all If set, the program will use "winner-takes-all" mode to do merging; otherwise, "weighted-merging" will be used for intermediate model merging -highres = <high> Reject reflections with resolution higher than high Å when calculating correlation coefficients. -lowres = <low> Reject reflections with resolution lower than low Å when calculating correlation coefficients. -write-assignments = <file> Write the final re-indexed results to file.
Add specific twin operators.

Test Datasets
Three datasets were used to assess the performance of the EM-detwin program, based on the availability of the diffraction data. The datasets include: (1) the HIV-1 Integrase Catalytic Core Domain (IN-CCD) data [5] collected in 2018 at the PAL-XFEL (Pohang, South Korea); (2) the Beta-lactamase (BLAC) data [21] collected in 2018 at the European-XFEL (Hamburg, Germany); and (3) the bacteriorhodopsin (BR) data [22] collected in 2017 at the LCLS (Stanford, CA, USA). The symmetry information is summarized in Table 2. The BLAC dataset was downloaded from the CXIDB [23] with the CXIDB id 83. The other two datasets were generously provided by the authors of each work. The input data are CrystFEL stream files generated by indexamajig before being processed by the ambigator. There are 27,311, 12,474 and 241,475 indexed crystals in IN-CCD, BLAC and BR stream files, respectively. The BR dataset is from a pump-probe SFX experiment with extraordinarily high redundancy in measurements that is not common in SX. In order to assess the performance for a typical sized SX experimental dataset, the test was done with the first 20,000 chunks of the BR stream file, containing 15,847 indexed crystals.
* The two twinning operators lead to equivalent miller indices.
The atomic coordinates for these three proteins were downloaded from the RCSB Protein data bank (PDB) [24]. The PDB codes for the IN-CCD, BLAC and BR are 6JCG, 6GTH and 6G7H, respectively. These atomic models were used for molecular replacement phasing and subsequent model refinement with the phenix software (version 1.11.1) [25].

The Convergence of EM-detwin Program
The three datasets were subject to the EM-detwin analysis to get the merged reflections. We applied 30 iterations to test the performance with two merging strategies independently. The merged data from "weighted" and "winner-takes-all" strategies are very similar, with the overall R-factors (see footnote of Table 3 for definition) of 0.36%, 0.32% and 0.25% for IN-CCD, BLAC and BR datasets, respectively. The small R-factors suggest that the differences between the merged results using these two strategies are essentially the same. Therefore, we used the results from the "weighted" strategy for the rest of the analysis.
, where k is the scaling factor between the two datasets. † CC 1/2 is Pearson correlation The convergence was examined by calculating the correlation coefficients between the final reflection model and the intermediately merged models at each iteration, as shown in Figure 1. Although the convergence speed varies among the three test dataset, the curves suggest that 30 iterations are sufficient for the merged model to converge. We also observed that the "winner-takes-all" strategy can converge faster. To visually assess the results of the EM-detwin program, we compared the merged models from original stream files (before detwinning) and the EM-detwin results. The reflections on planes with twinning symmetry were visualized using the CCP4 software [26], showing that the false symmetry disappeared after the analysis by the EM-detwin (Figure 2). This demonstrates that the EM-detwin program is capable of removing the extra symmetries due to twin-related orientations in all three datasets.

Merging Statistics Comparison
The pairwise R-factors were computed for three merged results: ambigator, EM-detwin and the twinned data. No resolution limits were set in the R-factor calculations. The results in Table 3 show that EM-detwin and ambigator results are very similar, and they are both significantly different from the twinned data. This is consistent with the quantification using the correlations, in terms of CC 1/2 and CC star . The quality of the merged results is also assessed using the R-split values, which measures the differences between merged models from two subsets composed of even/odd numbered datasets, as shown in Table 4. The resolutions of published PDB structures were used as the cut-off values for R-split factor calculations. The R-split comparison results indicate that the EM-detwin and the ambigator are comparable in terms of self-consistency.
, where k is the scaling factor between the two datasets. †, * the definitions of CC 1/2 /CC star are the same as in Table 3.

Structure Determination and Refinement with Detwinned Data
The merged data after detwinning was used to build atomic models and refine protein structures. The phenix.phaser program was used to build the initial model using a molecular replacement method. Then the model was refined using the phenix.refine program. The default parameters in phenix.phaser and phenix.refine were applied. The resolution limits were set to be the same as in the published structures. Merged data from original stream files without detwinning were used for structure refinement as control groups. The final R-work and R-free values of model refinement are listed in Table 5 (see Supplementary Table S1 for the complete statistics of structure refinement). The electron density maps are shown in Figure 3 for the three structures, which are calculated based on the merged data by the EM-detwin program.
It is shown that for all datasets, both detwinning programs improve the refinement model quality, in contrast to the large R-factors resulting from the twinned data. The comparison also suggests that ambigator and EM-detwin have similar performance on the qualities of structure determination and refinement.

Robustness of the Program
The EM-detwin program was evaluated for cases where data redundancy is low. We used the first 5000 crystals from each dataset as a reduced dataset to carry out the detwinning analysis. Using the indexing solutions obtained with the full dataset as a reference, the percentage of consistently indexed crystals was computed for the case with the reduced dataset. The similarity between the final models from the full and the reduced datasets was measured using the correlation coefficients. The results are summarized in Table 6. The performance of the ambigator program was also assessed with the reduced datasets in parallel for comparison. The data suggest that both programs yielded to the merged intensity models that are consistent with the results from the full datasets. However, the performance of the EM-detwin with the "winner-takes-all" merging strategy is not as good as the ambigator program in the case of reduced datasets, especially for the BLAC dataset. Based on this result, the "winner-takes-all" mode is not recommended for small datasets, as it is a more aggressive strategy which converges faster (see Figure 1) but with a risk of converging to local optimal points. The EM-detwin algorithm has a computational complexity of O(MN), with N as the number of diffraction patterns and M as the number of iterations. The ambigator program needs to compute the pairwise similarities between all patterns, so it has a computational complexity of O(N 2 ). Therefore, the EM-detwin program shall have speed advantages when N is big. This is confirmed by the execution time comparison for the three datasets. For smaller datasets of BR and BLAC, the EMdetwin program needs to run for 1260 s and 603 s respectively. It takes about a similar time for the ambigator program (3397 s and 649 s for BR and BLAC datasets respectively) to complete the detwinning analysis. For the IN-CCD dataset composed of 23,711 crystals, the EM-detwin execution time is 2150 s, faster than the ambigator that requires 11,873 s. All tests were done with a Macbook Pro A1706 computer utilizing a single 2.9 GHz Core i5 CPU. Therefore, we recommend using the EMdetwin program for larger dataset analysis.
In the case that partial intensities are measured due to the experimental limitations (such as monochromatic X-rays, ultrashort exposure time, crystal mosaicity, etc.), a post-refinement procedure may improve the quality of merged signals [27][28][29]. The post-refinement on partial reflections is a separate procedure from the detwinning, therefore we resort it to a specialized

Robustness of the Program
The EM-detwin program was evaluated for cases where data redundancy is low. We used the first 5000 crystals from each dataset as a reduced dataset to carry out the detwinning analysis. Using the indexing solutions obtained with the full dataset as a reference, the percentage of consistently indexed crystals was computed for the case with the reduced dataset. The similarity between the final models from the full and the reduced datasets was measured using the correlation coefficients. The results are summarized in Table 6. The performance of the ambigator program was also assessed with the reduced datasets in parallel for comparison. The data suggest that both programs yielded to the merged intensity models that are consistent with the results from the full datasets. However, the performance of the EM-detwin with the "winner-takes-all" merging strategy is not as good as the ambigator program in the case of reduced datasets, especially for the BLAC dataset. Based on this result, the "winner-takes-all" mode is not recommended for small datasets, as it is a more aggressive strategy which converges faster (see Figure 1) but with a risk of converging to local optimal points. The EM-detwin algorithm has a computational complexity of O(MN), with N as the number of diffraction patterns and M as the number of iterations. The ambigator program needs to compute the pairwise similarities between all patterns, so it has a computational complexity of O(N 2 ). Therefore, the EM-detwin program shall have speed advantages when N is big. This is confirmed by the execution time comparison for the three datasets. For smaller datasets of BR and BLAC, the EM-detwin program needs to run for 1260 s and 603 s respectively. It takes about a similar time for the ambigator program (3397 s and 649 s for BR and BLAC datasets respectively) to complete the detwinning analysis. For the IN-CCD dataset composed of 23,711 crystals, the EM-detwin execution time is 2150 s, faster than the ambigator that requires 11,873 s. All tests were done with a Macbook Pro A1706 computer utilizing a single 2.9 GHz Core i5 CPU. Therefore, we recommend using the EM-detwin program for larger dataset analysis.
In the case that partial intensities are measured due to the experimental limitations (such as monochromatic X-rays, ultrashort exposure time, crystal mosaicity, etc.), a post-refinement procedure may improve the quality of merged signals [27][28][29]. The post-refinement on partial reflections is a separate procedure from the detwinning, therefore we resort it to a specialized algorithm to handle it.
To facilitate the further analysis of post-refinement, the EM-detwin program has the option to save the re-indexed results in the format of CrystFEL stream file to allow partiality modeling and corrections. To test the impact of post-refinement on the three test datasets, we applied the partialator program in CrystFEL on the detwinned data using the xsphere model [27] and found small improvements in the model R-factors for these three datasets (see Table S1).

Conclusions
As the SX method is becoming more advanced, it attracts broader applications, providing an alternative for the conventional crystallography. We expect structures of proteins packed in various space groups to be determined using this approach. The indexing ambiguity issue needs to be resolved in order for this method to be useful for the crystals belonging to the space groups that possess twinning symmetries. In this research, we implemented an expectation-maximization algorithm-based program, the EM-detwin, to address this issue. Testing results on experimental data show that the diffraction patterns can be correctly indexed and merged for model building and structure refinement. The program of EM-detwin has been tested within the framework of CrystFEL software for versions 0.6.3, 0.7.0, 0.8.0 and 0.9.0. The source codes and installation guide are available on Github (https://github.com/LiuLab-CSRC/detwin).