A Novel Approach to Data Collection for Difficult Structures: Data Management for Large Numbers of Crystals with the BLEND Software

The present article describes how to use the computer program BLEND to help assemble complete datasets for the solution of macromolecular structures, starting from partial or complete datasets, derived from data collection from multiple crystals. The program is demonstrated on more than two hundred X-ray diffraction datasets obtained from 50 crystals of a complex formed between the SRF transcription factor, its cognate DNA, and a peptide from the SRF cofactor MRTF-A. This structure is currently in the process of being fully solved. While full details of the structure are not yet available, the repeated application of BLEND on data from this structure, as they have become available, has made it possible to produce electron density maps clear enough to visualise the potential location of MRTF sequences.

The crystal used for data collection is named in the Crystal column. It is stored in liquid nitrogen inside a Puck, containing several crystals. The crystal itself can be large enough for the beam to be shone through it at several Positions. The Serial Number, thus, assigns a unique number to all sweeps collected for this structure. The remaining 5 columns describe how the crystal was prepared, including details of the presence of a soaked or co-crystallised heavy atom, and cooled down to cryo-temperatures. There are three Base Conditions: bc1, bc2 and bc3. They are a mixture of commercial screens, additive screens (not revealed, as they are sensitive data) and gadolinium (Gd). More specifically: (1) bc1 = A + B + C1 (2) bc2 = A + B + C2 (3) bc3 = A + B + C1 + Gd where, A = commercial screen B = commercial screen for optimization C1 = additive screen C2 = additive screen Gd = gadolinium There are also three types of Cryogenic Conditions: (1) cry1 = 30% glycerol + 5 mM magnesium chloride (2) cry2 = 30% glycerol + OH (3) cry3 = 30% glycerol + 5 mM magnesium chloride + 1 M sodium bromide Also, some crystals have been dehydrated by addition of salts directly in the crystallization plates [51] with one of two protocols, dh1 or dh2 (column Dehydration).

Appendix A
Crystals and X-ray data for the case described in this paper were produced and collected over a period of 2 years. Full details are included in Table A1. In the table, Visit ID refers to the unique code assigned by the Diamond synchrotron user office to the specific experiment at a given beamline. Table A1. Information on all datasets used for the work described in this paper. Crystals were prepared with one of three different base conditions, bc1, bc2 and bc3 (see text). They were also prepared for cooling with one of three different cryogenic conditions, cry1, cry2 and cry3 (see text). The majority of crystals included heavy atoms to attempt SAD phasing. Heavy atoms were soaked in solution for most of the crystals; in a few cases, they were co-crystallised. In order to improve resolution, one of two dehydration screenings have been attempted for many crystals. The table also lists details concerning dates of the various data collections, position of the crystals in the pucks and whether crystals were shot once or more times. The serial number, thus, corresponds to a unique and specific sweep obtained from X-ray diffraction.         A table different from Table A1, but related to it, is Table 1, included in Section 4.3. Table 1 is a representation of a reshaped dataframe, an object present in the R programming language [46]. In this appendix, it will be explained how the reshaped dataframe is obtained. The starting point is the manual construction of a dataframe associated with Table A1. Several solutions can be envisaged to avoid this time-consuming task, all of them making use of database algorithms. These will be implemented shortly in BLEND, but for the work described in this article, preparation of the initial dataframe and the subsequent formation of the reshaped dataframe were carried out manually. A few lines of the code for the initial dataframe are shown in Figure A1. The dataframe is a simple matrix in which each row corresponds to a single dataset. As multiple datasets can be associated with a same Date, VisitID, Puck, etc., then values for these columns are, often, repeated. Next, a dataframe including all possible combinations from the unique conditions in the initial dataframe, is created. Let us call this dataframe theoretical conditions dataframe. It turns out that the base conditions (BC) comprise 3 unique values (bc1, bc2, bc3), the cryogenic conditions (CC) also comprise 3 unique values (cry1, cry2, cry3), the dehydration protocol includes 3 unique values (no = no dehydration, dh1, dh2), the co-crystallisation flag (CO) includes two values (yes, no), and the heavy atom types (HA) are 15 (no = no heavy atom, KlCl6, Tantalum, Hg(Thi), Pt(PIP), KAu(CN)2, Hg(Ace), K2PtCl4, Hg(PMA), K2PtI6, OsCl3, AgN, IC3(m_triangle), GdCl3, Os). The possible combinations from all values listed above are 3 × 3 × 3 × 2 × 15 = 810. This means that the theoretical conditions dataframe has 810 rows. Not all possible combinations will be present in the data collected for this work, because the total number of datasets is 271. For this reason, the initial dataframe entries are matched against the theoretical conditions dataframe; the result of this comparison is the new dataframe, simply called conditions dataframe, shown in Table 1.

Appendix C
With molecular replacement, models are oriented and placed at specific locations of the unit cell. Two solutions from molecular replacement runs do not necessarily overlap, even if they correspond to the same correct structure. The reason for this is that the asymmetric units selected by the molecular replacement program could be different. Furthermore, the absolute location of the oriented molecule depends on where the unit cell origin has been placed. The origin can be selected arbitrarily to be compatible with the specific symmetry. Thus, to verify whether two molecules overlap, all symmetry equivalents of the molecules and all allowed origin shifts must be tried. Within the CCP4 group of programs, this task is carried out by the program CSYMMATCH [52]. The input consists of the two files containing the atomic coordinates of the two models to be compared; one is considered the moving model, the other the reference model. The output consists of a PDB file corresponding to the moving model, transformed to the closest possible location to the reference model still compatible with symmetry and allowed unit cell origin. To compute the RMSD between all atoms of the reference structure and all atoms of the moved structure, we have used the CCP4 program COMPAR. This is an old program with no related documentation on the official CCP4 Figure A1. Initial R dataframe, corresponding to Table A1. Just a few lines of the dataframe are shown in this figure.
The dataframe is a simple matrix in which each row corresponds to a single dataset. As multiple datasets can be associated with a same Date, VisitID, Puck, etc., then values for these columns are, often, repeated. Next, a dataframe including all possible combinations from the unique conditions in the initial dataframe, is created. Let us call this dataframe theoretical conditions dataframe. It turns out that the base conditions (BC) comprise 3 unique values (bc1, bc2, bc3), the cryogenic conditions (CC) also comprise 3 unique values (cry1, cry2, cry3), the dehydration protocol includes 3 unique values (no = no dehydration, dh1, dh2), the co-crystallisation flag (CO) includes two values (yes, no), and the heavy atom types (HA) are 15 (no = no heavy atom, KlCl6, Tantalum, Hg(Thi), Pt(PIP), KAu(CN)2, Hg(Ace), K2PtCl4, Hg(PMA), K2PtI6, OsCl3, AgN, IC3(m_triangle), GdCl3, Os). The possible combinations from all values listed above are 3 × 3 × 3 × 2 × 15 = 810. This means that the theoretical conditions dataframe has 810 rows. Not all possible combinations will be present in the data collected for this work, because the total number of datasets is 271. For this reason, the initial dataframe entries are matched against the theoretical conditions dataframe; the result of this comparison is the new dataframe, simply called conditions dataframe, shown in Table 1.

Appendix C
With molecular replacement, models are oriented and placed at specific locations of the unit cell. Two solutions from molecular replacement runs do not necessarily overlap, even if they correspond to the same correct structure. The reason for this is that the asymmetric units selected by the molecular replacement program could be different. Furthermore, the absolute location of the oriented molecule depends on where the unit cell origin has been placed. The origin can be selected arbitrarily to be compatible with the specific symmetry. Thus, to verify whether two molecules overlap, all symmetry equivalents of the molecules and all allowed origin shifts must be tried. Within the CCP4 group of programs, this task is carried out by the program CSYMMATCH [52]. The input consists of the two files containing the atomic coordinates of the two models to be compared; one is considered the moving model, the other the reference model. The output consists of a PDB file corresponding to the moving model, transformed to the closest possible location to the reference model still compatible with symmetry and allowed unit cell origin. To compute the RMSD between all atoms of the reference structure and all atoms of the moved structure, we have used the CCP4 program COMPAR. This is an old program with no related documentation on the official CCP4 website. Details on how to run this program have been learned via the CCP4 Bulletin Board [53]. The value for the two structures discussed in this paper is RMSD = 0.773 Å.