Comparison of Intermolecular Halogen...Halogen Distances in Organic and Organometallic Crystals

Statistical analysis of halogen...halogen intermolecular distances was performed for three sets of homomolecular crystals under normal conditions: C–Hal1...Hal2–C distances in crystals consisting of: (i) organic compounds (set Org); (ii) organometallic compounds (set Orgmet); and (iii) distances M1–Hal1...Hal2–M2 (set MHal) (in all cases Hal1 = Hal2, and in MHal M1 = M2, M is any metal). When analyzing C–Hal...Hal–C distances, a new method for estimating the values of van der Waals radii is proposed, based on the use of two subsets of distances: (i) the shortest distances from each substance less than a threshold; and (ii) all C–Hal...Hal–C distances less than the same threshold. As initial approximations for these thresholds for different Hal, the Ragg values previously introduced in investigations with the participation of the author were used (Ragg values make it possible to perform a statistical assessment of the presence of halogen aggregates in crystals). The following values are recommended in this work to be used as universal values for crystals of organic and organometallic compounds: RF = 1.57, RCl = 1.90, RBr = 1.99, and RI = 2.15 Å. They are in excellent agreement with the results of some other works but significantly (by 0.10–0.17 Å) greater than the commonly used values. For the Orgmet set, slightly lower values for RI (2.11–2.09 Å) were obtained, but number of the C–I...I–C distances available for analysis was significantly smaller than in the other subgroups, which may be the reason for the discrepancy with value for the Org set (2.15 Å). Statistical analysis of the M–Hal...Hal–M distances was performed for the first time. A Hal-aggregation coefficient for M–Hal bonds is proposed, which allows one to estimate the propensity of M–Hal groups with certain M and Hal to participate in Hal-aggregates formed by M–Hal...Hal–M contacts. In particular, it was found that, for the Hg–Hal groups (Hal = Cl, Br, I), there is a high probability that the crystals have Hg–Hal...Hal–Hg distances with length ≤ Ragg.


Introduction
The role played by intermolecular contacts between halogen atoms (halogen...halogen contacts or Hal...Hal for short) in the formation of crystal structures of both organic [1,2] and organometallic [3,4] compounds is well known. This role is often reduced to considering only the shortest (shorter than the sum of van der Waals radii, i.e., 2R Hal for identical Hal) contacts that can be described as a halogen bond [5]. Meanwhile, it was noted for the first time in [6] that the grouping of several halogen atoms can play an important role in the formation of crystal structures, even if not all Hal...Hal distances between neighboring atoms are shortened compared with 2R Hal . This phenomenon, for which we later proposed the name "aggregation of halogen atoms" or "Hal-aggregation", was considered in a series of our works [7][8][9][10][11][12] with a number of examples.
Halogen atoms are not among the most common in protein or nucleic acid molecules; nevertheless, even almost 20 years ago in the PDB [13] (July 2004 release), 964 C-Hal bonds were found [14] in proteins and 321 in nucleic acids (for both sets Hal = Cl, Br, I). At the same time, halogenation of drugs often increases membrane binding and permeation [15], 2 of 14 leading to the desired steric and conformational effects when interacting with the target site [16], so it is not surprising that the proportion of halogenated compounds among drugs is quite large (about 25% in DrugBank [17] according to [18]). In addition, intermolecular contacts involving halogen atoms contribute to the stabilization of protein-ligand complexes; therefore, plenty of works have been devoted to their study (e.g., [19][20][21][22]).
However, regardless of whether researchers consider only short Hal...Hal contacts or Hal-aggregates or halogen bonds in different type of substances, the starting point in the analysis is the values of the van der Waals radii of halogen atoms (R Hal ). There have been a large number of works devoted to both the actual determination of R Hal [23][24][25][26][27][28][29][30] and the discussion of values obtained by various methods [31]. In 2020, to determine the van der Waals radii of atoms in crystals, the authors [30] created arrays of distances, the longest of which obviously greatly exceeded the sum of van der Waals radii, and then they described them using Gaussian functions. An approach based on the use of several Gaussian functions to describe distance distributions in crystals was proposed by me a little later [32], independently of the authors [30].
In [30], the authors considered distances in crystals with molecules containing only H(D), B, C, N, O, F, Si, P, S, Cl, As, Se, Br, and I atoms. If the values of the van der Waals radii are constant in any environment, then the indicated restriction should not have a significant effect on the result. At the same time, without additional research, it cannot be ruled out that some differences in the structure of crystals consisting of molecules of only organic compounds or of molecules of organometallic compounds [33] may also affect the average values of certain intermolecular contacts.
Therefore, one of the purposes of this work was to compare the distributions of intermolecular distances Hal...Hal in two sets of crystals: those consisting of molecules of organic compounds (Org) and those of molecules of organometallic compounds (Orgmet). In addition, the statistical analysis of intermolecular metal-halogen...halogen-metal distances was of significant interest because of very scarce information about these distances in the literature.

Results
The procedure for the formation of different Hal...Hal distance sets that were used to perform statistical analysis is described in detail in Section 4. Table 1 provides information about the number of substances, the molecules of which have C-Hal bonds, for the sets Org and Orgmet. The proportions of these substances making up of the total number of crystals in the Org (138,916) and Orgmet (76,439) sets that satisfy all search conditions were also determined, except for the presence of a certain C-Hal bond, i.e., how common the substances with such bonds are. It can be seen that this fraction is the maximum for crystals with a C-Cl bond in the Org set (10.6%) and the minimum for crystals with a C-I bond in the Orgmet set (0.3%). Table 1. Results of search in the CSD for substances with C-Hal bonds and C-Hal...Hal-C distances satisfying the Hal-aggregation statistical criterion proposed in [9] (N is the number of substances with C-Hal bonds, N tot is the total number of substances in the sets Org and Orgmet, and N agg is the number of substances with distances satisfying the Hal-aggregation criterion with R Hal from [26][27][28]). Next, in the formed sets, the C-Hal...Hal-C intermolecular distances were calculated for the same Hal atoms (d). In previous works [9], it was shown that the length of the Hal...Hal contacts involved in halogen aggregation usually does not exceed R agg = 2R Hal + 0.5 Å; i.e., in statistical studies (without a detailed analysis of aggregates in each substance), this quantity may be used to estimate the number of structures containing Hal-aggregates. In [9], R F = 1.40 [27], R Cl = 1.90 [26], R Br = 1.97 [28], and R I = 2.14 Å [28] were used; therefore, the upper limit d (d max ) was initially determined in this work as 3.300, 4.300, 4.440, and 4.780 Å for the F...F, Cl...Cl, Br...Br, and I...I distances, respectively. The number of substances containing such distances is also indicated in Table 1.

C-Hal
It turned out that, according to this criterion, the fraction of substances containing Hal-aggregates is systematically higher in the Orgmet set than in Org, with the greatest difference (by 22.6%) being observed for F-aggregates and the smallest difference for I-aggregates (3.8%).

Analysis of the Shortest C-Hal...Hal-C Distances
The shortest C-Hal...Hal-C distances of various types for crystals of organic compounds are reported in Table 2. The analysis of the adequacy of such distances is usually not carried out, although the almost twofold reduction in distances compared to 2R Hal apparently indicates errors in the determination of structures. In this work, the following approach was used to assess the adequacy of the shortest distances. All distances of each type were sorted by increase, and if the difference between the previous and next contact was more than 0.1 Å, then the previous contact was classified as unrealistically short. The presence of too short Hal...Hal distances may be associated with general errors in the determination of the structure; therefore, not only these short distances but also all distances in the corresponding substances (records) were excluded from further analysis. It turned out that there was no need to remove substances only from subsets of contacts I...I. In the remaining subsets, the entries PUCREL, FFMXZP, JODVAZ, VAWNUE, MIWHOP01, and XACXEE in the Org set and TUKNOC, KUSMOD, and CUCNUM in the Orgmet set were deleted. The number and range of distances used in further analysis are shown in Table 3. The results obtained below indicate that the previously used value R F = 1.40 Å is underestimated; therefore, the sample for C-F...F-C distances was also extended for this type of distance in Table 3 and has two rows. It can be seen that, in the corrected arrays, the reduction in the shortest distances compared to 2R Hal (proposed in [26][27][28]) does not exceed~20%, which seems reasonable.

Analysis of Distance Distributions
Often, when determining van der Waals radii, only the shortest distances are considered. However, van der Waals radii are most valuable when they allow for estimation of not the shortest but the most probable non-valence distances. In this paper, for each type of distances C-Hal...Hal-C in each of the two main sets Org and Orgmet, two variants of distance arrays were analyzed. The first one included all distances of the corresponding type within d max in the selected substances, while the second included the one shortest distance of the corresponding type from each selected substance; thus, the longest distances in the second array also did not exceed d max . For these arrays of distances, histograms were constructed, examples of which for the Org set are shown in Figures 1 and 2. It can be seen that, in the corrected arrays, the reduction in the shortest distances compared to 2RHal (proposed in [26][27][28]) does not exceed ~20%, which seems reasonable.

Analysis of Distance Distributions
Often, when determining van der Waals radii, only the shortest distances are considered. However, van der Waals radii are most valuable when they allow for estimation of not the shortest but the most probable non-valence distances. In this paper, for each type of distances C-Hal...Hal-C in each of the two main sets Org and Orgmet, two variants of distance arrays were analyzed. The first one included all distances of the corresponding type within dmax in the selected substances, while the second included the one shortest distance of the corresponding type from each selected substance; thus, the longest distances in the second array also did not exceed dmax.  The maxima on all histograms were described by Gaussian function (red lines in Figures 1 and 2). It should be noted that, when the histogram step changes, the view of histogram changes to some extent (for example, Figure 1a,b shows the distributions of the F...F distances with steps of 0.2 and 0.1 Å), but the position of the maximum when described by the Gaussian function remains almost constant (see Tables 4 and 5). In Figure  1a, the position and the very presence of the maximum do not seem obvious, which, as noted earlier, is most likely a consequence of the initially chosen Ragg value for the F...F The maxima on all histograms were described by Gaussian function (red lines in Figures 1 and 2). It should be noted that, when the histogram step changes, the view of histogram changes to some extent (for example, Figure 1a,b shows the distributions of the F...F distances with steps of 0.2 and 0.1 Å), but the position of the maximum when described by the Gaussian function remains almost constant (see Tables 4 and 5). In Figure 1a, the position and the very presence of the maximum do not seem obvious, which, as noted earlier, is most likely a consequence of the initially chosen R agg value for the F...F distances being underestimated. Therefore, additional arrays with d max = 3.50 Å were analyzed for this type of distance. Table 4 shows that, in this case, as for other types of distances, the distribution parameters depend very little on the histogram step. Table 6 shows the differences between the positions of the maxima of the Gaussian functions for different options used for histograms.
The results for I. . .I contacts in the Orgmet set do not quite match the trends for other types of contacts in several cases. Perhaps this outcome is due to the significantly smaller number of I. . .I contacts in this set, especially for the sample of the shortest contacts.
In the Org set, the differences in the positions of the maxima for histograms with steps of 0.2 and 0.1 Å do not exceed 0.011 Å. The same maximum difference appears in the Orgmet set if the results for contacts I. . .I are not considered. Thus, the differences in the positions of the maxima between Org and Orgmet can be considered significant if they exceed 0.01 Å. It turns out that the positions of the maxima for the shortest contacts of all types in Orgmet correspond to significantly shorter distances than in Org: the contraction is in the range of 0.023-0.055 Å for contacts F. . .F, Cl. . .Cl, and Br. . .Br. For the same contacts from the All arrays, a different pattern is observed: for F. . .F, on average, shorter (by 0.050-0.058 Å) contacts exist in the Org set, Cl. . .Cl contacts are also shorter in Org but only by 0.013-0.014 Å, and the Br. . .Br contacts are on average longer (by 0.017-0.022 Å) in Org than in Orgmet.
As expected, the maxima for First contacts correspond to shorter distances than the maxima for All for all types of distances in both sets (Org and Orgmet), while in its meaning, the sum of van der Waals radii should be greater than the maximum values for First and less than the maximum values for All.
The expression (x all + x first )/4 was used to estimate the value of the van der Waals radii. It turns out ( Table 7) that the values obtained in this way are in excellent (within 0.01 Å) agreement with each other, both for different histogram spacing and for the Org and Orgmet sets. An exception is the discrepancy in the R I estimates for Org and Orgmet, which, as noted above, may be due to an insufficient number of contacts in the C-I. . . I-C sample for Orgmet. Table 7. Estimation of the van der Waals radii of halogen atoms (Å) by the formula (x all + x first )/4, where x is a position of maximum of Gaussian function used for description of the subset of the C-Hal. . .Hal-C distances (All-all contacts with ≤d max ; First-one shortest contact from each substance ≤ d max ; Org-homomolecular crystals of compounds; Orgmet-homomolecular crystals of organometallic compounds).  Table 8 compares the results of determining the van der Waals radii in this work with some data from the literature. The values obtained in this work are in good agreement with the data [30] obtained using a similar technique. At the same time, it is important to note the good agreement between the values for Cl, Br, and I with the results [26][27][28], which were obtained by another method and which were previously used to estimate statistically the values of R agg . Thus, the previously obtained data on halogen aggregation involving these atoms remain relevant, while the data on F-aggregation can be revised considering the new value of R F . Thus, the obtained results indicate that the following van der Waals radii for halogen atoms bonded to a carbon atom can be recommended as unified values for crystals of organic and organometallic compounds under normal conditions: R F = 1.57, R Cl = 1.90, R Br = 1.99, and R I = 2.15 Å. It makes sense to clarify the value of R I in crystals of organometallic compounds when data for a larger number of structures become available.

The M-Hal. . .Hal-M Distances
Halogen bonds involving M-Hal groups have been the subject of many studies [3,4,[34][35][36]. However, as a rule, contacts of halogen atoms from such groups have been considered either with other elements or with halogen atoms that do not form M-Hal bonds. Contacts M-Hal. . .Hal-M have rarely been noted by researchers [37]. Therefore, one of the goals of this work was to statistically analyze the M1-Hal1. . .Hal2-M2 distances. In this case, as in the analysis of the C-Hal1. . .Hal2-C distances, only the distances between the same atoms (Hal1 = Hal2, M1 = M2) under normal conditions were considered. The rules for selecting substances for the MHal set are described in more detail in Section 4.
Data on the number of different metal elements (M) with M-Hal bonds, according to the CSD [38] search, are provided in Table 9. Detailed information about the number of symmetrically independent bonds of each M-Hal type is given in Table 10. It can be seen that, among the studied substances, some types of M-Hal bonds are very rare. Therefore, in addition to the number of metal elements that have at least one particular M-Hal bond in all crystals, the numbers of M with more than 20 and more than 40 symmetrically independent M-Hal bonds in total are presented in Table 9.        symmetrically independent M-Hal bonds in the training set is sufficiently large. The smallest numbers of such bonds were found for the M-F groups; therefore, the NM-Hal boundaries in the analysis of kMHal-agg were chosen, considering the available NM-F. The graphs illustrating the changes in kMHal-agg depending on M and Hal show values for metals having more than 20 symmetrically independent M-Hal bonds in the set (Figure 3a) and more than 40 bonds (Figure 3b) (for convenience, the abscissa scales on Figure 3a  In the set under consideration, when the condition NM-Hal > 20 is fulfilled (Figure 3a), the maximum and minimum values of kMHal-agg correspond to bonds involving fluorine. The largest value of kMHal-agg (2.17) is for the Sb-F group and the smallest (0) for the Ti-F group; i.e., the groups of Sb-F participate on average in more than two symmetrically independent distances of Sb-F...F-Sb, while the Ti-F groups are not at all inclined to form Ti-F...F-Ti contacts. It should be noted that the number of symmetrically independent groups of M-F in both cases is not too large (35 each), and as their number increases, the values of kMHal-agg can notably change.
In the subset with NM-Hal > 40 (Figure 3b), the largest value of kMHal-agg (1.17) is for the Ge-Cl group, while the value of Hg-I is close to it (1.09). In general, for the Hg-Hal groups (Hal = Cl, Br, I), the kMHal-agg values are high (0.80-1.09), while the numbers of such groups in the set is 190 or more (Table 10)  In the set under consideration, when the condition N M-Hal > 20 is fulfilled (Figure 3a), the maximum and minimum values of k MHal-agg correspond to bonds involving fluorine. The largest value of k MHal-agg (2.17) is for the Sb-F group and the smallest (0) for the Ti-F group; i.e., the groups of Sb-F participate on average in more than two symmetrically independent distances of Sb-F...F-Sb, while the Ti-F groups are not at all inclined to form Ti-F...F-Ti contacts. It should be noted that the number of symmetrically independent groups of M-F in both cases is not too large (35 each), and as their number increases, the values of k MHal-agg can notably change.
In the subset with N M-Hal > 40 (Figure 3b), the largest value of k MHal-agg (1.17) is for the Ge-Cl group, while the value of Hg-I is close to it (1.09). In general, for the Hg-Hal groups (Hal = Cl, Br, I), the k MHal-agg values are high (0.80-1.09), while the numbers of such groups in the set is 190 or more (Table 10); i.e., with high probability, one can expect the presence of substances of distances of Hg-Hal...Hal-Hg with length ≤ R agg for these Hal.
To estimate the parameters of the distributions of the M-Hal...Hal-M distances, the same approach was used as for the C-Hal...Hal-C distances. It should be noted that the smallest number of C-Hal...Hal-C distances is in the First subset of the Orgmet set for C-I...I-C (140). In the MHal set, there is about the same number of values (124) in the All subset for M-F...F-M distances, and in the First subset for distances of the same type, it is several times less (33). Accordingly, the error in parameters describing these distances can be high, especially in the First subset, as evidenced in particular by the low correlation coefficient for it (r 2 = 0.702). The distances of M-Hal1...Hal2-M (Hal1 = Hal2 = Cl, Br, I) were described using Gaussian functions with two step sizes (0.1 and 0.2 Å). In this case, as in the analysis of the C-Hal...Hal-C distances, the position of the maxima of the functions changed very little; therefore, Table 11 lists only the parameters of the distributions obtained with a step of 0.2 Å. Considering that the Hal atoms in the M-Hal groups can coexist with large ligands around M, which hinder the approach of the same groups from neighboring molecules, it is not surprising that the positions of the maximum exceed 2R Hal , even in the First subsets. Somewhat surprisingly, for M-Cl...Cl-M, the x All value is almost insignificant (by 0.026 Å) but is less than x First . Formally, for the distances of M-F...F-M, the same effect is observed and is much larger in magnitude; however, it could be due to the small number of these distances in the set, and, accordingly, the inaccuracy of the obtained parameters. For the M-Br...Br-M and M-I...I-M distances, the values of x m in the All subset are expectedly greater than in the First subset, but the difference of x all -x first for M-Br...Br-M is notably smaller (0.053 Å) on average than for the C-Hal...Hal-C distances ( Table 6).

General Procedure for the Formation of Hal...Hal Distance Sets
For the selection of crystalline substances, the Cambridge Structural Database (CSD) [38], version 5.43 (November 2021), +3 updates was used. The search was carried out with the ConQuest program [39], using combinations of several conditions. Small halogen-containing molecules (CCl 4 , CHCl 3 , etc.) often occur in crystals as solvate molecules, and their position and/or orientation is often disordered, which is not always accurately described when determining the structure. Therefore, to increase the accuracy of the results, not only all records with disorder noted in CSD were excluded from consideration but also the structures of only homomolecular crystals (crystals consisting of identical molecules; the search condition in ConQuest NRes = 1) were considered.
Since the goal of this work was to analyze distances involving only terminal halogen atoms, when drawing the C-Hal or M-Hal fragments, the presence of only one bond was specified for Hal.
The specificity of the formation of crystals of polymeric compounds can lead to a difference in the distributions of the lengths of interatomic distances in them compared to crystals of nonpolymeric compounds; therefore, the corresponding entries were also excluded from consideration.
Temperature and pressure can affect the parameters of intermolecular contacts, so room temperature and normal pressure were other search conditions (records with the "pressure" field were excluded).
The Best room temperature list [40] was used to determine the best studies within the same family (records with the same letter part of the reference code).
The conditions listed above, supplemented by the presence of 3D coordinates (which was necessary for further distance calculations) and the absence of CSD error warnings, were used for searches.

Additional Terms for Particular Sets
To form a set of distances of C-Hal...Hal-C existing in crystals of organic compounds (Org), a search was carried out among records classified in CSD as Organics. Accordingly, to form a set of distances observed in crystals of organometallic compounds (Orgmet), the search was performed among records classified in the CSD as Organometallic. To form a set of distances of M-Hal...Hal-M, when drawing, M-Hal M was taken as 'any metal', and the search was performed on all records (i.e., without choosing Organics or Organometallic).

Conclusions
The values of the van der Waals radii of halogen atoms obtained in this work are in excellent agreement with the results of Chernyshov et al. [30] but significantly (by 0.10-0.17 Å) greater than the commonly used values of Bondi [25].
The positions of the maxima of the Gaussian functions, describing the shortest distances of C-Hal...Hal-C in each substance, are shifted to shorter distances in the set of homomolecular crystals of organometallic compounds compared to crystals of organic compounds. At the same time, the values of the van der Waals radii proposed considering the characteristics of the two distributions of contacts C-Hal... Hal-C (the shortest in each substance and all less than the threshold while somewhat greater than R Hal ) are almost the same in these two sets of substances, with the exception of radius I. The difference in values of radius I may be attributed to a significantly smaller number of C-I...I-C contacts in the set of crystals of organometallic compounds. An