Diamonds Certify Themselves: Multivariate Statistical Provenance Analysis

The country or mine of origin is an important economic and societal issue inherent in the diamond industry. Consumers increasingly want to know the provenance of their diamonds to ensure their purchase does not support inhumane working conditions. Governments around the world reduce the flow of conflict diamonds via paper certificates through the Kimberley Process, a United Nations mandate. However, certificates can be subject to fraud and do not provide a failsafe solution to stopping the flow of illicit diamonds. A solution tied to the diamonds themselves that can withstand the cutting and manufacturing process is required. Here, we show that multivariate analysis of LIBS (laser-induced breakdown spectroscopy) diamond spectra predicts the mine of origin at greater than 95% accuracy, distinguishes between natural and synthetic stones, and distinguishes between synthetic stones manufactured in different laboratories by different methods. Two types of spectral features, elemental emission peaks and emission clusters from C-N and C-C molecules, are significant in the analysis, indicating that the provenance signal is contained in the carbon structure itself rather than in inclusions.


Introduction
Diamonds have played a significant role in human history. As major gemstones, an efficient method for transporting great value, and a component in many industrial applications, diamonds have long been useful, and their uses have evolved with human society. Diamond mining in countries with unstable governments motivated the use of diamonds to fund terrorist activities. Since 2003, international diamond trade has been led by the Kimberley Process, which defines conflict diamonds as those that "can be directly linked to the fueling of armed conflict, the activities of rebel movements aimed at undermining or overthrowing legitimate governments, and the illicit traffic in, and proliferation of, armaments, especially small arms and light weapons" [1]. Furthermore, diamond mining in some countries involves significant human civil rights violations, such as indentured servitude, child labor, and inhumane working conditions [2][3][4]. The Kimberley Process attempts to certify that all rough stones exported from participating nations are conflict-free. All rough diamonds are exported in tamper-resistant packaging with a Kimberley Process Certificate authorized by the country of origin; import and export of rough diamonds is allowed only between Kimberley Process countries [1].
Kimberley Process Certificates accompany rough diamonds to the cutter, but not to the customer. Once a diamond is cut, it is deemed conflict-free, prompting the smuggling of conflict diamonds into cutting facilities. No monitoring by independent observers exists, largely because there has not been an accurate monitoring method. The Kimberley Process requires a sophisticated infrastructure capable  The synthetic stones were cut and polished prior to analysis; natural stones were analyzed in rough form. All stones were cleaned by industry-standard methods prior to use in this study and again with isopropyl alcohol and a lint-free wipe immediately prior to analysis. LIBS spectra were acquired from each stone in four 4 × 4 grids, each of which covered 0.25 mm; grids were located on suitable flat surfaces for analysis. Total analysis time, from the beginning of analysis of one sample to the beginning of analysis of another, is approximately three minutes in this nonautomated system. Samples were analyzed by laser-induced breakdown spectroscopy (LIBS) using a ChemReveal LIBS system (Model 3996, TSI Incorporated, Shoreview, MN, USA). Laser ablation removes atoms from the surface of the sample, creating a plasma that contains atoms from the sample, as well as from the environment of analysis [38]. Samples were analyzed in Ar rather than air, resulting in higher peak intensities and smaller ablation craters at the same laser energy. In the plasma (at temperature >10,000 K), electrons are excited to upper orbitals. Energy is emitted as photons as the plasma cools and electrons decay to lower-energy orbitals. Photons are collected by a lens into fiber optic cables and diffracted by an Echelle spectrometer (SE 200, Catalina Scientific Instruments, LLC, Tucson, AZ, USA). Peak intensity is proportional to the number of photons emitted by specific orbital transitions [38].
This study used an Nd:YAG laser operating at wavelength 266 nm. Pulse energy was 30 mJ. Collection of light was delayed behind the laser pulse by 0.75 ns to prevent the collection of light emitted from nonquantum interactions of electrons, ions, and atoms. The gate width, or length of time for photon collection, was 10 µs. Spectra were acquired in four 4 × 4 grids, resulting in 64 spectra per sample, which were accumulated into eight spectra per stone. Some spectra (3.8%) were lost due to the inability of the computer processor to store the data at the rate it was being produced by the instrument. A total of 3161 spectra were used for multivariate analysis.
Samples were analyzed in an Ar environment rather than ambient air. As Ar is inert with high ionization energy compared to N, more laser energy is expended in exciting atoms from the sample than from those from the atmosphere when analyzing in Ar. The samples were analyzed at 1 atm Minerals 2020, 10, 916 4 of 12 (0.0001 GPa) with a stream of ultra-high-purity Ar gas flowing over the sample surface at a rate of 2 L/min. Average N intensities are measurably different for most of the localities, suggesting that the N originates from the samples rather than from the atmosphere. In addition, N-free materials analyzed with this method in other studies lack N peaks in the spectrum. Thus, it is likely that the Ar flow removed the atmospheric N from the analyzed area.
Multivariate data analysis employs the idea that the LIBS spectra of every material contain information on the origin, composition, and history of the material [42]. LIBS spectra were modeled using two multivariate techniques. In the Quantagenetics ® technique, data analysis was performed using MATLAB programming language (Mathworks, Natick, MA, USA) [42]. Single-shot spectra were pre-processed to minimize the shot-to-shot variation in a three-step process: (1) Set all intensities less than 1.0 to 1.0, (2) normalize each spectrum to its mean, and (3) convert to log 10 . The first step avoids taking log 10 of negative numbers and results in a minimum normalized intensity of zero. Unsupervised cluster analysis [43] of all 3161 spectra in the model was used to define spectral clusters, which corresponded to the known geographic/laboratory groups of diamonds. The probability distance function for each cluster was used as the tool to classify the spectra using a leave-one-out process, in which each spectrum was sequentially left out of the calibration and used to validate the model. Probability distance functions were calculated by first determining the Euclidian distance between each sample and the cluster mean in n-dimensional space, where n is the number of channels in the LIBS spectrum. The distance between a sample and the cluster mean is the square root of the sum of the squares of the residuals for each channel. The probability function, which describes the variation in spectra relative to the mean, was then calculated as the distribution of distances.
The Quantagenetics ® method determines the probability that a spectrum from a sample of unknown origin belongs to one of several clusters of spectra by calculating the Euclidian distance between the unknown spectrum and each of the cluster means [42,44,45]. The probability of a spectrum belonging to a cluster is determined by comparing the distance to the probability distance function for the cluster. The cluster with the highest probability is accepted as the provenance of the unknown. An accurate model depends on predicting which samples are from a given locality, as well as those that are not from the locality. Thus, success rates are calculated as the percent of true positives and true negatives of the total number of spectra. Principal component analysis (PCA) was used to understand the spectral features that represent significant relationships in the dataset. PCA loading plots indicate the spectral features that are highly correlated in each principal component (PC).
The second method was to train and validate a decision tree that consists of a series of partial least squares regression (PLSR) models [46,47]. PLSR is a multivariate method that is used to predict the value of an independent variable from a spectroscopic data set. Spectra were averaged for each specimen for all groups, yielding a total of 320 spectra for 29-30 specimens in each of the 11 location groups (the Brazilian carbonado diamonds were not included in this model). The PLSR decision tree consists of a series of binary models, each of which predict whether a spectrum belonged to the group being modeled or to the group of all other samples. For each model, the independent variable is set to 1 for spectra in the group being modeled and to 0 for spectra in all other groups. The majority of spectra (83%) were used to calibrate the models; the remaining 17% were used to validate the models. During calibration, PLSR calculates the value of the independent variable for each calibration spectrum. Spectra in the group being modeled should have calculated values close to 1; those in the group of all other spectra should have calculated values close to 0 value of the independent variable, called the Value of Apparent Distinction (VAD) [47], is chosen to discriminate between the two groups. Spectra with a calculated independent variable value greater than or equal to the VAD are classified as being in the group being modeled; those with calculated independent variable values less than the VAD fall into the group of all other spectra. Success rate is calculated as the percent of spectra in the validation set that were correctly classified.
The group with the most distinct composition, as observed in a PCA score plot, is the first group to be modeled. After successful calibration and validation, all spectra in that group are removed from all subsequent models. The process of defining the next group, calibrating and validating a model that classifies spectra as belonging to that group or to the group of all others, and the removal of spectra of that group from subsequent models continues until all groups have been modeled. Success rates are calculated as the percent of correctly classified spectra for each model and for the overall decision tree.

Results
Cut and polished synthetic diamonds represent the interior of the synthetic mass; thus, elemental impurities must reside in the diamond structure rather than in surficial impurities. Peaks for C, Mg, Fe, Si, Ti, Ca, Na, Ba, H, Ne, O, N, and Ar were observed in LIBS spectra of synthetic diamonds ( Figure 1A). As analysis was performed in an Ar environment, it is impossible to know whether Ar is present in the diamonds; however, H, Ne, O, and N are not atmospheric. In addition, strong emission peaks for the C-N molecules and weak C-C molecular emissions [48] are present in the spectra (Figure 2A,B).
The natural diamonds analyzed in this study exhibit LIBS peaks for all the elements observed in synthetic stones, as well as Cu, Ni, Al, V, Co, Cr, Sr, and K ( Figure 1B). The Brazilian carbonado diamonds carry the highest trace element load due to their inclusion-rich character; however, inclusion-free natural diamonds also exhibit peaks of these elements, as well as molecular C-N emissions ( Figure 2). C-C molecular emissions are highly variable ( Figure 2).
The success rate for the Quantagenetics© method, calculated as the percent of true positives and true negatives in a leave-one-out cross-validation, ranges from 95.8% to 100% for different sets (Table 2); the average success rate is 98.5%. The synthetic sets are easily identified from natural stones and from each other. A loading plot for a PCA model of all diamond spectra ( Figure 3) shows that the peaks of many elements, including Mg, Si, Ti, Mn, Al, Ca, Na, H, N, K, and O, and the C-N and C-C molecular emissions all exert influence on the principal component and are likely significant in predictions in the Quantagenetics© analysis.
Minerals 2020, 10, x FOR PEER REVIEW 5 of 13 less than the VAD fall into the group of all other spectra. Success rate is calculated as the percent of spectra in the validation set that were correctly classified. The group with the most distinct composition, as observed in a PCA score plot, is the first group to be modeled. After successful calibration and validation, all spectra in that group are removed from all subsequent models. The process of defining the next group, calibrating and validating a model that classifies spectra as belonging to that group or to the group of all others, and the removal of spectra of that group from subsequent models continues until all groups have been modeled. Success rates are calculated as the percent of correctly classified spectra for each model and for the overall decision tree.

Results
Cut and polished synthetic diamonds represent the interior of the synthetic mass; thus, elemental impurities must reside in the diamond structure rather than in surficial impurities. Peaks for C, Mg, Fe, Si, Ti, Ca, Na, Ba, H, Ne, O, N, and Ar were observed in LIBS spectra of synthetic diamonds ( Figure 1A). As analysis was performed in an Ar environment, it is impossible to know whether Ar is present in the diamonds; however, H, Ne, O, and N are not atmospheric. In addition, strong emission peaks for the C-N molecules and weak C-C molecular emissions [48] are present in the spectra (Figure 2A,B).  The PLSR decision tree is depicted in Figure 4. Spectra from RUS 4 diamonds were modeled first with a success rate of 96.9%. After the RUS 4 spectra were removed, no single group of spectra stood out as compositionally unique; however, the AUS 1 and BRZ groups together formed a unique cluster. Thus, Model 2 classifies spectra as belonging either to the group of AUS 1 and BRZ or to the group of all other spectra. The AUS 1 and BRZ spectra are separated from each other in Model 3, with a success rate of 86.7%. Model 4 is similar to Model 2 and classifies spectra from the two synthetic sets of samples as separate from all others. Model 5 then separates spectra from the two synthetic groups from each other with a success rate of 100%. Models 6-11 continue to differentiate between one group of spectra Minerals 2020, 10, 916 6 of 12 and the group of all others, defining a single group in each model. The final groups are RUS 2 and CAN 2. Their placement at the end of the decision tree indicates that they are similar in composition to each other. They can be differentiated from each other with 100% success in this binary model, but the subtle differences between them are lost when they are modeled with all the other diamond spectra. The decision tree correctly classifies 316 of 320 spectra, yielding an overall success rate of 95.8%. The natural diamonds analyzed in this study exhibit LIBS peaks for all the elements observed in synthetic stones, as well as Cu, Ni, Al, V, Co, Cr, Sr, and K ( Figure 1B). The Brazilian carbonado diamonds carry the highest trace element load due to their inclusion-rich character; however, inclusion-free natural diamonds also exhibit peaks of these elements, as well as molecular C-N emissions ( Figure 2). C-C molecular emissions are highly variable (Figure 2).
The success rate for the Quantagenetics© method, calculated as the percent of true positives and true negatives in a leave-one-out cross-validation, ranges from 95.8% to 100% for different sets ( Table  2); the average success rate is 98.5%. The synthetic sets are easily identified from natural stones and from each other. A loading plot for a PCA model of all diamond spectra (Figure 3) shows that the peaks of many elements, including Mg, Si, Ti, Mn, Al, Ca, Na, H, N, K, and O, and the C-N and C-C molecular emissions all exert influence on the principal component and are likely significant in predictions in the Quantagenetics© analysis.   true negatives in a leave-one-out cross-validation, ranges from 95.8% to 100% for different sets (Table  2); the average success rate is 98.5%. The synthetic sets are easily identified from natural stones and from each other. A loading plot for a PCA model of all diamond spectra (Figure 3) shows that the peaks of many elements, including Mg, Si, Ti, Mn, Al, Ca, Na, H, N, K, and O, and the C-N and C-C molecular emissions all exert influence on the principal component and are likely significant in predictions in the Quantagenetics© analysis.

Discussion
Variations in two types of spectral features are significant in the multivariate analysis: Elemental peaks and emissions from C-N and C-C molecules. These molecular emissions are thought to be related to the pre-ablation structure of organic molecules [40]. C-N molecules form in the plasma from ablated C and N atoms; thus, the intensity of the C-N molecular emission peaks will be related to the N concentration of the diamond. By contrast, C-C molecules are thought to be fragments of the

Discussion
Variations in two types of spectral features are significant in the multivariate analysis: Elemental peaks and emissions from C-N and C-C molecules. These molecular emissions are thought to be related to the pre-ablation structure of organic molecules [40]. C-N molecules form in the plasma from ablated C and N atoms; thus, the intensity of the C-N molecular emission peaks will be related to the N concentration of the diamond. By contrast, C-C molecules are thought to be fragments of the original structure [40]. The concentration of C-C molecules in the plasma is controlled by ablation efficacy, which in turn, is inversely related to the bond strength between neighboring C atoms relative to the bond strength between C atoms and other elements. In organic molecules, the stronger F-C bond strength in polytetrafluoroethylene is correlated to lower intensities of C-C molecular emissions compared to the weaker H-C bond of polyethylene, which exhibits more intense C-C molecular emission peaks [40]. Applying these findings to diamonds, C-C molecular intensities will vary depending on the strength of bonds, or lack of bonding, between C atoms and elemental impurities such as N, B, Si, Ti, Ni, Fe, and the other elements observed. Thus, intensities of molecular and elemental emissions are related, resulting in a unique LIBS signature for diamonds crystallized in the same host rock.
Two sets of synthetic diamonds, produced in different laboratories by different processes, were used in this study (Table 1). Our analysis correctly predicted the provenance of all synthetic stones, indicating that this method distinguishes between natural and synthetic diamonds. The success rates for both synthetic sets are essentially 100%, indicating that it is also possible to distinguish between sets of synthetic diamonds. The major spectral features in a PCA loading plot for the two sets of synthetic stones are the elemental emissions for Ca, Mg, Fe, O, and Ne. With only two sets, it is impossible to determine whether these variations are controlled by the manufacturer's specific recipes and contaminants or a process-related factor.
The sets of placer diamonds, RUS 2 and RUS 3, have the lowest success rates (97.0% and 95.8%, respectively). This is not unexpected, because placer deposits potentially contain a mixture of stones from different sources in a drainage basin. RUS 2 and RUS 3 have high occurrences of both false negatives (placer stones predicted to belong to another group) and false positives (stones from other groups predicted to belong to the placer); this is caused by the compositional heterogeneity of multiple diamond sources in placer deposits. Success rates would increase if each placer subgroup were represented by 30 or more samples; with a sufficiently large sample set, it would be possible to model each subgroup independently.
Heterogeneity within sets of diamonds can be assessed using PCA. Diamonds in the RUS 2 and RUS 3 placer deposits fall into two distinct groups in PCA score plots ( Figure 5A,B). Multiple groups of diamonds from different sources are expected in placer deposits. However, the success rate for classification of diamonds from these deposits are still >90% using both methods (Table 1, Figure 4). Using the Quantagenetics ® method, 97% of RUS 3 spectra were correctly classified, as were 95.8% of RUS 3 spectra. Similarly, success rates for the PLSR decision tree were 100% for RUS 2 and 91.4% for RUS 3. This suggests that: (1) The location of diamonds from placer deposits can be determined because there is a consistent regional compositional signature; and (2) compositional subgroups within a placer deposit can be identified. This technique may be useful in exploring for the host rock that supplies diamonds to placer deposits. The existence of multiple diamond sources in a single kimberlite has been documented. For example, kimberlite mines such as Diavik in Canada comprise several individual pipes [49]. Heterogeneity in diamond characteristics such as N aggregation and isotopic composition from single mines has been observed [24]. Features such as growth zones and {100} platelets, thought to be extremely thin plate-like segregations of impurities such as C or N [50], may also differ in diamonds from a single source. This heterogeneity is observed as multiple subgroups are seen in the LIBS data from two Canadian kimberlites, CAN 2 and CAN 3 ( Figure 5C). Both CAN 2 and CAN 3 exhibit two clear subgroups, although the CAN 3 subgroups show a larger difference in composition by the larger separation between them in the PCA plot. In addition, the CAN 2 diamonds and CAN 3 diamonds plot in separate areas on this plot, despite the close proximity of the two kimberlites. Although one subgroup of CAN 3 diamonds plots near the CAN 2 diamonds, both the Quantagenetics ® and decision tree methods were successful in predicting the origin of the spectra (Table 1, Figure 4). The Quantagenetics ® method yielded success rates of 99.6% for CAN 2 and 97.7% for CAN 3; the decision tree had success rates of 100% for CAN 2 and 92% for CAN 3. The analyses in this study were acquired without regard to crystallographic direction and likely encountered various growth zones and platelets. Multivariate analysis of LIBS spectra appears to be able to predict provenance correctly despite these compositional complexities. The existence of multiple diamond sources in a single kimberlite has been documented. For example, kimberlite mines such as Diavik in Canada comprise several individual pipes [49]. Heterogeneity in diamond characteristics such as N aggregation and isotopic composition from single mines has been observed [24]. Features such as growth zones and {100} platelets, thought to be extremely thin plate-like segregations of impurities such as C or N [50], may also differ in diamonds from a single source. This heterogeneity is observed as multiple subgroups are seen in the LIBS data from two Canadian kimberlites, CAN 2 and CAN 3 ( Figure 5C). Both CAN 2 and CAN 3 exhibit two clear subgroups, although the CAN 3 subgroups show a larger difference in composition by the larger separation between them in the PCA plot. In addition, the CAN 2 diamonds and CAN 3 diamonds plot in separate areas on this plot, despite the close proximity of the two kimberlites. Although one subgroup of CAN 3 diamonds plots near the CAN 2 diamonds, both the Quantagenetics ® and decision tree methods were successful in predicting the origin of the spectra (Table 1, Figure 4). The Quantagenetics ® method yielded success rates of 99.6% for CAN 2 and 97.7% for CAN 3; the decision tree had success rates of 100% for CAN 2 and 92% for CAN 3. The analyses in this study were acquired without regard to crystallographic direction and likely encountered various growth zones and platelets. Multivariate analysis of LIBS spectra appears to be able to predict provenance correctly despite these compositional complexities.
The ability to determine the geographic origin of diamond accurately and rapidly is a requisite for controlling the influence of conflict stones in the diamond industry; the work reported in this pilot study suggests that automated Quantagenetics ® analysis of LIBS data could be utilized for this purpose in the gem industry. Previous technology focused on features that were not ubiquitous in stones from each deposit, such as inclusion chemistry or crystal morphology. Studies of nitrogen aggregation, while useful in estimating the age and thermal history of a diamond, have not proven effective in provenance determination. Spectroscopic studies that determine the concentrations of a suite of elements in diamonds have not proven efficacious in the determination of provenance, because the chemical relationships are too complex to be discerned in a data set with~20 trace elements. By contrast, Quantagenetics ® analysis of LIBS spectra simultaneously uses information from every element in the samples in addition to structural information, providing a set of variables that is sufficiently large to unravel inherent complex geochemical relationships. Diamond provenance, like that of rubies and sapphires, depends not on the concentrations of a few elements but on the subtle and consistent proportions of all the trace elements [47]; these relationships are too complex to display on the traditional triangular diagram. This study indicates that multivariate analysis of LIBS spectra has promise in the accurate prediction of diamond provenance using a signal that is contained in the carbon structure itself, providing a robust and unique Quantagenetics ® [42] signature of each deposit. Two of the localities in this study are within 100 km of each other; we are able to determine provenance at least on that spatial scale. In addition, this method easily distinguishes between natural and synthetic diamonds, as well as between synthetic stones from different laboratories.