Seized Ecstasy Pills: Infrared Spectra and Image Datasets

: According to the World Drug Report 2020, cocaine and ecstasy are the most consumed stimulant drugs, with 19 and 27 million estimated users in 2018. In this context, large efforts are being made to design fast and cost-effective analytical methods to track and monitor the distribution networks of these synthetic drugs. Here, we share two datasets of ecstasy pills seized in the northeast of Switzerland between 2010 and 2011. The ﬁrst contains 621 forensic-grade images of pills, while the second one consists of 486 mid-infrared (mIR) spectra. While both sets are not covering the same seizure, both provide high-quality data with orthogonal information to evaluate clustering and dimension reduction methods


Summary
Amphetamine-type stimulants (ATS), mostly known as ecstasy, XTC, or designer drugs, are derivatives from Amphetamine (Am) and sold as tablets. The other most commonly found derivatives are MDA (3,4-methylenedioxyamphetamine) and MDMA (3,4-methylenedioxy-N -methylamphetamine) [1]. During 2018 only, 228 metric tons of methamphetamine (MAm) were seized globally [2]. The testing and monitoring of illicit pills is thus a gigantic task and a matter of public health. While many analytical platforms have been showcased for that purpose, the main focus has been the identification and/or quantification of active compounds in pills [3][4][5][6].
Instead, highlighting similarities between seizures as quickly as possible enables tracking the origin of the pills, hence unveiling trafficking routes and supply chains for production [7,8]. The size of this task calls for cheap methods with fast response time that can be deployed at a large scale as close as possible to where seizures occur. Such profiling methods exist that can either target the visual aspect of the pills, imaging [9], or their composition, optical spectroscopy [10,11], or even portable chromatography [12].
Both images and spectroscopy provide high-dimensional data, where a vast amount of variables, pixels, or frequencies are measured simultaneously. Consequently, this data is usually large, and it requires of mathematical models to extract insightful information, such as most relevant variables, or to use it as classifiers. Although high-dimensional data can always be represented as m × n matrices, data processing depends on the nature of the data, i.e., spectral resolution, number of points, quality of the baseline, and the signal-to-noise ratio, among others. Images will be analyzed by first defining regions of interest and then applying filters, for example. Then, different multivariate analysis strategies could be tested to extract information. Non-supervised approaches, such as Principal Component Analysis, Non-negative Matrix Factorization [13], or more recently, Uniform Manifold Approximation and Projection (UMAP) [14] are most commonly used. Again, the choice of the method is dictated by the type of data, the experimental design, and the aim of the research. It is not unusual to benchmark different data processing and normalization techniques, and analysis pipelines to pick the most appropriate one. On the other hand, the development of new statistical models requires access to high-quality and diverse datasets to evaluate their performance.
The two datasets described herein provide orthogonal information, visual aspects (images), and chemical composition (mIR spectra) about different seizures of ecstasy pills seized in the streets of Switzerland. The large amount of data made it challenging to cluster the seizures in a single analysis. The second dataset contains 486 mid-infrared (mIR) spectra (650 to 4000 cm −1 ) of 6701 points that were acquired at the same period, 2011, but never published. They represent 41 different seizures, and several replicates were analyzed for each of them, as described in Table 2. Although the dataset covers different seizures, the data cannot be used in direct comparison. However, it is worth mentioning that 31 seizures are common to both datasets.

Imaging
All images were acquired by trained forensic staff at Université de Lausanne using well-established protocols and standard equipment, as depicted in Figure 1. While this process was useful to evaluate the prospects of such an approach, a future implementation must perform well with data acquired from different sources, including hand-held smartphones. Figure 1. The schema of the experimental setup for depicting each pill. This setup is standard to forensic proceedings. The camera is a Canon D90 with Canon EFS 60 mm lens. The two black rings ensure homogenous lightning from two independent light sources (Marcel Aubert SA, MA 1300, Nidau, Switzerland). The pill is placed on the vertical rod in the center and a white square of paper is used as a reference for color and size.

mIR Spectra
All spectra were acquired using a Nicolet iS5 spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) equipped with a iD5ATR module and using the parameters as described in Table 3. Each sample was ground to a fine powder, homogenized, placed, and maintained over the crystal with the same pressure. Prior to each measurement, the crystal was carefully washed and a blank spectrum acquired.

Usage Note
To the extent of our knowledge, both datasets are the first of their kind to be published, probably due to the access restrictions that apply to police evidence material and, in this case, to controlled substances. The release of this original dataset will allow researchers from different fields to evaluate and propose new suitable strategies for extracting information from such data, and thus will contribute to finding solutions to a very acute public health issue.
Infrared spectra shows that most of the pills in this dataset contain 3,4-Methylenedioxy methamphetamine (MDMA) as an active compound, which is expected according to recent reports [2]. However, some pills contain different derivatives, as indicated by the large deviations observed in their spectra (See Figure 2). Other molecules are necessary for the preparation of the tablets. These latter can also be used to profile seizures that may share the same supply chain or prepared following a similar recipe. The consistency of the composition found among different pills from the same seizure, such as when replicates are available, provides additional information about the production method. Images, on the other hand, provide a rapid means of discriminating different distribution networks, although different-looking tablets may be produced by the same laboratory as an effort to confuse police investigations.
Smartphones are already universally available. They enable users to upload geolocalized images of the seized pills and retrieve information about previous seizures that contain similar tablets. An example of clustering analysis for that purpose was demonstrated several years ago by the authors, and readers are referred there for further details [9]. Hence, this dataset could prove useful to gauge the impact of recent progress in machine learning and artificial intelligence, in terms of speed and accuracy. These data could be used to develop smartphone applications that enable users to quickly check if the product they purchased has been reported as dangerous.
Portable and ultra-portable spectrophotometers are becoming available and may be used during routine police controls or raids to complement visual information. However, other applications could be envisioned to enable real-time assessment of the quality or toxicity of the tablets and reporting it to a central database.
An example of a simple multivariate analysis, PCA, is shown below using online and open source tools developed for the electronic notebooks c6h6.org. Figure 2 shows an overlay of the 486 spectra after applying a standard normal variate transformation and is color-coded by seizure. The upper portion shows a score plot of the first two principal components. To access this tool, it is preferable to use Google Chrome, and while on the landing page, one should choose the "PCA" tile. Enter "XTC" in the search bar on the left to retrieve the data. Seizures can be selected individually using the "+" buttons, or added as a whole using the "+" sign located on the header of the "List of selected sample" window. To calculate the PCA or to re-calculate PCA after removing outliers (using alt-Draw in the score plot), use the button "calculate/recalculate PCA". As expected, some seizures are clustered very compactly (seizure 1132, 0966 and 0244, uppermost cluster), while some clusters appear very distant (seizure 0140, leftmost cyan cluster). Visual inspection of the pills from seizures, 1132, 0244, and 0966 suggests all three originate from the same source, while the pill from seizure 0140 clearly looks different (See Figure 3). Hierarchical clustering analysis is another widely used approach to explore structure in data. The resulting clustering, shown in Figure 4, was achieved using the spectra similarity tool available in the open source c6h6 notebook. In this case, only the four seizures 0244, 1132, 0966, and 0140 were used as inputs. As expected, the former were clustered together as they looked similar in shape and colour, while the latter group was singled out because of its colour. (See Figure 4). Seizures 0244, 1132, and 0966 could not be further classified, as observed on the close-up on the right, and were thus presumed to originate from the same source. The upper branch groups all the pills from seizure 0140, while the lower branch aggregates all the remaining pills that were likely issued by the same laboratory. The panel on the right zooms in on the lower branch. As expected, no classification was observed for that branch, confirming that the pills were from a similar source.