AddictedChem: A Data-Driven Integrated Platform for New Psychoactive Substance Identification

The mechanisms underlying drug addiction remain nebulous. Furthermore, new psychoactive substances (NPS) are being developed to circumvent legal control; hence, rapid NPS identification is urgently needed. Here, we present the construction of the comprehensive database of controlled substances, AddictedChem. This database integrates the following information on controlled substances from the US Drug Enforcement Administration: physical and chemical characteristics; classified literature by Medical Subject Headings terms and target binding data; absorption, distribution, metabolism, excretion, and toxicity; and related genes, pathways, and bioassays. We created 29 predictive models for NPS identification using five machine learning algorithms and seven molecular descriptors. The best performing models achieved a balanced accuracy (BA) of 0.940 with an area under the curve (AUC) of 0.986 for the test set and a BA of 0.919 and an AUC of 0.968 for the external validation set, which were subsequently used to identify potential NPS with a consensus strategy. Concurrently, a chemical space that included the properties of vectorised addictive compounds was constructed and integrated with AddictedChem, illustrating the principle of diversely existing NPS from a macro perspective. Based on these potential applications, AddictedChem could be considered a highly promising tool for NPS identification and evaluation.


Introduction
Drug addiction is a chronic relapsing disease that involves drug sourcing and continuous use and adversely affects the brain. Individuals with drug addiction represent all age groups. Drug addiction often causes serious physical and mental damage to the individual, leading to societal problems, such as rising social crime rates and economic losses. The 2021 World Drug Report of the United Nations Office on Drugs and Crime (UNODC) indicated that approximately 275 million people took drugs worldwide in 2020, whereas more than 36 million individuals had drug use disorders [1]. Consequently, specific regulations have been introduced to control addictive compounds. The United Nations formulated three United Nations conventions in 1961 [2], 1971 [3], and 1988 [4] to classify addicted drugs and precursors. The European Union has no legislation to classify narcotic or psychotropic substances, while it has a pan-European control for rapid detection. For the USA, the Controlled Substances Act is a statute of the US Drug Enforcement Administration (DEA), which establishes federal US drug policy under which the manufacture, importation, possession, use, and distribution of certain substances is regulated [5]. Depending on the risk of dependence and abuse, and addiction likelihood, controlled substances are divided into Schedule I-V levels, where level I comprises research drugs whose use for medical purposes is strictly prohibited, and drugs in the other four categories can be used for medical treatment.
Most controlled compounds are agonists of receptors of the human central nervous system, such as opioid receptors, cannabis receptors, 5-hydroxytryptamine receptors, acetylcholine receptors, dopamine receptors, and γ-aminobutyric acid receptors [6]. The specific addiction mechanism may be related to the reward circuit mechanism in the brain [7]. For example, morphine stimulates µ receptors located on γ-aminobutyric acid neurons in the ventral tegmental area (VTA), thus disinhibiting dopaminergic neurons. This effect consequently promotes dopamine release from the VTA to the nucleus accumbens, triggering a rewarding effect [8,9]. However, the specific mechanism of addiction to most controlled substances remains unclear [10], and different levels of information must be integrated for in-depth research into controlled substances. However, a large amount of experimental data on controlled compounds exists in the scientific literature and databases. For example, the DrugBank database [11] contains information on the physiological targets and pharmacological effects of some controlled compounds; admetSAR [12] contains experimental and predicted values of the absorption, distribution, metabolism, excretion, and toxicity (ADMET) of controlled compounds; BindingDB [13] provides comprehensive information about protein-target interactions related to controlled compounds; and the Comparative Toxicogenomics Database (CTD) [14] provides detailed information on the interaction of controlled substances with genes and molecular pathways. To date, no single database provides experimental biologists with comprehensive information on controlled substances.
In addition to controlled substances per se, their analogues, i.e., new psychoactive substances (NPS), are constantly devised [15]. NPS are molecules that exert strong psychoactive effects mimicking the effects of legal recreational drugs and are difficult to detect by routine drug screening [16]. The UNODC first used the term 'new psychoactive substances' to refer to 'substances of abuse, either in a pure form or as a preparation, that are not controlled by the 1961 Single Convention on Narcotic Drugs or the 1971 Convention on Psychotropic Substances, but which may pose a public health threat' [17]. For example, a synthetic stimulant called 'bath salt' is a cathinone analogue [18]. At the end of 2020, approximately 830 NPS were monitored by the European Monitoring Centre for Drugs and Drug Addiction (EMCDDA), and their numbers continue to increase [15]. How to classify NPS from the many substances is an important issue that is challenging. Within drug samples, forensic and toxicological analyses can detect, identify, and quantify NPS. However, without a certified sample, NPS can be screened using methods such as highresolution mass spectrometry [19]. The above efforts to identify and predict NPS are costand time-consuming. From a computational viewpoint, a quantitative structure-activity relationship model was used to verify the analogues of amphetamine cathinone [20]. We still need more computational tools to produce low-cost, rapid predictions of NPS and to pave the way for the earlier identification of emerging drugs.
In this study, we present AddictedChem, which is a scientific database of controlled substances. The database integrates multiple dimensions of data, allowing the analysis of controlled substances from the perspectives of molecules, targets, and functional enrichment, facilitating the discovery of new targets and drugs for the treatment and prevention of addictive diseases. We also created a consensus model to predict NPS to accelerate experimental NPS identification. Finally, we visually explored the chemical space of addictive compounds and discovered the spatial distribution patterns of NPS and controlled substances, providing the user with an interactive online exploration platform (http://design.rxnfinder.org/addictedchem/chemical_space). AddictedChem will facilitate further addictive compound research and identification.

Overview of AddictedChem
The developed analysis platform comprises two parts ( Figure 1): the knowledge base of 622 controlled substances (classified as shown in Figure 2A) and the NPS prediction platform. Five search methods, namely, text query, structure, similarity, maximum common substructure (MCS), and fragment, are available in the knowledge base.

Overview of AddictedChem
The developed analysis platform comprises two parts ( Figure 1): the knowledge base of 622 controlled substances (classified as shown in Figure 2A) and the NPS prediction platform. Five search methods, namely, text query, structure, similarity, maximum common substructure (MCS), and fragment, are available in the knowledge base.
The user can directly enter the name or structure of a controlled substance to obtain the relevant literature, target, ADMET, gene, pathway, and bioassay information ( Figure  2B). The text query method retrieves controlled substances using synonymous names from PubChem (see Materials and Methods for details). The structure search method requires the user to provide the compound in the simplified molecular input line entry specification (SMILES) format or use the JSME tool [21] to draw the structure of the compound of interest. The JSME tool automatically converts the 2D molecular image into the SMILES format. The similarity retrieval method is based on a similarity algorithm [22], which returns the 20 closest molecular structures of the compound searched in the AddictedChem database. The MCS retrieval method is based on the fMCS algorithm [23] and returns 20 compounds with the largest substructure of the searched compound available in the AddictedChem database. Finally, the fragment retrieval method is based on the theory of connected subgraphs [24]. After entering the chemical structure of a compound, all fragment structures of the compound are retrieved.
In the NPS prediction platform, the user obtains the result report after entering the compound in the SMILES format or several compounds formatted in rows. The results are presented as three information sets: (1) consensus model results; (2) visualised results of 29 models based on seven descriptors, to be intuitively viewed by the user; and (3) specific prediction results of the 29 models in a list format. The website allows researchers to easily obtain prediction results. The user can directly enter the name or structure of a controlled substance to obtain the relevant literature, target, ADMET, gene, pathway, and bioassay information ( Figure 2B). The text query method retrieves controlled substances using synonymous names from PubChem (see Materials and Methods for details).
The structure search method requires the user to provide the compound in the simplified molecular input line entry specification (SMILES) format or use the JSME tool [21] to draw the structure of the compound of interest. The JSME tool automatically converts the 2D molecular image into the SMILES format. The similarity retrieval method is based on a similarity algorithm [22], which returns the 20 closest molecular structures of the compound searched in the AddictedChem database. The MCS retrieval method is based on the fMCS algorithm [23] and returns 20 compounds with the largest substructure of the searched compound available in the AddictedChem database. Finally, the fragment retrieval method is based on the theory of connected subgraphs [24]. After entering the chemical structure of a compound, all fragment structures of the compound are retrieved.
In the NPS prediction platform, the user obtains the result report after entering the compound in the SMILES format or several compounds formatted in rows. The results are presented as three information sets: (1) consensus model results; (2) visualised results of 29 models based on seven descriptors, to be intuitively viewed by the user; and (3) specific prediction results of the 29 models in a list format. The website allows researchers to easily obtain prediction results.
The addictive compounds are still far from saturation because more new psychoactive substances are emerging and have not been regulated completely. Because AddictedChem is an ongoing curation, it is continuously maintained and updated as new controlled substances are published. We will update the database annually to improve the comprehensiveness of data and collect more information about emerging controlled substances from more countries and regions in the future. The addictive compounds are still far from saturation because more new psychoactive substances are emerging and have not been regulated completely. Because Addicted-Chem is an ongoing curation, it is continuously maintained and updated as new controlled substances are published. We will update the database annually to improve the comprehensiveness of data and collect more information about emerging controlled substances from more countries and regions in the future.
We then used the Murcko scaffolds method to analyse controlled substances classed into different schedules (Supplementary Figure S1). The analysis revealed that 1,3-dihydro-5-phenyl-1,4-benzodiazepin-2-one is mainly found in Schedule IV drugs, 4-phenylpiperidine is mainly found in Schedule I and II drugs, and 4-anilino-N-phenethyl-piperidine is mainly found in Schedule I drugs. Notably, the top three scaffolds of Schedule III drugs are the same isomer, indicating that the structures of the Schedule III drugs are the most similar. These findings showed the diversity of the scaffolds of controlled substances, but the scaffolds of controlled substances with the same schedule are similar.

Analysis of Controlled Substance Targets
We then analysed controlled substance targets and used the PubChem ID of a compound of interest as the basis to search for its target information in the BindingDB database [13]. Analysis revealed that as of 4 January 2021, only 144 controlled substances (of 622) were associated with 104 (after deduplication) target studies, which comprised 475 compoundtarget pairs ( Figure 3). found in fentanyl, ocfentanil, and isobutyryl fentanyl, and it is the direct precursor of fentanyl and certain fentanyl analogues (such as acetylfentanyl). We then used the Murcko scaffolds method to analyse controlled substances classed into different schedules (Supplementary Figure S1). The analysis revealed that 1,3-dihydro-5-phenyl-1,4-benzodiazepin-2-one is mainly found in Schedule IV drugs, 4-phenylpiperidine is mainly found in Schedule I and II drugs, and 4-anilino-N-phenethyl-piperidine is mainly found in Schedule I drugs. Notably, the top three scaffolds of Schedule III drugs are the same isomer, indicating that the structures of the Schedule III drugs are the most similar. These findings showed the diversity of the scaffolds of controlled substances, but the scaffolds of controlled substances with the same schedule are similar.

Analysis of Controlled Substance Targets
We then analysed controlled substance targets and used the PubChem ID of a compound of interest as the basis to search for its target information in the BindingDB database [13]. Analysis revealed that as of 4 January 2021, only 144 controlled substances (of 622) were associated with 104 (after deduplication) target studies, which comprised 475 compound-target pairs ( Figure 3).  Considering the binding drug targets ( Figure 2D), the most common target of the identified 144 compounds was the Mu-type opioid receptor (UniProt ID: P35372), targeted by 26 compounds; followed by the delta-type opioid receptor (UniProt ID: P41143), targeted by 25 compounds; and the Kappa-type opioid receptor (UniProt ID: P41145), targeted by 24 compounds. While these three targets all interact with Schedule I-IV drugs, they mostly interact with Schedule II drugs. The fourth most common target was the sodiumdependent serotonin transporter (targeted by 19 compounds), which mainly interacts with Schedule I drugs. The other common targets were canalicular multi-specific organic anion transporter 2 (targeted by 18 compounds) and multidrug resistance-associated protein 4 (also targeted by 18 compounds), which mainly interact with Schedule IV drugs. Based on drug classification and statistical analysis of controlled substances from different schedules (Supplementary Tables S1-S5), Schedule II drugs had the highest degree of polymerisation, which was mainly because most of these compounds target cannabinoid receptors.
Considering the compound type ( Figure 2E), the controlled substances with the greatest number of targets were 3,4-methylenedioxymethamphetamine (MDMA), 25I-NBOMe, chlordiazepoxide, acetaminophen (APAP), and lacosamide. MDMA, as a Schedule I drug, induces the reversal of human serotonin transporters, leading to serotonin release [26]. Another Schedule I drug, 25I-NBOMe, is a synthetic hallucinogen that is a derivative of the 2C-I family of substituted phenethylamines. The main receptor of 25I-NBOMe is the human 5-HT 2A receptor, and the compound interacts weakly with other receptors. Chlordiazepoxide, a Schedule IV drug, is a sedative and hypnotic drug from the benzodiazepine class. It interacts with GABA A receptors/ion channels, influencing the central nervous system [27]. APAP, a Schedule IV drug, is a common analgesic and antipyretic drug that acts on many targets. Finally, lacosamide, a Schedule V drug, is an anticonvulsant compound. Controlled substances always activate or inhibit key targets in the signal transduction of the central nervous system; hence, identifying these targets helps to discover the mechanism of addiction.

Functional Enrichment Analysis
The analysis results revealed that pathway information was available for 166 controlled substances with 985 compound-pathway combinations. For the functional pathways of controlled substances, we observed that pathway information was available for 122 controlled substances with 209 compound-pathway combinations. Of the small number of the controlled substances (122), nearly 37% are related to the GABA-A receptor agonists/antagonists, which is followed by opioid receptor agonists/antagonists ( Figure 2F). This is in line with the mechanism by which most regulated substances target receptor agonists or inhibitors.
However, it is worth noting that because of data deficiency, the pathway analysis was conducted only on controlled substances that have reported targets and pathways and did not take dose effects into consideration.
Likewise, we identified 261 compounds and 442 related genes in humans. We counted 5637 gene ontology (GO) annotations for these genes, of which 203 were significantly enriched, and we identified the top 40 GO annotations that were significantly enriched (Supplementary Figure S2). GO analysis revealed that genes that interact with controlled substances were significantly enriched in the steroid metabolic process, response to the drug, response to oxidative process, and the metabolic process of fatty acids. Furthermore, biological process (BP) accounted for 90% of the 203 enriched GO annotations, indicating that addictive compounds are mainly involved in biological processes. These results reveal that controlled substances are widely involved in metabolic pathways in the human body and play a role in biological processes.

Model Performance for NPS Prediction
Using seven types of molecular descriptors and five machine learning algorithms, we built 29 models to predict potential NPS. The model's performance on external validation and test sets is summarised in Table 1. As shown, most models performed well with the test and external validation sets. Model performance with the test set was better than that with the external validation set. Because the test set was extracted from and constituted 20% of the entire dataset, the external validation set was collated using external data for verification. Among the seven molecular feature description methods, Morgan fingerprint performed the best with the four traditional machine learning models with the test set. The performance of SECFP with the NB, RF, and SVM models with the validation set was better than with the test set.
Meanwhile, the performance of the deep learning model was not better than traditional machine learning. The deep learning model performed well with the test set, but its performance with the external validation set was average. Among traditional machine learning models, the RF model performed the best, which was followed by SVM, LR, and NB. The best performance with the test set was obtained with the RF model based on MACCS (BA of 0.940 and AUC of 0.986) ( Table 1). The best performance with the external validation set was obtained with the NF model based on SECFP (BA of 0.919 and AUC of 0.968). However, as shown in Table 1, some models including NB::Mol2vec, LR::RDkit, RF::MHFP6, and ChemBERT have extremely poor performance. We deduced that model overfitting might have caused some models to perform much worse on the external validation set than on the test set. The test set and the training set belong to the same source, although the compounds are different. Compared with the test set, the external validation set is a completely different source. Furthermore, the number of externally labelled validation sets is very small. These reasons may cause some overfitted models to perform poorly on external validation sets. Hence, we selected RF::MACCS and NB::SECFP from the 29 models to form a consensus model. For intuitive comparison, we further compared the receiver-operating characteristic (ROC) and PR curves of all these methods; in the machine learning algorithm, we used Morgan fingerprint because it was very stable in all models (Figure 4). The analysis revealed the necessity of constructing a consensus model that retained the best model performance and feature description to the greatest extent.
the receiver-operating characteristic (ROC) and PR curves of all these methods; in th chine learning algorithm, we used Morgan fingerprint because it was very stable models ( Figure 4). The analysis revealed the necessity of constructing a consensus m that retained the best model performance and feature description to the greatest ext The consensus scoring model is used for the online classification prediction fun of addictive compounds. Deoxymethoxetamine (DXME) is a hallucinogen from the cyclohexylamine family that exerts dissociative effects. It has emerged on the illicit m since October 2020 and was first identified by a Danish forensic laboratory in Feb 2021 [28]. As a newly included NPS by HighResNPS (https://highresnps.forensic.k accessed on 27 April 2022), DXME does not appear in our datasets. We predicted D using the online prediction function freely available at http://design.rxnfinder.org/ad edchem/prediction/. Our consensus model correctly predicted it as an addictive pound (Supplementary Figure S3). This finding, to some extent, indicates that our m can predict potential NPS.
Despite our best efforts to collect compound addiction datasets, the datasets coll in this study are still small. This may affect the model's ability to discriminate structu unique novel addictive compounds. In the future, we will try new machine learning rithms to model methods, including positive unlabelled learning and few shot lear to further expand the generalisation ability and robustness of the model. The consensus scoring model is used for the online classification prediction function of addictive compounds. Deoxymethoxetamine (DXME) is a hallucinogen from the arylcyclohexylamine family that exerts dissociative effects. It has emerged on the illicit market since October 2020 and was first identified by a Danish forensic laboratory in February 2021 [28]. As a newly included NPS by HighResNPS (https://highresnps.forensic.ku.dk, accessed on 27 April 2022), DXME does not appear in our datasets. We predicted DXME using the online prediction function freely available at http://design.rxnfinder.org/addictedchem/ prediction/. Our consensus model correctly predicted it as an addictive compound (Supplementary Figure S3). This finding, to some extent, indicates that our model can predict potential NPS.
Despite our best efforts to collect compound addiction datasets, the datasets collected in this study are still small. This may affect the model's ability to discriminate structurally unique novel addictive compounds. In the future, we will try new machine learning algorithms to model methods, including positive unlabelled learning and few shot learning, to further expand the generalisation ability and robustness of the model.

Exploration of the Chemical Space of Addictive Compounds
To show the chemical space of addictive compounds more intuitively, we created vectorisations of all the compounds used in the mathematical method and explored their spatial distribution patterns. The TMAP (tree-map) [29] visualisation result ( Figure 5) shows the two-dimensional distribution of molecules on the tree and reflects similarities calculated in the high-dimensional MHFP6 space. The compounds in schedules of controlled substances (red) and abused drugs listed by the Cayman Chemical Company (blue) were positive samples, and DrugBank (green) and Generally Recognised as Safe (GRAS) (purple) compounds were negative samples ( Figure 5). The visualisation shows that the positive and negative classes clustered ( Figure 5), and abused drugs from the Cayman Company list surrounded the compounds in schedules of controlled substances. On a local magnification scale, some addictive analogues surrounded the same controlled compound, such as the controlled substance methanone (inset, Figure 5). Its analogues undergo some simple changes, such as a single terminal bond becoming a double bond, a bond change, or the addition of F, Cl, or Br atoms. As shown in Figure 5, most areas of chemical space around controlled substances (red) remain unexplored, which means that there are still many possibilities for NPS to be designed. Thus, expanding the chemical space may accelerate potential NPS discovery research. The user can visit the interactive version to interactively experience the space of addictive compounds.

Exploration of the Chemical Space of Addictive Compounds
To show the chemical space of addictive compounds more intuitively, we created vectorisations of all the compounds used in the mathematical method and explored their spatial distribution patterns. The TMAP (tree-map) [29] visualisation result ( Figure 5) shows the two-dimensional distribution of molecules on the tree and reflects similarities calculated in the high-dimensional MHFP6 space. The compounds in schedules of controlled substances (red) and abused drugs listed by the Cayman Chemical Company (blue) were positive samples, and DrugBank (green) and Generally Recognised as Safe (GRAS) (purple) compounds were negative samples ( Figure 5). The visualisation shows that the positive and negative classes clustered ( Figure 5), and abused drugs from the Cayman Company list surrounded the compounds in schedules of controlled substances. On a local magnification scale, some addictive analogues surrounded the same controlled compound, such as the controlled substance methanone (inset, Figure 5). Its analogues undergo some simple changes, such as a single terminal bond becoming a double bond, a bond change, or the addition of F, Cl, or Br atoms. As shown in Figure 5, most areas of chemical space around controlled substances (red) remain unexplored, which means that there are still many possibilities for NPS to be designed. Thus, expanding the chemical space may accelerate potential NPS discovery research. The user can visit the interactive version to interactively experience the space of addictive compounds. Figure 5. TMAP visualisation of model building datasets in the MHFP6 chemical space. The MHFP6 molecular fingerprints were used to characterise the compounds, generate indexes for fast k-nearest neighbour retrieval using LSH Forest, and visualise data using Faerun. Each data point in the figure indicates a chemical compound. The links connecting the data points represent the edges of the Figure 5. TMAP visualisation of model building datasets in the MHFP6 chemical space. The MHFP6 molecular fingerprints were used to characterise the compounds, generate indexes for fast k-nearest neighbour retrieval using LSH Forest, and visualise data using Faerun. Each data point in the figure indicates a chemical compound. The links connecting the data points represent the edges of the minimum spanning tree. All chemical compounds are shown in colour according to the dataset source. GRAS compounds (purple), DEA controlled substances (red), Cayman Chemical Company abused drugs (blue), and DrugBank compounds (green). In the upper left corner is seen a partial enlargement of methanone and the surrounding compounds. Please use the interactive version at http://design.rxnfinder.org/addictedchem/chemical_space to visualise the molecular structures associated with each point.

Data Sources
A detailed summary of all datasets used in this study is shown in Supplementary Table S6.
To construct a machine learning model for the classification of addictive compounds, we collected compound data from various publicly available sources. We carefully screened 854 data sources from PubChem [30], which is the world's largest collection of freely accessible chemical information to acquire many addiction-related compounds. Among them, substances controlled under the laws and regulations of the United States are listed by the DEA. It provides a relatively comprehensive list and is easily accessible. Furthermore, the Cayman Chemical Company is known as a provider of reference standards to crime labs. Therefore, the positive samples were finally derived from the list of controlled substances by the DEA and a list of abused drugs from Cayman Chemical Company. The compounds in the first list were divided into five schedules according to risk, abuse, and addiction. Schedules I-V contain 622 compounds, with 259 in Schedule I; 102 in Schedule II; 138 in Schedule III; 114 in Schedule IV; and 9 in Schedule V. The second list contains 1880 compounds. Because many of its components are NPS, these compounds were included as positive samples. Before the analysis, controlled substances were removed from that list. The negative samples, which contain compounds considered non-addictive, were also derived from two lists. The first is a list of GRAS compounds from the US Foods and Drug Administration. The compound names were converted to a SMILES [31] format by using PubChemPy (https://pubchempy.readthedocs.io/en/latest/, accessed on 31 December 2021), and the compounds without SMILES were deleted, yielding 173 compounds. The second list was from DrugBank [11]. To avoid the problem of sample imbalance, the ratio of positive to negative samples was set to 1:1. Therefore, 2329 compounds were finally randomly selected from the second list after removing controlled substances, NPS, and GRAS compounds. The above datasets were used for model training and testing.
The external validation sets for model evaluation are derived from two main sources (Supplementary Table S6). The 2017-2020 NFLIS-Drug substance list [32] provided by DEA constituted positive sample data for the external validation set and contained 184 substances. The NFLIS-Drug contains the results of drug chemistry analyses and associated information from drug cases. Negative samples for the external validation set were 149 newly approved drugs retrieved from the Kyoto Encyclopaedia of Genes and Genomes (KEGG) [33] for the period 2017-2021.
In addition to modelling-related data, there are more datasets for data analysis and database construction (Supplementary Table S6). Basic physical and chemical attribute information and bioassay information for controlled substances were obtained from Pub-Chem [30], which was searched using a list of controlled substances obtained from the DEA. The literature on controlled substances was obtained from the literature section in PubChem, which was based on the individual compound interface. ADMET data were obtained from admetSAR [12], which were divided into experimental and predicted data categories on chemicals. The data on controlled substance targets were obtained from BindingDB [13]. Interactions between the controlled substances and pathways and GO annotations originate from CTD [14]. The metabolic pathway data of controlled substances were acquired from the small-molecule pathway database (SMPDB) [34], which is a great source of metabolic information for many psychoactive drugs.

Data Curation
For subsequent modelling and database construction, we processed all compounds obtained from sources to maintain the quality of the final data. We used the open source cheminformatics toolkit RDKit [35] (http://www.rdkit.org) to evaluate the SMILES of compounds for validity and to discard invalid molecules. We then normalised the effective compound structures using RDKit.

Scaffold Analysis of Controlled Substances
Murcko scaffolds [36] were used to analyse the molecular characteristics of the controlled substances and explore their diversity. The Murcko scaffolds calculation method involves retaining the ring structure of the compound and the linker between the ring structures and removing all the side chains on the ring structure. The Murcko scaffolds algorithm was implemented based on RDKit [35], which is a chemical information toolkit.

Target Analysis of Controlled Substances
Cytoscape 3.7.0 [37] was used to comprehensively evaluate the targets of controlled substances to analyse the relationship between controlled substances classified at different addiction levels and their targets. The controlled substance list was from the DEA. The control substance targets were from BindingDB [13]. The relationship between controlled substances and their effects from the perspective of compounds and targets was statistically analysed using Python.

Functional Enrichment Analysis
To evaluate the functional genes targeted by controlled substances, pathway enrichment and GO annotation analyses were performed. The MeSH ID [38] of the controlled substances was a standard to retrieve pathway information related to controlled substances from CTD [14]. That is because the database already contains the information annotated from the literature on the relationship between manually annotated compounds and genes or proteins. There are two main parts in pathway analysis: one part is that, according to the manual annotated compound and gene data list in the CTD database, the R package clusterProfiler [39] was employed to conduct KEGG pathway enrichment analysis using compound-associated genes, and we considered only the pathways with p values lower than 0.1. The other part is that because the first part can be enriched only from the first to sixth categories (1. Metabolism-6. Human Diseases) and does not contain key functional pathways (7. Drug Development), we used Python to retrieve the seventh category in the KEGG pathway for controlled substances. Because the seventh pathway designs the mechanism of addiction, we performed a functional summary of the seventh pathway data.
The clusterProfiler package [39] was used for GO annotation analysis of gene sets related to controlled substances.
Four machine learning methods, logistic regression (LR) [45], naive Bayes (NB) [46], random forest (RF) [47], and support vector machine (SVM) [48] were adopted, which were implemented based on scikit-learn [49]. The training set and test set were randomly generated (ratio: 8:2). In addition, a deep learning method model, ChemBERTa_zinc250k_v2_40k from ChemBERTa [50], was used for pre-training. It was then fine-tuned for 10 epochs using the generated dataset. The dataset was divided into a training set, a test set, and an internal validation set in a ratio of 8:1:1.
After creating 29 models based on seven molecular feature representation methods and five machine learning algorithms, two models that performed the best with the external validation set and the test set were selected to create a consensus model for NPS prediction. To ascertain highly rigorous NPS prediction, the integrated model predicted a positive result when either of the two models yielded a positive result and a negative result when both models predicted a negative result.

Performance Evaluation
In this study, test and external validation sets were used to evaluate the model performance based on the following parameters: true positive (TP), false negative (FN), true negative (TN), false positive (FP), sensitivity (SE), specificity (SP), balance accuracy (BA), and Matthew's correlation coefficient (MCC), as defined below (Equations (1)- (7)).
The F1-measure (F1) score, ROC curve, and area under ROC (AUC) were also used to evaluate model performance.

Chemical Space Exploration
To explore the chemical space of addictive compounds, we visualized the chemical space occupied by all molecules in the model building datasets using TMAP (tree-map) [29], which is capable of representing large datasets and arbitrary high dimensionality as a twodimensional tree. We considered the SMILES of all compounds as input and computed their MHFP6 molecular fingerprints [43]. Then, LSH Forest [51] was used to generate indexes for fast k-nearest neighbour [52] retrieval. Furthermore, the data points indexed in the LSH forest were used to generate an undirected weighted graph of c-approximate k-nearest neighbour (c-k-NNG). We applied Kruskal's algorithm to construct a minimum spanning tree on the weighted c-k-NNG. Finally, the Python package Faerun [29] was used to visualise the data and show the resulting MST. The interactive version results are freely available at http://design.rxnfinder.org/addictedchem/chemical_space.

Database and Webserver
The comprehensive addicted compounds portal comprises a front-end user interface and a back-end database. The data used in the construction of the database mainly include those of controlled substances from DEA, basic physical and chemical attributes, bioassays for controlled substances from PubChem and literature on controlled substances from PubChem [30] and PubMed, experimental and predicted ADMET data from ad-metSAR [12], enriched associations between controlled substances and pathways from CTD [14], enriched associations between controlled substances and GO annotations from CTD [14], and metabolic pathway data of controlled substances from the SMPDB [34]. All of the above information is presented to the user through the search function module, which has five kinds of search methods. AddictedChem provides a browse function for all listed addictive compounds in our database. The user can either download the list of chemicals or follow the link to the compound details page. For the prediction of potential addictive compounds, we implemented an online prediction tool that uses a consensus scoring model formed with the RF::MACCS and NB::SECFP models. The entire project was implemented using Ubuntu (18.04.2), and the back-end framework used Django (2.2) based on Python (3.7.8). The data for the entire project are stored in MySQL (8.0.16). The ECharts (4.2.0) software was used as a graphical visualisation framework. The Bootstrap table (1.15.5) was used for static and dynamic data table display and relied on Bootstrap (3.3.7) and jQuery (2.1.1). Using the Chrome web browser is recommended to access and browse the database. AddictedChem is freely available to the research community at http://design.rxnfinder.org/addictedchem/. The users do not have to register or login to access the features of the databases. User-friendly online tutorials are provided to facilitate user hands-on operations at http://design.rxnfinder.org/addictedchem/help/. Finally, AddictedChem is an online platform outside the box, and the user can download all the data from http://design.rxnfinder.org/addictedchem/downloads/.

Conclusions
We constructed AddictedChem, a comprehensive database of controlled substances; analysed the existing data on molecular scaffolds, targets, gene-related pathways, and GO annotations; and developed a consistence predictive model based on multiple molecular description approaches and machine learning algorithms to assist NPS identification and analysis. Concurrently, we built a chemical space of addictive compounds, which revealed the chemical diversity of NPS. As the first step in the data-driven identification of potentially addictive molecules, this research can be applied to a wide range of scenarios, such as providing a preliminary assessment of drug addiction in the early stages of new drug development, identifying new addictive illegal drugs circulating on the market, and discovering potentially addictive components in food to improve food safety. In the next step, we will combine the prediction model with the metabolite identification model and untargeted metabolomic data to establish a more comprehensive analysis to better support the identification and detection of new illegal drugs. With these potential applications, we expect that AddictedChem will become an indispensable tool for use in future NPS studies.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/molecules27123931/s1, Table S1: Targets of Schedule I drugs.  Figure S1: Top five scaffolds of controlled substances among different-schedule drugs. Figure S2: Gene Ontology (GO) functional annotation of interaction genes of controlled substances. Figure   Institutional Review Board Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.