Cocrystal Prediction Using Machine Learning Models and Descriptors

: Cocrystals are of much interest in industrial application as well as academic research, and screening of suitable coformers for active pharmaceutical ingredients is the most crucial and challeng-ing step in cocrystal development. Recently, machine learning techniques are attracting researchers in many ﬁelds including pharmaceutical research such as quantitative structure-activity/property relationship. In this paper, we develop machine learning models to predict cocrystal formation. We extract descriptor values from simpliﬁed molecular-input line-entry system (SMILES) of compounds and compare the machine learning models by experiments with our collected data of 1476 instances. As a result, we found that artiﬁcial neural network shows great potential as it has the best accuracy, sensitivity, and F1 score. We also found that the model achieved comparable performance with about half of the descriptors chosen by feature selection algorithms. We believe that this will contribute to faster and more accurate cocrystal development.


Introduction
Active pharmaceutical ingredients (APIs) are commonly formulated and delivered to patients in the solid dosage forms (tablets, capsules, powders) for reasons of economy, stability, and convenience of intake [1]. One of the major problems faced during the formulation of drug is its low bioavailability which is mainly reliant on the solubility and permeability of API [2,3], and one of the approaches to enhance the physicochemical and pharmacological properties of API without modifying its intrinsic chemical structure is to develop novel solid forms such as cocrystals [4][5][6][7]. There are extensive reports on cocrystals for the purpose of improving the pharmaceutical properties including dissolution, permeability, bioavailability, stability, photostability, hygroscopicity, and compressibility [8,9].
Cocrystals are much of interest in industrial application as well as academic research because they offer various opportunities for intellectual property rights in respect of the development of new solid forms [10]. Furthermore, the latest Food and Drug Administration (FDA) guidance on pharmaceutical cocrystals, which recognizes cocrystals as drug substances, provides an excellent opportunity for the pharmaceutical industry to develop commercial products of cocrystals [11,12]. According to FDA, cocrystals are "Crystalline materials composed of two or more molecules within the same crystal lattice" [13]. Pharmaceutical cocrystals, a subclass of cocrystals, are stoichiometric molecular complexes composed of APIs and pharmaceutically acceptable coformers held together by non-covalent interactions such as hydrogen bonding within the same crystal lattice [14]. Coformers for pharmaceutical cocrystallization should be from the FDA's list of Everything Added to Food in the United States (EAFUS) or from the Generally Recognized as Safe list (GRAS), as they should have no adverse or pharmacological toxic effects [1]. The list of acceptable co-formers, in principle, is likely to at least extend into the hundreds, which means that screening of suitable coformers for an API is the most crucial and challenging step in cocrystal development [15]. Since experimental determination of cocrystals is time-consuming, costly, and labor-intensive, it is valuable to develop complementary tools that can reduce the list of coformers by predicting which coformers are likely to form cocrystals [16].
Various knowledge-based and computational approaches have been used in the literature to predict cocrystal formation. Supramolecular synthesis introduced by Desiraju is a well-known approach to rationalize the possibility of cocrystal formation [17][18][19]. A common strategy in this method is to first identify the crystal structure of the target molecule and investigate coformers with a desired functional group which can form intermolecular interactions (mainly hydrogen bonding) between the target molecule and coformers [15]. Knowledge of synthons allows the selection of potential coformers and predicts the interaction outcomes, but there is no guarantee that cocrystals with predicted structures would form. Statistical analysis of cocrystal data from the Cambridge Structural Database (CSD) where more than one million crystal structures of small molecules are available, allows researchers to apply virtual screening techniques to find suitable cocrystalforming pairs [20]. Galek et al. introduced hydrogen-bond propensity as a predictive tool and determined the likelihood of co-crystal formation [15,21]. Fábián analyzed the possibility of cocrystal formation by correlating the different descriptors such as polarity and molecular shape [22]. Cocrystal design can also be based on computational approaches, including the use of the ∆pK value [23,24], lattice energy calculation [25][26][27][28], molecular electrostatic potential surfaces (MEPS) calculation [29][30][31][32], Hansen solubility parameters calculation [33,34] and Conductor like screening Model for Real solvents (COSMO-RS) based enthalpy of mixing calculation [35][36][37][38].
In recent years, machine-learning (ML) has emerged as promising tool for data-driven predictions in pharmaceutical research, such as quantitative structure-activity/property relationships (QSAR/QSPR), drug-drug interactions, drug repurposing and pharmacogenomics [39]. In the area of pharmaceutical cocrystal research, Rama Krishna et al. applied artificial neural network to predict three solid-state properties of cocrystals, including melting temperature, lattice energy, and crystal density [40]. Przybylek et al. developed cocrystal screening models based on simple classification regression and Multivariate Adaptive Regression Splines (MARSplines) algorithm using molecular descriptors for phenolic acid coformers and dicarboxylic acid coformers, respectively [41]. Wicker et al. created a predictive model, that can classify a pair of coformers as a possible cocrystal or not, using a support vector machine (SVM) and simple descriptors of coformer molecules to guide the selection of coformers in the discovery of new cocrystals [16]. Devogelaer and co-workers introduced a comprehensive approach to study cocrystallization using network science and linkage prediction algorithms and constructed a data-driven co-crystal prediction tool with co-crystal data extracted from the CSD [42]. Wang et al. also used a data set with co-crystal data available in the CSD and ultimately developed a machine learning model using different model types and molecular fingerprints that can be used to select appropriate coformers for a target molecule [43]. The above existing studies have shown successful results, but they have a common limitation that they only compared model performance (e.g., accuracy) without investigating features (i.e., descriptors) importance.
In this work, we develop a model to predict co-crystal formation of API molecules. We use Mordred [44], one of widely-used descriptor calculators, to extract descriptor values from simplified molecular-input line-entry system (SMILES) strings of API and coformers compounds. There are several other tools or molecular descriptor-calculators used in cheminformatics such as PaDEL [45], PyDPI [46], Rcpi [47], Dragon [48], BlueDesc (http: //www.ra.cs.uni-tuebingen.de/software/bluedesc) and cinfony [49]; PaDEL descriptorcalculator is the most well-known tool and provides 1875 descriptors, and cinfony is a collection or a wrapper of other libraries such as Open Babel [50], RDKit (http://www. rdkit.org), and Chemistry Development Kit (CDK) [51]. We chose to use Mordred and used the descriptor values as features and compared different machine learning models through experimental results using our collected data. Our contributions can be summarized as follows. First, we not only extract descriptor values of compound pairs, but also investigate which descriptors are more important, and show that we can achieve good performance even if we use only a small subset of the descriptors. Second, we compare machine learning models through experiments and find that artificial neural networks (ANN) achieve the best performance. Third, we make our dataset available for free through the online website (http://ant.sch.ac.kr/) so that many other researchers can use the dataset as a benchmark. We believe that this study will advance the field of cocrystal formation prediction, and our dataset will help other researchers to easily develop better models.

Materials and Methods
In this paper, we essentially solve a binary classification problem; we develop a model for predicting a label (e.g., 'fail' and 'success') for a given pair of compounds. The class label 'success' means that the corresponding pair of compounds would successfully cocrystallize, while the label 'fail' means that it would not cocrystallize. As depicted in Figure 1, we first obtain attributes (i.e., features) of the compounds. Then, we select some promising features that are expected to contribute more to the final performance (e.g., accuracy). The selected features are fed into machine learning models that learn patterns behind the compound pairs, so that the models predict labels (e.g., 'success', 'fail') of the given pairs; in other words, the model takes the selected features obtained from compound pairs as input and generates labels as output.

Materials
Since we basically solve this problem using a data-driven approach, we first had to prepare a data set. The compounds pairs of the data set were mainly obtained from the work of Wicker et al. [16], Przybylek et al. [41,52] and Grecu et al. [32] and supplemented with an extensive literature review on cocrystal screening of different APIs. Duplicate records were removed from the data set, resulting in a total of 1476 molecular compound pairs. Of these 1476 pairs, 753 were positive pairs (experimentally verified cocrystals) and the remaining 723 were negative pairs (unsuccessful formation of cocrystals). Labels are converted to a numerical form (e.g., 'success' = 1, and 'fail' = 0). The raw data is a D × 3 matrix and samples are shown in Figure 2. We primarily develop models that predict which pair of API (i.e., 'compound 1') and coformer (i.e., 'compound 2') will successfully result into a new cocrystal formation (i.e., 'label' = 1) and which will not result in a new solid form (i.e., 'label' = 0), from the set of chemical experiments based on the two columns (e.g., 'compound 1' and 'compound 2') as shown in Figure 2. Prediction models use a combination of features known as molecular descriptors from each pair for the classification task. Molecular descriptors often used to develop quantitative structure-property relationships (QSPR) models, and Mordred is one of the attractive tools to extract the molecular descriptors [44]. We have chosen Mordred because of its advantages: (1) it provides a comparable number of PaDEL descriptors while fixing some bugs within the PaDEL descriptor calculator, and (2) it is easy to install and use since it is provided as Python 2 & 3 libraries. As shown on the left in Figure 1, the Mordred tool used to extract feature vectors from the raw data. Canonical SMILES strings for each compound were retrieved from PubChem (https://pubchem.ncbi.nlm.nih.gov/) as shown in Table 1 , and were used as input to the Mordred tool to generate molecular descriptor values. We found that some descriptor values are missing due to implementation issues of the tool; for example, a part of autocorrelation descriptor of a small molecule is known to be missing even if there is no bug [44]. Since the missing values did not occur at random, we simply filter them out instead of using any imputation algorithms. We obtain a F ALL dimensional real-numbered feature vector from a single pair of compounds. After adding a label column, we have D feature vectors of F ALL + 1 dimension. Table 1. Sample of canonical SMILES strings by PubChem for each compound in a sample of raw data.

Service Compound 1 Compound 2 Label
PubChem

Methods
There have been studies that trained machine learning models using molecular descriptors as features [16,53], however such studies only fed the models with the descriptors without performing a crucial analysis of the molecular descriptors. In this work, we apply feature selection algorithms to measure importance of descriptors and use a found set of promising ones. The feature selection algorithms are supposed to select F S features among the F ALL features as depicted in the middle of Figure 1. We tried two feature selection algorithms: Recursive Feature Elimination (RFE) algorithm and K-best algorithm. The RFE algorithm is a wrapper-based algorithm that treats the feature selection as a search problem. It repeatedly removes unpromising features until desired number of features remains. We use an artificial neural network (ANN) as an estimator of the RFE algorithm. The K-best algorithm is a filter-based algorithm that selects potential features according to a particular function σ( f , c) where f and c are a feature and a label, respectively. We use a ANOVA F-value as the function σ.
Before passing the D × F S + 1 real-numbered matrix to machine learning models, we standardize the feature values. This process is, of course, performed using only training data; the mean µ and standard deviation σ are computed only with the training data. We used scikit-learn (https://scikit-learn.org/stable/) to implement the standardization, and found that it is better than normalization (i.e., 0-1 scaling) for our data. Given the standardized matrix X ∈ R D×F S , the machine learning models are supposed to give labels y ∈ {0, 1} D . We have used several machine learning models such as artificial neural network (ANN), support vector machine (SVM) [54], random forest (RF) [55], and extreme gradient boost (XGB) [56]. The ANN is known to be effective in many research fields such as image analysis, natural language processing, and speech recognition; It is a deep learning model if it has a deep structure (i.e., multiple hidden layers). The SVM finds a decision boundary based on boundary instances (i.e., support vectors) and is known to be successful in many classification tasks. The RF and XGB are common ensemble approaches, but RF uses the bagging strategy while the XGB employs boosting strategy. We compared these widely-used models with experimental results.
The total, the dataset D total contains 1476 instances, of which 723 were unsuccessful, and 753 were successful, as shown in Table 2. Since the dataset is balanced, we performed 10-fold cross validation while maintaining the balanced ratio; for each cross validation, we have about 1,329 instances for training and 147 instances for testing. After preprocessing the raw data, we obtained that F ALL = 2207. Throughout all experimental results, we use averaged accuracy, precision, recall, and F1 scores.

Feature Selection Algorithms
We compared the two feature selection algorithms (e.g., RFE algorithm and K-best algorithm) by averaged accuracy with varying number of features F S . Figure 3 shows the results with F S ranging from 298 to 1103, where the classifier used here is artificial neural network (ANN); note that we use only ANN model because we focus on experimental results of feature selection algorithms, but not the models. With greater F S , the K-best algorithm gave generally better accuracies than the RFE algorithm. Therefore, we might say that if we want efficiency (e.g., less parameters), then the RFE algorithm will be preferable; on the other hand, the K-best algorithm is preferable if we want effectiveness (e.g., accuracy). As a compromise, using the K-best algorithm with F S = 900 might be a reasonable choice because its dimension is only a half of the total (e.g., 2207) and its accuracy is comparable.

Model Comparison
We prepared three independent datasets D 1 total , D 2 total , and D 3 total by shuffling the instances of the dataset D total ; so |D total | = |D 1 total | = |D 2 total | = |D 3 total | = 1476. We conducted 10-fold cross validation for each of these independent datasets, and computed averaged accuracy, precision, recall, and F1 scores. The optimal parameter settings of the machine learning models are found by a grid searching using a small portion (e.g., 10%) of the training set as a validation set. The parameter settings are summarized in Table 3; the parameter settings are obtained during the experiments. The ANN has a shallow structure (i.e., one hidden layer of 25 nodes) because we found that it gives better performance than other complex structure; the reason might be the small size of dataset that high model complexity will cause overfitting problem.  Table 4 summarizes the accuracy of machine learning models; note that we focus on the experimental comparison between the models here, but not the feature selection algorithms. The accuracy values are computed by averaging results with three independent datasets (e.g., D 1 total , D 2 total , and D 3 total ). The ANN gives the best accuracy (e.g., 0.833) among the models. The XGB is comparable to the ANN, and it is the best when F S = 300. As a model works faster when the feature dimension is small, the XGB might be preferable if we want better efficiency (i.e., fast prediction) without losing much accuracy. One might argue that the model is not useful if its sensitivity is not high enough. Tables 5 and 6 are per-label precision and recall. The ANN gives the best recall of 'success' label (e.g., 0.800) without losing much precision (e.g., 0.838). In terms of the precision, the XGB seems the best as its precision of 'success' label is 0.841, but we might say that the ANN would be chosen if we need to find as many promising candidates of compound pairs as possible. Table 7 shows per-label F1 scores, and the ANN is turned out to be the best amongst the models. This result is reasonable as the ANN is known to be effective in finding underlying patterns and gives significant performance improvement in many other classification tasks (e.g., malware detection [59], chatbot intent prediction [60]). We believe that the performance will be further improved if we collect more qualified data. Table 5. Per-label averaged precision of different machine learning models, where F ALL is the number of all features, F S means the number of features selected using the K-best algorithm, and 'Success' and 'Fail' mean label 1 and 0, respectively.

Model
All  Table 6. Per-label averaged recall of different machine learning models, where F ALL is the number of all features, F S means the number of features selected using the K-best algorithm, and 'Success' and 'Fail' mean label 1 and 0, respectively.  Table 7. Per-label averaged F1 score of different machine learning models, where F ALL is the number of all features, F S means the number of features selected using the K-best algorithm, and 'Success' and 'Fail' mean label 1 and 0, respectively.

Discussion
Although the models performed best when we use all features, the feature selection algorithms showed their potential using only half of the features (e.g., F S = 1103 or 900) gave comparable results. One might want to see what features were valuable than others. Table 8 shows lists of best and worst features obtained by K-best algorithm with F S = 900, where the scores are ANOVA F-values; the features with larger scores turn out to be more vital than others. Note that the features are grouped in terms of Module; for example, the best feature 'Mp' came from 'Constitutional' module of Mordred. Interestingly, many of the best and worst features commonly came together from the 'Autocorrelation' module, which computes the Moreau-Broto autocorrelation of the topological structure. Most of the best features of the 'Autocorrelation' module are Geary coefficients (e.g., 'GATS' series), implying that the spatial correlation is particularly essential in predict cocrystal formation. Especially, Geary coefficients weighted by intrinsic state (e.g., GATS4s, GATS6s), valence electrons (e.g., GATS6dv), or atomic number (e.g., GATS3Z, GATS6Z, GATS7Z, GATS8Z) turned out to be extremely important. On the other hand, most of the worst features of the 'Autocorrelation' module, are the Moran coefficient (e.g., MATS series). The Moran coefficient focuses on deviations from the mean whereas, the Geary coefficient focuses on the deviations of individual observation area [61], so we might say that the deviations of each observation area are more meaningful information for cocrystal prediction. It is consistent with a recent study by Shiquan Sun et al. [62] that revealed that the Moran coefficient is not very competent in detecting spatial patterns other than simple autocorrelation due to its asymptotic normality for p-value computation.  [63]. This implies that the electrostatic interaction of atoms and their topological environment (connections) within a molecule has a significant impact on cocrystallization. It is in line with the fact that the electrostatic interaction between atoms has been treated importantly in pharmaceutics [64,65]. Meanwhile, many worst features came from the 'BCUT' module that generates burden matrix weighted by ionization potential (e.g., BCUTi-1l), pauling EN (e.g., BCUTpe-1h), or sigma electrons (e.g., BCUTd-1h, BCUTd-1l). Note that this does not mean that these worst descriptor values are harmful to the outcome, but they only have a smaller contribution than the others to the performance (e.g., accuracy). Table 9 describes a comparison our work with recent studies. The best accuracy of this study is definitely highest among them; although the three studies used different datasets, we might say that we proved the potential of the ANN model to predict cocrystal formation by experimental results. It should be noted that the feature sources are different between these studies. That is, Jerome G. P. Wicker et al. [16] used the molecular descriptors (i.e., features) as the model inputs and Jan-Joris Devogelaer et al. [66] used fingerprint vectors and molecular graphs whereas our work uses the molecular descriptors (i.e., features). We employed feature selection algorithms to find some valuable features and explained how they are related to results of previous studies.

Conclusions
To address the co-crystal prediction problem, we extracted molecular descriptor values using the Mordred tool and performed preprocessing. We performed experiments on molecular descriptor values extracted from our collected data, and found that the ANN model was the best among the various known machine learning models. In particular, ANN gave the best recall 0.800 of the 'success' label and the best F1 score of 0.817. This implies that the ANN finds about 80% of co-crystal formation without losing too much precision. We also found that the model achieved comparable performance with only half of the descriptor values (i.e., selected molecular descriptor values), and explained that these selected molecular descriptors are related to the results of some previous studies; for example, the module'EState' refers to the electrostatic interaction between atoms, which is known to be important in pharmaceutics. We believe that this study will be helpful in the process of co-crystals development. In the future, we will examine to collect more data as the size of adequate data is crucial to develop better machine learning models.

Data Availability Statement:
The data presented in this study are openly available in the website at http://ant.sch.ac.kr/.

Conflicts of Interest:
The authors declare no conflict of interest.