Incorporating Domain Knowledge and Structure-Based Descriptors for Machine Learning: A Case Study of Pd-Catalyzed Sonogashira Reactions

Chan, Kalok; Ta, Long Thanh; Huang, Yong; Su, Haibin; Lin, Zhenyang

doi:10.3390/molecules28124730

Open AccessArticle

Incorporating Domain Knowledge and Structure-Based Descriptors for Machine Learning: A Case Study of Pd-Catalyzed Sonogashira Reactions

by

Kalok Chan

,

Long Thanh Ta

,

Yong Huang

^*

,

Haibin Su

^*

and

Zhenyang Lin

^*

Department of Chemistry, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong SAR, China

^*

Authors to whom correspondence should be addressed.

Molecules 2023, 28(12), 4730; https://doi.org/10.3390/molecules28124730

Submission received: 18 May 2023 / Revised: 10 June 2023 / Accepted: 10 June 2023 / Published: 13 June 2023

(This article belongs to the Special Issue Deep Learning in Molecular Science and Technology)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Machine learning has revolutionized information processing for large datasets across various fields. However, its limited interpretability poses a significant challenge when applied to chemistry. In this study, we developed a set of simple molecular representations to capture the structural information of ligands in palladium-catalyzed Sonogashira coupling reactions of aryl bromides. Drawing inspiration from human understanding of catalytic cycles, we used a graph neural network to extract structural details of the phosphine ligand, a major contributor to the overall activation energy. We combined these simple molecular representations with an electronic descriptor of aryl bromide as inputs for a fully connected neural network unit. The results allowed us to predict rate constants and gain mechanistic insights into the rate-limiting oxidative addition process using a relatively small dataset. This study highlights the importance of incorporating domain knowledge in machine learning and presents an alternative approach to data analysis.

Keywords:

activation energy; homogeneous catalysis; ligand effects; machine learning; phosphine ligands

1. Introduction

Machine learning has become an increasingly popular tool for data analysis in various fields [1]. Its strong pattern recognition and predictive ability have prompted chemists to investigate its applications in several subfields of chemistry [2,3]. In physical chemistry, the prediction of pKa through Lewis structures has been achieved with the aid of a large database [4,5]. In computational chemistry, machine learning force fields have been developed to integrate the performance of ab initio methods and the efficiency of classical force fields [6,7,8,9,10,11]. In retrosynthetic chemistry, deep learning, a subfield of machine learning, has been applied to achieve automated retrosynthesis planning based on a large pool of published reactions [12,13]. In material chemistry, machine learning has aided in predicting material properties, such as those of OLED [14] and crystals [15].

With the evolution of high-throughput experimentation and easily available online databases, chemists now have access to large datasets of high quality to perform predictions of reaction yield, selectivity, and condition optimization. Jenson, Barzilay, and co-workers [16] trained a graph convolutional neural network (based on patent literature) for product prediction with over 400 k data points. Applying a highly advanced deep learning architecture based on BERT, a recent robust language model pretrained on over 10 k reactions, Schwaller and co-workers [17] implemented reaction SMILES as inputs and predicted the yields of Suzuki–Miyaura reactions with good performance. Schwaller and co-workers [18] also employed an unsupervised transformer encoder model in conjunction with SMILES to generate products-to-reactants atom-mapping for reactions such as Diels–Alder reactions and epoxidations using 2.8 M reactions.

Studies involving machine learning are often based on a large amount of data as mentioned above, including both experimental and/or DFT-calculated data, which are, however, difficult to obtain and require high expenditure and effort for data collection and processing. In reality, data acquired from chemical research, especially in the area of organic reaction studies, are often very limited in size. Thus, when applying machine learning to study chemical reactions with a limited set of data, chemists start to intuitively employ chemically meaningful descriptors in order to derive accurate predictions. For instance, Doyle, Dreher, and co-workers [19] managed to predict the yields of Ni-catalyzed Suzuki–Miyaura cross-coupling of benzaldehyde-derived acetals with aryl boroxines with high accuracy by using generated and empirical spatial descriptors, such as buried volume and Tolman cone angle, using around 4000 data points. In another study, Doyle and co-workers [20] used computed atomic and empirical molecular properties (e.g., electrostatic charge, NMR shift) as well as binary categorical identifiers to predict reaction yields for the deoxyfluorination of a broad range of alcohols with sulfonyl fluorides using random forest based on 640 reactions. Yada, Sato, and co-workers [21] predicted the reaction yields for a tungsten-catalyzed epoxidation of alkene with hydrogen peroxide by coupling DFT-calculated descriptors with the least absolute shrinkage and selection operator methods, a regression analysis method commonly used in machine learning, with 3800 reactions. Denmark and co-workers [22] used feed-forward neural networks coupled with electronic descriptors and novel 3D steric descriptors generated from DFT calculations to predict the selectivity of phosphoric acid-catalyzed thiol addition reactions using around 1000 data points.

Encouraged by the above-mentioned successful applications of machine learning with limited data sizes in chemistry, we attempt to go one step further and study chemical reactions of an even smaller dataset by simply utilizing structural descriptors and traditional electronic descriptors with a specially defined machine learning architecture without the need for DFT calculations. It is worth noting that useful descriptors have been established to study Diels–Alder reactions. For example, DFT-calculated descriptors, such as global electrophilicity power difference (Δω) [23], and empirical descriptors, such as Taft’s polar and steric substituent constants [24], have been employed to describe their activation energies.

Here, we choose to study a palladium-catalyzed Sonogashira cross-coupling reaction using trialkyl phosphine ligands, reported in 2008 by Plenio and co-workers [25] and shown in Scheme 1. Using high-throughput experiments, the authors studied and determined the rate constants of 340 reactions consisting of 17 alkyl phosphine ligands and 20 different meta- and para-substituted aryl bromide substrates. The 340 data points (rate constants) formed a relatively small dataset, which we used here to showcase our deep learning study.

We generated descriptors based only on the topological structures of the phosphine ligands. For the electronic bias of the aryl bromide substrates, we chose a widely used and chemically meaningful descriptor—the Hammett constant—to capture the chemical reactivity of the aryl bromide reagents involved. Using the rather small dataset, we demonstrate that the new machine learning architecture we built performs very well and is able to derive well-known/accepted knowledge regarding the rate-limiting steps of Pd-catalyzed Sonogashira cross-coupling reactions, suggesting the usefulness of the protocol presented in this work.

2. Results

2.1. Performance of Our Ligand Descriptors

The averaged predicted rate constants of the 10 selected optimal sets of trainable parameters for the training, validation, and testing datasets were plotted against the experimental values in Figure 1. To directly compare with the experimental values, we chose to plot the rate constant instead of the corresponding converted free energies. We achieved a high R² of 0.94 on the training dataset, 0.87 on the validation dataset, and 0.84 on the testing dataset. The good performance on the testing dataset demonstrated our proposed machine learning model’s high predictive ability.

2.2. Comparison to Other Ligand Descriptors

To demonstrate the advantage of using the tree-like structural descriptors (for the phosphine ligands) over others, we examined four other types of common descriptors, two of which are based on DFT-calculated structures: buried volumes and cone angles. The buried volumes and cone angles for the 17 phosphines were calculated using the scheme reported in Kraken, a phosphine database with comprehensive physicochemical descriptor information [26]. The other two types of descriptors are based on one-hot encodings and multiple molecular fingerprints [27]. We conducted machine learning studies using these 4 types of descriptors with two machine learning methods: The first method is to modify the architecture we discussed in Section 4.2.1 by replacing the GNN layer and its output (the vertex values) with new descriptors. This modification was necessary because all four descriptors cannot be adequately represented graphically and thus are not directly compatible with GNN. The second method is to employ only the final layer of the architecture we built, i.e., a restricted linear regression. More specifically, we fit the linear regression model with positive coefficients and zero intercept against the rate constant k. This restriction was applied so that a fair comparison can be made between the results using different descriptors. Since we aim to examine different ligand descriptors, the aryl bromide Hammett constant descriptor remains unchanged. Table 1 summarizes the learning performances using these different types of descriptors and compares them against that using the tree-like structural descriptor.

The performance comparison presented in Table 1 clearly indicates that, within the rather small dataset studied in this work, the tree-like structural descriptors outperformed the other types of descriptors by a significant margin. We also attempted machine learning by combining both buried volume and cone angle as input descriptors. The performance remained inferior to our tree-like structural descriptors. Both buried volume and cone angle, although excellent descriptors reflecting ligand steric effect, did not perform well with the dataset studied in this work. Although one-hot encoding and multiple fingerprint features (MFF) performed better than buried volume and cone angle, they were much worse than the tree-like structural descriptors.

2.3. Cross-Validation

To evaluate the performance of our model, we conducted k-fold cross-validation as follows: We partitioned the dataset into 5 equal-sized datasets (S₁, S₂, S₃, S₄, S₅), where 4 subsets (S₁–S₄) were utilized as training or validation sets and one subset (S₅) was used exclusively as the testing set and was not involved in any training processes. Therefore, the testing set remained constant across all four cross-validation sets. To create the validation set, we selected one subset from S₁–S₄. This resulted in a total of 4 distinct cross-validation sets, namely, sets 1, 2, 3, and 4, which utilized S₁, S₂, S₃, and S₄ as their validation set, respectively. It is worth noting that the performance of set 1 is identical to the performance displayed in Figure 1. Table 2 shows that the performance of both the validation set and testing sets is greater than 0.75 and 0.8, respectively, for all cross-validation sets. Additionally, the average performance of the validation sets and testing sets are 0.84 and 0.87, respectively, which confirms that our models possess a high level of generalizability. Based on the results of cross-validation, we believed that the overfitting problem associated with our model should not be significant. In our model, we incorporated regularization techniques such as Leaky ReLU and L1 and L2-regularizers, leading to minimization of overfitting.

3. Discussion

Our results indicate that the tree-like structural descriptors exhibited superior performance compared to other existing descriptors. Here, we demonstrate how our protocol can provide mechanistic details that are consistent with the current understanding of cross-coupling reactions. As previously mentioned, we define ΔG^‡ as the sum of ΔG^‡(L) and ΔG^‡(L-S) to segregate the effect of purely ligand component and the combined effect of ligand and substrate. This machine learning architecture allows us to obtain the fitted values ΔG^‡(L) and ΔG^‡(L-S) separately and analyze their corresponding trends.

3.1. ΔG^‡(L)

As shown in Figure 2, the fitted ΔG^‡(L) values were sorted and plotted against Ligands 1–17. A boxplot method was used to display the results predicted by the 10 optimal sets of trainable parameters selected with the criteria mentioned in Section 4.2.2.

Figure 2 reveals that the least steric ligand, L1 (tri-n-butylphosphine, ⁿBu₃), has the highest ΔG^‡(L), while the bulkiest ligand, L16 (di(1-adamantyl)benzylphosphine, (1-Ad)₂PBn), has the lowest ΔG^‡(L). The results are consistent with our general understanding that bulky phosphines promote the OA process.

According to a kinetic study by JF Hartwig and F Barrios-Landeros [28], in the palladium-catalyzed Sonogashira coupling of aryl bromides using an exceptionally bulky ligand, Q-phos, the reaction rate only depends on ligand concentration. Therefore, they concluded that the rate-determining step is ligand dissociation. In the dataset we used, the rate-determining step is oxidative addition. The obtained ΔG^‡(L) values offer a reconciliation of these two observations. Bulkiness of the ligand negatively correlates with ΔG^‡(L). Therefore, for an exceptionally bulky ligand such as Q-phos, the oxidative addition transition state likely lies even lower in energy than the ligand dissociation transition state.

3.2. ΔG^‡(L-S)

Unlike ΔG^‡(L), which depends only on ligand, ΔG^‡(L-S) is affected by both ligand and aryl bromide substrate. Thus, we plotted ΔG^‡(L-S) against ligand for each substrate and against substrate for each ligand (see Supporting Information for the details). Examination of these plots revealed that all the plots of ΔG^‡(L-S) against ligands showed a similar trend, as did all the plots of ΔG^‡(L-S) against substrates.

To understand and discuss how ligands affect ΔG^‡(L-S), we plotted ΔG^‡(L-S) against ligands for 1-bromo-4-nitrobenzene substrate as an example, which exhibited the widest range of fluctuation, as shown in Figure 3. From this figure, it appeared that the effect of ligand on ΔG^‡(L-S) differed from that on ΔG^‡(L), in which ΔG^‡(L) showed a strong correlation with the steric property of ligands while ΔG^‡(L-S) did not. Generally, bulky ligands lead to higher ΔG^‡(L-S) because of increasing steric crowdedness at the Pd center in the transition state. Therefore, the change of ΔG^‡(L-S) with respect to ligands reflected the combined effect of the phosphine ligand, including both electronic and steric. In the case of alkylphosphines, the electronic difference among ligands was relatively small and the effect of ligand on ΔG^‡(L-S) was mostly steric.

Next, we present the plot of ΔG^‡(L-S) against aryl bromides for Ligand 7 (L7) in Figure 4. From this plot, the Hammett constant of the substituent is negatively correlated with ΔG^‡(L-S). Aryl bromides with a more electron-withdrawing substituent (i.e., a more positive Hammett constant) had lower ΔG^‡(L-S), while those with a more electron-donating substituent (i.e., a more negative Hammett constant) had higher ΔG^‡(L-S). This trend aligned with the general understanding of oxidative addition [29].

3.3. ΔG^‡

The previous two subsections discussed the correlations of ΔG^‡(L) and ΔG^‡(L-S) with L (ligand) and S (substrate). To understand the general trend of these reactions, we have to examine the total activation free energy, ΔG^‡. Thus, we plotted ΔG^‡ against ligand for 1-bromo-4-nitrobenzene in Figure 5. The plot reveals that bulkier phosphine ligands have smaller ΔG^‡. A previous experimental study [30] showed that pCy₃ as a ligand in Sonogashira coupling tends to be modest, while CyP^tBu₂ was a better ligand. This experimental observation is consistent with the trend predicted in Figure 5 that CyP^tBu₂ (L12) gives smaller ΔG^‡ than pCy₃ (L3).

While the general trend related to the steric factor exists, it is clearly not the only factor influencing the reactivity trend. For example, EtPAd₂ (L11) is bulkier than P^tBu₂Bn (L15), but the former gives a larger ΔG^‡, thus is less reactive. The same reverse trends are also observed for Cy₂pAd (L8) vs. Cy₂P^tBu (L9) and CyP^tBu₂ (L12) vs. ⁱPr₂P^tBu (L10). In other words, the cooperative nature between ligand and substrate may lead to a scenario where we cannot simply employ a single factor to explain experimental observation. The study by Plenio and coworkers [30] also demonstrates that in addition to the steric factor, electronic factor of phosphine ligands are also important for their reactivity.

Moreover, Figure 5 shows that ΔG^‡(L-S) is generally greater than ΔG^‡(L), implying that ΔG^‡(L-S) has a greater contribution to the overall predicted ΔG^‡. Comparing the plots in Figure 3 and Figure 4, we observe that the change of ΔG^‡(L-S) with ligands is narrower (Figure 3) than that with substrates (Figure 4), suggesting that the effect of ligands on ΔG^‡(L-S) is not as significant as that of the substrates. This conclusion agrees with the experimental observation by Hartwig and coworkers that the reaction rate of oxidative addition of bromoarenes is positively related to bromobenzene concentration and only weakly dependent on ligand concentration [31].

4. Methodology

As stated in the Introduction, our goal is to apply machine learning to a relatively small dataset of rate constants for Pd-catalyzed Sonogashira cross-coupling reactions. To achieve this goal, incorporating pertinent chemical knowledge is crucial in defining chemically meaningful descriptors and constructing a robust machine learning framework. Therefore, let us briefly discuss the current understanding of the reaction mechanisms first. For palladium-catalyzed cross-coupling reactions, it is widely accepted that the general catalytic cycle involves three fundamental steps [32,33]: oxidative addition of Pd(0) to an aryl halide to form a Ar-Pd(II)-X complex, transmetallation between the nucleophile and the Ar-Pd(II)-X complex, and reductive elimination to regenerate Pd(0) and produce the final coupling product. It is also widely understood that in most of these cross-coupling reactions, oxidative addition is often rate-limiting [34,35]. The Sonogashira cross-coupling reaction studied here is also an example of this.

Extensive and detailed kinetic studies [32,33,34,35] on the oxidative addition (OA) of aryl halides to phosphine Pd(0) complexes have concluded that the specific OA mechanisms mainly depend on both the steric properties of the phosphine ligands and the electronic properties of the aryl halides. The possible OA mechanisms include both associative and dissociative pathways (Scheme 2), and their variants. The literature reported 340 Sonogashira cross-coupling reactions that involve a wide range of phosphines and aryl bromides. Thus, it is not reasonable to assume that a single OA pathway can account for the rate-determining oxidative addition step. It is expected that the various mechanistic scenarios presented in Scheme 2 are all possible.

4.1. Descriptors

4.1.1. Descriptors for Aryl Bromides

Based on the above mechanistic analysis, the rate-determining step (OA) likely involves the cleavage of the C-X bond on the Pd(0) metal center. The more electron-deficient the C-X bond is, the easier the OA becomes, i.e., the lower the activation barrier ΔG^‡. In the reactions shown in Scheme 1, various aryl bromides with different substituents were used as substrates. The extent of electron deficiency of the C-Br bond in an aryl bromide largely depends on the identity and position of the substituent on the aryl ring. Thus, Hammett σ constants [36] for substituent groups were chosen as electronic descriptors for aryl bromides because they are widely accepted as electronic parameters for substituted arenes and used in quantitative structure-activity relationship (QSAR) studies. It should be pointed out here that Hammett σ constants were also used to correlate with Sonogashira cross-coupling reactivity of aryl bromides in the work by Plenio and co-workers [25], from which the measured rate constants dataset is drawn for this study. The 20 electrophile substrates used in the Pd-catalyzed Sonogashira cross-coupling reactions were aryl bromides substituted with an electron-withdrawing or electron-donating group at either meta- or para-position. Since steric effects of substituents at meta- and para-positions are generally considered less significant compared to ortho-positions, and considering that most substituents in the dataset were not bulky, the bulkiest substituent, -^tBu, located at para-position and far from the reaction center, should also be negligible sterically. Thus, steric factors likely did not contribute significantly to the reactivity of the underlying electrophiles and were not considered. Each aryl bromide was represented by a vector with 2 elements (σ_meta, σ_para), which contains information relevant to the Hammett constant of the substituent and its location, meta or para. Since there was only one substituent, either at meta or para position, on each aryl bromide, the unsubstituted aromatic position was assigned as 0 (the Hammett constant for a hydrogen substituent). For instance, a meta-bromobenzonitrile was represented as (σ_m^CN, 0), while a para-bromobenzonitrile was represented as (0, σ_p^CN).

4.1.2. Descriptors for Phosphine Ligands

In palladium-catalyzed cross-coupling reactions, the structure of the phosphine ligand is crucial. The steric topology of phosphine ligands is frequently a primary consideration in ligand design, which contributes significantly to the OA activation barrier ΔG^‡. Hence, to retain structural information without using any DFT data, we employed graphical representations to describe the 3D molecular structure of a phosphine. We found it convenient to use a tree-like representation to describe a phosphine ligand. The most important atom of a phosphine molecule, phosphorus, was considered the “root”, or reference point of the tree-like layout, while carbon atoms that radiate from the root are called “nodes”. The nodes were organized in a layer-by-layer format, in which nodes with the same smallest number of bonds away from the root were considered to be in the same layer: The nodes that are connected to the phosphorus atom via a single bond (i.e., carbon atoms directly linked to the root) were categorized in the first layer. The nodes that are two bonds away were classified in the second layer, and so on.

Using this tree-like framework, all ligands were mapped as shown in Figure 6a. The 1st layer nodes were prioritized by their substituents according to the Cahn–Ingold–Prelog rules, which are commonly used for assigning stereochemistry. Similarly, the nodes in other layers were assigned using the same procedure, constituting a unique representation (alignment graph) for each phosphine ligand molecule.

The alignment graph for each phosphine ligand molecule was converted to an input graph consisting of node feature data and edge feature data. A Boolean (1 or 0) representation was utilized to denote the presence of a node on the graph, where 1 indicates existence and 0 indicates non-existence. The edges were characterized by the bond order, where 1 denoted a single bond, 1.5 referred to a C-C bond in an aromatic ring, and 2 indicated a double bond. More information on how the model handles such data will be elaborated in the subsequent section.

4.2. Machine Learning

4.2.1. Machine Learning Architecture

Incorporating domain knowledge of the aforementioned mechanisms that dictate the rate-determining oxidative addition step is essential for designing our machine learning model. The practice of incorporating domain knowledge in machine learning has been reported in the literature. For instance, Amal and co-workers [37] proposed integrating bandgap-related physics equations into machine learning to study photocatalysis, and Riley and co-workers [38] applied rule-based chemistry restrictions, e.g., forbidding the formation of additional bonds among non-adjacent atoms within the same ring when using deep reinforcement learning models for drug discovery.

Before being processed by machine learning, the rate constants were first converted into activation energies using the Eyring equation, which resembles the Arrhenius equation (Equation (1)):

k = \frac{k_{B} T}{h} e^{- \frac{E_{a}}{R T}}

(1)

where E_a is the activation energy, k_B is Boltzmann’s constant, h is Planck’s constant, and R is the gas constant. As a result, the output from the machine learning model would be the prediction of the activation free energies (ΔG^‡) that were derived from the experimentally measured rate constants.

Considering the various mechanistic scenarios shown in Scheme 2 and the well-established notion that the phosphine ligand is the most important factor in palladium-catalyzed cross-coupling reactions [39,40], we proposed that ΔG^‡ can be reasonably estimated by a linear combination of two parts: (1) ΔG^‡(L), dependent purely on ligand, reflecting the steric effect of phosphine, and (2) ΔG^‡(L-S), dependent on both ligand and substrate, reflecting their cooperative nature in the OA step. We will integrate this proposal into our machine learning framework to improve prediction accuracy, given the relatively limited dataset. The most convenient and reasonable way to parse the previously defined tree-like structural descriptors of the phosphine ligand is using a graph-based method, a graph neural network (GNN). Thus, ligand-dependent values (h_L) were generated by parsing the ligand structural descriptors to a GNN layer to produce vertex values, followed by a single layer of fully connected neural network (FCNN). The ligand substrate-dependent values (h_L-S) were generated by parsing both the vertex values obtained from the GNN layer for the ligand and the aryl bromide descriptors (Hammett constants) to another single layer of FCNN, as shown in Figure 6b. The activation energy ΔG^‡ (output) would be the sum of ΔG^‡(L), derived from the scaled h_L, and ΔG^‡(L-S), derived from the scaled h_L-S.

4.2.2. Machine Learning Model Training Process

Following the framework illustrated above (Figure 6b), neural networks were built with the packages TensorFlow 2.7 [41] and Deep Graph Library [42] (DGL, for GNN layer) in the Python language. Adaptive Moment Estimation (Adam) optimizer was used with Mean Absolute Error (MAE) as the loss function. Hyperparameters were adjusted using the Optuna package [43], including the learning rate, the message dimension, and the number of iterations for the GNN layer. Further details related to hyperparameters, regularizers, weight restrictions, and early termination aspects are provided in the Section 5.

The dataset was randomly divided into three subsets: a training dataset (60%, 204 data points), a validation dataset (20%, 68 data points), and a testing dataset (20%, 68 data points). By applying the setup described above, we trained the neural networks by batch gradient descent with a batch size of 10 samples from the training dataset for 10,000 epochs. To select optimal values of trainable parameters, the coefficient of determination of R² > 0.8 and the mean absolute error (MAE) of MAE < 1.5 from the validation set were used as the thresholds. Only those neural networks delivering results meeting both criteria were selected. The selection of these thresholds was empirical, considering and balancing the number of required trials and the performance of the selected models subsequent to the model selection process. Final predictions were calculated by averaging the predicted results from applying each of the 10 optimal sets of trainable parameters with the largest R² on the validation set. Cross-validation was carried out for further validation, which is discussed in more detail in a later section.

5. Computational Details

The general formulae for the structure of the GNN layer in a message-passing neural network (MPNN) framework are displayed as Equations (2) and (3). As an example, we focus on one of the atoms—atom a₁ in a molecular structural input. Let a_x be the label of any neighboring atom of a₁. For the first iteration, i.e., time step t = 0,

h_{a_{1}}^{0}

is the node feature (also referred to as the initial hidden state) of a₁,

h_{a_{x}}^{0}

is the node feature of a_x, and

e_{a_{1} a_{x}}

is the edge (bond) feature of a₁ and a_x. Then the summation of all the message functions (M_t) results of each a_x gives

m_{a_{1}}^{1}

, the “message” of a₁ at t = 1, as shown in Equation (2):

m_{a_{1}}^{t + 1} = \sum_{a_{x} \in N e i g h b o r} M_{t} (h_{a_{1}}^{t}, h_{a_{x}}^{t}, e_{a_{1} a_{x}})

(2)

This message

m_{a_{1}}^{1}

, together with the current node feature

h_{a_{1}}^{0}

, will next be used as inputs for the update function, U_t, to generate a new node feature

h_{a_{1}}^{1}

, as shown in Equation (3):

h_{a_{1}}^{t + 1} = U_{t} (h_{a_{1}}^{t}, m_{a_{1}}^{t + 1})

(3)

For the next iteration, the same process will be carried out, in which the previous node feature inputs (old hidden states)

h_{a_{1}}^{0}, h_{a_{x}}^{0}

will be replaced by new hidden states

h_{a_{1}}^{1}, h_{a_{x}}^{1}

respectively. This process will end if a predefined number of iterations is reached. The final hidden state will be the input of the following layer.

As mentioned in the Section 4.2.2, some of the hyperparameters were chosen using the Optuna package. Learning rates were chosen from 0.0005, 0.001, 0.002, and 0.005. The detailed workflow of the model and the corresponding hyperparameters were described as follows:

Phosphine ligand descriptors were passed through a GNN layer with 3–5 iterations in which each iteration had the same weight without bias. The GNN layer was constructed in the framework of a message-passing neural network (MPNN) using Leaky ReLU activation with a = 0.01, size 2–4 for message function, and size 1 for update function. The graphical output was concatenated according to the order of an aligned graph. The generated vector was referred to as GNN Output States.
Aryl bromide descriptors and GNN Output States were concatenated and passed through a fully connected neural network (FCNN) layer using Sigmoid activation, size 1, L2-regularizer, and no bias. The generated vector was referred to as Hidden States 1.
GNN Output States were passed through an FCNN layer using Sigmoid activation, size 1, L2-regularizer, with non-negative weights and no bias. The generated vector was referred to as Hidden States 2.
Hidden States 1 and Hidden States 2 were passed through an FCNN layer using linear activation, size 1, with non-negative weights and no bias, generating predicted activation energies.
Predicted activation energies were converted to reaction rate constants using the Eyring equation.

Within a maximum of 10,000 epochs in the training process, the epochs were distributed between two stages: In the first stage, in order to increase the training efficiency, only steps 1–4 were performed, i.e., the outputs were the predicted activation energies, while the corresponding ground truths were the activation energies calculated from the experimentally measured rate constants. The first stage ended when the cutoff (<5.0, <6.0 or <7.0 kcal/mol in loss) was reached. The remaining epochs were considered in the second stage. In the second stage, in order to achieve finer optimization of the prediction, steps 1–5 were performed, i.e., the outputs were the predicted reaction rate constants, and the corresponding ground truths were the experimental reaction rate constants. Early termination of the training process would be activated if the loss fluctuates to a certain degree. This fluctuation was calculated by the mean absolute value of the last 10 batch losses. If 80% of the last 10 batch loss changes show different directions when compared with their previous loss changes, i.e., one was an increase while another was a decrease, and the mean absolute value of such loss changes exceeds 0.2, the training process would be considered highly fluctuated and thus rejected. On the other hand, if the mean absolute value of the loss changes < 0.0001, the training process would be considered as converged and terminated early.

6. Conclusions

In summary, we developed a simple yet highly effective tree-like representation for phosphine ligands that can be utilized to generate input graphs for GNNs. By combining this ligand descriptor with a well-established electronic descriptor of aryl bromides (Hammett constant) and integrating human knowledge of mechanistic details into our machine learning model, we were able to extract ligand- and substrate-specific activation energies (ΔG^‡(L) and ΔG^‡(L-S)), enabling us to investigate the impact of ligands and substrates on the rate-limiting step. Our machine learning protocol confirmed previously established chemical principles regarding cross-coupling reactions. In terms of practicality, our approach only required the tree-like representations (generated easily from the chemical structure) and straightforward empirical electronic descriptors for the substrates (Hammett σ constant of its substituent) as input. It was rewarding to develop a knowledge-driven machine learning model for chemical reaction prediction using a relatively small dataset. Our model not only provided accurate predictions but also shed light on the underlying scientific questions.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/molecules28124730/s1, Supplementary plots of ΔG^‡(L-S) for each substrate and ligand. And supplementary tables on cross-validation and results, including Figures S1–S17: Plots of ΔG^‡(L-S) against substrate for Ligands 1–17; Figures S18–S37: Plots of ΔG^‡(L-S) against ligand for Aryl bromides 1–20; Figures S38–S45: Parity plots for cross validation sets 1–4. Table S1: Averaged predicted reaction rate constants of plot; Table S2: List of Hammett constants used. Reference [44] is cited in the Supplementary Materials.

Author Contributions

Conceptualization, Z.L. and H.S.; Data curation, K.C. and L.T.T.; Methodology, K.C. and L.T.T.; Software, K.C. and L.T.T.; Formal analysis, K.C. and L.T.T.; Writing—original draft preparation, Z.L. and K.C.; Writing—review and editing, Z.L., H.S., Y.H., K.C. and L.T.T.; Supervision, Z.L., H.S. and Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Research Grants Council of Hong Kong, grant number HKUST16300021.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code and data used to produce the reported results can be found online at: https://github.com/klchan4207/PdSonoML.

Acknowledgments

The authors thank Fu Kit Sheong for valuable discussion.

Conflicts of Interest

The authors declare no conflict of interest.

Sample Availability

Not applicable.

References

Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef] [PubMed]
Tu, J.V. Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. J. Clin. Epidemiol. 1996, 49, 1225–1231. [Google Scholar] [CrossRef] [PubMed]
Zanardi, M.M.; Sarotti, A.M. GIAO C–H COSY Simulations Merged with Artificial Neural Networks Pattern Recognition Analysis. Pushing the Structural Validation a Step Forward. J. Org. Chem. 2015, 80, 9371–9378. [Google Scholar] [CrossRef] [PubMed]
Mansouri, K.; Cariello, N.F.; Korotcov, A.; Tkachenko, V.; Grulke, C.M.; Sprankle, C.S.; Allen, D.; Casey, W.M.; Kleinstreuer, N.C.; Williams, A.J. Open-source QSAR models for pKa prediction using multiple machine learning approaches. J. Cheminform. 2019, 11, 60. [Google Scholar] [CrossRef] [PubMed]
Yang, Q.; Li, Y.; Yang, J.D.; Liu, Y.; Zhang, L.; Luo, S.; Cheng, J.P. Holistic Prediction of the pKa in Diverse Solvents Based on a Machine-Learning Approach. Angew. Chem. Int. Ed. Engl. 2020, 59, 19282–19291. [Google Scholar] [CrossRef]
Ramakrishnan, R.; Dral, P.O.; Rupp, M.; von Lilienfeld, O.A. Big data meets quantum chemistry approximations: The Δ-machine learning approach. J. Chem. Theory Comput. 2015, 11, 2087–2096. [Google Scholar] [CrossRef] [Green Version]
Unke, O.T.; Meuwly, M. PhysNet: A neural network for predicting energies, forces, dipole moments, and partial charges. J. Chem. Theory Comput. 2019, 15, 3678–3693. [Google Scholar] [CrossRef] [Green Version]
von Lilienfeld, O.A. Quantum machine learning in chemical compound space. Angew. Chem. Int. Ed. Engl. 2018, 57, 4164–4169. [Google Scholar] [CrossRef]
Rupp, M.; Tkatchenko, A.; Müller, K.R.; von Lilienfeld, O.A. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 2012, 108, 058301. [Google Scholar] [CrossRef] [Green Version]
Gastegger, M.; Behler, J.; Marquetand, P. Machine learning molecular dynamics for the simulation of infrared spectra. Chem. Sci. 2017, 8, 6924–6935. [Google Scholar] [CrossRef] [Green Version]
Montavon, G.; Rupp, M.; Gobre, V.; Vazquez-Mayagoitia, A.; Hansen, K.; Tkatchenko, A.; Müller, K.R.; von Lilienfeld, O.A. Machine learning of molecular electronic properties in chemical compound space. New J. Phys. 2013, 15, 095003. [Google Scholar] [CrossRef]
Coley, C.W.; Barzilay, R.; Jaakkola, T.S.; Green, W.H.; Jensen, K.F. Prediction of Organic Reaction Outcomes Using Machine Learning. ACS Cent. Sci. 2017, 3, 434–443. [Google Scholar] [CrossRef] [Green Version]
Coley, C.W.; Green, W.H.; Jensen, K.F. Machine Learning in Computer-Aided Synthesis Planning. Acc. Chem. Res. 2018, 51, 1281–1289. [Google Scholar] [CrossRef]
Gómez-Bombarelli, R.; Aguilera-Iparraguirre, J.; Hirzel, T.D.; Duvenaud, D.; Maclaurin, D.; Blood-Forsythe, M.A.; Chae, H.S.; Einzinger, M.; Ha, D.G.; Wu, T.; et al. Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach. Nat. Mater. 2016, 15, 1120–1127. [Google Scholar] [CrossRef]
Xie, T.; Grossman, J.C. Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. Phys. Rev. Lett. 2018, 120, 145301. [Google Scholar] [CrossRef] [Green Version]
Coley, C.W.; Jin, W.; Rogers, L.; Jamison, T.F.; Jaakkola, T.S.; Green, W.H.; Barzilay, R.; Jensen, K.F. A graph-convolutional neural network model for the prediction of chemical reactivity. Chem. Sci. 2019, 10, 370–377. [Google Scholar] [CrossRef] [Green Version]
Schwaller, P.; Vaucher, A.C.; Laino, T.; Reymond, J.L. Prediction of chemical reaction yields using deep learning. Mach. Learn. Sci. Technol. 2021, 2, 015016. [Google Scholar] [CrossRef]
Schwaller, P.; Hoover, B.; Reymond, J.L.; Strobelt, H.; Laino, T. Unsupervised Attention-Guided Atom-Mapping. Sci. Adv. 2021, 7, eabe4166. [Google Scholar] [CrossRef]
Ahneman, D.T.; Estrada, J.G.; Lin, S.; Dreher, S.D.; Doyle, A.G. Predicting reaction performance in C–N cross-coupling using machine learning. Science 2018, 360, 186–190. [Google Scholar] [CrossRef] [Green Version]
Nielsen, M.K.; Ahneman, D.T.; Riera, O.; Doyle, A.G. Deoxyfluorination with Sulfonyl Fluorides: Navigating Reaction Space with Machine Learning. J. Am. Chem. Soc. 2018, 140, 5004–5008. [Google Scholar] [CrossRef]
Yada, A.; Matsumura, T.; Ando, Y.; Nagata, K.; Ichinoseki, S.; Sato, K. Ensemble Learning Approach with LASSO for Predicting Catalytic Reaction Rates. Synlett 2021, 32, 1843–1848. [Google Scholar] [CrossRef]
Zahrt, A.F.; Henle, J.J.; Rose, B.T.; Wang, Y.; Darrow, W.T.; Denmark, S.E. Prediction of higher-selectivity catalysts by computer-driven workflow and machine learning. Science 2019, 363, eaau5631. [Google Scholar] [CrossRef] [PubMed]
Domingo, L.R.; Aurell, M.J.; Pérez, P.; Contreras, R. Quantitative characterization of the global electrophilicity power of common diene/dienophile pairs in Diels–Alder reactions. Tetrahedron 2002, 58, 4417–4423. [Google Scholar] [CrossRef]
Teixeira, F.; Cordeiro, M.N.D. Simple descriptors for assessing the outcome of aza-Diels–Alder reactions. RSC Adv. 2015, 5, 50729–50740. [Google Scholar] [CrossRef]
an der Heiden, M.R.; Plenio, H.; Immel, S.; Burello, E.; Rothenberg, G.; Hoefsloot, H.C.J. Insights into Sonogashira Cross-Coupling by High-Throughput Kinetics and Descriptor Modeling. Chem. Eur. J. 2008, 14, 2857–2866. [Google Scholar] [CrossRef]
Gensch, T.; dos Passos Gomes, G.; Friederich, P.; Peters, E.; Gaudin, T.; Pollice, R.; Jorner, K.; Nigam, A.; Lindner-D’Addario, M.; Sigman, M.S.; et al. A comprehensive discovery platform for organophosphorus ligands for catalysis. J. Am. Chem. Soc. 2022, 144, 1205–1217. [Google Scholar] [CrossRef]
Sandfort, F.; Strieth-Kalthoff, F.; Kühnemund, M.; Beecks, C.; Glorius, F. A structure-based platform for predicting chemical reactivity. Chem 2020, 6, 1379–1390. [Google Scholar] [CrossRef]
Barrios-Landeros, F.; Hartwig, J.F. Distinct mechanisms for the oxidative addition of chloro-, bromo-, and iodoarenes to a bisphosphine palladium (0) complex with hindered ligands. J. Am. Chem. Soc. 2005, 127, 6944–6945. [Google Scholar] [CrossRef]
Fleckenstein, C.A.; Plenio, H. Sterically demanding trialkylphosphines for palladium-catalyzed cross coupling reactions—Alternatives to PtBu3. Chem. Soc. Rev. 2010, 39, 694–711. [Google Scholar] [CrossRef]
Schilz, M.; Plenio, H. A guide to Sonogashira cross-coupling reactions: The influence of substituents in aryl bromides, acetylenes, and phosphines. J. Org. Chem. 2012, 77, 2798–2807. [Google Scholar] [CrossRef]
Barrios-Landeros, F.; Carrow, B.P.; Hartwig, J.F. Effect of ligand steric properties and halide identity on the mechanism for oxidative addition of haloarenes to trialkylphosphine Pd (0) complexes. J. Am. Chem. Soc. 2009, 131, 8141–8154. [Google Scholar] [CrossRef] [Green Version]
Crabtree, R.H. The Organometallic Chemistry of the Transition Metals, 6th ed.; John Wiley & Sons: New York, NY, USA, 2014; pp. 163–203. [Google Scholar]
Spessard, G.; Miessler, G. Euan Cameron. Organometallic Chemistry, 2nd ed.; Oxford University Press: New York, NY, USA, 2010; pp. 585–586. [Google Scholar]
Labinger, J.A. Tutorial on oxidative addition. Organometallics 2015, 34, 4784–4795. [Google Scholar] [CrossRef]
Xue, L.; Lin, Z. Theoretical aspects of palladium-catalysed carbon–carbon cross-coupling reactions. Chem. Soc. Rev. 2010, 39, 1692–1705. [Google Scholar] [CrossRef]
Hammett, L.P. The effect of structure upon the reactions of organic compounds. Benzene derivatives. J. Am. Chem. Soc. 1937, 59, 96–103. [Google Scholar] [CrossRef]
Masood, H.; Toe, C.Y.; Teoh, W.Y.; Sethu, V.; Amal, R. Machine Learning for Accelerated Discovery of Solar Photocatalysts. ACS Catal. 2019, 9, 11774–11787. [Google Scholar] [CrossRef]
Zhou, Z.; Kearnes, S.; Li, L.; Zare, R.N.; Riley, P. Optimization of Molecules via Deep Reinforcement Learning. Sci. Rep. 2019, 9, 10752. [Google Scholar] [CrossRef] [Green Version]
Miyaura, N.; Suzuki, A. Palladium-Catalyzed Cross-Coupling Reactions of Organoboron Compounds. Chem. Rev. 1995, 95, 2457–2483. [Google Scholar] [CrossRef] [Green Version]
Nicolaou, K.C.; Bulger, P.G.; Sarlah, D. Palladium-Catalyzed Cross-Coupling Reactions in Total Synthesis. Angew. Chem. Int. Ed. Engl. 2005, 44, 4442–4489. [Google Scholar] [CrossRef]
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://tensorflow.org/ (accessed on 1 January 2021).
Wang, M.; Yu, L.; Zheng, D.; Gan, Q.; Gai, Y.; Ye, Z.; Li, M.; Zhou, J.; Huang, Q.; Ma, C.; et al. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv 2019, arXiv:1909.01315. [Google Scholar]
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; Association for Computing Machinery: Anchorage, AK, USA, 2019; pp. 2623–2631. [Google Scholar]
Perrin, D.D.; Dempsey, B.; Serjeant, E.P. pKa Prediction for Organic Acids and Bases; Springer: New York, NY, USA, 1981; pp. 109–126. [Google Scholar]

Scheme 1. Pd-catalyzed Sonogashira cross-coupling reaction.

Figure 1. Comparison of the predicted rate constants with the experimental rate constant data for training set, validation set and testing set. (k_ML is predicted reaction rate constant from machine learing; k_Expmt is experimental reaction rate constant).

Figure 2. Plot of ΔG^‡(L) against ligand 1–17 arranged in the ascending order with chemical structures labelled with a green line to the corresponding columns.

Figure 3. Plot of ΔG^‡(L-S) against ligands (Ligand 1–17) for 1-bromo-4-nitrobenzen with chemical structures of ligand labelled with a green line to the corresponding columns.

Figure 4. Plot of ΔG^‡(L-S) against meta- and para-substituted aryl bromides (aryl bromide 1–20) for Ligand 7 (tri-sec-butylphosphine) with chemical structures of aryl bromides labelled with a green line to the corresponding columns.

Figure 5. Plot of ΔG^‡ against ligands (Ligand 1–17) for 1-bromo-4-nitrobenzene with chemical structures of aryl bromides labelled with a green line to the corresponding columns.

Scheme 2. Possible mechanisms for oxidative addition of aryl halides to phosphine Pd(0) complexes (L = phosphine ligand; X = halide).

Figure 6. (a) Alignment scheme for converting a phosphine ligand chemical structure to a GNN input graph; (b) architecture of our proposed machine learning model.

Table 1. Performance comparison of different machine learning methods.

Ligand Descriptor	Model	Training Set		Validation Set		Testing Set
Ligand Descriptor	Model	R²	MAE	R²	MAE	R²	MAE
Graphical Representation (i.e., our descriptor)	GNN	0.94	0.81	0.87	1.04	0.84	1.15
Buried Volume	FCNN	<0	3.500	<0	3.233	<0	3.508
Buried Volume	Restricted Linear Regression	0.27	3.46	0.34	2.78	0.30	4.06
Cone Angle	FCNN	<0	3.500	<0	3.233	<0	3.508
Cone Angle	Restricted Linear Regression	0.26	3.52	0.33	2.78	0.28	4.03
Buried Volume and Cone Angle	FCNN	<0	3.500	<0	3.233	<0	3.508
Buried Volume and Cone Angle	Restricted Linear Regression	0.27	3.46	0.34	2.78	0.30	4.06
One-Hot Encoding	FCNN	<0	3.502	<0	3.232	<0	3.511
One-Hot Encoding	Restricted Linear Regression	0.52	2.85	0.52	2.38	0.51	3.07
Multiple Fingerprint Features (MFF)	FCNN	<0	3.500	<0	3.233	<0	3.508
Multiple Fingerprint Features (MFF)	Restricted Linear Regression	0.52	2.85	0.52	2.38	0.51	3.07

Table 2. Results of Cross-Validation.

Dataset	Cross-Validation Set Performance (R²)				Average Performance (R²)
Dataset	Set 1	Set 2	Set 3	Set 4 ¹	Average Performance (R²)
Training Set	0.94	0.93	0.91	0.92	0.93
Validation Set	0.87	0.76	0.78	0.93	0.84
Testing Set	0.84	0.87	0.87	0.91	0.87

¹ Average of 3 models were used for this set.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chan, K.; Ta, L.T.; Huang, Y.; Su, H.; Lin, Z. Incorporating Domain Knowledge and Structure-Based Descriptors for Machine Learning: A Case Study of Pd-Catalyzed Sonogashira Reactions. Molecules 2023, 28, 4730. https://doi.org/10.3390/molecules28124730

AMA Style

Chan K, Ta LT, Huang Y, Su H, Lin Z. Incorporating Domain Knowledge and Structure-Based Descriptors for Machine Learning: A Case Study of Pd-Catalyzed Sonogashira Reactions. Molecules. 2023; 28(12):4730. https://doi.org/10.3390/molecules28124730

Chicago/Turabian Style

Chan, Kalok, Long Thanh Ta, Yong Huang, Haibin Su, and Zhenyang Lin. 2023. "Incorporating Domain Knowledge and Structure-Based Descriptors for Machine Learning: A Case Study of Pd-Catalyzed Sonogashira Reactions" Molecules 28, no. 12: 4730. https://doi.org/10.3390/molecules28124730

APA Style

Chan, K., Ta, L. T., Huang, Y., Su, H., & Lin, Z. (2023). Incorporating Domain Knowledge and Structure-Based Descriptors for Machine Learning: A Case Study of Pd-Catalyzed Sonogashira Reactions. Molecules, 28(12), 4730. https://doi.org/10.3390/molecules28124730

Article Menu

Incorporating Domain Knowledge and Structure-Based Descriptors for Machine Learning: A Case Study of Pd-Catalyzed Sonogashira Reactions

Abstract

1. Introduction