1. Introduction
Drug discovery aims to find novel molecules that can modulate specific molecular targets implicated in diseases to produce the desired therapeutic responses. The traditional drug discovery process is often labor-intensive, typically spanning over a decade and costing upwards of a billion dollars per successful drug [
1]. Despite significant investments, the success rate remains low, with only about 10% of drug candidates entering clinical trials, and only one may eventually receive approval [
2]. To address the challenges of the complexity, development time, and cost of drug discovery, the integration of AI techniques, particularly generative models, has demonstrated promising results that can transform drug discovery by enhancing the development time, efficiency, and accuracy of candidate molecules design and optimization [
3,
4,
5]. For instance, in [
3], the deep generative technique was utilized to identify several active Discoidin Domain Receptor 1 (DDR1) inhibitors in just 21 days.
Typically, AI models and computational models that combine combinatorial methods and machine learning can be used to generate a large number of new drug molecules, as shown in [
5,
6], where thousands of new drug molecules were generated for the treatment of hypertension. However, translating AI-generated molecules into tangible compounds requires practical synthesis. In fact, the synthesizability of molecules generated by AI methods remains a challenge, as discussed in [
7,
8,
9], where the importance of enhancing the synthesizability through recent synthetic planning methods is emphasized. Traditionally, the determination of the synthesizability of molecular compounds is carried out experimentally by expert chemists relying on heuristic methods based on empirical rules and experience. However, for a large set of molecules typical of the output of the AI model, the traditional approach to determining the synthesizability of molecules would be cumbersome. Therefore, there is a need to consider other methods that are consistent and less cumbersome. An option that can be adopted to address this challenge is to ensure that the synthesizability of the novel molecules is accounted for in the generation process. This requires the development of generative models with reaction-aware architectures, which is not an easy task. Hence, most contemporary AI-based molecular generative models basically generate as many molecules as possible and do post-filtering to determine their synthesizability.
A synthesizability estimation option for a large set of molecules is in silico-based synthetic accessibility scoring. Synthetic accessibility scoring [
10] is a computational method for estimating how easy it is to synthesize a drug-like molecule considering molecular fragment contributions and molecular complexity. The method in [
10] was validated by comparing the ease of synthesis as estimated by experienced medicinal chemists for a set of 40 molecules. The synthetic accessibility scoring is often based on simplified and heuristic techniques, which may not adequately capture the complexities of current synthetic chemistry [
11]. As a result, even if a molecule has a high score, poor yields or expensive reagents may make its synthesis impracticable. It also does not provide reaction pathways to synthesis. Therefore, it is basically suitable for providing a quick estimation of the synthesizability of molecules.
An in silico synthesizability analytical technique that offers reaction pathways to synthesis and can handle large sets of molecules is data-driven retrosynthetic analysis. This analytical approach integrates AI to enhance the efficiency of synthesizing complex molecules by automating the identification of synthetic routes and optimizing reaction conditions. Hence, the method can handle a large set of molecules. However, compared to synthetic accessibility scoring, data-driven retrosynthetic analysis involves significantly more computational complexity with computational tasks for large datasets running in hours and days. Therefore, it is essential that only molecules with high probability undergo retrosynthetic analysis.
In this paper, we integrate synthetic accessibility scoring and a data-driven retrosynthesis reliability assessment method to evaluate the synthesizability of AI-generated lead drug molecules. We term this integrated strategy predictive synthetic feasibility analysis. Specifically, this integrated strategy combines traditional computational synthetic accessibility scoring and an AI-driven predictive retrosynthesis confidence assessment method to determine the synthesizability of molecules from a set of novel lead drug molecules generated in [
12] using AI. Unlike the synthetic accessibility scoring method, the AI-driven retrosynthesis confidence assessment method considers factors like the context of reactions involved. The integrated strategy enables quick initial qualitative and quantitative screening of large sets of molecules for actionable synthetic routes, thereby balancing speed and detail and favoring easy synthesis routes to avoid the risk of pursuing non-synthesizable compounds in the drug development pipeline. Once a set of molecules is identified as being easy to synthesize, the full retrosynthesis analysis will be conducted. In this paper, the retrosynthetic analysis of the top molecules identified by the proposed method as being the easiest regarding synthesizability is presented. Note that the term lead compound is used in this work to designate a potential drug compound that is yet to undergo preclinical evaluation.
2. Results
In this section, we employ the method described in
Figure 10 to present the synthesizability analysis of the molecules in the dataset,
D. First, we determine the values of
and
for the molecules in
D, then we plot the
characteristics of the molecules for different thresholds that indicate their predictive synthetic feasibility. Second, we present the AI-predicted retrosynthetic routes of the four (4) molecules with the best predictive synthesis feasibility and the expert chemist’s opinion on retrosynthesis routes. We note that all the figures in this section were generated using the RDKit in Python version 3.12.
The values of
for all the elements of D are calculated using the RDKit tool, which is based on the method developed by [
9].
Figure 1 shows the
violin plot for the 123 molecules in D. It can be seen that the synthetic accessibility of most of the molecules is concentrated between
and
. However, determining the threshold of
that offers good synthesizability of the molecules is not really assessable with this information. In
Figure 2, we show the
violin plot for the 123 molecules in D. The values of the CIs for all the elements of D are calculated using the IBM RXN for Chemistry AI tool [
13]. The results in the graph show that a considerable number of molecules can be synthesized with over
confidence. However, we do not clearly specify the threshold
CI value that indicates a ‘good’ value for a synthesizable molecule.
Combining the information in
and
, we present the predictive synthesis feasibility analysis,
, for arbitrary values of the thresholds, Th1 and Th2. The
characteristics are shown in
Figure 3. for different threshold values of
. The
and
for the best four molecules with the most promising synthetic scores are shown in
Table 1.
2.1. Retrosynthetic Feasibility Analysis of Compound A
The principal synthesis precursors to realizing the target molecule are shown in
Table 2. The precursor, 1,4–Dioxane, is a cyclic ether used as a solvent, and Palladium is used as a catalyst in cross-coupling reactions. Potassium carbonate is a base, Butyl boronic acid is a reactant used in a compound used in Suzuki coupling, and Ethyl 2-(3-bromo-4-hydroxyphenyl)acetate is an ester containing bromo and hydroxy substituents on a phenyl ring. The reaction occurs in two steps, as shown in
Figure 4. The first step entails debromination of the starting material, (ethyl 2-(3-bromo-4-hydroxyphenyl) acetate), and the debromination reaction is catalyzed by (Palladium (tetrakis triphenylphosphine), Pd(PPh
3)
4). The base (K
2CO
3) facilitates the conversion of N butyl boronic acid into a more reactive species, and the two starting materials react at elevated temperatures (50–80 °C) to enhance the reaction rate. This type of reaction is referred to as the Suzuki–Miyaura reaction [
14] since they form a new carbon to carbon (C-C) bond between the phenyl group of the alkyl group and the alkyl group of the boronic acid. The second step entails ammonolysis (addition of ammonia (NH
3) to form amines or nitrides) of the first step product (ethyl 2-(3-butyl-4-hydroxyphenyl)acetate), and the reaction is carried out in methanol (CH3-OH) as a solvent, and it is also carried out in elevated temperatures to increase the speed of the reaction.
2.2. Retrosynthetic Feasibility Analysis of Compound B
The principal synthesis precursors to realizing the target molecule are shown in
Table 3. The precursors, THF and Dichloromethane, are polar and non-flammable chlorinated solvents, respectively. Triethylamine and Triphenylphosphine are organic compounds and nucleophilic catalysts, respectively. The compound, 1-(2-azidoethyl)-4-methoxy-2-methylbenzene, is an azide-functionalized aromatic compound. The reaction to synthesizing O=C(NCCc1ccc(O)cc1C)Cc2cccc3ccccc32 occurs in three steps, as shown in
Figure 5. The first step entails hydrating 1-(aminoethyl)-4-methoxybenzene to convert it to an amide species in the presence of triphenylphosphine (PPh3) as a catalyst. These types of reactions are called Staudinger reactions [
15], which are reactions of organic azides with phosphines to produce iminophosphorane. The second step involves deprotonation of the amide group in 2-(4-methoxy-2-methylphenyl)ethan-1-amine, and the deprotonation is enhanced by using triethylamine as a base. The reactant (1-naphthoyl chloride) is also dechlorinated, resulting in the reaction between the two starting materials to form N-(4-methoxy-2-methylphenethyl)-2-(naphthalen-2-yl) acetamide and hydrogen chloride (HCl). The last step entails the formation of an alcohol N-(4-hydroxy-2-methylphenethyl)-2-(naphthalen-2-yl) acetamide, and the reaction is carried out at −78 °C to avoid side reactions that may occur because of utilizing boron tribromide (BBr3).
2.3. Retrosynthetic Feasibility Analysis of Compound C
The principal synthesis precursors to realizing the target molecule are shown in
Table 4. The precursor, 2-Hydroxy-5-(3-(4-hydroxyphenyl)propyl)benzaldehyde is a phenolic aldehyde with antioxidant and potential bioactive properties. Ammonia and sulfuric acid are reactants in the synthesis route, while sodium chlorite acts as a strong oxidizing agent. The compounds methanol and ethanol are solvents in the reactions. The reaction to synthesizing Oc1ccc(CCCc2ccc(O)cc2)cc1C(=O)N occurs in three steps, as shown in
Figure 6. Reaction step 1 is the reaction between 2-hydroxy-5-(3-(4-hydroxyphenyl)propyl)benzaldehyde and sodium chlorite (
), which typically results in the oxidation of the aldehyde group (-CHO) to a carboxylic acid group (-COOH). Reaction step 2 is the reaction between 2-hydroxy-5-(3-(4-hydroxyphenyl)propyl)benzoic acid, sulfuric acid, and methanol, which is a classic esterification reaction. This process is often referred to as Fischer esterification [
16], in which the carboxylic acid group (-COOH) reacts with methanol in a strong acid catalyst (sulfuric acid) to form the acid’s methyl ester. Reaction step 3 entails the nucleophilic substitution of the ethoxy group of the ketone group with an amine group, resulting in the formation of the amine species and, consequently, the targeted molecules. However, for the AI-predicted synthesis pathways presented above, substituting an ethoxy group with an amine is not straightforward because the ethoxy group is a poor group that leaves under normal conditions [
17]. Therefore, an alternative reaction pathway is presented in
Figure 7.
Reaction step 1 (
Figure 6a) is the same as reaction step 1 in
Figure 6a. Reaction step 2 (
Figure 6b) entails converting the carboxylic acid group to the corresponding acyl chloride, a much better leaving group using thionyl chloride (SOCl2), which is readily available. Reaction step 2 entails reacting the acyl chloride group with ammonia (
) in methanol at elevated temperatures (reflux). As shown in
Figure 6c, the ammonia acts as a nucleophile and replaces the acyl group with an amino group via nucleophilic substitution.
2.4. Retrosynthetic Feasibility Analysis of Compound D
The principal synthesis precursors to realizing the target molecule are shown in
Table 5. The precursors N-(3-amino-4-methoxyphenyl)acetamide and Vinylacetate chloride are used as reactants in the synthesis process, while Dichloromethane and Triethylamine are a solvent and a base or catalyst, respectively. Reaction step 1 shown in
Figure 8a involves the acylation on the aromatic ring of the acetamide derivative and is expected to proceed via the nucleophilic acyl substitution, where the lone pair (–NH2) in the amide attacks the electrophilic carbonyl carbon of vinyl acetate chloride, resulting in the chloride ion (Cl
−) acting as the leaving group, resulting in the formation of the amide bond between the amine and the vinyl acetate chloride. Reaction step 2 shown in
Figure 8b typically involves the demethylation of the methoxy group (–OCH
3) on the aromatic ring, yielding a hydroxyl group (–OH). Boron tribromide is a strong Lewis acid commonly used to cleave methyl ethers in organic compounds [
18]. It is carried out at low temperatures (e.g., −78 °C to 0 °C) to minimize side reactions and ensure selective demethylation. Reaction step 3 shown in
Figure 8c entails the conversion of the acrylamide double bond to form the corresponding butyramide via hydrogenation of acrylamide derivative in the presence of a hydrogenating catalyst (Pd catalyst) and methanol acting as a hydrogen source.
2.5. The CI of Synthesis vs. Steps Analysis
In
Figure 9, the
CI of synthesis vs. steps graph for the retrosynthesis of each of the compounds is presented to visualize the progression of the retrosynthetic analysis of the overall steps to the target, Oc1ccc(cc1CCCC)CC(=O)N. It can be observed that there are only two steps to compound A with an overall
CI of 0.946, which makes it stand out as the most synthesizable due to its high overall confidence and fewer synthesis steps. The overall
CIs for compounds C and D are close at 0.887 and 0.885, respectively. It can be seen that the overall
CI for compound B is the lowest at 0.861, which makes it the least synthesizable of the four compounds. From
Figure 9, it can be seen that step 2 is the significant bottleneck step that affects the synthesizability of compound B. To improve the synthesizability of compound B, alternative reaction types achieve the same transformation, but with easier reactions.
3. Review of Relevant Literature
The challenge of synthesizing new drug molecules is widely recognized in medicinal chemistry and has attracted researchers to explore strategies for efficient synthetic routes. In [
10], a quantitative method was introduced to estimate the synthetic accessibility (SA) score of drug-like molecules. This approach leverages a large database of known synthetic fragments to assess the ease with which a molecule can be synthesized. Although SA scoring provides numerical values that assess the difficulty in synthesizing a molecule, it may not fully account for reaction feasibility or retrosynthetic pathways [
11]. Therefore, a method of retrosynthesis is required.
Retrosynthetic analysis is a systematic process of breaking down a target molecule into simpler precursors to devise a synthetic route. In [
19], an overview of retrosynthetic analysis is provided, explaining its conceptual framework and application in organic synthesis. The use of artificial intelligence in retrosynthetic analysis is explored in [
20,
21,
22]. In [
20], a large database of reaction patterns is used to train a neural network to predict plausible reaction pathways. A review of AI-driven retrosynthesis tools is presented in [
21]. In [
22], an open-source retrosynthetic planning tool is introduced that combines a Monte Carlo algorithm with a neural network. This tool explores synthetic routes by prioritizing high-probability reactions, enabling rapid pathway generation.
However, while AI-based retrosynthesis models perform considerably well in many cases, these models face challenges in handling rare or novel reactions not well represented in the training data. The authors in [
11] emphasized the need for integrated approaches that combine SA scores with retrosynthetic tools to improve predictive reliability. Hence, the method proposed in this paper combines SA scores with retrosynthetic tools to improve molecular synthesis prediction.