DerivaPredict: A User-Friendly Tool for Predicting and Evaluating Active Derivatives of Natural Products

Yu Song; Meng Zhang; Sihao Chang; Ganghui Chu; Hongchao Ji

doi:10.3390/molecules30081683

,

and

¹

Laboratory of Xinjiang Native Medicinal and Edible Plant Resource Chemistry, College of Chemistry and Environmental Science, Kashi University, Kashi 844006, China

²

Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Molecules2025, 30(8), 1683;https://doi.org/10.3390/molecules30081683

This article belongs to the Section Computational and Theoretical Chemistry

Version Notes

Order Reprints

Abstract

While natural products and derivatives have been crucial in drug discovery, the current databases are limited to known compounds. There is a need for tools that can automatically generate and assess novel derivatives of natural products to enhance early-stage drug discovery. We present DerivaPredict (v1.0), a user-friendly tool that generates novel natural product derivatives through chemical and metabolic transformations. It predicts binding affinities using pretrained deep learning models and assesses drug-likeness via ADMET profiling. DerivaPredict is freely accessible with a source code on GitHub.

Keywords:

natural product derivatives; in silico molecular design; software engineering

1. Introduction

Natural products have long been a cornerstone of drug discovery and development, offering an unparalleled diversity of chemical structures and biological activities. From traditional herbal remedies to modern pharmaceuticals, natural products have played a critical role in addressing human health challenges [1]. Iconic examples include paclitaxel (Taxol), a widely used chemotherapy agent derived from the Pacific yew tree, and digoxin, a cardiac glycoside sourced from the foxglove plant. These compounds underscore the immense therapeutic potential of natural products in treating diseases. Over the years, advancements in extraction techniques, synthetic chemistry, and biotechnological methods have expanded the scope of natural products research, allowing scientists to modify existing compounds and create derivatives with improved efficacy, safety, and pharmacokinetic profiles [2,3]. Structural modifications based on natural product scaffolds can lead to novel bioactive compounds, offering a rational approach to expanding chemical space and identifying new drug candidates. However, despite these technological advancements, the process of discovering and optimizing new drug candidates remains a formidable challenge.

Databases of natural products and their derivatives, such as SuperNatural [4], NPAtlas [5], and COCONUT [6], have provided the research community with molecular structures for drug screening. They are constructed by aggregating data from various sources, including experimental studies, literature reviews, and cheminformatics pipelines. Despite their comprehensive construction and utility, these databases are inherently limited to already-cataloged molecules. They reflect the extent of current knowledge and do not address the need for generating novel chemical entities.

Recently, computer-aided drug design has become a crucial component of modern drug discovery efforts, particularly in the development of bioactive compounds [7,8]. Notably, de novo drug design, including molecular generation, has gained significant attention for its ability to accelerate the discovery and optimization of novel ligands with desirable therapeutic properties [9,10]. Various machine learning frameworks and models have exhibited promising and efficient performances in drug-like molecule generation [11]. For example, sequence-based recurrent neural networks (RNNs) utilize sequential learning to generate molecular structures in the form of SMILES representations by grasping the patterns from existing molecules [12,13,14,15,16]. Variational autoencoders (VAEs) encode molecules into a continuous latent space, facilitating the smooth interpolation and modification of chemical structures [17,18,19,20,21]. Generative adversarial networks (GANs) employ a generator–discriminator framework to produce novel molecular structures that resemble real compounds [22,23,24,25,26]. Large language models (LLMs) utilize natural language processing capabilities to design and generate molecular structures based on textual prompts [27,28,29]. While these generative approaches have proven to be powerful tools for creating novel molecular structures, they are not specifically designed to produce derivatives of natural products. Furthermore, they typically operate without an explicit consideration of the biochemical transformation rules and reaction mechanisms that underlie the natural derivation of compounds, limiting their applicability in generating bioactive derivatives with realistic modifications.

To address this limitation, we introduce DerivaPredict, a computational framework specifically designed to facilitate the prediction and evaluation of natural product derivatives. Unlike generative models that create molecules without constraints, DerivaPredict incorporates curated chemical and biological reaction rules, ensuring that the generated derivatives align with the known reaction templates of biosynthetic and enzymatic transformations. It is important to note that the purpose of this work is not to predict natural product derivatives that may exist in the real world but rather to provide potential chemical structures for drug screening, which aligns with the goal of molecular generation algorithms. The key distinction is that it aims to expand the chemical space based on the existing natural product scaffolds and known reaction rules, which is beneficial for generating molecules with a lower synthetic complexity. Therefore, it can be used as a complementary solution for molecular generation tools for drug design.

Beyond the structure generation, DerivaPredict integrates state-of-the-art computational pipelines to evaluate the drug potential of the generated derivatives. It employs state-of-the-art machine learning models to predict binding affinities, providing insights into the likelihood of interaction with specific biological targets. Additionally, it conducts ADMET (absorption, distribution, metabolism, excretion, and toxicity) profiling to assess the pharmacokinetic and safety profiles of the compounds. It is important to emphasize that the primary purpose of DerivaPredict is to generate candidate derivative structures for drug screening. While the predicted results offer preliminary insights into the potential efficacy and safety of the compounds, they do not guarantee performance in in vivo activity experiments. Additional proteomic and cellular experiments are necessary [30,31,32].

2. Results

2.1. Software Integration

DerivaPredict is an open-source, user-friendly software tool designed to streamline natural product-based drug discovery. It features a graphical user interface (GUI) built using the QT framework. The software is designed to cater to both researchers with minimal computational expertise and advanced users who require flexibility and customizability.

The software architecture, as illustrated in Figure 1, adopts a modular design where the front end and back end are fully independent. This separation ensures that updates or modifications to one component do not disrupt the other, enhancing the system’s reliability and maintainability. The front end provides a clean, interactive interface for users to input data, configure settings, and view results, while the back end performs the computational heavy lifting, including reaction rule applications, derivative generation, and molecular property predictions.

Figure 1. Overview of software integration. (A) Software architecture with front-end separation. (B) Graphical user interface screenshot of DerivaPredict software.

The modular structure of DerivaPredict also facilitates scalability and extensibility. Developers can easily integrate additional databases by replacing existing resource files without the need for extensive reengineering. This flexibility makes DerivaPredict a platform for ongoing development and collaboration, encouraging contributions from the research community to enhance its functionality further.

2.2. Software Functionalities

The GUI of DerivaPredict is shown in Figure 2. Researchers can start by inputting a natural product structure as the initial substrate. This can be accomplished in several ways: users can directly input a SMILES string, draw the structure interactively using the embedded molecular editor, or upload a file containing multiple SMILES strings, with each line representing a unique structure. The tool includes a built-in structure viewer, allowing users to visualize the input molecules interactively and verify their accuracy before proceeding.

Figure 2. Workflow and functions of DerivaPredict software.

In addition to the substrate, users can specify one or more target proteins for the prediction workflow. DerivaPredict supports multi-protein input, making it suitable for projects involving polypharmacology or multitarget drug discovery. To simplify the process, users can input the gene names of the target proteins instead of their amino acid sequences. The software automatically retrieves the corresponding protein sequences from UniProt, ensuring accuracy and eliminating the need for manual sequence curation.

The parameter-setting panel within DerivaPredict offers users control over key aspects of the prediction process. Users can configure the method for derivative prediction, selecting from available chemical, biochemical, or metabolic transformation options. They can also specify the number of iterations for the derivative generation to explore the chemical space comprehensively. Additionally, users can choose the desired deep learning model for drug–target affinity (DTA) prediction, tailoring the computational pipeline to their specific research needs.

Once the parameters are set, users can run the workflow and save the results using the dedicated save button. The output includes comprehensive data, such as the structures of the generated derivatives, their predicted binding affinities, and detailed ADMET profiles. The results are stored in a user-friendly format, facilitating the downstream analysis and integration with experimental workflows.

2.3. Illustrative Examples

2.3.1. Structural Diversity Evaluation

As a case study, we selected curcumin and paclitaxel as the parent compounds. Using chemical, biochemical, and metabolic transformation rules, a total of 1299 and 1497 unique derivatives were generated. For the chemical and biochemical transformations, the parameters were configured to perform two iterations. For the metabolic transformations, we selected environmental microbial transformation rules, with the parameters set to perform three iterations. Each iteration was designed to predict up to 30 derivatives. These derivatives spanned a broad chemical space, as visualized using Morgan fingerprints projected onto a U-Map, where the derivatives exhibited significant structural variation. Moreover, the chemical spaces covered by the different types of transformations exhibited distinct characteristics.

We compared the structural similarity between the derivatives and corresponding substrates and visualized the distribution trends using frequency distribution histograms. The chemical and biochemical transformations typically yielded derivatives with higher similarity to the substrates, as these processes primarily involved the addition or modification of functional groups. This is largely due to the prevalence of such reaction rules in transformation databases. In contrast, the metabolic transformations generated derivatives with greater structural diversity, often introducing more significant changes such as ring closures, chain elongations, or bond breakage. These distinctions underscore the unique capabilities of each transformation approach in exploring diverse regions of chemical space.

Considering synthetic complexity, the SCScore values for the derivatives of paclitaxel were generally higher than those for curcumin. This is likely due to the inherently more complex chemical structure of paclitaxel. For derivatives of the same substrate, the distributions of SCScore values for the chemical and biochemical transformations were relatively similar, while those for the metabolic transformations were more dispersed, reflecting greater structural variability (Figure 3).

Figure 3. Visualization and analysis of generated derivatives. The U-Map projection illustrates the distribution of the derivatives in chemical space, highlighting the structural diversity (left). The Tanimoto similarity histograms depict the structural similarity between the derivatives and their parent compounds, showing the impact of the different transformation types (middle). The SCScore distributions compare the synthetic complexity of the derivatives, indicating variations based on the transformation methods and parent compound structures (right).

2.3.2. Pharmacological Active Prediction

Previous studies have demonstrated the potential inhibitory activity on the epidermal growth factor receptor (EGFR) of curcumin [33,34]. The pretrained CNN model was used to predict the affinity between the derivatives and their original target proteins [35]. As a demonstration, we utilized DerivaPredict’s functionality to explore the pharmacological potential of the generated derivatives by predicting their affinity for the EGFR using in silico docking methods. The goal was to showcase the software’s feature that predicts the binding potential of novel curcumin derivatives. While the algorithms and models used are derived from published works, benchmarking the accuracy of these predictions is beyond the scope of this study.

Among the derivatives, 737 were predicted to have an IC50 lower than that of curcumin. Two example chemical structures, with their predicted IC50 values, are shown in Figure 4, along with their predicted affinity for the EGFR, comparing them to curcumin. To validate these predictions, we employed AutoDock Vina to calculate the docking scores based on the three-dimensional structure of the EGFR (PDB ID: 1M17). We compared the binding modes of curcumin and the derivative with the highest predicted binding affinity in complex with the EGFR. Molecular docking simulations revealed that the derivative similarly binds the EGFR to curcumin. However, the derivative exhibited tighter binding compared to curcumin, likely due to additional favorable interactions such as stronger hydrogen bonding and hydrophobic interactions.

Figure 4. Structures, docking scores, and predicted IC50 of curcumin and its derivatives.

Notably, Derivative 1 (CHEMBL103410) is registered in the ChEMBL database, with the Max Phase listed as “Preclinical”. Although no experimental data specifically targeting the EGFR are available, prior in silico studies have validated that this compound has a higher binding affinity to the EGFR compared to curcumin. Additionally, the contact sites of the EGFR for this derivative align closely with our study, providing further evidence of its potential as an EGFR inhibitor [33,34].

Additionally, we evaluated the ADMET properties of curcumin and its derivatives using the integrated ADMET prediction function in DerivaPredict (Table 1). The logP values indicate that Derivate 1 is more lipophilic than curcumin, which could affect its bioavailability and membrane permeability. Both Derivate 1 and Derivate 2 show good ADMET profiles, with favorable QED values and bioavailability scores. Furthermore, the derivatives exhibit promising results in terms of the potential interactions with various cytochrome P450 enzymes (CYP1A2, CYP2C19, CYP3A4), suggesting a reasonable metabolic stability. Their predicted solubility and permeability (PAMPA and Caco2 scores) also indicate a good absorption potential. These findings suggest that some of the derivatives may exhibit enhanced stability and potency as EGFR inhibitors.

Table 1. Predicted ADMET properties of curcumin and its derivatives.

3. Materials and Methods

3.1. Extraction of Reaction Templates

DerivaPredict applies various types of in silico chemical, biochemical, or metabolic transformations to the substrate based on user-defined parameters. This process generates a range of potential derivative structures for the given substrate. The chemical transformations are based on the extraction reaction rule templates from 50,000 organic chemical reactions in the patent literature [36]. The biochemical transformations, in turn, are based on the reaction rule templates from 95,000 enzymatic reactions sourced from databases, including MetaCyc [37], KEGG [38], SEED [39], Rhea [40], and BiGG [41], as integrated by Zheng et al. [42]. Since the reactions in enzymatic reactions are not atom-mapped, RXNMapper [43], a neural network-based automated atom mapping model, was employed to generate an atom-mapped dataset. Reactions containing the wildcard of the ‘R’ token in their SMILES strings were excluded due to RXNMapper’s canonicalization process.

The metabolic transformations utilize the BioTransformer 3.0 module [44], which includes predictions for eight types of metabolic transformations: promiscuous enzymatic (EC) reactions, environmental microbial transformations, Phase I reactions (cytochrome P450), Phase II reactions, human gut microbial reactions, and various combinations of the above. These integrations enable the derivation of natural products through accessible chemical, enzymatic, microbial, and human metabolic transformations.

3.2. Generation of Potential Derivatives

To generate potential derivatives from natural products, DerivaPredict applies the reaction templates extracted in the previous step to the substrate using the RDKit library, a powerful cheminformatics toolkit. RDKit is employed to efficiently apply the applicable reaction rules, transforming the input natural product into a diverse set of potential derivatives based on the defined transformation templates.

The reaction templates, which include both chemical and biochemical transformations, are mapped to the substrate using RDKit’s reaction engine. This engine performs the transformations by breaking down the molecular structures into their component parts and applying the corresponding reaction rules. For example, if a chemical transformation rule suggests hydroxylation at a specific position, RDKit will modify the substrate molecule accordingly, creating new potential derivatives. The process ensures that only practicable reactions—those that can realistically occur given the structural constraints of the input molecule—are applied. DerivaPredict can also incorporate metabolic transformations by leveraging the BioTransformer 3.0 JDK module, which extends the tool’s capability to predict how the substrate might be modified through human metabolic processes or microbial reactions.

The in silico transformation can be iterated 1–3 times. Through this multi-step process, DerivaPredict generates a broad spectrum of potential derivatives, offering users a rich set of chemical entities for further evaluation in drug discovery workflows. As the number of iterations increases, the number of structures obtained increases exponentially, at the cost of consuming more time.

3.3. Prediction of Molecular Properties

DerivaPredict integrates a suite of advanced tools to predict and evaluate the molecular properties of the derivatives it generates, enabling comprehensive assessments of their potential as drug candidates.

To assess synthetic complexity, DerivaPredict employs the SCScore algorithm [45], which provides a quantitative measure of the ease or difficulty of synthesizing a compound. This metric helps users prioritize derivatives that are not only chemically novel but also synthetically feasible, an essential consideration in the early stages of drug development. To evaluate drug similarity, DerivaPredict uses the Quantitative Estimation of Drug-likeness (QED) metric. QED provides a composite score that reflects how closely a compound aligns with known drug-like properties, considering factors such as molecular weight, lipophilicity, and hydrogen bonding.

For a detailed evaluation of drug-like properties, DerivaPredict incorporates the ADMET-AI package [46]. This powerful tool predicts a comprehensive set of ADMET (absorption, distribution, metabolism, excretion, and toxicity) profiles, including six primary ADMET classes and 91 specific properties. These predictions encompass critical factors such as bioavailability, blood–brain barrier permeability, metabolic stability, and toxicity risks, providing a holistic view of each derivative’s pharmacokinetic and safety profiles.

3.4. Prediction of Binding Affinity with Specific Targets

Deep learning-based drug–target affinity (DTA) [47,48,49,50,51,52] and quantitative structure–activity relationship (QSAR) [53,54,55,56] prediction have been introduced by various studies. DerivaPredict incorporates several state-of-the-art deep learning models to predict the binding affinity of generated derivative structures against user-defined target proteins. These models leverage advanced architectures, such as convolutional neural networks (CNNs) and graph neural networks (GNNs), to provide accurate predictions of the half-maximal inhibitory concentration (IC50), a critical metric for evaluating the potency of potential drug candidates.

The predictive models used in DerivaPredict are pretrained using the DeepPurpose package with BindingDB datasets [35], a comprehensive resource containing experimentally validated binding affinities for a wide range of small molecules and protein targets. This pretraining ensures that the models are well-optimized for analyzing diverse chemical structures and biological targets, enabling reliable predictions even for novel derivatives. This functionality not only accelerates the screening process but also empowers users to make data-driven decisions in prioritizing compounds for further investigation.

3.5. User-Defined Parameters and Initial Settings

DerivaPredict provides a variety of user-configurable parameters to customize the molecular generation and evaluation processes. Users can define the number of transformation steps, with a default setting of two, which is also applied in the case study of this work, and an adjustable range between one and three. They can also select reaction types, including chemical, biochemical, and metabolic transformations. For the binding affinity assessments, DerivaPredict enables users to input target proteins using a UniProt ID or gene name, allowing for precise molecular designs.

3.6. Molecular Docking

The X-ray crystallography-based three-dimensional structure of the EGFR was obtained from the RCSB Protein Data Bank (PDB ID: 1M17; accessed on 26 September 2024) and used as a docking template throughout the calculations. The two-dimensional structures of curcumin and its derivatives were energy-minimized using the MMFF method and subsequently converted to 3D structures using the RDKit package (v2022.09.3) for compatibility with the docking operations. Molecular docking was performed using AutoDock Tools (1.5.7), with the docking center coordinates set at (23.568, 9.824, 59.369). The docking procedure was carried out independently three times, generating 30 distinct conformations. Finally, PyMOL (2.6.0) was used to visualize and further illustrate the binding modes obtained from the docking analysis.

4. Conclusions

In summary, DerivaPredict serves as a rational design engine that systematically generates candidate molecular structures for drug screening while prioritizing biologically relevant derivatives of natural products. Its core innovation lies in bridging two critical phases of drug discovery: (1) the rational generation of chemically feasible derivatives through biotransformation-aware algorithms and (2) the prioritization of screening candidates via integrated target affinity predictions and automated ADMET evaluation workflows.

By focusing on rule-compliant structural diversification, the tool addresses the challenge of expanding natural products’ chemical space beyond existing database limitations, providing medicinal chemists with pre-filtered compound libraries that balance novelty and drug-likeness. Crucially, DerivaPredict operates as a hypothesis generator—its machine learning models identify high-potential derivatives for experimental validation rather than claiming to replace wet-lab studies. By offering a user-friendly interface and automated workflows, DerivaPredict empowers researchers to generate and evaluate derivative structures without the need for extensive computational expertise, accelerating the discovery of bioactive compounds in natural products research.

Author Contributions

Conceptualization, G.C. and H.J.; methodology, software, validation, formal analysis, resources, and data curation, Y.S., M.Z. and S.C.; writing—original draft preparation, Y.S. and H.J.; writing—review and editing, H.J.; supervision, project administration, and funding acquisition, G.C. and H.J. All authors have read and agreed to the published version of the manuscript.

Funding

Open project of the Laboratory of Xinjiang Native Medicinal and Edible Plant Resources Chemistry KSUZDSYS202304 and the National Natural Science Foundation of China (Grant No. 3247040263).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

DerivaPredict is written by Python 3.10 and PyQt5, which is a stand-alone software that can be run on cross-platforms. The manual and tutorial videos as well as example data can be found in our GitHub repository (https://github.com/hcji/DerivaPredict, accessed on 4 January 2025).

Acknowledgments

We acknowledge start-up funding from Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture. We also acknowledge ChatGPT 4o for assisting with language polishing.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Harvey, A.L.; Edrada-Ebel, R.; Quinn, R.J. The Re-Emergence of Natural Products for Drug Discovery in the Genomics Era. Nat. Rev. Drug Discov. 2015, 14, 111–129. [Google Scholar] [CrossRef] [PubMed]
Boström, J.; Brown, D.G.; Young, R.J.; Keserü, G.M. Expanding the Medicinal Chemistry Synthetic Toolbox. Nat. Rev. Drug Discov. 2018, 17, 709–727. [Google Scholar] [CrossRef] [PubMed]
Hadadi, N.; Hatzimanikatis, V. Design of Computational Retrobiosynthesis Tools for the Design of de Novo Synthetic Pathways. Curr. Opin. Chem. Biol. 2015, 28, 99–104. [Google Scholar] [CrossRef]
Gallo, K.; Kemmler, E.; Goede, A.; Becker, F.; Dunkel, M.; Preissner, R.; Banerjee, P. SuperNatural 3.0—A Database of Natural Products and Natural Product-Based Derivatives. Nucleic Acids Res. 2023, 51, D654–D659. [Google Scholar] [CrossRef] [PubMed]
Poynton, E.F.; van Santen, J.A.; Pin, M.; Contreras, M.M.; McMann, E.; Parra, J.; Showalter, B.; Zaroubi, L.; Duncan, K.R.; Linington, R.G. The Natural Products Atlas 3.0: Extending the Database of Microbially Derived Natural Products. Nucleic Acids Res. 2025, 53, D691–D699. [Google Scholar] [CrossRef]
Sorokina, M.; Merseburger, P.; Rajan, K.; Yirik, M.A.; Steinbeck, C. COCONUT Online: Collection of Open Natural Products Database. J. Cheminf. 2021, 13, 2. [Google Scholar] [CrossRef]
Fleming, N. How Artificial Intelligence Is Changing Drug Discovery. Nature 2018, 557, S55–S57. [Google Scholar] [CrossRef]
Schneider, P.; Walters, W.P.; Plowright, A.T.; Sieroka, N.; Listgarten, J.; Goodnow, R.A.; Fisher, J.; Jansen, J.M.; Duca, J.S.; Rush, T.S.; et al. Rethinking Drug Design in the Artificial Intelligence Era. Nat. Rev. Drug Discov. 2020, 19, 353–364. [Google Scholar] [CrossRef]
Zeng, X.; Wang, F.; Luo, Y.; Kang, S.; Tang, J.; Lightstone, F.C.; Fang, E.F.; Cornell, W.; Nussinov, R.; Cheng, F. Deep Generative Molecular Design Reshapes Drug Discovery. Cell Rep. Med. 2022, 3, 100794. [Google Scholar] [CrossRef]
Tong, X.; Liu, X.; Tan, X.; Li, X.; Jiang, J.; Xiong, Z.; Xu, T.; Jiang, H.; Qiao, N.; Zheng, M. Generative Models for De Novo Drug Design. J. Med. Chem. 2021, 64, 14011–14027. [Google Scholar] [CrossRef]
Pang, C.; Qiao, J.; Zeng, X.; Zou, Q.; Wei, L. Deep Generative Models in De Novo Drug Molecule Generation. J. Chem. Inf. Model. 2024, 64, 2174–2194. [Google Scholar] [CrossRef] [PubMed]
Pogány, P.; Arad, N.; Genway, S.; Pickett, S.D. De Novo Molecule Design by Translating from Reduced Graphs to SMILES. J. Chem. Inf. Model. 2019, 59, 1136–1146. [Google Scholar] [CrossRef] [PubMed]
Grisoni, F.; Moret, M.; Lingwood, R.; Schneider, G. Bidirectional Molecule Generation with Recurrent Neural Networks. J. Chem. Inf. Model. 2020, 60, 1175–1183. [Google Scholar] [CrossRef]
Stravs, M.A.; Dührkop, K.; Böcker, S.; Zamboni, N. MSNovelist: De Novo Structure Generation from Mass Spectra. Nat. Methods 2022, 19, 865–870. [Google Scholar] [CrossRef]
Li, C.; Wang, C.; Sun, M.; Zeng, Y.; Yuan, Y.; Gou, Q.; Wang, G.; Guo, Y.; Pu, X. Correlated RNN Framework to Quickly Generate Molecules with Desired Properties for Energetic Materials in the Low Data Regime. J. Chem. Inf. Model. 2022, 62, 4873–4887. [Google Scholar] [CrossRef] [PubMed]
Moret, M.; Friedrich, L.; Grisoni, F.; Merk, D.; Schneider, G. Generative Molecular Design in Low Data Regimes. Nat. Mach. Intell. 2020, 2, 171–180. [Google Scholar] [CrossRef]
Colby, S.M.; Nuñez, J.R.; Hodas, N.O.; Corley, C.D.; Renslow, R.R. Deep Learning to Generate in Silico Chemical Property Libraries and Candidate Molecules for Small Molecule Identification in Complex Samples. Anal. Chem. 2020, 92, 1720–1729. [Google Scholar] [CrossRef]
Joo, S.; Kim, M.S.; Yang, J.; Park, J. Generative Model for Proposing Drug Candidates Satisfying Anticancer Properties Using a Conditional Variational Autoencoder. ACS Omega 2020, 5, 18642–18650. [Google Scholar] [CrossRef]
Kotsias, P.-C.; Arús-Pous, J.; Chen, H.; Engkvist, O.; Tyrchan, C.; Bjerrum, E.J. Direct Steering of de Novo Molecular Generation with Descriptor Conditional Recurrent Neural Networks. Nat. Mach. Intell. 2020, 2, 254–265. [Google Scholar] [CrossRef]
Bagal, V.; Aggarwal, R.; Vinod, P.K.; Priyakumar, U.D. MolGPT: Molecular Generation Using a Transformer-Decoder Model. J. Chem. Inf. Model. 2022, 62, 2064–2076. [Google Scholar] [CrossRef]
Wang, S.; Song, T.; Zhang, S.; Jiang, M.; Wei, Z.; Li, Z. Molecular Substructure Tree Generative Model for de Novo Drug Design. Brief. Bioinform. 2022, 23, bbab592. [Google Scholar] [CrossRef] [PubMed]
Wan, C.; Jones, D.T. Improving Protein Function Prediction with Synthetic Feature Samples Created by Generative Adversarial Networks. Nat. Mach. Intell. 2019, 730143. [Google Scholar] [CrossRef]
Bian, Y.; Wang, J.; Jun, J.J.; Xie, X.-Q. Deep Convolutional Generative Adversarial Network (dcGAN) Models for Screening and Design of Small Molecules Targeting Cannabinoid Receptors. Mol. Pharm. 2019, 16, 4451–4460. [Google Scholar] [CrossRef]
Prykhodko, O.; Johansson, S.V.; Kotsias, P.-C.; Arús-Pous, J.; Bjerrum, E.J.; Engkvist, O.; Chen, H. A de Novo Molecular Generation Method Using Latent Vector Based Generative Adversarial Network. J. Cheminf. 2019, 11, 74. [Google Scholar] [CrossRef] [PubMed]
Sousa, T.; Correia, J.; Pereira, V.; Rocha, M. Generative Deep Learning for Targeted Compound Design. J. Chem. Inf. Model. 2021, 61, 5343–5361. [Google Scholar] [CrossRef]
Abbasi, M.; Santos, B.P.; Pereira, T.C.; Sofia, R.; Monteiro, N.R.C.; Simões, C.J.V.; Brito, R.M.M.; Ribeiro, B.; Oliveira, J.L.; Arrais, J.P. Designing Optimized Drug Candidates with Generative Adversarial Network. J. Cheminform. 2022, 14, 40. [Google Scholar] [CrossRef]
Gu, Y.; Xu, Z.; Yang, C. Empowering Graph Neural Network-Based Computational Drug Repositioning with Large Language Model-Inferred Knowledge Representation. Interdiscip. Sci. Comput. Life Sci. 2024, 1–18. [Google Scholar] [CrossRef]
Bran, A.M.; Cox, S.; Schilter, O.; Baldassari, C.; White, A.D.; Schwaller, P. Augmenting Large Language Models with Chemistry Tools. Nat. Mach. Intell. 2024, 6, 525–535. [Google Scholar] [CrossRef]
Wang, J.; Luo, H.; Qin, R.; Wang, M.; Wan, X.; Fang, M.; Zhang, O.; Gou, Q.; Su, Q.; Shen, C.; et al. 3DSMILES-GPT: 3D Molecular Pocket-Based Generation with Token-Only Large Language Model. Chem. Sci. 2025, 16, 637–648. [Google Scholar] [CrossRef]
Dziekan, J.M.; Wirjanata, G.; Dai, L.; Go, K.D.; Yu, H.; Lim, Y.T.; Chen, L.; Wang, L.C.; Puspita, B.; Prabhu, N.; et al. Cellular Thermal Shift Assay for the Identification of Drug–Target Interactions in the Plasmodium Falciparum Proteome. Nat. Protoc. 2020, 15, 1881–1921. [Google Scholar] [CrossRef]
Ji, H.; Lu, X.; Zhao, S.; Wang, Q.; Liao, B.; Bauer, L.G.; Huber, K.V.M.; Luo, R.; Tian, R.; Tan, C.S.H. Target Deconvolution with Matrix-Augmented Pooling Strategy Reveals Cell-Specific Drug-Protein Interactions. Cell Chem. Biol. 2023, 30, 1478–1487.e7. [Google Scholar] [CrossRef] [PubMed]
Lomenick, B.; Hao, R.; Jonai, N.; Chin, R.M.; Aghajan, M.; Warburton, S.; Wang, J.; Wu, R.P.; Gomez, F.; Loo, J.A.; et al. Target Identification Using Drug Affinity Responsive Target Stability (DARTS). Proc. Natl. Acad. Sci. USA 2009, 106, 21984–21989. [Google Scholar] [CrossRef]
Liang, Y.; Zhao, J.; Zou, H.; Zhang, J.; Zhang, T. In Vitro and in Silico Evaluation of EGFR Targeting Activities of Curcumin and Its Derivatives. Food Funct. 2021, 12, 10667–10675. [Google Scholar] [CrossRef] [PubMed]
Saeed, M.E.M.; Yücer, R.; Dawood, M.; Hegazy, M.-E.F.; Drif, A.; Ooko, E.; Kadioglu, O.; Seo, E.-J.; Kamounah, F.S.; Titinchi, S.J.; et al. In Silico and In Vitro Screening of 50 Curcumin Compounds as EGFR and NF-κB Inhibitors. Int. J. Mol. Sci. 2022, 23, 3966. [Google Scholar] [CrossRef] [PubMed]
Huang, K.; Fu, T.; Glass, L.M.; Zitnik, M.; Xiao, C.; Sun, J. DeepPurpose: A Deep Learning Library for Drug–Target Interaction Prediction. Bioinformatics 2021, 36, 5545–5547. [Google Scholar] [CrossRef]
Coley, C.W.; Rogers, L.; Green, W.H.; Jensen, K.F. Computer-Assisted Retrosynthesis Based on Molecular Similarity. ACS Cent. Sci. 2017, 3, 1237–1245. [Google Scholar] [CrossRef]
Caspi, R.; Billington, R.; Keseler, I.M.; Kothari, A.; Krummenacker, M.; Midford, P.E.; Ong, W.K.; Paley, S.; Subhraveti, P.; Karp, P.D. The MetaCyc Database of Metabolic Pathways and Enzymes—A 2019 Update. Nucleic Acids Res. 2020, 48, D445–D453. [Google Scholar] [CrossRef]
Kanehisa, M.; Furumichi, M.; Tanabe, M.; Sato, Y.; Morishima, K. KEGG: New Perspectives on Genomes, Pathways, Diseases and Drugs. Nucleic Acids Res. 2017, 45, D353–D361. [Google Scholar] [CrossRef]
Overbeek, R.; Olson, R.; Pusch, G.D.; Olsen, G.J.; Davis, J.J.; Disz, T.; Edwards, R.A.; Gerdes, S.; Parrello, B.; Shukla, M.; et al. The SEED and the Rapid Annotation of Microbial Genomes Using Subsystems Technology (RAST). Nucleic Acids Res. 2014, 42, D206–D214. [Google Scholar] [CrossRef]
Bansal, P.; Morgat, A.; Axelsen, K.B.; Muthukrishnan, V.; Coudert, E.; Aimo, L.; Hyka-Nouspikel, N.; Gasteiger, E.; Kerhornou, A.; Neto, T.B.; et al. Rhea, the Reaction Knowledgebase in 2022. Nucleic Acids Res. 2022, 50, D693–D700. [Google Scholar] [CrossRef]
Norsigian, C.J.; Pusarla, N.; McConn, J.L.; Yurkovich, J.T.; Dräger, A.; Palsson, B.O.; King, Z. BiGG Models 2020: Multi-Strain Genome-Scale Models and Expansion across the Phylogenetic Tree. Nucleic Acids Res. 2020, 48, D402–D406. [Google Scholar] [CrossRef] [PubMed]
Zheng, S.; Zeng, T.; Li, C.; Chen, B.; Coley, C.W.; Yang, Y.; Wu, R. Deep Learning Driven Biosynthetic Pathways Navigation for Natural Products with BioNavi-NP. Nat. Commun. 2022, 13, 3342. [Google Scholar] [CrossRef]
Chen, B.; Li, H.; Huang, R.; Tang, Y.; Li, F. Deep Learning Prediction of Electrospray Ionization Tandem Mass Spectra of Chemically Derived Molecules. Nat. Commun. 2024, 15, 8396. [Google Scholar] [CrossRef] [PubMed]
Wishart, D.S.; Tian, S.; Allen, D.; Oler, E.; Peters, H.; Lui, V.W.; Gautam, V.; Djoumbou-Feunang, Y.; Greiner, R.; Metz, T.O. BioTransformer 3.0—A Web Server for Accurately Predicting Metabolic Transformation Products. Nucleic Acids Res. 2022, 50, W115–W123. [Google Scholar] [CrossRef] [PubMed]
Coley, C.W.; Rogers, L.; Green, W.H.; Jensen, K.F. SCScore: Synthetic Complexity Learned from a Reaction Corpus. J. Chem. Inf. Model. 2018, 58, 252–261. [Google Scholar] [CrossRef]
Swanson, K.; Liu, G.; Catacutan, D.B.; Arnold, A.; Zou, J.; Stokes, J.M. Generative AI for Designing and Validating Easily Synthesizable and Structurally Novel Antibiotics. Nat. Mach. Intell. 2024, 6, 338–353. [Google Scholar] [CrossRef]
Öztürk, H.; Özgür, A.; Ozkirimli, E. DeepDTA: Deep Drug–Target Binding Affinity Prediction. Bioinformatics 2018, 34, i821–i829. [Google Scholar] [CrossRef]
Lee, I.; Keum, J.; Nam, H. DeepConv-DTI: Prediction of Drug-Target Interactions via Deep Learning with Convolution on Protein Sequences. PLoS Comput. Biol. 2019, 15, e1007129. [Google Scholar] [CrossRef]
Shaikh, F.; Tai, H.K.; Desai, N.; Siu, S.W.I. LigTMap: Ligand and Structure-Based Target Identification and Activity Prediction for Small Molecular Compounds. J. Cheminf. 2021, 13, 1–12. [Google Scholar] [CrossRef]
Karimi, M.; Wu, D.; Wang, Z.; Shen, Y. DeepAffinity: Interpretable Deep Learning of Compound-Protein Affinity through Unified Recurrent and Convolutional Neural Networks. Bioinformatics 2019, 35, 3329–3338. [Google Scholar] [CrossRef]
Li, S.; Wan, F.; Shu, H.; Jiang, T.; Zhao, D.; Zeng, J. MONN: A Multi-Objective Neural Network for Predicting Compound-Protein Interactions and Affinities. Cell Systems 2020, 10, 308–322.e11. [Google Scholar] [CrossRef]
Watanabe, N.; Ohnuki, Y.; Sakakibara, Y. Deep Learning Integration of Molecular and Interactome Data for Protein—Compound Interaction Prediction. J. Cheminf. 2021, 13, 44. [Google Scholar] [CrossRef]
Ding, Q.; Zu, S.; Hou, S.; Zhang, Y.; Li, S. VISAR: An Interactive Tool for Dissecting Chemical Features Learned by Deep Neural Network QSAR Models. Bioinformatics 2020, 36, 3610–3612. [Google Scholar] [CrossRef] [PubMed]
Ji, H.; Deng, H.; Lu, H.; Zhang, Z. Predicting a Molecular Fingerprint from an Electron Ionization Mass Spectrum with Deep Neural Networks. Anal. Chem. 2020, 92, 8649–8653. [Google Scholar] [CrossRef] [PubMed]
Mei, P.-C.; An, N.; Xiao, H.-M.; Chen, Y.-Y.; Zhu, Q.-F.; Feng, Y.-Q. QSAR-Guided Strategy for Accurate Annotation of FAHFA Regioisomers. Talanta 2025, 285, 127421. [Google Scholar] [CrossRef] [PubMed]
Srour, A.M.; Ahmed, N.S.; Abd El-Karim, S.S.; Anwar, M.M.; El-Hallouty, S.M. Design, Synthesis, Biological Evaluation, QSAR Analysis and Molecular Modelling of New Thiazol-Benzimidazoles as EGFR Inhibitors. Bioorg. Med. Chem. 2020, 28, 115657. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of software integration. (A) Software architecture with front-end separation. (B) Graphical user interface screenshot of DerivaPredict software.

Figure 2. Workflow and functions of DerivaPredict software.

Figure 3. Visualization and analysis of generated derivatives. The U-Map projection illustrates the distribution of the derivatives in chemical space, highlighting the structural diversity (left). The Tanimoto similarity histograms depict the structural similarity between the derivatives and their parent compounds, showing the impact of the different transformation types (middle). The SCScore distributions compare the synthetic complexity of the derivatives, indicating variations based on the transformation methods and parent compound structures (right).

Figure 4. Structures, docking scores, and predicted IC50 of curcumin and its derivatives.

Table 1. Predicted ADMET properties of curcumin and its derivatives.

	Curcumin	Derivate 1	Derivate 2
MW	368.385	372.804	354.358
SCScore	1.804	2.109	2.212
logP	3.370	4.015	3.067
nHBA	6	5	6
nHBD	2	2	3
QED	0.548	0.566	0.401
TPSA	93.060	83.830	104.060
BBB	0.344	0.255	0.286
BS	0.463	0.578	0.414
CYP1A2 inhibition	0.439	0.665	0.420
CYP2C19 inhibition	0.582	0.703	0.506
CYP3A4 inhibition	0.751	0.771	0.697
PAMPA	0.739	0.734	0.663
Caco2	−5.171	−5.084	−5.283

MW: molecular weight, SCScore: synthetic complexity score, logP: log of octanol/water partition coefficient, nHBA: number of hydrogen bond acceptor(s), nHBD: number of hydrogen bond donor(s), TPSA: total polar surface area, BBB: blood–brain barrier permeant, BS: bioavailability score, CYP2C19/CYP1A2/CYP3A4: inhibition potential for corresponding enzyme, PAMPA: parallel artificial membrane permeability assay, Caco2: Caco2 cell permeability.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

DerivaPredict: A User-Friendly Tool for Predicting and Evaluating Active Derivatives of Natural Products

Abstract

1. Introduction

2. Results

2.1. Software Integration

2.2. Software Functionalities

2.3. Illustrative Examples

2.3.1. Structural Diversity Evaluation

2.3.2. Pharmacological Active Prediction

3. Materials and Methods

3.1. Extraction of Reaction Templates

3.2. Generation of Potential Derivatives

3.3. Prediction of Molecular Properties

3.4. Prediction of Binding Affinity with Specific Targets

3.5. User-Defined Parameters and Initial Settings

3.6. Molecular Docking

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics