AI-Integrated QSAR Modeling for Enhanced Drug Discovery: From Classical Approaches to Deep Learning and Structural Insight

Koirala, Mahesh; Yan, Lindy; Mohamed, Zoser; DiPaola, Mario

doi:10.3390/ijms26199384

Open AccessReview

AI-Integrated QSAR Modeling for Enhanced Drug Discovery: From Classical Approaches to Deep Learning and Structural Insight

Therabene Inc., Norwood, MA 02062, USA

^*

Authors to whom correspondence should be addressed.

Int. J. Mol. Sci. 2025, 26(19), 9384; https://doi.org/10.3390/ijms26199384

Submission received: 12 August 2025 / Revised: 15 September 2025 / Accepted: 22 September 2025 / Published: 25 September 2025

(This article belongs to the Special Issue Editorial Board Members’ Collection Series: QSAR and Computational Approaches to Drug Discovery)

Download

Browse Figures

Versions Notes

Abstract

Integrating artificial intelligence (AI) with the Quantitative Structure-Activity Relationship (QSAR) has transformed modern drug discovery by empowering faster, more accurate, and scalable identification of therapeutic compounds. This review outlines the evolution from classical QSAR methods, such as multiple linear regression and partial least squares, to advanced machine learning and deep learning approaches, including graph neural networks and SMILES-based transformers. Molecular docking and molecular dynamics simulations are presented as cooperative tools that boost the mechanistic consideration and structural insight into the ligand-target interactions. Discussions on using PROTACs and targeted protein degradation, ADMET prediction, and public databases and cloud-based platforms to democratize access to computational modeling are well presented with priority. Challenges related to authentication, interpretability, regulatory standards, and ethical concerns are examined, along with emerging patterns in AI-driven drug development. This review is a guideline for using computational models and databases in explainable, data-rich and profound drug discovery pipelines.

Keywords:

QSAR modeling; artificial intelligence; machine learning; deep learning; molecular docking; molecular dynamics; PROTACs; ADMET prediction; cheminformatics; drug discovery; graph neural networks

Graphical Abstract

1. Introduction

Drug discovery is undergoing a significant revolution, driven by integrating artificial intelligence (AI) into Quantitative Structure-Activity Relationship (QSAR) modeling [1,2,3,4]. Until recently, drug discovery was conducted primarily by trial-and-error, an approach that was time-consuming and burdened with high costs; currently, drug development is increasingly shaped by data-driven computational methodologies that should lead to speedier drug discovery and, ultimately, safer drugs as shown in Figure 1. Advances in QSAR have evolved from basic linear models to sophisticated machine learning (ML) and deep learning (DL) frameworks that integrate complex nonlinear patterns across large chemical spaces [1,5,6,7].

QSAR predictive power, when properly combined with AI, can facilitate virtual screening of extensive chemical databases, containing billions of compounds, de novo drug design and lead optimization for specific targets. Algorithms incorporating neural networks, generative models and reinforcement learning are reshaping how compounds are selected, modified, and evaluated for defined targets. Furthermore, integrating omics data, real-world evidence from medicine, and multi-parametric optimization pushes the frontier of personalized medicine and targeted therapeutics [8,9].

This article aims to explore the current trends in AI-augmented QSAR methodologies, highlighting key breakthroughs, and outlining emerging trends that could reshape the future of discovery and development of novel pharmaceutical products, especially with respect to small molecule compounds, such as inhibitors, degraders (PROTACS) and other modalities [9,10]. With the prospects of improving hit-to-lead timelines to designing safer and more effective drugs, the synergy between QSAR and AI is becoming the new foundation in modern drug discovery and development [11]. Recent applications highlight the role of AI-enhanced QSAR in real-world discovery. Talukder et al. integrated docking, QSAR, and simulations to investigate EGFR-targeting phytochemicals in non-small cell lung cancer [12]; Kaur et al. developed BBB-permeable BACE-1 inhibitors for Alzheimer’s disease using 2D-QSAR, docking, ADMET, and MD [13]; Souza et al. combined ML approaches and QSAR to analyze SARS-CoV-2 Mpro inhibitors [14]; and Maliyakkal et al. applied QSAR-driven virtual screening to identify potential therapeutics against Trypanosoma cruzi [15]. These case studies illustrate how QSAR continues to advance modern inhibitor discovery.

Computational approaches significantly accelerate the preclinical stage of drug discovery by reducing costs, minimizing attrition, and expediting the identification of viable candidates. Ou-Yang et al. described how computational pipelines streamline hit-to-lead discovery, reducing reliance on expensive high-throughput screening [16]. More recently, Ouma et al. reviewed modern computational approaches and emphasized their role in decreasing attrition rates and supporting ADMET profiling in early discovery [17]. In parallel, Lavecchia highlighted how virtual screening and QSAR methodologies have become central tools in prioritizing compounds before synthesis, directly cutting down experimental costs [18]. Similarly, Paul et al. demonstrated that computational filtering and optimization strategies reduce the time required to progress leads into the preclinical pipeline, showing tangible benefits in pharmaceutical R&D [19]. These studies collectively demonstrate that computational drug discovery is not only complementary to laboratory methods but also indispensable for modern, cost-effective preclinical development.

2. Foundations of QSAR and Molecular Descriptors

QSAR modeling is dependent on molecular descriptors, which are numerical values that encode various chemical, structural, or physicochemical properties of compounds. The descriptors are generally classified according to dimensions as 1D, 2D and 3D that correspond to the compound’s properties like molecular weight, topological indices and molecular shape or electrostatic potential map, respectively. To increase the model efficiency and reduce overfitting, dimensionality reduction techniques such as principal component analysis (PCA) and recursive feature elimination (RFE) are being highly used [7,20]. The appropriate selection and interpretation of these descriptors are necessary for making predictive, robust QSAR models. There are even more sophisticated methods, including LASSO (Least Absolute Shrinkage and Selection Operator), and mutual information ranking. They are being used frequently to eliminate irrelevant or redundant variables and to identify the most significant features [21,22,23]. These methods not only improve model performance but also enhance interpretability, which is essential for governing approval and hypothesis generation in medicinal chemistry. In addition to 1D, 2D, and 3D descriptors, 4D descriptors have also been developed. These account for conformational flexibility by considering ensembles of molecular structures rather than a single static conformation. Ensemble-based descriptors provide more realistic representations of molecules under physiological conditions and have been applied in ligand-based pharmacophore modeling and QSAR refinement [24,25].

Furthermore, Quantum chemical descriptors like the HOMO-LUMO gap, dipole moment, molecular orbital energies, and electrostatic potential surfaces have also found extensive application in QSAR modeling, particularly for drug-like molecules where electronic properties influence bioactivity [1,9]. The use of 3D descriptors such as molecular surface area, volume, and conformer-based properties has expanded with the availability of tools like DRAGON, PaDEL, and RDKit [26,27,28]. Likewise, the latest integration of deep learning techniques has led to the development of learned molecular representations, or “deep descriptors,” which are derived from molecular graphs or SMILES strings without manual descriptor engineering [29]. These latent embeddings created by graph neural networks (GNNs) or autoencoders are capable of capturing more abstract and hierarchical molecular features, which opens new possibilities for constructing data-driven, flexible QSAR pipelines applicable across diverse chemical spaces [30,31].

3. Classical QSAR: Statistical Modeling Techniques

In classical QSAR, the molecular descriptors are correlated with biological activity by statistical regression method. Other traditional but effective approaches used extensively in drug discovery and environmental toxicology includes Multiple Linear Regression (MLR), Partial Least Squares (PLS), and Principal Component Regression (PCR). These approaches are esteemed for their simplicity, speed, and ease of explanation, specifically in governing settings. They are generally practical when a reasonably small number of variables show linear relationships with the biological response. For model validation, internal metrics such as R² (coefficient of determination) and Q² (cross-validated R²), as well as external datasets that test the model’s generalizability on unseen compounds, are used. These models while being mostly dependent on assumptions of linearity, normal distribution, and independence among variables, which might not be true in large and chemically varied datasets [7,32,33].

Several efforts have been made over the years to boost the capabilities of classical QSAR models by incorporating robust feature selection and data preprocessing tactics. Several methods, like stepwise regression, bootstrapping, and residual analysis, have been initiated to expand stability and diminish overfitting [34,35]. Nevertheless, these models frequently falter while dealing with extremely nonlinear relationship or noisy data that cannot be modeled with simple parametric equations. Subsequently, hybrid approaches that combine the classical statistical tools with machine learning methods like PLS, combined with decision trees or ensemble averaging, have been developed to narrow this gap when gaining interpretability [11,36]. Despite the upwelling in popularity of complex models like random forests and deep learning, classical QSAR remains obligatory for preliminary screening, mechanism clarification, and when explainability is a greater priority, as in regulatory toxicology and REACH compliance [37]. Additional software packages like QSARINS and Build QSAR endure to support classical model development with enriched validation roadmaps and visualization tools [7,38].

Classical QSAR remains highly relevant in modern drug discovery when combined with rigorous validation. Olenginski et al. applied QSAR to RNA-binding small molecules, uncovering structural determinants of RNA–ligand recognition [39]. Talukder et al. integrated classical QSAR, docking, and simulations to prioritize EGFR-targeting phytochemicals for non-small cell lung cancer [12]. Kaur et al. designed BBB-permeable BACE-1 inhibitors using 2D-QSAR for Alzheimer’s disease, highlighting how traditional QSAR supports lead design [13]. Finally, Cherkasov et al. provided a landmark review demonstrating how classical QSAR principles underpin contemporary drug discovery pipelines [1]. These cases illustrate the enduring value of classical QSAR.

4. Machine Learning Rise in QSAR

Machine learning has significantly increased the predictive influence and flexibility of QSAR models, mainly in managing complex, high-dimensional chemical datasets. Algorithms like Support Vector Machines (SVM), Random Forests (RF), and k-Nearest Neighbors (kNN) are the most standard tools in cheminformatics and are widely used for tasks ranging from virtual screening to toxicity projection [40,41,42]. Unlike classical linear models, these algorithms can successfully capture nonlinear relationship between molecular descriptors and biological activity without earlier assumptions about data distribution. Generally, Random Forests are preferred for their robustness, built-in feature selection, and ability to handle noisy data, while SVMs are more efficient in conditions with regulated samples and high descriptor-to-sample ratios [43,44,45]. This is because of the several reasons. RF can manage irrelevant or redundant descriptors because its random feature selection at each split reduces the risk of overfitting to noisy variables, it tolerates collinearity among descriptors, since each tree considers only a subset of variables. and it demonstrates resilience to outliers in bioactivity data, as the ensemble averaging across many trees dampens the effect of anomalous points [46,47]. Grid search and Bayesian optimization are other hyperparameter optimization strategies that are regularly applied to fine-tune these models for top predictive performance.

Modern developments have also focused on boosting interpretability and decreasing the “black-box” nature of machine learning in QSAR. Feature importance ranking methods like permutation importance, SHAP (SHapley Additive exPlanations), and LIME (Local Interpretable Model-agnostic Explanations) now allow researchers to comprehend which descriptors influence model predictions the most [48]. Ensemble learning methods such as stacking, bagging, and boosting have further improved model stability and precision across diverse chemical spaces. These novelties, combined with increasing access to curated datasets and open-source platforms which including scikit-learn, KNIME [49], and AutoQSAR, have democratized machine learning-based QSAR modeling [50]. Despite these encroachments, careful consideration of justification, applicability domain, and dataset bias remains indispensable to evade overfitting and confirm regulatory acceptance, principally in safety-critical domains like pharmacovigilance and environmental hazard prediction [51]. Machine learning has rapidly expanded QSAR capabilities. Singh et al. developed ML-guided QSAR for imidazole scaffolds, achieving strong predictive accuracy even in noisy datasets [42]. Souza et al. combined ML and QSAR to reveal antagonistic trends in SARS-CoV-2 Mpro inhibitors, illustrating ML’s role in antiviral discovery [14]. Maliyakkal et al. performed QSAR-driven virtual screening for Trypanosoma cruzi therapeutics, showing the utility of ML-enhanced QSAR in neglected tropical diseases [15]. Zhang et al. demonstrated the power of ML-based QSAR for kinase inhibitor optimization, highlighting broader applicability across target classes [52]. Together, these studies show how ML augments QSAR with robustness and generalizability

5. Deep Learning and Neural Models in Drug Discovery

Deep learning has meaningfully renovated QSAR modeling by empowering automated feature extraction and representation learning right from raw molecular structures. Contrasting traditional models that depend on pre-calculated descriptors, deep learning architectures can generate molecular graphs, SMILES strings, and even 3D conformations to come up with abstract, high-level features relevant to biological activity prediction [9,53,54]. Convolutional Neural Networks (CNNs) have been adapted to operate on molecular grids and images, while Recurrent Neural Networks (RNNs), principally Long Short-Term Memory (LSTM) networks, are compatible with sequential representations like SMILES strings [55,56]. Another emerging field is Graph Neural Networks (GNNs), which have been particularly powerful tools in cheminformatics, as they natively model atoms as nodes and bonds as edges, perfectly reflecting the topology of molecular structures. Other tools like DeepChem [57], Chemprop [58], and DGL-LifeSci have also made these models more straightforward to non-experts [30].

Likewise, another most groundbreaking development in this area is the rise of chemical language models, such as SMILES-based transformers and autoencoders. The models, such as ChemBERTa [59] and MolBERT [60], are trained on millions of chemical structures and can be adjusted for explicit tasks such as activity projection, retrosynthesis, or de novo molecular generation. To propose novel compounds with enhanced activity and ADMET properties, Generative models that include Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) are now commonly employed. One example includes REINVENT, which pools reinforcement learning with RNN-based generators to enhance molecules toward a predefined goal iteratively [61]. Deep Learning models in QSAR also encounter issues in interpretability, data disparity, and transferability within distinctive chemical domains. To solve these challenges, several techniques like attention mechanisms, multi-task learning, and data reinforcement approaches are being incorporated into pipelines to refine generalizability and cut overfitting [62,63,64]. Deep learning enables abstraction of complex features beyond the reach of classical QSAR. Li et al. developed DeepPROTACs, a deep learning framework for PROTAC design [65]. Goh et al. pioneered deep learning models for compound–protein interaction prediction, outperforming conventional QSAR [66]. Kim et al. applied SMILES-based deep networks for bioactivity prediction, achieving superior accuracy in chemical space coverage [67]. Altae-Tran et al. demonstrated one-shot learning for drug discovery, highlighting DL’s ability to learn from extremely limited data [68]. These advances underscore how DL reshapes QSAR by capturing nonlinear, high-dimensional representations.

6. Molecular Docking and Dynamics

Molecular docking is a powerful tool in drug discovery that simulates the interaction between ligands and target proteins, yielding the binding poses and estimating binding affinities through their scoring functions. It offers a critical first look at how a small molecule might bind to a receptor and is essential for virtual screening, lead identification, and understanding structure-activity interactions. The most commonly used docking platforms include AutoDock Vina (v1.1.2) [69], Glide (2004 release) [70], GOLD (v5.2) [71], and LeDock (2015 release) [72]. Each docking method differs in sampling techniques and scoring algorithms [73], with its binding affinities and pose geometries that can be further incorporated into QSAR models as molecular descriptors for improving predictive performance and interpretability. Besides its popularity and accessibility, it has some discrepancies as it treats receptors as rigid, and scoring functions usually might not completely capture entropic and solvent effects. While rigid docking assumes a fixed receptor structure, most modern approaches employ semi-flexible docking (ligand flexible, receptor side chains partially flexible) or flexible docking (both ligand and receptor sampled). These methods better capture induced-fit effects but also present limitations. First, flexible docking is computationally intensive, often requiring significant resources for conformational sampling. Second, it provides only a limited representation of receptor flexibility, as most algorithms sample a subset of side chains or backbone motions rather than the full conformational landscape. Third, flexible docking can increase the risk of false positives, since greater conformational freedom may allow unrealistic ligand poses without sufficient energetic penalties [73,74]. To address these limitations, flexible docking is frequently combined with molecular dynamics simulations, which provide a more rigorous account of receptor flexibility.

Molecular Dynamics (MD) simulations, which offer atomistic insights into the temporal behavior of protein-ligand complexes in a fully solvated environment, are used to overcome the limitations of docking tools. MD simulations help to determine the structural and conformational dynamics of protein-ligand complexes, which include pose stability, protein flexibility, and significant non-covalent interactions such as hydrogen bonds, salt bridges, van der Waals, and electrostatic interactions, which might be unnoticed in static docking studies. Software packages like GROMACS (v5.x), AMBER (v12/14), NAMD (v2.6) and CHARMM (CHARMM36) are commonly used to simulate biological systems on timescales ranging from nanoseconds to microseconds [75,76,77,78,79]. There are different commercial packages available for easy and fast MD simulations. The simulation results are combined with MM/PBSA or MM/GBSA methods to estimate binding free energies more precisely. The docking and MD are then combined with QSAR or ML/DL frameworks, resulting in the hybrid models that achieve higher performance in predicting binding affinity and action mechanisms [80]. It is widely known that integrating MD-derived features like root-mean-square deviation (RMSD), hydrogen bond analysis, and conformational clustering [81,82,83] can considerably advance both regression and classification outcomes in modern drug discovery pipelines [84,85]. Docking and molecular dynamics (MD) simulations complement QSAR by providing mechanistic insight into ligand–target interactions. Koirala and Fagerquist combined MD with QSAR to probe Colicin–Immunity protein complexes [81]. Lu et al. applied long-timescale MD to GPCRs, revealing conformational states essential for ligand binding [86]. Hollingsworth and Dror reviewed how MD simulations guide drug discovery across multiple protein classes, demonstrating predictive power at atomic resolution [75]. Pagadala et al. provided a comprehensive review of docking applications, emphasizing the strengths and limitations of docking tools in lead optimization [73]. These cases illustrate how docking and MD extend QSAR into structure-based discovery.

7. PROTACs and Targeted Protein Degradation

Proteolysis-targeting chimeras (PROTACs) or degraders represent a highly promising therapeutic strategy that goes beyond inhibition by eliminating disease-causing proteins by harnessing the ubiquitin-proteasome system. These bifunctional molecular constructs bind to a target protein and an E3 ubiquitin ligase, concurrently, forming a ternary complex that leads to ubiquitination and subsequent proteolysis of the target protein. This mechanism unlocks potential access to previously “undruggable” proteins and opens new avenues for treating cancer, neurodegenerative diseases, and immune disorder. PROTACs’ development presents several challenges due to the size, flexibility, and reliance on complex protein–protein interactions of these molecular constructs. Indeed, existing drug discovery tools are not very effective at accurately predicting degraders’ pharmacokinetics, efficacy, or degradation behavior [87,88,89].

Most degraders are over 800 Da in molecular weight, display high flexibility, and contain three distinct substructures: a moiety specific to the target protein, a linker, and an E3 ligase binder. This modular structure creates complex, nonlinear relationship between chemical structure and biological activity. Existing QSAR models rely on relatively static structure–activity correlations and thus are not well suited to capture multi-factorial interactions. Descriptors, including LogP, molecular size, and polar surface area (PSA), are insufficient for prediction of critical degrader determinants, such as ternary complex stability, target-E3 ligase cooperativity, linker geometry and flexibility and cell permeability and metabolic stability [90,91,92].

To overcome these shortcomings, the combination of AI and QSAR models is being evaluated and trained to learn from relatively sparse, noisy, and complex, high-dimensional data, to better capture patterns in structure–function relationships of PROTACs, leading to better prediction of degrader constructs, with drug-like attributes, against specific target proteins. This new paradigm in AI-driven QSAR has resulted in novel predictive algorithms, including graph neural networks (GNNs), generative models for linkers, multi-task and transfer learning algorithms and explainable AI [93,94], along with the use of transformers to process molecular representations for high-accuracy predictions of PROTAC behavior [95].

GNNs modeling describes molecules as graphs, with nodes representing atoms and edges representing bonds. This modeling approach can accurately capture the topological and relational information within PROTACs and learn representations directly from molecular structure. Recent publications [96] have reported that GNNs can accurately predict ternary complex formation and degradation potential across multiple E3 ligases. More recently, a 3D graph neuronal network, DegradeMaster is introduced, which utilizes 3D spatial information to further improve PROTAC degradation effectiveness [97,98].

Composition and geometry of linker regions in degrader structures are critical to degrader function, affecting ternary complex formation, solubility and cell permeability, and overall degradation efficiency. Novel AI models like DeLinker [99] and DiffLinker [100] employ deep generative techniques to allow for the design of optimal linkers. These models learn linker patterns from large chemical databases and generate novel candidates based on specific spatial constraints, thus reducing the number of experimental iterations. Deep learning models capable of predicting multiple endpoints, such as degradation efficiency, off-target effects, and ADME properties, offer a comprehensive approach to degrader optimization. Multi-task models leverage shared information across tasks to improve learning efficiency, especially within the limitations of available data [101].

Transfer learning approaches are also obtaining traction; these reuse available knowledge from related drug discovery tasks and apply such knowledge to new conditions. For example, models pretrained on large chemical datasets can be fine-tuned on smaller PROTAC-specific datasets, improving performance on degradation tasks. As degraders enter the clinical space, interpretability becomes crucial. Explainable AI techniques like SHAP (SHapley Additive exPlanations) and attention-based mechanisms allow researchers to identify which molecular substructures or linker properties affect degradation efficiency the most, allowing for feedback to guide in the rational design of degrader structures [102,103,104].

8. Predicting ADMET and Toxicity Profiles

A major reason for therapeutic drug failure lies in undesirable Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET) properties. Predicting these critical drug properties early in the development process allows for significant de-risking in drug development, reducing costs, and accelerating the delivery of significant therapies. Until recently, ADMET properties have been evaluated through laborious and expensive experimental methods, frequently involving in vitro and in vivo studies. These methods can be resource-intensive, time-consuming, and can be applied to a limited number of compounds. Nevertheless, with new developments in Artificial Intelligence (AI), ADMET drug properties can now be predicted in silico, leading to significant reduction in costs and time, and potential de-risking of drug failure [105,106].

AI, through the combination of machine learning (ML) and deep learning (DL), has emerged as a promising tool in overcoming the ADMET bottleneck in drug development. By leveraging vast datasets of chemical structures and their associated experimental ADMET and toxicity data, AI models can learn complex patterns and make highly accurate predictions of ADMET properties for novel compounds [106,107].

Most ML approaches in ADMET prediction are dependent on algorithms such as Random Forest, Support Vector Machines (SVM) and Gradient Boosting Machines (GBM). These models require cautiously curated datasets of molecular descriptors and associated ADMET characteristics. Otherwise, DL algorithms, especially GNNs, have shown great success. GNNs let the processing of molecular structures as graphs, capturing complex relationships between atoms and bonds, which is vital for understanding the process of molecular interaction with biological systems. GNN-based platforms, when trained on widespread datasets, can provide fast and accurate predictions across a wide range of ADMET properties. Additional methods, such as AutoML, automate the predictive process by choosing the best machine learning algorithms and augmenting their hyperparameters, making AI-driven ADMET prediction more accessible and efficient. Furthermore, newly emerging research focuses on the use of quantum machines in ADMET predictions, promising significant improvement in performance and prediction accuracy [108,109,110,111].

A series of AMET prediction software packages, including both open source and commercial ones, is provided in Table 1 and the list of available databases for training various AI models is provided in Table 2.

9. Assessing Model Validity and Reliability

Model endorsement is important in QSAR development because it confirms that predictive models are precise and generalizable. There are several internal validation practices such as leave-one-out (LOO), k-fold cross-validation, and bootstrapping [131] are regularly used to evaluate a model’s stability and validity during training. Internal validation alone is not enough so different external authentication using an independent dataset is crucial to validate that the model performs better on unseen compounds. For this, several Metrics such as the coefficient of determination (R2), cross-validated R2 (Q2), root mean square error (RMSE), and concordance correlation coefficient (CCC) are greatly used to evaluate the performance. OECD guidelines require a model to pass both internal and external validation criteria to be considered predictive [3,132].

Another keystone of QSAR model trustworthiness is the Applicability Domain (AD), the chemical space in which a model can make consistent predictions. There exists several techniques for this, including the leverage approach (Williams plot), distance-based methods (Euclidean or Mahalanobis distance), and ensemble-based probability approximations [133]. ADAN (Applicability Domain Analysis) and other tools like Ambit Discovery and QSAR Toolbox deals with practical applications to consider whether query compounds fall within the domain. In addition, Y-randomization (or response permutation testing) is used to identify chance correlations by randomly shuffling the activity data and rebuilding the model several times. It is expected a valid model should perform implicitly better than the randomized complements. Recently, Monte Carlo sampling, double cross-validation, and consensus modeling have further boosted model robustness and reproducibility [134]. These validation pipelines are specifically important for supervisory approval in fields like environmental toxicology and medicines, where consistency and transparency are principal.

While model validation is an established requirement in QSAR, inadequate or superficial validation often leads to overfitting and misleading performance metrics. Internal validation methods such as k-fold cross-validation, bootstrapping, and Y-randomization assess robustness within the training dataset, but they cannot fully guarantee generalizability [135]. External validation using independent test sets is considered the gold standard, as emphasized by Tropsha and Golbraikh [136]. Moreover, the OECD principles for QSAR validation recommend defining the applicability domain (AD), which delineates the chemical space where predictions are reliable [137]. To mitigate failure cases, best practices include combining internal and external validation, applying consensus modeling, and benchmarking against well-curated datasets. Practical adoption of these recommendations can reduce model overestimation and enhance regulatory acceptance.

Despite significant progress, ML and DL approaches face critical limitations. ML methods, though robust against noise, often struggle with interpretability and may overfit in small datasets. DL models, while powerful in capturing nonlinear renonlinears, typically require large, high-quality datasets and are vulnerable to bias from imbalanced training data [138]. Furthermore, most reported QSAR successes underrepresent negative results, limiting awareness of failure cases [139]. Addressing these challenges requires transparent reporting, model explainability, and rigorous external validation.

10. Software, Databases and Computational Platforms

With the growth of computational resources and several AI tool, many platforms now play a vital role in the rationalization of computational drug discovery pipelines, especially in QSAR modeling as presented in Table 3. RDKit (2019 release) [28], is a very popular open-source cheminformatics toolkit in Python and is being extensively used for calculating molecular descriptors, generating molecular fingerprints, and performing structure-based filtering. Similarly, PaDEL-Descriptor (2011 release) [27] is another widely used tool for computing more than 1,400 descriptors and fingerprints, making it suitable for both academic and industrial workflows. KNIME (2015 release) [49], is a very popular graphical workflow-based analytics platform that permits users to build and automate modeling pipelines by drag-and-drop components which later integrates the tools for data preprocessing, machine learning, and molecular descriptor generation [140,141].

DeepChem (2018 release) [57] and Chemprop (2019 release) [58] are other Deep Learning and graphical based approach that has become increasingly admired. DeepChem use real executions of deep neural networks, graph convolutional networks (GCNs), and multitask learning frameworks custom-made for molecular property prediction. In contrast, Chemprop (2019 release) built with PyTorch (release 2017), is adjusted for message passing neural networks and is predominantly operative for structure-activity relationship tasks using SMILES strings [30]. There are several public databases like ChEMBL2017 release) [142], ZINC (2012 release) [143], and PubChem (2004/2006 release) [144] that provide broad curated chemical and bioactivity data, empowering large-scale QSAR and machine learning studies [145]. These are complemented by different cloud-based resources like Google Colab (2018 release) [57] and AWS SageMaker (2017 release) [146] that makes it possible to run QSAR workflows at scale without demanding local hardware. On the other hand, cheminformatics incorporation with cloud-based notebooks accelerates reproducible investigation and easy distribution of computational pipelines, profoundly decreasing the barrier to entry for scholars all over the world [147].

11. Challenges, Ethical Considerations and Regulatory Aspects

AI-driven drug discovery and development represent a promising platform for accelerating the development of new medicines; at the same time, such technology brings significant challenges, ethical considerations, and evolving regulatory aspects that must be carefully navigated [148,149]. Among the many challenges, key issues are data quality and availability. AI models require extensive data for model training; furthermore, these data must be of high quality. In the drug discovery space, the availability of large amounts of quality, that is comprehensive and unbiased data from various sources, is a major hurdle. Frequently the available datasets can be incomplete, noisy or proprietary, making it difficult to integrate and use effectively for properly training AI-models. AI models trained on narrowly defined datasets will likely not have much utility as minor variations in training parameters or input data can lead to different outputs, raising concerns about reproducibility and reliability. Additionally, certain data, especially for pharmacokinetic and toxicity analysis, may be biased toward particular chemical scaffolds or patient demographics, limiting model generalization.

Modeling complex biological systems to determine how a drug will interact with such systems presents major challenges; indeed, modeling even simpler systems, such as cellular structures, can be overwhelming given the complex interrelations between the cellular components and the dynamic nature of these systems. Commercially, the employment of AI in drug discovery requires considerable investment in advanced computational hardware and software, infrastructure, and skilled workforce with proficiency in both AI/machine learning and pharmaceutical sciences. Such investments can be a barrier for smaller companies or research institutions. Furthermore, the integration of AI technology into existing drug discovery and development pipelines and workflows within pharmaceutical organizations can present significant integration challenges and, potentially, low adoption.

From ethical considerations, AI models trained on biased data can perpetuate and even reinforce existing disparities in healthcare. If datasets are not representative of diverse patient populations, the AI’s predictions may lead to inequitable outcomes in drug efficacy and safety for underrepresented groups. The use of patient-level data, including genomic, electronic health record data, and clinical trial data for AI training must be governed by strict privacy protections and ethical standards of informed consent to prevent re-identification of anonymized data, data breaches, or unauthorized secondary use of the data.

The nature of some AI models raises ethical concerns regarding accountability. If an AI-driven drug leads to adverse outcomes, it can be challenging to pinpoint the exact reasons for the AI’s decision, making it difficult to allocate responsibility. Likewise, when AI systems are used to lead therapeutic decisions or impact clinical trial designs, questions of accountability arise; therefore, inventors must ensure traceability of model inputs, outputs, and decision logic. The regulatory landscape as it applies to the use of AI in drug development is as challenging as the ethical issues discussed above. Meanwhile, regulatory bodies worldwide, such as the FDA (US) and EMA (Europe), are working to develop appropriate guidelines for AI-driven drug development [150]. This is a rapidly evolving and changing area, and companies face the challenge of complying with frameworks that are still conceptual in some respects and, at best, partially defined.

Regulatory agencies are increasingly incorporating AI and QSAR into safety assessment. While still in the very early stages, both FDA and EMA currently accept QSAR assessments under ICH M7(R2) for the identification and control of DNA-reactive impurities [151]. Additionally, the FDA’s Center for Drug Evaluation and Research (CDER) has partnered with the National Center for Toxicological Research (NCTR) through the ‘SafetAI Initiative’ to develop and validate AI-based QSAR models for predicting key safety endpoints, including hepatotoxicity, carcinogenicity, and cardiotoxicity [152]. This effort not only strengthens internal expertise but also establishes standards against which industry submissions can be evaluated.

Interpretability remains a regulatory priority. Approaches such as SHAP (Shapley Additive Explanations) have been successfully applied to QSAR toxicity models, where they highlight the molecular substructures most responsible for predictions, thereby improving transparency and confidence in model use [153,154]. These developments indicate a convergence between regulatory acceptance and explainable AI, supporting safer integration of AI-enhanced QSAR into decision-making.

When utilizing AI Models to support regulatory submissions, regulatory agencies invariably require robust validation and verification of AI models that facilitate regulatory decision-making, especially in relation to drug safety, effectiveness, and quality. Regulatory requirements will necessitate demonstration of data integrity and traceability, and verifiable model performance. Additionally, regulatory submission, relying on any AI-driven decisions, will require that such AI models be extensively validated using external datasets and benchmarked against established clinical standards and must contain documentation details on data source, preprocessing, model architecture, and performance metrics. Regulatory agencies are likely to shift towards a risk-based approach, where the level of scrutiny for an AI model depends on its “context of use” (COU) and the potential risk associated with its output. Eventually, data used to train AI models must be “fit for use,” meaning that the data must be applicable, reliable, and of sufficient quality to support the intended regulatory purpose. Furthermore, regulators are going to demand a greater need for transparency and explainability in AI models, especially those impacting on critical regulatory decisions.

For continuous learning AI models that adapt over time (“self-evolving”), regulatory agencies will be faced with ensuring ongoing model validation and monitoring to prevent “model drift,” such that the model’s performance does not degrade or become unreliable. For situations in which AI is being used in pharmacovigilance to identify adverse drug events and safety signals from large datasets. Regulators should make sure that the AI model used in this crucial post-marketing phase is precise and real. Correspondingly, if AI is applied in patient stratification, simulation of trial outcomes, and identification of relevant biomarkers, it is important that such applications do not introduce bias or compromise statistical integrity.

12. Conclusions

The future direction of AI in drug development will be characterized by several key trends and advancements. Firstly, AI will continue to revolutionize the early stages of drug development through the use and analysis of vast datasets, including genomics, proteomics and clinical studies to efficiently identify and validate promising drug targets. This process will allow drug developers to focus and manage resources on the most relevant proteins and molecules involved in disease pathways. Additionally, generative AI and deep learning models, including variational autoencoders, generative adversarial networks and large language models, will become even more sophisticated, especially with appropriate training, in the design of novel compounds with drug-like properties, predicting molecular interactions, and optimizing chemical structures to improve efficacy, simulating biological responses and reducing off-target interactions and side effects, thus minimizing toxicity.

AI will also be deployed for the rapid identification of existing drugs that can be repurposed for new indications; therefore, shortening the development timeline relative to the discovery and development of entirely new compounds and also proposing potential synergistic drug combinations.

AI will play a more dominant role in clinical trial design by analyzing historical data, resulting in ideal trial protocols, optimal endpoints, and timelines [155]. It will be used to simulate trial scenarios and allow for fine-tuning of dosage and treatment duration. AI algorithms will be used to analyze available medical data to determine suitable patient candidates for clinical trials and ensure more diverse and representative cohorts. The use of AI in trial design and management will enable real-time modifications to trials, allowing adjustments to dosages or patient cohorts based on interim results, increasing efficiency and supporting decentralized clinical trials. Furthermore, AI algorithms will enable better patient stratification via the analysis of genomic, proteomic, and clinical data, allowing for targeted treatments and personalized medicine approaches and even predicting individual patient responses to specific therapies for customized treatment plans for optimal outcomes.

To conclude, we presented the use of QSAR modeling with artificial intelligence, molecular docking, and molecular dynamics simulations that have meaningfully upgraded the field of computational drug discovery. Classical regression methods, which remain foundational, are now being conveyed by robust machine learning and deep learning approaches proficient in seizing complex, nonlinear hips in chemical data. While rigorous validation practices continue to ensure their scientific and regulatory credibility, the implementation of cloud-based platforms and open-source tools has made these technologies more accessible. As the field progresses, the conjunction of data-driven methods with mechanistic modeling may be the key to building more robust, reliable, and ethical drug discovery pipelines.

Author Contributions

M.K. and M.D. jointly wrote the manuscript and revised it; L.Y. and Z.M. cross-checked the content and performed proofreading. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

Mahesh Koirala is a consultant at Therabene Inc., serving as a Computational Chemistry Scientist. Lindy Yan, Zoser Mohamed and Mario DiPaola are employees and shareholders of Therabene, Inc.

Abbreviations

QSAR, Quantitative Structure–Activity Relationship; QSPR, Quantitative Structure–Property Relationship; AI, Artificial Intelligence; ML, Machine Learning; DL, Deep Learning; MD, Molecular Dynamics; CNN, Convolutional Neural Network; RNN, Recurrent Neural Network; GNN, Graph Neural Network; GCN, Graph Convolutional Network; SVM, Support Vector Machine; RF, Random Forest; kNN, k-Nearest Neighbors; PLS, Partial Least Squares; MLR, Multiple Linear Regression; PCR, Principal Component Regression; PCA, Principal Component Analysis; RFE, Recursive Feature Elimination; RMSE, Root Mean Square Error; RMSD, Root Mean Square Deviation; LOO, Leave-One-Out; AD, Applicability Domain; ADAN, Applicability Domain Analysis; SHAP, SHapley Additive exPlanations; LIME, Local Interpretable Model-agnostic Explanations; HOMO, Highest Occupied Molecular Orbital; LUMO, Lowest Unoccupied Molecular Orbital; SMILES, Simplified Molecular Input Line Entry System; GANs, Generative Adversarial Networks; VAEs, Variational Autoencoders; PROTAC, PROteolysis Targeting Chimera; ADMET, Absorption, Distribution, Metabolism, Excretion, and Toxicity; GBSA, Generalized Born Surface Area; MM/PBSA, Molecular Mechanics/Poisson–Boltzmann Surface Area; OECD, Organisation for Economic Co-operation and Development; FDA, Food and Drug Administration; EMA, European Medicines Agency; REACH, Registration, Evaluation, Authorisation and Restriction of Chemicals; KNIME, Konstanz Information Miner; CHARMM, Chemistry at HARvard Macromolecular Mechanics; GROMACS, GROningen MAchine for Chemical Simulations; NAMD, Nanoscale Molecular Dynamics.

References

Cherkasov, A.; Muratov, E.N.; Fourches, D.; Varnek, A.; Baskin, I.I.; Cronin, M.; Dearden, J.; Gramatica, P.; Martin, Y.C.; Todeschini, R. QSAR modeling: Where have you been? Where are you going to? J. Med. Chem. 2014, 57, 4977–5010. [Google Scholar] [CrossRef]
Wu, Z.; Ramsundar, B.; Feinberg, E.N.; Gomes, J.; Geniesse, C.; Pappu, A.S.; Leswing, K.; Pande, V. MoleculeNet: A benchmark for molecular machine learning. Chem. Sci. 2018, 9, 513–530. [Google Scholar] [CrossRef] [PubMed]
Tropsha, A. Best practices for QSAR model development, validation, and exploitation. Mol. Inform. 2010, 29, 476–488. [Google Scholar] [CrossRef]
Roy, K.; Kar, S.; Das, R.N. A Primer on QSAR/QSPR Modeling: Fundamental Concepts; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
Hansch, C.; Fujita, T. p-σ-π Analysis. A Method for the Correlation of Biological Activity and Chemical Structure. J. Am. Chem. Soc. 1964, 86, 1616–1626. [Google Scholar] [CrossRef]
Kubinyi, H. QSAR and 3D QSAR in drug design Part 1: Methodology. Drug Discov. Today 1997, 2, 457–467. [Google Scholar] [CrossRef]
De, P.; Kar, S.; Ambure, P.; Roy, K. Prediction reliability of QSAR models: An overview of various validation tools. Arch. Toxicol. 2022, 96, 1279–1295. [Google Scholar] [CrossRef]
Ren, B. Novel atomic-level-based AI topological descriptors: Application to QSPR/QSAR modeling. J. Chem. Inf. Comput. Sci. 2002, 42, 858–868. [Google Scholar] [CrossRef]
Tropsha, A.; Isayev, O.; Varnek, A.; Schneider, G.; Cherkasov, A. Integrating QSAR modelling and deep learning in drug discovery: The emergence of deep QSAR. Nat. Rev. Drug Discov. 2024, 23, 141–155. [Google Scholar] [CrossRef]
Li, F.; Hu, Q.; Zhang, X.; Sun, R.; Liu, Z.; Wu, S.; Tian, S.; Ma, X.; Dai, Z.; Yang, X. DeepPROTACs is a deep learning-based targeted degradation predictor for PROTACs. Nat. Commun. 2022, 13, 7133. [Google Scholar] [CrossRef]
Sheridan, R.P.; Baskin, I.I.; Curtarolo, S.; Isayev, O.; Tropsha, A.; Filimonov, D.; Poroikov, V.; Tetko, I.V.; Varnek, A.; Roitberg, A.E. Correction: QSAR without borders. Chem. Soc. Rev. 2020, 49, 3716, Correction in Chem. Soc. Rev. 2020, 49, 3525–3564. [Google Scholar] [CrossRef]
Talukder, M.E.K.; Atif, M.F.; Siddiquee, N.H.; Rahman, S.; Rafi, N.I.; Israt, S.; Shahir, N.F.; Islam, M.T.; Samad, A.; Wani, T.A. Molecular docking, QSAR, and simulation analyses of EGFR-targeting phytochemicals in non-small cell lung cancer. J. Mol. Struct. 2025, 1321, 139924. [Google Scholar] [CrossRef]
Kaur, N.; Gupta, S.; Pal, J.; Bansal, Y.; Bansal, G. Design of BBB permeable BACE-1 inhibitor as potential drug candidate for Alzheimer disease: 2D-QSAR, molecular docking, ADMET, molecular dynamics, MMGBSA. Comput. Biol. Chem. 2025, 116, 108371. [Google Scholar] [CrossRef]
Souza, A.S.d.; Amorim, V.M.d.F.; Soares, E.P.; de Souza, R.F.; Guzzo, C.R. Antagonistic trends between binding affinity and drug-likeness in SARS-CoV-2 MPRO inhibitors revealed by machine learning. Viruses 2025, 17, 935. [Google Scholar] [CrossRef]
Maliyakkal, N.; Kumar, S.; Bhowmik, R.; Vishwakarma, H.C.; Yadav, P.; Mathew, B. Two-dimensional QSAR-driven virtual screening for potential therapeutics against Trypanosoma cruzi. Front. Chem. 2025, 13, 1600945. [Google Scholar] [CrossRef]
Ou-Yang, S.-S.; Lu, J.-Y.; Kong, X.-Q.; Liang, Z.-J.; Luo, C.; Jiang, H. Computational drug discovery. Acta Pharmacol. Sin. 2012, 33, 1131–1140. [Google Scholar] [CrossRef]
Ouma, R.B.; Ngari, S.M.; Kibet, J.K. A review of the current trends in computational approaches in drug design and metabolism. Discov. Public Health 2024, 21, 108. [Google Scholar] [CrossRef]
Lavecchia, A. Machine-learning approaches in drug discovery: Methods and applications. Drug Discov. Today 2015, 20, 318–331. [Google Scholar] [CrossRef]
Paul, D.; Sanap, G.; Shenoy, S.; Kalyane, D.; Kalia, K.; Tekade, R.K. Artificial intelligence in drug discovery and development. Drug Discovery Today 2021, 26, 80–93. [Google Scholar] [CrossRef] [PubMed]
Roy, K.; Kar, S.; Das, R.N. Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment; Academic Press: Cambridge, MA, USA, 2015. [Google Scholar]
Le, T.T.; Fu, W.; Moore, J.H. Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics 2020, 36, 250–256. [Google Scholar] [CrossRef] [PubMed]
Romano, J.D.; Le, T.T.; Fu, W.; Moore, J.H. TPOT-NN: Augmenting tree-based automated machine learning with neural network estimators. Genet. Program. Evolvable Mach. 2021, 22, 207–227. [Google Scholar] [CrossRef]
Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
Das, J.; Chen, P.; Norris, D.; Padmanabha, R.; Lin, J.; Moquin, R.V.; Shen, Z.; Cook, L.S.; Doweyko, A.M.; Pitt, S. 2-Aminothiazole as a Novel Kinase Inhibitor Template. Structure—Activity Relationship Studies toward the Discovery of N-(2-Chloro-6-methylphenyl)-2-[[6-[4-(2-hydroxyethyl)-1-piperazinyl)]-2-methyl-4-pyrimidinyl] amino)]-1, 3-thiazole-5-carboxamide (Dasatinib, BMS-354825) as a Potent pan-Src Kinase Inhibitor. J. Med. Chem. 2006, 49, 6819–6832. [Google Scholar]
Vedani, A.; Dobler, M. 5D-QSAR: The key for simulating induced fit? J. Med. Chem. 2002, 45, 2139–2149. [Google Scholar] [CrossRef]
Todeschini, R.; Consonni, V. Molecular Descriptors for Chemoinformatics: Volume I: Alphabetical Listing/Volume II: Appendices, References; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
Yap, C.W. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011, 32, 1466–1474. [Google Scholar] [CrossRef]
Landrum, G. Rdkit: Open-Source Cheminformatics Software. 2016. Available online: https://github.com/rdkit/rdkit (accessed on 2 August 2025).
Duvenaud, D.K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R.P. Convolutional networks on graphs for learning molecular fingerprints. Adv. Neural Inf. Process. Syst. 2015, 2, 2224–2232. [Google Scholar]
Yang, K.; Swanson, K.; Jin, W.; Coley, C.; Eiden, P.; Gao, H.; Guzman-Perez, A.; Hopper, T.; Kelley, B.; Mathea, M. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 2019, 59, 3370–3388, Correction in J. Chem. Inf. Model. 2019, 12, 5304–5305. [Google Scholar] [CrossRef] [PubMed]
Hung, C.; Gini, G. QSAR modeling without descriptors using graph convolutional neural networks: The case of mutagenicity prediction. Mol. Divers. 2021, 25, 1283–1299. [Google Scholar] [CrossRef]
Varmuza, K.; Dehmer, M.; Bonchev, D. Statistical Modelling of Molecular Descriptors in QSAR/QSPR; Wiley Online Library: Hoboken, NJ, USA, 2012. [Google Scholar]
Gini, G. QSAR methods. In In Silico Methods for Predicting Drug Toxicity; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–26. [Google Scholar]
Tetko, I.V.; Sushko, I.; Pandey, A.K.; Zhu, H.; Tropsha, A.; Papa, E.; Oberg, T.; Todeschini, R.; Fourches, D.; Varnek, A. Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: Focusing on applicability domain and overfitting by variable selection. J. Chem. Inf. Model. 2008, 48, 1733–1746. [Google Scholar] [CrossRef]
Riley, R.D.; Collins, G.S. Stability of clinical prediction models developed using statistical or machine learning methods. Biom. J. 2023, 65, 2200302. [Google Scholar] [CrossRef] [PubMed]
Cai, Z.; Zafferani, M.; Akande, O.M.; Hargrove, A.E. Quantitative Structure–Activity Relationship (QSAR) Study Predicts Small-Molecule Binding to RNA Structure. J. Med. Chem. 2022, 65, 7262–7277. [Google Scholar] [CrossRef]
Bueso-Bordils, J.I.; Antón-Fos, G.M.; Martín-Algarra, R.; Alemán-López, P.A. Overview of computational toxicology methods applied in drug and green chemical discovery. J. Xenobiot. 2024, 14, 1901–1918. [Google Scholar] [CrossRef]
Mora, J.R.; Marquez, E.A.; Pérez-Pérez, N.; Contreras-Torres, E.; Perez-Castillo, Y.; Agüero-Chapin, G.; Martinez-Rios, F.; Marrero-Ponce, Y.; Barigye, S.J. Rethinking the applicability domain analysis in QSAR models. J. Comput.-Aided Mol. Des. 2024, 38, 9. [Google Scholar] [CrossRef]
Olenginski, L.T.; Wierzba, A.J.; Laursen, S.P.; Batey, R.T. Designing small molecules targeting a cryptic RNA binding site through base displacement. Nat. Chem. Biol. 2025, 1–10. [Google Scholar] [CrossRef] [PubMed]
Wu, Z.; Zhu, M.; Kang, Y.; Leung, E.L.-H.; Lei, T.; Shen, C.; Jiang, D.; Wang, Z.; Cao, D.; Hou, T. Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets. Brief. Bioinform. 2021, 22, bbaa321. [Google Scholar] [CrossRef] [PubMed]
Zhang, F.; Wang, Z.; Peijnenburg, W.J.; Vijver, M.G. Machine learning-driven QSAR models for predicting the mixture toxicity of nanoparticles. Environ. Int. 2023, 177, 108025. [Google Scholar] [CrossRef]
Singh, K.; Ghosh, I.; Jayaprakash, V.; Jayapalan, S. Building a ML-based QSAR model for predicting the bioactivity of therapeutically active drug class with imidazole scaffold. Eur. J. Med. Chem. Rep. 2024, 11, 100148. [Google Scholar] [CrossRef]
Lenselink, E.B.; Ten Dijke, N.; Bongers, B.; Papadatos, G.; Van Vlijmen, H.W.; Kowalczyk, W.; IJzerman, A.P.; Van Westen, G.J. Beyond the hype: Deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set. J. Cheminform. 2017, 9, 45. [Google Scholar] [CrossRef]
Nayarisseri, A.; Khandelwal, R.; Tanwar, P.; Madhavi, M.; Sharma, D.; Thakur, G.; Speck-Planche, A.; Singh, S.K. Artificial intelligence, big data and machine learning approaches in precision medicine & drug discovery. Curr. Drug Targets 2021, 22, 631–655. [Google Scholar]
Matboli, M.; Al-Amodi, H.S.; Khaled, A.; Khaled, R.; Roushdy, M.M.; Ali, M.; Diab, G.I.; Elnagar, M.F.; Elmansy, R.A.; TAhmed, H.H. Comprehensive machine learning models for predicting therapeutic targets in type 2 diabetes utilizing molecular and biochemical features in rats. Front. Endocrinol. 2024, 15, 1384984. [Google Scholar] [CrossRef]
Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J.C.; Sheridan, R.P.; Feuston, B.P. Random forest: A classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947–1958. [Google Scholar] [CrossRef]
Koutsoukas, A.; Monaghan, K.J.; Li, X.; Huan, J. Deep-learning: Investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data. J. Cheminform. 2017, 9, 42. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference of Neutral Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; p. 30. [Google Scholar]
Mazanetz, M.P.; Marmon, R.J.; Reisser, C.B.T.; Morao, I. Drug discovery applications for KNIME: An open source data mining platform. Curr. Top. Med. Chem. 2012, 12, 1965–1979. [Google Scholar] [CrossRef]
Niazi, S.K.; Mariam, Z. Recent Advances in Machine-Learning-Based Chemoinformatics: A Comprehensive Review. Int. J. Mol. Sci. 2023, 24, 11488. [Google Scholar] [CrossRef]
Van Tilborg, D.; Alenicheva, A.; Grisoni, F. Exposing the limitations of molecular machine learning with activity cliffs. J. Chem. Inf. Model. 2022, 62, 5938–5951. [Google Scholar] [CrossRef] [PubMed]
Scholz, G.E.; Linard, B.; Romashchenko, N.; Rivals, E.; Pardi, F. Rapid screening and detection of inter-type viral recombinants using phylo-k-mers. Bioinformatics 2020, 36, 5351–5360. [Google Scholar] [CrossRef] [PubMed]
Kalian, A.D.; Benfenati, E.; Osborne, O.J.; Gott, D.; Potter, C.; Dorne, J.-L.C.; Guo, M.; Hogstrand, C. Exploring dimensionality reduction techniques for deep learning driven QSAR models of mutagenicity. Toxics 2023, 11, 572. [Google Scholar] [CrossRef]
Noviandy, T.R.; Idroes, G.M.; Maulana, A.; Afidh, R.P.F.; Idroes, R. Optimizing hepatitis C virus inhibitor identification with LightGBM and tree-structured parzen estimator sampling. Eng. Technol. Appl. Sci. Res. 2024, 14, 18810–18817. [Google Scholar] [CrossRef]
Goh, G.B.; Hodas, N.O.; Vishnu, A. Deep learning for computational chemistry. J. Comput. Chem. 2017, 38, 1291–1307. [Google Scholar] [CrossRef]
Zhong, S.; Hu, J.; Yu, X.; Zhang, H. Molecular image-convolutional neural network (CNN) assisted QSAR models for predicting contaminant reactivity toward OH radicals: Transfer learning, data augmentation and model interpretation. Chem. Eng. J. 2021, 408, 127998. [Google Scholar] [CrossRef]
Bisoi, A.V.; Shreyas, V.; Siguenza, J.; Ramsundar, B. DeepChem-Variant: A Modular Open Source Framework for Genomic Variant Calling. In Proceedings of the Championing Open-Source Development in ML Workshop@ ICML25, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
Heid, E.; Greenman, K.P.; Chung, Y.; Li, S.-C.; Graff, D.E.; Vermeire, F.H.; Wu, H.; Green, W.H.; McGill, C.J. Chemprop: A machine learning package for chemical property prediction. J. Chem. Inf. Model. 2023, 64, 9–17. [Google Scholar] [CrossRef]
Chithrananda, S.; Grand, G.; Ramsundar, B. ChemBERTa: Large-scale self-supervised pretraining for molecular property prediction. arXiv 2020, arXiv:2010.09885. [Google Scholar]
Li, J.; Jiang, X. Mol-BERT: An Effective Molecular Representation with BERT for Molecular Property Prediction. Wirel. Commun. Mob. Comput. 2021, 2021, 7181815. [Google Scholar] [CrossRef]
Olivecrona, M.; Blaschke, T.; Engkvist, O.; Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminform. 2017, 9, 48. [Google Scholar] [CrossRef]
Hajim, W.I.; Zainudin, S.; Daud, K.M.; Alheeti, K. Optimized models and deep learning methods for drug response prediction in cancer treatments: A review. PeerJ Comput. Sci. 2024, 10, e1903. [Google Scholar] [CrossRef]
Ugurlu, S. Machine Learning Applications in Drug Discovery. ChemRxiv. 2024. [Google Scholar] [CrossRef]
Gao, K.; Wang, R.; Chen, J.; Cheng, L.; Frishcosy, J.; Huzumi, Y.; Qiu, Y.; Schluckbier, T.; Wei, X.; Wei, G.-W. Methodology-centered review of molecular modeling, simulation, and prediction of SARS-CoV-2. Chem. Rev. 2022, 122, 11287–11368. [Google Scholar] [CrossRef] [PubMed]
Peng, L.; Wang, F.; Wang, Z.; Tan, J.; Huang, L.; Tian, X.; Liu, G.; Zhou, L. Cell–cell communication inference and analysis in the tumour microenvironments from single-cell transcriptomics: Data resources and computational strategies. Brief. Bioinform. 2022, 23, bbac234. [Google Scholar] [CrossRef]
Ahmad, A.; Fröhlich, H. Towards clinically more relevant dissection of patient heterogeneity via survival-based Bayesian clustering. Bioinformatics 2017, 33, 3558–3566. [Google Scholar] [CrossRef] [PubMed]
Kim, H.; Lee, J.; Ahn, S.; Lee, J.R. A merged molecular representation learning for molecular properties prediction with a web-based service. Sci. Rep. 2021, 11, 11028. [Google Scholar] [CrossRef]
Altae-Tran, H.; Ramsundar, B.; Pappu, A.S.; Pande, V. Low data drug discovery with one-shot learning. ACS Cent. Sci. 2017, 3, 283–293. [Google Scholar] [CrossRef]
Trott, O.; Olson, A.J. AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 2010, 31, 455–461. [Google Scholar]
Halgren, T.A.; Murphy, R.B.; Friesner, R.A.; Beard, H.S.; Frye, L.L.; Pollard, W.T.; Banks, J.L. Glide: A new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. J. Med. Chem. 2004, 47, 1750–1759. [Google Scholar] [CrossRef]
Verdonk, M.L.; Cole, J.C.; Hartshorn, M.J.; Murray, C.W.; Taylor, R.D. Improved protein–ligand docking using GOLD. Proteins Struct. Funct. Bioinform. 2003, 52, 609–623. [Google Scholar]
Liu, N.; Xu, Z. In Using LeDock as a docking tool for computational drug design. IOP Conf. Ser. Earth Environ. Sci. 2019, 218, 012143. [Google Scholar] [CrossRef]
Pagadala, N.S.; Syed, K.; Tuszynski, J. Software for molecular docking: A review. Biophys. Rev. 2017, 9, 91–102. [Google Scholar] [CrossRef] [PubMed]
Yuriev, E.; Ramsland, P.A. Latest developments in molecular docking: 2010–2011 in review. J. Mol. Recognit. 2013, 26, 215–239. [Google Scholar] [CrossRef]
Hollingsworth, S.A.; Dror, R.O. Molecular dynamics simulation for all. Neuron 2018, 99, 1129–1143. [Google Scholar] [CrossRef]
Abraham, M.J.; Murtola, T.; Schulz, R.; Páll, S.; Smith, J.C.; Hess, B.; Lindahl, E. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 2015, 1, 19–25. [Google Scholar] [CrossRef]
Huang, J.; MacKerell, A.D., Jr. CHARMM36 all-atom additive protein force field: Validation based on comparison to NMR data. J. Comput. Chem. 2013, 34, 2135–2145. [Google Scholar] [CrossRef] [PubMed]
Phillips, J.C.; Braun, R.; Wang, W.; Gumbart, J.; Tajkhorshid, E.; Villa, E.; Chipot, C.; Skeel, R.D.; Kale, L.; Schulten, K. Scalable molecular dynamics with NAMD. J. Comput. Chem. 2005, 26, 1781–1802. [Google Scholar] [CrossRef] [PubMed]
Salomon-Ferrer, R.; Case, D.A.; Walker, R.C. An overview of the Amber biomolecular simulation package. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2013, 3, 198–210. [Google Scholar] [CrossRef]
Kumari, R.; Kumar, R.; Consortium, O.S.D.D.; Lynn, A. g_mmpbsa—A GROMACS tool for high-throughput MM-PBSA calculations. J. Chem. Inf. Model. 2014, 54, 1951–1962. [Google Scholar] [CrossRef] [PubMed]
Koirala, M.; Fagerquist, C.K. Binding Free Energy Analysis of Colicin D, E3 and E8 to Their Respective Cognate Immunity Proteins Using Computational Simulations. Molecules 2025, 30, 1277. [Google Scholar] [CrossRef]
Koirala, M.; DiPaola, M. Targeting CDK9 in Cancer: An Integrated Approach of Combining In Silico Screening with Experimental Validation for Novel Degraders. Curr. Issues Mol. Biol. 2024, 46, 1713–1730. [Google Scholar] [CrossRef]
Koirala, M.; Alexov, E. Ab-initio binding of barnase–barstar with DelPhiForce steered Molecular Dynamics (DFMD) approach. J. Theor. Comput. Chem. 2020, 19, 2050016. [Google Scholar] [CrossRef]
Shi, W.; Yang, H.; Xie, L.; Yin, X.-X.; Zhang, Y. A review of machine learning-based methods for predicting drug–target interactions. Health Inf. Sci. Syst. 2024, 12, 30. [Google Scholar] [CrossRef] [PubMed]
Liu, H.; Hu, B.; Chen, P.; Wang, X.; Wang, H.; Wang, S.; Wang, J.; Lin, B.; Cheng, M. Docking score ML: Target-specific machine learning models improving docking-based virtual screening in 155 targets. J. Chem. Inf. Model. 2024, 64, 5413–5426. [Google Scholar] [CrossRef] [PubMed]
Lu, S.; He, X.; Yang, Z.; Chai, Z.; Zhou, S.; Wang, J.; Rehman, A.U.; Ni, D.; Pu, J.; Sun, J. Activation pathway of a G protein-coupled receptor uncovers conformational intermediates as targets for allosteric drug design. Nat. Commun. 2021, 12, 4721. [Google Scholar] [CrossRef]
Zou, Y.; Ma, D.; Wang, Y. The PROTAC technology in drug development. Cell Biochem. Funct. 2019, 37, 21–30. [Google Scholar] [CrossRef]
Troup, R.I.; Fallan, C.; Baud, M.G. Current strategies for the design of PROTAC linkers: A critical review. Explor. Target. Anti-Tumor Ther. 2020, 1, 273. [Google Scholar] [CrossRef]
Koirala, M.; DiPaola, M. Overcoming cancer resistance: Strategies and modalities for effective treatment. Biomedicines 2024, 12, 1801. [Google Scholar] [CrossRef] [PubMed]
Ribes, S.; Nittinger, E.; Tyrchan, C.; Mercado, R. Modeling PROTAC degradation activity with machine learning. Artif. Intell. Life Sci. 2024, 6, 100104, Erratum in Artif. Intell. Life Sci. 2024, 6, 100114. [Google Scholar] [CrossRef]
Speck-Planche, A.; Scotti, M.T. BET bromodomain inhibitors: Fragment-based in silico design using multi-target QSAR models. Mol. Divers. 2019, 23, 555–572. [Google Scholar] [CrossRef] [PubMed]
Poongavanam, V.; Kolling, F.; Giese, A.; Goller, A.H.; Lehmann, L.; Meibom, D.; Kihlberg, J. Predictive modeling of PROTAC cell permeability with machine learning. ACS Omega 2023, 8, 5901–5916. [Google Scholar] [CrossRef]
Jarusiewicz, J.A.; Yoshimura, S.; Mayasundari, A.; Actis, M.; Aggarwal, A.; McGowan, K.; Yang, L.; Li, Y.; Fu, X.; Mishra, V. Phenyl dihydrouracil: An alternative cereblon binder for PROTAC design. ACS Med. Chem. Lett. 2023, 14, 141–145. [Google Scholar] [CrossRef]
Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
Tunjic, T.M.; Weber, N.; Brunsteiner, M. Computer aided drug design in the development of proteolysis targeting chimeras. Comput. Struct. Biotechnol. 2023, 21, 2058–2067. [Google Scholar] [CrossRef]
Wu, L.; Chen, Y.; Shen, K.; Guo, X.; Gao, H.; Li, S.; Pei, J.; Long, B. Graph neural networks for natural language processing: A survey. Found. Trends® Mach. Learn.g 2023, 16, 119–328, Erratum in AI Open 2024, 5, 100001. [Google Scholar] [CrossRef]
Liu, J.; Roy, M.J.; Isbel, L.; Li, F. Accurate PROTAC-targeted degradation prediction with DegradeMaster. Bioinformatics 2025, 41 (Suppl. S1), i342–i351. [Google Scholar] [CrossRef]
Abouzied, A.S.; Alshammari, B.; Kari, H.; Huwaimel, B.; Alqarni, S.; Kassab, S.E. AI-DPAPT: A Machine Learning Framework for Predicting PROTAC Activity. Mol. Divers. 2025, 29, 2995–3007. [Google Scholar] [CrossRef]
Imrie, F.; Bradley, A.R.; van der Schaar, M.; Deane, C.M. Deep generative models for 3D linker design. J. Chem. Inf. Model. 2020, 60, 1983–1995. [Google Scholar] [CrossRef] [PubMed]
Igashov, I.; Stärk, H.; Vignac, C.; Schneuing, A.; Satorras, V.G.; Frossard, P.; Welling, M.; Bronstein, M.; Correia, B. Equivariant 3D-conditional diffusion model for molecular linker design. Nat. Mach. Intell. 2024, 6, 417–427. [Google Scholar] [CrossRef]
Li, F.; Hu, Q.; Zhou, Y.; Yang, H.; Bai, F. DiffPROTACs is a deep learning-based generator for proteolysis targeting chimeras. Brief. Bioinform. 2024, 25. [Google Scholar] [CrossRef]
Mangalathu, S.; Hwang, S.-H.; Jeon, J.-S. Failure mode and effects analysis of RC members based on machine-learning-based SHapley Additive exPlanations (SHAP) approach. Eng. Struct. 2020, 219, 110927. [Google Scholar] [CrossRef]
Ekanayake, I.; Meddage, D.; Rathnayake, U. A novel approach to explain the black-box nature of machine learning in compressive strength predictions of concrete using Shapley additive explanations (SHAP). Case Stud. Constr. Mater. 2022, 16, e01059. [Google Scholar] [CrossRef]
Xie, L.; Xie, L. Elucidation of genome-wide understudied proteins targeted by PROTAC-induced degradation using interpretable machine learning. PLoS Comput. Biol. 2023, 19, e1010974. [Google Scholar] [CrossRef]
Yi, J.; Shi, S.; Fu, L.; Yang, Z.; Nie, P.; Lu, A.; Wu, C.; Deng, Y.; Hsieh, C.; Zeng, X. OptADMET: A web-based tool for substructure modifications to improve ADMET properties of lead compounds. Nat. Protoc. 2024, 19, 1105–1121. [Google Scholar] [CrossRef]
Swanson, K.; Walther, P.; Leitz, J.; Mukherjee, S.; Wu, J.C.; Shivnaraine, R.V.; Zou, J. ADMET-AI: A machine learning ADMET platform for evaluation of large-scale chemical libraries. Bioinformatics 2024, 40, btae416. [Google Scholar] [CrossRef]
Daoud, N.E.-H.; Borah, P.; Deb, P.K.; Venugopala, K.N.; Hourani, W.; Alzweiri, M.; Bardaweel, S.K.; Tiwari, V. ADMET profiling in drug discovery and development: Perspectives of in silico, in vitro and integrated approaches. Curr. Drug Metab. 2021, 22, 503–522. [Google Scholar] [CrossRef]
Raju, B.; Verma, H.; Narendra, G.; Sapra, B.; Silakari, O. Multiple machine learning, molecular docking, and ADMET screening approach for identification of selective inhibitors of CYP1B1. J. Biomol. Struct. Dyn. 2022, 40, 7975–7990. [Google Scholar] [CrossRef]
Abdelwahab, A.A.; Elattar, M.A.; Fawzi, S.A. Advancing ADMET prediction for major CYP450 isoforms: Graph-based models, limitations, and future directions. Biomed. Eng. OnLine 2025, 24, 93. [Google Scholar] [CrossRef]
Göller, A.H.; Kuhnke, L.; Ter Laak, A.; Meier, K.; Hillisch, A. Machine learning applied to the modeling of pharmacological and ADMET endpoints. Artif. Intell. Drug Des. 2021, 2390, 61–101. [Google Scholar]
Zonghuang, X. Machine learning-based quantitative structure-activity relationship and ADMET prediction models for erα activity of anti-breast cancer drug candidates. Wuhan Univ. J. Nat. Sci. 2023, 28, 257–270. [Google Scholar]
Dong, J.; Wang, N.-N.; Yao, Z.-J.; Zhang, L.; Cheng, Y.; Ouyang, D.; Lu, A.-P.; Cao, D.-S. ADMETlab: A platform for systematic ADMET evaluation based on a comprehensively collected ADMET database. J. Cheminform. 2018, 10, 29. [Google Scholar] [CrossRef] [PubMed]
Pires, D.E.; Blundell, T.L.; Ascher, D.B. pkCSM: Predicting small-molecule pharmacokinetic and toxicity properties using graph-based signatures. J. Med. Chem. 2015, 58, 4066–4072. [Google Scholar] [CrossRef] [PubMed]
Daina, A.; Michielin, O.; Zoete, V. SwissADME: A free web tool to evaluate pharmacokinetics, drug-likeness and medicinal chemistry friendliness of small molecules. Sci. Rep. 2017, 7, 42717. [Google Scholar] [CrossRef] [PubMed]
Banerjee, P.; Eckert, A.O.; Schrey, A.K.; Preissner, R. ProTox-II: A webserver for the prediction of toxicity of chemicals. Nucleic Acids Res. 2018, 46, W257–W263. [Google Scholar] [CrossRef] [PubMed]
Martin, T.; Harten, P.; Young, D. TEST (Toxicity Estimation Software Tool); Version 4.1; US Environmental Protection Agency: Washington DC, USA, 2012.
Benfenati, E.; Manganaro, A.; Gini, G.C. VEGA-QSAR: AI inside a platform for predictive toxicology. CEUR Workshop Proc. 2013, 1107, 21–28. [Google Scholar]
Cheng, F.; Li, W.; Zhou, Y.; Shen, J.; Wu, Z.; Liu, G.; Lee, P.W.; Tang, Y. admetSAR: A comprehensive source and free tool for assessment of chemical ADMET properties. J. Chem. Inf. Model. 2012, 52, 3099–3105, Correction in J. Chem. Inf. Model. 2019, 59, 4959. [Google Scholar] [CrossRef]
Ioakimidis, L.; Thoukydidis, L.; Mirza, A.; Naeem, S.; Reynisson, J. Benchmarking the reliability of QikProp. Correlation between experimental and predicted values. QSAR Comb. Sci. 2008, 27, 445–456. [Google Scholar] [CrossRef]
Advanced Chemistry Development, Inc. Available online: https://www.acdlabs.com (accessed on 11 August 2025).
Lhasa Limited. DEREK Nexus; Lhasa Limited: Leeds, UK. Available online: https://www.lhasalimited.org (accessed on 11 August 2025).
BIOVIA Discovery Studio Solutions, Version 2.1; Dassault Systèmes: San Diego, CA, USA. Available online: https://www.3ds.com/products/biovia/discovery-studio (accessed on 11 August 2025).
ADMET Predictor, Version 12; Simulations Plus, Inc.: Lancaster, CA, USA, 2025. Available online: https://www.businesswire.com (accessed on 11 August 2025).
StarDrop; Optibrium Ltd.: Cambdrige, UK, 2025. Available online: https://optibrium.com (accessed on 11 August 2025).
Chemaxon. Available online: https://www.chemaxon.com (accessed on 12 August 2025).
Patlewicz, G.; Jeliazkova, N.; Safford, R.; Worth, A.; Aleksiev, B. An evaluation of the implementation of the Cramer classification scheme in the Toxtree software. SAR QSAR Environ. Res. 2008, 19, 495–524. [Google Scholar] [CrossRef]
U.S. Environmental Protection Agency. Toxicity Forecasting (ToxCast). Available online: https://www.epa.gov/comptox-tools/toxicity-forecasting-toxcast (accessed on 11 August 2025).
Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B.A.; Thiessen, P.A.; Yu, B. PubChem 2023 update. Nucleic Acids Res. 2023, 51, D1373–D1380. [Google Scholar] [CrossRef]
Mendez, D.; Gaulton, A.; Bento, A.P.; Chambers, J.; De Veij, M.; Félix, E.; Magariños, M.P.; Mosquera, J.F.; Mutowo, P.; Nowotka, M. ChEMBL: Towards direct deposition of bioassay data. Nucleic Acids Res. 2019, 47, D930–D940. [Google Scholar]
Wishart, D.S.; Feunang, Y.D.; Guo, A.C.; Lo, E.J.; Marcu, A.; Grant, J.R.; Sajed, T.; Johnson, D.; Li, C.; Sayeeda, Z. DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res. 2018, 46, D1074–D1082. [Google Scholar] [CrossRef]
Lumumba, V.W.; Kiprotich, D.; Lemasulani Mpaine, M.; Grace Makena, N.; Daniel Kavita, M. Comparative analysis of Cross-Validation techniques: LOOCV, K-folds Cross-Validation, and repeated K-folds Cross-Validation in machine learning models. Am. J. Theor. Appl. Stat. 2024, 13, 127–137. [Google Scholar] [CrossRef]
Gramatica, P. Principles of QSAR modeling: Comments and suggestions from personal experience. Int. J. Quant. Struct.-Prop. Relatsh. (IJQSPR) 2020, 5, 61–97. [Google Scholar] [CrossRef]
Sahigara, F.; Mansouri, K.; Ballabio, D.; Mauri, A.; Consonni, V.; Todeschini, R. Comparison of different approaches to define the applicability domain of QSAR models. Molecules 2012, 17, 4791–4810. [Google Scholar] [CrossRef] [PubMed]
Cassotti, M.; Ballabio, D.; Todeschini, R.; Consonni, V. A similarity-based QSAR model for predicting acute toxicity towards the fathead minnow (Pimephales promelas). SAR QSAR Environ. Res. 2015, 26, 217–243. [Google Scholar] [CrossRef]
Chirico, N.; Gramatica, P. Real external predictivity of QSAR models. Part 2. New intercomparable thresholds for different validation criteria and the need for scatter plot inspection. J. Chem. Inf. Model. 2012, 52, 2044–2058. [Google Scholar] [CrossRef]
Golbraikh, A.; Tropsha, A. Beware of q2! Mol. Graph. Model. 2002, 20, 269–276. [Google Scholar] [CrossRef]
Organisation for Economic Co-Operation and Development. Guidance Document on the Validation of (Quantitative) Structure-Activity Relationship [(Q) SAR] Models; Organisation for Economic Co-Operation and Development: Paris, France, 2014.
Vamathevan, J.; Clark, D.; Czodrowski, P.; Dunham, I.; Ferran, E.; Lee, G.; Li, B.; Madabhushi, A.; Shah, P.; Spitzer, M. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 2019, 18, 463–477. [Google Scholar] [CrossRef]
Bender, A.; Glen, R.C. Molecular similarity: A key technique in molecular informatics. Org. Biomol. Chem. 2004, 2, 3204–3218. [Google Scholar] [CrossRef]
Fu, X.; Liu, L.; Guan, W.W.; Kalra, Y.; Bao, S.; Kötter, T.; Sturm, K. Advancing replicable and reproducible GIScience: An approach with KNIME. Cartogr. Geogr. Inf. Sci. 2025, 1–21. [Google Scholar] [CrossRef]
Neves, B.J.; Moreira-Filho, J.T.; Silva, A.C.; Borba, J.V.; Mottin, M.; Alves, V.M.; Braga, R.C.; Muratov, E.N.; Andrade, C.H. Automated framework for developing predictive machine learning models for data-driven drug discovery. J. Braz. Chem. Soc. 2021, 32, 110–122. [Google Scholar] [CrossRef]
Zdrazil, B.; Felix, E.; Hunter, F.; Manners, E.J.; Blackshaw, J.; Corbett, S.; De Veij, M.; Ioannidis, H.; Lopez, D.M.; Mosquera, J.F. The ChEMBL Database in 2023: A drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 2024, 52, D1180–D1192. [Google Scholar] [CrossRef]
Irwin, J.J.; Shoichet, B.K. ZINC—A free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 2005, 45, 177–182. [Google Scholar] [CrossRef] [PubMed]
Uzundurukan, A.; Nelson, M.; Teske, C.; Islam, M.S.; Mohamed, E.; Christy, J.V.; Martin, H.-J.; Muratov, E.; Glover, S.; Fuoco, D. Meta-analysis and review of in silico methods in drug discovery—Part 1: Technological evolution and trends from big data to chemical space. Pharmacogenom. J. 2025, 25, 8. [Google Scholar] [CrossRef] [PubMed]
Gaulton, A.; Bellis, L.J.; Bento, A.P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B. ChEMBL: A large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012, 40, D1100–D1107. [Google Scholar] [CrossRef] [PubMed]
Vinogradov, V.; Izmailov, I.; Steshin, S.; Nguyen, K.T. Bioptic--A Target-Agnostic Potency-Based Small Molecules Search Engine. arXiv 2024, arXiv:2406.14572. [Google Scholar]
Ramsundar, B.; Eastman, P.; Walters, P.; Pande, V. Deep Learning for the Life Sciences: Applying Deep Learning to Genomics, Microscopy, Drug Discovery, and More; O’Reilly Media: Sebastopol, CA, USA, 2019. [Google Scholar]
Nene, L.; Flepisi, B.T.; Brand, S.J.; Basson, C.; Balmith, M. Evolution of drug development and regulatory affairs: The demonstrated power of artificial intelligence. Clin. Ther. 2024, 46, e6–e14. [Google Scholar] [CrossRef]
Blanco-Gonzalez, A.; Cabezon, A.; Seco-Gonzalez, A.; Conde-Torres, D.; Antelo-Riveiro, P.; Pineiro, A.; Garcia-Fandino, R. The role of AI in drug discovery: Challenges, opportunities, and strategies. Pharmaceuticals 2023, 16, 891. [Google Scholar] [CrossRef]
Mirakhori, F.; Niazi, S.K. Harnessing the AI/ML in drug and biological products discovery and development: The regulatory perspective. Pharmaceuticals 2025, 18, 47. [Google Scholar] [CrossRef]
Guideline, I. Assessment and control of DNA reactive (mutagenic) impurities in pharmaceuticals to limit potential carcinogenic risk M7. In Proceedings of the International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH), Geneva, Switzerland, 8–13 November 2014. [Google Scholar]
Okumoto, A.; Nomura, Y.; Maki, K.; Ogawa, T.; Onodera, H.; Shikano, M.; Okabe, N. Addressing practical issues in the smooth implementation of revised guidelines for non-clinical studies of vaccines for infectious disease prevention. Regul. Toxicol. Pharmacol. 2023, 142, 105413. [Google Scholar] [CrossRef]
Lundberg, S.M.; Nair, B.; Vavilala, M.S.; Horibe, M.; Eisses, M.J.; Adams, T.; Liston, D.E.; Low, D.K.-W.; Newman, S.-F.; Kim, J. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat. Biomed. Eng. 2018, 2, 749–760. [Google Scholar] [CrossRef] [PubMed]
Rodríguez-Pérez, R.; Bajorath, J. Interpretation of machine learning models using shapley values: Application to compound potency and multi-target activity predictions. J. Comput.-Aided Mol. Des. 2020, 34, 1013–1026. [Google Scholar] [CrossRef] [PubMed]
Wilczok, D.; Zhavoronkov, A. Progress, pitfalls, and impact of AI-driven clinical trials. Clin. Pharmacol. Ther. 2025, 117, 887–890. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of the shift from the trial-and-error methods to the personalized drug design, ranging from the classical approach to the modern AI techniques.

Table 1. List of ADMET prediction software programs.

Name	Type	Key Features
ADMETlab (v3.0)	Open-Source	On-line multi-endpoint ADMET & toxicity prediction [112].
pkCSM (2015 release)	Open-Source	Graph-based for ADMET classification [113].
SwissADME (2017 release)	Open-Source	On-line tool for ADME, physicochemical, and drug-likeness [114].
ProTox-II (v2, 2018)	Open-Source	Toxicity endpoints including LD50, hepatotoxicity [115].
T.E.S.T. (v5.1.1)	Open-Source	EPA tool for QSAR-based toxicity estimates [116].
DeepChem (v2.x)	Open-Source	Python ML/AI library for molecular modeling [2].
VEGA QSAR (v1.2.3)	Open-Source	Rule-based QSAR toxicity predictor [117].
AdmetSAR (v2.0, 2019)	Open-Source	Predictive model for ADMET endpoints [118].
ADMET-AI (2023 release)	Open-Source	ML-based tool for fast and accurate ADMET predictions [106].
CtoxPred3 (v3)	Open-Source	In silico prediction of peptide toxicity
Schrödinger QikProp (v6.2)	Commercial	50+ ADME properties, integrated in Schrödinger [119].
ACD/Percepta (v2023.1)	Commercial	Physicochemical, ADME, and toxicity predictions [120].
DEREK Nexus (v6.x)	Commercial	Rule-based toxicology and safety predictions [121].
TOPKAT (BIOVIA) (v6.2)	Commercial	QSAR-based toxicity (mutagenicity, carcinogenicity) [122].
ADMET Predictor (v11.5)	Commercial	175+ ADMET properties and metabolism simulation [123].
StarDrop (v7.3)	Commercial	ADMET modeling with compound prioritization [124].
ChemAxon’s cxcalc (v23.15)	Commercial	Command-line ADME prediction tool [125].
ToxTree (v2.6.13)	Commercial	Decision tree-based toxicity analysis [126].

Table 2. ADMET databases for AI model training.

Tox21/ToxCast	Large public databases with toxicity screening data for thousands of compounds [127].
PubChem BioAssay	ADMET-related assay data and curated datasets for ML model development [128].
ChEMBL	Contains bioactivity data including absorption, CYP inhibition, and off-target toxicity [129].
DrugBank	Pharmacokinetic and toxicity data for approved drugs, useful for validation and model tuning [130].

Table 3. Computational Tools and Their Applications in QSAR-Based Drug Discovery.

Tool/Database	Application	Reference
RDKit (2019 release)	Molecular descriptor generation, structure processing	[28]
KNIME (2015 release)	Workflow automation and model building	[49]
ChEMBL (2017 release)	Bioactivity database for QSAR modeling	[129,142]
DeepChem (2018 release)	Deep learning platform for cheminformatics	[57]
ZINC Database (2012 release)	Commercially available compound repository	[143]
Chemprop (2019 release)	SMILES-based deep learning QSAR modeling	[58]
Google Colab (2018)	Cloud-based Python notebook environment	[57]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Koirala, M.; Yan, L.; Mohamed, Z.; DiPaola, M. AI-Integrated QSAR Modeling for Enhanced Drug Discovery: From Classical Approaches to Deep Learning and Structural Insight. Int. J. Mol. Sci. 2025, 26, 9384. https://doi.org/10.3390/ijms26199384

AMA Style

Koirala M, Yan L, Mohamed Z, DiPaola M. AI-Integrated QSAR Modeling for Enhanced Drug Discovery: From Classical Approaches to Deep Learning and Structural Insight. International Journal of Molecular Sciences. 2025; 26(19):9384. https://doi.org/10.3390/ijms26199384

Chicago/Turabian Style

Koirala, Mahesh, Lindy Yan, Zoser Mohamed, and Mario DiPaola. 2025. "AI-Integrated QSAR Modeling for Enhanced Drug Discovery: From Classical Approaches to Deep Learning and Structural Insight" International Journal of Molecular Sciences 26, no. 19: 9384. https://doi.org/10.3390/ijms26199384

APA Style

Koirala, M., Yan, L., Mohamed, Z., & DiPaola, M. (2025). AI-Integrated QSAR Modeling for Enhanced Drug Discovery: From Classical Approaches to Deep Learning and Structural Insight. International Journal of Molecular Sciences, 26(19), 9384. https://doi.org/10.3390/ijms26199384

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AI-Integrated QSAR Modeling for Enhanced Drug Discovery: From Classical Approaches to Deep Learning and Structural Insight

Abstract

1. Introduction

2. Foundations of QSAR and Molecular Descriptors

3. Classical QSAR: Statistical Modeling Techniques

4. Machine Learning Rise in QSAR

5. Deep Learning and Neural Models in Drug Discovery

6. Molecular Docking and Dynamics

7. PROTACs and Targeted Protein Degradation

8. Predicting ADMET and Toxicity Profiles

9. Assessing Model Validity and Reliability

10. Software, Databases and Computational Platforms

11. Challenges, Ethical Considerations and Regulatory Aspects

12. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI