Next Article in Journal
Recent Progress in the Vaccine Development Against Epstein–Barr Virus
Previous Article in Journal
Beta-Sitosterol Enhances Classical Swine Fever Virus Infection: Insights from RNA-Seq Analysis
Previous Article in Special Issue
Broad-Spectrum Antiviral Activity of Pyridobenzothiazolone Analogues Against Respiratory Viruses
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Antagonistic Trends Between Binding Affinity and Drug-Likeness in SARS-CoV-2 Mpro Inhibitors Revealed by Machine Learning

by
Anacleto Silva de Souza
*,
Vitor Martins de Freitas Amorim
,
Eduardo Pereira Soares
,
Robson Francisco de Souza
and
Cristiane Rodrigues Guzzo
Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo, Sao Paulo 05508-000, Brazil
*
Author to whom correspondence should be addressed.
Viruses 2025, 17(7), 935; https://doi.org/10.3390/v17070935
Submission received: 11 June 2025 / Revised: 26 June 2025 / Accepted: 26 June 2025 / Published: 30 June 2025
(This article belongs to the Special Issue Advances in Small-Molecule Viral Inhibitors)

Abstract

The SARS-CoV-2 main protease (Mpro) is a validated therapeutic target for inhibiting viral replication. Few compounds have advanced clinically, underscoring the difficulty in optimizing both target affinity and drug-like properties. To address this challenge, we integrated machine learning (ML), molecular docking, and molecular dynamics (MD) simulations to investigate the balance between pharmacodynamic (PD) and pharmacokinetic (PK) properties in Mpro inhibitor design. We developed ML models to classify Mpro inhibitors based on experimental IC50 data, combining molecular descriptors with structural insights from MD simulations. Our Support Vector Machine (SVM) model achieved strong performance (training accuracy = 0.84, ROC AUC = 0.91; test accuracy = 0.79, ROC AUC = 0.86), while our Logistic Regression model (training accuracy = 0.78, ROC AUC = 0.85; test accuracy = 0.76, ROC AUC = 0.83). Notably, PK descriptors often exhibited opposing trends to binding affinity: hydrophilic features enhanced binding affinity but compromised PK properties, whereas hydrogen bonding, hydrophobic, and π–π interactions in Mpro subsites S2 and S3/S4 are fundamental for binding affinity. Our findings highlight the need for a balanced approach in Mpro inhibitor design, strategically targeting these subsites may balance PD and PK properties. For the first time, we demonstrate antagonistic trends between pharmacokinetic (PK) and pharmacodynamic (PD) features through the integrated application of ML/MD. This study provides a computational framework for rational Mpro inhibitors, combining ML and MD to investigate the complex interplay between enzyme inhibition and drug likeness. These insights may guide the hit-to-lead optimization of the novel next-generation Mpro inhibitors of SARS-CoV-2 with preclinical and clinical potential.

1. Introduction

Despite efforts in vaccination in several countries, more than 777 million cases and 7.05 million deaths by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) have been reported since the beginning of the Coronavirus disease 2019 (COVID-19) pandemic until 14 May 2025 [1]. Several SARS-CoV-2 variants of concern (VOC) such as the Alpha (B.1.1.7), Beta (B.1.351), Gamma (P.1), Delta (B.1.617.2), and Omicron (B.1.1.529) and its sublineages emerged, all showing higher transmission rates and immune system evasion than the Wuhan strain (wild type, WT) [2,3]. In particular, the Omicron variant emerged at the end of November 2021, and within three months, it rapidly prevailed over the Delta variant [2,3]. The lineage B.1.529 (Clade 21M) gave rise to its sublineages, BA.1 (Clade 21K), BA.2 (Clade 21L), BA.2.12.1 (Clade 22C), BA.4 (Clade 22A), and BA.5 (Clade 22B). As of 14 May 2025, the proportion of Omicron variants summed almost 100%, in which the LP.8.1 (Clade 25A) represented more than ~34% followed by XEC sub-variant (Clade 24F) with 24%, NB.1.8.1 (Clade 25B) with 16%, and KP.3.1.1 with 12% [3]. Recently, the discovery of the HKU5-CoV lineage 2 (HKU5-CoV-2) in bats has raised concerns, as this virus efficiently binds human ACE2 (hACE2) and exhibits broad host tropism [4]. Comparative genomic analyses reveal significant sequence divergence in the Spike protein-coding region relative to other Coronaviruses, underscoring the potential for novel cross-species transmission [4].
At the molecular level, after viral infection, viral mRNA is released into the cytosol of the host cell and translated into two polyproteins, pp1a and pp1ab, which compose non-structural proteins responsible for different functions in the viral infection process [5]. These polyproteins are cleaved by two proteases, PLpro (papain-like protease), responsible for cleaving three sites in the pp1a and pp1ab, and 3CLpro (3-chymotrypsin-like protease), also known as Mpro (main protease), that cleaves at least eleven cleavage sites [5,6]. The Mpro is a cysteine protease and plays a key role in the viral infection process. Its active form is a homodimer, in which each protomer consists of three domains, known as domains I, II, and III. In a previous study, we suggested that residues His41–Cys145–Asp187 compose the catalytic triad of Mpro, being essential for the catalytic process [5]. Cysteine is a conserved element within the Coronaviridae family, functioning as a nucleophile crucial for the proteolytic process [5,7]. The 3CLpro protein’s active site can be divided into the subsites S1, S1′, S2, and S3/S4, adhering to the Schecter–Berger nomenclature for proteases [5,8]. This arrangement underscores the critical contribution of specific residues to enzymatic activity in the context of viral infection.
To date, more than 55,429 chemical structures have been experimentally evaluated against Mpro and reported in the scientific literature [9]. The extensive diversity in chemical compounds arises from the large active site with distinct subsites, making them highly potent against the main enzyme of the novel Coronavirus. Although this diversity may lead to the existence of a broad range of bioactive compounds, as shown in Figure 1, it is not clear why only a few 3CLpro antiviral compounds are available for clinical use. Additionally, previous studies show that 3CLpro inhibitors of MERS-CoV, SARS-CoV, and SARS-CoV-2 have different binding affinities [10], thus raising the possibility that 3CLpro inhibitors could be less effective in VOCs [5,11].
A substantial number of chemical structures have been screened against SARS-CoV-2 Mpro [9], and only a small fraction of the compounds have advanced from preclinical to clinical stages [5]. In our recent study, we explored pharmacokinetic (PK) challenges that may significantly influence the efficacy of these compounds in preclinical and clinical phases [5]. One example is Boceprevir, a hepatitis C virus protease inhibitor, which demonstrated promising affinity for Mpro and antiviral activity against SARS-CoV-2 (IC50 = 4.1 μM and EC50 = 1.3 μM) [5,12]. In the case of hepatitis, this antiviral is only available by controlled prescription. In the case of possible use in SARS-CoV-2, further studies are required to advance this inhibitor to clinical trials due to PK limitations, including human cytochrome P450 isoform (CYP3A4) inhibition [5,12]. To address similar PK challenges, the combination of Nirmatrelvir with Ritonavir—marketed as Paxlovid—was developed, utilizing Ritonavir’s CYP3A4 affinity and plasma protein binding properties to enhance therapeutic efficacy [5]. Despite these advances, a lack of predictive models to guide the optimization of PK properties continues to limit the search for novel Mpro inhibitors. To address this gap, we investigated the interplay between PK and pharmacodynamic (PD) properties using machine learning (ML) and molecular dynamics (MD) simulations. We developed classification models employing different ML algorithms such as Support Vector Machines (SVM) and Logistic Regression (LR). These models were trained using a database of Mpro inhibitors using different biophysics analysis that includes the value of IC50 from FRET assays (353 compounds), fluorescence assays (1595 compounds), and mixed datasets (FRET, fluorescence, and SPR assays) (1943 compounds). Our analysis revealed that hydrophilic features of inhibitors enhance Mpro binding but compromise PK properties. We also simulated different potent Mpro inhibitors and determined the free-energy landscape (FEL) by MD trajectories and principal component analysis. Our MD analysis also showed that hydrogen bonding, hydrophobic, and π–π interactions in the subsites S2 and S3/S4 are fundamental for binding affinity. These findings highlight the importance of targeting the subsites S2 and S3/S4 to balance PD and PK properties and provide a computational framework for future rational design of Mpro inhibitors, offering a pathway for hit-to-lead optimization.

2. Materials and Methods

2.1. Dataset Construction and Data Preprocessing

We identified 18 articles on SARS-CoV-2 Mpro inhibitors, of which 54,087 molecular entries with biological activity information were provided in the AI-driven Structure-enabled Antiviral Platform (ASAP) [9,13]. For the remaining 17 papers, data extraction was performed manually and curated using Open Babel [14,15] and MarvinSketch [16]. These efforts resulted in an initial, uncleaned dataset comprising 55,419 molecules. Data transformations included conversion to numerical format and log-scale adjustment (pIC50 = −log IC50) in order to facilitate statistical analyses. The dataset was organized and analyzed to investigate the correlation between different IC50 measurement methods, specifically focusing on experimental assays involving FRET, fluorescence, and SPR as measurement techniques. The dataset was imported using the pandas library, and unnecessary columns were dropped. We developed a set of scripts that provide methods for calculating substrate affinity and mean binding constants, treating missing or censored IC50 values. The predicted KM calculation estimates the Michaelis-Menten constant (KM) based on substrate and protein concentrations. Based on our published dataset in our previous study [17], we defined that if the protein concentration ([Mpro]) exceeds 50 nM and the substrate concentration ([subst]) is greater than 15 µM, KM was set to 24 µM. If only the substrate concentration is above 15 µM, KM was assigned a value of 75.41 µM; otherwise, it defaults to 40.87 µM (mean of these data). The mean calculation of Ki and KD values assumes that both values present a similar magnitude, calculating the mean when both values are available. Indeed, in our dataset, we observed that both have similar magnitude (dataset is available in Supplementary File S2). If only one of the values was present, that value was used directly. If neither Ki nor KD was available, the result was NaN. This approach ensures a representative binding constant even when experimental data is incomplete. We processed IC50 values, treating censored data and cleaning invalid entries. We retained values starting with ‘>’ or ‘<’ and recognized both integers and decimals. If the value did not meet these criteria, we attributed NaN. We ensured data integrity, preserving information from censored measurements while filtering out invalid data points. Thus, we established a reliable framework for processing enzymatic inhibition data, facilitating more accurate quantitative modeling of compound potency.
We computed the pIC50 value from experimental IC50 or Ki/KD measurements, accounting for adjustments based on the concentration of the target protein (Mpro) and the Michaelis-Menten constant (KM). The input data consisted of IC50 values converted to nanomolar, the concentration of Mpro in nanomolar, and the predicted KM, Ki, or KD value (or mean of both). Initially, the method attempted a direct conversion of IC50 to pIC50 when the IC50 value was numerical, positive, and below 90 nM, using the equation as follows:
p I C 50 = l o g 10   I C 50 10 6
If the direct conversion was not possible, an adjustment factor was calculated considering the expression as follows:
f a c t o r = 1 + [ M p r o ] 10 9 K M 10 6 + [ M p r o ] 10 9 2 1
For IC50 values with ‘>’ or ‘<’, the method computed pIC50 using the adjusted Ki/KD value was as follows:
p I C 50 = l o g 10   K i K D × f a c t o r 10 6
This approach enabled the estimation of pIC50 even when the IC50 value was censored or ambiguous. In cases where IC50 was missing or invalid but Ki/KD was a positive numerical value, pIC50 was calculated using the same adjustment factor and equation. Our strategy facilitated implementation by handling exceptions such as non-numeric or invalid types, preventing interruptions during data processing. This approach provided a robust conversion of enzymatic inhibition data to pIC50, integrating adjustments for concentration and affinity constants, thereby supporting quantitative analysis of inhibitor activity even with incomplete or censored measurements. For pIC50 values below 5, molecules were considered inactive against Mpro; otherwise, they were classified as active against this enzyme. All outputs were systematically organized and made easily accessible for further analysis and visualization. The resulting datasets included Dataset 1: FRET (353 molecules); Dataset 2: fluorescence (1595 molecules); and Dataset 3: Combined dataset (FRET, SPR, and fluorescence; 1943 molecules). Dataset 3 was specifically structured to evaluate whether assay types independently influenced machine learning predictions, regardless of inhibition assay types.

2.2. Computational Environment

The computational analyses were performed in Python 3.12.2, utilizing a suite of scientific libraries including NumPy [18], Pandas [19], RDKit [20], scikit-learn [21], Plotly [22], Matplotlib [23,24], and seaborn [25]. These tools support data manipulation, molecular descriptor computation, dimensionality reduction, and data visualization. Protonation states of molecular structures were predicted using Dimorphite-DL [26], which we adapted to enable parallel processing across datasets and to constrain the analysis within a physiologically relevant pH range (6.8 to 7.2), allowing a maximum of one variant per molecule. We computed 2D molecular descriptors using RDKit, where molecules were converted from SMILES strings to RDKit Mol objects, followed by molecular descriptor calculations. We also generated 3D descriptors by embedding molecules with the ETKDG algorithm, implemented on RDKit, where hydrogen atoms were added, and 3D conformers were generated. Due to the high dimensionality inherent in molecular descriptor datasets—known to complicate machine learning (ML) applications by increasing overfitting risk, computational cost, and reducing model interpretability [27]—we applied dimensionality reduction based on Pearson’s correlation coefficient, removing highly correlated features. Features with missing values or zero variance were excluded, and remaining missing values were imputed with mean values. Further procedural details are provided at https://github.com/anacletosouza/Mpro_AI-ML_pipeline/ (accessed on 11 June 2025) and Supplementary File S2. Feature sets derived from various filtering strategies (see Table S1) were used to train predictive models, whose performance was evaluated using precision, recall, F1-score, and accuracy as follows:
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F 1 - score = 2 × T P 2 × T P + F P + F N
A c c u r a c y = T P + T N T P + T N + F P + F N

2.3. Support Vector Machine (SVM)

The data processing and model training pipeline was structured to evaluate Support Vector Machines (SVM) on a set of molecular descriptor datasets. The datasets were split into training and test sets, with test sets ranging from 20%, 25%, or 30% of the corresponding dataset. Stratified splitting was applied to preserve the class distribution. The data was standardized to ensure that all features were on a comparable scale. SVM models were trained using various kernel functions, including radial basis function (RBF), linear, polynomial, and sigmoid kernels. The models were evaluated using non-cross-validation (CV) and k-fold CV (with k = 5 or 10). The performance of each model was evaluated using accuracy as the metric, calculated as the proportion of correct predictions. Hyperparameters were performed by defining a list of potential values for the number of splits for k-fold CV, random states for CV and test split, test size splits, and kernel functions (kernel). Thus, we evaluated all combinations of these hyperparameters. Only the results with accuracy more or equal than 0.60 for the training set (accnon−CV and acck−fold CV) and accuracy more or equal than 0.50 for the test set (accpred) were retained for ROC AUC analysis. The scripts are available in https://github.com/anacletosouza/Mpro_AI-ML_pipeline/ and Supplementary File S2.

2.4. Logistic Regression (LR)

To construct and evaluate the Logistic Regression models applied to binary classification tasks derived from molecular descriptor datasets, we employed the same approach used for building the SVM models. However, the Logistic Regression models were trained using five distinct solvers: lbfgs, liblinear, newton-cg, sag, and saga, initially without cross-validation. The scripts are available in https://github.com/anacletosouza/Mpro_AI-ML_pipeline/ and Supplementary File S2. The logistic function used in this study was defined by the following:
p ( 1   |   X ) = 1 1 + e x p [ ( β 0 + β 1 x 1 + β 2 x 2 + + β n x n   ) ]
where the following is established:
X = ( x 1 , , x n )
This denotes the vector of molecular descriptors used during the training of the Logistic Regression model.

2.5. Molecular Docking and Molecular Dynamics Simulations

The inhibitors, represented by the SMILES, were converted into a 3D structure using Chimera [28] tools, which was then used for molecular docking. The 3D structure of monomeric Mpro was built with AlphaFold3 [29]. We prepared each Mpro/inhibitor complex using the UCSF Chimera tool for molecular docking simulations, according to our previous studies [17]. Histidines 64, 163 were protonated in the ϵ2 tautomer; histidines were protonated in residues 41, 80, 164 as δ1 tautomer; charged histidines were protonated in 172 and 246. To build a Mpro/inhibitor complex, we docked each inhibitor into the binding site. Subsequently, the inhibitor was submitted to a systematic conformational assembly into the binding site using the autodocking Vina software [30,31]. The grid was centered in the Cys145 (SG, chain A, with coordinates (11.4, 12.1, −2.2) and size (40, 30, 30)). In the molecular docking parameters, we used exhaustiveness and number of modes of 8 and 1, respectively. We differed in the molecule poses, considering differences of 3 kcal/mol. The poses were evaluated by scoring representing the binding free energy. We performed nine consecutive minimizations of the Mpro/inhibitor complex, followed by MD runs performed in the Groningen Machine for Chemical Simulation Software (GROMACS, version 2022) [32,33], using optimized potentials for liquid simulations for all-atom (OPLS-AA) force fields. Inhibitor topology was built in the LigParGen web-based service [34,35,36], and the 1.14*CM1A charges were calculated considering neutral charge and 1 interaction. All systems were then explicitly solvated with TIP3P water models in a triclinic box and neutralized, keeping a 150 mM NaCl concentration, and minimized until reaching a maximum force of 10.0 kJ/mol or a maximum number of steps of 5000. The Mpro/inhibitor was equilibrated consecutively in isothermal-isochoric (NVT) and isothermal-isobaric (1 bar; NpT) ensembles, both at a temperature of 310 K for 2 ns (number of steps and intervals of 1,000,000 and 2 fs, respectively). All simulations were then performed in a periodic triclinic box, considering the minimum distance of 1.2 nm between any protein atom and triclinic box walls. Molecular dynamics runs were performed for 150 ns (number of steps and intervals of 75,000,000 and 2 fs, respectively) in NpT ensembles. We used the leapfrog algorithm to integrate Newton equations. We selected LINCS (LINear Constraint Solver) as a solver that satisfies holonomic constraints, whose number of iterations and order were 1 and 4, respectively. We used neighbor searching grid cells (cutoff scheme Verlet, frequency to update the neighbor list of 20 steps, and cutoff distance for the short-range neighbor list of 12 Å). In the van der Waals parameters, we smoothly switched the forces to zero between 10 and 12 Å. In electrostatic Coulomb, we used fast, smooth Particle-Mesh Ewald (SPME) electrostatics for long-range electrostatics. In addition, we defined the distance for the Coulomb cutoff of 12 Å, interpolation order for PME for a value of 4 (cubic interpolation), and grid spacing for Fast Fourier Transform (FFT) of 1.6 Å. In temperature coupling, we used velocity rescaling with a stochastic term (V-rescale; modified Berendsen thermostat), and after obtaining two coupling groups (Mpro/inhibitor and water/ions), we considered them for the time constant of 0.1 ps and 310 K as the reference temperature. In pressure coupling (NpT ensembles), we used a Parrinello–Rahman barostat (type isotropic, time constant of 1 ps, reference pressure of 1 bar, and isothermal compressibility of 4.5 × 10−5 bar−1). In the molecular dynamics calculations, we used periodic boundary conditions in xyz coordinates (3D space). Next, we calculated root-mean-square deviation (RMSD) and root-mean-square fluctuation (RMSF) values using GROMACS modules. To analyze structural dynamics of the Mpro/inhibitor complex, we extracted the Cα atomic coordinates of each residue from molecular dynamics trajectories and computed their Euclidean distances from the origin for each frame. The data were mean-centered, followed by Dynamical Cross-Correlation Matrix (DCCM) computation using Pearson correlation coefficients between atomic displacements. Correlated and anti-correlated motions were visualized with heatmaps and 3D structural networks in PyMOL (v. 2.5.0). From DCCM, we extracted eigenvalues and eigenvectors to identify normal modes of motion. Free Energy Landscapes (FELs) were derived from PCA scores using the relation ∆G = − RT ln(P + ξ), with ξ = 10−10 to prevent logarithmic singularities. Energies were binned (bin size = 50), normalized, and mapped back onto trajectory frames for visualization. To identify interactions within the subsites of Mpro, four main subsites were defined based on our previous study: S1 (containing PHE140, CYS145, HIS163), S1′ (THR25, THR26, HIS41, LEU27, ASN142, GLY143), S2 (MET49, TYR54, PRO52, HIS164, MET165, ASP187, ARG188, GLN189), and S3/S4 (MET165, GLU166, LEU167, PRO168, PHE185, THR190, ALA191) [5]. The analysis considered three distant atoms of the ligand: 3AN [C0F (quinoline), C05 (benzothiazole), O00 (thioamide)], 37 [C0F (pyrazole), C0O (chlorophenyl), C0T (benzotriazole)], 21 [C0R (benzotriazole), S07 (thiophene), N0H (imidazole)], and 40 [C0R (benzotriazole), N0K (imidazole), C0A (chlorophenyl)]. The coordinates of these atoms define the center of a sphere with a radius equal to 4 Å. All atoms within this sphere were considered potential interaction residues, and corresponding interacting subsites were assigned. Based on that, the contact frequency per residue and per subsite was calculated. Scripts used in these methodologies are available in Supplementary File S1 and https://github.com/anacletosouza/Mpro_AI-ML_pipeline/. Simulations and analyses were run on an HPC cluster comprising four nodes equipped with high-memory CPUs and Nvidia GPUs (L40S and A30) (Institute of Biomedical Sciences, University of São Paulo, São Paulo city, São Paulo state, Brazil), ensuring efficient large-scale data processing.

3. Results

3.1. Support Vector Machine (SVM) Models

In this study, we adopted a systematic approach to build and evaluate SVM-based classification models using different datasets and molecular descriptors in the feature selection process. The datasets varied in the IC50 measurement method (FRET, fluorescence, or a combination of FRET/fluorescence/SPR), the type of molecular descriptors (2D or 3D), and the filtering thresholds applied based on the pairwise Pearson correlation between the numerical descriptors. We performed multiple hyperparameter combinations, adjusting the number of cross-validation splits (k-fold), the training and test set splits, the random seed values, and the kernel types (linear, radial basis function, polynomial, and sigmoid). For each combination, the SVM models were trained and evaluated using the metrics of accuracy, recall, precision, F1-score, and area under the curve of the ROC curve (ROC AUC). Then, the performance of these models was evaluated using the same metrics in the test set. In total, we built and evaluated 1584 models, of which 1204 presented accuracy without cross-validation (CV) (accnon−CV), accuracy with cross-validation (acck−fold CV), and prediction accuracy (accpred) values greater than or equal to 0.6, 0.6, and 0.5, respectively. This strategy allowed us to explore the hyperparameter space and select the best models for solving classification problems.
Next, we selected the three best models constructed from each dataset, considering only those with accnon−CV and accpred greater than 0.7, in addition to prioritizing those that presented the largest AUC ROC in both the training and test sets (Table 1). The best model (Model 9) was obtained using the set of molecules whose IC50 measurement method was by fluorescence and 36 molecular descriptors (condition 43, Table S1). Molecular descriptors used in the construction of model 9 are shown in Table S4, and parameters and hyperparameters are summarized in the Supplementary Text, in the Supporting Information. This SVM model was built with radial basis function (RBF), 5- and 10-fold cross-validation, a random seed for k-fold CV (kf) of 37, a test size of 0.2 (319 compounds), and a training size of 1595 compounds, using a random seed (split) of 21. This SVM classifier utilized 769 support vectors—357 corresponding to active and 412 to inactive compounds—and a gamma value (for RBF) of ~0.03. In the test set, the model achieved an accuracy of 0.79 and a precision of 0.75, indicating reliable overall classification performance. This pattern is corroborated by the confusion matrix, which showed 79 true positives and 26 false negatives; in contrast, the classification of inactive compounds was more robust, with 174 true negatives and 40 false positives (Figure 2). This disparity may be related to class imbalance, as the training set contained 475 active and 801 inactive compounds, and the test set included 119 active and 200 inactive compounds. Despite this limitation, the ROC-AUC value of 0.86 indicates strong discriminative ability, supporting the model’s overall suitability for identifying active and inactive Mpro inhibitors. In contrast, models based on IC50 measured by FRET presented high recall (close to 1) but lower precision (between 0.72 and 0.73) and SVM models built from combined dataset (FRET, fluorescence, and SPR) presented lower ROC-AUC values (between 0.20 and 0.28), showing limited capacity to separate active and inactive inhibitors. This result suggests that, even in high data dimensionality—generally favorable for SVMs—the mixture of measurement methods impairs model performance.

3.2. Logistic Regression (LR) Models

Model description. The Logistic Regression (LR) models were built and evaluated with a focus on interpretability and generalization, both assessed through performance metrics on training and test sets. Table 2 presents the performance of LR models. Model 1 (FRET, condition 14, 47 molecular descriptors—Table S1) achieved a recall of ~0.97, ROC AUC was ~0.71, and accnon−CV of ~0.75 for the test set was considered satisfactory in our analysis. Conversely, models 10–13 (condition 45 in Table S1, based on fluorescence data, using 85 molecular descriptors)—trained and tested with a random split of 75% (1196 inhibitors) and 25% (399 inhibitors), respectively—exhibited superior generalization performance. These models reached AUCs of ~0.85 and ~0.83 on the training and test sets, respectively. The average accuracies under 5-fold and 10-fold CV (acc5−fold CV = 0.78; acc10−fold CV = 0.76) were consistent across training and test sets, suggesting that the models learned stable and generalizable patterns. Although the recall (0.60) was lower than that of the models generated under condition 14, the overall predictive performance was more robust. The models built under condition 31 (integrating FRET, fluorescence, and SPR data, with 81 descriptors) showed intermediate performance compared to the previously described models, with AUCs ranging from 0.76 to 0.77. The accuracies under 5-fold and 10-fold CV ranged from 0.73 to 0.74, and test set accuracies ranged from 0.71 to 0.72. These results indicate that, although all these LR models exhibit acceptable generalization capabilities, models 10–13 stand out in this regard. Given that models 10–13 differ only in the method used for estimating the Logistic Regression coefficients, model 10 was selected for subsequent analyses due to its consistent and superior generalization performance. Specifically, model 10 (Logistic Regression with L2 regularization, newton-cg solver) was trained using 85 molecular descriptors covering diverse physicochemical and topological features (Table 2 and Supplementary Table S1).
The parameters and hyperparameters of model 10 are summarized in the Supplementary Text, in the Supporting Information. The model employed an adjusted intercept (∼0.85), L2 regularization (C = 1.0), and a convergence tolerance of 0.0001, reaching the stopping criterion in eight iterations (maximum: 1000). Performance metrics indicated robust generalization, with training accuracy (~0.78) closely matching test accuracy (~0.76). Precision exceeded recall in both sets (training: ~0.73; test: ~0.71 vs. training recall: ~0.66; test recall: ~0.60), while F1-scores were ~0.69 (training) and ~0.66 (test). The strong ROC-AUC scores (~0.85 training, ~0.83 test; Figure 3) further confirmed discriminatory power between active and inactive classes. As shown in Figure 3, the confusion matrices presented consistency in the training/test ratio with 294/90 true positives, 151/59 false negatives, 111/36 false positives, and 640/214 true negatives (Figure 3). Since the coefficients (provided in the Supplementary File S2) quantify each descriptor’s contribution to predicting the active class, and the consistent performance between training and test distributions demonstrates minimal overfitting, this Logistic Regression model emerges as the optimal choice for reliable coefficient interpretation. The model’s robust performance across all evaluation metrics further supports its suitability for the mechanistic interpretation of descriptor contributions in binding affinity of the Mpro inhibitors. In contrast to the SVM model, the Logistic Regression (LR) model allows interpretability, where weights associated with Logistic Regression indicate positive contributions (positive coefficient values) and negative contributions (negative coefficient values) toward a compound being classified as active. In this regard, we considered the coefficients obtained from the standardized data. Thus, to investigate how pharmacokinetic properties relate to biological activity, we compared the contribution of each coefficient representing a molecular descriptor with the probability of the inhibitor being active. The selected descriptors in this discussion are involved with Lipinski’s Rule [37], TPSA [38], and QED [39].
Drug-likeness profile. Other molecular descriptors used in the construction of Model 10 are shown in Table S5. Lipinski’s Rule is widely used in the early screening of drug-like compounds to predict oral bioavailability compounds with good intestinal absorption tend to violate no more than one of the following empirical criteria: molecular mass ≤ 500 Da, octanol/water partition coefficient (log P) ≤ 5, ≤5 hydrogen bond donors (OH and NH groups), and ≤ 10 hydrogen bond acceptors (oxygen and nitrogen atoms) [37]. These parameters are associated with physicochemical properties favorable to membrane permeability and aqueous solubility. Beyond Lipinski’s criteria, topological polar surface area (TPSA) is another key parameter for predicting oral absorption. TPSA represents the sum of polar surface areas from chemical functionalities—such as oxygen and nitrogen atoms—and correlates with a compound’s ability to cross cell membranes. Generally, compounds with TPSA < 140 Å2 are more likely to be well-absorbed orally [38], while values below 90 Å2 are considered favorable for central nervous system penetration [40]. The quantitative estimate of drug-likeness (QED) is particularly important during the hit-to-lead and lead optimization phases because it captures the balance among key features such as molecular weight, lipophilicity, polarity, solubility, and synthetic accessibility—interpretable score ranging from 0 (poor drug-likeness) to 1 (optimal drug-likeness) [39]. In this context, the relatively low QED value highlights the importance of further rational modifications to enhance pharmacokinetic compatibility while retaining biological activity. In this regard, combining Lipinski’s Rule with TPSA and QED thresholds provides a strategic approach to selecting compounds with potential for good oral bioavailability.
Available inhibitors have a hydrophilic profile with a penalty in PK properties. Based on that, we observe that the model coefficients indicate that certain pharmacokinetic properties, often associated with good absorption and distribution, may reduce the probability of a compound being active against Mpro. For example, the negative coefficient for MolLog P (~−0.6) suggests that hydrophilic compounds tend to be more active. This contrasts with Lipinski’s Rule, which favors log P ≤ 5 for good permeability. This implies that reducing lipophilicity increases the probability of activity, possibly due to interactions with hydrophilic residues located in the Mpro S1′ and S1 subsites, which are regions with more polar physicochemical characteristics. Additionally, TPSA exhibited a negative coefficient approaching zero (−0.000156), indicating that this parameter has minimal influence on activity prediction. This suggests that Mpro may tolerate molecules with higher polarity without compromising activity, differing from more hydrophobic targets.
Another notable aspect is the positive coefficient for ExactMolWt (~0.37), indicating that heavier molecules tend to be more active. This may be related to the need for more extensive interactions with Mpro, aiming to enhance inhibitors that maximize the number of contacts in a large and complex active site. Conversely, the negative coefficient for LabuteASA (~−0.77) suggests that molecules with a smaller solvent-accessible surface area are favored. Considering the hydrophobic, aromatic, and hydrogen bonding natures of the subsites S2 and S3/S4, these features may indicate that these subsites could be further exploited to improve both pharmacodynamic and pharmacokinetic properties of these inhibitors. Furthermore, descriptors such as BertzCT (~0.44) and Kappa3 (~0.53) have positive coefficients, indicating that greater molecular complexity and higher structural rigidity may favor activity. SlogP_VSA2 (~−0.55) and VSA_EState2 (~−0.51) reinforce that properties related to highly lipophilic regions or certain electronic states may decrease binding affinity to Mpro.
The QED coefficient for the current LR model was ~0.37, indicating that QED may be explored for structural optimization of inhibitors. Therefore, these results suggest that while properties such as log P and TPSA within conventional ranges are typically sought to optimize pharmacokinetics, a more flexible approach may be necessary for Mpro. Excessive increase in lipophilicity to satisfy drug-likeness rules may compromise target interaction, whereas higher TPSA does not appear to be a limiting factor. Thus, the discovery of effective Mpro inhibitors may require a different balance between pharmacokinetics and pharmacodynamics, prioritizing structural features that favor target binding even if this entails partial deviations from traditional permeability rules. This underscores the importance of target-specific predictive models, avoiding generalizations that could lead to false negatives in virtual screening.
Comparison with FDA-approved drugs. We applied the RDkit functions on the protonated molecules to calculate the 36 descriptors used in the construction of the SVM model (Table S2) and the 85 descriptors used in the LR model (Table S3) for Boceprevir, Ensitrelvir, Lufotrelvir, Nirmatrelvir, Ritonavir, and Simnotrelvir (active Mpro inhibitors in preclinical or clinical phases). From these descriptors, we constructed a 95% interval of confidence (IC95%) for the descriptors used in these models to establish the drug-likeness profile for molecules in advanced development stages. We included Ritonavir due to its inhibitory activity against Mpro (IC50 = 13.7 µM and EC50 = 19.9 µM) and its favorable pharmacokinetic properties (it is a component of Paxlovid) [5,41,42]. Tables S4 and S5 present the mean, standard deviation, and confidence interval (α = 0.05) for the 35 and 85 descriptors of these inhibitors. For these Mpro inhibitors, the means of ExactMolWt, QED, TPSA, and MolLog P were respectively 561.56 Da [IC95% = (477.5 − 645.6)], 0.33 [IC95% = (0.18 − 0.48)], 146.4 Å2 [IC95% = (0.18–0.48) Å2], and 2.03 [IC95% = (−0.2 − 4.2)]. The means, standard deviations, and confidence intervals for other descriptors are provided in Tables S4 and S5 for these inhibitors. Nirmatrelvir and Simnotrelvir showed the highest QED values of 0.50 and 0.46, respectively, while Boceprevir, Ritonavir, and Lufotrelvir exhibited values of 0.36, 0.11, and 0.24, respectively. All compounds displayed comparable molecular weights (MolWt) with a mean of 562 Da and a standard deviation of 80 Da [IC95% = 477.9 − 646.2 Da]. The minimum and maximum partial charges had mean values and standard deviations of −0.4 ± 0.2 and 0.3 ± 0.1, respectively. The confidence intervals, as well as mean values and standard deviations of the molecular descriptors, are available in Tables S4 and S5. These results are fundamental for hit-to-lead optimization studies and for high-throughput virtual screening (HTVS) to identify novel hits.

3.3. Molecular Docking and Molecular Dynamics Simulations

We performed molecular docking followed by molecular dynamics (MD) simulations in order to investigate the interactions of Mpro inhibitors at the active site, aiming to elucidate how their pharmacodynamic and structural features may help to improve pharmacokinetic behavior. To that end, 150 ns MD simulations were carried out for a subset of randomly selected inhibitors from a dataset with IC50 values determined by fluorescence-based assays (Figure 4 and Figure 5).
Mpro S2 and S3/S4 may be a potential pharmacophoric center. Overall, the high numbers of correlations and anti-correlations are directly linked to inhibitor activity in Mpro. Potent inhibitors are more stable in the active site, maintained by hydrogen bonding and van der Waals interactions. In general, these potent inhibitors stabilize domains I and II, reducing their flexibility and inducing allosteric effects in other regions of the enzyme, as shown in RMSD and RMSF plots (Figures S1 and S2). This is evidenced by strong correlations and anti-correlations between residue pairs across different domains, as shown in Figure 4. Examples of potent inhibitors following this pattern include inhibitors 37 (2163 correlations and 1029 anti-correlations; IC50 of 0.33 μM). These molecules are characterized by performing aromatic, hydrogen bonding, and hydrophobic interactions in the subsites S2 and S3/S4. These findings suggest that subsites S2 and S3/S4 may be a potential pharmacophoric center to be explored in hit-to-lead optimization.
We also investigated the free-energy landscape (FEL) (Figure 5) and potential interactions (Figure 6) at the different sites of Mpro during binding with inhibitors 3AN, 37, 21, and 40. In relation to FEL results, the number of blue points (favorable free-energy values) (compared with red points that represent unfavorable free-energy values) in the PCA plot for each inhibitor is in agreement with the corresponding experimental IC50 value of them. In relation to potential interactions, our analysis focused on key residues in subsites S1 (PHE140, CYS145, HIS163), S1’ (THR25, THR26, LEU27, HIS41, ASN142, GLY143), S2 (MET49, PRO52, TYR54, HIS164, MET165, ASP187, ARG188, GLN189), and S3/S4 (MET165, GLU166, LEU167, PRO168, PHE185, THR190, ALA191), as defined in our previous study [5]. Contact frequencies were quantified for each residue and subsite, and amino acid residues interacting with these inhibitors were identified (Figure 6), enabling a systematic assessment of interactions between inhibitor regions, Mpro, and the involved amino acids. Our analyses revealed interaction preferences of these inhibitors in subsites S2 and S3/S4. For instance, the quinoline group of 3AN interacted predominantly with subsite S3/S4 (1355 occurrences) compared to S2 (643 occurrences), while the benzothiazole group preferentially interacted with S2 (1799 occurrences) rather than S3/S4 (618 occurrences). At the residue level (Figure S1A), quinoline primarily interacted with MET49, LEU50, GLN189, and THR190, whereas benzothiazole had more contacts with SER46, MET49, GLN189, and THR190. These results suggest permanent dipole–induced dipole interactions and potential hydrogen bonding mediated by the heteroatoms of the aromatic system of this chemical group. In addition, the thioamide group interacted predominantly with MET39, SER46, GLN189, and THR190. This pattern highlights the central role of the C(=S)NH moiety in establishing hydrogen bonding and van der Waals interactions. Thus, interactions identified in the 3AN/Mpro complex highlighted specific subsites S2 and S3/S4 as critical interaction regions. On the other hand, inhibitors 21, 37, and 40 possess a benzotriazole group responsible for interacting with subsites S1’, S1, S2, and S3/S4 (Figure 6B–D). However, the pyrazole group (inhibitor 37) specifically interacted with subsite S1’ (2259 occurrences) (primarily with ASN119, THR26, TYR118, GLN19, GLY143), being more solvent-exposed, whereas the imidazole group of inhibitor 21 preferentially contacted subsites S2 (1358 occurrences), S3/S4 (543 occurrences), and S1’ (129 occurrences) (main contacts with SER46, LEU50, GLN189, GLU47, PRO168, MET49). The imidazole group of inhibitor 40 was also more solvent-exposed (1593 occurrences in S1’) and interacted less with S3/S4 (512 occurrences) (preferentially with THR25, ASN142, GLY143, THR24).
Overall, this may explain why compounds 37 and 40 exhibit lower potency (IC50 of 0.33 and 0.17 μM, respectively), whereas 3AN and compound 21 show higher binding affinity with Mpro (0.09 and 0.06 μM, respectively). All these are in agreement with our FEL results (Figure 5). Considering that the PK/PD trade-offs are fundamental for hit-to-lead optimization and that, according to our ML data, the most potent inhibitors are those with more hydrophilic characteristics (contrary to PK properties) and vice versa. Additionally, assuming that MD simulation data were substantially important for identifying the S2 and S3/S4 subsites, as well as the mean and confidence interval data of advanced-stage inhibitors discussed here, we suggest that inhibitors with the properties listed in Tables S4 and S5, combined with the confidence interval and MD simulation data, may provide a basis for selecting new hits and lead optimization. This is based on the fact that the most potent inhibitors interact with the S2 and S3/S4 subsites. Thus, these results provide a rational direction for the search for new SARS-CoV-2 Mpro inhibitors.

4. Discussion

The rising costs and extended timelines associated with conventional drug development have positioned computational approaches as an increasingly attractive strategy in pharmaceutical research [43,44]. By capitalizing on compounds with well-characterized safety profiles, this approach significantly reduces clinical risks while providing expedited and cost-efficient routes to therapeutic discovery [45,46,47]. Molecular docking and dynamics simulations have enabled systematic evaluation of diverse chemical scaffolds against key viral targets, facilitating rapid identification of high-affinity binders [48]. The integration of pharmacophore modeling with molecular dynamics has further refined this process, yielding several FDA-approved drug candidates with predicted activity against SARS-CoV-2 main protease (Mpro) [49,50]. Despite these advances, a critical bottleneck persists, while numerous compounds show promising in silico activity against Mpro, few progress to clinical evaluation [5]. This translational gap may often be related to pharmacokinetic (PK) limitations, as exemplified by Boceprevir [5]. Although this HCV protease inhibitor demonstrates anti-SARS-CoV-2 activity (IC50 = 4.13 µM, EC50 = 1.31 µM), its clinical potential remains constrained by CYP3A4 interactions. The successful development of Paxlovid (Nirmatrelvir/Ritonavir) illustrates how strategic PK optimization may overcome these barriers [5].
In this study, we investigated the relationship between pharmacokinetic (PK) and pharmacodynamic (PD) properties using integrated machine learning (ML) and molecular dynamics (MD) approaches. We developed binary classification models employing multiple ML algorithms—Support Vector Machines (SVM) and Logistic Regression (LR). These models were trained on IC50 data from FRET assays, fluorescence assays, and combined datasets (FRET, fluorescence, and SPR assays). These computational approaches use large datasets and predictive algorithms to rapidly evaluate chemical compounds, accelerating the discovery of promising candidates with inhibitory activity against SARS-CoV-2 [51,52,53]. Other QSAR models have been developed to identify SARS-CoV-2 Mpro inhibitors. For example, convolutional neural networks (CNNs) were used to predict pIC50 values from a dataset of 467 molecules, having a training accuracy of 95.7% and a test accuracy of 90.43% [54] while random forests (RF) classified compounds as active or inactive using 940 samples, with a training accuracy of 89.8% and test accuracy of 64.2% [55]. On the other hand, partial least squares (PLS) regression was applied to predict pIC50 values from a smaller dataset of 32 samples, resulting in a training accuracy of 90.1% and a test accuracy of 83.1% [56], support vector regression (SVR) predicted pIC50 values for a dataset of 71 samples, achieving 96.0% training accuracy and 89.3% test accuracy [57], and artificial neural networks (ANNs) were employed to predict pKi values from 28 samples, reaching a training accuracy of 80.0% and a test accuracy of 91.0% [17]. This comparison highlights the variability in performance across different models and datasets (containing different approaches and sizes), emphasizing the need for careful selection of methods based on specific prediction tasks.
The main limitation across the compared machine learning methods is the variability in performance, driven by sample size, model complexity, and data representativeness. While complex models such as deep neural networks achieve high training accuracy, their generalization may be inconsistent, particularly with small datasets. Methods such as random forests also show limited generalizability, highlighting the need for model selection in order to improve the prediction of novel SARS-CoV-2 Mpro inhibitors [58,59]. Our Support Vector Machine (SVM) and Logistic Regression (LR) models, trained on chemically diverse datasets, demonstrated robust performance in distinguishing active from inactive compounds, achieving strong predictive accuracy validated by consistent ROC AUC values across both training and test sets. Recent advances have demonstrated the powerful synergy of machine learning (ML), molecular docking, and molecular dynamics (MD) in studying SARS-CoV-2 drug candidates, enabling simultaneous evaluation of structure-activity relationships and pharmacokinetic properties [17,60,61,62]. These approaches have successfully identified diverse chemical scaffolds with high predicted affinity for viral targets, as exemplified by evaluation of ten Mpro inhibitors exhibiting favorable drug-like properties (molecular weights 383.15–534.18 Da, logP~2.9–5.3, TPSA 55.11–113.84 Å2, and three-to-seven rotatable bonds) [60]. The integration of computational techniques—particularly pharmacophore modeling with MD simulations has accelerated drug repurposing efforts by efficiently prioritizing stable, high-affinity binders among FDA-approved compounds [49,50].
Our results are supported by Nirmatrelvir (PF-07321332) (Figure 1). The binding to the S3 and S4 subsites of Mpro validates the relevance of these regions for antiviral activity—a finding consistent with our molecular dynamics simulations, which revealed dynamic correlations between the S2 and S3/S4 [63]. Furthermore, Nirmatrelvir exhibits a balanced profile of hydrophilicity (LogP~2.9) and oral bioavailability, reinforcing our Logistic Regression-based hypothesis that more hydrophilic compounds tend to be active, although they require pharmacokinetic optimization. Recent structural studies by Blankenship and co-authors provide direct experimental support for our mechanistic insights, showing that inhibitor analogs to Nirmatrelvir, such as MPI105 (IC50 = 0.039 μM), achieve high potency targeting S3, and MPI108 (IC50 = 0.025 μM) binds at S3/S4 [64]. These findings further corroborate our conclusion that effective inhibitor design depends on integrating interactions across S2 and S3/S4.
Our study advances this field by establishing, for the first time, a structure-based framework for PK optimization of Mpro inhibitors through integrated ML/MD analysis. Key insights reveal that while hydrophilic features enhance target binding, they often impair PK properties, whereas hydrogen bonding, hydrophobic, and π–π interactions in subsites S2 and S3/S4 stabilize the inhibitor contacts into active sites of Mpro. This finding highlights the subsites S2 and S3/S4 as a strategic target for achieving optimal PD/PK balance. Therefore, our computational pipeline provided predictive models for classification in inactive and active compounds against Mpro and used structure-based design principles for Mpro inhibitors with applications in the preclinical candidate selection. These advances address a critical gap in current computational drug discovery by enabling hit-to-lead optimization. This integrative approach not only enhances Mpro inhibitors but also provides a transferable framework for addressing other molecular targets through structure-guided computational design.

5. Conclusions

Herein, we used a computational approach that integrates ML algorithms, molecular docking, and MD simulations to explore the often conflicting relationship between PD potency and PK in the context of designing inhibitors targeting the SARS-CoV-2 Mpro. The ML models employed in this framework, specifically Support Vector Machine (SVM) and Logistic Regression (LR), showed strong predictive capabilities in distinguishing active Mpro inhibitors from inactive compounds. The SVM model achieved a test accuracy of 0.79 with an area under the ROC curve (AUC) of 0.86, while the LR model closely followed with an accuracy of 0.76 and an AUC of 0.83. These results underscore the potential of data-driven methods in early-stage drug discovery, particularly when predicting activity across diverse chemical scaffolds.
Through detailed analysis, it was observed that certain molecular features—especially those associated with hydrophilicity, such as lower MolLog P values—tended to correlate with higher binding affinity to Mpro. Conversely, properties typically favorable for PK profiles, such as increased lipophilicity and reduced molecular weight, were frequently found to compromise binding potency. This observation illustrates a fundamental trade-off between optimizing a compound for effective target engagement and ensuring it possesses suitable drug-like characteristics for clinical development. Molecular dynamics simulations provided further mechanistic insights by revealing the importance of specific interaction regions within the protease structure. In particular, the Mpro S2 and S3/S4 subsites emerged as key for stabilizing inhibitor binding, mediated primarily by a combination of hydrogen bonding, hydrophobic contacts, and π–π stacking interactions. These analyses, including dynamical cross-correlation matrices and free-energy landscapes, indicated that potent inhibitors tend to induce conformational stabilization in the protease, particularly by reducing the flexibility of domains I and II, while potentially promoting long-range allosteric effects that may further influence enzymatic activity.
An exploration of molecular descriptors highlighted how features linked to chemical complexity—such as ExactMolWt, BertzCT, and Kappa3—correlated with increased biological activity. However, these same descriptors often conflict with attributes that favor desirable PK outcomes, such as oral bioavailability and metabolic stability. This intrinsic tension complicates the development process, as enhancing one aspect of a molecule’s profile frequently detracts from another. Nonetheless, a few advanced-stage inhibitors currently in clinical evaluation, such as Nirmatrelvir and Simnotrelvir, exemplify an emerging class of compounds that manage to strike a workable balance between PD and PK demands. Their quantitative estimates of drug-likeness (QED scores ranging from 0.33 to 0.50) and topological polar surface areas (around 146 Å2) serve as practical reference points for future optimization efforts.
Overall, this study presents an interpretable, predictive, and integrative computational strategy aimed at enhancing the PK/PD properties of antiviral drug candidates targeting SARS-CoV-2 Mpro. By focusing on structurally relevant subsites of the Mpro enzyme, particularly S2 and S3/S4, the framework not only pinpointed high-potential binding regions for drug optimization but also revealed critical insights that can drive the next generation of Mpro inhibitors. Additionally, the developed machine learning models demonstrated strong potential for large-scale virtual screening of novel chemical entities (NCEs), offering a robust platform for accelerating hit-to-lead discovery. Future efforts should prioritize the experimental validation of top computational hits, expand the chemical space through fragment-based design strategies that emphasize PK/PD synergy, and integrate resistance profiling to ensure sustained efficacy against emerging SARS-CoV-2 variants.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/v17070935/s1, Supplementary File S1 contains datasets and ML and MD scripts used in this study. Supplementary File S2 contains parameters, models, and tables corresponding to models SVM and LR. The Supporting Information contains Tables S1–S5 and Figures S1–S3.

Author Contributions

Conceptualization, A.S.d.S., E.P.S., R.F.d.S. and C.R.G.; Methodology, A.S.d.S. and V.M.d.F.A.; Validation, A.S.d.S.; Formal analysis, A.S.d.S., V.M.d.F.A. and C.R.G.; Investigation, A.S.d.S., V.M.d.F.A., R.F.d.S. and C.R.G.; Resources, A.S.d.S. and R.F.d.S.; Data curation, A.S.d.S. and V.M.d.F.A.; Writing—original draft, A.S.d.S., V.M.d.F.A., E.P.S., R.F.d.S. and C.R.G.; Writing—review & editing, A.S.d.S., V.M.d.F.A., E.P.S., R.F.d.S. and C.R.G.; Visualization, A.S.d.S.; Supervision, A.S.d.S. and C.R.G.; Project administration, A.S.d.S.; Funding acquisition, A.S.d.S., R.F.d.S. and C.R.G. All authors have read and agreed to the published version of the manuscript.

Funding

National Council for Scientific and Technological Development (CNPq); Coordination for the Improvement of Higher Education Personnel (CAPES) grant 88887.103056/2025-00; São Paulo Research Foundation (FAPESP, 2021/10577-0, 2022/08730-7, 2021/05262-0, 2023/13894-1, 2023/07366-2, 2023/18211-0, 2025/02990-5, 2024/05454-4); FINEP grant 29794 (FINEP/UFMG/PRPQ/CT VACINAS/SARS-CoV-2). The APC of this paper was funded by FAPESP grant 2023/18211-0.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset containing the implementation of the machine learning approaches (pipeline and scripts) are available at https://github.com/anacletosouza/Mpro_AI-ML_pipeline/.

Acknowledgments

The authors acknowledge Rafaella Radmylan Silva do Amaral for her assistance in organizing the datasets.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. World Health Organization. Available online: https://covid19.who.int (accessed on 8 June 2025).
  2. de Souza, A.S.; de Amorim, V.M.F.; Guardia, G.D.A.; dos Santos, F.F.; Ulrich, H.; Galante, P.A.F.; de Souza, R.F.; Guzzo, C.R. Severe Acute Respiratory Syndrome Coronavirus 2 Variants of Concern: A Perspective for Emerging More Transmissible and Vaccine-Resistant Strains. Viruses 2022, 14, 827. [Google Scholar] [CrossRef] [PubMed]
  3. GISAID Genomic Epidemiology of SARS-CoV-2 with Subsampling Focused Globally over the Past 6 Months. Available online: https://www.gisaid.org/phylodynamics/global/nextstrain/ (accessed on 25 January 2025).
  4. Chen, J.; Zhang, W.; Li, Y.; Liu, C.; Dong, T.; Chen, H.; Wu, C.; Su, J.; Li, B.; Zhang, W.; et al. Bat-Infecting Merbecovirus HKU5-CoV Lineage 2 Can Use Human ACE2 as a Cell Entry Receptor. Cell 2025, 188, 1729–1742.e16. [Google Scholar] [CrossRef] [PubMed]
  5. De Amorim, V.M.F.; Soares, E.P.; de Ferrari, A.S.A.; Merighi, D.G.S.; de Souza, R.F.; Guzzo, C.R.; de Souza, A.S. 3-Chymotrypsin-like Protease (3CLpro) of SARS-CoV-2: Validation as a Molecular Target, Proposal of a Novel Catalytic Mechanism, and Inhibitors in Preclinical and Clinical Trials. Viruses 2024, 16, 844. [Google Scholar] [CrossRef] [PubMed]
  6. Zhao, Y.; Zhu, Y.; Liu, X.; Jin, Z.; Duan, Y.; Zhang, Q.; Wu, C.; Feng, L.; Du, X.; Zhao, J.; et al. Structural Basis for Replicase Polyprotein Cleavage and Substrate Specificity of Main Protease from SARS-CoV-2. Proc. Natl. Acad. Sci. USA 2022, 119, e2117142119. [Google Scholar] [CrossRef]
  7. Parmar, M.; Thumar, R.; Patel, B.; Athar, M.; Jha, P.C.; Patel, D. Structural Differences in 3C-like Protease (Mpro) from SARS-CoV and SARS-CoV-2: Molecular Insights Revealed by Molecular Dynamics Simulations. Struct. Chem. 2023, 34, 1309–1326. [Google Scholar] [CrossRef]
  8. Schechter, I.; Berger, A. On the Size of the Active Site in Proteases. I. Papain. Biochem. Biophys. Res. Commun. 1967, 27, 157–162. [Google Scholar] [CrossRef]
  9. Boby, M.L.; Fearon, D.; Ferla, M.; Filep, M.; Koekemoer, L.; Robinson, M.C.; COVID Moonshot Consortium; Chodera, J.D.; Lee, A.A.; London, N.; et al. Open Science Discovery of Potent Noncovalent SARS-CoV-2 Main Protease Inhibitors. Science 2023, 382, eabo7201. [Google Scholar] [CrossRef]
  10. Li, X.; Song, Y. Structure and Function of SARS-CoV and SARS-CoV-2 Main Proteases and Their Inhibition: A Comprehensive Review. Eur. J. Med. Chem. 2023, 260, 115772. [Google Scholar] [CrossRef]
  11. Noske, G.D.; de Souza Silva, E.; de Godoy, M.O.; Dolci, I.; Fernandes, R.S.; Guido, R.V.C.; Sjö, P.; Oliva, G.; Godoy, A.S. Structural Basis of Nirmatrelvir and Ensitrelvir Activity against Naturally Occurring Polymorphisms of the SARS-CoV-2 Main Protease. J. Biol. Chem. 2023, 299, 103004. [Google Scholar] [CrossRef]
  12. Ma, C.; Sacco, M.D.; Hurst, B.; Townsend, J.A.; Hu, Y.; Szeto, T.; Zhang, X.; Tarbet, B.; Marty, M.T.; Chen, Y.; et al. Boceprevir, GC-376, and Calpain Inhibitors II, XII Inhibit SARS-CoV-2 Viral Replication by Targeting the Viral Main Protease. Cell Res. 2020, 30, 678–692. [Google Scholar] [CrossRef]
  13. ASAP AI-Driven Structure-Enabled Antiviral Platform (ASAP). Available online: https://asapdiscovery.org/ (accessed on 26 January 2025).
  14. O’Boyle, N.M.; Banck, M.; James, C.A.; Morley, C.; Vandermeersch, T.; Hutchison, G.R. Open Babel: An Open Chemical Toolbox. J. Cheminform. 2011, 3, 33. [Google Scholar] [CrossRef] [PubMed]
  15. The Open Babel Team. Open Babel—The Chemistry Toolbox—Open Babel Openbabel-3-1-1 Documentation. Available online: https://openbabel.org/ (accessed on 26 January 2025).
  16. Marvin. MarvinSketch. Available online: https://docs.chemaxon.com/display/lts-europium/marvinsketch-downloads.md (accessed on 26 January 2025).
  17. de Souza, A.S.; de Souza, R.F.; Guzzo, C.R. Quantitative Structure-Activity Relationships, Molecular Docking and Molecular Dynamics Simulations Reveal Drug Repurposing Candidates as Potent SARS-CoV-2 Main Protease Inhibitors. J. Biomol. Struct. Dyn. 2021, 40, 11339–11356. [Google Scholar] [CrossRef] [PubMed]
  18. NumPy Community. NumPy Reference. Available online: https://numpy.org/community/ (accessed on 26 January 2025).
  19. The Pandas Development Team. Pandas-Dev/Pandas. Available online: https://github.com/pandas-dev/pandas (accessed on 26 January 2025).
  20. Landrum, G.; Tosco, P.; Kelley, B.; Rodriguez, R.; Cosgrove, D.; Vianello, R.; Sriniker; Gedeck, P.; Jones, G.; NadineSchneider; et al. Rdkit/rdkit: 2024_09_6 (Q3 2024) Release. Available online: https://zenodo.org/records/14535873 (accessed on 25 January 2025).
  21. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Louppe, G.; Prettenhofer, P.; Weiss, R.; et al. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  22. Plotly Technologies Inc. Collaborative Data Science Publisher: Plotly Technologies Inc. Place of Publication: Montréal, QC Date of Publication: 2015. Available online: https://plot.ly (accessed on 25 January 2025).
  23. Hunter, J.D. Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng. 2007, 9, 90–95. [Google Scholar] [CrossRef]
  24. The Matplotlib Development Team. Matplotlib: Visualization with Python. Available online: https://matplotlib.org/ (accessed on 25 January 2025).
  25. Waskom, M. Seaborn: Statistical Data Visualization. J. Open Source Softw. 2021, 6, 3021. [Google Scholar] [CrossRef]
  26. Ropp, P.J.; Kaminsky, J.C.; Yablonski, S.; Durrant, J.D. Dimorphite-DL: An Open-Source Program for Enumerating the Ionization States of Drug-like Small Molecules. J. Cheminform. 2019, 11, 14. [Google Scholar] [CrossRef]
  27. Azlim Khan, A.K.; Ahamed Hassain Malim, N.H. Comparative Studies on Resampling Techniques in Machine Learning and Deep Learning Models for Drug-Target Interaction Prediction. Molecules 2023, 28, 1663. [Google Scholar] [CrossRef]
  28. Pettersen, E.F.; Goddard, T.D.; Huang, C.C.; Couch, G.S.; Greenblatt, D.M.; Meng, E.C.; Ferrin, T.E. UCSF Chimera—A Visualization System for Exploratory Research and Analysis. J. Comput. Chem. 2004, 25, 1605–1612. [Google Scholar] [CrossRef]
  29. Abramson, J.; Adler, J.; Dunger, J.; Evans, R.; Green, T.; Pritzel, A.; Ronneberger, O.; Willmore, L.; Ballard, A.J.; Bambrick, J.; et al. Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3. Nature 2024, 630, 493–500. [Google Scholar] [CrossRef]
  30. Eberhardt, J.; Santos-Martins, D.; Tillack, A.F.; Forli, S. AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. J. Chem. Inf. Model. 2021, 61, 3891–3898. [Google Scholar] [CrossRef]
  31. Trott, O.; Olson, A.J. AutoDock Vina: Improving the Speed and Accuracy of Docking with a New Scoring Function, Efficient Optimization, and Multithreading. J. Comput. Chem. 2010, 31, 455–461. [Google Scholar] [CrossRef] [PubMed]
  32. GROMACS 2022 Documentation. Available online: https://manual.gromacs.org/2022/index.html (accessed on 25 January 2025).
  33. GROMACS 2022 Source Code. Available online: https://manual.gromacs.org/2022/download.html (accessed on 25 January 2025).
  34. Dodda, L.S.; Vilseck, J.Z.; Tirado-Rives, J.; Jorgensen, W.L. 1.14*CM1A-LBCC: Localized Bond-Charge Corrected CM1A Charges for Condensed-Phase Simulations. J. Phys. Chem. B 2017, 121, 3864–3870. [Google Scholar] [CrossRef] [PubMed]
  35. Dodda, L.S.; Cabeza de Vaca, I.; Tirado-Rives, J.; Jorgensen, W.L. LigParGen Web Server: An Automatic OPLS-AA Parameter Generator for Organic Ligands. Nucleic Acids Res. 2017, 45, W331–W336. [Google Scholar] [CrossRef] [PubMed]
  36. Jorgensen, W.L.; Tirado-Rives, J. Potential Energy Functions for Atomic-Level Simulations of Water and Organic and Biomolecular Systems. Proc. Natl. Acad. Sci. USA 2005, 102, 6665–6670. [Google Scholar] [CrossRef]
  37. Lipinski, C.A.; Lombardo, F.; Dominy, B.W.; Feeney, P.J. Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings. Adv. Drug Deliv. Rev. 2001, 46, 3–26. [Google Scholar] [CrossRef]
  38. Pajouhesh, H.; Lenz, G.R. Medicinal Chemical Properties of Successful Central Nervous System Drugs. NeuroRx 2005, 2, 541–553. [Google Scholar] [CrossRef]
  39. Bickerton, G.R.; Paolini, G.V.; Besnard, J.; Muresan, S.; Hopkins, A.L. Quantifying the Chemical Beauty of Drugs. Nat. Chem. 2012, 4, 90–98. [Google Scholar] [CrossRef]
  40. Hitchcock, S.A.; Pennington, L.D. Structure-Brain Exposure Relationships. J. Med. Chem. 2006, 49, 7559–7583. [Google Scholar] [CrossRef]
  41. Zhang, L.; Liu, J.; Cao, R.; Xu, M.; Wu, Y.; Shang, W.; Wang, X.; Zhang, H.; Jiang, X.; Sun, Y.; et al. Comparative Antiviral Efficacy of Viral Protease Inhibitors against the Novel SARS-CoV-2 In Vitro. Virol. Sin. 2020, 35, 776–784. [Google Scholar] [CrossRef]
  42. Mahdi, M.; Mótyán, J.A.; Szojka, Z.I.; Golda, M.; Miczi, M.; Tőzsér, J. Analysis of the Efficacy of HIV Protease Inhibitors against SARS-CoV-2’s Main Protease. Virol. J. 2020, 17, 190. [Google Scholar] [CrossRef]
  43. Sun, Y.; Jiao, Y.; Shi, C.; Zhang, Y. Deep Learning-Based Molecular Dynamics Simulation for Structure-Based Drug Design against SARS-CoV-2. Comput. Struct. Biotechnol. J. 2022, 20, 5014–5027. [Google Scholar] [CrossRef] [PubMed]
  44. Sadybekov, A.V.; Katritch, V. Computational Approaches Streamlining Drug Discovery. Nature 2023, 616, 673–685. [Google Scholar] [CrossRef] [PubMed]
  45. Ahuja, A.S.; Reddy, V.P.; Marques, O. Artificial Intelligence and COVID-19: A Multidisciplinary Approach. Integr. Med. Res. 2020, 9, 100434. [Google Scholar] [CrossRef] [PubMed]
  46. Mohanty, S.; Harun Ai Rashid, M.; Mridul, M.; Mohanty, C.; Swayamsiddha, S. Application of Artificial Intelligence in COVID-19 Drug Repurposing. Diabetes Metab. Syndr. 2020, 14, 1027–1031. [Google Scholar] [CrossRef]
  47. Tan, G.S.Q.; Sloan, E.K.; Lambert, P.; Kirkpatrick, C.M.J.; Ilomäki, J. Drug Repurposing Using Real-World Data. Drug Discov. Today 2023, 28, 103422. [Google Scholar] [CrossRef]
  48. Sharma, J.; Kumar Bhardwaj, V.; Singh, R.; Rajendran, V.; Purohit, R.; Kumar, S. An in-Silico Evaluation of Different Bioactive Molecules of Tea for Their Inhibition Potency against Non Structural Protein-15 of SARS-CoV-2. Food Chem. 2021, 346, 128933. [Google Scholar] [CrossRef]
  49. Avti, P.; Chauhan, A.; Shekhar, N.; Prajapat, M.; Sarma, P.; Kaur, H.; Bhattacharyya, A.; Kumar, S.; Prakash, A.; Sharma, S.; et al. Computational Basis of SARS-CoV 2 Main Protease Inhibition: An Insight from Molecular Dynamics Simulation Based Findings. J. Biomol. Struct. Dyn. 2021, 40, 8894–8904. [Google Scholar] [CrossRef]
  50. Gupta, A.; Zhou, H.-X. Profiling SARS-CoV-2 Main Protease (M) Binding to Repurposed Drugs Using Molecular Dynamics Simulations in Classical and Neural Network-Trained Force Fields. ACS Comb. Sci. 2020, 22, 826–832. [Google Scholar] [CrossRef]
  51. Alves, V.M.; Bobrowski, T.; Melo-Filho, C.C.; Korn, D.; Auerbach, S.; Schmitt, C.; Muratov, E.N.; Tropsha, A. QSAR Modeling of SARS-CoV M Inhibitors Identifies Sufugolix, Cenicriviroc, Proglumetacin, and Other Drugs as Candidates for Repurposing against SARS-CoV-2. Mol. Inform. 2021, 40, e2000113. [Google Scholar] [CrossRef]
  52. Yu, X. Prediction of Inhibitory Constants of Compounds against SARS-CoV 3CLpro Enzyme with 2D-QSAR Model. J. Saudi Chem. Soc. 2021, 25, 101262. [Google Scholar] [CrossRef]
  53. Zaki, M.E.A.; Al-Hussain, S.A.; Masand, V.H.; Akasapu, S.; Bajaj, S.O.; El-Sayed, N.N.E.; Ghosh, A.; Lewaa, I. Identification of Anti-SARS-CoV-2 Compounds from Food Using QSAR-Based Virtual Screening, Molecular Docking, and Molecular Dynamics Simulation Analysis. Pharmaceuticals 2021, 14, 357. [Google Scholar] [CrossRef] [PubMed]
  54. Tong, J.; Gao, P.; Xu, H.; Liu, Y. Improved SAR and QSAR Models of SARS-CoV-2 Mpro Inhibitors Based on Machine Learning. J. Mol. Liq. 2024, 394, 123708. [Google Scholar] [CrossRef]
  55. Hazemann, J.; Kimmerlin, T.; Lange, R.; Mac Sweeney, A.; Bourquin, G.; Ritz, D.; Czodrowski, P. Identification of SARS-CoV-2 Mpro Inhibitors through Deep Reinforcement Learning for Drug Design and Computational Chemistry Approaches. RSC Med. Chem. 2024, 15, 2146–2159. [Google Scholar] [CrossRef] [PubMed]
  56. Costa, A.S.; Martins, J.P.A.; de Melo, E.B. SMILES-Based 2D-QSAR and Similarity Search for Identification of Potential New Scaffolds for Development of SARS-CoV-2 MPRO Inhibitors. Struct. Chem. 2022, 33, 1691–1706. [Google Scholar] [CrossRef]
  57. Madani, A.; Benkortbi, O.; Laidi, M. In Silico Prediction of the Inhibition of New Molecules on SARS-CoV-2 3CL Protease by Using QSAR: PSOSVR Approach. Braz. J. Chem. Eng. 2023, 41, 427–442. [Google Scholar] [CrossRef]
  58. Andrianov, A.M.; Shuldau, M.A.; Furs, K.V.; Yushkevich, A.M.; Tuzikov, A.V. AI-Driven De Novo Design and Molecular Modeling for Discovery of Small-Molecule Compounds as Potential Drug Candidates Targeting SARS-CoV-2 Main Protease. Int. J. Mol. Sci. 2023, 24, 8083. [Google Scholar] [CrossRef]
  59. Gupta, R.; Srivastava, D.; Sahu, M.; Tiwari, S.; Ambasta, R.K.; Kumar, P. Artificial Intelligence to Deep Learning: Machine Intelligence Approach for Drug Discovery. Mol. Divers. 2021, 25, 1315. [Google Scholar] [CrossRef]
  60. Sharma, G.; Kumar, N.; Sharma, C.S.; Alqahtani, T.; Tiruneh, Y.K.; Sultana, S.; Rolim Silva, G.V.; de Lima Menezes, G.; Zaki, M.E.A.; Nobre Oliveira, J.I. Identification of Promising SARS-CoV-2 Main Protease Inhibitor through Molecular Docking, Dynamics Simulation, and ADMET Analysis. Sci. Rep. 2025, 15, 2830. [Google Scholar] [CrossRef]
  61. de Gomes, I.S.; Santana, C.A.; Marcolino, L.S.; de Lima, L.H.F.; de Melo-Minardi, R.C.; Dias, R.S.; de Paula, S.O.; de Silveira, S.A. Computational Prediction of Potential Inhibitors for SARS-CoV-2 Main Protease Based on Machine Learning, Docking, MM-PBSA Calculations, and Metadynamics. PLoS ONE 2022, 17, e0267471. [Google Scholar] [CrossRef]
  62. Chhetri, S.P.; Bhandari, V.S.; Maharjan, R.; Lamichhane, T.R. Identification of Lead Inhibitors for 3CLpro of SARS-CoV-2 Target Using Machine Learning Based Virtual Screening, ADMET Analysis, Molecular Docking and Molecular Dynamics Simulations. RSC Adv. 2024, 14, 29683–29692. [Google Scholar] [CrossRef]
  63. Owen, D.R.; Allerton, C.M.N.; Anderson, A.S.; Aschenbrenner, L.; Avery, M.; Berritt, S.; Boras, B.; Cardin, R.D.; Carlo, A.; Coffman, K.J.; et al. An Oral SARS-CoV-2 M Inhibitor Clinical Candidate for the Treatment of COVID-19. Science 2021, 374, 1586–1593. [Google Scholar] [CrossRef]
  64. Blankenship, L.R.; Yang, K.S.; Vulupala, V.R.; Alugubelli, Y.R.; Khatua, K.; Coleman, D.; Ma, X.R.; Sankaran, B.; Cho, C.-C.D.; Ma, Y.; et al. SARS-CoV-2 Main Protease Inhibitors That Leverage Unique Interactions with the Solvent Exposed S3 Site of the Enzyme. ACS Med. Chem. Lett. 2024, 15, 950–957. [Google Scholar] [CrossRef]
Figure 1. Chemical diversity and potency of Mpro inhibitors. The broad active site of Mpro supports the development of diverse inhibitors with high potency (IC50 between 0.3 and 2500 nM). This chemical diversity offers significant potential for anti-SARS-CoV-2 drug development.
Figure 1. Chemical diversity and potency of Mpro inhibitors. The broad active site of Mpro supports the development of diverse inhibitors with high potency (IC50 between 0.3 and 2500 nM). This chemical diversity offers significant potential for anti-SARS-CoV-2 drug development.
Viruses 17 00935 g001
Figure 2. Chemical diversity and performance of the Support Vector Machine model (Model 9). (A) Distribution of active and inactive compounds in the PCA chemical space. (B) Random partitioning of compounds into training and test sets for SVM model construction, preserving overall data distribution. (C) Probability density of predicted class assignments from the SVM classifier. (D) Receiver operating characteristic (ROC) curve depicting SVM performance in distinguishing active from inactive compounds.
Figure 2. Chemical diversity and performance of the Support Vector Machine model (Model 9). (A) Distribution of active and inactive compounds in the PCA chemical space. (B) Random partitioning of compounds into training and test sets for SVM model construction, preserving overall data distribution. (C) Probability density of predicted class assignments from the SVM classifier. (D) Receiver operating characteristic (ROC) curve depicting SVM performance in distinguishing active from inactive compounds.
Viruses 17 00935 g002
Figure 3. Chemical diversity and performance of the Logistic Regression (LR) model (Model 10). (A) Distribution of active and inactive compounds in the PCA chemical space. (B) Random partitioning of compounds into training and test sets for LR model construction, preserving overall data distribution. (C) Probability density of predicted class assignments from the LR model. (D) Receiver operating characteristic (ROC) curve depicting LR performance in distinguishing active from inactive compounds.
Figure 3. Chemical diversity and performance of the Logistic Regression (LR) model (Model 10). (A) Distribution of active and inactive compounds in the PCA chemical space. (B) Random partitioning of compounds into training and test sets for LR model construction, preserving overall data distribution. (C) Probability density of predicted class assignments from the LR model. (D) Receiver operating characteristic (ROC) curve depicting LR performance in distinguishing active from inactive compounds.
Viruses 17 00935 g003
Figure 4. Structural analysis of the Mpro interacting with random inhibitors using dynamical cross-correlation matrix (DCCM). Dynamical cross-correlation matrix (DCCM) calculated using pairwise Pearson correlations of Cα–Cα distances for each residue in the Mpro. The correlation matrix has dimensions 297 × 297, reflecting the correlation between residue pairs, where blue represents anti-correlated residue pairs while red represents correlated residue pairs. DCCM was filtered, keeping correlations more than 0.8 and anti-correlations less than −0.8. These correlations and anti-correlations were projected into the Mpro/inhibitor complex structure (after NpT equilibrium, t = 0 ns). The inhibitors considered in these analyses include: (A) 3AN; (B) 37; (C) 21; (D) 40.
Figure 4. Structural analysis of the Mpro interacting with random inhibitors using dynamical cross-correlation matrix (DCCM). Dynamical cross-correlation matrix (DCCM) calculated using pairwise Pearson correlations of Cα–Cα distances for each residue in the Mpro. The correlation matrix has dimensions 297 × 297, reflecting the correlation between residue pairs, where blue represents anti-correlated residue pairs while red represents correlated residue pairs. DCCM was filtered, keeping correlations more than 0.8 and anti-correlations less than −0.8. These correlations and anti-correlations were projected into the Mpro/inhibitor complex structure (after NpT equilibrium, t = 0 ns). The inhibitors considered in these analyses include: (A) 3AN; (B) 37; (C) 21; (D) 40.
Viruses 17 00935 g004
Figure 5. Principal component analysis (PCA) and free-energy landscape (FEL) of SARS-CoV-2 Mpro conformational dynamics. PCA considers the projection of Cα distances for each residue over time onto the PC1, PC2, and PC3 eigenvectors. The FEL heat map illustrates the evolution of energy changes associated with conformational transitions of SARS-CoV-2 Mpro along the PC1–PC2 and PC1–PC3 projected trajectories. The data are presented for the following inhibitors: (A) 3AN; (B) 37; (C) 21; (D) 40. For each case, PC1 explains 18.67, 58.46, 24.16, and 31.44% of the data variability, respectively; PC2 explains 15.34, 14.68, 15.98, and 17.08%, respectively; and PC3 explains 10.21, 5.06, 8.39, and 13.09%, respectively. Additionally, a cumulative variance plot was generated to depict the contribution of each principal component to the total variance. The FEL, obtained via PCA, represents the system’s free energy landscape in relation to its principal coordinates, highlighting stable states, conformational transitions, and energetic barriers. The FEL was calculated by generating a two-dimensional histogram of the PC1 and PC2 scores, with the corresponding probability density converted into free energy using the equation ∆G = − RT ln(P + ξ), where ∆G is the free energy, R is the gas constant, T is the absolute temperature (310 K), P is the probability density, and ξ = 10−10 is a regularization term to avoid singularities. Each point in the PC1–PC2 space corresponds to a free energy value based on its bin (bin size = 50) in the 2D histogram. Blue regions indicate low-energy (stable) conformational states, whereas red regions denote high-energy barriers separating these states. The pathways between these valleys suggest possible routes for conformational transitions.
Figure 5. Principal component analysis (PCA) and free-energy landscape (FEL) of SARS-CoV-2 Mpro conformational dynamics. PCA considers the projection of Cα distances for each residue over time onto the PC1, PC2, and PC3 eigenvectors. The FEL heat map illustrates the evolution of energy changes associated with conformational transitions of SARS-CoV-2 Mpro along the PC1–PC2 and PC1–PC3 projected trajectories. The data are presented for the following inhibitors: (A) 3AN; (B) 37; (C) 21; (D) 40. For each case, PC1 explains 18.67, 58.46, 24.16, and 31.44% of the data variability, respectively; PC2 explains 15.34, 14.68, 15.98, and 17.08%, respectively; and PC3 explains 10.21, 5.06, 8.39, and 13.09%, respectively. Additionally, a cumulative variance plot was generated to depict the contribution of each principal component to the total variance. The FEL, obtained via PCA, represents the system’s free energy landscape in relation to its principal coordinates, highlighting stable states, conformational transitions, and energetic barriers. The FEL was calculated by generating a two-dimensional histogram of the PC1 and PC2 scores, with the corresponding probability density converted into free energy using the equation ∆G = − RT ln(P + ξ), where ∆G is the free energy, R is the gas constant, T is the absolute temperature (310 K), P is the probability density, and ξ = 10−10 is a regularization term to avoid singularities. Each point in the PC1–PC2 space corresponds to a free energy value based on its bin (bin size = 50) in the 2D histogram. Blue regions indicate low-energy (stable) conformational states, whereas red regions denote high-energy barriers separating these states. The pathways between these valleys suggest possible routes for conformational transitions.
Viruses 17 00935 g005
Figure 6. Potential interactions of SARS-CoV-2 Mpro with different inhibitors. To identify the interactions in the subsites of Mpro, four main subsites were defined in the protein based on our previous study [5]: S1 (containing PHE140, CYS145, HIS163), S1’ (THR25, THR26, HIS41, LEU27, ASN142, GLY143), S2 (MET49, TYR54, PRO52, HIS164, MET165, ASP187, ARG188, GLN189), and S3/S4 (MET165, GLU166, LEU167, PRO168, PHE185, THR190, ALA191). This analysis considered three distant atoms (and different chemical groups) that compose the inhibitor: 3AN [C0F (quinoline), C05 (benzothiazole), O00 (thioamide)], 37 [C0F (pyrazole), C0O (chlorophenyl), C0T (benzotriazole)], 21 [C0R (benzotriazole), S07 (thiophene), N0H (imidazole)], and 40 [C0R (benzotriazole), N0K (imidazole), C0A (chlorophenyl)]. The coordinates of these atoms define the center of a sphere with a radius of 4 Å. All atoms within the sphere were considered potential residues for interaction, and the subsite with which the inhibitor interacted was assigned. Next, the contact frequency per residue and per subsite was calculated. The data of the interactions are presented for the following inhibitors: (A) 3AN; (B) 37; (C) 21; (D) 40.
Figure 6. Potential interactions of SARS-CoV-2 Mpro with different inhibitors. To identify the interactions in the subsites of Mpro, four main subsites were defined in the protein based on our previous study [5]: S1 (containing PHE140, CYS145, HIS163), S1’ (THR25, THR26, HIS41, LEU27, ASN142, GLY143), S2 (MET49, TYR54, PRO52, HIS164, MET165, ASP187, ARG188, GLN189), and S3/S4 (MET165, GLU166, LEU167, PRO168, PHE185, THR190, ALA191). This analysis considered three distant atoms (and different chemical groups) that compose the inhibitor: 3AN [C0F (quinoline), C05 (benzothiazole), O00 (thioamide)], 37 [C0F (pyrazole), C0O (chlorophenyl), C0T (benzotriazole)], 21 [C0R (benzotriazole), S07 (thiophene), N0H (imidazole)], and 40 [C0R (benzotriazole), N0K (imidazole), C0A (chlorophenyl)]. The coordinates of these atoms define the center of a sphere with a radius of 4 Å. All atoms within the sphere were considered potential residues for interaction, and the subsite with which the inhibitor interacted was assigned. Next, the contact frequency per residue and per subsite was calculated. The data of the interactions are presented for the following inhibitors: (A) 3AN; (B) 37; (C) 21; (D) 40.
Viruses 17 00935 g006
Table 1. Performance of Support Vector Machine models. The condition column specifies the descriptors computed and included in model construction, as detailed in Table S1. Kernel functions in SVMs create a nonlinear decision boundary. In these cases, the RBF kernel used Euclidean distance for flexible nonlinear separation of inhibitors into actives and inactives in Mpro.
Table 1. Performance of Support Vector Machine models. The condition column specifies the descriptors computed and included in model construction, as detailed in Table S1. Kernel functions in SVMs create a nonlinear decision boundary. In these cases, the RBF kernel used Euclidean distance for flexible nonlinear separation of inhibitors into actives and inactives in Mpro.
Training SetTest Set
ModelCond.SplitNon-CVCVNon-CV
Acc.Rec.Prec.F1AUCk = 5k = 10Acc.Rec.Prec.F1AUC
170.20.781.000.770.870.800.790.790.720.980.720.830.76
280.20.751.000.740.850.860.760.760.721.000.720.840.75
390.20.791.000.780.880.860.790.800.720.980.720.830.80
4110.20.781.000.770.870.910.790.780.721.000.720.840.75
5130.20.781.000.770.870.910.790.790.731.000.730.840.72
6400.20.810.720.760.740.890.820.820.800.700.740.720.85
7410.20.830.730.790.760.900.820.830.790.670.730.700.86
8420.20.820.730.770.750.900.830.820.780.660.720.690.85
9 *430.20.840.740.810.780.910.840.840.790.660.750.710.86
10450.20.820.700.790.740.890.820.820.770.610.730.660.85
11220.250.720.630.700.660.220.720.720.710.630.670.650.23
12240.250.730.670.700.690.200.740.730.710.670.660.670.24
13240.30.730.640.710.670.200.730.730.720.660.680.670.23
14290.20.800.710.800.760.120.800.780.720.630.700.660.22
Abbreviations: acc = accuracy, F1 = F1-score, and AUC = area under curve of the ROC curve. The terms k = 5 and k = 10 represent the results obtained using 5-fold and 10-fold cross-validation, respectively. RSK (random state of k-fold CV) = 37; RSS (random state for data splitting) = 21; all these SVM models used RBF kernel.
Table 2. Performance metrics for different Logistic Regression models. The table shows evaluation metrics for various LR model configurations across training and test sets.
Table 2. Performance metrics for different Logistic Regression models. The table shows evaluation metrics for various LR model configurations across training and test sets.
Training SetTest Set
ModelCond.SolverNon-CVk-Fold CVNon-CV
Acc.Rec.Prec.F1AUCk = 5k = 10Acc.Rec.Prec.F1AUC
1 14liblinear0.790.960.790.870.780.780.780.750.970.760.850.71
2 14saga0.790.960.790.870.780.780.790.750.970.760.850.70
3 14sag0.790.960.790.870.780.780.790.750.970.760.850.70
4 14newton-cg0.790.960.790.870.780.780.790.750.970.760.850.70
5 14lbfgs0.790.960.790.870.780.780.790.750.970.760.850.70
631lbfgs0.740.660.720.690.800.740.740.710.640.680.660.77
731liblinear0.740.660.720.690.800.740.740.720.640.680.660.77
831newton-cg0.740.660.720.690.800.740.740.720.640.680.660.77
931sag0.740.660.720.690.800.740.730.710.640.680.660.76
10 *45newton-cg0.780.660.730.690.850.780.780.760.600.710.650.83
1145sag0.780.650.720.690.850.780.780.760.600.710.650.83
1245liblinear0.780.660.730.690.850.780.780.760.600.710.650.83
1345lbfgs0.780.660.730.690.850.780.780.760.600.710.650.83
Note: All models employed the RBF kernel. Abbreviations: cond. = condition, rec. = recall, prec. = precision, acc = accuracy, F1 = F1-score, and AUC = area under curve of the ROC curve. The terms k = 5 and k = 10 represent the results obtained using 5-fold and 10-fold cross-validation, respectively. RSK (random state of k-fold CV) = 11; RSS (random state for data splitting) = 75; represents test split of 0.25, otherwise the value represents 0.3.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Souza, A.S.d.; Amorim, V.M.d.F.; Soares, E.P.; de Souza, R.F.; Guzzo, C.R. Antagonistic Trends Between Binding Affinity and Drug-Likeness in SARS-CoV-2 Mpro Inhibitors Revealed by Machine Learning. Viruses 2025, 17, 935. https://doi.org/10.3390/v17070935

AMA Style

Souza ASd, Amorim VMdF, Soares EP, de Souza RF, Guzzo CR. Antagonistic Trends Between Binding Affinity and Drug-Likeness in SARS-CoV-2 Mpro Inhibitors Revealed by Machine Learning. Viruses. 2025; 17(7):935. https://doi.org/10.3390/v17070935

Chicago/Turabian Style

Souza, Anacleto Silva de, Vitor Martins de Freitas Amorim, Eduardo Pereira Soares, Robson Francisco de Souza, and Cristiane Rodrigues Guzzo. 2025. "Antagonistic Trends Between Binding Affinity and Drug-Likeness in SARS-CoV-2 Mpro Inhibitors Revealed by Machine Learning" Viruses 17, no. 7: 935. https://doi.org/10.3390/v17070935

APA Style

Souza, A. S. d., Amorim, V. M. d. F., Soares, E. P., de Souza, R. F., & Guzzo, C. R. (2025). Antagonistic Trends Between Binding Affinity and Drug-Likeness in SARS-CoV-2 Mpro Inhibitors Revealed by Machine Learning. Viruses, 17(7), 935. https://doi.org/10.3390/v17070935

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop