A Computer-Aided Drug Design Approach to Predict Marine Drug-Like Leads for SARS-CoV-2 Main Protease Inhibition

The investigation of marine natural products (MNPs) as key resources for the discovery of drugs to mitigate the COVID-19 pandemic is a developing field. In this work, computer-aided drug design (CADD) approaches comprising ligand- and structure-based methods were explored for predicting SARS-CoV-2 main protease (Mpro) inhibitors. The CADD ligand-based method used a quantitative structure–activity relationship (QSAR) classification model that was built using 5276 organic molecules extracted from the ChEMBL database with SARS-CoV-2 screening data. The best model achieved an overall predictive accuracy of up to 67% for an external and internal validation using test and training sets. Moreover, based on the best QSAR model, a virtual screening campaign was carried out using 11,162 MNPs retrieved from the Reaxys® database, 7 in-house MNPs obtained from marine-derived actinomycetes by the team, and 14 MNPs that are currently in the clinical pipeline. All the MNPs from the virtual screening libraries that were predicted as belonging to class A were selected for the CADD structure-based method. In the CADD structure-based approach, the 494 MNPs selected by the QSAR approach were screened by molecular docking against Mpro enzyme. A list of virtual screening hits comprising fifteen MNPs was assented by establishing several limits in this CADD approach, and five MNPs were proposed as the most promising marine drug-like leads as SARS-CoV-2 Mpro inhibitors, a benzo[f]pyrano[4,3-b]chromene, notoamide I, emindole SB beta-mannoside, and two bromoindole derivatives.


Introduction
To date, over 68.2 million people worldwide have been infected by coronavirus disease 2019 , resulting in over 1.5 million deaths caused by severe acute respiratory syndrome (SARS-CoV-2) and in addition, severe economic costs, national health systems crises, and massive unemployment [1,2]. Despite the enormous human and financial efforts, there is still no appropriate treatment or prevention for COVID-19 and the number of deaths keeps increasing, which makes the discovery of drugs for infection treatment of foremost importance and an emergency. New and efficient therapeutic agents will relieve social, economic, and management burdens, improving patients' quality of life, and reducing hospitals pressure due to ineffective/long hospitalizations.
to the class A, were selected to proceed to the CADD structure-based method. Where 494 MNPs selected by QSAR approach were screened by molecular docking against M pro enzyme. In this CADD approach, a list of virtual screening hits comprising fifteen MNPs was assented on the basis of some established limits, such as: confidence value (3), probability of being active against SARS-CoV-2 in the best model, prediction of the affinity between the M pro of the selected MNPs through molecular docking. Five MNPs, a benzo[f]pyrano [4,3-b]chromene, notoamide I, emindole SB beta-mannoside, and two bromoindole derivatives were proposed as the most promise marine drug-like leads as SARS-CoV-2 M pro inhibitors.

Chemical Space of the SARS-CoV-2 Model
The whole data set of 5272 organic molecules from the ChEMBL database with SARS-CoV-2 screening data (antiviral activity determined as inhibition of SARS-CoV-2 induced cytotoxicity of Caco-2 cells) was randomly divided into a training set of 3499 molecules (comprising 302 molecules from class A with inhibition % ≥50%, 265 molecules from class B with 50% > inhibition % ≥ 30%, and 2932 molecules from class C with inhibition % < 30%), a test set of 1533 molecules (comprising 145 molecules from class A with inhibition % ≥ 50%, 99 molecules from class B with 50% > inhibition % ≥ 30%, and 1288 molecules from class C with inhibition % < 30%), and an additional test set, test 2 set, of 241 molecules, which were used for the development and external validation of the QSAR classification models, respectively. The whole data set was clustered into ten structural classes or scaffold types (I-X) using the ward tool in JChem. The ten structural clusters are represented in Table 1 along with their average of molecular weight (MW), the octanol-water partition coefficient estimation (AlogP), and activity class counts.
The percentage of molecules belonging to class A, i.e., active against SARS-CoV-2, in the training set is relatively low, 9%, and these molecules are distributed within the ten structural groups (I-X), with percentages ranging from 3-13%. There are three structural clusters (III, VII, and VIII) in which class A is less represented compared to its representation in the training set, with percentages between 3-5%. Moreover, in the other seven structural clusters, class A has a representation equal to or higher than that obtained for class A in the training set, with percentages between 9-13%. The well-known, Lipinski rule, informs mainly if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. One of the most important parameters is the LogP, which is highly correlated with lipophilicity, thus, highly lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [30]. To explore the chemical and biological diversity of the training set, the active, intermediate, and inactive molecules against SARS-CoV-2 in the training set were analyzed, according to the ten structural clusters, using MW and ALogP. The analysis of these data indicates that the active, intermediate, and inactive molecules against SARS-CoV-2 in the training set are distributed over a wide range of MW (i.e., 84-4187 Da) and ALogP (i.e., −8.9-17.85).
Interestingly, approximately 84% of the molecules present in the training set have a MW bellow 500 Da. This MW interval contains approximately 78%, 82%, and 85% of all active, intermediate, and inactive molecules against SARS-CoV-2 in the training set, respectively. In spite of this, using this rule (MW < 500 Da), it is possible to discriminate actives molecules in relation to intermediate and inactive molecules in three structural clusters, namely in the clusters III, IV, and X, which comprises 80%, 82%, and 86% of inactive molecules, respectively, when compared to 78% of active molecules in the overall training set. In addition, more than 68%, 82%, and 86% of the active, intermediate, and inactive molecules against SARS-CoV-2 in the training set have an ALogP that is lower than 5, respectively. In the same way, using the ALogP < 5 rule, it is possible to prioritize active molecules in relation to the other molecules in three clusters, namely in the clusters: VII, VIII and IX, which comprises 85%, 81%, and 79% of active molecules when compared to 68% of active molecules in the overall training set, respectively. Table 1. Structural clusters and SARS-CoV-2 activity class counts within the ten structural clusters.

Clusters 1 # 2 (A class) 3 Average MW 4 Average ALogP 5
Tr Set Te Set Tr Set Te Set Tr Set Te Set I-Indole derivative molecules from class A with inhibition % ≥ 50%, 99 molecules from class B with 50% > inhibition % ≥ 30%, and 1288 molecules from class C with inhibition % < 30%), and an additional test set, test 2 set, of 241 molecules, which were used for the development and external validation of the QSAR classification models, respectively. The whole data set was clustered into ten structural classes or scaffold types (I-X) using the ward tool in JChem. The ten structural clusters are represented in Table  1 along with their average of molecular weight (MW), the octanol-water partition coefficient estimation (AlogP), and activity class counts. molecules from class A with inhibition % ≥ 50%, 99 molecules from class B with 50% > inhibition % ≥ 30%, and 1288 molecules from class C with inhibition % < 30%), and an additional test set, test 2 set, of 241 molecules, which were used for the development and external validation of the QSAR classification models, respectively. The whole data set was clustered into ten structural classes or scaffold types (I-X) using the ward tool in JChem. The ten structural clusters are represented in Table  1 along with their average of molecular weight (MW), the octanol-water partition coefficient estimation (AlogP), and activity class counts. III-γ-Lactone derivative molecules from class A with inhibition % ≥ 50%, 99 molecules from class B with 50% > inhibition % ≥ 30%, and 1288 molecules from class C with inhibition % < 30%), and an additional test set, test 2 set, of 241 molecules, which were used for the development and external validation of the QSAR classification models, respectively. The whole data set was clustered into ten structural classes or scaffold types (I-X) using the ward tool in JChem. The ten structural clusters are represented in Table  1 along with their average of molecular weight (MW), the octanol-water partition coefficient estimation (AlogP), and activity class counts.

IV-Benzimidazole derivative
30%, and 1288 molecules from class C with inhibition % < 30%), and an additional test set, test 2 set, of 241 molecules, which were used for the development and external validation of the QSAR classification models, respectively. The whole data set was clustered into ten structural classes or scaffold types (I-X) using the ward tool in JChem. The ten structural clusters are represented in Table  1 along with their average of molecular weight (MW), the octanol-water partition coefficient estimation (AlogP), and activity class counts. V-α-Amino acid ester derivative 30%, and 1288 molecules from class C with inhibition % < 30%), and an additional test set, test 2 set, of 241 molecules, which were used for the development and external validation of the QSAR classification models, respectively. The whole data set was clustered into ten structural classes or scaffold types (I-X) using the ward tool in JChem. The ten structural clusters are represented in Table  1 along with their average of molecular weight (MW), the octanol-water partition coefficient estimation (AlogP), and activity class counts. 30%, and 1288 molecules from class C with inhibition % < 30%), and an additional test set, test 2 set, of 241 molecules, which were used for the development and external validation of the QSAR classification models, respectively. The whole data set was clustered into ten structural classes or scaffold types (I-X) using the ward tool in JChem. The ten structural clusters are represented in Table  1 along with their average of molecular weight (MW), the octanol-water partition coefficient estimation (AlogP), and activity class counts. The percentage of molecules belonging to class A, i.e., active against SARS-CoV-2, in the training set is relatively low, 9%, and these molecules are distributed within the ten structural groups (I-X), with percentages ranging from 3-13%. There are three structural clusters (III, VII, and VIII) in which class A is less represented compared to its representation in the training set, with percentages between 3-5%. Moreover, in the other seven structural clusters, class A has a representation equal to or higher than that obtained for class A in the training set, with percentages between 9-13%. The well-known, Lipinski rule, informs mainly if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. One of the most important parameters is the LogP, which is highly correlated with lipophilicity, thus, highly lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [30]. To explore the chemical and biological diversity of the training set, the active, intermediate, and inactive molecules against SARS-CoV-2 in the training set were analyzed, according to the ten structural clusters, using MW and 279 (13) 131 ( The percentage of molecules belonging to class A, i.e., active against SARS-CoV-2, in the training set is relatively low, 9%, and these molecules are distributed within the ten structural groups (I-X), with percentages ranging from 3-13%. There are three structural clusters (III, VII, and VIII) in which class A is less represented compared to its representation in the training set, with percentages between 3-5%. Moreover, in the other seven structural clusters, class A has a representation equal to or higher than that obtained for class A in the training set, with percentages between 9-13%. The well-known, Lipinski rule, informs mainly if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. One of the most important parameters is the LogP, which is highly correlated with lipophilicity, thus, highly lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [30]. To explore the chemical and biological diversity of the training set, the active, intermediate, and inactive molecules against SARS-CoV-2 in the training set were analyzed, according to the ten structural clusters, using MW and 588 (31) 254 ( The percentage of molecules belonging to class A, i.e., active against SARS-CoV-2, in the training set is relatively low, 9%, and these molecules are distributed within the ten structural groups (I-X), with percentages ranging from 3-13%. There are three structural clusters (III, VII, and VIII) in which class A is less represented compared to its representation in the training set, with percentages between 3-5%. Moreover, in the other seven structural clusters, class A has a representation equal to or higher than that obtained for class A in the training set, with percentages between 9-13%. The well-known, Lipinski rule, informs mainly if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. One of the most important parameters is the LogP, which is highly correlated with lipophilicity, thus, highly lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [30]. To explore the chemical and biological diversity of the training set, the active, intermediate, and inactive molecules against SARS-CoV-2 in the training set were analyzed, according to the ten structural clusters, using MW and 285 (29) 136 ( The percentage of molecules belonging to class A, i.e., active against SARS-CoV-2, in the training set is relatively low, 9%, and these molecules are distributed within the ten structural groups (I-X), with percentages ranging from 3-13%. There are three structural clusters (III, VII, and VIII) in which class A is less represented compared to its representation in the training set, with percentages between 3-5%. Moreover, in the other seven structural clusters, class A has a representation equal to or higher than that obtained for class A in the training set, with percentages between 9-13%. The well-known, Lipinski rule, informs mainly if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. One of the most important parameters is the LogP, which is highly correlated with lipophilicity, thus, highly lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [30]. To explore the chemical and biological diversity of the training set, the active, intermediate, and inactive molecules against SARS-CoV-2 in the training set were analyzed, according to the ten structural clusters, using MW and 266 (35) 98 (12)

QSAR Classification Modeling
Two extensive sets of descriptors were explored, one with 6 different types of fingerprints (FPs) with different sizes (166 MACCS, MACCS keys; 307 Substructure, presence and count Sub and SubC, respectively; 881 PubChem fingerprints; 1024 CDK, circular fingerprints; and 1024 CDKExt, extended circular fingerprints with additional bits describing ring features) and another with a total of 1442 1D&2D descriptors (including electronic, topological, and constitutional descriptors). The FPs and the Mar. Drugs 2020, 18, 633 5 of 17 molecular descriptors were calculated by PaDEL-Descriptor [31]. Random forests (RF) [32] machine learning (ML) technique was used for building SARS-CoV-2 activity prediction classification models, and the performance of the models was successfully evaluated by internal validation (out-of-bag, OOB, estimation on the training set), Table 2. Table 2. Evaluation of the predictive performance of FPs and 1D&2D molecular descriptors for modeling the antiviral activity against SARS-CoV-2 of organic molecules using the RF algorithm for the training set in an OOB estimation.  4 False A that was B. 5 False A that was C. 6 False B that was A. 7 False B that was C. 8 False C that was A. 9 False C that was B. 10 Sensitivity, the ratio of true A to the sum of true A and false A. 11 Specificity B, the ratio of true B to the sum of true B and false B. 12 Specificity C, the ratio of true C to the sum of true C and false C. 13 Overall predictive accuracy, the ratio of the sum of true A, true B, and true C to the sum of true A, true B, true C, false A, false B, and false C. 14 MCC, Matthews correlation coefficient.

MACCS
In Table 2, were highlighted in bold the three selected models, MACCS model-best model built with sets of fragment FPs (MACCS, Sub, SubC, and PubChem), ExtCDK model-best model built with sets of circular FPs (CDK and ExtCDK), and 1D&2D descriptors in accordance with Q and MCC parameters. For the two best models for the training set, ExtCDK, and 1D&2D (Table 2), the descriptor selection was evaluated based on the importance assigned by the RF model with the R program- Table 3. For the MACCS model, descriptors were not selected since these comprised only 166 FPs. The best models for each set of ExtCDK FPs and 1D&2D descriptors were accomplished with the RF algorithm using the 150 selected FPs and descriptors, respectively, (the best models were highlighted in bold- Table 3) for the training set. Majority voting predictions (consensus) obtained by ExtCDK, 1D&2D, and MACCS models (consensus model, CM) further improved the results with a Q and MCC of 0.68 and 0.31 for the training set, respectively, Table 4. The results obtained by the CM for the test set in accordance with the ten structural categories (I-X), which were set up using the ward tool in JChem, are shown in Table 5.  3 Specificity C, the ratio of true C to the sum of true C and false C. 4 Overall predictive accuracy, the ratio of the sum of true A, true B, and true C to the sum of true A, true B, true C, false A, false B, and false C. 5 MCC, Matthews correlation coefficient. 6 OOB estimation.  3 Specificity C, the ratio of true C to the sum of true C and false C. 4 Overall predictive accuracy, the ratio of the sum of true A, true B, and true C to the sum of true A, true B, true C, false A, false B, and false C. 5 MCC, Matthews correlation coefficient.
In general, the predictions obtained for the structural clusters are better than those obtained for the overall test set simultaneously taking into account the Q and MCC values, except for Clusters IV-V and IX-X (bold highlighted in Table 5). An improvement in the CM model prediction accuracies (Q = 0.67-0.72 and MCC = 0.25-0.38) was achieved for the other clusters (Clusters I-III and VI-VIII) of the test set, when compared with the prediction accuracy obtained for all the molecules of the test set (Q = 0.67 and MCC = 0.19). For the clusters IV-V and IX-X inferior prediction accuracies were obtained, Q = 0.58-0.68 and MCC = 0.05-0.21. Interestingly, the best achieved predictions for structural clusters I and VI are related to the best performance obtained for the class A (active) prediction with SE values of 0.71 and 0.67, respectively, compared to the SE value of 0.48 for all test sets. Furthermore, an additional predicting criterion can be generated according to the vote count of each of the models (i.e., MACCS, ExtCDK, and 1D&2D). Thus, if classes A, B, or C are predicted by the three models, these have the maximum confidence value (3), by two models (2), and by only one model (1), in this case, the CM prediction is obtained by the prediction of the best model, ExtCDK. An additional parameter, probability of being of class A (Prob_A) or probability of being active, was assigned by the RF algorithm and it can be used as a predicting criterion. For instance, the percentage of TA with the maximum confidence value (3) was 57%, which compares with the percentages of 42% and 43% obtained for false A, FA_B (false A which was B) and FA_C (false A which was C), respectively. If the criterion Prob_A ≥ 0.5 was added for the percentage of TA, the percentage of 33% was obtained and it compares with percentages of 13% and 15% obtained for FA_B and FA_C, respectively.
There are thirty-seven molecules from test set 2 that were predicted as being for class A, from those fourteen, were predicted with the maximum confidence value (3) and only five were predicted with Prob_A ≥ 0.5. Chemical structures of the five molecules in test set 2 that were predicted as belonging to class A with the confidence value (3) and Prob_A ≥ 0.5 are described in Figure 1.
each of the models (i.e., MACCS, ExtCDK, and 1D&2D). Thus, if classes A, B, or C are predicted by the three models, these have the maximum confidence value (3), by two models (2), and by only one model (1), in this case, the CM prediction is obtained by the prediction of the best model, ExtCDK. An additional parameter, probability of being of class A (Prob_A) or probability of being active, was assigned by the RF algorithm and it can be used as a predicting criterion. For instance, the percentage of TA with the maximum confidence value (3) was 57%, which compares with the percentages of 42% and 43% obtained for false A, FA_B (false A which was B) and FA_C (false A which was C), respectively. If the criterion Prob_A ≥ 0.5 was added for the percentage of TA, the percentage of 33% was obtained and it compares with percentages of 13% and 15% obtained for FA_B and FA_C, respectively.
There are thirty-seven molecules from test set 2 that were predicted as being for class A, from those fourteen, were predicted with the maximum confidence value (3) and only five were predicted with Prob_A ≥ 0.5. Chemical structures of the five molecules in test set 2 that were predicted as belonging to class A with the confidence value (3) and Prob_A ≥ 0.5 are described in Figure 1. NCATS released, via OpenData Portal (https://opendata.ncats.nih.gov/covid19/assays) the quantitative HTS data on drugs approved for clinical use tested in the SARS-CoV-2 cytopathic effect (CPE) assay. We used the chemical structures and data of 3957 molecules in SMILES format of this SARS-CoV-2 CPE evaluation from the work of Alves et al. [17]. From those 3957 molecules, there are 1401 molecules that were used to build (910 molecules in the training set) and validate (409 and 82 molecules in the test and test 2 sets, respectively) the SARS-CoV-2 QSAR model. Only the KRCA-0008 molecule in Figure 1, which was proposed as the most promising active molecules against SARS-CoV-2 for test set 2, was tested by NCATS and had no activity. However, there is a proposed hit of the test 2 set, quizartinib (Figure 2), predicted with the maximum confidence value (3) but with NCATS released, via OpenData Portal (https://opendata.ncats.nih.gov/covid19/assays) the quantitative HTS data on drugs approved for clinical use tested in the SARS-CoV-2 cytopathic effect (CPE) assay. We used the chemical structures and data of 3957 molecules in SMILES format of this SARS-CoV-2 CPE evaluation from the work of Alves et al. [17]. From those 3957 molecules, there are 1401 molecules that were used to build (910 molecules in the training set) and validate (409 and 82 molecules in the test and test 2 sets, respectively) the SARS-CoV-2 QSAR model. Only the KRCA-0008 molecule in Figure 1, which was proposed as the most promising active molecules against SARS-CoV-2 for test set 2, was tested by NCATS and had no activity. However, there is a proposed hit of the test 2 set, quizartinib (Figure 2), predicted with the maximum confidence value (3) but with Prob_A < 0.5 by the QSAR model that was experimentally validated by the SARS-CoV-2 CPE assay as active with AC 50 of 2.51 µM.
All six molecules predicted to be the most promising against SARS-CoV-2 from the test set 2 in Figures 1 and 2 are N-heterocyclic molecules, four of which have a pyrimidine scaffold (BIBU-1361, CHEMBL214796, NGD-94-1, KRCA-0008), one, OXA-06, has a pyrrolo[2¨C3-b]pyridine scaffold, and quizartinib has a benzo[d]imidazo[2,1-b]thiazole scaffold. Better predictions compared to those obtained for the test set (Table 4) were obtained taking into account the results of the SARS-CoV-2 CPE assay for the 82 molecules of the test set 2 and the values of 0.21, 0.94, and 0.74 for sensitivity, specificity C, and general predictive precision, respectively. Likewise, the other 2556 molecules of the NCATS data were predicted by the CM model, obtaining 0.01, 0.01, 0.98, and 0.91 for sensitivity, specificity B, specificity C, and general predictive precision, respectively. Only one molecule, vorapaxar (Figure 3 Prob_A < 0.5 by the QSAR model that was experimentally validated by the SARS-CoV-2 CPE assay as active with AC50 of 2.51 µM. All six molecules predicted to be the most promising against SARS-CoV-2 from the test set 2 in Figure 1 and  Figure 3. Chemical structure of vorapaxar from the test set 2 that was predicted as class A with the confidence value (1) and Prob_A of 0.17, which is active in the SARS-CoV-2 CPE assay.

Analysis of MACCS FPs and 1D&2D Descriptors Identified as Relevant for Modeling the Antiviral Activity against SARS-CoV-2
A comparison of the best twenty FPs (i.e., MACCS keys) and descriptors (i.e., 1D_2D) selected by descriptor importance of RF used to build the QSAR classification models is comprised in Table 2 and these were analyzed and presented in Figure 4.  All six molecules predicted to be the most promising against SARS-CoV-2 from the test set 2 in Figure 1 and Figure 2 are N-heterocyclic molecules, four of which have a pyrimidine scaffold (BIBU-1361, CHEMBL214796, NGD-94-1, KRCA-0008), one, OXA-06, has a pyrrolo[2¨C3-b]pyridine scaffold, and quizartinib has a benzo[d]imidazo[2,1-b]thiazole scaffold. Better predictions compared to those obtained for the test set (Table 4) were obtained taking into account the results of the SARS-CoV-2 CPE assay for the 82 molecules of the test set 2 and the values of 0.21, 0.94, and 0.74 for sensitivity, specificity C, and general predictive precision, respectively. Likewise, the other 2556 molecules of the NCATS data were predicted by the CM model, obtaining 0.01, 0.01, 0.98, and 0.91 for sensitivity, specificity B, specificity C, and general predictive precision, respectively. Only one molecule, vorapaxar (Figure 3 Figure 3. Chemical structure of vorapaxar from the test set 2 that was predicted as class A with the confidence value (1) and Prob_A of 0.17, which is active in the SARS-CoV-2 CPE assay.

Analysis of MACCS FPs and 1D&2D Descriptors Identified as Relevant for Modeling the Antiviral Activity against SARS-CoV-2
A comparison of the best twenty FPs (i.e., MACCS keys) and descriptors (i.e., 1D_2D) selected by descriptor importance of RF used to build the QSAR classification models is comprised in Table 2 and these were analyzed and presented in Figure 4.

Analysis of MACCS FPs and 1D&2D Descriptors Identified as Relevant for Modeling the Antiviral Activity against SARS-CoV-2
A comparison of the best twenty FPs (i.e., MACCS keys) and descriptors (i.e., 1D_2D) selected by descriptor importance of RF used to build the QSAR classification models is comprised in Table 2 and these were analyzed and presented in Figure 4.
Of the twenty most important MACCS FPs for modeling the antiviral activity against SARS-CoV-2 in the set, there are more MACCS FPs (i.e., eighteen MACCS key) that are relevant in discriminating the class C than the classes A or B. For example, only the 11th, which codifies the presence of N-heterocyclic scaffold, and the 19th, which encodes the existence of a 4-membered ring, MACCS FPs out of the twenty most important FPs of MACCs are more relevant in discriminating the A and B classes, respectively. The importance of the N-heterocyclic scaffold in class A discrimination has already been highlighted in the proposed list of the six most promising molecules as antiviral against SARS-CoV-2 for test set 2, namely four molecules containing a pyrimidine scaffold, a molecule with a pyrrolo[2,3-b]pyridine scaffold, and a benzo[d]imidazo[2,1-b]thiazole scaffold. In the same way, of the twenty most important descriptors for modeling the antiviral activity against SARS-CoV-2, there are more 1D&2D descriptors that are relevant in discriminating the class C than the classes A or B (i.e., eighteen descriptors) in the set. There are two 1D&2D descriptors that are more relevant in discriminating the class A than the others classes in the set of the twenty most important 1D&2D descriptors for modelling the antiviral activity against SARS-CoV-2, Figure 4, both are topological descriptors, VR3_Dzp (logarithmic Randic-like eigenvector-based index from Barysz matrix/weighted by polarizabilities) and piPC4 (conventional bond order ID number of order 4, log scale). Despite the two most relevant 1D&2D descriptors for class B, an autocorrelation (ATSC0e, centered Broto-Moreau autocorrelation-lag 0/weighted by Sanderson electronegativities) and a topological (ETA_dEpsilon_A, a measure of contribution of unsaturation and electronegative atom count) descriptors, these are more relevant in class C discrimination. Of the twenty most important MACCS FPs for modeling the antiviral activity against SARS-CoV-2 in the set, there are more MACCS FPs (i.e., eighteen MACCS key) that are relevant in discriminating the class C than the classes A or B. For example, only the 11th, which codifies the presence of N-heterocyclic scaffold, and the 19th, which encodes the existence of a 4-membered ring, MACCS FPs out of the twenty most important FPs of MACCs are more relevant in discriminating the A and B classes, respectively. The importance of the N-heterocyclic scaffold in class A discrimination has already been highlighted in the proposed list of the six most promising molecules as antiviral against SARS-CoV-2 for test set 2, namely four molecules containing a pyrimidine scaffold, a molecule with a pyrrolo[2,3-b]pyridine scaffold, and a benzo[d]imidazo[2,1-b]thiazole scaffold. In the same way, of the twenty most important descriptors for modeling the antiviral activity against SARS-CoV-2, there are more 1D&2D descriptors that are relevant in discriminating the class C than the classes A or B (i.e., eighteen descriptors) in the set. There are two 1D&2D descriptors that are more relevant in discriminating the class A than the others classes in the set of the twenty most important 1D&2D descriptors for modelling the antiviral activity against SARS-CoV-2, Figure 4, both are topological descriptors, VR3_Dzp (logarithmic Randic-like eigenvector-based index from Barysz matrix/weighted by polarizabilities) and piPC4 (conventional bond order ID number of order 4, log scale). Despite the two most relevant 1D&2D descriptors for class B, an autocorrelation (ATSC0e, centered Broto-Moreau autocorrelation-lag 0/weighted by Sanderson electronegativities) and a topological (ETA_dEpsilon_A, a measure of contribution of unsaturation and electronegative atom count) descriptors, these are more relevant in class C discrimination.

Application of the In Silico Anti-Viral Model Against SARS-CoV-2 in Virtual Screening
A virtual screening campaign was carried out to search for new lead-like inhibitors against SARS-CoV-2. The best model, the CM, was selected for the virtual screening procedure using 11,162 MNPs retrieved from the Reaxys ® database, 7 in-house MNPs obtained by the team from marine-derived actinomycetes, and 14 MNPs in the pharmaceutical pipeline (eight approved drugs and six MNPs in Phase II and III of clinical trials). All the MNPs from the virtual screening libraries that were predicted as belonging to the class A, were selected to the CADD structure-based approach. In the CADD structure-based method, the 494 MNPs selected by QSAR classification approach were screened by molecular docking against M pro enzyme.

Molecular Docking Against M pro Enzyme
The 494 MNPs from the three MNPs libraries (491, 1, and 2 from MNPs from Reaxys ® database, in-house MNPs, MNPs pharmaceutical pipeline, respectively) selected by QSAR classification approach were screened by molecular docking against M pro enzyme (PDB ID: 6LU7) [3]. The antiviral drugs nelfinavir and lopinavir were used as positive controls and allicin was used as negative control in molecular docking experiments [7]. A list of virtual screening hits comprising thirteen MNPs was assented on the basis of some limits established in this CADD ligand-and structure-based approach, such as: confidence value (3) and probability of being active against SARS-CoV-2 in the best model and prediction of the affinity between the M pro and the selected MNPs through molecular docking (∆G B ≤ −8 kcal/mol). The Autodock Vina software [33] (http://vina.scripps.edu/) was used to perform the flexible virtual screening of the 494 MNPs from the three MNPs libraries to find the most favorable binding interactions, and the calculated free binding energies by the two sets of search space coordinates were reported in Table 6 for the 15 MNPs selected, one from in-house MNPs (hydroxydebromomarinone), two from MNPs clinical trials (nelarabine and fludarabine), and the positive (nelfinavir and lopinavir) and the negative (allicin) controls. Table 6. Structures and calculated free binding energies (∆G B , in kcal/mol) of the fifteen selected MNPs, one from in-house MNPs (hydroxydebromomarinone), two from MNPs pharmaceutical pipeline (nelarabine and fludarabine), and the positive (nelfinavir and lopinavir) and negative (allicin) controls, using two sets of search space coordinates.

Code
Chemical Structure Structural Category Natural Source Prob_A ∆G B (kcal/mol)

1
The 494 MNPs from the three MNPs libraries (491, 1, and 2 from MNPs from Reaxys ® database, in-house MNPs, MNPs pharmaceutical pipeline, respectively) selected by QSAR classification approach were screened by molecular docking against M pro enzyme (PDB ID: 6LU7) [3]. The antiviral drugs nelfinavir and lopinavir were used as positive controls and allicin was used as negative control in molecular docking experiments [7]. A list of virtual screening hits comprising thirteen MNPs was assented on the basis of some limits established in this CADD ligand-and structure-based approach, such as: confidence value (3) and probability of being active against SARS-CoV-2 in the best model and prediction of the affinity between the M pro and the selected MNPs through molecular docking (ΔGB ≤ −8 kcal/mol). The Autodock Vina software [33] (http://vina.scripps.edu/) was used to perform the flexible virtual screening of the 494 MNPs from the three MNPs libraries to find the most favorable binding interactions, and the calculated free binding energies by the two sets of search space coordinates were reported in Table 6 for the 15 MNPs selected, one from in-house MNPs (hydroxydebromomarinone), two from MNPs clinical trials (nelarabine and fludarabine), and the positive (nelfinavir and lopinavir) and the negative (allicin) controls. Table 6. Structures and calculated free binding energies (∆GB, in kcal/mol) of the fifteen selected MNPs, one from in-house MNPs (hydroxydebromomarinone), two from MNPs pharmaceutical pipeline (nelarabine and fludarabine), and the positive (nelfinavir and lopinavir) and negative (allicin) controls, using two sets of search space coordinates.

Code
Chemical The 494 MNPs from the three MNPs libraries (491, 1, and 2 from MNPs from Reaxys ® database, in-house MNPs, MNPs pharmaceutical pipeline, respectively) selected by QSAR classification approach were screened by molecular docking against M pro enzyme (PDB ID: 6LU7) [3]. The antiviral drugs nelfinavir and lopinavir were used as positive controls and allicin was used as negative control in molecular docking experiments [7]. A list of virtual screening hits comprising thirteen MNPs was assented on the basis of some limits established in this CADD ligand-and structure-based approach, such as: confidence value (3) and probability of being active against SARS-CoV-2 in the best model and prediction of the affinity between the M pro and the selected MNPs through molecular docking (ΔGB ≤ −8 kcal/mol). The Autodock Vina software [33] (http://vina.scripps.edu/) was used to perform the flexible virtual screening of the 494 MNPs from the three MNPs libraries to find the most favorable binding interactions, and the calculated free binding energies by the two sets of search space coordinates were reported in Table 6 for the 15 MNPs selected, one from in-house MNPs (hydroxydebromomarinone), two from MNPs clinical trials (nelarabine and fludarabine), and the positive (nelfinavir and lopinavir) and the negative (allicin) controls. Table 6. Structures and calculated free binding energies (∆GB, in kcal/mol) of the fifteen selected MNPs, one from in-house MNPs (hydroxydebromomarinone), two from MNPs pharmaceutical pipeline (nelarabine and fludarabine), and the positive (nelfinavir and lopinavir) and negative (allicin) controls, using two sets of search space coordinates.

Code
Chemical     To prioritize the best marine drug-like leads as SARS-CoV-2 M pro inhibitors from the list of fifteen selected MNPs by QSAR model against SARS-CoV-2 and molecular docking of M pro enzyme, the absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties were predicted via in silico methods using the pKCSM software (http://biosig.unimelb.edu.au/pkcsm/) [34]. Five MNPs, a benzo[f]pyrano [4,3-b]chromene (Reaxys ID 7450892), notoamide I (Reaxys ID 19384758), emindole SB beta-mannoside (Reaxys ID 26845562), and two bromoindole derivatives (Reaxys IDs 10714788 and 10720065) were proposed as marine drug-like leads as SARS-CoV-2 M pro inhibitors. In accordance with Lipinski rule of five, 3 molecules failure in only one rule (one from indoloditerpene and two bromoindole derivatives). Regarding the desirable ADMET properties, no AMES toxicity, hERG I inhibition, or skin sensitization was predicted by the first three hits, and no hERG I inhibition, hepatotoxicity, or skin sensitization by the other two hits, Table 6. The prediction of ADMET properties of the fifteen selected MNPs by QSAR model against SARS-CoV-2 and molecular docking of M pro enzyme was presented in Table S1, in the Supplementary Materials. In Figure 5, the interaction profiles of the best-docked poses for the five lead-like SARS-CoV-2 M pro inhibitors were represented.

Data Sets and Selection of Training, Test, Test 2 Sets
In total, 5466 organic molecules were extracted from the ChEMBL (https://www.ebi.ac.uk/chembl/) database [35], searching by antiviral activity determined as

Data Sets and Selection of Training, Test, Test 2 Sets
In total, 5466 organic molecules were extracted from the ChEMBL (https://www.ebi.ac.uk/chembl/) database [35], searching by antiviral activity determined as inhibition of SARS-CoV-2 induced cytotoxicity of Caco-2 cells. Their chemical structures were saved in the simplified molecular input line entry specification (SMILES) data format. The Caco-2 cell line is widely used with in vitro assays to predict the absorption rate of lead drug molecules across the intestinal epithelial cell barrier and it is extensively used to study infection by SARS-CoV and can be used for SARS-CoV-2 infection [36]. The antiviral activity was classified using three activity classes: (A)-inhibition % ≥ 50%; (B)-50% > inhibition % ≥ 30%; (C)-inhibition % < 30%. After calculating the molecular descriptors and the fingerprints, the final data set comprises 5273 organic molecules, the molecules that showed errors in the calculation were removed. From these 447 molecules are recorded as active (class A), 364 as intermediate active (class B), 4220 as inactive (class C), and 241 without defined recorded (label as "Outside typical range"). The data set was randomly divided into a training set of 3499 molecules (class A: 302 molecules, class B: 265 molecules, and class C: 2932 molecules), and a test set of 1533 molecules (class A: 145 molecules, class B: 99 molecules, and class C: 1288 molecules), a partition of approximately 56:44 for the training and test sets. The test 2 set comprises the 241 molecules without activity recorded. The SMILES strings of the data set, the corresponding experimental, and the predicted activities are available as Supplementary Materials.
The virtual library comprises 11,162 MNPs set retrieved from the Reaxys ® database (Elsevier Information Systems GmbH)) in the MDL SDF data format, 7 in-house MNPs set obtained by the team from marine-derived actinomycetes, and 14 MNPs from the pharmaceutical pipeline set (eight approved drugs and six MNPs in Phase II and III of clinical trials). The SMILES strings of the in-house MNPs and MNPs clinical trials sets, the corresponding predicted activities are available as Supplementary Materials.

Selection of Descriptors and Optimization of QSAR Models
In the quest for QSAR models with the minimum possible number of descriptors, descriptor selection was performed based on the importance of descriptors assessed by random forest (RF) (computeAttributeImportance) [32]. Selection of descriptors was accomplished using this procedure with the importance of descriptors assessed by RF within an OOB methodology using the 50, 100, 150, and 200 most important descriptors and RF algorithm as ML technique employing the following statistical metrics: true A (TA), true B (TB), true C (TC), false A that was B (FA_B), false A that was C (FA_C), false B that was A (FB_A), false B that was C (FB_C), false C that was A (FC_A), false C that was B (FC_B), sensitivity (SE, prediction accuracy for class A, active against SARS-CoV-2), specificity B (SP_B, prediction accuracy for class B, intermediate activity against SARS-CoV-2), specificity C (SP_C, prediction accuracy for class C, inactive against SARS-CoV-2), overall predictive accuracy (Q), and Matthews correlation coefficient (MCC).

Class Balancer
In general, class imbalance introduces a bias in the performance of the ML algorithms due to their preference towards the majority class [37]. Our SARS-CoV-2 activity training set is unbalanced, and the imbalance ratio is 9:8:84 for the A: active/ B: intermediate/ C: inactive classes, respectively. To address this issue, a base-classifier with RF in R program version 3.6.1 [38], was used using the sampsize, which is a parameter in the RF with R specified for balancing the classes. This parameter was set to be of the same size as the minority class (class B). With this parameter, some molecules belonging to the minority class were used more than once.

Random Forest (RF)
A RF [32,39] is implemented as an ensemble of unpruned classification trees which are created using bootstrap samples of the training set. For each individual tree, the best split at each node is defined using a randomly selected subset of descriptors. Each individual tree is created using a different training and validation set. Prediction is made by a majority vote of the classification trees in the forest. Performance is internally assessed with the prediction error for the objects left out in the bootstrap procedure (OOB estimation). The method quantifies the importance of a descriptor by the increase in misclassification occurring when the values of the descriptor are randomly permuted, correlated with the mean decrease in accuracy parameter. RFs also assign a probability to every prediction based on the number of votes obtained by the predicted class. RFs were grown with the R program [38], version 3.6.1, using the random forest library [40]. As a result of the nature of three-class imbalance, this problem was alleviated by setting the class weight ranges from 1-265, 1-265, and 1-265 for class A, B, and C, respectively, using the sampsize parameter.

Molecular Docking
A list of 494 virtual screening hits were prioritized by the QSAR approach from the three MNPs libraries (491, 1, and 2 from MNPs from Reaxys ® database, in-house MNPs, MNPs clinical pipeline, respectively). The software program OpenBabel (version 2.3.1) [41] was used to convert the mol2 files to PDBQT files. PDBQT files were used for docking to M pro enzyme (PDB ID: 6LU7) [3] with AUTODOCK VINA (version 1.1) [33]. Water molecules and N3 ligand (https://www.tocris.com/ products/mpro-n3_7230) were removed from 6LU7 [3] prior to docking using the AutoDockTools (http://mgltools.scripps.edu/). An experiment to evaluate the importance of water molecules in the binding of ligands to the active site of the protease was carried out and it was concluded that water molecules do not seem to have a relevant role, since a Pearson correlation of 0.885 between the calculated free binding energies with and without water molecules was obtained. The AutoDockTools is graphical front-end for setting up and running AutoDock-an automated docking software designed to predict how small molecules bind to a receptor of known 3D structure. The coordinates of the search space for M pro enzyme were maximized to allow the entire macromolecule to be considered for docking. The search space coordinates were: M pro enzyme; Center X: -36.149 Y: -3.796 Z: 45.045, and X: -12.806 Y: 18.646 Z: 65.607, Dimensions X: 40.000 Y: 40.000 Z: 40.000. Ligand tethering of the M pro enzyme was performed by regulating the genetic algorithm (GA) parameters, using 10 runs of the GA criteria. The docking binding poses were visualized with PyMOL Molecular Graphics System, Version 2.0 Schrödinger, LLC.

Conclusions
The current results suggest that CADD approaches relying on ligand-and structure-based methodologies could be used with success to predict new inhibitory MNPs against SARS-CoV-2 M pro . Five MNPs, a benzo[f]pyrano [4,3-b]chromene (Reaxys ID 7450892), notoamide I (Reaxys ID 19384758), emindole SB beta-mannoside (Reaxys ID 26845562), and two bromoindole derivatives (Reaxys IDs 10714788 and 10720065) were proposed as the most promise marine drug-like leads as SARS-CoV-2 M pro inhibitors. To our knowledge, the CADD ligand-based using a QSAR classification model, that was developed here, is the largest study ever performed with regard both to the number of molecules involved and to the number of structural families involved in the modeling of the antiviral activity against SARS-CoV-2 and the best model achieved an overall predictive accuracy of up to 67% for both test and training sets. In future works, the proposed five marine drug-like leads against SARS-CoV-2 M pro can be validated experimentally. Based on these results, virtual libraries of marine-derived drug-like leads can be built, which can be virtual screened using the best QSAR model and molecular docking against SARS-CoV-2 M pro .
Supplementary Materials: The following data are available online at http://www.mdpi.com/1660-3397/18/12/633/ s1, Tables S1-S6. The following files are available free of charge. Predictions of ADMET properties with in silico methods, using the pKCSM software for a list of fifteen selected MNPs by QSAR model against SARS-CoV-2 and molecular docking of Mpro enzyme are available as Supplementary Materials, Table S1. SMILES strings of the data set (training, test, and test 2 sets), the corresponding experimental and predicted activities are available as Supplementary Materials, Tables S2, S3, and S4, respectively. Moreover, SMILES strings of the in-house MNPs and MNPs clinical pipeline sets, for the virtual screening data set, the corresponding predicted activities are available as Supplementary Materials, Tables S5 and S6, respectively (XLSX).