Discovery of Potential Inhibitors of SARS-CoV-2 Main Protease by a Transfer Learning Method

The COVID-19 pandemic caused by SARS-CoV-2 remains a global public health threat and has prompted the development of antiviral therapies. Artificial intelligence may be one of the strategies to facilitate drug development for emerging and re-emerging diseases. The main protease (Mpro) of SARS-CoV-2 is an attractive drug target due to its essential role in the virus life cycle and high conservation among SARS-CoVs. In this study, we used a data augmentation method to boost transfer learning model performance in screening for potential inhibitors of SARS-CoV-2 Mpro. This method appeared to outperform graph convolution neural network, random forest and Chemprop on an external test set. The fine-tuned model was used to screen for a natural compound library and a de novo generated compound library. By combination with other in silico analysis methods, a total of 27 compounds were selected for experimental validation of anti-Mpro activities. Among all the selected hits, two compounds (gyssypol acetic acid and hyperoside) displayed inhibitory effects against Mpro with IC50 values of 67.6 μM and 235.8 μM, respectively. The results obtained in this study may suggest an effective strategy of discovering potential therapeutic leads for SARS-CoV-2 and other coronaviruses.


Introduction
SARS-CoV-2, first reported in the beginning of 2020 [1], has caused over 759 million confirmed infection cases including 6.8 million deaths as of March of 2023 as reported to the World Health Organization (WHO). SARS-CoV-2 is a novel coronavirus which shares 79.5% sequence similarity with SARS-CoV [2], both of which belong to the Coronaviridae family, which contains positive single-stranded encapsulated viruses [3]. The virus genome contains several open-reading frames (ORFs) that encode four structure proteins (sps), 16 non-structure proteins (nsps) and several accessory proteins [4,5]. Nsp5 is the main protease (M pro ), which is also known as 3-Chymotrypsin like protease (3CL pro ). It has been characterized as one of the potential druggable targets of SARS-CoV-2 owing to its essential role in viral replication and transcription [6]. Active M pro consists of a homodimer while each protomer has three domains (I-III) [7]. The active site of M pro locates in the cleft between domains I and II and features the catalytic Cys-His dyad (Cys145-His41) [8][9][10]. After ORF1a/b translates into two polyproteins pp1a and pp1ab, M pro cleavages at 11 distinct sites to release functional polypeptides [6,11,12]. The core recognition sequence is Leu-Gln↓ (Ser/Ala/Gly) [7,13]. Moreover, the high conservatism of M pro among coronaviruses and the absence of homologues with similar cleavage specificity in humans make it an attractive target for antiviral drug discovery [14,15].
Many clinical trials have been initiated in the search for the prevention and treatment of coronavirus disease 2019 . At the time of writing, several vaccines have been approved by the U.S. Food and Drug Administration (FDA), including ones by Pfizer/BioNTech, Moderna and Johnson and Johnson/Jassen (JnJ) [16]. There have also been attempts in preclinical development of multiple formulations of vaccine candidates [17]. However, the continuing mutations in the viral genome may affect the protective effects of current vaccines. Notably, the emergence of the Omicron (B.1.1.529) VoC which contains a high number of mutations in the viral spike protein has an increased reinfection risk [18]. As the pandemic threat continues and vaccines cannot provide complete and lasting protection [19], the need for antiviral agents to treat infected patients remains. Drug repurposing, for the advantage of already confirmed clinical profiles data, is considered to be a fast and low-cost approach to find potential effective therapeutic agents against COVID-19 [20][21][22]. At present, there are only three drugs approved by the FDA for the treatment of COVID-19, including Actemra (Tocilizumab), Veklury (Remdesivir) and Olumiant (baricitinib) [23]. There are several authorized products under an EUA for the clinical treatment of COVID-19 as well, including two anti-viral drugs which are Paxlovid (nirmatrelvir and ritonavir) and Lagevrio (molnupiravir), three immune modulators, five SARS-CoV-2targeting monoclonal antibodies, sedatives and renal replacement therapies. Hundreds of drugs are undergoing clinical trials for COVID-19, such as favipiravir, lopinavir, ribavirin, ritonavir, and tocilizumab, which have shown positive effects in vitro [17,24]. Dexamethasone and hydroxychloroquine have been withdrawn from treatment options because of the insignificant protection benefits and serious side effects [24][25][26].
Drug discovery and development is a time-consuming process in which computational methods can help speed up the identification and application of drug candidates. Deep learning techniques have recently received wide attention and been applied to drug discovery [27]. To facilitate efforts in exploring the chemical space against various therapeutic targets for SARS-CoV-2, deep learning combined with computer-aided drug design (CADD) methodologies such as docking and molecular dynamics simulation have been extensively used [20,[28][29][30][31][32][33][34][35]. However, labeled data scarcity remains a challenge for supervised learning due to time-consuming and laborious benchwork testing. To better solve this problem, transformer pre-training by making use of large amounts of unlabeled data plus downstream task-specific finetuning has become a powerful architecture for learning representation of texts, i.e., natural language processing (NLP) [36][37][38][39][40]. Compared with many previous approaches such as graph neural networks (GNNs), modern transformers display substantial gain of efficiency and throughput [41,42]. Given the availability of millions of Simplified Molecular-Input Line-Entry system (SMILES) strings, different molecular property prediction tasks can be tackled by using learned representations of functional groups and atoms learned by the model [43][44][45].
In the present study, we used pre-trained ChemBERTa [39] which is based on RoBERTa [37] transformer implementation from HuggingFace and fine-tuned it on a dataset which contains over 280,000 molecules screened against SARS-CoV-1 M pro [29]. Considering the fact that natural compounds have been sources of pharmacologically active molecules for a long history and that the de novo design of novel scaffolds might expand the chemical space of active drug candidates, we made predictions of two libraries, a natural compound library (TargetMol) and a de novo generated compound library from the literature by Santana et al. [29], to seek molecules against SARS-CoV-2 M pro . The predicted active molecules were evaluated using molecular docking and PAINS filtering. In vitro enzyme activity inhibition experiments were performed to validate the selected hits.

Dataset Preparation
Due to the high sequence similarity (~76%) shared between SARS-CoV-2 M pro and SARS-CoV-1 M pro , we selected a dataset which contains over 280,000 molecules against SARS-CoV-1 M pro as the fine-tuning dataset. Obtained from the publication of Santana and Silva-Jr [29], it consisted of 629 active molecules and 288,940 inactive molecules. Based on the fact that one molecule can be represented by more than one SMILES strings, and that the augmented dataset with enumerated SMILES could help improve model performance [46], we used the same approach to augment the dataset. Different ratios of SMILES enumeration were calculated with a python script, which is available at https://github.com/Ebjerrum/SMILES-enumeration (accessed on 1 July 2020).

Chemical Space Analysis
Morgan fingerprints for each molecule using radius 2 and 2048 bits fingerprint vectors were determined after obtaining the canonical SMILES by rdkit in Python. Then, t-Distributed Stochastic Neighbor Embedding (t-SNE) clustering analysis was performed by the scikit-learn package in Python. Data points were reduced from 2048 dimensions to 2 dimensions by t-SNE. All t-SNE parameters were Scikit-learn's default values.

Model Performance Evaluation
The fine-tuned model performance was evaluated with five-fold cross-validation. Scaffold splitting was used to ensure that the training/validation set is more structurally different, which, as a result, is more challenging for the model. Additionally, an external independent test dataset which was collected from results of a screening assay against SARS-CoV-2 M pro using X-ray crystallography (at Diamond Light Source, Oxfordshire, United Kingdom) [47] was used. It consisted of 880 molecules with 78 hits. The performance of Chemprop [48], which is a freely available message passing neural network (MPNN) (http://chemprop.csail.mit.edu/predict (accessed on 27 October 2021)) on the same dataset, was also determined for comparison. Various evaluation metrics including area under the receiver-operator characteristic curve (au_roc), area under the precision-recall curve (au_prc), recall score, accuracy score, precision score and f1 score were calculated.

PAINS Filtering
All predicted active compounds were submitted to FAF-DRUGS4 server (available at http://fafdrugs4.mti.univ-paris-diderot.fr (accessed on 25 November 2021)) by evaluating their physicochemical properties [49]. Molecules with suspicious substructure features were flagged out by Pan Assay Interference Compounds (PAINS) filter.

Molecular Docking Protocol
Crystal structures of SARS-CoV-2 M pro bound with inhibitor PF-07321332 (PDB ID: 7VH8) and inhibitor N3 (PDB ID: 6LU7) were accessed from the RCSB Protein Data Bank. The M pro protein and inhibitor ligands were prepared using AutoDockTools by removing water atoms and adding polar hydrogen atoms and charges. Prepared protein and ligand files were converted to PDBQT format. Molecular docking was carried out using AutoDock Vina-1.2.0 software while M pro in the structure of 7VH8 was used as the docking protein due to its higher resolution. The redocking of PF-07321332 and N3 was performed in order to validate the performance of the docking model; then, the docking model was determined for the virtual screening process. The grid box center was set at X: −18.217, Y: 17.605, Z: −25.603 and box dimension was set to X: 20, Y: 26, Z: 24. The binding affinities of the compounds with M pro protein were calculated and ranked.

Protein Expression and Purification of SARS-CoV-2 M pro
The plasmid pET-28b-SARS-CoV-2-M pro was a kind gift from Professor George Fu Gao from the Institute of Microbiology, Chinese Academy of Sciences. The expression plasmid was transformed into E. coli strain BL21 cells and then cultured in LB medium containing 50 µg/mL kanamycin in a shaking incubator at 37 • C. When the cells were grown to an OD 600 of 0.6-0.8, 0.6 mM IPTG was added to the cell culture to induce the protein expression at 16 • C. After 18 h, the cells were harvested by centrifugation at 4000 rpm for 20 min at 4 • C. The cell pellets were washed twice by PBS, resuspended in lysis buffer (50 mM HEPES, 300 mM NaCl, 10 mM imidazole, pH 7.5), lysed by sonication on ice for 3 s ON time 5 s OFF time for 30 min of total time and then clarified by ultracentrifugation at 18,000 rpm at 4 • C for 40 min to remove debris. The supernatants were then purified by TALON metal affinity resin and washed with washing buffer (25 mM HEPES, 500 mM NaCl, pH 7.5) to remove unspecific binding proteins. The His-tagged M pro was eluted by elution buffer (25 mM HEPES, 500 mM NaCl, 300 mM imidazole, pH 7.5). His-tagged SUMO protease (home-made) was added to remove the His-tag, His-tagged SUMO protease and uncleaved His-tag protein overnight at 4 • C. The M pro was further purified by His60 Ni superflow resin. The quality of M pro was checked by SDS-PAGE, and the concentration of M pro was determined via a BCA Protein Assay Kit. The purified M pro was stored in (10 mM Tris-HCl, 1 mM DTT, 1 mM EDTA, 10% glycerol, pH 7.5).

FRET-Based M pro Enzyme Activity Inhibition Assay
Fluorescence resonance energy transfer (FRET)-based M pro enzyme activity inhibition assay was conducted as follows. First, 5 µL serially diluted concentrations of candidate compounds were incubated with 35 µL 150 nM M pro in Assay Buffer (10 mM Tris-Hcl, pH 7.5; 1 mM DTT; 1 mM EDTA; 0.01% Triton X-100) in a 96-well plate at room temperature for 30 min. This was followed with the adding of 10 µL 20 µM fluorogenic substrate (Dabcyl-KTSAVLQSGFRKME-Edans, P9733-5 mg, purchased from Beyotime) in Assay Buffer on ice, after which the plate was shaken for 1 min and then transferred to a 37 • C incubator for 30 min of incubation. Fluorescence signals (excitation wavelength at 340 nm and emission wavelength at 490 nm) were measured using a PerkinElmer Envision multimode plate reader. Experiments were performed in triplicate. Experimental data were plotted by GraphPad Prism 8.0.

Dataset Preprocessing and Chemical Space Analysis
Because of the highly conserved sequence and the similar substrate binding site of M pro between SARS-CoV-1 and SARS-CoV-2, the previously described inhibitors targeting SARS-CoV-1 M pro could be used as templates for the design of novel inhibitors against SARS-CoV-2. Thus, the dataset used for fine-tuning was collected from PubChem (AID:1706) and from the literature, which contains 629 active molecules and 288,940 inactive molecules [29,50]. Structural relationships between active compounds and inactive compounds using t-SNE (t-distributed stochastic neighbor embedding) were calculated ( Figure 1A). Analysis details were provided in the Supplementary Information. Data obtained from PubChem were the result of a QFRET-based biochemical high-throughput screening assay. Two inactive molecules were dropped due to long SMILES length, which is over 150. Scaffold-based 5-fold split was used to split the data. Due to the high imbalance of the lab dataset, data augmentation via a SMILES enumeration script was used to create more copies of active molecules. As shown in Table 1, different ratios of augmentation were conducted for later comparison to seek the optimum dataset size. To confirm the scaffold differences among

Performance of the Fine-Tuned Model
We used transfer learning to fine-tune a classification model for M pro target bioactivity prediction. A pre-trained ChemBERTa model was downloaded from huggingface. To compare the performance of the classifier on different datasets, we calculated various evaluation scores using five-fold cross-validation. As shown in Table 2, the pre-trained model using augmented training data displayed better predictive ability on the validation dataset than no augmented data. An obvious improvement of evaluation scores was observed in all augmented datasets, especially in datasets with augmented active molecules 20 and  To assess the model performance more realistically, we also evaluated on an external test dataset [47]. The external test dataset contains 880 fragments including 78 hits, which were screened through a combined mass spectrometry and X-ray approach against SARS-CoV-2 M pro . The structural diversity between the training and external datasets was also analyzed using t-SNE, as shown in Figure 1D. As shown in Table 3, a drop in performance on the external dataset was observed compared with the performance on the validation dataset, which was expected because no molecules in the test dataset were learned by the model before. The F1 score is one of the most meaningful metrics because it represents the harmonic mean of recall and precision. Datasets with 20 times more active molecules exhibited the highest f1 score of 0.34793, while GCNN and RF using the same training dataset only scored 0.0788 and 0.02025, respectively. Au_prc and au_roc were two other evaluation metrics for imbalanced data, while the former is more sensitive to the improvements of the positive class, which is a better indicator. In fine-tuned models, training datasets with 10 and 20 times more active molecules achieved similar au_prc scores, of 0.28671 and 0.28472, respectively, while the 80 times augmented datasets achieved a lower au_prc of 0.23152. Having evaluated performances of various models and confirmed the advantages of data augmentation, we used the whole dataset as training input to compare the prediction abilities of transfer learning and a freely available classifier chemprop (http://chemprop. csail.mit.edu/ (accessed on 27 October 2021)) on this external test dataset. Chemprop could be used for molecular property prediction through a Message Passing Neural Network (MPNN), which works directly on a molecular graph [48]. Transfer learning with a 20 times augmented dataset achieved the highest au_prc of 0.34433, while the AUC-PR of chemprop was 0.19321. The f1 score of transfer learning using 20 times augmentation data was 0.41321, while that of chemprop was 0.19048 (Table 4).

Prediction of Bioactivities of Natural Compound and De Novo Generated Molecule Libraries
The fine-tuned model using a 20 times augmented dataset was then used for making predictions of the Targetmol natural compound library and a de novo generated molecule library. Scoring ranks were the average results of five independent predictions. A total of 385 natural compounds and 66 de novo generated molecules were predicted as bioactive. The lists of predicted active compounds are provided in Tables S1 and S2. The top ranked 20 compounds from the natural compound library and 20 from the de novo generated molecule library are shown in Figures 2 and 3, respectively.

Molecular Docking Screening
We next submitted all the predicted active compounds to docking simulation using AutoDock Vina (version1. inhibitor PF-07321332 (PDB:7VH8) and N3 (PDB:6LU7) were both downloaded from the Protein Data Bank. PF-07321332 (Paxlovid) is an oral SARS-CoV-2 M pro inhibitor developed by Pfizer and has shown positive responses in Phase III trials in combination with Ritonavir [51]. N3 is a covalent inhibitor of SARS-CoV-2 M pro derived from the inhibitor targeting SARS-CoV-1 M pro [15]. After calculating the binding affinities of the compounds with M pro , 46 compounds were selected for further binding pose analysis according to a cutoff score of −8.5 kcal/mol. After analysis of residue interactions in crystal structures of M pro with PF-07321332 and N3, ligand interactions with F140 and E166 were considered critical for binding with M pro . Twelve molecules were finally confirmed as hits due to more than two H-bonds formed with residues F140 and E166. These hits include 10 natural compounds (T5429, T2727, T5497, T1035, T1609, T6S1529, T3149, T3S1612, TL0006) and two de novo generated molecules (58353 and 52917). The binding poses of these hit compounds with M pro are displayed in Table 5.

PAINS Filtering
In the final round of the in silico analysis, we performed PAINS (pan assay interference compounds) filtering through a freely available web server FAF-Drugs4 to estimate potential molecules that may interfere with biological assays [49]. These compounds may display false positives in screening assays via a number of means and therefore represent poor choices for drug development [52]. We submitted all predicted active molecules to the server; 78 natural compounds and 5 de novo generated molecules were flagged as PAINS. identifying molecules that interfere with biological assays. Therefore, the judgement of PAINS should be taken with caution, and experimental confirmation is always necessary before any 'problematic' molecules are discarded. targeting SARS-CoV-1 M pro [15]. After calculating the binding affinities of the compounds with M pro , 46 compounds were selected for further binding pose analysis according to a cutoff score of −8.5 kcal/mol. After analysis of residue interactions in crystal structures of M pro with PF-07321332 and N3, ligand interactions with F140 and E166 were considered critical for binding with M pro . Twelve molecules were finally confirmed as hits due to more than two H-bonds formed with residues F140 and E166. These hits include 10 natural compounds (T5429, T2727, T5497, T1035, T1609, T6S1529, T3149, T3S1612, TL0006) and two de novo generated molecules (58353 and 52917). The binding poses of these hit compounds with M pro are displayed in Table 5. oped by Pfizer and has shown positive responses in Phase III trials in combination with Ritonavir [51]. N3 is a covalent inhibitor of SARS-CoV-2 M pro derived from the inhibitor targeting SARS-CoV-1 M pro [15]. After calculating the binding affinities of the compounds with M pro , 46 compounds were selected for further binding pose analysis according to a cutoff score of −8.5 kcal/mol. After analysis of residue interactions in crystal structures of M pro with PF-07321332 and N3, ligand interactions with F140 and E166 were considered critical for binding with M pro . Twelve molecules were finally confirmed as hits due to more than two H-bonds formed with residues F140 and E166. These hits include 10 natural compounds (T5429, T2727, T5497, T1035, T1609, T6S1529, T3149, T3S1612, TL0006) and two de novo generated molecules (58353 and 52917). The binding poses of these hit compounds with M pro are displayed in Table 5. cutoff score of −8.5 kcal/mol. After analysis of residue interactions in crystal structures of M pro with PF-07321332 and N3, ligand interactions with F140 and E166 were considered critical for binding with M pro . Twelve molecules were finally confirmed as hits due to more than two H-bonds formed with residues F140 and E166. These hits include 10 natural compounds (T5429, T2727, T5497, T1035, T1609, T6S1529, T3149, T3S1612, TL0006) and two de novo generated molecules (58353 and 52917). The binding poses of these hit compounds with M pro are displayed in Table 5. cutoff score of −8.5 kcal/mol. After analysis of residue interactions in crystal structures of M pro with PF-07321332 and N3, ligand interactions with F140 and E166 were considered critical for binding with M pro . Twelve molecules were finally confirmed as hits due to more than two H-bonds formed with residues F140 and E166. These hits include 10 natural compounds (T5429, T2727, T5497, T1035, T1609, T6S1529, T3149, T3S1612, TL0006) and two de novo generated molecules (58353 and 52917). The binding poses of these hit compounds with M pro are displayed in Table 5. than two H-bonds formed with residues F140 and E166. These hits include 10 natural compounds (T5429, T2727, T5497, T1035, T1609, T6S1529, T3149, T3S1612, TL0006) and two de novo generated molecules (58353 and 52917). The binding poses of these hit compounds with M pro are displayed in Table 5. with M , 46 compounds were selected for further binding pose analysis according to a cutoff score of −8.5 kcal/mol. After analysis of residue interactions in crystal structures of M pro with PF-07321332 and N3, ligand interactions with F140 and E166 were considered critical for binding with M pro . Twelve molecules were finally confirmed as hits due to more than two H-bonds formed with residues F140 and E166. These hits include 10 natural compounds (T5429, T2727, T5497, T1035, T1609, T6S1529, T3149, T3S1612, TL0006) and two de novo generated molecules (58353 and 52917). The binding poses of these hit compounds with M pro are displayed in Table 5. M with PF-07321332 and N3, ligand interactions with F140 and E166 were considered critical for binding with M pro . Twelve molecules were finally confirmed as hits due to more than two H-bonds formed with residues F140 and E166. These hits include 10 natural compounds (T5429, T2727, T5497, T1035, T1609, T6S1529, T3149, T3S1612, TL0006) and two de novo generated molecules (58353 and 52917). The binding poses of these hit compounds with M pro are displayed in Table 5. two de novo generated molecules (58353 and 52917). The binding poses of these hit compounds with M pro are displayed in Table 5.

PAINS Filtering
In the final round of the in silico analysis, we performed PAINS (pan assay interference compounds) filtering through a freely available web server FAF-Drugs4 to estimate potential molecules that may interfere with biological assays [49]. These compounds may display false positives in screening assays via a number of means and therefore represent poor choices for drug development [52]. We submitted all predicted active molecules to the server; 78 natural compounds and 5 de novo generated molecules were flagged as PAINS. For those natural compounds, among the top 20 predicted hits and 10 high-dockscoring hits, T2765 (rosmarinic acid), T2730 (gossypol acetic acid), T3012 (mangiferin), T3227 (danshensu), T2844 (hyperoside), T2775 (baicalin), T3232 (higenamine hydrochloride), T5429 (theaflavin 3,3′-digallate), T2727 (salvianolic acid B), T6S1529 (1,5-Dicaffeoylquinic acid), T3149 (salvianolic acid C), TL0006 (chicoric acid) and T3242 (breviscapin) were flagged as PAINS. For those de novo generated molecules, among the top 20 predicted hits and two high dock-scoring hits, compound 52917, compound 42806, compound 64500 and compound 58353 were flagged as PAINS. However, virtual filters may not be perfect in identifying molecules that interfere with biological assays. Therefore, the judgement of PAINS should be taken with caution, and experimental confirmation is always necessary before any 'problematic' molecules are discarded.

In Vitro Binding Assay Validation
In order to validate the in vitro binding activities of selected hits, we purchased 18 natural compounds from the top 20 scored active compounds predicted by deep learning and 9 selected natural compounds screened by molecular docking from Targetmol. PF-07321332 and Boceprevir were used as positive controls. These 27 compounds were tested by SARS-CoV-2 M pro inhibition assay at concentrations of 200 µM and 40 µM. As shown

PAINS Filtering
In the final round of the in silico analysis, we performed PAINS (pan assay interference compounds) filtering through a freely available web server FAF-Drugs4 to estimate potential molecules that may interfere with biological assays [49]. These compounds may display false positives in screening assays via a number of means and therefore represent poor choices for drug development [52]. We submitted all predicted active molecules to the server; 78 natural compounds and 5 de novo generated molecules were flagged as PAINS. For those natural compounds, among the top 20 predicted hits and 10 high-dockscoring hits, T2765 (rosmarinic acid), T2730 (gossypol acetic acid), T3012 (mangiferin), T3227 (danshensu), T2844 (hyperoside), T2775 (baicalin), T3232 (higenamine hydrochloride), T5429 (theaflavin 3,3′-digallate), T2727 (salvianolic acid B), T6S1529 (1,5-Dicaffeoylquinic acid), T3149 (salvianolic acid C), TL0006 (chicoric acid) and T3242 (breviscapin) were flagged as PAINS. For those de novo generated molecules, among the top 20 predicted hits and two high dock-scoring hits, compound 52917, compound 42806, compound 64500 and compound 58353 were flagged as PAINS. However, virtual filters may not be perfect in identifying molecules that interfere with biological assays. Therefore, the judgement of PAINS should be taken with caution, and experimental confirmation is always necessary before any 'problematic' molecules are discarded.

In Vitro Binding Assay Validation
In order to validate the in vitro binding activities of selected hits, we purchased 18 natural compounds from the top 20 scored active compounds predicted by deep learning and 9 selected natural compounds screened by molecular docking from Targetmol. PF-07321332 and Boceprevir were used as positive controls. These 27 compounds were tested by SARS-CoV-2 M pro inhibition assay at concentrations of 200 µM and 40 µM. As shown

In Vitro Binding Assay Validation
In order to validate the in vitro binding activities of selected hits, we purchased 18 natural compounds from the top 20 scored active compounds predicted by deep learning and 9 selected natural compounds screened by molecular docking from Targetmol. PF-07321332 and Boceprevir were used as positive controls. These 27 compounds were tested by SARS-CoV-2 M pro inhibition assay at concentrations of 200 µM and 40 µM. As shown in Figure 4A, except for PF-07321332, only compound T2730 (Gossypol acetic acid) and T2844 (Hyperoside) had over 50% inhibitory effects against M pro catalytic activity at 200 µM, while all tested compounds exhibited less than 50% inhibitory effects at 40 µM. The IC50 values of compounds T2730 and T2844 were further determined in dose-dependent studies, which are 67.6 µM and 235.8 µM, respectively. Noteworthily, as many researchers have reported that some molecules self-associating into colloidal aggregates is one of the most common cause of non-specific inhibition [53,54], we added detergent triton X-100 in the experimental solvent; thus, the false positives caused by aggregate-based inhibition could be avoided. When treated with and without triton X-100, the inhibitory efficacies of the positive control Boceprevir and compound T2730 displayed no obvious differences within the experimental error, although a slight decrease in the inhibitory effects of T2844 when added with triton-X100 was observed. Gossypol acetic acid, a polyphenolic compound isolated form cottonseeds, has been reported to inhibit Bcl-2, Bcl-xL and Mcl-1 function and have antiproliferative effects on some cancer cells in vitro [55]. Hyperoside, a naturally occurring flavonoid compound isolated from Artemisia capillaris, shows myocardial protective, hepatoprotective, anti-redox and anti-inflammatory activities [56]. It is also a derivative of quercetin, which was predicted to potentially inhibit SARS-CoV-2 M pro [57]. Recently, Dr. Souza's group has demonstrated a biflavonoid (agathisflavone) and two flavonols (myricetin and fisetin) as non-competitive inhibitors of SARS-CoV-2 M pro , which indicated an interesting potential mode of action of these classes of compounds [58,59]. Further studies to deeper understand the mechanism of actions of these compounds are essential for chemical design to improve the activity profiles. Taken together, we have found that two natural compounds showed biological activity against M pro in vitro.

Discussion
Artificial intelligence-aided drug design is becoming extensively used especially for emerging diseases because of its potential advantage in saving the time and cost of the drug discovery and development process. Here, we used a data augmentation method to boost transfer learning model performance in the fine-tuned bioactivity prediction task. The model outperformed GCNN, RF and chemprop. A natural compound library and a de novo generated molecule library were screened by this fast and efficient model. In combination with frequently used CADD techniques, such as molecular docking and PAINSfiltering, this method allowed us to select a group of 27 commercial available compounds for further experimental validation. Among these experimentally tested compounds, gossypol acetic acid and hyperoside displayed inhibitory effects against M pro with IC50 values of 67.6 μM and 235.8 μM, respectively. Even though these two compounds displayed only micromolar potency, they still provided valuable scaffolds for further drug design in searching for treatment of COVID-19. Follow-up cellular assays and in vivo experiments are also essentially necessary to ensure the efficacy and safety of these compounds and more deeply understand the mechanism of actions. Overall, our results demonstrated the feasibility of finding potential candidate compounds using a deep learning method, and the experimental outcome suggested that these natural products may merit further biological studies of their potential ability in blocking SARS-CoV-2 infection.

Supplementary Materials:
The following supporting information can be downloaded at: www.mdpi.com/xxx/s1, the Fine-tuned model prediction results are provided in Tables S1 and S2 (.xls).

Discussion
Artificial intelligence-aided drug design is becoming extensively used especially for emerging diseases because of its potential advantage in saving the time and cost of the drug discovery and development process. Here, we used a data augmentation method to boost transfer learning model performance in the fine-tuned bioactivity prediction task. The model outperformed GCNN, RF and chemprop. A natural compound library and a de novo generated molecule library were screened by this fast and efficient model. In combination with frequently used CADD techniques, such as molecular docking and PAINS-filtering, this method allowed us to select a group of 27 commercial available compounds for further experimental validation. Among these experimentally tested compounds, gossypol acetic acid and hyperoside displayed inhibitory effects against M pro with IC50 values of 67.6 µM and 235.8 µM, respectively. Even though these two compounds displayed only micromolar potency, they still provided valuable scaffolds for further drug design in searching for treatment of COVID-19. Follow-up cellular assays and in vivo experiments are also essentially necessary to ensure the efficacy and safety of these compounds and more deeply understand the mechanism of actions. Overall, our results demonstrated the feasibility of finding potential candidate compounds using a deep learning method, and the experimental outcome suggested that these natural products may merit further biological studies of their potential ability in blocking SARS-CoV-2 infection.

Data Availability Statement:
The pre-trained model used for transfer learning can be freely downloaded from huggingface (https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1 (accessed on 15 June 2021)). AutoDock Vina (version 1.2.0) used for molecular docking can be downloaded from GitHub repository (https://github.com/ccsb-scripps/AutoDock-Vina (accessed on 11 November 2021)). The web server FAF-Drugs4 used for PAINS filtering is publicly available at https: //fafdrugs4.rpbs.univ-paris-diderot.fr/. All relevant data are shown in figures, tables and Supporting Materials. The ChemBERTa model predicted active natural compound and de novo generated compound information are listed in Table S1 and Table S2, respectively. Molecular docking scores are provided in Table S3.

Conflicts of Interest:
The authors declare no conflict of interest.