Discovering the Active Ingredients of Medicine and Food Homologous Substances for Inhibiting the Cyclooxygenase-2 Metabolic Pathway by Machine Learning Algorithms

Cyclooxygenase-2 (COX-2) and microsomal prostaglandin E2 synthase (mPGES-1) are two key targets in anti-inflammatory therapy. Medicine and food homology (MFH) substances have both edible and medicinal properties, providing a valuable resource for the development of novel, safe, and efficient COX-2 and mPGES-1 inhibitors. In this study, we collected active ingredients from 503 MFH substances and constructed the first comprehensive MFH database containing 27,319 molecules. Subsequently, we performed Murcko scaffold analysis and K-means clustering to deeply analyze the composition of the constructed database and evaluate its structural diversity. Furthermore, we employed four supervised machine learning algorithms, including support vector machine (SVM), random forest (RF), deep neural networks (DNNs), and eXtreme Gradient Boosting (XGBoost), as well as ensemble learning, to establish 640 classification models and 160 regression models for COX-2 and mPGES-1 inhibitors. Among them, ModelA_ensemble_RF_1 emerged as the optimal classification model for COX-2 inhibitors, achieving predicted Matthews correlation coefficient (MCC) values of 0.802 and 0.603 on the test set and external validation set, respectively. ModelC_RDKIT_SVM_2 was identified as the best regression model based on COX-2 inhibitors, with root mean squared error (RMSE) values of 0.419 and 0.513 on the test set and external validation set, respectively. ModelD_ECFP_SVM_4 stood out as the top classification model for mPGES-1 inhibitors, attaining MCC values of 0.832 and 0.584 on the test set and external validation set, respectively. The optimal regression model for mPGES-1 inhibitors, ModelF_3D_SVM_1, exhibited predictive RMSE values of 0.253 and 0.35 on the test set and external validation set, respectively. Finally, we proposed a ligand-based cascade virtual screening strategy, which integrated the well-performing supervised machine learning models with unsupervised learning: the self-organized map (SOM) and molecular scaffold analysis. Using this virtual screening workflow, we discovered 10 potential COX-2 inhibitors and 15 potential mPGES-1 inhibitors from the MFH database. We further verified candidates by molecular docking, investigated the interaction of the candidate molecules upon binding to COX-2 or mPGES-1. The constructed comprehensive MFH database has laid a solid foundation for the further research and utilization of the MFH substances. The series of well-performing machine learning models can be employed to predict the COX-2 and mPGES-1 inhibitory capabilities of unknown compounds, thereby aiding in the discovery of anti-inflammatory medications. The COX-2 and mPGES-1 potential inhibitor molecules identified through the cascade virtual screening approach provide insights and references for the design of highly effective and safe novel anti-inflammatory drugs.


Introduction
Eicosanoids, derived from arachidonic acid (AA) and related polyunsaturated fatty acids (PUFAs), play a crucial role in regulating a wide range of homeostatic and inflammatory processes associated with many diseases, such as atherosclerosis, Alzheimer's disease, and cancer [1]. Studies of eicosanoids have primarily focused on prostaglandin E2 (PGE 2 ), due to the fact that PEG 2 primarily contributes to the fundamental symptoms of inflammation, such as fever, swelling, redness, pain, and loss of function [2]. In the production of PGE 2 , cyclooxygenase-2 (COX-2) catalyzes the conversion of arachidonic acid (AA) into prostaglandin H2 (PGH 2 ). Subsequently, microsomal prostaglandin E2 synthase (mPGES-1) converts PGH 2 into PGE 2 . Therefore, suppressing the overexpression of COX-2 and mPGES-1 is a reasonable strategy in the treatment of inflammation.
Traditional non-steroidal anti-inflammatory drugs (NSAIDs), such as Aspirin, inhibit both COX-1 and COX-2. This reduces the production of the pro-inflammatory factor PEG 2 along with the cytoprotective prostaglandin I 2 , leading to an increased risk of severe gastrointestinal and cardiovascular disease. Although highly selective COX2 inhibitors, such as Celecoxib and Rofecoxib, can alleviate some of the side effects associated with NSAIDs, their long-term use does not completely avoid the risk of cardiovascular disease [3]. Consequently, current research is focused on discovering potential COX-2 inhibitors that exhibit both high selectivity and low toxicity. In recent years, mPGES-1 inhibitors have been suggested as attractive anti-inflammation therapeutics since they can selectively reduce the PEG 2 without affecting the cytoprotective PGs that regulate homeostasis. However, the wide variation in mPGES-1 inhibitor efficacy between humans and mice poses a challenge to fully exploit rodent disease models, and the lack of effective inhibitors with cross-species activity further hinders the development of safe and effective mPGES-1 inhibitors [4].
Medicine and food homology (MFH) refers to substances that are both food and herbal medicine. It reflects the traditional Chinese idea of health care, and includes the contents of food therapy, health care, and medicinal food in Chinese medicine [5]. MFHs have passed the food safety risk assessment of the Chinese National Health Commission and have been proven to be both for long-term use as daily food and for disease treatment and health benefits [6]. Natural products have served as a valuable source of drugs and lead compounds due to their complexity, structural diversity, and extensive pharmacological effects. Growing studies have revealed that many active ingredients of plant-derived and marine-derived natural products are involved in the regulation of inflammatory responses in humans [7], especially with COX-2 and mPGES-1 inhibitory activity. A series of secondary metabolites isolated from marine products have shown potency in inhibiting COX-2, their chemical structures are shown in sub-figure A of Figure 1. Vaccinal A, Botryoisocoumarin A [8], Axinelline A, Capnellene, Stachybogrisephenone B, and Actinoquinoline A inhibit COX-2 in vitro with IC 50 values of 1.8 µM, 6.5 µM, 2.8 µM, 6.2 µM, 8.9 µM, and 2.1 µM, respectively [9]. Some compounds from medicinal plants have been verified to exhibit the prominent mPGES-1 inhibitory effect (seen in sub-figure B of Figure 1), Curcumin, Hyperforin, Garcinol, Arzanol, Myrtucommulone, Carnosic acid, and Emeblin inhibit mPGES-1 in vitro with IC 50 values of 0.3 µM, 1 µM, 0.3 µM, 0.4 µM, 1 µM, 5 µM, and 0.2 µM, respectively [10]. Although many natural products are reported to be potent COX-2, mPGES-1 inhibitors or transcriptional suppressors, poor oral availability and non-negligible toxicity still limit their further clinical use. MFHs are easily absorbed and utilized by humans with relative nontoxicity. Additionally, there are some multitarget anti-inflammatory compounds in Chinese herbal medicine. Acrovestone, extracted from acronychia, inhibits mPGES-1 and 5-lipoxygenase (5-LOX) in vitro with IC 50 values of 2.7 µM and 1.1 µM, respectively [11]. Cannflavin A, extracted from Cannabis sativa, inhibits mPGES-1 and 5-LOX in vitro with IC 50 values of 1.8 µM and 0.9 µM, respectively. The above multi-target inhibitors can reduce the risk of drug-drug interactions and contribute to efficient anti-inflammatory therapy [12]. Therefore, MFHs are capable of providing insight into the discovery of novel, safe, and effective COX-2 and mPGES-1 inhibitors. There has been a growing interest in the application of machine learning (ML) and deep learning (DL) in the field of biomedicine. Various emerging supervised and unsupervised learning algorithms have shown potential in providing valuable insights and predictions for drug discovery, drug repurposing, diagnostics, and pharmaceutical production [13]. Currently, virtual screening (VS) combined with experimental verification is the standard protocol for drug discovery. Some ML-facilitating VS approaches have been widely used to mine novel molecular substances from in-house or commercial compound libraries [14]. An ML-based Quantitative structure-activity relationship (QSAR) model has become a favorable tool in VS due to its ability to rapidly output predictions based on input datasets and its high hit rate [15]. Recent studies have focused on implementing VS on natural compound libraries; this is because active ingredients derived from natural products are considered an ideal starting point for designing anti-inflammatory agents [16]. Wang's group [17] built several classification models and employed these models for identifying potential P-glycoprotein inhibitors in the Traditional Chinese Medicine Systems Pharmacology (TCMSP) database. Sattar et al. [18] utilized molecular docking to screen potential COX inhibitors on compounds isolated from Eucalyptus maculata resin, and they found that 1,6-dicinnamoyl-O-α-D-glucopyranoside exhibited an COX-2 inhibitory effect.
Currently, obscure functions and unclear therapeutic targets limit the clinical practice of medicine and food homology (MFH) substances. In this study, we constructed the first comprehensive MFH database aiming at further investigating the complex active ingredients of MFH and deciphering its biology and functional activities (detailed workflow is shown in Figure 2). Additionally, we constructed the updated comprehensive dataset of COX-2 and mPGES-1 inhibitors. Leveraging these datasets, we employed supervised machine learning algorithms and ensemble learning techniques to develop a series of QSAR models for predicting the inhibitory efficacy of COX-2 and mPGES-1. Furthermore, utilizing the well-performed machine learning models combined with unsupervised learning and scaffold analysis, we performed virtual screening on the MFH database to identify potential COX-2 and mPGES-1 inhibitors which could contribute to the efficient and safe anti-inflammatory therapy. There has been a growing interest in the application of machine learning (ML) and deep learning (DL) in the field of biomedicine. Various emerging supervised and unsupervised learning algorithms have shown potential in providing valuable insights and predictions for drug discovery, drug repurposing, diagnostics, and pharmaceutical production [13]. Currently, virtual screening (VS) combined with experimental verification is the standard protocol for drug discovery. Some ML-facilitating VS approaches have been widely used to mine novel molecular substances from in-house or commercial compound libraries [14]. An ML-based Quantitative structure-activity relationship (QSAR) model has become a favorable tool in VS due to its ability to rapidly output predictions based on input datasets and its high hit rate [15]. Recent studies have focused on implementing VS on natural compound libraries; this is because active ingredients derived from natural products are considered an ideal starting point for designing anti-inflammatory agents [16]. Wang's group [17] built several classification models and employed these models for identifying potential P-glycoprotein inhibitors in the Traditional Chinese Medicine Systems Pharmacology (TCMSP) database. Sattar et al. [18] utilized molecular docking to screen potential COX inhibitors on compounds isolated from Eucalyptus maculata resin, and they found that 1,6-dicinnamoyl-O-α-D-glucopyranoside exhibited an COX-2 inhibitory effect.
Currently, obscure functions and unclear therapeutic targets limit the clinical practice of medicine and food homology (MFH) substances. In this study, we constructed the first comprehensive MFH database aiming at further investigating the complex active ingredients of MFH and deciphering its biology and functional activities (detailed workflow is shown in Figure 2). Additionally, we constructed the updated comprehensive dataset of COX-2 and mPGES-1 inhibitors. Leveraging these datasets, we employed supervised machine learning algorithms and ensemble learning techniques to develop a series of QSAR models for predicting the inhibitory efficacy of COX-2 and mPGES-1. Furthermore, utilizing the well-performed machine learning models combined with unsupervised learning and scaffold analysis, we performed virtual screening on the MFH database to identify potential COX-2 and mPGES-1 inhibitors which could contribute to the efficient and safe antiinflammatory therapy.  . The workflow of this study. The pink circle in the interaction diagram represents polar amino acids, the green represents hydrophobic amino acids. Figure 2. The workflow of this study. The pink circle in the interaction diagram represents polar amino acids, the green represents hydrophobic amino acids. We constructed a database containing 27,319 active ingredient molecules derived from 503 kinds of MFH substances by collecting data from the Traditional Chinese Medicines Integrated Database (TCMID) [19] and Traditional Chinese Medicine Systems Pharmacology Database and Analysis Platform (TCMSP) [20], as well as 92 pieces of literature. Details of the constructed MFH database can be seen in Table S1. The distribution of two physicochemical properties of the MFH database is shown in Figure 3. The molecular weight (MW) of molecules derived from TCMID, TCMSP, and the literature ranges from 32 to 1900, 30 to 1466, and 59 to 1263, respectively. The octanol-water partition coefficient (LogP) of molecules derived from TCMID, TCMSP, and the literature ranges from −24.9 to 24, −11.9 to 24.5, and −8.8 to 12.6, respectively. The distribution of basic physicochemical properties across two orders of magnitude demonstrates the extensive chemical space of our MFH database.

Chemical Space and Scaffold Analysis of the MFH Database
We constructed a database containing 27,319 active ingredient molecules derived from 503 kinds of MFH substances by collecting data from the Traditional Chinese Medicines Integrated Database (TCMID) [19] and Traditional Chinese Medicine Systems Pharmacology Database and Analysis Platform (TCMSP) [20], as well as 92 pieces of literature. Details of the constructed MFH database can be seen in Table S1. The distribution of two physicochemical properties of the MFH database is shown in Figure 3. The molecular weight (MW) of molecules derived from TCMID, TCMSP, and the literature ranges from  32 to 1900, 30 to 1466, and 59 to 1263, respectively. The octanol-water partition coefficient (LogP) of molecules derived from TCMID, TCMSP, and the literature ranges from −24.9 to 24, −11.9 to 24.5, and −8.8 to 12.6, respectively. The distribution of basic physicochemical properties across two orders of magnitude demonstrates the extensive chemical space of our MFH database.
To further evaluate the structural diversity of our MFH database, we calculated the Tanimoto coefficients (TCs) [21] on molecules (represented by 1024 bits ECFP4 fingerprints). The average TC value for the entire MFH database was 0.216, with 95.92% of molecule pairs having TC values of less than 0.6, indicating significant structural differences within our MFH database.  To further evaluate the structural diversity of our MFH database, we calculated the Tanimoto coefficients (TCs) [21] on molecules (represented by 1024 bits ECFP4 fingerprints). The average TC value for the entire MFH database was 0.216, with 95.92% of molecule pairs having TC values of less than 0.6, indicating significant structural differences within our MFH database.
In addition, we extracted Murcko scaffolds of the MFH database and performed a K-means clustering [22] analysis. As a result, the MFH database was divided into 11 clusters. The Flavonoids cluster was the most abundant, comprising approximately one-fifth of the database, while Fatty Acids, Saponins, and Steroids each accounted for roughly one-tenth. Moreover, there were Lignans, Alkaloids, Triterpenoids, Sesquiterpenes, Diterpenes, and Stilbenes in the MFH database (seen in Figure 4A). These are commonly isolated components from natural products, and some Flavonoids, Steroids, Alkaloids, and Triterpenoids have been reported to play important roles in the regulation of immune responses [23]. We further summarized the top 20 Murcko scaffolds in the MFH database, as shown in Figure 4B; the scaffolds ranged from simple aromatic natural products with a single ring to complex skeletons with 7-8 ring systems. These observations demonstrate that our MFH database is comprehensive and exhibits high structural diversity. In addition, we extracted Murcko scaffolds of the MFH database and performed a Kmeans clustering [22] analysis. As a result, the MFH database was divided into 11 clusters. The Flavonoids cluster was the most abundant, comprising approximately one-fifth of the database, while Fatty Acids, Saponins, and Steroids each accounted for roughly one-tenth. Moreover, there were Lignans, Alkaloids, Triterpenoids, Sesquiterpenes, Diterpenes, and Stilbenes in the MFH database (seen in Figure 4A). These are commonly isolated components from natural products, and some Flavonoids, Steroids, Alkaloids, and Triterpenoids have been reported to play important roles in the regulation of immune responses [23]. We further summarized the top 20 Murcko scaffolds in the MFH database, as shown in Figure 4B; the scaffolds ranged from simple aromatic natural products with a single ring to complex skeletons with 7-8 ring systems. These observations demonstrate that our MFH database is comprehensive and exhibits high structural diversity.

Performances Evaluation and Comparison of Developed Models
In this study, we developed classification models using four datasets: Dataset 1 and 2, which contain COX-2 inhibitors, and Dataset 4 and 5, which are composed of mPGES-1 inhibitors. To generate training and test sets, we split the datasets for classification 10 times randomly. We then characterized these sets using three types of fingerprints: Avalon, ECFP4, and MACCS. Four machine learning methods were applied to predict high or weak inhibition on COX-2 and mPGES-1 separately. As a result, we constructed a total of 480 classification models. Subsequently, we integrated the predicted probabilities of classification models (built with the same algorithm but with the different molecular characterization) by a stacked generalization approach. This resulted in the construction of an additional 160 ensemble classification models: 80 for COX-2 inhibitors and 80 for mPGES-1 inhibitors. Additionally, we utilized two datasets, Dataset 3 and Dataset 6, represented by two types of descriptors to build 160 regression models following a similar process of developing classification models.

Performances of Classification and Ensemble Models on Dataset 1
The Matthews correlation coefficients (MCCs) on the test sets were employed to evaluate the predictive ability and stability of constructed classification models, and the MCC on external validation sets was applied to assess the generalization ability of models. Table 1 lists the overall model performances based on Dataset 1 (1640 COX-2 inhibitors), and the detailed performances of all models on Dataset 1 can be seen in Table S1. Figure 5A depicts the MCC values of 10 randomly split test sets for Dataset 1.
When comparing the performance of different algorithms, SVM and XGB slightly outperformed other machine learning methods. When using Avalon fingerprints to characterize the dataset, the mean MCC values on 10 randomly split test sets for SVM and XGB models were 0.783 and 0.744, respectively. When using ECFP fingerprints, the mean MCC values for the SVM and XGBoost models were 0.763 and 0.768, respectively. When using MACCS fingerprints, the mean MCC values for the SVM and XGBoost models were 0.746 and 0.753, respectively. Meanwhile, the RF algorithm had a slightly inferior performance with the average MCC values on 10 randomly split test sets of 0.721, 0.704, and 0.701 when using Avalon, ECFP, and MACCS fingerprints to characterize the dataset, respectively.
When comparing the performance of different descriptors, all three fingerprints (Avalon, ECFP, and MACCS) were equally effective in characterizing the dataset with no significant differences. The MCC values on test sets for all models exceeded 0.66. Compared to the base classification models, the ensemble models that integrated all three fingerprints showed improved performance with mean MCC values of 0.735, 0.782, 0.747, and 0.774 on test sets when using RF, SVM, DNN, and XGBoost models, respectively. The ensemble model for each algorithm had an average MCC improvement of 0.02 on the test set compared to the models constructed with a single type of fingerprints.

Performances of Classification and Ensemble Models on Dataset 2
The overall performances of developed classifiers on Dataset 2 were summarized in Table 2, the detailed performances of all models on Dataset 2 (containing 2925 COX-2 inhibitors) can be seen in Table S2. The detailed MCC values of 10 randomly split test sets for Dataset 2 were visualized in Figure 5B. The MCC values of the test sets of the models on dataset 2 all exceeded 0.52, which was a decrease in performance compared to the classification models constructed with Dataset 1. This was mainly because the moderately active inhibitors were removed from dataset 1, and the structural differences between the highly/weakly inhibitors were easily distinguished. Similar to the performances on Dataset 1, the SVM and XGBoost algorithms also exhibited excellent performance for Dataset 2, with MCC values of 0.632 ± 0.018, 0.63 ± 0.022, and 0.621 ± 0.025 for the SVM model on the test set of the dataset characterized using Avalon, ECFP, and MACCS, respectively. The MCC values of the XGBoost model on the test set of the dataset represented by Avalon, ECFP, and MACCS were 0.644 ± 0.014, 0.625 ± 0.028, and 0.626 ± 0.012, respectively. The MCC performances of the ensemble RF, SVM, DNN, and XGBoost models on the test sets were 0.576 ± 0.02, 0.642 ± 0.016, 0.604 ± 0.015, and 0.643 ± 0.012, respectively. Compared with models constructed with one type of fingerprints, the ensemble models of each algorithm increased the average MCC on the test set by 0.01.     Table 3 displays the overall performances of the models using two criteria (R 2 and RMSE) based on Dataset 3, which consisted of 1511 COX-2 inhibitors. Detailed results of all 80 QSAR models are listed in Table S3. The RMSE based on the test sets and external validation sets were utilized as the main metric to evaluate the performance of the regression models. Figure 6 showed the specific RMSE values of 10 randomly split test sets on Dataset 3.  Table 3 displays the overall performances of the models using two criteria (R 2 and RMSE) based on Dataset 3, which consisted of 1511 COX-2 inhibitors. Detailed results o all 80 QSAR models are listed in Table S3. The RMSE based on the test sets and externa validation sets were utilized as the main metric to evaluate the performance of the regression models. Figure 6 showed the specific RMSE values of 10 randomly split test sets on Dataset 3.  The RMSE values for the test set of models constructed using the G_3D descriptors were all below 0.56. From the perspective of modeling algorithms, the SVM algorithm performed the best with an average RMSE of 0.465 on the test sets. Among the models constructed using RDKit descriptors, the SVM algorithm also showed the highest predictive power with a mean RMSE of 0.436 ± 0.019, followed by the comparable DNN and XGBoost algorithms with mean RMSE values of 0.471 ± 0.019, respectively. Comparing from the descriptor perspective, the RMSEs of the test sets of the models constructed using RDKit descriptors were all below 0.54. Based on the same algorithm, RDKit-based models performed slightly better than the G_3D-based models, which indicates that the RDKit 2D descriptor is more suitable for characterizing the COX-2 inhibitors collected in this study. Table 4 summarizes the overall model performances of Dataset 4, which included 3179 mPGES-1 inhibitors. The detailed performances of all models on Dataset 4 can be seen in Table S4 and Figure 7A. The XGBoost algorithm performed the best, with mean MCC values of 0.74, 0.83, and 0.765 for 10 randomly partitioned test sets based on the Avalon, ECFP4, and MACCS characterized datasets, respectively. Comparing from the perspective of dataset characterization, ECFP4 fingerprints slightly outperformed the other two fingerprints. The highest MCC values were obtained on the test sets of the SVM, DNN, and XGBoost models using the dataset characterized by ECFP4 fingerprints. Among the ensemble classification models constructed based on the three fingerprints, the ensemble SVM and XGBoost models achieved excellent MCC performances of 0.783 ± 0.017 and 0.788 ± 0.006 on the test sets, respectively.

Performances of Classification and Ensemble Models on Dataset 5
Compared to the classification models from Dataset 4 (Table 4), the classification models constructed based on Dataset 5, containing 3455 mPGES-1 inhibitors, were slightly less discriminative for highly/weakly active mPGES-1 inhibitors, shown in Table 5, Table  S5, and Figure 7B. In terms of machine learning algorithms, the models established by the RF algorithm had the worst performance, with all the average MCC values below 0.6 on the test set based on the datasets characterized by three types of fingerprints. Conversely, the SVM and XGBoost algorithms exhibited better performance. XGBoost models achieved the best performance on the test set based on the datasets characterized by Avalon and MACCS. On the test set based on the ECFP4 characterization, the SVM and XGBoost models achieved excellent performance with average MCC values of 0.773 and 0.772, respectively. In addition, the ensemble SVM and XGBoost classification models obtained average MCC values of 0.738 ± 0.017 and 0.749 ± 0.011 on the test sets, respectively.

Performances of Regression Models on Dataset 6
Based on Dataset 6 consisting of 735 mPGES-1 inhibitors, a total of 80 regression models (Table 6) were constructed using the four algorithms to predict the bioactivities of the inhibitors; the details of all model performances are listed in Table S6. As seen in Table 6 and Figure 8, the RMSE values of the test sets of the models constructed using the G_3D descriptors were all below 0.42, with the SVM algorithm performing the best, with an average RMSE of 0.329 on the test sets. This was followed by models constructed by the XGBoost algorithm, with an average RMSE of 0.339 on the test sets. The RMSE values of the test sets of the models constructed using the 2D RDKit descriptors were all below 0.35. From the descriptors' perspective, RDKit-based models performed slightly better than G_3D-based models for the same algorithm, with an average RMSE reduction of 0.03 in the test sets.      average RMSE of 0.329 on the test sets. This was followed by models constructed by the XGBoost algorithm, with an average RMSE of 0.339 on the test sets. The RMSE values of the test sets of the models constructed using the 2D RDKit descriptors were all below 0.35. From the descriptors' perspective, RDKit-based models performed slightly better than G_3D-based models for the same algorithm, with an average RMSE reduction of 0.03 in the test sets.

Performances of the External Validation Sets
In addition to evaluating the predictive accuracy and stability of the models through the performance of the five-fold cross-validation and test sets, four external validation sets were also employed to verify the generalization ability of the constructed classification and regression models. Only models that performed well on the external validation sets were applied to further implement virtual screening on the MFH database. The ensemble SVM models and the ensemble XGBoost models performed best in the classification models constructed based on Dataset 1, with predicted MCC values of 0.551 ± 0.012 and 0.546 ± 0.014 for external validation set A1, respectively. Similarly, the ensemble SVM and the ensemble XGBoost models outperformed others in the classification models developed based on Dataset 2, with predicted MCC values of 0.544 ± 0.011 and 0.531 ± 0.01, respectively. For the regression model built on Dataset 3, the regression models constructed using the XG-Boost algorithm and the RDkit descriptors had the most accurate prediction RMSE values of 0.584 ± 0.035 for external validation set A2. In the classification model constructed based on Dataset 4, models constructed by the XGBoost with the ECFP4 fingerprints had the highest prediction for external validation set B1 with an MCC of 0.547 ± 0.015. For classification models established based on Dataset 5, the model developed using SVM and XGBoost algorithms with ECFP4 fingerprints had the best performance; the predicted MCC values for external validation set B1 reached 0.538 ± 0.016 and 0.537 ± 0.009, respectively. Models constructed through the SVM algorithm combined with RDkit descriptors outperformed the regression models built based on Dataset 6, with a predicted RMSE of 0.424 ± 0.019 for external validation set B2. The specific model results and parameters applied for virtual screening of MFH database are summarized in Table S7.

Virtual Screening on the MFH Database
In this study, we employed a ligand-based cascade virtual screening strategy to predict potential COX-2 and mPGES-1 inhibitors from the constructed MFH database containing 27,319 molecules. Our goal was to advance the development of safer anti-inflammatory therapies. The virtual screening workflow proceeded as follows, taking the screening of COX-2 inhibitors as an example. Firstly, molecules of the MFH database were predicted using 39 classification models that exhibited excellent performance on external validation sets (seen in Table S8). These 39 classification models comprised both single classifiers and ensemble classifiers constructed based on Dataset 1, as well as basic classifiers and ensemble classifiers constructed based on Dataset 2. By combining different datasets with multiple fingerprints and employing various supervised machine learning algorithms and the ensemble algorithm, the classification models used to predict the COX-2 inhibitory activity of molecules in the MFH database demonstrated strong generalization ability and robustness, thus providing reliability to the prediction results to some extent. Subsequently, the molecules predicted as highly active inhibitors by the 39 classification models were further input into 17 regression models to predict their specific inhibitory values against COX-2 (Detailed performances are listed in Table S8). The predicted values from the 17 regression models were averaged to mitigate occasional errors introduced by individual models, further enhancing the credibility of the prediction results. Based on the ranked mean predicted bioactivities from the 17 regression models, molecules with an average predicted IC 50 values below 10 µM (a common threshold for inhibitory potency) were retained. Additionally, we applied the self-organized map (SOM), an unsupervised algorithm, to predict the molecules of the MFH database. The positioning of molecules on a two-dimensional neural network was used to determine their high or weak COX-2 inhibitory activity. This approach enabled the retention of MFH molecules that resided in the same location as the majority of highly active molecules (over 80%). Finally, COX-2 inhibitor candidates were selected through an extensive literature review and molecular scaffold analysis. The screening process for mPGES-1 inhibitors followed a similar procedure, as described above.

Potential COX-2 Inhibitors in the MFH Database
Through the cascade virtual screening described above, 10 potential COX-2 inhibitors were selected from a pool of 27,319 MFH molecules. The structures of these molecules, along with their positions on the two-dimensional grid mapped by SOM, were presented in Figure 9. From Figure 8, it can be observed that the majority of the MFH molecules are distributed in different grid cells compared to known COX-2 inhibitor molecules. This discrepancy arises due to significant structural differences between natural MFH molecules and the synthesized or modified COX-2 inhibitors. Our focus primarily lies on the MFH molecules that share the same position as reported natural COX-2 inhibitors and those located in the grids predominantly occupied by highly active inhibitors (rendered in red in Figure 9). HP represents the proportion of the grid occupied by highly active inhibitors. It is the proportion of the number of highly active inhibitors in the grid to all molecules in the grid. The larger the calculated HP of one grid (close to 1), the warmer the color tends to be (red). On the contrary, the closer the HP value of a grid is to 0, the cooler the color tends to be (blue). The origin, predicted bioactivity values, and reported effects of the candidate molecules are summarized in Table 7.
Molecules 2023, 28, 6782 24 of 44 Figure 9. The mapping positions of COX-2 inhibitor candidates in self-organizing map. HP represents the proportion of the number of highly active inhibitors in the grid to all molecules in the grid. The larger the calculated HP of one grid (close to 1), the warmer the color tends to be (red). On the contrary, the closer the HP value of a grid is to 0, the cooler the color tends to be (blue). Figure 9. The mapping positions of COX-2 inhibitor candidates in self-organizing map. HP represents the proportion of the number of highly active inhibitors in the grid to all molecules in the grid. The larger the calculated HP of one grid (close to 1), the warmer the color tends to be (red). On the contrary, the closer the HP value of a grid is to 0, the cooler the color tends to be (blue).
Candidate cmp_A3, also known as Humulene, is derived from Panax Ginseng and belongs to the monocyclic sesquiterpene. It has been reported to exhibit inhibitory activity against COX-2, suppressing the expression of COX-2 in mice and reducing the production of prostaglandin E 2 (PGE 2 ) [24]. Another candidate, cmp_A1 (Dehydrotanshinone II A), derived from Radix Salviae, is the benzofuran derivatives. Kwon et al. isolated Dehydrotanshinone II A from Salvia miltiorrhiza Bunge and validated its inhibitory effect on COX-2 using a platelet activation model [25]. Candidate cmp_A6 (β-Sesquiphellandrene) is commonly found in Atractylodes macrocephala and Zingiber officinale. Zingiber officinale oil has been reported to possess anti-inflammatory activity, particularly by inhibiting lipoxygenase [26]. cmp_A7, derived from Lycii Fructus, is an apocarotenoid that significantly inhibits the expression of IL-1β in vitro, thus exerting an anti-inflammatory effect [27]. Cmp_A5, named Scutianine C, is derived from in Jujubae Fructus and has been reported as a biologically active alkaloid with antimicrobial properties [28]. Candidate cmp_A10 (Hispaglabridin B) isolated from Glycyrrhiza glabra L. belongs to the isoflavone derivatives and is located in the same grid as the reported natural COX-2 inhibitors, which has demonstrated antioxidant effects [29]. Peptidomimetic candidates, cmp_A2 and cmp_A8, originate from Corneum Gigeriae Galli Endothelium and Fagopyrum esculentum, respectively. They were predicted by SOM to have a probability of 0.821 as high-activity COX-2 inhibitors, with average predicted IC 50 values of 2.69 µM and 5.2 µM, respectively. Candidate cmp_A4, classified as a sesquiterpene, is derived from Angelica sinensis Radix and was predicted by SOM as a 100% high-activity COX-2 inhibitor, with an average predicted IC 50 value of 3.69 µM. cmp_A9, found in Mori Follum, is a folinic acid with an average predicted IC 50 value of 5.92 µM. Glycyrrhiza glabra L. 6.97 1 Anti-oxidation [29] a : the average predicted IC 50 values by a series of optimal regression models. b : the percentage of highly active inhibitors mapped to the target molecules in the same position during the self-organized map (SOM) training. It can also be taken as the probability of being predicted by an unsupervised algorithm as a highly active COX-2 inhibitor.

Potential mPGES-1 Inhibitors in the MFH Database
Through the virtual screening process described above, 15 potential mPGES-1 inhibitors were screened from the MFH database. As seen in Figure 10 and Table 8, Candidate cmp_B1, a cannabinolic acid derived from Cannabis Sativa L., was mapped by SOM into a grid inhabited by all highly active mPGES-1 inhibitors. The average predicted IC 50 value from the regression models for cmp_B1 was 0.88 µM. Candidate cmp_B1 has been reported to decrease COX enzyme activity, although selectivity toward COX1/2 was still unknown [30]. Candidate cmp_B2, isolated from Ramulus Mori, is a prenylated flavanone. It was predicted by the SOM to be a highly active mPGES-1 inhibitor with a probability of 0.93 and a mean IC 50 of 0.25 µM by the regression models. Candidate cmp_B3, an active ingredient of Amomum longiligularg, is an isopentenyl flavonoid with a predicted IC 50 of 0.34 µM, and it has been reported to inhibit the growth of breast cancer cells in vitro and in vivo [31]. Candidate cmp_B4 is also known as Kanzonol C, which is a flavonoid-like active ingredient of Glycyrrhiza glabra L. It has been demonstrated as a PTP1B inhibitor [32] and has been found to exhibit inhibitory activity against nitric oxide (NO), making it a potential anti-inflammatory agent [33]. Candidate cmp_B5, derived from Gardeniae Fructus, is a caffeoylquinic acid with a predicted IC 50 value of 0.18 µM. It possesses the ability to inhibit lipoxygenase [34]. Candidate cmp_B6 is the active ingredient of Schisandra chinensis with an average predicted IC 50 value of 0.37 µM, which has been reported to inhibit UDP-glucuronosyltransferase [35] and Oxidized low-density lipoprotein (OxLDL) [36]. Candidate cmp_B7, also known as Garcinone B, derived from Rhizoma Dioscoreae (Chinese yam), has a predicted IC 50 value of 0.55 µM. It has been found to reduce the production of prostaglandin E 2 , although the specific metabolic pathway of its action has yet to be demonstrated [37]. Candidate cmp_B8 (Kuwanon M) was isolated from Ramulus Mori. It was predicted by the SOM to be a highly active mPGES-1 inhibitor with a probability of 1 and was predicted by the regression models to have a mean IC 50 of 0.18 µM. Candidate cmp_B9, derived from Mori Cortex, is an isopropenylated phenol derivative that has been reported to show an inhibitory effect on Tyrosinase [38]. Candidate cmp_B10, derived from Glehniae Radix, is a coumarin analogue with a predicted IC 50 value of 0.2 µM. Candidate cmp_B11, also known as Xanthochymol, is a component of Colla and belongs to the polycyclic phloroglucinol. It has been found to regulate inflammation by downregulating the expression of several major histocompatibility complex (MHC) molecules [39]. Candidate cmp_B12, derived from Epimrdii Herba, is a flavonoid with a predicted IC 50 value of 0.5 µM. It has been reported to inhibit CYP3A4, thus exhibiting anti-inflammatory effects [40]. Candidate cmp_B13, isolated from Coicis Semen, is a steroid with an average predicted IC 50 value of 0.27 µM. Lee et al. [41] verified the inhibitory effect of γ-oryzanol, composed of cmp_B13 and various ferulate molecules, on inflammation-related diseases. Candidate cmp_B14, also known as Kaikasaponin III, derived from Radix Puerariae, is a triterpenoid with a predicted IC 50 value of 0.2 µM. It possesses antioxidant effects [42] and has been shown to have therapeutic effects on colitis [43]. Candidate cmp_B15, located in the same grid as cmp_B14, is also a triterpenoid primarily found in Alisma Orientale. It has a predicted IC 50 value of 0.29 µM and has been reported to inhibit COX-2 expression [44]. a : the average predicted IC 50 values by a series of optimal regression models. b : the percentage of highly active inhibitors mapped to the target molecules in the same position during the self-organized map (SOM) training. It can also be taken as the probability of being predicted by an unsupervised algorithm as a highly active mPGES-1 inhibitor.
Molecules 2023, 28, 6782 27 of 44 Figure 10. The mapping positions of mPGES-1 inhibitors candidates in self-organizing map. HP represents the proportion of the number of highly active inhibitors in the grid to all molecules in the grid. The larger the calculated HP of one grid (close to 1), the warmer the color tends to be (red). On the contrary, the closer the HP value of a grid is to 0, the cooler the color tends to be (blue). Figure 10. The mapping positions of mPGES-1 inhibitors candidates in self-organizing map. HP represents the proportion of the number of highly active inhibitors in the grid to all molecules in the grid. The larger the calculated HP of one grid (close to 1), the warmer the color tends to be (red). On the contrary, the closer the HP value of a grid is to 0, the cooler the color tends to be (blue).

Molecular Docking on the Potential COX-2 and mPGES-1 Inhibitors
Through a series of ligand-based virtual screening processes, we have identified potential inhibitors of COX-2 and mPGES-1 from the established MFH database, and we further filtered these candidate molecules by the pan assay interference compounds (PAINS) rule [45]. To further validate the reliability and validity of the virtual screening, we conducted molecular docking (a widely utilized structure-based virtual screening approach) on the candidate MFH molecules. This procedure aimed to examine the binding modes between the candidate molecules and the proteins, while also assessing the ability of the candidate molecules to interact with the key amino acids of the target protein.

Molecular Docking Analysis on Potential COX-2 Inhibitors
The active site of COX-2 is demarcated from the initial substrate binding site by a constriction formed by three residues: Arg120, Tyr355, and Glu524. This structural constriction necessitates dilation to enable the entry or exit of substrates to and from the active site [46]. Ser530 and Tyr385 are also vital during the catalytic process of COX-2. Several typical binding modes of non-steroidal anti-inflammatory drugs (NSAIDs) interacting with COX-2 have been reported. These modes include the hydrophobic region of NSAIDs interacting with Tyr385, Trp387, and neighboring residues; the polar region of NSAIDs interacting with residues located above Tyr355; NSAIDs with negative charges interacting with Arg120 [47].
We calculated the structural similarity (measured by Tanimoto coefficient) between the candidate MFH molecules and the ligand molecules in the published crystal structures, we selected the protein crystal structure with the bound ligand having the highest structural similarity to the candidate MFH molecules for docking, the details of which are shown in Table S10. As shown in Table 9, candidate compounds cmp_A3, cmp_A4, cmp_A6, and cmp_A7 bind within the active cavity of COX-2 (PDB: 4PH9), with binding affinities of −7.37, −8.53, −8.68, and −7.69 kcal/mol, respectively. The co-crystallized ligand, ibuprofen, in the complex 4PH9, under the same docking conditions, exhibits a binding affinity of −9.41 kcal/mol. While the four candidate molecules displayed affinities lower than the original ligand within the protein, they all established interactions with key amino acid residues in the catalytic domain of COX-2: hydrophobic interactions with Trp388 and polar interactions involving Tyr356, Ser531, and Tyr386. Candidate cmp_A7 generated a hydrogen bond interaction with Arg121, indicating a robust binding force. Candidate cmp_A1 bound within the active pocket of COX-2 (PDB: 5KIR), exhibiting a binding affinity of −7.57 kcal/mol. Meanwhile, the co-crystallized ligand within complex 5KIR achieved a binding affinity of −9.80 kcal/mol under identical docking conditions. Candidate cmp_A3 engaged in polar interactions with Tyr355 and formed a π-H stacking interaction with Ser533. Candidate cmp_A2 and cmp_A10 occupied the active site of the protein with PDB index 6BL3, demonstrating binding affinities of −11.75 and −8.92 kcal/mol, respectively. The co-crystallized ligand within complex 6BL3 had a binding affinity of −12.15 kcal/mol under the same docking procedure. Candidate cmp_A2 generated hydrogen bond interactions with Ser530, Lys83, and Glu524, engaged in hydrophobic interactions with Trp387, and established polar interactions with Tyr355 and Tyr385. Candidate cmp_A10 formed hydrogen bond interactions with Glu524, had hydrophobic interactions with Trp387, and engaged in polar interactions with Ser530 and Tyr355. Candidates cmp_A5, cmp_A8, and cmp_A9 bound to the active cavity of COX-2 (PDB: 6BL4) with the affinities of −7.25, −11.24, and −9.61 kcal/mol, respectively. The affinity of the complex 6BL4 after docking with the original ligand was −12.27 kcal/mol. Candidate cmp_A5 engaged in hydrophobic interactions with Trp100 and formed polar interactions with Tyr115, Lys79, and Lys83. Candidate cmp_A8 established hydrogen bond interactions with Tyr355, Arg120, Glu524, Ser530, and Lys83, which significantly contributed to its enhanced protein affinity. Candidate cmp_A9 had hydrogen bond interactions with Ser119, Met522, Glu524, and Ser530. Table 9. The interactions between potential COX-2 inhibitors and COX-2 protein obtained by docking computations.
In summary, the candidate MFH molecules could effectively bind to the ac of COX-2 and interact with key amino acids involved in COX-2 catalysis. The tions were consistent with the classical binding interactions reported betwee and COX-2. Most candidate MFH molecules exhibited strong hydrogen bondi tions with the key amino acid residues of COX-2. These observations support t ity of the potential COX-2 inhibitor molecules discovered through the ligand-ba screening on the MFH database. In summary, the candidate MFH molecules could effectively bind to the active pocket of COX-2 and interact with key amino acids involved in COX-2 catalysis. These interactions were consistent with the classical binding interactions reported between NSAIDs and COX-2. Most candidate MFH molecules exhibited strong hydrogen bonding interactions with the key amino acid residues of COX-2. These observations support the reliability of the potential COX-2 inhibitor molecules discovered through the ligand-based virtual screening on the MFH database. of COX-2 and interact with key amino acids involved in COX-2 catalysis. tions were consistent with the classical binding interactions reported bet and COX-2. Most candidate MFH molecules exhibited strong hydrogen bo tions with the key amino acid residues of COX-2. These observations suppo ity of the potential COX-2 inhibitor molecules discovered through the ligand screening on the MFH database. of COX-2 and interact with key amino acids involved in COX-2 catalysis. These interactions were consistent with the classical binding interactions reported between NSAIDs and COX-2. Most candidate MFH molecules exhibited strong hydrogen bonding interactions with the key amino acid residues of COX-2. These observations support the reliability of the potential COX-2 inhibitor molecules discovered through the ligand-based virtual screening on the MFH database. In summary, the candidate MFH molecules could effectively bind to th of COX-2 and interact with key amino acids involved in COX-2 catalysis. tions were consistent with the classical binding interactions reported bet and COX-2. Most candidate MFH molecules exhibited strong hydrogen bo tions with the key amino acid residues of COX-2. These observations supp ity of the potential COX-2 inhibitor molecules discovered through the ligand screening on the MFH database. of COX-2 and interact with key amino acids involved in COX-2 catalysis. These interactions were consistent with the classical binding interactions reported between NSAIDs and COX-2. Most candidate MFH molecules exhibited strong hydrogen bonding interactions with the key amino acid residues of COX-2. These observations support the reliability of the potential COX-2 inhibitor molecules discovered through the ligand-based virtual screening on the MFH database.  Molecular Docking Analysis on Potential mPGES-1 Inhibitors mPGES-1 is a homotrimer, with each subunit consisting of four transmem ices. The mPGES-1 trimer contains three active site cavities, which are formed c by transmembrane helices 1, 2, and 4 along with neighboring monomers [48]. A currently resolved co-crystal complexes, key amino acids include AlaA123 SerA127, ValA128, TyrA130, ThrA131, GlnA134, TyrB28, IleB32, AsnB36, ArgB3 PheB44, ArgB52, and HisB53, where A and B representing different monom mPGES-1 trimer, respectively.
The selection of mPGES-1 crystals for docking was carried out using a sim odology as the COX-2 crystal selection, with the detailed results presented in The binding affinities and interactions between candidate MFH molecules and are summarized in Table 10. Candidate cmp_B4 and cmp_B5 bound within pocket of the protein 4AL1, exhibiting binding affinities of −8.08 and −11.25 kc spectively. The binding affinity of the original ligand of protein 4AL1 is −12.1 Specifically, Candidate cmp_B4 formed hydrogen bond interactions wit AsnB46, and ThrA131. Candidate cmp_B5 established hydrogen bond interac AsnB46, ArgB52, GluA77, and TyrA117, with ionic bonds between ArgB38 and and π-H stacking with Ile B32. Candidate cmp_B1, cmp_B6, cmp_B7, cmp_B11 cmp_B13, cmp_B14, and cmp_B15 were docked within the active site of the pro Molecular Docking Analysis on Potential mPGES-1 Inhibitors mPGES-1 is a homotrimer, with each subunit consisting of four transmembrane helices. The mPGES-1 trimer contains three active site cavities, which are formed collectively by transmembrane helices 1, 2, and 4 along with neighboring monomers [48]. Among the currently resolved co-crystal complexes, key amino acids include AlaA123, ProA124, SerA127, ValA128, TyrA130, ThrA131, GlnA134, TyrB28, IleB32, AsnB36, ArgB38, LeuB39, PheB44, ArgB52, and HisB53, where A and B representing different monomers in the mPGES-1 trimer, respectively.
The selection of mPGES-1 crystals for docking was carried out using a similar methodology as the COX-2 crystal selection, with the detailed results presented in Table S10. The binding affinities and interactions between candidate MFH molecules and mPGES-1 are summarized in Table 10. Candidate cmp_B4 and cmp_B5 bound within the active pocket of the protein 4AL1, exhibiting binding affinities of −8.08 and −11.25 kcal/mol, respectively. The binding affinity of the original ligand of protein 4AL1 is −12.12 kcal/mol. Specifically, Candidate cmp_B4 formed hydrogen bond interactions with AspB49, AsnB46, and ThrA131. Candidate cmp_B5 established hydrogen bond interactions with AsnB46, ArgB52, GluA77, and TyrA117, with ionic bonds between ArgB38 and ArgA126, and π-H stacking with Ile B32. Candidate cmp_B1, cmp_B6, cmp_B7, cmp_B11, cmp_B12, cmp_B13, cmp_B14, and cmp_B15 were docked within the active site of the protein 4YL0, The selection of mPGES-1 crystals for docking was carried out using odology as the COX-2 crystal selection, with the detailed results present The binding affinities and interactions between candidate MFH molecule are summarized in Table 10. Candidate cmp_B4 and cmp_B5 bound w pocket of the protein 4AL1, exhibiting binding affinities of −8.08 and −11. spectively. The binding affinity of the original ligand of protein 4AL1 is − Specifically, Candidate cmp_B4 formed hydrogen bond interactions AsnB46, and ThrA131. Candidate cmp_B5 established hydrogen bond in AsnB46, ArgB52, GluA77, and TyrA117, with ionic bonds between ArgB3 and π-H stacking with Ile B32. Candidate cmp_B1, cmp_B6, cmp_B7, cmp cmp_B13, cmp_B14, and cmp_B15 were docked within the active site of th Molecular Docking Analysis on Potential mPGES-1 Inhibitors mPGES-1 is a homotrimer, with each subunit consisting of four transmembrane helices. The mPGES-1 trimer contains three active site cavities, which are formed collectively by transmembrane helices 1, 2, and 4 along with neighboring monomers [48]. Among the currently resolved co-crystal complexes, key amino acids include AlaA123, ProA124, SerA127, ValA128, TyrA130, ThrA131, GlnA134, TyrB28, IleB32, AsnB36, ArgB38, LeuB39, PheB44, ArgB52, and HisB53, where A and B representing different monomers in the mPGES-1 trimer, respectively.
The selection of mPGES-1 crystals for docking was carried out using a similar methodology as the COX-2 crystal selection, with the detailed results presented in Table S10. The binding affinities and interactions between candidate MFH molecules and mPGES-1 are summarized in Table 10. Candidate cmp_B4 and cmp_B5 bound within the active pocket of the protein 4AL1, exhibiting binding affinities of −8.08 and −11.25 kcal/mol, respectively. The binding affinity of the original ligand of protein 4AL1 is −12.12 kcal/mol. Specifically, Candidate cmp_B4 formed hydrogen bond interactions with AspB49, AsnB46, and ThrA131. Candidate cmp_B5 established hydrogen bond interactions with AsnB46, ArgB52, GluA77, and TyrA117, with ionic bonds between ArgB38 and ArgA126, and π-H stacking with Ile B32. Candidate cmp_B1, cmp_B6, cmp_B7, cmp_B11, cmp_B12, cmp_B13, cmp_B14, and cmp_B15 were docked within the active site of the protein 4YL0, In summary, the candidate MFH molecules could effectively bind to the active pocket of COX-2 and interact with key amino acids involved in COX-2 catalysis. These interactions were consistent with the classical binding interactions reported between NSAIDs and COX-2. Most candidate MFH molecules exhibited strong hydrogen bonding interactions with the key amino acid residues of COX-2. These observations support the reliability of the potential COX-2 inhibitor molecules discovered through the ligand-based virtual screening on the MFH database.
The selection of mPGES-1 crystals for docking was carried out using a similar methodology as the COX-2 crystal selection, with the detailed results presented in Table S10. The binding affinities and interactions between candidate MFH molecules and mPGES-1 are summarized in Table 10. Candidate cmp_B4 and cmp_B5 bound within the active pocket of the protein 4AL1, exhibiting binding affinities of −8.08 and −11.25 kcal/mol, respectively. The binding affinity of the original ligand of protein 4AL1 is −12.12 kcal/mol. Specifically, Candidate cmp_B4 formed hydrogen bond interactions with AspB49, AsnB46, and ThrA131. Candidate cmp_B5 established hydrogen bond interactions with AsnB46, ArgB52, GluA77, and TyrA117, with ionic bonds between ArgB38 and ArgA126, and π-H stacking with Ile B32. Candidate cmp_B1, cmp_B6, cmp_B7, cmp_B11, cmp_B12, cmp_B13, cmp_B14, and cmp_B15 were docked within the active site of the protein 4YL0, exhibiting binding affinities of −7.93, −7.25, −7.26, −6.89, −8.73, −7.19, −10.71, and −7.34 kcal/mol, respectively. The binding affinity of protein 4YL0 after docking with its ligand is −7.17 kcal/mol, and except for cmp_B11, the docking affinities of the remaining candidate molecules surpass that of the original ligand of protein 4YL0 under the same docking conditions. Notably, candidate cmp_B14 demonstrated significantly improved docking affinity compared to the other candidates. Candidate cmp_B1 had hydrogen bond interactions with ArgB38 and AsnB46. Hydrogen bond interactions were established between Candidate cmp_B6 and AspB49. Candidate cmp_B7 generated hydrogen bond interactions with GluA77 and SerA127, while also forming a π-π stacking interaction with TyrA130. It is noteworthy that Candidate cmp_B7 has been reported to inhibit the generation of PGE 2 [37]. Coupled with its hydrogen bond interactions with key amino acids at the active site of mPGES-1, these findings underscore the potential of cmp_B7 as an inhibitor for mPGES-1. Candidate cmp_B2 and cmp_B3 bound to the active pocket of protein 4YL1 with binding affinities of −9.19 and −8.82 kcal/mol, both of which were superior to the docking affinity of the original ligand of protein 4YL1 (−8.71 kcal/mol). Candidate cmp_B2 formed hydrogen bond interactions with AspB49 and ThrA131. Candidate cmp_B3 generated a hydrogen bond interaction with AspB49. Candidate cmp_B10 docked within the active domain of protein 5K0I, displaying a binding affinity of −10.31 kcal/mol, which outperformed the binding affinity of the original ligand of protein 5K0I under identical docking conditions (−8.57 kcal/mol). Candidate cmp_B10 engaged in hydrogen bond interactions with AspB49, SerA127, GluA77, and TyrA117. Candidates cmp_B8 and cmp_B9 bound within the active cavity of mPGES-1 (PDB: 5TL9), with binding affinities of −9.55 and −9.2 kcal/mol, respectively. The binding affinity of the original ligand within complex 5TL9 was −8.63 kcal/mol, indicating that cmp_B8 and cmp_B9 exhibited superior binding performances under the same docking parameters. Candidate cmp_B8 formed a hydrogen bond interaction with SerA127. Candidate cmp_B9 had hydrogen bond interactions with SerA127, GluA77, and TyrA117. Table 10. The interactions between potential mPGES-1 inhibitors and mPGES-1 protein obtained by molecular docking.
Molecules 2023, 28,6782 exhibiting binding affinities of −7.93, −7.25, −7.26, −6.89, −8.73, −7.19, −10 kcal/mol, respectively. The binding affinity of protein 4YL0 after docking is −7.17 kcal/mol, and except for cmp_B11, the docking affinities of the rem date molecules surpass that of the original ligand of protein 4YL0 under the conditions. Notably, candidate cmp_B14 demonstrated significantly imp affinity compared to the other candidates. Candidate cmp_B1 had hydrog actions with ArgB38 and AsnB46. Hydrogen bond interactions were estab Candidate cmp_B6 and AspB49. Candidate cmp_B7 generated hydrogen tions with GluA77 and SerA127, while also forming a π-π stacking in TyrA130. It is noteworthy that Candidate cmp_B7 has been reported to inh ation of PGE2 [37]. Coupled with its hydrogen bond interactions with key the active site of mPGES-1, these findings underscore the potential of cmp_ itor for mPGES-1. Candidate cmp_B2 and cmp_B3 bound to the active po 4YL1 with binding affinities of −9.19 and −8.82 kcal/mol, both of which w the docking affinity of the original ligand of protein 4YL1 (−8.71 kcal/m cmp_B2 formed hydrogen bond interactions with AspB49 and ThrA1 cmp_B3 generated a hydrogen bond interaction with AspB49. Candidate cm within the active domain of protein 5K0I, displaying a binding affinity ofwhich outperformed the binding affinity of the original ligand of protein 5K tical docking conditions (−8.57 kcal/mol). Candidate cmp_B10 engaged in h interactions with AspB49, SerA127, GluA77, and TyrA117. Candidates cmp_B9 bound within the active cavity of mPGES-1 (PDB: 5TL9), with bin of -9.55 and −9.2 kcal/mol, respectively. The binding affinity of the origina complex 5TL9 was -8.63 kcal/mol, indicating that cmp_B8 and cmp_B9 exh binding performances under the same docking parameters. Candidate cm hydrogen bond interaction with SerA127. Candidate cmp_B9 had hydrog actions with SerA127, GluA77, and TyrA117.
The majority of candidate molecules exhibited the ability to form hyd teractions with key amino acids of mPGES-1. Notably, the docking affin candidate molecules even surpass those of the original ligands within the c plexes. These findings collectively contribute to bolstering the plausibility mPGES-1 inhibitors identified through our virtual screening process. . Candidate cmp_B7 generated hydrogen bond interac tions with GluA77 and SerA127, while also forming a π-π stacking interaction with TyrA130. It is noteworthy that Candidate cmp_B7 has been reported to inhibit the gener ation of PGE2 [37]. Coupled with its hydrogen bond interactions with key amino acids a the active site of mPGES-1, these findings underscore the potential of cmp_B7 as an inhib itor for mPGES-1. Candidate cmp_B2 and cmp_B3 bound to the active pocket of protein 4YL1 with binding affinities of −9.19 and −8.82 kcal/mol, both of which were superior to the docking affinity of the original ligand of protein 4YL1 (−8.71 kcal/mol). Candidate cmp_B2 formed hydrogen bond interactions with AspB49 and ThrA131. Candidate cmp_B3 generated a hydrogen bond interaction with AspB49. Candidate cmp_B10 docked within the active domain of protein 5K0I, displaying a binding affinity of -10.31 kcal/mol which outperformed the binding affinity of the original ligand of protein 5K0I under iden tical docking conditions (−8.57 kcal/mol). Candidate cmp_B10 engaged in hydrogen bond interactions with AspB49, SerA127, GluA77, and TyrA117. Candidates cmp_B8 and cmp_B9 bound within the active cavity of mPGES-1 (PDB: 5TL9), with binding affinities of -9.55 and −9.2 kcal/mol, respectively. The binding affinity of the original ligand within complex 5TL9 was -8.63 kcal/mol, indicating that cmp_B8 and cmp_B9 exhibited superior binding performances under the same docking parameters. Candidate cmp_B8 formed a hydrogen bond interaction with SerA127. Candidate cmp_B9 had hydrogen bond inter actions with SerA127, GluA77, and TyrA117. The majority of candidate molecules exhibited the ability to form hydrogen bond in teractions with key amino acids of mPGES-1. Notably, the docking affinities of severa candidate molecules even surpass those of the original ligands within the co-crystal com plexes. These findings collectively contribute to bolstering the plausibility of the potentia mPGES-1 inhibitors identified through our virtual screening process.

Construction of the Catalogue for MFH Substances
The culture of traditional Chinese medicine has a long history, the "Co Materia Medica", compiled in the 16th century, records more than 300 kin and food homology (MFH) substances. In 2002, Chinese National Health C sued a list of " Chinese medicine that can be used in health food" (includin substances), both as traditional Chinese medicines and as food [49]. In 201 tional Health Commission published a list of 110 MFH substances [50]. Ad 2020 edition of the "Pharmacopoeia" records 86 substances as "healthy substances as "therapeutic food", both of which are in line with the conce and food homology [51]. Therefore, by examining and integrating the abo total of 503 MFH substances were collected in the catalogue of this study, th dients were further collected based on this catalog with the aim of construc hensive database of MFH substances.

Collection and Preparation of Active Ingredients from MFH Substances
Based on the integrated and constructed catalogue of MFH substance the active ingredients of 503 MFH substances from the following public Chi database: Traditional Chinese Medicines Integrated Database (TCMID) [1 tional Chinese Medicine Systems Pharmacology Database and Anal (TCMSP) [20]. However, some of the MFH substances were not included databases, or the collected active ingredients were inadequate. Therefore ducted a literature search (92 pieces of Chinese literature from the C Knowledge Infrastructure (CNKI) [52]) and manually collected the active MFH substances to ensure the integrity and diversity of our MFH substa For the collected active ingredient molecules, we further disconnected gr simple salts, removed minor components, screened the duplicated mole trieved the isomeric SMILES of molecules to upgrade the data quality of stances database. As a result, we obtained 15,362 active ingredients of M from TCMID, 11,154 active ingredients from TCMSP, and an additional 80 dients from 92 pieces of literature.

Chemical Space Analysis on the MFH Substances Database
Chemical space analysis is a widely utilized approach for exploring, co and optimizing a multitude of potential molecules [53]. Herein, we calcula weight (MW) and octanol-water partition coefficient (LogP) on the co The majority of candidate molecules exhibited the ability to form hydrogen bond interactions with key amino acids of mPGES-1. Notably, the docking affinities of several candidate molecules even surpass those of the original ligands within the co-crystal complexes. These findings collectively contribute to bolstering the plausibility of the potential mPGES-1 inhibitors identified through our virtual screening process.

Construction of the Catalogue for MFH Substances
The culture of traditional Chinese medicine has a long history, the "Compendium of Materia Medica", compiled in the 16th century, records more than 300 kinds of medicine and food homology (MFH) substances. In 2002, Chinese National Health Commission issued a list of "Chinese medicine that can be used in health food" (including 114 kinds of substances), both as traditional Chinese medicines and as food [49]. In 2018, Chinese National Health Commission published a list of 110 MFH substances [50]. Additionally, the 2020 edition of the "Pharmacopoeia" records 86 substances as "healthy food" and 193 substances as "therapeutic food", both of which are in line with the concept of medicine and food homology [51]. Therefore, by examining and integrating the above four lists, a total of 503 MFH substances were collected in the catalogue of this study, the active ingredients were further collected based on this catalog with the aim of constructing a comprehensive database of MFH substances.

Collection and Preparation of Active Ingredients from MFH Substances
Based on the integrated and constructed catalogue of MFH substances, we collected the active ingredients of 503 MFH substances from the following public Chinese Medicine database: Traditional Chinese Medicines Integrated Database (TCMID) [19], and Traditional Chinese Medicine Systems Pharmacology Database and Analysis Platform (TCMSP) [20]. However, some of the MFH substances were not included in the above databases, or the collected active ingredients were inadequate. Therefore, we also conducted a literature search (92 pieces of Chinese literature from the China National Knowledge Infrastructure (CNKI) [52]) and manually collected the active ingredients of MFH substances to ensure the integrity and diversity of our MFH substances database. For the collected active ingredient molecules, we further disconnected group metals in simple salts, removed minor components, screened the duplicated molecules, and retrieved the isomeric SMILES of molecules to upgrade the data quality of our MFH substances database. As a result, we obtained 15,362 active ingredients of MFH substances from TCMID, 11,154 active ingredients from TCMSP, and an additional 803 active ingredients from 92 pieces of literature.

Chemical Space Analysis on the MFH Substances Database
Chemical space analysis is a widely utilized approach for exploring, comprehending, and optimizing a multitude of potential molecules [53]. Herein, we calculated molecular weight (MW) and octanol-water partition coefficient (LogP) on the collected active ingredient molecules. These two physicochemical properties enable us to measure and demonstrate the breadth of the chemical space distribution of the MFH database. MW and LogP were calculated via Python-based RDKit packages [54], and the chemical space distribution was visualized with Matplotlib [55]. For measuring the structural similarity of our MFH database, we computed the Tanimoto coefficients (TCs) [21] based on 1024 bits ECFP4 fingerprints by using RDKit packages [54]. To thoroughly analyze the diversity of molecular structures within the MFH database, we computed Murcko scaffolds of the molecules and conducted a clustering analysis using the K-means algorithm. Prior to clustering, we characterized the calculated Murcko scaffolds by ECFP4 fingerprints and employed T-distributed Stochastic Neighbor Embedding (t-SNE) [56] to reduce the highdimensional data into two dimensions. ECFP4 fingerprints were computed via K-means clustering and t-SNE were implemented with scikit-learn [57].

Datasets for Modeling on COX-2 Inhibitors
Dataset 1 and 2 (Table 11) for developing classification models to classify highly/weakly active COX-2 inhibitors were identical to the datasets in our previous work [58]. Different thresholds were employed to label highly/weakly active inhibitors, which enabled the constructed classification models to cover different chemical spaces. This strategy would further contribute to the hit rate of VS on MFH substances. A total of 1511 molecules (Dataset 3) were collected from ChEMBL [59], Reaxys [60], and SciFinder [61], their IC 50 values were tested by enzyme-linked immunoassay. The pIC 50 (−log 10 IC 50 ) values ranged from 5.06 to 9.52. Dataset 3 was utilized to develop regression models with the aim of accurately predicting the bioactivities of MFH substances. External validation sets A1 and A2 were collected from the newly published literature and used to evaluate the generalizability of the constructed classification and regression models, respectively. The IC 50 values of molecules in the External validation set A2 were all tested by enzyme-linked immunoassay.

Datasets for Modeling on mPGES-1 Inhibitors
Numerous mPGES-1 inhibitors with diverse structures were collected from ChEMBL, Reaxys, and SciFinder. Dataset 4 was composed of 3179 mPGES-1 inhibitors, their IC 50 values vary from 0.0001 to 20,000 µM. Molecules with IC 50 > 10 µM were weakly active inhibitors; with IC 50 < 0.6 µM were highly active inhibitors. Dataset 5 comprised 3455 inhibitors with IC 50 values ranging from 0.0001 to 20,000 µM. Molecules with IC 50 ≥ 10 µM and IC 50 < 10 µM are weakly and highly active inhibitors, respectively. Datasets 4 and 5 were employed to construct classification models on mPGES-1 inhibitors (shown in Table 9). Dataset 6 containing 735 inhibitors was derived from our previous work [62], pIC 50 values of inhibitors in Dataset 6 ranged from 5.54 to 9 (all tested by homogeneous time-resolved fluorescence assay). External validation sets B1 and B2 were used to evaluate the generalizability of the constructed classification and regression models, respectively.

Splitting Strategy for Generating the Training/Test Set
The datasets for modeling on COX-2 inhibitors were divided into training/test sets at the ratio of 4:1. The datasets for modeling on mPGES-1 inhibitors were randomly divided into training/test sets at the ratio of 3:1. The datasets of both COX-2 and mPGES-1 inhibitors were randomly split 10 times for generating the training/test sets to avoid the random error. The random splitting was conducted by using the function StratifiedSplit of the Python toolkit scikit-learn 0.22.1 [57].

Binary Fingerprints for Classification Models
Three types of well-known fingerprints were employed to comprehensively and multifacetedly characterize the structural features of molecules within our MFH database. 166 bits MACCS fingerprints (belongs to dictionary-based fingerprints), 1024 bits Avalon fingerprints (topological-based fingerprints) [63], and 1024 bits ECFP4 fingerprints (circular fingerprints) [64] were computed with RDKit [54] packages. To avoid the inclusion of redundant information, the calculated fingerprints were then filtered by the variance, and fingerprints with variance in the bottom quartile were excluded from the construction of classification models.

Physicochemical Molecular Descriptors for Regression Models
Two types of physicochemical molecular descriptors were utilized to represent molecules for further developing QSAR models. A total of 22 global molecular descriptors and 96 3D property-weighted autocorrelation from CORINA were calculated by the CORINA Symphony software V1.0 [65]. A total of 115 2D physicochemical molecular descriptors and 85 FragmentCount descriptors from RDKit were computed via MayaChemTools [66]. The calculated descriptors were further screened with Pearson correlation coefficient (PCC), and listed in descending order by recursive feature elimination with the random forest estimator (RF-RFE) before modeling; more details can be seen in the previous work [62]. Additionally, all the reserved calculated descriptors were auto-scaled to the same range from 0.1 to 0.9.

Supervised Machine Learning Algorithms for Modeling
Machine learning (ML) algorithms are capable of discerning relationships within large datasets and devising optimal approaches for their analysis without prior specification. Four supervised ML algorithms, including support vector machine (SVM) [67], random forest (RF) [68], deep neural networks (DNNs) [69], and eXtreme Gradient Boosting (XGBoost) [70], were utilized to provide predictions on COX-2 and mPGES-1 inhibitors.

Modeling with SVM, RF, and XGBoost
The SVM algorithm equipped with radial basis function (RBF) kernel was involved in developing both classification and regression models, penalty parameter (C) and γ were hyperparameters to be confirmed during the optimization of the classification models. Except for the two parameters mentioned above, the insensitive parameter (ε) needed to be confirmed in the regression model's optimization process. For RF models, the number of trees (n_estimators) and the maximum leaf nodes (max_leaf_nodes) were determined by the grid search. In modeling with XGBoost, the number of trees (n_estimators), the maximum depth of a tree (max_depth), the subsample ratio of the training instances (subsample), and the subsample ratio of columns when constructing each tree (colsample_bytree) were optimized by grid search. Other parameters not mentioned were set as their default values.
In addition to the common hyperparameters of SVM, RF, and XGBoost algorithms, the number of descriptors also served as a hyperparameter in the grid-based optimization of regression models. During the grid search, 5-fold cross-validation was repeated 10 times, and the optimal hyperparameters were determined based on the smallest mean squared error (MSE) of the 5-fold cross-validation [71]. Detailed ranges of those mentioned hyperparameters were listed in Table S7 in Supplementary Materials.

Modeling with DNN
Fully connected feed-forward neural networks with four hidden layers were constructed to develop classification and regression models. Neurons within the hidden layers were activated using the Relu function [72] and further compiled using the Adam optimizer with a learning rate of 0.0001. The training epoch was determined through repetitive 5-fold cross-validation training combined with early stopping [73]. For classification models, early stopping monitored changes in the predicted accuracy of the validation set. The validation set was generated through 5-fold cross-validation and was part of the training set. For regression models, early stopping focused on changes in the MSE values of the validation set. Training was halted when the accuracy or MSE of the validation set ceased to change within 50 epochs. To mitigate the potential for contingency arising from early stopping based on a single validation set, each cross-validation was conducted through repetitive training 50 times.

Ensemble Learning Based on Developed Classification Models
Ensemble learning integrates the predictions of multiple machine learning models to improve the robustness and accuracy of predictions made by individual base models [74]. In this study, the stacked generalization [75] was employed to combine the predicted probabilities of classification models built using the same algorithm but on datasets characterized by different fingerprints. These probabilities were then input into a logistic regression algorithm to generate predicted values of ensemble models with the same algorithm.

Unsupervised Machine Learning on MFH Substances
Unsupervised learning algorithms are trained on unlabeled data to discover hidden patterns or relationships among the data, as the hidden patterns and relationships can serve as a foundation for exploratory analysis [76]. The self-organized map (SOM) is a type of unsupervised artificial neural network; during the training of SOM, when the input layer receives data with similar vectors, these vectors are mapped onto the same neuron or neurons that are close together in the two-dimensional grid [77]. In this study, SOM was applied to perform clustering analysis on molecules of the MFH database, and further predict their inhibitory effects on COX-2 and mPGES-1.

Evaluation of Model Performances
The predicted accuracy (Q) and the Matthews correlation coefficient (MCC) were utilized as indicators of the performances of classification models. The coefficient of determination (R 2 ) and root mean squared error (RMSE) were applied to evaluate the performances of regression models. These criteria mentioned above were calculated by the following equations: where true positives (TPs) and true negatives (TNs) represent the number of "1" and "0" that were correctly predicted, respectively. False positives (FPs) and false negatives (FNs) represent the number of "1" and "0" that were wrongly predicted, respectively.
RMSE(y,ŷ) = 1 where n represents the total number of compounds; y represents an observed value of a compound;ŷ represents predicted value of a compound; y represents the average of y.

Pan Assay Interference Compounds (PAINS) Screening
The pan assay interference compounds (PAINS) screening has evolved into a pivotal element within drug design. The PAINS rule is introduced to identify false positive compounds (frequent hitters) during biological screening initiatives. We obtained a list of known aggregators (12,645 molecules were shown in Table S9) from Aggregator Advisor [78], which represented the known aggregator molecules and our screened MFH candidates with MACCS fingerprints. We employed RDKit to match the structure of each screened MFH candidate molecule with 12,645 known aggregator molecules. As a result, none of MFH candidates (10 potential COX-2 inhibitors and 15 potential mPGES-1 inhibitors) appeared in the known aggregators list.

Molecular Docking
In this study, molecular docking was conducted using the latest release of the widely used open-source program AutoDock Vina 1.2.0 [79]. Given the availability of multiple COX-2 and mPGES-1 crystal structures, the selection of an appropriate receptor with a low resolution is a prerequisite for reliable docking computations. Therefore, we chose COX-2 and mPGES-1 co-complex crystal structures from the PDB database that exhibited similar bound ligand to the screened medicine and food homologous (MFH) candidates. Ligands within the complex crystal structures and the screened MFH candidates were characterized using MACCS fingerprints. Subsequently, RDKit functions were employed to calculate Tanimoto similarity between the molecular structures of these entities. For each screened MFH candidate molecule, docking was performed with the crystal structure that exhibited the highest structural similarity (details were listed in Table S10). Before formal docking, the ligands within the original complex crystal structures underwent re-docking. The better the alignment between the re-docked ligand and the experimentally determined ligand, the more optimal the parameter settings and system preparation of the docking calculation. Results of the re-docking of ligands within the complex crystal structures are presented in Figure S1 of the Supplementary Materials. The protein preparation process involved the removal of water and other solvents, repair of missing residue sections, addition of hydrogen atoms to heavy atoms, and subsequent pre-docking energy minimization of the entire protein. The ligand preparation process included the addition of hydrogen atoms, computation of Gasteiger charges for all atoms, definition of rotatable bonds, and energy optimization. The grid box was adjusted based on the spatial center of the ligand within the crystal structure. Vina force field was employed during docking, with the exhaustiveness parameter set to 32.

Conclusions
In this study, we constructed a comprehensive database of 27,319 active ingredient molecules from 503 different types of medicine and food homologous (MFH). Analysis of the distribution of molecular weight (MW) and octanol-water partition coefficient (LogP) showed a wide range of values, indicating that our MFH database covers a wide chemical space. Structural diversity was assessed using Tanimoto coefficients (TCs), showing significant structural differences between molecules, with 95.92% of molecule pairs having TC values below 0.6. In addition, we performed Murcko scaffold analysis and K-means clustering, resulting in the identification of 11 different clusters in the MFH database. Among them, flavonoid clusters were the most abundant, followed by fatty acids, saponins, and sterols. The database was further enriched by the presence of lignans, alkaloids, triterpenoids, sesquiterpenoids, diterpenoids, and stilbenoids. Furthermore, we summarized the top 20 Murcko scaffolds, revealing diverse structures ranging from simple aromatic compounds with a single ring to complex systems with 7-8 rings. These findings collectively demonstrate the comprehensiveness and high structural diversity of our MFH database. Our MFH database will serve as a foundation for future studies, as it could facilitate the assessment of the effects of MFH on health and identifies potential mechanisms to accelerate the development of MFH-inspired products with nutritional and therapeutic value.
Based on datasets with different distributions of bioactivities, we employed four supervised learning algorithms (RF, SVM, DNN, and XGBoost), incorporating various fingerprints and physicochemical descriptors for modeling. As a result, a total of 240 classification models and 80 QSAR models were constructed for COX-2 and mPGES-1 inhibitors, respec-tively. Additionally, we also utilized ensemble learning to develop classification models. Based on the constructed single classifiers, another 80 integrated classification models were constructed for COX-2 and mPGES-1 inhibitors, respectively. For COX-2 inhibitors, ModelA_ensemble_RF_1, built on Dataset 1, demonstrated the best classification performance with MCC values of 0.802 and 0.603 on the test set and external validation set, respectively. ModelB_MACCS_SVM_6, constructed using Dataset 2, achieved the highest performance with MCC values of 0.657 and 0.572 on the test set and external validation set, respectively. ModelC_RDKIT_SVM_2, the optimal regression model based on Dataset 3, yielded RMSE values of 0.419 and 0.513 on the test set and external validation set, respectively. For mPGES-1 inhibitors, ModelD_ECFP_SVM_4 emerged as the top-performing classification model on Dataset 4, exhibiting MCC values of 0.832 and 0.584 on the test set and external validation set, respectively. ModelE_ECFP_SVM_1, the best classification model for Dataset 5, achieved MCC values of 0.799 and 0.579 on the test set and external validation set, respectively. ModelF_3D_SVM_1, based on Dataset 6, served as the optimal regression model with RMSE values of 0.253 and 0.35 on the test set and external validation set, respectively. These well-performing machine learning models can serve as powerful tools for virtual screening of the constructed MFH database, aiming to identify potential COX-2 and mPGES-1 inhibitors from MFH substances. Moreover, these models can be employed to predict the inhibitory capabilities of unknown compounds against COX-2 and mPGES-1, thus facilitating the discovery of novel anti-inflammatory drugs.
Finally, by means of a cascade ligand-based virtual screening strategy and a PAINS screening rule, we identified 10 potential COX-2 inhibitors and 15 potential mPGES-1 inhibitors from the MFH database. We verified candidates by molecular docking, investigated the interaction of the candidate molecules upon binding to COX-2 or mPGES-1. It is worth mentioning that some of these molecules have been previously reported to exhibit COX-2 inhibitory or anti-inflammatory activities. This demonstrates the effectiveness of the cascaded ligand-based virtual screening strategy employed in this study and provides design and modification ideas for the development of new effective anti-inflammatory drugs targeting COX-2 and mPGES-1.

Supplementary Materials:
The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/molecules28196782/s1, Detailed performances of all classification models built on Dataset 1 can be seen in Table S1. Performances of all classification models of Dataset 2 are listed in Table S2. Detailed results of all 80 QSAR models built on Dataset 3 are listed in Table S3. Detailed performances of all classification models built on Dataset 4 can be seen in Table S4. Performances of all classification models of Dataset 5 are listed in Table S5. Detailed results of all regression models built on Dataset 6 are listed in Table S6. The specific model results and parameters applied for virtual screening of MFH database are summarized in Table S7. All the datasets utilized in this study are summarized in Table S8. A list of known aggregators (12,645 molecules) is shown in Table S9. The PDB complex crystals utilized in this work are listed in Table S10. The alignment results of the re-docking of ligands within the complex crystal structures are presented in Figure S1.
Author Contributions: Y.T. Tian conceived the original idea of this work, performed all the machine learning models, carried out the virtual screening, analyzed the results, and took the lead in writing the manuscript. Z.Z. contributed to the construction of the medicine and food homology database, verified the developed models. A.Y. supervised the findings of this work, provided critical feedback and helped shape the research, analysis, and manuscript. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: All the codes used in this work can be found at: tyj-19951029/medicineand-food-homologous-virtual-screnning (github.com, accessed on 3 June 2023). The MFH database established in this study can be obtained by sending an email to the author, and commercial use should be avoided.