Deep Learning Approach for Predicting the Therapeutic Usages of Unani Formulas towards Finding Essential Compounds

The use of herbal medicines in recent decades has increased because their side effects are considered lower than conventional medicine. Unani herbal medicines are often used in Southern Asia. These herbal medicines are usually composed of several types of medicinal plants to treat various diseases. Research on herbal medicine usually focuses on insight into the composition of plants used as ingredients. However, in the present study, we extended to the level of metabolites that exist in the medicinal plants. This study aimed to develop a predictive model of the Unani therapeutic usage based on its constituent metabolites using deep learning and data-intensive science approaches. Furthermore, the best prediction model was then utilized to extract important metabolites for each therapeutic usage of Unani. In this study, it was observed that the deep neural network approach provided a much better prediction model than other algorithms including random forest and support vector machine. Moreover, according to the best prediction model using the deep neural network, we identified 118 important metabolites for nine therapeutic usages of Unani.


Introduction
Herbal medicines are plant-based medicines made from different combinations of medicinal plant parts, e.g., leaves, flowers, or roots. Each part can have different medicinal uses, and many types of chemical constituents require different extraction methods. Both fresh and dried plants are used, depending on the herb (https://www.nimh.org.uk/ whats-herbal-medicine, accessed on 4 June 2021). Herbal medicines have become popular drugs in the last three decades, and no less than 80% of people worldwide depend on herbal medicines. The main reasons why people tend to choose herbal medicines are because herbal medicines provide better efficacy and relatively lower side effects compared to conventional drugs [1]. The use of herbal medicines throughout the world reached USD 60 billion in 2010 and is expected to reach USD 5 trillion by 2050 [2,3]. This information shows that the use of herbal medicines is prevalent throughout the world. Some examples of herbal medicine systems around the world are Traditional Chinese Medicine (TCM) from China; Kampo from Japan; Jamu from Indonesia; and Ayurvedic, Siddha, or Unani from Southern Asia.
Unani Tibb, commonly known as Unani medicine, is practiced widely in South and Central Asia. The Arabic term "Tibb" means "medicine," while the name "Unani" is assumed to have its roots in the Greek word "Ionan" [4]. Later on, it was also influenced by Indian and Chinese traditional systems. The Unani herbal medicines mostly utilize medicinal plants as their ingredients, and this system follows ancient concepts and principles The initial Unani formulas consisted of plants as ingredients. Unani compounds were collected according to the corresponding plants by using the following databases: KNAp-SAcK Family Databases (http://www.knapsackfamily.com/KNApSAcK_Family, accessed on 25 June 2021), IJAH Analytics (http://ijah.apps.cs.ipb.ac.id, accessed on 3 July 2021), KEGG (https://www.genome.jp/kegg/, accessed on 10 July 2021), and ChEbi (https://www.ebi.ac.uk/chebi/, accessed on 11 September 2021). The distribution of metabolites collected for each medicinal plant is shown in Figure 3. The number of compounds belonging to a medicinal plant varies a lot: some plants are associated with a few metabolites, whereas some are associated with many.
The KNApSAcK database (DB) contains information on the species-metabolite relationship (101.500), encompassing 20,741 species and 50,048 metabolites. This database also contains information on accurate mass, molecular formula, metabolite name, and mass spectra in several ionization modes [10]. IJAH Analytics is an open-access database specifically for Jamu data. This database provides the plant-metabolite relations, and we assume some metabolites might be common between Jamu and Unani because both are classified as traditional medicine. The Kyoto Encyclopedia of Genes and Genomes (KEGG) is also an open-access database containing cell, organism, and molecular information with the specific large-scale molecular datasets. The Chemical Entities of Biological Interest (ChEBI) database contains molecular entities focusing on small chemical compounds. The minimum and maximum number of compounds associated with a formula corresponding to 18 disease classes are shown in Table 1. Finally, we represented the collected data as a two-dimensional table, in which the rows represent the Unani formulas and columns represent metabolites. Figure 2b illustrates the data representation of herbal medicine-metabolite relations. The number of metabolites associated with 369 medicinal plants is 4688. Therefore, the dimension of the matrix indicating relations between Unani formulas and metabolites is 609 × 4688.  The initial Unani formulas consisted of plants as ingredients. Unani compounds were collected according to the corresponding plants by using the following databases: KNApSAcK Family Databases (http://www.knapsackfamily.com/KNApSAcK_Family, accessed on 25 June 2021), IJAH Analytics (http://ijah.apps.cs.ipb.ac.id, accessed on 3 July 2021), KEGG (https://www.genome.jp/kegg/, accessed on 10 July 2021), and ChEbi (https://www.ebi.ac.uk/chebi/, accessed on 11 September 2021). The distribution of metabolites collected for each medicinal plant is shown in Figure 3. The number of compounds belonging to a medicinal plant varies a lot: some plants are associated with a few metabolites, whereas some are associated with many.    The KNApSAcK database (DB) contains information on the species-metabolite relationship (101.500), encompassing 20,741 species and 50,048 metabolites. This database also contains information on accurate mass, molecular formula, metabolite name, and mass spectra in several ionization modes [10]. IJAH Analytics is an open-access database specifically for Jamu data. This database provides the plant-metabolite relations, and we assume some metabolites might be common between Jamu and Unani because both are classified as traditional medicine. The Kyoto Encyclopedia of Genes and Genomes (KEGG) is also an open-access database containing cell, organism, and molecular information with the specific large-scale molecular datasets. The Chemical Entities of Biological Interest (ChEBI) database contains molecular entities focusing on small chemical compounds.
The minimum and maximum number of compounds associated with a formula corresponding to 18 disease classes are shown in Table 1. Finally, we represented the collected data as a two-dimensional table, in which the rows represent the Unani formulas and columns represent metabolites. Figure 2b illustrates the data representation of herbal medicine-metabolite relations. The number of metabolites associated with 369 medicinal plants is 4688. Therefore, the dimension of the matrix indicating relations between Unani formulas and metabolites is 609 × 4688.

Data Preprocessing
We initially eliminated some Unani formulas with missing values and the Unani formula with multiple therapeutic usages because we only focused on determining compounds for a specific efficacy. One way to overcome the problems of imbalanced data, multiple classification, and inconsistent data is by applying filtering methods. We used a single filtering method in this research. The filtering approach creates models using an entire dataset as training data, then predicts the class of all data and eliminates misclassified data. According to this reference [11], we can use random forest and other classifier methods to remove inconsistent data and increase the performance of the model classifier. We used two types of machine learning to filter the dataset. The first dataset was created using random forest as a filter, whereas another dataset was created utilizing deep learning. Two types of filtering were applied to compare the results and to accept and utilize the better option for the final prediction.

Model Generation and Comparison
We generated a prediction model by utilizing the deep learning method. Deep learning is a form of machine learning that allows computers to learn something based on experience and understand everything in the form of concepts. Techniques and algorithms in deep learning can be used for supervised learning, unsupervised learning, and semisupervised learning in various applications. The architecture used in this study was the deep neural network [8].
Deep learning allows a computational model consisting of several layers of processing to study data at various levels of abstraction. The representation of learning with various levels of representation obtained by compiling simple non-linear modules is a method of deep learning. To classify, a higher layer of representation is used to strengthen input and suppress irrelevant variations. The deep learning method can be used to find complex structures in high-dimensional data [9]. In this study, the method used consisted of more than one hidden layer. Figure 4 shows the input layer, hidden layer, and output layer components in deep learning.  Initially, we tuned the DNN to obtain the optimal parameter values. The DN advanced artificial neural network that has more than one hidden layer between t and output layers. Each hidden layer has an activation function such as a sigmo fied linear unit (ReLU), or hyperbolic tangent (tanh) function to map the input previous layer to the output that will be sent to the layer afterward.
The DNN can be discriminatorily trained with backpropagation using cost Initially, we tuned the DNN to obtain the optimal parameter values. The DNN is an advanced artificial neural network that has more than one hidden layer between the input and output layers. Each hidden layer has an activation function such as a sigmoid, rectified linear unit (ReLU), or hyperbolic tangent (tanh) function to map the input from the previous layer to the output that will be sent to the layer afterward.
The DNN can be discriminatorily trained with backpropagation using cost function derivatives to measure the difference between the target output and actual output. Backpropagation for large training data is performed on a small portion of data taken at random so that it is more efficient than considering all data together.
The DNN, with a large number of hidden layers, is challenging to optimize. The approach of using the gradient descent from a randomly generated starting point close to the actual value cannot produce a good set of weights, unless careful weight-scale initialization is completed. Therefore, the initialization of weights in DNN modeling becomes essential to improve the DNN modeling performance. We also compared the performance of the DNN with other supervised learning methods, such as random forest [12], and support vector machine [13].

Extracting Important Metabolites
According to the best prediction model, we extracted important metabolites from each class by considering the weight of variable importance in the DNN. We selected the top-15 important metabolites for each disease class and examined their weights. Among the top-15 selected metabolites, we discarded the metabolites whose weights were less than the threshold.

Filtering Dataset
First, we removed 33 Unani formulas for fever because this symptom can be found in many disease classes. Then, we eliminated 195 Unani formulas which have more than one therapeutic usage, and also eliminated unrelated metabolites after the reduction of Unani formulas. We applied single filtering using random forest and the deep neural network, separately. The filtering process was conducted by using all datasets as training data and also as testing data, and misclassified formulas were deleted. Therefore, we obtained two datasets from two different types of filtering, namely dataset 1 as the dataset after filtering using random forest, and dataset 2 as the dataset after filtering using the deep neural network. The dimensions of the data after filtering can be seen in Table 2. Next, we examined the distribution of formulas to each efficacy class after filtering. Each class in both datasets should have had enough Unani samples to generate good prediction models. Therefore, we eliminated efficacy classes 1, 2, 4, 5, 7, 9, 14, and 18 because only a few Unani formulas were available in both datasets as follows (dataset 1, dataset 2): (8,4), (1, 0), (10, 5), (7, 1), (3,3), (0, 0), (3, 0), and (13, 0). After this removal, the distribution of the Unani formulas in dataset 1 and dataset 2 is shown in Figure 5.

Performance of Prediction
The datasets obtained from the previous process were used to develop a model for the prediction of therapeutic usages of Unani using machine learning approaches ( Table  2). We adopted several methods, namely deep neural networks (DNN), random forest (RF), and support vector machine (SVM), etc. The deep neural network was chosen as a recommended classifier because this method is robust for imbalanced and multi-class problem data. The DNN model that was built for this study was completed according to the method proposed by [14]. This method is considered to be able to model complex data.
Tuning parameters are important factors for forming a prediction model. In terms of the deep neural network, several parameters affected the accuracy value of the DNN model, such as the activation function, the dropout value, the number of k in the validation process (k-fold cross-validation), the number of hidden layers, and the number of epochs. Each parameter was tuned by considering a range of values as follows: activation functions ("relu", "tanh", "sigmoid") [15], the dropout value (0.15, 0.25, 0.40, 0.50), the value of k concerning cross-validation (4,5,6,7,8,9,10), the number of hidden layers (4,6,8,12), and the number of epochs (30,50,100,500). Then, the best DNN parameters were processed using a grid search for both datasets. The optimal parameters for both datasets were the same as follows: activation function = "relu", dropout value = 0.40, k value = 5,

Performance of Prediction
The datasets obtained from the previous process were used to develop a model for the prediction of therapeutic usages of Unani using machine learning approaches ( Table 2). We adopted several methods, namely deep neural networks (DNN), random forest (RF), and support vector machine (SVM), etc. The deep neural network was chosen as a recommended classifier because this method is robust for imbalanced and multi-class problem data. The DNN model that was built for this study was completed according to the method proposed by [14]. This method is considered to be able to model complex data.
Tuning parameters are important factors for forming a prediction model. In terms of the deep neural network, several parameters affected the accuracy value of the DNN model, such as the activation function, the dropout value, the number of k in the validation process (k-fold cross-validation), the number of hidden layers, and the number of epochs. Each parameter was tuned by considering a range of values as follows: activation functions ("relu", "tanh", "sigmoid") [15], the dropout value (0.15, 0.25, 0.40, 0.50), the value of k concerning cross-validation (4,5,6,7,8,9,10), the number of hidden layers (4,6,8,12), and the number of epochs (30,50,100,500). Then, the best DNN parameters were processed using a grid search for both datasets. The optimal parameters for both datasets were the same as follows: activation function = "relu", dropout value = 0.40, k value = 5, number of hidden layers = 4, and number of epochs = 30. The prediction results for each fold using the DNN with the best parameters can be seen in Figure 6.
The comparison of the classifier performances is shown in Figure 7. For the random forest, support vector machine, and XGBoost, the averages of prediction accuracy were below 40%, and for the KNN it was around 60% but still much less than the deep learning method. In this study, the DNN achieved 87.4% accuracy. The results imply that the prediction models based on RF and SVM are not able to make a good efficacy prediction using Unani's compounds as features. One of the reasons that influenced the results was the imbalanced amount of Unani formulas belonging to different efficacy classes. It is noteworthy that the results of the prediction model based on the DNN could increase the accuracy measure by about 50% when compared to RF and SVM. One of the reasons that influenced the results was the imbalanced amount of Unani formulas belonging to different efficacy classes. It is noteworthy that the results of the prediction model based on the DNN could increase the accuracy measure by about 50% when compared to RF and SVM.

Identification of Important Metabolites
After obtaining the best prediction model, we extracted essential features, in this case metabolites, for each therapeutic usage. The potential compounds for each disease class were obtained based on variable importance from the best deep neural network model using the KerasRegressor and PermutationImportance packages. First, we selected the top-15 compounds and then discarded the compounds with the weight of variable importance lower than the threshold. In this study, we set the threshold equal to 0.01. In total, we selected 118 unique compounds for 9 efficacy groups. The statistics of the selected compounds can be seen in Table 3, and the details of the selected compounds for each disease class are available in Supplementary Table S1.

Validation of Important Metabolites
We utilized three approaches to validate metabolites for each therapeutic group as follows: (1) by searching in supporting journals/articles; (2) by searching for the same metabolites in traditional medicine, in this case, Jamu and TCM; (3) by searching for metabolites with similar structures in the PubChem database (using the Simpson similarity). Equation (1) shows the formula for calculating the Simpson similarity between two compounds.
where a is the number of common features between two compounds, b is the number of features present in only one compound, and c is the number of features only present in the other compound. A list of validated metabolites/compounds for different disease classes is shown in Table 4.    Table 4 shows the list of predicted compounds for which we could find validations. Corresponding to the disease category 'The Digestive System', there were eight validated compounds. Out of them, 6H-dibenzo[b,d]pyran-6-one is effective against Enterophytoestrogens [16]. lyratol C is used as a drug to treat colorectal neoplasms [17]. Epithienamycin E is a substance that kills or slows the growth of microorganisms, including bacteria, viruses, fungi, and protozoans [18]. 9(S)-HOTrE enhances reverse cholesterol transport (RCT) by increasing the apoA-I transcription in human hepatocellular carcinoma (HepG2) cells [18]. Cimifoetiside A is the active ingredient in Cimicifuga spp., which is used to relieve diarrhea in TCM [20]. Gymnemic acid XII possesses a higher binding affinity to PPARγ, a promising drug target for diabetes [21]. Quercetin 7,4 -di-O-β-D-glucoside is the active ingredient in Delonix elata, which is used to relieve flatulence and purgatives in Saudi Arabia [22]. Furthermore, as therapeutic agents, phenethylamine acts as an appetite suppressant [23].
For the category 'The Heart and Blood Vessels', we found four validated compounds. Out of them, kaempferol 3-O-[α-L-rhamnopyranosyl(1→2)-β-D-galactopyranosyl]-7-O-α-L-rhamnopyranoside is a candidate agent for the treatment of cardiovascular diseases [26]. Succinic acid is an active component that is applied in Jamu. Linalyl acetate prevents hypertension-related ischemic injury and can prevent the production of ROS [28].
In the case of 'Male-Specific Diseases', there were seven validated compounds. According to the Simpson similarity, Obtusifoliol resembles Euphadienol, which has antiinflammatory effects [29]. Methyl 4-hydroxy cinnamate, ∆6-protoilludene, and 3-O-Acetyloleanolic acid are active against prostate cancer [30]. Butiin demonstrates the growth inhibition of Gram-positive and Gram-negative bacteria that cause male-specific infections [32]. Gibberellin A12 is implicated in the treatment of male infertility [33]. The ∆-6-protoilludene is a precursor for the synthesis of both melleolides and armillyl orsellinates, whose cytotoxicity reflects their ability to induce apoptosis [34]. In addition, erythrodiol is an active ingredient from the herb, Rhododendron ferrugineum, which is used in TCM.
According to the category 'Muscle and Bone', the number of compounds validated was 4. Among them, 14-deoxo-3-O-propionyl-5,15-di-O-acetyl-7-O-benzoylmyrsinol-14beta-nicotinoate shows similarities with perfluorooctyl iodide. These metabolites are useful as organocatalysts through the activation of substrates with halogen bonds. Euphorbiaproliferin I resembles cesium and Euphorbiaproliferin G is similar to moli001259. Structural similarity is measured based on Simpson's similarity. Furthermore, Euphorbiaproliferin D can be isolated from TCM ingredients, namely Euphorbia prolifera. Euphorbia prolifera can cure various diseases when referring to TCM.
For the category 'Skin and Connective Tissue', Taxifolin 3 -glucoside, Oleanolic acid, Oleandrin, Himaphenolone, Coniferyl aldehyde, and Cedrin were the validated metabolites. Taxifolin 3 -glucoside is effective for preventing the production of inflammatory cytokines and reducing atopic dermatitis [40]. Oleanolic acid can inhibit skin tumor promotion [41]. Oleandrin is shown to induce the apoptosis of malignant cells in melanoma cell lines [42]. Himaphenolone is the active ingredient of the herb, Cedrus deodara (Roxb.) Loud, which can be used for the treatment of carbuncle sores, eczema, traumatic bleeding, burns, and scalds. Coniferyl aldehyde is similar to a drug, and Nalco L. and Cedrin resemble dihydroquercetin.

Discussion
We tried our best to collect as many metabolites as possible for each Unani plant from various resources. Medicinal metabolites are of more importance to researchers and usually they are the first identified for various plants. Therefore, we assumed that the currently available plant-metabolite relation could produce good results up to a certain extent.
The approach adopted in the current work can be considered as a top-down approach because we started with a global set of Unani formulas in terms of plants, and then we moved down to the metabolite level and utilized state-of-the-art machine learning techniques to identify significant compounds. Hence, the approach is also a computational approach. The results we obtained are promising, showing the strength and usefulness of computational approaches in drug discovery. Our input data correspond to versatile types of diseases. In this work, we considered disease classes at an upper hierarchy, and under each class, there were diseases with some differences. Interestingly, our results also show compounds corresponding to different types of diseases under each category. This has been possible by investigating and identifying significant compounds within formulas showing bias to specific disease classes/categories using efficient algorithms. Therefore, these are the results of the systems-level investigation. Another thing that is interesting to discuss is the other compounds (not validated) extracted from the best model of this study. The validation results show around 43% of compounds are directly or indirectly related to the therapeutic group of diseases. The remaining 69 compounds are potential candidates for further research, for example, in the fields of biochemistry, pharmacy, medicine, and so on. Last, the simple binary data to represent metabolites have performed well in this study. However, other approaches can be explored to improve the results.

Conclusions
A prediction of the therapeutic usage of the Unani formulas based on their constituent metabolites using the deep neural network showed the highest accuracy compared to other algorithms, e.g., the random forest and support vector machine, etc. The best prediction accuracies corresponding to DNN, KNN, Xgboost, RF, and SVM were 87.4%, 63.2%, 39.3%, 37.9%, and 38.6%, respectively. The results of this prediction indicate that the DNN performed much better compared to other algorithms. In this work, two datasets were prepared using filtering techniques, namely, dataset 1 and dataset 2. In the case of the DNN, the best accuracy was obtained from dataset 1, while RF and SVM obtained the best accuracy from dataset 2. In general, the filtering process improves prediction accuracy, but our results were mainly influenced by the type of classifier algorithms.
Based on the best classification model, we extracted important metabolites by making use of the DNN interest variable. Corresponding to the nine therapeutic uses of the Unani formula, we extracted 118 essential metabolites, 49 of which were validated using the following methods: searching in supporting health-related journals/articles, searching the same metabolites in Jamu or TCM, and searching metabolites with a similar structure and activity in the PubChem database.
For future work of this research, we need to consider increasing the number of Unani formulas; by doing this, the number of plants and metabolites will increase simultaneously. We will be finding more sources of plant-metabolite relation databases, such as open-source databases, books, and journals, so that our dataset is closer to the actual conditions and acceptable also in the industry. The authors also recommend using artificially generated data in testing to support and strengthen the prediction results of model accuracy.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/life13020439/s1, Table S1: List of important metabolites for each disease class extracted from best prediction model using variable importance of Deep Neural Network.