Screening of Potential Indonesia Herbal Compounds Based on Multi-Label Classiﬁcation for 2019 Coronavirus Disease

: Coronavirus disease 2019 pandemic spreads rapidly and requires an acceleration in the process of drug discovery. Drug repurposing can help accelerate the drug discovery process by identifying new efﬁcacy for approved drugs, and it is considered an efﬁcient and economical approach. Research in drug repurposing can be done by observing the interactions of drug compounds with protein related to a disease (DTI), then predicting the new drug-target interactions. This study conducted multilabel DTI prediction using the stack autoencoder-deep neural network (SAE-DNN) algorithm. Compound features were extracted using PubChem ﬁngerprint, daylight ﬁngerprint, MACCS ﬁngerprint, and circular ﬁngerprint. The results showed that the SAE-DNN model was able to predict DTI in COVID-19 cases with good performance. The SAE-DNN model with a circular ﬁngerprint dataset produced the best average metrics with an accuracy of 0.831, recall of 0.918, precision of 0.888, and F-measure of 0.89. Herbal compounds prediction results using the SAE-DNN model with the circular, daylight, and PubChem ﬁngerprint dataset resulted in 92, 65, and 79 herbal compounds contained in herbal plants in Indonesia respectively.

The COVID-19 pandemic has spread rapidly and requires an acceleration in the process of drug discovery to fight this disease. Drug repurposing is one of the processes that can help to accelerate the drug discovery process to fight COVID-19 [4]. In drug repurposing, the drug discovery process is conducted by identifying new efficacy for approved drugs, and it is considered an efficient and economical approach [5]. Research in drug repurposing can be done by observing the interactions of drug compounds with protein related to a disease (drug-target interaction or DTI), then predicting the new drug-target interactions [3].
The search for potential drugs through drug repurposing should consider the ease of access to medicinal ingredients to be more accessible by the public, especially for the Multilabel classification to predict DTI can be used to overcome binary classification problems. In multilabel classification, the training process is conducted to produce a model that maps input vectors to one or more classes. In the multilabel classification for DTI, m compounds are samples, and n proteins are target classes. The sample is characterized as an input vector which is then used to predict the target (protein) of the compound with a multilabel learning algorithm [17]. The use of protein as a class label can reduce the dimensions of the input because it no longer requires feature extraction of the protein. Prediction of the target is only determined based on the pattern of the existing compound structure. In addition, from a machine learning perspective, apart from being able to predict several interactions at once, the multilabel classification model can also identify possible correlations between class labels (proteins) to increase the performance of DTI predictions [18].
Research [17] shows that multilabel classification for DTI problems can outperform binary classification with better computational speed, especially for large datasets.
Several studies related to the multilabel classification of DTI have been conducted before. Research [19] conducted a multilabel DTI search using a deep belief network (DBN) model with a binary relevance data transformation approach on protease and kinase data taken from the DUD-E site. Feature extraction on compounds was carried out using the PubChem fingerprint and Klekota-Roth fingerprint descriptors. As a result, the DBN method can be used as a model to predict multi-target DTI with an accuracy range of 97-99% and an AUC range of 83-99%. Research [20] predicted multi-target DTI using the ensemble tree model on the golden standard dataset from research [21]. First, the data are reconstructed using the Neighborhood Regularized Logistic Matrix Factorization (NRLMF) method to overcome the imbalanced data problem. The ensemble tree model with data reconstruction using the proposed NRLMF produces good predictive ability in multi-target DTI.
Research [3,6] that use a binary classification approach for DTI prediction in COVID-19 case simplified DTI problems and could bias the model. Therefore, it is necessary to try using a multilabel classification approach to predict DTI in COVID- 19. This study conducted a multilabel classification approach to predict DTI using the SAE-DNN algorithm. SAE is used as a pre-training model for DNN by unsupervised learning. DNN uses an algorithm adaptation approach in solving multilabel classification problems and has good performance in multilabel classification [22]. The feature extraction process was conducted on the compound data using four compound fingerprints: Pub-Chem fingerprint, daylight fingerprint, MACCS fingerprint, and circular fingerprint. The trained SAE-DNN model is then used to predict herbal compounds. Then a search for herbal plants containing predicted herbal compounds was made on the KNapSAcK site (http://www.knapsackfamily.com/KNApSAcK/ accessed on 2 November 2021) [23].

Dataset
This study used three datasets: protein dataset, drug-target interaction dataset, and herbal compound dataset. Protein dataset obtained from GeneCards [24], which yielded 1567 genes related to COVID-19. The data are filtered by selecting genes categorized as protein-coding and producing a total of 1498 genes (proteins). Protein data can be seen in Supplementary Spreadsheet 1. The drug-target interaction dataset obtained from the SuperTarget [25] and DrugBank [26] site yielded 58,446 interactions. All of the drug-target interaction data can be seen in Supplementary Spreadsheet 2. Herbal compound data were obtained from HerbalDB [27], which consist of 403 herbal compounds. Data acquisition is made on 12-13 July 2021. Feature extraction on the compound was carried out with four fingerprints with two different types: substructural fingerprint (PubChem fingerprint [28] and MACCS fingerprint (or referred as MDL keys [29]) and topological fingerprint (daylight fingerprint [30] and circular fingerprint (ECFPs) [31]). In chemoinformatics, a fingerprint is one way to represent the chemical structure [12].

Workflow
DTI multilabel prediction consists of three steps:

1.
Data preprocessing step, which includes feature extraction on compounds and class data transformation.

2.
Multilabel modeling step using SAE-DNN model. Hyperparameter tuning is also conducted to find the optimal parameter of SAE-DNN for all feature extraction datasets.

3.
Post-processing step including model evaluation and herbal compounds prediction.

Data Preprocessing
The first stage of data preprocessing is the feature extraction on the compound. Feature extraction aims to form a representation of compounds' chemical structure. One of the most commonly used feature extraction processes to represent compounds' chemical structure is molecular fingerprints. Molecular fingerprinting simplifies chemical information in compounds by analyzing the molecular structure into a graph and representing it through binary vectors [12].
The feature extraction process used four fingerprints with two different types, substructure-based fingerprint (PubChem fingerprint [28] and MACCS (MDL) fingerprint [29]) and topological fingerprint (daylight fingerprint [30] and circular fingerprint (ECFPs) [31]). In a substructure-based fingerprint, an array is formed to represent the chemical substructure of a compound, with each substructure assigned to a specific location in the array. For each substructure that occurs in the compound, the position of the corresponding substructure in the fingerprint vector is 1; otherwise, the position of the substructure is 0 [32]. Topology-based fingerprints are formed by analyzing the number of molecular fragments that emerge from a specified path or radius of a molecule, then each path or radius is encrypted with a hash. The bit value in topology-based fingerprint array is 1 if there is a molecular fragment at a certain path length or radius; otherwise, it is 0 [33].
The illustration of the feature extraction process on compounds can be seen in Figure 1.
1. Data preprocessing step, which includes feature extraction on compounds and c data transformation. 2. Multilabel modeling step using SAE-DNN model. Hyperparameter tuning is a conducted to find the optimal parameter of SAE-DNN for all feature extraction tasets. 3. Post-processing step including model evaluation and herbal compounds predict

Data Preprocessing
The first stage of data preprocessing is the feature extraction on the compound. F ture extraction aims to form a representation of compounds' chemical structure. On the most commonly used feature extraction processes to represent compounds' chem structure is molecular fingerprints. Molecular fingerprinting simplifies chemical in mation in compounds by analyzing the molecular structure into a graph and represen it through binary vectors [12].
The feature extraction process used four fingerprints with two different types, s structure-based fingerprint (PubChem fingerprint [28] and MACCS (MDL) fingerp [29]) and topological fingerprint (daylight fingerprint [30] and circular fingerp (ECFPs) [31]). In a substructure-based fingerprint, an array is formed to represent chemical substructure of a compound, with each substructure assigned to a specific lo tion in the array. For each substructure that occurs in the compound, the position of corresponding substructure in the fingerprint vector is 1; otherwise, the position of substructure is 0 [32]. Topology-based fingerprints are formed by analyzing the num of molecular fragments that emerge from a specified path or radius of a molecule, t each path or radius is encrypted with a hash. The bit value in topology-based fingerp array is 1 if there is a molecular fragment at a certain path length or radius; otherwis is 0 [33].
The illustration of the feature extraction process on compounds can be seen in Fig  1. (a) (b) Figure 1. (a) Illustration of the substructure-based fingerprint to represent the chemical structure of a compound. Bit 1 indicates that the substructures they represent are present in the molecule. In contrast, bit 0 indicates the substructure is not present in the molecule (b) Illustration of the topological-based fingerprint to represent the chemical structure of a compound. In this case, a linear path-based (daylight) fingerprint with a path length of 5. Every fragment found from the starting point (circled) to a certain path length is hashed to the corresponding bit in the fingerprint. Circular fingerprints The feature extraction process is done using the following steps: 1.
Identify each unique compound in the interaction data. There are 49,862 unique compounds in the interaction data 2.
Identify PubChem ID of each compound 3.
Identify the SMILES (Simplified Molecular-Input Line-Entry System), which represents the chemical structure of each compound. SMILES data can be seen in Supplementary Spreadsheet 3.

4.
Form fingerprint of each compound according to the SMILES of each compound. 5.
The fingerprint feature retrieval process produces a feature vector C (C = [c 1 , c 2 , c 3 , . . . , c n ] with n = the number of substructures on the fingerprint), which will be used as input to the DNN.
The compound feature extraction process produces 881 attributes for the PubChem fingerprint, 1024 for the daylight fingerprint, 166 for the MACCS fingerprint, and 1024 for the circular fingerprint.
The next step is transforming the class attributes from a single label to a multilabel problem. The transformation is done by creating an array P (P = [p 1 , p 2 , p 3 , . . . , p m ] with m = many proteins). In each data row, the value of p in the P array is one of the compounds in that row that interacts with the p protein and is 0 if the compound in the row does not or is not known to have an interaction with the p protein.
It was transforming class data into a multilabel problem by first identifying the unique protein in the interaction data. There are 467 unique proteins in the interaction data. The formation of multilabel data resulted in 49,862 rows of data representing unique compounds and 27 unique classes representing the set of proteins that interacted with these compounds. The data are dominated by compounds that have only one protein that interacts with the compound. Details of the number of compounds that interact with at least one protein can be seen in Figure 2. follow a similar approach, but instead of using path length, it used the radius of the starting point to find the molecule fragments.
The feature extraction process is done using the following steps: 1. Identify each unique compound in the interaction data. There are 49,862 unique compounds in the interaction data 2. Identify PubChem ID of each compound 3. Identify the SMILES (Simplified Molecular-Input Line-Entry System), which represents the chemical structure of each compound. SMILES data can be seen in Supplementary Spreadsheet 3. 4. Form fingerprint of each compound according to the SMILES of each compound. 5. The fingerprint feature retrieval process produces a feature vector C (C = [c1, c2, c3, ..., cn] with n = the number of substructures on the fingerprint), which will be used as input to the DNN.
The compound feature extraction process produces 881 attributes for the PubChem fingerprint, 1024 for the daylight fingerprint, 166 for the MACCS fingerprint, and 1024 for the circular fingerprint.
The next step is transforming the class attributes from a single label to a multilabel problem. The transformation is done by creating an array P (P = [p1, p2, p3, …, pm] with m = many proteins). In each data row, the value of p in the P array is one of the compounds in that row that interacts with the p protein and is 0 if the compound in the row does not or is not known to have an interaction with the p protein.
It was transforming class data into a multilabel problem by first identifying the unique protein in the interaction data. There are 467 unique proteins in the interaction data. The formation of multilabel data resulted in 49,862 rows of data representing unique compounds and 27 unique classes representing the set of proteins that interacted with these compounds. The data are dominated by compounds that have only one protein that interacts with the compound. Details of the number of compounds that interact with at least one protein can be seen in Figure 2. The data are then separated into two forms: data table X, compound data containing feature extraction results as predictor variables, and data table Y, a data class in the form of multilabel. An example of all feature extraction data and multilabel class data can be seen in Figure 3. The data are then separated into two forms: data table X, compound data containing feature extraction results as predictor variables, and data table Y, a data class in the form of multilabel. An example of all feature extraction data and multilabel class data can be seen in Figure 3.

SAE-DNN Model
In the SAE-DNN model, SAE is used to pre-trained DNN model to initialize initial weight for DNN. DNN used an algorithm adaptation approach to predict multilabel DTI. Initial weight initialization using SAE training results on DNN modeling is carried out to produce an optimal model compared to a model with random weights [34]. SAE-DNN architecture can be seen in Figure 4. The Pre-training DNN process using SAE can be seen in Algorithm 1 [34]. Big Data Cogn. Comput. 2021, 5, x FOR PEER REVIEW 6 of 21 Figure 3. Example of all feature extraction data and multilabel class data.

SAE-DNN Model
In the SAE-DNN model, SAE is used to pre-trained DNN model to initialize initial weight for DNN. DNN used an algorithm adaptation approach to predict multilabel DTI. Initial weight initialization using SAE training results on DNN modeling is carried out to produce an optimal model compared to a model with random weights [34]. SAE-DNN architecture can be seen in Figure 4. The Pre-training DNN process using SAE can be seen in Algorithm 1 [34]. In SAE architecture, the input data (x) that enters the encoder layer will be converted to a new data representation form h which is formulated with Equation (1).

SAE-DNN Model
In the SAE-DNN model, SAE is used to pre-trained DNN model to initialize initial weight for DNN. DNN used an algorithm adaptation approach to predict multilabel DTI. Initial weight initialization using SAE training results on DNN modeling is carried out to produce an optimal model compared to a model with random weights [34]. SAE-DNN architecture can be seen in Figure 4. The Pre-training DNN process using SAE can be seen in Algorithm 1 [34]. In SAE architecture, the input data (x) that enters the encoder layer will be converted to a new data representation form h which is formulated with Equation (1). In SAE architecture, the input data (x) that enters the encoder layer will be converted to a new data representation form h which is formulated with Equation (1).
where ∅ is activation function, w e is weight, and b e is bias at encoder layer. Then the decoder layer will return the new data representation from the bottleneck layer to the initial data form (x ) with Equation (2).
where ∅ is activation function, w d is weight, and b d is bias at decoder layer. The AE model is trained to reduce errors in data reconstruction between x and h. The error metric used is the mean squared error (MSE). In the SAE training process, the new data representation at the current AE bottleneck is used as input for the next AE.
In DNN architecture, the learning process starts from the feed forward process where the input data moves through the existing layers to the output layer. The hidden layer maps the input from the layer to the output to be sent to the next layer according to Equation (3).
where i is the current layer, j is the next layer, y is output from the layer, f (x j ) is activation function, x is input for layer, N is the number of nodes in the layer, w i,j is a weight that connects i-th layer to j-th layer, and b is bias from the layer. The backpropagation process is then carried out to update the weight and bias values in the DNN architecture to reduce the error value in training. The weight update is carried out using stochastic gradient descent, which aims to optimize the objective function and the learning process of DNN. Equation (4) shows the formulation of the error value, and Equation (5) shows the formulation of the stochastic gradient. Finally, the model training process is repeated until the maximum iteration or results converge to a value.
where α is the learning rate, ∆w is the change in the weight value, t is the batch size of the data, e is the error, and f (C) is the cost derivative function used during the backpropagation process to calculate the error gradient. The pre-training DNN process using SAE can be seen in Algorithm 1 [34]. The number of autoencoder layers in SAE is adjusted to the number of hidden layers in DNN. The weights and biases in the DNN hidden layer use the weights and biases from the SAE training, while the weights and biases in the output layer are initialized randomly. After getting the initial weights and biases for the DNN, the next step is to build the DNN model. The DNN adaptation process in solving multilabel classification problems can be done by adjusting the number of nodes in the output layer according to the number of classes (proteins) in the data. Each node in the output layer uses a binary cross-entropy loss function formulated by Equation (6). DNN architecture applies batch normalization and dropout processes to improve model performance [3]. Batch normalization normalizes the input from each layer by making all inputs have a mean close to 0 and a standard deviation close to 1 to speed up the DNN training process [35]. The dropout process can reduce the complexity of the DNN architecture and prevent overfitting by temporarily eliminating several nodes in the layer randomly during the model training process [36]. After the batch normalization process, the dropout process is carried out to produce a more stable training process, faster convergence, and better generalization [37].
Each node in the output layer will produce a class probability value from the input. If the probability value of the class is above 0.5, then the input class is 1, otherwise, it is 0. The output of the output layer is an arrayP (P = [p 1 ,p 2 ,p 3 , . . . .,p m ]) which is the prediction result of each node in the output layer. SAE-DNN modeling uses the "TensorFlow-GPU" library version 2.3 and uses GPU to speed up the training process.
The hyperparameter tuning process is carried out using Bayesian Optimization (BO). BO builds a probabilistic model that selects the best hyperparameter from several possible parameters and includes the best hyperparameter to search the other best hyperparameters in the next iteration to speed up the search process for all the best hyperparameters [38]. The implementation of BO is carried out using the "Keras-tuner" library [39]. The hyperparameter search space can be seen in Table 1.

Postprocess Step
Model evaluation is carried out using iterative stratification, a modification of k-fold cross validation that aims to balance the number of combinations of labels from multilabel data in each fold [40,41] with a total fold of k = 5. The evaluation metrics used are accuracy, recall, precision, and F-measure metrics. The accuracy value measures how well the test data predict. Precision measures the percentage of positive predictions against a positive class. Recall measures the accuracy of the positive prediction of the model. F-measure measures the performance of the minority class [3].
Prediction of herbal compounds is made by predicting the set of proteins (classes) that interact with the data of herbal compounds using the SAE-DNN model with optimal hyperparameters. A set of proteins is considered to interact with herbal compounds if the probability value of the prediction results is above 0.5. Prediction of herbal compounds was carried out using two models, the SAE-DNN model trained with the PubChem fingerprint feature and the best SAE-DNN model from the results of a comparison of four feature extractions. Then a search for herbal plants containing predicted herbal compounds was made on the KNapSAcK site (http://www.knapsackfamily.com/KNApSAcK/ accessed on 2 November 2021).

Results
First, we present the comparison between SAE-DNN and DNN model without pretraining (DNN only) for all feature extraction dataset using default parameter such as: HL 0 Node = 1024, HL i Node = 0.5, hidden layer = 3, Optimizer = Adam, activation function = ReLU, learning rate = 0.01, dropout rate = 0.5. Second, we present the performance comparison for all the feature extraction datasets using optimal hyperparameters. Third, we show the herbal prediction result using the SAE-DNN model.

Performance Comparison between SAE-DNN and DNN Only
Model performance is presented with the mean and standard deviation of each metric. Figure 5 shows the performance results of SAE-DNN and DNN without a pre-training process for all feature extraction datasets.
First, we present the comparison between SAE-DNN and DNN model without pretraining (DNN only) for all feature extraction dataset using default parameter such as: HL0 Node = 1024, HLi Node = 0.5, hidden layer = 3, Optimizer = Adam, activation function = ReLU, learning rate = 0.01, dropout rate = 0.5. Second, we present the performance comparison for all the feature extraction datasets using optimal hyperparameters. Third, we show the herbal prediction result using the SAE-DNN model.

Performance Comparison between SAE-DNN and DNN Only
Model performance is presented with the mean and standard deviation of each metric. Figure 5 shows the performance results of SAE-DNN and DNN without a pre-training process for all feature extraction datasets. SAE-DNN model produces slightly better performance than DNN without pre-training with higher average values of accuracy, recall, precision, and F-measure than the performance of DNN. This indicates that the SAE-DNN model is better able to predict multilabel classes (high accuracy), better at predicting the positive class of each label (high recall), better at predicting positive each label (high precision), and able to recognize minority classes well (high F-measure). SAE-DNN model produces slightly better performance than DNN without pretraining with higher average values of accuracy, recall, precision, and F-measure than the performance of DNN. This indicates that the SAE-DNN model is better able to predict multilabel classes (high accuracy), better at predicting the positive class of each label (high recall), better at predicting positive each label (high precision), and able to recognize minority classes well (high F-measure).
Even though SAE-DNN produces slightly better performance than DNN without pre-training, the use of SAE for DNN pre-training has several advantages, including preventing layer activation outputs from exploding or vanishing during the training of the DL technique [42] and helping DNN achieve better convergence and better generalization power [34]. One way to analyze the generalization performance of learning algorithms is the stability of its prediction performance [43]. Figure 6 shows the standard deviation of SAE-DNN and DNN without pre-training process metrics.
Even though SAE-DNN produces slightly better performance than DNN without pre-training, the use of SAE for DNN pre-training has several advantages, including preventing layer activation outputs from exploding or vanishing during the training of the DL technique [42] and helping DNN achieve better convergence and better generalization power [34]. One way to analyze the generalization performance of learning algorithms is the stability of its prediction performance [43]. Figure 6 shows the standard deviation of SAE-DNN and DNN without pre-training process metrics. In general, SAE-DNN produced a lower standard deviation for all metrics compared to DNN without pre-training. A lower standard deviation value indicates that the model's performance is more stable for each fold in the cross-validation process. These results imply that the SAE-DNN model has better generalization power than the DNN without pretraining.

SAE-DNN Performance Comparison for All the Feature Extraction Datasets
SAE-DNN model is trained using optimal hyperparameters from the hyperparameter tuning process. Optimal hyperparameters can be seen in Table 2.  In general, SAE-DNN produced a lower standard deviation for all metrics compared to DNN without pre-training. A lower standard deviation value indicates that the model's performance is more stable for each fold in the cross-validation process. These results imply that the SAE-DNN model has better generalization power than the DNN without pre-training.

SAE-DNN Performance Comparison for All the Feature Extraction Datasets
SAE-DNN model is trained using optimal hyperparameters from the hyperparameter tuning process. Optimal hyperparameters can be seen in Table 2.  Table 3. Topological fingerprints (daylight fingerprint and circular fingerprint) have better performance than substructure-based fingerprints (PubChem fingerprint and MACCS fingerprint). SAE-DNN model with the circular fingerprint feature produces the best average metric value compared to other feature extraction processes, with an accuracy value of 0.83160, recall 0.91836, precision 0.88848, and F-measure 0.89368. The standard deviation of each metric in the circular, daylight, and PubChem models is also relatively low, indicating that the model's performance for each fold tends to be stable. The low performance of the MACCS fingerprint can be assumed due to the lack of explanatory features used, considering that the features extracted from the MACCS feature are only 166. Based on these results, it can be concluded that using the circular fingerprint feature in the SAE-DNN model has the best performance compared to other models. Prediction of herbal compounds was carried out using three SAE-DNN models, namely SAE-DNN with PubChem fingerprint, daylight fingerprint, and circular fingerprint dataset. SAE-DNN model with the MACCS fingerprint dataset was not used to predict herbal compounds due to the low performance of the model.

Comparison with Other Approaches from the Literature
From a methodological point of view, some recent studies regarding DTI prediction commonly used a binary classification approach. In our proposed method, DTI prediction is done using the multilabel classification approach and takes several advantages over using the binary classification approach. First, the proposed method does not require a process of balancing data between positive data and negative data to achieve fair results, whereas the existing binary classification approach needs to randomly sample the negative DTI in order to balance the data, such as in research [3], which can result in false-negative rates and bias in the model results [15]. Second, the proposed method does not require to include a feature extraction process on protein data which can decrease data dimensions and speed up the training process.
From a machine learning performance point of view, we compare the DTI prediction performance between SAE-DNN and other deep learning models implemented in research [44,45]. Although these studies used a binary approach to predict DTI, comparisons can be made by looking at the model's performance in predicting positive classes. In terms of DTI, only the positive class is considered validated information, while the negative class cannot be validated due to the lack of experimental data on drug-target pairs [46]. Therefore, the comparison is done using recall and f-measure metrics. SAE-DNN outperforms other deep learning such as standard artificial neural network (ANN) and deep belief network (DBN) method from Research [44] with the best f-measure of 0.89368 compared to standard ANN with f-measure of 0.88 and DBN with an f-measure of 0.885. SAE-DNN also outperforms the proposed ComboNet method [45] with the best recall of 0.918 compared to the ComboNet recall of 0.8.

Herbal Compounds Prediction
The first prediction of herbal compounds was carried out using the SAE-DNN model trained with circular fingerprint datasets. Since the herbal compound data are taken from the HerbalDB website only contains the PubChem fingerprint feature, we must first search for the PubChem ID of each compound. PubChem ID is used to determine SMILES and look for each herbal compound's daylight fingerprint and circular fingerprint. PubChem ID search for herbal compounds yields 305 herbal compounds that have PubChem ID. Prediction results with the SAE-DNN model with circular fingerprint dataset produced 169 compounds interacting with COVID-19 proteins. Of the 169 compounds, 79 compounds interacted with more than one protein, while the rest interacted with only one protein.
Details of the number of herbal compounds that interact with the protein set predicted by SAE-DNN with the circular fingerprint feature can be seen in Figure 7.

Herbal Compounds Prediction
The first prediction of herbal compounds was carried out using the SAE-DNN model trained with circular fingerprint datasets. Since the herbal compound data are taken from the HerbalDB website only contains the PubChem fingerprint feature, we must first search for the PubChem ID of each compound. PubChem ID is used to determine SMILES and look for each herbal compound's daylight fingerprint and circular fingerprint. PubChem ID search for herbal compounds yields 305 herbal compounds that have PubChem ID. Prediction results with the SAE-DNN model with circular fingerprint dataset produced 169 compounds interacting with COVID-19 proteins. Of the 169 compounds, 79 compounds interacted with more than one protein, while the rest interacted with only one protein. Details of the number of herbal compounds that interact with the protein set predicted by SAE-DNN with the circular fingerprint feature can be seen in Figure 7. This study did not conduct molecular docking on the predicted results of compoundprotein interactions from the SAE-DNN model. Potential compounds for COVID-19 can be determined from the relevance score of predicted proteins that interact with these compounds which shows the value of the relevance of the protein to COVID-19 disease on the GeneCards site. The relevance score is calculated based on several factors, including how often the protein appears in a publication and disease pathways. Herbal plants search results show that of the 169 compounds predicted to interact with the COVID-19 protein according to the results of the SAE-DNN prediction with a circular fingerprint dataset; there are 92 compounds contained in herbal plants in Indonesia with a total of 378 herbal plants. Ten compound-protein interactions with the highest relevance score predicted by SAE-DNN with the circular fingerprint dataset can be seen in Table 4. All compoundprotein interactions of SAE-DNN prediction results with circular fingerprint dataset can be seen in Supplementary Spreadsheet 4.  This study did not conduct molecular docking on the predicted results of compoundprotein interactions from the SAE-DNN model. Potential compounds for COVID-19 can be determined from the relevance score of predicted proteins that interact with these compounds which shows the value of the relevance of the protein to COVID-19 disease on the GeneCards site. The relevance score is calculated based on several factors, including how often the protein appears in a publication and disease pathways. Herbal plants search results show that of the 169 compounds predicted to interact with the COVID-19 protein according to the results of the SAE-DNN prediction with a circular fingerprint dataset; there are 92 compounds contained in herbal plants in Indonesia with a total of 378 herbal plants. Ten compound-protein interactions with the highest relevance score predicted by SAE-DNN with the circular fingerprint dataset can be seen in Table 4. All compoundprotein interactions of SAE-DNN prediction results with circular fingerprint dataset can be seen in Supplementary Spreadsheet 4. The scientific names of species are italicized. The genus name is always capitalized and is written first; the specific epithet follows the genus name and is not capitalized. Based on the results in Table 4. Ten herbal compounds were predicted to have interactions with four proteins with high relevance score, which is DPP4 (12.445), TNF (12.213), TLR4 (11.928), and F3 (10.37) protein. According to research [47], DPP4 protein can be a receptor for SARS-CoV-2 and help the hyper inflammation process in the body. Rutin compound is predicted to interact with DPP4 protein with a probability of 0.534, together with PTGS1 protein with a small relevance value (0.895) and a probability of 0.628. In Indonesia, Rutin compounds are only found in the Carmellia sinensis plant (tea leaves). TNF protein is also a protein that plays a role in the process of forming a cytokine storm in acute COVID-19 patients [48]. It is predicted to interact with the Damnacanthal compound with a probability of 0.819. Damnacanthal compounds are found in the Morinda citrifolia L. (noni) plant. Research [49] stated that TLR4 protein could be a promising drug target in COVID-19 cases because it contributes significantly to the pathogenesis of SARS-CoV-2, and its over-activation can cause an exaggerated innate immune response. This protein is predicted to interact with ascorbic acid, palmitoleic acid, petunidin, naringin, malvidin, sterculic acid, and ricinoleic acid compounds with a high probability (0.975-0.998). Among the seven compounds, ascorbic acid compounds are the compounds contained by most herbal plants (5).
Next, predictions of herbal compounds were made using the SAE-DNN model with a daylight fingerprint dataset. The prediction results of the SAE-DNN model with daylight fingerprint dataset resulted in 119 compounds interacting with COVID-19 proteins. Of the 119 compounds, 63 compounds interacted with one protein while the rest interacted with more than one protein. Details of the number of herbal compounds that interact with the protein set predicted by SAE-DNN with the daylight fingerprint feature can be seen in Figure 8. Herbal plants search results at the KNApSAcK site show that of the 119 compounds predicted to interact with the COVID-19 protein according to the results of the SAE-DNN prediction with a daylight fingerprint dataset; there are 65 compounds contained in herbal plants in Indonesia with a total of 272 herbal plants. Ten compound-protein interactions with the highest relevance score predicted by SAE-DNN with daylight fingerprint dataset can be seen in Table 5. All compound-protein interactions of SAE-DNN prediction results with daylight fingerprint dataset can be seen in Supplementary Spreadsheet 5.  Herbal plants search results at the KNApSAcK site show that of the 119 compounds predicted to interact with the COVID-19 protein according to the results of the SAE-DNN prediction with a daylight fingerprint dataset; there are 65 compounds contained in herbal plants in Indonesia with a total of 272 herbal plants. Ten compound-protein interactions with the highest relevance score predicted by SAE-DNN with daylight fingerprint dataset can be seen in Table 5. All compound-protein interactions of SAE-DNN prediction results with daylight fingerprint dataset can be seen in Supplementary Spreadsheet 5.  The scientific names of species are italicized.
Based on the results in Table 5, ten herbal compounds with the highest relevance score predicted by the SAE-DNN model with the daylight fingerprint feature interacted with seven proteins with four proteins having the highest relevance score, namely EGFR protein (9.887), TNFRSF1A (5.459), TTR (2.638), and AR (2.428). EGFR protein became the protein with the highest relevance from the prediction results of the SAE-DNN model with daylight fingerprint feature. Research [50] stated that EGFR protein is involved in the infection process in lung cells, triggers a proinflammatory response, and is a potential drug target in the treatment of COVID-19. EGFR protein is predicted to interact with Hyperoside compounds with a probability of 0.509. Hyperoside compounds are found in seven herbal plants in Indonesia, one of which is Mangifera indica (mango). According to the GeneCards website, the TNFRSF1A protein is a variant of the TNF protein and plays a role in inflammatory processes in the body. The TNFRSF1A protein is predicted to interact with safrole, estradiol, tetrahydroxyflavone, myristic acid, and rhamnetin compounds, with the highest probability of rhamnetin compounds being 0.916. Rhamnetin compounds are found in Syzygium aromaticum (cloves) and Averrhoa carambola (starfruit) plants. The direct impact of the TTR protein on COVID-19 is not yet known. According to the GeneCards website, the TTR protein is associated with respiratory failure, which may be one of the effects of COVID-19. According to research [51], AR protein affects the severity of COVID-19 in patients. AR protein regulates the transcription of the TMPRSS2 protein, which is the host for the SARS-CoV-2 spike protein. AR protein is predicted to interact with Epicatechin, Proanthocyanidin a2, and Momordicilin compounds. Epicatechin had the highest probability of interacting with AR protein of 0.82.
Next, predictions of herbal compounds were made using the SAE-DNN model with PubChem fingerprint dataset. The prediction results of the SAE-DNN model with the PubChem fingerprint dataset resulted in 187 compounds predicted to interact with the COVID-19 protein. Of the 187 interactions, 15 compounds interact with two proteins, while 172 interact with only one protein. Details of the number of herbal compounds that interact with the protein set predicted by SAE-DNN with the PubChem fingerprint feature can be seen in Figure 9.  Figure 9.  Table 6. All compound-protein interactions of SAE-DNN prediction results with PubChem fingerprint dataset can be seen in Supplementary Spreadsheet 6. Table 6. Ten compound-protein interactions with the highest protein relevance score of SAE-DNN prediction results with  Table 6. All compound-protein interactions of SAE-DNN prediction results with PubChem fingerprint dataset can be seen in Supplementary Spreadsheet 6. From the results in Table 6, ten compounds with the highest protein relevance score interacted with the EGFR protein. Glucobrassicin compound has the highest probability to interact with EGFR protein according to the prediction results of SAE-DNN with PubChem fingerprint. This compound is found in eight herbal plants in Indonesia, with one of the commonly used plants being Brassica oleracea (cabbage). Of the ten protein-compound interactions with the highest relevance score predicted by SAE-DNN with PubChem fingerprint, P-cymene compounds are the compounds contained by the most herbal plants, which are 23 herbal plants.
There are similarities in the protein predicted by the three SAE-DNN models, namely AR, EGFR, and PRKCA proteins, with relevance values of 9.887 (EGFR), 2.428 (AR), and 1.003 (PRKCA). The SAE-DNN model with circular fingerprint and PubChem fingerprint both predicts EGFR protein interacting with P-cymene compound, while the SAE-DNN model with daylight fingerprint predicts EGFR protein interacting with Hyperoside compound. For AR proteins, the three models gave different predictions regarding the compounds interacting with these proteins. As for PRKCA protein, the circular and daylight feature SAE-DNN model predicts that this protein interacts with Catalpol and Sinigrin compounds, while the PubChem feature SAE-DNN model predicts this protein interacts with 31 different compounds.
There are also similar compounds that emerged from the prediction results of the three SAE-DNN models, namely Hyperoside, Aloin, Garcimangosone d, Rhamnetin, Anisaldehyde, Laurotetanine, Momordin I, Isoquercetin, and Cycloeucalenone compounds. Details of these compounds can be seen in Table 7. There are several similarities from the prediction results of the SAE-DNN model using daylight fingerprint and circular fingerprint. These results can occur because both fingerprints belong to topology-based fingerprints. The difference between these two fingerprints lies in creating array fingerprints where daylight builds an array of fingerprints based on a specified path of a molecule while circular builds an array based on the specified radius of a molecule [33].
Regarding similar compounds that emerged from the prediction results of the three SAE-DNN models, these compounds were predicted to have interactions with COVID-19 proteins. This is supported by several literature studies showing that Hyperoside, Aloin, Rhamnetin, Laurotetanine, and Isoquercetin compounds have anti-inflammatory properties [52][53][54][56][57][58], which can help fight the hyper inflammation process in COVID-19 patients. For Garcimangosone d, its efficacy on COVID-19 or the inflammatory process in the body is not yet known. This compound is found in the Garcinia mangostana plant which is a plant commonly used to treat various disease such as inflammation and fever in several Asian countries [55]. Usually, certain databases such as DrugBank and Stitch [59] (http://stitch.embl.de/ accessed on 27 November 2021) can be used to verify compoundprotein interaction based on the databases collection of knowledge compounds and their known interactions with proteins. However, the herbal compounds from the prediction results of the three SAE-DNN models in this study have not been found to have interactions with predicted proteins in these databases, thus its compound-protein interactions still cannot be verified. This is due to the lack of experiments related to herbal compounds. Further research is needed to verify compound-protein interactions and determine the potential value of herbal compounds from SAE-DNN prediction results in this study.

Conclusions
The multilabel classification approach to search for potential compounds that interact with COVID-19 proteins using the SAE-DNN model has been successfully carried out. The results showed that the SAE-DNN model was able to predict the interaction between drug-target in cases of COVID-19 with a pretty good performance and outperformed DNN without pre-training. The results also show that using the circular fingerprint feature as the predictor variable of the model produces the best average metric value with an accuracy of 0.831, recall 0.918, precision 0.888, and F-measure 0.893.
The prediction results of herbal compounds using the SAE-DNN model with the circular fingerprint dataset resulted in 92 herbal compounds contained in herbal plants in Indonesia, while the prediction results for herbal compounds using the SAE-DNN model with daylight fingerprint dataset resulted in 65 herbal compounds contained in herbal plants in Indonesia and using SAE-DNN model with PubChem fingerprint dataset resulted in 79 compounds contained in herbal plants in Indonesia. Hyperoside, Aloin, Rhamnetin, Laurotetanine, and Isoquercetin are predicted to interact with COVID-19 proteins according to prediction results of the SAE-DNN with circular, daylight, and PubChem dataset and are known to have anti-inflammatory properties regarding several literature studies.