A Machine Learning Approach for Recommending Herbal Formulae with Enhanced Interpretability and Applicability

Herbal formulae (HFs) are representative interventions in Korean medicine (KM) for the prevention and treatment of various diseases. Here, we proposed a machine learning-based approach for HF recommendation with enhanced interpretability and applicability. A dataset consisting of clinical symptoms, Sasang constitution (SC) types, and prescribed HFs was derived from a multicenter study. Case studies published over 10 years were collected and curated by experts. Various classifiers, oversampling methods, and data imputation techniques were comprehensively considered. The local interpretable model-agnostic explanation (LIME) technique was applied to identify the clinical symptoms that led to the recommendation of specific HFs. We found that the cascaded deep forest (CDF) model with data imputation and oversampling yielded the best performance on the training set and holdout test set. Our model also achieved top-1 and top-3 accuracies of 0.35 and 0.89, respectively, on case study datasets in which clinical symptoms were only partially recorded. We performed an expert evaluation on the reliability of interpretation results using case studies and achieved a score close to normal. Taken together, our model will contribute to the modernization of KM and the identification of an HF selection process through the development of a practically useful HF recommendation model.


Introduction
An herbal formula (HF) is a representative intervention in Korean medicine (KM) [1]. KM doctors select and administer HFs consisting of medical herbs to patients according to their clinical diagnoses. HFs play a crucial role in the prevention and treatment of various diseases. For instance, a nationwide multicenter study suggested that Chunggan granules, a type of HF, effectively protect against liver fibrosis in patients with chronic liver diseases [2]. Additionally, a double-blind clinical trial indicated that Gyejigachulbutang (Gui-Zhi-Jia-Shu-Fu-Tang for China and Keishikajutsubuto for Japan) might be effective for non-obese patients with degenerative knee osteoarthritis [3]. During the COVID-19 pandemic, a systematic review and meta-analysis showed that HFs may provide potential benefits to patients suffering from COVID-19 [4].
Despite recent technological advances, the core process of diagnosing and prescribing HFs still mainly depends on the experience of individual KM doctors. Generally, KM doctors comprehensively identify the pathophysiological patterns (also called "zheng") of patients based on given clusters of symptoms and signs through observation, listening, asking, and cutting [5]. By comprehensively considering the identified patterns, the KM doctors administer HFs to ameliorate the major complaints of the patients. Although this procedure provides a comprehensive treatment strategy for various diseases, its usefulness is limited by the high variability among experts and its reliance on implicit knowledge. The limitations of this diagnostic process pose a formidable challenge to the standardization and modernization of KM. Therefore, it is necessary to develop an interpretable and practically useful model that can explicitly understand the HF recommendation process.
The conventional approaches used for HF recommendation systems mainly rely on expert knowledge. The majority of these conventional approaches are clinical practice guidelines that collect diagnostic criteria and determine administrable HFs based on expert consensus or questionnaires that can classify pathological patterns or formula indications [6][7][8]. These approaches can act as indirect guides in clinical practice, but they cannot directly assist in the process of HF selection for individual patients. Researchers have thus attempted to develop data-driven models based on machine learning approaches. Seq2seq models, probabilistic graphic models, and multistep analysis methods are representative machine learning models that have been employed to develop HF recommendation systems [9][10][11]. In particular, Zhou et al. developed an HF recommendation model by combining phenotype and molecular information based on a convolutional neural network and a network embedding model [12]. Although these approaches can be useful for recommending prescriptions to patients, the applicability and/or interpretability of their predictive results remains an unresolved issue.
Interpretable machine learning methods have been proposed and have been applied to various fields, including the biomedical field. The local interpretable model-agnostic explanation (LIME) technique proposed by Ribeiro et al. is a representative model-agnostic interpretation method [13]. The characteristic of LIME is that by observing how the output is affected by a perturbed input, it can be applied to any supervised learning model to provide an interpretation of the given instance. Furthermore, incomplete data are an unavoidable problem and one that should be addressed during the data preprocessing stage. Missing values in the input dataset can cause the performance of the utilized classification algorithm to decrease, so data imputation algorithms have been proposed to solve this problem. Data imputation can also be applied to datasets containing nan values, increasing the applicability of trained models. Therefore, combining data imputation and LIME would be a useful approach for increasing the applicability and interpretability of prediction results.
In this study, we developed an HF recommendation model based on the clinical symptoms of patients ( Figure 1). To improve its applicability and interpretability, we conducted data imputation on the training dataset using missForest and derived major features for the prediction results using LIME. Our model was developed using a multicenter study dataset collected by the Korean Institute of Oriental Medicine. We tried to construct a recommendation model with the best performance by comparing various classifiers, applying data imputation, and performing oversampling. The generalization ability of the developed model for predicting HFs was evaluated on the holdout test set and case study dataset. We also employed LIME to investigate the clinical symptoms that contributed to the predictions of administered HFs and evaluated their reliability and consistency. We believe that our model provides understandable recommendations for KM doctors who need decision making assistance and researchers who wish to study the HF selection process.

Data Collection and Selection Procedure
A dataset was obtained from the Korean Medicine Data Center (KDC) of the Korea Institute of Oriental Medicine. They conducted a clinical observational study between 2013 and 2015 with 10 Korean medicine clinics. One of the characteristics of the dataset is that it contains Sasang constitutional (SC) types diagnosed by licensed SCM specialists. Briefly, SC type is a typological concept used in SCM. In SCM, people are classified into four SC types, which are referred as Soeumin (SE), Soyangin (SY), Taeeumin (TE), and Taeyangin (TY) [14]. By using the diagnostic information of a patient's SC type, an SCM expert can identify the risk factors for certain SC type-specific symptoms and provide a patient with tailored treatments. In this dataset, the diagnosis of the Sasang type was conducted by licensed SCM specialists who had been in clinical practice for at least 5 years. The SCM specialists diagnosed the Sasang type by carefully considering the physical body shapes, appearances, temperaments, and pathological symptoms of patients based on the detailed SC type determination processes that have been previously described [15]. All

Data Collection and Selection Procedure
A dataset was obtained from the Korean Medicine Data Center (KDC) of the Korea Institute of Oriental Medicine. They conducted a clinical observational study between 2013 and 2015 with 10 Korean medicine clinics. One of the characteristics of the dataset is that it contains Sasang constitutional (SC) types diagnosed by licensed SCM specialists. Briefly, SC type is a typological concept used in SCM. In SCM, people are classified into four SC types, which are referred as Soeumin (SE), Soyangin (SY), Taeeumin (TE), and Taeyangin (TY) [14]. By using the diagnostic information of a patient's SC type, an SCM expert can identify the risk factors for certain SC type-specific symptoms and provide a patient with tailored treatments. In this dataset, the diagnosis of the Sasang type was conducted by licensed SCM specialists who had been in clinical practice for at least 5 years. The SCM specialists diagnosed the Sasang type by carefully considering the physical body shapes, appearances, temperaments, and pathological symptoms of patients based on the detailed SC type determination processes that have been previously described [15]. All processes were approved by the Korea Institute of Oriental Medicine (I-1210/002-002-03), and written informed consent for participation was obtained from each subject.
The dataset initially included 241 measured clinical symptoms (features) and the resulting prescriptions (labels) for 1148 patients. We found that data selection was needed because some features were not closely related to clinical practice and some samples included sparsely appearing HFs. To this end, we selected the clinical symptoms and SC type, which were recorded as indicator symptoms in the clinical guidelines, as input characteristics. We also selected patients who received HFs that appeared at least 15 times in the dataset. For convenience, we refer to the constructed dataset as the KDC dataset.

Classifier Models
Supervised learning was conducted using various classifier models on the KDC dataset. Six classifiers were considered as potential classifiers for recommending HFs: logistic regression (logit), a decision tree (DT), a multilayer perceptron (MLP), a support vector machine (SVM), a random forest (RF), and a cascaded deep forest (CDF). These models showed sufficient performance in the recommendation system, and the CDF model was additionally considered as it yielded excellent performance in classification tasks including image classification. Logit is a type of linear regression method that adds a sigmoid function layer to the result of a linear mapping function. A DT uses a tree-like graph or models to make decisions and examine their possible outcomes. An MLP is a supervised learning algorithm that can learn nonlinear models. An SVM classifies outputs by finding the hyperplane that maximizes the margins between classes. An RF is an ensemble model that combines the probabilistic predictions of a number of DT-based classifiers to improve its generalization capability over that of a single estimator. A CDF employs a cascade structure, where the model consists of a multilayered architecture, and each level consists of an RF and extra trees [16]. The greedy search method was used to select the optimal hyperparameters of the classifiers within predetermined search ranges (Supplementary  Table S1). Other hyperparameter not included in the search range were set to default values. CDF model and other classifier models was trained using DF21 module and scikit-learn module, respectively [16,17].

Data Oversampling
An imbalanced situation can be defined as a case in which the number of instances of the majority class is much higher than the number of minority class instances, and this problem can often be seen in multiclass classification. Having balanced data is important since machine learning algorithms may be biased toward the majority class. Oversampling is an efficient way to address this potential issue. One widely employed method for generating synthetic instances is the synthetic minority oversampling technique for nominal and continuous cases (SMOTE-NC) [18,19]. It is a variant of SMOTE and is suitable for use with datasets containing both continuous and categorical variables. SMOTE-NC uses the characteristics of the k-nearest neighbors classifier in the given feature space to generate synthetic data for the minority classes. It can effectively handle both numerical and categorical features that appear in the dataset. SMOTE-NC is conducted via the following procedure. (1) Take the training set obtained from the data splitting stage; (2) calculate the number of samples needed for resampling based on the number of samples of the majority class; (3) generate synthetic samples for the three minority classes (no damage, penetration, and scabbing) using SMOTE-NC; and (4) output the rebalanced dataset for training.

Data Imputation
MissForest is an RF-based data imputation method that can yield increased performance in classification problems relative to that of other methods when applied to a dataset containing both continuous and categorical variables [20]. Let the data matrix X = X 1 , X 2 , . . . , X p be an n-by-p matrix containing missing values, where n and p represent the numbers of subjects and features, respectively. Depending on the missing mis is predicted and replaced; this also based on the RF model. These processes are repeated until a stopping criterion is met or the maximum number of iterations is reached.

Case Study Selection and Data Processing
The selection of the case study, data curation approach, and data processing strategy were carried out according to the following procedure. We reviewed manuscripts published over 10 years (from July 2013 to June 2022) in the official journal of SCM (Journal of Sasang Constitutional Medicine) and initially collected 41 case studies and their full texts. The identifier of the case study was determined by the year-volume-issue-number of the published paper (i.e., 2018-30-4-06). The first author (W.Y. Lee) and independent curator carefully read each full manuscript and extracted the text concerning clinical symptoms and administered HFs into a structured table. The two authors (W.Y. Lee and J.H. Kim) read the extracted texts and discussed what values the clinical symptoms belonged to. Conflicts between the two authors over the assignment of values were resolved through discussions with other authors. During data curation, the authors found that some of the manuscripts in the case study did not fully describe the features defined in the KDC dataset and inevitably contained nan values (Supplementary Table S2). To conduct a performance evaluation, we selected 23 case studies in which more than half of the clinical symptoms were recorded and imputed the remaining clinical symptoms by using the missForest model trained on the KDC dataset.

LIME
LIME is essentially a model-agnostic interpretable framework that is utilized to explain the independent instance predictions of 'black box' machine learning models [13]. LIME conducts tests on what would happen to the predictions of a model when the user provides altered versions of their data instances to the model. LIME modifies a single data sample by tweaking its feature values and observes the resulting impact on the output. Based on this principle, LIME engenders a novel dataset comprising permuted samples and their corresponding predictions from the black box model. LIME was implemented in this study using the Python lime module (https://lime-ml.readthedocs.io/en/latest/, accessed on 3 March 2022).

Performance Evaluation
The performance achieved on the training set was evaluated using k-fold stratified cross-validation (CV). K-fold CV was conducted by dividing the training dataset into 5 folds, evaluating the performance on each fold individually, and utilizing the remaining folds for the model training. Stratified CV steps were used to adjust each CV fold to have the same proportion of administered HFs. For each fold of the predictive model, the following metrics were calculated: where TP denotes true positives, FP represents false positives, FN denotes false negatives, and TN represents true negatives. To reduce the model variance, the performances of the classifier model were measured by conducting k-fold stratified CV several times. On the training dataset, the performance of the classifiers was evaluated under a condition where 5-fold CV was repeatedly conducted 5 times under different data splits. On the unseen dataset, the performance of the classifier was measured 10 times.

Dataset Construction and Description
We obtained a dataset for developing a prescription recommendation model from the Korean Medicine Data Center of the Korea Institute of Oriental Medicine, hereafter referred to as the KDC dataset. By applying the selection criteria defined in the Methods section, we obtained a dataset for 973 patients with 54 clinical symptoms, SC types (features) and administered herbal prescriptions (labels). The included features for recommending HFs consisted of information on thirst, stool, urine, temperature and sweat, which are known clinical diagnostic indicators that are primarily relevant to prescription selection in KM. The name, type, description, scale, and distribution for the included features can be found in Supplementary Tables S3 and S4. We additionally considered the SC type as an input function, as this factor is known as useful information that can identify risk factors and provide customized treatments. The dataset also contained a total of 15 types of HFs, including 7 to 12 herbal medicines (Table 1). Table 1. Names of the herbal formulas included in the recommendation system and their compositions.

Formula Name (Abbreviation) Composition (per Serving)
We carefully examined the dataset and found two potential problems. First, we found significant deviations between the administered HM frequencies. For example, the top five prescriptions accounted for more than half of the total prescriptions (574/973). This type of imbalance can lead to shortened learning, where the model is trained with a bias toward labels that appear frequently in the training dataset. Second, our dataset contained samples for which the clinical symptoms were not fully recorded. A classifier trained with a model using only fully recorded features may have suboptimal performance. To address these potential issues, we tried to develop an optimal model that recommends HM through the following processes ( Figure 1). First, we explored various classifiers and their hyperparameters using a fully valid dataset and selected the combination of classifiers and hyperparameters with the best predictive performance. Then, we evaluated whether the prediction performance was improved through the dataset obtained using oversampling and data imputation.

Hyperparameter Selection for the Classifiers
We first tried to select the best hyperparameters for each classifier model based on the KDC dataset. We first obtained features and their labels by selecting information from the 709 patients whose features were all validly recorded in the constructed dataset. The following classifiers were considered as potential models for predicting administered HFs: logit, DT, SVM, RF, MLP, and CDF models. A greedy search algorithm was used to find the optimized hyperparameters by comparing the performance of the classifier models in a predefined search range. We selected the optimized hyperparameters of the classifier model with the highest performance ( Table 2). The result showed that the SVM, CDF, and MLP models achieved superior performance in terms of the top-1 accuracy, top-3 accuracy, and F1 score, respectively. Specifically, all classifiers achieved MCC values higher than zero, which indicates that a significant association is present between clinical symptoms and administered HMs.

Impact of Oversampling and Data Imputation
The above experiment did not adjust for data imbalance and excluded all samples containing nan values. This indicates that the prediction performance could be further improved by converting the dataset into a balanced dataset with imputed nan values. To this end, we first evaluated whether oversampling could improve the performance of the classifier models. We measured the prediction performance by conducting oversampling on the training set during the CV experiments (Table 2). Oversampling was performed by synthesizing samples belonging to the minority classes using random sampling and SMOTE-NC until the numbers of majority class and minority class samples matched. We found that using random sampling significantly lowered the top-3 accuracies, Macro-F1 and Macro-MCC values, indicating that random sampling may introduce overfitting issues. On the other hand, we found a significant improvement in the F1 scores and MCC values of all classifier models when utilizing SMOTE-NC. In particular, the CDF model improved its F1 score and MCC value substantially and achieved the highest top-1 accuracy, top-3 accuracy, and macro precision values. To understand the effect of oversampling, we further measure the predictive performance for each HM of the CDF model across oversampling methods. Without oversampling, we found that the accuracies and F1 scores of CDF models were highly biased by prescriptions (Figure 2 and Supplementary Table S5). This result indicates that accurate prediction can be performed only for a specific prescription. Introducing random sampling also did not improve or worsen the bias of performance. On the other hand, we found that the performance bias across HFs was significantly reduced when using SMOTE-NC. Specifically, applying SMOTE-NC achieved the best accuracy and F1 score for 80% of HFs (12/15). These results indicate that SMOTE-NC could be an efficient way to improve the performance of predicting various HFs.
We further evaluated whether data imputation could further improve performance. We expected that the performance of the recommendation model could be further improved if a sample with a high ratio of valid features acquired through imputation could be used for classifier training. We measured the prediction performance using a synthesized dataset by applying data imputation and oversampling to the training dataset. To find the optimal criteria for imputation, we selected the samples to be imputed and compared their performance while sequentially increasing the threshold of the nan ratio from 0 to 0.5. We found that the performance of classifier models progressively improved up to a nan ratio of 0.35 (Figure 3). In particular, the CDF model achieved a superior F1 score with the highest top-1 accuracy and top-3 accuracy at a threshold of 0.35. However, after this threshold, we found that the prediction performance decreased, especially for the F1 score, even if the nan ratio was higher. This result indicates that the imputation of appropriate samples and their utilization for subsequent supervised learning can contribute to constructing an optimized prediction model. achieved the best accuracy and F1 score for 80% of HFs (12/15). These results indicate that SMOTE-NC could be an efficient way to improve the performance of predicting various HFs. We further evaluated whether data imputation could further improve performance. We expected that the performance of the recommendation model could be further improved if a sample with a high ratio of valid features acquired through imputation could be used for classifier training. We measured the prediction performance using a synthesized dataset by applying data imputation and oversampling to the training dataset. To find the optimal criteria for imputation, we selected the samples to be imputed and compared their performance while sequentially increasing the threshold of the nan ratio from 0 to 0.5. We found that the performance of classifier models progressively improved up to a nan ratio of 0.35 (Figure 3). In particular, the CDF model achieved a superior F1 score with the highest top-1 accuracy and top-3 accuracy at a threshold of 0.35. However, after this threshold, we found that the prediction performance decreased, especially for the F1 score, even if the nan ratio was higher. This result indicates that the imputation of appropriate samples and their utilization for subsequent supervised learning can contribute to constructing an optimized prediction model. achieved the best accuracy and F1 score for 80% of HFs (12/15). These results indicate that SMOTE-NC could be an efficient way to improve the performance of predicting various HFs. We further evaluated whether data imputation could further improve performance. We expected that the performance of the recommendation model could be further improved if a sample with a high ratio of valid features acquired through imputation could be used for classifier training. We measured the prediction performance using a synthesized dataset by applying data imputation and oversampling to the training dataset. To find the optimal criteria for imputation, we selected the samples to be imputed and compared their performance while sequentially increasing the threshold of the nan ratio from 0 to 0.5. We found that the performance of classifier models progressively improved up to a nan ratio of 0.35 (Figure 3). In particular, the CDF model achieved a superior F1 score with the highest top-1 accuracy and top-3 accuracy at a threshold of 0.35. However, after this threshold, we found that the prediction performance decreased, especially for the F1 score, even if the nan ratio was higher. This result indicates that the imputation of appropriate samples and their utilization for subsequent supervised learning can contribute to constructing an optimized prediction model.

Evaluating the Model Generalization Abilities on Unseen Datasets
To test the generalization abilities of the models, the performance of the classifier models was further evaluated on unseen datasets. We first attempted to evaluate the performance of the trained classifier models on the holdout test set of the KDC dataset. We started by constructing a balanced dataset using data imputation and oversampling on the training set of the KDC dataset. We trained the classifier models on this dataset and predict the labels (administered HFs) using the features of the holdout test set. The results showed that the trained models achieved superior performance even on unseen datasets (Table 3). Specifically, we found that the CDF model still had the highest top-3 accuracy and high F1 scores. This result indicates that the predictions of our model did not have a significant overfitting issue. To evaluate the practical applicability of our model, we constructed a case study dataset and used it for evaluation. We expected that if our model correctly estimated the actual administered HFs using recorded clinical symptoms, then it could be usefully used in the real world as well. Through the selection criteria and data processing strategy defined in the Methods section, we obtained the clinical symptoms and administered HFs for 23 published case studies. We predicted the HFs of the case study using a classifier model trained on the dataset obtained by applying data imputation and oversampling to the entire KDC dataset. We found that our classifier model performed much better than it did on the training set (Table 3). In particular, we found that the CDF model still had the highest top-3 accuracy and F1 score. It is noteworthy that the proportion of valid features in the case study was only 0.5-0.68, indicating that the classifier model accurately predicted HFs even with data in which only partially recorded features were recorded.

Model Interpretation Using Case Study Data
Understanding what a trained classifier learns in a data-driven manner may provide valuable clinical insights. Therefore, we applied an explainability technique called LIME to highlight which clinical symptoms drove particular HFs. The CDF model was selected as the classifier model to which LIME was applied, as it achieved superior performance across various experiments. We selected a case study (case id: 2018-30-4-06) and applied LIME as a representative example [21]. This case study concerned a patient who visited the KM hospital with an essential tremor after undergoing a total vaginal hysterectomy. After considering the clinical symptoms, the authors of the case study administered HBJHT, which was consistent with our model's predictions. We computed the weights of the features that contributed to the prediction using LIME and represented the top 10 features and their descriptions ( Table 4). The weights of the features derived from LIME can be understood to the extent that the features affected the HF selection outcome. For example, the increase in the feature values of Stool_form, Fatigue and Stool_tenesmus would have contributed more positively to the HBJHT administration probability provided by the recommender system. On the other hand, the decreases in Stool_frequency and stress would have contributed more positively to the probabilities of the administered HFs. These results indicate that the application of LIME aids in producing a more reliable recommendation system by providing clinicians with criteria for evaluating prescription choices. We tried to evaluate the interpretation results by expert's manual audit. The remaining 22 instances included in the case study dataset were considered as data for model evaluation. We applied LIME to each instance consisting of the patient's clinical symptoms and administered HFs, and identified the top-weighted features and their descriptions in a form similar to the representative example. We then invited three SCM experts to rate whether these features were appropriate for choosing a HFs on a scale of 1 (most inappropriate) to 5 (most appropriate). The results showed that the average score of evaluation was close to normal (2.94) ( Figure 4A). JWSCT had the highest average score with 3.7 points, followed by CSYJT (3.4 points) and HBDJS (3.3 points). Then, we evaluate the consistency of the interpretation results. We expected that if the overlap between the features with high weights in the same prescription was higher than that between other features, it would support the consistency of our model. We calculated the weight of each clinical symptom for the top predicted HF from each case study dataset using LIME. Then, we calculated the Jaccard similarities between the top 10 features of the same prescription (intra-similarity) and different prescriptions (inter-similarity). The results showed that the intrasimilarity was significantly higher than the intersimilarity (p < 0.001, Figure 4B), indicating that the LIME model interprets with consistent criteria.

Discussion
KM has been a practical science for thousands of years, and KM doctors have long been prescribing HFs based on the clinical symptoms of patients. Experiences on the efficacy of using HFs and herbal medicines have been accumulated, but they are mainly im-

Discussion
KM has been a practical science for thousands of years, and KM doctors have long been prescribing HFs based on the clinical symptoms of patients. Experiences on the efficacy of using HFs and herbal medicines have been accumulated, but they are mainly implicit and cannot be quantitatively measured. In this study, we proposed an HF recommendation model with enhanced applicability and interpretability. To our knowledge, this is the first approach that can explicitly identify the HF recommendation process. In addition, our model enhanced its practicality and usefulness by using only the clinical information defined in the clinical treatment guidelines as input data for HF recommendation. We also proposed a comprehensive strategy that combines various classifiers, data imputation, and oversampling to develop models with optimal predictive performance. A comprehensive evaluation demonstrated that our model accurately predicts HFs, even on the holdout dataset and a partially recorded case study dataset. The applied LIME model was able to identify the clinical symptoms that were mainly considered in the HF recommendation process.
We proposed an HF recommendation model using a comparison between classifiers, a data imputation strategy, and oversampling techniques. Among the classifier models, we found that the RF and CDF model achieved the highest performance in unseen datasets. Random forest requires less data preparation and preprocessing than neural networks or SVMs and is known to exhibit superior performance on various datasets. Because of this usefulness, random forests have been successfully applied in recommendation systems in the biomedical field [22,23]. The CDF model, a random forest-based ensemble model, automatically determined the appropriate complexity in a data-dependent way with relatively few parameters and achieved excellent performance across various domains. Additionally, we confirmed that data imputation and oversampling can improve the performance of the recommendation system. These results show that an appropriate data preprocessing approach is an efficient way to solve incomplete and imbalanced data issues, which are unavoidable for most multiclassification problems in the biomedical field. Finally, the robust performance of our recommendation system was evaluated on various holdout test sets and a case study dataset. In particular, the superior performance achieved on the case study dataset with partially recorded features supported the practical applicability of the developed system. Taken together, we suggest that combining various classifiers, data imputation methods, and oversampling techniques can be an efficient strategy for developing machine learning models with optimal performance and high applicability.
This study exhibits several limitations with the potential for further improvement. First, our recommendation model is limited to 15 types of HF recommendations. If a dataset consisting of various HFs is secured, our model can be extended to recommend various HFs. Second, the prediction performance can be further improved by applying advanced algorithms, such as graph convolutional networks (GCNs), which have been recently reported to achieve state-of-the-art performance [12]. In addition, expert evaluation results suggested that there is room for improvement in the interpretation results of our model. Finally, although our model only requires information used in the clinical setting, but it requires the additional process of recording clinical information in a predefined format to recommend HFs. An interesting future research topic is to improve the explainability of models by combining state-of-the-art models such as DLP or to develop models based on natural language process models using directly progress notes recorded in clinical practice as input data [24]. Despite these limitations, we are the first to propose an interpretable HF recommendation model, which will contribute to the development of explicit knowledge for recommending HFs and the modernization of KM.

Informed Consent Statement:
The written informed consent for participation was obtained from each of the subjects.

Data Availability Statement:
The data supporting the findings of this study are available from the Korea medicine Data Center (KDC). Data can be made available from the authors upon reasonable request and with permission of the KDC and the Institutional Data Access/Ethics Committee of Korea Institute of Oriental Medicine. Requests for access to these data can be made at the KDC website (kdc.kiom.re.kr, accessed on 3 March 2022).

Conflicts of Interest:
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.