Machine Learning Techniques for Predicting Drug-Related Side Effects: A Scoping Review

Background: Drug safety relies on advanced methods for timely and accurate prediction of side effects. To tackle this requirement, this scoping review examines machine-learning approaches for predicting drug-related side effects with a particular focus on chemical, biological, and phenotypical features. Methods: This was a scoping review in which a comprehensive search was conducted in various databases from 1 January 2013 to 31 December 2023. Results: The results showed the widespread use of Random Forest, k-nearest neighbor, and support vector machine algorithms. Ensemble methods, particularly random forest, emphasized the significance of integrating chemical and biological features in predicting drug-related side effects. Conclusions: This review article emphasized the significance of considering a variety of features, datasets, and machine learning algorithms for predicting drug-related side effects. Ensemble methods and Random Forest showed the best performance and combining chemical and biological features improved prediction. The results suggested that machine learning techniques have some potential to improve drug development and trials. Future work should focus on specific feature types, selection techniques, and graph-based methods for even better prediction.


Introduction
Drug-related side effects include undesirable, unpleasant, unexpected, and adverse hazardous reactions in organs and tissues [1].Some market-approved drugs may cause unacceptable side effects, endangering human health and raising concerns among pharmaceutical companies [2].Ensuring drug efficacy is crucial since unfavorable drug responses are the main cause of drug failure, often leading to side effects and drug withdrawal [2,3].However, the traditional method of identifying side effects through solid clinical trials is time-consuming and expensive, making it unsuitable for large-scale tests [4,5].As a result, there is a critical need to develop rapid and cost-effective methods for predicting drug-related side effects [6,7].
The ability to predict drug-related side effects presents itself as an indispensable facet of contemporary pharmaceutical research and development [8].By enabling the early and accurate identification of potential side effects, such methodologies have the potential to revolutionize the drug development landscape, which can lead to significant time and resource efficiencies [9].This transformative capacity facilitates the prioritization of drug candidates with favorable safety profiles while concurrently enabling the exclusion of those exhibiting a high propensity to induce adverse events [6,8].Ultimately, the development 2 of 22 of robust drug side effect prediction methodologies paves the way for the introduction of safer and more efficacious medications, thereby fostering improved patient outcomes and propelling advancements in personalized medicine [7,10,11].
The development of advanced computational algorithms provides strong technical support for addressing a wide range of medical challenges [12].Specifically, numerous computational methods have been developed for predicting drug-related side effects, with a strong emphasis on machine learning-based approaches [13].These methods delve into current information on drug-related side effects to create patterns that allow for the prediction of side effects for various drugs [1,13,14].
Recently, machine learning techniques emerged as the leading computational approaches for predicting drug-related side effects, leveraging previous experiences with similar drugs to learn and develop predictive models [1].Existing machine learning-based approaches have rigorously examined hundreds of side effects and the probability of their occurrence [13,14].This critical role of machine learning in side effect prediction entails developing models that predict outcomes based on the available data [1,15].Machine learning techniques use drug properties and well-labeled side effects to predict drug-related side effects and build models for targeted predictions [16].Integrating chemical, biological, and phenotypic features is critical in effectively predicting drug-related side effects, as diverse information and features from many sources contribute to the total understanding [17].
Researchers such as Pauwels et al. [18], Mizutani et al. [19], and Liu et al. [17] have contributed to the field by building drug-related side effect prediction models using various machine-learning techniques and incorporating different drug properties.Their findings highlight the importance of combining chemical, biological, and phenotypic data to make comprehensive drug-related side effect predictions [17][18][19].Chemical features, such as molecular structure and composition, provide insights into a drug's nature, while biological features explore interactions with cellular components [20,21].Phenotypic features capture a drug's effects on organisms, covering both therapeutic benefits and adverse reactions [22].Integrating these features offers a holistic understanding of drug mechanisms and outcomes.Through machine learning analysis of these integrated features, robust predictive models can be developed, facilitating the early identification and mitigation of drug-related side effects [23,24].These models empower researchers to optimize drug efficacy and safety profiles, ultimately leading to safer medications, improved patient outcomes, and advancing personalized medicine and pharmaceutical innovation [25].
Although several reviews have examined computational methodologies for predicting drug-related side effects, there are still significant gaps [1,13,14].Das and Mazumder's review of supervised machine-learning techniques looked at drug descriptors, commonly used drug property sources, and computational models, but they did not report or compare the performance of individual machine-learning algorithms [1].Moreover, their focus did not encompass drug-related features.A separate review focused extensively on using computational techniques to predict drug-related side effects without comparing or comprehensively focusing on machine learning approaches [13].Another review study examined three data sources, namely omics data, social network data, and electronic medical records, to predict adverse drug effects [14].To our knowledge, none of the studies specifically focused on predicting drug-related side effects using drugs chemical, phenotypic, or biological features and machine learning techniques.Therefore, the aim of the current study was to review studies in which machine-learning techniques were used to predict drug-related side effects based on chemical, biological, or phenotypic features.

Materials and Methods
This scoping review was conducted according to Arksey and O'Malley's framework in 2023 [26].Before conducting the research, ethics approval was obtained from the ethics committee of Iran University of Medical Sciences (IR.IUMS.REC.1401.1007).

Stage 1: Identifying Research Questions
A comprehensive understanding of machine learning techniques is essential to predict drug-related side effects based on chemical, biological, or phenotypic features for improving personalized medicine and safe medication prescriptions.Therefore, the research questions were as follows:

•
What were the machine learning techniques used for predicting drug-related side effects?• What were the main features used for predicting drug-related side effects?

Stage 2: Identifying Relevant Studies
The related articles were searched in different databases, including Web of Science, PubMed, Ovid, Scopus, ProQuest, IEEE Xplore, and the Cochrane Library.The search strategy included three main concepts: namely, "drug-related side effect", "machine learning", and "prediction".The MeSH terms, synonyms, and other related keywords were also included in the search strategies.To identify the relevant papers, the search strategies were applied in three fields: title, abstract, and keywords of the articles (Supplementary Table S1).Articles were searched from 1 January 2013 to 31 December 2023.The citations and reference lists of the retrieved papers were also checked to ensure that all relevant studies were included.

Stage 3: Study Selection
In this study, the original research papers published in English between 2013 and 2023 with a focus on predicting drug-related side effects using chemical, biological, or phenotypical features were included.However, for papers that were published in languages other than English, there was no access to their full texts, review papers, letters to the editor, and papers that did not primarily focus on machine learning techniques were excluded.
The retrieved papers were entered into the Endnote software version 19, and after removing duplicates, the remaining articles were assessed in terms of the title and abstract relevancy to the study objective.After removing the irrelevant articles, the full texts of the remaining ones were examined by two authors (E.T. and H.A.) separately, and any disagreements were resolved by the third author (A.F.S.).

Stage 4: Charting the Data
We used a data extraction form to collect the required data.This form contained the author's name, publication year, country, study objective, selected features and data sources, algorithms, evaluation metrics, and main results.In this study, conducting a meta-analysis was not feasible due to the inherent heterogeneity of the study design and methodologies.As a result, the findings were organized and reported narratively.Regarding the evaluation metrics, including precision, accuracy, recall, F1 score, area under the curve (AUC), and area under the precision-recall curve (AUPR), the average was calculated and reported.

Results
In total, 1698 papers were retrieved from databases.After removing duplicates (n = 809), the remaining papers (n = 889) were examined in terms of their titles and abstracts, and irrelevant papers were excluded (n = 827).Among the remaining papers (n = 62), the full texts of three papers were not retrieved.As a result, the full texts of 59 papers were reviewed.Finally, 22 papers were selected to be included in the study .A total of 37 papers were removed as either they were not related to machine learning algorithms or they did not include the expected features.The process of selecting the articles is illustrated in Figure 1.
reviewed.Finally, 22 papers were selected to be included in the study .A total of 37 papers were removed as either they were not related to machine learning algorithms or they did not include the expected features.The process of selecting the articles is illustrated in Figure 1.

Selected Features and Data Sources
The study findings revealed that the selected features across various studies could be classified into four main categories, including general, chemical, biological, and phenotypical features.Different models employed one or more of these categories in predicting drug-related side effects.Furthermore, the data sources utilized for feature extraction displayed a degree of variability.DrugBank, Liu's dataset, and SIDER 4 were consistently employed for extracting features across all categories.Bio2RDF v2 was utilized for all categories except for the general category, and Mizutani's dataset was utilized across all categories except for the phenotypical category.The subsequent sections entail the features and data sources encompassed within each category.

Principal Findings
This scoping review investigated the use of machine learning techniques for the prediction of drug-related side effects.Based on the findings, general features were mainly extracted from SIDER, Pauwel's dataset, Mizutani's dataset, Liu's dataset, and DrugBank.Chemical features predominantly were obtained from PubChem, Molecular Operating Environment, and DrugBank using fingerprint analysis software.DrugBank, Liu's dataset, and Pauwels' dataset were used to provide biological features, and SIDER 4, Liu's dataset, SIDER, DrugBank, and Bio2RDF v2 provided therapeutic indications and phenotypes.
According to the current review findings, when chemical and biological features were combined, the prediction outcomes were impressive.Moreover, ensemble methods showed the best results in terms of precision and AURP metrics.SVM exhibited superior performance in accuracy and recall measures, and decision trees excelled in F1 score metrics.In addition, clustering methods demonstrated proficiency in AUC assessment.
The results showed that careful selection of features from relevant databases or datasets is crucial in predicting drug-related side effects.In the present study, features were classified into four primary groups.This classification scheme is aligned with the findings reported by Das and Mazumder's study [1].Likewise, the review conducted by Sachdev and Gupta on computational techniques for identifying drug-related side effects introduced some features and datasets [13]; however, the focus was not primarily on machine learning techniques, resulting in a limited range of features compared to the current study.
Various studies highlighted the importance of specific features in predicting drugrelated side effects, such as chemical fingerprints from SMILES strings and target protein associations from DrugBank, indicating the necessity for a combination of chemical and biological data for accurate predictions.However, biases exist within data sources like SIDER, which may skew towards common side effects [50], and limitations in PubChem exclude information on biologic drugs, urging integration with databases capturing biologic complexities [51].Feature engineering techniques, like fingerprint generation algorithms and text-mining, aid in translating raw data into interpretable formats [52], while networkbased approaches offer promise in modeling complex relationships between chemical structures, biological targets, and side effects [53].Despite the potential of emerging data sources such as electronic health records and genomics data for personalized prediction, challenges like data standardization and interoperability persist [54], highlighting the need for standardized efforts and common ontologies to facilitate comprehensive dataset creation for machine learning models in side effect prediction.
According to the findings of this review, the integration of chemical and biological features showcased proficiency in precision, F1 score, AUC, and AUPR metrics.In the research conducted by Mizutani et al., canonical correlation analysis and sparse canonical correlation analysis were used, which provided valuable insights into the significance of feature selection.Their study highlighted the superiority of employing the targeted protein-based approach as a biological feature for the prediction of drug-related side effects [19].Moreover, the research conducted by Liu et al. evaluated various machinelearning algorithms by different features and demonstrated the exceptional performance of SVM when combining chemical, biological, and phenotypic features [17].
Random Forest emerged as the most common algorithm used across the included studies, followed by KNN and SVM.However, there are discrepancies regarding the most frequently used algorithms within this research domain [55].Das and Mazumder reported that SVM and logistic regression are commonly used for predicting drug-related side effects [1].In contrast, Sachdev and Gupta emphasized the efficacy of multi-label KNN learning, SVM, and random forest [13].Random Forest interpretability and resistance to overfitting are among the advantages of this algorithm; however, it may struggle with highdimensional data [56].Techniques like Mean Decrease in Impurity (MDI) could enhance its efficacy [57].KNN is valued for simplicity but requires careful parameter selection, while SVM handles high-dimensional data well but can be computationally expensive [58].Beyond these, gradient-boosting machines and deep learning architectures offer promising alternatives and are adept at capturing complex relationships in drug data [6].
This study highlighted the significance of different feature combinations in predicting outcomes.Similarly, Das and Mazumder focused on four distinct features, namely, chemical, biological, phenotypic, and other drug descriptors [1].Other studies concentrated on patient-centric data sources such as prospective data collection and derived data from Electronic Health Records (EHRs) and social media platforms to enrich their predictive capabilities [59].For example, Zhao et al. used EHR data to predict drug-related side effects.They applied multiple supervised algorithms to analyze patient data, including demographics, lab results, and medication history, achieving significant accuracy with the Random Forest algorithm in identifying potential drug-related side effects before they manifested clinically [60].Ietswaart et al. used data from the FDA's Adverse Event Reporting System (FAERS) to train a Random Forest model.This model was able to detect subtle patterns and correlations within the vast datasets, effectively predicting the side effects of new and existing drugs [61].
It is essential to distinguish between studies that used patient-centric data and those that focused on drug features, as their objectives vary significantly.Patient-centric studies primarily aim to predict the overall incidence of specific drug-related side effects, diagnose individuals experiencing side effects, or prognosticate patients at high risk of drug side effects [62].Conversely, studies included in this review predominantly focused on predicting drug-related side effects based on drug features prior to their manifestation in patients.For instance, Kim et al. reviewed existing statistical and machine-learning methods to detect drug-related side effects in humans [59].La et al. integrated theoretical biological data into machine-learning models to predict Active Pharmaceutical Ingredient (API) side effects, validating their approach against real-world clinical outcomes [63].This underscores the multifaceted nature of data used in predicting drug-related side effects, reflecting the inherent challenges in directly comparing machine learning techniques used across these two distinct groups of studies.
Additionally, different metrics, including AUC, F1 score, precision, recall, AUPR, and accuracy, were used to evaluate the algorithm's effectiveness.According to the results, AUC was the most frequently used metric.These findings are consistent with Ho et al., who underscored the importance of metrics such as AUC, F1 score, and precision in evaluating machine learning algorithms for ADR detection and prediction [14].Drug-related side effect data often suffer from class imbalance, where some side effects are significantly rarer than others [33].Exploring alternative metrics like balanced accuracy or Matthew's correlation coefficient, which accounts for class imbalance, could provide a more nuanced perspective on model performance, especially for datasets with imbalanced classes [64].
The results showed that Random Forest had superior performance compared to other machine learning algorithms included in this study.However, the prominent algorithm in Das and Mazumder's study was SVM [1], and multi-label KNN learning prevailed in Sachdev and Gupta's research [13].Random Forest's prominence in drug-related side effect prediction arises from its adeptness at handling high-dimensional data and its robustness to imbalanced class distributions commonly found in such datasets [9].Ensemble methods like Random Forest often outperform single-learner methods like SVM due to their ability to leverage multiple learners for greater generalizability, although SVMs may excel in specific scenarios, particularly with limited dataset sizes [36].However, a deeper analysis beyond average performance metrics is essential to unveil algorithm-specific nuances and assess generalizability across independent datasets [58].Combining chemical and biological features enhances performance, but further exploration into specific types of features and feature selection techniques is warranted.
Overall, a comprehensive examination of multiple studies reveals common trends and variations in the selection of features, databases, and algorithms for predicting drugrelated side effects.The diversity of machine learning approaches highlighted the complex nature of this task, and the emphasis on using different evaluation metrics underscores the significance of thorough evaluation to guarantee the reliability and effectiveness of predictive models in the pharmaceutical research domain.

Implication for Practice
By leveraging comprehensive datasets that integrate chemical, biological, and phenotypic information, machine learning algorithms demonstrate promise in robustly predicting drug-related side effects.These capabilities translate into several key benefits for clinical translation and drug development applications.This type of prediction can facilitate a paradigm shift towards precision medicine.Integration of pharmacogenetic data into these algorithms could empower clinicians to tailor drug therapies based on the individual patient's unique genetic profiles, significantly mitigating the risk of drug-related side effects [25,65].Furthermore, machine learning can serve as a powerful catalyst in drug development by enabling the early identification of potential side effects.This crucial information allows researchers to prioritize promising drug candidates and circumvent late-stage clinical trial failures stemming from unforeseen safety concerns [66].
Machine learning also offers the potential to optimize clinical trial design through patient stratification based on the risk of the predicted side effect.This targeted approach can enhance the efficiency and safety of clinical trials by focusing on patient populations demonstrably susceptible to side effects [67].Additionally, graph learning approaches have emerged as a powerful tool for uncovering the intricate relationships between drugs, their targets, and potential side effects [68][69][70].By leveraging biological networks that integrate information on drugs, targets, and their interactions, graph neural networks (GNNs) offer a promising avenue for improved prediction [71].However, GNN-based methods are susceptible to the over-smoothing problem, which can hinder their ability to learn discriminative representations of drugs and targets [72].To address this challenge, recent studies have proposed novel GNN architectures that incorporate strategies to mitigate oversmoothing, such as node-dependent local smoothing techniques [73].These advancements pave the way for more accurate drug side effect predictions by capturing the nuanced relationships within biological networks [71,72].

Strengths and Limitations of the Study
In this study, the literature related to the use of machine learning algorithms for predicting drug-related side effects, their selected features, and evaluation metrics were reviewed.However, there were some limitations.First of all, due to the inherent diversity in the design, datasets, and methodologies across the literature, conducting a meta-analysis was not feasible.In response to this limitation, a qualitative comparison approach was adopted, enabling a comprehensive evaluation of the available evidence.The second limitation was related to the exclusion of non-English studies due to time and resource constraints.The third limitation might be related to overfitting the included models, particularly as they are in-silico models lacking empirical verification in real-world scenarios.To mitigate this concern, future research can prioritize the validation of predictive models using real-world data and clinical trials.Rigorous cross-validation techniques and external validation on independent datasets can further enhance the robustness and generalizability of predictive models.Moreover, the algorithms should be used to predict the side effects of the commercially available drugs to be able to evaluate their performance and effectiveness.

Conclusions
In conclusion, this scoping review comprehensively analyzed the use of machine learning techniques for predicting drug-related side effects.The findings underscore the critical role of selecting features from diverse databases encompassing chemical, biological, and phenotypic data for robust prediction.Ensemble methods, particularly Random Forest, emerged as superior algorithms across a spectrum of evaluation metrics, including AUC, precision, recall, F1 score, and AUPR.To predict drug-related side effects, the integration of chemical and biological features enhanced performance.These findings suggested that machine learning algorithms are useful for various applications in the pharmaceutical domain, including drug development through early prediction of side effects and optimizing clinical trial designs via patient stratification based on the predicted risk of side effects.Future research should delve into exploring specific feature types, refining feature selection techniques, and investigating the potential of graph-based methods to predict even more accurate drug-related side effects.

Table 1 .
Summary of the selected articles.

Table 2 .
Comparing algorithms, selected features, and evaluation metrics.