2.1. Statistical Analysis
The statistical analysis was structured from a global characterization of the data using descriptive statistics and association analysis techniques, with the purpose of evaluating the relevance and degree of relationship between the independent variables in order to avoid information redundancy and the target variable to be predicted. This approach allowed us to identify preliminary patterns and potential dependencies among the variables considered in the study. Likewise, the Chi-square test was used to determine any associations between adverse effects and the patient’s condition or the type of medication prescribed.
In addition, the central axis of the work was oriented towards a predictive approach based on machine learning models, aimed at estimating the probability of the appearance of one or more side effects associated with the consumption of certain drugs. In particular, the prediction strategy was formulated to generate, for each patient or entry setting, a set of the five most likely side effects together with their respective probabilities of occurrence, in order to contribute to the prevention and early detection of potential adverse events.
In addition, in order to explore the possible overlap between the categories of the secondary effect variable, a principal component analysis was carried out to assess the adequacy and robustness of conventional classification metrics, such as accuracy, precision, completeness, and F1-score.
For the optimization of the parameters of the machine learning models, the Grid Search method was used as an iterative training and evaluation procedure, with the aim of maximizing the accuracy value of the models globally. The hyperparameters obtained through this process were later used in the final prediction of side effects in patients.
This procedure allowed for the analysis of the degree of overlap between classes and its potential impact on the discriminative capacity of the models, as well as on the interpretability of traditional metrics, which can be affected when there are shared patterns between multiple response categories.
2.2. Findings
Table 1 shows a heterogeneous population in terms of age and treatment regimens, with wide variability in both dosage and duration of treatment. On average, patients are in middle adulthood, while the doses administered are considerably dispersed, reflecting the coexistence of different drug regimens. The duration of treatment is concentrated around an intermediate value, with no marked extremes, suggesting a relatively balanced distribution. Regarding clinical response, most patients achieved moderate to high levels of improvement, indicating a generally favorable therapeutic effect within the analyzed set.
Table 2 shows that the sample had a balanced composition, with a slight predominance of males versus females. Regarding the clinical condition associated with pharmacological treatment, a relatively homogeneous distribution was observed among the categories considered. The most frequent condition was infection, followed by pain, diabetes, hypertension, and depression. These proportions indicate that no single condition dominates the sample markedly, suggesting adequate representativeness of the different clinical scenarios included in the study.
In the present study, one of the relevant methodological aspects corresponds to the analysis of the possible redundancy of information between the predictor variables [
19]. In order to identify strong linear relationships that could indicate information duplicity, the correlation matrix between the numerical independent variables was calculated and is presented in
Figure 1. The results show the absence of significant correlations between these characteristics, which indicates a low degree of linear dependence between them.
This behavior suggests that there is minimal information redundancy in the set of variables considered, so that each characteristic provides differentiated information to the model without substantially overlapping with other predictor variables. Consequently, independent variables can be incorporated together in machine learning models without a high risk of multicollinearity, thus strengthening the stability of the estimators and the interpretability of the results obtained.
The results in
Table 3 show a statistically significant association between side effects and clinical condition (
p < 0.05) and between side effects and the drug administered (
p < 0.05).
These findings indicate that the occurrence of side effects is not independent of the patient’s underlying medical condition, but varies significantly depending on the type of pathology treated. Likewise, there is evidence of a significant relationship between the reported side effects and the drug used, suggesting that different active ingredients have differentiated safety profiles.
Overall, these results support the hypothesis that both the clinical condition and the medication are relevant factors in the manifestation of adverse effects, which justifies their inclusion as explanatory variables in the proposed predictive models. The magnitude of the chi-square statistics obtained also suggests a structured and non-random relationship between these variables, reinforcing the probabilistic modelling approach adopted in the study.
In order to analyze the overlap between the categories of the side effects variable, a dimensionality reduction procedure was applied using Principal Component Analysis (PCA), the results of which are presented in
Figure 2. Through principal component analysis (PCA), a dimensionality reduction process was applied to the numerical predictor variables in order to project the data into two dimensions. Since the problem involves a multidimensional prediction setting, this approach enables the representation of information in a two-dimensional space that would otherwise remain in a high-dimensional feature space. In an ideal scenario, a clear and defined separation between the color groups would allow for the inference of an adequate capacity for discrimination between the different categories of side effects. However, the results show considerable overlap between the groups, without a clear delimitation of regions corresponding to each category.
This lack of clear separation indicates that there are similar configurations of predictor variables that may lead to similar side effects or, in certain cases, to the simultaneous manifestation of multiple adverse effects. This behavior reflects the inherent complexity of the pharmacological response and the interindividual variability in the occurrence of adverse events.
Consequently, these findings support the relevance of an approach based on probabilistic prediction of potential side effects, rather than a strict classification scheme of a single category per patient. This approach is more in line with the nature of the problem addressed, as it allows estimating a reduced set of plausible adverse effects associated with the use of certain drugs, thus strengthening the clinical utility of the model as a support tool for the prevention and early surveillance of adverse events.
In line with the results derived from principal component analysis (PCA), the grouping obtained through the t-SNE dimensionality reduction technique exhibits a comparable pattern (
Figure 3). That is, no distinct separation among side effects is observed; rather, the data form relatively heterogeneous clusters, which reinforces the hypothesis of symptom overlap across records.
Table 4 presents the optimal hyperparameters identified for each of the machine learning models evaluated, along with the value of the metric used during the optimization process. In general terms, it is observed that models based on decision trees (Random Forest and Decision Tree) achieved superior performance compared to distance-based methods (KNN) and nonlinear separation functions (SVC). These results suggest that approaches that exploit hierarchical structures and nonlinear relationships between explanatory variables show a greater ability to capture the complexity inherent in the problem of predicting side effects, even when overlap between categories limits the overall performance of the models.
Table 5 presents the results obtained from the training process using cross-validation. Overall, the results indicate a relatively stable performance across the different models, as evidenced by the standard deviation values associated with each model. Furthermore, the Random Forest model achieved the highest performance metric during the five-fold cross-validation training process, followed closely by the Decision Tree model, which exhibited a relatively similar performance. In contrast, the remaining two models showed considerably lower performance compared with the first two models. The models were trained on a system equipped with an Intel i7 CPU, 12 GB of RAM, without GPU acceleration, and approximately 110 GB of available storage.
Table 6 presents an example of the output generated by the Random Forest model for the top four most probable symptoms, which shows, for each instance evaluated, the side effects with the highest estimated probability, ordered according to their relevance. In general terms, it is observed that the model systematically assigns the highest probability to the main side effect, which indicates a consistent pattern in the hierarchy of predictions. Likewise, there is evidence of a correspondence between the observed side effect and the one with the highest predicted probability, which suggests an adequate internal coherence of the model in the cases illustrated.
In addition, the presentation of multiple side effects per instance allows representing the uncertainty inherent in the predictive process and reflects the clinical possibility of the simultaneous occurrence of different adverse events. This probabilistic approach is especially relevant in the pharmacological field, since it provides a limited set of plausible side effects that can be considered in clinical surveillance tasks and in early prevention strategies, instead of being restricted to a single deterministic prediction [
20,
21].
As an overview of the performance of the proposed models, the prediction generated by each of the algorithms considered in this study was incorporated, together with the probability associated with the secondary effect estimated as most likely (
Table 7). This representation allows direct comparison of the output of the different machine learning approaches and facilitates the clinical interpretation of the results by providing not only a category of expected adverse effect but also a quantitative measure of uncertainty associated with this prediction.
As shown in
Figure 4, patients with depression exhibit greater proximity in the factorial space to effects related to neurological or central nervous system symptoms, particularly anxiety, somnolence, dry mouth, sleep disturbances, and sweating. In the case of diabetes, the most associated side effects mainly correspond to gastrointestinal and metabolic alterations, such as stomach discomfort, diarrhea, and weight gain, in addition to dermatological manifestations. For hypertension, proximity is observed with symptoms related to the cardiovascular and respiratory systems, including bradycardia, cough, swelling, and fatigue. In the case of infections, the most associated effects include abdominal pain and allergic reactions, followed by dermatological manifestations such as rashes. Finally, for pain relief treatments, the most commonly associated side effects mainly correspond to gastrointestinal symptoms, such as abdominal pain and constipation, followed by cutaneous reactions.
Furthermore,
Table 8 evaluates the reliability of the predictions of the most probable side effects. A top-k evaluation was performed with k = 1, 3, and 5. This metric assesses whether the true side effect appears as the most probable prediction (top-1), within the three most probable predictions (top-3), or within the five most probable predictions (top-5) for each trained model.
2.3. Comparison with Alternative Models
In the previous results, a multi-output prediction approach was implemented using Random Forest, Decision Tree, SVC, and KNN models. However, it is important to note that these are not the only machine learning models that can be applied to this type of problem. Emphasis can be placed on the two best-performing models, namely Random Forest and Decision Tree, which showed favorable classification results by correctly identifying the true side effects within their top-3 predicted outcomes.
Nevertheless, alternative approaches such as XGBoost and ML-KNN can also be considered for multi-output or multi-label prediction tasks. These methods have been widely applied in classification problems and, in many cases, have demonstrated competitive or slightly superior predictive performance compared to traditional machine learning models [
22]. However, the computational cost associated with these algorithms may increase as the size of the dataset grows, particularly in the case of gradient boosting methods such as XGBoost [
23].
Similarly, ML-KNN has been proposed as an extension of the k-nearest neighbors algorithm specifically designed for multi-label classification problems. This method estimates label probabilities based on the distribution of labels among neighboring instances and has shown competitive performance in several multi-label learning tasks [
24]. Nevertheless, in practice, the predictive performance obtained with ML-KNN is often comparable to that achieved by conventional tree-based models such as Decision Trees, especially in datasets of moderate size.