You are currently viewing a new version of our website. To view the old version click .
by
  • Camila Barrios-Cogollo1,*,
  • Jorge Gómez Gómez2 and
  • Emiro De-La-Hoz-Franco1

Reviewer 1: Anonymous Reviewer 2: Anonymous Reviewer 3: Anonymous Reviewer 4: Anonymous

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper conducts a comparative study of seven supervised machine learning models (Decision Tree, Random Forest, Logistic Regression, KNN, SVC, LDA, and MLP) to detect cyberbullying among university students. 


The overall structure of the paper is appropriate. Some comments are provided to improve the quality of this manuscript. In my opinion, this manuscript should be reconsidered after major revision.

1)    The figures quality is too low. Please kindly increase their resolution.
2)    Please bold best result in each table.
3)    The Decision Tree achieving 100% precision, recall, F1, and accuracy is extremely suspicious. Given the relatively small dataset (615 instances), it strongly suggests overfitting, especially since there’s no indication that cross-validation or external test set was used. The authors used SMOTE before the train-test split or didn't specify otherwise, which may cause data leakage. Please use k-fold cross-validation and clarify whether SMOTE was applied only to the training set.
4)    Please kindly provide confusion matrix after using k-fold cross validation.
5)    The authors should explain briefly each machine learning model in a subsection. Please add one or two mathematical equations of each method. I recommend referring to the following papers:  doi.org/10.1038/s41598-025-03232-z, doi.org/10.1038/s41598-022-11429-9
6)    There is no mention of statistical tests to assess whether differences in performance across models are statistically significant.
7)    The paper does not offer any model interpretability techniques. Please use SHAP plots to show the feature importance. What are the most important features driving the prediction? Can you interpret the model outputs?
8)    Why were deep learning models excluded from empirical comparison, despite being discussed?
9)    Paper could benefit from a flowchart of the pipeline.
10)    I highly recommend using more recent machine learning models like XGBoost for experiments.
11)    It is recommended to share a link to the source code to make the project reproducible.
12)    Please provide more discussion on the results.
13)    The English in the present paper is not of publication quality and require major improvement. Please carefully proof-read spell check to eliminate grammatical errors. As an example: “Machine learning models, particularly decision trees, how a high potential for accurately detecting cyberbullying.”

Author Response

Reviewer 1's responses:

Dear Reviewer,

We greatly appreciate the time and effort you put into reviewing our work, "Comparative Analysis of Classification Models for Cyberbullying Detection in University Environments." Your comments have allowed us to reflect deeply on various aspects of our study and, in most cases, substantially improve the manuscript.

We have addressed all of your points in the review. For most of your suggestions, we have modified the text accordingly. We trust that our clarifications address your concerns.

We greatly appreciate your contribution to the process of improving this article.

1)    The figures quality is too low. Please kindly increase their resolution.

R:/

We have improved the quality of all figures.


2)    Please bold best result in each table.

R:

We have highlighted the best performing model

Table 3. Comparison of performance

Model

Accuracy_mean

Accuracy_std

Precision_mean

Precision_std

Recall_mean

Recall_std

F1_mean

F1_std

Roc_auc_mean

Roc_auc_std

DecisionTree

0.99186992

0.01817941

1

0

0.98630137

0.03063107

0.9929078

0.01585864

0.99315068

0.01531553

RandomForest

0.97560976

0.02072772

0.99166667

0.0124226

0.96712329

0.03153661

0.97901478

0.01802444

0.9949589

0.00670212

XGBoost

0.95447154

0.03020191

0.99108225

0.00814842

0.93150685

0.04543322

0.96001551

0.02679039

0.97879452

0.02007735

LogisticRegression

0.94471545

0.01563852

0.98863692

0.01159546

0.91780822

0.03624317

0.95143064

0.0146466

0.98838356

0.00922926

MLP

0.88292683

0.01957983

0.92723149

0.01714454

0.87123288

0.0248848

0.89821891

0.01762364

0.94975342

0.00913525

SVC

0.85365854

0.03941203

0.91875508

0.03457614

0.82739726

0.05697708

0.8697079

0.03771287

0.93484932

0.01807326

LDA

0.76585366

0.03515731

0.89994842

0.03092746

0.68219178

0.06002439

0.77463033

0.04106359

0.84728767

0.03139085

KNN

0.52195122

0.02024374

0.88026144

0.04781896

0.22465753

0.0248848

0.35772027

0.03498954

0.67975342

0.02974455

 


3)    The Decision Tree achieving 100% precision, recall, F1, and accuracy is extremely suspicious. Given the relatively small dataset (615 instances), it strongly suggests overfitting, especially since there’s no indication that cross-validation or external test set was used. The authors used SMOTE before the train-test split or didn't specify otherwise, which may cause data leakage. Please use k-fold cross-validation and clarify whether SMOTE was applied only to the training set.

R:/

Categorical variables, represented as text, were transformed into numeric variables to facilitate processing using machine-learning models. To assess the performance of the classifiers, stratified k-fold cross-validation (k = 5) was employed to provide a robust estimation of the model's generalization capability. Given the class imbalance in the target variable (bullying vs. non-bullying), the synthetic minority oversampling technique (SMOTE) was utilized to achieve class balance. In each iteration of the k-fold, the following procedure was implemented to prevent information leakage: (1) the dataset was partitioned into training and test sets according to the fold; (2) within the training set, missing values were imputed using the mean (SimpleImputer), and a StandardScaler was applied; (3) SMOTE was applied exclusively to the scaled training set to generate synthetic samples of the minority class; and (4) the classifier was trained using the augmented training set, and the validation (test) set of the fold was evaluated without any modification (i.e., without SMOTE, using the same imputer and scaler fitted on the training set). Figure 2 shows the confusion matrix of a decision tree evaluated using this protocol.

.

SMOTE was configured with k_neighbors = 5. To address cases whichere the minority class within a fold had very few samples (fewer than six), the k_neighbors parameter was automatically set to a lower value to allow for the generation of synthetic samples, as suggested by the standard implementation. Random seeds were fixed to ensure reproducible results. Model performance was evaluated using the accuracy, precision, recall, F1 score, and area under the ROC curve (ROC-AUC). All reported metrics correspond to the mean and standard deviation calculated over k folds of the cross-validation process. To prevent overfitting during the model evaluation, the hyperparameters of each classifier were defined a priori and kept fixed across all iterations, thus avoiding any within-fold optimization processes that could artificially inflate the metrics.

To compare the performance of the classifiers, the Friedman test was applied to the aggregated metrics per fold (k = 5, stratified cross validation). When Friedman was significant, pairwise comparisons between classifiers were performed using the Wilcoxon signed-rank test (paired). To control for type I errors due to multiple comparisons, the Holm correction was applied. In addition to p-values, the effect sizes were reported using paired Cohen's d values. Statistical comparisons were calculated based on the F1 and ROC-AUC metrics obtained per fold.

 


4)    Please kindly provide confusion matrix after using k-fold cross validation.

R:/

Figure 2. Decision tree confusion matrix.


5)    The authors should explain briefly each machine learning model in a subsection. Please add one or two mathematical equations of each method. I recommend referring to the following papers:  doi.org/10.1038/s41598-025-03232-z, doi.org/10.1038/s41598-022-11429-9

 

R:/

  • Supervised learning models
  • Random Forest

An ensemble method constructs T-independent decision trees, each trained on a bootstrap sample with random feature selection, and integrates their predictions by voting for classification or averaging for regression. This approach reduces the variance compared to a single tree and mitigates overfitting [21]. Random Forest combines multiple decision trees, each trained on a different subset of data and features, to make predictions. Combining the results of these trees through voting or averaging reduces the risk of errors and overfitting, which can occur with a single tree, leading to more reliable predictions. The associated equations are as follows:

- Prediction (regression):   :                           (2)

- Prediction (majority ranking):   (3)

 

  • Logistic Regression

The linear model for binary class probability applies the logistic (sigmoid) function to the linear predictor; it is typically trained using the maximum likelihood [22], as shown in the following equation:

  • Probability function:    (4)

The logistic regression model uses a special function called the logistic or sigmoid function to predict the probability of an outcome belonging to one of the two categories. This function transforms a linear combination of input variables into a value between 0 and 1, representing the probability. The model is usually trained by finding the parameters that make the observed data most likely to occur, a method known as maximum-likelihood estimation.

  • Decision Tree

This model recursively partitions the feature space into homogeneous regions based on an impurity measure (Gini, entropy), and places a single prediction at each leaf. These are some of the most important equations in this model.

  • Gini impurity for a partition: (5)
  • Information gain/impurity reduction during splitting:

                    (6)

Decision trees work by repeatedly dividing data into smaller groups based on specific features, aiming to create groups that are as similar as possible. This process continues until the tree reaches its endpoints, called leaves, where the predictions are made. The model uses mathematical calculations to determine the best way to split data at each step, ensuring the most effective grouping of similar items [23].

  • K-Nearest Neighbors (K-NN)

This lazy learning method assigns a test point to the majority (or average) class of its k-nearest neighbors based on distance. The main equations are as follows:

  • Euclidean distance: (7)
  • Classification rule:             (8)

K-Nearest Neighbors is a straightforward machine learning technique that classifies new data points based on the characteristics of their closest Neighbors. It works by looking at the k closest data points to the new one and deciding its category based on the majority of those neighbors. The method is considered "lazy" because it does not create a general model but instead makes decisions on the spot using available data [24].

  • Support Vector Classifier

This model searches for a hyperplane that maximizes the margin between classes, allowing for violations (slack variables), and using kernels to separate in higher-dimensional spaces, if necessary. One of the main equations is as follows:

  • SV classification: for all                                                                                                              (9)

where the variables ξi are called slack variables, and measure the error committed at point (xi,yi).

The Support Vector Classifier is a machine-learning technique that attempts to find the best way to separate different groups of data. It does this by looking for a line (or plane in higher dimensions) that creates the widest possible gap between groups. This method is flexible, allowing for errors and using special mathematical tricks to handle complex data that might not be easily separable in its original form [25].

  • Linear Discriminant Analysis (LDA)

This method assumes Gaussian classes with the same covariance matrix; it projects the data in a direction that maximizes the separation between classes relative to the intraclass variance [26]. The equations include: 

  • Fisher criterion: (10)
  • Two-class solution: : (11)

   Where SW is the intra-class covariance and μi the means

  • Multi-layer Perceptron (MLP)

It is a multilayer neural network with nonlinear activation functions; the weights are adjusted by minimizing a loss function using backpropagation (gradient descent over layer composition) [27]. One of the main equations for this method is as follows:

  • Forward propagation for layer : : (12)
  • Loss function:                          (13)

 

  • XGBoost (Extreme Gradient Boosting)

It is an efficient and scalable implementation method for tree boosting (additive tree model), with regularization of tree functions and practical optimizations (sparsity-aware and cache-aware). It is widely used in competition and industrial applications. Gradient Tree Boosting is one of the most relevant equations:

  • Gradient Tree Boosting: (14)

XGBoost is a powerful machine-learning technique that combines multiple simple decision trees to create a more accurate model. It incorporates special features to prevent overfitting and improve the performance, making it particularly effective for handling complex datasets. XGBoost has gained popularity in both competitive data science challenges and real-world applications, owing to its ability to produce highly accurate predictions while maintaining computational efficiency [28].

 

  • SHapley Additive ExPlanations (SHAP)

This SHAP method uses Shapley values ​​(derived from game theory) to attribute each feature its fair contribution to the prediction of a model ?. Therefore, it produces a local additive explanatory model of the form:

  •               (14)

where z’  0 ∈ {0, 1}M, M is the number of simplified input features, and φi ∈ R. [29].

SHAP is a technique that helps us understand how different factors contribute to the prediction of a model. It borrows ideas from the game theory to determine the extent to which each feature influences the final result. This method creates a simple explanation that shows how each factor is added to produce the model's output [29]. To determine which has the best predictions, it can be adopted in future implementations.


6)    There is no mention of statistical tests to assess whether differences in performance across models are statistically significant.

R:/

Materials and methods section

 

To compare the performance of the classifiers, the Friedman test was applied to the aggregated metrics per fold (k = 5, stratified cross validation). When Friedman was significant, pairwise comparisons between classifiers were performed using the Wil-coxon signed-rank test (paired). To control for type I errors due to multiple comparisons, the Holm correction was applied. In addition to p-values, the effect sizes were reported using paired Cohen's d values. Statistical comparisons were calculated based on the F1 and ROC-AUC metrics obtained per fold.

Results section

For the F1 metric, the Friedman test [31], showed overall differences among the eight classifiers evaluated (χ² = 33.923, p = 1.8e-05). Similarly, for ROC-AUC (Figure 4), the Friedman test was significant (χ² = 32.106, p = 3.9e-05). However, after applying paired comparisons Wilcoxon and correcting by Holm's test, no pair-by-pair com-parison reached significance (p < 0.05). In terms of practical magnitude, effect sizes (Cohen's paired d) [32][33]. This combination of results suggests the presence of over-all differences between models; However, the lack of significant pairs after correction indicates that individual differences are not robust to the sample size of the folds (k = 5).


7)    The paper does not offer any model interpretability techniques. Please use SHAP plots to show the feature importance. What are the most important features driving the prediction? Can you interpret the model outputs?

R:/

Materials and methods section

  • Analysis of the Importance of Variables and their Correlation with Prediction

Variables prefixed with "cbp" and "cbv" (e.g., cbp2, cbp1, cbv1, cbv2, cbv10, etc.) are identified as the most influential in enhancing the prediction of the bullying class within the RandomForest model (refer to Figure 2). The naming convention (where cbp* denotes perpetration questions and cbv* denotes victimization questions) implies that responses indicating higher levels of perpetration or victimization elevate the predicted probability of an observation being classified as bullying. For instance, cbp2 had the highest SHAP importance; its positive correlation (approximately +0.74) signified those elevated values of cbp2 substantially increased its contribution to the bullying prediction score. Similarly, cbv1 and cbv2 exhibited high positive correlations, with higher values for these questions driving the prediction towards the positive class. Conversely, certain characteristics (e.g., pbjw7 and sef6) demonstrated negative correlations with their SHAP values (e.g., pbjw7 corr ≈ −0.63, sef6 corr ≈ −0.79). Higher values for these questions diminished the probability that the model assigned bullying. Depending on the content of these questions (a review of the questionnaire is recommended), they may correspond to protective factors or constructs inversely associated with the probability of bullying, as identified by the model.

 

 

Results section

TreeExplainer over RandomForest (SHAP) scores were employed to elucidate the contribution of each characteristic to the prediction of bullying. The variables exerting the most significant influence were cbp2, cbv1, cbv2, cbp1, and cbp10, as ordered by SHAP mean, as illustrated in Figure 5. Generally, responses indicating higher levels of perpetration or victimization were associated with an increased probability of bullying. Conversely, certain variables (e.g., pbjw7 and sef6) exhibited a negative association with the prediction, suggesting that they may function as protective factors or are inversely related to labeling within this dataset.

Figure 5. Feature importance SHAP – Random Forest (positive class)

 
8)    Why were deep learning models excluded from empirical comparison, despite being discussed?

R:/

In our study, we did not include deep learning models in the empirical comparison for several methodological and practical reasons.

  1. Nature of the dataset: The dataset used (CyberBullying University Students) was relatively small (615 instances, 98 variables). Deep-learning models typically require large volumes of data for adequate generalization. In this case, their application could lead to overfitting without providing a significant advantage over traditional machine learning models.
  2. Objective: The main purpose of this study was to evaluate the structural quality of a dataset and compare different supervised classification models that are explainable and interpretable in educational contexts. Algorithms such as decision trees, logistic regression, and Random Forest allow for a more direct interpretation of the predictors and their relationship with the cyberbullying phenomenon, which is essential for the adoption of these tools in academic institutions.
  3. Computational resources and reproducibility: Deep-learning models require greater computing power and hyperparameter settings. We opted for lighter algorithms that ensured reproducibility and accessibility for researchers or education professionals with limited resources.

However, we recognize the potential of deep neural network-based approaches, especially in the context of natural language processing and analysis of large volumes of data from social networks. We believe that exploring these techniques constitutes a future line of research complementary to this work.

 


9)    Paper could benefit from a flowchart of the pipeline.

R:/


10)    I highly recommend using more recent machine learning models like XGBoost for experiments.

R:/

We have integrated it as an eighth model


11)    It is recommended to share a link to the source code to make the project reproducible.

R:/

https://github.com/jeliecergomez/Machine_Learning/blob/main/CyberBullying6.ipynb

12)    Please provide more discussion on the results.

R:/

The results showed that tree-based models, particularly Decision Tree and Random Forest, achieved very high performance on conventional metrics (accuracy, precision, recall, F1, and ROC-AUC), with the Decision Tree obtaining the highest mean value in our experiment (see Table 3). However, interpretation of this superiority requires some qualifications. Although the Friedman test indicated overall differences between the classifiers for F1 and ROC-AUC (χ² F1 = 33.923, p = 1.8e-05; χ² ROC-AUC = 32.106, p = 3.9e-05), holm-corrected pairwise (Wilcoxon) comparisons did not show statistically significant pairs after controlling for multiple comparisons. This pattern suggests that while there are indications of overall differences between models, the pair-by-pair differences are not sufficiently robust with the current evaluation structure (k = 5 folds), which advises caution when declaring an outright winner.

The SHAP interpretability analysis identified that questions related to victimization and perpetration (items cbv* and cbp*, respectively) had the greatest influence on predicting the bullying label. The positive correlation between the scores of these items and their SHAP values ​​ indicated that higher responses to these questions increased the model's contribution to the positive class. In contrast, some psychological scales (e.g., pbjw7 and sef6) showed negative correlations, suggesting that certain traits or beliefs could act as protective factors or be inversely related to the estimated probability of cyberbullying. These observations are consistent with previous theories and findings linking experiences of victimization/perpetration to the likelihood of involvement in cyberbullying episodes. The use of SMOTE to balance classes is appropriate for mitigating imbalance; however, it introduces synthetic observations that can alter the true population distribution and, in some cases, inflate metrics if not externally validated. Third, with k = 5, only five observations per classifier are available for the paired tests, which reduces the statistical power of the tests and partly explains the lack of significant pairs after correcting for multiple comparisons. Finally, the sample came from German universities and may contain cultural or contextual biases; generalization to other populations or environments requires external validation.

In terms of practical applications, the models can be used as human-in-the-loop support systems for early detection: automatic recommendations and filters that alert guidance teams, not as autonomous decision-making tools. Operational use should include oversight of psychology/student service professionals, clear review procedures, and privacy/consent protocols.


13)    The English in the present paper is not of publication quality and require major improvement. Please carefully proof-read spell check to eliminate grammatical errors. As an example: “Machine learning models, particularly decision trees, how a high potential for accurately detecting cyberbullying.”

            R:/

We have corrected the grammatical errors throughout the document.

 

 

Sincerely,

 

The Authors

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This is a good, well-presented, and well-written paper. Congratulations.

Comments:

Include a table with the hyperparameters used in each model.

The models used are classic and old ML models; I suggest implementing more recent models used in classification (GB, RNN, LSTM).

It would also be desirable to use other performance metrics such as Cohen's kappa (κ) metric, the Jaccard index (IJ), and Matthews' correlation coefficient (MCC). These metrics allow us to determine the quality of the database structure and how the classification models will perform under uncertainty (real-world domains).

The conclusions should be rewritten in terms of the objective and contributions of the work (structural quality of the dataset on cyberbullying in university students).

Author Response

Reviewer 2's responses:

Dear Reviewer,

We greatly appreciate the time and effort you put into reviewing our work, "Comparative Analysis of Classification Models for Cyberbullying Detection in University Environments." Your comments have allowed us to reflect deeply on various aspects of our study and, in most cases, substantially improve the manuscript.

We have addressed all of your points in the review. For most of your suggestions, we have modified the text accordingly. We trust that our clarifications address your concerns.

We greatly appreciate your contribution to the process of improving this article.

 

Comments:

 

Include a table with the hyperparameters used in each model.

R:/

Table 2. Hyperparameters used in each model

 

Classifier

Fixed hyperparameters (explicit)

Random Forest

n_estimators=200, random_state=42

Logistic Regression

max_iter=1000, random_state=42

Decision Tree

random_state=42

K-Nearest Neighbors (KNN)

n_neighbors=5

Support Vector Classifier (SVC)

probability=True, random_state=42

Linear Discriminant Analysis (LDA)

default parameters (no explicit change)

Multi-layer Perceptron (MLP)

hidden_layer_sizes=(100,), max_iter=500, random_state=42

XGBoost (XGBClassifier)

use_label_encoder=False, eval_metric='logloss', random_state=42

 

 

The models used are classic and old ML models; I suggest implementing more recent models used in classification (GB, RNN, LSTM).

R:/

We have included XGBoost (XGBClassifier)

 

It would also be desirable to use other performance metrics such as Cohen's kappa (κ) metric, the Jaccard index (IJ), and Matthews' correlation coefficient (MCC). These metrics allow us to determine the quality of the database structure and how the classification models will perform under uncertainty (real-world domains).

R:/

Eight classification models were evaluated to detect cyberbullying in a sample of university students. The models were compared based on the following standard evaluation metrics: precision, recall, F1 score, and accuracy, as shown in Figure 3 and Table 3.

 

Figure 3. Comparison of performance between models.

Table 3. Comparison of performance

Model

Accuracy_mean

Accuracy_std

Precision_mean

Precision_std

Recall_mean

Recall_std

F1_mean

F1_std

Roc_auc_mean

Roc_auc_std

DecisionTree

0.99186992

0.01817941

1

0

0.98630137

0.03063107

0.9929078

0.01585864

0.99315068

0.01531553

RandomForest

0.97560976

0.02072772

0.99166667

0.0124226

0.96712329

0.03153661

0.97901478

0.01802444

0.9949589

0.00670212

XGBoost

0.95447154

0.03020191

0.99108225

0.00814842

0.93150685

0.04543322

0.96001551

0.02679039

0.97879452

0.02007735

LogisticRegression

0.94471545

0.01563852

0.98863692

0.01159546

0.91780822

0.03624317

0.95143064

0.0146466

0.98838356

0.00922926

MLP

0.88292683

0.01957983

0.92723149

0.01714454

0.87123288

0.0248848

0.89821891

0.01762364

0.94975342

0.00913525

SVC

0.85365854

0.03941203

0.91875508

0.03457614

0.82739726

0.05697708

0.8697079

0.03771287

0.93484932

0.01807326

LDA

0.76585366

0.03515731

0.89994842

0.03092746

0.68219178

0.06002439

0.77463033

0.04106359

0.84728767

0.03139085

KNN

0.52195122

0.02024374

0.88026144

0.04781896

0.22465753

0.0248848

0.35772027

0.03498954

0.67975342

0.02974455

For the F1 metric, the Friedman test [31], showed overall differences among the eight classifiers evaluated (χ² = 33.923, p = 1.8e-05). Similarly, for ROC-AUC ( Figure 4), the Friedman test was significant (χ² = 32.106, p = 3.9e-05). However, after applying paired comparisons Wilcoxon  and correcting by Holm’s test, no pair-by-pair comparison reached significance (p < 0.05). In terms of practical magnitude, effect sizes (Cohen's paired d) [32][33]. This combination of results suggests the presence of overall differences between models; however, the lack of significant pairs after correction indicates that individual differences are not robust to the sample size of the folds (k = 5).

Figure 4. Comparative Mean ROC across classifier.

TreeExplainer over RandomForest (SHAP) scores were employed to elucidate the contribution of each characteristic to the prediction of bullying. The variables exerting the most significant influence were cbp2, cbv1, cbv2, cbp1, and cbp10, as ordered by SHAP mean, as illustrated in Figure 5. Generally, responses indicating higher levels of perpetration or victimization were associated with an increased probability of bullying. Conversely, certain variables (e.g., pbjw7 and sef6) exhibited a negative association with the prediction, suggesting that they may function as protective factors or are inversely related to labeling within this dataset.

Figure 5. Feature importance SHAP – Random Forest (positive class)

The Decision Tree model demonstrated superior performance, achieving an average accuracy of 99.1% with exceptional evaluation metrics: precision of 100%, recall of 98.6%, and F1-score of 99.2%. These metrics indicate the model's high efficacy in accurately classifying both the positive and negative instances of cyberbullying. Furthermore, the area under the ROC curve (ROC-AUC) of 0.993 corroborates its substantial predictive capability, rendering it a reliable tool for the development of preventive monitoring systems within educational contexts.

The Random Forest model also exhibited commendable performance, with an average accuracy of 97.6%, precision of 99.2%, and F1-score of 97.9%. This robust performance, coupled with its ROC-AUC of 0.995, is one of the most reliable and consistent alternatives for predicting cyberbullying.

The XGBoost model ranked third and performed competitively, achieving an average accuracy of 95.4%, a precision of 99.1%, a recall of 93.1%, and an F1-score of 96%, with an ROC-AUC of 0.979. These metrics reflect a favorable balance between sensitivity and specificity, making them particularly appealing in scenarios where minimizing false negatives is crucial.

Logistic Regression demonstrated robust performance with an accuracy of 94.5%, precision of 98.9%, recall of 91.8%, and F1-score of 95.1%. Additionally, its ROC-AUC of 0.988 indicates that despite its simplicity, this model yields highly competitive results.

 Conversely, the Multilayer Perceptron (MLP) model exhibited moderate performance, with an accuracy of 88.3% and an F1-score of 89.8%, reflecting acceptable performance, but inferior to decision tree-based methods. In contrast, the SVC model yielded more limited results, with an accuracy of 85.4% and an F1-score of 86.9%. Although its precision was relatively high (91.8%), its recall of 82.7% suggests some challenges in detecting positive cyberbullying cases.

Linear Discriminant Analysis (LDA) demonstrated intermediate performance, with an accuracy of 76.6%, a precision of 90%, a recall of 68.2%, and an F1-score of 77.4%, indicating a higher false-negative rate compared to the more robust models.

Finally, the KNN model performed the worst, with an average accuracy of 52.2% and an F1-score of 35.8%, primarily owing to its low recall (22.5%). This suggests that the model encountered significant difficulties in adequately identifying positive cyberbullying cases.

In summary, the results indicate that the Decision Tree and Random Forest models are the most effective for this dataset, with metrics exceeding 97% in all categories and an excellent ROC curve performance. Notably, the Decision Tree has emerged as the most suitable option owing to its high capacity to accurately identify both cyberbullying and non-bullying cases, making it a valuable tool for the prevention and early detection of this phenomenon in educational institutions. Within the framework of institutional strategies to prevent and mitigate cyberbullying, this model possesses features that can be integrated into a support tool for early detection and timely intervention. However, it is recommended to validate these results using independent data and expand the dataset to ensure the robustness of the model in real-world and diverse scenarios.

 

 

The conclusions should be rewritten in terms of the objective and contributions of the work (structural quality of the dataset on cyberbullying in university students).

R:/

The main objective of this study was to evaluate the suitability and structural quality of a survey-based dataset on cyberbullying among university students for use in supervised machine learning systems, and to compare the predictive performance and interpretability of several standard classifiers on that dataset. To that end, we curated and documented a clean version of the Cyber Bullying among University Students dataset (n = 615, originally 100 variables; 98 variables used for modeling), defined a clear target variable that combines victimization and perpetration items (cbv1–cbv11 and cbp1–cbp11), and implemented a reproducible modeling pipeline that addresses missing data, categorical encoding, class imbalance (SMOTE applied only to training folds), feature scaling, stratified k-fold evaluation (k = 5), and model interpretability (SHAP).

From the modeling experiments, tree-based methods (Decision Tree and Random Forest) yielded the highest predictive scores on this dataset, while other methods (Logistic Regression, MLP, SVC, LDA, KNN) presented varied performance profiles. However, statistical comparison across classifiers (Friedman test; Wilcoxon paired tests with Holm correction) indicates that although there are global differences, pairwise differences are not uniformly robust after multiple-comparison correction, which is likely influenced by the CV design (k = 5) and finite sample size. Critically, model performance must therefore be interpreted with caution; exceptionally high metrics for a given classifier may reflect dataset idiosyncrasies (informative item overlap and redundancy among cbv/cbp items) rather than universally generalizable superiority.

Interpretability analyses using SHAP identified the cbv* and cbp* items (victimization and perpetration) as the primary drivers of the predictions. Several psychological scales (e.g., some perceived justice and self-efficacy items) were negatively associated with the predicted bullying probability, suggesting potential protective factors. These findings align with the theoretical literature and help explain what the models have learned, supporting the use of explainability as a part of any deployment pipeline.

Finally, while the dataset and pipeline provide a solid basis for exploratory and methodological work on automated cyberbullying detection, before operational deployment, we recommend (a) validation of independent external samples and data from other institutions or cultural contexts, (b) the use of repeated cross-validation or nested CV when hyperparameter selection is introduced, (c) careful monitoring of fairness across subgroups, and (d) maintaining human-in-the-loop review for any intervention triggered by automated detection. In summary, this study contributes to both a documented, model-ready dataset and a reproducible evaluation framework that advance the methodological foundations for responsible cyberbullying detection at the university level.

 

 

 

Sincerely,

 

The Authors

 

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper compares seven models for university cyberbullying detection using a well-described dataset, with Decision Tree achieving perfect performance, and emphasizes ethical considerations and psychosocial supervision in practice.

However, several issues limit the paper’s impact and generalizability:

  1.  Perfect metrics on a relatively small dataset (615 participants) suggest possible overfitting or data leakage. Please clarify whether strict train-test separation or nested cross-validation was applied.

  2. The study would be stronger with results from an independent dataset or at least a detailed discussion of generalization limitations.

  3.  Given the sensitive nature of cyberbullying detection, model transparency and explainability are important. Consider including feature importance analysis or SHAP/LIME explanations.

  4. As the dataset involves only German university students, results may not directly transfer to other cultural or linguistic contexts.

  5.  Provide more specifics on categorical encoding type, scaling, and SMOTE parameters.

 

Comments on the Quality of English Language

The English language is generally clear and understandable, with appropriate technical terminology. However, there are occasional grammatical inconsistencies, formatting issues, and minor awkward phrasing that could be improved for smoother readability and precision.

Author Response

Reviewer 3's responses:

Dear Reviewer,

We greatly appreciate the time and effort you put into reviewing our work, "Comparative Analysis of Classification Models for Cyberbullying Detection in University Environments." Your comments have allowed us to reflect deeply on various aspects of our study and, in most cases, substantially improve the manuscript.

We have addressed all of your points in the review. For most of your suggestions, we have modified the text accordingly. We trust that our clarifications address your concerns.

We greatly appreciate your contribution to the process of improving this article.

 

Comments:

However, several issues limit the paper’s impact and generalizability:

  1. Perfect metrics on a relatively small dataset (615 participants) suggest possible overfitting or data leakage. Please clarify whether strict train-test separation or nested cross-validation was applied.

R:/

Categorical variables, represented as text, were transformed into numeric variables to facilitate processing using machine-learning models. To assess the performance of the classifiers, stratified k-fold cross-validation (k = 5) was employed to provide a robust estimation of the model's generalization capability. Given the class imbalance in the target variable (bullying vs. non-bullying), the synthetic minority oversampling technique (SMOTE) was utilized to achieve class balance. In each iteration of the k-fold, the following procedure was implemented to prevent information leakage: (1) the dataset was partitioned into training and test sets according to the fold; (2) within the training set, missing values were imputed using the mean (SimpleImputer), and a StandardScaler was applied; (3) SMOTE was applied exclusively to the scaled training set to generate synthetic samples of the minority class; and (4) the classifier was trained using the augmented training set, and the validation (test) set of the fold was evaluated without any modification (i.e., without SMOTE, using the same imputer and scaler fitted on the training set). Figure 2 shows the confusion matrix of a decision tree evaluated using this protocol.

Figure 2. Decision tree confusion matrix.

SMOTE was configured with k_neighbors = 5. To address cases whichere the minority class within a fold had very few samples (fewer than six), the k_neighbors parameter was automatically set to a lower value to allow for the generation of synthetic samples, as suggested by the standard implementation. Random seeds were fixed to ensure reproducible results. Model performance was evaluated using the accuracy, precision, recall, F1 score, and area under the ROC curve (ROC-AUC). All reported metrics correspond to the mean and standard deviation calculated over k folds of the cross-validation process. To prevent overfitting during the model evaluation, the hyperparameters of each classifier were defined a priori and kept fixed across all iterations, thus avoiding any within-fold optimization processes that could artificially inflate the metrics. Table 2 shows the hyperparameters used in each model.

Table 2. Hyperparameters used in each model

 

Classifier

Fixed hyperparameters (explicit)

Random Forest

n_estimators=200, random_state=42

Logistic Regression

max_iter=1000, random_state=42

Decision Tree

random_state=42

K-Nearest Neighbors (KNN)

n_neighbors=5

Support Vector Classifier (SVC)

probability=True, random_state=42

Linear Discriminant Analysis (LDA)

default parameters (no explicit change)

Multi-layer Perceptron (MLP)

hidden_layer_sizes=(100,), max_iter=500, random_state=42

XGBoost (XGBClassifier)

use_label_encoder=False, eval_metric='logloss', random_state=42

 

To compare the performance of the classifiers, the Friedman test was applied to the aggregated metrics per fold (k = 5, stratified cross validation). When Friedman was significant, pairwise comparisons between classifiers were performed using the Wilcoxon signed-rank test (paired). To control for type I errors due to multiple comparisons, the Holm correction was applied. In addition to p-values, the effect sizes were reported using paired Cohen's d values. Statistical comparisons were calculated based on the F1 and ROC-AUC metrics obtained per fold.

 

 

  1. The study would be stronger with results from an independent dataset or at least a detailed discussion of generalization limitations.

R:/

Limitations of the generalization. A primary limitation of this study is that the model evaluation was conducted on a single curated survey dataset of university students (n = 615). Although we followed the best practices to reduce optimistic bias, notably stratified k-fold cross-validation (k = 5), imputing and scaling using training-fold statistics only, and applying SMOTE solely to the training folds, these procedures do not substitute for an external test on a truly independent population. To provide more robust estimates of model stability, we augmented the evaluation with repeated cross-validation and bootstrap confidence intervals for key metrics and used SHAP explanations to assess feature-level stability across folds. Nonetheless, trained models may exploit dataset-specific idiosyncrasies (item redundancy or local response patterns), limiting their immediate generalizability to other institutions, cultures, or survey versions. Therefore, we recommend external validation of samples collected independently (other universities, different countries, or later cohorts) and subgroup analyses (e.g., by gender, faculty, or year) before any operational use. These caveats are now described in the Discussion and Conclusion sections.

 

 

  1. Given the sensitive nature of cyberbullying detection, model transparency and explainability are important. Consider including feature importance analysis or SHAP/LIME explanations.

R:/

TreeExplainer over RandomForest (SHAP) scores were employed to elucidate the contribution of each characteristic to the prediction of bullying. The variables exerting the most significant influence were cbp2, cbv1, cbv2, cbp1, and cbp10, as ordered by SHAP mean, as illustrated in Figure 5. Generally, responses indicating higher levels of perpetration or victimization were associated with an increased probability of bullying. Conversely, certain variables (e.g., pbjw7 and sef6) exhibited a negative association with the prediction, suggesting that they may function as protective factors or are inversely related to labeling within this dataset.

Figure 5. Feature importance SHAP – Random Forest (positive class)

 

 

 

  1. As the dataset involves only German university students, results may not directly transfer to other cultural or linguistic contexts.

R:/

. Limitations / Generalizability

Generalizability and cultural limitations. This study was based on a single survey dataset collected from German university students. Consequently, our results reflect the patterns and response distributions of this specific population, and may not be generalized directly to student populations in other countries, languages, or cultural contexts. Cultural differences can affect both the prevalence and expression of cyberbullying behaviors as well as the interpretation of questionnaire items, producing different response styles (e.g., acquiescence, social desirability) and factor structures. To mitigate optimistic conclusions, we applied stratified cross-validation, SMOTE only within the training folds, and stability checks (repeated CV and bootstrap confidence intervals). However, we emphasize that the external validation of independently collected samples (other universities, countries, or translated survey waves) and formal tests of measurement invariance (e.g., configural/metric/scalar invariance across groups) are required to establish transportability. Until such a validation is performed, any operational deployment should be local, human-supervised, and preceded by fairness audits across demographic subgroups.

 

  1. Provide more specifics on categorical encoding type, scaling, and SMOTE parameters.

R:/

In the code, we report the exact preprocessing choices: we encode categorical columns with LabelEncoder (per column), impute missing values ​​with SimpleImputer(strategy='mean'), standardize features with StandardScaler(), and apply SMOTE only to the training partition in each fold with random_state=42 and k_neighbors = min(5, minority_count - 1) (SMOTE is skipped if minority_count ≤ 1). We also add a small table with these hyperparameters and a short paragraph describing limitations and alternatives (One-Hot/Target encoding, effect of scaling before SMOTE).

https://github.com/jeliecergomez/Machine_Learning/blob/main/CyberBullying6.ipynb

 

Sincerely

 

The authors

 

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

Advantages:  

1. Clear Contribution: The study provides a good comparison of machine learning models for cyberbullying detection, addressing an important gap in the university setting.  

2. Methodological rigor: We describe the setup steps (e.g., detail SMOTE for class imbalance) and evaluation metrics to enhance reproducibility.  

3. Practical Implications: Discussions on ethical frameworks and institutional integration find valuable practical implications for technology.  

Disadvantages  

  1. 100% performance for decision trees is not the opposite; Validation strategies (such as cross-validation) or external datasets should be included to ensure completeness.  
  2. Possible applicability of the dataset's focus on German students to other cultural/linguistic backgrounds - this bias should be explicitly discussed.  
  3. The ROC curve (Figure 4) is incorrectly labeled as "pulsar classifier", which may confuse the reader.  
  4. SVC/KNN models have low accuracy and lack deeper exploration (e.g. hyperparameter tuning or feature correlation).

Author Response

Reviewer 4's responses:

Dear Reviewer,

We greatly appreciate the time and effort you put into reviewing our work, "Comparative Analysis of Classification Models for Cyberbullying Detection in University Environments." Your comments have allowed us to reflect deeply on various aspects of our study and, in most cases, substantially improve the manuscript.

We have addressed all of your points in the review. For most of your suggestions, we have modified the text accordingly. We trust that our clarifications address your concerns.

We greatly appreciate your contribution to the process of improving this article.

Comments:

 

100% performance for decision trees is not the opposite; Validation strategies (such as cross-validation) or external datasets should be included to ensure completeness. 

R:/

Categorical variables, represented as text, were transformed into numeric variables to facilitate processing using machine-learning models. To assess the performance of the classifiers, stratified k-fold cross-validation (k = 5) was employed to provide a robust estimation of the model's generalization capability. Given the class imbalance in the target variable (bullying vs. non-bullying), the synthetic minority oversampling technique (SMOTE) was utilized to achieve class balance. In each iteration of the k-fold, the following procedure was implemented to prevent information leakage: (1) the dataset was partitioned into training and test sets according to the fold; (2) within the training set, missing values were imputed using the mean (SimpleImputer), and a StandardScaler was applied; (3) SMOTE was applied exclusively to the scaled training set to generate synthetic samples of the minority class; and (4) the classifier was trained using the augmented training set, and the validation (test) set of the fold was evaluated without any modification (i.e., without SMOTE, using the same imputer and scaler fitted on the training set). Figure 2 shows the confusion matrix of a decision tree evaluated using this protocol.

Figure 2. Decision tree confusion matrix.

SMOTE was configured with k_neighbors = 5. To address cases whichere the minority class within a fold had very few samples (fewer than six), the k_neighbors parameter was automatically set to a lower value to allow for the generation of synthetic samples, as suggested by the standard implementation. Random seeds were fixed to ensure reproducible results. Model performance was evaluated using the accuracy, precision, recall, F1 score, and area under the ROC curve (ROC-AUC). All reported metrics correspond to the mean and standard deviation calculated over k folds of the cross-validation process. To prevent overfitting during the model evaluation, the hyperparameters of each classifier were defined a priori and kept fixed across all iterations, thus avoiding any within-fold optimization processes that could artificially inflate the metrics. Table 2 shows the hyperparameters used in each model.

Table 2. Hyperparameters used in each model

 

Classifier

Fixed hyperparameters (explicit)

Random Forest

n_estimators=200, random_state=42

Logistic Regression

max_iter=1000, random_state=42

Decision Tree

random_state=42

K-Nearest Neighbors (KNN)

n_neighbors=5

Support Vector Classifier (SVC)

probability=True, random_state=42

Linear Discriminant Analysis (LDA)

default parameters (no explicit change)

Multi-layer Perceptron (MLP)

hidden_layer_sizes=(100,), max_iter=500, random_state=42

XGBoost (XGBClassifier)

use_label_encoder=False, eval_metric='logloss', random_state=42

 

To compare the performance of the classifiers, the Friedman test was applied to the aggregated metrics per fold (k = 5, stratified cross validation). When Friedman was significant, pairwise comparisons between classifiers were performed using the Wilcoxon signed-rank test (paired). To control for type I errors due to multiple comparisons, the Holm correction was applied. In addition to p-values, the effect sizes were reported using paired Cohen's d values. Statistical comparisons were calculated based on the F1 and ROC-AUC metrics obtained per fold.

 

 

Possible applicability of the dataset's focus on German students to other cultural/linguistic backgrounds - this bias should be explicitly discussed. 

R:/

. Limitations / Generalizability

Generalizability and cultural limitations. This study was based on a single survey dataset collected from German university students. Consequently, our results reflect the patterns and response distributions of this specific population, and may not be generalized directly to student populations in other countries, languages, or cultural contexts. Cultural differences can affect both the prevalence and expression of cyberbullying behaviors as well as the interpretation of questionnaire items, producing different response styles (e.g., acquiescence, social desirability) and factor structures. To mitigate optimistic conclusions, we applied stratified cross-validation, SMOTE only within the training folds, and stability checks (repeated CV and bootstrap confidence intervals). However, we emphasize that the external validation of independently collected samples (other universities, countries, or translated survey waves) and formal tests of measurement invariance (e.g., configural/metric/scalar invariance across groups) are required to establish transportability. Until such a validation is performed, any operational deployment should be local, human-supervised, and preceded by fairness audits across demographic subgroups.

 

 

The ROC curve (Figure 4) is incorrectly labeled as "pulsar classifier", which may confuse the reader. 

R:/

Figure 4. Comparative Mean ROC across classifier.

 

SVC/KNN models have low accuracy and lack deeper exploration (e.g. hyperparameter tuning or feature correlation).

R:/

For future work, we will perform hyperparameter tuning or feature correlation, since the dataset only has 615.

 

 

 

 

 

Sincerely

 

 

The authors

Author Response File: Author Response.pdf