Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Prevention Is Better than Cure: Exposing the Vulnerabilities of Social Bot Detectors with Realistic Simulations

Appl. Sci. 2025, 15(11), 6230; https://doi.org/10.3390/app15116230

by Rui Jin

and Yong Liao^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Appl. Sci. 2025, 15(11), 6230; https://doi.org/10.3390/app15116230

Submission received: 24 April 2025 / Revised: 25 May 2025 / Accepted: 29 May 2025 / Published: 1 June 2025

(This article belongs to the Special Issue Artificial Neural Network and Deep Learning in Cybersecurity)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper deals with an exciting topic. The article has been read carefully, and some minor issues have been highlighted in order to be considered by the author(s).

(1) The proposed simulation-based framework for modeling evolved social bots represents a novel and valuable approach. However, to fully establish its efficacy, it is important to include a direct experimental comparison with existing bot simulation or generation techniques. Such benchmarking would clarify how the proposed method advances the state of the art in terms of realism, adversarial effectiveness, or detector evasion rate.

(2) Given the large-scale nature of the simulations and the multi-component structure of the proposed framework (e.g., bot influence model, user engagement model, profile editor), a detailed evaluation of the computational costs in terms of time and memory usage is necessary. This would provide insights into the scalability and practical feasibility of the proposed method for deployment in real-world settings or online platforms.

(3) While the paper mentions ablation experiments, a more systematic ablation study is recommended. Specifically, isolating the contribution of each component—such as the influence model, engagement model, or profile editing—would offer a clearer understanding of their respective roles in enhancing bot realism and detector evasion.

(4) The manuscript would benefit from a clearer articulation of how the proposed method differs from and improves upon prior approaches. Emphasizing the unique aspects of the bot strategy modeling (e.g., user engagement simulation, target preselection) and how these elements contribute to more effective penetration testing would strengthen the reader’s understanding of the framework’s innovation and significance.

(5)

To enhance the interpretability and rigor of the evaluation, it is recommended to include additional quantitative performance metrics, such as confusion matrices, precision/recall/F1 scores, or ROC curves, particularly for the detection tasks. These metrics would provide a more granular view of the detectors' strengths and weaknesses when confronted with different bot variants.

(6) Including references to relevant literature on adversarial samples, such as the papers available at https://link.springer.com/article/10.1007/s11042-023-15961-2 would provide a comprehensive understanding of the potential vulnerabilities.

Author Response

Comments 1: The proposed simulation-based framework for modeling evolved social bots represents a novel and valuable approach. However, to fully establish its efficacy, it is important to include a direct experimental comparison with existing bot simulation or generation techniques. Such benchmarking would clarify how the proposed method advances the state of the art in terms of realism, adversarial effectiveness, or detector evasion rate.

Response 1: We agree and have made efforts to include relevant existing studies. For instance, the baseline model used in our experiments is adapted from ACORN, a prior bot simulation framework. However, we encountered difficulties in identifying other directly applicable bot generation techniques. Most existing approaches fail to account for the dynamic behavior of bots, instead generating static bot data without simulating their interactive actions or the consequences of those actions. As a result, these methods could not be feasibly integrated into our simulation environment, preventing a direct experimental comparison.

Comments 2: Given the large-scale nature of the simulations and the multi-component structure of the proposed framework (e.g., bot influence model, user engagement model, profile editor), a detailed evaluation of the computational costs in terms of time and memory usage is necessary. This would provide insights into the scalability and practical feasibility of the proposed method for deployment in real-world settings or online platforms.

Response 2: Thank you for your suggestion. Therefore, we have presented a detailed complexity analysis in section 3.7.

Comments 3: While the paper mentions ablation experiments, a more systematic ablation study is recommended. Specifically, isolating the contribution of each component—such as the influence model, engagement model, or profile editing—would offer a clearer understanding of their respective roles in enhancing bot realism and detector evasion.

Response 3: Apologies for the lack of clarity. We presented the comparison of the mean absolute error (MAE) of the influence model and engagement model respectively in the last paragraph of section 3.3.1 and 3.3.2. These are now presented in section 4.2. As for the components contributing to detector evasion, we believe it would be more appropriate to evaluate them in more realistic simulations, i.e., with the proposed influence and engagement model.

Comments 4: The manuscript would benefit from a clearer articulation of how the proposed method differs from and improves upon prior approaches. Emphasizing the unique aspects of the bot strategy modeling (e.g., user engagement simulation, target preselection) and how these elements contribute to more effective penetration testing would strengthen the reader’s understanding of the framework’s innovation and significance.

Response 4: Thank you for pointing this out. We have included the current research status in the first paragraph of section 3.3.1, 3.3.2, 3.4, and 3.5.

Comments 5: To enhance the interpretability and rigor of the evaluation, it is recommended to include additional quantitative performance metrics, such as confusion matrices, precision/recall/F1 scores, or ROC curves, particularly for the detection tasks. These metrics would provide a more granular view of the detectors' strengths and weaknesses when confronted with different bot variants.

Response 5: While we appreciate the reviewer’s suggestion, we are unsure how to include the mentioned performance metrics. To rule out the potential advantages and disadvantages of using existing bots, the proposed framework simulates creating a new bot account and controlling this bot. Thus, the testing set would only include the generated bot accounts.

Comments 6: Including references to relevant literature on adversarial samples, such as the papers available at https://link.springer.com/article/10.1007/s11042-023-15961-2 would provide a comprehensive understanding of the potential vulnerabilities.

Response 6: Thank you for you advice. We have included the provided paper in the second paragraph of introduction.

Reviewer 2 Report

Comments and Suggestions for Authors

The proposed approach are experiments performed are rather interesting and could be considered for publication. Currently I see just several weaknesses, that should be improved:

Article in fact combines two approaches: simulation and ML methods and sometimes there is no clarity from the text, what is being done. Some high level diagram, presenting the approach would be helpful for understanding.
I do not agree, that simulation is a good method for bot detection. It is fine for analysis, forecasting, etc. Even in the experiment ML methods are in fact used for detection. Suggest strengthening/reformulating the motivational part.
Author contributions are defined not according to the requirements.

Author Response

Comments 1: Article in fact combines two approaches: simulation and ML methods and sometimes there is no clarity from the text, what is being done. Some high level diagram, presenting the approach would be helpful for understanding.

Response 1: We agree and have updated figure 2 for a clearer presentation of the framework by specifying the techniques and purposes.

Comments 2: I do not agree, that simulation is a good method for bot detection. It is fine for analysis, forecasting, etc. Even in the experiment ML methods are in fact used for detection. Suggest strengthening/reformulating the motivational part.

Response 2: Thank you for your suggestion. We have updated the abstract and the first and second paragraph of the introduction to clarify that the purpose of simulations is to help exploring the possible adversarial behaviors of the bots.

Comments 3: Author contributions are defined not according to the requirements.

Response 3: Thank you for pointing this out. We have revised the author contributions according to the requirements.

Reviewer 3 Report

Comments and Suggestions for Authors

Please, find my remarks below:

The authors use a bot score threshold of 0.5 as the cutoff for detection. While this is a standard default in binary classification, it is a heuristic. The paper would be strengthened by exploring how varying this threshold impacts false positives and false negatives, especially given the adversarial nature of the evaluation.
Several parameters (e.g., q=0.3, cluster count = 1000, etc.) are introduced without justification or tuning. These choices affect key components of the model and should be supported by either empirical evidence or ablation studies.
The authors largely avoid text-based features, citing multilingual limitations. However, this rationale is outdated. Multilingual language models allow robust cross-language feature extraction. Given that bots often have content-driven goals (e.g., propaganda, spam), the lack of content analysis significantly limits detection power, especially for more sophisticated or goal-oriented bots.
Reputation is defined solely by the inverse of a bot score. This is a functional heuristic for simulation, but it lacks empirical validation. The paper would benefit from testing whether this metric correlates with actual influence or trustworthiness.
The paper only evaluates its simulated bots against two detectors, without comparing the simulation approach to other evasion methods or bot-generation models. This limits the ability to judge the novelty or effectiveness of the proposed approach within the broader research landscape.
While the paper rightly focuses on evasion and influence (e.g., survival rate, follower growth), it avoids classical evaluation metrics (e.g., precision, recall, F1-score). Including these would offer a complementary view of detection performance.
The authors do not conduct feature ablation or importance analysis. It remains unclear which features contribute most to bot success or evasion. Additionally, the system assumes a single type of bot behavior, without testing generalizability to different bot strategies or objectives.

Author Response

Comments 1: The authors use a bot score threshold of 0.5 as the cutoff for detection. While this is a standard default in binary classification, it is a heuristic. The paper would be strengthened by exploring how varying this threshold impacts false positives and false negatives, especially given the adversarial nature of the evaluation.

Response 1: Thank you for your suggestion. We have considered the effect of this threshold, and thus used the average bot score as an evaluation metric. We have also analyzed the potential consequences of changing the threshold to the average bot score of our bots in the last paragraph of section 4.3.

Comments 2: Several parameters (e.g., q=0.3, cluster count = 1000, etc.) are introduced without justification or tuning. These choices affect key components of the model and should be supported by either empirical evidence or ablation studies.

Response 2: We apologize for the lack of clarity. q is a parameter of the influence model used in an existing study. Since we mapped the clusters to the discrete action space of the target selection agent, the number of cluster was set to 1,000 to ensure that the number of alternative targets aligns with previous studies. We have cited the corresponding paper near equation 3 and in the second paragraph of section 3.4.

Comments 3: The authors largely avoid text-based features, citing multilingual limitations. However, this rationale is outdated. Multilingual language models allow robust cross-language feature extraction. Given that bots often have content-driven goals (e.g., propaganda, spam), the lack of content analysis significantly limits detection power, especially for more sophisticated or goal-oriented bots.

Response 3: We apologize for any possible misunderstanding. We didn’t adopt manually engineered text-based features when training the RF-based detector due to multilingual limitations. However, we have also adopted BotRGCN, a DL-based detector that do utilizes the users' tweets. We have emphasized this in section 3.3.3.

Comments 4: Reputation is defined solely by the inverse of a bot score. This is a functional heuristic for simulation, but it lacks empirical validation. The paper would benefit from testing whether this metric correlates with actual influence or trustworthiness.

Response 4: We appreciate the reviewer's valuable suggestion regarding target selection. While we acknowledge that a more human-like target selection mechanism could potentially enhance the bot's strategy model, we would like to clarify that the current heuristic approach was specifically designed for selecting target for the bot rather than simulating real-user behavior. As such, we respectfully submit that empirical validation may not be essential for this particular implementation.

Comments 5: The paper only evaluates its simulated bots against two detectors, without comparing the simulation approach to other evasion methods or bot-generation models. This limits the ability to judge the novelty or effectiveness of the proposed approach within the broader research landscape.

Response 5: We agree and have made efforts to include relevant existing studies. For instance, the baseline model used in our experiments is adapted from ACORN, a prior bot simulation framework. However, we encountered difficulties in identifying other directly applicable bot generation techniques. Most existing approaches fail to account for the dynamic behavior of bots, instead generating static bot data without simulating their interactive actions or the consequences of those actions. As a result, these methods could not be feasibly integrated into our simulation environment, preventing a direct experimental comparison.

Comments 6: While the paper rightly focuses on evasion and influence (e.g., survival rate, follower growth), it avoids classical evaluation metrics (e.g., precision, recall, F1-score). Including these would offer a complementary view of detection performance.

Response 6: While we appreciate the reviewer’s suggestion, we are unsure how to include the mentioned performance metrics. To rule out the potential advantages and disadvantages of using existing bots, the proposed framework simulates creating a new bot account and controlling this bot. Thus, the testing set would only include the generated bot accounts.

Comments 7: The authors do not conduct feature ablation or importance analysis. It remains unclear which features contribute most to bot success or evasion. Additionally, the system assumes a single type of bot behavior, without testing generalizability to different bot strategies or objectives.

Response 7: Thank you for your suggestion. We agree that a feature ablation analysis would help strengthening the manuscript. However, we hope the reviewer can understand the challenge of conducting a feature ablation analysis on a RF-based classifier with hundreds of features, let alone explaining the features utilized in the DL-based detector. We have also considered conducting simplified feature ablation analysis by dividing the features into several categories (such as profile-based or graph-based), but it seems the component ablation analysis we presented can also interpret the contribution of different kinds of features.

For the second concern, we have introduced two kinds of bots: fake accounts and influencer bots. Overall, we believe that fake accounts and influencer bots represent the majority of bot types. For instance, bots designed to manipulate votes and fake followers are essentially fake accounts since their primary goal is to evade detection. Similarly, influencer bots embody the objectives of spam and political bots, as they aim to gain influence while avoiding detection. Therefore, unless we are analyzing a unique type of bot with a different and uncommon goal, the two proposed reward functions should be sufficient. We have added the suggested content to the manuscript in Section 3.6.

Reviewer 4 Report

Comments and Suggestions for Authors

The article presents the following concerns:

• Add the main quantitative results of the research in the abstract.
• It is recommended not to use We; instead, write in the passive voice.
• Add a brief introduction between section and subsection titles.
• Please improve the quality of the images.
• All variables and parameters in all equations must be named and described in the text.
• Add a comparative table of the results of this research with previous work.
• Although it is mentioned that the environment simulates Twitter using three directed graphs (followers, retweets, and mentions) and an attribute matrix, the assumptions of temporal dynamics and the exact rules governing the evolution of the social network are not specified. For example, how frequently is the graph topology updated? How is user inactivity handled?
• The article states that a random forest (RF) model and BotRGCN were used as adversarial detectors but does not provide sufficient details about their training: What were the initial performance metrics (accuracy, F1) on TwiBot-22? How was the data split into train/test? Was there hyperparameter tuning?
• The reduction of the action space through clustering and heuristics is understandable due to computational limitations, but it is unclear whether this introduces biases. For example, selecting users with low bot scores can generate an artificially "easy" target set. Was this strategy compared to a stratified random selection?
• The profile editor's (GAN-like) training lacks critical details: No convergence metrics are reported, and it is not analyzed whether the generator significantly alters the true distribution of human profiles, which could generate bots that are easily detectable by more robust models. Nor is the stability of training between the surrogate model and the generator discussed.
• Key metrics (survival, N(u), Pi, score) are reported as averages without any indication of variability. This makes it difficult to assess the robustness of the results.
• In the GNN simulation, some bot models lose effectiveness when incorporating profile edits or preselection. Although briefly mentioned, the reason for this is not analyzed in depth, nor are clear technical hypotheses proposed. This weakens the explanatory value of the article.
• It is mentioned that the influencer-type bot model is sensitive to the parameter θ (the weight between avoidance and influence). However, the effects of different values are not explored, nor is a sensitivity curve or Pareto analysis presented to guide its selection.

Author Response

Comments 1: Add the main quantitative results of the research in the abstract.

Response 1: Thank you for your suggestion. We have added the highest survival rate against both RF-based and GNN-based detection models in the abstract.

Comments 2: It is recommended not to use We; instead, write in the passive voice.

Response 2: Thank you for your suggestion. In response, we have revised the manuscript to employ passive voice constructions where appropriate.

Comments 3: Add a brief introduction between section and subsection titles.

Response 3: Thank you for your suggestion. We have added brief introduction for section 2, 3, and 4.

Comments 4: Please improve the quality of the images.

Response 4: Thank you for your suggestion. We have enlarged all images.

Comments 5: All variables and parameters in all equations must be named and described in the text.

Response 5: Thank you for your suggestion. We have updated equation 1, 2, 4, and 5 by replacing u’ with v (which has been used to represent users in the first paragraph of 3.3.1) and added description for U and q’ to ensure the readability.

Comments 6: Add a comparative table of the results of this research with previous work.

Response 6: Thank you for your advice and we have made efforts to include relevant existing studies. For instance, the baseline model used in our experiments is adapted from ACORN, a prior bot simulation framework. However, we encountered difficulties in identifying other directly applicable bot generation techniques. Most existing approaches fail to account for the dynamic behavior of bots, instead generating static bot data without simulating their interactive actions or the consequences of those actions. As a result, these methods could not be feasibly integrated into our simulation environment, preventing a direct experimental comparison.

Comments 7: Although it is mentioned that the environment simulates Twitter using three directed graphs (followers, retweets, and mentions) and an attribute matrix, the assumptions of temporal dynamics and the exact rules governing the evolution of the social network are not specified. For example, how frequently is the graph topology updated? How is user inactivity handled?

Response 7: Apologies for the lack of clarity. To maintain computational efficiency, we freeze accounts that have no direct interactions with the controlled bot. The graph topology will be updated when the bot performs network-altering actions—such as following, retweeting, or mentioning other users—or when it receives engagements from other accounts within the simulation. We have added the suggested content in the first paragraph of section 3.3.

Comments 8: The article states that a random forest (RF) model and BotRGCN were used as adversarial detectors but does not provide sufficient details about their training: What were the initial performance metrics (accuracy, F1) on TwiBot-22? How was the data split into train/test? Was there hyperparameter tuning?

Response 8: Apologies for the lack of clarity. During the training of the RF detector, we assign weights of 1 to normal users and 3 to bots to account for class imbalance. In a 5-fold cross-validation on the dataset, the RF detector achieves an accuracy of 84.9% and an F1-score of 50.1%, while BotRGCN attains 79.7% accuracy and 57.5% F1-score. For our simulation, both detectors are trained on the entire dataset user base. For benchmarking, please refer to https://proceedings.neurips.cc/paper_files/paper/2022/hash/e4fd610b1d77699a02df07ae97de992a-Abstract-Datasets_and_Benchmarks.html . We have added the suggested content in section 3.3.3.

Comments 9: The reduction of the action space through clustering and heuristics is understandable due to computational limitations, but it is unclear whether this introduces biases. For example, selecting users with low bot scores can generate an artificially "easy" target set. Was this strategy compared to a stratified random selection?

Response 9: We appreciate the reviewer’s valuable suggestion. As a preliminary evaluation, we implemented the proposed approach (additional actions + preselection models) in a simulated fake-account environment, substituting our original selection strategy with stratified random sampling. The observed metrics against the RF detector and BotRGCN were as follows:

Survival rate: 94.4% (RF) and 48.0% (BotRGCN)

Bot score: 0.391±0.010 (RF) and 0.024±0.076 (BotRGCN)

While this initial test did not yield significant performance improvements, we fully acknowledge the importance of further investigation and would seek to conduct a more systematic evaluation in the future.

Comments 10: The profile editor's (GAN-like) training lacks critical details: No convergence metrics are reported, and it is not analyzed whether the generator significantly alters the true distribution of human profiles, which could generate bots that are easily detectable by more robust models. Nor is the stability of training between the surrogate model and the generator discussed.

Response 10: We apologize for any possible misunderstanding. To clarify, during the training of the profile editor, the surrogate detector model remains fixed, ensuring there are no convergence issues arising from adversarial feedback loops. Regarding the concern of significantly alters the true distribution of human profiles, as noted in Section 4.5, this phenomenon is indeed an anticipated outcome of our study design. Rather than viewing this as a limitation, we consider it a valuable opportunity to: (1) stress-test current detection paradigms under evolving adversarial conditions, and (2) provide benchmark data for developing more robust detectors capable of handling such distributional changes in the future.

Comments 11: Key metrics (survival, N(u), Pi, score) are reported as averages without any indication of variability. This makes it difficult to assess the robustness of the results.

Response 11: We agree and have added the standard deviation of N(u), Pi, and bot score for a more comprehensive evaluation. Please note that the standard deviation is not applicable to the survival rate.

Comments 12: In the GNN simulation, some bot models lose effectiveness when incorporating profile edits or preselection. Although briefly mentioned, the reason for this is not analyzed in depth, nor are clear technical hypotheses proposed. This weakens the explanatory value of the article.

Response 12: Apologies for the lack of clarity. We believe that the observation that abstaining from certain actions might have yielded better outcomes—yet the bot strategy model failed to recognize this—implies that while such actions offer short-term rewards, they could be detrimental in the long run, disturbing the training of bot strategy model. We have revised the second paragraph of section 4.3 to provide a clearer technical hypotheses.

Comments 13: It is mentioned that the influencer-type bot model is sensitive to the parameter θ (the weight between avoidance and influence). However, the effects of different values are not explored, nor is a sensitivity curve or Pareto analysis presented to guide its selection.

Response 13: We appreciate the reviewer's valuable suggestion regarding parameter optimization. However, please note that systematic tuning of this parameter presents significant challenges, as its effects are highly context-dependent across different scenarios and bot strategies. A brute-force exploration would be very time-consuming given the combinatorial complexity involved. More fundamentally, we wish to reiterate that our study's primary objective was to help exploring the possible adversarial behaviors of the bots and vulnerability analysis, rather than optimizing the bots by tuning parameters.

Article Menu

Prevention Is Better than Cure: Exposing the Vulnerabilities of Social Bot Detectors with Realistic Simulations

Further Information

Guidelines

MDPI Initiatives

Follow MDPI