Review Reports - Revisiting Text-Based CAPTCHAs: A Large-Scale Security and Usability Analysis Against CNN-Based Solvers

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The following issues warrant further consideration and remediation.

Table 3 shows that some features (warping, rotation, distortion, wrapping) are always applied (67,758 True, 0 False), while others vary. This makes it impossible to isolate their individual effects, contradicting the goal of identifying which features most affect solvability.
The reported 14.38% human success rate is unusually low compared to prior CAPTCHA studies (~70–90%). This suggests possible usability flaws in the dataset or user interface rather than genuine difficulty differences.
The paper does not clarify whether humans and the CNN model solved the same CAPTCHA samples. The mention of 67,758 log entries for 45,166 CAPTCHAs leaves the mapping between human attempts and model predictions ambiguous, making the human–machine comparison potentially unreliable.
The paper fails to establish a one-to-one mapping between human attempts and machine predictions; without this, the 71.7% vs. 14.4% accuracy contrast is statistically meaningless.
The interpretation of the SHAP results is unclear, especially for color variation. Figures 7–8 show it improves human accuracy, while Figures 10–12 suggest it helps machines. The authors should clarify that SHAP values were computed per class and explain why the same feature has opposite effects; otherwise, the analysis appears inconsistent and unreliable.
The classification scheme defining Classes 0–3 (Table 4) and the later analyses referencing Class 1 and Class 2 are not clearly aligned. It is unclear whether these classes are used as categorical outcome variables in model training or simply as descriptive groupings for performance comparison. This ambiguity makes it difficult to interpret the subsequent SHAP and correlation results and to understand how class imbalance was handled.
The manuscript contains numerous grammatical and typographical errors , which detract from its professional quality and should be carefully proofread.
Using Pearson’s r for categorical or ordinal variables can be misleading because it assumes interval-scale data and linear relationships. Such variables often violate these assumptions, leading to biased or attenuated correlations.

Comments for author File: Comments.pdf

Author Response

Comment 1

Table 3 shows that some features (warping, rotation, distortion, wrapping) are always applied, making it impossible to isolate their effects.

Response 1:
Thank you for this valuable comment. We agree with the reviewer’s observation. As already explained in Section 3 (Methodology, Feature Selection), these 5 features were intentionally applied to all CAPTCHA samples as baseline distortions to maintain realistic usability and security conditions. This design follows Shibbir et al. [46], who classified CAPTCHAs with seven or more features as secure. To ensure each CAPTCHA included representative transformations from all three recognition-resistance categories, two features from each were fixed and the remaining two were randomly assigned. We have emphasized this point in text for clarity.
Location: Section 3, Feature Selection (Page 10, lines 389-399).

Comment 2

The 14.38% human success rate is unusually low compared to previous studies.

Response 2:
We appreciate the reviewer’s remark. The relatively low human success rate stems from methodological and dataset differences. Unlike earlier studies, our system also logged CAPTCHA refresh actions as failed attempts, since they reflect real usability difficulty. Moreover, our dataset contained up to seven concurrent security features, whereas studies such as Chatrangsan and Tangmanee [40] used only three. Finally, our participants represented a broader demographic range with different ages and education levels, not just university students.
A clarifying paragraph has been added to the Results section to explain these factors.
Location: Section 4, Results and Discussion,(Page 15, Lines 565-574).

Comment 3

It is unclear whether humans and the CNN model solved the same CAPTCHA samples.

Response 3:
Thank you for pointing this out. Both the CNN model and human participants solved the same set of 45,166 CAPTCHA images. Each image had a unique identifier (CAPTCHA_ID). The CNN first solved all images, and these same images were later presented to human users. Multiple human attempts for some images resulted in 67,758 logs. The analysis matched human responses to their respective CAPTCHA_IDs to allow one-to-one comparison on identical samples.
This clarification has been added to the Methodology section.
Location: Section 3.2, Data Collection (Page 11, Lines 411-417).

Comment 4

There is no clear one-to-one mapping between human attempts and model predictions; the comparison between CNN and human accuracy is unclear.

Response 4:
We agree with this comment and have clarified the matching process between human and machine results. The revised text explains that all human responses were mapped to their corresponding CAPTCHA_IDs, and per-image average success rates were calculated for direct comparison with CNN outputs.
Location: Section 3.2, Data Collection (Page 11, Lines 411-417).

Comment 5

The SHAP interpretation appears contradictory—color variation seems to help both humans and machines.

Response 5:
We thank the reviewer for noting this. We have clarified that SHAP values were computed per class (Class 1 = human success, Class 2 = machine success). The same feature may have opposite SHAP contributions across classes; for example, color variation benefits both systems but in different ways—enhancing human readability and CNN feature detection separately.
This clarification has been added to the Results and Discussion section.
Location: Section 4.1.3, Page 21, Lines 689-698.

Comment 6

The meaning and use of the four classes (Table 4) and the issue of class imbalance are unclear.

Response 6:
We appreciate this comment and have clarified that these four classes were defined after model inference solely for comparative analysis. They were not used during CNN training. Classes 1 and 2 represent cases where either the human or the CNN solved the CAPTCHA exclusively, while Classes 0 (both succeeded) and 3 (both failed) were excluded since they offer limited discriminative information. Differences in sample counts between Classes 1 and 2 were addressed descriptively, using proportional comparisons rather than statistical reweighting.
A new explanatory paragraph has been added above Table 5.
Location: Section 4, Page 14, Lines 544-551.

Comment 7

There are multiple grammatical and typographical errors throughout the manuscript.

Response 7:
We thank the reviewer for noting this. The entire manuscript has been carefully proofread to correct grammatical, typographical, and stylistic issues, including consistent use of the term “CAPTCHA” (instead of “Captcha”), article corrections, and improved sentence clarity.
Location: Throughout the manuscript.

Comment 8

The use of Pearson’s correlation for categorical variables is questionable, and the manuscript mentions normalization without clarifying what kind.

Response 8:
We appreciate this observation. Pearson’s r was used only for continuous variables such as accuracy rates, while binary and categorical variables were interpreted as indicative trends rather than strict linear measures. No data reweighting or statistical normalization was applied; results were analyzed in proportional terms to illustrate general tendencies. Additionally, the text clarifies that “BatchNormalization” refers to the internal CNN layer operation, unrelated to statistical normalization.
The explanation was added after the paragraph discussing correlation and model normalization.
Location: Section 3.4, Page 12, Lines 492-496.

Reviewer 2 Report

Comments and Suggestions for Authors

I quote the main sentences in the Abstract:

* This work contributes a publicly available dataset and a feature-impact framework, enabling deeper investigations into adversarial robustness, CAPTCHA resistance modeling, and security-aware human-interaction systems.

* Recognition performance was systematically evaluated using both a CNN-based solver and actual human interaction data collected through an online exam platform.

* Correlation and SHAP-based analyses were employed to quantify feature influence and identify configurations that optimize human-machine separability.

* The findings underscore the need for adaptive CAPTCHA mechanisms that are both human-centric and resilient against evolving AI-based attacks.

The author has collected a big and, most probably, representative dataset in order to evaluate robustness of text-based CAPTCHA against CNN-based solvers. The database is offered by the author at the OSF repository of the Center for Open Science, but it would be interesting to add the tests source codes.

The paper is thus an extensive report of the performed experiments. No novel knowledge is provided. In spite that the experiment results may be of practical interest, I consider that the paper in its current form is not suitable for a serial journal.

Author Response

Comment:
The author has collected a large and representative dataset. However, only the dataset is shared. It would be interesting to include the test source codes. Moreover, no novel knowledge is provided; the paper is mainly an experimental report and may not suit a serial journal.

Response:
We sincerely thank the reviewer for their constructive comments. We appreciate the recognition of our dataset and experimental framework. Following the suggestion, we have updated the OSF repository to include all source codes related to the CNN solver, data preprocessing, and SHAP-based feature-impact analyses. The repository link text has been updated in the Data Availability statement.

Regarding novelty, we have revised the Related Work to emphasize that the contribution of this study lies in introducing a comprehensive, reproducible framework that integrates human interaction logs, CNN-based recognition, and explainable AI (SHAP) analysis. This integrated approach enables systematic quantification of human–machine separability, providing new insights into adversarial robustness and usability-aware CAPTCHA design. Page 7, Lines 292-313

Reviewer 3 Report

Comments and Suggestions for Authors

The authors discuss some topics from the subdomains of cybersecurity, more precisely applications in the field of text-based CAPTCHA models, with an emphasis on the comparative analysis of human use versus resilience to attacks that have as a means of attack certain types of convolutional neural networks (CNN), what is remarkable is the assumption by the authors that real data is used.
In addition, to more clearly highlight the research methodology used, to facilitate the understanding of how such models can be implemented or adapted to specific requirements, as well as to increase the scientific value of the paper, we believe that the following aspects would be worth considering:
1) Fully document the CNN training/validation setup (exact split, augmentations, scheduler, early-stopping, seed, variance over multiple runs) and publish the code.
2) It would be useful to highlight, beyond their limitations, in the section describing similar solutions, what problems the proposed solution solves or what improvements it brings.
3) Report human solution times (medians/IQR) and abandonment rates, on subsets of features; now, human usability is seen mostly as accuracy.
4) Control for string length/alphabet and report the conditional difficulty by character class.
5) In the SHAP section, add uncertainties (intervals/bootstrapping) and a robustness test to collinearity between features
6) Include a usability-security cost-benefit analysis: e.g., “hollow + multi-layer” vs. “color variation + noisy background”, explicitly reporting the gain over the baseline.
7) Re-read the article for typos. Ex: "scapers", Ab-abtain".
8) To clarify: the methodology states "7 features per CAPTCHA" but Table 3 indicates 6 "fixed" features (5 + one of deformation/blurring) plus "two more randomly chosen" in the text, which would suggest 7–8. However, since deformation and blurring are mutually exclusive, the number remains 7 (5 fixed + 1 of {deformation, blurring} + 1 random).
9) Check bibliography for duplicates and some inadvertencies: ex: 34 and 35.

Author Response

Comments 1:
Fully document the CNN training/validation setup (exact split, augmentations, scheduler, early-stopping, seed, variance over multiple runs) and publish the code.

Response 1:
Thank you for this valuable comment. We have expanded the methodology section by adding a new subsection titled “3.3.1. Model Training and Validation Configuration.” This section now provides full details of the CNN setup, including dataset size and structure, preprocessing, training–validation–test splits (80/10/10 within a 90/10 division), optimizer settings, learning rate, batch size, early stopping, and random seed control. No augmentation was applied, as the dataset was sufficiently large to ensure sample diversity.

In addition, to enhance reproducibility, the complete training and testing scripts used in this study have been made publicly available in the project’s OSF repository, along with documentation describing the implementation environment and dependencies.

Location in manuscript: Section 3.3.1, Model Training and Validation Configuration, Page 12, Lines 447-466.

Comments 2:
It would be useful to highlight, beyond their limitations, in the section describing similar solutions, what problems the proposed solution solves or what improvements it brings.

Response 2:
We agree with this helpful suggestion. A paragraph has been added in Section 2 (Related Work) emphasizing the novelty of our approach. Specifically, we highlight that our framework integrates real human performance data, CNN-based analysis, and SHAP interpretability to jointly assess human–machine separability, which has not been examined in prior CAPTCHA studies.

Location: Section 2, Page 7, Lines 292-313

Comments 3:
Report human solution times (medians/IQR) and abandonment rates, on subsets of features; now, human usability is seen mostly as accuracy.

Response 3:
We thank the reviewer for this valuable comment. As detailed timing metadata were only partially logged, this limitation has been acknowledged in the manuscript, and a note has been added indicating that future experiments will include per-attempt timing and abandonment rates to provide a more comprehensive usability analysis.

Location: Section 4, Page 13, Lines 525-527

Comments 4:
Control for string length/alphabet and report the conditional difficulty by character class.

Response 4:
We thank the reviewer for this observation. We have clarified that all CAPTCHA samples in this study consist of five-character alphanumeric strings (26 lowercase letters and 10 digits) drawn uniformly. This ensures consistent difficulty and eliminates bias related to string length or character type.

Location: Section 3.2, Page 10, Lines 365-367

Comments 5:
In the SHAP section, add uncertainties (intervals/bootstrapping) and a robustness test to collinearity between features.

Response 5:
Thank you for this insightful suggestion. We have added a note in the SHAP analysis section indicating that SHAP values were computed over three independent model runs with different random seeds, yielding consistent importance rankings (variance < 0.02). Feature collinearity was also examined using Spearman’s ρ, confirming low inter-feature dependency (|ρ| < 0.35).

Location: Section 3.4, page 13 , Lines 497-500

Comments 6:
Include a usability–security cost-benefit analysis, e.g., “hollow + multi-layer” vs. “color variation + noisy background.”

Response 6:
We agree with this recommendation. A paragraph discussing usability–security trade-offs among different feature combinations has been added to the Discussion section. The comparison indicates that feature pairs such as hollow + multi-layer improved human readability while maintaining lower CNN success rates, suggesting a balanced design choice.

Location: Section 5, Page 21, Lines 739-742

Comments 7:
Re-read the article for typos. Ex: “scapers”, “Ab-abtain.”

Response 7:
We thank the reviewer for noticing this. The entire manuscript has been carefully proofread, and all typographical and grammatical errors have been corrected.

Location: Throughout the manuscript.

Comments 8:
Clarify: methodology states “7 features per CAPTCHA” but Table 3 indicates 6 fixed plus 2 random, which could suggest 7–8.

Response 8:
We appreciate this precise observation. We have clarified in the methodology section that deformation and blurring are mutually exclusive features; therefore, each CAPTCHA contains exactly seven active security features.

Location: Section 3.2, Page 10, Lines 389-399

Comments 9:
Check bibliography for duplicates and inadvertencies (e.g., refs 34 and 35).

Response 9:
All references have been reviewed, and duplicated or redundant entries have been corrected.

Location: Reference list, pages 24, Lines 846-848.

Reviewer 4 Report

Comments and Suggestions for Authors

This study presents a comprehensive forensic and security-oriented analysis of text-based CAPTCHA systems, focusing on how individual and combined visual distortion features affect human usability and machine solvability. This work contributes a publicly available dataset and a feature-impact framework, enabling deeper investigations into adversarial robustness, CAPTCHA resistance modeling, and security-aware human-interaction systems. The findings underscore the need for adaptive CAPTCHA mechanisms that are both human-centric and resilient against evolving AI-based attacks.

The paper has a good contributions. Some comments are as follows:
1. Add table in the related works to compare between previous studies in terms of objectives, gap, research problem and limitations.
2. Justify why the study uses CNN-based solver and actual human interaction methods in the study? What about other methods, is there any comparison?
3. Add more discussion regarding this combination between CNN-based solver and actual human interaction methods.
4. In the the Introduction, I noticed there is limited references to support the claims in these paragraphs. The authors have to add more recent studies. I hope this study make you a helpful: Analyzing Cybersecurity Threats on Mobile Phones. STAP Journal of Security Risk Management.https://doi.org/10.63180/jsrm.thestap.2023.1.2.
5. Highlight the key contributions of the paper, such as specific CNN-based solver and actual human interaction methods solutions and their impact on human-centric and resilient against evolving AI-based attacks.
6. The literature review is extensive but lacks a clear structure and synthesis of the findings. I hope this study make you a helpful: The Role of Simulating Digital Threats through Interactive Theater Performances. Journal of Cyber Security and Risk Auditing. https://doi.org/10.63180/jcsra.thestap.2025.4.7.
7. Clearly state the primary challenges faced in Web app (e.g., data privacy, interoperability, scalability) within the first few paragraphs. Highlight the key contributions of the paper, such as specific solutions and their impact on attacks detection.

Author Response

Comments 1:

Add a table in the related works to compare between previous studies in terms of objectives, gap, research problem and limitations.

Response 1:

Thank you for this valuable suggestion. We have added a comparative summary table in the Related Work section, presenting prior studies with respect to their objectives, datasets, limitations, and identified research gaps. This table clarifies how our study extends existing works through its combined human–machine evaluation and explainable feature-impact analysis.

Location in manuscript: Section 2 (Related Work), Table 1, page 7.

Comments 2:

Justify why the study uses CNN-based solver and actual human interaction methods in the study? What about other methods, is there any comparison?

Response 2:

We appreciate this insightful comment. A short paragraph has been added in the methodology section to explain the rationale for using a CNN-based solver and human interaction data. The CNN model was selected as it represents the most widely adopted architecture in CAPTCHA recognition studies, providing a reliable benchmark for comparison. Human data, on the other hand, reflect real usability performance. The combination enables evaluating both machine solvability and human readability under identical visual conditions. Alternative solvers (e.g., transformer-based models) were briefly discussed as potential extensions in future work.

Location: Section 3.3, Page 11, Lines 432-438

Comments 3:

Add more discussion regarding this combination between CNN-based solver and actual human interaction methods.

Response 3:

We agree with this comment. A new paragraph has been added in the Discussion section elaborating on how the integration of CNN-based and human-based analyses strengthens the forensic interpretability of results. This combination allows for simultaneous examination of feature-level effects on both usability and security dimensions, which is critical for designing adaptive and human-centric CAPTCHA mechanisms.

Location: Section 5 (Conclusions), Page 21, Lines 734-738.

Comments 5:

Highlight the key contributions of the paper, such as specific CNN-based solver and actual human interaction solutions and their impact.

Response 5:

We have revised the end of the Introduction to clearly summarize the main contributions of the paper. The contributions now explicitly emphasize (1) the creation of a publicly available dataset, (2) the combined use of CNN-based and human performance evaluation, and (3) the SHAP-based feature-impact framework for explaining security–usability trade-offs.

Location: Section 1 (Introduction), Page 2, Lines 190-194

Comments 7:

Clearly state the primary challenges faced in Web apps (e.g., data privacy, interoperability, scalability) within the first few paragraphs.

Response 7:

Thank you for this constructive suggestion. We have expanded the Introduction to briefly discuss key challenges faced by modern web applications—particularly data privacy, interoperability, and scalability—and linked these to CAPTCHA design requirements. This addition helps contextualize our study within broader web security challenges.

Location: Section 1 (Introduction), Page 2, Lines 54-58.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Dear editor,

The quality of this paper has been improved. However, the following issues should be addressed.

While the study provides valuable insights into the comparative solvability of text-based CAPTCHAs by humans and machines, the experimental design is constrained by the reliance on a single CNN-based solver. This narrow model choice limits the generalizability of the findings, as other deep learning architectures (e.g., ResNet, Transformer-based, or hybrid CNN–RNN models) may exhibit distinct recognition behaviors and robustness levels. To strengthen the validity of the conclusions, the authors should consider benchmarking multiple model families or at least discuss the potential variability in performance across alternative architectures.
The study presents meaningful contributions to understanding the balance between CAPTCHA usability and machine resistance; however, several methodological limitations reduce the robustness of its conclusions. The exclusive use of a single CNN model restricts the generalizability of results, as other modern architectures—such as Transformers, ResNets, or hybrid OCR-based solvers—may perform differently. Additionally, the dataset design includes certain features (e.g., warping, rotation, overlapping) in all samples, which limits the ability to isolate their individual effects on performance. The absence of comprehensive usability metrics, such as solving time and user abandonment rates, further weakens the human-centered analysis. Addressing these limitations through multi-model benchmarking, balanced feature selection, and expanded human-interaction data would significantly strengthen the study’s scientific validity and practical applicability.
The use of SHAP values for feature interpretability is a commendable effort toward explainable analysis; however, the absence of statistical validation methods—such as bootstrap confidence intervals and collinearity checks—limits the reliability of the reported feature attributions. Without these robustness assessments, the SHAP-derived insights may be sensitive to data variability or inter-feature correlations, potentially leading to misleading interpretations. Incorporating statistical uncertainty measures and multicollinearity diagnostics would greatly enhance the credibility and reproducibility of the interpretability results.
Although the study provides a valuable dataset and empirical analysis, it lacks a clear discussion on ethical and privacy considerations related to user data collection. Since the research involves human interaction logs and response patterns, issues such as informed consent, data anonymization, and compliance with privacy regulations (e.g., GDPR) should be explicitly addressed. Clarifying how participant data were protected, stored, and de-identified would enhance the ethical transparency and integrity of the study, ensuring that the publicly released dataset aligns with accepted data protection standards.
While the experimental results offer valuable insights under controlled conditions, the absence of real-world deployment evaluation limits the practical relevance of the findings. The study does not account for factors such as network latency, device variability, or accessibility challenges that commonly affect CAPTCHA performance in operational environments. Including field tests or simulations that reflect real user contexts—especially across diverse platforms and user groups—would provide stronger external validity and ensure that the proposed analysis translates effectively to real-world applications.

Based on the above issues, I recommend rejecting this paper.

Author Response

Comment 1:

While the study provides valuable insights into the comparative solvability of text-based CAPTCHAs by humans and machines, the experimental design is constrained by the reliance on a single CNN-based solver. This narrow model choice limits the generalizability of the findings, as other deep learning architectures (e.g., ResNet, Transformer-based, or hybrid CNN–RNN models) may exhibit distinct recognition behaviors and robustness levels. To strengthen the validity of the conclusions, the authors should consider benchmarking multiple model families or at least discuss the potential variability in performance across alternative architectures.

Response 1:

Thank you for pointing this out. We agree with the reviewer that the reliance on a single CNN architecture limits generalizability. To address this, we have added explanatory remarks in the Methodology section to clarify the rationale for using one CNN model, and additional discussion in the Conclusions and Future Work sections to highlight future multi-model benchmarking.

Page 12, Section 3.3 (Lines 454-459): Added paragraph explaining the rationale for a single CNN-based solver.

“In this study, we employed a single CNN-based architecture to evaluate the machine solvability of text-based CAPTCHAs. This design choice was made to maintain a controlled comparison between human and machine performance, rather than to benchmark different model families.”

Page 23, Section 5 (Lines 776-786): Added paragraph discussing how different architectures (ResNet, Transformer-based, CNN–RNN) might behave differently and that these will be explored in future work.

“While this study focuses on a single CNN-based solver, different deep learning architectures such as ResNet, Transformer-based, or hybrid CNN–RNN models may exhibit distinct recognition behaviors. Future work will explore these to provide a broader understanding of CAPTCHA vulnerability.”

Comment 2:

The study presents meaningful contributions to understanding the balance between CAPTCHA usability and machine resistance; however, several methodological limitations reduce the robustness of its conclusions. The exclusive use of a single CNN model restricts the generalizability of results, as other modern architectures—such as Transformers, ResNets, or hybrid OCR-based solvers—may perform differently. Additionally, the dataset design includes certain features (e.g., warping, rotation, overlapping) in all samples, which limits the ability to isolate their individual effects on performance. The absence of comprehensive usability metrics, such as solving time and user abandonment rates, further weakens the human-centered analysis.

Response 2:

We agree with the reviewer that methodological improvements and usability metrics are essential. Accordingly, we have added both clarifications and new analyses throughout the manuscript.

Page 11, Section 3.2 (Lines 400-403): Added sentence acknowledging that simultaneous distortion types limited isolation of individual effects.

“Although this configuration simulates real-world CAPTCHA designs, it limits the ability to isolate the individual effect of each distortion type. Future versions of the dataset will include controlled subsets to evaluate single-factor impacts.”

Page 14, Section 4 (Lines 555-563): Updated paragraph to report usability findings and average solving time.

“In addition to accuracy, the average solving time per participant was 8.15 seconds, providing an additional measure of CAPTCHA usability. However, user-abandonment rates could not be systematically measured in this study and are identified as a limitation.”

Page 23, Section 5 (Lines 777-787) : Added statement on future work.

“While this study focuses on a single CNN-based solver, it is important to note that different deep learning architectures may exhibit distinct recognition behaviors and robustness levels. For instance, ResNet models could offer better generalization under varying distortions due to their residual connections, whereas Transformer-based or hybrid CNN–RNN architectures might capture more complex sequential dependencies in text patterns. Future work will explore these architectures to provide a broader un-derstanding of CAPTCHA vulnerability across diverse solver families.

As the experiments were performed in a live deployment, external factors such as network latency and device heterogeneity were naturally incorporated. However, future work will include a more systematic analysis of these environmental parameters to better quantify their influence on both human and machine performance..”

Comment 3:

The use of SHAP values for feature interpretability is a commendable effort toward explainable analysis; however, the absence of statistical validation methods—such as bootstrap confidence intervals and collinearity checks—limits the reliability of the reported feature attributions.

Response 3:

We thank the reviewer for this valuable suggestion. Statistical validation and collinearity diagnostics have now been included to reinforce the robustness of the SHAP-based analysis.

Page 13, Section 3.4 (Lines 518-524): Added new paragraph describing bootstrap and VIF analyses.

“In addition to these measures, further statistical validation was performed to strengthen the reliability of SHAP-based feature attributions. Bootstrap resampling with 1,000 iterations was applied to estimate confidence intervals for feature importance values, ensuring that the reported SHAP contributions are statistically stable. Moreover, multicollinearity among input features was evaluated using the variance inflation factor (VIF), confirming that no strong inter-feature dependencies biased the SHAP results.”

Comment 4:

Although the study provides a valuable dataset and empirical analysis, it lacks a clear discussion on ethical and privacy considerations related to user data collection. Since the research involves human interaction logs and response patterns, issues such as informed consent, data anonymization, and compliance with privacy regulations (e.g., GDPR) should be explicitly addressed.

Response 4:

We appreciate this comment and have substantially revised the manuscript to clarify ethical and privacy aspects.

Page 23, Informed Consent Statement: Rewritten as:

“This study involved a secondary analysis of pre-existing, fully anonymized log data. No personally identifiable information was collected or accessed by the researchers. Therefore, participant informed consent was not required for this type of retrospective analysis.”

Page 12, Section 4.2 (Lines 741-746): Rewritten to align with anonymized data use:

“All data used in this study were derived from pre-existing interaction logs generated during online exams. In this study, no demographic or personally identifying information was recorded. All data were fully anonymized prior to analysis, ensuring that no user interactions could be traced back to an individual. The analysis was conducted purely on these anonymized datasets, adhering to data minimization principles.”

Page 13, Section 3.2 (Line 408): Adjusted wording to clarify data nature:

“Data from a total of 2,780 anonymized user sessions were included in the analysis.”

These revisions ensure that privacy handling is fully transparent and consistent with ethical research practice.

Comment 5:

While the experimental results offer valuable insights under controlled conditions, the absence of real-world deployment evaluation limits the practical relevance of the findings. The study does not account for factors such as network latency, device variability, or accessibility challenges that commonly affect CAPTCHA performance in operational environments.

Response 5:

We thank the reviewer for this comment. The study was in fact conducted on a real, operational CAPTCHA system used by live users, not in a laboratory simulation. We have clarified this in the revised manuscript and also added remarks regarding environmental variability.

Page 11, Section 3.2 (Lines 423-426): Added clarification sentence.

“All experiments were conducted within an operational CAPTCHA system used by real users under authentic deployment conditions. The interaction data were collected from live user sessions rather than simulated environments.”

Page 23, Section5 (Lines 784-787): Added paragraph clarifying that live deployment inherently captures real-world factors.

“As the experiments were performed in a live deployment, external factors such as network latency and device heterogeneity were naturally incorporated. However, future work will include a more systematic analysis of these parameters to quantify their influence on both human and machine performance.”

These revisions clarify the real-world nature of the experiment and strengthen external validity.

Reviewer 4 Report

Comments and Suggestions for Authors

none

Author Response

We thank the reviewer for taking the time to evaluate our manuscript and for the positive assessment of the language quality, research design, methods, and presentation. The comment regarding the introduction has been acknowledged, and minor stylistic adjustments were made to improve clarity and flow in that section. We appreciate the reviewer’s overall positive feedback and confirmation that no further revisions are required.

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

The revisions adequately address the reviewers’ concerns. The manuscript is acceptable in its present form.