Designing CAPTCHA Systems with Reinforcement Learning for Adaptive Defense

Indukuri, Meghana; Naseerkhan, Eman; Rose, Joshua; Tran, Martin; Park, Younghee

doi:10.3390/electronics15112363

Open AccessArticle

Designing CAPTCHA Systems with Reinforcement Learning for Adaptive Defense

by

Meghana Indukuri

,

Eman Naseerkhan

,

Joshua Rose

,

Martin Tran

and

Younghee Park

^*

Department of Computer Engineering, San Jose State University, San Jose, CA 95192, USA

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(11), 2363; https://doi.org/10.3390/electronics15112363 (registering DOI)

Submission received: 10 April 2026 / Revised: 17 May 2026 / Accepted: 22 May 2026 / Published: 30 May 2026

(This article belongs to the Special Issue Novel Approaches for Deep Learning in Cybersecurity)

Download

Browse Figures

Review Reports Versions Notes

Abstract

CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) systems remain a widely deployed defense against automated abuse, but advances in machine learning have reduced the effectiveness of traditional challenge-based designs and exposed limitations in proprietary risk-scoring systems. This paper presents an adaptive, reinforcement learning-based CAPTCHA defense framework for high-security web applications. The proposed system formulates bot detection as a partially observable Markov decision process and uses a Proximal Policy Optimization (PPO) agent with Long Short-Term Memory to analyze streamed behavioral telemetry, including mouse movements, clicks, keystrokes, and scrolling, over sequential interaction windows. During the observation phase, the agent can continue observing or deploy a honeypot as an early-intervention and evidence-gathering action; after sufficient session evidence is accumulated, it can issue graded CAPTCHA challenges, allow a session, or block it. To complement the sequential agent, the framework also includes an XGBoost classifier that produces a session-level human-likelihood score as a supervised benchmark. The accompanying reinforcement learning environment and code base are publicly available, allowing future researchers to train, evaluate, and extend adaptive CAPTCHA policies as bot capabilities evolve. Experiments conducted on a sandbox ticket-purchasing web application demonstrate that the proposed methodology achieves strong preliminary performance on human-generated sessions and real bot sessions produced by scripted, replay-based, and Large Language Model (LLM)-powered agents. Among the evaluated reinforcement learning algorithm variants, Soft PPO achieved the best performance with 97.7% accuracy, 100% precision, and a 97.6% F1 score. Correspondingly, the XGBoost classifier achieved 99.48% accuracy, a 1.000 ROC-AUC (receiver operating characteristic area under the curve), and a 0.9919 F1 score. Our results indicate that sequential reinforcement learning can support accurate and low-friction bot detection, while the accompanying classifier provides a complementary binary benchmark. Compared to proprietary systems, the proposed framework emphasizes transparency, auditability, and explicit sequential decision-making rather than black-box risk scoring. Overall, this work introduces a publicly available, open, and adaptive CAPTCHA defense framework that supports transparent experimentation with behavior-based bot mitigation while also identifying the remaining limits that must be addressed before commercial deployment.

Keywords:

CAPTCHA; large language models; reinforcement learning; cybersecurity; proximal policy optimization; bot detection; reinforcement learning agent

1. Introduction

Malicious bots and botnets are a persistent problem for online platforms. They are used to automate ticket scalping, credential stuffing, spam, large-scale scraping, fraudulent account creation, and other forms of abuse [1]. As such, bot detection has become a fundamental requirement of modern web security. Effective defenses must distinguish legitimate users from automated agents while minimizing unnecessary friction for humans. In 2003, CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) was formally introduced by von Ahn et al. [2] as a class of artificial intelligence (AI)-hard (i.e., difficult for automated systems but frictionless for humans) challenge–response tests designed to block automated programs from abusing online services while ensuring they remain accessible to human users. The foundation of CAPTCHA design is to exploit tasks that are computationally trivial for humans, but onerous for machines, providing a mechanism for distinguishing legitimate users from bots that is both reliable and scalable.

Due to the advent of artificial intelligence, and more specifically machine learning, traditional CAPTCHA schemes have been defeated convincingly. Comprehensive literature reviews, such as the one from Dinh and Hoang [3], divide CAPTCHA into specific security schemes. One such scheme is text-based CAPTCHA, where humans must decipher distorted text and numbers with the aim of blocking bots from accessing web pages. However, researchers have developed a modified variant of convolutional neural networks (CNNs) to quickly solve any form of text-based CAPTCHA [4]. Image-based CAPTCHA follows a similar logical flow but instead uses graphical puzzles. In a similar vein, with computer vision algorithms, CNNs, Support Vector Machines, and other methods, image-based CAPTCHA security measures have also been compromised [3]. Audio CAPTCHA, which was introduced to improve accessibility, has proven equally vulnerable. Bock et al. [5] demonstrated that an automated speech recognition ensemble could defeat reCAPTCHA’s audio challenge with over 85% accuracy. Video-based CAPTCHA, which typically involves moving text or an animated object, creates difficulty for bots due to their motion and temporal elements. However, they are not widely used due to issues with bandwidth requirements, accessibility, and excessive friction for human users [6,7].

Across all of its versions, a core limitation of CAPTCHA remains the same: high-performing systems remain proprietary. Google’s reCAPTCHA v3 incorporates a risk-based scoring system that is based on behavioral biometrics, yet the implementation is entirely proprietary, making it impossible to audit, validate, or extend [8]. Current open-source systems apply a fixed detection approach and a fixed challenge response that cannot adapt to adversarial behavior over time. Open-source alternatives such as ALTCHA utilize cryptographic proof-of-work mechanisms to raise the cost of bot attacks but lack the behavioral intelligence to distinguish bots from humans in real time [9]. To the best of our knowledge, no existing open-source CAPTCHA framework treats bot detection as a sequential decision-making problem in which a defender observes behavior over time, deploys evidence-gathering interventions such as honeypots, and adaptively selects a challenge or terminal decision based on accumulated evidence. This project directly addresses that gap.

This paper proposes a silent, reinforcement learning (RL)-based CAPTCHA system designed for high-security web applications. The system frames bot detection as a partially observable Markov decision process (POMDP), in which a Proximal Policy Optimization agent with a Long Short-Term Memory architecture (PPO+LSTM) makes sequential classification decisions over a sliding window of behavioral telemetry events. The system collects mouse movements, keystrokes, scroll events, and click data silently as users navigate a simulated ticket-purchasing web application, encoding these signals into a 26-dimensional feature vector per window. During the observation phase of each session, the agent may either continue observing or deploy a honeypot to gather additional evidence before a final decision is required. Generally, honeypots are referred to as deceptive decoy systems designed to attract attackers [10]; in our context, honeypot deployment refers to embedding fields that are visually hidden from human users but exist in the backend, enabling the detection of automated agents that interact with these otherwise invisible elements. At the terminal state of each session, the agent adaptively deploys an easy, medium, or hard CAPTCHA challenge, blocks the user, or passes the user without friction, depending on its classification confidence. Furthermore, we also introduce a companion XGBoost classifier, which provides a holistic, session-level human-likelihood score, allowing direct comparison between the two approaches.

This study is guided by the following research questions:

RQ1: Can a sequential reinforcement learning agent distinguish human users from scripted, replay-based, adversarially humanized, and large language model (LLM)-powered bots exclusively using streamed behavioral telemetry, without relying on personally identifiable information?

RQ2: Can unnecessary user friction be reduced while maintaining strong bot detection performance through an adaptive action space that includes both mid-session interventions (i.e., continued observation and honeypot deployment) and terminal decisions (i.e., graded CAPTCHA, allow, and block)?

We evaluate the corresponding hypotheses.

H1:

Temporal behavioral telemetry contains sufficient sequential and interaction-level patterns for a reinforcement learning agent to achieve statistically significant discrimination between human and automated actors.

H2:

A multi-action adaptive response policy with progressive intervention mechanisms, including honeypots and graded CAPTCHA escalation, will preserve or improve overall bot detection effectiveness while lowering interaction costs imposed on legitimate users.

The proposed framework should not be interpreted as a ready-to-deploy commercial replacement for mature CAPTCHA services. Its practical value lies in the methodology and environment: the defender policy is more explainable than traditional binary classifiers (e.g., XGBoost); the reward function can be modified to prioritize security without losing performance, user friction, accessibility, or false-positive reduction; and new bot families can be added as AI-based automation improves. This makes the system future-adaptable rather than future-proof. In particular, the same environment can be reused to evaluate stronger browser agents, retrain policies against newly observed behaviors, and compare different intervention strategies without relying on closed risk-scoring logic. Overall, the key contributions of this work are as follows:

This paper proposes a new RL-based CAPTCHA system that learns a sequential intervention policy over windowed behavioral telemetry.
We design a new feature vector space by considering users’ real-time behaviors using keystrokes and mouse movements.
We implement the proposed system and evaluate the performance of the RL agent and XGBoost classifier.
We release an open-source RL environment for adaptive CAPTCHA defense research, including session replay, configurable reward schedules, honeypot deployment, graded CAPTCHA interventions, and multiple bot-tier behaviors.

The remainder of this paper is organized as follows: Section 2 reviews related work in CAPTCHA systems, their vulnerabilities, behavioral biometrics, and the use of reinforcement learning in cybersecurity. Section 3 describes our methodology, including the web application, data collection procedure, adversarial bot framework, RL agent architecture, and XGBoost classifier design. Section 4 presents experimental results for both the RL agents and the classifier. Section 5 discusses our key findings and system limitations. Section 6 and Section 7 summarize our work and discuss directions for future research.

2. Related Work

2.1. CAPTCHA Systems and Their Evolution

CAPTCHA was introduced as an online security mechanism designed to prevent automated fraud and abuse, wherein unique challenges are used to distinguish bots and humans [2]. The first generation consisted of text-based CAPTCHA, which required users to identify and transcribe distorted alphanumeric characters rendered against noisy backgrounds. These were followed by image-based CAPTCHA, which presented users with visual puzzles requiring the identification of objects belonging to a specified category, a format that was used in Google’s reCAPTCHA v2 image grid challenges. Both generations relied on the premise that the visual recognition tasks involved were computationally intractable for automated systems.

The advancements in machine learning has brought about additional evolutions in CAPTCHA. One such methodology tracks user behavior on a web page, distinguishing between humans and malicious automation through behavioral and sensor metrics [3,11]. This invisible CAPTCHA works in the background, collecting information about the user without a direct puzzle. This approach has become prevalent due to its ease for users and its security performance.

Beyond purely invisible approaches, some systems have explored hybrid designs. Open-source alternatives such as ALTCHA couple cryptographic proof of work with behavioral signals to raise the cost of automated abuse in a transparent manner [9]. However, a key concern with this system, and others, is that their detection strategies are fixed at the start of the deployment. Furthermore, research has highlighted an inherent tradeoff in CAPTCHA design: increasing difficulty to deter bots correspondingly raises friction and error rates for legitimate users [12].

2.2. Bots and AI-Based CAPTCHA Attacks

The vulnerability of text-based CAPTCHA to machine learning attacks has been well established. Tang et al. [4] demonstrated that CNNs could break text-based CAPTCHA schemes deployed by the top 50 most popular international websites with high accuracy, targeting schemes that utilize noise, distortion, and anti-segmentation techniques.

Image-based CAPTCHA has been proven equally susceptible. Sukhani and Chitaliya [13] showed that a multi-class CNN model could solve image grid puzzles with 92.98% accuracy, while Sivakorn et al. [14] defeated Google’s image reCAPTCHA at scale by combining deep learning-based object recognition with cookie manipulation to influence the risk analysis system. Plesner et al. [15] showed that deep learning models such as the YOLO (You Only Look Once) v8 model can solve reCAPTCHAv2 image challenges with 100% success. Audio CAPTCHAs, introduced to improve accessibility, have fared no better: Bock et al. [5] showed that an automated speech recognition ensemble could mount a large-scale attack against reCAPTCHA’s audio challenge with an attack success rate of 85.15%.

In particular, Akrout et al. [16] demonstrated that RL agents can be trained to bypass reCAPTCHA v3 by learning mouse movement patterns that achieve a high risk score. The agent formulated the problem as a grid-world and achieved a success rate of 97.4%. Together, these studies make a compelling case that both challenge-based and behavioral CAPTCHA are vulnerable to targeted AI attacks, motivating the need for adaptive defenses.

2.3. Behavioral Biometrics

The high data cost associated with this project required that our selected data be simple to collect, widely available in environments similar to ours, and effective for bot detection. To meet these data constraints, we track a variety of mouse and keyboard interaction telemetry.

To ensure the effectiveness of bot detection, we turned to various research cases to identify the most valuable telemetry to track. Mouse dynamics, including movement and clicking behavior, were used by [17] to detect bots with up to 99.20% accuracy. Similarly, ref. [18] used real and synthetic mouse trajectories to supplement other approaches (e.g., Google’s reCAPTCHA) to achieve 98.7% accuracy. In another study, ref. [19], a variety of mouse and keyboard inputs were used to detect bots in online video games with 99% accuracy and negligible performance overhead.

These studies show that bot interaction with online environments is often simple and contains many obvious differences from human data. Mouse movements are often overly smooth, while keyboard inputs are unrealistically fast or uniformly timed compared to human inputs. Based on these findings, we conclude that a variety of mouse movement and keyboard inputs combined provide an effective approach to bot detection.

2.4. Reinforcement Learning for Cybersecurity

Reinforcement learning has emerged as an increasingly viable approach for cybersecurity problems because many defensive tasks are inherently sequential. Nguyen and Reddi [20] provide a comprehensive survey of deep RL (DRL) applied to cybersecurity, covering its application to intrusion detection systems (IDSs), cyber-physical system defense, and multi-agent game-theoretic simulations of attacker–defender dynamics. Their review highlights an advantage of RL over static classifiers: while supervised models are trained once on historical attack patterns, RL agents can continuously refine their policies, making them well-suited to adversaries that evolve their strategies over time.

Within intrusion detection specifically, RL-based methods have been used to detect anomalies in network traffic and respond to attack patterns that fall outside the distribution of training data. This robustness to distributional shift is valuable in security contexts, where adversaries actively probe for the gaps in a deployed system’s knowledge. The adversarial framing, in which the RL agent is cast as a defender operating against an implicit attacker, maps to problems like bot detection, where the goal is not simply to classify known bots but to maintain a reliable policy as bot strategies evolve [20].

2.5. Existing Dynamic CAPTCHA Systems

Modern CAPTCHA systems have shifted from static challenge–response tasks toward behavioral and adaptive risk-scoring approaches [11]. Although popular commercial systems reduce visible user friction through passive scoring mechanisms, their internal methodology remains proprietary. This limits transparency, reproducibility, and independent adaptation to emerging adversarial behaviors.

A qualitative reference point for the framework proposed in this paper is reCAPTCHA v3. Its approach operates in the background and is based on behavioral evidence rather than requiring immediate visible challenge. However, reCAPTCHA v3 ultimately exposes a proprietary risk score, leaving the downstream intervention logic to the site operator or to Google’s closed infrastructure. By contrast, the presented framework, which will be further discussed in the paper, formulates bot defense as a partially observable Markov decision process (POMDP) and learns an explicit sequential policy over continued observation, honeypot deployment, graded CAPTCHA challenges, and allow and block actions. The difference, therefore, is not simply whether an adaptation exists but where that adaptation resides: in reCAPTCHA v3, it is embedded within a proprietary service, whereas in the system presented in this paper, the intervention policy itself is learned, auditable, and modifiable.

At the same time, reCAPTCHA v3 remains substantially stronger in deployment maturity, ecosystem integration, and scale in the real world. Since Google does not publish public precision, recall, and false-positive benchmarks that are directly comparable to the dataset and task formulation used in this study, this comparison should be understood as architectural and operational rather than as a strict head-to-head performance evaluation. Consequently, the contribution of the present work is not to claim benchmark superiority over reCAPTCHA v3 but to introduce an auditable and sequentially adaptive CAPTCHA defense framework with strong preliminary internal results.

Comprehensively, existing dynamic CAPTCHA systems generally adapt either a risk score or a challenge difficulty, while our framework formulates the broader bot-mitigation process as a POMDP. The defender observes behavioral telemetry over time, may deploy honeypots as early evidence-gathering interventions, and then selects among graded CAPTCHA challenges and allows or blocks action.

2.6. Gap Analysis and Contribution

The related work identifies a remaining gap between proprietary behavioral-risk systems and open-source CAPTCHA alternatives. Commercial systems provide large-scale adaptive scoring, but their decision logic is closed and difficult to audit or extend. Open-source alternatives provide transparency but generally do not model bot mitigation as a sequential defender policy that can observe behavior, gather evidence, and choose among multiple interventions over time. This study addresses that gap by introducing a publicly available reinforcement learning environment for adaptive CAPTCHA defense. The environment formulates bot detection as a POMDP and supports continued observation, honeypot-based evidence gathering, graded CAPTCHA challenges, and allow and block decisions within a single policy. We evaluate this framework using a PPO+LSTM agent and a companion XGBoost classifier against scripted, replay-based, adversarially humanized, and LLM-powered bots. The goal is to provide a transparent testbed for studying how CAPTCHA defense policies can be adapted as bot capabilities evolve.

3. Methodology

3.1. Web Application

For this project, a sandbox web application was developed to simulate a real-world high-security environment requiring anti-bot security measures. The system records per-session user inputs, which are used by a silent RL agent that evaluates activity for each unique session. User interaction with the application flows through three primary pages: concert selection, seat selection, and checkout. These pages and the CAPTCHA challenges delivered to the user are shown in Figure 1. Additionally, the application includes a developer dashboard that provides session telemetry and displays RL agent decisions.

The mock e-commerce application uses CAPTCHA challenges instead of hard blocks to avoid mistakenly blocking legitimate customers. Suspected bots receive one of three CAPTCHA puzzles based on risk level. The easy puzzle asks users to rotate an asymmetric object upright with generous tolerance. The medium puzzle uses a jigsaw slider with more visual ambiguity and stricter tolerance. The hard puzzle, reserved for highly suspicious sessions, requires users to click moving objects in a specified order within a time limit; two failed attempts result in blocking.

3.2. Telemetry DataCollection

The dataset contains two categories of interactions: human-generated and bot-generated. Human data was collected from participants directly using the application. Bot data was collected from LLM agents and replay bots directly interacting with the application, generating data from actual automated use of the interface rather than synthetically constructed traces. These bot sessions reflect real event streams produced from automated agents navigating the application. To further supplement and vary the bot data, a data augmentation process was additionally applied; this process is detailed in Section 3.3. Overall, the dataset contains 643 original sessions (204 human; 439 bot) with an additional 2628 sessions from the bot augmentation process.

For each session, telemetry data included keystrokes, mouse movements, scroll events, and button presses. Keystrokes, scrolls, and clicks were logged as discrete events, whereas mouse movements were recorded at 15 millisecond intervals. Both the human and bot samples were represented using telemetry captured from interaction with the same application interface. A detailed summary of the collected features is provided in Table 1.

The distinctions between human and bot activity are illustrated in Figure 2 and Figure 3, which present heatmap visualizations of interaction data. These figures map the X-Y pixel positions of all recorded interactions, with the mean position of each distribution indicated by a red circle. For mouse movements, the mean positions are approximately (507, 358), (593, 420), and (594, 419), while for mouse clicks they are (425, 379), (483, 453), and (483, 453).

These visualizations highlight clear behavioral differences between human and bot activity. Human interaction exhibits broad spatial dispersion across the interface, indicating a more exploratory behavior. In contrast, bot interactions are highly structured, with limited horizontal variance and a small number of unique x-axis positions. This results in the distinct vertical bands of activity observed in both movement and click heatmaps. The standard deviation along the x-axis for human clicks is approximately 1.9 times greater than that of bot data. Additionally, human clicks spanned 562 unique x-axis positions, while bot clicks are confined to only 73. Human click interactions also occupy a substantially larger number of spatial bins, covering 17% of the provided space, while bot activity only covered approximately 9.4%. Furthermore, human mouse movements occupied approximately 65% of the environment space, while clicks occupied only about 17%. This demonstrates the highly exploratory nature of mouse movements and positioning versus the more targeted nature of clicking.

In total, the dataset includes 39,904 human mouse movement events and 28,741 bot movement events, which were augmented to produce an additional 172,138 samples. For mouse clicks, the dataset contains 1392 human events and 4214 bot events, with an additional 25,260 augmented samples. Notably, the augmented bot dataset closely mirrors the statistical properties of the original bot data across all evaluated metrics, suggesting that augmentation does not fully introduce human-like variability.

3.3. Adversarial Bot-Tier Framework and Augmentation

To capture a broad spectrum of adversarial behavior on websites, we organize our bot implementations into a five-tier difficulty hierarchy. Each tier reflects increasing levels of behavioral complexity, ranging from simple scripted automation to highly realistic human-mimicking agents, including those powered by LLMs. The characteristics and capabilities of each tier are summarized in Table 2 below.

Development-wise, for Tiers 1 through 4, browser interactions are implemented using Selenium WebDriver to simulate automated behaviors of varying complexity [28]. In general, we expect detection performance to decrease from Tier 1 to Tier 5, reflecting increasing adversarial difficulty. Deviations from this trend may indicate that certain behavioral strategies are more effective at evasion than their tier ranking suggests, or that specific agents are better suited to particular adversary types. The corresponding results are presented in Section 4 and Section 5.

Adversarial Augmentation

A key challenge with bot detection is that trivially separable features (e.g., Selenium’s

\sim 1

ms key-hold durations) can inflate model accuracy without learning genuinely discriminative behavioral patterns. To address this, we draw inspiration from adversarial training [29], which improves model robustness by exposing it to perturbed inputs during training. We adapt this principle to the behavioral bot detection domain by introducing a novel adversarial augmentation procedure—the HumanProfiler pipeline—that progressively humanizes bot sessions at three difficulty levels, directly mapping to the bot-tier framework in Section 3.3. To our knowledge, no prior work has applied progressive humanization of bot telemetry as a data augmentation strategy for behavioral bot detection; this pipeline is a novel contribution of this work.

A HumanProfiler first learns statistical profiles from real human sessions across six signal categories: key-hold duration, mouse inter-event

Δ t

, jitter ratio, mouse speed, direction-change frequency, and event-type ratios. Let

μ_{h}^{(f)}

and

σ_{h}^{(f)}

denote the human mean and standard deviation for feature f. Bot sessions are then transformed as follows:

Easy augmentation: Fixes the most obvious giveaways. Key-hold durations are resampled from a clipped Gaussian centered on the human profile, and micro-jitter is injected into mouse trajectories:

$d_{hold}^{'} \sim N (μ_{h}^{(hold)}, σ_{h}^{(hold)}), p_{t}^{'} = p_{t} + ϵ_{t}, ϵ_{t} \sim N (0, σ_{jitter}^{2} I) .$

(1)
Medium augmentation: Applies all easy transforms and additionally humanizes timing distributions by compressing mouse $Δ t$ values toward human rates and applies exponential smoothing to reduce abrupt direction changes in mouse paths:

$Δ t_{k}^{'} = β Δ t_{k} + (1 - β) μ_{h}^{(Δ t)}, p_{t}^{'} = α_{s} p_{t} + (1 - α_{s}) p_{t - 1}^{'},$

(2)

where $β$ controls timing compression and $α_{s}$ is the smoothing coefficient.
Hard augmentation: Applies all transforms with tighter parameters (smaller $σ_{jitter}$ , stronger smoothing, and narrower timing compression), producing near-human sessions that are challenging to distinguish from real users.

The specific parameter values used at each difficulty level are summarized in Table 3.

For each original bot session, multiple augmented copies are generated (default: 2 copies × 3 levels = 6 augmented sessions per bot). This forces the model to learn subtle behavioral signals rather than relying on trivially separable artifacts, directly strengthening robustness against higher-tier adversaries.

3.4. Reinforcement Learning for Adaptive CAPTCHA Defenses

3.4.1. Reinforcement Learning Formulation

Reinforcement learning (RL) is a machine learning paradigm in which an agent learns to make decisions through trial-and-error interactions with an environment. The agent’s objective is to learn a policy that maximizes the expected cumulative reward over time [30]. RL formally stems from Markov decision processes (MDPs), which are defined as a tuple

M = (S, A, P, R, γ),

where

S

denotes the state space,

A

the action space,

P (s^{'} ∣ s, a)

the transition probability function,

R (s, a)

the reward function, and

γ \in [0, 1)

the discount factor that determines the relative importance of future rewards [30,31]. A policy

π (a ∣ s)

specifies the probability of selecting action a in state s [30]. The value function

V^{π} (s)

represents the expected sum of discounted future rewards when starting in state s and following policy

π

. Similarly, the advantage function is defined as

A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s),

which measures the relative benefit of taking action a compared to the expected value of the policy [30,32]. These components lay the foundation for modern policy gradient methods such as Proximal Policy Optimization (PPO) [33,34,35], which commonly use generalized advantage estimation (GAE) [32] to reduce variance in gradient updates.

In the context of our problem (i.e., CAPTCHA defenses), this RL formulation can be mapped to user interaction sessions on a web application. The environment corresponds to a user session, and the state space (

S

) captures user interaction behavior represented as windowed telemetry features (e.g., mouse movement, click events, keystrokes, and scrolling patterns). On the other hand, the action space (

A

) consists of the RL agent’s possible interventions, such as continuing observation, deploying a honeypot, issuing challenges, allowing the user, or blocking the session. The reward function (R) is defined based on the outcome of these actions, assigning positive rewards for correctly identifying bots and legitimate users and penalties for incorrect decisions or user friction. The transition dynamics (P) are governed by how users (human or automated) respond to these interventions over time but are not modeled explicitly. This is because the RL agent is trained in a model-free setting, learning directly from observed interaction data without access to the transition probabilities [30]. In our setting, the initial state (

s_{0}

) corresponds to the start of a user session, prior to observing any interaction behavior. A terminal state (

s_{t}

) is reached when a final action is taken, such as allowing the user, blocking the session, or issuing a challenge. The overarching objective of the RL agent is to learn a policy (

π

) that maximizes the expected cumulative reward by accurately distinguishing between bots and legitimate users while minimizing user friction.

Critically, this problem is more accurately characterized as a partially observable MDP (POMDP) [36], since the agent does not directly observe whether the user is a bot or human. Instead, it must infer this hidden state from observable behavioral signals (e.g., mouse movements) and interaction patterns over time. Thus, due to this partial observability, sequence models, specifically Long Short-Term Memory (LSTM) networks, were utilized in this paper to keep track of past information over time [37]. Concretely, LSTM networks capture temporal dependencies and aggregate information across multiple timesteps, enabling the agent to approximate the underlying hidden state (i.e., whether the user is a bot or human). A visual representation of the POMDP for this problem is shown in Figure 4.

3.4.2. Observation, State, Action, and Reward Space

ObservationSpace

As discussed, the RL agent cannot directly observe the true state of the environment (i.e., whether the user is a bot or a human). Instead, it receives observations derived from user interaction telemetry. These observations consist of raw behavioral signals (i.e., mouse movements, click events, keystrokes, and scrolling activity), which are collected during a user session. Raw telemetry is used to construct a chronological event timeline, where the initial state corresponds to the user opening the web page, and the terminal state corresponds to checkout completion or session termination. To prevent mouse movement data from dominating the representation, mouse events are subsampled by retaining every fifth sample. The resulting event stream is then segmented into fixed-length windows of 30 events with a 50% overlap, producing a sequence of observations for each session, a standard technique for preserving temporal structure in sequential data [38].

State Representation

Formally, the true underlying state

s \in S

corresponds to whether the user is a bot or a human, which is not directly observable. The agent must infer this hidden state from the sequence of observations. Each observation window is encoded into a 26-dimensional feature vector capturing behavioral characteristics such as motor dynamics, spatial coverage, temporal patterns, and interaction metadata. This feature vector is shown in Table A1. Since a single observation window is insufficient to determine the user type, the agent utilizes an LSTM-based architecture [37] to aggregate information across multiple timesteps (t). The LSTM maintains a hidden state that summarizes past observations, enabling the agent to create a latent representation that approximates the underlying user state (i.e., bot or human).

Action Space

The action space consists of seven discrete actions corresponding to different intervention strategies, including continuing observation, deploying a honeypot, issuing CAPTCHA challenges of varying difficulty, allowing the user, or blocking the session. Since non-terminal actions (continue; deploy_honeypot) are only valid during the observation phase and terminal actions (puzzles, allow, and block) are only valid at the final window, invalid action masking [39] is applied to restrict the set of available actions at each timestep t, ensuring that only valid actions are selected.

Reward Structure and Challenge Outcome Formulation

The reward function reflects the asymmetric costs of CAPTCHA decision-making: correct identifications yield positive rewards while false positives, false negatives, and unnecessary user friction incur penalties. Terminal actions consist of three puzzle difficulties (easy, medium, and hard), as well as allow and block. For puzzle actions, outcomes are sampled stochastically using fixed pass probabilities (Table 4): humans are more likely to pass easier challenges, while bots are less likely to pass harder ones. The resulting reward depends on both the user type and the outcome. Humans who pass incur a small friction cost that increases with difficulty, while humans who fail receive a false-positive penalty. Bots who pass incur a slip-through penalty, whereas bots caught by a puzzle receive a positive reward scaled by difficulty, incentivizing harder challenges for higher-confidence detections. Direct actions follow a simpler structure: allowing a human yields a positive reward (

+ 0.5

), while allowing a bot incurs a false-negative penalty (

- 1.0

). Blocking a bot yields a moderate positive reward (

+ 0.7

), deliberately set below the puzzle-catch rewards (

+ 0.8

to

+ 1.2

) to encourage evidence-based detection over opaque blocking. Blocking a human incurs the largest penalty (

- 1.5

) to reflect the high friction cost of incorrectly denying access. The full pipeline is illustrated in Figure 5.

Non-terminal actions include continue, which applies a small per-step penalty (

- 0.001

), and deploy_honeypot. Honeypots trigger with tier-dependent probabilities for bots (ranging from 85% for Tier 1 Commodity bots down to 5% for Tier 5 LLM-Powered agents) and a fixed 1% probability for humans. When a honeypot is triggered by a bot, an information bonus (

+ 0.5

) is applied in the following timestep, encouraging their use as a signal-gathering mechanism without assuming perfect knowledge at deployment. To prevent over-reliance on honeypots, since they are not a foolproof method, a maximum of two may be deployed per session.

Evolution of the Reward Design (Legacy Baseline vs. Revised Schedule)

The legacy reward schedule treated direct blocking and puzzle-based detection nearly identically. Blocking a bot directly yielded

+ 1.0

, the same base reward as catching a bot via any puzzle, making direct blocking the strictly dominant strategy. The legacy schedule also used a single aggregate honeypot trigger probability (

0.60

) for all bot tiers rather than tier-stratified rates and assigned a smaller information bonus (

+ 0.3

). The full legacy mapping is shown in Table 5.

The revised schedule (Table 6) addresses these issues by making the catch reward scale with difficulty (

+ 0.8

/

+ 1.0

/

+ 1.2

) and setting the direct block reward (

+ 0.7

) below all puzzle catches. This makes puzzle-based detection the higher-reward strategy, encouraging evidence gathering over opaque blocking. The penalty for incorrectly blocking a human increases from

- 1.0

to

- 1.5

, further discouraging false positives. Honeypot trigger probabilities are stratified by bot tiers to reflect realistic differences in bot sophistication, and the information bonus increases from

+ 0.3

to

+ 0.5

. The behavioral impact is directly observable in Section 4.1.1.

3.4.3. Architecture

The architecture we implemented utilizes four main components: LSTM, actor head, critic head, and Shared Representation. Our actor–critic architecture, originating from [34], utilizes an LSTM backbone which enables the RL agent to process sequential observations and accumulate evidence over time. This is well-suited for our setting because the user’s behavior unfolds temporally, and early signals must be integrated with later observations to infer whether the user is a bot or human.

LSTM Backbone

The core of the architecture is a Long Short-Term Memory (LSTM) network [37], which captures long-term dependencies in sequential data through gated memory mechanisms. At each timestep t, the network receives a 26-dimensional feature vector

x_{t} \in R^{26}

representing a window of user interaction telemetry and updates its hidden and cell states

(h_{t}, c_{t})

accordingly. Intuitively, the LSTM acts as a memory system that maintains a running summary of user behavior over time. At each step, it decides what past information to retain (forget gate), what new information to incorporate from the current observation (input gate), and what information to expose for decision-making (output gate). This allows the model to retain relevant behavioral patterns while discarding noise. By aggregating information across multiple timesteps, the LSTM enables the agent to accumulate evidence over time (e.g., combining early navigation behavior with later typing patterns, which helps form a more complete understanding of the user). Thus, the hidden state

h_{t} \in R^{128}

serves as a compressed representation of the observation history. Due to the partial observability of the environment, the LSTM hidden state can be interpreted as an implicit belief representation that summarizes past observations and approximates the underlying latent user state (bot or human) [40].

Actor Head

The actor head defines the policy

π_{θ} (a_{t} ∣ h_{t})

over the discrete action space. It takes the LSTM hidden state

h_{t}

as input and maps it to action scores (logits) using the following layers:

h_{t} \to Linear (128, 128) \to tanh \to Linear (128, 64) \to tanh \to Linear (64, 7) .

These logits are then masked to remove invalid actions and passed through a softmax function to produce a probability distribution over actions. During training, actions are sampled from this distribution to allow for exploration, while during evaluation, the highest-probability (greedy) action is selected. Invalid action masking is applied by setting the logits of invalid actions to

- \infty

before the softmax, ensuring they receive zero probability [39].

Critic Head

The critic estimates how good the current state is by predicting the value function:

V_{θ} (h_{t}) = E_{π} [G_{t} ∣ h_{t}]

It takes the LSTM hidden state

h_{t}

as input and maps it through a small neural network:

h_{t} \to Linear (128, 128) \to tanh \to Linear (128, 64) \to tanh \to Linear (64, 1) .

The output is a single value representing the expected future reward from the current state. This value is used as a baseline during training, which helps reduce variance in the updates, and makes learning more stable [30].

Shared Representation

The LSTM backbone is shared between the actor and critic. This keeps the model smaller and allows both parts of the network to learn from the same representation of user behavior. Since both actor and critic rely on similar information (i.e., whether the user is a bot or human), sharing the backbone works well in practice and is commonly used in PPO-based methods [41]. This entire LSTM actor–critic architecture is visualized in Figure 6. Furthermore, the parameter count for this architecture is summarized in Table 7. The compact architecture (∼130 K parameters) is small to avoid overfitting on limited training data while also retaining enough capacity for the sequential classification task at hand.

3.4.4. Training Algorithms

We train three algorithm variants, all sharing the same LSTM architecture and environment. This controlled comparison, illustrated in Figure 7, isolates the effect of the policy optimization method from the architectural and environmental factors.

PPO (Proximal Policy Optimization)

PPO [35] is our primary training algorithm, chosen for its empirical stability and strong performance across a wide range of RL tasks. PPO belongs to the family of policy gradient methods [35] but addresses the instability of vanilla policy gradients through a clipped surrogate objective that constrains how far the policy can change in a single update.

Each training iteration collects a 4096-step on-policy rollout by running the current policy in the environment. Each interaction step (transition) stores the observation, action, reward, termination flag, action log-probability, value estimate, and action mask. These stored quantities are required for computing policy gradient updates, including the likelihood ratio and advantage estimates needed for PPO updates. Because the model incorporates an LSTM, the hidden state is reset at the beginning of each episode to prevent information leakage across independent trajectories. The initial hidden state for each sequence is also stored to ensure correct reconstruction of temporal dependencies during training updates. Advantages are computed using generalized advantage estimation (GAE) [32] and normalized across the rollout buffer to improve training stability.

PPO then updates the policy using a clipped surrogate objective, which constrains large policy changes between updates [35]. Value learning uses a clipped value loss, and an entropy bonus is included to encourage exploration.

Updates are performed over sequential episode segments to preserve LSTM state consistency. Segments are shuffled across epochs, but transitions within a segment retain temporal order. Gradients are clipped to a maximum norm of 0.5 to prevent instability during LSTM training. The hyperparameters for PPO are highlighted in Table 8.

DG (Delightful Policy Gradient)

DG [43] is a recent alternative to standard policy gradients that addresses noisy updates and redirects the gradient signal across contexts. Specifically, DG not only reduces variance within a context but also shifts the expected gradient direction across contexts toward a cross-entropy-like objective. Instead of weighting updates only by the advantage estimate, DG gates each update using both the advantage and the surprisal of the selected action. As a result, actions that are both beneficial and unlikely under the current policy receive stronger learning signals.

This mechanism emphasizes rare, informative successes while suppressing noisy or uninformative updates. In our CAPTCHA defense setting, this is particularly useful because rare but critical cases, such as complex bots, can have a disproportionate impact on system performance. DG uses only current policy probabilities and does not rely on importance sampling or PPO-style clipping. The DG hyperparameters used in this study are summarized in Table 9. All other hyperparameters (learning rate, rollout steps, etc.) match those used for PPO (Table 8).

PPO with Adaptive Entropy (Soft PPO)

Soft PPO extends PPO [35] by learning an entropy coefficient

α

to automatically balance exploration and exploitation, inspired by Soft actor–critic (SAC) [44]. Rather than using a fixed entropy bonus, Soft PPO adjusts the strength of entropy regularization based on how the policy entropy compares to a target entropy.

When the policy becomes too deterministic,

α

increases the influence of the entropy term to encourage additional exploration. When the policy is overly stochastic,

α

reduces this influence so that the agent can make more confident decisions. This allows the agent to balance exploration and decision-making automatically without manual tuning. In our implementation,

α

is optimized in log-space using Adam and constrained to a fixed range for stability. The Soft PPO hyperparameters are summarized in Table 10.

3.4.5. Data Splitting and Training Protocol

All experiments use a stratified 70/15/15 train/validation/test split across 643 sessions (204 human; 439 bot), partitioned at the session level with a fixed split seed (

s = 42

). Human and bot sessions are split independently to maintain class proportions across all three sets. The partition is determined solely by the original sessions; augmented copies are assigned to the same split as their source session, preventing data leakage across splits. The training set contains

N_{train} = 449

original sessions (142 human; 307 bot), with 97 sessions each reserved for validation and test.

Each algorithm is trained in two configurations: noaug (original sessions only) and advaug (original sessions plus adversarially augmented bot sessions from the humanization pipeline described in Section 3.3). Each configuration is trained independently with five random seeds (42, 123, 456, 789, and 1024) to account for training variability, yielding 30 trained models in total (3 algorithms × 2 configurations × 5 seeds). The advaug training set additionally includes

6 \times | B_{train} | = 1842

augmented bot sessions (2 copies × 3 difficulty levels per training bot), where

B_{train}

denotes the 307 bot sessions in the training split. Primary evaluation uses the original test set (97 sessions: 31 human; 66 bot). A separate augmented test evaluation additionally includes the pre-generated humanized bot copies in the test set to assess robustness to humanized behaviors (Section 4.1.1). All 30 models share the same underlying data split.

Training Loop

Training follows the standard on-policy rollout collection and update cycle. In each iteration, the agent interacts with the environment for 4096 steps, collecting a rollout buffer of transitions. At each step, the environment samples a session uniformly at random from the training split (with on-the-fly augmentation applied stochastically). The LSTM hidden state is reset at the beginning of each episode. For each transition, the agent receives a 26-dimensional observation window, selects an action according to its current policy

π_{θ}

subject to the applicable action mask, and records the following information: observation, action, reward, termination flag, action log-probability, value estimate, and action mask. Episodes that terminate mid-rollout (via a terminal action or truncation) contribute their final outcome to per-rollout statistics; the remaining budget of steps continues with a new session. If the rollout ends mid-episode, the critic’s value estimate for the final observation is used to bootstrap the return.

After each rollout, the advantages are computed using GAE and normalized across the buffer. The policy is then updated over 4 epochs. Within each epoch, the buffer is split into episode-level segments, which are shuffled across epochs to reduce correlation. Transitions within each segment retain their temporal order to preserve LSTM hidden-state consistency. Each segment is processed by passing its full observation sequence through the LSTM from the recorded initial hidden state

(h_{0}, c_{0})

, recomputing logits and value estimates, and applying the clipped surrogate loss. Gradients are clipped to a maximum norm of 0.5 to prevent exploding gradients in the recurrent backbone [42]. This process repeats for

⌊ 500,000 / 4096 ⌋ = 122

rollout iterations. Checkpoints are saved every 10 rollouts alongside a deterministic validation pass over 100 episodes (with augmentation disabled), and the final checkpoint is used for evaluation.

On-the-Fly Stochastic Augmentation

To mitigate overfitting on our limited dataset, we apply stochastic augmentation to every training episode at sampling time. Unlike the adversarial augmentation pipeline (Section 3.3), which pre-generates static humanized copies, on-the-fly augmentation applies random perturbations each time a session is drawn, ensuring the agent never observes the same input twice across training. This is applied independently of, and in addition to, adversarial augmentation. With probability

p_{aug} = 0.5

, all three perturbation types are applied simultaneously to the episode:

Position noise: Gaussian noise $N (0, σ_{pos})$ is added to all mouse and click coordinates, with $σ_{pos} = 15$ px for bot sessions and $σ_{pos} = 5$ px for human sessions. The lighter human perturbation preserves the natural structure of genuine mouse trajectories while still providing regularization.
Timing jitter: Gaussian noise $N (0, σ_{t})$ is added to event timestamps, with $σ_{t} = 30$ ms for bots and $σ_{t} = 15$ ms for humans. This prevents the agent from relying on exact inter-event timing, which can vary across hardware and network conditions.
Speed warping: All timestamps are scaled by a uniform random factor $w \sim U (w_{min}, w_{max})$ , with $(w_{min}, w_{max}) = (0.7, 1.4)$ for bots and $(0.85, 1.15)$ for humans. This simulates variation in overall interaction pace by uniformly scaling all event timestamps, representing sessions that are faster or slower overall.

Bot sessions receive stronger perturbations than human sessions by design. The asymmetry serves two purposes: (1) it broadens the distribution of bot behaviors the agent encounters during training, improving generalization to unseen bot variants, and (2) it preserves the subtler statistical structure of human sessions, which the agent must learn to recognize as legitimate. On-the-fly stochastic augmentation is disabled during validation and test evaluation to ensure metrics reflect performance on the fixed evaluation distribution. This does not affect whether pre-generated adversarially augmented bot sessions are included in the test split.

3.4.6. Inference and Pseudo-Online Training

At inference time, the agent processes a user session sequentially by constructing a telemetry timeline and passing observation windows through the LSTM-based policy. The LSTM hidden state is reset at the start of each session. The terminal action selected by the agent determines the intervention shown to the user (e.g., allow, block, or issue a challenge). To enable adaptation to evolving bot behaviors, we implement a pseudo-online training mechanism based on single-session PPO updates. After a session is completed and a ground-truth label is obtained via the confirmation endpoint, the full interaction trajectory is replayed through the network with the same action masking used during offline training. The resulting transitions are stored in a single-episode rollout buffer, advantages are computed using the same GAE procedure described in Section 3.4.4, and a PPO update is performed. To ensure stability, the online learning rate is reduced to 60% of the offline rate (

1.8 \times 10^{- 4}

), and updates run for 3 optimization epochs instead of 4. The agent’s decision on the session is logged both before and after the weight update, providing an auditable record of whether online adaptation improved, regressed, or left the decision unchanged. The updated checkpoint is saved immediately after each update. This assumes the confirmation label is reliable; noisy labels could reinforce incorrect decisions, so online updates are gated on label confidence in deployment.

3.4.7. Evaluation Protocol

To assess performance, all 30 models are evaluated on the held-out test split using deterministic (greedy) policy execution. The primary results use the revised reward structure; legacy reward evaluations are included for comparison only (Section 3.4.4).

Each evaluation episode corresponds to a single user session processed sequentially through the LSTM using the same action masking scheme applied during training. To account for environmental stochasticity (puzzle pass rates; honeypot triggers), we run 500 episodes per agent across 5 independent random seeds and report mean ± standard deviation. Episodes are sampled with replacement from the test sessions to produce stable metric estimates under varying stochastic outcomes. Beyond the primary evaluation, we conduct seven additional evaluation regimes:

Cross-environment transfer: Tests revised-trained agents in the legacy reward environment to assess robustness to reward changes.
Augmented test evaluation: Evaluates all models on adversarially humanized bot copies to test robustness to humanized behavior.
Disjoint tier generalization: Withholds bot tiers during PPO training and then evaluates on the held-out tiers to measure zero-shot generalization.
Human disjoint generalization: Withholds one human participant during PPO training and then evaluates on that participant to test unseen-human generalization.
Reward sensitivity analysis: Sweeps key reward parameters to test robustness to reward misspecification.
Ablation study: Removes or modifies architectural and reward components to identify key performance drivers.
Strict accuracy evaluation: Counts human puzzle challenges as false positives to estimate conservative UX impact.

3.4.8. Evaluation Metrics

We report the following metrics capturing both classification performance and reinforcement learning behavior:

Accuracy, precision, recall, and F1: Standard binary classification metrics. Precision measures the fraction of bot predictions that are correct (minimizing disruption to humans), while recall measures the fraction of actual bots detected.
Per-tier and per-family detection rate: Recall stratified across the five adversarial tiers (T1: Commodity through T5: LLM-Powered) and across bot families, capturing adversarial resilience as sophistication increases.
Average reward: Mean episodic return under the revised reward function, reflecting correct classification, user friction costs, and information-gathering bonuses.
Honeypot usage: Fraction of episodes in which the agent deploys at least one honeypot, capturing preference for evidence gathering over direct action.
Challenge rate: Fraction of human sessions in which the agent issues a puzzle challenge, measuring UX friction imposed on legitimate users.

3.5. Classifier

In addition to the RL agent, we developed a supervised machine learning classifier that serves as a critical benchmark. While the RL agent makes sequential decisions over sliding windows of behavioral telemetry, the classifier takes a holistic approach by analyzing an entire session’s telemetry after it concludes. It produces a single human-likelihood score

\hat{p} \in [0, 1]

, where

\hat{p} = 1.0

indicates very high confidence that the session belongs to a human and

\hat{p} = 0.0

indicates very high confidence that the session belongs to a bot.

3.5.1. Feature Engineering

Raw JSON telemetry collected during a user session is condensed into a 39-dimensional feature vector organized across eight behavioral groups, summarized in Table 11. Each group targets a distinct aspect of user interaction, chosen based on the behavioral differences between humans and bots identified in Section 2.3.

This 39-dimensional representation aggregates each session into fixed-length statistics that summarize how the user interacted rather than what they did, making the classifier robust to differences in page layout or task content. The feature set is intentionally compact and human-interpretable so that XGBoost can train reliably without overfitting and so that feature importance analysis remains meaningful for research interpretability.

The classifier’s 39-dimensional session-level feature set differs deliberately from the 26-dimensional window-level representation used by the RL agent (Table A1). The two models operate under different temporal constraints that dictate their feature design. The RL agent must produce an action at every observation window during a live session, so its features are restricted to quantities computable from a fixed-length window of 30 events; several session-level statistics used by the classifier—such as total session duration, global rhythm regularity, and cumulative spatial coverage—are undefined or degenerate at this scale, since a single window may contain zero keystrokes or only a narrow slice of the user’s spatial trajectory. The classifier, by contrast, operates post hoc on completed sessions and can therefore exploit the full event stream to compute stable, whole-session aggregates. Despite the dimensional difference, both feature sets draw from the same underlying telemetry channels (mouse dynamics, click patterns, keystroke timing, scroll behavior, spatial coverage, and event-type composition) and encode the same behavioral intuitions at different temporal granularities. This asymmetry is central to the complementary roles of the two models: the RL agent provides real-time sequential intervention under streaming constraints, while the classifier provides an interpretable session-level score suitable for auditing and post hoc analysis.

3.5.2. Model Architecture

We selected XGBoost (eXtreme Gradient Boosting) [45] as the classification algorithm based on four considerations. First, the 39 input features are aggregate session-level statistics (i.e., tabular data), a domain in which gradient-boosted decision trees consistently match or outperform neural network approaches [46]. Second, XGBoost handles small labeled datasets effectively through built-in

L_{1}

/

L_{2}

regularization and early stopping. Third, the model provides interpretable feature importance scores via gain-based splitting, which aids in understanding which behavioral signals are most discriminative. Fourth, XGBoost achieves sub-millisecond CPU inference, requiring no GPU, which is critical for real-time deployment alongside the RL agent.

XGBoost constructs an additive ensemble of CART regression trees, optimizing a regularized binary cross-entropy objective with

L_{1}

,

L_{2}

, and minimum-split-loss penalties; we refer the reader to Chen and Guestrin [45] for the full formulation. The model is configured with the hyperparameters listed in Table 12. Regularization is intentionally strong (

α = 0.3

,

λ = 2.0

,

γ = 0.3

, and min_child_weight

= 5

) to prioritize generalization over memorization on a small dataset.

3.5.3. Training Pipeline

The classifier is trained with a regularization-heavy pipeline designed to generalize from a small labeled corpus and to resist adversarial mimicry of human behavior.

Data Split

A stratified 70/30 train/test split is applied at the session level, preserving the class distribution between human (

y = 1

) and bot (

y = 0

) sessions. Pre-generated augmented copies (Section 3.3) are appended to the train split only, never the test split, so that evaluation reflects the real-world distribution and no augmented sample leaks across the boundary.

This split differs from the 70/15/15 train/validation/test partition used for the RL agents (Section 3.4.5). The difference is deliberate: the RL agents require a held-out validation set for checkpoint selection during training, whereas the XGBoost classifier performs model selection via Optuna’s 5-fold stratified cross-validation on the training set, making a separate validation partition redundant. Allocating the full 30% to the test split increases the number of held-out sessions available for evaluation, yielding tighter confidence intervals on the reported metrics. Despite the different partition ratios, both pipelines draw from the same underlying session pool with the same random seed, and neither allows for augmented data to leak into the test set.

Feature Standardization

A StandardScaler is fit on the training features and applied to both splits, zero-centering and unit-scaling each feature so that subsequent noise augmentation is uniform regardless of the original feature scale.

Feature-Space Adversarial Augmentation

For each bot sample,

n_{adv} = 2

humanized copies are generated by blending the bot’s standardized feature vector toward the mean of the human samples with a per-sample random factor

β \sim U (0.2, 0.6)

and then adding Gaussian noise scaled by the human-feature standard deviation (

σ_{adv} = 0.3

). These copies retain the bot label, forcing the classifier to look beyond surface-level differences.

Feature Noise Augmentation

To prevent overfitting to exact feature values, the training set is duplicated into

n_{copies} = 3

noisy versions, each perturbed by Gaussian noise:

x_{i}^{'} = x_{i} + η, η_{f} \sim N (0, {(σ_{n} \cdot {\hat{σ}}_{f})}^{2}),

(3)

where

σ_{n} = 0.5

controls the noise scale relative to each feature’s training-set standard deviation

{\hat{σ}}_{f}

.

Label Smoothing

Inspired by the regularization principle behind label smoothing [47], we reduce the influence of every training sample by scaling its weight:

w_{i} = 1 - α_{ls},

(4)

with

α_{ls} = 0.05

, so that each sample’s contribution to the loss is multiplied by

0.95

. This prevents the model from over-committing to any single training example, producing softer probability estimates without modifying the binary labels themselves.

Class Imbalance Handling

To account for potential class imbalance, XGBoost’s scale_pos_weight is set to the ratio of negative to positive samples:

w_{pos} = \frac{N_{bot}}{N_{human}},

(5)

ensuring that the loss contribution of each class is balanced during training.

Optuna Hyperparameter Tuning

An optional automated tuning stage uses Optuna [48] with 5-fold stratified cross-validation and ROC-AUC as the optimization objective. The search space covers XGBoost regularization parameters (max_depth, min_child_weight,

η

, subsample, colsample_bytree,

α

,

λ

, and

γ

) as well as the augmentation hyperparameters (

σ_{n}

,

n_{copies}

,

α_{ls}

,

n_{adv}

, and

σ_{adv}

). The resulting configuration is then used to retrain the final model on the full training split. The trained model exposes human_score( $x$ ), returning a probability in

[0, 1]

that serves as an interpretable session-level human-likelihood score.

3.5.4. Evaluation

The classifier is evaluated on the held-out test set using a comprehensive set of metrics: accuracy, precision, recall,

F_{1}

score, and area under the receiver operating characteristic curve (ROC-AUC). The predicted human-likelihood score

\hat{p}

is compared against a default decision threshold of

τ = 0.5

:

{\hat{y}}_{i} = \{\begin{matrix} 1 (human) & if {\hat{p}}_{i} \geq τ, \\ 0 (bot) & otherwise . \end{matrix}

(6)

In addition to aggregate metrics, we report the following:

Score distribution analysis: Histograms of $\hat{p}$ for human and bot sessions, assessing the separation between class distributions.
Feature importance ranking: Gain-based importance from XGBoost, identifying which behavioral signals contribute most to classification decisions.
Confusion matrix: Visualizing the tradeoff between false positives (humans incorrectly blocked) and false negatives (bots incorrectly allowed), which have asymmetric costs in a CAPTCHA deployment setting, since blocking a legitimate user has greater consequences than admitting a bot.

4. Results

4.1. Reinforcement Learning (RL) Results

We evaluate 30 agent configurations: three algorithms (PPO, DG, and Soft PPO; Section 3.4.4), each trained in two data regimes (noaug and advaug) across five random seeds. The base variant trains on original sessions only, while the augmented variant additionally includes adversarially humanized bot sessions (Section 3.3). All primary results use the revised reward structure (Table 6); legacy reward comparisons appear only where relevant to illustrate the impact of the reward redesign.

4.1.1. Overall RL Classification Performance

Table 13 reports RL classification performance for all six agent configurations under the revised reward structure on the held-out test split. Each configuration trains five independently seeded agents, and each agent is evaluated across five evaluation seeds of 500 episodes each, giving 25 total runs per configuration. The reported mean and standard deviation reflect variance across the five training seeds. All six configurations reach at least 0.961 accuracy and 0.958 F1, showing that the revised reward structure successfully guides all three algorithms toward strong bot detection regardless of augmentation.

Overall Ranking

Soft PPO with adversarial augmentation is the top-performing configuration across nearly every metric: accuracy

0.977 \pm 0.001

, F1

0.976 \pm 0.002

, recall

0.954 \pm 0.004

, and average reward

1.008 \pm 0.010

. Its accuracy standard deviation of

0.001

is the lowest of any configuration by a wide margin, indicating that Soft PPO+Advaug converges reliably across seeds. Soft PPO without augmentation follows closely (

0.974

accuracy;

0.973

F1), confirming that the adaptive entropy mechanism is the primary driver of stability rather than augmentation alone.

Precision and False Positives

Three of six configurations achieve perfect precision (

1.000 \pm 0.000

): PPO+Advaug, DG+Advaug, and Soft PPO+Advaug. PPO noaug reaches

0.997 \pm 0.007

and Soft PPO noaug

0.996 \pm 0.004

, both indicating very rare false positives. PPO noaug reaches

0.997 \pm 0.007

, indicating very rare false positives. DG noaug is the exception at

0.975 \pm 0.045

, the highest variance of any metric across any configuration. Per-seed inspection shows that seed 789 drops to

0.885

precision and drags the mean down, while the other four seeds achieve perfect or near-perfect precision. This points to DG’s delight-gated update rule being sensitive to initialization: most seeds learn to avoid false positives, but some unlucky seeds do not.

Recall

DG noaug achieves the tightest recall (

0.949 \pm 0.003

), suggesting its update gating acts as an implicit regularizer that prevents overconfident terminal decisions. DG+Advaug, however, shows the lowest recall (

0.921 \pm 0.041

) with high variance, meaning adversarial augmentation appears to destabilize DG’s recall under the revised reward signal, a pattern not seen in PPO or Soft PPO. PPO variants hold recall near

0.936

–

0.937

with consistent variance (

\pm 0.013

), while Soft PPO noaug and Soft PPO+Advaug reach

0.951

and

0.954

respectively.

Effect of Adversarial Augmentation

Augmentation has algorithm-dependent effects. For PPO, it eliminates false positives (precision

1.000

vs.

0.997

) with no meaningful change in recall or accuracy, suggesting it primarily shrinks the false-positive tail without affecting overall decision quality. For Soft PPO, augmentation improves every metric and cuts accuracy variance by

4 \times

(

0.001

vs.

0.004

), indicating that advaug and Soft PPO’s entropy schedule interact well to stabilize learning. For DG, augmentation flips the tradeoff: precision improves from

0.975

to

1.000

, but recall drops from

0.949

to

0.921

, yielding a net decrease in F1 (

0.958

vs.

0.962

). This suggests DG+Advaug over-corrects toward caution and waits for stronger evidence before acting, at the cost of missing some bots.

Honeypot Behavior

Honeypot usage rates range from

0.791

to

0.862

. Soft PPO+Advaug, despite being the best classifier, deploys honeypots least frequently (

0.791 \pm 0.040

). This suggests the policy has learned to make accurate terminal decisions from behavioral features alone and only uses honeypots when genuinely uncertain. DG noaug has both the lowest honeypot rate among no-augmentation agents (

0.826

) and the highest variance (

\pm 0.066

), consistent with its seed-dependent instability, where seeds that converge well use honeypots discriminately while unstable seeds show erratic deployment.

Average Reward Alignment

The two Soft PPO configurations are the only agents to exceed average reward

1.0

(

1.004

and

1.008

), while DG noaug posts the lowest (

0.942 \pm 0.122

). The reward ranking closely tracks the classification ranking, with the highest-reward agent also being the highest-accuracy agent. This confirms that the revised reward structure, with tiered puzzle-catch rewards scaled by difficulty, successfully encoded the task objective without introducing reward hacking or divergence between reward optimization and detection quality.

4.1.2. RL Training Dynamics

Figure 8 shows training dynamics for all three algorithms under adversarial augmentation and the revised reward structure, averaged across five seeds.

Training Reward

The training reward shows that PPO converges fastest, reaching near-peak reward within the first 100k steps and remaining stable. Soft PPO rises more gradually but steadily, approaching a similar ceiling by 300k steps. DG exhibits the widest seed variance throughout training, consistent with the precision instability observed at evaluation and reflecting its dependence on advantage–surprisal alignment for gradient updates.

Validation Accuracy

The validation accuracy is high and stable across all three algorithms from early in training, confirming that classification performance is not sensitive to when training is stopped past around 100k steps. DG again shows the most inter-seed spread, while Soft PPO and PPO maintain tighter bands.

Policy Entropy

The policy entropy reveals the clearest behavioral difference between algorithms. PPO’s entropy steadily decays toward near zero, indicating that the policy commits to deterministic action preferences over time. DG follows a similar trend with higher variance. Soft PPO, by contrast, maintains a relatively stable entropy level throughout training due to its adaptive entropy coefficient, which resists premature commitment. This sustained exploration explains the consistency in Soft PPO’s evaluation performance across seeds.

Training Correctness Rate

The training correctness rate tracks the fraction of correct decisions made during rollouts. All three algorithms reach above 0.8 and improve steadily, with PPO and Soft PPO showing tighter convergence than DG.

4.1.3. RL Reward Structure Impact

Figure 9 shows how the reward redesign changes agent decision behavior while preserving classification performance. Each bar shows the fraction of terminal decisions split into allow, direct block, and puzzle challenges (medium and hard) across both reward structures.

Allow Rate

The allow rate holds near 50% across all three algorithms and both reward structures, matching the balanced human/bot mix used during evaluation (500 episodes per agent, sampled with equal class probability rather than following the natural test split distribution of 31 humans and 66 bots). The redesign does not shift the allow/block boundary, confirming it does not introduce additional false positives or negatives. The classification accuracy observed in Table 13 is preserved regardless of which reward structure is used.

Direct Block

Under the legacy reward, direct block is the dominant bot-handling strategy: PPO blocks 50% of sessions directly, Soft PPO 45%, and DG 33%. Under the revised structure, direct block effectively disappears in all three algorithms, dropping to near zero. This confirms that the legacy reward, which assigned equal catch reward to both puzzle and direct block while adding per-action puzzle costs, made direct blocking the rational choice. Removing those costs and scaling puzzle-catch rewards by difficulty makes puzzle challenges strictly preferable for bot sessions.

Hard Puzzle

In the revised structure, all three algorithms converge to routing nearly all bot-handling decisions through hard puzzle challenges: PPO at 49%, DG at 50%, and Soft PPO at 50%. This shift is operationally significant since puzzle-based detections provide interpretable evidence of bot behavior rather than opaque blocking decisions, which matters for auditability and false-positive recovery.

Partial Puzzle Affinity

Unlike PPO and Soft PPO, DG already allocates roughly 15% of decisions to hard puzzles and a small 2% slice to medium puzzles under the legacy reward. This suggests DG’s advantage-gated update rule naturally favors actions with higher evidence value even when not explicitly incentivized. The revised reward then amplifies this preexisting tendency into full puzzle dominance.

4.1.4. RL Generalization: Bot-Tier Difficulty

Figure 10 evaluates how well a trained agent generalizes to bot tiers it has never seen during training. For each tier configuration, we train a PPO agent with those sessions excluded entirely and evaluate only on the held-out sessions, comparing against a baseline agent trained on all tiers. Recall and F1 are reported rather than accuracy because holding out a tier shifts the bot proportion in the test set, making accuracy misleadingly high even when the agent fails to detect the held-out bots.

Lower Tiers

Tier 1 (commodity) and Tier 2 (careful automation) show no meaningful degradation when held out: recall holds at

0.939

and

0.941

respectively, matching or slightly exceeding the baseline (

0.937

and

0.930

). F1 is identical to the baseline at

0.967

and

0.964

. These tiers exhibit distinctive behavioral signatures, such as high movement regularity and unnatural timing patterns, that the agent detects through general learned features without needing explicit training exposure.

Mid-Tier Degradation

Tier 3 (adaptive) shows a small but consistent drop: recall falls from

0.906

to

0.873

and F1 from

0.950

to

0.932

. Tier 4 (stealth) degrades more noticeably, with recall dropping from

0.943

to

0.874

and F1 from

0.971

to

0.925

. These bots use deliberate behavioral mimicry that the agent has learned to detect partly through exposure, so excluding them reduces detection reliability.

Tier 5 (LLM-Powered)

When LLM bots are entirely excluded from training, recall drops from

0.917

to

0.528

, and F1 falls from

0.957

to

0.687

. This is the sharpest degradation in the evaluation: LLM-powered bots generate highly variable, human-like behavioral sequences that do not share obvious surface patterns with other tiers. Without training exposure, the agent fails to detect nearly half of them.

Multi-Tier Holdouts

Holding out T4 + T5 simultaneously yields recall

0.669

and F1

0.795

, while holding out T3 + T4 + T5 produces nearly identical results (

0.669

recall;

0.797

F1). The similarity between these two conditions indicates that T5 LLM bots drive most of the degradation and that T3 contributes relatively little additional difficulty when T5 is already excluded.

4.1.5. RL Generalization: Bot Family

Figure 11 breaks down detection accuracy by bot family, evaluating the standard advaug model on each family’s test sessions independently.

Semi-Auto Bot Family

PPO drops to

93.8 \pm 1.4 %

on semi-automated bots, the lowest result of any algorithm–family pair in this evaluation. DG (

96.6 \pm 0.4 %

) and Soft PPO (

95.6 \pm 1.0 %

) handle this family noticeably better, suggesting that their update rules are better suited to the subtler temporal patterns that semi-automated bots produce.

Replay Bots

DG and Soft PPO both reach

98.0 %

accuracy on replay bots, and PPO follows at

96.6 %

. Replay sessions repeat fixed behavioral traces, making them straightforward to identify from timing and movement regularity.

LLM Bots

All three algorithms score between

95.9 %

and

96.7 %

on LLM-powered sessions, a notably higher result than the disjoint tier eval (Section 4.1.4), which showed recall dropping to

0.528

when LLM bots are completely excluded from training. This contrast confirms that LLM bots are not inherently undetectable, but they do require training exposure to be reliably caught.

PPO

Across stealth, LLM, and semi-auto, PPO scores 1–3 percentage points below DG and Soft PPO, while the gap closes on easier families like replay and trace-conditioned. This pattern is consistent with PPO’s faster entropy collapse during training, which may cause it to commit to simpler decision boundaries that miss edge cases in harder families.

All Algorithms

Despite the variation, every configuration exceeds

93 %

accuracy on every family. Trace-conditioned bots, which use structured behavioral conditioning, are detected reliably by all algorithms (

97.5

–

97.7 %

), suggesting the revised reward structure and adversarial augmentation together produce robust coverage across bot family types.

4.1.6. RL Generalization: Unseen Human Users

Figure 12 evaluates whether the agent falsely blocks real humans whose behavioral profiles were never seen during training. For each person, we train a PPO agent with all of their sessions excluded and measure how often the agent correctly allows them through.

Correctness Rate

For Person A, the disjoint agent achieves a pass-through rate of

0.979

, a small drop from the baseline of

0.997

. For Person B, the disjoint agent passes

0.956

of sessions, down from

1.000

. In both cases, the agent correctly allows the unseen human the vast majority of the time despite having no exposure to their behavioral style during training.

Person B

The drop for Person B (

0.044

) is larger than for Person A (

0.018

), and the per-seed variance is higher, suggesting Person B’s behavioral style is somewhat further from the training distribution. Even so, the agent passes

95.6 %

of Person B’s sessions correctly, which remains an operationally acceptable false-positive rate.

Generalization

The near-perfect pass-through rates under disjoint training confirm that the agent has learned general human behavioral features rather than memorizing individual users. This provides preliminary evidence that the agent will not overfit to specific behavioral profiles seen during training.

4.1.7. RL Ablation Study

Figure 13 reports the impact of removing or modifying individual design choices relative to the PPO+Advaug baseline (F1

= 0.967

; accuracy

= 96.8 %

).

Reward Ablations

Removing the honeypot information bonus, doubling it, applying a stricter false-positive penalty, or removing the per-step continue cost all produce F1 scores within

\pm 0.006

of the baseline. No single reward component is individually critical to classification performance, suggesting the overall reward structure is robust to moderate changes in any one parameter. That said, these components shape behavioral policy rather than just accuracy: the honeypot bonus in particular drives evidence-gathering behavior as shown in Section 4.1.3.

Smaller LSTM Capacity

LSTM-64 (half the hidden size of the baseline’s 128) achieves F1

97.4 %

and accuracy

97.5 %

, slightly above baseline. LSTM-256 (double size) also stays near baseline at

97.1 %

. This suggests the classification task does not require large temporal memory capacity, and the default hidden size of 128 is already more than sufficient for the sequence lengths seen in practice (∼17.8 windows per episode on average).

Deeper LSTM Layers

The two-layer LSTM drops slightly below baseline to F1

96.3 %

and accuracy

96.4 %

. Stacking a second LSTM layer adds parameters without adding useful representational capacity for this task and may introduce optimization difficulty on the small dataset.

Temporal Reasoning

The Single View ablation restricts the agent to a single randomly selected window of the session, dropping F1 to

94.5 %

from the baseline of

96.7 %

, a gap of

2.2

percentage points. Random selection is a stricter test than evaluating only the opening of a session, because the agent cannot rely on the easier early-stage signals (mechanical timing; unnatural movement regularity) that are concentrated in the first few windows for lower-tier bots. With only one arbitrarily-positioned window to reason about, the agent loses the temporal context that lets it integrate evidence as a session unfolds. The gap widens further under harder conditions: the disjoint tier evaluation (Section 4.1.4) shows that LLM-powered bot recall drops to

0.528

when those bots are unseen during training, precisely because their behavior is ambiguous on any individual window and only becomes distinguishable over time. The full temporal model therefore provides both a measurable accuracy advantage on the standard test mix and a robustness margin against more adaptive adversaries that the single-window agent does not have.

4.1.8. RL Sensitivity Analysis

Figure 14 and Figure 15 evaluate how robust the trained agents are to misspecified environment assumptions, sweeping individual parameters across plausible ranges while holding all others fixed.

Reward Parameters

Reward parameters show no meaningful sensitivity. Varying the honeypot information bonus from

0.0

to

1.0

, the direct block reward from

0.25

to

1.2

, the human block penalty from

- 3.0

to

- 0.5

, and the missed bot penalty from

- 2.0

to

- 0.25

all produce flat accuracy curves across all three algorithms. No parameter produces a drop below

0.95

at any tested value. This confirms the finding from the ablation study: the overall reward structure is what shapes agent behavior, and individual component values within reasonable ranges do not meaningfully affect classification performance.

Challenge-Outcome Assumptions

Challenge-outcome assumptions are similarly robust. Easy puzzle bot pass rate, hard puzzle human pass rate, the Tier 5 honeypot trigger rate, and the all-tier honeypot scale all produce flat accuracy curves with no meaningful degradation across their tested ranges.

Sensitive Parameters

Hard puzzle bot pass rate is the only sensitive parameter. As the assumed bot pass rate for hard puzzles increases from

0.01

to

0.15

(default

0.05

), accuracy drops from approximately

0.985

–

0.998

down to

0.915

–

0.930

across all three algorithms. This is the sharpest degradation in the entire sensitivity analysis. The mechanism is straightforward: if bots can pass hard puzzle challenges more easily than assumed during training, the agent’s puzzle-based detection strategy becomes less reliable, since a bot passing a hard puzzle no longer provides strong evidence of legitimacy. This finding identifies a concrete deployment risk. Sophisticated bots capable of solving hard CAPTCHAs at higher rates than the training assumption would degrade detection quality, and the hard puzzle pass rate assumption should be calibrated carefully against observed bot capabilities. In fact, the legacy reward structure may prove more useful against high-capability bots.

4.1.9. RL Baseline Comparison

Figure 16 compares all three RL agents against seven rule-based baselines on the held-out test split.

RL Agent Performance

All RL agents substantially outperform all rule-based baselines. The weakest RL agent (DG,

96.1 %

accuracy) outperforms the strongest rule-based baseline (hard puzzle,

81.4 %

) by nearly 15 percentage points. This gap confirms that learned sequential decision-making captures behavioral patterns that no fixed decision rule can match.

Rule-Based Baselines

Always Block and HP+Block achieve

100 %

recall by blocking everything but drop to ∼50% accuracy by blocking all humans too. Hard puzzle achieves the highest rule-based accuracy (

81.4 %

) and recall (

94.3 %

) but at the cost of precision (

74.8 %

), generating many false positives by failing puzzles that real users occasionally fail too. HP+Decide flips this: it reaches

97.2 %

precision (nearly matching RL) but only

59.4 %

recall, correctly avoiding false positives but missing most bots. No rule-based policy resolves this tradeoff.

Precision–Recall Resolution

All three RL agents achieve perfect or near-perfect precision (

100 %

) while maintaining strong recall (

92.1

–

95.4 %

), a combination no rule-based system approaches. This is because the RL agent can adapt its decision strategy to each session’s specific behavioral evidence rather than applying a fixed rule uniformly.

Task Difficulty

Random baseline confirms the difficulty of this task. The random policy achieves only

68.1 %

accuracy and

67.2 %

F1, reflecting the approximately

68 %

bot prevalence in the test split. Any policy that does better than this is genuinely learning to distinguish bot from human behavior.

4.2. XGBoost Classifier Results

To isolate the contribution of hyperparameter tuning and adversarial augmentation, we trained four configurations of the classifier on the same stratified 70/30 train/test partition. All four share the feature pipeline and label-smoothing setup described in Section 3.5.3 and are evaluated on identical held-out sessions; only Optuna tuning and the inclusion of HumanProfiler-augmented bot copies are toggled.

Model Configurations

xgb_v1: Optuna-tuned hyperparameters and adversarial augmentation.
xgb_v1_noaug: Optuna-tuned hyperparameters, no augmentation.
xgb_v2: Default ClassifierConfig hyperparameters with adversarial augmentation.
xgb_v2_noaug: Default hyperparameters, no augmentation (baseline).

Table 14 reports the headline metrics for all four configurations on the held-out test split at the default decision threshold

τ = 0.5

.

Key Observations

1.: Perfect separability across configurations. All models achieve an ROC-AUC of $1.000$ , indicating complete separability between human and bot score distributions at some threshold. The 39-dimensional feature representation is sufficient to capture behavioral differences within the current dataset, while regularization (label smoothing, noise augmentation, class balancing, and $L_{1}$ / $L_{2}$ penalties) prevents overfitting.
2.: No false positives. No configuration misclassifies a human as a bot at $τ = 0.5$ . All errors are false negatives. From a deployment perspective, this is critical: legitimate users are never blocked, and missed bots can be handled downstream via other interventions. Performance differences are minimal: xgb_v1_noaug and xgb_v2 each miss one bot, while xgb_v1 and xgb_v2_noaug miss two.
3.: Marginal gains from tuning and augmentation. Hyperparameter tuning and adversarial augmentation each provide slight improvements over the baseline, but their combination does not yield further gains. Optuna selected deeper trees ( $\max_depth = 5$ vs. 3), a moderate learning rate (≈0.099), and stronger $L_{2}$ regularization (≈5.37). However, differences across configurations amount to at most one test sample, suggesting that default settings are already well-aligned with the feature space.

Figure 17, Figure 18, Figure 19 and Figure 20 show evaluation plots for each configuration. Feature importance analysis indicates that predictive power is distributed across feature groups rather than dominated by a single signal.

Feature Insights

Mouse dynamics features—including average speed, jitter ratio, and direction-change ratio—consistently rank among the most informative predictors. This aligns with the intuition that human motor behavior exhibits irregular yet bounded patterns that are difficult for simple bots to replicate.

4.2.1. Classifier Generalization: Bot Family Disjoint Evaluation

The standard 70/30 test split contains every bot family the classifier saw during training, so its near-perfect headline numbers (Table 14) characterize a within-distribution performance ceiling rather than a guarantee of generalization to unseen bot behavior. As a stricter test, and to mirror the disjoint-tier protocol used for the RL agents in Section 4.1.4, we conduct a family-disjoint evaluation: for each of the ten bot families in our dataset (linear, tabber, speedrun, scripted, stealth, slow, erratic, semi-auto, trace-conditioned, and LLM), we retrain the classifier from scratch with every session of that family excluded (including any pre-generated humanized copies) and then evaluate exclusively on the held-out family. This isolates the question: how well does the model recognize bot behavior it has never encountered? Each family is retrained across five independent seeds (42–46), and we run two configurations to match Section 4.1.1: noaug (no pre-generated humanized copies) and advaug (pre-generated humanized copies of all non-held-out families included in training). For comparability, we use the default ClassifierConfig for both configurations.

Figure 21 reports detection accuracy on the held-out family for each configuration.

Most Families Generalize Cleanly

Seven of ten families reach perfect detection accuracy even when entirely withheld from training. Commodity bots (linear, tabber, and speedrun) and careful-automation bots (scripted, stealth, and slow) all exhibit behavioral signatures—near-constant cursor velocities, near-zero key-hold durations, and narrow spatial extents—that the classifier identifies through general learned features, not through memorization of family-specific patterns. Trace-conditioned bots, which replay recorded human trajectories with added noise, are also detected with perfect accuracy in the disjoint setting, suggesting that the structural artifacts they introduce (e.g., uniform sampling intervals on replayed traces) are reliably distinguishable from genuine human telemetry.

LLM Bots Are the Hardest Disjoint Case

The sharpest degradation is on the LLM family: detection drops from near perfect to

0.855 \pm 0.035

(noaug) and

0.804 \pm 0.087

(advaug). This mirrors the trend observed for the RL agents (Section 4.1.4), where LLM-powered bots are the dominant source of generalization error. However, the classifier degrades far less severely than the RL agents, whose recall collapses to

0.528

on disjoint LLM bots (Section 4.1.4), indicating that the holistic, session-level feature aggregation in our 39-dimensional representation captures behavioral cues that survive the LLM family’s strong human-like mimicry. Semi-auto bots, which interleave scripted navigation with human-like checkout phases, are the second-hardest case (

0.950 \pm 0.021

noaug/

0.933 \pm 0.000

advaug). Their mixed behavioral signatures mean that no single feature group cleanly separates them from humans, so the classifier cannot rely on the same trivially separable artifacts—constant cursor velocity and near-zero key-hold durations—that make Tier 1 and Tier 2 families detectable without training exposure.

Adversarial Augmentation Slightly Hurts Disjoint LLM Detection

Counterintuitively, including HumanProfiler-augmented copies of the other families during training reduces detection on unseen LLM bots by roughly five percentage points (

0.855 \to 0.804

) and increases variance across seeds (std

0.035 \to 0.087

). A plausible explanation is that humanized copies of simpler bots pull the decision boundary toward the human distribution in regions that real LLM sessions also occupy, making it harder to flag truly human-like LLM behavior at test time. On the within-distribution test split (Table 14), this tradeoff does not appear because LLM sessions present in training anchor the boundary correctly. The implication for deployment is that augmentation is beneficial against known bot families but should be re-evaluated as new families emerge.

Test-Split Per-Family Slice

For completeness, we also examined a per-family slice of the standard 70/30 test split (no retraining) using all four saved models from Table 14. As expected from the near-saturated headline metrics, every family reaches

1.000

detection accuracy except the LLM family, where two of the four configurations miss the same single bot session (

17 / 18 = 0.944

). This confirms that the within-distribution per-family breakdown carries no additional signal beyond the aggregate metrics already reported and that the disjoint protocol above is the more informative stress test.

4.3. XGBoost Classifier Versus RL Agent

Table 15 compares the three methods on the standard 70/30 test split. XGBoost leads on every metric, reaching 99.5% accuracy and a perfect precision/recall trade with only one misclassified bot session and zero false positives. Soft PPO is the strongest RL agent at 97.7% accuracy and 0.976 F1, within 2 percentage points of XGBoost. All three methods achieve perfect precision, so the gap is entirely on the recall side: XGBoost recovers 99.2% of bot sessions while the RL agents recover 93.6% (PPO) and 95.4% (Soft PPO). On aggregate session-level classification, the hand-engineered feature ensemble does have a real edge, but as the next figure shows, that edge is not uniform across bot families.

Figure 22 breaks down detection rate by bot family for the three best-performing methods, evaluated on the standard test split. We pair RL advaug against XGBoost v2 default+aug so both methods see the same augmented training data and neither benefits from extra hyperparameter tuning. This keeps the comparison apples-to-apples on augmentation and out-of-the-box hyperparameters.

XGBoost Performance

XGBoost dominates on commodity families. For nine of the ten bot families (linear, tabber, speedrun, scripted, stealth, slow, erratic, semi-auto, and trace-conditioned), XGBoost achieves 100% detection, reflecting the strength of hand-engineered session-level features for distinguishing mechanically generated traces from human behavior. The RL agents trail by 3–14 percentage points on these families: PPO ranges from 0.86 (semi-auto) to 0.97 (erratic) and Soft PPO from 0.94 (stealth; tabber) to 0.97 (scripted; slow).

RL Performance

RL outperforms XGBoost on LLM-driven bots. The single family where the RL agents take the lead is LLM (Tier 5), the most sophisticated adversary in the dataset. XGBoost detects only 94.4% of LLM sessions, while RL PPO and RL Soft PPO both reach 95%. The gap is small in absolute terms, but the direction is significant: LLM-driven bots are designed to mimic human browsing patterns, which defeats the static feature signatures XGBoost relies on. The RL agents, in contrast, observe the session as a sequence of windowed events and can issue tiered challenges or honeypots to actively probe ambiguous sessions rather than commit to a binary label up front. This is the strongest evidence in the evaluation that the RL framework adapts better than a static classifier when faced with novel, human-imitating adversaries. This is exactly the failure mode the system is being built to defend against.

PPO’s Weakness

PPO drops to

0.86 \pm 0.07

on semi-automated bots, by far the worst RL result. Soft PPO (

0.96

) handles this family much more reliably, suggesting that the adaptive entropy schedule in Soft PPO is better matched to the partially human behavioral signature that semi-automated bots produce, where a fixed exploration coefficient leaves PPO under-committed.

5. Discussion

5.1. Reinforcement Learning

The contribution of the RL framework is not classification accuracy. Across all three algorithms and both augmentation conditions, classification accuracy is not the differentiating factor, since every advaug configuration converges to similar precision and F1 scores. What the policies provide that a static classifier does not is action selection: the choice of when to deploy a honeypot, which puzzle difficulty to issue, and when to allow a borderline session through without friction. These are policy decisions a deployed CAPTCHA system must make regardless of how strong its classifier is, and the framework makes them jointly with the detection decision rather than through hand-tuned post hoc rules. The honeypot deployment rates above 0.79 and the migration from direct block to hard puzzle under the revised reward both confirm that the agents actively exploit the full action space, not just the terminal verdict.

The choice of algorithm matters less than the reward shaping. Once the reward function penalizes blocking a human more heavily than missing a bot, every algorithm we tested learns the same caution-first behavior. The differences that remain between PPO, DG, and Soft PPO are in stability and seed sensitivity rather than peak accuracy. Soft PPO’s advantage is best characterized as a stability advantage: its adaptive entropy schedule resists premature commitment, producing more uniform per-family behavior and tighter cross-seed variance. The practical implication for follow-up work is that the architectural decision is not “which algorithm” but “how aggressively exploration should be constrained as training progresses.” Reporting five training seeds per configuration was essential to seeing this. The single-seed runs would have masked DG’s seed-dependent instability and overstated its average behavior.

Mid-session intervention is built into the framework. The agent already exposes a non-terminal action (i.e., honeypot deployment) that fires during the session and gathers behavioral evidence without revealing detection. The honeypot is one realization of a generic mid-session intervention slot in the action space; the same architectural mechanism supports any non-blocking response (rate limiting, soft micro-challenges, silent re-authentication, and server-side request throttling) that an operator wants to attach. Swapping the honeypot for one of these alternatives is an implementation detail, not a redesign of the MDP. What is true by design is that the binary block decision is reserved for the closing window, and we view that as a property of the framework rather than a defect. A system that issues hard blocks mid-session, before observing enough behavioral evidence to be confident, is the design that produces false positives on legitimate users with unusual interaction styles. The agents in this work are explicitly trained to gather evidence first (continue; honeypot) and commit second, which is the inverse of the failure mode a premature blocker would have.

Generalization in the framework is bounded by training exposure. The held-out tier experiments shows that the agents generalize well across familiar adversaries but degrade sharply when an entire bot tier is excluded from training, with the largest drop on held-out LLM-driven bots. This is a structural property of policy methods: the same caution-on-uncertainty behavior that produces perfect precision on familiar distributions defaults to allowing unfamiliar bots through. Training-set composition is therefore a deployment constraint, not a one-time choice. A system put into production would need ongoing exposure to evolving adversary distributions to maintain its detection rate. The flip side of this result is what the user-disjoint experiments show: when held-out humans are presented to an agent that has not seen them during training, the agent still passes them through at high rates. The framework’s failure mode is therefore asymmetric. It under-detects unfamiliar bots rather than over-blocking unfamiliar humans, which is the correct asymmetry for a UX-sensitive deployment.

5.2. XGBoost

Across four configurations spanning tuned and untuned hyperparameters and the presence or absence of adversarial augmentation, the XGBoost classifier reaches an ROC-AUC of

1.000

on the held-out test split.

All four configurations achieve near-perfect classification at the default decision threshold (

τ = 0.5

): xgb_v1_noaug and xgb_v2 each misclassify a single bot, while xgb_v1 and xgb_v2_noaug each misclassify two. Crucially, no configuration blocks a legitimate user—every error across all four models is a false negative, corresponding to a bot scoring above the threshold.

Across all four configurations, the top-ranked feature comes from the mouse dynamics group—average speed, jitter ratio, or direction-change ratio—confirming that motor behavior is the most discriminative channel for bot detection in the classifier.

The adversarial augmentation pipeline (HumanProfiler) was designed to push the classifier away from trivially separable artifacts, such as Selenium’s

\sim 1

ms key-hold durations, toward deeper behavioral signals that generalize across bot tiers. In practice, however, its effect on the current dataset is mixed: augmentation helps the default-hyperparameter model (xgb_v2: 2 FN → 1 FN) but hurts the tuned model (xgb_v1: 1 FN → 2 FN), suggesting that Optuna’s regularization choices already provide the protection augmentation is meant to supply.

Beyond the in-distribution test split, the family-disjoint evaluation (Section 4.2.1) provides a stricter view of generalization. Seven of ten bot families reach perfect detection accuracy even when entirely withheld from training, suggesting that the classifier’s decision boundary captures general behavioral signatures rather than family-specific patterns. The exception is the LLM family, where disjoint detection drops to

0.855 \pm 0.035

without augmentation and, counterintuitively, falls further to

0.804 \pm 0.087

when augmentation is enabled.

A plausible explanation is that humanized copies of simpler bots pull the decision boundary toward the human distribution in regions that real LLM sessions also occupy, making genuinely human-like LLM behavior harder to flag at test time. This inverts the usual interpretation of augmentation as universally beneficial and implies that augmentation strategies should be re-validated whenever a new bot family emerges.

All four configurations share the same zero-false-positive outcome on the in-distribution test split, so the observed differences there amount to at most one test sample; the family-disjoint results above provide the more informative picture of how the classifier behaves under a genuine distribution shift.

5.3. XGBoost Versus RL Agents

The two models are answering different questions. XGBoost answers, “Given the complete session, is this a bot?” and the RL agents answer, “As events arrive, what should the system do?” A deployed CAPTCHA system must answer the second question regardless of how strong its classifier is, because every threshold, every challenge-difficulty selection, and every honeypot deployment is a policy decision. The aggregate accuracy gap is therefore best read as the cost of folding the policy decision into the same optimization as the detection decision rather than as evidence that the RL approach is worse at classification. A pure classifier still requires a downstream policy layer; the framework collapses both into a single learned mechanism whose tradeoffs are encoded explicitly in the reward function rather than implicitly in hand-tuned thresholds. The evaluation protocol was matched where it could be (same advaug condition, same default hyperparameters, and same per-family breakdown for the comparison), so the gap reflects an architectural difference rather than a methodological one. Where the protocols differ (RL uses a 70/15/15 split with multi-seed averaging across a stochastic policy; XGBoost uses a 70/30 split with a single deterministic model), the difference is structural to the two method families rather than a choice we made.

XGBoost outperforms the strongest RL configuration on aggregate classification, but the gap matters less than where it appears. On the standard test split, XGBoost reaches

0.995

accuracy and

0.992

recall while Soft PPO+Advaug reaches

0.977

accuracy and

0.954

recall; both methods achieve perfect precision, so the gap is concentrated entirely on the recall side rather than on misclassified humans. Per-family, XGBoost detects

100 %

of nine of the ten bot families (linear, tabber, speedrun, scripted, stealth, slow, erratic, semi-auto, and trace-conditioned), reflecting the strength of hand-engineered behavioral features at separating mechanically generated traces from human behavior. As a session-level binary classifier on this dataset, XGBoost is straightforwardly the stronger method, and the RL framework’s classification contribution should not be overstated.

The one per-family result where the RL agents take the lead is the family the comparison cares about most. XGBoost detects

94.4 %

of LLM-driven sessions while RL PPO and Soft PPO both reach

95 %

. This is a small absolute gap, but a meaningful directional one, because LLM-driven bots are explicitly designed to defeat the kind of static behavioral signatures XGBoost depends on. The RL agents observe each session as an unfolding sequence and can actively probe ambiguous sessions through tiered challenges and honeypots rather than committing to a verdict from a single feature vector. The framework outperforms specifically on the family designed to be hardest for hand-engineered features, which is evidence in this evaluation that policy-based methods provide headroom against adaptive adversaries.

The natural production architecture can combine both methods rather than forcing a choice between them. A feature-based classifier such as XGBoost supplies a robust, high-recall per-session likelihood estimate that can feed the policy as an additional observation feature, while the policy retains responsibility for action selection (challenge difficulty, honeypot deployment, and allow/block timing). Under this framing, the classification gap on the standard split is the operational cost of having a learnable policy at all, and the LLM-family result indicates that this cost buys real properties (explicit reward shaping, uniform caution-first behavior, and auditable puzzle-based detections) that a classifier-plus-threshold pipeline does not. Whether to combine the two or deploy one alone is a deployment choice, not a structural requirement of the framework.

5.4. Limitations

There are several limitations to the proposed system that should be acknowledged.

5.4.1. Data Collection

The dataset currently used was collected from our sandbox web application with a fixed page layout and user flow, meaning the learned policies may not transfer directly to sites with different interaction patterns or form structures. Due to this, implementations of the proposed system would require a large initial dataset for training before reaching adequate bot detection performance in some environments.

An additional potential limitation of the proposed system is the volume of data collected. Currently, the system collects a significant amount of information to properly detect bots. While this implementation is simple to apply to small-scale projects such as the application used for this project, the amount of data for each individual user session is substantial, and the collection of mouse positions every 15 ms specifically adds a significant burden to the data collection system. In some applications, a significant user count may create a volume of telemetry data that overloads the processing systems. This may create damaging slowdowns for services relying on the detection system if processing resources get overloaded. Consequently, large-scale implementations may have to investigate adjustments to the data collection model to get acceptable performance from the system. Consideration should be made for reducing processing costs through reductions in the amount of data collected, compressing the collected data, improving the processing pipeline, or some combination of these methods. Additional solutions, such as those found in [18,19], demonstrate data collection systems that are designed with low-cost collection in mind.

5.4.2. Accessibility and Fairness

The proposed system may additionally raise concerns regarding accessibility and fairness. The human session data used in this study consists entirely of desktop browser interactions from a small group of university participants without impairments or assistive-technology use. Therefore, the dataset does not capture the behavioral diversity of mobile users, tablet users, accessibility tool users, or users with varying levels of technical familiarity. Due to this, the system may not fully capture the behavioral patterns of users who navigate websites using different devices or assistive technology such as screen readers, voice controls, eye-tracking systems, or other assistive technologies. Users with tremors, cognitive disabilities, motor impairments, or unusual telemetry behavior may also produce events that do not match the patterns in the training data.

This may lead to situations where legitimate users are incorrectly labeled as bots, creating access barriers for some groups. Commercial implementation of the proposed system should evaluate performance across diverse accessibility contexts, including false-positive rates for users with impairments or assistive technologies. The proposed system should not be used as the sole basis for denying access to a website or service without appropriate safeguards that account for accessibility and fairness for users.

5.4.3. Privacy and Consent

We recognize that the proposed data collection system and its passive nature raise privacy and consent considerations. Conventional key logging systems can expose sensitive user information, bringing significant privacy risks. To avoid this, the system reduces risks through data minimization by recording only the event types and timing or interaction features needed for the bot detection model. The proposed data collection system is limited to mouse movements, clicks, scrolls, and non-content keyboard events such as Backspace, Tab, and Delete. The system does not record typed characters, form contents, clipboard data, device information, or website data, such as purchase details or account information. While the system avoids collecting personally identifying data, the tracked data remains privacy-relevant behavioral information because it is collected passively in the background and may be retained or used for future bot detection training. Therefore, implementations of the proposed system should follow best practices regarding user privacy, security, consent, retention, and deletion, in accordance with relevant legal and institutional requirements.

6. Conclusions

This paper presented an adaptive CAPTCHA defense framework that formulates bot detection as a sequential decision-making problem. Unlike traditional CAPTCHA systems that rely on fixed challenges or proprietary risk scores, the proposed framework uses reinforcement learning to observe user behavior over time, gather additional evidence through honeypot deployment, and select an appropriate terminal action, such as allowing the session, issuing a graded CAPTCHA challenge, or blocking the user. By modeling the task as a partially observable Markov decision process, the system makes decisions from streamed behavioral telemetry, including mouse movements, clicks, keystrokes, and scrolling patterns.

Our experimental results support the proposed hypotheses. Specifically, the results demonstrate that reinforcement learning can support effective and low-friction bot detection in a sandbox ticket-purchasing environment. Among the evaluated RL variants, Soft PPO with adversarial augmentation achieved the strongest overall performance, reaching 97.7% accuracy, 100% precision, and a 97.6% F1 score. These results support the first hypothesis that temporal behavioral telemetry contains sufficient sequential and interaction-level patterns for a reinforcement learning agent to distinguish between human and automated actors. Furthermore, the revised reward structure encouraged evidence-based interventions by shifting the agent away from opaque direct blocking and toward honeypot-assisted observation and graded challenge deployment. This supports the second hypothesis, which states that a multi-action adaptive response policy with progressive intervention mechanisms, including honeypots and graded CAPTCHA escalation, can preserve strong bot detection performance while reducing unnecessary friction for legitimate users.

In addition to the RL agent, we evaluated an XGBoost classifier as a supervised session-level benchmark. The classifier achieved near-perfect performance on the held-out test split, demonstrating that behavioral telemetry contains strong signals for distinguishing human users from automated agents. However, the RL framework provides a complementary advantage by learning not only whether a session is likely human or automated but also what intervention should be taken during the session. This distinction is especially important for adaptive defense, where the system must respond to uncertain or evolving adversarial behavior rather than simply output a binary classification.

Overall, this work shows that reinforcement learning is a promising direction for designing adaptive, transparent, and behavior-based CAPTCHA defenses. The proposed framework is not intended to replace mature commercial CAPTCHA systems in their current form but rather to provide an open and extensible research environment for studying sequential bot-mitigation policies. Future work should evaluate the system across larger and more diverse user populations, stronger adversarial bots, real-world deployment settings, and privacy-preserving telemetry collection methods.

7. Future Work

Future work should explore several directions to strengthen and extend this system. First, developing an RL-based attacker agent that learns evasion strategies against the defender would enable adversarial co-training, where both the attacker and defender continuously improve against each other in a minimax loop. This would produce more robust detection policies than training against a fixed set of bot behaviors. Second, while our current dataset of 439 bot sessions across five tiers and 204 human sessions is sufficient for initial validation, expanding the dataset to include a wider range of real-world user behavior would improve generalization. In particular, users who rely on assistive technologies such as screen readers or switch inputs may produce behavioral telemetry that resembles bot activity, and the current system has not been evaluated against such edge cases. Ensuring low false-positive rates for accessibility users is critical for real-world deployment. Third, LLM-powered bots remain a partially open challenge: although the agents detect them at ∼95% when LLM sessions are present in training, recall collapses to

0.528

when LLM bots are entirely excluded, indicating that the policy does not generalize to LLM behavior it has never seen. Research into richer feature representations, such as sub-window micro-patterns or cross-page behavioral continuity, may capture subtler differences that survive this generalization gap. Fourth, integrating the RL agent with browser-level signals beyond DOM telemetry, such as GPU rendering timing or network request fingerprints, could narrow the remaining detection gap for sophisticated adversaries. Finally, while the current data collection system performs well in our sandbox application, improving its throughput would support deployment in larger-scale environments.

Author Contributions

Conceptualization, M.I., E.N., J.R. and M.T.; Methodology, M.I., E.N., J.R. and M.T.; Software, M.I., E.N., J.R. and M.T.; Formal Analysis, M.I., J.R. and M.T.; Investigation, M.I., E.N., J.R. and M.T.; Data Curation, M.I., E.N., J.R. and M.T.; Writing—Original Draft Preparation, M.I., E.N., J.R. and M.T.; Visualization, M.I., E.N., J.R. and M.T.; Supervision, Y.P.; Project Administration, E.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original dataset used in this study is openly available in Zenodo at the following (accessed on 21 May 2026): https://doi.org/10.5281/zenodo.20391172. The code base presented in the study is openly available in Github at adaptive-captcha-defense: https://github.com/meghanai28/adaptive-captcha-defense (accessed on 21 May 2026).

Acknowledgments

Use of AI-Assisted Tools: During the preparation of this manuscript, the author(s) used Claude Sonnet 4.6 [49] and Claude Opus 4.6 (Anthropic) [50] for manuscript organization, improving readability, and refining written content. The authors have reviewed and edited the suggestions and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. RL Window-Level Feature Definitions

Table A1. RL Window-Level Feature Groups (26 dimensions).

Group	Dim	Features
Event composition	4	mouse/click/key/scroll ratios
Mouse dynamics	4	mean speed, speed variance, mean acceleration, path curvature
Timing features	3	mean/variance/min inter-event $Δ t$ (log1p normalized)
Click timing	2	mean/variance inter-click interval
Keystroke dynamics	4	mean/variance hold duration, mean/variance inter-key interval
Spatial & contextual	9	scroll magnitude, scroll direction changes, unique cursor positions, spatial $x / y$ range, interactive click ratio, window duration, normalized event count

References

Geer, D. Malicious bots threaten network security. Computer 2005, 38, 18–20. [Google Scholar] [CrossRef]
Von Ahn, L.; Blum, M.; Hopper, N.J.; Langford, J. CAPTCHA: Using hard AI problems for security. In Proceedings of the International Conference on the Theory and Applications of Cryptographic Techniques; Springer: Berlin/Heidelberg, Germany, 2003; pp. 294–311. [Google Scholar] [CrossRef]
Dinh, N.T.; Hoang, V.T. Recent advances of Captcha security analysis: A short literature review. Procedia Comput. Sci. 2023, 218, 2550–2562. [Google Scholar] [CrossRef]
Tang, M.; Gao, H.; Zhang, Y.; Liu, Y.; Zhang, P.; Wang, P. Research on deep learning techniques in breaking text-based captchas and designing image-based captcha. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2522–2537. [Google Scholar] [CrossRef]
Bock, K.; Hughey, D.; Bhargava, V.; Hoy, N.; Bhatt, D.; Suba, T.; Levin, D. unCaptcha: A Low-Resource Defeat of reCaptcha’s Audio Challenge. In Proceedings of the 11th USENIX Workshop on Offensive Technologies (WOOT 17); USENIX Association: Berkeley, CA, USA, 2017. [Google Scholar]
Xu, Y.; Reynaga, G.; Chiasson, S.; Frahm, J.M.; Monrose, F.; Van Oorschot, P. Security and Usability Challenges of Moving-Object CAPTCHAs: Decoding Codewords in Motion. In Proceedings of the 21st USENIX Security Symposium (USENIX Security 12), Bellevue, WA, USA, 8–10 August 2012; pp. 49–64. [Google Scholar]
Singh, V.P.; Pal, P. Survey of different types of CAPTCHA. Int. J. Comput. Sci. Inf. Technol. 2014, 5, 2242–2245. [Google Scholar]
Sakthivel, M.; Naveenkumar, E.; Mukilarasan, S.; Boopathi, A.; Akila, V.; Rithanya, A. CAPTCHA in the Age of Automation: Enhancing Security and Ensuring User Convenience. In Proceedings of the 2025 5th International Conference on Pervasive Computing and Social Networking (ICPCSN), Salem, India, 14–16 May 2025; pp. 1208–1214. [Google Scholar] [CrossRef]
ALTCHA. ALTCHA: Open-Source CAPTCHA Alternative. 2025. Available online: https://altcha.org (accessed on 28 February 2025).
Zobal, L.; Kolář, D.; Fujdiak, R. Current State of Honeypots and Deception Strategies in Cybersecurity. In Proceedings of the 2019 11th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), Dublin, Ireland, 28–30 October 2019; pp. 1–9. [Google Scholar] [CrossRef]
Searles, A.; Nakatsuka, Y.; Ozturk, E.; Paverd, A.; Tsudik, G.; Enkoji, A. An Empirical Study & Evaluation of Modern CAPTCHAs. In Proceedings of the 32nd Usenix Security Symposium (Usenix Security 23); USENIX Association: Berkeley, CA, USA, 2023; pp. 3081–3097. [Google Scholar]
Yan, J.; El Ahmad, A.S. Usability of CAPTCHAs or usability issues in CAPTCHA design. In Proceedings of the 4th Symposium on Usable Privacy and Security; Association for Computing Machinery: New York, NY, USA, 2008; pp. 44–52. [Google Scholar] [CrossRef]
Sukhani, K.; Sawant, S.; Maniar, S.; Pawar, R. Automating the bypass of image-based CAPTCHA and assessing security. In Proceedings of the 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT); IEEE: Piscataway, NJ, USA, 2021; pp. 1–8. [Google Scholar] [CrossRef]
Sivakorn, S.; Polakis, I.; Keromytis, A.D. I am robot:(deep) learning to break semantic image captchas. In Proceedings of the 2016 IEEE European Symposium on Security and Privacy (EuroS&P); IEEE: Piscataway, NJ, USA, 2016; pp. 388–403. [Google Scholar] [CrossRef]
Plesner, A.; Vontobel, T.; Wattenhofer, R. Breaking reCAPTCHAv2. In Proceedings of the 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC); IEEE: Piscataway, NJ, USA, 2024; pp. 1047–1056. [Google Scholar] [CrossRef]
Akrout, I.; Feriani, A.; Akrout, M. Hacking Google reCAPTCHA v3 using Reinforcement Learning. Presented at the 4th Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM 2019), Montréal, QC, Canada, 7–10 July 2019. [Google Scholar]
Acien, A.; Morales, A.; Fierrez, J.; Vera-Rodriguez, R. BeCAPTCHA-Mouse: Synthetic mouse trajectories and improved bot detection. Pattern Recognit. 2022, 127, 108643. [Google Scholar] [CrossRef]
Niu, H.; Wei, A.; Song, Y.; Cai, Z. Exploring visual representations of computer mouse movements for bot detection using deep learning approaches. Expert Syst. Appl. 2023, 229, 120225. [Google Scholar] [CrossRef]
Gianvecchio, S.; Wu, Z.; Xie, M.; Wang, H. Battle of Botcraft: Fighting bots in online games with human observational proofs. In Proceedings of the 16th ACM Conference on Computer and Communications Security, Chicago, IL, USA, 9–13 November 2009; CCS ’09. pp. 256–268. [Google Scholar] [CrossRef]
Nguyen, T.T.; Reddi, V.J. Deep reinforcement learning for Cyber Security. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 3779–3795. [Google Scholar] [CrossRef] [PubMed]
S, S.A.; Venkateshan, P.; Suresh, P.A. Emulating Human-Like Mouse Movement Using Bezier Curves and Behavioural Models for Advanced Web Automation. Int. J. Innov. Res. Technol. 2025, 12, 1364–1370. [Google Scholar] [CrossRef]
Anthropic. Claude 3.5 Sonnet. 2025. Available online: https://www.anthropic.com/news/claude-3-5-sonnet (accessed on 27 February 2026).
OpenAI. GPT-4o Technical Overview. 2024. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 21 May 2026).
Google. Gemini: A Family of Highly Capable Multimodal Models. 2024. Available online: https://deepmind.google/technologies/gemini/ (accessed on 31 March 2026).
Browser-Use Contributors. Browser-Use. 2026. Available online: https://github.com/browser-use/browser-use (accessed on 7 April 2026).
Google. Chrome DevTools Protocol Documentation. 2026. Available online: https://chromedevtools.github.io/devtools-protocol/ (accessed on 7 April 2026).
Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar] [CrossRef]
Selenium Project. Selenium—Web Browser Automation. 2024. Available online: https://www.selenium.dev/ (accessed on 7 April 2026).
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; Wiley Series in Probability and Statistics; John Wiley & Sons: New York, NY, USA, 1994. [Google Scholar]
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.I.; Abbeel, P. High-Dimensional Continuous Control Using Generalized Advantage Estimation. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Williams, R.J. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
Konda, V.R.; Tsitsiklis, J.N. Actor-Critic Algorithms. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Solla, S., Leen, T., Müller, K., Eds.; MIT Press: Cambridge, MA, USA, 1999; Volume 12. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Kaelbling, L.P.; Littman, M.L.; Cassandra, A.R. Planning and Acting in Partially Observable Stochastic Domains. Artif. Intell. 1998, 101, 99–134. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Harris, F.J. On the Use of Windows for Harmonic Analysis with the Discrete Fourier Transform. Proc. IEEE 1978, 66, 51–83. [Google Scholar] [CrossRef]
Huang, S.; Ontañón, S. A Closer Look at Invalid Action Masking in Policy Gradient Algorithms. In Proceedings of the International FLAIRS Conference, Sandestin Beach, FL, USA, 19–21 May 2022. [Google Scholar] [CrossRef]
Hausknecht, M.; Stone, P. Deep Recurrent Q-Learning for Partially Observable MDPs. In Proceedings of the AAAI Fall Symposia, Arlington, VA, USA, 12–14 November 2015; p. 141. [Google Scholar]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1928–1937. [Google Scholar]
Pascanu, R.; Mikolov, T.; Bengio, Y. On the Difficulty of Training Recurrent Neural Networks. In Proceedings of the 30th International Conference on Machine Learning, PMLR, Atlanta, GA, USA, 17–19 June 2013; pp. 1310–1318. [Google Scholar]
Osband, I. Delightful Policy Gradient. arXiv 2026, arXiv:2603.14608. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning; Dy, J., Krause, A., Eds.; Curran Associates: Red Hook, NY, USA, 2018; Volume 80, pp. 1861–1870. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Grinsztajn, L.; Oyallon, E.; Varoquaux, G. Why Do Tree-Based Models Still Outperform Deep Learning on Typical Tabular Data? In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 507–520. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar] [CrossRef]
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-Generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2019; pp. 2623–2631. [Google Scholar] [CrossRef]
Anthropic. Claude 4.6 Sonnet. 2026. Available online: https://www.anthropic.com/news/claude-sonnet-4-6 (accessed on 21 May 2026).
Anthropic. Claude 4.6 Opus. 2026. Available online: https://www.anthropic.com/news/claude-opus-4-6 (accessed on 21 May 2026).

Figure 1. Ticket Monarch Web Environment. Arrows represent user navigation flow through the application.

Figure 2. Mouse movement telemetry heatmaps.

Figure 3. Mouse click telemetry heatmaps.

Figure 4. POMDP diagram of CAPTCHA defense problem.

Figure 5. RL-based bot detection framework.

Figure 6. LSTM actor–critic architecture.

Figure 7. Training algorithm comparison.

Figure 8. RL algorithm training comparison: revised reward structure.

Figure 9. Terminal action distribution under the legacy and revised reward structures with adversarial augmentation.

Figure 10. Disjoint tier generalization for PPO.

Figure 11. Per-family detection accuracy for all three algorithms (advaug, revised reward structure).

Figure 12. Human pass-through rate for a baseline PPO agent (trained with the person’s sessions included) versus a disjoint PPO agent (trained with that person’s sessions completely excluded).

Figure 13. Ablation study using PPO+Advaug as the base configuration.

Figure 14. Sensitivity of all three advaug agents to reward parameter misspecification.

Figure 15. Sensitivity to challenge-outcome assumption misspecification.

Figure 16. Bot detection accuracy of rule-based baselines versus RL agents (advaug, revised reward structure).

Figure 17. xgb_v1 (tuned + augmentation). Two false negatives; 0 false positives. Top feature: mouse_avg_speed.

Figure 18. xgb_v1_noaug (tuned, no augmentation). One false negative; 0 false positives. Top feature: mouse_avg_speed.

Figure 19. xgb_v2 (default + augmentation). One false negative; 0 false positives. Top feature: mouse_avg_speed.

Figure 20. xgb_v2_noaug (default, no augmentation). Two false negatives; 0 false positives. Top feature: mouse_direction_change_ratio.

Figure 21. Family-disjoint detection accuracy for the XGBoost classifier. For each bot family, the model is retrained with every session of that family excluded and then evaluated exclusively on the held-out family. Bars show the mean across five training seeds; error bars indicate one standard deviation.

Figure 22. Per-family bot detection rate for XGBoost, RL PPO, and RL Soft PPO (advaug).

Table 1. Telemetry data.

Signal	Rate	Data
Mouse Movement	15 ms sampling	Position (x, y), Timestamp
Mouse Clicks	Every click	Position (x, y), Target Element, Time Delta
Key Strokes	Every key press	Field ID, Type (Down/Up), Timestamp, Special Keys (e.g., Backspace, Tab, Delete)
Mouse Scrolls	Every scroll	Scroll X, Scroll Y Delta, Time Delta

Table 2. Adversarial bot-tier hierarchy.

Tier	Name	Bot Types	Description
1	Commodity	linear, tabber, speedrun	These bots have obvious automation patterns such as constant cursor speed, tab-only navigation, and minimal mouse movement. Their behavior is easy to distinguish from human interaction.
2	Careful Automation	scripted, stealth, slow, erratic, replay	These bots attempt to mimic human behavior using Bezier curve mouse trajectories [21], randomized delays, and natural pacing. While less predictable than Tier 1, they remain detectable through timing variance and movement regularity.
3	Pseudo-Semi-Automated	semi_auto	These bots combine scripted navigation with simulated human-operator handoffs. One phase is automated (e.g., browsing and seat selection) while another uses human-like interaction patterns (e.g., checkout), creating sessions with mixed behavioral signatures that are harder to classify with a single global decision.
4	Trace-Conditioned	trace_conditioned	These bots replay recorded human mouse trajectories with added Gaussian noise and simulate human-profiled typing intervals, producing sessions that statistically resemble real user behavior.
5	LLM-Powered	Claude [22], GPT-4o [23], Gemini [24]	These bots are autonomous AI agents navigating web pages via browser-use [25]. They perceive the page through screenshots, Document Object Model (DOM), or an accessibility tree via Chrome DevTools Protocol (CDP) [26]. They then decide actions through a ReAct-style observe–reason–act loop [27], where the LLM observes the current page state, selects an action (click, type, and scroll), and receives updated observations after each interaction.

Table 3. Adversarial augmentation parameters by difficulty level.

Parameter	Easy	Medium	Hard
$σ_{jitter}$ (px)	3.0	2.0	1.0
Timing compression $β$	—	0.7	0.4
Path smoothing $α_{s}$	—	0.8	0.6
Fix key-hold durations	✓	✓	✓

—: parameter not applicable at this difficulty level; ✓: transform is applied.

Table 4. Challenge pass probabilities used in training.

Difficulty	Human Pass	Bot Pass
Easy	0.95	0.40
Medium	0.85	0.15
Hard	0.70	0.05

Table 5. Legacy terminal reward mapping. Bot catch and pass rewards do not vary by difficulty.

Scenario	Reward
Puzzle actions
Human issued easy puzzle	$- 0.10$
Human issued medium puzzle	$- 0.30$
Human issued hard puzzle	$- 0.50$
Bot passes puzzle (any difficulty)	$- 0.40$
Bot caught by puzzle (any difficulty)	$+ 1.00$
Direct actions
Allow human	$+ 0.50$
Allow bot	$- 0.80$
Block bot	$+ 1.00$
Block human	$- 1.00$

Table 6. Revised terminal reward mapping. Puzzle outcomes are sampled stochastically with no per-action costs.

Scenario	Reward
Puzzle actions
Human passes easy puzzle	$- 0.05$
Human passes medium puzzle	$- 0.20$
Human passes hard puzzle	$- 0.40$
Human fails puzzle	$- 1.00$
Bot passes puzzle	$- 0.50$
Bot caught by easy puzzle	$+ 0.80$
Bot caught by medium puzzle	$+ 1.00$
Bot caught by hard puzzle	$+ 1.20$
Direct actions
Allow human	$+ 0.50$
Allow bot	$- 1.00$
Block bot	$+ 0.70$
Block human	$- 1.50$

Table 7. Model parameter count.

Component	Parameters
LSTM	$4 \times (26 \times 128 + 128 \times 128 + 128 + 128) = 79,872$
Actor head	$(128 \times 128 + 128) + (128 \times 64 + 64) + (64 \times 7 + 7) = 25,223$
Critic head	$(128 \times 128 + 128) + (128 \times 64 + 64) + (64 \times 1 + 1) = 24,833$
Total	129,928

Table 8. PPO hyperparameters.

Parameter	Value	Justification
Learning rate	$3 \times 10^{- 4}$	Standard PPO setting [35]
Discount factor $γ$	$0.99$	Long-horizon reasoning [30]
GAE $λ$	$0.95$	Bias–variance tradeoff [32]
Clip $ϵ$	$0.2$	PPO default [35]
Value loss coefficient	$0.5$	Standard
Entropy coefficient	$0.02$	Encourages exploration
Max grad norm	$0.5$	LSTM stability [42]
Rollout steps	4096	On-policy buffer size
Epochs per rollout	4	Standard
Total timesteps	$500, 000$	Convergence budget
Optimizer	Adam	Standard optimizer [35]

Table 9. DG hyperparameters.

Parameter	Value	Justification
Temperature $η$	$1.0$	Stable and robust across experiments per [43]

Table 10. Soft PPO hyperparameters.

Parameter	Value	Justification
Target entropy ratio	$0.5$	SAC-style entropy scaling [44]
$α$ learning rate	$3 \times 10^{- 4}$	Standard RL learning rate [35,44]
Initial $log α$	$- 2.0$	Common entropy initialization [44]
$α$ range	$[0.001, 1.0]$	Stability constraint (ours)
$α$ optimizer	Adam	Standard optimizer [35,44]

Table 11. Session-level feature groups (39 dimensions).

Group	Dim	Features	Rationale
Mouse dynamics	9	count, avg/std speed, avg/std $Δ t$ , direction-change ratio, straightness, jitter ratio, acceleration std	Bots move in straight lines at constant speed; humans exhibit curved paths, hand tremor (jitter), and variable acceleration.
Click patterns	4	count, avg/std inter-click interval, interactive-element ratio	Bots click at regular intervals and may miss interactive targets.
Keystroke timing	8	count, avg/std inter-key interval, unique fields, field-switch ratio, rhythm regularity (CV), avg/std hold duration	Bots type with uniform timing; Selenium fires near-zero hold durations (∼ $1$ ms vs. human ∼ $80$ –200 ms).
Scroll behavior	6	count, avg/std $Δ y$ , total $\| Δ y \|$ , avg speed, direction-change ratio	Bots scroll monotonically; humans reverse direction frequently.
Session-level	1	total duration	Bots complete sessions significantly faster than humans.
Event-type ratios	4	mouse/click/key/scroll ratios	Bots produce abnormal event distributions (e.g., no mouse events for tab-only Selenium bots).
Global timing	3	mean/variance/min inter-event $Δ t$	Bots generate unnaturally regular or fast event streams.
Spatial coverage	4	unique x positions, unique y positions, x range, y range	Bots visit fewer unique screen positions with narrower spatial extent.

Table 12. XGBoost hyperparameter configuration.

Parameter	Value
Number of estimators (K)	200
Max depth	3
Learning rate ( $η$ )	0.05
Row subsampling (`subsample`)	0.7
Column subsampling (`colsample_bytree`)	0.7
Min child weight	5
$L_{1}$ regularization ( $α$ )	0.3
$L_{2}$ regularization ( $λ$ )	2.0
Min split loss ( $γ$ )	0.3
Early stopping rounds	20

Table 13. RL classification performance on test split (Mean ± Std over 5 seeds)—revised reward structure. Bold values indicate the best-performing result for each metric.

Metric	PPO	PPO+Advaug	DG	DG+Advaug	Soft PPO	Soft PPO+Advaug
Accuracy	0.967 ± 0.006	0.968 ± 0.007	0.962 ± 0.025	0.961 ± 0.021	0.974 ± 0.004	0.977 ± 0.001
Precision	0.997 ± 0.007	1.000 ± 0.000	0.975 ± 0.045	1.000 ± 0.000	0.996 ± 0.004	1.000 ± 0.000
Recall	0.937 ± 0.013	0.936 ± 0.013	0.949 ± 0.003	0.921 ± 0.041	0.951 ± 0.005	0.954 ± 0.004
F1 Score	0.966 ± 0.007	0.967 ± 0.007	0.962 ± 0.024	0.958 ± 0.023	0.973 ± 0.004	0.976 ± 0.002
Avg Reward	0.990 ± 0.017	0.996 ± 0.016	0.942 ± 0.122	0.961 ± 0.067	1.004 ± 0.015	1.008 ± 0.010
Honeypot %	0.862 ± 0.005	0.857 ± 0.009	0.826 ± 0.066	0.858 ± 0.013	0.851 ± 0.012	0.791 ± 0.040

Table 14. XGBoost classifier evaluation across four configurations (

τ = 0.5

, test

n = 192

: 61 human, 131 bot). Aug. indicates whether HumanProfiler-augmented bot sessions were included; Tuned indicates Optuna-based hyperparameter selection (5-fold CV). All errors are false negatives (bots misclassified as human); no configuration blocked a legitimate user.

Table 14. XGBoost classifier evaluation across four configurations (

τ = 0.5

, test

n = 192

: 61 human, 131 bot). Aug. indicates whether HumanProfiler-augmented bot sessions were included; Tuned indicates Optuna-based hyperparameter selection (5-fold CV). All errors are false negatives (bots misclassified as human); no configuration blocked a legitimate user.

Model	Tuned	Aug.	Accuracy	F1	ROC-AUC	FN
xgb_v1	✓	✓	0.9896	0.9839	1.0000	2
xgb_v1_noaug	✓		0.9948	0.9919	1.0000	1
xgb_v2		✓	0.9948	0.9919	1.0000	1
xgb_v2_noaug			0.9896	0.9839	1.0000	2

✓ indicates yes (feature enabled); blank indicates no (feature not used).

Table 15. Standard test split: XGBoost vs. RL PPO+Advaug vs. RL Soft PPO+Advaug.

Metric	XGBoost (v2 Default + Aug)	PPO + Advaug	Soft PPO + Advaug
Accuracy	0.995	0.968 ± 0.007	0.977 ± 0.001
Precision	1.000	1.000 ± 0.000	1.000 ± 0.000
Recall	0.992	0.936 ± 0.013	0.954 ± 0.004
F1 Score	0.996	0.967 ± 0.007	0.976 ± 0.002

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Indukuri, M.; Naseerkhan, E.; Rose, J.; Tran, M.; Park, Y. Designing CAPTCHA Systems with Reinforcement Learning for Adaptive Defense. Electronics 2026, 15, 2363. https://doi.org/10.3390/electronics15112363

AMA Style

Indukuri M, Naseerkhan E, Rose J, Tran M, Park Y. Designing CAPTCHA Systems with Reinforcement Learning for Adaptive Defense. Electronics. 2026; 15(11):2363. https://doi.org/10.3390/electronics15112363

Chicago/Turabian Style

Indukuri, Meghana, Eman Naseerkhan, Joshua Rose, Martin Tran, and Younghee Park. 2026. "Designing CAPTCHA Systems with Reinforcement Learning for Adaptive Defense" Electronics 15, no. 11: 2363. https://doi.org/10.3390/electronics15112363

APA Style

Indukuri, M., Naseerkhan, E., Rose, J., Tran, M., & Park, Y. (2026). Designing CAPTCHA Systems with Reinforcement Learning for Adaptive Defense. Electronics, 15(11), 2363. https://doi.org/10.3390/electronics15112363

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Designing CAPTCHA Systems with Reinforcement Learning for Adaptive Defense

Abstract

1. Introduction

2. Related Work

2.1. CAPTCHA Systems and Their Evolution

2.2. Bots and AI-Based CAPTCHA Attacks

2.3. Behavioral Biometrics

2.4. Reinforcement Learning for Cybersecurity

2.5. Existing Dynamic CAPTCHA Systems

2.6. Gap Analysis and Contribution

3. Methodology

3.1. Web Application

3.2. Telemetry DataCollection

3.3. Adversarial Bot-Tier Framework and Augmentation

3.4. Reinforcement Learning for Adaptive CAPTCHA Defenses

3.4.1. Reinforcement Learning Formulation

3.4.2. Observation, State, Action, and Reward Space

ObservationSpace

State Representation

Action Space

Reward Structure and Challenge Outcome Formulation

Evolution of the Reward Design (Legacy Baseline vs. Revised Schedule)

3.4.3. Architecture

LSTM Backbone

Actor Head

Critic Head

Shared Representation

3.4.4. Training Algorithms

PPO (Proximal Policy Optimization)

DG (Delightful Policy Gradient)

PPO with Adaptive Entropy (Soft PPO)

3.4.5. Data Splitting and Training Protocol

Training Loop

On-the-Fly Stochastic Augmentation

3.4.6. Inference and Pseudo-Online Training

3.4.7. Evaluation Protocol

3.4.8. Evaluation Metrics

3.5. Classifier

3.5.1. Feature Engineering

3.5.2. Model Architecture

3.5.3. Training Pipeline

Data Split

Feature Standardization

Feature-Space Adversarial Augmentation

Feature Noise Augmentation

Label Smoothing

Class Imbalance Handling

Optuna Hyperparameter Tuning

3.5.4. Evaluation

4. Results

4.1. Reinforcement Learning (RL) Results

4.1.1. Overall RL Classification Performance

Overall Ranking

Precision and False Positives

Recall

Effect of Adversarial Augmentation

Honeypot Behavior

Average Reward Alignment

4.1.2. RL Training Dynamics

Training Reward

Validation Accuracy

Policy Entropy

Training Correctness Rate

4.1.3. RL Reward Structure Impact

Allow Rate

Direct Block

Hard Puzzle

Partial Puzzle Affinity

4.1.4. RL Generalization: Bot-Tier Difficulty

Lower Tiers

Mid-Tier Degradation

Tier 5 (LLM-Powered)

Multi-Tier Holdouts

4.1.5. RL Generalization: Bot Family

Semi-Auto Bot Family

Replay Bots

LLM Bots

PPO

All Algorithms