1. Introduction
Malicious bots and botnets are a persistent problem for online platforms. They are used to automate ticket scalping, credential stuffing, spam, large-scale scraping, fraudulent account creation, and other forms of abuse [
1]. As such, bot detection has become a fundamental requirement of modern web security. Effective defenses must distinguish legitimate users from automated agents while minimizing unnecessary friction for humans. In 2003, CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) was formally introduced by von Ahn et al. [
2] as a class of artificial intelligence (AI)-hard (i.e., difficult for automated systems but frictionless for humans) challenge–response tests designed to block automated programs from abusing online services while ensuring they remain accessible to human users. The foundation of CAPTCHA design is to exploit tasks that are computationally trivial for humans, but onerous for machines, providing a mechanism for distinguishing legitimate users from bots that is both reliable and scalable.
Due to the advent of artificial intelligence, and more specifically machine learning, traditional CAPTCHA schemes have been defeated convincingly. Comprehensive literature reviews, such as the one from Dinh and Hoang [
3], divide CAPTCHA into specific security schemes. One such scheme is text-based CAPTCHA, where humans must decipher distorted text and numbers with the aim of blocking bots from accessing web pages. However, researchers have developed a modified variant of convolutional neural networks (CNNs) to quickly solve any form of text-based CAPTCHA [
4]. Image-based CAPTCHA follows a similar logical flow but instead uses graphical puzzles. In a similar vein, with computer vision algorithms, CNNs, Support Vector Machines, and other methods, image-based CAPTCHA security measures have also been compromised [
3]. Audio CAPTCHA, which was introduced to improve accessibility, has proven equally vulnerable. Bock et al. [
5] demonstrated that an automated speech recognition ensemble could defeat reCAPTCHA’s audio challenge with over 85% accuracy. Video-based CAPTCHA, which typically involves moving text or an animated object, creates difficulty for bots due to their motion and temporal elements. However, they are not widely used due to issues with bandwidth requirements, accessibility, and excessive friction for human users [
6,
7].
Across all of its versions, a core limitation of CAPTCHA remains the same: high-performing systems remain proprietary. Google’s reCAPTCHA v3 incorporates a risk-based scoring system that is based on behavioral biometrics, yet the implementation is entirely proprietary, making it impossible to audit, validate, or extend [
8]. Current open-source systems apply a fixed detection approach and a fixed challenge response that cannot adapt to adversarial behavior over time. Open-source alternatives such as ALTCHA utilize cryptographic proof-of-work mechanisms to raise the cost of bot attacks but lack the behavioral intelligence to distinguish bots from humans in real time [
9]. To the best of our knowledge, no existing open-source CAPTCHA framework treats bot detection as a sequential decision-making problem in which a defender observes behavior over time, deploys evidence-gathering interventions such as honeypots, and adaptively selects a challenge or terminal decision based on accumulated evidence. This project directly addresses that gap.
This paper proposes a silent, reinforcement learning (RL)-based CAPTCHA system designed for high-security web applications. The system frames bot detection as a partially observable Markov decision process (POMDP), in which a Proximal Policy Optimization agent with a Long Short-Term Memory architecture (PPO+LSTM) makes sequential classification decisions over a sliding window of behavioral telemetry events. The system collects mouse movements, keystrokes, scroll events, and click data silently as users navigate a simulated ticket-purchasing web application, encoding these signals into a 26-dimensional feature vector per window. During the observation phase of each session, the agent may either continue observing or deploy a honeypot to gather additional evidence before a final decision is required. Generally, honeypots are referred to as deceptive decoy systems designed to attract attackers [
10]; in our context, honeypot deployment refers to embedding fields that are visually hidden from human users but exist in the backend, enabling the detection of automated agents that interact with these otherwise invisible elements. At the terminal state of each session, the agent adaptively deploys an easy, medium, or hard CAPTCHA challenge, blocks the user, or passes the user without friction, depending on its classification confidence. Furthermore, we also introduce a companion XGBoost classifier, which provides a holistic, session-level human-likelihood score, allowing direct comparison between the two approaches.
This study is guided by the following research questions:
RQ1: Can a sequential reinforcement learning agent distinguish human users from scripted, replay-based, adversarially humanized, and large language model (LLM)-powered bots exclusively using streamed behavioral telemetry, without relying on personally identifiable information?
RQ2: Can unnecessary user friction be reduced while maintaining strong bot detection performance through an adaptive action space that includes both mid-session interventions (i.e., continued observation and honeypot deployment) and terminal decisions (i.e., graded CAPTCHA, allow, and block)?
We evaluate the corresponding hypotheses.
H1: Temporal behavioral telemetry contains sufficient sequential and interaction-level patterns for a reinforcement learning agent to achieve statistically significant discrimination between human and automated actors.
H2: A multi-action adaptive response policy with progressive intervention mechanisms, including honeypots and graded CAPTCHA escalation, will preserve or improve overall bot detection effectiveness while lowering interaction costs imposed on legitimate users.
The proposed framework should not be interpreted as a ready-to-deploy commercial replacement for mature CAPTCHA services. Its practical value lies in the methodology and environment: the defender policy is more explainable than traditional binary classifiers (e.g., XGBoost); the reward function can be modified to prioritize security without losing performance, user friction, accessibility, or false-positive reduction; and new bot families can be added as AI-based automation improves. This makes the system future-adaptable rather than future-proof. In particular, the same environment can be reused to evaluate stronger browser agents, retrain policies against newly observed behaviors, and compare different intervention strategies without relying on closed risk-scoring logic. Overall, the key contributions of this work are as follows:
This paper proposes a new RL-based CAPTCHA system that learns a sequential intervention policy over windowed behavioral telemetry.
We design a new feature vector space by considering users’ real-time behaviors using keystrokes and mouse movements.
We implement the proposed system and evaluate the performance of the RL agent and XGBoost classifier.
We release an open-source RL environment for adaptive CAPTCHA defense research, including session replay, configurable reward schedules, honeypot deployment, graded CAPTCHA interventions, and multiple bot-tier behaviors.
The remainder of this paper is organized as follows:
Section 2 reviews related work in CAPTCHA systems, their vulnerabilities, behavioral biometrics, and the use of reinforcement learning in cybersecurity.
Section 3 describes our methodology, including the web application, data collection procedure, adversarial bot framework, RL agent architecture, and XGBoost classifier design.
Section 4 presents experimental results for both the RL agents and the classifier.
Section 5 discusses our key findings and system limitations.
Section 6 and
Section 7 summarize our work and discuss directions for future research.
2. Related Work
2.1. CAPTCHA Systems and Their Evolution
CAPTCHA was introduced as an online security mechanism designed to prevent automated fraud and abuse, wherein unique challenges are used to distinguish bots and humans [
2]. The first generation consisted of text-based CAPTCHA, which required users to identify and transcribe distorted alphanumeric characters rendered against noisy backgrounds. These were followed by image-based CAPTCHA, which presented users with visual puzzles requiring the identification of objects belonging to a specified category, a format that was used in Google’s reCAPTCHA v2 image grid challenges. Both generations relied on the premise that the visual recognition tasks involved were computationally intractable for automated systems.
The advancements in machine learning has brought about additional evolutions in CAPTCHA. One such methodology tracks user behavior on a web page, distinguishing between humans and malicious automation through behavioral and sensor metrics [
3,
11]. This invisible CAPTCHA works in the background, collecting information about the user without a direct puzzle. This approach has become prevalent due to its ease for users and its security performance.
Beyond purely invisible approaches, some systems have explored hybrid designs. Open-source alternatives such as ALTCHA couple cryptographic proof of work with behavioral signals to raise the cost of automated abuse in a transparent manner [
9]. However, a key concern with this system, and others, is that their detection strategies are fixed at the start of the deployment. Furthermore, research has highlighted an inherent tradeoff in CAPTCHA design: increasing difficulty to deter bots correspondingly raises friction and error rates for legitimate users [
12].
2.2. Bots and AI-Based CAPTCHA Attacks
The vulnerability of text-based CAPTCHA to machine learning attacks has been well established. Tang et al. [
4] demonstrated that CNNs could break text-based CAPTCHA schemes deployed by the top 50 most popular international websites with high accuracy, targeting schemes that utilize noise, distortion, and anti-segmentation techniques.
Image-based CAPTCHA has been proven equally susceptible. Sukhani and Chitaliya [
13] showed that a multi-class CNN model could solve image grid puzzles with 92.98% accuracy, while Sivakorn et al. [
14] defeated Google’s image reCAPTCHA at scale by combining deep learning-based object recognition with cookie manipulation to influence the risk analysis system. Plesner et al. [
15] showed that deep learning models such as the YOLO (You Only Look Once) v8 model can solve reCAPTCHAv2 image challenges with 100% success. Audio CAPTCHAs, introduced to improve accessibility, have fared no better: Bock et al. [
5] showed that an automated speech recognition ensemble could mount a large-scale attack against reCAPTCHA’s audio challenge with an attack success rate of 85.15%.
In particular, Akrout et al. [
16] demonstrated that RL agents can be trained to bypass reCAPTCHA v3 by learning mouse movement patterns that achieve a high risk score. The agent formulated the problem as a grid-world and achieved a success rate of 97.4%. Together, these studies make a compelling case that both challenge-based and behavioral CAPTCHA are vulnerable to targeted AI attacks, motivating the need for adaptive defenses.
2.3. Behavioral Biometrics
The high data cost associated with this project required that our selected data be simple to collect, widely available in environments similar to ours, and effective for bot detection. To meet these data constraints, we track a variety of mouse and keyboard interaction telemetry.
To ensure the effectiveness of bot detection, we turned to various research cases to identify the most valuable telemetry to track. Mouse dynamics, including movement and clicking behavior, were used by [
17] to detect bots with up to 99.20% accuracy. Similarly, ref. [
18] used real and synthetic mouse trajectories to supplement other approaches (e.g., Google’s reCAPTCHA) to achieve 98.7% accuracy. In another study, ref. [
19], a variety of mouse and keyboard inputs were used to detect bots in online video games with 99% accuracy and negligible performance overhead.
These studies show that bot interaction with online environments is often simple and contains many obvious differences from human data. Mouse movements are often overly smooth, while keyboard inputs are unrealistically fast or uniformly timed compared to human inputs. Based on these findings, we conclude that a variety of mouse movement and keyboard inputs combined provide an effective approach to bot detection.
2.4. Reinforcement Learning for Cybersecurity
Reinforcement learning has emerged as an increasingly viable approach for cybersecurity problems because many defensive tasks are inherently sequential. Nguyen and Reddi [
20] provide a comprehensive survey of deep RL (DRL) applied to cybersecurity, covering its application to intrusion detection systems (IDSs), cyber-physical system defense, and multi-agent game-theoretic simulations of attacker–defender dynamics. Their review highlights an advantage of RL over static classifiers: while supervised models are trained once on historical attack patterns, RL agents can continuously refine their policies, making them well-suited to adversaries that evolve their strategies over time.
Within intrusion detection specifically, RL-based methods have been used to detect anomalies in network traffic and respond to attack patterns that fall outside the distribution of training data. This robustness to distributional shift is valuable in security contexts, where adversaries actively probe for the gaps in a deployed system’s knowledge. The adversarial framing, in which the RL agent is cast as a defender operating against an implicit attacker, maps to problems like bot detection, where the goal is not simply to classify known bots but to maintain a reliable policy as bot strategies evolve [
20].
2.5. Existing Dynamic CAPTCHA Systems
Modern CAPTCHA systems have shifted from static challenge–response tasks toward behavioral and adaptive risk-scoring approaches [
11]. Although popular commercial systems reduce visible user friction through passive scoring mechanisms, their internal methodology remains proprietary. This limits transparency, reproducibility, and independent adaptation to emerging adversarial behaviors.
A qualitative reference point for the framework proposed in this paper is reCAPTCHA v3. Its approach operates in the background and is based on behavioral evidence rather than requiring immediate visible challenge. However, reCAPTCHA v3 ultimately exposes a proprietary risk score, leaving the downstream intervention logic to the site operator or to Google’s closed infrastructure. By contrast, the presented framework, which will be further discussed in the paper, formulates bot defense as a partially observable Markov decision process (POMDP) and learns an explicit sequential policy over continued observation, honeypot deployment, graded CAPTCHA challenges, and allow and block actions. The difference, therefore, is not simply whether an adaptation exists but where that adaptation resides: in reCAPTCHA v3, it is embedded within a proprietary service, whereas in the system presented in this paper, the intervention policy itself is learned, auditable, and modifiable.
At the same time, reCAPTCHA v3 remains substantially stronger in deployment maturity, ecosystem integration, and scale in the real world. Since Google does not publish public precision, recall, and false-positive benchmarks that are directly comparable to the dataset and task formulation used in this study, this comparison should be understood as architectural and operational rather than as a strict head-to-head performance evaluation. Consequently, the contribution of the present work is not to claim benchmark superiority over reCAPTCHA v3 but to introduce an auditable and sequentially adaptive CAPTCHA defense framework with strong preliminary internal results.
Comprehensively, existing dynamic CAPTCHA systems generally adapt either a risk score or a challenge difficulty, while our framework formulates the broader bot-mitigation process as a POMDP. The defender observes behavioral telemetry over time, may deploy honeypots as early evidence-gathering interventions, and then selects among graded CAPTCHA challenges and allows or blocks action.
2.6. Gap Analysis and Contribution
The related work identifies a remaining gap between proprietary behavioral-risk systems and open-source CAPTCHA alternatives. Commercial systems provide large-scale adaptive scoring, but their decision logic is closed and difficult to audit or extend. Open-source alternatives provide transparency but generally do not model bot mitigation as a sequential defender policy that can observe behavior, gather evidence, and choose among multiple interventions over time. This study addresses that gap by introducing a publicly available reinforcement learning environment for adaptive CAPTCHA defense. The environment formulates bot detection as a POMDP and supports continued observation, honeypot-based evidence gathering, graded CAPTCHA challenges, and allow and block decisions within a single policy. We evaluate this framework using a PPO+LSTM agent and a companion XGBoost classifier against scripted, replay-based, adversarially humanized, and LLM-powered bots. The goal is to provide a transparent testbed for studying how CAPTCHA defense policies can be adapted as bot capabilities evolve.
3. Methodology
3.1. Web Application
For this project, a sandbox web application was developed to simulate a real-world high-security environment requiring anti-bot security measures. The system records per-session user inputs, which are used by a silent RL agent that evaluates activity for each unique session. User interaction with the application flows through three primary pages: concert selection, seat selection, and checkout. These pages and the CAPTCHA challenges delivered to the user are shown in
Figure 1. Additionally, the application includes a developer dashboard that provides session telemetry and displays RL agent decisions.
The mock e-commerce application uses CAPTCHA challenges instead of hard blocks to avoid mistakenly blocking legitimate customers. Suspected bots receive one of three CAPTCHA puzzles based on risk level. The easy puzzle asks users to rotate an asymmetric object upright with generous tolerance. The medium puzzle uses a jigsaw slider with more visual ambiguity and stricter tolerance. The hard puzzle, reserved for highly suspicious sessions, requires users to click moving objects in a specified order within a time limit; two failed attempts result in blocking.
3.2. Telemetry DataCollection
The dataset contains two categories of interactions: human-generated and bot-generated. Human data was collected from participants directly using the application. Bot data was collected from LLM agents and replay bots directly interacting with the application, generating data from actual automated use of the interface rather than synthetically constructed traces. These bot sessions reflect real event streams produced from automated agents navigating the application. To further supplement and vary the bot data, a data augmentation process was additionally applied; this process is detailed in
Section 3.3. Overall, the dataset contains 643 original sessions (204 human; 439 bot) with an additional 2628 sessions from the bot augmentation process.
For each session, telemetry data included keystrokes, mouse movements, scroll events, and button presses. Keystrokes, scrolls, and clicks were logged as discrete events, whereas mouse movements were recorded at 15 millisecond intervals. Both the human and bot samples were represented using telemetry captured from interaction with the same application interface. A detailed summary of the collected features is provided in
Table 1.
The distinctions between human and bot activity are illustrated in
Figure 2 and
Figure 3, which present heatmap visualizations of interaction data. These figures map the X-Y pixel positions of all recorded interactions, with the mean position of each distribution indicated by a red circle. For mouse movements, the mean positions are approximately (507, 358), (593, 420), and (594, 419), while for mouse clicks they are (425, 379), (483, 453), and (483, 453).
These visualizations highlight clear behavioral differences between human and bot activity. Human interaction exhibits broad spatial dispersion across the interface, indicating a more exploratory behavior. In contrast, bot interactions are highly structured, with limited horizontal variance and a small number of unique x-axis positions. This results in the distinct vertical bands of activity observed in both movement and click heatmaps. The standard deviation along the x-axis for human clicks is approximately 1.9 times greater than that of bot data. Additionally, human clicks spanned 562 unique x-axis positions, while bot clicks are confined to only 73. Human click interactions also occupy a substantially larger number of spatial bins, covering 17% of the provided space, while bot activity only covered approximately 9.4%. Furthermore, human mouse movements occupied approximately 65% of the environment space, while clicks occupied only about 17%. This demonstrates the highly exploratory nature of mouse movements and positioning versus the more targeted nature of clicking.
In total, the dataset includes 39,904 human mouse movement events and 28,741 bot movement events, which were augmented to produce an additional 172,138 samples. For mouse clicks, the dataset contains 1392 human events and 4214 bot events, with an additional 25,260 augmented samples. Notably, the augmented bot dataset closely mirrors the statistical properties of the original bot data across all evaluated metrics, suggesting that augmentation does not fully introduce human-like variability.
3.3. Adversarial Bot-Tier Framework and Augmentation
To capture a broad spectrum of adversarial behavior on websites, we organize our bot implementations into a five-tier difficulty hierarchy. Each tier reflects increasing levels of behavioral complexity, ranging from simple scripted automation to highly realistic human-mimicking agents, including those powered by LLMs. The characteristics and capabilities of each tier are summarized in
Table 2 below.
Development-wise, for Tiers 1 through 4, browser interactions are implemented using Selenium WebDriver to simulate automated behaviors of varying complexity [
28]. In general, we expect detection performance to decrease from Tier 1 to Tier 5, reflecting increasing adversarial difficulty. Deviations from this trend may indicate that certain behavioral strategies are more effective at evasion than their tier ranking suggests, or that specific agents are better suited to particular adversary types. The corresponding results are presented in
Section 4 and
Section 5.
A key challenge with bot detection is that trivially separable features (e.g., Selenium’s
ms key-hold durations) can inflate model accuracy without learning genuinely discriminative behavioral patterns. To address this, we draw inspiration from adversarial training [
29], which improves model robustness by exposing it to perturbed inputs during training. We adapt this principle to the behavioral bot detection domain by introducing a novel adversarial augmentation procedure—the
HumanProfiler pipeline—that progressively humanizes bot sessions at three difficulty levels, directly mapping to the bot-tier framework in
Section 3.3. To our knowledge, no prior work has applied progressive humanization of bot telemetry as a data augmentation strategy for behavioral bot detection; this pipeline is a novel contribution of this work.
A HumanProfiler first learns statistical profiles from real human sessions across six signal categories: key-hold duration, mouse inter-event , jitter ratio, mouse speed, direction-change frequency, and event-type ratios. Let and denote the human mean and standard deviation for feature f. Bot sessions are then transformed as follows:
Easy augmentation: Fixes the most obvious giveaways. Key-hold durations are resampled from a clipped Gaussian centered on the human profile, and micro-jitter is injected into mouse trajectories:
Medium augmentation: Applies all easy transforms and additionally humanizes timing distributions by compressing mouse
values toward human rates and applies exponential smoothing to reduce abrupt direction changes in mouse paths:
where
controls timing compression and
is the smoothing coefficient.
Hard augmentation: Applies all transforms with tighter parameters (smaller , stronger smoothing, and narrower timing compression), producing near-human sessions that are challenging to distinguish from real users.
The specific parameter values used at each difficulty level are summarized in
Table 3.
For each original bot session, multiple augmented copies are generated (default: 2 copies × 3 levels = 6 augmented sessions per bot). This forces the model to learn subtle behavioral signals rather than relying on trivially separable artifacts, directly strengthening robustness against higher-tier adversaries.
3.4. Reinforcement Learning for Adaptive CAPTCHA Defenses
3.4.1. Reinforcement Learning Formulation
Reinforcement learning (RL) is a machine learning paradigm in which an agent learns to make decisions through trial-and-error interactions with an environment. The agent’s objective is to learn a policy that maximizes the expected cumulative reward over time [
30]. RL formally stems from Markov decision processes (MDPs), which are defined as a tuple
where
denotes the state space,
the action space,
the transition probability function,
the reward function, and
the discount factor that determines the relative importance of future rewards [
30,
31]. A policy
specifies the probability of selecting action
a in state
s [
30]. The value function
represents the expected sum of discounted future rewards when starting in state
s and following policy
. Similarly, the advantage function is defined as
which measures the relative benefit of taking action
a compared to the expected value of the policy [
30,
32]. These components lay the foundation for modern policy gradient methods such as Proximal Policy Optimization (PPO) [
33,
34,
35], which commonly use generalized advantage estimation (GAE) [
32] to reduce variance in gradient updates.
In the context of our problem (i.e., CAPTCHA defenses), this RL formulation can be mapped to user interaction sessions on a web application. The environment corresponds to a user session, and the state space (
) captures user interaction behavior represented as windowed telemetry features (e.g., mouse movement, click events, keystrokes, and scrolling patterns). On the other hand, the action space (
) consists of the RL agent’s possible interventions, such as continuing observation, deploying a honeypot, issuing challenges, allowing the user, or blocking the session. The reward function (
R) is defined based on the outcome of these actions, assigning positive rewards for correctly identifying bots and legitimate users and penalties for incorrect decisions or user friction. The transition dynamics (
P) are governed by how users (human or automated) respond to these interventions over time but are not modeled explicitly. This is because the RL agent is trained in a model-free setting, learning directly from observed interaction data without access to the transition probabilities [
30]. In our setting, the initial state (
) corresponds to the start of a user session, prior to observing any interaction behavior. A terminal state (
) is reached when a final action is taken, such as allowing the user, blocking the session, or issuing a challenge. The overarching objective of the RL agent is to learn a policy (
) that maximizes the expected cumulative reward by accurately distinguishing between bots and legitimate users while minimizing user friction.
Critically, this problem is more accurately characterized as a partially observable MDP (POMDP) [
36], since the agent does not directly observe whether the user is a bot or human. Instead, it must infer this hidden state from observable behavioral signals (e.g., mouse movements) and interaction patterns over time. Thus, due to this partial observability, sequence models, specifically Long Short-Term Memory (LSTM) networks, were utilized in this paper to keep track of past information over time [
37]. Concretely, LSTM networks capture temporal dependencies and aggregate information across multiple timesteps, enabling the agent to approximate the underlying hidden state (i.e., whether the user is a bot or human). A visual representation of the POMDP for this problem is shown in
Figure 4.
3.4.2. Observation, State, Action, and Reward Space
ObservationSpace
As discussed, the RL agent cannot directly observe the true state of the environment (i.e., whether the user is a bot or a human). Instead, it receives observations derived from user interaction telemetry. These observations consist of raw behavioral signals (i.e., mouse movements, click events, keystrokes, and scrolling activity), which are collected during a user session. Raw telemetry is used to construct a chronological event timeline, where the initial state corresponds to the user opening the web page, and the terminal state corresponds to checkout completion or session termination. To prevent mouse movement data from dominating the representation, mouse events are subsampled by retaining every fifth sample. The resulting event stream is then segmented into fixed-length windows of 30 events with a 50% overlap, producing a sequence of observations for each session, a standard technique for preserving temporal structure in sequential data [
38].
State Representation
Formally, the true underlying state
corresponds to whether the user is a bot or a human, which is not directly observable. The agent must infer this hidden state from the sequence of observations. Each observation window is encoded into a 26-dimensional feature vector capturing behavioral characteristics such as motor dynamics, spatial coverage, temporal patterns, and interaction metadata. This feature vector is shown in
Table A1. Since a single observation window is insufficient to determine the user type, the agent utilizes an LSTM-based architecture [
37] to aggregate information across multiple timesteps (
t). The LSTM maintains a hidden state that summarizes past observations, enabling the agent to create a latent representation that approximates the underlying user state (i.e., bot or human).
Action Space
The action space consists of seven discrete actions corresponding to different intervention strategies, including continuing observation, deploying a honeypot, issuing CAPTCHA challenges of varying difficulty, allowing the user, or blocking the session. Since non-terminal actions (
continue;
deploy_honeypot) are only valid during the observation phase and terminal actions (
puzzles,
allow, and
block) are only valid at the final window, invalid action masking [
39] is applied to restrict the set of available actions at each timestep
t, ensuring that only valid actions are selected.
Evolution of the Reward Design (Legacy Baseline vs. Revised Schedule)
The legacy reward schedule treated direct blocking and puzzle-based detection nearly identically. Blocking a bot directly yielded
, the same base reward as catching a bot via any puzzle, making direct blocking the strictly dominant strategy. The legacy schedule also used a single aggregate honeypot trigger probability (
) for all bot tiers rather than tier-stratified rates and assigned a smaller information bonus (
). The full legacy mapping is shown in
Table 5.
The revised schedule (
Table 6) addresses these issues by making the catch reward scale with difficulty (
/
/
) and setting the direct block reward (
) below all puzzle catches. This makes puzzle-based detection the higher-reward strategy, encouraging evidence gathering over opaque blocking. The penalty for incorrectly blocking a human increases from
to
, further discouraging false positives. Honeypot trigger probabilities are stratified by bot tiers to reflect realistic differences in bot sophistication, and the information bonus increases from
to
. The behavioral impact is directly observable in
Section 4.1.1.
3.4.3. Architecture
The architecture we implemented utilizes four main components: LSTM, actor head, critic head, and Shared Representation. Our actor–critic architecture, originating from [
34], utilizes an LSTM backbone which enables the RL agent to process sequential observations and accumulate evidence over time. This is well-suited for our setting because the user’s behavior unfolds temporally, and early signals must be integrated with later observations to infer whether the user is a bot or human.
LSTM Backbone
The core of the architecture is a Long Short-Term Memory (LSTM) network [
37], which captures long-term dependencies in sequential data through gated memory mechanisms. At each timestep
t, the network receives a 26-dimensional feature vector
representing a window of user interaction telemetry and updates its hidden and cell states
accordingly. Intuitively, the LSTM acts as a memory system that maintains a running summary of user behavior over time. At each step, it decides what past information to retain (forget gate), what new information to incorporate from the current observation (input gate), and what information to expose for decision-making (output gate). This allows the model to retain relevant behavioral patterns while discarding noise. By aggregating information across multiple timesteps, the LSTM enables the agent to accumulate evidence over time (e.g., combining early navigation behavior with later typing patterns, which helps form a more complete understanding of the user). Thus, the hidden state
serves as a compressed representation of the observation history. Due to the partial observability of the environment, the LSTM hidden state can be interpreted as an implicit belief representation that summarizes past observations and approximates the underlying latent user state (bot or human) [
40].
Actor Head
The actor head defines the policy
over the discrete action space. It takes the LSTM hidden state
as input and maps it to action scores (logits) using the following layers:
These logits are then masked to remove invalid actions and passed through a softmax function to produce a probability distribution over actions. During training, actions are sampled from this distribution to allow for exploration, while during evaluation, the highest-probability (greedy) action is selected. Invalid action masking is applied by setting the logits of invalid actions to
before the softmax, ensuring they receive zero probability [
39].
Critic Head
The critic estimates how good the current state is by predicting the value function:
It takes the LSTM hidden state
as input and maps it through a small neural network:
The output is a single value representing the expected future reward from the current state. This value is used as a baseline during training, which helps reduce variance in the updates, and makes learning more stable [
30].
Shared Representation
The LSTM backbone is shared between the actor and critic. This keeps the model smaller and allows both parts of the network to learn from the same representation of user behavior. Since both actor and critic rely on similar information (i.e., whether the user is a bot or human), sharing the backbone works well in practice and is commonly used in PPO-based methods [
41]. This entire LSTM actor–critic architecture is visualized in
Figure 6. Furthermore, the parameter count for this architecture is summarized in
Table 7. The compact architecture (∼130 K parameters) is small to avoid overfitting on limited training data while also retaining enough capacity for the sequential classification task at hand.
3.4.4. Training Algorithms
We train three algorithm variants, all sharing the same LSTM architecture and environment. This controlled comparison, illustrated in
Figure 7, isolates the effect of the policy optimization method from the architectural and environmental factors.
PPO (Proximal Policy Optimization)
PPO [
35] is our primary training algorithm, chosen for its empirical stability and strong performance across a wide range of RL tasks. PPO belongs to the family of policy gradient methods [
35] but addresses the instability of vanilla policy gradients through a clipped surrogate objective that constrains how far the policy can change in a single update.
Each training iteration collects a 4096-step on-policy rollout by running the current policy in the environment. Each interaction step (transition) stores the observation, action, reward, termination flag, action log-probability, value estimate, and action mask. These stored quantities are required for computing policy gradient updates, including the likelihood ratio and advantage estimates needed for PPO updates. Because the model incorporates an LSTM, the hidden state is reset at the beginning of each episode to prevent information leakage across independent trajectories. The initial hidden state for each sequence is also stored to ensure correct reconstruction of temporal dependencies during training updates. Advantages are computed using generalized advantage estimation (GAE) [
32] and normalized across the rollout buffer to improve training stability.
PPO then updates the policy using a clipped surrogate objective, which constrains large policy changes between updates [
35]. Value learning uses a clipped value loss, and an entropy bonus is included to encourage exploration.
Updates are performed over sequential episode segments to preserve LSTM state consistency. Segments are shuffled across epochs, but transitions within a segment retain temporal order. Gradients are clipped to a maximum norm of 0.5 to prevent instability during LSTM training. The hyperparameters for PPO are highlighted in
Table 8.
DG (Delightful Policy Gradient)
DG [
43] is a recent alternative to standard policy gradients that addresses noisy updates and redirects the gradient signal across contexts. Specifically, DG not only reduces variance within a context but also shifts the expected gradient direction across contexts toward a cross-entropy-like objective. Instead of weighting updates only by the advantage estimate, DG gates each update using both the advantage and the surprisal of the selected action. As a result, actions that are both beneficial and unlikely under the current policy receive stronger learning signals.
This mechanism emphasizes rare, informative successes while suppressing noisy or uninformative updates. In our CAPTCHA defense setting, this is particularly useful because rare but critical cases, such as complex bots, can have a disproportionate impact on system performance. DG uses only current policy probabilities and does not rely on importance sampling or PPO-style clipping. The DG hyperparameters used in this study are summarized in
Table 9. All other hyperparameters (learning rate, rollout steps, etc.) match those used for PPO (
Table 8).
PPO with Adaptive Entropy (Soft PPO)
Soft PPO extends PPO [
35] by learning an entropy coefficient
to automatically balance exploration and exploitation, inspired by Soft actor–critic (SAC) [
44]. Rather than using a fixed entropy bonus, Soft PPO adjusts the strength of entropy regularization based on how the policy entropy compares to a target entropy.
When the policy becomes too deterministic,
increases the influence of the entropy term to encourage additional exploration. When the policy is overly stochastic,
reduces this influence so that the agent can make more confident decisions. This allows the agent to balance exploration and decision-making automatically without manual tuning. In our implementation,
is optimized in log-space using Adam and constrained to a fixed range for stability. The Soft PPO hyperparameters are summarized in
Table 10.
3.4.5. Data Splitting and Training Protocol
All experiments use a stratified 70/15/15 train/validation/test split across 643 sessions (204 human; 439 bot), partitioned at the session level with a fixed split seed (). Human and bot sessions are split independently to maintain class proportions across all three sets. The partition is determined solely by the original sessions; augmented copies are assigned to the same split as their source session, preventing data leakage across splits. The training set contains original sessions (142 human; 307 bot), with 97 sessions each reserved for validation and test.
Each algorithm is trained in two configurations:
noaug (original sessions only) and
advaug (original sessions plus adversarially augmented bot sessions from the humanization pipeline described in
Section 3.3). Each configuration is trained independently with five random seeds (42, 123, 456, 789, and 1024) to account for training variability, yielding 30 trained models in total (3 algorithms × 2 configurations × 5 seeds). The advaug training set additionally includes
augmented bot sessions (2 copies × 3 difficulty levels per training bot), where
denotes the 307 bot sessions in the training split. Primary evaluation uses the original test set (97 sessions: 31 human; 66 bot). A separate augmented test evaluation additionally includes the pre-generated humanized bot copies in the test set to assess robustness to humanized behaviors (
Section 4.1.1). All 30 models share the same underlying data split.
Training Loop
Training follows the standard on-policy rollout collection and update cycle. In each iteration, the agent interacts with the environment for 4096 steps, collecting a rollout buffer of transitions. At each step, the environment samples a session uniformly at random from the training split (with on-the-fly augmentation applied stochastically). The LSTM hidden state is reset at the beginning of each episode. For each transition, the agent receives a 26-dimensional observation window, selects an action according to its current policy subject to the applicable action mask, and records the following information: observation, action, reward, termination flag, action log-probability, value estimate, and action mask. Episodes that terminate mid-rollout (via a terminal action or truncation) contribute their final outcome to per-rollout statistics; the remaining budget of steps continues with a new session. If the rollout ends mid-episode, the critic’s value estimate for the final observation is used to bootstrap the return.
After each rollout, the advantages are computed using GAE and normalized across the buffer. The policy is then updated over 4 epochs. Within each epoch, the buffer is split into episode-level segments, which are shuffled across epochs to reduce correlation. Transitions within each segment retain their temporal order to preserve LSTM hidden-state consistency. Each segment is processed by passing its full observation sequence through the LSTM from the recorded initial hidden state
, recomputing logits and value estimates, and applying the clipped surrogate loss. Gradients are clipped to a maximum norm of 0.5 to prevent exploding gradients in the recurrent backbone [
42]. This process repeats for
rollout iterations. Checkpoints are saved every 10 rollouts alongside a deterministic validation pass over 100 episodes (with augmentation disabled), and the final checkpoint is used for evaluation.
On-the-Fly Stochastic Augmentation
To mitigate overfitting on our limited dataset, we apply stochastic augmentation to every training episode at sampling time. Unlike the adversarial augmentation pipeline (
Section 3.3), which pre-generates static humanized copies, on-the-fly augmentation applies random perturbations each time a session is drawn, ensuring the agent never observes the same input twice across training. This is applied independently of, and in addition to, adversarial augmentation. With probability
, all three perturbation types are applied simultaneously to the episode:
Position noise: Gaussian noise is added to all mouse and click coordinates, with px for bot sessions and px for human sessions. The lighter human perturbation preserves the natural structure of genuine mouse trajectories while still providing regularization.
Timing jitter: Gaussian noise is added to event timestamps, with ms for bots and ms for humans. This prevents the agent from relying on exact inter-event timing, which can vary across hardware and network conditions.
Speed warping: All timestamps are scaled by a uniform random factor , with for bots and for humans. This simulates variation in overall interaction pace by uniformly scaling all event timestamps, representing sessions that are faster or slower overall.
Bot sessions receive stronger perturbations than human sessions by design. The asymmetry serves two purposes: (1) it broadens the distribution of bot behaviors the agent encounters during training, improving generalization to unseen bot variants, and (2) it preserves the subtler statistical structure of human sessions, which the agent must learn to recognize as legitimate. On-the-fly stochastic augmentation is disabled during validation and test evaluation to ensure metrics reflect performance on the fixed evaluation distribution. This does not affect whether pre-generated adversarially augmented bot sessions are included in the test split.
3.4.6. Inference and Pseudo-Online Training
At inference time, the agent processes a user session sequentially by constructing a telemetry timeline and passing observation windows through the LSTM-based policy. The LSTM hidden state is reset at the start of each session. The terminal action selected by the agent determines the intervention shown to the user (e.g., allow, block, or issue a challenge). To enable adaptation to evolving bot behaviors, we implement a pseudo-online training mechanism based on single-session PPO updates. After a session is completed and a ground-truth label is obtained via the confirmation endpoint, the full interaction trajectory is replayed through the network with the same action masking used during offline training. The resulting transitions are stored in a single-episode rollout buffer, advantages are computed using the same GAE procedure described in
Section 3.4.4, and a PPO update is performed. To ensure stability, the online learning rate is reduced to 60% of the offline rate (
), and updates run for 3 optimization epochs instead of 4. The agent’s decision on the session is logged both before and after the weight update, providing an auditable record of whether online adaptation improved, regressed, or left the decision unchanged. The updated checkpoint is saved immediately after each update. This assumes the confirmation label is reliable; noisy labels could reinforce incorrect decisions, so online updates are gated on label confidence in deployment.
3.4.7. Evaluation Protocol
To assess performance, all 30 models are evaluated on the held-out test split using deterministic (greedy) policy execution. The primary results use the revised reward structure; legacy reward evaluations are included for comparison only (
Section 3.4.4).
Each evaluation episode corresponds to a single user session processed sequentially through the LSTM using the same action masking scheme applied during training. To account for environmental stochasticity (puzzle pass rates; honeypot triggers), we run 500 episodes per agent across 5 independent random seeds and report mean ± standard deviation. Episodes are sampled with replacement from the test sessions to produce stable metric estimates under varying stochastic outcomes. Beyond the primary evaluation, we conduct seven additional evaluation regimes:
Cross-environment transfer: Tests revised-trained agents in the legacy reward environment to assess robustness to reward changes.
Augmented test evaluation: Evaluates all models on adversarially humanized bot copies to test robustness to humanized behavior.
Disjoint tier generalization: Withholds bot tiers during PPO training and then evaluates on the held-out tiers to measure zero-shot generalization.
Human disjoint generalization: Withholds one human participant during PPO training and then evaluates on that participant to test unseen-human generalization.
Reward sensitivity analysis: Sweeps key reward parameters to test robustness to reward misspecification.
Ablation study: Removes or modifies architectural and reward components to identify key performance drivers.
Strict accuracy evaluation: Counts human puzzle challenges as false positives to estimate conservative UX impact.
3.4.8. Evaluation Metrics
We report the following metrics capturing both classification performance and reinforcement learning behavior:
Accuracy, precision, recall, and F1: Standard binary classification metrics. Precision measures the fraction of bot predictions that are correct (minimizing disruption to humans), while recall measures the fraction of actual bots detected.
Per-tier and per-family detection rate: Recall stratified across the five adversarial tiers (T1: Commodity through T5: LLM-Powered) and across bot families, capturing adversarial resilience as sophistication increases.
Average reward: Mean episodic return under the revised reward function, reflecting correct classification, user friction costs, and information-gathering bonuses.
Honeypot usage: Fraction of episodes in which the agent deploys at least one honeypot, capturing preference for evidence gathering over direct action.
Challenge rate: Fraction of human sessions in which the agent issues a puzzle challenge, measuring UX friction imposed on legitimate users.
3.5. Classifier
In addition to the RL agent, we developed a supervised machine learning classifier that serves as a critical benchmark. While the RL agent makes sequential decisions over sliding windows of behavioral telemetry, the classifier takes a holistic approach by analyzing an entire session’s telemetry after it concludes. It produces a single human-likelihood score , where indicates very high confidence that the session belongs to a human and indicates very high confidence that the session belongs to a bot.
3.5.1. Feature Engineering
Raw JSON telemetry collected during a user session is condensed into a 39-dimensional feature vector organized across eight behavioral groups, summarized in
Table 11. Each group targets a distinct aspect of user interaction, chosen based on the behavioral differences between humans and bots identified in
Section 2.3.
This 39-dimensional representation aggregates each session into fixed-length statistics that summarize how the user interacted rather than what they did, making the classifier robust to differences in page layout or task content. The feature set is intentionally compact and human-interpretable so that XGBoost can train reliably without overfitting and so that feature importance analysis remains meaningful for research interpretability.
The classifier’s 39-dimensional session-level feature set differs deliberately from the 26-dimensional window-level representation used by the RL agent (
Table A1). The two models operate under different temporal constraints that dictate their feature design. The RL agent must produce an action at every observation window during a live session, so its features are restricted to quantities computable from a fixed-length window of 30 events; several session-level statistics used by the classifier—such as total session duration, global rhythm regularity, and cumulative spatial coverage—are undefined or degenerate at this scale, since a single window may contain zero keystrokes or only a narrow slice of the user’s spatial trajectory. The classifier, by contrast, operates post hoc on completed sessions and can therefore exploit the full event stream to compute stable, whole-session aggregates. Despite the dimensional difference, both feature sets draw from the same underlying telemetry channels (mouse dynamics, click patterns, keystroke timing, scroll behavior, spatial coverage, and event-type composition) and encode the same behavioral intuitions at different temporal granularities. This asymmetry is central to the complementary roles of the two models: the RL agent provides real-time sequential intervention under streaming constraints, while the classifier provides an interpretable session-level score suitable for auditing and post hoc analysis.
3.5.2. Model Architecture
We selected XGBoost (eXtreme Gradient Boosting) [
45] as the classification algorithm based on four considerations. First, the 39 input features are aggregate session-level statistics (i.e., tabular data), a domain in which gradient-boosted decision trees consistently match or outperform neural network approaches [
46]. Second, XGBoost handles small labeled datasets effectively through built-in
/
regularization and early stopping. Third, the model provides interpretable feature importance scores via gain-based splitting, which aids in understanding which behavioral signals are most discriminative. Fourth, XGBoost achieves sub-millisecond CPU inference, requiring no GPU, which is critical for real-time deployment alongside the RL agent.
XGBoost constructs an additive ensemble of CART regression trees, optimizing a regularized binary cross-entropy objective with
,
, and minimum-split-loss penalties; we refer the reader to Chen and Guestrin [
45] for the full formulation. The model is configured with the hyperparameters listed in
Table 12. Regularization is intentionally strong (
,
,
, and
min_child_weight ) to prioritize generalization over memorization on a small dataset.
3.5.3. Training Pipeline
The classifier is trained with a regularization-heavy pipeline designed to generalize from a small labeled corpus and to resist adversarial mimicry of human behavior.
Data Split
A stratified 70/30 train/test split is applied at the session level, preserving the class distribution between human (
) and bot (
) sessions. Pre-generated augmented copies (
Section 3.3) are appended to the
train split only, never the test split, so that evaluation reflects the real-world distribution and no augmented sample leaks across the boundary.
This split differs from the 70/15/15 train/validation/test partition used for the RL agents (
Section 3.4.5). The difference is deliberate: the RL agents require a held-out validation set for checkpoint selection during training, whereas the XGBoost classifier performs model selection via Optuna’s 5-fold stratified cross-validation on the training set, making a separate validation partition redundant. Allocating the full 30% to the test split increases the number of held-out sessions available for evaluation, yielding tighter confidence intervals on the reported metrics. Despite the different partition ratios, both pipelines draw from the same underlying session pool with the same random seed, and neither allows for augmented data to leak into the test set.
Feature Standardization
A StandardScaler is fit on the training features and applied to both splits, zero-centering and unit-scaling each feature so that subsequent noise augmentation is uniform regardless of the original feature scale.
Feature-Space Adversarial Augmentation
For each bot sample, humanized copies are generated by blending the bot’s standardized feature vector toward the mean of the human samples with a per-sample random factor and then adding Gaussian noise scaled by the human-feature standard deviation (). These copies retain the bot label, forcing the classifier to look beyond surface-level differences.
Feature Noise Augmentation
To prevent overfitting to exact feature values, the training set is duplicated into
noisy versions, each perturbed by Gaussian noise:
where
controls the noise scale relative to each feature’s training-set standard deviation
.
Label Smoothing
Inspired by the regularization principle behind label smoothing [
47], we reduce the influence of every training sample by scaling its weight:
with
, so that each sample’s contribution to the loss is multiplied by
. This prevents the model from over-committing to any single training example, producing softer probability estimates without modifying the binary labels themselves.
Class Imbalance Handling
To account for potential class imbalance, XGBoost’s
scale_pos_weight is set to the ratio of negative to positive samples:
ensuring that the loss contribution of each class is balanced during training.
Optuna Hyperparameter Tuning
An optional automated tuning stage uses Optuna [
48] with 5-fold stratified cross-validation and ROC-AUC as the optimization objective. The search space covers XGBoost regularization parameters (
max_depth,
min_child_weight,
,
subsample,
colsample_bytree,
,
, and
) as well as the augmentation hyperparameters (
,
,
,
, and
). The resulting configuration is then used to retrain the final model on the full training split. The trained model exposes
human_score(), returning a probability in
that serves as an interpretable session-level human-likelihood score.
3.5.4. Evaluation
The classifier is evaluated on the held-out test set using a comprehensive set of metrics: accuracy, precision, recall,
score, and area under the receiver operating characteristic curve (ROC-AUC). The predicted human-likelihood score
is compared against a default decision threshold of
:
In addition to aggregate metrics, we report the following:
Score distribution analysis: Histograms of for human and bot sessions, assessing the separation between class distributions.
Feature importance ranking: Gain-based importance from XGBoost, identifying which behavioral signals contribute most to classification decisions.
Confusion matrix: Visualizing the tradeoff between false positives (humans incorrectly blocked) and false negatives (bots incorrectly allowed), which have asymmetric costs in a CAPTCHA deployment setting, since blocking a legitimate user has greater consequences than admitting a bot.
5. Discussion
5.1. Reinforcement Learning
The contribution of the RL framework is not classification accuracy. Across all three algorithms and both augmentation conditions, classification accuracy is not the differentiating factor, since every advaug configuration converges to similar precision and F1 scores. What the policies provide that a static classifier does not is action selection: the choice of when to deploy a honeypot, which puzzle difficulty to issue, and when to allow a borderline session through without friction. These are policy decisions a deployed CAPTCHA system must make regardless of how strong its classifier is, and the framework makes them jointly with the detection decision rather than through hand-tuned post hoc rules. The honeypot deployment rates above 0.79 and the migration from direct block to hard puzzle under the revised reward both confirm that the agents actively exploit the full action space, not just the terminal verdict.
The choice of algorithm matters less than the reward shaping. Once the reward function penalizes blocking a human more heavily than missing a bot, every algorithm we tested learns the same caution-first behavior. The differences that remain between PPO, DG, and Soft PPO are in stability and seed sensitivity rather than peak accuracy. Soft PPO’s advantage is best characterized as a stability advantage: its adaptive entropy schedule resists premature commitment, producing more uniform per-family behavior and tighter cross-seed variance. The practical implication for follow-up work is that the architectural decision is not “which algorithm” but “how aggressively exploration should be constrained as training progresses.” Reporting five training seeds per configuration was essential to seeing this. The single-seed runs would have masked DG’s seed-dependent instability and overstated its average behavior.
Mid-session intervention is built into the framework. The agent already exposes a non-terminal action (i.e., honeypot deployment) that fires during the session and gathers behavioral evidence without revealing detection. The honeypot is one realization of a generic mid-session intervention slot in the action space; the same architectural mechanism supports any non-blocking response (rate limiting, soft micro-challenges, silent re-authentication, and server-side request throttling) that an operator wants to attach. Swapping the honeypot for one of these alternatives is an implementation detail, not a redesign of the MDP. What is true by design is that the binary block decision is reserved for the closing window, and we view that as a property of the framework rather than a defect. A system that issues hard blocks mid-session, before observing enough behavioral evidence to be confident, is the design that produces false positives on legitimate users with unusual interaction styles. The agents in this work are explicitly trained to gather evidence first (continue; honeypot) and commit second, which is the inverse of the failure mode a premature blocker would have.
Generalization in the framework is bounded by training exposure. The held-out tier experiments shows that the agents generalize well across familiar adversaries but degrade sharply when an entire bot tier is excluded from training, with the largest drop on held-out LLM-driven bots. This is a structural property of policy methods: the same caution-on-uncertainty behavior that produces perfect precision on familiar distributions defaults to allowing unfamiliar bots through. Training-set composition is therefore a deployment constraint, not a one-time choice. A system put into production would need ongoing exposure to evolving adversary distributions to maintain its detection rate. The flip side of this result is what the user-disjoint experiments show: when held-out humans are presented to an agent that has not seen them during training, the agent still passes them through at high rates. The framework’s failure mode is therefore asymmetric. It under-detects unfamiliar bots rather than over-blocking unfamiliar humans, which is the correct asymmetry for a UX-sensitive deployment.
5.2. XGBoost
Across four configurations spanning tuned and untuned hyperparameters and the presence or absence of adversarial augmentation, the XGBoost classifier reaches an ROC-AUC of on the held-out test split.
All four configurations achieve near-perfect classification at the default decision threshold (): xgb_v1_noaug and xgb_v2 each misclassify a single bot, while xgb_v1 and xgb_v2_noaug each misclassify two. Crucially, no configuration blocks a legitimate user—every error across all four models is a false negative, corresponding to a bot scoring above the threshold.
Across all four configurations, the top-ranked feature comes from the mouse dynamics group—average speed, jitter ratio, or direction-change ratio—confirming that motor behavior is the most discriminative channel for bot detection in the classifier.
The adversarial augmentation pipeline (HumanProfiler) was designed to push the classifier away from trivially separable artifacts, such as Selenium’s ms key-hold durations, toward deeper behavioral signals that generalize across bot tiers. In practice, however, its effect on the current dataset is mixed: augmentation helps the default-hyperparameter model (xgb_v2: 2 FN → 1 FN) but hurts the tuned model (xgb_v1: 1 FN → 2 FN), suggesting that Optuna’s regularization choices already provide the protection augmentation is meant to supply.
Beyond the in-distribution test split, the family-disjoint evaluation (
Section 4.2.1) provides a stricter view of generalization. Seven of ten bot families reach perfect detection accuracy even when entirely withheld from training, suggesting that the classifier’s decision boundary captures general behavioral signatures rather than family-specific patterns. The exception is the LLM family, where disjoint detection drops to
without augmentation and, counterintuitively, falls further to
when augmentation is enabled.
A plausible explanation is that humanized copies of simpler bots pull the decision boundary toward the human distribution in regions that real LLM sessions also occupy, making genuinely human-like LLM behavior harder to flag at test time. This inverts the usual interpretation of augmentation as universally beneficial and implies that augmentation strategies should be re-validated whenever a new bot family emerges.
All four configurations share the same zero-false-positive outcome on the in-distribution test split, so the observed differences there amount to at most one test sample; the family-disjoint results above provide the more informative picture of how the classifier behaves under a genuine distribution shift.
5.3. XGBoost Versus RL Agents
The two models are answering different questions. XGBoost answers, “Given the complete session, is this a bot?” and the RL agents answer, “As events arrive, what should the system do?” A deployed CAPTCHA system must answer the second question regardless of how strong its classifier is, because every threshold, every challenge-difficulty selection, and every honeypot deployment is a policy decision. The aggregate accuracy gap is therefore best read as the cost of folding the policy decision into the same optimization as the detection decision rather than as evidence that the RL approach is worse at classification. A pure classifier still requires a downstream policy layer; the framework collapses both into a single learned mechanism whose tradeoffs are encoded explicitly in the reward function rather than implicitly in hand-tuned thresholds. The evaluation protocol was matched where it could be (same advaug condition, same default hyperparameters, and same per-family breakdown for the comparison), so the gap reflects an architectural difference rather than a methodological one. Where the protocols differ (RL uses a 70/15/15 split with multi-seed averaging across a stochastic policy; XGBoost uses a 70/30 split with a single deterministic model), the difference is structural to the two method families rather than a choice we made.
XGBoost outperforms the strongest RL configuration on aggregate classification, but the gap matters less than where it appears. On the standard test split, XGBoost reaches accuracy and recall while Soft PPO+Advaug reaches accuracy and recall; both methods achieve perfect precision, so the gap is concentrated entirely on the recall side rather than on misclassified humans. Per-family, XGBoost detects of nine of the ten bot families (linear, tabber, speedrun, scripted, stealth, slow, erratic, semi-auto, and trace-conditioned), reflecting the strength of hand-engineered behavioral features at separating mechanically generated traces from human behavior. As a session-level binary classifier on this dataset, XGBoost is straightforwardly the stronger method, and the RL framework’s classification contribution should not be overstated.
The one per-family result where the RL agents take the lead is the family the comparison cares about most. XGBoost detects of LLM-driven sessions while RL PPO and Soft PPO both reach . This is a small absolute gap, but a meaningful directional one, because LLM-driven bots are explicitly designed to defeat the kind of static behavioral signatures XGBoost depends on. The RL agents observe each session as an unfolding sequence and can actively probe ambiguous sessions through tiered challenges and honeypots rather than committing to a verdict from a single feature vector. The framework outperforms specifically on the family designed to be hardest for hand-engineered features, which is evidence in this evaluation that policy-based methods provide headroom against adaptive adversaries.
The natural production architecture can combine both methods rather than forcing a choice between them. A feature-based classifier such as XGBoost supplies a robust, high-recall per-session likelihood estimate that can feed the policy as an additional observation feature, while the policy retains responsibility for action selection (challenge difficulty, honeypot deployment, and allow/block timing). Under this framing, the classification gap on the standard split is the operational cost of having a learnable policy at all, and the LLM-family result indicates that this cost buys real properties (explicit reward shaping, uniform caution-first behavior, and auditable puzzle-based detections) that a classifier-plus-threshold pipeline does not. Whether to combine the two or deploy one alone is a deployment choice, not a structural requirement of the framework.
5.4. Limitations
There are several limitations to the proposed system that should be acknowledged.
5.4.1. Data Collection
The dataset currently used was collected from our sandbox web application with a fixed page layout and user flow, meaning the learned policies may not transfer directly to sites with different interaction patterns or form structures. Due to this, implementations of the proposed system would require a large initial dataset for training before reaching adequate bot detection performance in some environments.
An additional potential limitation of the proposed system is the volume of data collected. Currently, the system collects a significant amount of information to properly detect bots. While this implementation is simple to apply to small-scale projects such as the application used for this project, the amount of data for each individual user session is substantial, and the collection of mouse positions every 15 ms specifically adds a significant burden to the data collection system. In some applications, a significant user count may create a volume of telemetry data that overloads the processing systems. This may create damaging slowdowns for services relying on the detection system if processing resources get overloaded. Consequently, large-scale implementations may have to investigate adjustments to the data collection model to get acceptable performance from the system. Consideration should be made for reducing processing costs through reductions in the amount of data collected, compressing the collected data, improving the processing pipeline, or some combination of these methods. Additional solutions, such as those found in [
18,
19], demonstrate data collection systems that are designed with low-cost collection in mind.
5.4.2. Accessibility and Fairness
The proposed system may additionally raise concerns regarding accessibility and fairness. The human session data used in this study consists entirely of desktop browser interactions from a small group of university participants without impairments or assistive-technology use. Therefore, the dataset does not capture the behavioral diversity of mobile users, tablet users, accessibility tool users, or users with varying levels of technical familiarity. Due to this, the system may not fully capture the behavioral patterns of users who navigate websites using different devices or assistive technology such as screen readers, voice controls, eye-tracking systems, or other assistive technologies. Users with tremors, cognitive disabilities, motor impairments, or unusual telemetry behavior may also produce events that do not match the patterns in the training data.
This may lead to situations where legitimate users are incorrectly labeled as bots, creating access barriers for some groups. Commercial implementation of the proposed system should evaluate performance across diverse accessibility contexts, including false-positive rates for users with impairments or assistive technologies. The proposed system should not be used as the sole basis for denying access to a website or service without appropriate safeguards that account for accessibility and fairness for users.
5.4.3. Privacy and Consent
We recognize that the proposed data collection system and its passive nature raise privacy and consent considerations. Conventional key logging systems can expose sensitive user information, bringing significant privacy risks. To avoid this, the system reduces risks through data minimization by recording only the event types and timing or interaction features needed for the bot detection model. The proposed data collection system is limited to mouse movements, clicks, scrolls, and non-content keyboard events such as Backspace, Tab, and Delete. The system does not record typed characters, form contents, clipboard data, device information, or website data, such as purchase details or account information. While the system avoids collecting personally identifying data, the tracked data remains privacy-relevant behavioral information because it is collected passively in the background and may be retained or used for future bot detection training. Therefore, implementations of the proposed system should follow best practices regarding user privacy, security, consent, retention, and deletion, in accordance with relevant legal and institutional requirements.
6. Conclusions
This paper presented an adaptive CAPTCHA defense framework that formulates bot detection as a sequential decision-making problem. Unlike traditional CAPTCHA systems that rely on fixed challenges or proprietary risk scores, the proposed framework uses reinforcement learning to observe user behavior over time, gather additional evidence through honeypot deployment, and select an appropriate terminal action, such as allowing the session, issuing a graded CAPTCHA challenge, or blocking the user. By modeling the task as a partially observable Markov decision process, the system makes decisions from streamed behavioral telemetry, including mouse movements, clicks, keystrokes, and scrolling patterns.
Our experimental results support the proposed hypotheses. Specifically, the results demonstrate that reinforcement learning can support effective and low-friction bot detection in a sandbox ticket-purchasing environment. Among the evaluated RL variants, Soft PPO with adversarial augmentation achieved the strongest overall performance, reaching 97.7% accuracy, 100% precision, and a 97.6% F1 score. These results support the first hypothesis that temporal behavioral telemetry contains sufficient sequential and interaction-level patterns for a reinforcement learning agent to distinguish between human and automated actors. Furthermore, the revised reward structure encouraged evidence-based interventions by shifting the agent away from opaque direct blocking and toward honeypot-assisted observation and graded challenge deployment. This supports the second hypothesis, which states that a multi-action adaptive response policy with progressive intervention mechanisms, including honeypots and graded CAPTCHA escalation, can preserve strong bot detection performance while reducing unnecessary friction for legitimate users.
In addition to the RL agent, we evaluated an XGBoost classifier as a supervised session-level benchmark. The classifier achieved near-perfect performance on the held-out test split, demonstrating that behavioral telemetry contains strong signals for distinguishing human users from automated agents. However, the RL framework provides a complementary advantage by learning not only whether a session is likely human or automated but also what intervention should be taken during the session. This distinction is especially important for adaptive defense, where the system must respond to uncertain or evolving adversarial behavior rather than simply output a binary classification.
Overall, this work shows that reinforcement learning is a promising direction for designing adaptive, transparent, and behavior-based CAPTCHA defenses. The proposed framework is not intended to replace mature commercial CAPTCHA systems in their current form but rather to provide an open and extensible research environment for studying sequential bot-mitigation policies. Future work should evaluate the system across larger and more diverse user populations, stronger adversarial bots, real-world deployment settings, and privacy-preserving telemetry collection methods.