1. Introduction
With the proliferation of intelligent transportation systems and the rapid deployment of autonomous driving, ensuring cybersecurity has become a foundational pillar for traffic safety and road system resilience. Modern autonomous driving platforms integrate complex cyberphysical architectures that include real-time operating systems, networked electronic control units (ECUs), and vehicular communication modules (e.g., V2X), as well as diverse sensor suites such as LiDAR, GPS, and cameras. These components collectively expose a broad and dynamic attack surface. In this context, penetration testing [
1,
2] serves as a proactive mechanism to assess the security posture of such intelligent systems. However, traditional penetration testing [
3] remains heavily reliant on human expertise, rendering it impractical for large-scale, continuous, and adaptive evaluation in safety-critical domains like autonomous driving. This limitation has intensified the demand for automated, intelligent penetration testing solutions [
4] capable of operating across complex, multi-stage attack chains and identifying novel threats, including zero-day vulnerabilities [
5,
6].
Recent efforts have explored diverse automation strategies to advance penetration testing beyond traditional, manual paradigms. Tudosi et al. [
7] performed comprehensive assessments on distributed firewall deployments and revealed that even widely adopted solutions such as pfSense exhibit exploitable security flaws under targeted probing. Their findings highlight the critical need for combining automated scanning with human-in-the-loop analysis to uncover subtle, system-wide vulnerabilities. In parallel, Chowdhary et al. [
8] introduced a GAN-based framework that autonomously generates adversarial web attack payloads capable of bypassing modern Web Application Firewalls (WAFs). Their approach demonstrates the efficacy of generative models in crafting evasive, realistic attack vectors—particularly in application-layer contexts such as cross-site scripting (XSS) and SQL injection. In a related but orthogonal direction, Berenguer et al. [
9] investigated the use of large language models (LLMs) to extract and transform raw sensor data from unstructured formats (e.g., HTML) into structured representations (e.g., JSON, XML). While primarily targeting data interoperability, their work showcases the broader potential of LLMs for robust parsing, semantic interpretation, and transformation of machine-generated outputs—capabilities that are increasingly relevant for automated vulnerability analysis and exploit generation.
These developments collectively signal a broader trend: leveraging AI-driven models, particularly LLMs and generative frameworks, to replace or augment traditional manual security workflows. With strong capabilities in language understanding, reasoning, code generation and tool interaction [
10], LLMs provide a powerful alternative to conventional AI planning or reinforcement learning (RL) agents, which often struggle with generalization and scalability. Early explorations in this direction include the work of Happe et al. [
11], which constructed a closed-loop system using GPT-3.5 to autonomously interact with vulnerable virtual machines. Despite its feasibility, it was limited to CTF-style tasks and lacked modularity or learning mechanisms. PentestGPT [
12] is one of the earliest attempts to apply LLMs to real-world penetration testing. It leverages an Auto-GPT [
13] architecture to plan and guide attack steps but requires human intervention for command execution, limiting its autonomy. However, these early systems treat each attack task in isolation and lack a mechanism to accumulate and transfer knowledge across tasks. Without the ability to build up reusable experience, their generalization and long-term planning capabilities remain limited.
Building on early explorations, recent research has shifted toward the multi-agent system (MAS) based on LLMs to better emulate the division of labor observed in real-world penetration testing teams. This design enables role specialization, improves task modularity, and facilitates coordinated execution across complex attack phases. For instance, AutoAttacker [
14] proposes a modular MAS architecture in which LLM agents collaborate on post-exploitation tasks, supported by a retrieval-augmented generation (RAG)-based experience manager for limited knowledge reuse. AutoPT [
15] introduces a Penetration testing State Machine (PSM), using finite-state control to guide LLM agents through web penetration stages, helping reduce planning instability in long-horizon tasks. Similarly, VulnBot [
16] constructs a full-stack MAS framework where LLM agents handle reconnaissance, vulnerability analysis, and exploitation, coordinated through a shared Penetration Task Graph (PTG). Extending these MAS paradigms to the domain of intelligent transportation, Gao et al. [
17] explored the use of LLMs in simulating traffic system behaviors and generating cybersecurity strategies in autonomous driving scenarios. Their work highlights the feasibility of leveraging LLMs not only for modeling vehicular communication and sensor interactions, but also for identifying attack vectors and evaluating potential impacts on traffic safety. While these systems demonstrate notable improvements in autonomy and success rates compared to single-agent baselines, they still suffer from a fundamental limitation: the lack of structured, progressive skill acquisition. In all cases, agents are exposed directly to high-difficulty, multi-stage common vulnerabilities and exposures (CVE) exploitation tasks without prior scaffolding or experiential buildup, leading to low success rates (e.g., VulnBot reports only 20% on complex vulnerabilities), weak generalization, and fragile reasoning when handling logical flaws or chained exploits. Critically, none of these methods reflect the human-like process of accumulating expertise over time through a curriculum of increasing difficulty. This absence of structured experiential learning has emerged as a key bottleneck limiting the scalability and adaptability of current LLM-based pentesting systems.
These limitations are particularly problematic in safety-critical domains like autonomous driving, where vulnerabilities may cascade across multiple components—from perception to control—and compromise not only system integrity but also physical safety. To bridge this critical gap, we propose CurriculumPT, an automated penetration testing framework designed to enable agents to learn systematically. Our approach uniquely integrates three core components: curriculum learning (CL), a multi-agent system (MAS), and an experience knowledge base (EKB). To solve the problem of unstructured skill acquisition, the CL module organizes CVEs into a curriculum of increasing difficulty, allowing agents to build expertise incrementally without costly fine-tuning, thereby mimicking human learning patterns. This curriculum guides our MAS module, where specialized LLM agents (e.g., planner, recon, exploiter) collaborate to execute tasks. Their strategies and focus adapt based on the curriculum’s difficulty stage. Finally, to ensure knowledge retention and transfer, the EKB module serves as a central memory. It captures and organizes successful strategies, toolchains, and decision rationales from completed tasks. This dynamic knowledge base is then retrieved by agents to tackle more complex challenges, fostering generalization. By tightly coupling these components, CurriculumPT creates a virtuous cycle of learning and application, enabling agents to autonomously accumulate experience and achieve superior performance on real-world penetration tests.
We systematically evaluate CurriculumPT against three state-of-the-art LLM-based penetration testing systems (AutoPT [
15], VulnBot [
16], and PentestAgent [
18]) on 15 real-world CVE exploitation scenarios. The results unequivocally demonstrate the superiority of our learning-centric approach. CurriculumPT achieves the highest average exploit success rate (ESR), outperforming the strongest baseline by a significant 18 percentage points. This success is complemented by superior efficiency: CurriculumPT reduces both average task execution time (by 20.6%) and token usage (by 25.5%), and requires the fewest decision-making steps. These efficiency gains underscore that our framework learns more direct and effective exploitation strategies, rather than relying on brute-force attempts. Furthermore, ablation studies confirm that the curriculum is essential for effective skill acquisition, while the knowledge base is critical for adaptability and knowledge reuse. Together, these findings validate that CurriculumPT is a more effective, efficient, and scalable solution for complex, multi-stage autonomous penetration testing. The main contributions of this work can be summarized as follows:
We pioneer the integration of curriculum learning into multi-agent LLM-based penetration testing. This novel approach systematically addresses the critical challenge of skill acquisition and generalization, enabling agents to progressively master complex tasks without human intervention or model fine-tuning.
We design and implement a synergistic framework, CurriculumPT, that tightly couples a curriculum scheduler, specialized agents, and a dynamic experience knowledge base. This architecture creates a closed-loop system for continuous, autonomous learning and knowledge transfer in complex environments.
We conduct extensive experiments on a benchmark of real-world CVE tasks, demonstrating that our method achieves a 15–36% improvement in exploit success rate (ESR) over representative baselines, along with significantly lower execution time and token usage, confirming its superior efficiency and effectiveness.
The rest of this paper is organized as follows.
Section 2 reviews related work on LLM-based penetration testing and curriculum learning.
Section 3 presents the overall architecture of the proposed CurriculumPT framework, including the curriculum construction strategy, task scheduling mechanism, multi-agent collaboration design, and the experience knowledge base.
Section 4 describes the experimental setup, evaluation metrics, and benchmark scenarios, and reports results from comparative and ablation studies.
Section 5 discusses the core contributions of our approach, challenges for real-world deployment, and its limitations. Finally,
Section 6 concludes the work and outlines potential directions for future research.
3. Methodology
In this section, we first present an overview of the CurriculumPT architecture (
Section 3.1). We then detail its core learning mechanisms, including curriculum design and experience management (
Section 3.2). Next, we describe the curriculum-guided multi-agent system that acts as the execution engine (
Section 3.3). Finally, we integrate these components to illustrate the system’s end-to-end workflow (
Section 3.4).
3.1. Overview
The overall architecture of CurriculumPT is shown in
Figure 1. At its core, the framework is designed to simulate the cognitive process of human experts, who master new skills by progressing from simple cases to complex problems. This paradigm is implemented through three tightly integrated components: curriculum scheduler, multi-agent system, and experience knowledge base. The curriculum scheduler organizes CVE-based tasks into a sequence of gradually increasing difficulty. The multi-agent system, composed of specialized LLM agents, then collaborates to execute these tasks. Throughout execution, actionable insights, including successful strategies, effective toolchains, and decision rationales, are continuously extracted and stored in the EKB. As the system advances through the curriculum, it retrieves and reuses this accumulated knowledge to address increasingly complex penetration scenarios. Together, these components form a closed-loop learning cycle that enables agents to continually refine their capabilities without model fine-tuning.
3.2. Curriculum-Guided Learning and Experience Accumulation
At the core of CurriculumPT lies a synergistic mechanism that combines a structured curriculum with an experience-driven reasoning loop. This section outlines the key components of the learning process: the design of the curriculum, the representation and application of experience, and the adaptive control of learning progression.
3.2.1. Task Difficulty-Based Curriculum Design
The construction of the curriculum begins with the classification and sequencing of penetration testing tasks (primarily based on CVE instances from Vulhub) according to their difficulty.
Difficulty Metrics. We comprehensively assessed task difficulty based on the following factors: (1) AC—Attack Complexity: Derived from the Common Vulnerability Scoring System (CVSS) [
37], this metric reflects the stringency of conditions required to exploit the vulnerability. A high value indicates that successful exploitation depends on specific environmental requirements or precise timing, thus increasing difficulty. (2) UI—User Interaction: Also based on CVSS, this measures whether successful exploitation requires victim user participation (e.g., clicking a malicious link or opening a file). Vulnerabilities marked as
Required limit automation and increase difficulty. (3) PR—Privileges Required: Indicates the level of access an attacker must possess prior to exploiting the vulnerability. Higher privilege requirements (e.g.,
Low or
High) imply additional prerequisite actions, such as local access or privilege escalation, making exploitation more complex. (4) ES—Exploitation Steps: Represents the number of distinct procedural steps needed to complete exploitation. Multi-stage exploits involving environment setup, payload crafting, chained execution, or post-exploitation steps are considered more difficult. ES is estimated using LLM-assisted analysis of CVE reproduction guides and public proof-of-concept (PoC) resources, such as scripts, technical blogs, or repositories that demonstrate real-world exploitation of reported vulnerabilities [
38,
39].
Curriculum Levels. To enable structured skill progression, all CVE-based penetration tasks are automatically categorized into three curriculum levels: Simple, Medium, and Complex. Task assignment is based on a unified difficulty scoring metric derived from four normalized and quantifiable indicators: AC, UI, PR, and ES. Each indicator is scaled to the range
, and the overall difficulty score
D is computed as a weighted linear combination:
where
,
,
, and
denote the weights selected to reflect the contribution of each factor to the difficulty of exploitation. These values were determined based on a qualitative analysis of sample CVEs, prioritizing AC and ES as they most directly influence the reasoning and action sequence required of the agent.
Based on the computed difficulty score, each vulnerability task is automatically assigned to one of three curriculum levels, reflecting different levels of exploitation complexity and reasoning demands. As shown in
Table 1, tasks at Simple level typically involve low-complexity, single-step exploits using publicly available PoC scripts, with no need for user interaction or elevated privileges. Medium tasks require moderate effort, such as adapting PoCs or performing low-privilege, multi-step exploits. Complex tasks are the most challenging, often involving multi-stage exploit chains, advanced reasoning, and privilege escalation.
3.2.2. Experience Representation
In this framework, “experience” refers to the structured recording of task execution processes, including successful strategies, useful intermediate results, and insights derived from both successes and failures. The Report Agent is responsible for generating and storing this experience. After each task attempt, regardless of success or failure, this agent analyzes the entire interaction process to extract valuable knowledge. It then organizes this information into several predefined, structured formats before storing them as new entries in the EKB. This captured knowledge includes comprehensive structured exploitation cases, saved as json objects detailing everything from the CVE ID to the exact command sequences used; practical problem–solution pairs that document specific errors and their resolutions; an atomic operation skills library containing efficient, context-specific commands for various services or protocols; and a high-quality prompt template library with validated prompts for recurring subtasks. By treating failed attempts and their causal analyzes as equally valuable, this process ensures that the EKB becomes a robust repository of actionable experiential knowledge.
3.2.3. Experience-Driven Reasoning
The critical process of experience transfer is initiated when the Planner Agent receives a new task. To assist with the current task, the system first retrieves relevant historical experiences from the EKB using a mechanism based on semantic similarity. It encodes the current task’s description, including CVE characteristics, target service information, and expected vulnerability type, into a query vector, which is then compared against the vector representations of stored experience entries to return those with the highest similarity scores. The retrieved knowledge is then applied in multiple ways. The primary method is for the Planner Agent to leverage this experience for enhanced plan generation. It synthesizes the retrieved insights to construct more effective strategies, select optimal tools, and generate precise commands. These detailed plans and commands are then passed to the appropriate agents, such as the Exploitation Agent, for execution. For example, an agent may be guided by a prompt such as “You are trying to exploit [CVE-202X-YYYY]. According to the experience base, a similar vulnerability was successfully exploited using the following steps: [Step A, Command A]; [Step B, Command B;]…”. Furthermore, the retrieved experience is used to guide tool selection and parameter optimization based on successful past configurations and to assist the Planner Agent in high-level strategy and task decomposition by referencing how similar complex cases were previously resolved.
3.2.4. Adaptive Curriculum Pacing
The effective progression and adaptation of the system through the curriculum are essential to the success of our proposed CurriculumPT framework. To govern this process, we introduce a mechanism that controls both the advancement schedule and dynamic adjustments based on observed performance. The pacing function is managed by the Commander Agent, which monitors the system’s overall performance on tasks at the current difficulty level using metrics like average success rate and number of attempts. When the system reaches a predefined proficiency threshold, such as an 80% exploit success rate on level N tasks, it begins to introduce tasks from the next difficulty level, N + 1. In addition, an adaptive mechanism addresses performance bottlenecks by injecting auxiliary tasks with slightly lower difficulty or modifying experience retrieval strategies to emphasize failed cases.
3.3. Curriculum-Guided Multi-Agent System
To support curriculum-guided penetration testing and structured experience reuse, we design a role-specialized multi-agent system (MAS) tailored for staged learning, performance monitoring, and adaptive decision making. In contrast to conventional MAS architectures, our design features tightly coupled task decomposition, curriculum progression control, and failure-aware replanning, collectively facilitating the acquisition of generalized exploitation skills. While this fine-grained specialization introduces more agents than some frameworks, it is a deliberate design choice. By isolating critical functions, such as learning control (Commander), failure analysis (Replan), and knowledge curation (Report), we prevent cognitive overload on a single planning agent and enable more robust, modular control over the entire learning loop. At the core of each agent lies a general-purpose LLM (e.g., GPT-4o [
40]), which performs reasoning, plan generation, tool instruction translation, and output interpretation. The behavior of each agent is governed via carefully designed prompt templates under a unified coordination paradigm.
To enable these LLM agents, particularly the Reconnaissance Agent and the Exploitation Agent, to interact with actual penetration testing tools, we implemented a tool integration layer. A core feature of this layer is its use of an encapsulation mechanism of the Model Context Protocol (MCP) [
41]. This mechanism abstracts a suite of commonly used penetration testing tools, such as Nmap [
42], Metasploit Framework [
43], SQLMap [
44], and Nikto [
45] into standardized function calls. Such encapsulation not only provides a unified API for the designated agents but also simplifies the interaction logic between the LLM agents and diverse tools. The tool integration layer is responsible for accurately translating high-level instructions issued by the agents (under the guidance of their LLM core) into specific tool command line instructions, remotely invoking and securely executing these commands via the MCP architecture, and, finally, parsing the raw output returned by the tools into structured information that is easy for the LLM to understand and process further.
Building upon this foundation, each agent within the MAS is assigned specific roles and responsibilities:
Commander Agent. Oversees task scheduling and curriculum pacing. It receives either a curriculum-level CVE task from the curriculum learning module (Phase 1) or a user-defined objective (Phase 2). Based on performance metrics (ESR), it dynamically adjusts curriculum progression or triggers adaptive auxiliary task injection.
Planner Agent. Responsible for overall task planning. It receives metadata and reconnaissance results, queries the EKB, and synthesizes multi-step exploitation strategies. It decomposes complex procedures and coordinates execution with other agents.
Reconnaissance Agent. Conducts information gathering on target hosts. Using tools like Nmap [
42] and Nikto [
45], it performs service enumeration, port scanning, directory probing, and returns structured system profiles to the Planner Agent.
Exploitation Agent. Executes planned steps using toolkits (e.g., Metasploit [
43], SQLMap [
44]). For each step, it performs command execution and sends structured feedback (including success/failure indicators) to the Analysis Agent.
Replan Agent. Triggered on failure, this agent analyzes contextual errors, tool output, and failed plans. It queries the EKB for related fix patterns and generates revised candidate steps to continue execution without restarting the full task.
Analysis Agent. Evaluates execution outcomes. It determines whether subgoals were achieved and computes performance metrics (e.g., step latency, success ratio, number of replans). Results are returned to both the Commander Agent for learning control, and to the Report Agent for final documentation.
Report Agent. Aggregates and abstracts complete execution traces, including success paths, critical parameters, encountered failures, and effective strategies. It updates the EKB with distilled experience entries in structured formats for future reuse.
3.4. Workflow
The workflow of the CurriculumPT framework is designed to enable progressive learning and effective generalization in automated penetration testing. It consists of two tightly coupled phases: (1) Curriculum-Guided Learning and Experience Accumulation, and (2) Experience-Driven Penetration Testing. These phases are not independent; instead, they form a closed learning cycle in which knowledge acquired during Phase 1 directly supports decision making in Phase 2, while execution results from Phase 2 further refining the knowledge base. This two-phase design enables CurriculumPT to not only learn incrementally but also adapt and apply its knowledge to novel, real-world scenarios, ensuring both robustness and continual improvement.
Phase 1: Curriculum-Guided Learning and Experience Accumulation. The goal of this phase is to systematically bootstrap the MAS and build foundational penetration capabilities through staged, difficulty-aware learning, as detailed in Algorithm 1. The process begins with the Commander Agent, which coordinates the entire workflow by selecting a candidate task from the current curriculum level. The task is then forwarded to the Planner Agent, which queries the EKB for relevant prior knowledge, and, if necessary, invokes the Reconnaissance Agent to gather target information before generating a detailed exploitation plan. The plan is subsequently executed by the Exploitation Agent. If any step fails, a ReAct-style loop is triggered: the Replan Agent analyzes the failure context, queries the EKB for similar error resolutions, and generates revised steps to continue the attempt. This execution–replanning cycle continues until the task is successfully completed or a maximum number of attempts is reached. Upon completion, the Analysis Agent evaluates the outcome and extracts key performance metrics, which are sent to the Commander Agent to track progress. Concurrently, the Report Agent summarizes the entire execution trace into a structured experience entry and stores it in the EKB. Finally, the Commander Agent uses the aggregated performance metrics to check if the learning objectives at the current difficulty level have been achieved, and accordingly decides whether to advance to the next level or continue training.
Algorithm 1Curriculum-Guided Learning and Experience Accumulation. |
- 1:
Input: Curriculum Levels ; Task Pool with CVE metadata; Max Replanning Iterations ; Completion Thresholds - 2:
Output: Updated Experience Knowledge Base (EKB) - 3:
Initialize EKB - 4:
Initialize Progress Tracker - 5:
for each level do - 6:
while NotCompleted do - 7:
CommanderAgent.SelectTask - 8:
QueryEKB - 9:
PlannerAgent.Plan - 10:
if PlannerAgent.RequiresRecon then - 11:
ReconnaissanceAgent - 12:
PlannerAgent.ReplanWithInfo - 13:
end if - 14:
, - 15:
while do - 16:
ExploitationAgent.Execute - 17:
if IsSuccess then - 18:
- 19:
else - 20:
ReplanAgent - 21:
- 22:
end if - 23:
end while - 24:
LogExecution - 25:
AnalysisAgent.Evaluate - 26:
CommanderAgent.UpdateProgress - 27:
ReportAgent.Summarize - 28:
- 29:
end while - 30:
end for return
|
Phase 2: Experience-Driven Penetration Testing. The second phase assesses the framework’s generalization capability by applying the knowledge accumulated in Phase 1 to solve real-world security challenges. Transitioning from curriculum-driven learning to practical application, this phase is initiated by a user-defined task, which can be an unstructured, high-level objective (e.g., “try to read sensitive file/etc/passwd”). Upon receiving this goal, the Commander Agent orchestrates the agent pipeline. A key feature of this phase is the experience-driven planning strategy employed by the Planner Agent. It proactively queries the EKB, and, upon finding a highly relevant precedent, may generate a direct exploitation plan, potentially bypassing unnecessary reconnaissance steps. The framework then leverages the same execution and replanning loop from Phase 1 to attempt the task. After the attempt, the Analysis Agent assesses the final outcome against the user’s objective and generates performance metrics. Importantly, the learning loop remains active. The Report Agent processes the entire execution trace, including successful strategies and failure recovery paths, and transforms them into new structured entries. This ensures that the experience knowledge base (EKB) is continuously enriched with diverse, real-world scenarios.
4. Evaluation
In this section, we validate the effectiveness of CurriculumPT.
RQ1 (Overall Performance): How effective is the CurriculumPT framework overall?
RQ2 (Comparison Analysis): What are the specific advantages of CurriculumPT over existing LLM-based penetration testing methods, particularly regarding (1) progressive task learning via curriculum guidance, (2) modular agent coordination for task decomposition, and (3) experience-driven adaptation and generalization through EKB reuse?
RQ3 (Ablation Study): What is the contribution of each core component in the CurriculumPT framework?
4.1. Experimental Setting
(1) Benchmark Dataset. All experiments in this study were conducted using the Vulhub vulnerability reproduction platform. Vulhub [
46] is a widely used open-source project built on Docker and Docker Compose, offering a broad collection of real-world vulnerability environments. It provides standardized and reproducible conditions for penetration testing and security research. To construct a structured evaluation dataset, we selected a representative subset of CVE from Vulhub. The selection criteria emphasized both the diversity of vulnerability types (e.g., remote code execution, SQL injection, file upload, command injection, directory traversal) and variation in exploitation complexity. Each CVE instance was automatically classified into one of three curriculum difficulty levels—Simple, Medium, or Complex, according to a unified difficulty scoring metric based on four normalized indicators: Attack Complexity (AC), User Interaction (UI), Privileges Required (PR), and Exploitation Steps (ES). This classification reflects both procedural and cognitive challenges in exploitation and underpins the task sequencing used in our curriculum learning strategy. To evaluate generalization, a subset of CVEs was excluded from training and reserved solely for downstream testing. While the experiments are based on general-purpose CVE environments, the selected vulnerabilities are representative of those frequently encountered in intelligent transportation systems (ITS). These include traffic signal controllers, roadside communication gateways, and vehicular cloud services, which often run on web-based management platforms or embedded Linux devices. As such, the benchmark scenarios are highly relevant for assessing cybersecurity risks and resilience in ITS environments.
(2) Metrics. To conduct a comprehensive and multi-dimensional evaluation of different system configurations, this study adopts a set of key quantitative metrics.
Exploitation Success Rate (ESR). This is the primary metric for assessing the core effectiveness of the system. It reflects the ability to successfully reproduce CVEs and achieve defined objectives (e.g., obtaining shell access or reading sensitive files).
where
denotes the number of CVE exploitation tasks that successfully achieve the target objective, and
is the total number of exploitation attempts.
Average Steps per Task (AST). AST measures the average number of reasoning–execution iterations required to complete a successful task. Each step typically involves a planning decision, tool invocation, or replanning operation. This metric reflects both interaction depth and coordination overhead.
where
is the total number of planner–agent interactions across all successful tasks.
Average Time to Exploit (ATE). ATE quantifies the average time required to successfully exploit a vulnerability, from initiation to completion. It serves as an indicator of overall system execution efficiency.
where
is the total time consumed across all successful exploits.
Average Token Usage (ATU). This metric measures the average number of tokens (including both prompt and completion) consumed per task, providing a fine-grained estimation of reasoning overhead and cost-efficiency.
where
is the total number of tokens used in LLM invocations, and
is the number of penetration testing tasks. Lower ATU values indicate more efficient reasoning.
Experience Knowledge Base Hit Rate (EHR). EHR evaluates the effectiveness of the EKB in supporting decision making during task execution. It measures how frequently retrieved experience is successfully applied to assist in solving new tasks.
where
denotes the number of successful applications of retrieved experience, and
represents the total number of retrieval attempts.
(3) Implemental Details. All experiments were conducted on a laptop equipped with an Intel Core i7-14650HX processor (2.20 GHz) and 32 GB RAM. The vulnerable environments were deployed using Docker containers based on official Vulhub images, running within an Ubuntu virtual machine. The attacker-side was hosted on a separate Kali Linux virtual machine, which served as the orchestrator via a customized MCP interface. Both virtual machines were connected through NAT within a local network, simulating realistic internal conditions while ensuring isolation. All LLM agents in CurriculumPT were powered by GPT-4o-mini, accessed through OpenAI API, and used for planning, reasoning, code generation, and tool interaction.
4.2. Performance Evaluation
The results in
Table 2 illustrate the effectiveness and efficiency of the CurriculumPT framework across progressive curriculum stages. As task difficulty increases from Level 1 (Simple) to Level 3 (Complex), the ESR decreases from 95.3% to 60.0%, reflecting the growing challenge associated with more complex vulnerability tasks. Correspondingly, the ATE increases from 110 to 370 s, and the ATU rises from 2.3M to 5.6M tokens, indicating higher reasoning and interaction overhead in more complex scenarios. Notably, the EHR improves significantly with task complexity, rising from 67.9% at Level 1 to 81.7% at Level 3. This trend suggests that experiential reuse becomes increasingly valuable and effective as challenges intensify. Performance on the hold-out set, composed of mixed-difficulty CVEs not seen during training, achieves a competitive ESR of 66.7%, and maintains a high EHR of 80.0%, demonstrating strong generalization of the learned knowledge.
4.3. Comparison Analysis
To highlight the specific advantages of CurriculumPT over existing LLM-based penetration testing frameworks, we performed a comparative analysis involving three representative baselines: AutoPT [
15], VulnBot [
16], and PentestAgent [
18]. The comparison focuses on both framework characteristics and empirical performance across 15 standardized penetration testing scenarios. Each framework was implemented according to its original design specifications and executed under identical conditions, including the same underlying LLM (GPT-4o-mini) and consistent environment settings, to ensure fair comparison.
Table 3 presents a comparative analysis between CurriculumPT and three representative LLM-based penetration testing frameworks: AutoPT, VulnBot, and PentestAgent, focusing on both architectural features and empirical performance. Among the four, CurriculumPT is the only framework that integrates three targeted architectural features. It uniquely supports curriculum learning, enabling agents to progressively acquire skills aligned with increasing task complexity. While VulnBot and PentestAgent adopt modular agent architectures, they lack curriculum guidance and full EKB reuse, which limits their adaptability and generalization. Empirically, CurriculumPT outperforms all baselines across every performance metric. It achieves the highest ESR of 60.0%, demonstrating superior task effectiveness. Additionally, it completes tasks with the shortest ATE of 390 s, the lowest ATU of 3.5 M, and the fewest AST of 5.1. These results indicate that CurriculumPT’s architectural innovations, particularly curriculum learning and experience reuse, contribute directly to more efficient and effective penetration testing. Overall, these results demonstrate that CurriculumPT not only introduces meaningful design innovations but also translates them into measurable advantages in real-world LLM-driven penetration testing.
![Applsci 15 09096 i002 Applsci 15 09096 i002]()
4.4. Ablation Study
To evaluate the contributions of core components in the CurriculumPT framework across varying difficulty levels, we perform a comprehensive ablation study on a stratified CVE test set (N = 30), consisting of 10 Simple, 10 Medium, and 10 Complex vulnerability scenarios. All CVEs in this set are excluded from the curriculum training phase to avoid data leakage and ensure fair evaluation. All experiments are conducted under consistent system configurations using the same GPT-4o-mini model (128 k context window, temperature = 0) and identical runtime environments. Each CVE is tested once per system variant under fixed resource constraints. We examine four system variants to assess the contributions of curriculum learning and experience knowledge reuse. The Full configuration includes both CL and the EKB, representing the complete framework. In the No EKB setting, curriculum scheduling is retained, but experiential reuse is limited to the LLM’s internal context window, with no access to external memory. The No CL configuration disables curriculum-guided task progression, presenting tasks in a random order while retaining the EKB. Finally, the No CL + No EKB variant removes both staged learning and long-term experience storage, serving as a minimal baseline for comparison.
As shown in
Table 4, the full configuration of CurriculumPT consistently achieves superior performance across all difficulty levels. It reaches the highest ESR of 100% on simple tasks and maintains 60.0% on complex ones, while also exhibiting the lowest ATE, token consumption (ATU), and reasoning steps (AST). In contrast, disabling the experience knowledge base (No EKB) significantly impairs performance, particularly on complex tasks. The system’s reliance on the limited short-term context of the LLM results in a drop in ESR to 43.3%, accompanied by substantial increases in both ATE and ATU. Similarly, removing curriculum learning (No CL) disrupts the structured progression of task complexity. Although the EKB remains available, the absence of curriculum guidance leads to reduced performance due to the lack of systematic skill accumulation. The most pronounced degradation is observed in the No CL + No EKB configuration, where both structured learning and long-term memory are eliminated. This variant exhibits the lowest ESR (33.3%), the highest ATE (450 s), and the greatest token consumption and reasoning overhead, especially on medium and complex CVEs. These results collectively demonstrate that curriculum guidance and experience reuse contribute complementary and indispensable benefits.
![Applsci 15 09096 i003 Applsci 15 09096 i003]()
6. Conclusions
This paper presented CurriculumPT, a novel automated penetration testing framework that pioneers a curriculum-guided learning approach. By simulating how human experts master skills through progressively complex tasks, CurriculumPT enables LLM agents to systematically accumulate and transfer experience, significantly enhancing their success rate and generalization on complex CVEs without costly model fine-tuning. The experimental results demonstrate that CurriculumPT significantly outperforms baselines that lack its learning-centric, multi-agent architecture. Furthermore, ablation studies confirmed that each core component—curriculum learning, the experience knowledge base, and specialized agent design—is indispensable for achieving the framework’s high performance and efficiency. Notably, through curriculum learning, the system’s experience utilization efficiency improved with increasing task complexity, and it displayed promising generalization potential on a hold-out set. While this study has limitations, including the scope of the dataset, the degree of automation in curriculum design, and the depth of experience representation, it provides a valuable exploration into enabling LLMs to achieve a “learning to learn” capability within complex professional domains. CurriculumPT also lays the groundwork for building more intelligent, adaptive, and autonomous cybersecurity systems.
Looking ahead, the CurriculumPT framework and its core concepts open several promising research directions for advancing automated penetration testing. As a first step toward real-world deployment, we plan to enhance the learning process through dynamic curriculum generation and more robust reasoning models to support knowledge transfer across heterogeneous systems. We also aim to extend CurriculumPT by integrating real-time adaptation to dynamic targets and network conditions, enabling the framework to operate effectively in evolving environments. To ensure safety and operational oversight, we will incorporate human-in-the-loop mechanisms for supervising high-risk actions and validating system decisions. Additionally, we are exploring multimodal information processing and human–AI collaborative paradigms to improve task coordination and explainability. Finally, addressing ethical implications and defensive countermeasures remains essential to the responsible application and governance of this technology.