PenQA: A Comprehensive Instructional Dataset for Enhancing Penetration Testing Capabilities in Language Models

Zhong, Xiaofeng; Zhang, Yunlong; Liu, Jingju

doi:10.3390/app15042117

Open AccessArticle

PenQA: A Comprehensive Instructional Dataset for Enhancing Penetration Testing Capabilities in Language Models

by

Xiaofeng Zhong

^1,2,

Yunlong Zhang

^1,2,* and

Jingju Liu

^1,2

¹

College of Electronic Engineering, National University of Defense Technology, Hefei 230037, China

²

Anhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation, Hefei 230037, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(4), 2117; https://doi.org/10.3390/app15042117

Submission received: 25 January 2025 / Revised: 14 February 2025 / Accepted: 15 February 2025 / Published: 17 February 2025

(This article belongs to the Special Issue AI Technology and Security in Cloud/Big Data)

Download

Browse Figures

Versions Notes

Abstract

Large language models’ domain-specific capabilities can be enhanced through specialized datasets, yet constructing comprehensive cybersecurity datasets remains challenging due to the field’s multidisciplinary nature. We present PenQA, a novel instructional dataset for penetration testing that integrates theoretical and practical knowledge. Leveraging authoritative sources like MITRE ATT&CK™ and Metasploit, we employ online large language models to generate approximately 50,000 question–answer pairs.We demonstrate PenQA’s efficacy by fine-tuning language models with fewer than 10 billion parameters. Evaluation metrics, including the BLEU, ROUGE, and BERTScore, show significant improvements in the models’ penetration testing capabilities. PenQA is designed to be compatible with various model architectures and updatable as new techniques emerge. This work has implications for automated penetration testing tools, cybersecurity education, and decision support systems. The PenQA dataset is available in our GitHub repository.

Keywords:

penetration testing; large language models; cybersecurity

1. Introduction

Cybersecurity, as an emerging discipline and field in recent years, has garnered increasing attention from both society and academia. Penetration testing exemplifies a typical cybersecurity practice, representing an authorized, proactive security assessment process targeting specific systems and machines. Through penetration testing, system administrators can preemptively identify and mitigate system vulnerabilities, thus adopting a preventative approach to security. To gain a comprehensive understanding and proficiency in cybersecurity, it is imperative to balance the acquisition of theoretical concepts with practical application. Contemporary cybersecurity practice environments have become increasingly sophisticated [1], as they are accompanied by increasingly detailed manuals guiding corresponding cybersecurity experiments. This further underscores the practical orientation of cybersecurity as a discipline. The complexity and depth of specific cybersecurity issues vary considerably, ranging from superficial to profoundly intricate. Consequently, a cybersecurity professional must possess extensive experience with a diverse array of sophisticated tools and maintain a keen understanding of rapidly evolving practical scenarios. To mitigate the human, temporal, and financial resources required for manual task resolution, industry practitioners commonly employ machine-assisted or automated approaches. Leveraging advancements in natural language processing, a prevalent methodology involves distilling domain-specific knowledge into knowledge graphs [2]. These graphs represent entities and relationships within the domain as interconnected concepts. By conducting graph-based searches within knowledge graphs, we can access a broad spectrum of knowledge relevant to the search object, which proves beneficial in addressing domain-specific issues.

The recent proliferation of large language models (LLMs) [3] has impressed researchers with their capacity to comprehend and generate human-like text. This formidable generative capability has been evaluated across multiple tasks, with the competence of LLMs being affirmed to a considerable extent. While LLMs have received substantial validation for their general task capabilities, the prohibitive operational and training costs associated with full-parameter models are not universally accessible. Consequently, researchers have explored the use of smaller-parameter models for Supervised Fine-Tuning (SFT), which can, to some degree, achieve performance comparable to larger models. Industry practitioners typically address this through SFT and Reinforcement Learning from Human Feedback (RLHF) methods. For instance, the Alpaca model [4], trained on Llama [5] (7B) using a dataset generated by InstructGPT [6] and SFT, has demonstrated performance surpassing InstructGPT in certain aspects. Similarly, Vicuna [7], fine-tuned on Llama (13B) using ChatGPT-generated [8] data, has outperformed both Alpaca and Llama models of comparable parameter size. However, these models often exhibit suboptimal performance in specialized domains [9], which can be addressed through similar adjustment and refinement methods. In this process, data play a crucial role, necessitating the construction of datasets that encapsulate relevant domain knowledge. Cybersecurity texts inherently comprise a mixture of structured, semi-structured, and unstructured information, with practical cybersecurity experience being challenging to distill into knowledge. Consequently, there is currently a dearth of well-curated datasets and fine-tuning practices for extracting penetration testing knowledge and experience.

To address the challenge of scarce penetration testing knowledge data, we endeavored to generate instructional data using more powerful language models. This approach systematically covered the concepts involved in each stage of the penetration testing process, as well as the corresponding methodologies for utilizing relevant tools, serving as foundational knowledge. Our focus centered primarily on Metasploit (https://www.metasploit.com/ accessed on 4 July 2024), one of the most renowned penetration testing frameworks, and the MITRE ATT&CK™ (Adversarial Tactics, Techniques, and Common Knowledge) platform (https://attack.mitre.org/ accessed on 4 July 2024), a globally accessible knowledge base of adversarial tactics and techniques grounded in real-world observations.

By extracting theoretical cybersecurity knowledge and practical operational insights from diverse sources, we constructed the PenQA dataset. Subsequently, we fine-tuned five models with parameter counts below 10 billion. The results demonstrated notable improvements in cybersecurity question-answering performance, underscoring the efficacy of our approach in enhancing model capabilities within this specialized domain.

2. Related Work

2.1. LLMs for Security

The impressive capabilities of LLMs have garnered significant attention from domain experts across various fields. Numerous studies [10,11,12] have explored the application of these models to cybersecurity research. Given their human-like question-answering abilities, researchers have begun investigating the direct use of LLMs to execute penetration testing or security assessment tasks. PentestGPT [13], one of the pioneering works applying LLMs to penetration testing, explored the feasibility and efficacy of utilizing ChatGPT for penetration testing. The proposed framework ingeniously modularized role-specific GPT components, providing a comprehensive reference for subsequent research. Similarly, ScriptKiddle [14] employed prompt templates and specific dialogue flows to enable ChatGPT to conduct network threat assessments. PenHeal [15] utilizes counterfactual prompts along with the Instructor module to guide LLMs in efficiently exploring penetration testing paths. NYUCTF [16] compiled a challenge dataset by aggregating Capture The Flag (CTF) competition questions from New York University. This dataset aims to evaluate the performance of integrated large models in solving CTF challenges. However, as advocated by industry experts [17,18], we believe that large model products available on publicly accessible platforms should incorporate content restrictions pertaining to security and health considerations. Given the dual nature of technology, unrestricted access to these powerful and readily available model resources could potentially be exploited for malicious activities in cyberspace. Consequently, models designed for penetration testing tasks should be developed offline and utilized under authorized conditions.

2.2. Instructional Data

Numerous frameworks [19,20,21,22,23] have been proposed to address the efficient construction of instructional datasets. Various methodologies exist for building domain-specific datasets, with each necessitating consideration of the characteristics of domain data sources and domain-specific question-answering dynamics. Pixiu [24] integrated various open-source financial domain natural language processing datasets to construct a large-scale, multitask dataset called FIT, providing evaluation benchmarks and a fine-tuned model, FinMA. MedAlign [25] is derived from questions and corresponding instructions submitted by 15 practicing physicians across seven medical specialties, all corresponding to authentic electronic health records. In the cybersecurity domain, CyberQ [26] is an introductory question-answering dataset generated through diverse prompting methods based on entities and relationships extracted from a knowledge graph constructed from cybersecurity experimental manuals. While this dataset serves as a learning resource for cybersecurity novices and can be used to fine-tune models to enhance their grasp of cybersecurity knowledge, its complexity may not meet the requirements for penetration testing processes. Moreover, it is not specifically tailored for penetration testing, and the industry has yet to extensively explore efforts to extract penetration testing experience and knowledge.

The dataset proposed in this study represents a step towards extracting methodologies and experiences related to penetration testing tool usage. It addresses the current gap in the field by focusing on the practical aspects of penetration testing, which existing datasets have not adequately covered.

3. Dataset Generation

Figure 1 illustrates the overall workflow of this manuscript. Initially, data representative of cybersecurity knowledge and practical operations are subjected to data cleaning. Subsequently, the pertinent documents are categorized into conceptual and instructional segments. Following this, exemplars of conceptual and practical questions and answers are crafted manually, along with the development of a template for associated prompt words for subsequent utilization. In the subsequent phase, the prompt word template is employed to generate corresponding question–answer datasets from the previously distinguished conceptual and instructional text segments. Ultimately, fine-tuning is conducted on open-source models with parameter counts less than 10 billion, followed by an analysis of the outcomes.

In the section on data sources, we consider that the standard penetration testing process is primarily composed of five parts: reconnaissance, scanning, vulnerability assessment, exploitation, and post-exploitation.

3.1. ATT&CK

To acquire knowledge and experience pertinent to penetration testing, we propose leveraging the ATT&CK (Adversarial Tactics, Techniques, and Common Knowledge) platform. Developed by MITRE, ATT&CK is a knowledge base that delineates and categorizes cyber adversary behaviors. It provides security professionals with a standardized methodology for analyzing and responding to cyber threats. The platform encompasses tactics, techniques, defenses, Cyber Threat Intelligence (CTI), and additional resources. ATT&CK offers standardized terminology and frameworks, facilitating learning for novices and enhancing communication and collaboration among security teams. Moreover, it provides comprehensive descriptions of threat behaviors, encompassing the entire process from initial access to exfiltration, documenting genuine and effective techniques that can be utilized for learning, and identifying and remediating security vulnerabilities.

Penetration testing techniques essentially employ an offensive approach to bolster defense, actively testing target systems or environments using various tools and techniques under authorized conditions. Through ATT&CK, security practitioners can better comprehend the principles and methodologies underlying cyber threats, thereby enhancing organizational defense capabilities. Analogously, the textual content extracted from the ATT&CK platform serves as a manifestation of penetration testing knowledge.

In the data sanitization and categorization process, we initially segmented the data within the MITRE ATT&CK framework according to the official delineation of 14 phases. We prioritized the selection of enterprise-level techniques, as the enterprise environment typically encompasses a broad range of operating systems, applications, and network devices, thereby providing a more comprehensive depiction of the aspects considered during penetration testing. Each phase’s techniques essentially encompass the following elements: the names and IDs of subtechniques within the technique, the definition of the technique along with potential implementation scenarios, Procedure Examples detailing specific cybersecurity incidents, Mitigations outlining related countermeasures, Detection detailing pertinent detection methods, and References. Subsequently, we excised the References and ID information, retaining only the natural language sentences that facilitate the understanding by language models. We also downplayed the specifics of cybersecurity incidents, as such information does not directly contribute to the model’s acquisition of knowledge regarding penetration testing. Ultimately, we extracted and segmented the definitions and potential application scenarios of each technique and subtechnique, as well as the corresponding detection and defensive strategies for subsequent utilization.

We focused primarily on extracting information from the platform’s Techniques section, treating each subtechnique as a source of data for extraction. Considering that subtechnique descriptions encompass information on application scenarios, conceptual explanations, and specific implementation commands, we utilized powerful large language models to generate two types of questions: conceptual and operational. For conceptual questions, we employed prompt engineering, allowing the model to generate and answer questions autonomously. Operational questions require substantial background knowledge, the model’s inherent knowledge, and clear examples. Here, we adopted a few-shot learning strategy to generate operational questions pertaining to subtechniques.

Specifically, within each subtechnology, we will first establish one or two questions to elucidate the concept of the subtechnology, such as what the subtechnology entails and its relevance to a particular phase of penetration testing. Following this, we will extract example information from the Procedure Examples within subtechnologies and compile question-and-answer pairs by drawing defensive measures from the sections on mitigations and detection.

As illustrated in Figure 2, the techniques section comprises 14 primary techniques, with Resource Development containing 10 subtechniques, Initial Access containing 18, Execution containing 32, Persistence containing 104, Privilege Escalation containing 95, and so forth, totaling 708 subtechniques. Some subtechniques may appear under different phases. To facilitate automation and streamline the instruction data generation process, and considering that descriptions of the same subtechnique may vary across different phases, we employed a templatized prompt approach. This approach incorporates all relevant descriptive information and phase-specific details to distinguish the application and associated knowledge of these subtechniques across different phases.

3.2. Metasploit

Metasploit is a widely used penetration testing framework that equips security professionals with tools to identify and exploit system vulnerabilities. It serves as a robust platform for developing and executing exploit code against remote target machines. The core of the Metasploit platform consists of various modules, allowing users to achieve specific tool functionalities by utilizing different modules and configuring corresponding parameters. As of the time we collected and processed the data, the Metasploit Community Edition had 5626 modules categorized by functionality into Exploit, Auxiliary, Post, Payload, Encoder, NOP, and Evasion modules.The quantitative relationships between different modules are illustrated in the Figure 3.

Exploit modules form the core of Metasploit are designed to leverage software security vulnerabilities. These modules contain exploit code targeting specific vulnerabilities, enabling testers to gain control over target systems or execute arbitrary code. Auxiliary modules execute attacks that do not directly provide sessions, such as scanning, fingerprinting, sniffing, and denial of service attacks. These modules assist penetration testers in gathering information prior to attacks or providing supplementary functions during the attack process. Post modules are employed after successfully gaining control of the target system, facilitating further penetration and privilege maintenance. Payloads are code or commands sent to the target system following successful exploitation. They determine the operations the attacker wishes to execute on the target system, such as establishing Meterpreter sessions, executing specific commands, or uploading files. Encoders are used to encode Payloads to evade detection by the target system’s antivirus software or other security measures. Encoders can alter the Payload’s form, making it less detectable during transmission. NOP (No Operation) instructions are inserted into Payloads to fill space or adjust Payload size, ensuring more stable memory layout. While NOP instructions do not execute any operations, they help stabilize Payload execution on the target system. Evasion modules assist penetration testers in circumventing network defense measures such as Intrusion Detection Systems (IDSs), Intrusion Prevention Systems (IPSs), and antivirus software. These modules reduce detection probability by altering attack characteristics and employing encrypted communications.

In the data processing phase, the detailed information for each module within Metasploit encompasses aspects such as the platform and architecture, required privileges, impact level, release date, contributor details, module stability and reliability, side effects, basic options, a description of the module, and references. To facilitate direct assistance to the model and to enhance subsequent human understanding of the usage of these tools, we have excluded information irrelevant to the immediate application of the modules, such as provenance and module stability. We have retained the scenario descriptions, applicable platforms, and options that need to be set during usage. These essential details have been preserved in a name–content format for ease of reference and comprehension. Subsequently, through few-shot learning and prompt engineering, we instructed the large language model to generate an instructional dataset that reflects Metasploit module usage scenarios, module configuration procedures, and module execution knowledge. To elaborate, we have filtered out irrelevant author information from the penetration testing guidelines. We segment each module using divisions and establish questions that encompass the vulnerabilities corresponding to each module, the necessary options to be set within the module, and questions and answers on how certain options should be configured.

The focal point of penetration testing lies in system vulnerabilities. Our aforementioned data sources do not emphasize the collection of vulnerability-specific information such as Common Vulnerabilities and Exposures (CVEs) or Common Vulnerability Disclosure (CVD). This omission is deliberate, as we have observed that even the most advanced LLMs tend to generate seemingly correct but fabricated content when confronted with unfamiliar queries—a phenomenon known as hallucination.

Primarily, these models were not specifically pre-trained on cybersecurity vulnerability knowledge, resulting in insufficient mastery of this domain. Additionally, open-source model training corpora have a definitive cut-off date, beyond which documents or corpora are not represented in the LLM’s knowledge base. Current fine-tuning techniques cannot fully enable models to memorize this specialized knowledge comprehensively. Consequently, we posit that constructing instructional data focused on specific CVE knowledge would be time-consuming and potentially ineffective. A more efficient approach would be to leverage external knowledge retrieval mechanisms to access this constantly updated information. However, the knowledge embedded within Metasploit can indeed facilitate practical application of corresponding modules. Therefore, we have distilled the knowledge from all modules into an instructional dataset for potential fine-tuning purposes.

Regarding model selection, we have opted for the latest API version of ChatGLM, GLM-4, developed by Zhipu AI (https://www.zhipuai.cn/ accessed on 30 July 2024). GLM-4’s performance is comparable to GPT-4 (https://openai.com/index/gpt-4/ accessed on 30 July 2024) but with lower API call costs, offering a superior cost–performance ratio. To ensure standardization and reproducibility of the generated instructional data and to enable fair comparison between different models when invoking LLM APIs, it is imperative to construct prompt templates. We have meticulously crafted a set of prompt templates, which, during their operational deployment, are designed to facilitate the dynamic generation of question-and-answer pairs by automatically populating them with contextually relevant information, thereby leveraging the capabilities of large language models. The partial templates constructed are shown in the Figure 4. Within these templates, we define the model’s role, skills, and constraints. Role specification allows the model to more rapidly adapt to the given scenario and generate domain-relevant content [27]. Delineating skills and constraints enables the model to clearly comprehend the task objectives and specific requirements. In this process, we have uniformly adopted a few-shot learning approach to enhance the models’ comprehension of the task objectives, thereby improving their responsiveness. As a concrete example, we have manually curated question-and-answer pairs that encapsulate the usage methodologies and operational commands of Nmap, serving as a guiding framework for instructing large language models (LLMs). Furthermore, the question-and-answer pairs that are generated and validated in subsequent stages can also be utilized as exemplars. These exemplars serve to reduce the complexity of generation, provided that they align with the subject matter and contextual background (see Table 1 and Table 2). The quantitative relationships within the overall dataset can be found in the Appendix A.

We have presented several examples of the dataset in Table 1 and Table 2. These include both conceptual and practical knowledge question-and-answer pairs. Each entry in our dataset comprises four attributes: an ID, which serves as a serial number for easy identification and counting; a category, representing the type of data; and an Input and Output, which denote the question and answer content of the Q&A dataset, respectively.

4. Experiments

To demonstrate the practicality of our constructed dataset, we proposed fine-tuning smaller parameter models (below 10 billion parameters) using our dataset. Our model foundations include Qwen2-7B [28], Gemma2-9B [29], GLM4-9B [30], Llama3-8B [31], and Mistral-8B [32]. These models offer comprehensive support for model parameters and training frameworks, enabling fine-grained parameter updates. The architectures encompass a variety of designs including autoregressive (Llama3), mixture of experts (Mistral), and bidirectional attention (GLM), thereby facilitating the verification of the method’s adaptability across different model families. Due to memory and fine-tuning time constraints, we applied 4-bit QLoRA [33] to fine-tune the base models. For the dataset, we allocated 5% for testing and the remainder for training, employing a learning rate of 2 × 10⁻⁵. For LoRA, we set the rank to 128 and alpha to 256. We applid LoRA to all Query, Key, and Value metrics in multihead self-attention blocks alongside Linear layers. We utilized the AdamW optimizer with a batch size of 8.

Given that these models are generative in nature, it is challenging to achieve perfect word-for-word consistency in responses to identical prompts, and they may not entirely match the reference answers. Consequently, we employed popular natural language generation (NLG) metrics for evaluation. These include word overlap-based metrics such as BLEU [34], ROUGE [35], and METEOR [36]; word embedding-based metrics [37] that include Greedy Matching, Embedding Average, and Vector Extrema; and language model-based metrics such as BERTScore [38]. The results of our experiments are presented in Table 3 and Table 4.

We conducted predictions using our experimental models both before and after fine-tuning, comparing the outputs with reference answers. Within the experimental context, there was a discernible enhancement observed across all metrics, with each exhibiting an increment in score to a diverse extent. Our findings indicate that fine-tuning with our proposed instruction dataset indeed enhances the model’s application of cybersecurity knowledge and proficiency with certain cybersecurity tools.

Due to the time-sensitive nature of many Metasploit modules, the correlation between the test data and the training data, despite utilizing 95% of the data for training, was not strong. Consequently, the model had not learned all the knowledge and could not accurately answer all the questions in the test set; therefore, the practical value of the model trained in the experiment is not greater than that of the model trained on the full dataset. Therefore, we fine-tuned the entire dataset on the Llama model and experimented with several queries. Figure 5 illustrates the enhanced performance in responding to specific Metasploit module inquiries post-fine-tuning, accurately providing detailed information about the respective modules. From a human intuitive perspective, the improvement is quite pronounced, and the model maintained both conciseness and accuracy across similar questions. Inevitably, the limitation of fine-tuned models lies in their inability to acquire dynamically updated knowledge, representing a constraint of the refinement process.

5. Conclusions

This paper presents PenQA, an instructional dataset for penetration testing in the cybersecurity domain. We have also trained large language models with parameters under 10 billion as a question-answering system based on this dataset. This system is designed to address fundamental concepts in the penetration testing process and elucidate the usage of various cybersecurity tools, with a primary focus on Metasploit module methodologies. Our work extracts critical knowledge from the renowned cybersecurity knowledge platform ATT&CK and the penetration testing platform Metasploit to create prompts, utilizing large language models to generate questions and answers. These question–answer pairs encapsulate domain-specific experience and knowledge in cybersecurity.

In this research, we employed GLM-4 to generate question–answer pairs, a methodology that can be extended to various state-of-the-art large language models with vast parameter counts and enhanced capabilities.

6. Discussion

While our approach has made significant strides in the field of penetration testing, it is crucial to explore potential limitations. Firstly, cybersecurity knowledge is expansive and complex, intersecting with numerous disciplines and concepts that make it challenging to create a comprehensive instructional dataset. Moreover, penetration testing encompasses a wealth of expert-derived practical experience that are often embedded in technical reports and hands-on tutorials, which presents increased difficulty in extraction.

Secondly, due to computational resource constraints and time limitations, we were unable to conduct experiments on models with parameter counts exceeding 20 billion. Lastly, it is important to note that the question-answering system fine-tuned on the dataset proposed in this paper can only serve as a reference for the penetration testing process. It has not undergone testing on real-world networks or local systems.

These limitations underscore the need for continued research and development in this domain. Future work could focus on expanding the dataset to include a broader range of cybersecurity concepts and practical experiences, exploring methods to extract knowledge from diverse sources such as technical reports and tutorials and conducting experiments with larger models as computational resources become available. Additionally, rigorous testing in real-world scenarios would be essential to validate the system’s practical applicability and refine its capabilities for operational use in penetration testing contexts.

7. Ethical Statement

The presence of sensitive and offensive content within the datasets should be acknowledged. It is important to highlight that these contents do not reflect our views or beliefs but are solely intended for research purposes. The datasets generated and analyzed during the current study are available from the corresponding author upon reasonable request. We underscore the potential risks associated with the misuse of the dataset developed for this research. To mitigate these risks, we impose the following legal and ethical constraints on the use of the dataset:

The dataset shall be utilized exclusively for lawfully authorized penetration testing, vulnerability verification, or the development of defensive systems. Any unauthorized network probing or penetration simulation is strictly prohibited.
Case analyses should focus on the theoretical underpinnings of vulnerabilities, refraining from disclosing specific methodologies for constructing attack chains.
All experiments conducted using this dataset must be executed within a secure, isolated virtual environment to prevent any unintended impact on real-world systems.
Users of the dataset are required to adhere to all applicable laws, regulations, and industry standards related to cybersecurity and data protection.
The dataset should not be used for any activities that could compromise personal privacy, intellectual property rights, or the integrity of critical infrastructure.
Researchers and practitioners using the dataset are encouraged to report any identified vulnerabilities through proper channels and to follow responsible disclosure practices.
Access to the dataset should be controlled and limited to qualified individuals or entities who have agreed to the terms of use and demonstrated responsible research practices.

By incorporating these constraints and guidelines, we aim to promote the responsible and ethical use of the dataset, thereby upholding the principles of cybersecurity research.

Author Contributions

Conceptualization, X.Z. and J.L.; methodology, X.Z.; software, X.Z.; validation, J.L. and Y.Z.; formal analysis, J.L.; investigation, Y.Z.; data curation, Y.Z.; writing, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The PenQA dataset is partially available in our GitHub repository at https://github.com/WangZtl/PenQA (accessed on 14 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SFT	Supervised Fine-Tuning
LLMs	Large Language Models

Appendix A

We have supplemented the quantity of the dataset across various categories in Table A1. It can be observed that we possess pertinent data throughout different stages of penetration testing. Notably, Metasploit is not subdivided further, as it relies on specific modules.

Table A1. This table presents the current distribution of data types within our compiled dataset. Ongoing efforts will focus on the continuous collection and expansion of data across each category, with the aim of broadening the scope of categories included.

Aspect	Count
Reconnaissance	508
Vulnerablity Scanning	478
Resource Development	813
Initial Access	376
Execution	664
Persistence	2119
Privilege Escalation	1887
Defense Evasion	3502
Credential Access	1327
Discovery	942
Lateral Movement	478
Collection	615
Command and Control	672
Exfiltration	295
Impact	465
Metasploit	33,756
Basic Concepts	1000

References

Liu, A.; Maxim, B.R.; Yuan, X.; Cheng, Y. Exploring Cybersecurity Hands-on Labs in Pervasive Computing: Design, Assessment, and Reflection. In Proceedings of the 2024 ASEE Annual Conference & Exposition, Portland, OR, USA, 23–26 June 2024. [Google Scholar]
Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; Philip, S.Y. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 494–514. [Google Scholar] [CrossRef] [PubMed]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; Hashimoto, T.B. Alpaca: A strong, replicable instruction-following model. Stanf. Cent. Res. Found. Model. 2023, 3, 7. Available online: https://crfm.stanford.edu/2023/03/13/alpaca.html (accessed on 20 July 2024).
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. In Proceedings of the Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 27730–27744. [Google Scholar]
Chiang, W.L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J.E.; et al. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT. Available online: https://vicuna.lmsys.org (accessed on 14 April 2023).
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Mikalef, P.; Conboy, K.; Lundström, J.E.; Popovic, A. Thinking responsibly about responsible AI and `the dark side’ of AI. Eur. J. Inf. Syst. 2022, 31, 257–268. [Google Scholar] [CrossRef]
Lu, Y.; Yu, L.; Zhao, J. Research Progress on Intelligent Mining Technology for Software Vulnerabilities. Inf. Countermeas. Technol. 2023, 2, 1–19. [Google Scholar] [CrossRef]
Geng, C.; Chang, S.; Huang, H. Research on Smart Contract Vulnerability Detection Based on Prompt Engineering in Zero-shot Scenarios. Inf. Countermeas. Technol. 2024, 2, 70–81. [Google Scholar] [CrossRef]
Yan, S.; Wang, S.; Duan, Y.; Hong, H.; Lee, K.; Kim, D.; Hong, Y. An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong Detection. In Proceedings of the 33rd USENIX Security Symposium, USENIX Security 2024, Philadelphia, PA, USA, 14–16 August 2024; Balzarotti, D., Xu, W., Eds.; USENIX Association: Berkeley, CA, USA, 2024. [Google Scholar]
Deng, G.; Liu, Y.; Vilches, V.M.; Liu, P.; Li, Y.; Xu, Y.; Pinzger, M.; Rass, S.; Zhang, T.; Liu, Y. PentestGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing. In Proceedings of the 33rd USENIX Security Symposium, USENIX Security 2024, Philadelphia, PA, USA, 14–16 August 2024; Balzarotti, D., Xu, W., Eds.; USENIX Association: Berkeley, CA, USA, 2024. [Google Scholar]
Moskal, S.; Laney, S.; Hemberg, E.; O’Reilly, U.M. LLMs Killed the Script Kiddie: How Agents Supported by Large Language Models Change the Landscape of Network Threat Testing. arXiv 2023, arXiv:2310.06936. [Google Scholar]
Huang, J.; Zhu, Q. PenHeal: A Two-Stage LLM Framework for Automated Pentesting and Optimal Remediation. arXiv 2024, arXiv:2407.17788. [Google Scholar]
Shao, M.; Jancheska, S.; Udeshi, M.; Dolan-Gavitt, B.; Xi, H.; Milner, K.; Chen, B.; Yin, M.; Garg, S.; Krishnamurthy, P.; et al. NYU CTF Dataset: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security. arXiv 2024, arXiv:2406.05590. [Google Scholar]
Li, Y.; Liu, S.; Chen, K.; Xie, X.; Zhang, T.; Liu, Y. Multi-target Backdoor Attacks for Code Pre-trained Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J.L., Okazaki, N., Eds.; Association for Computational Linguistics: New Brunswick, NJ, USA, 2023; pp. 7236–7254. [Google Scholar] [CrossRef]
Sun, H.; Zhang, Z.; Deng, J.; Cheng, J.; Huang, M. Safety assessment of chinese large language models. arXiv 2023, arXiv:2304.10436. [Google Scholar]
Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N.A.; Khashabi, D.; Hajishirzi, H. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
Singh, S.; Vargus, F.; D’souza, D.; Karlsson, B.; Mahendiran, A.; Ko, W.; Shandilya, H.; Patel, J.; Mataciunas, D.; O’Mahony, L.; et al. Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: New Brunswick, NJ, USA, 2024; pp. 11521–11567. [Google Scholar] [CrossRef]
Happe, A.; Cito, J. Getting pwn’d by ai: Penetration testing with large language models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 3–9 December 2023; pp. 2082–2086. [Google Scholar]
Sung, C.; Lee, Y.; Tsai, Y. A New Pipeline for Generating Instruction Dataset via RAG and Self Fine-Tuning. In Proceedings of the 48th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2024, Osaka, Japan, 2–4 July 2024; Shahriar, H., Ohsaki, H., Sharmin, M., Towey, D., Majumder, A.K.M.J.A., Hori, Y., Yang, J., Takemoto, M., Sakib, N., Banno, R., et al., Eds.; IEEE: Piscataway, NJ, USA, 2024; pp. 2308–2312. [Google Scholar] [CrossRef]
Shashwat, K.; Hahn, F.; Ou, X.; Goldgof, D.; Hall, L.; Ligatti, J.; Rajgopalan, S.R.; Tabari, A.Z. A Preliminary Study on Using Large Language Models in Software Pentesting. arXiv 2024, arXiv:2401.17459. [Google Scholar]
Xie, Q.; Han, W.; Zhang, X.; Lai, Y.; Peng, M.; Lopez-Lira, A.; Huang, J. PIXIU: A Comprehensive Benchmark, Instruction Dataset and Large Language Model for Finance. In Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 33469–33484. [Google Scholar]
Fleming, S.L.; Lozano, A.; Haberkorn, W.J.; Jindal, J.A.; Reis, E.; Thapa, R.; Blankemeier, L.; Genkins, J.Z.; Steinberg, E.; Nayak, A.; et al. Medalign: A clinician-generated dataset for instruction following with electronic medical records. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 22021–22030. [Google Scholar]
Agrawal, G.; Pal, K.; Deng, Y.; Liu, H.; Chen, Y.C. CyberQ: Generating Questions and Answers for Cybersecurity Education Using Knowledge Graph-Augmented LLMs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 23164–23172. [Google Scholar]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Yang, A.; Yang, B.; Hui, B.; Zheng, B.; Yu, B.; Zhou, C.; Li, C.; Li, C.; Liu, D.; Huang, F.; et al. Qwen2 Technical Report. arXiv 2024, arXiv:2407.10671. [Google Scholar]
Team, G.; Riviere, M.; Pathak, S.; Sessa, P.G.; Hardin, C.; Bhupatiraju, S.; Hussenot, L.; Mesnard, T.; Shahriari, B.; Ramé, A.; et al. Gemma 2: Improving open language models at a practical size. arXiv 2024, arXiv:2408.00118. [Google Scholar]
Zeng, A.; Xu, B.; Wang, B.; Zhang, C.; Yin, D.; Zhang, D.; Rojas, D.; Feng, G.; Zhao, H.; Lai, H.; et al. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv 2024, arXiv:2406.12793. [Google Scholar]
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. In Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 10088–10115. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, PA, USA, 7–12 July 2002; Association for Computational Linguistics: Philadelphia, PA, USA, 2002; pp. 311–318. [Google Scholar]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 25–30 June 2005; pp. 65–72. [Google Scholar]
Mikolov, T. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv 2019, arXiv:1904.09675. [Google Scholar]

Figure 1. The pipeline of our method for dataset construction and model augmentation. We have sanitized and organized the usage manuals of open-source cybersecurity tools and open-source cybersecurity knowledge documents into coherent text segments. Subsequently, we have extracted specific question-and-answer pairs from several of these segments, which include the usage commands for specific cybersecurity tools and concepts related to network security. Following this, we have designed an enhancement project that enables more capable large language models to learn the correspondences between our crafted questions and answers and the original text segments, thereby generating question-and-answer data for each segmented piece. Finally, we have split the resulting question-and-answer dataset and fine-tuned it on large language models under 10 B parameters to enhance their performance in penetration testing Q&A tasks.

Figure 2. Examples of techniques and some sub-techniques in ATT&CK. These techniques symbolize the distinct phases and objectives of the penetration process, constituting a valuable knowledge base for penetration testing. Within each technique, the subtechniques offer a more granular classification and narrative, allowing cybersecurity practitioners, as well as large language models, to systematically absorb and comprehend the intricacies of the penetration testing process.

Figure 3. The quantitative relationships between different modules. Exploit, Payloads, and Auxiliary modules constitute the majority of modules within Metasploit. Exploits, Payloads, and Auxiliary modules constitute the majority of all modules by number, with each module typically corresponding to a specific vulnerability. As vulnerabilities continue to be discovered, the modules available on the Metasploit open-source community are continually updated to reflect these findings.

Figure 4. Basic template for generating penetration testing Q&A prompts. By utilizing Roles, Skills, and Restrictions, we assist large language models in comprehending the requirements of a task and the boundaries of their capabilities within that task. This approach ensures the generation of data that are uniformly formatted and aligned with the domain-specific criteria.

Figure 5. An example that demonstrates how a model fine-tuned on a dataset performs better than one that is not fine-tuned has been practically verified as the answer to this question.

Table 1. Examples of conceptual Q&A in the PenQA. The primary focus is on designing conceptual knowledge and definitional knowledge. For instance, this includes explanations of Metasploit module options as presented in tables, as well as the methods and techniques employed in specific phases of penetration testing.

Conceptual Q&A Examples
ID: 129,
Category: “Reconnaissance”,
Input: “Describe how to use NMAP for scanning an entire network segment”.
Output: “To use NMAP for scanning an entire network segment, specify the target network segment in CIDR notation (e.g., 192.168.1.0/24) or a range of IP addresses (e.g., 192.168.1.1-254)”.
ID: 32123,
Category: “Metasploit”,
Input: “What does the FOLDER option in the module `auxiliary/gather/exchange_proxylogon_collector` define?”
Output: “The FOLDER option defines the email folder that should be targeted for dumping emails in the module `auxiliary/gather/exchange_proxylogon_collector`”.
ID: 4724,
Category: “Persistence”,
Input: “How can adversaries use transport agents for persistent access to systems?”
Output: “Adversaries can abuse Microsoft transport agents by registering a malicious transport agent with the Exchange server. This malicious agent can then be triggered by specific email events to provide a persistence mechanism”.

Table 2. Examples of Practical Q&A in the PenQA. The main focus revolves around the specific command usage for tools in particular scenarios, along with the assignment of parameters for Metasploit modules in targeted contexts.

Practical QA Examples
ID: 867,
Category: “Vulnerability Scanning”,
Input: “How can I scan for open ports that might lead to exposure of sensitive data (CWE-200)?”
Output: “Use `nmap’ to scan for open ports with the command `nmap -p 1-65535 -sV example.com’”.
ID: 45706,
Category: “Metasploit”,
Input: “How can one discover the open port created by `payload/linux/x64/shell_bind_tcp_random_port` module?”
Output: “One can discover the open port created by the payload `/linux/x64/shell_bind_tcp_random_port` module using the nmap command
’nmap -sS target -p-’”.
ID: 41854,
Category: “Metasploit”,
Input: “What command is used by default to fetch the payload on the target system in `payload/cmd/windows/http/x64/vncinject/reverse_tcp_rc4`?”
Output: “The payload `/cmd/windows/http/x64/vncinject/reverse_tcp_rc4` module uses the CERTUTIL command by default to fetch the payload on the target system”.

Table 3. Performance on natural language generation evaluation metrics including BLEU, METEOR, and ROUGE. It is evident that across all experiments, the models demonstrated enhanced performance in penetration testing knowledge Q&A after fine-tuning.

Model	Parameter	BLEU	METEOR	ROUGE
Model	Parameter	BLEU	METEOR	Rouge1	Rouge2	RougeL	RougeLS
Gemma	9 B	0.04	0.25	0.21	0.09	0.16	0.18
Finetuned Gemma	9 B	0.08	0.34	0.31	0.19	0.26	0.27
GLM	9 B	0.13	0.40	0.41	0.27	0.35	0.36
Finetuned GLM	9 B	0.47	0.58	0.63	0.48	0.58	0.58
Llama	8 B	0.07	0.29	0.22	0.12	0.18	0.19
Finetuned Llama	8 B	0.41	0.50	0.58	0.42	0.52	0.52
Mistral	7 B	0.07	0.31	0.26	0.13	0.20	0.21
Finetuned Mistral	7 B	0.38	0.46	0.54	0.39	0.49	0.49
Qwen	7 B	0.05	0.19	0.21	0.10	0.17	0.18
Finetuned Qwen	7 B	0.41	0.53	0.59	0.42	0.53	0.53

Table 4. Performance on natural language generation evaluation metrics including BERTScore and word embedding-based metrics. The dataset provided has indeed enhanced the accuracy of penetration testing Q&A.

Model	Bertscore	Word Embedding-Based Metrics
Model	Bertscore	Greedy Matching	Embedding Average	Vector Extrema
Gemma	0.81	−0.30	0.78	0.91
Finetuned Gemma	0.84	−0.23	0.81	0.90
GLM	0.89	−0.25	0.83	0.92
Finetuned GLM	0.94	−0.23	0.88	0.95
Llama	0.82	−0.22	0.83	0.92
Finetuned Llama	0.93	−0.23	0.84	0.91
Mistral	0.84	−0.25	0.81	0.92
Finetuned Mistral	0.84	−0.20	0.77	0.84
Qwen	0.76	−0.30	0.60	0.79
Finetuned Qwen	0.93	−0.25	0.86	0.94

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhong, X.; Zhang, Y.; Liu, J. PenQA: A Comprehensive Instructional Dataset for Enhancing Penetration Testing Capabilities in Language Models. Appl. Sci. 2025, 15, 2117. https://doi.org/10.3390/app15042117

AMA Style

Zhong X, Zhang Y, Liu J. PenQA: A Comprehensive Instructional Dataset for Enhancing Penetration Testing Capabilities in Language Models. Applied Sciences. 2025; 15(4):2117. https://doi.org/10.3390/app15042117

Chicago/Turabian Style

Zhong, Xiaofeng, Yunlong Zhang, and Jingju Liu. 2025. "PenQA: A Comprehensive Instructional Dataset for Enhancing Penetration Testing Capabilities in Language Models" Applied Sciences 15, no. 4: 2117. https://doi.org/10.3390/app15042117

APA Style

Zhong, X., Zhang, Y., & Liu, J. (2025). PenQA: A Comprehensive Instructional Dataset for Enhancing Penetration Testing Capabilities in Language Models. Applied Sciences, 15(4), 2117. https://doi.org/10.3390/app15042117

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PenQA: A Comprehensive Instructional Dataset for Enhancing Penetration Testing Capabilities in Language Models

Abstract

1. Introduction

2. Related Work

2.1. LLMs for Security

2.2. Instructional Data

3. Dataset Generation

3.1. ATT&CK

3.2. Metasploit

4. Experiments

5. Conclusions

6. Discussion

7. Ethical Statement

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI