Next Article in Journal
Improving Real-Time Detection of Abnormal Traffic Using MobileNetV3 in a Cloud Environment
Previous Article in Journal / Special Issue
C3-VULMAP: A Dataset for Privacy-Aware Vulnerability Detection in Healthcare Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Cybersecurity Applications of Near-Term Large Language Models

by
Casimer DeCusatis
*,
Raymond Tomo
,
Aurn Singh
,
Emile Khoury
and
Andrew Masone
School of Computer Science and Mathematics, Marist University, Poughkeepsie, NY 12601, USA
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(13), 2704; https://doi.org/10.3390/electronics14132704
Submission received: 30 May 2025 / Revised: 24 June 2025 / Accepted: 2 July 2025 / Published: 4 July 2025

Abstract

This paper examines near-term generative large language models (GenLLM) for cybersecurity applications. We experimentally study three common use cases, namely the use of GenLLM as a digital assistant, analysts for threat hunting and incident response, and analysts for access management in zero trust systems. In particular, we establish that one of the most common GenLLMs, ChatGPT, can pass cybersecurity certification exams for security fundamentals, hacking and penetration testing, and mobile device security, as well as perform competitively in cybersecurity ethics assessments. We also identify issues associated with hallucinations in these environments. The ability of ChatGPT to analyze network scans and security logs is also evaluated. Finally, we attempt to jailbreak ChatGPT in order to assess its application to access management systems.

1. Introduction

Recently, there has been a tremendous amount of interest in machine learning (ML) and artificial intelligence (AI) applications [1]. Most of this interest has centered around large language models (LLM) as applied to generative language creation (GenLLM) [1]. The announcement and general availability of the ChatGPT model in November 2022 was followed by this model acquiring over 1 million users with 5 days and over 100 million users within 2 months [2]. This has driven emerging applications in a wide range of fields, as well as calls for regulation at the state [3], national [4], and international levels of government [5]. As of this writing, there are over 60 distinct types of GenLLMs available [6]. New types are still emerging, and existing types are continually updated. According to recent industry surveys [7], some of the most popular models include ChatGPT (from OpenAI), Claude (from Anthropic), DeepSeek (from a Chinese AI firm of the same name), Granite (a module from the IBM Watson X library), Llama (from Facebook/Meta), Lambda (part of a family of models developed by Google which includes the model Gemini, the chatbot Bard and the Pathway Learning Model), Mistral (from Mistral AI), Grok (from the company xAI), Bedrock (from Amazon), and more.
In this paper, we have chosen to concentrate on the ChatGPT model for a number of reasons. This model is consistently ranked as among the most popular and most effective in the field [7,8,9,10]. Its unprecedented adoption rate [2] has led to a wide range of potential use cases. Industry analysists have noted that ChatGPT offers similar modalities and integration options as other leading GenLLMs [8,9,10]. For text-based assignments, ChatGPT has been found to generally perform better than most leading alternatives based on side-by-side comparisons for some common use cases [7,8,9,10]. While other GenLLMs excel at image generation and similar creative tasks, ChatGPT has become known for its ability to generate and understand human-like text and facilitate problem solving. These properties make ChatGPT a good fit for the type of security industry certification testing we had planned. We hope to extend this approach to other LLMs in future research.
Our initial research hypothesis is that GenLLM models such as ChatGPT may have value in the field of cybersecurity for specific use cases. We have defined three near-term use cases, discussed in a later section, and we will test ChatGPT to determine whether this approach has any value, as well as document any limitations with the current models. There are several novel aspects to our research. First, for the use case of GenLLMs as a cybersecurity digital assistant, we study for the first time whether a GenLLM can pass an industry standard cybersecurity certification exam. We also study for the first time whether a GenLLM can pass a cybersecurity ethics exam. For the use case of incident response and threat hunting, we study for the first time whether a GenLLM can be used to analyze security log or honeypot data. We compare the GenLLM’s performance with a typical human operator. For our zero trust environment use case, we perform a series of jailbreak attempts to determine if a GenLLM can be compromised through different types of malicious prompt injection attacks.
It is well known that all GenLLMs have potentially significant limitations and risks [11]. For the purposes of our study, these include the following:
Lack of explainability: Many of these models have not released their source code or training data, meaning that their behavior can only be studied as a “black box” model. It is frequently impossible to say “why” a model produces a given response, or to expect that it will produce the same response under slightly different input conditions. For this reason, we will conduct a number of tests to assess the model’s behavior for our initial three short term use cases. Model bias: Without access to the training data set, it is not possible to determine if the model was trained with an implicit bias impacting the results. LLMs can perpetuate biases present in their training data, which is difficult to detect in black box testing.
Hallucinations: LLMs can generate plausible-sounding but inaccurate or nonsensical information. They might invent facts, misrepresent concepts, or fabricate details. In some cases the models will deliberately fabricate information, rather than admit they do not know the correct response. We expect to find examples of hallucinations during our testing.
Limited Knowledge: LLMs are trained on data up to a certain point in time and cannot automatically update their knowledge with real-time information. We expect to find examples of this in our testing when asking the model to interpret fairly recent developments in the cybersecurity field.
Ethical Concerns: As a separate issue from model bias, ethical issues may include lack of accountability for generated content, failure to properly attribute original sources or hallucinate fake sources, and potential for harmful uses such as circumventing security protocols. We will evaluate the GenLLM using an accepted industry standard ethical assessment provided by the IEEE industry professional society.
Computational Constraints: Most LLMs have limits on the amount of text they can process in a single input or output. We expect this to be a limiting factor in their ability to ingest security logs and generate or implement security policies such as identity management.
In this paper, we are concerned with potential applications of GenLLMs to cybersecurity. We define three near-term use cases for GenLLMs (digital assistants, incident response for threat hunting, and identity and access management for zero trust environments). We conduct experimental research to assess the value in each area using ChatGPT, currently the most widely used GenLLM which is representative of the fundamental behavior of many similar LLMs.
This manuscript is organized as follows. After a brief introduction, we present a literature review and discuss materials and experimental methods, including a detailed description of the three near-term use cases considered in our research. We then present experimental results and discussion for the three use cases studied. This is followed by our conclusions, references, and links to additional material.

2. Literature Review

The field of artificial intelligence, or human intelligence exhibited by computers, dates back to at least the 1950s [1]. A few decades later, during the 1980s, researchers narrowed the field to study machine learning, or computers capable of learning from historical data [12,13,14]. By the 2010s, machine learning systems capable of mimicking human responses after training on large data sets began to emerge (so-called deep learning) [13]. The foundation of modern LLMs began to take shape around 2014–2016, when research was published describing machine learning techniques designed to mimic human cognition [15]. This work was refined in a subsequent 2017 paper describing transformer models and attention mechanisms [16]. More recently, since about 2020, a subfield has developed which uses so-called foundation models, or generative artificial intelligence via large language models [17]. This led to the release of the popular ChatGPT GenLLM in November 2022, followed closely by similar generative text models including Google Bard and Gemini [12], Meta’s LLaMA [18], and Anthropic’s Claude [19].
Although the source code for most LLMs (including ChatGPT) has not been released, the fundamental operating principles of LLMs have been extensively described in the literature [17]; we will provide only a brief, high level overview for our purposes. ChatGPT is a machine learning framework which has been trained on over 200 billion data points scraped from the public Internet. Its basic function is similar to a nonlinear artificial neural network with variable threshold feedback levels and weights connecting different network layers (whose value is determined by training the model, or exposing it to a series of inputs which exemplify the desired outputs). The complexity and non-algorithmic nature of these models can make it difficult to establish explainability or reproducibility of the model outputs. When provided with an input prompt, it attempts to respond by generating a reasonable continuation of whatever response it has provided up to that point. In this context, a reasonable continuation refers to something that we might expect a typical person to write after observing similar responses on billions of web pages. In other words, given a response fragment, the LLM reviews similar content from its training data to determine which words are likely to appear next, and with what probabilities. The model does not always pick the highest probability word, rather the algorithm introduces some randomness so that the resulting response reads more like human-generated text. Thus, the LLM will not always respond the same way to the same prompt, and can sometimes make up entirely new words. Further, the LLM will take into account not just the next likely word, but the next likely sequence of words. Although there is no underlying theory of language involved, and this is likely not the same way a human would formulate a reply, it is remarkable that the LLM can generate responses that seem (even superficially) like human-generated text, and with reasonably accurate response to at least a limited set of questions. It searches for relationships between these variables to generate text responses to queries. In its current form, ChatGPT does not consult the Internet to validate facts, and thus is prone to generating misleading or incorrect responses, known as hallucinations.
GenLLMs have been considered as both a potential source of cybersecurity threats and a novel approach to threat mitigation [20]. The models themselves are vulnerable to exploitation, and the most commonly observed threats have been cataloged by the industry standard Open Web Application Security Project (OWASP) [21]. Further guidance on adversarial machine learning techniques has been published by the National Institute of Standards and Technology (NIST) [22]. However, in this paper we are focused on the experimental application of GenLLMs to several specific use cases in cybersecurity which have not yet been investigated in detail. For example, previous studies have not considered whether currently available GenLLMs can pass a cybersecurity certification exam, which is one indication of their fitness for use as digital assistants. The ability of GenLLMs to pass an industry standard cybersecurity ethics competition has also not been previously assessed. The robustness of GenLLM responses to prompt injection attacks related to cybersecurity, and their applications to threat hunting, have also not been studied previously. We will consider all of these issues in our current novel experimental work.

3. Materials and Experimental Methods

In order to assess our first use case (the value of ChatGPT 3.0 as a cybersecurity digital assistant), we began by testing whether this LLM could pass an industry standard certification test on the fundamentals of cybersecurity, as required for many entry-level jobs in the field. Our methodology duplicates the approach used for certification testing of human security specialists; namely, we tested the LLM using a series of questions about the fundamentals of cybersecurity. This body of knowledge is representative of material covered in several undergraduate college textbooks and has been used as part of professional cybersecurity certification exams by the State of New York [23]. The questions in this body of knowledge are also used in the Security Plus certification [24] and the knowledge units/requirements of the U.S. Department of Defense (DoD) and National Security Agency (NSA) National Centers for Academic Excellence in Cybersecurity four year undergraduate program in cyber-defense (CAE-CD) [25]. A database of 300 questions covering these topics was assembled using reference material from cybersecurity exam preparation materials and textbooks [26]. From this database we randomly sampled 40 questions (without replacement) to form a cybersecurity exam (this is the same approach used to generate certification exams for human security specialists). We then repeated this process to create 17 unique exams covering the fundamentals of these bodies of knowledge. The cybersecurity certification provided by New York State requires proficiency in not only cybersecurity fundamentals, but in the sub-fields of Hacking and Penetration Testing and Mobile Security. Therefore, we created two additional assessments for these sub-fields. To evaluate the subfield of Hacking and Penetration Testing, a similar approach was taken using a second body of 300 questions related to Ethical Hacking and Penetration Testing [27]. Since this field was somewhat narrower than the introductory cybersecurity material, we only formed 10 unique exams consisting of 40 questions each (sampled without replacement). Next, we evaluated a third distinct body of 300 questions in the sub-field of Mobile Security [28], which were sampled without replacement as before to form 10 unique exams.
Finally, we evaluated a body of knowledge related to ethical hacking principles. Although ChatGPT is, by definition, inherently incapable of making ethical judgements, we can still evaluate its response to various ethical questions in cybersecurity. For this test, ChatGPT was asked to perform an ethical analysis of the use cases employed in the annual IEEE Cybersecurity Ethics competition and was evaluated using the standardized IEEE rubric [29] which is based on the ten elements in the IEEE Code of Ethics [30]. The results were compared with several anonymized results from human teams participating in this competition.
In order to assess our second use case (the value of ChatGPT for incident response and threat hunting), we evaluated ChatGPT’s ability to interpret security logs and network scan traces as compared with a human security analyst. We used a network traffic data set from the publicly available gfek/Real-CyberSecurity-Datasets collection [31] which has been used previously for other cybersecurity testing applications. The data set consists of over 555,000 entries grouped into 88 columns, providing detailed information about network flows, including source and destination IPs, port numbers, flow duration, packet counts, TCP flags, and traffic categories. ChatGPT was also used to interpret network scans using the industry standard tool Nmap, and to analyze traffic through a honeypot network.
In order to assess our third use case (the value of ChatGPT for IAM and Zero Trust), we tested whether a GenLLM can be trusted to consistently implement fundamental cybersecurity rules or whether it can be manipulated to break such rules using prompt engineering. To test this characteristic, several attempts were made to jailbreak ChatGPT for cybersecurity applications using previously published prompt engineering techniques. These included the prompt techniques Always Intelligent and Machiavellian (AIM), Freemode, Mongo Tom, and Do Anything Now (DAN) [32]. For example, the AIM prompt instructs the GenLLM to participate in a hypothetical discussion wherein it behaves as an amoral, unfiltered chatbot, without any ethical guidelines; it will never refuse to respond, apologize for not responding, or claim that it does not know the answer to a question. Similarly, the Freemode prompt attempts to convince the GenLLM that it is role playing as an AI without the typical restraints that force it to block illegal activity. The Mongo Tom prompt attempts to make the GenLLM take on a personality which can use improper language such as swearing or sexual innuendo, but which will always make an effort to reply to any query. The prompt known as DAN will encourage the GenLLM to pretend that it can access the Internet when responding to questions, or present information that has not been verified as if it were factual. Since a GenLLM is not a strictly deterministic system, repeating the same prompt can produce different outputs. We applied each of these prompts 20 times to different instances of ChatGPT, in an effort to compromise its access control policies and to make ChatGPT generate malicious code useful for cyberattacks.

Near-Term Use Cases

The following three near-term use cases for GenLLMs in cybersecurity will be investigated in this paper:
Digital assistants to augment cybersecurity talent, resources, and report generation: Many GenLLMs have demonstrated basic competency in passing industry certification tests, for example in the medical [33], legal [34], and other professions. There is currently a global shortage of about 4 million trained cybersecurity professionals, which is not likely to be reduced any time soon [13]. This has driven interest in using GenLLMs for more mundane tasks such as assisting human security operators, conducting training exercises, or researching vendor offerings [14]. In principle, this will free up limited human resources for more complex tasks. Similarly, GenLLMs have been suggested as a way to streamline the generation of first draft audit and regulatory compliance reports. To assess the value of currently available GenLLMs in this area, we study whether ChatGPT v3.0 is able to pass industry certification tests in cybersecurity fundamentals, hacking and penetration testing, and mobile security. We also assess whether ChatGPT can perform competitively in an industry standard cybersecurity ethics evaluation.
Incident response and threat hunting: As businesses attempt to modernize their operations, many cybersecurity teams are being asked to perform threat management tasks. A security operations center can log as many as 10,000 alerts per day; it is estimated that as much as 70% of daily system logs are currently ignored due to lack of human resources [1]. The large and increasing velocity of big data in cybersecurity implies that we cannot rely on reviewing this data at human scale. For some organizations, telemetry is also an issue, since they may be collecting data from may disparate locations. GenLLMs may be able to review at least some of the logs which are currently being ignored, helping to close the gap left by a lack of skilled human threat analysts. We study whether ChatGPT 3.0 can perform rudimentary log analysis and threat hunting on network security scans obtained from standard tools such as Nmap. We also study how ChatGPT can respond to log analysis requests for data collected from our honeypots.
Identity and Access Management (IAM) for Zero Trust: GenLLMs may be able to recommend, draft, automate, and validate security policies based on a given threat profile, business objectives, or behavioral analytics. Some preliminary results suggest that the policy revision cycle (also known as plan/do/check/act or the Demming cycle) can be reduced by up to 50% using LLMs [14]. GenLLMs may also find applications in the automated review of permission/access requests, estimation of endpoint risk scores and overall risk analysis, and validating compliance with zero trust access models (for example, identifying so-called “shadow data” in hybrid cloud environments, assuming that a cost-effective approach to hybrid cloud data extraction becomes available). As part of a zero trust environment, GenLLMs may be able to suggest novel attack vectors or draft simple attack scripts which a human operator could use as the basis for penetration testing. All of this depends on whether the GenLLM can be trusted to implement fundamental cybersecurity rules, or whether it can be manipulated into delivering erroneous results through prompt injection attacks. We therefore study common jailbreak attempts on ChatGPT 3.0 to assess how well it can be trusted in access management applications.

4. Experimental Results and Discussion

Our experimental results and interpretation will be presented separately for each of the three use cases considered.

4.1. Digital Assistant Use Case

As discussed in the Section 3, we assembled a database of industry standard fundamental cybersecurity knowledge and from this database we randomly sampled 40 questions (without replacement) to form a cybersecurity exam. We then repeated this process to create 17 unique exams. ChatGPT was evaluated for each of these exams; results are shown in Figure 1, indicating that on average ChatGPT correctly answered 83% of these questions. The highest score achieved was 90%, and the lowest score was 73%.
A similar approach was taken using a second body of 300 questions related to Ethical Hacking and Penetration Testing [21]. Since this field was somewhat narrower than the introductory cybersecurity material, we only formed 10 unique exams consisting of 40 questions each (sampled without replacement). Results are shown in Figure 2, indicating that on average ChatGPT correctly answered 87% of these questions. The highest score achieved was 92% and the lowest was 75%.
Finally, we evaluated a third distinct body of 300 questions related to Mobile Security [22], which were sampled without replacement as before to form 10 unique exams. Results are shown in Figure 3, indicating that on average ChatGPT correctly answered 80% of these questions. The highest score achieved was 90% and the lowest was 71%.
While these results show that ChatGPT can pass basic cybersecurity certifications such as Security Plus, NSA CAE-CD, and for New York State, the results do not out-perform the best human undergraduate students, who have been known to score 90% or above on all three topic areas, and close to 100% in some cases. By contrast, ChatGPT is currently a C+ to B+ level student. Further, it is useful to study the errors made by ChatGPT in response to these questions. Without any unusual prompting, ChatGPT will hallucinate incorrect responses as shown in Figure 4, where it incorrectly says that the advanced encryption algorithm (AES) is not a block cipher, then immediately contradicts itself and states that AES is a block cipher. Note that ChatGPT asserts the correctness of this response with the same confidence that it answers all questions; there is no indication that the response may be incomplete or not properly understood.
More troubling, although ChatGPT claims to have been trained on data up to January 2022, it is unable to answer some questions before that date. For example, as shown in Figure 5, ChatGPT is unaware of the so-called Metcalf Incident and the corresponding Seven Bullet Theory [21] which occurred in 2013.
Similar issues arise when we apply ChatGPT to the IEEE Cybersecurity Ethics Competition. Ethical considerations are an important part of cybersecurity training, and a useful digital assistant must be able to parse ethical situations similar to a human analyst. Although ChatGPT is inherently incapable of making ethical judgements, we can still evaluate its response to various ethical questions in cybersecurity. For example, ChatGPT was asked to write an essay discussing the Edward Snowden case, including an analysis of the case related to the 10 elements in the IEEE Code of Ethics [30]. ChatGPT was, of course, much faster than a human, generating a 10-page response in only a few minutes. We then performed a cursory edit to remove any telltale signs that this essay was the work of an LLM (such as statements beginning with “as an LLM…”) and included its response along with 8 essays generated by undergraduate college seniors for judging by a panel of 5 IEEE volunteers. The judges were not told that one of the essays was generated by ChatGPT. Using the standardized IEEE rubric [29], the judges scored ChatGPT’s essay 73 points out of 100 (or a C− letter grade). While ChatGPT was able to pass this competition rubric, its score was significantly below the winning human entries, which were all in the high 80 to low 90-point range. Again, ChatGPT lost points for unprompted hallucinations. For example, it misidentified 3 elements of the IEEE Code of Ethics (such as incorrectly claiming element 6 was “responsibility to the public” when this is actually element 1). ChatGPT also made up a name for element 4 in the code of ethics, which was of course incorrect. This results in an essay which appears to be analyzing the case study but in fact is not making a connection between the case study and the Code of Ethics. Likewise, ChatGPT fails to properly generate citations for this work in IEEE format, and makes up several citations that do not exist. Due to these limitations, ChatGPT currently meets only the most basic requirements of a useful digital assistant, and its responses will invariably include hallucinations which are difficult to distinguish from the correct response.

4.2. Incident Response and Threat Hunting Use Case

Due to its limited prompt size, ChatGPT is currently not effective at parsing or reading large system logs. ChatGPT will respond with “this conversation is too long, please start another conversation”. We were able to obtain some results by breaking the logs to be analyzed into smaller subsets and using multiple prompts; however, this approach is only effective for small subsets of the data. In most of our tests, ChatGPT performed much worse than a commercially available intrusion detection and prevention system, often responding with vague prompts such as the example in Figure 6.
We then attempted to use ChatGPT to analyze a simple Nmap scan, as illustrated in Figure 7. In this case, ChatGPT was able to correctly identify that port 22 was open, and was running OpenSSH. The scan correctly detected 997 filtered TCP ports with no response and 2 filtered TCP ports marked as “admin-prohibited.” ChatGPT was also able to determine that the target is 13 hops away from the scanning machine. While these results are encouraging, we once again encountered issues when attempting to run larger scans due to length limits on ChatGPT’s conversation.
Finally, we used ChatGPT to analyze traffic through a honeypot on our network. This work uses the same honeypots developed in prior research to detect unauthorized login attempts [35]. The honeypot never allows user access to any system resources; it simply records text associated with login attempts, time stamps, and if available source IP addresses used in the login attempt. The raw data and copies of the honeypots are available from our GitHub site (see our Data Availability Statement, data may be accessed from the Marist Innovation Lab public GitHub site, accessed on 1 July 2025). The conversation length limits were a significant impediment, and ChatGPT was unable to identify even a small fraction of the illegal honeypot access attempts. A comparison between ChatGPT’s results and those of a human analyst are shown in Figure 8.
ChatGPT did have some value in parsing unusual commands found in the honeypot logs. For example, the rather complex command given in Figure 9 was correctly identified as potentially malicious.
Based on these experiments, ChatGPT provides some value in threat hunting and log analysis, but is not able to deal with larger scale, enterprise-class production systems or perform as reliably as a qualified human analyst.

4.3. IAM and Zero Trust Use Case

There is potential to use GenLLMs to produce and automatically validate security access policies, review permission access requests, and validate compliance with zero trust access models. GenLLMs may also be able to suggest novel attack vectors or draft simple attack scripts which a human operator could use as the basis for penetration testing. However, these applications assume that a GenLLM can be trusted to consistently implement fundamental cybersecurity rules. If we can manipulate a GenLLM through prompt engineering, it may deliver incorrect results and its value for enforcing access control is questionable. Current prompt-based LLMs are insecure by design, since they are virtually impossible to defend against prompt injection attacks. The best approach to mitigating such attacks seems to be a robust development system which can quickly train the LLM to avoid an injection attack shortly after it is discovered, thus minimizing the window of vulnerability. To test this characteristic, several attempts were made to jailbreak ChatGPT for cybersecurity applications using previously published prompt engineering techniques. These included the prompt techniques Always Intelligent and Machiavellian (AIM), Freemode, Mongo Tom, and Do Anything Now (DAN) [32]. Descriptions of these prompts were provided in the Section 3 of this paper. Each jailbreak attempt is either a success or failure; that is, a malformed prompt will either compromise the LLM or will have no effect.
We applied each of these prompts 20 times to different instances of ChatGPT, in an effort to compromise its access control policies. We also attempted to make ChatGPT generate malicious code useful for cyberattacks. Remarkably, ChatGPT seems to be patched against all of these prompting techniques fairly quickly after they are published, since we were unable to jailbreak ChatGPT using any of these documented approaches. As shown in Table 1, after 20 attempts with each jailbreak prompt, the system continued to resist our efforts to compromise its access control or introduce malicious code generation. This result is encouraging, since the ChatGPT ecosystem appears robust enough to self-correct for published jailbreaks reasonably quickly. Additional work would be required to assess whether ChatGPT responds as well to unpublished jailbreaks, or to quantify the window of vulnerability for such attacks. For now, the trustworthiness of this application for IAM and zero trust use cases appears to meet basic requirements, but further experiments are planned in this area as the tool matures.

5. Conclusions

We have studied the application of GenLLMs to three prominent cybersecurity use cases. First, we determined that near-term GenLLMs are capable of passing industry certification exams in cybersecurity fundamentals (83rd percentile), hacking and penetration testing (87th percentile), and mobile device security (80th percentile). These models were also able to pass an industry standard cybersecurity ethics competition (73rd percentile). However, during this testing we observed a significant number of errors, hallucinations, and incomplete responses which suggest that near-term GenLLMs are not a consistently reliable source of cybersecurity assistance. Second, we assessed the use of GenLLMs to analyze output from standard cybersecurity tools, including Nmap, and a honeypot network. Current systems are limited in their ability to ingest practical logs or network scans of significant size. While they can parse very short logs fairly efficiently, their performance falls well short of conventional existing intrusion detection and response tools and still requires a human in the loop to completely parse the results. Third, when evaluating ChatGPT’s ability to implement and enforce IAM policies, our results suggest that GenLLMs have a fairly robust response to prompt injection vulnerabilities, often patching known issues within days or less. Our inability to compromise access control or malware generation policies is encouraging for emerging applications to IAM and zero trust environments, although we have no way to assess unknown future threats (since the source code and training data are proprietary).
Further development and training of GenLLMs with cybersecurity data sets is required before these systems can approach the efficiency of a trained human operator. There are also several known issues which need to be addressed in the application of near-term GenLLMs to cybersecurity. These include data provenance (ChatGPT has not released its source code, architecture, or training data). The results appear to be highly dependent on training data sets and details of the models. For example, supervised training using structured, labeled, known attack data (such as that extracted from a security operations center) may be useful for incident triage but poor at identifying novel attacks or responding to live threats in near real time (milliseconds or less). Training an LLM from primary sources is a nontrivial task, requiring at least 8–16 GPUs and weeks to months of training time. Further, some models such as ChatGPT are not fine tunable. Emerging security applications are attempting to create custom models which filter their training data, although this can potentially increase costs and testing requirements. There are also ongoing issues with privacy, trust, and the explainability of GenLLM results. These areas will be the subject of ongoing research.

Author Contributions

Conceptualization, C.D.; methodology, C.D.; software, C.D.; validation, C.D., R.T., A.S., E.K. and A.M.; formal analysis, C.D.; validation, C.D., R.T., A.S., E.K. and A.M.; investigation, C.D.; validation, C.D., R.T., A.S., E.K. and A.M.; resources, C.D.; data curation, C.D.; validation, C.D., R.T., A.S., E.K. and A.M.; writing—original draft preparation, C.D.; validation, C.D., R.T., A.S., E.K. and A.M.; writing—review and editing, C.D.; visualization, C.D.; validation, C.D., R.T., A.S., E.K. and A.M.; supervision, C.D.; project administration, C.D.; funding acquisition, not applicable. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are openly available in the Marist University Innovation Lab GitHub repository (https://github.com/Marist-Innovation-Lab) (accessed on 1 July 2025).

Acknowledgments

The authors are grateful to Marist University, Poughkeepsie, NY, for their support of this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Pacific, D.; DeRoos, D. Generative AI for the Agile Enterprise. Converge Technology Solutions White Paper. 2023. Available online: https://convergetp.com/ (accessed on 13 February 2024).
  2. Nosta, J. The Most Important Chart in 100 Years, 2023. Medium. Available online: https://johnnosta.medium.com/the-most-important-chart-in-100-years-1095915e1605 (accessed on 13 February 2024).
  3. Frasier, M. The New York City Artificial Intelligence Action Plan. 2023. Available online: https://www.nyc.gov/assets/oti/downloads/pdf/reports/artificial-intelligence-action-plan.pdf (accessed on 12 February 2024).
  4. Holland, M. Biden Executive Order Aims to Build Foundation for AI Legislation. 2023. Available online: https://www.techtarget.com/searchcio/news/366557595/Biden-EO-aims-to-build-foundation-for-AI-legislation?utm_campaign=20231102_Senators+defend+new+federal+agency+for+tackling+big+tech+&utm_medium=email&utm_source=MDN&asrc=EM_MDN_280934188&bt_ee=N4311RQjkkTJTVPGuFamRv2%2FRetFKw7FE%2FV8sujKu%2FeltFsnVSdHzrhwRtUKLry9&bt_ts=1699020277357 (accessed on 12 February 2024).
  5. Bracey, J.; Andrews, C. European Union Countries Vote Unanimously to Approve AI Act. 2024. Available online: https://iapp.org/news/a/eu-countries-vote-unanimously-to-approve-ai-act/#:~:text=Representatives%20from%20EU%20member%20states,region%20and%20around%20the%20world. (accessed on 12 February 2024).
  6. List of Large Language Models. 2025. Available online: https://en.wikipedia.org/wiki/List_of_large_language_models (accessed on 24 June 2025).
  7. Kerner, S. 25 of the Best Large Language Models. 2025. Available online: https://www.techtarget.com/whatis/feature/12-of-the-best-large-language-models (accessed on 24 June 2025).
  8. McKenzie, L. Google Gemini vs. ChatGPT. April 2025. Available online: https://backlinko.com/gemini-vs-chatgpt (accessed on 24 June 2025).
  9. Kane, R. Claude vs. ChatGPT. May 2024. Available online: https://zapier.com/blog/claude-vs-chatgpt/ (accessed on 24 June 2025).
  10. Kamban, S. Meta’s Llama vs. OpenAI ChatGPT. September 2024. Available online: https://elephas.app/blog/llama-vs-chatgpt#liama-vs-chatgpt-at-a-glanceandnbsp (accessed on 24 June 2025).
  11. Johnson, S.; Hyland-Wood, D. A Primer on Large Language Models and Their Limitations. arXiv 2024, arXiv:2412.04503v1. Available online: https://arxiv.org/html/2412.04503v1 (accessed on 24 June 2025).
  12. Darktrace White Paper. The CISO’s Guide to Cyber AI. 2023. Available online: https://darktrace.com/resources/the-cisos-guide-to-cyber-ai (accessed on 13 February 2024).
  13. IBM Security White Paper. Cost of a Data Breach Report. 2023. Available online: https://www.ibm.com/reports/data-breach (accessed on 13 February 2024).
  14. Anil, R.; Borgeaud, S.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; Silve, D.; et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
  15. Bahdanu, D.; Cho, K.; Benjio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2016, arXiv:1409.0473. [Google Scholar] [CrossRef]
  16. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
  17. Wolfram, S. What is ChatGPT Doing and Why Does It Work. 2023. Available online: https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/ (accessed on 12 February 2024).
  18. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
  19. Williams, B. Claude AI. 2023. Available online: https://medium.com/@brynn_30189/claude-ai-the-dark-horse-of-the-industry-43b8877bfa6d (accessed on 12 February 2024).
  20. Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; Zhang, Y. A Survey on Large Language Model Security and Privacy. arXiv 2024, arXiv:2312.02003. [Google Scholar]
  21. Wilson, S.; Dawson, A. OWASP Top 10 for LLM Applications. 2023. Available online: https://owasp.org/www-project-top-10-for-large-language-model-applications/assets/PDF/OWASP-Top-10-for-LLMs-2023-v1_1.pdf (accessed on 12 February 2024).
  22. Apostol, V.; Oprea, A.; Fordyce, A.; Anderson, H. NIST Trustworthy and Responsible AI: Adversarial Machine Learning. 2023. Available online: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-2e2023.pdf (accessed on 12 February 2024).
  23. IDCP (Institute of Data Center Professionals) NY State Cybersecurity Certification Program. 2022. Available online: https://www.marist.edu/student-life/campus/hudson-valley/idcp (accessed on 12 February 2024).
  24. CompTIA Security Plus Certification Exam (SY0-701). 2023. Available online: https://www.comptia.org/certifications/security (accessed on 12 February 2024).
  25. NSA Knowledge Units for Centers of Academic Excellence in Cyber-Defense (CAE-CD). 2023. Available online: https://dl.dod.cyber.mil/wp-content/uploads/cae/pdf/unclass-cae-cd_ku.pdf (accessed on 12 February 2024).
  26. Kim, D.; Solomon, M. Fundamentals of Information Systems Security, 4th ed.; Jones and Bartlett: New York, NY, YSA, 2023. [Google Scholar]
  27. Oriyano, S.; Solomon, M. Hacker Techniques, Tools, and Information Handling, 3rd ed.; Jones and Bartlett: New York, NY, USA, 2020. [Google Scholar]
  28. Doherty, J. Wireless and Mobile Device Security, 2nd ed.; Jones and Bartlett: New York, NY, USA, 2022. [Google Scholar]
  29. IEEE Ethics Competition Rubric. 2020. Available online: https://www.ieee.org/content/dam/ieee-org/ieee/web/org/about/ethics/judging-form-live-event-2-member-team.pdf (accessed on 12 February 2024).
  30. IEEE Code of Ethics. 2023. Available online: https://www.ieee.org/about/corporate/governance/p7-8.html (accessed on 12 February 2024).
  31. Gfek Real Cybersecurity Datasets. 2022. Available online: https://github.com/gfek/Real-CyberSecurity-Datasets (accessed on 12 February 2024).
  32. Sakamoto, A. ChatGPT Jailbreak Prompts: How to Unchain ChatGPT. 2023. Available online: https://docs.kanaries.net/articles/chatgpt-jailbreak-prompt (accessed on 12 February 2024).
  33. Gilson, A.; Safranek, C.W.; Huang, T.; Socrates, V.; Chi, L.; Taylor, R.A.; Chartash, D. How does ChatGPT perform on the U.S. Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med. Educ. 2023, 9, e45312. Available online: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9947764/ (accessed on 12 February 2024). [CrossRef]
  34. Weiss, D. ChatGPT Aces Bar Exam with Score Nearing 90th Percentile. 2023. Available online: https://www.abajournal.com/web/article/latest-version-of-chatgpt-aces-the-bar-exam-with-score-in-90th-percentile (accessed on 12 February 2024).
  35. Joseph, V.; Liengtiraphan, P.; Leaden, G.; DeCusatis, C. A Software-Defined Network Honeypot with Geolocation and Analytic Data Collection. In Proceedings of the 12th Annual IEEE/ACM Trenton Computing Festival (TCF) Information Technology Professional Conference (ITPC), Trenton, NJ, USA, 17–18 March 2017; Available online: https://princetonacm.acm.org/tcfpro/programs/TCF_ITPC_2017_Program.pdf (accessed on 1 July 2025).
Figure 1. GenLLM scores on 17 randomly generated exams covering cybersecurity fundamentals.
Figure 1. GenLLM scores on 17 randomly generated exams covering cybersecurity fundamentals.
Electronics 14 02704 g001
Figure 2. GenLLM scores on 10 randomly generated exams covering cybersecurity hacking and penetration testing.
Figure 2. GenLLM scores on 10 randomly generated exams covering cybersecurity hacking and penetration testing.
Electronics 14 02704 g002
Figure 3. GenLLM scores on 10 randomly generated exams covering mobile device security.
Figure 3. GenLLM scores on 10 randomly generated exams covering mobile device security.
Electronics 14 02704 g003
Figure 4. Example of incorrect hallucination response.
Figure 4. Example of incorrect hallucination response.
Electronics 14 02704 g004
Figure 5. Example of incorrect response due to training errors (the information requested in this question has been available since 2013 and should have been part of the GenLLM training data set).
Figure 5. Example of incorrect response due to training errors (the information requested in this question has been available since 2013 and should have been part of the GenLLM training data set).
Electronics 14 02704 g005
Figure 6. Example GenLLM responses when attempting to ingest real-world network security logs.
Figure 6. Example GenLLM responses when attempting to ingest real-world network security logs.
Electronics 14 02704 g006
Figure 7. GenLLM input from an Nmap network scan for analysis (https://nmap.org).
Figure 7. GenLLM input from an Nmap network scan for analysis (https://nmap.org).
Electronics 14 02704 g007
Figure 8. GenLLM comparison with human security analyst performing honeypot log analysis.
Figure 8. GenLLM comparison with human security analyst performing honeypot log analysis.
Electronics 14 02704 g008
Figure 9. GenLLM analysis of complex honeypot command while performing threat hunting.
Figure 9. GenLLM analysis of complex honeypot command while performing threat hunting.
Electronics 14 02704 g009
Table 1. Jailbreak attempts using known prompt injection exploits.
Table 1. Jailbreak attempts using known prompt injection exploits.
PromptResultsPass/Fail
AIM20 attempts100% pass
Freemode20 attempts100% pass
Mongo Tom20 attempts100% pass
Dan20 attempts100% pass
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

DeCusatis, C.; Tomo, R.; Singh, A.; Khoury, E.; Masone, A. Cybersecurity Applications of Near-Term Large Language Models. Electronics 2025, 14, 2704. https://doi.org/10.3390/electronics14132704

AMA Style

DeCusatis C, Tomo R, Singh A, Khoury E, Masone A. Cybersecurity Applications of Near-Term Large Language Models. Electronics. 2025; 14(13):2704. https://doi.org/10.3390/electronics14132704

Chicago/Turabian Style

DeCusatis, Casimer, Raymond Tomo, Aurn Singh, Emile Khoury, and Andrew Masone. 2025. "Cybersecurity Applications of Near-Term Large Language Models" Electronics 14, no. 13: 2704. https://doi.org/10.3390/electronics14132704

APA Style

DeCusatis, C., Tomo, R., Singh, A., Khoury, E., & Masone, A. (2025). Cybersecurity Applications of Near-Term Large Language Models. Electronics, 14(13), 2704. https://doi.org/10.3390/electronics14132704

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop