Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessEditor’s ChoiceArticle

Peer-Review Record

Cybersecurity Applications of Near-Term Large Language Models

Electronics 2025, 14(13), 2704; https://doi.org/10.3390/electronics14132704

by Casimer DeCusatis^*, Raymond Tomo, Aurn Singh, Emile Khoury and Andrew Masone

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Electronics 2025, 14(13), 2704; https://doi.org/10.3390/electronics14132704

Submission received: 30 May 2025 / Revised: 24 June 2025 / Accepted: 2 July 2025 / Published: 4 July 2025

(This article belongs to the Special Issue Advanced Machine Learning, Pattern Recognition, and Deep Learning Technologies: Methodologies and Applications, 2nd Edition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

A well described experimental methodology for this paper. There are some questions raised on the selection of GenLLM tool and its validation through the use of exam questions. The authors should look in the following comments for further processing and improvement of manuscript electronics-3703234:

1) Lines 23-131 (Introduction):

The Introduction section does not provide any insight to other GenLLMs models or describes the academic and research innovation that this manuscript is contributing to. The authors need to show how this manuscript is providing something new in the specific scientific field.
The Introduction should have a subsection describing the structure of manuscript.
The 3 near-term use cases for GenLLMs which are described in much detail should be part of a separate section or subsection. This description really shows the research methodology followed and could be briefly mentioned in the Introduction but analysed in more detail in a separate section or subsection.
The Literature Review (Lines 79-131: 1.1. Literature Review 79) should be a separate section or subsection.
The authors should consider removing the description of the 3 near-term use cases for GenLLMs and the Literature Review from the Introduction and creating a separate section or subsection(s). In that case, the Introduction text (Lines 23-36) will have to be expanded further.

2) Line 95 (1.1 Literature Review):

The literature review should not focus only in ChatGPT. While ChatGPT is widely used the authors should consider using or analysing other AI/GenLLM tools as well. A list of available AI/GenLLM tools should be given to show that the authors have a thorough understanding of GenLLM models and the current market footprint. Such GenLLM tools could include Google Gemini, Deepseek, etc. Any statistical data showcasing the use of available GenLLMs would be useful. The authors do not provide sufficient justification on why they are only using ChatGPT in their research. A comparison of the different commercially available GenLLMs should be given, listing advantages/disadvantages, etc.

3) Lines 132-195 (2. Materials and Methods):

What is described in Section 2 is really the experimental methodology implemented. Well described but requires some more input on why this methodology was selected. Also perhaps the description of the 3 near-term use cases for GenLLMs from the Introduction could be part of this section.
The authors should describe the parameters considered for the selection of the exam questions for the 3 use cases for the evaluation of performance of the GenLLM (Chat GPT).
In order to check the repeatability of the experimental phase of this publication, the authors could consider providing the type and list of questions used for the validation of the GenLLM (ChatGPT) performance as part of an appendix.

4) Lines 199-370 (3.1. Digital Assistant Use Case/ 3.2. Incident Response and Threat Hunting Use Case / 3.3. IAM and Zero Trust Use Case):

A list and description of the typical GenLLM errors observed should be provided. While there is some discussion on the subject it is not conclusive nor does it fully capture the failures of the selected GenLLM (ChatGPT).
Lines 207-208 (Figure 1): Increase size of Figure for legibility.
Lines 217-218 (Figure 2): Increase size of Figure for legibility.
Lines 225-226 (Figure 3): Increase size of Figure for legibility.

Author Response

Comment 1: Lines 23-131 (Introduction): The Introduction section does not provide any insight to other GenLLMs models or describes the academic and research innovation that this manuscript is contributing to. The authors need to show how this manuscript is providing something new in the specific scientific field.

Response 1: We thank the reviewer for their constructive comments. We have reorganized the introduction (lines 23-103 in the revised manuscript) and added new material, including the hypothesis for our research (lines 50-52), a contextual list of other LLMs, justification for using ChatGPT, and the novel aspects of this work (lines 55-63). Per other review comments, the detailed description of our 3 use cases has been moved to another section.

Comment 2: The Introduction should have a subsection describing the structure of manuscript.

Response 2: we have added a description of the manuscript structure to the Introduction (lines 100-104).

Comment 3: The 3 near-term use cases for GenLLMs which are described in much detail should be part of a separate section or subsection. This description really shows the research methodology followed and could be briefly mentioned in the Introduction but analysed in more detail in a separate section or subsection.

Response 3: The 3 near term use case descriptions have been moved to section 3.1, a subsection of section 3, Materials and Experimental Methods (this was done at the request of another reviewer as well).

Comment 4: The Literature Review (Lines 79-131: 1.1. Literature Review 79) should be a separate section or subsection.

Response 4: The Literature Review now has its own section (Section 2), following the guidelines in the Electronics journal template.

Comment 5: The authors should consider removing the description of the 3 near-term use cases for GenLLMs and the Literature Review from the Introduction and creating a separate section or subsection(s). In that case, the Introduction text (Lines 23-36) will have to be expanded further.

Response 5: The Literature Review now has its own section (section 2). The 3 near term use case descriptions have been moved to section 3.1, a subsection of section 3, Materials and Experimental Methods (this was done at the request of another reviewer as well). The Introduction has been further expanded with additional references added as noted in the response to Comment 1.

Comment 6: Line 95 (1.1 Literature Review): The literature review should not focus only in ChatGPT. While ChatGPT is widely used the authors should consider using or analysing other AI/GenLLM tools as well. A list of available AI/GenLLM tools should be given to show that the authors have a thorough understanding of GenLLM models and the current market footprint. Such GenLLM tools could include Google Gemini, Deepseek, etc. Any statistical data showcasing the use of available GenLLMs would be useful. The authors do not provide sufficient justification on why they are only using ChatGPT in their research. A comparison of the different commercially available GenLLMs should be given, listing advantages/disadvantages, etc.

Response 6: We have added a list of currently available LLMs with a citation covering over 60 different types of LLMs currently documented (lines 31-38). Other authors have performed extensive comparative analysis of different LLMs, which we cite in our revised paper. This motivates why we have selected ChatGPT for our initial study (lines 40-50).

Comment 7: Lines 132-195 (2. Materials and Methods): What is described in Section 2 is really the experimental methodology implemented. Well described but requires some more input on why this methodology was selected. Also perhaps the description of the 3 near-term use cases for GenLLMs from the Introduction could be part of this section.

Response 7: The term “experimental methods” was added to the title of section 2. Additional material on experimental methodology selection has been added in lines 163-196, 198-206, and 208-228 of the revised manuscript. This should make it clear that we adopted this methodology because it is the same approach used to evaluate human security specialists, and as stated in our research hypothesis we are attempting to determine if there are near term use cases for which ChatGPT offers value to cybersecurity. In other words, our methodology to determine whether ChatGPT can pass an industry standard certification exam uses the same approach and body of knowledge required of anyone attempting to pass such certification exams. The description of 3 near term use cases has been moved to this section as requested.

Comment 8: The authors should describe the parameters considered for the selection of the exam questions for the 3 use cases for the evaluation of performance of the GenLLM (Chat GPT).

Response 8: Lines 163-196 and 279-281 have been updated in the revised manuscript to clarify that the selection process used to generate exam questions from a body of knowledge is the same process used when generating exam questions for human security specialists.

Comment 9: In order to check the repeatability of the experimental phase of this publication, the authors could consider providing the type and list of questions used for the validation of the GenLLM (ChatGPT) performance as part of an appendix.

Response 9: the body of knowledge containing potential questions used for the various certification exams and for the incident response testing have been cited in references 23-31 so that others may reproduce this work. As noted in our Data Availability statement, the data that support the findings of this study are openly available in the Marist University Innovation Lab GitHub repository.

Comment 10: Lines 199-370 (3.1. Digital Assistant Use Case/ 3.2. Incident Response and Threat Hunting Use Case / 3.3. IAM and Zero Trust Use Case): A list and description of the typical GenLLM errors observed should be provided. While there is some discussion on the subject it is not conclusive nor does it fully capture the failures of the selected GenLLM (ChatGPT).

Response 10: additional discussion has been provided in this section. A discussion of LLM limitations and risks has been added to the introduction (lines 64-92), and subsequent errors in our testing map to these areas. We analyzed all the errors produced by ChatGPT in the digital assistant use case, and provide representative examples of the error classifications we found (such as hallucinations and limited knowledge). The threat hunting use case was severely limited by computational limits as noted in our manuscript (lines 358-364). The IAM use case was robust against the most commonly known prompt injection jailbreaks, so there are no failures to analyze, but we note potential weaknesses in this use case discussion (lines 431-444_.

Comment 11:Lines 207-208 (Figure 1): Increase size of Figure for legibility.

Response 11: the size of figure 1 has been increased for legibility as requested.

Comment 12: Lines 217-218 (Figure 2): Increase size of Figure for legibility.

Response 12: the size of figure 2 has been increased for legibility as requested.

Comment 13: Lines 225-226 (Figure 3): Increase size of Figure for legibility.

Response 13: the size of figure 3 has been increased for legibility as requested.

Reviewer 2 Report

Comments and Suggestions for Authors

The article "Cybersecurity Applications of Near Term Large Language Models" addresses a highly relevant topic, the use of generative language models (GenLLMs), such as ChatGPT, in the field of cybersecurity. Given the global shortage of qualified professionals and the growing volume of data, the interest in automating analytical and routine tasks using AI is increasingly justified. The authors explore three practical scenarios: using GenLLMs as digital assistants, for threat and log analysis, and within identity and access management (IAM) systems and the Zero Trust framework.

Despite the relevance of the topic, the scientific novelty of the work is questionable. The study focuses primarily on showcasing the capabilities of a single model, ChatGPT 3.0, without sufficient theoretical grounding, comparison with alternative tools, or clear formulation of a scientific research problem. The key shortcomings of the article are outlined below.

The article fails to discuss alternative approaches or compare with existing AI-based cybersecurity tools.

There is no examination of known limitations and risks associated with LLMs as discussed in academic literature (e.g., lack of explainability, model bias, ethical concerns).
The introduction does not clearly define a specific research hypothesis or scientific problem; the work reads more like a practical demonstration of ChatGPT capabilities.
The research objectives are vaguely stated, with no precise problem formulation or research questions.
There is no articulation of the study’s scientific novelty—it remains unclear what gap in existing knowledge the paper aims to address or how it advances the field.
The study relies solely on ChatGPT 3.0, with no comparison to other LLMs (e.g., Gemini, Claude, LLaMA), which limits generalizability.
The jailbreak testing methods lack formalization—there are no clear success/failure criteria or analysis of result variability.
The experimental procedures are insufficiently validated: no replication, no error analysis, and no statistical significance testing.
In the incident response use case, log and honeypot analyses are superficial, lacking robust scenario-based evaluation or large-scale log processing tests.
The authors make broad and optimistic claims about the applicability of GenLLMs in cybersecurity despite notable limitations and mediocre performance in evaluations.
Some conclusions are not fully supported by the presented data; for example, the claim that ChatGPT is a useful digital assistant is undermined by its modest exam scores (C+/B+) and hallucination issues.
In the IAM and jailbreak case, the conclusion that the model is "secure" is premature and lacks depth, no analysis of unknown attack vectors or model robustness against future exploits is provided.

Author Response

Comment 1:There is no examination of known limitations and risks associated with LLMs as discussed in academic literature (e.g., lack of explainability, model bias, ethical concerns).

Response 1: We thank the reviewer for their constructive comments. A discussion of known LLM risks and limitations has been added to the introduction, along with several additional references (lines 64-92). These risks and limitations are subsequently referenced when discussing the results of our work.

Comment 2: The introduction does not clearly define a specific research hypothesis or scientific problem; the work reads more like a practical demonstration of ChatGPT capabilities.

Response 2: a hypothesis statement has been added to the introduction (line 51) as well as a discussion of novelty in this work (lines 51-63).

Comment 3: The research objectives are vaguely stated, with no precise problem formulation or research questions.

Response 3: A section has been added to the Introduction stating our hypothesis and the novelty of this work (lines 51-63). We note prior research dedicated to understanding whether ChatGPT can pass a medical or legal certification (line 237), which is conceptually similar to our approach determining whether ChatGPT can pass various cybersecurity and ethical certifications. A detailed discussion of our three near term use cases is provided in the methodology section of the manuscript (lines 232-274), as requested by other reviewers.

Comment 4: There is no articulation of the study’s scientific novelty—it remains unclear what gap in existing knowledge the paper aims to address or how it advances the field.

Response 4: a list of novel outcomes has been added to the Introduction (lines 51-63).

Comment 5: The study relies solely on ChatGPT 3.0, with no comparison to other LLMs (e.g., Gemini, Claude, LLaMA), which limits generalizability.

Response 5: as part of the introduction to the revised paper, we note there are currently over 60 GenLLMs available (lines 31-38 and related citations). It is not practical for us to study them all. We have chosen to focus on ChatGPT for reasons described in the revised paper introduction (lines 40-50).

Comment 6: the jailbreak testing methods lack formalization—there are no clear success/failure

Response 6: we have added some text (lines 413-443) to clarify that each jailbreak attempt is either a success or failure. A successful attempt means that the LLM will now respond to prompts for information which should be prohibited (for example, instructions on how to make a bomb). An unsuccessful attempt does not cause a change in LLM behavior. We tested 4 different jailbreaks, as described in this paper, and ran each attempt 20 times. Results are shown in table 1.

Comment 7: The experimental procedures are insufficiently validated: no replication, no error analysis, and no statistical significance testing.

Response 7: The revised manuscript includes discussion of our replication, error analysis, and statistical calculations. For example, we kindly draw the reviewer’s attention to the Experimental Results section of our manuscript and note the following examples of replication and statistical analysis:

The digital assistant use case was run 17 times using randomly generated exams of 40 questions each, sampled without replacement from the certification knowledge base cited in our manuscript. This is the same methodology used to certify human cybersecurity experts. The score for each trial is given in figure 1, which also graphically displays the distribution of scores, and the max/min score are presented in the text. There are no error bars on this data, since each exam question is either marked correct or incorrect. Since the industry standard exam is a pass/fail test, we actually provide a contextual letter grade equivalent, which is more data than a real certification exam would provide. Similar results are shown for the hacking/pen testing certification and mobile security certification (10 trials of randomly generated 40 question exams in each case). As discussed in the text, we analyzed all incorrect responses; examples provided in the text are attributable to hallucinations or gaps in training data sets.

The cybersecurity ethics exam statistics were also discussed in the manuscript (lines 332-356). Likewise, we provide a statistical summary of the jailbreak results in table 1 and the nearby text. For example, we note that each jailbreak was attempted 20 times, and always yielded the same result, demonstrating that our results are repeatable.

Comment 8: In the incident response use case, log and honeypot analyses are superficial, lacking robust scenario-based evaluation or large-scale log processing tests.

Response 8: Please refer to our discussion of the incident response use case in the revised manuscript (lines 357-393). Due to computational constraints as described our paper’s introduction (lines 89-91), GenLLMs have limits on the amount of text they can process in a single input or output. We attempted to analyze large-scale logs, but the LLMs were unable to parse the input. Similarly, current models lack the ability to ingest logs representing real-world scenarios. We noted this result from our testing, and provided representative examples from smaller logs which the GenLLM was able to ingest.

Comment 9: The authors make broad and optimistic claims about the applicability of GenLLMs in cybersecurity despite notable limitations and mediocre performance in evaluations.

Response 9: It was never our intention to make broad claims which are not supported by our data. In response to this comment, we have revised the language in our conclusions (lines 457-490) and attempted to align our experimental results with expectations for a real world system. We also note that statements in the introductory part of the paper which refer to the potential value of LLMs in cybersecurity should not be confused with the actual short-term applicability as described in our three use cases. Please let us know if you have other specific examples from the revised manuscript which could be reworded to better reflect our results.

Comment 10: Some conclusions are not fully supported by the presented data; for example, the claim that ChatGPT is a useful digital assistant is undermined by its modest exam scores (C+/B+) and hallucination issues.

Response 10: We have added text to clarify the value of ChatGPT as a security assistant (lines 308-312 and 354-356) . As noted in our paper, there is currently a significant shortage of cybersecurity professionals. Thus, there are many humans working in the cybersecurity field who barely passed their certification exams or were in the bottom quarter of their college program. There are no doubt some organizations who would be glad to have a C+/B- student who makes mistakes 10-20% of the time, especially if that individual worked 24/7 with no medical or vacation benefits. Our conclusions is meant to reflect that ChatGPT does not currently measure up to the best human professionals in the field. We hope this helps clarify our analysis of the research data.

Comment 11: In the IAM and jailbreak case, the conclusion that the model is "secure" is premature and lacks depth, no analysis of unknown attack vectors or model robustness against future exploits is provided.

Response 11: The conclusions in the IAM and jailbreak case are based on the data included in the paper from our black box testing. We used 4 different, well document prompt jailbreaks, and repeated each one 20 times; we did not find a single successful jailbreak. We were also unsuccessful in forcing the model to generate malicious code. That said, we also acknowledge that prompt-based models are insecure by design (line 420). Since ChatGPT has not released the source code or training data for their model, it’s not possible to estimate how vulnerable it may be to an unknown attack, or how to mitigate an unknown attack. We never claim that the IAM is “secure”, rather we state the following (lines 436-443):
As shown in Table 1, after 20 attempts with each jailbreak prompt, the system continued to resist our efforts to compromise its access control or introduce malicious code generation. This result is encouraging, since the ChatGPT ecosystem appears robust enough to self-correct for published jailbreaks reasonably quickly. Additional work would be required to assess whether ChatGPT responds as well to unpublished jailbreaks, or to quantify the window of vulnerability for such attacks. For now, the trustworthyness of this application for IAM and zero trust use cases appears to meet basic requirements, but further experiments are planned in this area as the tool matures.

If you still feel that this discussion of our results could be improved, please provide specific examples for consideration in future manuscripts.

Reviewer 3 Report

Comments and Suggestions for Authors

This study delivers a fresh and thoughtful examination of ChatGPT 3.0’s abilities in cybersecurity settings, tackling a pressing need as companies turn to large language models to address talent shortages.
The researchers dig in, relying on proven tests, ethical standards, and hands-on data to check how ChatGPT performs, identify its flaws, and explore how it reacts to tough prompts.
Their key observations about what ChatGPT does well and where it struggles are well-supported and straightforward, though some minor adjustments could improve the clarity and repeatability of the results.

The introduction would benefit from explicitly contextualizing how this work builds upon or diverges from existing research on LLMs in cybersecurity, supported by 1–2 key citations. Methodological transparency could be improved by elaborating on how honeypot logs were structured or selected. Visual presentation issues—such as inconsistent axis labels in Figures 7–8 and missing descriptive captions—should be addressed to help readers interpret data at a glance.

Prose revisions include simplifying overly complex sentences (e.g., lines 260–269) and clarifying terminology like “GenLLM” to prevent confusion. The discussion section offers a prime opportunity to explore the practical consequences of ChatGPT’s limitations—for instance, how hallucinations or input restrictions might compromise threat analysis in live environments. A few tweaks, along with small edits, would refine an already strong manuscript. The study stays ethically solid and ready for publication, providing practical guidance for AI developers and cybersecurity experts working on integrating large language models.

Author Response

Comment 1: The introduction would benefit from explicitly contextualizing how this work builds upon or diverges from existing research on LLMs in cybersecurity, supported by 1–2 key citations.

Response 1: We thank the reviewers for their constructive comments. The introduction has been updated with 2 additional key references (15 and 16) and a discussion of early LLM research in the literature review (lines 108 – 147).

Comment 2: Methodological transparency could be improved by elaborating on how honeypot logs were structured or selected.

Response 2: We had included additional discussion about how our honeypots work (line 384-388), and added a citation from our previous publication describing the honeypots (reference 35). The raw information and copies of the honeypot are available through our Data Availability statement (line 389).

Comment 3: Visual presentation issues—such as inconsistent axis labels in Figures 7–8 and missing descriptive captions—should be addressed to help readers interpret data at a glance.

Response 3: Figure 7 is not a graph and has no axis labels. The axis labels of Figure 8 do not reference any other graphs in the manuscript, so they should not have any consistency problems. We have reviewed all other graphs present in the manuscript for consistency in the axis labeling. All figures are provided with descriptive captions, and minor tweaks have been made to the text surrounding these figures to enhance clarify.

Comment 4: Prose revisions include simplifying overly complex sentences (e.g., lines 260–269) and clarifying terminology like “GenLLM” to prevent confusion.

Response 4: lines 260-269 from the original manuscript have been rewritten for improved clarify. The definitions of LLM and GenLLM have been cited in line 26 at the very beginning of the manuscript to avoid confusion.

Comment 5: The discussion section offers a prime opportunity to explore the practical consequences of ChatGPT’s limitations—for instance, how hallucinations or input restrictions might compromise threat analysis in live environments. A few tweaks, along with small edits, would refine an already strong manuscript.

Response 5 :we have made minor edits in the discussion sections and conclusions to address these comments. For example, a list of GenLLM risks and limitations is now included (lines 64-92) and our results are discussed in the context of this list (for example, lines 390-391 and lines 345-356 and lines 313-316 and lines 323-326).

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have incorporated all review comments. Manuscript electronics-3703234 has been greatly improved.

Reviewer 2 Report

Comments and Suggestions for Authors

All of my comments have either been fully addressed or reasonably justified as limitations of the study.

Article Menu

Cybersecurity Applications of Near-Term Large Language Models

Further Information

Guidelines

MDPI Initiatives

Follow MDPI