AI-Augmented SOC: A Survey of LLMs and Agents for Security Automation
Round 1
Reviewer 1 Report
This work provides an overview of the integration of Large Language Models (LLMs) and AI agents into modern Security Operations Centers (SOCs). The authors systematically review recent literature to map the application of these AI technologies across eight critical SOC functions: log summarization, alert triage, threat intelligence, ticket handling, incident response, report generation, asset discovery, and vulnerability management. The primary contributions are (i) a structured taxonomy of AI-augmented SOC tasks, (ii) a detailed synthesis of the models, techniques, evaluation methods, and datasets currently in use, and (iii) the proposal of a five-level Capability Maturity Model (CMM) to classify the progressive autonomy of AI within SOC environments. The work is based on a PRISMA-guided literature review, distilling findings from 100 relevant papers.
The paper is globally well written, and I also think that the structure and the general argumentation are easy to follow. In my opinion the paper is relevant for the cybersecurity community.
I do have some comments and questions, see below.
- SOC capability maturity models already exist (e.g., the widely used SOC-CMM). The proposed model here overlaps conceptually and it is not clear the position against SOC-CMM.
- There exists already LLM for cyber defense surveys as well as domain-specific surveys. The contribution here is at best a consolidation rather than being the “first comprehensive vision”.
- I have some problems also with the database coverage. Important venues like USENIX, NDSS, ACM CCS, IEEE S&P, Elsevier/ACM Journals are missing. This is a threat to validity.
- The authors should explicitly discuss how the capability maturity model could be validated.
- The paper fails to provide enough details about how the 100 papers were selected out of the 200 articles assessed. The final selection was based on what?
- It would be also nice to see a more integrated and critical discussion about the identified challenges. For example, how the interdependencies between these exacerbate the lack of trust and make human-AI collaboration more difficult?
The paper contains a few typos and minor inconsistencies that are listed here.
- Page 1, Section 1, “or signature-driven processes prove inadequate..." -> change “prove” to “proves”.
- Page 3, Table 1, change “Iot Traffic” to “IoT Traffic”.
- Page 4, Section 3.1,
- "...available for real real-world deployment..." -> remove “real”
- "...with over 100% reduction in LLM costs." -> rephrase for clarity: “with "a 100% reduction in LLM costs"
- Page 14 and 15, section numbering is incorrect.
Author Response
Comments 1: SOC capability maturity models already exist (e.g., the widely used SOC-CMM). The proposed model here overlaps conceptually and it is not clear the position against SOC-CMM.
Response 1: Thank you for pointing this out. We agree with this comment. Therefore we have added a subsection titled "Relevance compared to SOC-CMM" that illustrates the positional relevance of our Maturity Model to SOC-CMM. Page 12, Section 4.1. Line number 435.
Comments 2: There exists already LLM for cyber defense surveys as well as domain-specific surveys. The contribution here is at best a consolidation rather than being the “first comprehensive vision”.
Response 2: Thank you for pointing this out. We agree with this comment. Therefore we have changed the contributions of comprehensive overview into consolidation and extension of existing works. We also have acknowledged the previous LLM and domain specific surveys. Page 2, Section 1. Line number 63.
Comments 3: I have some problems also with the database coverage. Important venues like USENIX, NDSS, ACM CCS, IEEE S&P, Elsevier/ACM Journals are missing. This is a threat to validity.
Response 3: Agree. We have accordingly, revised the threats to validity. We added two sentences that directly address the other databases not represented in the survey paper acknowledging their absence could lead to threats to validity due to underrepresentation. Page 18, Section 7. Line number 684.
Comments 4: The authors should explicitly discuss how the capability maturity model could be validated.
Response 4: Agree. We have, accordingly, discussed how the maturity model could be validated. We have added a subsection titled "Validity of Capability Maturity Model" that discusses the possibilities of validating the capability maturity model. Page 13, Section 4.2. Line number 454.
Comments 5: The paper fails to provide enough details about how the 100 papers were selected out of the 200 articles assessed. The final selection was based on what?
Response 5: Thank you for pointing this out. We agree with this comment. Therefore we have added more specific criteria in which we used to narrow down the final 100 from the 200 texts assessed. Page 3, Section 2. Line number 102.
Comments 6: It would be also nice to see a more integrated and critical discussion about the identified challenges. For example, how the interdependencies between these exacerbate the lack of trust and make human-AI collaboration more difficult?
Response 6: Agree. We have, accordingly, discussed the further critical discussion about the interdependencies of the challenges. We have added a subsection "Integrated and Critical Discussion of Challenges" that make it very clear the interrelations of the challenges and how they interact to undermine human-AI collaboration. Page 16, Section 5.5. Line Number 588.
Reviewer 2 Report
-
Add a subsection on cross-domain lessons from autonomous driving (AD).
I strongly encourage the authors to add a dedicated subsection (e.g., Cross-domain Lessons from Autonomous Driving) in the Related Work or Discussion. This should explicitly address: which aspects LLMs have improved in AD, how these were integrated into existing planning/control pipelines, and what safety and explainability lessons can be transferred to SOC. The discussion can be structured along four axes and mapped to the eight SOC tasks presented in the paper:-
Planning and decision coordination: In AD, LLMs are often used to generate semantic sub-goals/waypoints (integrated with local planners such as DWA, TEB, or CBF) or to interpret visual-linguistic scenes. This parallels SOC applications such as alert triage and automated ticketing, where LLMs handle semantic constraints and task decomposition, while agents execute tool invocation.
- In AD, robust planning critically depends on semantic-enhanced perception. Works such as ViT-SAMBA: Semantic Segmentation of Remotely Sensed Images with State Space Models demonstrate how semantic cues stabilize mapping and reduce hallucinations in downstream planners. A similar insight applies to SOC, where data quality and context representation strongly affect the reliability of LLM reasoning.
-
Tool use and constraint optimization: In AD, LLMs frequently serve as policy routers, interfacing with cost functions, safety constraints (e.g., CBF), or trajectory optimizers (e.g., TEB) to search within safe sets. This is analogous to SOC’s SOAR/XDR-based orchestration. Emphasizing the paradigm “LLMs map semantics to safety constraints, while specialized optimizers guarantee hard safety” would highlight practical strategies to mitigate hallucination risks in SOC.
- In mobile robotics, methods like Control Barrier Functions via Minkowski Operations for Safe Navigation among Polytopic Sets show how LLMs can provide high-level semantic guidance, while CBFs enforce hard safety constraints. An analogous paradigm in SOC would be LLM-generated remediation suggestions filtered through policy enforcement layers, ensuring fail-safe execution.
- Recent works such as Bio-Inspired Hybrid Path Planning for Efficient and Smooth Robotic Navigation and Multi-Strategy Enhanced Secret Bird Optimization Algorithm for Obstacle Avoidance highlight how bio-inspired optimizers balance exploration–exploitation in cluttered environments. These approaches could inspire SOC researchers to explore analogous hybrid strategies that combine LLM-based reasoning with heuristic or evolutionary optimization for complex decision spaces.
-
Robustness and failure handling: AD emphasizes fail-safe and fail-operational architectures, often through redundancy where an “LLM soft layer” is safeguarded by a classical control hard layer. A similar pattern exists in SOC when LLM-generated remediation suggestions are filtered through strict rule engines or sandbox environments. Discussing consistency across the two domains (e.g., least-privilege execution and revocability) would be valuable.
-
Evaluation and closed-loop feedback: AD relies on standardized benchmarks such as success rate, collision rate, minimum distance, trajectory smoothness, and inference latency. SOC could adopt analogous practices such as corner-case scenario testing, long-tail replay, and closed-loop evaluation. Beyond F1, action-level KPIs such as MTTD/MTTR, closed-loop automation success rate, and severity misclassification rates should be considered.
Such a cross-domain comparison would clarify how LLM semantic reasoning can be decoupled from verifiable optimizers/controllers, reducing systemic risks of hallucinations in safety-critical applications. It would also enrich your “Challenges and Governance” section with practical engineering prescriptions.
-
-
-
Add a cross-domain comparison figure or table.
I recommend adding a matrix-style comparison between AD and SOC across eight dimensions: input modalities, task objectives, constraint forms, LLM role, hard safety modules, evaluation metrics, failure handling, and regulatory compliance. The adjacent column could summarize transferable engineering practices (e.g., “LLMs only produce candidate plans → formal verifiers filter unsafe ones”), risk control points (prompt injection, data leakage, over-privileged execution), and mitigation strategies (RAG whitelisting, policy engines, audit logs, rollback mechanisms). Such a figure would significantly improve the generality and applicability of the survey. -
Extend the maturity model into a two-dimensional framework.
The manuscript currently defines a 0–4 autonomy gradient. I suggest extending this into a two-dimensional framework (autonomy level × supervision/constraint strength). The vertical axis could include categories such as HoTL/HITL, revocable execution, enforced CBF, or policy auditing. The figure could plot common AD use cases (e.g., LLM-as-planner: low autonomy, high constraint; LLM-as-router: moderate autonomy, moderate constraint) and map them to SOC tasks (IR, CTI, ticketing). This would provide readers with a more nuanced tool for risk-aware adoption.
- Reference coverage and balance.
The survey cites a strong set of recent works but appears somewhat skewed toward 2023–2024 papers. Earlier foundational works on SOC automation or traditional security orchestration could be cited to better situate LLM-driven methods in the historical context.
A few citations are repeated or not consistently formatted.
- Evaluation metrics.
While the paper reports commonly used metrics (precision, recall, F1), it does not always clarify whether these come from primary studies or are suggested by the authors. Adding footnotes or explicit attribution would help.
- Repetition and redundancy.
The Challenges and Discussion sections partially overlap. Consider merging or restructuring to avoid repetition.
Certain descriptions (e.g., “LLMs reduce human workload and improve efficiency”) recur multiple times without adding new detail.
Author Response
Major Comments 1,2,3:
-
Add a subsection on cross-domain lessons from autonomous driving (AD).
I strongly encourage the authors to add a dedicated subsection (e.g., Cross-domain Lessons from Autonomous Driving) in the Related Work or Discussion. This should explicitly address: which aspects LLMs have improved in AD, how these were integrated into existing planning/control pipelines, and what safety and explainability lessons can be transferred to SOC. The discussion can be structured along four axes and mapped to the eight SOC tasks presented in the paper:-
Planning and decision coordination: In AD, LLMs are often used to generate semantic sub-goals/waypoints (integrated with local planners such as DWA, TEB, or CBF) or to interpret visual-linguistic scenes. This parallels SOC applications such as alert triage and automated ticketing, where LLMs handle semantic constraints and task decomposition, while agents execute tool invocation.
- In AD, robust planning critically depends on semantic-enhanced perception. Works such as ViT-SAMBA: Semantic Segmentation of Remotely Sensed Images with State Space Models demonstrate how semantic cues stabilize mapping and reduce hallucinations in downstream planners. A similar insight applies to SOC, where data quality and context representation strongly affect the reliability of LLM reasoning.
-
Tool use and constraint optimization: In AD, LLMs frequently serve as policy routers, interfacing with cost functions, safety constraints (e.g., CBF), or trajectory optimizers (e.g., TEB) to search within safe sets. This is analogous to SOC’s SOAR/XDR-based orchestration. Emphasizing the paradigm “LLMs map semantics to safety constraints, while specialized optimizers guarantee hard safety” would highlight practical strategies to mitigate hallucination risks in SOC.
- In mobile robotics, methods like Control Barrier Functions via Minkowski Operations for Safe Navigation among Polytopic Sets show how LLMs can provide high-level semantic guidance, while CBFs enforce hard safety constraints. An analogous paradigm in SOC would be LLM-generated remediation suggestions filtered through policy enforcement layers, ensuring fail-safe execution.
- Recent works such as Bio-Inspired Hybrid Path Planning for Efficient and Smooth Robotic Navigation and Multi-Strategy Enhanced Secret Bird Optimization Algorithm for Obstacle Avoidance highlight how bio-inspired optimizers balance exploration–exploitation in cluttered environments. These approaches could inspire SOC researchers to explore analogous hybrid strategies that combine LLM-based reasoning with heuristic or evolutionary optimization for complex decision spaces.
-
Robustness and failure handling: AD emphasizes fail-safe and fail-operational architectures, often through redundancy where an “LLM soft layer” is safeguarded by a classical control hard layer. A similar pattern exists in SOC when LLM-generated remediation suggestions are filtered through strict rule engines or sandbox environments. Discussing consistency across the two domains (e.g., least-privilege execution and revocability) would be valuable.
-
Evaluation and closed-loop feedback: AD relies on standardized benchmarks such as success rate, collision rate, minimum distance, trajectory smoothness, and inference latency. SOC could adopt analogous practices such as corner-case scenario testing, long-tail replay, and closed-loop evaluation. Beyond F1, action-level KPIs such as MTTD/MTTR, closed-loop automation success rate, and severity misclassification rates should be considered.
Such a cross-domain comparison would clarify how LLM semantic reasoning can be decoupled from verifiable optimizers/controllers, reducing systemic risks of hallucinations in safety-critical applications. It would also enrich your “Challenges and Governance” section with practical engineering prescriptions.
-
-
-
Add a cross-domain comparison figure or table.
I recommend adding a matrix-style comparison between AD and SOC across eight dimensions: input modalities, task objectives, constraint forms, LLM role, hard safety modules, evaluation metrics, failure handling, and regulatory compliance. The adjacent column could summarize transferable engineering practices (e.g., “LLMs only produce candidate plans → formal verifiers filter unsafe ones”), risk control points (prompt injection, data leakage, over-privileged execution), and mitigation strategies (RAG whitelisting, policy engines, audit logs, rollback mechanisms). Such a figure would significantly improve the generality and applicability of the survey. -
Extend the maturity model into a two-dimensional framework.
The manuscript currently defines a 0–4 autonomy gradient. I suggest extending this into a two-dimensional framework (autonomy level × supervision/constraint strength). The vertical axis could include categories such as HoTL/HITL, revocable execution, enforced CBF, or policy auditing. The figure could plot common AD use cases (e.g., LLM-as-planner: low autonomy, high constraint; LLM-as-router: moderate autonomy, moderate constraint) and map them to SOC tasks (IR, CTI, ticketing). This would provide readers with a more nuanced tool for risk-aware adoption.
Response 1,2,3: Thank you for pointing this out. We agree that these suggestions would bolster the security aspects. Unfortunately, the scope of our paper is the AI -augmentation of Security Operation Centers specifically.
Detailed Comments 1: Reference coverage and balance. The survey cites a strong set of recent works but appears somewhat skewed toward 2023–2024 papers. Earlier foundational works on SOC automation or traditional security orchestration could be cited to better situate LLM-driven methods in the historical context. A few citations are repeated or not consistently formatted.
Response 1: Thank you for pointing this out. We further clarify the reason for the 2022- 2025 papers. We have added further clarification that we chose these papers from 2022 - 2025 due to the rapidly advancing AI research. Page 2, Section 2. Line number 85. We also deleted the repetitive citations. Page 8, Section 3.6.
Detailed Comments 2: Evaluation metrics. While the paper reports commonly used metrics (precision, recall, F1), it does not always clarify whether these come from primary studies or are suggested by the authors. Adding footnotes or explicit attribution would help.
Response 2: Agree. We have, accordingly, added statements to clarify the evaluation metrics. We have added sentences in the methodology and threats to validity sections to clarify this issue and its consequences. Page 3, Section 2. Line number 100. Page 18, Section 7. Line number 684.
Detailed Comments 3: Repetition and redundancy. The Challenges and Discussion sections partially overlap. Consider merging or restructuring to avoid repetition. Certain descriptions (e.g., “LLMs reduce human workload and improve efficiency”) recur multiple times without adding new detail.
Response 3: Thank you for point this out. We agree with this comment. Therefore we have done a detailed check of section 5.1-5.4 of all repetitions and redundancy and eliminated them. Page 14-16, Section 5.1-5.4.
Round 2
Reviewer 1 Report
I appreciate the authors efforts, as I believe they did a good job addressing most of my comments. However, there is still a big issue regarding the relevant literature's coverage that they have deliberately avoided addressing, which prevents me from accepting the work in its current form. Please see my detailed comment below.
In my previous review, I observed that papers from important venues like USENIX, NDSS, ACM CCS, IEEE S&P, Elsevier/ACM Journals are totally missing.
The authors only briefly acknowledged this limitation and dealt with it rather quickly. In my opinion, this is somewhat insufficient for a paper that aims to provide a state-of-the-art overview of LLMs and Agents for Security Automation. I strongly encourage the authors to conduct a comprehensive review of papers published in those relevant venues on this topic and provide readers with a clearer and more complete picture. As it stands, the paper is not ready for publication.
Author Response
Comment 1:
In my previous review, I observed that papers from important venues like USENIX, NDSS, ACM CCS, IEEE S&P, Elsevier/ACM Journals are totally missing.
The authors only briefly acknowledged this limitation and dealt with it rather quickly. In my opinion, this is somewhat insufficient for a paper that aims to provide a state-of-the-art overview of LLMs and Agents for Security Automation. I strongly encourage the authors to conduct a comprehensive review of papers published in those relevant venues on this topic and provide readers with a clearer and more complete picture. As it stands, the paper is not ready for publication.
Response 1: Thank you for your clarification of your review of our paper. Accordingly we manually visited all the accepted papers according to our criteria ((2022-2025) and experimental papers) and keywords "SOC","AI Agent", "LLM", "Log Summarization", "Alert Triage", "Threat Intelligence", "Report Generation", "Ticket Handling", "Incident Response", "Vulnerability Management", "Asset Discovery and Management" to all the venues you suggested ( USENIX, NDSS, ACM CCS, IEEE S&P, Elsevier/ACM Journals) through the following links:
https://www.ndss-symposium.org/ndss2022/accepted-papers/
https://www.ndss-symposium.org/ndss2023/accepted-papers/
https://www.ndss-symposium.org/ndss2024/accepted-papers/
https://www.ndss-symposium.org/ndss2025/accepted-papers/ https://www.computer.org/csdl/proceedings/sp/2022/1FlQurJZBuw https://www.computer.org/csdl/proceedings/sp/2023/1He7WWuJExG https://www.computer.org/csdl/proceedings/sp/2024/1RjE8VKKk1y https://www.computer.org/csdl/proceedings/sp/2025/21B7ONGXzZ6 https://www.sigsac.org/ccs/CCS2022/program/accepted-papers.html https://www.sigsac.org/ccs/CCS2023/program.html# https://www.sigsac.org/ccs/CCS2024/program/accepted-papers.html https://www.sigsac.org/ccs/CCS2025/accepted-papers/ https://www.usenix.org/conference/usenixsecurity22/summer-accepted-papers https://www.usenix.org/conference/usenixsecurity22/fall-accepted-papers https://www.usenix.org/conference/usenixsecurity22/winter-accepted-papers https://www.usenix.org/conference/usenixsecurity23/summer-accepted-papers https://www.usenix.org/conference/usenixsecurity23/fall-accepted-papers https://www.usenix.org/conference/usenixsecurity24/summer-accepted-papers https://www.usenix.org/conference/usenixsecurity24/fall-accepted-papers https://www.usenix.org/conf%C3%A9rence/usenixsecurity25/cycle1-accepted-papers
Unfortunately, there were only 1 - 3 papers each for the following areas: Log Summarization, Threat Intelligence, Vulnerability Management, Other security related incident response issues (Not applicable to our SOC scope), LLM jailbreak safety (Not applicable to our paper)
There are clear gaps in the papers of the conference that focus solely on AI Agent and LLM integration with SOC tasks such as Alert Triage, Ticket handling, Report Generation, Asset Discovery and Management
We have added 5 new papers from Elsevier Computer and Security, ACM CCS, and USENIX Security to add to our paper in the sections of 3.1 Log Summarization Page 5 Line Number 155 , 3.3 Threat Intelligence Page 6 Line Number 231, 3.5 Incident Response Page 8 Line Number 309, 3.8 Vulnerability Management Page 10 Line Number 407, Page 10 Line Number 415
Due to these changes we have updated the Methodology Page 2 , PRISMA figure Page 3, Threats to Validity Page 18, and Conclusion Page 19
Reviewer 2 Report
The authors have addressed all comments thoroughly, and the revised version shows significant improvement in clarity, depth, and structure. The paper is now comprehensive, well-balanced, and provides valuable insights into the role of LLMs in SOC automation. I am pleased with the revision and recommend the manuscript for acceptance.
The authors have carefully addressed all previous comments, and the revised manuscript shows clear improvement in both depth and presentation. The newly added discussion on cross-domain insights, along with the expanded literature coverage and clearer structural organization, has significantly strengthened the contribution.
The paper is now well-written, comprehensive, and insightful. It provides a valuable and timely overview of how LLMs and intelligent agents are transforming SOC automation, with broader implications for safety-critical systems.
I would like to commend the authors for their thorough revisions and thoughtful responses. The manuscript has reached a publishable standard and is recommended for acceptance.
Author Response
Comment1 :The authors have carefully addressed all previous comments, and the revised manuscript shows clear improvement in both depth and presentation. The newly added discussion on cross-domain insights, along with the expanded literature coverage and clearer structural organization, has significantly strengthened the contribution.
The paper is now well-written, comprehensive, and insightful. It provides a valuable and timely overview of how LLMs and intelligent agents are transforming SOC automation, with broader implications for safety-critical systems.
I would like to commend the authors for their thorough revisions and thoughtful responses. The manuscript has reached a publishable standard and is recommended for acceptance.
Response 1: Thank you for your support and clear instructions. We are very happy that our work is being recognized by you.

