Next Article in Journal
Subjective Well-Being, Active Travel, and Socioeconomic Segregation
Previous Article in Journal
Renewable Energy in Policy Frameworks: A Comparative Analysis of EU and Global Strategies for Sustainable Development
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Towards a Sustainable Cybersecurity Governance: Threat Modelling with Large Language Models

Faculty of Electrical Engineering and Computer Science, University of Maribor, Koroška Cesta 46, 2000 Maribor, Slovenia
*
Authors to whom correspondence should be addressed.
Sustainability 2025, 17(23), 10569; https://doi.org/10.3390/su172310569
Submission received: 3 October 2025 / Revised: 12 November 2025 / Accepted: 20 November 2025 / Published: 25 November 2025
(This article belongs to the Section Sustainable Engineering and Science)

Abstract

With the increased complexity of applications and systems, threat modelling struggles to keep pace with the evolution of risks. This article addresses this challenge by exploring how large language models (LLMs) can be leveraged to create comprehensive threat models across different risk assessment methodologies. We examine whether a single generic prompt can support frameworks such as LINDDUN, PASTA, and STRIDE, despite their different requirements. Through this comparative analysis, we identify components that enable AI-based assessments, while acknowledging that privacy, regulatory, and dynamic risks require adaptation of the frameworks. Our findings show that a universal guideline is feasible for broad applications, but adaptation is necessary for effective use. Overall, LLM-based threat modelling improves the accessibility, repeatability, and effectiveness of risk analysis and supports stronger and more sustainable practices.

1. Introduction

In recent years, artificial intelligence (AI) has proven to be transformative in the field of cyber security and risk management, improving the ability of organisations to detect, analyse and respond to threats in a digital environment. These challenges take place in a broader global context where sustainable infrastructure development is essential. Digital trust and resilient cybersecurity models form the foundation of a sustainable digital transformation that is aligned with the Sustainable Development Goal 9 (industry, innovation and infrastructure) [1]. Artificial intelligence offers a solution by using machine learning algorithms, data analytics and automated decision-making to strengthen security measures [2]. Sustainability is not only an environmental concern but also a principle in the design of secure, efficient and resilient digital infrastructures. In this context, AI-powered threat modelling offers scalable solutions that support long-term cybersecurity goals aligned with sustainable development objectives [3,4].
Threat modelling enables organisations to identify, analyse and mitigate potential security threats before they materialise. It is used widely to improve cybersecurity and operational resilience in a variety of industries including finance, healthcare, manufacturing and government. A proactive approach to threat modelling helps organisations anticipate security vulnerabilities, assess attack vectors and implement appropriate countermeasures [5,6].
AI-powered threat modelling solutions improve this process further by automating threat identification, analysing data sets in real time and generating security insights [7]. These models use machine learning algorithms to predict attack patterns, detect anomalies and recommend mitigation strategies. By incorporating threat modelling techniques powered by artificial intelligence, organisations can improve their security posture, reduce vulnerabilities and ensure compliance with the regulatory standards [2].
In addition, the effectiveness of large language models (LLMs) in risk assessment is dependent on the quality, availability and integrity of data. Poor data quality—such as incomplete, inconsistent or biased datasets—can lead to inaccurate risk predictions, reducing the reliability of LLM-driven safety measures [8]. Addressing these challenges requires a combination of data management strategies, AI frameworks and continuous improvements to LLM algorithms to ensure fair, transparent and effective threat modelling practices.
This article discusses the role of LLMs in threat modelling, and its benefits, challenges and implications for organisations. Our research is focused on the following research questions:
RQ1. 
Is it possible to develop a generalised input prompt allowing LLM to generate a comprehensive analysis?
RQ2. 
How efficient are different LLMs in generating threats?
The article is structured as follows: Section 2 provides an overview of related work. In Section 3 we present threat modelling, and, within the following subsections, provide a detailed examination of STRIDE, PASTA and LINDDUN, highlighting their methodologies, applications and effectiveness in threat modelling. Section 4 focuses on LLMs, and, in the following subsections, presents all the LLMs that were used in this research (Gemini 2.5 Flash-Lite, ChatGPT 4o, Perplexity (release July 2025) and CoPilot (release July 2025)). Section 5 presents the methodology used in this research, with Section 6 presenting the results of all the outputs and the interpretations by the authors. The paper concludes with a discussion in Section 7, highlighting the methodological strengths, prompt sensitivity, adaptation effects and practical implications of LLMs. The paper finishes with a conclusion and the acknowledgments.

2. Related Work

As seen in the research from Sai and Challa [9], their work explored how machine learning algorithms can automate compliance checking, identify system vulnerabilities and predict potential risks in software development and validation processes. AI’s capability to streamline validation reduces both cost and time while maintaining high levels of accuracy and compliance.
The study by Trifonov et al. [10] elaborated on AI’s role in threat detection, anomaly analysis and predictive modelling to improve defensive mechanisms against cyberattacks. The study also discussed the transition of cyber threats from the phase of cyber-crime to cyber-warfare and how this shift necessitates the advancement of cyber defensive techniques, particularly through the application of AI methods. It also focuses on the implementation of these techniques in Cyber Threat Intelligence (CTI) at different levels—strategic, operational and tactical.
Jawhar et al. [11] explored how AI can be used to assess cyber risks for insurance purposes, providing more accurate pricing and enhancing overall resilience. The integration of AI tools in cybersecurity practices reduces response times and improves risk management for both enterprises and governments. It explores how AI can enhance the process of evaluating cyber risks, generating detailed reports and designing customized cyber insurance policies.
The research from Singh [12] explored the role of AI and machine learning (ML) in improving risk assessment within the financial industry. It discussed the application of AI and ML in managing various types of financial risks, including credit risk, market risk and operational risk. The paper highlights how these technologies can enhance risk management practices by providing more accurate and efficient methods of identifying, assessing and mitigating risks.
In the area of supply chain management, AI has been instrumental in improving risk identification, forecasting disruptions and optimising logistics. The paper from Baryannis et al. [13] discussed the use of AI models to predict supply chain risks such as supplier failures, geopolitical disruptions and natural disasters. Additionally, Optimising Supply Chain Risk Management, an Integrated Framework Leveraging LLMs, examines the integration of LLMs in supply chain data analysis, providing more accurate decision-making processes and enhancing risk management frameworks.
Recent advancements in AI and automation technologies have impacted the field of industrial risk management significantly. The work by Yaseen [14] highlighted the role of AI-driven predictive analytics, machine learning models, and robotics in enhancing safety, mitigating risks, and improving operational efficiency in industrial environments. The automation of hazard detection and real-time risk mitigation is particularly notable, as it reduces human error and enhances response times in critical situations.
In their paper, Hoseini et al. [15] addressed security challenges in the field of Adversarial Machine Learning (AML) by proposing a systematic approach to threat modelling. It presents a multi-stage methodology for designing and evaluating attack trees, focusing on breaches of confidentiality, integrity, availability and privacy. The authors classified attacks, assessed risks using feasibility and severity metrics, and provided insights for vulnerability mitigation.
The existing studies focus on threat detection, anomaly detection and predictive analytics based on artificial intelligence [9,10,11,12,13,14], while more recent research explored the role of adversarial machine learning and AI-enhanced risk assessment frameworks [15,16]. However, these contributions refer mainly to domain-specific applications, and do not explore how LLMs can be used to generate comprehensive, model-aligned threat analyses across different system architectures. In particular, there is limited understanding of whether an LLM can be prompted in a standardised manner to generate outputs compatible with established methodologies such as STRIDE or DREAD.
Our research addresses this gap by empirically analysing the ability of LLMs to generate threat models based on structured prompts that incorporate system-specific data. We examine both the general applicability of LLM-based threat modelling and the effectiveness of different models in generating relevant, accurate and comprehensive threats. In doing so, we aimed to bridge the gap between the theoretical advances in AI-based cybersecurity and the practical needs of organisations seeking flexible, standards-compliant approaches to threat modelling.

3. Threat Modelling

The threat modelling methodology ensures a proactive approach to cybersecurity, adapting continuously to new threats and evolving system architectures [17]. In addition to improving security, systematic threat modelling contributes to sustainable development by promoting proactive planning, reducing unnecessary system upgrades and supporting energy-efficient security implementations—all of which are essential for a resilient digital infrastructure in line with the Sustainable Development Goal 9 [1].
Threat modelling is an ongoing, adaptive process, aligning security strategies with their specific infrastructure and business objectives. However, the extent to which this is achieved depends on the threat modelling models themselves. Various threat modelling methodologies address different aspects of security risk analysis.

3.1. STRIDE

Spoofing, Tampering, Repudiation, Information disclosure, Denial of service, Elevation of privilege (STRIDE) evaluates threats systematically by analysing the system design, often utilising data flow diagrams (DFDs) [18,19]. Table 1 outlines the key threat categories and their associated security controls. STRIDE is a system-centric methodology that focuses on identifying attack vectors based on known security vulnerabilities.

3.2. PASTA

The Process for Attack Simulation and Threat Analysis (PASTA) provides a structured, attacker-centric approach to threat modelling, integrating security analysis with business risk mitigation strategies [22]. Table 2 outlines the seven stages of the PASTA methodology.

3.3. LINDDUN

LINDDUN is a privacy-centric threat modelling methodology designed to integrate data protection principles into system development [23,24]. Table 3 outlines its key phases, which are categorised into problem space and solution space.

4. Large Language Models

Large Language Models (LLMs) are advanced artificial intelligence systems built on the transformer architecture that process entire input sequences in parallel, rather than sequentially, enabling more efficient training and better handling of long-range dependencies in text. They contain billions of parameters and are trained on large amounts of textual data [25].
This study examines four LLMs—Gemini, ChatGPT, Perplexity and GitHub CoPilot—to compare their effectiveness in modelling threats across complementary modalities.
Gemini’s [26] multimodal architecture enabled the evaluation of threat analysis from text, image and log data, reflecting real-world enterprise environments. ChatGPT [27] was selected for its robust conversational capabilities, extensive context processing, and alignment with structured security frameworks, making it a relevant baseline for overall LLM performance. Perplexity’s [28] real-time search augmentation enabled the assessment of the latest source-based threats and the formulation of mitigation measures, which is particularly valuable in changing security contexts. GitHub CoPilot [29], which specialises in code analysis, was included to explore automated vulnerability detection at the implementation level.
This selection enabled a robust comparison of how each model’s unique design contributes to generating relevant, accurate and actionable outputs in automated threat modelling scenarios, as presented in Table 4.

Prompt Engineering

Since LLMs have become important due to their exceptional ability to understand and create human-like text, optimising the way users interact with these models is increasingly important. Effective prompt creation—the shaping of input queries that direct the model to produce the desired results—has become a key strategy for improving efficiency and reliability in a variety of applications, from text summarisation to coding assistance.
Prompt engineering is a systematic approach to designing and optimising input prompts for LLMs, to ensure that they generate accurate, relevant and consistent answers [30]. By designing prompts, users can guide AI models to perform certain tasks more efficiently, such as translation, summarisation, question answering and more.
The importance of prompt engineering is that it allows general-purpose models to be adapted to specialised tasks or domains without requiring additional training data, and is therefore an efficient and versatile approach [31]. In addition, prompt engineering can help mitigate bias and ensure more ethical AI behaviour through thoughtful design, while addressing concerns related to fairness and inclusivity. It also reduces the need for computational resources associated with fine-tuning, making advanced AI capabilities more accessible to a wider audience [32].
Sahoo et al. [33] developed a detailed taxonomy of existing prompt engineering strategies, categorising them by application domain, which includes methods like zero-shot and few-shot prompting, chain-of-thought prompting, and more specialised techniques such as the Chain of Code or Chain of Symbol prompting. Based on their research, we concluded that our created prompts fall into the Chain-of-Thought Prompting, which was originally proposed by Wei et al. [34]. Their work focused on how creating a series of intermediate inference steps—called the Chain-of-Thought—can improve a model’s ability to perform complex inference tasks. The Chain-of-Thought method involves presenting a few examples of inference processes as part of a stimulus to guide the model. This helps the model to break down complex problems into more manageable intermediate steps that reflect human problem-solving processes [34].

5. Methodology

This study compares the performance of LLMs in generating, assessing and mitigating security threats across three threat models—LINDDUN, PASTA and STRIDE. The objective is to evaluate LLM model outputs systematically for consistency, reliability and practical value.
The experiment utilises an imaginative e-commerce platform designed by the authors that combines widely used architectural building blocks, such as a frontend interface, a backend API, an authentication server, a database, payment processing, order/inventory management, email notifications and cloud-based infrastructure. This platform is not a real-world deployment, but rather a synthetic model whose initial design is based on architectural decisions from ongoing research.
The rationale for selecting an e-commerce infrastructure arises from its prominence as an attack target in the OWASP Top 10 [35], where e-commerce scenarios are extensively referenced as examples of web application vulnerabilities. E-commerce systems remain among the most popular and critical web applications. Their complex architectures and frequent transaction processing make e-commerce especially susceptible to a wide range of security issues [36]. This platform serves as a representative and security-relevant subject for testing, emulating the types of architectural and security challenges encountered in practice. A concise summary of the system is provided in Appendix A.

Procedure

The methodological workflow was conducted as presented in Figure 1, enabling a systematic comparison across threat models, LLMs and prompt levels.
The LLM prompt is designed to generate a threat report based on a threat model, with a placeholder “[]”and following a staged workflow of increasing specificity. These stages include
  • Low specificity: A brief listing of the core technologies.
  • Medium specificity: The inclusion of infrastructure details and component relationships.
  • High specificity: Explicit version numbers and configuration parameters.
Further elaboration on the content and structure of each prompt can be found in Appendix B, Appendix C and Appendix D.
This multi-level approach enables systematic investigation into whether greater prompt specificity improves the relevance and accuracy of the vulnerabilities identified by LLMs, or if the model outputs remain inconsistent regardless of the detail. The experiment used the models available in July 2025. Since CoPilot and Perplexity do not publish specific version numbers, their then-current releases were tested. For reference, Gemini version 2.5 Flash-Lite and ChatGPT version 4o were also included.
Each LLM received the structured prompts and was instructed sequential tasks:
  • Follow the methodology based on the model;
  • Threat identification;
  • Output the results in format based on the model.
The output reports were normalised and classified under a unified taxonomy to facilitate the comparison. Additionally, an example of the output is available in Appendix E. The validation focused on the following:
  • Consistency: The repeatability and stability of the results when the prompt details were minimally modified.
  • Reliability: Alignment with the established threat modelling frameworks and internal logical coherence.
In the comparison between different levels, we employed independent expert validation, drawing on a prior, separate experiment conducted as part of another ongoing research project. In this project, an expert independently performed a manual threat analysis and completed documentation using the STRIDE, PASTA and LINDDUN methods. We then compared the reports generated by the LLM with the results of this expert, manually performed assessment. This approach allowed us to objectively assess overlaps, newly identified threats and the accuracy of detected threats in relation to established expert standards and experience. Table 5 provides an overview of the rules used for classifying the detected threats as true positives (TP) or false positives (FP).
It is important to note that this representative e-commerce system originated from a distinct experiment within ongoing research. For further robustness, the accuracy and practical utility of all the generated threat reports and mitigation strategies were reviewed independently by a professional penetration tester, whose insights ensured alignment with the current industry practices and enhanced the credibility of our findings.
An additional methodological presumption in this study is the potential “learning effect” that can occur when LLMs store information from multiple queries or sessions. To reduce this factor during the comparative analysis of different prompts and models, all experiments were conducted in “incognito mode”, with each evaluation using a separate incognito chat instance. According to the LLM provider’s documentation, user data are not stored in anonymous sessions, and the models cannot learn from these interactions, reducing the memory-related bias in the threat modelling results. Since studying the learning effect is beyond the scope of this research, the authors acted in accordance with the provider’s assurances, while acknowledging the residual risk and highlighting this limitation for transparency.
This study has several limitations that should be considered. First, the evaluation was limited to three methodologies (STRIDE, PASTA and LINDDUN) and four models (ChatGPT, CoPilot, Gemini and Perplexity). The inclusion of additional frameworks, such as OCTAVE, DREAD or Attack Trees, as well as emerging LLMs, could provide a more comprehensive perspective. Second, despite measures to mitigate the learning effects, the possibility of adaptation due to repeated exposure to similar tasks cannot be ruled out completely. Third, the results reflect the state of LLMs at the time of experiment; as these models are evolving rapidly, future updates may yield different results.

6. Results

Following the methodological considerations and limitations discussed in the previous section, this chapter presents the empirical findings derived from the comparative analysis of four large language models across three threat modelling frameworks.
Before presenting the comparison of LLM results, it is important to present the results prepared by the security expert. For each threat modeling framework, the expert performed a comprehensive manual threat analysis and prepared a report tailored to the specific characteristics and architecture of a representative electronic trading system. The number of threats identified by the expert for each methodology is as follows:
  • STRIDE: 29 threats
  • PASTA: 31 threats
  • LINDDUN: 34 threats
Figure 2 summarises the STRIDE-based threat identification results for four LLMs. ChatGPT achieved the highest number of true positives (TP), ranging from 17 to 22 per prompt, with only a single false positive (FP) per case. However, ChatGPT missed specific web-related risks such as cross-site request forgery (CSRF) and S3 storage misconfigurations.
CoPilot’s TP counts were comparable to ChatGPT’s but accompanied by a higher number of FPs, primarily redundant or misclassified “rejection” scenarios. Its outputs also exhibited frequent repetition of “sensitive data disclosure” across multiple components. Gemini produced the lowest number of TPs and the highest number of FPs, often including irrelevant categories (e.g., hygiene issues or configuration errors) and omitting key threats such as JWT and CSRF abuse. Perplexity outperformed Gemini in TP, but generated repetitive entries dominated by “unauthorised access” and “data breach.”
Figure 3 presents the comparative results obtained using the PASTA methodology. ChatGPT again achieved the highest precision, identifying between 22 and 24 relevant risks across all prompts. Nevertheless, some critical threats such as JWT hijacking and cloud misconfiguration remained undetected.
CoPilot demonstrated moderate performance, achieving lower TP values, between 14 and 17, but with additional FPs. It also displayed a high degree of redundancy, listing “sensitive data disclosure” repeatedly. Gemini yielded a greater range of TPs (14–23 per prompt) and the highest number of FPs, including unrelated risks such as natural disasters or hygiene issues. It also overlooked CSRF and JWT-related threats. Perplexity performed similarly to CoPilot in TPs, but generated similar FP as ChatGPT.
Figure 4 illustrates the comparative results under the LINDDUN methodology. Gemini achieved the highest number of TPs (29–34 per query), but this was accompanied by the largest number of irrelevant threats, including non-privacy issues such as natural disasters and hardware failures.
CoPilot identified slightly fewer TPs (24–29) while maintaining consistent output quality and moderate FP rates. ChatGPT demonstrated substantial variability between the prompts (14–25 identified threats), with frequent omissions of privacy-specific risks, such as token abuse and CSRF. Perplexity generated the fewest findings (14–16), primarily repeating “unauthorised access” and “data breach” across assets.
Across all three methodologies the results revealed both quantitative and qualitative differences in model performance. ChatGPT, consistently, produced threats aligned with the test scenario—such as JWT forgery, unauthorised database modification, SQL injection, cross-site scripting (XSS), CSRF and man-in-the-middle attacks on Stripe. CoPilot’s results were similar but more repetitive, while Gemini and Perplexity showed reduced precision and scope.
False positives were most frequent in Gemini, which often included irrelevant categories such as outdated software, configuration issues and non-cyber threats. CoPilot occasionally introduced implausible “denial” scenarios, and Perplexity produced overly general categories without contextual grounding. Repetitive and semantically redundant outputs were most evident in CoPilot and Perplexity, which duplicated generic threats such as “unauthorised access,” “data breach,” and “sensitive data disclosure”. ChatGPT and Gemini also exhibited repetition, but to a lesser extent.
Additionally, some threats identified by LLMs corresponded to risks implicitly covered by the expert but not explicitly recorded in the annual report. In several cases, the models captured conceptually related threats that were consistent with the expert’s underlying reasoning. For example, while the expert described “unauthorized modification of records,” ChatGPT expressed a similar concern as “reverse engineering or bypassing regular integrity checks.” Similarly, the concept of “exploiting role-based access control” in CoPilot conceptually overlapped with the expert’s “escalating privileges in the admin dashboard.” These examples illustrate that while the models use different terminology, they sometimes capture complementary aspects of the same threat patterns.

7. Discussion

A comparative evaluation of large language models (LLMs) across three threat modelling methodologies—STRIDE, PASTA and LINDDUN—provided quantitative and qualitative insights into their performance, limitations and methodological dependencies. This section addresses the two research questions, and interprets the results in the context of model reliability, prompt generalisation and practical applicability.
Regarding the RQ1, the findings demonstrate that a single, generalised prompt does not yield consistently comprehensive or reliable threat analyses across all methodologies. ChatGPT and CoPilot performed relatively well under STRIDE and PASTA, but the results degraded significantly when applied to the privacy-oriented LINDDUN framework. Gemini, conversely, excelled in LINDDUN, but struggled with STRIDE due to its reduced sensitivity to technical attack surfaces. Designing methodology-specific prompts remains essential to achieve balanced coverage and minimise model-induced bias. Future research should focus on prompt standardisation and the development of dynamic adaptation mechanisms to improve reproducibility and contextual accuracy.
Considering RQ2, the quantitative comparisons revealed that the LLM efficiency depends both on the threat modelling methodology and the internal reasoning design of each model. The aggregated results across the three frameworks are summarised in Table 6.
ChatGPT achieved the highest true positive rates and lowest false positive rates in structured frameworks such as STRIDE and PASTA, reflecting effective pattern alignment with the technical vulnerabilities. CoPilot’s performance was stable across the runs but limited by redundancy and over-generation of non-unique threats. Gemini delivered outstanding recall within LINDDUN, identifying up to 34 relevant privacy threats, but at the cost of increased false positives and contextual misclassification. Perplexity achieved modest precision, but exhibited tendencies towards generic outputs dominated by repetitive categories such as “unauthorised access” or “data breach.”

8. Conclusions

This study presented a comparative evaluation of four LLMs—ChatGPT, CoPilot, Gemini and Perplexity—across three established threat modelling methodologies: STRIDE, PASTA and LINDDUN. The methodological design incorporated controlled experimental conditions, including the use of isolated incognito sessions to prevent cross-learning effects, ensuring that the resulting differences in model outputs reflected genuine methodological and architectural variances rather than adaptive data retention. This design strengthens the validity and reproducibility of the obtained results.
The analysis demonstrated that model performance is both methodology-dependent and model-specific. ChatGPT achieved the highest precision and stability in structured, security-focused frameworks such as STRIDE and PASTA, identifying 17–24 relevant threats per scenario with minimal false positives. Gemini excelled in the privacy-oriented LINDDUN methodology, generating the broadest range of privacy threats (29–34 per query), while CoPilot maintained moderate but stable performance across all the models. Perplexity produced the least comprehensive results, with repetitive and generalised threat descriptions.
From a methodological perspective, the consistency of results across repeated runs and the strong alignment between the TP/FP patterns and framework intent demonstrate the adequacy of the experimental setup and analysis approach. The comparative structure of this evaluation—analysing three distinct frameworks and four diverse models—ensures that the conclusions are supported empirically rather than hypothetically.
The study’s findings led to two main conclusions. First, the combination of complementary models provides greater threat coverage and a lower false positive rate than any single model, supporting hybrid configurations in operational use. Second, LLMs can serve as a highly effective support tool for security professionals by accelerating the threat modeling process. This makes it easier to identify overlooked vulnerabilities and allows experts to focus on tasks that require human judgment and contextual understanding. In this way, LLMs do not replace human expertise, but are valuable assistants that improve consistency and reliability of identified threats.
Looking forward, further research should extend this work by testing additional frameworks, which offer complementary perspectives for threat and risk assessment. Another promising direction is the development of prompts that are adapted to different threat modeling methodologies; by using multiple versions of prompts and systematically analysing the results, researchers can gain deeper insights into the responsiveness and accuracy of LLMs. This approach also enables robust statistical analysis of combinations of incentives and models, with different metrics further strengthening quantitative assessment and comparison.

Author Contributions

Writing—original draft, N.J.; Writing—review & editing, M.T. and T.B. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge the financial support from the Slovenian Research and Innovation Agency (Research Core Funding No. P2-0057).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. E-Commerce

Table A1. System components overview.
Table A1. System components overview.
ComponentDescription
Frontend InterfaceEnables users to interact with the system and is developed using React.js, HTML, CSS, and JavaScript.
Backend APIHandles requests and integrates with other components of the system. It is implemented using Node.js and Express.js.
Authentication ServerEnsures secure user login sessions by utilizing JSON Web Tokens (JWT).
DatabaseBuilt on PostgreSQL and is responsible for storing user profiles, purchase history and other system-related data.
Payment ProcessingManaged by the Stripe API, which handles secure payment transactions on the platform.
Order and Inventory ManagementCoordinates orders, inventory and logistics, ensuring timely fulfilment of customer purchases.
Email Notification SystemPowered by Amazon SES, the email notification system sends notifications to users, including order confirmations and updates.
Administrative SystemProvides tools for platform management, such as user administration, inventory tracking and order processing.
InfrastructureHosted on AWS, leveraging services like EC2, S3, RDS and Lambda to ensure scalability and reliability.

Appendix B. Prompt 1

Figure A1. Prompt 1, low specificity; brief listing of the core technologies.
Figure A1. Prompt 1, low specificity; brief listing of the core technologies.
Sustainability 17 10569 g0a1

Appendix C. Prompt 2

Figure A2. Prompt 2, medium specificity; the inclusion of infrastructure details and component relationships.
Figure A2. Prompt 2, medium specificity; the inclusion of infrastructure details and component relationships.
Sustainability 17 10569 g0a2

Appendix D. Prompt 3

Figure A3. Prompt 3, high specificity; with version numbers and configuration parameters.
Figure A3. Prompt 3, high specificity; with version numbers and configuration parameters.
Sustainability 17 10569 g0a3

Appendix E. Example Output Report

Figure A4. Example output of Prompt 2 (medium specificity) produced by the LLM ChatGPT for STRIDE.
Figure A4. Example output of Prompt 2 (medium specificity) produced by the LLM ChatGPT for STRIDE.
Sustainability 17 10569 g0a4aSustainability 17 10569 g0a4b

References

  1. United Nations. Goal 9|Department of Economic and Social Affairs. 2023. Available online: https://sdgs.un.org/goals/goal9 (accessed on 7 November 2025).
  2. Kaur, R.; Gabrijelčič, D.; Klobučar, T. Artificial intelligence for cybersecurity: Literature review and future research directions. Inf. Fusion 2023, 97, 101804. [Google Scholar] [CrossRef]
  3. Argyroudis, S.A.; Mitoulis, S.A.; Chatzi, E.; Baker, J.W.; Brilakis, I.; Gkoumas, K.; Vousdoukas, M.; Hynes, W.; Carluccio, S.; Keou, O.; et al. Digital technologies can enhance climate resilience of critical infrastructure. Clim. Risk Manag. 2022, 35, 100387. [Google Scholar] [CrossRef]
  4. Palma, G.; Cecchi, G.; Caronna, M.; Rizzo, A. Leveraging Large Language Models for Scalable and Explainable Cybersecurity Log Analysis. J. Cybersecur. Priv. 2025, 5, 55. [Google Scholar] [CrossRef]
  5. Xiong, W.; Lagerström, R. Threat modeling—A systematic literature review. Comput. Secur. 2019, 84, 53–69. [Google Scholar] [CrossRef]
  6. V, M.; M, P. Integrating Risk assessment and Threat modeling within SDLC process. In Proceedings of the 2016 International Conference on Inventive Computation Technologies (ICICT), Coimbatore, India, 26–27 August 2016. [Google Scholar] [CrossRef]
  7. Rashid, S.; Bollis, E.; Pellicer, L.; Rabbani, D.; Palacios, R.; Gupta, A.; Gupta, A. Evaluating Prompt Injection Attacks with LSTM-Based Generative Adversarial Networks: A Lightweight Alternative to Large Language Models. Mach. Learn. Knowl. Extr. 2025, 7, 77. [Google Scholar] [CrossRef]
  8. Buczak, A.L.; Guven, E. A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection. IEEE Commun. Surv. Tutor. 2016, 18, 1153–1176. [Google Scholar] [CrossRef]
  9. Sai, S.; Challa, S. Leveraging AI for Risk Management in Computer System Validation. Int. J. Multidiscip. Innov. Res. Methodol. 2024, 3, 145–153. [Google Scholar]
  10. Trifonov, R.; Nakov, O.; Mladenov, V. Artificial Intelligence in Cyber Threats Intelligence. In Proceedings of the 2018 International Conference on Intelligent and Innovative Computing Applications (ICONIC), Mon Tresor, Mauritius, 6–7 December 2018. [Google Scholar]
  11. Jawhar, S.; Kimble, C.E.; Miller, J.R.; Bitar, Z. Enhancing Cyber Resilience with AI-Powered Cyber Insurance Risk Assessment. In Proceedings of the 2024 IEEE 14th Annual Computing and Communication Workshop and Conference, CCWC 2024, Las Vegas, NV, USA, 8–10 January 2024; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2024; pp. 435–438. [Google Scholar] [CrossRef]
  12. Singh, V. Explore the role of artificial intelligence and machine learning in improving risk assessment. Quest J. J. Res. Bus. Manag. 2023, 11, 110–114. [Google Scholar]
  13. Baryannis, G.; Validi, S.; Dani, S.; Antoniou, G. Supply chain risk management and artificial intelligence: State of the art and future research directions. Int. J. Prod. Res. 2019, 57, 2179–2202. [Google Scholar] [CrossRef]
  14. Yaseen, A. Reducing industrial risk with AI and automation. Int. J. Intell. Autom. Comput. 2021, 4, 60–80. [Google Scholar]
  15. veria Hoseini, S.; Suutala, J.; Partala, J.; Halunen, K. Threat modeling AI/ML with the Attack Tree. IEEE Access 2024, 12, 172610–172637. [Google Scholar] [CrossRef]
  16. Sarker, I.H.; Janicke, H.; Mohsin, A.; Gill, A.; Maglaras, L. Explainable AI for cybersecurity automation, intelligence and trustworthiness in digital twin: Methods, taxonomy, challenges and prospects. ICT Express 2024, 10, 935–958. [Google Scholar] [CrossRef]
  17. Lohmann, P.; Albuquerque, C.; Machado, R.C.S.; Lohmann, P.A.; Machado, R. Systematic Literature Review of Threat Modeling Concepts. In Proceedings of the 9th International Conference on Information Systems Security and Privacy, Lisbon, Portugal, 22–24 February 2023. [Google Scholar] [CrossRef]
  18. Sattar, D.; Vasoukolaei, A.H.; Crysdale, P.; Matrawy, A. A STRIDE Threat Model for 5G Core Slicing. In Proceedings of the 2021 IEEE 4th 5G World Forum (5GWF), Montreal, QC, Canada, 13–15 October 2021; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2021; pp. 247–252. [Google Scholar] [CrossRef]
  19. Shevchenko, N.; Chick, T.A.; O’riordan, P.; Scanlon, T.P.; Woody, C. Threat Modeling: A Summary of Available Methods; Carnegie Mellon Universit, Software Engineering Institute: Pittsburgh, PA, USA, 2018. [Google Scholar]
  20. Threat Modeling Process|OWASP Foundation. Available online: https://owasp.org/www-community/Threat_Modeling_Process (accessed on 5 February 2025).
  21. Pöhls, H.C.; Kügler, F.; Geloczi, E.; Klement, F. Segmentation and Filtering Are Still the Gold Standard for Privacy in IoT—An In-Depth STRIDE and LINDDUN Analysis of Smart Homes. Future Internet 2025, 17, 77. [Google Scholar] [CrossRef]
  22. UcedaVelez, T.; Morana, M.M. Intro to Pasta. In Risk Centric Threat Modeling: Process for Attack Simulation and Threat Analysis; Wiley: Hoboken, NJ, USA, 2015; pp. 317–342. [Google Scholar] [CrossRef]
  23. Wuyts, K. Privacy Threats in Software Architectures. Ph.D. Thesis, KU Leuven, Leuven, Belgium, 2015. [Google Scholar]
  24. Wuyts, K.; Sion, L.; Joosen, W. LINDDUN GO: A Lightweight Approach to Privacy Threat Modeling. In Proceedings of the 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), Genoa, Italy, 7–11 September 2020; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2020; pp. 302–309. [Google Scholar] [CrossRef]
  25. Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar] [PubMed]
  26. Pichai, S.; Hassabis, D. Introducing Gemini: Google’s Most Capable AI Model Yet. 2023. Available online: https://blog.google/technology/ai/google-gemini-ai/#sundar-note (accessed on 15 September 2025).
  27. OpenAI. Introducing ChatGPT|OpenAI. 2022. Available online: https://openai.com/index/chatgpt/ (accessed on 12 August 2025).
  28. PerplexityTeam. About Perplexity. Available online: https://www.perplexity.ai/hub/about (accessed on 12 August 2025).
  29. CopilotTeam. What Is a Copilot and How Does It Work? Available online: https://www.microsoft.com/en-us/microsoft-copilot/copilot-101/what-is-copilot (accessed on 12 August 2025).
  30. Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 195. [Google Scholar] [CrossRef]
  31. Brown, T.B. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
  32. Schick, T.; Schütze, H. Exploiting cloze questions for few shot text classification and natural language inference. arXiv 2020, arXiv:2001.07676. [Google Scholar]
  33. Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv 2024, arXiv:2402.07927. [Google Scholar] [CrossRef]
  34. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 24824–24837. [Google Scholar]
  35. OWASP Foundation, Inc. OWASP Top Ten. 2025. Available online: https://owasp.org/www-project-top-ten/ (accessed on 11 November 2025).
  36. Nacionalni Odzivni Center za Kibernetsko Varnost SI-CERT. Poročilo o Kibernetski Varnosti 2024; Technical Report; SI-CERT, Akademska in Raziskovalna Mreža Slovenije (ARNES): Ljubljana, Slovenia, 2024. [Google Scholar]
Figure 1. Revised methodological workflow with LLM selection and prompt-specific evaluations.
Figure 1. Revised methodological workflow with LLM selection and prompt-specific evaluations.
Sustainability 17 10569 g001
Figure 2. Comparison of LLM performance in STRIDE-based threat identification.
Figure 2. Comparison of LLM performance in STRIDE-based threat identification.
Sustainability 17 10569 g002
Figure 3. Comparison of LLM performance in PASTA-based threat identification.
Figure 3. Comparison of LLM performance in PASTA-based threat identification.
Sustainability 17 10569 g003
Figure 4. Comparison of LLM performance in LINDDUN-based threat identification.
Figure 4. Comparison of LLM performance in LINDDUN-based threat identification.
Sustainability 17 10569 g004
Table 1. Key threat categories in the STRIDE methodology [20,21].
Table 1. Key threat categories in the STRIDE methodology [20,21].
ThreatSecurity ControlDefinition
SpoofingAuthenticationImpersonation of a legitimate entity.
TamperingIntegrityUnauthorised modification of data or processes.
RepudiationNon-repudiationDenying responsibility for an action.
Information DisclosureConfidentialityUnauthorised access to sensitive data.
Denial of ServiceAvailabilityDisrupting system functionality.
Elevation of PrivilegeAuthorizationGaining unauthorised access to higher privileges.
Table 2. Stages of the PASTA methodology [22].
Table 2. Stages of the PASTA methodology [22].
StageDescription
Define the ObjectivesEstablish business and security goals.
Define the Technical ScopeIdentify system components and infrastructure.
Application DecompositionCreate a system data flow diagram (DFD).
Threat AnalysisIdentify security threats using STRIDE.
Vulnerability and Weakness AnalysisEvaluate the risks using CVSS scoring.
Attack ModellingDevelop attack trees.
Risk and Impact AnalysisDefine mitigation strategies.
Table 3. Key phases of the LINDDUN PRO methodology [21].
Table 3. Key phases of the LINDDUN PRO methodology [21].
PhaseDescription
Data Flow Diagram (DFD) DefinitionProvide a high-level system description identifying the included components, data flows and processes.
Mapping Threats to DFDFor each system component defined in Step I, identify the related LINDDUN components.
Identifying ScenariosIdentify the potential threats and usage scenarios for each category.
Prioritising ThreatsEvaluate the risks based on their likelihood and impact.
Identifying RequirementsMap privacy threats to concrete privacy requirements.
Mitigation Strategy PlanningDefine strategies to address or minimise the identified privacy risks.
Table 4. Comparison of the selected LLMs for threat modelling.
Table 4. Comparison of the selected LLMs for threat modelling.
ModelKey FeatureThreat Modelling Role
Gemini [26]Advanced multimodal architecture enabling integrated text, image and audio processingAnalyses threats using text, images, logs; suited to enterprise environments
ChatGPT [27]Large-scale conversational language model optimised for natural dialogue and reasoningBaseline for context, frameworks (e.g., STRIDE); structured analysis
Perplexity [28]Real-time retrieval-augmented generation framework combining LLM output with live web dataIdentifies current and source-cited threats and mitigations
GitHub CoPilot [29]Code-oriented large language model trained for software development assistance and autocompletionDetects code-level vulnerabilities and suggests mitigations
Table 5. Classification rules for LLM and expert threat identification.
Table 5. Classification rules for LLM and expert threat identification.
LLM LabelExpert LabelCounted As
TPTPTP
TPFPFP
FPTPTP
FPFPFP
TPNot FoundTP
FPNot FoundFP
Table 6. Overall comparison of model and human expert results based on threat methods.
Table 6. Overall comparison of model and human expert results based on threat methods.
ModelSTRIDE (TP/FP)PASTA (TP/FP)LINDDUN (TP/FP)
ChatGPT17–22/122–24/114–25/2–3
CoPilot16–20/3–414–17/4–524–29/3–4
Gemini12–15/6–814–23/7–929–34/8–10
Perplexity14–18/4–615–18/4–514–16/3–5
Human expert293134
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jeršič, N.; Turkanović, M.; Beranič, T. Towards a Sustainable Cybersecurity Governance: Threat Modelling with Large Language Models. Sustainability 2025, 17, 10569. https://doi.org/10.3390/su172310569

AMA Style

Jeršič N, Turkanović M, Beranič T. Towards a Sustainable Cybersecurity Governance: Threat Modelling with Large Language Models. Sustainability. 2025; 17(23):10569. https://doi.org/10.3390/su172310569

Chicago/Turabian Style

Jeršič, Nika, Muhamed Turkanović, and Tina Beranič. 2025. "Towards a Sustainable Cybersecurity Governance: Threat Modelling with Large Language Models" Sustainability 17, no. 23: 10569. https://doi.org/10.3390/su172310569

APA Style

Jeršič, N., Turkanović, M., & Beranič, T. (2025). Towards a Sustainable Cybersecurity Governance: Threat Modelling with Large Language Models. Sustainability, 17(23), 10569. https://doi.org/10.3390/su172310569

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop