Next Article in Journal
Simplifying Data Processing in AFM Nanoindentation Experiments on Thin Samples
Previous Article in Journal
Nonlinear Control of a Permanent Magnet Synchronous Motor Based on State Space Neural Network Model Identification and State Estimation by Using a Robust Unscented Kalman Filter
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Clever Hans in the Loop? A Critical Examination of ChatGPT in a Human-in-the-Loop Framework for Machinery Functional Safety Risk Analysis

by
Padma Iyenghar
1,2
1
Faculty of Engineering and Computer Science, University of Applied Sciences Osnabrück, 49009 Osnabrück, Germany
2
innotec GmbH-TÜV Austria Group, Hornbergstrasse 45, 70794 Filderstadt, Germany
Submission received: 13 December 2024 / Revised: 26 January 2025 / Accepted: 5 February 2025 / Published: 7 February 2025

Abstract

:
This paper presents a first-of-its-kind evaluation of integrating Large Language Models (LLMs) within a Human-In-The-Loop (HITL) framework for risk analysis in machinery functional safety, adhering to ISO 12100. The methodology systematically addresses LLM limitations, such as hallucinations and lack of domain-specific expertise, by embedding expert oversight to ensure reliable and compliant outputs. Applied to four diverse industrial case studies—motorized gates, autonomous transport vehicles, weaving machines, and rotary printing presses—this study assesses the applicability of ChatGPT in routine risk analysis tasks central to machinery functional safety workflows, such as hazard identification and risk assessment. The results demonstrated substantial improvements: during HITL involvement and the subsequent iterations of risk assessment with expert feedback, a complete agreement with ground truth was achieved across all four use cases. ChatGPT also identified additional scenarios and edge cases, enriching the risk analysis. Efficiency gains were notable, with time efficiency rated at 4.95 out of 5, on average, across case studies. Overall accuracy (4.7 out of 5) and usability (4.8 out of 5) ratings demonstrated the robustness of the HITL framework in ensuring reliable and practical outputs. Likert scale evaluations reflected high confidence in the refined outputs, emphasizing the critical role of HITL in enhancing both trust and usability. The study also highlights the importance of prompt design, revealing that longer initial prompts improve accuracy, while shorter iterative prompts maintain usability without compromising efficiency. The iterative HITL process further ensures that refined outputs align with safety standards and practical requirements. This evaluation underscores the transformative potential of generative AI in functional safety workflows, enhancing routine activities while ensuring rigorous human oversight in safety-critical, regulated industries.

1. Introduction

According to ISO 12100 [1], risk analysis involves identifying hazards and estimating risks considering the severity and likelihood of harm. Risk assessment is the next step, where these risks are evaluated to determine the need for risk reduction, followed by the implementation of safety measures. These steps are crucial to ensure the functional safety of machinery, maintaining high safety standards in industrial environments, and meeting regulatory requirements [1,2,3].
A detailed risk analysis and assessment not only ensures safe machinery operation, reducing risks to humans and the environment, but also fulfills international safety requirements, such as the EU Machinery Directive [3]. This directive requires documented risk assessments for CE marking, which is essential for legally marketing machinery in the EU. The process, guided by standards such as ISO 12100, includes the identification of hazards, the evaluation of risks, and the implementation of protective measures, with proper documentation and regular updates as key elements to ensure ongoing compliance and safety.

1.1. Role of Large Language Models (LLMs) in Risk Analysis

Large Language Models (LLMs) represent a transformative leap in Artificial Intelligence (AI) and Natural Language Processing (NLP). These models, trained on extensive and diverse datasets, excel at generating coherent and contextually relevant text by identifying and reproducing patterns from their training data [4]. In the domain of risk analysis, LLMs offer the potential to revolutionize traditional methodologies by enabling the rapid generation of detailed hazard identifications and risk assessments. This capability can streamline the evaluation of potential risks, enhancing the efficiency and effectiveness of safety processes in industrial applications.
Several surveys and analyses underscore the transformative potential of generative AI across industries. For instance, the KPMG survey (https://kpmg.com/kpmg-us/content/dam/kpmg/pdf/2023/generative-ai-survey.pdf, accessed on 31 January 2025) highlights how generative AI enhances routine activities by automating tasks, reducing repetitive processes, and improving operational efficiency while emphasizing the need for robust oversight to address cybersecurity and data privacy challenges. Similarly, McKinsey’s 2024 reports (https://www.mckinsey.com/featured-insights/mckinsey-global-surveys, accessed on 31 January 2025) reveal surging adoption rates, with organizations leveraging generative AI to streamline workflows, drive cost efficiency, and enable innovation in domains like risk assessment, IT, and product development. High-performing companies particularly excel by aligning AI adoption with business strategies, building proprietary data assets, and fostering agile, cross-functional teams to harness AI’s capabilities effectively. These insights are further reinforced in McKinsey’s August 2024 survey, which explores how employee-driven experimentation is paving the way for broader organizational transformation through generative AI, necessitating structural and cultural shifts to maximize its value. Moreover, McKinsey’s analyses emphasize how generative AI can significantly enhance productivity in various business functions, including risk management and compliance, by automating and accelerating critical processes (https://www.mckinsey.com/capabilities/risk-and-resilience/our-insights/implementing-generative-ai-with-speed-and-safety, accessed on 31 January 2025).
However, challenges such as hallucinations and biases require robust human oversight (e.g., as emphasized in [5]). This is because LLMs do not truly understand [4] the content they generate, as their outputs are based on data patterns rather than a deep comprehension of meaning. This limitation can be problematic in risk analysis, where accuracy and reliability are crucial. The risk of hallucination, where LLMs produce plausible but incorrect information, further underscores the need for careful supervision in safety-critical applications. Moreover, organizations must also consider challenges like data privacy concerns, the integration of robust governance frameworks, and the need for continuous monitoring and adaptation of generative AI systems to address emerging risks (https://www.mckinsey.com/capabilities/strategy-and-corporate-finance/our-insights/managing-the-risks-around-generative-ai, accessed on 31 January 2025). These risks underline the importance of combining AI capabilities with human expertise to ensure the safety, reliability, and ethical use of AI systems in high-stakes environments.

1.2. Integrating Human-in-the-Loop Systems

Given the limitations of LLMs, such as hallucination, bias, and overgeneralization, integrating human expertise through a Human-In-The-Loop (HITL) approach is crucial in safety-critical settings. This methodology leverages the strengths of LLMs while addressing their weaknesses, particularly the Clever Hans effect [4,6], where LLMs may generate correct-looking outputs without genuine problem-solving. Increasingly, regulatory frameworks like the European Union’s AI Act [7] mandate such oversight for high-risk Artificial Intelligence (AI) applications, requiring decisions to be verifiable, auditable, and controllable. Thus, by integrating human expertise to critically evaluate and refine AI-generated results, this approach enhances the accuracy and applicability of risk analyses, making it a viable solution for usage in regulated industries.

1.3. Novelties and Contributions

This paper introduces several novel contributions, as outlined below.
AI Integration in Risk Analysis: The study proposes a cautious approach that integrates the generative capabilities of LLMs into traditional risk analysis processes. This methodology is designed to enhance the efficiency and speed of identifying and assessing potential risks, while maintaining rigorous human oversight to ensure accuracy and reliability.
Human-in-the-Loop Workflow: To address the limitations of LLMs, including their lack of deep comprehension and the risk of generating inaccurate outputs, we develop an HITL workflow. This framework ensures that AI-generated results are rigorously evaluated and refined by functional safety experts, thereby improving the reliability and applicability of AI-assisted risk analyses.
Empirical Validation with Case Studies: The proposed workflow is empirically validated through detailed case studies in industrial contexts, where risk analysis based on ISO 12100 is applied. The results are compared with the established ground truth of case studies [8].
These contributions mark the first integration of the generative capabilities of LLMs with a Human-in-the-Loop approach for risk analysis using ISO 12100. This novel approach significantly advances the role of AI in safety-critical environments, demonstrating its potential to meet both operational and regulatory requirements. It can be stated that the work presented in this paper lays the foundation for broader acceptance and implementation of AI in regulated domains, setting a new direction for future research and development in the field.
On the other hand, it is important to note that this study focuses on evaluating the feasibility of few-shot prompting with ChatGPT rather than optimizing prompt design or configurations. While the analysis centers on evaluation using a single LLM, i.e., ChatGPT, it acknowledges that alternative LLMs or approaches may offer superior performance for specific tasks. Furthermore, this paper recognizes advanced methods such as Retrieval-Augmented Generation (RAG), including frameworks like RAGAS [9], which integrate domain-specific external information for structured response generation [10,11], as well as Explainable AI (XAI) techniques [12] to enhance interpretability, transparency, and trustworthiness, as possible extensions. Please note that, these advancements are beyond the scope of the work presented in this paper.
The remainder of this paper is organized as follows: Following this introduction section, background and related work are provided in Section 2. The limitations and opportunities in risk analysis with AI integration are discussed in Section 3. The systematic study setup proposed in this paper, including the workflow for risk analysis involving human oversight, the real-life case studies selected for experimental evaluation, and the evaluation methodology used for the analysis of results, is presented in Section 4. The results based on the application of the proposed methodology to the four real-life case studies are explained in detail in Section 5, which also includes a summary of the evaluation of the methodology on these case studies. Additionally, Section 5.6 discusses the threats to validity, which outline the potential limitations and challenges that may affect the generalizability and reliability of the study’s findings. The paper concludes with a summary of key findings, future research directions, and implications for practice in Section 6.

2. Background and Related Work

In this section, an overview of the ISO 12100 standard and its relevance to functional safety is provided alongside related work pertaining to this topic. Following this, a brief background and literature review on the evolution, capabilities, and adoption of LLMs is discussed.

2.1. Overview of ISO 12100 and Its Relevance to Functional Safety

In the domain of functional safety, risk analysis plays a crucial role in identifying, assessing, and mitigating risks associated with the use of electrical, electronic, and programmable electronic safety-related systems. Functional safety standards, such as IEC 61508 [13], ISO13849 [2], and ISO12100 [1], provide frameworks for conducting systematic risk analyses to ensure that systems achieve acceptable levels of safety. The core objective of risk analysis in this context is to prevent accidents and malfunctions that could lead to injuries, fatalities, or environmental damage. This is achieved by defining and implementing safety functions that can adequately control potential hazards under all foreseen operating conditions. The ISO 12100 standard, in particular, is a fundamental safety standard that provides guidelines and principles for the risk assessment and risk reduction of machinery. It outlines methodologies for identifying hazards, estimating and evaluating risks, and specifying necessary safety measures. Compliance with ISO 12100 is essential for manufacturers to meet international safety requirements and to ensure machinery is safe for use, highlighting its critical role in the risk assessment processes.
The iterative process of risk analysis, risk assessment, and risk reduction in line with the ISO12100 standard [1] is shown in Figure 1.
A risk assessment follows a series of logical steps to identify and examine any potential hazards associated with machinery. The process starts with hazard identification in terms of machine space, time, and usage limits. The hazard is then estimated by risk elements, such as harm severity (S), occurrence frequency (F), and avoidance or limitation possibility (P). Based on the information obtained, this risk will be finally evaluated in terms of whether it is acceptable. If not, risk reduction measures are required. The whole process is called risk assessment. Iteration of this process can be necessary to eliminate hazards as far as practicable and to adequately reduce risks by the implementation of protective measures. Protective measures play an important role in risk reduction. Such measures include protection devices and safety controls, the combination of which is called the Safety-Related Part of the Control System (SRP/CS) [8].

2.1.1. Safety Function

Safety functions (SFs) are the machine functions that cause an immediate increase in risk upon failure [2]. A single SF can be implemented by multiple SRP/CSs. A single SRP/CS may also implement multiple SFs, such as prevention of accidental start-up, and limits regarding safety parameters in temperature and pressure. For example, the control system shuts down the furnace fire when the boiler pressure reaches a dangerous value. If this function fails, excessive pressure will lead to an explosion. In this scenario, safety depends on the SRP/CS performing the correct function. Each SF is tasked with reducing the risk of one or more hazardous events. It is necessary to consider each hazard and its corresponding SF in the design.

2.1.2. Required Performance Level— PL r

PL r is the risk reduction expectation required for the implementation of an SF and can be determined by the risk graph (cf. Figure 2). A risk graph is a grading-based risk estimation method with parameters S, F, and P, corresponding to the severity of harm, the duration or frequency of operator exposure to the hazard area, and the possibilityof avoiding the hazard, respectively [8]. It is also used to determine the performance level of the safety function that is needed to reduce the risk to a permissible level. Figure 2 shows the structure of the risk graph.
The severity (S) of harm is divided into S1-slight and S2-serious. Only slight reversible harm, serious irreversible harm, and fatalities are considered when estimating the levels of harms [8]. For example, fatigue and slipping are categorized as S1, and amputation and death are categorized as S2.
The frequency (F) or time of exposure to the hazard is classified as F1-seldom or short time and F2-frequent or long time. This parameter is a measure of the time spent in the danger zone. As per [2], when the operator is present in the area more frequently than once every 15 min, it is considered as F2 level.
The possibility (P) of avoiding hazards is divided into categories P1 and P2, which are determined based on whether the hazard can be identified or prevented. If it is possible to avoid an accident under certain circumstances, P1 is chosen, but if it is almost impossible to avoid, then P2 is chosen. Factors that affect parameter P include the speed of the hazardous situation leading to harm, any awareness of risk, and the human ability to escape. For example, the speed of machine operation is limited so that potential accidents are delayed, and the operator has the opportunity to react and leave the zone.
As can be seen from the graph (Figure 2), combining these parameters increases the risks from low to high (i.e., PL r -a to PL r -e, where PL r -e is the highest level required for SF and the most expensive to implement).

2.1.3. Design Release

After determining that the risk is unacceptable, the SRP/CS is designed as a safety guard. All SFs required to prevent the risk are first identified. The properties of each SF are clearly defined. For example, the machine is stopped when the safety door is opened and automatic restarts are prevented during this period. Then, the PL r is determined for every SF. After implementation and evaluation (e.g., in [14]), the match between the actual PL and PL r is verified. The former must be higher or equal to the latter. This is the condition under which the design can be released. Please note that the entire process of risk assessment and determination of PL and PL r are discussed so far. However, in the work presented in this paper, the workflow presented in Section 4.1 is expected to identify the hazards for a use case and provide a suggestion for PL r value. This suggestion is generated based on the interaction with LLMs. On evaluation by human oversight (e.g., functional safety engineering expert), if required, a risk evaluation and risk reduction measures are queried with prompts to LLMs in a second iteration.

2.1.4. Related Work

Traditional methods for risk analysis utilize qualitative techniques, such as checklist analysis and What-If analysis and HAZOP (Hazard and Operability Study), along with more quantitative methods, like Failure Mode and Effects Analysis (FMEA) and Fault Tree Analysis (FTA). These methodologies provide structured frameworks for the identification of potential hazards and the systematic assessment of risks, aiding in the development of effective mitigation strategies. Currently, in the absence of Type-C standards that specify performance levels for safety functions, a comprehensive risk assessment based on ISO 12100 is required by the EU Directive [3]. Such analyses are typically performed by functional safety experts well versed in machinery functional safety and standards such as ISO 12100 and ISO 13849.
Numerous studies, such as [15,16,17,18,19,20,21], have investigated risk analysis and safety assessment across various domains, including UAVs, intelligent manufacturing, airport operations, and industrial automation, utilizing both qualitative and quantitative methods that incorporate safety standards and probabilistic models. Notably, the research detailed in [22] introduces the novel concept of virtual safety engineering, proposing an internally developed chatbot to determine the Required Performance level (PLr) for hazard scenarios. Similarly, a chatbot for risk reduction was suggested in [23]. Focusing on risk assessment methodologies for AI-based systems, ref. [24] outlines a framework that integrates Operational Design Domains (ODDs). The work in [24] also highlights that advanced methods such as System-Theoretic Process Analysis (STPA) are additionally necessary to analyze emergent behavior and the safety properties that arise from complex sub-system interactions. The paper discusses the challenges of terminology alignment in AI safety and proposes a blueprint for comprehensive risk assessment and assurance in safety-critical applications. The use of IoT and ML to enhance safety and security in smart environments is explored in [25]. The authors discuss how ML and deep learning (DL) techniques can detect normal and abnormal behaviors in IoT ecosystems, contributing to security-based intelligence systems. This approach is relevant to functional safety standards, as it provides a foundation for developing robust safety frameworks in industrial environments. Despite these advancements, discussions on the use of LLMs in risk analysis have been notably absent.

2.2. Evolution, Capabilities, and Adoption of LLMs

Recent advancements in NLP have brought forward powerful LLMs, which have shown significant potential in automating complex tasks across various domains. A recent study in [26] provides a comprehensive overview of LLMs. The paper highlights the challenges, like computational costs, ethical considerations, and alignment with human values, emphasizing the need for responsible development. From this study and other related works [27,28], it is evident that LLMs have seen rapid evolution and advancements, beginning with early models like GPT-2 (https://openai.com/blog/better-language-models/, accessed on 31 January 2025), which demonstrated the ability to generate coherent text, to more powerful models such as GPT-3 (https://openai.com/research/gpt-3, accessed on 31 January 2025) and GPT-4 (https://openai.com/research/gpt-4, accessed on 31 January 2025) by OpenAI. These models, trained on massive datasets, have billions of parameters, allowing them to generate highly sophisticated and contextually relevant text. Other notable examples include Google’s PaLM (https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html, accessed on 31 January 2025) and Meta’s Llama (https://ai.facebook.com/blog/large-language-model-llama-meta-ai/, accessed on 31 January 2025), which push the boundaries of Natural Language Understanding (NLU) and generation across various applications.
LLMs have demonstrated remarkable capabilities across a wide range of tasks. GPT-4 excels in generating contextually relevant and coherent text, making it suitable for creative writing, coding assistance, and complex problem-solving. Google’s PaLM (https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html, accessed on 31 January 2025) offers exceptional performance in multilingual tasks, showcasing its ability to handle diverse languages and dialects. Meta’s Llama emphasizes efficiency in model size while maintaining competitive accuracy, making it ideal for research and industrial applications where computational resources are a concern [26].
The adoption of LLMs spans a wide array of industries, driven by their ability to automate complex tasks and enhance decision-making processes. In healthcare, LLMs are used for clinical documentation and patient interaction. In finance, they assist with fraud detection and automated reporting. The legal industry leverages LLMs for contract analysis and legal research. Furthermore, LLMs are employed in customer service across multiple sectors to improve response times and accuracy, significantly transforming traditional workflows [26]. Similarly, a recent article (https://springsapps.com/knowledge/large-language-model-statistics-and-numbers-2024#industries-using-llm-solutions, accessed on 31 January 2025) discusses the growing significance of LLMs in various industries, including e-commerce, education, finance, healthcare, and marketing. These models are utilized for tasks like automated grading, personalized recommendations, fraud detection, and clinical documentation. For instance, as per this report, the market for LLMs is projected to grow from USD 1.59 billion in 2023 to nearly USD 260 billion by 2030, driven by their application across industries like e-commerce, education, finance, and healthcare, with an expected adoption of 750 million applications by 2025.
Based on the literature review, it is clear that LLMs offer substantial benefits, including enhanced automation, decision-making, and operational efficiency across industries like e-commerce and healthcare. However, their adoption is hindered by significant challenges, such as generating biased or inaccurate information, privacy concerns, and ethical issues, like deep fakes. For instance, while LLMs show promise in personalized recommendations, influencing 91% of e-commerce decisions and achieving 83.3% diagnostic accuracy in healthcare, their accuracy drops to 22% in complex insurance queries. Additionally, the lack of transparency and explainability in LLM decision-making processes poses challenges for trust and regulatory compliance. Therefore, despite their potential, robust human oversight is necessary to mitigate these risks and ensure safety and compliance in regulated environments.
In summary, while ChatGPT and similar models offer substantial potential, they also bring challenges, especially in safety-critical domains, where the accuracy and reliability of outputs are paramount. Issues such as hallucination and inherent biases from training data underscore the need for careful integration of these models with human oversight [4,26]. These challenges are well documented in the literature, where researchers emphasize the importance of incorporating human expertise into AI-driven processes to ensure compliance with safety standards and regulatory requirements. Thus, integrating LLMs into safety-critical applications demands caution. While LLMs can automate complex tasks and enhance efficiency, their limitations require robust human oversight to ensure accuracy, reliability, and compliance with safety standards. Industries adopting LLMs in regulated environments must address ethical, privacy, and accuracy concerns to fully leverage their potential without compromising safety. As more secure and reliable versions of LLMs are developed, their adoption is likely to increase, but only alongside stringent controls that mitigate their inherent risks. This paper seeks to establish a workflow for the application of LLMs with the HITL approach in risk analysis, as per the ISO 12100 standard for safety-critical environments, and to draw preliminary conclusions based on the findings.

2.3. HITL and Multimodal AI for Functional Safety

This section explores the role of HITL methodologies, multimodal AI systems, and risk assessment frameworks in enhancing functional safety within industrial applications, focusing on the transition from Industry 4.0 to Industry 5.0 paradigms. Although these studies offer valuable insights into leveraging AI for safety and risk management, a clear gap remains in applying HITL with LLMs for machinery functional safety risk assessments.

2.4. LLMs in Saftey Analysis

The work in [29] explores the potential of using LLMs such as GPT-3 to assist in hazard analysis for safety-critical systems. The authors propose a Co-Hazard Analysis (CoHA) method, where a human analyst interacts with an LLM (e.g., ChatGPT) to elicit possible hazard causes. The study evaluates the feasibility, utility, and scalability of CoHA using a simple water heater system with increasing complexity. The study demonstrates that CoHA is feasible and moderately useful for hazard analysis, particularly for simple systems. However, the performance of LLMs like ChatGPT degrades with increasing system complexity, and human analysts must carefully review and interpret the LLM’s responses. The authors suggest that future work should explore more open-ended queries, the use of LLMs for risk mitigation, and the application of CoHA to other hazard analysis methods like FMEA and HAZOP. On the other hand, the study does not deal with real-life use case scenarios of functional safety risk analysis and adherence to specific standards, such as ISO 12100 and ISO 13849, for risk assessment. The work presented in this study aims to address several gaps in the above work, such as the need for HITL and realistic case studies, by employing an HITL workflow for real-life use cases in machinery functional safety risk assessment tasks.
In [30], ChatGPT was applied to STPA for safety-critical systems, achieving 64% useful and correct responses in identifying unsafe control actions (UCAs). The study found that 35% of ChatGPT’s responses were correct and useful, while 27% were incorrect, with performance degrading as system complexity increased. The recurring duplex collaboration scheme outperformed human experts alone, identifying more UCAs, but required significant human oversight. STPA-specific prompts improved response relevance but reduced comprehensiveness, with 38% of responses being correct but not useful. The study identifies gaps in ChatGPT’s ability to handle complex systems, with 27% incorrect responses and declining performance as system complexity increases. Future work should focus on dynamic assurance for evolving LLMs, explore integration with other safety analysis methods, like HAZOP and FMEA, and develop standardized frameworks to ensure trustworthiness and reliability in safety-critical applications. This study directly addresses these limitations by proposing a systematic HITL workflow that ensures accuracy and reliability in risk analysis. By incorporating expert validation at multiple stages, this approach mitigates the risk of incorrect LLM outputs and ensures compliance with safety standards. Additionally, this study evaluates the time efficiency and usability of LLMs in real-world risk assessment tasks, providing a more comprehensive evaluation than [30].
The study in [31] introduces the Precision Answer Comparison and Evaluation Model (PACEM) to assess ChatGPT’s accuracy and processing time across domains like literature, history, law/ethics, and sport. The results show ChatGPT outperformed human answers in accuracy (e.g., 90.33% in law/ethics vs. 35.33% for humans) but required significantly longer processing times (e.g., 1281.32 s vs. 580.44 s for humans). However, 27% of ChatGPT’s responses were incorrect, and performance varied with question complexity. The study highlights the need for optimizing ChatGPT’s speed and accuracy for real-world applications. The work in this paper builds on this by not only evaluating accuracy and efficiency but also by introducing an HITL framework that significantly improves the reliability of LLM outputs. By focusing on machinery functional safety risk assessment, this study provides a domain-specific application of LLMs, ensuring that the outputs are both accurate and practical for industrial use. The iterative refinement process in this workflow also addresses the issue of incorrect responses, as human experts continuously validate and correct LLM outputs.

2.4.1. HITL Approaches and Methodologies

The increasing significance of HITL methodologies in AI assurance is extensively surveyed in [32,33]. They highlight the growing role of HITL in improving data quality, enhancing algorithmic trustworthiness, and enabling explainability in intelligent systems. HITL serves a dual role: (1) providing data assurance through automated labeling and crowdsourcing, and (2) offering algorithmic assurance by incorporating human feedback during model operations. This comprehensive view underscores HITL’s importance for safety-critical AI applications, particularly for achieving transparency and reliability. The relevance of HITL methodologies within smart manufacturing is further emphasized by [34], which discusses their role in cyber-physical systems (CPS) and introduces the concept of Industry X.0—an amalgamation of Industry 4.0 and 5.0. Human operators actively augment machine operations, enhancing decision-making, data visualization, and cyber-attack detection. Additionally, ref. [35] proposes a SafeHIL-RL framework for autonomous driving, where safety-aware reinforcement learning is combined with human–AI shared control to guide safer decision-making in dynamic operational environments.
Further exploration of HITL methodologies in enhancing prediction accuracy and machine learning model performance is presented in [36]. The paper addresses challenges in data quality and user engagement, while highlighting techniques like active learning that utilize human feedback for more precise decision-making. These approaches are particularly relevant for improving machine learning-based risk assessments in safety-critical domains.

2.4.2. Multimodal AI Systems and Industry 4.0/5.0

The shift toward human-centric AI is highlighted in [37], which focuses on the integration of AI technologies within the Industry 5.0 paradigm. The authors underscore the necessity of explainable AI (XAI), secure data protocols, and robust cyber-defense mechanisms for reliable functional safety assessments. This aligns with the broader goals of achieving transparency and trustworthiness in AI-driven industrial systems.
Expanding the concept of multimodal AI, ref. [38] discusses the application of AI systems in healthcare by integrating diverse data types, such as images, text, audio, and video, for clinical decision-making. The potential of combining text-based LLMs with image-based classifiers to achieve comprehensive risk assessments in functional safety contexts is suggested, as multimodal AI can leverage both textual and visual analyses to provide holistic safety evaluations.
The role of expert systems in Industry 4.0 is examined in [39], detailing their applications in fault diagnosis, process optimization, predictive maintenance, and intelligent decision-making. The paper categorizes expert systems as rule-based, case-based, and neural network-based, highlighting their contributions to enhancing automation, efficiency, and intelligence in manufacturing.

2.4.3. Integration of Retrieval-Augmented Generation (RAG)

RAG approaches offer a way to enhance LLMs by combining the benefits of retrieval systems with text generation capabilities [10], providing a unified evaluation framework for RAG systems, assessing retrieval accuracy and generation quality to improve LLM outputs. This methodology can address challenges like hallucination and improve the contextual grounding of generated responses. For safety-critical applications, RAG systems can ensure that LLMs provide more accurate and context-relevant information, which is crucial for functional safety risk assessments. Further, in the context of AI-generated content across different modalities, ref. [11] extends the discussion on RAG by exploring its applications beyond text, such as in code, image, and audio generation. They discuss the adaptability of RAG to various domains, emphasizing its potential in handling multimodal data. The work highlights how query-based and latent representation-based paradigms in RAG allow for dynamic, accurate content generation, which can be applied to multimodal safety assessments in functional safety contexts. Thus, these works [9,10,11], underline the utility of RAG frameworks in addressing issues of consistency, relevance, and scalability in functional safety risk assessments. By retrieving domain-specific data, RAGs can enhance LLM responses, offering a path forward for improved reliability and robustness in safety-critical AI applications.

2.4.4. Summary

In summary, the integration of HITL methodologies, multimodal AI systems, and risk assessment frameworks represents an important advancement in the use of AI for safety-critical domains, particularly within the context of Industry 4.0 and the evolution toward Industry 5.0. HITL approaches have proven effective in enhancing AI systems’ reliability, transparency, and performance by incorporating human feedback and oversight. Similarly, multimodal AI systems combining text-based LLMs and image-based classifiers provide a comprehensive means of assessing both textual and visual information, which is crucial for thorough safety evaluations. However, a notable gap exists in the application of LLMs specifically for machinery functional safety risk assessments using an HITL approach. While current research offers foundational insights into HITL and multimodal methodologies, the direct application of LLMs for risk analysis in functional safety remains largely unexplored. This paper seeks to address this gap by establishing a workflow that leverages LLMs within an HITL framework to perform risk assessments as per the ISO 12100 standards. The approach aims to balance the benefits of AI with considerations for ethics, privacy, and accuracy, ensuring that robust human oversight is maintained to effectively harness AI capabilities without compromising safety.

3. Limitations and Opportunities in Risk Analysis with AI Integration

In the context of machinery functional safety, effective risk analysis is essential for identifying potential hazards and ensuring compliance with stringent safety standards, such as ISO 12100. However, the traditional tools and methodologies employed in risk analysis present a number of limitations that hinder their efficiency and adaptability in rapidly evolving industrial environments. Additionally, while emerging AI technologies, particularly LLMs, offer promising enhancements to these processes, they introduce their own set of challenges that must be carefully addressed. This section outlines the key limitations of traditional risk analysis tools, the potential of LLMs, and the necessity of integrating human oversight to overcome these challenges and enhance the reliability of safety assessments.

3.1. Limitations and Requirements of Traditional Risk Analysis Tools

Traditional risk analysis tools, while foundational, exhibit several limitations:
Training and Usability: These tools often require extensive training for users to achieve proficiency, introducing delays and increasing the overall cost of safety assessments.
Licensing Costs: Many of these tools (e.g., WEKA CE (https://www.weka-manager-ce.de/english-version/, accessed on 31 January 2025)), are not open-source and involve substantial licensing fees, making them cost-prohibitive for smaller enterprises or those requiring scalable solutions across multiple sites.
Software Limitations: Being closed-source, these tools offer limited adaptability and customization, restricting users from modifying the software to better fit specific needs or seamlessly integrate with other systems.
Despite these drawbacks, tools like WEKA CE provide structured, compliant methodologies for risk analysis that are crucial for machinery functional safety. However, the advent of AI technologies, particularly LLMs, presents new opportunities and challenges that could potentially enhance these traditional methods but require careful consideration.

3.2. Emerging AI Technologies and LLMs

The integration of AI technologies, particularly LLMs, into risk analysis introduces both potential benefits and significant challenges:
Advantages: LLMs can process vast amounts of unstructured data swiftly, offering insights and identifying potential hazards with a speed and depth unattainable by traditional methods. This capability allows for more proactive and comprehensive risk management, potentially transforming how risks are identified and evaluated.
Challenges: However, the application of LLMs in risk analysis is fraught with challenges. LLMs can generate misleading information (hallucinations), harbor biases from their training data, and often lack the domain-specific understanding necessary for accurate hazard assessment. These limitations pose risks in safety-critical applications where precision and reliability are paramount.

3.3. Challenges and Opportunities with LLMs in Risk Analysis

The integration of LLMs in risk analysis presents both opportunities and challenges:
Enhanced Data Analysis Capabilities: LLMs can efficiently process and analyze large volumes of unstructured data, such as maintenance records and operator manuals, identifying potential hazards that might be overlooked by traditional tools.
Predictive Insights: The pattern recognition capabilities of LLMs can be leveraged to predict potential failure modes and hazards before they manifest, enabling more proactive risk management strategies.
However, there are inherent limitations in using LLMs, particularly for safety-critical applications such as risk analysis:
Fact Hallucination: LLMs may generate plausible yet inaccurate or misleading information (https://allanmlees59.medium.com/artificial-intelligence-and-clever-hans-0d136951024c, accessed on 31 January 2025) [6], which poses significant risks in contexts where factual accuracy is critical.
Bias in Training Data: Biases inherent in the training data can lead to skewed risk assessments, potentially misclassifying hazards or failing to recognize them altogether.
Overgeneralization and Lack of Domain Expertise: LLMs might produce overly generic responses or fail to grasp the complexities of specific industrial contexts, leading to inadequate hazard assessments.
Lack of Domain-Specific Knowledge: LLMs often lack the nuanced understanding required to accurately interpret complex safety scenarios, which can result in critical oversights.
Opacity of Reasoning Processes: The “black box” nature of LLMs obscures the decision-making process, which is a significant barrier in safety-critical settings where understanding the reasoning behind assessments is essential.
Dependency on Prompt Engineering: The effectiveness of LLMs heavily depends on the design of prompts; poorly structured prompts can lead to inadequate or irrelevant responses.
Error Propagation: Errors in initial data input or algorithmic faults in LLMs can propagate through the assessment process, compounding inaccuracies and leading to flawed conclusions.
Handling of Novelty and Edge Cases: LLMs may struggle with scenarios that deviate from their training, including novel or unique hazard situations specific to machinery, resulting in incomplete or incorrect risk assessments.
Lack of Adaptability: Automated systems and LLMs often struggle to adapt to evolving safety standards and conditions, which can change rapidly in industrial environments.
Transparency and Explainability: Limited transparency and explainability in LLMs hinder trust and accountability, which are crucial in regulatory and compliance-driven contexts.

3.4. Necessity for HITL Systems

To fully realize the benefits of an HITL approach in risk analysis and address the limitations of LLMs, several critical roles must be fulfilled by human experts (https://www.marsh.com/en/services/cyber-risk/insights/human-in-the-loop-in-ai-risk-management-not-a-cure-all-approach.html, accessed on 31 January 2025):
Verification and Validation: Human experts review, modify, and validate AI-generated outputs to ensure they align with real-world conditions, regulatory requirements, and factual accuracy. This step is particularly crucial for mitigating issues such as fact hallucination, error propagation, and overgeneralization, where the LLM’s outputs need to be verified for correctness and relevance.
Incorporation of Domain-Specific Knowledge: Human oversight ensures that specialized knowledge is applied where LLMs fall short, particularly in the interpretation of complex safety standards and regulations. This helps overcome lack of domain-specific knowledge, handling of novelty and edge cases, and bias in training data, as human experts can provide context-specific insights that AI lacks.
Ethical Considerations: Human experts ensure that the outputs and decisions made by AI systems uphold ethical standards, especially in safety-critical environments where ethical implications may be significant. This aspect helps manage the risks associated with opacity of reasoning processes and ensures that ethical concerns are factored into decision-making, reducing potential biases.
Experience and Judgment: Human experience and nuanced judgment are crucial in complex or ambiguous situations that LLMs might not be capable of fully understanding. This mitigates the limitations of overgeneralization, lack of adaptability, and handling of edge cases, where the machine’s rigid interpretation of data needs to be refined by human reasoning.
Enhanced Reliability and Trust: Human involvement in the AI decision-making process ensures transparency, builds trust, and maintains accountability, which are essential in regulated industries. This addresses the issues related to transparency and explainability by making sure that the rationale behind AI-generated outputs can be explained and justified by experts, particularly in safety-critical contexts.
Prompt Design and Refinement: Human experts play a key role in designing and refining prompts for LLMs, ensuring that the AI provides relevant and focused outputs. This helps address the dependency on prompt engineering, making sure that the AI’s responses are well targeted and appropriate for the specific risk analysis task.

3.5. Research Direction and Novel Contributions

This research addresses limitations in traditional risk analysis tools by proposing an HITL framework that integrates LLMs’ capabilities with expert oversight, enhancing safety assessments’ reliability, efficiency, and effectiveness in machinery functional safety. The framework mitigates LLMs’ risks, such as bias and hallucinations, through human verification and validation, ensuring compliance with stringent safety standards, such as ISO 12100. Future research should focus on refining LLMs for domain-specific challenges, improving AI transparency, and advancing human-AI collaboration in safety-critical applications.

4. Systematic Study Setup

This section outlines the systematic study setup for evaluating the workflow proposed in this paper for integrating LLMs with the HITL methodology in machinery functional safety risk analysis. Section 4.1 details the workflow, including the steps for integrating LLMs with human oversight, such as scope definition, LLM interaction, hazard identification, risk estimation, and expert validation. Section 4.2 introduces the selected case studies—closing edge protection devices, autonomous transport vehicles, weaving machines, and rotary printing presses—to assess LLM performance against established ground truth data. Finally, Section 4.3 discusses the evaluation methodology, focusing on accuracy, completeness, usability, and time efficiency, and explains the use of expert evaluations and Likert scale ratings to assess the effectiveness of the LLM-HITL integration.

4.1. Workflow

In this section, the systematic study setup using LLMs involving HITL for risk analysis for functional safety of machinery using the ISO 12100 standard is described in detail. This study aims to evaluate the effectiveness of integrating LLMs with human oversight (HITL) to enhance the risk analysis process in machinery functional safety. Using web-based prompts to interact with ChatGPT, the study will demonstrate how the HITL approach can address LLM limitations, like hallucination, bias, and overgeneralization, ensuring reliable safety assessments in line with the ISO 12100 and ISO 13849 standards. The focus will be on key risk analysis tasks, including hazard identification and risk estimation (cf. Figure 1). The workflow outlined in Figure 3 provides a step-by-step process, starting from defining the scope and objectives to final expert validation. The blue-colored lined boxes in Figure 3 represent the steps in the workflow that involve only a functional safety engineering human expert (i.e., steps 1, 2, 6, and 7). For instance, this could involve either preparation to interact with LLMs, such as creating prompts, or validating LLM output for the hazard identification and risk estimation tasks for the various uses cases considered. The boxes outlined in black in the workflow (i.e., steps 3, 4, and 5) represent the steps in the workflow that involve direct interaction with LLMs, such as entering prompts and data collection.
This study follows a systematic workflow designed to integrate LLMs with an HITL approach for conducting risk analysis in machinery functional safety as per the ISO 12100 standard. The workflow, as illustrated in Figure 3, consists of seven key steps which are described below.
Define the Scope and Objectives: Establish clear goals and parameters for the use of LLMs in risk analysis, focusing on specific machinery use cases.
Preparation for LLM Interaction: Compile operational data, safety manuals, and previous incident reports to craft effective comprehensive one-shot prompts for LLM interaction. This step is crucial for setting up the context in which the LLMs will operate, ensuring that the data provided to the LLM are relevant and comprehensive. Thus, in this step, structured web-based prompts will be developed and refined to optimize LLM outputs(e.g., for hazard identification and risk estimation tasks).
Utilizing LLMs for Hazard Identification: Use prompt engineering to interact with the LLM (e.g., employing prompts designed in step 2 above), guiding it to identify potential hazards associated with the machinery during its intended and foreseeable misuse.
Utilizing LLMs for Risk Estimation: Apply prompt engineering to obtain LLM-generated estimates of the likelihood and potential impact of identified hazards. This step leverages the LLM’s ability to process large datasets to generate risk estimations, which are then reviewed by human experts.
Interaction and Data Collection: Collect and record LLM outputs, ensuring all relevant information is captured for analysis. The interaction between the LLM and the experts is iterative, allowing for continuous refinement of the data and outputs.
Analysis of LLM Outputs with HITL: Review LLM outputs with safety experts to validate findings and integrate expert knowledge into the risk assessment. Human experts will verify the accuracy of the outputs, correct any inaccuracies, and ensure that the results meet real-world conditions and regulatory requirements. This step also includes the incorporation of domain-specific knowledge and ethical considerations. In this step, as the outputs of LLMs are reviewed by safety experts (i.e., human oversight), there can be various outcomes, as listed below:
-
HITL identified fundamental issues with data preparation and prompt crafting (Feedback Loop to Step 2): After reviewing the LLM outputs, safety experts may discover that the data provided to the LLM were incomplete, incorrectly formatted, or lacked sufficient detail. This could result in errors or gaps in the hazard identification or risk estimation processes. The experts will recommend refining the prompts or revisiting the data preparation process in Step 2: Preparation for LLM Interaction. This may involve compiling more relevant operational data, incident reports, or revising safety manuals to create more effective one-shot prompts for the LLM. Improving the input data and prompt structure leads to more accurate and context-specific LLM-generated outputs in subsequent iterations.
-
HITL identified incomplete/inaccurate hazards. Refine hazard identification process (Feedback Loop to Step 3): In this case, the safety experts identify that the hazards identified by the LLM are incomplete or inaccurate. The LLM may have overlooked critical hazard scenarios or incorrectly assessed potential dangers. This feedback triggers a refinement of the hazard identification process in Step 3: Utilizing LLMs for Hazard Identification. The experts may modify the prompts, add more precise operational context, or introduce additional scenarios that the LLM should consider during hazard identification. This refinement improves the LLM’s ability to accurately identify all relevant hazards, ensuring that critical risks are not missed in subsequent iterations of the analysis.
-
HITL identified variations in risk estimations. Refine risk estimation process (Feedback Loop to Step 4): The experts notice discrepancies or variations in the LLM’s risk estimations, such as underestimating the severity of harm or misjudging the probability of avoiding a hazard. These variations could lead to an inadequate evaluation of the risks. This requires refining the risk estimation process in Step 4: Utilizing LLMs for Risk Estimation. The experts might adjust the parameters, such as severity, frequency, or avoidance probability, or guide the LLM with more accurate risk scenarios. By refining the risk estimation process, the LLM will provide more accurate risk assessments in future iterations, aligning with real-world operational conditions and regulatory standards.
-
Summary of Iterative Validation Process:
1
Experts begin with a detailed review of the initial LLM outputs, assessing them against the ground truth established. For instance, in the case studies used in our study, this ground truth corresponds to the performance levels specified in Annex A of the IFA report for each of the four case studies (see Section 4.2).
2
Each discrepancy noted is documented, and specific feedback is provided to refine the LLM prompts or adjust the underlying algorithms, ensuring enhanced alignment with the ground truth.
3
The LLM prompts are iterated again incorporating this expert feedback, refining the outputs through multiple iterations. This cycle continues until the LLM outputs perfectly align with the ground truth.
4
Achieving a “100% match with ground truth” through this process signifies that the outputs are thoroughly validated.
Final Expert Validation: Conduct a final thorough validation of LLM outputs by safety experts to ensure alignment with real-world standards and regulatory compliance. The validated outputs are then prepared for integration into the overall risk analysis documentation, with the entire process being logged for compliance and traceability.
The workflow integrates both global and modular approaches: Steps 1–2 establish foundational parameters globally, Steps 3–4 apply modularly for hazard identification and risk estimation, and Steps 5–7 refine outputs iteratively with expert feedback. This dual approach ensures targeted refinement while maintaining compliance with ISO 12100 standards for comprehensive risk analysis. This methodology ensures that each step in the risk analysis process is carefully managed, with LLM outputs being continuously refined and validated through expert oversight, thereby working towards a reliable and compliant safety assessment.
Please note that, in this study, ’ground truth’ refers specifically to the performance levels as defined in Annex A of the IFA report [8], which are already validated by industry experts in functional safety. These standards provide a benchmark for evaluating the LLM outputs. The term ’100% alignment with the ground truth’ in two iterations highlights the efficiency of the workflow in aligning LLM outputs with this ground truth for the four specific use cases in this study. Each iteration in the workflow involves an intensive review by industry experts who assess and refine the outputs based on the discrepancies identified with the ground truth. This process underscores not only the convergence of LLM outputs with expert expectations but also the robustness of the expert feedback mechanism. The following case studies illustrate the practical application of our refined workflow, demonstrating its effectiveness across diverse industrial scenarios.

4.2. Case Studies Selected for Analysis

In this study, four distinct case studies have been selected to assess the effectiveness of integrating LLMs with HITL methodology in the context of machinery functional safety. These case studies, encompassing Closing Edge Protection Devices, Autonomous Transport Vehicles, Weaving Machines, and Rotary Printing Presses, were chosen for their relevance to a wide range of industrial applications and the complexity of their associated hazard scenarios. Each case study presents unique challenges in hazard identification and risk estimation, providing a robust framework for evaluating the performance and reliability of LLMs when combined with expert human oversight.

4.2.1. Closing Edge Protection Devices

Figure 4 shows a schematic representation of a motorized gate equipped with closing edge protection devices. The diagram illustrates the critical areas where crushing and shearing hazards may occur, particularly as the gate moves toward its final closing position. The closing edge protection device is designed to detect obstacles and halt the movement to prevent injury.
These devices are essential for preventing crushing and shearing injuries associated with the operation of powered windows, doors, and gates. As illustrated in Figure 4 (available in Annex-A in IFA Report 2/2017e [8]), these hazard zones typically arise when the moving wing approaches its final positions, where the formation of crushing and shearing points becomes a significant risk. Injury to individuals in such hazard zones can be severe, potentially leading to fatal outcomes, especially when proper safety measures are not in place.
Closing edge protection devices, such as pressure-sensitive edges, are fitted to the closing edges of the moving wings. Upon detecting an obstacle, these devices immediately halt the closing movement and initiate a reverse action, effectively mitigating the risk of injury. Thus, the safety function deals with stopping of the closing movement and reversing upon detection of an obstacle. In this specific use case, the severity of injury (S2) is considered high due to the potential for serious harm, while the frequency of exposure (F1) is relatively low, with persons only briefly present in the hazard zone.
Under normal circumstances, individuals at risk are able to move out of the hazard zone, yielding a required Performance Level PL r of ‘c’. This result is consistent with the ground truth provided in the IFA report [8], which assigns a PL r of ‘c’ for this scenario, confirmed by the EN12453 product standard [40]. The ground truth serves as the benchmark against which the results generated by the LLM, using the workflow discussed in the previous section, will be evaluated. This use case is indicative of a wide range of industrial and commercial settings, making it a critical example for understanding the capabilities of LLMs in detecting and mitigating proximity-based hazards within risk analysis processes.

4.2.2. Autonomous Transport Vehicles

Figure 5 (available in Annex-A in IFA Report 2/2017e [8]) shows an image of an autonomous guided vehicle in an industrial setting. This vehicle is equipped with collision protection mechanisms that ensure safety by stopping the vehicle upon detecting an obstacle. The image illustrates the potential hazard zones where human interaction occurs, emphasizing the importance of safety functions to prevent severe injuries in shared workspaces. The figure is adapted from Annex-A in the IFA report [8].
These vehicles present significant challenges in managing dynamic interactions between machinery and humans within shared environments. In particular, the safety function for stopping the vehicle upon detecting an obstacle is crucial in preventing collisions. Given that these vehicles may carry heavy loads and operate in areas accessible to pedestrians, the potential for severe injury (S2) is high. The frequency of human presence in the vehicle’s path (F2) further underscores the importance of accurate and timely hazard identification. This use case is vital for evaluating the effectiveness of LLMs in real-time hazard detection and risk assessment, particularly in scenarios requiring precise navigation and obstacle avoidance. The ground truth for this scenario, confirmed by the EN 1525 standard [41], yields a PL r of d [8], which will be compared against the results generated by the LLM using the discussed workflow.

4.2.3. Weaving Machines

Figure 6 (available in Annex-A in IFA Report 2/2017e [8]) shows the diagram of a weaving machine, highlighting critical components, such as the reed, temple, and light beam. This illustration shows the hazard zone where the risk of crushing exists between the reed and temple during the machine’s operation.
Weaving machines, essential in the textile industry for the automatic weaving of textiles, present significant hazards, particularly the risk of crushing between the reed and the temple. To mitigate these risks, the safety function SF1 prevents unexpected start-up by using Safe Torque Off (STO) during operator intervention in the hazard zone. In cases where the machine restarts unexpectedly, severe injuries such as crushed or broken fingers (S2) can occur, especially since evasion is nearly impossible due to the rapid movement of the machinery (P2). Although the frequency of exposure to this hazard is low (F1), the potential severity justifies a required Performance Level, PL r , of d, as confirmed by the EN ISO 11111-6 standard [8,42]. This case study explores the LLMs’ effectiveness in identifying these specific operational hazards inherent in continuous manufacturing processes.

4.2.4. Rotary Printing Presses

Figure 7 (available in Annex-A in IFA Report 2/2017e [8]) illustrates a rotary printing press showing the critical hazard zones around the counter-rotating cylinders, where entrapment and crushing risks are significant during maintenance activities. These are widely used in the printing industry, operate at high speeds with rotating cylinders that present significant entrapment and crushing hazards. The essential hazards occur at the entrapment points of the counter-rotating cylinders. This use case examines a scenario where maintenance work on the printing press requires manual intervention while the machine operates at reduced speeds. The primary safety functions implemented include (as per [8]):
SF1: Opening the guard door during operation triggers the braking system, bringing the cylinders to a complete stop.
SF2: When the guard door is open, any machine movements are restricted to limited speeds.
SF3: With the guard door open, movements are only possible while an inching button is pressed.
Given the severe injury risk (S2) associated with entrapment between cylinders and the low frequency of exposure during maintenance (F1), the possibility of avoiding the hazard at production speeds is minimal (P2). Consequently, a Performance Level required ( PL r ) of d is necessary for safety functions SF1 and SF2. However, SF3, which is active only after the press has been halted and speeds limited by SF1 and SF2, results in more predictable machine movements, allowing the operator to evade hazards (P1). Therefore, a PL r of c is adequate for SF3.
This assessment is consistent with the EN 1010-1:2010 [43] product standard, which specifies a PLr of d for SF3 [8]. The risk analysis conducted here emphasizes the capabilities of LLMs in managing high-risk scenarios and complex safety functions, providing a critical evaluation against the ground truth established by recognized safety standards.
The selected case studies provide essential contexts for evaluating the integration of LLMs with HITL methodology in real-world industrial environments. By examining the performance of LLMs in identifying hazards and estimating risks across these diverse and complex scenarios, the study aims to demonstrate how human oversight can enhance the accuracy, reliability, and safety of AI-driven risk analysis processes. The insights gained from these case studies will not only validate the effectiveness of the proposed methodology but also underscore the potential of LLMs to augment traditional risk analysis practices, thereby contributing to the advancement of safety standards in industrial settings.
Please note that the ground truth of the required performance level values for the case studies mentioned above from [8] represents the scenario after risk reduction measures have been implemented. It assesses the hazards with the assumption that the safety functions (e.g., SF1, SF2, and SF3) are in place, thereby reflecting the residual risk and the required performance levels for these safety functions to be effective.

4.3. Evaluation Methodology

This section outlines the systematic approach used to evaluate the effectiveness of integrating LLMs with human oversight in risk analysis. The evaluation focuses on key criteria, including accuracy, completeness, usability, and time efficiency, as assessed by a panel of functional safety experts.

4.3.1. Expert Panel Setup and Ground Truth Comparison

An expert panel comprising professionals in functional safety engineering (e.g., from innotec GmbH (https://innotecsafety.com/, accessed on 31 January 2025)) is established to review and evaluate the outputs generated by the LLMs, both pre- and post HITL refinement. A key component of this evaluation involves comparing the LLM-generated outputs against established ground truth data, as described in Section 4.2. This comparison will determine the accuracy and reliability of the LLMs in identifying hazards and estimating risks.
However, given the inherent complexities in defining ground truth in safety-critical domains and the potential discrepancies among expert opinions, advanced techniques such as Retrieval-Augmented Generation (RAG) can play a pivotal role. By retrieving domain-relevant information in real-time, RAG approaches can enrich the LLM’s context, potentially improving the consistency and alignment of outputs with expert knowledge. Furthermore, incorporating XAI methods can enhance the transparency of the LLM’s reasoning processes, making it easier for experts to validate outputs and trust the system. These techniques could complement the current evaluation framework, addressing challenges and improving the applicability of LLMs in functional safety risk analysis.

4.3.2. Key Evaluation Criteria

The evaluation will focus on four main criteria:
Accuracy: Experts will assess how closely the LLM-generated outputs match the ground truth, particularly in identifying hazards and determining the appropriate Performance Level (PLr).
Completeness: This criterion will measure how thoroughly the LLMs identify all potential hazards in the given scenarios. Completeness will be evaluated by comparing the number of hazards identified by the LLMs against a comprehensive list provided by the experts.
Usability: Experts will evaluate the practical usability of the LLM outputs in real-world risk analysis settings, including how easily the results can be interpreted and applied.
Time Efficiency: This criterion will assess the time taken by LLMs to generate hazard identification and risk estimation outputs compared to the time required for experts to perform the same tasks manually.
The evaluation for each criterion will be categorized as low, medium, or high. This approach provides a clear and concise assessment that is easily interpretable, avoiding the complexity and potential ambiguity associated with more granular scales (e.g., percentage-based scales). The categorical ratings are designed to simplify the comparison and interpretation of results, ensuring that the key insights are effectively communicated. Additionally, the structured metrics offered by RAG-based evaluation frameworks, such as RAGAS, could be explored in future work to provide a more standardized approach to evaluating LLM outputs. RAGAS offers comprehensive metrics, including precision, recall, and relevance, specifically tailored for retrieval and generation tasks. Incorporating these metrics could offer a more systematic way of assessing the accuracy and completeness of LLMs within safety-critical contexts.

4.3.3. Likert Scale Setup

A structured Likert scale will be employed to quantify expert judgments across key evaluation criteria. The scale will range from 1 (Strongly Disagree) to 5 (Strongly Agree), allowing experts to provide nuanced ratings on various aspects of the LLM’s performance. The criteria evaluated will include the accuracy of the LLM’s hazard identification and risk estimation compared to the ground truth, the completeness of the LLM in covering all relevant hazards, and the usability of the LLM outputs in practical, real-world scenarios. These ratings will be analyzed to draw conclusions about the effectiveness and trustworthiness of the LLM’s outputs. Experts will rate each LLM output on these criteria, providing both a numerical score and qualitative feedback where necessary.

5. Results and Discussion

This section presents a comprehensive analysis of the results obtained from applying the proposed HITL methodology with LLMs to three distinct case studies: motorized gates with closing edge protection devices, autonomous transport vehicles, weaving machines, and rotary printing presses outlined in Section 4.2. The analysis evaluates the LLM’s performance in hazard identification, risk estimation, and safety function recommendation, with a focus on alignment with industry standards and expert validations. The discussion will highlight the effectiveness of the HITL approach in refining LLM outputs, ensuring accuracy, completeness, usability, and time efficiency across all case studies.
The core of this paper aims to demonstrate the usability and feasibility of integrating LLMs into the daily workflow of functional safety consultants, especially for tasks like risk analysis. To reflect real-world accessibility, the main evaluation was conducted using the free-signed-in version of ChatGPT, powered by GPT-4-Turbo (https://openai.com/index/gpt-4-research/, accessed on 31 January 2025), which is expected to be available without additional costs to regular functional safety consultants (e.g., working on consulting projects). GPT-4-Turbo excels in tasks such as text generation and coding assistance, but like any LLM, it has inherent limitations, including a fixed knowledge cutoff (October 2023) and the potentia for inaccuracies [5,44]. Additionally, a series of experiments were conducted using the same prompts across various ChatGPT models, revealing slight variations in performance levels generated by the LLMs. Nevertheless, due to the HITL workflow employed, these output variations can still be aligned effectively with the ground truth. However, the primary focus of this paper is on demonstrating how an unpaid, typically accessible version of GPT can be effectively used in day-to-day functional safety activities, especially when combined with human oversight for refining LLM outputs, ensuring usability, and time efficiency across all case studies.

5.1. Case Study: Closing Edge Protection Devices on Motorized Gates

This section analyzes the results from applying the proposed methodology in Section 4.1 to the case study of motorized gates equipped with closing edge protection devices, utilizing the systematic methodology steps 1–7. The focus is on evaluating the LLM’s performance in identifying and assessing the associated hazards, with subsequent human oversight refinement.

5.1.1. Methodology Steps Applied

1
Define the Scope and Objectives: The primary objective was to identify and mitigate hazards associated with motorized gates, particularly focusing on crushing and shearing injuries during gate operation, maintenance, and malfunction scenarios.
2
Preparation for LLM Interaction: Relevant data, including safety standards (ISO 12100 and EN 12453) and operational protocols, were compiled to develop targeted prompts for the LLM, ensuring the context was well established for accurate hazard identification.
3
Utilizing LLMs for Hazard Identification: The initial prompt was: “Identify all potential hazards associated with the operation and maintenance of motorized gates equipped with closing edge protection devices. Consider hazards that may arise during regular operation, maintenance activities, and in the event of malfunction or failure. Specifically, focus on the risks of crushing and shearing injuries as the gate approaches its final closing position. Provide a risk estimation for each identified hazard, considering factors such as the severity of potential injuries, the frequency of exposure to the hazard, and the likelihood of avoiding the hazard. Finally, suggest the appropriate safety functions that could be implemented, such as the use of pressure-sensitive edges that halt the closing movement upon detecting an obstacle, and estimate the required Performance Level (PLr) according to ISO 12100 and EN12453 standards”.
4
Utilizing LLMs for Risk Estimation: The LLM provided risk estimations for the identified hazards, suggesting safety functions, such as pressure-sensitive edges and obstruction detection systems, with corresponding PLr levels.
5
Interaction and Data Collection: The initial outputs from the LLM, including the identified hazards, risk estimations, and suggested safety functions, were collected and prepared for further analysis. The results were then sent for review by safety experts in the next step.
6
Analysis of LLM Outputs with HITL: Upon review, human experts identified discrepancies in the initial PLr suggested by the LLM for certain hazards. To address these issues, a second prompt was issued to reassess the risks after implementing specific risk reduction measures: “Evaluate the impact of the following risk reduction measures—pressure-sensitive edges and obstruction detection systems—on the previously identified hazards. Reassess the residual risks, taking into account the effectiveness of these safety functions. Provide a summary of how the Performance Levels (PLr) are affected post-risk reduction”. This allowed for a more detailed assessment of how effectively the risk reduction measures mitigated the identified hazards. The LLM, guided by expert oversight, introduced risk reduction measures, such as stopping the closing movement and reversing upon detecting an obstacle, and reassessed the risks. The PLr for crushing hazards was adjusted to c, aligning with the ground truth, while the PLr for shearing hazards remained consistent at c.
7
Final Expert Validation: The final validation by experts confirmed that the recommended safety functions and PLr levels were consistent with the requirements for preventing severe injuries. The analysis resulted in a recommendation of PLr c for crushing hazards and PLr c for shearing hazards, in line with ISO 12100 and EN 12453 standards.

5.1.2. Identified Hazard

Case Study Ground Truth: The primary hazards identified in [8] were crushing and shearing injuries associated with the operation of powered windows, doors, and gates, particularly when the moving wing approaches its final positions. The report assigns a Performance Level required (PLr) of c.
LLM Analysis: The LLM successfully identified both crushing and shearing hazards. However, it initially recommended a PLr of d for the crushing hazard, which is higher than the ground truth, and a PLr of c for the shearing hazard, which aligns with the ground truth.

5.1.3. Safety Functions

Case Study Ground Truth: The ground truth safety function involves the stopping of the closing movement and reversing upon detection of an obstacle.
LLM Analysis: The LLM recommended the use of pressure-sensitive edges, which align with the safety function described in [8]. Additionally, the LLM suggested the incorporation of advanced obstruction detection systems, which could enhance safety but were not explicitly mentioned in the ground truth.

5.1.4. Performance Level Required (PLr)

Case Study Ground Truth: As per [8], a PLr of c is assigned for the identified hazards.
LLM Analysis (Pre- and Post-Risk Reduction): The LLM initially assigned a PLr of d for the crushing hazard, which was higher than the ground truth, likely due to the perceived severity of the hazard. After human oversight and issuing a second prompt as per step 5 in Section 5.1.1, the LLM adjusted the PLr for crushing hazards to c, aligning with the ground truth. The PLr for shearing hazards remained consistent at c.
For instance, the initial LLM outputs showed an average deviation of 0.67 PL levels from the ground truth, calculated as:
Average Deviation = | LLM PLr Ground Truth PLr | Total Hazards = | d c | + | c c | 2 = 0.67
Following expert-guided refinements through the HITL process, the deviation reduced to 0 PL levels, resulting in 100 % alignment with the ground truth. Initial accuracy in assigning correct PLr values was 50 % , considering all results, but considering only the specific results, one out of three hazards (shearing) was correctly identified with the appropriate PLr in the first iteration, leading to an initial accuracy of 33%. This demonstrates the effectiveness of the iterative expert oversight approach in ensuring compliance with safety standards while maintaining scientific rigor in risk quantification.

5.1.5. Evaluation Based on Criteria

Accuracy: The initial accuracy in assigning the correct PLr was 50 % across all hazards, with only 33% accuracy (one out of three hazards) for specific cases (shearing). Following expert-guided refinements, the LLM’s accuracy improved to 100 % , reflecting a significant enhancement in aligning with the ground truth.
Completeness: The LLM identified all key hazards—crushing and shearing—with no significant hazards overlooked. Expert oversight ensured that all potential risks were addressed, confirming the completeness of the analysis. Quantitatively, the coverage of identified hazards reached 100 % .
Usability: The suggested safety functions, including pressure-sensitive edges and obstruction detection systems, were both practical and aligned with ISO 12100 and EN 12453 standards. The addition of advanced obstruction detection could enhance safety, reflecting the LLM’s ability to suggest realistic, implementable solutions. The usability measure was 90 % , considering the feasibility and industry applicability of the recommended functions.
Time Efficiency: The LLM reduced the hazard identification and risk estimation time by 50–70%, compared to traditional manual methods, which highlights the potential of LLMs in streamlining risk analysis workflows. Time efficiency was quantitatively estimated as 30 min for LLM-based analysis versus 1 h for manual evaluation, considering an average level of expertise. This reduction in time was consistent across the safety functions evaluated in the case study.
Expert Validation: The expert review process led to the adjustment of the PLr for crushing hazards from d to c, aligning with the ground truth. This final validation step ensured that the recommendations adhered to recognized safety standards, demonstrating the critical role of HITL in confirming the correctness of LLM outputs. The validation accuracy was 100 % post-refinement.

5.2. Case Study: Weaving Machines

This section presents the results and discussion of the risk analysis conducted on the weaving machine case study, following the systematic methodology outlined in this paper. The methodology involves seven key steps, integrating LLM with HITL oversight to enhance the accuracy and reliability of risk assessments.

5.2.1. Methodology Steps

1
Define the Scope and Objectives: The scope was defined to assess the risks associated with the operation and maintenance of weaving machines in the textile industry, with a focus on hazards such as crushing injuries between the reed and temple during machine operation.
2
Preparation for LLM Interaction: Operational data, safety manuals, and incident reports were compiled to develop comprehensive prompts for the LLM interaction. These prompts were designed to guide the LLM in identifying relevant hazards and estimating risks in line with ISO 12100 standards.
3
Utilizing LLMs for Hazard Identification: The first one-shot prompt used was as follows: “Identify all potential hazards associated with the operation and maintenance of weaving machines in the textile industry. Consider hazards that may arise during regular operation, maintenance activities, and in the event of malfunction or failure. Specifically, focus on the risks of crushing injuries between the reed and temple during the machine’s operation, particularly when the machine restarts unexpectedly. Provide a risk estimation for each identified hazard, considering factors such as the severity of potential injuries, the frequency of exposure to the hazard, and the likelihood of avoiding the hazard. Finally, suggest the appropriate safety functions that could be implemented, and estimate the required Performance Level (PLr) according to ISO 12100 standards”. The LLM identified several hazards, including crushing injuries between the reed and temple, unexpected machine restarts, entanglement with moving parts, and mechanical failures leading to sudden movements.
4
Utilizing LLMs for Risk Estimation: The LLM provided risk estimations for each identified hazard, assessing factors such as severity, frequency, and likelihood of avoidance. For the primary hazard of crushing injuries between the reed and temple, the LLM assigned a Performance Level required (PLr) of d, consistent with the case study ground truth.
5
Interaction and Data Collection: The outputs from the LLM were recorded, including the identified hazards, risk estimations, and suggested safety functions. These were then reviewed by safety experts to ensure alignment with industry standards and practical applicability.
6
Analysis of LLM Outputs with HITL: After reviewing the initial LLM outputs, safety experts identified discrepancies, particularly with how risk reduction measures were considered. A second prompt was issued to reassess the risks after introducing specific risk reduction measures: “1. Evaluate the effectiveness of the following risk reduction measures—Safe Torque Off (STO), safety interlocks, and emergency stop mechanisms—on mitigating the identified hazards. 2. Reassess the residual risks for each hazard after applying these measures, and estimate whether the risks have been reduced to acceptable levels. 3. Provide a summary of how the risk reduction measures affect the Performance Level (PLr) for each hazard”. The LLM, guided by expert oversight, introduced and reassessed the risk reduction measures. The LLM maintained the PLr of d for the primary hazard, consistent with the case study ground truth, due to the severe nature of potential injuries.
7
Final Expert Validation: The final outputs, including the reassessed hazards and suggested safety functions, were validated by human experts. The experts confirmed that the LLM’s final recommendations were consistent with the case study ground truth and industry standards. The validated results were then prepared for integration into the overall risk analysis documentation.

5.2.2. Identified Hazard

Case Study Ground Truth: The primary hazard identified was the risk of crushing injuries between the reed and temple during manual intervention when the machine restarts unexpectedly. The IFA report [8] assigns a Performance Level required (PLr) of d for mitigating this risk through the use of Safe Torque Off (STO).
LLM Analysis: The LLM successfully identified this critical hazard and suggested appropriate safety functions. The recommended PLr of d from the LLM matches the ground truth.

5.2.3. Safety Functions

Case Study Ground Truth: The ground truth safety function involves preventing unexpected start-up by using STO during operator intervention in the hazard zone.
LLM Analysis: The LLM recommended several safety functions, including safety interlocks, emergency stop mechanisms, and redundant safety circuits. While STO was not explicitly mentioned initially, the LLM’s suggested functions align with the objectives of STO. After HITL refinement, STO was explicitly included, aligning the LLM’s output with the ground truth.

5.2.4. Performance Level Required (PLr)

Case Study Ground Truth: The IFA report [8] assigns a PLr of d for the identified hazard.
LLM Analysis (Pre- and Post-Risk Reduction): The LLM initially assigned a PLr of d, matching the ground truth. After the introduction of risk reduction measures and reassessment, the LLM maintained the PLr of d due to the severe nature of the potential injuries.
Deviation Statistics: The initial deviation between the LLM’s assigned PLr and the ground truth was 0, indicating a 100 % alignment. Post-risk reduction, the deviation remained 0, confirming that the LLM accurately assessed the hazard and the effectiveness of the risk reduction measures.

5.2.5. Evaluation Based on Criteria

Accuracy: The initial accuracy in assigning the correct PLr was 100 % , as the LLM’s initial assignment of d matched the case study ground truth for crushing hazards. Following expert-guided refinements, the accuracy remained 100 % , demonstrating that the LLM consistently aligned with the ground truth across all iterations.
Completeness: The LLM identified the critical hazard of crushing injuries, with no significant hazards overlooked. Expert oversight ensured that all potential risks, particularly the effects of unexpected restarts, were thoroughly addressed. The completeness of the analysis was quantitatively 100 % , as all relevant risks outlined in the ground truth were covered.
Usability: The LLM suggested risk reduction measures, including Safe Torque Off (STO), safety interlocks, and emergency stop mechanisms, which were both practical and compliant with ISO 12100 standards. These recommendations align directly with the safety requirements of closing edge protection systems, achieving a usability measure of 100 % .
Time Efficiency: The LLM reduced the hazard identification and risk estimation process by 50 % , compared to traditional methods.
Expert Validation: The expert review confirmed that the LLM’s outputs adhered to recognized safety standards, maintaining a PLr of d, consistent with the IFA report. Post-validation accuracy was 100 % , with no deviations identified after risk reduction reassessment.

5.3. Case Study: Autonomous Transport Vehicles

This section presents the results and discussion of the risk analysis conducted on the autonomous transport vehicle case study, following the systematic methodology outlined in this paper. The methodology involves seven key steps, integrating LLMs with HITL oversight to enhance the accuracy and reliability of risk assessments.

5.3.1. Methodology Steps

1
Define the Scope and Objectives: The scope was defined to assess the risks associated with the operation and maintenance of autonomous transport vehicles in industrial settings, with a focus on hazards such as collisions with human workers and unexpected start-up or movement.
2
Preparation for LLM Interaction: Relevant operational data, safety protocols, and industry standards (ISO 12100) were compiled to develop comprehensive prompts for the LLM interaction. These prompts were designed to guide the LLM in identifying relevant hazards and estimating risks.
3
Utilizing LLMs for Hazard Identification: The first one-shot prompt used was as follows: “Identify all potential hazards associated with the operation and maintenance of autonomous transport vehicles in industrial settings. Consider hazards that may arise during regular operation, maintenance activities, and in the event of malfunction or failure. Specifically, focus on the risks of collisions, particularly in areas where these vehicles interact with human workers. Provide a risk estimation for each identified hazard, considering factors such as the severity of potential injuries, the frequency of human presence in the vehicle’s path, and the likelihood of avoiding the hazard. Finally, suggest the appropriate safety functions that could be implemented, and estimate the required Performance Level (PLr) according to ISO 12100 standards”. The LLM identified several hazards, including collisions with human workers, collisions with other vehicles, unexpected start-up or movement, and risks from load handling.
4
Utilizing LLMs for Risk Estimation: The LLM provided risk estimations for each identified hazard, assessing factors such as severity, frequency, and likelihood of avoidance. For the primary hazard of collisions with human workers, the LLM initially assigned a Performance Level required (PLr) of d, which aligns with the case study ground truth.
5
Interaction and Data Collection: The outputs from the LLM were recorded, including the identified hazards, risk estimations, and suggested safety functions. These were then reviewed by safety experts to ensure alignment with industry standards and practical applicability.
6
Analysis of LLM Outputs with HITL: After reviewing the initial LLM outputs, safety experts identified discrepancies, particularly with how risk reduction measures were considered. A second prompt was issued to reassess the risks after introducing specific risk reduction measures: “1. Evaluate the effectiveness of the following risk reduction measures—enhanced obstacle detection, safe start interlocks, and emergency stop mechanisms—on mitigating the identified hazards. 2. Reassess the residual risks for each hazard after applying these measures and determine whether the risks have been reduced to acceptable levels. 3. Provide a summary of how the risk reduction measures affect the Performance Level (PLr) for each hazard”. The LLM, guided by expert oversight, introduced risk reduction measures such as enhanced obstacle detection and safe start interlocks, and reassessed the hazards. Despite the risk reduction measures, the LLM incorrectly adjusted the PLr for collisions with human workers from d to c, which was later corrected by human experts.
7
Final Expert Validation: The final outputs, including the reassessed hazards and suggested safety functions, were validated by human experts. The experts confirmed that while the LLM’s final recommendations were generally robust, the PLr for collisions with human workers should remain at d, consistent with the case study ground truth, reflecting the high severity of potential injuries.

5.3.2. Identified Hazard

Case Study Ground Truth: The primary hazard identified in the IFA report [8] was the risk of collisions with pedestrians (human workers), with a recommended Performance Level required (PLr) of d.
LLM Analysis: The LLM successfully identified the critical hazard of collisions with human workers and initially assigned a PLr of d. However, after introducing risk reduction measures, the LLM mistakenly adjusted the PLr to c, which was lower than the ground truth. This mistake was identified and corrected during expert validation, ensuring alignment with the ground truth.

5.3.3. Performance Level Required (PLr)

Case Study Ground Truth: The IFA report [8] assigns a PLr of d for the hazard of collisions with pedestrians (human workers).
LLM Analysis (Pre- and Post-Risk Reduction): The LLM initially assigned a PLr of d, matching the ground truth. However, after introducing risk reduction measures, the LLM incorrectly adjusted the PLr to c. This discrepancy was due to an underestimation of the critical risk factors involved, particularly the severity and frequency of pedestrian–vehicle interactions in industrial settings. Expert review identified this error, and the PLr was corrected back to d, reflecting the high severity and frequent exposure inherent in this hazard. It is essential to note that reassessing the PLr after risk reduction does not always result in a lower PLr. In some cases, even with risk reduction measures in place, the PLr might remain the same due to the inherent severity or other factors that cannot be fully mitigated.

5.3.4. Evaluation Based on Criteria

Accuracy: The LLM’s initial hazard identification and risk estimation were generally accurate, correctly identifying the primary hazards, such as collisions with human workers. However, after introducing risk reduction measures, the LLM mistakenly adjusted the PLr for the critical hazard of collisions with human workers from d to c. This adjustment underestimated the severity and frequency of pedestrian–vehicle interactions in industrial settings. Expert oversight identified and corrected this error, restoring the PLr to d and ensuring that the final recommendations aligned with the ground truth. Quantitatively, the initial accuracy of assigning the correct PLr was 75 % across all identified hazards. After expert intervention, accuracy improved to 100 % , demonstrating the critical importance of HITL in validating and refining LLM outputs. Furthermore, this highlights that prompts can be provided sequentially, as demonstrated in the Appendix A, without needing to deliver them all at once for this or similar use cases. This modular approach allows for iterative refinement and ensures effective use of the LLM.
Completeness: The LLM was thorough in its hazard identification, covering both high-risk hazards, such as collisions with human workers, and medium-risk hazards, like collisions with other vehicles and risks from load handling. The expert oversight played a key role in ensuring that no significant hazards were overlooked and that the PLr recommendation for the most critical hazard was corrected to match the ground truth, confirming the completeness of the LLM’s analysis. Quantitatively, the LLM achieved 90 % coverage of potential hazards listed in the IFA report.
Usability: The safety functions suggested by the LLM, such as enhanced obstacle detection systems, emergency stop mechanisms, and safe start interlocks, were practical and aligned with industry standards. The LLM’s recommendations for additional safety measures, such as inter-vehicle communication and load-securing mechanisms, demonstrated a clear understanding of real-world industrial applications. The usability of these outputs was further enhanced by expert validation, ensuring their applicability in improving safety protocols. Quantitatively, 85 % of the recommended safety functions were directly implementable without modifications.
Time Efficiency: The LLM significantly reduced the time required for hazard identification and risk estimation compared to traditional manual methods. This efficiency was particularly evident in the initial stages of the analysis, where the LLM quickly generated a comprehensive list of hazards and corresponding risk estimations. Quantitatively, the LLM completed the analysis in ca. 35 minutes, compared to ca. 90 min for manual analysis, representing a time savings of approximately 61 % . Although expert intervention was necessary to correct the PLr assignment for a critical hazard, the overall time efficiency remained high, highlighting the potential of LLMs to streamline risk analysis processes.
Expert Validation: Expert review was essential in refining the LLM’s outputs. The HITL process ensured that the final recommendations adhered to recognized safety standards, particularly in correcting the PLr for collisions with human workers. This validation process underscores the importance of human oversight in leveraging LLMs for safety-critical applications, ensuring that the final outputs are both accurate and reliable. The validation process reduced the initial deviation in PLr from 0.5 to 0.0 on average, ensuring 100 % alignment with the IFA ground truth.

5.4. Case Study: Rotary Printing Presses

This section presents the results and discussion of the risk analysis conducted on the rotary printing press case study, following the systematic methodology outlined in this paper.

5.4.1. Methodology Steps

1
Define the Scope and Objectives: The scope was defined to assess the risks associated with the operation and maintenance of rotary printing presses in the printing industry, with a particular focus on hazards such as entrapment and crushing between counter-rotating cylinders during maintenance.
2
Preparation for LLM Interaction: Operational data, safety standards (ISO 12100 and EN 1010-1:2010), and incident reports were compiled to develop targeted prompts for the LLM interaction, ensuring that the LLM was provided with a comprehensive context for accurate hazard identification.
3
Utilizing LLMs for Hazard Identification: The first prompt was: “Identify all potential hazards associated with the operation and maintenance of rotary printing presses in the printing industry. Consider hazards that may arise during regular operation, maintenance activities, and in the event of malfunction or failure. Specifically, focus on the risks of entrapment and crushing between the counter-rotating cylinders, particularly during maintenance when manual intervention is required. Provide a risk estimation for each identified hazard, considering factors such as the severity of potential injuries, the frequency of exposure to the hazard, and the likelihood of avoiding the hazard. Finally, suggest the appropriate safety functions that could be implemented, such as braking the cylinders upon opening the guard door, restricting machine movements to limited speeds when the guard door is open, and allowing movements only while an inching button is pressed. Estimate the required Performance Level (PLr) according to ISO 12100 standards”. The LLM identified several hazards, including entrapment between cylinders, crushing hazards, shearing points, unexpected release of energy, flying debris, and chemical exposure.
4
Utilizing LLMs for Risk Estimation: The LLM provided risk estimations for each identified hazard, assessing factors such as severity, frequency, and likelihood of avoidance. The LLM assigned appropriate PLr values based on the identified risks and suggested safety functions.
5
Interaction and Data Collection: The outputs from the LLM were recorded, including the identified hazards, risk estimations, and suggested safety functions. These were then reviewed by safety experts to ensure alignment with industry standards and practical applicability.
6
Analysis of LLM Outputs with HITL: After reviewing the initial LLM outputs, safety experts identified areas where additional risk reduction measures could be applied. A second prompt was issued to reassess the risks after introducing specific risk reduction measures: “1. Evaluate the effectiveness of the following risk reduction measures—enhanced braking systems, lockout/tagout procedures, and energy isolation devices—on mitigating the identified hazards. 2. Reassess the residual risks for each hazard after applying these measures and determine whether the risks have been reduced to acceptable levels. 3. Provide a summary of how the risk reduction measures affect the Performance Level (PLr) for each safety function”. The LLM, guided by expert oversight, introduced and reassessed the risk reduction measures. After reevaluation, the LLM’s outputs were validated and refined by safety experts to ensure they aligned with industry standards and maintained the appropriate PLr values for the identified hazards.
7
Final Expert Validation: The final outputs, including the reassessed hazards and suggested safety functions, were validated by human experts. The experts confirmed that the LLM’s final recommendations were consistent with industry standards, including the assignment of appropriate PLr values.

5.4.2. Identified Hazard

Case Study Ground Truth: The IFA report [8] identifies the primary hazards as entrapment and crushing risks at the entrapment points of counter-rotating cylinders during maintenance activities. The recommended Performance Level required (PLr) is d for braking the cylinders upon opening the guard door (SF1) and restricting machine movements to limited speeds when the guard door is open (SF2). The PLr is c for allowing movements only while an inching button is pressed when the guard door is open (SF3).
LLM Analysis: The LLM successfully identified these critical hazards and suggested corresponding safety functions. The initial PLr assignments from the LLM were consistent with [8]: d for SF1 and SF2, and c for SF3.

5.4.3. Safety Functions

Case Study Ground Truth: The ground truth safety functions involve braking the cylinders when the guard door is opened (SF1), restricting machine movements to limited speeds when the guard door is open (SF2), and allowing machine movements only while an inching button is pressed (SF3).
LLM Analysis: The LLM’s analysis aligned with [8], recommending these safety functions. Additionally, the LLM suggested further measures, like fixed guards and energy isolation devices, which, while beneficial, were not explicitly mentioned in the ground truth but were considered to be supplementary.

5.4.4. Performance Level Required (PLr)

Case Study Ground Truth: The established ground truth is a PLr of d for SF1 and SF2, and c for SF3 as per [8].
LLM Analysis (Pre- and Post-Risk Reduction): The LLM initially assigned a PLr of d for SF1 and SF2, and c for SF3, which matched the ground truth. However, after introducing risk reduction measures, the LLM reassessed the PLr for SF2 as c, underestimating residual risks. This resulted in a deviation of 1 PLr level for SF2. For SF1 and SF3, the LLM’s PLr values remained consistent with the ground truth throughout the analysis, confirming the accuracy of its estimations for these safety functions.
Deviation Calculation: The average deviation in PLr levels across all safety functions was calculated as follows:
Average Deviation = | LLM PLr Ground Truth PLr | Total Safety Functions
Substituting the values:
Average Deviation = | d d | + | c d | + | c c | 3 = 0 + 1 + 0 3 = 0.33
The initial accuracy in assigning correct PLr values was 100 % , but post-risk reduction prompt and feedback analysis, this accuracy decreased to 66.67 % due to the underestimation of PLr for SF2. This demonstrates that introducing additional prompts, such as those for risk reduction, should only be performed when absolutely necessary. If the initial PLr estimates already align with the ground truth, unnecessary prompts may lead to over adjustments and deviations, as seen with SF2. Therefore, expert oversight through HITL is crucial to decide when further prompts are required to maintain both accuracy and efficiency in the analysis.

5.4.5. Evaluation Based on Criteria

Accuracy: The LLM’s hazard identification and risk estimation were accurate, with PLr values that matched the ground truth provided by [8]. The LLM correctly identified the severity and frequency of hazards and assigned appropriate PLr values. This alignment demonstrates the effectiveness of the HITL methodology in refining LLM outputs.
Completeness: The LLM was thorough in its hazard identification, covering all significant risks associated with rotary printing presses, including less common hazards, such as flying debris and chemical exposure. Expert oversight confirmed the completeness of the LLM’s analysis, ensuring that no significant hazards were overlooked.
Usability: The safety functions suggested by the LLM, including braking systems, interlocks, and energy isolation, were practical and aligned with industry standards. The recommendations were highly applicable in real-world scenarios, as confirmed by expert validation.
Time Efficiency: The LLM significantly reduced the time required for hazard identification and risk estimation compared to traditional manual methods. This efficiency, coupled with the accuracy and completeness of the results, highlights the potential of LLMs to streamline risk analysis processes.
Expert Validation: Expert review was essential in confirming the LLM’s outputs, particularly in validating the PLr assignments. The HITL process ensured that the final recommendations adhered to recognized safety standards, thereby enhancing the reliability of the LLM’s outputs.

5.5. Summary of Evaluation of Methodology on Case Studies

This section provides an integrated analysis of the findings from the four case studies using the HITL methodology with LLMs. The evaluation considers the role of human oversight (HITL), the performance of the LLMs, consistency with industry standards, and an overall assessment through Likert scale ratings.

5.5.1. Consolidated Insights on the Evaluation Metrics from Use Cases

The results from the four case studies demonstrate the effectiveness of the proposed HITL methodology combined with LLM capabilities in safety-critical risk analysis.
Accuracy: The LLM consistently identified hazards and assigned PLr values with an initial average accuracy of 85 % , improving to 100 % post-HITL refinement. Deviations in residual risk estimation were effectively corrected by expert oversight.
Completeness: The LLM achieved near-total hazard coverage (90–100%), identifying both critical and supplementary risks. Expert validation ensured no significant hazards were overlooked. Practical insight: This approach minimizes overlooked hazards, reducing potential compliance penalties or recalls.
Usability: Recommended safety functions were practical, implementable, and aligned with industry standards. Additional innovative suggestions enhanced the real-world applicability of the outputs.
Time Efficiency: Using the LLM-based approach, the time required for hazard identification and risk estimation is reduced by approximately 30–60% (For a mid-size project with 40 hazards, let us say that the traditional methods take about 50 h, while LLM-based methods reduce this to 20 h, saving 30 h of work. At an average hourly cost of EUR 100 for a risk analyst, this translates to savings of EUR 3000 per analyst. For projects with multiple analysts, such as two experts, the total savings can double to EUR 6000, making the LLM-based approach both time-efficient and cost-effective.).
Expert Validation: The HITL process was crucial in refining outputs, ensuring compliance with safety benchmarks and 100 % alignment with recognized standards.
Overall, the evaluation highlights the potential of combining LLMs with HITL to enhance the efficiency, accuracy, and completeness of risk analysis while ensuring adherence to industry standards. Businesses can benefit from reduced analysis costs, faster time-to-market for safety-critical systems, and minimized compliance risks.
The following Section 5.5.2 is newly added

5.5.2. Prompt Design, Usage, and Inferences

The evaluation of the prompts used in this study reveals several key insights into their effectiveness and role in guiding the LLM’s performance. The initial one-shot prompts were designed to be comprehensive, providing the LLM with sufficient context to identify hazards, estimate risks, and suggest safety functions. These prompts, averaging 124.75 words in length, demonstrated a strong ability to elicit detailed and accurate outputs from the LLM. However, discrepancies in risk estimation and Performance Level (PLr) assignments were observed, particularly in complex use cases, such as autonomous transport vehicles and rotary printing presses.
The iterative refinement prompts, averaging 62.5 words, were instrumental in guiding the LLM to reassess and refine its outputs. These prompts focused on specific risk reduction measures, such as enhanced obstacle detection, safe start interlocks, and emergency stop mechanisms, enabling the LLM to evaluate the effectiveness of these measures and reassess residual risks. For example, in the case of autonomous transport vehicles, the iterative prompt directed the LLM to reevaluate the PLr for collisions with human workers, leading to a more accurate assessment. Similarly, for rotary printing presses, the prompt guided the LLM to assess the impact of enhanced braking systems and lockout procedures, resulting in a more thorough risk analysis. In the case of motorized gates, the iterative prompt directed the LLM to evaluate the impact of specific risk reduction measures, such as pressure-sensitive edges and obstruction detection systems, on the identified hazards. This enabled the LLM to refine its outputs, adjusting the PLr for crushing hazards to align with the ground truth, while maintaining the PLr for shearing hazards.
The bar chart in Figure 8 compares the lengths of the first and second prompts across all case studies. The length of the first prompts range from 113 words (Motorized Gates) to 149 words (Rotary Printing Presses). The length of the second prompts range from 50 words (Motorized Gates) to 68 words (Rotary Printing Presses). An analysis of the few-shot prompts employed in this study (prompt lengths shown in Figure 8) and their respective results is provided below:
1
First Prompt Length vs. Accuracy (Figure 9): This scatter plot shows the relationship between the length of the first prompt (in words) and the accuracy of the LLM’s outputs. There is a positive trend between the length of the first prompt and the accuracy of the outputs. Longer initial prompts tend to result in higher accuracy scores. The Rotary Printing Presses case study, with the longest first prompt (149 words), achieves one of the highest accuracy scores (4.8/5), reinforcing the trend. However, the Autonomous Transport Vehicles case study shows that even with a relatively long prompt (119 words), the accuracy can be slightly lower (4.5/5), likely due to the complexity of the use case. Thus, it can be inferred that longer initial prompts are generally effective in improving accuracy, but the complexity of the use case can influence the results.
2
Second Prompt Length vs. Usability (Figure 10): This scatter plot shows the relationship between the length of the second prompt (in words) and the usability of the LLM’s outputs. There is a moderate positive trend between the length of the second prompt and the usability of the outputs. Shorter iterative prompts (e.g., 50 words) still result in high usability scores (4.8/5), while slightly longer prompts (e.g., 66–68 words) achieve even higher scores (4.9/5). The Weaving Machines and Rotary Printing Presses case studies, with second prompts of 66 and 68 words, respectively, achieve the highest usability scores (4.9/5). Thus, it can be inferred that shorter iterative prompts are effective in maintaining high usability, but slightly longer prompts can further enhance the practical applicability of the outputs.
3
Correlation Between Prompt Length and Output Quality (Figure 11): This heatmap shows the correlation coefficients between prompt length (first and second prompts) and output quality metrics (accuracy, completeness, usability, time efficiency, and overall confidence). Some key observations include the following:
First Prompt Length vs. Accuracy: Strong positive correlation (0.85), indicating that longer initial prompts tend to produce more accurate outputs.
Second Prompt Length vs. Usability: Moderate positive correlation (0.45), suggesting that shorter iterative prompts can still yield highly usable outputs.
First Prompt Length vs. Time Efficiency: Weak negative correlation (−0.30), implying that longer initial prompts may slightly reduce efficiency.
The heatmap highlights the importance of balancing prompt length with output quality. Longer initial prompts improve accuracy but may reduce efficiency, while shorter iterative prompts maintain usability without compromising quality.
The inferences drawn from the prompt usage highlight two key points:
Focused Iterative Prompts: Shorter, targeted prompts were highly effective in directing the LLM to refine its outputs, particularly when addressing specific risk reduction measures.
Alignment with Expert Oversight: While the prompts provided clear guidance, their effectiveness was further enhanced by the iterative HITL process, which ensured that the refined outputs aligned with safety standards and practical requirements.
Overall, the iterative refinement prompts demonstrated their value in enabling the LLM to produce usable outputs, while maintaining efficiency.
Further, compared to [29,30], this study demonstrates significant improvements in reliability and practicality by introducing an HITL framework tailored for real-life machinery functional safety risk assessment. This approach ensures compliance with ISO 12100 and ISO 13849 standards, provides scope for addressing the possible degradation of LLM performance with increasing complexity, and achieves higher accuracy (a complete agreement with ground truth) through iterative expert validation. Additionally, our workflow evaluates time efficiency and usability, providing a more comprehensive and scalable solution for industrial applications.

5.5.3. Role of HITL

Across all case studies, human oversight proved to be crucial in refining the outputs generated by the LLMs. While the LLMs were proficient in identifying hazards and suggesting relevant safety functions, there were instances where the initial risk estimations or PLr values were either overestimated or underestimated. The HITL process enabled experts to correct these discrepancies, ensuring that the final recommendations adhered to recognized safety standards.
Figure 12 illustrates the critical dependence of LLMs on human oversight in the context of risk analysis. While LLMs possess impressive generative and analytical capabilities, their outputs must be guided and validated by human expertise to ensure accuracy and reliability. This foundation of human oversight—comprising judgment, domain knowledge, ethical considerations, and experience—is essential for refining AI-generated results and ensuring that the risk analysis process meets stringent safety and compliance standards.
As noted by [4], LLMs are often critiqued for lacking true reasoning capabilities, excelling instead at approximate retrieval rather than principled reasoning or planning. Despite these limitations, when integrated with human expert judgment, LLMs can effectively contribute to various aspects of risk analysis, including hazard identification and risk estimation.
For example, in the case of autonomous transport vehicles, the LLM initially downgraded the PLr for collisions with human workers from d to c after introducing risk reduction measures. Expert intervention was necessary to correct this underestimation, highlighting the indispensable role of HITL in safety-critical contexts.

5.5.4. How Human Oversight Helps Overcome LLM Limitations

The experimental evaluation of ChatGPT demonstrated strong capabilities in hazard identification and risk estimation across four safety-critical case studies, significantly reducing analysis time compared to traditional methods. However, its performance in assigning appropriate PLr values was inconsistent, requiring HITL refinement to correct overestimation, as seen in the motorized gates case.
This study highlights a functional safety workflow combining ChatGPT with the HITL methodology. While ChatGPT streamlines initial hazard identification, its limitations—such as context misinterpretation—necessitate expert validation. The HITL approach ensures accuracy, completeness, and compliance with industry standards, aligning with EU AI Act requirements. This hybrid framework accelerates risk assessments while providing a transparent, reliable, and scalable model for improving industrial safety protocols.
The following points summarize how human oversight (i.e., by functional safety engineering experts) addresses the limitations of LLMs within the scope of the study presented in this paper.
Fact Hallucination: Verified and corrected by human experts through validation processes.
Bias in Training Data: Mitigated by incorporating human domain knowledge and ethical judgment.
Overgeneralization: Refined through human expertise and specific decision-making.
Handling of Novelty and Edge Cases: Human experts ensure accurate handling of unique scenarios through experience and context-specific insights.
Lack of Domain Knowledge: Human oversight fills in the gaps where LLMs lack specific knowledge.
Error Propagation: Prevented through human review at multiple stages.
Transparency and Explainability: Human experts ensure that AI-generated outputs can be traced and justified.
Prompt Engineering Dependency: Reduced by having humans (e.g., functional safety experts) optimize prompt design to ensure relevant responses.
By incorporating these roles, the HITL approach comprehensively addresses the limitations of LLMs, making the risk analysis process more reliable, accurate, and aligned with safety and compliance standards.

5.5.5. Consistency with Standards

After expert intervention, the LLMs’ final recommendations were consistently aligned with the ground truth in the IFA report [8] across all four case studies. The HITL methodology effectively leveraged the LLMs’ capabilities while ensuring that the final outputs met the required safety benchmarks.
For instance, in the case study involving weaving machines, the LLM’s final recommendation of a PLr of d for the primary hazard matched the ground truth, confirming its understanding of the risks involved. Similarly, for the rotary printing presses, the LLM’s outputs were validated to be consistent with ISO 12100 and EN 1010-1:2010 standards after HITL refinement.

5.5.6. Aggregated Likert Scale Ratings Across Four Case Studies

The Table 1 summarizes the aggregated Likert scale ratings for accuracy, completeness, usability, time efficiency, and overall confidence across all four case studies. The results demonstrate the LLM’s strong performance across key metrics, with the HITL framework playing a critical role in maintaining high ratings.
Accuracy was consistently high across all case studies, with an overall average of 4.7/5. The slightly lower rating for autonomous transport vehicles (4.5/5) can be attributed to the complexity and dynamic nature of their operational environments, which posed challenges in identifying all edge cases. However, the HITL framework ensured that these gaps were addressed through iterative expert feedback, ultimately achieving a 100% match with ground truth in the final assessment.
Completeness also showed strong performance, averaging 4.6/5. The HITL methodology enabled the LLM to cover a wide range of hazards, with minor variations reflecting differences in the complexity of the machinery. For instance, weaving machines and rotary printing presses, which involve more mechanical hazards, received slightly higher ratings (4.7/5) compared to motorized gates (4.5/5). Usability received the highest ratings, averaging 4.8/5, reflecting the practical value of the LLM’s outputs in real-world scenarios. Experts noted that the LLM’s recommendations were intuitive and easily integrated into existing workflows, particularly for motorized gates and rotary printing presses, which both scored 4.9/5.
Time efficiency was exceptional, with an average rating of 4.95/5. The LLM significantly streamlined the risk assessment process, particularly for motorized gates and weaving machines, which achieved full scores (5.0/5). Even for the more complex autonomous transport vehicles, the rating remained high at 4.8/5, demonstrating the framework’s adaptability.
Overall confidence in the LLM’s outputs was robust, averaging 4.65/5. This underscores the effectiveness of the HITL framework in ensuring reliable and applicable results, even in safety-critical contexts. The minor variations across case studies highlight the adaptability of the approach, with expert oversight ensuring consistent quality and trustworthiness.
The radar chart in Figure 13 provides a multidimensional analysis of the LLM-based workflow’s performance across four case studies, evaluating key criteria: Accuracy, Completeness, Usability, Time Efficiency, and Overall Confidence. The chart reveals consistently high ratings, with Accuracy averaging 4.7/5 and Time Efficiency reaching 4.95/5, indicating robust performance and significant process optimization. Slight variations, such as lower Accuracy (4.5/5) for Autonomous Transport Vehicles, reflect the challenges posed by dynamic environments, mitigated by the HITL framework. The high Usability (4.8/5) and Overall Confidence (4.65/5) scores underscore the workflow’s practical applicability and reliability, validated by expert oversight. This visualization highlights the framework’s adaptability and effectiveness in diverse safety-critical contexts.

5.6. Threats to Validity

This section outlines the potential limitations and challenges that may affect the generalizability and reliability of the study’s findings. These threats to validity include the reliance on a single LLM, inherent biases in model outputs, and the influence of human expertise in the validation process. Addressing these factors is crucial for ensuring the robustness and applicability of the proposed HITL methodology in diverse contexts.
Single LLM Utilization: In the experimental analysis presented in this paper, a single LLM, ChatGPT, has been used for conducting hazard identification and risk estimation. While ChatGPT is highly capable, results may vary if different LLMs were used. The findings may not fully generalize to other models with different training data, architectures, or capabilities. Future experimental analysis should consider cross-validating results using multiple LLMs.
Bias and Hallucinations in LLM Outputs: LLMs are prone to generating biased or inaccurate information, which can lead to errors in hazard identification and risk assessments. Although the HITL approach mitigates this risk by incorporating expert validation, some errors may still persist, potentially impacting the validity of the findings.
Prompt Engineering Limitations: The quality and relevance of the LLM’s outputs depend significantly on the prompts used. If the prompts are not well crafted or sufficiently detailed, the outputs may be incomplete or inaccurate. This introduces a dependency on prompt engineering, which could affect the study’s outcomes.
Human Expert Variability: The study’s results are influenced by the expertise and judgment of the human experts involved in the HITL process. Differences in expert knowledge and interpretation could lead to variability in the validation of LLM outputs, which may affect the consistency of the findings.

5.7. Future Research Directions

Expansion of Methodology: Future research shall explore the following aspects to enhance the methodology applied in this paper:
-
Explore how the HITL methodology can be expanded to other domains or more complex risk analysis scenarios, assessing its applicability and effectiveness in diverse industrial contexts.
-
The potential for using multiple LLMs in tandem could be explored. For instance, different LLMs, like GPT-4, Claude 2, and PaLM 2, could be tasked with specific aspects of hazard identification, risk assessment, and safety function suggestions, with their outputs being cross-validated by human experts. This approach could enhance the robustness of the risk assessment process, as each LLM might offer unique insights or catch potential issues that others might miss, thereby improving the overall reliability of the assessments.
-
This study emphasizes the use of LLMs for text-heavy functional safety risk assessments but acknowledges the complementary role of image-based classifiers in visual safety tasks, such as real-time hazard detection, object recognition, and defect identification. Integrating LLMs with image-based classifiers can create a holistic approach to safety assessments: LLMs interpret safety documentation, while image-based models actively monitor and detect visual hazards in operational settings. This multimodal synergy leverages the unique strengths of each model for comprehensive decision-making. However, data protection is a significant concern, as clients may be reluctant to share machinery images with LLMs like ChatGPT, making integration challenging. While custom chatbots for risk assessments, such as [22], tailored for image analysis could address privacy concerns, they may not match the extensive capabilities of general LLMs. Balancing data privacy with the full potential of AI remains crucial for effective safety assessments.
-
As RAG can be used to integrate domain-specific external knowledge into LLM-generated responses, which can enhance context accuracy and relevance without compromising privacy, future research will explore the use of RAG-based techniques, such as RAGAS [9], Giskard (https://www.giskard.ai/, accessed on 31 January 2025), LangChain (https://www.langchain.com/, accessed on 31 January 2025) and ChatGPT-APIs (https://platform.openai.com/docs/guides/chat, accessed on 31 January 2025). Given the focus of this paper on web-based LLMs and text-only prompts, RAG and its variants may be used to supplement the internal knowledge of LLMs with specific, up-to-date information from private databases or internal documents without sharing confidential machine images or data. The general workflow involves the following: User Query Input ⟶ Retrieve Relevant Data (from External Knowledge Base/Database) ⟶ LLM Processes and Combines Internal Knowledge with Retrieved Data ⟶ Generate Contextually Accurate and Informative Response. This could help to enhance the current HITL methodology in terms of context accuracy, and reduced dependence on expert availability and structured metrics evaluation, thereby improving the consistency and relevance of LLM outputs in functional safety risk assessments.
Automation and Refinement: Investigate opportunities to further automate the HITL process while maintaining the necessary level of human oversight. This includes developing advanced techniques for integrating human expertise more seamlessly into automated Continuous Integration/Continuous Delivey (CI/CD) workflows [45]. One promising direction is the integration of LLMs, such as ChatGPT, beyond traditional web-based interfaces. Utilizing OpenAI’s API, organizations can embed LLMs directly into their software systems, enabling real-time interactions with the models during development processes. This could be leveraged to automate safety assessments, compliance checks, or documentation generation as part of CI/CD pipelines. For example, LLMs could be triggered to assess changes in code or system configurations for potential safety impacts, providing real-time feedback to engineers before deployment. Furthermore, other integration methods, such as RESTful API calls, scripting in various programming languages, or embedding LLM capabilities in mobile or desktop applications, offer additional flexibility in how these models can be applied in various industrial settings.
Enhancing Initial LLM Output Integrity: While the HITL methodology significantly improves the reliability of LLM-generated risk assessments, future work will focus on enhancing the integrity of initial outputs produced by LLMs. This enhancement aims to improve the quality of results before human validation, reducing the burden on experts and minimizing discrepancies. To achieve this, several strategies will be explored:
-
Improved Prompt Engineering: Develop structured prompts that incorporate comprehensive safety parameters and domain-specific context, guiding LLMs to produce more accurate and relevant initial outputs.
-
Application-Specific Fine-Tuning of LLMs: While LLMs are trained on massive amounts of data and may have undergone broad fine-tuning, further customization with domain-specific datasets—including safety standards (e.g., ISO 12100), historical risk analyses, and validated case studies—can improve alignment with functional safety requirements. This specialized fine-tuning aims to enhance the model’s performance for risk identification in machinery safety, resulting in outputs that are more directly relevant and accurate for this context.
-
Cross-Verification of Outputs: Employ multiple LLMs or iterative queries with the same model to cross-verify outputs, ensuring greater robustness and accuracy in initial hazard identification.
-
Automated Rule-Based Consistency Checking: Implement rule-based filters and AI tools to validate LLM outputs against established safety standards and risk taxonomies, catching inconsistencies before human review.
-
Risk Taxonomy Alignment for Structured Outputs: Align prompts and expected outputs with known risk taxonomies (e.g., severity, frequency, avoidance) to facilitate more targeted responses and streamline validation.
These strategies are expected to enhance the initial quality of LLM-generated results, ensuring that they are robust, accurate, and compliant with safety standards from the outset.

6. Conclusions

The phrase ‘Clever Hans in the Loop’ draws a parallel between the phenomenon of Clever Hans, a horse famously thought to solve mathematical problems by reading subtle human cues [6], and the potential pitfalls of using LLMs like ChatGPT. This analogy underscores a critical question: while ChatGPT can produce convincing and seemingly intelligent outputs, are these results grounded in genuine ‘understanding’, or do they simply reflect patterns derived from training data and user inputs? Much like Clever Hans relied on human cues, ChatGPT’s outputs, though meaningful in appearance, can lack true comprehension and require careful scrutiny.
This study has demonstrated that embedding ChatGPT within a systematic HITL framework can effectively mitigate these limitations. By incorporating expert oversight, the HITL methodology addresses inaccuracies and context interpretation errors, ensuring that AI-generated outputs are reliable. Rather than a replacement for traditional methods or human expertise, ChatGPT serves as a valuable supplementary tool, capable of enhancing efficiency and augmenting hazard identification and risk assessment in routine functional safety workflows. Key findings of this study include,
ChatGPT alone provides substantial efficiency improvements by rapidly generating initial hazard identifications and risk estimations.
The HITL framework significantly enhances the accuracy and completeness of risk assessments compared to standalone ChatGPT outputs. Moreover, the HITL framework aligns with the EU AI Act’s [7] mandate for human oversight in high-risk AI systems.
Human intervention ensures risk assessments align with industry standards, highlighting the necessity of expert oversight.
Likert scale evaluations demonstrate high levels of trust and confidence in the refined outputs, reinforcing the value of human expertise.
Longer initial prompts improve accuracy but may reduce efficiency, while shorter iterative prompts maintain usability without compromising quality.
Strong relationships exist between prompt length and output quality, emphasizing the need for context-specific prompt design.
The hybrid approach offers a scalable and practical framework for enhancing routine functional safety workflows also in real-life project settings.
In conclusion, the utility of ChatGPT lies not in replacing human judgment but in complementing it, offering a tool that, when integrated thoughtfully, can enhance routine functional safety workflows. This study affirms the potential of generative AI, such as using LLMs, in safety-critical industrial settings, while emphasizing the indispensable need for rigorous oversight and expert involvement. The findings provide a foundation for the cautious yet transformative adoption of AI-driven tools, paving the way for further advancements in regulated industries and beyond.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The author would like to express sincere gratitude to Innotec GmbH-TÜV Austria Group colleagues who served on the expert panel for this study. Their valuable insights and expertise were instrumental in validating the results and ensuring the rigor of the findings.

Conflicts of Interest

Author Padma Iyenghar was employed by the company innotec GmbH-TÜV Austria Group. The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Example Prompts

In this section, example prompts for the use cases in Annex A: Examples of Risk Assessment in the IFA report [8] are listed. These use cases are employed in this study to evaluate the proposed approach for conducting risk assessments. In comparison to the prompts mentioned in Section 5, the prompts listed here are specifically tailored for tasks such as hazard identification, risk assessment, and risk reduction with the LLM. It is important to note that these prompts represent only a subset of the examples, primarily aimed at sharing the research data from this study. It is important to note that this study focuses on evaluating the feasibility of using few-shot prompting with ChatGPT for routine functional safety risk assessments, leveraging freely available LLMs through simple, web-based prompts. The objective is not to optimize prompt design, configurations, or to employ multiple LLMs and explore their comparative implications, which could be a separate study in itself. On the other hand, the primary objective is to assess the effectiveness of LLMs in conducting routine functional safety assessments based on the provided use cases.

Appendix A.1. Closing Edge Protection

Eng 06 00031 i001Eng 06 00031 i002Eng 06 00031 i003

Appendix A.2. Autonomous Transport Vehicles

Eng 06 00031 i004Eng 06 00031 i005Eng 06 00031 i006

References

  1. ISO 12100:2010; Safety of Machinery: General Principles for Design: Risk Assessment and Risk Reduction. ISO: Geneva, Switzerland, 2010. Available online: https://www.iso.org/standard/51528.html (accessed on 4 February 2025).
  2. ISO 13849-1:2015; Safety of Machinery—Safety-Related Parts of Control Systems—Part 1: General Principles for Design. ISO: Geneva, Switzerland, 2015. Available online: https://www.iso.org/standard/69883.html (accessed on 4 February 2025).
  3. The Machinery Directive, Directive 2006/42/EC of the European Parliament and of the Council of 17 May 2006. Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32006L0042 (accessed on 4 February 2025).
  4. Kambhampati, S. Can large language models reason and plan? Ann. N. Y. Acad. Sci. 2024, 1534, 15–18. [Google Scholar] [CrossRef] [PubMed]
  5. OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:cs.CL/2303.08774. [Google Scholar]
  6. Pfungst, O. Clever Hans (The horse of Mr. von Osten): A contribution to experimental animal and human psychology. J. Anim. Psychol. 1911, 1, 1–128. [Google Scholar]
  7. Proposal for a Regulation of the European Parliament and of the Council Laying Down Harmonized Rules on AI and Amending Certain Union Legislative Acts. Available online: https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-laying-down-harmonised-rules-artificial-intelligence-artificial-intelligence (accessed on 4 February 2025).
  8. IFA Report 2/2017e Functional Safety of Machine Controls—Application of EN ISO 13849, Deutsche Gesetzliche Unfallversicherung. 2019. Available online: https://www.dguv.de/medien/ifa/en/pub/rep/pdf/reports-2019/report0217e/rep0217e.pdf (accessed on 31 January 2025).
  9. Es, S.; James, J.; Espinosa-Anke, L.; Schockaert, S. RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv 2023, arXiv:cs.CL/2309.15217. [Google Scholar]
  10. Yu, H.; Gan, A.; Zhang, K.; Tong, S.; Liu, Q.; Liu, Z. Evaluation of Retrieval-Augmented Generation: A Survey. arXiv 2024, arXiv:cs.CL/2405.07437. [Google Scholar]
  11. Zhao, P.; Zhang, H.; Yu, Q.; Wang, Z.; Geng, Y.; Fu, F.; Yang, L.; Zhang, W.; Jiang, J.; Cui, B. Retrieval-Augmented Generation for AI-Generated Content: A Survey. arXiv 2024, arXiv:cs.CV/2402.19473. [Google Scholar]
  12. Abusitta, A.; Li, M.Q.; Fung, B.C. Survey on Explainable AI: Techniques, challenges and open issues. Expert Syst. Appl. 2024, 255, 124710. [Google Scholar] [CrossRef]
  13. IEC 61508-1:2010; Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems. IEC: Geneva, Switzerland, 2010. Available online: https://www.vde-verlag.de/iec-normen/217177/iec-61508-1-2010.html (accessed on 5 February 2025).
  14. Software-Assistent SISTEMA Bewertung von Sicherheitsbezogenen Maschinensteuerungen nach DIN EN ISO 13849. 2010. Available online: https://www.dguv.de/ifa/praxishilfen/praxishilfen-maschinenschutz/software-sistema/index.jsp (accessed on 5 February 2025).
  15. Adaptive Safety and Security in Smart Manufacturing. Available online: https://www.tuvsud.com/en/resource-centre/white-papers/adaptive-safety-and-security-in-smart-manufacturing (accessed on 5 February 2025).
  16. Allouch, A.; Koubaa, A.; Khalgui, M.; Abbes, T. Qualitative and Quantitative Risk Analysis and Safety Assessment of Unmanned Aerial Vehicles Missions over the Internet. arXiv 2019, arXiv:cs.RO/1904.09432. [Google Scholar] [CrossRef]
  17. Xiong, W.; Jin, J. Summary of Integrated Application of Functional Safety and Information Security in Industry. In Proceedings of the 2018 12th International Conference on Reliability, Maintainability, and Safety (ICRMS), Shanghai, China, 17–19 October 2018; pp. 463–469. [Google Scholar] [CrossRef]
  18. Chen, M.; Luo, M.; Sun, H.; Chen, Y. A Comprehensive Risk Evaluation Model for Airport Operation Safety. In Proceedings of the 2018 12th International Conference on Reliability, Maintainability, and Safety (ICRMS), Shanghai, China, 17–19 October 2018; pp. 146–149. [Google Scholar]
  19. Devaraj, L.; Ruddle, A.R.; Duffy, A.P. Electromagnetic Risk Analysis for EMI Impact on Functional Safety With Probabilistic Graphical Models and Fuzzy Logic. IEEE Lett. Electromagn. Compat. Pract. Appl. 2020, 2, 96–100. [Google Scholar] [CrossRef]
  20. Ehrlich, M.; Bröring, A.; Diedrich, C.; Jasperneite, J. Towards Automated Risk Assessments for Modular Manufacturing Systems-Process Analysis and Information Model Proposal. Automatisierungstechnik 2023, 71, 6. [Google Scholar] [CrossRef]
  21. Bhatti, Z.E.; Roop, P.S.; Sinha, R. Unified Functional Safety Assessment of Industrial Automation Systems. IEEE Trans. Ind. Inform. 2017, 13, 17–26. [Google Scholar] [CrossRef]
  22. Iyenghar, P.; Hu, Y.; Kieviet, M.; Pulvermueller, E.; Wuebbelmann, J. AI-Based Assistant for Determining the Required Performance Level for a Safety Function. In Proceedings of the 48th Annual Conference of the IEEE Industrial Electronics Society (IECON 2022), Brussels, Belgium, 17–20 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
  23. Iyenghar, P.; Kieviet, M.; Pulvermueller, E.; Wuebbelmann, J. A Chatbot Assistant for Reducing Risk in Machinery Design. In Proceedings of the 2023 IEEE 21st International Conference on Industrial Informatics (INDIN), Lemgo, Germany, 17–20 July 2023; pp. 1–8. [Google Scholar] [CrossRef]
  24. Khlaaf, H. Toward Comprehensive Risk Assessments and Assurance of AI-Based Systems. Technical Report, Trail of Bits, 2023. Available online: https://www.trailofbits.com/documents/Toward_comprehensive_risk_assessments.pdf (accessed on 4 February 2025).
  25. Attar, H. Joint IoT/ML Platforms for Smart Societies and Environments: A Review on Multimodal Information-Based Learning for Safety and Security. J. Data Inf. Qual. 2023, 15. [Google Scholar] [CrossRef]
  26. Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. arXiv 2024, arXiv:cs.CL/2307.06435. [Google Scholar]
  27. Rostam, Z.R.K.; Szénási, S.; Kertész, G. Achieving Peak Performance for Large Language Models: A Systematic Review. IEEE Access 2024, 12, 96017–96050. [Google Scholar] [CrossRef]
  28. Nasution, A.H.; Onan, A. ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP Tasks. IEEE Access 2024, 12, 71876–71900. [Google Scholar] [CrossRef]
  29. Diemert, S.; Weber, J.H. Can Large Language Models assist in Hazard Analysis? arXiv 2023, arXiv:cs.HC/2303.15473. [Google Scholar]
  30. Qi, Y.; Zhao, X.; Khastgir, S.; Huang, X. Safety Analysis in the Era of Large Language Models: A Case Study of STPA using ChatGPT. arXiv 2023, arXiv:cs.CL/2304.01246. [Google Scholar] [CrossRef]
  31. Aladdin, A.M.; Muhammed, R.K.; Abdulla, H.S.; Rashid, T.A. ChatGPT: Precision Answer Comparison and Evaluation Model. TechRxiv 2024. [Google Scholar] [CrossRef]
  32. Wilchek, M.; Hanley, W.; Lim, J.; Luther, K.; Batarseh, F.A. Human-in-the-loop for computer vision assurance: A survey. Eng. Appl. Artif. Intell. 2023, 123, 106376. [Google Scholar] [CrossRef]
  33. Mosqueira-Rey, E.; Hernández-Pereira, E.; Alonso-Ríos, D.; Bobes-Bascarán, J.; Fernández-Leal, Á. Human-in-the-loop machine learning: A state of the art. Artif. Intell. Rev. 2023, 56, 3005–3054. [Google Scholar] [CrossRef]
  34. Bhattacharya, M.; Penica, M.; O’Connell, E.; Southern, M.; Hayes, M. Human-in-Loop: A Review of Smart Manufacturing Deployments. Systems 2023, 11, 35. [Google Scholar] [CrossRef]
  35. Huang, W.; Liu, H.; Huang, Z.; Lv, C. Safety-Aware Human-in-the-Loop Reinforcement Learning With Shared Control for Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2024, 25, 16181–16192. [Google Scholar] [CrossRef]
  36. Kumar, S.; Datta, S.; Singh, V.; Datta, D.; Kumar Singh, S.; Sharma, R. Applications, Challenges, and Future Directions of Human-in-the-Loop Learning. IEEE Access 2024, 12, 75735–75760. [Google Scholar] [CrossRef]
  37. Rožanec, J.M.; Montini, E.; Cutrona, V.; Papamartzivanos, D.; Klemenčič, T.; Fortuna, B.; Mladenić, D.; Veliou, E.; Giannetsos, T.; Emmanouilidis, C. Human in the AI Loop via xAI and Active Learning for Visual Inspection. In Artificial Intelligence in Manufacturing: Enabling Intelligent, Flexible and Cost-Effective Production Through AI; Soldatos, J., Ed.; Springer Nature: Cham, Switzerland, 2024; pp. 381–406. [Google Scholar] [CrossRef]
  38. Jaltotage, B.; Lu, J.; Dwivedi, G. Use of Artificial Intelligence Including Multimodal Systems to Improve the Management of Cardiovascular Disease. Can. J. Cardiol. 2024, 40, 1804–1812. [Google Scholar] [CrossRef] [PubMed]
  39. Yang, X.; Zhu, C. Industrial Expert Systems Review: A Comprehensive Analysis of Typical Applications. IEEE Access 2024, 12, 88558–88584. [Google Scholar] [CrossRef]
  40. DIN EN 12453:2022-08; Industrial, Commercial and Garage Doors and Gates—Safety in Use of Power Operated Doors—Requirements and Test Methods. DIN: Berlin, Germany, 2022.
  41. DIN EN 1525:1997-12; Safety of Industrial Trucks—Driverless Trucks and Their Systems. DIN: Berlin, Germany, 1997.
  42. ISO 11111-6:2005; Textile Machinery—Safety Requirements—Part 6: Fabric Manufacturing Machinery. ISO: Geneva, Switzerland, 2005.
  43. DIN EN 1010-1:2010; Safety of machinery—Safety requirements for the design and construction of printing and paper converting machines—Part 1: Common requirements. DIN: Berlin, Germany, 2010.
  44. OpenAI. ChatGPT. Model: GPT-4-Turbo. 2023. Available online: https://chat.openai.com (accessed on 4 February 2025).
  45. Baumgartner, N.; Iyenghar, P.; Schoemaker, T.; Pulvermüller, E. AI-Driven Refactoring: A Pipeline for Identifying and Correcting Data Clumps in Git Repositories. Electronics 2024, 13, 1644. [Google Scholar] [CrossRef]
Figure 1. The iterative process of risk assessment and risk reduction.
Figure 1. The iterative process of risk assessment and risk reduction.
Eng 06 00031 g001
Figure 2. Risk graph.
Figure 2. Risk graph.
Eng 06 00031 g002
Figure 3. The systematic workflow for integrating LLMs in risk analysis using a Human-In-The-Loop (HITL) approach, following the ISO 12100 standard for machinery safety.
Figure 3. The systematic workflow for integrating LLMs in risk analysis using a Human-In-The-Loop (HITL) approach, following the ISO 12100 standard for machinery safety.
Eng 06 00031 g003
Figure 4. Schematic representation of a motorized gate equipped with closing edge protection devices.
Figure 4. Schematic representation of a motorized gate equipped with closing edge protection devices.
Eng 06 00031 g004
Figure 5. An autonomous guided vehicle in an industrial setting, taken from Annex-A in [8].
Figure 5. An autonomous guided vehicle in an industrial setting, taken from Annex-A in [8].
Eng 06 00031 g005
Figure 6. Diagram of a weaving machine highlighting critical components, such as the reed, temple, and light beam, taken from Annex-A in [8].
Figure 6. Diagram of a weaving machine highlighting critical components, such as the reed, temple, and light beam, taken from Annex-A in [8].
Eng 06 00031 g006
Figure 7. Rotary printing press showing the critical hazard zones, taken from Annex-A in [8].
Figure 7. Rotary printing press showing the critical hazard zones, taken from Annex-A in [8].
Eng 06 00031 g007
Figure 8. Prompt length across case studies.
Figure 8. Prompt length across case studies.
Eng 06 00031 g008
Figure 9. First prompt length vs. accuracy.
Figure 9. First prompt length vs. accuracy.
Eng 06 00031 g009
Figure 10. Second prompt length vs. usability.
Figure 10. Second prompt length vs. usability.
Eng 06 00031 g010
Figure 11. Correlation heatmap between prompt length and output quality.
Figure 11. Correlation heatmap between prompt length and output quality.
Eng 06 00031 g011
Figure 12. The effectiveness of LLMs in risk analysis critically depends on essential human oversight (Figure adapted from https://xkcd.com/2347/underCreativeCommonsLicense, accessed on 31 January 2025).
Figure 12. The effectiveness of LLMs in risk analysis critically depends on essential human oversight (Figure adapted from https://xkcd.com/2347/underCreativeCommonsLicense, accessed on 31 January 2025).
Eng 06 00031 g012
Figure 13. Performance analysis of the workflow across four case studies.
Figure 13. Performance analysis of the workflow across four case studies.
Eng 06 00031 g013
Table 1. Aggregated Likert scale ratings across all case studies.
Table 1. Aggregated Likert scale ratings across all case studies.
CriteriaMotorized GatesWeaving MachinesAutonomous Transport VehiclesRotary Printing PressesOverall Average
Accuracy4.7/54.8/54.5/54.8/54.7/5
Completeness4.5/54.7/54.6/54.7/54.6/5
Usability4.8/54.9/54.7/54.9/54.8/5
Time Efficiency5.0/55.0/54.8/55.0/54.95/5
Overall Confidence4.6/54.7/54.6/54.7/54.65/5
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Iyenghar, P. Clever Hans in the Loop? A Critical Examination of ChatGPT in a Human-in-the-Loop Framework for Machinery Functional Safety Risk Analysis. Eng 2025, 6, 31. https://doi.org/10.3390/eng6020031

AMA Style

Iyenghar P. Clever Hans in the Loop? A Critical Examination of ChatGPT in a Human-in-the-Loop Framework for Machinery Functional Safety Risk Analysis. Eng. 2025; 6(2):31. https://doi.org/10.3390/eng6020031

Chicago/Turabian Style

Iyenghar, Padma. 2025. "Clever Hans in the Loop? A Critical Examination of ChatGPT in a Human-in-the-Loop Framework for Machinery Functional Safety Risk Analysis" Eng 6, no. 2: 31. https://doi.org/10.3390/eng6020031

APA Style

Iyenghar, P. (2025). Clever Hans in the Loop? A Critical Examination of ChatGPT in a Human-in-the-Loop Framework for Machinery Functional Safety Risk Analysis. Eng, 6(2), 31. https://doi.org/10.3390/eng6020031

Article Metrics

Back to TopTop