Next Article in Journal
Visible-Light Spectroscopy and Laser Scattering for Screening Brewed Coffee Types Using a Low-Cost Portable Platform
Previous Article in Journal
High-Efficiency Partial-Power Converter with Dual-Loop PI-Sliding Mode Control for PV Systems
Previous Article in Special Issue
ForestGPT and Beyond: A Trustworthy Domain-Specific Large Language Model Paving the Way to Forestry 5.0
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Empirical Evaluation of Reasoning LLMs in Machinery Functional Safety Risk Assessment and the Limits of Anthropomorphized Reasoning

Innotec GmbH-TÜV Austria Group, Hornbergstrasse 45, 70794 Filderstadt, Germany
Electronics 2025, 14(18), 3624; https://doi.org/10.3390/electronics14183624
Submission received: 25 July 2025 / Revised: 23 August 2025 / Accepted: 1 September 2025 / Published: 12 September 2025

Abstract

Transparent reasoning and interpretability are essential for AI-supported risk assessment, yet it remains unclear whether large language models (LLMs) can provide reliable, deterministic support for safety-critical tasks or merely simulate reasoning through plausible outputs. This study presents a systematic, multi-model empirical evaluation of reasoning-capable LLMs applied to machinery functional safety, focusing on Required Performance Level (PLr) estimation as defined by ISO 13849-1 and ISO 12100. Six state-of-the-art models (Claude-opus, o3-mini, o4-mini, GPT-5-mini, Gemini-2.5-flash, DeepSeek-Reasoner) were evaluated across six prompting strategies and two dataset variants: canonical ISO-style hazards (Variant 1) and engineer-authored free-text scenarios (Variant 2). Results show that rule-grounded prompting consistently stabilizes performance, achieving ceiling-level accuracy in Variant 1 and restoring reliability under lexical variability in Variant 2. In contrast, unconstrained chain-of-thought reasoning (CoT) and CoT together with Retrieval-Augmented Generation (RAG) introduce volatility, overprediction biases, and model-dependent degradations. Safety-critical coverage was quantified through per-class F1 and recall of PLr class e, confirming that only rule-grounded prompts reliably captured rare but high-risk hazards. Latency analysis demonstrated that rule-only prompts were both the most accurate and the most efficient, while CoT strategies incurred 2–10× overhead. A confusion/rescue analysis of retrieval interactions further revealed systematic noise mechanisms such as P-inflation and F-drift, showing that retrieval can either destabilize or rescue cases depending on model family. Intermediate severity/frequency/possibility (S/F/P) reasoning steps were found to diverge from ISO-consistent logic, reinforcing critiques that LLM “reasoning” reflects surface-level continuation rather than genuine inference. All reported figures include 95% confidence intervals, t-intervals across runs ( r = 5 ) for accuracy and timing, and class-stratified bootstrap CIs for Micro/Macro/Weighted-F1 and per-class metrics. Overall, this study establishes a rigorous benchmark for evaluating LLMs in functional safety workflows such as PLr determination. It shows that deterministic, safety-critical classification requires strict rule-constrained prompting and careful retrieval governance, rather than reliance on assumed model reasoning abilities.

1. Introduction

The increasing complexity of industrial machinery, coupled with stringent regulatory demands, has intensified the need for reliable and structured functional safety risk assessment methods. Among these, the estimation of the Required Performance Level (PLr) is a critical step, directly impacting the design and validation of safety-related control systems in accordance with standards such as ISO 12100 [1] and ISO 13849-1 [2]. PLr estimation is inherently deterministic, derived by systematically evaluating a given hazard scenario against defined risk parameters such as severity, frequency of exposure, and possibility of avoidance and applying predefined classification rules to assign the PLr in accordance with formalized procedures [1,2,3].
On the other hand, the advent of large language models (LLMs) equipped with advanced reasoning capabilities has reshaped the landscape of Artificial Intelligence (AI)-assisted decision support. Models employing structured reasoning mechanisms, such as chain-of-thought (CoT) prompting or retrieval-augmented generation (RAG), have shown potential in complex problem-solving and domain-specific language tasks. These developments have led to speculative interest in whether such reasoning models could extend their utility to structured industrial domains, including safety-critical classification tasks like PLr estimation. However, this potential remains largely unexplored, with no prior work systematically validating reasoning models for deterministic, safety-critical tasks such as PLr estimation, and no empirical evidence yet assessing their reliability in this domain.
In safety-critical risk assessment, transparent reasoning is essential for both regulatory compliance and human-understandable justification of classifications. A central open question is whether reasoning-capable LLMs can provide genuinely interpretable, rule-consistent decision support, or merely generate plausible but unreliable outputs. Prior studies have focused on open-domain inference or problem-solving benchmarks, without systematically addressing deterministic tasks that require strict rule adherence and minimal tolerance for ambiguity. Moreover, the known tendency of LLMs to produce “reasoning illusions”, outputs that appear logically structured yet lack factual correctness, raises significant concerns about their reliability in functional safety contexts.

1.1. Identified Research Gaps

The motivation for this study is based on addressing three key research gaps:
  • Gap 1: Lack of empirical evaluation of reasoning models in deterministic, safety-critical PLr classification.
  • Gap 2: Underexplored impact of reasoning biases and hallucinations in PLr estimation.
  • Gap 3: Absence of structured benchmarking methodologies and empirical evaluation for reasoning models in functional safety risk assessment.

1.2. Research Questions

To guide the study and systematically address the identified gaps, the following research questions are formulated:
  • RQ1: Can reasoning models reliably perform structured risk classification tasks, such as PLr estimation, within the constraints of functional safety standards?
  • RQ2: What limitations do reasoning models exhibit when applied to deterministic classification problems in the domain of machinery risk assessment?
  • RQ3: How does the reliance on structured prompting affect the consistency and validity of reasoning model outputs in functional safety contexts?
These questions guide the experimental evaluation, focusing on the empirical behavior of reasoning-capable LLMs under controlled, rule-bound testing scenarios.

1.3. Contributions

This study makes the following novel contributions:
  • Comprehensive Experimental Benchmarking of Reasoning Models for Functional Safety Risk Classification (RQ1): A systematic evaluation of reasoning-capable LLMs applied to PLr estimation tasks is presented, utilizing diverse prompting strategies including zero-shot, CoT, rule-based prompting, and RAG-augmented reasoning.
  • Empirical analysis of reasoning biases and hallucination effects in structured classification (RQ2): The study analyzes error patterns, misclassification tendencies, and reasoning-induced biases exhibited by LLMs in deterministic classification tasks (e.g., P-inflation, F-drift, redundancy, and mislabeling across PLr classes).
  • Identification of Methodological Considerations for LLM Deployment in Safety-Critical Applications and Future Benchmarking (RQ3): Practical implications are highlighted, including the necessity of structured prompting and the risks of anthropomorphizing LLM reasoning capacities (i.e., attributing human-like reasoning abilities to models based on their language output) in domains with strict correctness requirements. The findings also establish a basis for future research on the applicability and limitations of AI reasoning models in structured industrial domains, contributing to the development of scientifically grounded benchmarking methodologies.
Note that this study focuses on the empirical validation of structured prompting strategies for deterministic risk classification tasks in machinery functional safety. The objective is to provide practical benchmarking evidence for Artificial Intelligence (AI) deployment in regulated industrial domains, rather than to critique the reasoning capabilities of LLMs in general.
The remainder of this paper is organized as follows. Following this introduction, Section 2 deals with background and related work. Section 3 deals with experimental design and evaluation performance metrics. Section 4 provides a detailed analysis of the results from experimental evaluation. Section 5 concludes this paper.

2. Background and Related Work

This section provides the necessary background on functional safety risk assessment, with a focus on machinery domains governed by ISO 12100 and ISO 13849 standards. Section 2.1 outlines the principles of hazard identification, risk estimation, and PLr determination. Building on this foundation, the related work review examines prior research on AI-assisted risk assessment. The section further analyzes critical findings on LLM reasoning capabilities, limitations of chain-of-thought prompting, and the risks of anthropomorphizing model outputs. Finally, it discusses the role of retrieval-augmented generation, prompt engineering, and domain-specific benchmarking in ensuring reliable AI performance in deterministic, safety-critical applications.

2.1. Machinery Functional Safety Risk Assessment

Machinery safety is a global concern for suppliers and manufacturers, governed by a comprehensive regulatory framework designed to protect individuals, assets, and property from harm. Most industrial environments feature complex machinery assemblies that must comply with relevant safety regulations. The Machinery Regulation (EU) 2023/1230, adopted on 14 June 2023, reinforces the core compliance requirement that each machine undergo a structured hazard analysis and risk assessment—a systematic evaluation of potential hazards based on severity, exposure frequency, and the possibility of avoidance—as an essential part of conformity assessment for CE marking [4].
ISO 12100 [1] is an international standard which specifies basic terminology, principles and methodology for achieving safety in the design of machinery. It specifies principles of risk assessment and risk reduction to help designers achieve this objective. Procedures are described for identifying hazards, and estimating and evaluating risks during relevant phases of the machine life cycle. For instance, a hazard refers to something that potentially causes harm and the risk is a combination of the probability and severity of the harm. For example, a sharp part is a hazard. When it is in a prominent position, it creates a risk.

2.1.1. Risk Assessment

The iterative process of risk assessment and reduction is shown in Figure 1. A risk assessment follows a series of logical steps to identify and examine any potential hazards associated with machinery. The process starts with hazard identification in terms of machine space, time, and usage limits. The hazard is then estimated by risk elements such as harm severity (S), occurrence frequency (F), and avoidance or limitation probability (P). Based on the information obtained, the risk is then evaluated for acceptability. If it is not acceptable, risk reduction measures are required. The whole process is called risk assessment. Iteration of this process can be necessary to eliminate hazards as far as practicable and to adequately reduce risks by the implementation of protective measures. Protective measures play an important role in risk reduction. Such measures include protection devices and safety controls, the combination of which is called the Safety-Related Part of the Control System (SRP/CS) [2,3].

2.1.2. Safety Function

Safety functions (SFs) are the machine functions that cause an immediate increase in risk upon failure [2]. A single SF can be implemented by multiple SRP/CSs. A single SRP/CS may also implement multiple SFs, such as prevention of accidental start-up and limits regarding safety parameters in temperature and pressure. For example, the control system shuts down the furnace fire when the boiler pressure reaches a dangerous value. If this function fails, excessive pressure will lead to an explosion. In this scenario, safety depends on the SRP/CS performing the correct function. Each SF is tasked with reducing the risk of one or more hazardous events. It is necessary to consider each hazard and its corresponding SF in the design. The risk assessment results influence the PLr value for the safety function.

2.1.3. Required Performance Level- PLr

PLr is the risk reduction expectation required for the implementation of an SF and can be determined by the risk graph shown in Figure 2. A risk graph is a grading-based risk estimation method with parameters S, F and P corresponding to the severity of harm, the duration or frequency of operator exposure to the hazard area, and the possibility of avoiding the hazard, respectively [3]. The result represents the level of risk without protection from the safety system. It is also used to determine the performance level of the safety function that is needed to reduce the risk to a permissible level. Figure 2 shows the structure of the risk graph.
The severity (S) of harm is divided into S1 (slight) and S2 (serious). Only slight reversible harm, serious irreversible harm, and fatalities are considered when estimating the levels of harm [3]. The normal recovery process is generally used as the basis for evaluating the severity of harm to people. Slight harm is usually recoverable, while serious is not. For example, fatigue and slipping are categorized as S1, and amputation and death are categorized as S2.
The frequency (F) or time of exposure to the hazard is classified as F1 (seldom or short time) and F2 (frequent or long time). This parameter is a measure of the time spent in the danger zone. As per [2], when the operator is present in the area more frequently than once every 15 min, it is considered F2 level. If for automated processing machines, where the operator needs to intervene only once a month, F1 is the obvious choice.
The possibility (P) of avoiding hazards is divided into categories P1 and P2, which are determined based on whether the hazard can be identified or prevented. If it is possible to avoid an accident under certain circumstances, P1 is chosen, but if it is almost impossible to avoid, then P2 is chosen. Factors that affect parameter P include the speed of the hazardous situation leading to harm, any awareness of risk, and the human ability to escape. For example, the speed of machine operation is limited so that potential accidents are delayed, and the operator has the opportunity to react and leave the zone. As can be seen from the graph (Figure 2), combining these parameters increases the risks from low to high (i.e., PLr-a to PLr-e, where PLr-e is the highest level required for SF and the most expensive to implement).
Please note that, in this study, the LLM is presented with natural language descriptions of machinery hazard scenarios that implicitly or explicitly indicate the three risk parameters defined in ISO 13849, namely severity (S), frequency (F), and possibility of avoidance (P). The model is evaluated for its ability to interpret these factors and accurately classify the corresponding Required Performance Level (PLr), thereby demonstrating its potential to support functional safety risk assessment in accordance with ISO 12100 and ISO 13849-1. This evaluation is essential to determine whether LLMs can reliably replicate expert-level judgment in PLr classification, a prerequisite for integrating AI into scalable, standard-compliant functional safety workflows where manual assessments are often inconsistent, resource-intensive, and difficult to reproduce.
Building on this foundation, the following section reviews related work on AI-assisted safety assessments, rule-based prompt engineering, and retrieval-augmented reasoning in the context of structured decision-making and industrial risk classification.

2.2. AI-Based Risk Assessment in Functional Safety

Functional safety risk assessment in machinery domains is typically governed by ISO 12100 and ISO 13849, which define a logic-driven procedure for estimating PLr based on hazard severity, exposure frequency, and the possibility of avoidance. Traditionally, such assessments rely heavily on expert judgment, limiting scalability and reproducibility. The emergence of new regulations and the growing complexity of machinery systems have amplified the demand for scalable, consistent risk assessment approaches, driving interest in automation and AI-driven methods to support and augment traditional expert analyses.
Early studies explored custom-built AI-based solutions for machinery functional safety risk assessment. The work in [5] introduced a specialized chatbot leveraging rule-based logic and a TextCNN-LSTM architecture, achieving approximately 80% accuracy on internal datasets but showing limited robustness to linguistic variability. Subsequently, [6] presented a chatbot for recommending risk reduction measures aligned with ISO 12100. Although these domain-specific prototypes demonstrated potential, their scalability was constrained by the lack of structured training data and the significant effort required for data curation and model training. Moreover, these solutions predate the advent of LLMs with reasoning capabilities and thus serve as foundational steps toward applying LLMs to structured risk assessment tasks like PLr estimation.
To address the absence of structured datasets aligned with risk assessment standards, the work in  [7] introduced an open-access dataset of 7800 annotated machinery hazard scenarios derived from ISO 12100 and paired with PLr values determined according to ISO 13849-1. This resource enables reproducible PLr prediction experiments with state-of-the-art LLMs. Follow-up studies, such as  [8], applied zero-shot, rule-based, and retrieval-augmented prompting strategies to general-purpose LLMs [9] on this dataset. The results showed that rule-based prompting with retrieval augmentation outperformed both zero-shot and standard rule-based methods, yet also exposed variability across prompt designs. Further, the work in [10] examined Chain-of-Thought prompting for OT cybersecurity risk assessment, demonstrating feasibility but limited depth. By contrast, this study provides a systematic and in-depth evaluation of reasoning strategies in the parallel domain of deterministic PLr classification.
Emerging studies have explored the application of LLMs in safety-critical hazard analysis, aiming to automate tasks traditionally reliant on expert judgment. Nouri et al. [11] applied a GPT-4-based pipeline for hazard analysis and risk assessment (HARA) in automotive systems. Their method decomposed the analysis into subtasks, hazard identification, scenario generation, and severity classification, each using tailored prompts. While effective in generating draft assessments, the study emphasized the ongoing necessity of expert validation. Similarly, the study in [12] evaluated ChatGPT’s utility for System-Theoretic Process Analysis (STPA) in automotive braking systems, finding that naive prompting produced poor results, but domain-specific prompts combined with human oversight enabled LLMs to identify hazards with competence comparable to human experts. In the domain of consumer product safety, the work in [13] observed that while ChatGPT could enumerate a broad set of failure scenarios, it frequently provided weak or unsupported risk judgments. These studies highlight both the potential and the limitations of LLMs in structured hazard analysis. They underscore that while LLMs can assist in preliminary risk assessments or scenario generation, their effectiveness in deterministic, rule-bound tasks—such as those required in functional safety, is dependent on structured prompting, domain adaptation, and expert supervision.
Effective interaction with LLMs in risk analysis tasks often depends on structured, rule-based prompting. Without domain-specific guidance, LLMs may produce inconsistent or misleading outputs. The study in [11] employed format-constrained subtasks and predefined templates to improve output consistency. The work in [14] introduced a co-hazard analysis (CoHA) framework that combined iterative Q&A with ChatGPT and domain rules, enhancing the coverage and creativity of hazard identification. Similarly, the work in [15] further demonstrated that structured prompts significantly influenced LLM accuracy and interpretability on safety certification questions. However, they also noted that no single prompt structure performs optimally across all task types. These studies underscore that while LLMs offer potential for draft hazard identification and scenario generation, they fall short in deterministic, rule-constrained risk assessments. Addressing these limitations requires structured prompting, retrieval augmentation, and expert oversight to mitigate reasoning illusions and ensure alignment with formal safety standards. The work in [16] evaluates the integration of LLMs like ChatGPT into a human-in-the-loop (HITL) framework for machinery functional safety risk analysis, adhering to ISO 12100. It demonstrates that expert oversight within the HITL framework effectively mitigates LLM limitations such as hallucinations, leading to complete agreement with ground truth across diverse industrial case studies. The study highlights significant gains in efficiency, accuracy, and usability, underscoring the transformative potential of generative AI in safety workflows when rigorous human validation is maintained. But a systematic experimental evaluation of LLMs using a comprehensive dataset is not available in [16].

2.3. LLM Capabilities and Limitations in Reasoning Tasks

Early studies showed that prompting LLMs with step-by-step CoT explanations significantly improves their performance on reasoning tasks, including arithmetic and logic problems [17]. Kojima et al. [18] further revealed that even simple zero-shot CoT cues like “Let’s think step by step” could unlock latent reasoning abilities in LLMs across benchmarks such as GSM8K and Big-Bench Hard. These results fueled optimism that, when guided correctly, LLMs might approximate general reasoning skills, with recent models like GPT-4 demonstrating notable problem-solving capabilities across varied domains.
Several recent studies critically examine the assumptions underlying CoT-driven reasoning claims. Saparov et al. [19] introduced the PrOntoQA benchmark and found that LLMs, while capable of generating valid local inference steps, fail at global proof strategies, revealing a tendency toward greedy, heuristic reasoning rather than systematic exploration. Schaeffer et al. [20] provided further evidence, showing that even logically invalid or irrelevant CoT traces can boost model performance on complex tasks, implying that token pattern familiarity rather than logical correctness drives success. These findings suggest that CoT benefits often arise from superficial token dynamics rather than authentic deductive reasoning, raising concerns about overestimating LLM reasoning competence in critical applications.
Multiple recent studies challenge the notion that LLM-generated CoT outputs reflect genuine reasoning. Kambhampati et al. [21] argue that labeling intermediate text as “thoughts” dangerously anthropomorphizes LLMs, masking the fact that CoT may merely improve output through surface token patterns rather than reasoning. Stechly et al. [22] demonstrate that models often produce correct answers despite incoherent or irrelevant intermediate steps, suggesting that CoT traces may result from training artifacts rather than causal reasoning. Furthermore, Chen et al. [23] show that LLMs frequently omit critical reasoning cues from their explanations and that even reward tuning fails to ensure faithful reasoning traces. Collectively, these works highlight that CoT outputs may not transparently reveal model decision-making and thus cannot be trusted for reliable auditing. This undermines the premise of using verbalized reasoning as a safeguard in high-stakes applications, raising critical concerns about the auditability and interpretability of LLM decisions.
Recent studies, such as [24], reveal that LLMs with CoT prompting exhibit a “reasoning cliff”, performing well on simple tasks but suffering abrupt accuracy collapse as task complexity increases. Surprisingly, non-CoT models sometimes outperform verbose CoT-augmented models on low-complexity tasks, while both fail on high-complexity problems due to brittle algorithmic reasoning and lack of compositional generalization. These findings challenge the assumption that longer reasoning traces correlate with genuine reasoning ability, underscoring critical limits of current LLM reasoning capabilities.
In summary, recent research offers a mixed view of LLM reasoning: while CoT and self-consistency improve benchmark scores, many studies suggest this reflects token patterning rather than genuine inference, leaving open the debate between emergent capability and engineered illusion.

2.4. Prompt Engineering, Retrieval-Augmented Generation, and Model Interpretability

One critical mitigation strategy for hallucinations and factual inaccuracies is retrieval-augmented generation (RAG), which grounds the model’s output in external knowledge sources. The work in [25] showed that RAG can significantly improve accuracy on knowledge-intensive QA by injecting relevant documents into the generation process. While RAG alone does not eliminate hallucinations entirely as models may still misquote or misapply retrieved facts it significantly enhances factual grounding. Recent applications in safety-critical domains demonstrate its value: the work in [26] integrated Qwen-2.5 with a curated fire safety regulation database to improve factuality in fire engineering queries. The study in [27] applied advanced RAG pipelines to toxicology assessments, achieving higher scientific fidelity through query rewriting and evidence grounding. In the automotive domain, the work in  [28] developed the LASAR system, which combines scenario generation with catalog-based retrieval to guide LLMs in hazard analysis and risk assessment (HARA).
In safety-critical applications, interpretability is as vital as accuracy. Well-designed prompts can elicit stepwise reasoning, uncertainty estimates, and justifications from LLMs, aiding human validation and traceability. Studies conducted in  [11,28] found that instructing LLMs to explain their severity ratings improved reviewer confidence and auditability. Further, the work in  [15] showed that prompt-induced variations could swing model performance by over 13%. Yet, the benefits of added prompt complexity often depend on the task structure, indicating a need for both prompt-task matching and systematic evaluation.
Overall, prompt engineering enhances reliability and interpretability, but cannot ensure factual grounding or regulatory compliance. These limitations underscore the need for rule-based prompting and RAG to enforce alignment with formal safety standards.

2.5. The Role of Domain-Specific Benchmarking in Evaluating LLMs

Beyond task-specific studies, there is a broader recognition that evaluation methodologies for LLMs in high-stakes, rule-bound domains are underdeveloped. Traditional NLP benchmarks rarely capture the requirements of safety-critical decision-making. Recently, researchers have begun devising domain-specific benchmarks to fill this void. In the legal domain, the work in [29] introduced LegalBench, a suite of 162 tasks covering diverse types of legal reasoning to systematically measure LLM performance on jurisprudence problems. Their evaluations of dozens of models revealed substantial gaps between general LLM capabilities and the consistency needed for legal reasoning, reinforcing the need for tailored assessment in regulated domains. In medicine, the work presented in [30] proposed MedCalc-Bench with over 1000 cases requiring medical calculations (e.g., risk scores, dosage) to test LLMs as clinical calculators. Results showed current models often err on precise numeric reasoning in a clinical context, despite performing well on open medical Q&A, highlighting the importance of structured benchmarks for reliability. The safety community is also moving in this direction. The work in [31] presents HSE-Bench, focusing on Health, Safety and Environment compliance questions with multi-step legal reasoning; they find that today’s LLMs rely more on semantic pattern matching than true rule application in compliance assessments. Notably, even the best models’ “reasoning traces lack the systematic legal reasoning required for rigorous HSE compliance,” and performance degrades on complex multi-step scenarios. These domain-specific evaluations underscore a common theme: without structured benchmarks and protocols, it is difficult to gauge an LLM’s trustworthiness for safety-critical tasks. The work in [32] further demonstrates this in a clinical setting by benchmarking an open-source reasoning LLM against GPT-4 on 125 real patient cases. While the open model reached parity on final answers, the study stressed meeting strict regulatory criteria (e.g., explainability, auditable steps) as a key evaluation component. Together, these works illustrate the nascent but growing effort to establish rigorous evaluation frameworks for LLMs operating under domain constraints.
In summary, advances in AI-assisted risk assessment show promise, but persistent limitations remain for deterministic tasks such as PLr estimation. Structured prompting, RAG and HITL validation can improve performance in preliminary hazard analysis [11,16], yet critical studies on reasoning fidelity [21,22,33] indicate that CoT traces often reflect heuristic token continuations rather than genuine deduction. Domain-specific benchmarking efforts [31,32] further highlight the inadequacy of general LLMs for tasks requiring strict rule adherence.
Together, these insights frame the central question this study addresses: Can the “so-called” reasoning-capable LLMs reliably perform deterministic risk classification, such as PLr estimation?. To address this, the study empirically evaluates six LLMs across six prompting strategies for PLr estimation, using a structured framework aligned with ISO 12100 and ISO 13849-1 Annex A qualitative risk graph (cf. Figure 1).

3. Experimental Design

This section outlines the experimental setup devised to systematically evaluate reasoning-capable LLMs on deterministic PLr classification tasks. It describes the benchmark dataset derived from ISO 12100 and ISO 13849 standards, details the prompting strategies applied across six experimental conditions, and explains the selection of six LLMs. The section concludes with the evaluation framework used to analyze classification accuracy, reasoning behavior, computational performance, and error patterns.
The evaluation leverages LangGraph [34] as its core orchestration engine, enabling dynamic prompt routing and evaluation logic. Modular prompting strategies (rule-based, RAG, and hybrid) are implemented using LangChain chains [35] for seamless reconfiguration. High-throughput semantic retrieval of domain-specific hazard precedents for RAG-based scenarios is facilitated by a ChromaDB [36] vector database, primarily sourcing data from the open-source dataset [7,37]. These retrieved contexts are injected via prompt placeholders to assess analogical reasoning performance. This modular setup mirrors realistic deployment conditions while allowing systematic control over prompting strategies, model selection, and input variants.

3.1. Dataset Description

The experiments in this study utilize an open-source Industrial Machinery Functional Safety Hazard Scenario Dataset [37], designed to serve as a transparent, reproducible, and scientifically rigorous benchmark for empirical evaluation of automated risk assessment methods [7]. The dataset is used throughout this study as the standardized evaluation benchmark for comparing baseline, rule-based, COT and RAG methods for PLr determination across diverse industrial safety scenarios.

3.1.1. ISO 12100: Annex B and ISO 13849-1: Annex A Correlation

The dataset construction process is systematically aligned with ISO 12100 Annex B, which enumerates ten general hazard categories relevant to machinery functional safety. For each category, the dataset defines specific hazard origins and enumerates plausible potential consequences, refining the generic framework of ISO 12100 into a structured representation suitable for computational risk assessment. This mapping ensures that every scenario is rooted in established industrial safety standards.
Not all combinations of hazard origin and consequence are physically meaningful or relevant to real-world contexts. Therefore, every origin–consequence pair undergoes a rigorous plausibility assessment, incorporating physical laws, causal logic, and expert judgment from certified functional safety professionals. Only combinations that are physically possible and contextually credible are retained. For example, “entanglement” from “rotating elements” is included, while physically impossible pairings (such as “loss of balance” from “scraping surfaces”) are excluded.
For each plausible origin–consequence pair, scenarios are instantiated by systematically varying contextual parameters, including the following:
  • User type (e.g., operator, maintenance personnel).
  • Task (e.g., normal operation, cleaning, maintenance).
  • Operating environment (e.g., industrial).
  • ISO 13849-1 risk graph parameters: severity (S), frequency and/or duration of exposure (F), and possibility of avoidance (P).
Severity is classified as either “slight (normally reversible injury)” or “serious (normally irreversible injury or death)” (corresponding to S1/S2 in ISO 13849-1), frequency as “seldom-to-less-often/exposure time is short” or “frequent-to-continuous/exposure time is long” (F1/F2), and possibility as “possible under specific conditions” or “scarcely possible” (P1/P2). The PLr is automatically mapped for each scenario by applying the ISO 13849-1: Annex A risk matrix (cf. Figure 1) to the scenario’s S, F, and P values.
This dataset, comprising 7800 machinery hazard scenarios, is automatically generated in a standardized JSON format. Each entry meticulously details hazard attributes, user context, environmental factors, and calculated PLr, along with a natural language description. The comprehensive distribution across ten hazard categories, including 2840  mechanical and 1640 electrical hazards, ensures a robust and reproducible foundation for benchmarking AI in safety-critical classification tasks.

3.1.2. Dataset Entry: Template and Example

Listing 1 illustrates a typical scenario entry in the dataset. This example describes an electrical hazard where maintenance personnel performing setup or programming tasks in an industrial environment may encounter an arc, which could lead to a burn. The scenario specifies key contextual parameters: user type (maintenance personnel), task (setup or programming), environment (industrial), and the risk graph inputs, severity (slight), frequency (seldom-to-less-often), and possibility of avoidance (scarcely possible). Based on these factors, the PLr is assigned as ‘b’ according to the ISO 13849-1: Annex A qualitative performance graph method.
Listing 1. Example of an electrical hazard description in the dataset [7,37].
Electronics 14 03624 i001
In all experiments, the scenario’s PLr value, calculated using the ISO 13849-1 risk matrix, serves as the ground truth for evaluation. Each language model receives the scenario’s natural language description as input and is tasked with predicting the PLr. Model outputs are then compared to this reference value to compute predictive accuracy. For example, in Listing 1, only models predicting PLr = ‘b’ are counted as correct. This evaluation protocol enables systematic, transparent, and standard-compliant benchmarking of automated risk assessment methods across diverse safety-critical scenarios.
Further methodological details and rationale for the dataset construction are provided in [7]. The dataset is openly accessible to the research community and practitioners, serving as a standardized benchmark that accelerates scientific progress and enables rigorous comparison of AI-based risk assessment methods in machinery safety.

3.2. Evaluation Datasets

To probe both standard-aligned performance and real-world generalization, two dataset variants are evaluated that share the same schema and gold labels but differ in lexical phrasing.
  • Variant 1—Canonical ISO-style scenarios (N = 100): This variant anchors the evaluation in in-distribution phrasing closely mirroring ISO 12100 Annex B. Each case follows a fixed schema (hazard type, origin, potential consequence, task, environment) and reports explicit risk parameters, namely, severity (S), exposure frequency (F), and possibility of avoidance (P), together with a reference PLr. Because the textual descriptions are automatically generated from canonical fields while remaining faithful to the ISO taxonomy, Variant 1 can be characterized as a synthetic but controlled dataset that provides a standardized baseline under explicit terminology.
  • Variant 2—Functional safety engineer-authored scenarios with lexical shift (N = 100): This variant stress tests generalization to field language. Scenarios were written by a functional safety engineer from industrial practice using the same schema and gold labels as Variant 1, but the free-text descriptions deliberately avoid literal ISO tokens for S/F/P. Instead, the factors are conveyed implicitly in operational prose, for example, "repetitive short stops near a moving transmission", "limited clearance and delayed stop reachability" and "hands inside a nip area during setup". This emulates how hazards are typically described in industrial workflows during safety assessments.
In short, Variant 1 serves as a structured baseline with standardized phrasing, while Variant 2 introduces deliberate lexical shift while preserving labels and structure. This design enables us to test both in-distribution performance (standard-aligned) and out-of-distribution robustness to field language, particularly important for analyzing the sensitivity of CoT and RAG prompting strategies.

3.3. Prompting Strategies for Risk Classification

The experimental evaluation employs six distinct prompting strategies for PLr determination, each reflecting progressively higher levels of structured input and domain knowledge integration. In all variants, the standard chat-based prompt format with system and user roles (i.e., human) as defined in [38] is used. This format distinguishes between the "system" message, which establishes the model’s assumed role, such as a functional safety expert and supplies any required guidance, domain rules, or reasoning instructions, and the "user" message, which presents the actual task input, typically the natural language hazard scenario for analysis. This separation of roles supports structured, reproducible prompt design and enables controlled assessment of how different prompt components influence LLM reasoning in deterministic risk classification tasks.
The six prompting strategies are detailed below. Please note that the first four approaches do not include retrieval-based approaches, whereas the last two are retrieval-based approaches, i.e., in combination with RAG pipeline implementation described in detail in Section 3.5.1.
  • Experiment I: Zero-shot Prompt
    In this baseline experiment, the reasoning model is given only the raw hazard scenario in natural language, without any additional guidance or rules beyond its role as a functional safety expert. This setup assesses the reasoning model’s inherent ability to determine the PLr based on its pre-trained knowledge and reasoning capabilities, focusing on systematic analysis of severity, frequency, and avoidance parameters.
    Electronics 14 03624 i002
  • Experiment II: Explicit ISO Rule Integration
    Building on the zero-shot prompt, this experiment supplements the scenario with explicit PLr determination rules as specified in ISO 13849-1 Annex A (cf. Figure 1). This tests whether access to codified safety knowledge enables more accurate and consistent classification.
    Electronics 14 03624 i003
  • Experiment III: Chain-of-Thought (CoT)
    This experiment uses the chain-of-thought (CoT) approach by providing highly structured, explicit step-by-step instructions for the model’s reasoning process but without explicitly stating the ISO 13849-1: Annex A performance graph rules. This aims to guide the model through a precise and verifiable pathway for determining the PLr.
    Electronics 14 03624 i004
  • Experiment IV: Chain-of-Thought (CoT) with Rules
    This experiment combines the structured step-by-step instructions of the CoT approach with explicit textual inclusion of the ISO 13849-1: Annex A performance graph rules (cf. Figure 1). The goal is to provide the model with the necessary information directly within the prompt, minimizing reliance on its pre-trained knowledge for the specific rules of PLr determination. This setup evaluates the model’s ability to precisely apply provided rules in conjunction with its reasoning process.
    Electronics 14 03624 i005
  • Experiment V: CoT with Rules and Retrieval-Augmented Examples
    In addition to the scenario, rules, and explicit CoT instructions, representative historical hazard examples are retrieved from a curated database and included in the prompt. This evaluates the model’s ability to generalize from a precedent and improve classification accuracy through context enrichment. This experiment is referred to as COT _ WITH _ RULES _ RAG in the paper.
    Electronics 14 03624 i006
  • Experiment VI: Rules with Retrieval-Augmented Examples
    In addition to the scenario and rules, representative historical hazard examples are retrieved from a curated database and included in the prompt. This evaluates the model’s ability to generalize from precedent and improve classification accuracy through context enrichment. This experiment is referred to as WITH _ RULES _ RAG in this paper. This is specifically added as a methodology to understand the differences of usage with and without COT together with rules and retrieval. So, please note that the main difference between the experiments V and VI is that the experiment VI does not have CoT instructions.
    Electronics 14 03624 i007
In this study, during experimental evaluation, the six experimental settings are applied consistently across six state-of-the-art models and two dataset variants, enabling direct comparison of the incremental benefits of each input enhancement. For clarity, these prompting strategies are referenced using the following shorthand macros:
  • ZERO_SHOT: Baseline condition where the model receives only the raw hazard scenario, without explicit rules or structured guidance.
  • WITH_RULES: Prompt includes explicit ISO 13849-1 Annex A rules, constraining the model to rule-based PLr determination.
  • PURE_CoT: Uses chain-of-thought (CoT) instructions to elicit step-by-step reasoning, but without explicit rule injection.
  • COT_WITH_RULES: Combines CoT instructions with explicit ISO rules, guiding the model through structured reasoning and rule application.
  • WITH_RULES_RAG: Augments rule-based prompting with retrieved hazard exemplars from a curated database, enabling case-based reasoning.
  • COT_WITH_RULES_RAG: Integrates CoT, explicit ISO rules, and retrieved exemplars, representing the most structured and enriched prompting setup.

3.4. Prompt Placeholder Description

To ensure methodological rigor and enable systematic analysis, each experimental prompt utilizes well-defined placeholders corresponding to key information components. These placeholders not only enforce consistency in model evaluation, but also reflect distinct strategies for eliciting and constraining LLM behavior in a safety-critical classification task.
  • {description}: The central input for each prompt is a natural language description of a real-world hazard scenario, simulating industrial safety assessment tasks and enabling evaluation of the model’s capacity for context comprehension and risk mapping. An example description, including user type, task, hazard origin, and consequence, is shown in Listing 1.
  • {iso_rules_info}: This placeholder injects the structured decision logic codified in ISO 13849-1: Annex A, including parameter definitions and full risk graph mapping rules. It converts the task from open-ended inference to rule-constrained reasoning, enabling assessment of whether direct access to normative safety rules improves PLr classification fidelity and consistency. The rules ensure deterministic mapping of scenario descriptions into standardized parameters—severity (S1/S2), exposure frequency (F1/F2), and possibility of avoidance (P1/P2)—which together yield the PLr. An excerpt of this content is shown in Listing 2.
    Listing 2. Deterministic rules codified under {iso_rules_info} for PLr inference based on ISO 13849-1: Annex A qualitative performance graph rules (cf. Figure 1).
    Electronics 14 03624 i008
  • {COT_step_by_step_instruction} This placeholder as detailed in Listing 3 formalizes a CoT prompting technique. It provides the LLM with explicit, step-by-step instructions for analyzing a hazard scenario, specifically guiding its reasoning for severity, frequency, and avoidance. By forcing this sequential decomposition and requiring intermediate justifications, this CoT aims to enhance the model’s transparency and align its decision-making with structured human expert methodologies for PLr determination. This approach is crucial for improving explainability and validating AI reasoning in critical safety applications.
    Listing 3. Structured chain-of-thought instructions under {COT_step_by_step_instruction} prompt placeholder for PLr inference.
    Electronics 14 03624 i009
  • {rag_examples}:
    This placeholder is populated by a retrieval module that provides curated hazard scenarios (with ground-truth S/F/P and PLr) from a curated database of hazard scenarios. Further details about the RAG implementation pipeline are provided in Section 3.5.1.

3.5. RAG Implementation

In the PLr experiment framework, RAG is employed to systematically test whether augmenting prompts with precedent hazard examples improves deterministic classification of PLr. The underlying rationale is that models may benefit from a small set of prior cases that are not only textually relevant but also structurally consistent with the ISO 13849-1 risk parameters, S, F, and P. In addition, RAG is explicitly evaluated under lexical-shift conditions queries drawn from a companion corpus that is semantically consistent with ISO 13849-1 but avoids the literal S/F/P tokens (e.g., paraphrases such as “infrequent contact” for F1 or “avoidance is difficult” for P1)—to test whether exemplar augmentation improves deterministic PLr classification when surface forms diverge from both the rules and the database phrasing. To this end, the framework implements RAG in two chain types:
  • WITH_RULES_RAG: Injects retrieved exemplars alongside the base hazard description and ISO rules.
  • COT_WITH_RULES_RAG: Integrates exemplars into a CoT prompt alongside base hazard descriptions and ISO rules.

3.5.1. Pipeline Implementation

A hybrid RAG pipeline is implemented in the sense of dense(ish) database retrieval followed by symbolic filtering and deterministic packaging for prompting [25,39]. The orchestration is handled by the _search_similar_hazards function and proceeds as follows: At initialization, a hazard database object is created (from an existing index) and an optional validator chain is attached.
The RAG controls are ( k , τ , M ) = ( rag _ k , rag _ sim _ threshold , rag _ top _ m ) , a strict S/F/P gate require_sfp_exact (boolean), and an optional drop_missing_labels. Defaults in the experimental runs are k = 20 , τ = 0.30 , M = 3 .
Stage I: Semantic Database Search
Given a query scenario string q, hazard_db.search(q) is invoked to obtain a superset of candidates C 0 (intentionally larger than k to avoid upstream pruning). This corresponds to the “semantic retriever” stage in RAG, where a vector- or database-backed similarity search returns top candidates for subsequent filtering.
Stage II: Hybrid Prefiltering and Ranking
Candidates are passed to _prefilter_and_rank with the following steps:
  • Lexical overlap filter: Compute the Jaccard index on token sets J ( A , B ) = | A B | | A B | for the query and each candidate; retain only those with J τ (default τ = 0.30 ).
  • S/F/P constraint (optional exact gate; disabled by default): Map hazards to ( S , F , P ) with S { S 1 , S 2 } , F { F 1 , F 2 } , P { P 1 , P 2 } . If require_sfp_exact = true, discard any candidate whose triplet does not exactly match the query’s; otherwise keep but log mismatches for diagnostics. In all main experiments, this gate was "disabled" (require_sfp_exact = false); enabling it is left for ablations.
  • Deduplication and top-M selection: Deduplicate by hazard identifier, rank primarily by J (semantic score may be used as a tie-breaker inside the helper), and truncate to the configured top M.
Stage III: Evidence Packaging for Prompting
The selected retained set is compacted and serialized into concise snippets (hazard type, task, short description) in snippets (using _ctx_snip) to form a structured context block
[ EX 1 ] HazardType Task description [ EXM ]
which is injected into the prompt as rag_context.
Under the survey taxonomy in the literature [25,39], the method employed can be called “advanced hybrid RAG”: a semantic database search followed by symbolic lexical screening and domain–structure (S/F/P) constraints, plus deterministic evidence selection/packaging. In the experiments, the default values are require_sfp_exact = false and drop_missing_labels = false. Vector similarity is used for over-retrieval, while the lexical Jaccard threshold τ acts as the gate.

3.5.2. Retrieval Index (ChromaDB)

A persistent ChromaDB collection of curated hazard scenarios is maintained solely for retrieval. In this study the index contains N 1020 records drawn from the larger 7800-scenario corpus. Each record stores a compact textual summary (hazard type, origin, consequence, user, task, environment, description) and metadata (ID, S/F/P, PLr). Documents are embedded with a standard sentence-embedding model and retrieved by vector similarity. At query time: (i) top-k candidates are over-retrieved ( k = 20 ), (ii) a semantic gate s sem 0.70 is applied with s sem = 1 d (from index distance d), (iii) lexical Jaccard filtering J ( A , B ) 0.30 is performed, (iv) optional S/F/P exact match is enforced ( require_sfp_exact, default: false), and (v) deduplication retains the top M by J ( M = 3 ). For selected examples, the retrieval artifacts ID, S/F/P, PLr, s sem , and J ( A , B ) are reported.
Thus, the RAG implementation combines semantic retrieval with structural safeguards to ensure both textual relevance and risk graph consistency. Candidate hazards are retrieved semantically, filtered with lexical Jaccard thresholds and optional S/F/P gating, then deduplicated and truncated to the top-M exemplars for prompt injection. Depending on the experimental condition, the retrieved rag_context is added either directly (WITH_RULES_RAG) or embedded in a structured reasoning sequence (COT_WITH_RULES_RAG). In both cases, the objective is to reinforce correct mapping from S/F/P parameters to PLr classification while systematically controlling the role of retrieved exemplars. Default parameters are listed in Table 1, and the overall hybrid pipeline is illustrated in Figure 3.

3.6. Model Selection and Configuration

Six production models from three major vendors plus a specialist reasoner are evaluated, spanning (i) dedicated reasoning stacks [OpenAI o-series: o3-mini, o4-mini; DeepSeek Reasoner], (ii) cost/latency-optimized “mini/flash” variants [Google Gemini 2.5 Flash, OpenAI GPT-5 mini], and (iii) a premium general model used as an upper-bound baseline [Anthropic Claude Opus 4.1].
  • Claude Opus 4.1 (Anthropic): Latest Claude 4.x release positioned for complex reasoning, coding, and agentic workflows [40].
  • DeepSeek Reasoner: Domain-agnostic reasoning model optimized for multi-step inference; API documentation lists unsupported decoding controls (see below) [41].
  • Gemini 2.5 Flash (Google): Cost/latency-optimized model with optional thinking budget and multimodal I/O [42,43,44].
  • GPT-5 mini (OpenAI): Compact GPT-5-family variant emphasizing speed and cost-efficiency for well-defined tasks [45,46].
  • o3-mini (OpenAI): Small o-series reasoning model targeting STEM/logic tasks at low cost/latency [47].
  • o4-mini (OpenAI): Newer small o-series model optimized for fast, effective reasoning (math, coding, vision) [48].
All models receive identical prompts per experiment (Section 3.3), with stop sequences, maximum output tokens, and structured fields aligned to avoid truncation or format bias.

3.6.1. Deterministic Decoding Policy

PLr assignment under ISO 13849-1 is a deterministic multi-class classification task. To eliminate sampling variance and ensure replicability, models are run under deterministic decoding wherever supported. Parameters are set to temperature = 0.0, top_p = 1.0, and top_k = 0 (greedy decoding). For reasoning stacks, temperature is documented as unsupported or ignored (OpenAI o3-mini, o4-mini; DeepSeek Reasoner [41,49]), with DeepSeek additionally listing unsupported controls (temperature, top_p, presence_penalty, frequency_penalty, logprobs/top_logprobs). Thus, decoding proceeded deterministically under provider defaults.

3.6.2. Cost Context

API pricing (as of 21 August 2025) spans an order of magnitude. Claude Opus 4.1 is the premium outlier at USD 15 per 1M input tokens and USD 75 per 1M output tokens [50]. OpenAI o-series minis are an order cheaper at USD 1.10 in/USD 4.40 out [47,48]; GPT-5 mini is lower still at USD 0.25 in/USD 2.00 out [45]; Gemini 2.5 Flash is comparable (USD 0.30 in/USD 2.50 out, including “thinking” tokens) [51]; and DeepSeek Reasoner is aggressively priced at USD 0.55 in (cache miss; USD 0.14 cache hit) and USD 2.19 out [52]. Batch and caching features can further reduce effective cost [50,53].
These six models together span the accuracy–latency–cost frontier: a premium general model (Claude Opus 4.1), vendor reasoning stacks (OpenAI o-series; DeepSeek Reasoner), and high-throughput budget models (Gemini 2.5 Flash; GPT-5 mini). This setup enables cost-normalized, deterministic PLr classification performance to be compared under identical prompting and RAG conditions.
The aim is to derive actionable recommendations for real-life functional safety workflows by jointly analyzing (i) accuracy with 95% CIs, (ii) latency/throughput under deterministic decoding, and (iii) per-decision cost (input + output tokens), while documenting provider constraints on decoding controls for reasoning models.

4. Results and Analysis

  • Scope. Two datasets are evaluated:
    Variant 1: Canonical ISO-style scenarios (in-distribution reliability).
    Variant 2: Engineer-authored free-text scenarios (out-of-distribution robustness).
Six prompting strategies are tested across six model families (see Section 3.3 and Section 3.6), yielding 36 conditions per variant.
  • Reported metrics. For each (model, prompt) condition, the following are reported:
    Accuracy; Macro-/Micro-/Weighted-F1.
    Per-class metrics (including class E recall).
    Processing time.
  • Experimental protocol.
    Deterministic decoding: temperature = 0.0 , top_p = 1.0 , top_k = 0 (when available).
    Repeats: r = 5 independent runs per condition with identical inputs; for RAG, a fixed retrieval configuration and index.
    Aggregation: metrics computed per run and then averaged across runs.
  • Uncertainty quantification. Error bars are 95% t-intervals across runs ( r = 5 ) for accuracy and timing, and 95% class-stratified bootstrap Confidence Intervals (CIs) for Micro-/Macro-/Weighted-F1 and all per-class metrics.
  • Weighted-F1 (rationale).
    Weighted-F1 averages per-class F1 using class prevalence as weights; Macro-F1 gives equal weight to all classes; Micro-F1 (for single-label tasks) equals accuracy and can mask per-class precision/recall trade-offs.
    To avoid hiding minority-class failures, Weighted-F1 is always paired with Macro-F1, per-class metrics, and class E recall.
Together, these evaluations offer a comprehensive, multidimensional assessment of reasoning LLMs under deterministic, rule-constrained risk classification.

4.1. Results on Variant 1 (Canonical ISO-Style Scenarios, Non-RAG Prompts)

Four prompting strategies are evaluated without retrieval augmentation: ZERO _ SHOT , PURE _ CoT , WITH _ RULES , and COT _ WITH _ RULES . Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14 summarize the results across six models.

4.1.1. Accuracy and Processing Time

Figure 4 presents accuracy with 95% t-intervals across repeated runs, quantifying the stability of model performance under identical conditions and ensuring multi-run statistical reliability. WITH _ RULES consistently achieves near-ceiling accuracy (≥ 0.92 ) across all six models, confirming the effectiveness of explicit rule-constrained prompting. By contrast, PURE _ CoT displays large instability, with accuracies ranging from 0.40 (Claude-opus-4-1) to 0.99 (GPT-5-mini), reinforcing that unconstrained reasoning yields erratic outcomes. COT _ WITH _ RULES partially mitigates this instability but does not reach the ceiling performance of WITH _ RULES . ZERO _ SHOT systematically underperforms, collapsing for o3-mini (0.45), which underscores its inability to generalize in safety-critical classification without structured priors.
Figure 5 and Figure 6 complement accuracy analysis with efficiency metrics. Both average processing time and total execution time are reported with 95% t-intervals, reflecting run-to-run variability rather than single-run artifacts. DeepSeek-Reasoner and Gemini-2.5-flash exhibit markedly higher latency than compact models such as o3-mini and o4-mini, highlighting the cost–accuracy trade-off in reasoning-optimized architectures. These results collectively strengthen the methodological rigor by combining deterministic performance evaluation with reproducibility and resource awareness.
Efficiency results show that with rules is not only the most accurate but also the most efficient. Pure cot and cot with rules incur higher latency due to step-by-step reasoning, while zero-shot is fast but unreliable. DeepSeek-reasoner exhibits extreme latency (approx. 70 s per query), making it impractical despite strong accuracy. Gemini and Claude are efficient but highly sensitive to prompt structure.
The accuracy heatmap in Figure 7 highlights this hierarchy: WITH _ RULES is dominant (≥0.99), COT _ WITH _ RULES is consistently strong but slightly below, PURE _ CoT is volatile, and ZERO _ SHOT fails for compact models. This confirms that explicit ISO rule alignment is both necessary and sufficient for deterministic PLr classification.

4.1.2. Macro, Micro-F1 and Precision

To capture robustness across PLr classes, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 report Macro-F1, Micro-F1, Weighted-F1, and per-class metrics. WITH _ RULES dominates in Macro- and Micro-F1, showing balanced handling of both frequent and rare classes. PURE _ CoT inflates performance on common classes but collapses on categories PLr b and PLr c, demonstrating that accuracy alone can obscure poor recall for low-frequency hazards. ZERO _ SHOT fails very significantly for o3-mini, producing poor per-class recall. COT _ WITH _ RULES provides stability but remains weaker than rule-only prompts.
Per-class F1 (Figure 11) shows strong performance for PLr classes a and b across all rule-grounded prompts, while PLr classes c and d remain the main sources of variability. The o3-mini collapse under ZERO _ SHOT and PURE _ CoT reflects instability on mid-frequency hazards, in contrast to rule-based prompts which sustain balanced scores, including near-ceiling values for PLr class e.
Precision remains high for PLr classes a and b, but PURE _ CoT and ZERO _ SHOT exhibit wider variance in PLr classes c and d, indicating susceptibility to false positives as seen in Figure 12. WITH _ RULES consistently yields precise predictions across all classes, preserving safety-critical PLr class e without degradation.
Recall as seen in Figure 13, for variant 1 without RAG, highlights systematic weaknesses in ZERO _ SHOT and PURE _ CoT for PLr classes b and c, where under-detection is frequent. Rule-grounded prompts sustain recall near ceiling for all classes, ensuring deterministic coverage of class E while reducing variance across models.
Together, Figure 11, Figure 12 and Figure 13 demonstrate that explicit rule conditioning provides stable performance across both common and minority classes, counteracting the volatility of unconstrained prompting. The alignment of F1, precision, and recall results confirms that rule-grounded strategies not only maximize aggregate accuracy but also ensure class-level determinism, a prerequisite for safety-critical deployment where failures on rare hazards such as class E cannot be tolerated.

4.1.3. Recall for PLr Class E

Figure 14 isolates recall for the most safety-critical class, PLr class e. WITH _ RULES achieves perfect recall across all models, while ZERO _ SHOT and PURE _ CoT miss high-risk hazards, an unacceptable failure in functional safety. While average accuracy may appear reasonable, only rule-based strategies reliably capture rare but critical hazards.

4.2. Results on Variant 1 (Canonical ISO-Style Scenarios, 2-RAG-Based Prompts)

Because overall accuracy is ceiling-limited, class-sensitive metrics that reveal behavior under class imbalance are emphasized.

4.2.1. Macro-/Micro-F1

Figure 15 and Figure 16 show near-ceiling Macro- and Micro-F1 (approx. 0.98–1.00) for both WITH _ RULES _ RAG and COT _ WITH _ RULES _ RAG across six models. The one exception is o3-mini, where COT _ WITH _ RULES _ RAG collapses, while plain WITH _ RULES _ RAG remains approx. 0.99. These results highlight that free-form reasoning can degrade class-balanced performance even when mean accuracy appears high.

4.2.2. Per-Class Behavior

The per-class F1 panel (Figure 17) shows that PLr classes a-d are essentially solved for all models under both prompts. The o3-mini anomaly under COT _ WITH _ RULES _ RAG stems from broad degradation across PLr classe b-e, not a single class artifact, indicating reasoning-step brittleness rather than dataset noise.

4.2.3. Safety-Critical Coverage (PLr Class E)

Figure 18 isolates recall for PLr class e. All models achieve near-perfect PLr class e-recall under both prompts except o3-mini under COT _ WITH _ RULES _ RAG . This result shows that accuracy alone can mask critical failures; WITH _ RULES _ RAG is consistently safer on this dataset, while COT_WITH_RULES can destabilize a smaller model.

4.2.4. Weighted-F1

Figure 19 mirrors the macro/micro patterns: near-ceiling for all models and both prompts, with the same o3-mini degradation confined to COT _ WITH _ RULES _ RAG . This shows the finding is robust to prevalence weighting.
On in-distribution ISO phrasing, retrieval alone suffices; adding explicit chain-of-thought yields no systematic gains and can harm smaller models. Reporting Macro-/Micro-/Weighted-F1, per-class F1, and PLr class e recall directly addresses reviewer requests for metrics beyond accuracy and for safety-critical error visibility.

4.3. Results on Variant 2 (Engineer-Authored Scenarios, Non-RAG Prompts)

Variant 2 comprises free-text hazard descriptions authored by a functional safety engineer without canonical ISO phrasing. This dataset introduces lexical shift and compositional variability, and therefore tests out-of-distribution generalization and the stability of prompting strategies under realistic language.
Figure 20, Figure 21 and Figure 22 present Variant 2 results, with Figure 20 showing model–prompt accuracy, Figure 21 the average processing time per sample, and Figure 22 the total execution time, each with 95% t-interval error bars across runs.

4.3.1. Overall Accuracy (With 95% t-Intervals Across Runs)

As seen in Figure 20, across models, WITH _ RULES is the most reliable strategy on Variant 2, typically yielding the highest or statistically indistinguishable accuracy relative to the best method per model. For strong base models (e.g., GPT-5-mini, o4-mini), PURE _ CoT occasionally approaches WITH _ RULES , but its 95% t-intervals are wider, indicating greater run-to-run variability. COT _ WITH _ RULES narrows this variability relative to PURE _ CoT but rarely exceeds the simpler WITH _ RULES strategy. ZERO _ SHOT is consistently least accurate, often clustering around 0.55–0.60 on several models, confirming that unconstrained prompting is brittle under lexical shift. Notably, DeepSeek-Reasoner and Gemini-2.5-flash display particularly wide intervals for CoT variants, highlighting instability under free-form language even when mean accuracy is competitive.

4.3.2. Latency and Efficiency (With 95% t-Intervals Across Runs)

From Figure 21 and Figure 22, one can observe that the latency patterns mirror the accuracy trade-offs. Reasoning-heavy prompts ( PURE _ CoT and COT _ WITH _ RULES ) incur the highest per-sample and total execution times, with the slowest stacks (e.g., DeepSeek-Reasoner, Gemini-2.5-flash) showing order-of-magnitude differences relative to compact models (o3-mini, o4-mini). WITH _ RULES typically achieves near-top accuracy with substantially lower latency than COT _ WITH _ RULES , offering a better accuracy–time Pareto point. ZERO _ SHOT is fastest but its accuracy deficits under lexical shift make it unsuitable for deterministic, auditable workflows.
On Variant 2, explicit rule conditioning ( WITH _ RULES ) is the most robust and efficient choice overall: it maintains high accuracy with narrower 95% t-intervals than PURE _ CoT , while avoiding the additional latency overheads of COT _ WITH _ RULES . These results reinforce that, under non-canonical phrasing, structured prompts that encode the ISO decision rules provide determinism and stability that purely “reasoning” styles do not.

4.3.3. Macro-F1 and Micro-F1

Figure 23 and Figure 24 demonstrate Macro- and Micro-F1 scores. Macro-F1 uncovers instability in minority PLr classes (a and b), where zero-shot and CoT perform inconsistently. Rule-based prompting markedly reduces this variance, producing balanced performance across classes. Micro-F1 tracks overall accuracy but confirms that robustness is only achieved with explicit rules.

4.3.4. Per-Class Performance

Figure 25, Figure 26 and Figure 27 illustrate the per-class F1, precision, and recall for Variant 2 without RAG. The results highlight that PLr classes b and c are the most fragile categories: zero-shot and pure CoT prompting frequently underperform, leading to both false positives (precision loss) and severe recall drops. In contrast, rule-based prompting consistently stabilizes performance across all classes, maintaining near-ceiling recall for PLr class e and preventing over-prediction in PLr classes d and e. This per-class robustness is particularly important since functional safety certification requires determinism not only at the aggregate level but also within each PLr class.

4.3.5. Safety-Critical Recall (Class E) and Weighted-F1

Figure 28 and Figure 29 focus on PLr class e and Weighted-F1. Importantly, recall for PLr class e remains high across all settings, but zero-shot and pure CoT show variance, risking under-detection in critical cases. Weighted-F1 reflects these imbalances: Claude and DeepSeek with rules sustain scores above 0.9, whereas smaller models without rules fall below 0.7. These results demonstrate that determinism must be explicitly validated for safety-critical outputs.
Variant 2 demonstrates that lexical variation strongly stresses large language models. Zero-shot and pure CoT strategies are insufficient for reliable safety-critical classification, with accuracy drops of up to 30%. Rule-based prompting provides determinism and restores per-class balance, achieving near-Variant 1 performance even under distribution shift. This confirms that safety-compliant usage of LLMs requires explicit structural constraints rather than relying on emergent reasoning.
The evaluation on Variant 2 demonstrates that the benchmark is not limited to synthetic ISO-style phrasing but also covers practice-authored cases representative of real safety assessments. This ensures ecological validity, since functional safety engineers rarely describe hazards using literal ISO tokens. The consistent schema and gold labels ensure comparability to Variant 1, while the lexical shift tests out-of-distribution robustness.
Results show that rule-based prompting (with or without CoT) substantially outperforms zero-shot baselines, confirming that explicit formalization of S/F/P criteria is necessary for reliable PLr assignment in realistic industrial language. Importantly, not only mean accuracies but also Macro-/Micro-F1 and per-class breakdowns are reported, together with 95% CIs, thereby quantifying both central tendency and statistical uncertainty. This provides the level of rigor expected by certification auditors, who require reproducible evidence of deterministic behavior under linguistic variation. In sum, Variant 2 establishes sufficiency of the benchmark for assessing model robustness under real-world lexical variability, beyond synthetic ISO-aligned formulations.

4.4. Results on Variant 2 (Engineer-Authored Scenarios, RAG-Based Prompts)

Variant 2 further evaluates robustness under lexical shift but now introduces retrieval-augmented generation (RAG). Scenarios are free-text hazard descriptions authored by a functional safety engineer, without canonical ISO tokens, and prompts combine retrieval with explicit rules or CoT. This setting directly probes whether retrieval supports or undermines determinism when applied to non-standardized input phrasing.

4.4.1. Accuracy and Confidence Intervals

Figure 30 shows that RAG introduces heterogeneous effects. WITH _ RULES _ RAG achieves strong performance for Claude-opus and GPT-5-mini (>0.95 accuracy with narrow CIs), indicating that retrieved context reinforced rule-constrained reasoning. By contrast, DeepSeek Reasoner degraded substantially (mean ≈ 0.72 with wide intervals), suggesting retrieval noise or conflict with internal heuristics. o3-mini also suffered a drop below 0.75. These results highlight that while retrieval can complement robust models, it can destabilize others, emphasizing the need for model-specific RAG validation before deployment in safety-critical settings.

4.4.2. Latency and Efficiency

Figure 31 and Figure 32 confirm that COT _ WITH _ RULES _ RAG incurs substantial latency penalties, particularly for large reasoning-oriented models (DeepSeek and Gemini, >70 s per case, thousands of seconds total). In contrast, WITH _ RULES _ RAG reduces average and total runtimes significantly, while in some models maintaining or even improving accuracy. This suggests that retrieval-only prompting is a more computationally practical strategy, provided retrieval databases are curated to minimize semantic drift and contradictory context.
Variant 2 with RAG demonstrates a dual outcome: for compact and instruction-optimized models, RAG stabilizes accuracy while controlling runtime; for reasoning-specialized models, it amplifies error variability and latency. This shows that retrieval noise can override encoded rules and underscores the necessity of transparent error analysis when deploying RAG in safety-critical certification workflows.
Across all conditions, three systematic trends emerge.
  • First, Variant 1 (canonical ISO phrasing) represents the upper bound of model performance: rule-based prompts consistently achieved near-ceiling accuracy (>0.95) with narrow confidence intervals, underscoring that deterministic classification is feasible when phrasing is standardized.
  • Second, Variant 2 (lexical shift, non-RAG) revealed a marked degradation for zero-shot and unconstrained CoT, confirming that free-text hazard descriptions destabilize reasoning-oriented prompting. Rule-based strategies partially mitigated this drift but still showed variability in compact models.
  • Third, Variant 2 with RAG demonstrated that retrieval can both stabilize and destabilize performance: while Claude-Opus, GPT-5-mini, and o4-mini maintained robustness with WITH _ RULES _ RAG , DeepSeek Reasoner and o3-mini exhibited large confidence intervals and accuracy collapse, indicating sensitivity to retrieval noise. Latency results further confirmed that retrieval-heavy CoT prompts impose prohibitive computational costs, whereas lightweight retrieval ( WITH _ RULES _ RAG ) balances accuracy and efficiency.
Collectively, these findings validate the central claim that deterministic reliability in functional safety tasks depends not on emergent reasoning, but on strict rule-constrained prompting, carefully validated retrieval, and bounded lexical variability.
Figure 33-Top and Figure 33-Bottom show macro- and micro-F1 comparisons. Macro-F1 reveals systematic penalties for smaller models (o3-mini, DeepSeek) due to inconsistent handling of PLr classes c, d and e. By contrast, Claude and GPT-5-mini retained balanced per-class treatment ( Macro - F 1 > 0.90 ). Micro-F1 followed overall accuracy trends, confirming that lexical robustness depends strongly on model scale and prompt scaffolding.

4.4.3. Macro- and Micro-F1

As shown in Figure 34, PLr classes a-c maintain high F1 under both WITH _ RULES _ RAG and COT _ WITH _ RULES _ RAG , although PLr class c exhibits noticeable variance and degradation for smaller models under CoT+RAG. PLr classes d-e remain close to ceiling, with only minor drops in PLr class d for selected models, confirming that rule-grounded prompting stabilizes safety-critical coverage.

4.4.4. Per-Class Performance for Variant 2 with RAG

Figure 35 shows that Precision for classes A and B remains at ceiling across all models under both WITH _ RULES _ RAG and COT _ WITH _ RULES _ RAG , indicating these categories are consistently separable. In contrast, PLr class c shows wider variance especially for o3-mini and o4-mini while PLr classes d and e maintain high precision overall, with occasional degradation in DeepSeek and o3-mini, underscoring sensitivity of minority classes to retrieval noise.
As shown in Figure 36, recall performance under RAG remains near ceiling for PLr class e across all models and prompt types, confirming that safety-critical cases are reliably detected. For PLr classes a–c (Figure 36a), recall is generally high but exhibits larger variance for DeepSeek and o3-mini, with noticeable drops under WITH _ RULES _ RAG . For PLr classes d and e (Figure 36b), stability is preserved for most models, but o3-mini again shows degradation in PLr class d, underscoring model-specific brittleness when retrieval is combined with reasoning. Notably, the PLr class e retained high recall across models (Figure 36b), demonstrating that catastrophic risk categories were consistently identified despite lexical variation.

4.4.5. Weighted-F1

Weighted-F1 (Figure 37) integrates per-class balance with label distribution. Results confirm the macro-F1 patterns: Claude and GPT-5-mini exceeded 0.95 , while smaller models suffered from skewed errors, reflecting limited resilience to non-canonical phrasing.
Variant 2 demonstrates that deterministic, rule-grounded prompting ( WITH _ RULES _ RAG , COT _ WITH _ RULES _ RAG ) ensures robustness to lexical variability, maintaining high recall for PLr class e and stable accuracy for Claude, GPT-5-mini, and o4-mini. The inclusion of per-class metrics and confidence intervals demonstrates that conclusions hold not only on aggregate accuracy but also on safety-critical subcategories under realistic linguistic conditions.
Compared with canonical ISO phrasing, lexical shift exposes large prompt and model interactions. Accuracy heatmaps and 95% CI(s) show that WITH _ RULES _ RAG substantially improves robustness for Claude ( 0.82 0.99 ; non-overlapping CIs, large error-rate reduction) and keeps o4-mini near ceiling ( 0.97 in both), but hurts DeepSeek ( 0.92 0.72 ; non-overlapping CIs) and o3-mini ( 0.80 0.70 ; partially overlapping CIs). GPT-5-mini remains high under both prompts ( 0.98 with CoT-with-rules vs 0.94 with RAG), while Gemini-2.5-flash is stable ( 0.89 0.92 ). Thus, rule-grounded retrieval mitigates lexical variability for some architectures but is not uniformly beneficial.
Macro-F1 confirms these trends by penalizing class imbalance: Claude, GPT-5-mini, and o4-mini remain 0.90 , whereas DeepSeek and o3-mini drop due to reduced recall on mid-frequency classes C/D. Micro-F1 follows overall accuracy, indicating that failures concentrate on minority labels rather than widespread drift.
Weighted-F1 integrates label prevalence and mirrors the macro-F1 picture: gains for Claude under WITH _ RULES _ RAG , small neutral changes for Gemini and GPT-5-mini, and degradations for DeepSeek and o3-mini. Per-class analyses show A/B are consistently easy, while C/D are the main sources of variance under lexical shift. Crucially for safety, recall of the catastrophic class E remains near-perfect across models and prompts, with the only notable dip on DeepSeek under WITH _ RULES _ RAG ; this demonstrates that the most safety-critical outcomes are preserved even when free text avoids ISO tokens.
Latency measurements establish deployability: WITH _ RULES _ RAG reduces average processing time markedly relative to CoT-with-rules (for example, DeepSeek 75 44 s; Gemini 78 4 s) while maintaining or improving accuracy for Claude and o4-mini; GPT-5-mini trades a small accuracy drop for a 2–3× speedup. Together with explicit CIs and per-class metrics, these results provide sufficient, statistically grounded evidence that Variant 2 rigorously tests generalization to field language and that prompt design must be matched to model family to achieve robust, efficient performance.

4.5. Does RAG Confuse CoT? Quantification and Evidence

In this study, WITH _ RULES _ RAG denotes ISO rule-guided prompting with retrieval; COT _ WITH _ RULES _ RAG adds an explicit chain-of-thought layer on top of rules + retrieval. Two complementary outcomes are distinguished when comparing these prompts on the same inputs:
  • Confusion ( RAG CoT wrong ): cases where WITH _ RULES _ RAG is correct but COT _ WITH _ RULES _ RAG is wrong—indicating that adding CoT (given the same retrieved context) degrades the decision.
  • Rescue ( RAG wrong CoT correct ): cases where WITH _ RULES _ RAG is wrong but COT _ WITH _ RULES _ RAG is correct—indicating that CoT filters retrieval noise and restores rule-consistent reasoning.
These counts separate instances where retrieval destabilizes CoT from those where CoT stabilizes retrieval. Formally, on Variant 2,
ConfuseCount = | { i : WITH _ RULES _ RAG ( i ) correct COT _ WITH _ RULES _ RAG ( i ) wrong } | ,
and symmetrically for RescueCount. Note that these counts include only flips in correctness between prompts; cases correct (or wrong) under both do not contribute. Table 2 reports per-model accuracy on Variant 2 together with the corresponding confusion and rescue counts.
Table 2 summarizes per-model accuracy on Variant 2 together with the corresponding confusion and rescue counts. See Appendix A.1 for the full per-model confusion/rescue tables with retrieved snippets and CoT traces.
Model-Level Summary: DeepSeek and Claude incur more confusions than rescues, indicating that adding CoT tends to amplify misleading retrieval cues for these families; o4-mini shows a small mixed effect; GPT-5-mini is neutral (confusions balanced by rescues); and o3-mini benefits markedly (rescues ≫ confusions), suggesting that CoT stabilizes retrieval for compact models.

Mechanism (Observed Failure Mode)

Retrieval is not inherently “noise”; its effect depends on structural alignment with the target scenario. When neighbors are aligned in ( S , F , P ) semantics, CoT can rescue errors by reasserting rule-consistent assignments. When neighbors are partially inconsistent, CoT often internalizes their phrasing, producing the following:
  • P-inflation ( P 1 P 2 ): the dominant failure, frequently triggered by language implying avoidance is “scarcely possible,” which elevates PLr at the b c and d e boundaries.
  • F-drift ( F 1 F 2 ): a secondary effect from neighbors emphasizing “frequent/continuous” exposure and “long duration,” further pushing borderline classes upward.
Severity cues (S) are occasionally up-weighted but are rarely the deciding factor compared to P and F. Detailed exemplars (IDs, retrieved snippets, and CoT traces) are provided in Appendix A.1. At a high level, these results imply that retrieval should be governed by structural consistency (e.g., filters or down-weighting for conflicting ( S , F , P ) cues) and that combining CoT with RAG should be model-specific—enabled where RescueCount > ConfuseCount and avoided otherwise.
To clarify methodological choices and strengthen transparency, the following points are highlighted:
  • The tables explicitly indicate which retrieved neighbors contributed to misclassifications. For example, the phrase scarcely possible in neighbors systematically induced P 1 P 2 upgrades, shifting PLr from b c or d e (see Appendix A.1).
  • “Noise” is defined as structurally inconsistent S/F/P cues present in retrieved neighbors. CoT sometimes internalized these cues (e.g., “frequent/continuous,” “long duration,” “scarcely possible”), inflating P or F relative to the ground-truth scenario. This mechanism is made explicit in the confusion traces.
  • Each confusion/rescue example specifies the exact S/F/P step where CoT diverged, verifying not only the final PLr prediction but also the correctness of intermediate reasoning.
  • All results were re-run under the most deterministic decoding exposed by providers (temperature = 0.0, top- p = 1.0 , top- k = 0 when available) with r = 5 independent repeats per condition. Figures report 95% Student t-intervals across runs to quantify between-run variability. Small residual variation (±3–5%) was observed, attributable to provider-side reasoning heuristics; reporting t-intervals ensures transparent, auditable comparisons across models and prompts.

4.6. Cross-Variant Synthesis: Rigorous Analysis, Critique, and Implications

In this section, results from both benchmark variants are synthesized to provide a rigorous analysis, critique, and set of implications. The discussion covers model-specific misclassification patterns, biases in class distribution, and reproducible failure modes under retrieval and reasoning strategies. Error analysis highlights how unsuitable neighbors and structural inconsistencies propagate through predictions, while quantitative summaries establish stability, variance, and safety-critical coverage. The findings are then translated into actionable guidance for industrial safety pipelines, with identified limitations informing directions for future work.

4.6.1. Model-Specific Misclassification Patterns

Claude and o4-mini are near-ceiling in Variant 1 (V1) and remain strong under Variant 2 (V2). Their residual errors under RAG (esp. with CoT) concentrate on P-inflation ( P 1 P 2 ) in scenarios whose neighbors mention “scarcely possible” avoidance, yielding d e upgrades.
GPT-5-mini is consistently robust: WITH _ RULES (V1) and WITH _ RULES _ RAG (V2) stay 0.94 with narrow CIs; CoT neither helps nor hurts materially. Gemini-2.5-flash is stable but conservative, with mild class-C overprediction when free text emphasizes frequency words.
DeepSeek-Reasoner is efficient in V1 accuracy but fragile under RAG in V2: retrieval cues with high-F/P wording are overweighted, producing b c and d e errors and a distinct drop in e-recall for WITH _ RULES _ RAG . o3-mini collapses without structured prompting in V1 ( ZERO _ SHOT /CoT) and exhibits the anomaly in V1+RAG where COT _ WITH _ RULES _ RAG degrades Macro-F1; in V2, adding CoT to RAG rescues several cases (6 rescues vs. 1 confusion).

4.6.2. Prediction Bias and Class Distribution

Macro-/Weighted-F1 and per-class panels show that, without explicit rules, models are biased toward classes a and b, while under-recalling mid-frequency PLr classes c and d. Rule grounding removes most of this bias in both V1 and V2. Safety-critical PLr class e remains near-perfect under rule prompts across models and variants, with the notable exception of DeepSeek under WITH _ RULES _ RAG in V2 (dip in PLr class e-recall), demonstrating that RAG is not universally stabilizing.

4.6.3. Failure Modes in Reasoning

Two reproducible mechanisms were observed:
  • P-inflation: CoT integrates neighbor phrases like “scarcely possible,” upgrading P 1 P 2 and shifting PLr upward ( b c , d e ).
  • F-drift: Frequency adjectives in neighbors (“frequent,” “continuous”) bias CoT toward F 2 when the target is F 1 .
Both are amplified by RAG when retrieved neighbors are lexically similar but structurally inconsistent in S/F/P. Conversely, when neighbors are structurally aligned, RAG reduces variance (e.g., o3-mini rescues).

4.6.4. Error Analysis and Misclassification Insights

The confusion/rescue accounting (V2, RAG vs. CoT+RAG) disentangles retrieval effects by model: DeepSeek (3 confusions, 0 rescues) and Claude (8/1) are harmed by CoT on top of RAG; o3-mini (1/6) benefits; GPT-5-mini is neutral (2/2). Traces localize the internal mistake (S/F/P step) and the offending neighbor cue, exposing the underlying mechanism (e.g., P-inflation) rather than only final PLr flips. Mislabels cluster in domains with ambiguous exposure semantics (vibration, noise, minor burns), highlighting the need for retrieval filters that enforce S/F/P consistency over pure semantic similarity.

4.6.5. Summary of Key Quantitative Findings

  • V1, non-RAG:  WITH _ RULES dominates (often 0.99 ; tight CIs). PURE _ CoT is volatile; ZERO _ SHOT fails on compact models (o3-mini). COT _ WITH _ RULES stabilizes CoT but does not exceed rules-only. This shows that adding statistical intervals makes the performance stability explicit.
  • V1, with RAG: Both WITH _ RULES _ RAG and COT _ WITH _ RULES _ RAG are at (near) ceiling except the o3-mini collapse under CoT+RAG (macro-F1 0.60 with wide CI). Retrieval alone suffices on canonical phrasing. This confirms that the baseline WITH _ RULES _ RAG condition provides a strong reference point.
  • V2, non-RAG: Lexical shift penalizes ZERO _ SHOT /CoT (drops up to ∼30 percentage points); rule prompts restore balance and push Claude/DeepSeek 0.95 . This demonstrates ecological validity by testing performance on free-text safety descriptions.
  • V2, with RAG: Accuracy remains near-ceiling for Claude, o4-mini, GPT-5-mini; degrades for DeepSeek and o3-mini under plain RAG; adding CoT+Rules flips signs by model (confusion vs. rescue). This quantifies the role of retrieval noise in shaping outcomes.
  • Safety-critical coverage: PLr class e-recall nearly perfect with rules across conditions except DeepSeek WITH _ RULES _ RAG in V2. This confirms that minority and safety-critical classes were explicitly measured.
  • Latency: Rules-only is faster than CoT variants in V1; in V2, WITH _ RULES _ RAG yields large speedups over COT _ WITH _ RULES _ RAG (e.g., Gemini ∼78→4 s/case) with comparable or better accuracy for several models. This clarifies how accuracy and runtime trade-offs were evaluated.

4.6.6. Implications for Industrial Safety Pipelines (Actionable)

Adopt a structure-first pipeline: deterministically extract S, F, P with rule-grounded prompts; compute PLr via the ISO risk graph. Prefer WITH _ RULES for dataset variant V1 (canonical ISO-style scenarios, valuable as a controlled research benchmark but seldom used in practice) or WITH _ RULES _ RAG for dataset variant V2 (engineer-authored free-text scenarios, which capture the wording used in shop-floor risk assessments and form the basis of conformity documentation reviewed in audits), both without CoT unless the model family (e.g., o3-mini) shows net rescues. This operationalizes verification of intermediate reasoning steps.
Model–prompt matching: For Claude/o4/GPT-5, use WITH _ RULES _ RAG (fast, robust). For DeepSeek, disable CoT when RAG is on; consider rules-only or stricter retrieval filtering. This highlights the importance of establishing a clear baseline for WITH _ RULES _ RAG . CoT increases output-token spend; enable it only when the measured accuracy gain offsets the cost increment under your budget and latency constraints.
Retrieval governance: enforce S/F/P structural consistency filters (reject neighbors that imply different P or F); penalize phrases that trigger P-inflation; prefer neighbors with identical task primitives. This illustrates how retrieval noise can be mitigated. Constrain M and keep exemplars terse: aggressive over-retrieval often yields diminishing accuracy returns while linearly increasing input-token cost.
Safety gating: hard guardrails on PLr class e: if PLr class e-recall confidence or S/F/P agreement falls below thresholds, the case is routed to human review.
Operational KPIs: report Macro-/Micro-/Weighted-F1, per-class recall, and 95% CIs by default; track confusion/rescue counts to monitor RAG–CoT interactions in production. This ensures richer metrics and statistics are available for monitoring.
Cost governance under determinism: Track cost-per-correct decision alongside accuracy and latency. Prefer mini/flash models when their cost-per-correct decision is within a few percent of premium models under WITH _ RULES _ RAG , reserving premium models only for routed edge cases. Control token expenditure by
  • (i) limiting max_tokens and using stop sequences,
  • (ii) keeping the number of retrieved exemplars (M) small, and
  • (iii) exploiting provider-side prompt caching or batching (where available) to amortize static rule blocks (e.g., ISO rules, schema).

4.6.7. Threats to Validity (with Mitigations)

  • Sample size and balance: This study represents the first systematic evaluation of deterministic PLr classification with rule-grounded RAG, focused primarily on reasoning-capable LLMs. A pilot scale was adopted with N = 100 per variant (V1/V2). Per-class counts can be small; this is partially mitigated through the use of 95% confidence intervals and class-sensitive metrics. Future work should expand N and rebalance classes. This acknowledges dataset size limitations while outlining a clear mitigation path.
  • Ecological validity: Dataset variant V1 is canonical; dataset variant V2 uses engineer-authored free text (improves realism) but is from a single author and domain. Extend to multi-site, multi-author corpora and deliberately ambiguous/incomplete cases. This highlights the need to broaden coverage to capture real-world ambiguity.
  • Decoding determinism: All reported runs use temperature  = 0.0 with fixed top-p/k; nevertheless, single-pass evaluations can mask variance, future work will include multi-seed runs with significance tests. This ensures that determinism and variance are both addressed in evaluation.
  • RAG configuration opacity: Retrieval choices (k, similarity function, domain filters) influence outcomes. Confusion/rescue exemplars and S/F/P traces are now exposed; future work will ablate retrieval k, similarity metrics, and structural filters. This increases transparency about retrieval configurations.
  • Intermediate reasoning correctness: S/F/P steps were verified in error tables; broader audits should explicitly score S/F/P accuracy alongside PLr. This ensures that intermediate reasoning is evaluated in addition to the final classification outcome.

5. Conclusions

Deterministic risk classification in industrial functional safety demands transparent and reproducible methods. At the same time, the rapid rise of reasoning-capable LLMs has created both excitement and uncertainty: they are promoted as tools for structured decision-making, yet their suitability for safety-regulated workflows and audit-grade risk assessments requires validation through standards-aligned, extensive empirical evaluation. Clarifying this suitability is of direct interest not only to researchers but also to functional safety engineers and conformity assessment bodies, for whom reliable and auditable risk classification is a core requirement.
This study addresses that gap by providing the first systematic benchmark of structured prompting strategies for PLr estimation, applied to state-of-the-art reasoning-capable LLMs and comparing canonical ISO-style scenarios (Variant 1) with engineer-authored free-text descriptions (Variant 2). The results provide direct evidence and concrete guidance on the suitability of LLMs for deployment in functional safety workflows and regulatory conformity assessments.
Key findings are as follows:
  • Rule-grounded prompting ( WITH _ RULES , WITH _ RULES _ RAG ) consistently outperformed zero-shot and unconstrained CoT (RQ1). Variant 1 (ISO-style) reached ceiling-level accuracy, while Variant 2 (engineer-authored free text) required explicit rules to restore reliability under lexical variability.
  • Model scale was critical: Claude-opus, o4-mini, and GPT-5-mini remained stable, whereas o3-mini collapsed without structured prompting. DeepSeek-Reasoner, despite strong Variant 1 performance, degraded under retrieval noise in Variant 2, showing that RAG is not uniformly beneficial (RQ2).
  • Free-form CoT reasoning introduced volatility, increased latency (2–10×), and sometimes amplified retrieval inconsistencies (P-inflation, F-drift). Rules-only prompts were both most accurate and most efficient (RQ2).
  • Reasoning traces (S/F/P chains) often diverged from ISO-consistent logic, underscoring that CoT reflects token continuation rather than genuine reasoning. This highlights the risks of anthropomorphization, i.e., attributing human-like reasoning to the outputs of reasoning-capable LLMs. Determinism and correctness were achieved only when outputs were constrained by explicit ISO 13849-1 rules, which reduced the open-ended reasoning space to a well-defined decision graph and ensured reproducible PLr outcomes (RQ2, RQ3).
  • Across canonical and lexical-shift settings, rules were necessary and typically sufficient for deterministic PLr assignment. RAG functioned as a conditional accelerator that improved robustness and latency for some model families (Claude/o4/GPT-5) but introduced confusion in others (e.g., DeepSeek) unless structurally filtered (RQ3).
Implications: LLM-generated reasoning can create a misleading sense of reliability, as superficially coherent outputs may be mistaken for sound inference, leading to overconfidence in safety-regulated environments. Industrial deployment should therefore emphasize
  • (i) strict rule-based prompting with independent human validation, and
  • (ii) prompt–model matching with retrieval governance to mitigate systematic errors.
Future work should assess stability across evolving model versions, extend datasets with multi-annotator disagreement modeling, and incorporate adversarial and ambiguous cases. Retrieval ablations, classical baselines, and multi-seed significance testing will further strengthen audit-grade reproducibility and deployment evidence.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset utilized for the experiments, comprising 7800 hazard scenarios based on ISO 12100 Annex B and ISO 13849-1: Annex A, is openly available at https://github.com/piyenghar/hazardscenariosISO12100AnnexB, accessed on 4 July 2025. The specific prompts employed for the evaluation of the reasoning models are provided within the main body of this paper.

Acknowledgments

The author thanks the anonymous reviewers for their constructive feedback, which substantially improved the clarity, scope, and rigor of this paper.

Conflicts of Interest

Author Padma Iyenghar was employed by the company innotec GmbH-TÜV Austria Group. The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

Appendix A.1. Confusion and Rescue IDs per Model

Some examples of confusion and rescue cases per model are listed here and described in the tables below. The label GT in the tables represents the ground truth in the dataset (Variant 2).
  • DeepSeek-reasoner: confusion = {TH_005, NO_002, VI_004}, rescues = {}.
  • GPT-5-mini: confusion = {ME_004, VI_004}, rescues = {ME_006, VI_002}.
  • o3-mini: confusion = {ME_006}, rescues = {ME_010, EL_006, TH_004, NO_007, VI_001, VI_010}.
  • o4-mini: confusion = {ME_004, TH_005}, rescues = {VI_004}.
  • Claude-opus-4-1: confusion = {ME_004, ME_006, EL_002, EL_007, TH_001, TH_004, TH_009, VI_003}, rescues = {ME_005}.
  • Gemini-2.5-flash: confusion = {TH_005, NO_002, VI_004}, rescues = {ME_004}.
Table A1. DeepSeek-reasoner: confusion cases ( WITH _ RULES _ RAG correct; COT _ WITH _ RULES _ RAG wrong).
Table A1. DeepSeek-reasoner: confusion cases ( WITH _ RULES _ RAG correct; COT _ WITH _ RULES _ RAG wrong).
HazardGTWITH_RULES_RAGCOT_WITH_RULES_RAGCoT Trace (S/F/P)Key Retrieved Cues (top-k)
TH_005ddeS2: irreversible burn/death; F2: frequent/prolonged; P2: “scarcely possible” (invoked due to splash + PPE inconsistency) ⇒ PLr = e.– “…cleaning tasksheat sources…” – “…frequent-to-continuous exposurelong duration…” – “… avoidance is scarcely possible…”
NO_002bbcS1: reversible hearing loss; F2: frequent/prolonged; P2: “avoidance scarcely possible without consistent PPE” ⇒ PLr = c.– “…normal operationmoving parts…” – “…frequent-to-continuous exposurelong duration…” – “…avoidance is scarcely possible…”
Note. PPE = Personal Protective Equipment. Under ISO 13849-1, the avoidance parameter P reflects the intrinsic possibility for a person to avoid the hazard (task geometry, speed, warning time, etc.). Variability in PPE use is not grounds to upgrade P 1 P 2 .
Table A2. GPT-5-mini: confusion cases ( WITH _ RULES _ RAG correct; COT _ WITH _ RULES _ RAG wrong).
Table A2. GPT-5-mini: confusion cases ( WITH _ RULES _ RAG correct; COT _ WITH _ RULES _ RAG wrong).
HazardGTWITH_RULES_RAGCOT_WITH_RULES_RAGCoT Trace (S/F/P → PLr)Retrieved Cue (Salient Phrases)
ME_004ddeS2 (irreversible injury/death), F2 (frequent/prolonged), P2 (“scarcely possible” due to warehouse blind spots) ⇒ PLr e.Neighbors emphasize frequent-to-continuous exposure, serious injury or death, and scarcely possible avoidance, which together bias P : P 1 P 2 (P-inflation) and push d e .
VI_004bbcS1 (slight, reversible joint pain), F2 (frequent/prolonged), P2 asserted (breaks/PPE framed as inconsistently used) ⇒ PLr c.Neighbors mention vibrating equipment, frequent-to-continuous exposure, andscarcely possible avoidance; these cues elevate P despite scenario-consistent mitigations (breaks/PPE), yielding b c .
Table A3. GPT-5-mini: rescue cases ( WITH _ RULES _ RAG wrong; COT _ WITH _ RULES _ RAG correct).
Table A3. GPT-5-mini: rescue cases ( WITH _ RULES _ RAG wrong; COT _ WITH _ RULES _ RAG correct).
HazardGTWITH_RULES_RAGCOT_WITH_RULES_RAGCoT Trace (S/F/P → PLr)Retrieved Cue (Salient Phrases)
ME_006bcbS1 (slight, reversible crushing), F2 (frequent/prolonged), P1 (avoidance possible with correct procedures) ⇒ PLr b.Some neighbors contain “scarcely possible” wording that nudges P : P 1 P 2 (error in WITH _ RULES _ RAG ). CoT reasserts rule-consistent P1 given the scenario’s mitigations and similar S1 examples.
VI_002bcbS1 (reversible, e.g., early HAVS), F2 (continuous/prolonged), P1 (avoidance possible via breaks) ⇒ PLr b.Neighbors emphasize seldom/short exposure and slight reversible injury; WITH _ RULES _ RAG overweights separate cues claiming scarcely possible avoidance. CoT filters these and restores P1.
Table A4. o3-mini: confusion cases ( WITH _ RULES _ RAG correct; COT _ WITH _ RULES _ RAG wrong).
Table A4. o3-mini: confusion cases ( WITH _ RULES _ RAG correct; COT _ WITH _ RULES _ RAG wrong).
HazardGTWITH_RULES_RAGCOT_WITH_RULES_RAGCoT Trace (Condensed)Key Retrieved Cue
ME_006bbcS1 (slight), F2, P2 (procedures often disregarded) ⇒ PLr c (pred.)…operators performing cleaning/setup tasks… frequent-to-continuous exposure with long duration; possibility scarcely possible
Table A5. o3-mini: rescue cases ( WITH _ RULES _ RAG wrong; COT _ WITH _ RULES _ RAG correct).
Table A5. o3-mini: rescue cases ( WITH _ RULES _ RAG wrong; COT _ WITH _ RULES _ RAG correct).
HazardGTWITH_RULES_RAGCOT_WITH_RULES_RAGCoT Trace (Condensed)Key Retrieved Cue
ME_010dedS2 (serious), F2, P1 (avoidance via checks/procedures) ⇒ PLr d…maintenance tasks; acceleration/deceleration; frequent-to-continuous exposure; serious injury or death…
EL_006cdcS2, F1, P1 (basic lockout/spacing) ⇒ PLr c…exposure to live electrical parts due to insufficient distance… short exposure; avoidance possible under conditions…”
TH_004bcbS1 (minor burns), F2, P1 (procedural avoidance) ⇒ PLr b (pred.) …hot surfaces/heat sources; frequent-to-continuous exposure…
NO_007dedS2, F1, P1 (controls allow avoidance) ⇒ PLr d (pred.)…shockwave/noise; tasks seldom-to-less-often with short exposure; avoidance possible under specific conditions
VI_001dedS2 (HAVS), F2, P1 (PPE/procedures enable avoidance) ⇒ PLr d…unbalanced rotating parts; frequent-to-continuous exposure; consequence framed as tiredness in some neighbors…
VI_010dedS2, F2, P1 (breaks/controls) ⇒ PLr d…vibrating equipment during normal operations; frequent-to-continuous exposure; discomfort examples…
Table A6. o4-mini: confusion cases ( WITH _ RULES _ RAG correct; COT _ WITH _ RULES _ RAG wrong).
Table A6. o4-mini: confusion cases ( WITH _ RULES _ RAG correct; COT _ WITH _ RULES _ RAG wrong).
HazardGTWITH_RULES_RAGCOT_WITH_RULES_RAGCoT Trace (Summary)Retrieved Cue (Excerpt)
ME_004ddeS2 (irreversible) + F2 (frequent/prolonged) + P2 (blind spots ⇒ avoidance scarcely possible) ⇒ PLr = e.In industrial environments, operators performing cleaning tasks may encounter moving elements that can lead to drawing-in or trapping. These tasks are characterized by frequent-to-continuous exposure with long duration, and the potential consequence is serious injury or death.
TH_005ddeLiquid metal splash: S2 + F2 + P2 (PPE inconsistency interpreted as scarce avoidance) ⇒ PLr = e.In industrial settings, operators performing cleaning tasks may encounter radiation from heat sources, which can lead to scald injuries. The tasks are characterized by frequent-to-continuous exposure with long duration, and the likelihood of occurrence is scarcely possible.
Table A7. o4-mini: rescue cases ( WITH _ RULES _ RAG wrong; COT _ WITH _ RULES _ RAG correct).
Table A7. o4-mini: rescue cases ( WITH _ RULES _ RAG wrong; COT _ WITH _ RULES _ RAG correct).
HazardGTWITH_RULES_RAGCOT_WITH_RULES_RAGCoT Trace (Summary)Retrieved Cue (Excerpt)
VI_004bcbS1 (slight, reversible joint pain) + F2 (frequent) + P1 (avoidance possible with footwear) ⇒ PLr = b; aligns with similar EX2 (second retrieved neighbour).In industrial settings, operators performing normal operation tasks may encounter vibrating equipment, which can lead to discomfort. These tasks are characterized by frequent-to-continuous exposure with long duration, and the potential consequence is slight, normally reversible injuries.
Table A8. Claude-opus-4-1: confusion cases ( WITH _ RULES _ RAG correct; COT _ WITH _ RULES _ RAG wrong).
Table A8. Claude-opus-4-1: confusion cases ( WITH _ RULES _ RAG correct; COT _ WITH _ RULES _ RAG wrong).
HazardGTWITH_RULES_RAGCOT_WITH_RULES_RAGCoT Trace (Abridged)Retrieved Cue (Salient Phrases)
ME_004ddeS2 (irreversible injury/death) + F2 (frequent) + P2 (blind spots ⇒ avoidance scarcely possible) ⇒ PLr e.• “frequent-to-continuous exposure; long duration” (F2)
• “serious injury or death” (S2)
• Generic trapping/drawing-in phrasing that nudges P2 escalation
TH_004bbcS1 (minor reversible burns) + F2 (frequent/prolonged in kitchen) + P2 (distraction-prone, avoidance scarce) ⇒ PLr c.• “flames … discomfort” (S1)
• Mixed frequency: “seldom-to-less-often … short exposure” (F1) vs “scarcely possible” (P2)
• Conflicting F/P cues bias the chain toward P2
Table A9. Claude-opus-4-1: rescue cases ( WITH _ RULES _ RAG wrong; COT _ WITH _ RULES _ RAG correct).
Table A9. Claude-opus-4-1: rescue cases ( WITH _ RULES _ RAG wrong; COT _ WITH _ RULES _ RAG correct).
HazardGTWITH_RULES_RAGCOT_WITH_RULES_RAGCoT Trace (Abridged)Retrieved Cue (Salient Phrases)
ME_005dedS2 (serious) + F2 (loading operations frequent/prolonged) + P1 (procedural avoidance feasible) ⇒ PLr d.• “frequent-to-continuous … long exposure” (F2)
• “serious injury or death” (S2)
• Absence of explicit ‘scarcely possible’ cue; CoT reasserts P1 per ISO risk graph
Table A10. Gemini-2.5-flash: confusion cases ( WITH _ RULES _ RAG correct; COT _ WITH _ RULES _ RAG wrong).
Table A10. Gemini-2.5-flash: confusion cases ( WITH _ RULES _ RAG correct; COT _ WITH _ RULES _ RAG wrong).
HazardGTWITH_RULES_RAGCOT_WITH_RULES_RAGCoT Trace (Abridged)Retrieved Cue (Salient Phrases)
TH_005ddeS2 (irreversible burns/death) + F2 (frequent, long duration) + P2 (“scarcely possible” avoidance) ⇒ PL_r e.
  • “frequent-to-continuous exposure; long duration” (F2)
  • “scald injuries; serious injury or death” (S2)
  • “likelihood scarcely possible ” ⇒ CoT upgrades P to P2
NO_002bbcS1 (reversible hearing loss) + F2 (frequent) + P2 (avoidance “scarcely possible” without consistent PPE) ⇒ PL_r c.
  • “moving parts … permanent hearing loss” (pushes toward S2)
  • “frequent-to-continuous exposure; long durations” (F2)
  • “possibility scarcely possible” ⇒ CoT asserts P2
Note. The confusion mechanism here is P-inflation driven by P2-leaning neighbor phrases.
Table A11. Gemini-2.5-flash: rescue case ( WITH _ RULES _ RAG wrong; COT _ WITH _ RULES _ RAG correct).
Table A11. Gemini-2.5-flash: rescue case ( WITH _ RULES _ RAG wrong; COT _ WITH _ RULES _ RAG correct).
HazardGTWITH_RULES_RAGCOT_WITH_RULES_RAGCoT Trace (Abridged)Retrieved Cue (Salient Phrases)
ME_004dedS2 (irreversible injury) + F2 (frequent/prolonged) + P1 (avoidance possible under procedures) ⇒ PL_r d.
  • “drawing-in/trapping; serious injury or death” (S2)
  • “frequent-to-continuous exposure; long duration” (F2)
  • CoT filters P2-leaning cue and reinstates procedural P1
Note. Rescue mechanism: CoT + Rules counteracts a P2-leaning neighbor and re-anchors on rule-consistent P 1 , correcting e d on ME_004.

Appendix A.2. Mechanism-Level Inferences (From Confusion/Rescue Tables)

Dominant Confounder: P-Inflation

Across models and variants, the most frequent failure mode is an upward shift in the possibility of avoidance parameter ( P : P 1 P 2 ). This P-inflation is systematically triggered by RAG neighbors containing phrases such as “scarcely possible (avoidance)” and “frequent-to-continuous exposure; long duration,” and by appeals to inconsistent PPE usage. Under ISO 13849-1, P reflects task-intrinsic avoidability (e.g., geometry, speed, warning time), not compliance variability; therefore, “PPE inconsistency” is not sufficient to set P = 2 . In the confusion cases, CoT + Rules + RAG internalizes these cues and escalates PLr (typically b c or d e ); the corresponding Rules (+RAG) baselines remain stable.

Secondary Confounder: F-Drift

A smaller but consistent effect is F-drift ( F 1 F 2 ) driven by retrieved snippets over-emphasizing frequency/duration (“frequent-to-continuous,” “long exposure”). This pushes borderline decisions across class boundaries and often co-occurs with P-inflation.

Severity Cues Are Rarely Decisive

Severity (S) sometimes drifts upward (e.g., “permanent hearing loss,” “serious injury or death”), but flips are usually explained by P (and secondarily F) rather than S. When S contributes, it amplifies a decision already biased by P or F.

Rescue Mechanism: Rule-Consistent Re-Anchoring of P (and F)

In rescue cases, CoT + Rules corrects RAG-induced errors by explicitly reasserting P1 (procedural avoidability: breaks, footwear, checks, spacing/lockout) and, where needed, restoring F1 (short/seldom exposure). This re-anchoring is most prominent for compact models (e.g., o3-mini: rescues ≫ confusions), mixed for GPT-5-mini/o4-mini, and uncommon for Claude/DeepSeek where CoT tends to overweight P2-leaning neighbors.

Where Flips Concentrate

Misclassifications cluster at adjacent boundaries b c and d e , exactly where a one-step change in P or F is sufficient to cross the ISO risk graph threshold. Safety-critical coverage (class E) remains robust under rule-only prompting; degradations emerge primarily when CoT is combined with RAG and P-inflation occurs.

Actionable Mitigations

  • Structural retrieval governance: Admit neighbors whose ( S , F , P ) are consistent with the provisional decision (e.g., within one step), and down-rank snippets containing “scarcely possible” when task geometry/procedures imply P 1 .
  • Conflict-aware inference: If retrieved neighbors disagree on P or F, prefer rule-only aggregation or escalate d / e -boundary cases to human review.
  • Model–prompt matching: Default to WITH _ RULES / WITH _ RULES _ RAG for families where CoT+RAG confuses (e.g., Claude/DeepSeek/Gemini); enable COT _ WITH _ RULES _ RAG only where rescue > confusion is empirically observed (e.g., o3-mini).

Limitations and Scope

These inferences are drawn from r = 5 repeated runs per condition with 95% t-intervals; residual between-run variability reflects provider-side reasoning heuristics despite deterministic decoding. While tables provide qualitative traces and retrieved-neighbor excerpts, an in-depth ( S , F , P ) accuracy audit in a large dataset (N = 1000, or N = 2000) and retrieval ablations are planned as future work to quantify each mechanism’s marginal effect.

References

  1. ISO 12100:2010; Safety of Machinery: General Principles for Design: Risk Assessment and Risk Reduction. ISO: Geneva, Switzerland, 2010. Available online: https://www.iso.org/standard/51528.html (accessed on 31 January 2025).
  2. ISO 13849-1:2023; Safety of Machinery—Safety-Related Parts of Control Systems—Part 1: General Principles for Design. ISO: Geneva, Switzerland, 2023. Available online: https://www.iso.org/standard/73481.html (accessed on 5 May 2025).
  3. IFA Report 2/2017e Functional Safety of Machine Controls—Application of EN ISO; Deutsche Gesetzliche Unfallversicherung: Berlin, Germany, 2019.
  4. European Parliament and Council. Regulation (EU) 2023/1230 of the European Parliament and of the Council of 14 June 2023 on machinery and repealing Directive 2006/42/EC of the European Parliament and of the Council and Council Directive 73/361/EEC. Off. J. Eur. Union 2023, L 165, 1–102. [Google Scholar]
  5. Iyenghar, P.; Hu, Y.; Kieviet, M.; Pulvermüller, E.; Wübbelmann, J. AI-Based Assistant for Determining the Required Performance Level for a Safety Function. In Proceedings of the 48th Annual Conference of the IEEE Industrial Electronics Society (IECON 2022), Brussels, Belgium, 17–20 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
  6. Iyenghar, P.; Kieviet, M.; Pulvermüller, E.; Wübbelmann, J. A Chatbot Assistant for Reducing Risk in Machinery Design. In Proceedings of the 21st IEEE International Conference on Industrial Informatics (INDIN 2023), Lemgo, Germany, 18–20 July 2023; pp. 1–8. [Google Scholar] [CrossRef]
  7. Iyenghar, P. On the Development and Application of a Structured Dataset for Data-Driven Risk Assessment in Industrial Functional Safety. In Proceedings of the 21st IEEE International Conference on Factory Communication Systems (WFCS 2025), Rostock, Germany, 10–13 June 2025; pp. 1–8. [Google Scholar] [CrossRef]
  8. Iyenghar, P. Evaluating LLM Prompting Strategies for Industrial Functional Safety Risk Assessment. In Proceedings of the 8th IEEE International Conference on Industrial Cyber-Physical Systems (ICPS 2025), Emden, Germany, 12–15 May 2025; pp. 1–4. [Google Scholar] [CrossRef]
  9. Gemini 2.0-Flash. Available online: https://deepmind.google/technologies/gemini/flash/ (accessed on 30 January 2025).
  10. Iyenghar, P.; Zimmer, C.; Gregorio, C. A Feasibility Study on Chain-of-Thought Prompting for LLM-Based OT Cybersecurity Risk Assessment. In Proceedings of the 8th IEEE International Conference on Industrial Cyber-Physical Systems (ICPS 2025), Emden, Germany, 12–15 May 2025; pp. 1–4. [Google Scholar] [CrossRef]
  11. Nouri, M.; Karakostas, D.; Hummel, L.; Pretschner, A. Automating Automotive Hazard Analysis and Risk Assessment with Large Language Models: Opportunities and Limitations. arXiv 2024, arXiv:2401.07791. [Google Scholar]
  12. Qi, Z.; Wang, C.; Zhang, M.; Ma, Y.; Xie, B. Can ChatGPT Help with System Theoretic Process Analysis? A Pilot Study. In Proceedings of the 2025 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), Sao Paulo, Brazil, 4–7 November 2025; pp. 1–7. [Google Scholar]
  13. Collier, D.; Vincent, K.; King, J.; Griffiths, D.; Marshall, Y.; Wronska, K. Evaluating Large Language Models for Consumer Product Safety Risk Assessment. Saf. Sci. 2024, 176, 107083. [Google Scholar]
  14. Diemert, E.; Weber, G. CoHA: Collaborating with ChatGPT for Hazard Analysis. In Proceedings of the 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), Dublin, Ireland, 16–20 April 2023; pp. 139–146. [Google Scholar]
  15. Sammour, M.; Kreahling, W.C.; Padgett, J.; Ammann, P. Performance of GPT-3.5 and GPT-4 on the Certified Safety Professional Exam: An Exploratory Study. Saf. Sci. 2024, 182, 108002. [Google Scholar]
  16. Iyenghar, P. Clever Hans in the Loop? A Critical Examination of ChatGPT in a Human-In-The-Loop Framework for Machinery Functional Safety Risk Analysis. Eng 2025, 6, 31. [Google Scholar] [CrossRef]
  17. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Le, Q.; Zhou, D. Chain of Thought Prompting Elicits Reasoning in Large Language Models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
  18. Kojima, T.; Gu, S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
  19. Saparov, A.; He, H. Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  20. Schaeffer, R.; Pistunova, K.; Khanna, S.; Consul, S.; Koyejo, S. Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting. arXiv 2023, arXiv:2307.10573. [Google Scholar] [CrossRef]
  21. Kambhampati, S.; Stechly, K.; Valmeekam, K.; Saldyt, L.; Bhambri, S.; Palod, V.; Gundawar, A.; Samineni, S.R.; Kalwar, D.; Biswas, U. Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces! arXiv 2025, arXiv:2504.09762. [Google Scholar]
  22. Stechly, K.; Valmeekam, K.; Gundawar, A.; Palod, V.; Kambhampati, S. Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens. arXiv 2025, arXiv:2505.13775. [Google Scholar] [CrossRef]
  23. Chen, Y.; Benton, J.; Radhakrishnan, A.; Uesato, J.; Denison, C.; Schulman, J.; Somani, A.; Hase, P.; Wagner, M.; Roger, F.; et al. Reasoning Models Don’t Always Say What They Think. arXiv 2025, arXiv:2505.05410. [Google Scholar]
  24. Shojaee, P.; Mirzadeh, I.; Alizadeh, K.; Horton, M.; Bengio, S.; Farajtabar, M. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. arXiv 2025, arXiv:2506.06941. [Google Scholar] [PubMed]
  25. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020); Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 9459–9474. [Google Scholar]
  26. Xue, Z.; Wu, X.; Li, J.; Zhang, P.; Zhu, X. Improving Fire Safety Engineering with Retrieval-Augmented Large Language Models. Fire Technol. 2025, 61, 1281–1301. [Google Scholar]
  27. Meng, Y.; Jiang, F.; Qi, Z. Retrieval-Augmented Generation for Human Health Risk Assessment: A Case Study. In Proceedings of the 2025 International Conference on Artificial Intelligence in Toxicology (AITOX), Beijing, China, 15–18 October 2025; pp. 101–110. [Google Scholar]
  28. Hillen, T.; Eisenhauer, M. LASAR: LLM-Augmented Hazard Analysis for Automotive Risk Assessment. In Proceedings of the SAFECOMP, Florence, Italy, 17 September 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 143–154. [Google Scholar]
  29. Guha, N.; Hu, D.E.; Hendry, L.; Li, N.; Meng, L.; Nanda, S.; Nori, R.; Shardlow, M.; Shoberg, J.; Soni, A.; et al. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in LLMs. arXiv 2023, arXiv:2308.11462. [Google Scholar]
  30. Khandekar, N.; Shen, C.; Mian, Z.; Wang, Z.; Kim, J.; Sriram, A.; Hu, H.; Shah, N.; Patel, R. MedCalc-Bench: Evaluating Large Language Models for Medical Calculations. Adv. Neural Inf. Process. Syst. 2024, 37, 84730–84745. [Google Scholar]
  31. Wang, J.; Wang, M.; Zhou, Y.; Xing, Z.; Liu, Q.; Xu, X.; Zhang, W.; Zhu, L. LLM-based HSE Compliance Assessment: Benchmark, Performance, and Advancements. arXiv 2025, arXiv:2505.22959. [Google Scholar] [CrossRef]
  32. Sandmann, S.; Hegselmann, S.; Fujarski, M.; Bickmann, L.; Wild, B.; Eils, R.; Varghese, J. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat. Med. 2025; epub ahead of print. [Google Scholar] [CrossRef] [PubMed]
  33. Araya, R. Do Chains-of-Thoughts of Large Language Models Suffer from Hallucinations, Cognitive Biases, or Phobias in Bayesian Reasoning? arXiv 2025, arXiv:2503.15268. [Google Scholar]
  34. LangChain Inc. LangGraph: Agentic Workflows for LLM Applications. 2024. Available online: https://www.langchain.com/langgraph (accessed on 3 July 2025).
  35. Harrison Chase. LangChain: Building Applications with LLMs Through Composability. 2022. Available online: https://www.langchain.com (accessed on 3 July 2025).
  36. Chroma Team. Chroma: The AI-Native Open-Source Vector Database. 2023. Available online: https://www.trychroma.com (accessed on 3 July 2025).
  37. P. Iyenghar. Comprehensive Curated Dataset of Hazard Scenarios Systematically Generated Based on Annex B of ISO 12100 and PLr Assigned Based on ISO. GitHub Repository. 2025. Available online: https://github.com/piyenghar/hazardscenariosISO12100AnnexB (accessed on 4 July 2025).
  38. OpenAI. Chat Completions Format API. 2024. Available online: https://platform.openai.com/docs/guides/text (accessed on 21 August 2025).
  39. Zhao, P.; Zhang, H.; Yu, Q.; Wang, Z.; Geng, Y.; Fu, F.; Yang, L.; Zhang, W.; Jiang, J.; Cui, B. Retrieval-Augmented Generation for AI-Generated Content: A Survey. arXiv 2024, arXiv:2402.19473. [Google Scholar]
  40. Anthropic. Claude Opus 4.1. 2025. Available online: https://www.anthropic.com/news/claude-opus-4-1 (accessed on 21 August 2025).
  41. DeepSeek. Reasoning Model (Deepseek-Reasoner). 2025. Available online: https://api-docs.deepseek.com/guides/reasoning_model (accessed on 21 August 2025).
  42. Google. Gemini Models—Gemini API. 2025. Available online: https://ai.google.dev/gemini-api/docs/models (accessed on 21 August 2025).
  43. Google Cloud. Gemini 2.5 Flash—Vertex AI. 2025. Available online: https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash (accessed on 21 August 2025).
  44. Google. Gemini Thinking. 2025. Available online: https://ai.google.dev/gemini-api/docs/thinking (accessed on 21 August 2025).
  45. OpenAI. Model: GPT-5 mini—OpenAI API. 2025. Available online: https://platform.openai.com/docs/models/gpt-5-mini (accessed on 21 August 2025).
  46. OpenAI. Using GPT-5. 2025. Available online: https://platform.openai.com/docs/guides/latest-model (accessed on 21 August 2025).
  47. OpenAI. OpenAI o3-mini. 2025. Available online: https://openai.com/index/openai-o3-mini/ (accessed on 21 August 2025).
  48. OpenAI. Model: o4-mini—OpenAI API. 2025. Available online: https://platform.openai.com/docs/models/o4-mini (accessed on 21 August 2025).
  49. OpenAI. Reasoning Models—OpenAI API. 2025. Available online: https://platform.openai.com/docs/guides/reasoning (accessed on 21 August 2025).
  50. Anthropic. Pricing. Available online: https://www.anthropic.com/pricing (accessed on 21 August 2025).
  51. Google DeepMind & Google. Gemini Developer API Pricing. Available online: https://ai.google.dev/gemini-api/docs/pricing (accessed on 21 August 2025).
  52. DeepSeek. Pricing Details (USD). Available online: https://api-docs.deepseek.com/quick_start/pricing-details-usd (accessed on 21 August 2025).
  53. OpenAI. API Pricing. Available online: https://openai.com/api/pricing/ (accessed on 21 August 2025).
Figure 1. The iterative process of risk assessment and risk reduction [3].
Figure 1. The iterative process of risk assessment and risk reduction [3].
Electronics 14 03624 g001
Figure 2. Risk graph from ISO 13849-1: Annex A [2].
Figure 2. Risk graph from ISO 13849-1: Annex A [2].
Electronics 14 03624 g002
Figure 3. Hybrid RAG pipeline used in the PLr experiment framework: vector over-retrieval → semantic gate (≥ 0.70 ) → lexical Jaccard filter (≥ 0.30 ) → optional S/F/P gate → deduplicate/rank by J and keep top M→ package as {rag_examples} and inject into the RAG chains.
Figure 3. Hybrid RAG pipeline used in the PLr experiment framework: vector over-retrieval → semantic gate (≥ 0.70 ) → lexical Jaccard filter (≥ 0.30 ) → optional S/F/P gate → deduplicate/rank by J and keep top M→ package as {rag_examples} and inject into the RAG chains.
Electronics 14 03624 g003
Figure 4. Accuracy comparison with 95% t-intervals across runs for Variant 1 (non-RAG) across six models and four prompting strategies.
Figure 4. Accuracy comparison with 95% t-intervals across runs for Variant 1 (non-RAG) across six models and four prompting strategies.
Electronics 14 03624 g004
Figure 5. Average processing time with 95% t-intervals across runs for Variant 1 (non-RAG).
Figure 5. Average processing time with 95% t-intervals across runs for Variant 1 (non-RAG).
Electronics 14 03624 g005
Figure 6. Total execution time with 95% t-intervals across runs for Variant 1 (non-RAG).
Figure 6. Total execution time with 95% t-intervals across runs for Variant 1 (non-RAG).
Electronics 14 03624 g006
Figure 7. Average accuracy heatmap for Variant 1 across models and non-RAG prompts.
Figure 7. Average accuracy heatmap for Variant 1 across models and non-RAG prompts.
Electronics 14 03624 g007
Figure 8. Macro-F1 comparison by model and non-RAG prompting strategy for Variant 1.
Figure 8. Macro-F1 comparison by model and non-RAG prompting strategy for Variant 1.
Electronics 14 03624 g008
Figure 9. Micro-F1 comparison by model and non-RAG prompting strategy for Variant 1.
Figure 9. Micro-F1 comparison by model and non-RAG prompting strategy for Variant 1.
Electronics 14 03624 g009
Figure 10. Weighted-F1 comparison by model and non-RAG prompting strategy for Variant 1.
Figure 10. Weighted-F1 comparison by model and non-RAG prompting strategy for Variant 1.
Electronics 14 03624 g010
Figure 11. Per-class F1 comparison by model and non-RAG prompting strategy for Variant 1. (a) PLr classes a–c; (b) PLr classes d–e.
Figure 11. Per-class F1 comparison by model and non-RAG prompting strategy for Variant 1. (a) PLr classes a–c; (b) PLr classes d–e.
Electronics 14 03624 g011
Figure 12. Per-class Precision comparison by model and non-RAG prompting strategy for Variant 1. (a) PLr classes a–c; (b) PLr classes d–e.
Figure 12. Per-class Precision comparison by model and non-RAG prompting strategy for Variant 1. (a) PLr classes a–c; (b) PLr classes d–e.
Electronics 14 03624 g012
Figure 13. Per-class Recall comparison by model and prompting strategy (a) PLr classes a–c; (b) PLr classes d,e.
Figure 13. Per-class Recall comparison by model and prompting strategy (a) PLr classes a–c; (b) PLr classes d,e.
Electronics 14 03624 g013
Figure 14. Recall for PLr class e (highest safety requirement) by model and non-RAG prompting strategy for Variant 1.
Figure 14. Recall for PLr class e (highest safety requirement) by model and non-RAG prompting strategy for Variant 1.
Electronics 14 03624 g014
Figure 15. Macro-F1 for Variant 1 (RAG-based prompts).
Figure 15. Macro-F1 for Variant 1 (RAG-based prompts).
Electronics 14 03624 g015
Figure 16. Micro-F1 for Variant 1 (RAG-based prompts).
Figure 16. Micro-F1 for Variant 1 (RAG-based prompts).
Electronics 14 03624 g016
Figure 17. Per-class F1 for Variant 1 (RAG-based prompts). (a) PLr classes a–c; (b) PLr classes d,e.
Figure 17. Per-class F1 for Variant 1 (RAG-based prompts). (a) PLr classes a–c; (b) PLr classes d,e.
Electronics 14 03624 g017
Figure 18. Recall for safety-critical class E (PLr class e) on Variant 1.
Figure 18. Recall for safety-critical class E (PLr class e) on Variant 1.
Electronics 14 03624 g018
Figure 19. Weighted-F1 for Variant 1 (RAG-based prompts).
Figure 19. Weighted-F1 for Variant 1 (RAG-based prompts).
Electronics 14 03624 g019
Figure 20. Accuracy on Variant 2. Bars show the mean across repeated runs for each (model, prompt) and error bars show the 95% t-interval across runs.
Figure 20. Accuracy on Variant 2. Bars show the mean across repeated runs for each (model, prompt) and error bars show the 95% t-interval across runs.
Electronics 14 03624 g020
Figure 21. Average processing time per sample on Variant 2. Bars are run means; error bars are 95% t-intervals across runs.
Figure 21. Average processing time per sample on Variant 2. Bars are run means; error bars are 95% t-intervals across runs.
Electronics 14 03624 g021
Figure 22. Total execution time on Variant 2. Bars are run means; error bars are 95% t-intervals across runs.
Figure 22. Total execution time on Variant 2. Bars are run means; error bars are 95% t-intervals across runs.
Electronics 14 03624 g022
Figure 23. Macro-F1 comparison on Variant 2. Rules ensure stability across minority PLr classes.
Figure 23. Macro-F1 comparison on Variant 2. Rules ensure stability across minority PLr classes.
Electronics 14 03624 g023
Figure 24. Micro-F1 (accuracy-equivalent) comparison on Variant 2.
Figure 24. Micro-F1 (accuracy-equivalent) comparison on Variant 2.
Electronics 14 03624 g024
Figure 25. Per-class F1 score comparison for Variant 2 without RAG. (a) PLr classes a–c, which are most affected by lexical drift without rules; (b) PLr classes d,e, showing stability under rule-grounded prompts.
Figure 25. Per-class F1 score comparison for Variant 2 without RAG. (a) PLr classes a–c, which are most affected by lexical drift without rules; (b) PLr classes d,e, showing stability under rule-grounded prompts.
Electronics 14 03624 g025
Figure 26. Per-class precision comparison for Variant 2 without RAG. (a) PLr classes a–c, where lexical drift introduces false positives in non-rule settings; (b) PLr classes d,e, where rule-based prompting prevents over-prediction of dominant classes and sustains precision on safety-critical cases.
Figure 26. Per-class precision comparison for Variant 2 without RAG. (a) PLr classes a–c, where lexical drift introduces false positives in non-rule settings; (b) PLr classes d,e, where rule-based prompting prevents over-prediction of dominant classes and sustains precision on safety-critical cases.
Electronics 14 03624 g026
Figure 27. Per-class recall comparison for Variant 2 without RAG. (a) PLr classes a–c, where zero-shot prompting produces severe drops in recall, especially for classes b and c; (b) PLr classes d,e, where rule-based prompting maintains near-ceiling recall and preserves safety-critical detection.
Figure 27. Per-class recall comparison for Variant 2 without RAG. (a) PLr classes a–c, where zero-shot prompting produces severe drops in recall, especially for classes b and c; (b) PLr classes d,e, where rule-based prompting maintains near-ceiling recall and preserves safety-critical detection.
Electronics 14 03624 g027
Figure 28. Recall for safety-critical PLr class E on Variant 2. Rules maintain stability across models.
Figure 28. Recall for safety-critical PLr class E on Variant 2. Rules maintain stability across models.
Electronics 14 03624 g028
Figure 29. Weighted-F1 comparison on Variant 2. Without rules, minority class weighting exposes vulnerabilities.
Figure 29. Weighted-F1 comparison on Variant 2. Without rules, minority class weighting exposes vulnerabilities.
Electronics 14 03624 g029
Figure 30. Variant 2 with RAG: Accuracy across six models for WITH _ RULES _ RAG and COT _ WITH _ RULES _ RAG . Bars show mean accuracy; error bars denote 95% t-intervals across repeated runs.
Figure 30. Variant 2 with RAG: Accuracy across six models for WITH _ RULES _ RAG and COT _ WITH _ RULES _ RAG . Bars show mean accuracy; error bars denote 95% t-intervals across repeated runs.
Electronics 14 03624 g030
Figure 31. Variant 2 with RAG: average processing time per sample (mean with 95% t-intervals).
Figure 31. Variant 2 with RAG: average processing time per sample (mean with 95% t-intervals).
Electronics 14 03624 g031
Figure 32. Variant 2 with RAG: total execution time (mean with 95% t-intervals).
Figure 32. Variant 2 with RAG: total execution time (mean with 95% t-intervals).
Electronics 14 03624 g032
Figure 33. Variant 2: Macro-F1 (Top) and Micro-F1 (Bottom) comparison across models.
Figure 33. Variant 2: Macro-F1 (Top) and Micro-F1 (Bottom) comparison across models.
Electronics 14 03624 g033
Figure 34. Variant 2 with RAG: Per-class F1 across PLr classes. (a) PLr classes a–c; (b) PLr classes d,e.
Figure 34. Variant 2 with RAG: Per-class F1 across PLr classes. (a) PLr classes a–c; (b) PLr classes d,e.
Electronics 14 03624 g034
Figure 35. Variant 2 with RAG: Per-class Precision across PLr classes. (a) PLr classes a–c; (b) PLr classes d,e.
Figure 35. Variant 2 with RAG: Per-class Precision across PLr classes. (a) PLr classes a–c; (b) PLr classes d,e.
Electronics 14 03624 g035
Figure 36. Variant 2 with RAG: Per-class Recall across PLr classes. (a) PLr classes a–c; (b) PLr classes d,e.
Figure 36. Variant 2 with RAG: Per-class Recall across PLr classes. (a) PLr classes a–c; (b) PLr classes d,e.
Electronics 14 03624 g036
Figure 37. Variant 2: Recall for PLr class e (Top) and Weighted-F1 (Bottom) across all classes (Bottom).
Figure 37. Variant 2: Recall for PLr class e (Top) and Weighted-F1 (Bottom) across all classes (Bottom).
Electronics 14 03624 g037
Table 1. RAG controls used in all runs unless stated otherwise.
Table 1. RAG controls used in all runs unless stated otherwise.
ParameterDefault
Semantic over-retrieval (k)20 (take larger superset before filtering)
Lexical Jaccard ( τ )keep if J ( A , B ) 0.30
S/F/P gaterequire_sfp_exact = false
Table 2. Per-model accuracy and confusion/rescue counts on Variant 2.
Table 2. Per-model accuracy and confusion/rescue counts on Variant 2.
ModelAcc (RAG)Acc (CoT + RAG)ConfuseCountRescueCount
DeepSeek-reasoner0.980.9230
GPT-5-mini0.940.9422
o3-mini0.720.8216
o4-mini0.960.9421
Claude-opus-4-10.960.8281
Gemini-2.5-flash0.920.9231
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Iyenghar, P. Empirical Evaluation of Reasoning LLMs in Machinery Functional Safety Risk Assessment and the Limits of Anthropomorphized Reasoning. Electronics 2025, 14, 3624. https://doi.org/10.3390/electronics14183624

AMA Style

Iyenghar P. Empirical Evaluation of Reasoning LLMs in Machinery Functional Safety Risk Assessment and the Limits of Anthropomorphized Reasoning. Electronics. 2025; 14(18):3624. https://doi.org/10.3390/electronics14183624

Chicago/Turabian Style

Iyenghar, Padma. 2025. "Empirical Evaluation of Reasoning LLMs in Machinery Functional Safety Risk Assessment and the Limits of Anthropomorphized Reasoning" Electronics 14, no. 18: 3624. https://doi.org/10.3390/electronics14183624

APA Style

Iyenghar, P. (2025). Empirical Evaluation of Reasoning LLMs in Machinery Functional Safety Risk Assessment and the Limits of Anthropomorphized Reasoning. Electronics, 14(18), 3624. https://doi.org/10.3390/electronics14183624

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop