Empirical Evaluation of Reasoning LLMs in Machinery Functional Safety Risk Assessment and the Limits of Anthropomorphized Reasoning

Iyenghar, Padma

doi:10.3390/electronics14183624

Open AccessFeature PaperArticle

Empirical Evaluation of Reasoning LLMs in Machinery Functional Safety Risk Assessment and the Limits of Anthropomorphized Reasoning

by

Padma Iyenghar

Innotec GmbH-TÜV Austria Group, Hornbergstrasse 45, 70794 Filderstadt, Germany

Electronics 2025, 14(18), 3624; https://doi.org/10.3390/electronics14183624

Submission received: 25 July 2025 / Revised: 23 August 2025 / Accepted: 1 September 2025 / Published: 12 September 2025

(This article belongs to the Special Issue New Insights into Natural Language Processing and Large Language Models)

Download

Browse Figures

Versions Notes

Abstract

Transparent reasoning and interpretability are essential for AI-supported risk assessment, yet it remains unclear whether large language models (LLMs) can provide reliable, deterministic support for safety-critical tasks or merely simulate reasoning through plausible outputs. This study presents a systematic, multi-model empirical evaluation of reasoning-capable LLMs applied to machinery functional safety, focusing on Required Performance Level (PL_r) estimation as defined by ISO 13849-1 and ISO 12100. Six state-of-the-art models (Claude-opus, o3-mini, o4-mini, GPT-5-mini, Gemini-2.5-flash, DeepSeek-Reasoner) were evaluated across six prompting strategies and two dataset variants: canonical ISO-style hazards (Variant 1) and engineer-authored free-text scenarios (Variant 2). Results show that rule-grounded prompting consistently stabilizes performance, achieving ceiling-level accuracy in Variant 1 and restoring reliability under lexical variability in Variant 2. In contrast, unconstrained chain-of-thought reasoning (CoT) and CoT together with Retrieval-Augmented Generation (RAG) introduce volatility, overprediction biases, and model-dependent degradations. Safety-critical coverage was quantified through per-class F1 and recall of PL_r class e, confirming that only rule-grounded prompts reliably captured rare but high-risk hazards. Latency analysis demonstrated that rule-only prompts were both the most accurate and the most efficient, while CoT strategies incurred 2–10× overhead. A confusion/rescue analysis of retrieval interactions further revealed systematic noise mechanisms such as P-inflation and F-drift, showing that retrieval can either destabilize or rescue cases depending on model family. Intermediate severity/frequency/possibility (S/F/P) reasoning steps were found to diverge from ISO-consistent logic, reinforcing critiques that LLM “reasoning” reflects surface-level continuation rather than genuine inference. All reported figures include 95% confidence intervals, t-intervals across runs (

r = 5

) for accuracy and timing, and class-stratified bootstrap CIs for Micro/Macro/Weighted-F₁ and per-class metrics. Overall, this study establishes a rigorous benchmark for evaluating LLMs in functional safety workflows such as PL_r determination. It shows that deterministic, safety-critical classification requires strict rule-constrained prompting and careful retrieval governance, rather than reliance on assumed model reasoning abilities.

Keywords:

reasoning models; large language models (LLMs); functional safety; risk assessment; performance level estimation; anthropomorphization bias

1. Introduction

The increasing complexity of industrial machinery, coupled with stringent regulatory demands, has intensified the need for reliable and structured functional safety risk assessment methods. Among these, the estimation of the Required Performance Level (PL_r) is a critical step, directly impacting the design and validation of safety-related control systems in accordance with standards such as ISO 12100 [1] and ISO 13849-1 [2]. PL_r estimation is inherently deterministic, derived by systematically evaluating a given hazard scenario against defined risk parameters such as severity, frequency of exposure, and possibility of avoidance and applying predefined classification rules to assign the PL_r in accordance with formalized procedures [1,2,3].

On the other hand, the advent of large language models (LLMs) equipped with advanced reasoning capabilities has reshaped the landscape of Artificial Intelligence (AI)-assisted decision support. Models employing structured reasoning mechanisms, such as chain-of-thought (CoT) prompting or retrieval-augmented generation (RAG), have shown potential in complex problem-solving and domain-specific language tasks. These developments have led to speculative interest in whether such reasoning models could extend their utility to structured industrial domains, including safety-critical classification tasks like PL_r estimation. However, this potential remains largely unexplored, with no prior work systematically validating reasoning models for deterministic, safety-critical tasks such as PL_r estimation, and no empirical evidence yet assessing their reliability in this domain.

In safety-critical risk assessment, transparent reasoning is essential for both regulatory compliance and human-understandable justification of classifications. A central open question is whether reasoning-capable LLMs can provide genuinely interpretable, rule-consistent decision support, or merely generate plausible but unreliable outputs. Prior studies have focused on open-domain inference or problem-solving benchmarks, without systematically addressing deterministic tasks that require strict rule adherence and minimal tolerance for ambiguity. Moreover, the known tendency of LLMs to produce “reasoning illusions”, outputs that appear logically structured yet lack factual correctness, raises significant concerns about their reliability in functional safety contexts.

1.1. Identified Research Gaps

The motivation for this study is based on addressing three key research gaps:

Gap 1: Lack of empirical evaluation of reasoning models in deterministic, safety-critical PL_r classification.
Gap 2: Underexplored impact of reasoning biases and hallucinations in PL_r estimation.
Gap 3: Absence of structured benchmarking methodologies and empirical evaluation for reasoning models in functional safety risk assessment.

1.2. Research Questions

To guide the study and systematically address the identified gaps, the following research questions are formulated:

RQ1: Can reasoning models reliably perform structured risk classification tasks, such as PL_r estimation, within the constraints of functional safety standards?
RQ2: What limitations do reasoning models exhibit when applied to deterministic classification problems in the domain of machinery risk assessment?
RQ3: How does the reliance on structured prompting affect the consistency and validity of reasoning model outputs in functional safety contexts?

These questions guide the experimental evaluation, focusing on the empirical behavior of reasoning-capable LLMs under controlled, rule-bound testing scenarios.

1.3. Contributions

This study makes the following novel contributions:

Comprehensive Experimental Benchmarking of Reasoning Models for Functional Safety Risk Classification (RQ1): A systematic evaluation of reasoning-capable LLMs applied to PL_r estimation tasks is presented, utilizing diverse prompting strategies including zero-shot, CoT, rule-based prompting, and RAG-augmented reasoning.
Empirical analysis of reasoning biases and hallucination effects in structured classification (RQ2): The study analyzes error patterns, misclassification tendencies, and reasoning-induced biases exhibited by LLMs in deterministic classification tasks (e.g., P-inflation, F-drift, redundancy, and mislabeling across PL_r classes).
Identification of Methodological Considerations for LLM Deployment in Safety-Critical Applications and Future Benchmarking (RQ3): Practical implications are highlighted, including the necessity of structured prompting and the risks of anthropomorphizing LLM reasoning capacities (i.e., attributing human-like reasoning abilities to models based on their language output) in domains with strict correctness requirements. The findings also establish a basis for future research on the applicability and limitations of AI reasoning models in structured industrial domains, contributing to the development of scientifically grounded benchmarking methodologies.

Note that this study focuses on the empirical validation of structured prompting strategies for deterministic risk classification tasks in machinery functional safety. The objective is to provide practical benchmarking evidence for Artificial Intelligence (AI) deployment in regulated industrial domains, rather than to critique the reasoning capabilities of LLMs in general.

The remainder of this paper is organized as follows. Following this introduction, Section 2 deals with background and related work. Section 3 deals with experimental design and evaluation performance metrics. Section 4 provides a detailed analysis of the results from experimental evaluation. Section 5 concludes this paper.

2. Background and Related Work

This section provides the necessary background on functional safety risk assessment, with a focus on machinery domains governed by ISO 12100 and ISO 13849 standards. Section 2.1 outlines the principles of hazard identification, risk estimation, and PL_r determination. Building on this foundation, the related work review examines prior research on AI-assisted risk assessment. The section further analyzes critical findings on LLM reasoning capabilities, limitations of chain-of-thought prompting, and the risks of anthropomorphizing model outputs. Finally, it discusses the role of retrieval-augmented generation, prompt engineering, and domain-specific benchmarking in ensuring reliable AI performance in deterministic, safety-critical applications.

2.1. Machinery Functional Safety Risk Assessment

Machinery safety is a global concern for suppliers and manufacturers, governed by a comprehensive regulatory framework designed to protect individuals, assets, and property from harm. Most industrial environments feature complex machinery assemblies that must comply with relevant safety regulations. The Machinery Regulation (EU) 2023/1230, adopted on 14 June 2023, reinforces the core compliance requirement that each machine undergo a structured hazard analysis and risk assessment—a systematic evaluation of potential hazards based on severity, exposure frequency, and the possibility of avoidance—as an essential part of conformity assessment for CE marking [4].

ISO 12100 [1] is an international standard which specifies basic terminology, principles and methodology for achieving safety in the design of machinery. It specifies principles of risk assessment and risk reduction to help designers achieve this objective. Procedures are described for identifying hazards, and estimating and evaluating risks during relevant phases of the machine life cycle. For instance, a hazard refers to something that potentially causes harm and the risk is a combination of the probability and severity of the harm. For example, a sharp part is a hazard. When it is in a prominent position, it creates a risk.

2.1.1. Risk Assessment

The iterative process of risk assessment and reduction is shown in Figure 1. A risk assessment follows a series of logical steps to identify and examine any potential hazards associated with machinery. The process starts with hazard identification in terms of machine space, time, and usage limits. The hazard is then estimated by risk elements such as harm severity (S), occurrence frequency (F), and avoidance or limitation probability (P). Based on the information obtained, the risk is then evaluated for acceptability. If it is not acceptable, risk reduction measures are required. The whole process is called risk assessment. Iteration of this process can be necessary to eliminate hazards as far as practicable and to adequately reduce risks by the implementation of protective measures. Protective measures play an important role in risk reduction. Such measures include protection devices and safety controls, the combination of which is called the Safety-Related Part of the Control System (SRP/CS) [2,3].

2.1.2. Safety Function

Safety functions (SFs) are the machine functions that cause an immediate increase in risk upon failure [2]. A single SF can be implemented by multiple SRP/CSs. A single SRP/CS may also implement multiple SFs, such as prevention of accidental start-up and limits regarding safety parameters in temperature and pressure. For example, the control system shuts down the furnace fire when the boiler pressure reaches a dangerous value. If this function fails, excessive pressure will lead to an explosion. In this scenario, safety depends on the SRP/CS performing the correct function. Each SF is tasked with reducing the risk of one or more hazardous events. It is necessary to consider each hazard and its corresponding SF in the design. The risk assessment results influence the PL_r value for the safety function.

2.1.3. Required Performance Level- PL_r

PL_r is the risk reduction expectation required for the implementation of an SF and can be determined by the risk graph shown in Figure 2. A risk graph is a grading-based risk estimation method with parameters S, F and P corresponding to the severity of harm, the duration or frequency of operator exposure to the hazard area, and the possibility of avoiding the hazard, respectively [3]. The result represents the level of risk without protection from the safety system. It is also used to determine the performance level of the safety function that is needed to reduce the risk to a permissible level. Figure 2 shows the structure of the risk graph.

The severity (S) of harm is divided into S1 (slight) and S2 (serious). Only slight reversible harm, serious irreversible harm, and fatalities are considered when estimating the levels of harm [3]. The normal recovery process is generally used as the basis for evaluating the severity of harm to people. Slight harm is usually recoverable, while serious is not. For example, fatigue and slipping are categorized as S1, and amputation and death are categorized as S2.

The frequency (F) or time of exposure to the hazard is classified as F1 (seldom or short time) and F2 (frequent or long time). This parameter is a measure of the time spent in the danger zone. As per [2], when the operator is present in the area more frequently than once every 15 min, it is considered F2 level. If for automated processing machines, where the operator needs to intervene only once a month, F1 is the obvious choice.

The possibility (P) of avoiding hazards is divided into categories P1 and P2, which are determined based on whether the hazard can be identified or prevented. If it is possible to avoid an accident under certain circumstances, P1 is chosen, but if it is almost impossible to avoid, then P2 is chosen. Factors that affect parameter P include the speed of the hazardous situation leading to harm, any awareness of risk, and the human ability to escape. For example, the speed of machine operation is limited so that potential accidents are delayed, and the operator has the opportunity to react and leave the zone. As can be seen from the graph (Figure 2), combining these parameters increases the risks from low to high (i.e., PL_r-a to PL_r-e, where PL_r-e is the highest level required for SF and the most expensive to implement).

Please note that, in this study, the LLM is presented with natural language descriptions of machinery hazard scenarios that implicitly or explicitly indicate the three risk parameters defined in ISO 13849, namely severity (S), frequency (F), and possibility of avoidance (P). The model is evaluated for its ability to interpret these factors and accurately classify the corresponding Required Performance Level (PL_r), thereby demonstrating its potential to support functional safety risk assessment in accordance with ISO 12100 and ISO 13849-1. This evaluation is essential to determine whether LLMs can reliably replicate expert-level judgment in PL_r classification, a prerequisite for integrating AI into scalable, standard-compliant functional safety workflows where manual assessments are often inconsistent, resource-intensive, and difficult to reproduce.

Building on this foundation, the following section reviews related work on AI-assisted safety assessments, rule-based prompt engineering, and retrieval-augmented reasoning in the context of structured decision-making and industrial risk classification.

2.2. AI-Based Risk Assessment in Functional Safety

Functional safety risk assessment in machinery domains is typically governed by ISO 12100 and ISO 13849, which define a logic-driven procedure for estimating PL_r based on hazard severity, exposure frequency, and the possibility of avoidance. Traditionally, such assessments rely heavily on expert judgment, limiting scalability and reproducibility. The emergence of new regulations and the growing complexity of machinery systems have amplified the demand for scalable, consistent risk assessment approaches, driving interest in automation and AI-driven methods to support and augment traditional expert analyses.

Early studies explored custom-built AI-based solutions for machinery functional safety risk assessment. The work in [5] introduced a specialized chatbot leveraging rule-based logic and a TextCNN-LSTM architecture, achieving approximately 80% accuracy on internal datasets but showing limited robustness to linguistic variability. Subsequently, [6] presented a chatbot for recommending risk reduction measures aligned with ISO 12100. Although these domain-specific prototypes demonstrated potential, their scalability was constrained by the lack of structured training data and the significant effort required for data curation and model training. Moreover, these solutions predate the advent of LLMs with reasoning capabilities and thus serve as foundational steps toward applying LLMs to structured risk assessment tasks like PL_r estimation.

To address the absence of structured datasets aligned with risk assessment standards, the work in [7] introduced an open-access dataset of 7800 annotated machinery hazard scenarios derived from ISO 12100 and paired with PL_r values determined according to ISO 13849-1. This resource enables reproducible PL_r prediction experiments with state-of-the-art LLMs. Follow-up studies, such as [8], applied zero-shot, rule-based, and retrieval-augmented prompting strategies to general-purpose LLMs [9] on this dataset. The results showed that rule-based prompting with retrieval augmentation outperformed both zero-shot and standard rule-based methods, yet also exposed variability across prompt designs. Further, the work in [10] examined Chain-of-Thought prompting for OT cybersecurity risk assessment, demonstrating feasibility but limited depth. By contrast, this study provides a systematic and in-depth evaluation of reasoning strategies in the parallel domain of deterministic PLr classification.

Emerging studies have explored the application of LLMs in safety-critical hazard analysis, aiming to automate tasks traditionally reliant on expert judgment. Nouri et al. [11] applied a GPT-4-based pipeline for hazard analysis and risk assessment (HARA) in automotive systems. Their method decomposed the analysis into subtasks, hazard identification, scenario generation, and severity classification, each using tailored prompts. While effective in generating draft assessments, the study emphasized the ongoing necessity of expert validation. Similarly, the study in [12] evaluated ChatGPT’s utility for System-Theoretic Process Analysis (STPA) in automotive braking systems, finding that naive prompting produced poor results, but domain-specific prompts combined with human oversight enabled LLMs to identify hazards with competence comparable to human experts. In the domain of consumer product safety, the work in [13] observed that while ChatGPT could enumerate a broad set of failure scenarios, it frequently provided weak or unsupported risk judgments. These studies highlight both the potential and the limitations of LLMs in structured hazard analysis. They underscore that while LLMs can assist in preliminary risk assessments or scenario generation, their effectiveness in deterministic, rule-bound tasks—such as those required in functional safety, is dependent on structured prompting, domain adaptation, and expert supervision.

Effective interaction with LLMs in risk analysis tasks often depends on structured, rule-based prompting. Without domain-specific guidance, LLMs may produce inconsistent or misleading outputs. The study in [11] employed format-constrained subtasks and predefined templates to improve output consistency. The work in [14] introduced a co-hazard analysis (CoHA) framework that combined iterative Q&A with ChatGPT and domain rules, enhancing the coverage and creativity of hazard identification. Similarly, the work in [15] further demonstrated that structured prompts significantly influenced LLM accuracy and interpretability on safety certification questions. However, they also noted that no single prompt structure performs optimally across all task types. These studies underscore that while LLMs offer potential for draft hazard identification and scenario generation, they fall short in deterministic, rule-constrained risk assessments. Addressing these limitations requires structured prompting, retrieval augmentation, and expert oversight to mitigate reasoning illusions and ensure alignment with formal safety standards. The work in [16] evaluates the integration of LLMs like ChatGPT into a human-in-the-loop (HITL) framework for machinery functional safety risk analysis, adhering to ISO 12100. It demonstrates that expert oversight within the HITL framework effectively mitigates LLM limitations such as hallucinations, leading to complete agreement with ground truth across diverse industrial case studies. The study highlights significant gains in efficiency, accuracy, and usability, underscoring the transformative potential of generative AI in safety workflows when rigorous human validation is maintained. But a systematic experimental evaluation of LLMs using a comprehensive dataset is not available in [16].

2.3. LLM Capabilities and Limitations in Reasoning Tasks

Early studies showed that prompting LLMs with step-by-step CoT explanations significantly improves their performance on reasoning tasks, including arithmetic and logic problems [17]. Kojima et al. [18] further revealed that even simple zero-shot CoT cues like “Let’s think step by step” could unlock latent reasoning abilities in LLMs across benchmarks such as GSM8K and Big-Bench Hard. These results fueled optimism that, when guided correctly, LLMs might approximate general reasoning skills, with recent models like GPT-4 demonstrating notable problem-solving capabilities across varied domains.

Several recent studies critically examine the assumptions underlying CoT-driven reasoning claims. Saparov et al. [19] introduced the PrOntoQA benchmark and found that LLMs, while capable of generating valid local inference steps, fail at global proof strategies, revealing a tendency toward greedy, heuristic reasoning rather than systematic exploration. Schaeffer et al. [20] provided further evidence, showing that even logically invalid or irrelevant CoT traces can boost model performance on complex tasks, implying that token pattern familiarity rather than logical correctness drives success. These findings suggest that CoT benefits often arise from superficial token dynamics rather than authentic deductive reasoning, raising concerns about overestimating LLM reasoning competence in critical applications.

Multiple recent studies challenge the notion that LLM-generated CoT outputs reflect genuine reasoning. Kambhampati et al. [21] argue that labeling intermediate text as “thoughts” dangerously anthropomorphizes LLMs, masking the fact that CoT may merely improve output through surface token patterns rather than reasoning. Stechly et al. [22] demonstrate that models often produce correct answers despite incoherent or irrelevant intermediate steps, suggesting that CoT traces may result from training artifacts rather than causal reasoning. Furthermore, Chen et al. [23] show that LLMs frequently omit critical reasoning cues from their explanations and that even reward tuning fails to ensure faithful reasoning traces. Collectively, these works highlight that CoT outputs may not transparently reveal model decision-making and thus cannot be trusted for reliable auditing. This undermines the premise of using verbalized reasoning as a safeguard in high-stakes applications, raising critical concerns about the auditability and interpretability of LLM decisions.

Recent studies, such as [24], reveal that LLMs with CoT prompting exhibit a “reasoning cliff”, performing well on simple tasks but suffering abrupt accuracy collapse as task complexity increases. Surprisingly, non-CoT models sometimes outperform verbose CoT-augmented models on low-complexity tasks, while both fail on high-complexity problems due to brittle algorithmic reasoning and lack of compositional generalization. These findings challenge the assumption that longer reasoning traces correlate with genuine reasoning ability, underscoring critical limits of current LLM reasoning capabilities.

In summary, recent research offers a mixed view of LLM reasoning: while CoT and self-consistency improve benchmark scores, many studies suggest this reflects token patterning rather than genuine inference, leaving open the debate between emergent capability and engineered illusion.

2.4. Prompt Engineering, Retrieval-Augmented Generation, and Model Interpretability

One critical mitigation strategy for hallucinations and factual inaccuracies is retrieval-augmented generation (RAG), which grounds the model’s output in external knowledge sources. The work in [25] showed that RAG can significantly improve accuracy on knowledge-intensive QA by injecting relevant documents into the generation process. While RAG alone does not eliminate hallucinations entirely as models may still misquote or misapply retrieved facts it significantly enhances factual grounding. Recent applications in safety-critical domains demonstrate its value: the work in [26] integrated Qwen-2.5 with a curated fire safety regulation database to improve factuality in fire engineering queries. The study in [27] applied advanced RAG pipelines to toxicology assessments, achieving higher scientific fidelity through query rewriting and evidence grounding. In the automotive domain, the work in [28] developed the LASAR system, which combines scenario generation with catalog-based retrieval to guide LLMs in hazard analysis and risk assessment (HARA).

In safety-critical applications, interpretability is as vital as accuracy. Well-designed prompts can elicit stepwise reasoning, uncertainty estimates, and justifications from LLMs, aiding human validation and traceability. Studies conducted in [11,28] found that instructing LLMs to explain their severity ratings improved reviewer confidence and auditability. Further, the work in [15] showed that prompt-induced variations could swing model performance by over 13%. Yet, the benefits of added prompt complexity often depend on the task structure, indicating a need for both prompt-task matching and systematic evaluation.

Overall, prompt engineering enhances reliability and interpretability, but cannot ensure factual grounding or regulatory compliance. These limitations underscore the need for rule-based prompting and RAG to enforce alignment with formal safety standards.

2.5. The Role of Domain-Specific Benchmarking in Evaluating LLMs

Beyond task-specific studies, there is a broader recognition that evaluation methodologies for LLMs in high-stakes, rule-bound domains are underdeveloped. Traditional NLP benchmarks rarely capture the requirements of safety-critical decision-making. Recently, researchers have begun devising domain-specific benchmarks to fill this void. In the legal domain, the work in [29] introduced LegalBench, a suite of 162 tasks covering diverse types of legal reasoning to systematically measure LLM performance on jurisprudence problems. Their evaluations of dozens of models revealed substantial gaps between general LLM capabilities and the consistency needed for legal reasoning, reinforcing the need for tailored assessment in regulated domains. In medicine, the work presented in [30] proposed MedCalc-Bench with over 1000 cases requiring medical calculations (e.g., risk scores, dosage) to test LLMs as clinical calculators. Results showed current models often err on precise numeric reasoning in a clinical context, despite performing well on open medical Q&A, highlighting the importance of structured benchmarks for reliability. The safety community is also moving in this direction. The work in [31] presents HSE-Bench, focusing on Health, Safety and Environment compliance questions with multi-step legal reasoning; they find that today’s LLMs rely more on semantic pattern matching than true rule application in compliance assessments. Notably, even the best models’ “reasoning traces lack the systematic legal reasoning required for rigorous HSE compliance,” and performance degrades on complex multi-step scenarios. These domain-specific evaluations underscore a common theme: without structured benchmarks and protocols, it is difficult to gauge an LLM’s trustworthiness for safety-critical tasks. The work in [32] further demonstrates this in a clinical setting by benchmarking an open-source reasoning LLM against GPT-4 on 125 real patient cases. While the open model reached parity on final answers, the study stressed meeting strict regulatory criteria (e.g., explainability, auditable steps) as a key evaluation component. Together, these works illustrate the nascent but growing effort to establish rigorous evaluation frameworks for LLMs operating under domain constraints.

In summary, advances in AI-assisted risk assessment show promise, but persistent limitations remain for deterministic tasks such as PL_r estimation. Structured prompting, RAG and HITL validation can improve performance in preliminary hazard analysis [11,16], yet critical studies on reasoning fidelity [21,22,33] indicate that CoT traces often reflect heuristic token continuations rather than genuine deduction. Domain-specific benchmarking efforts [31,32] further highlight the inadequacy of general LLMs for tasks requiring strict rule adherence.

Together, these insights frame the central question this study addresses: Can the “so-called” reasoning-capable LLMs reliably perform deterministic risk classification, such as PL_r estimation?. To address this, the study empirically evaluates six LLMs across six prompting strategies for PL_r estimation, using a structured framework aligned with ISO 12100 and ISO 13849-1 Annex A qualitative risk graph (cf. Figure 1).

3. Experimental Design

This section outlines the experimental setup devised to systematically evaluate reasoning-capable LLMs on deterministic PL_r classification tasks. It describes the benchmark dataset derived from ISO 12100 and ISO 13849 standards, details the prompting strategies applied across six experimental conditions, and explains the selection of six LLMs. The section concludes with the evaluation framework used to analyze classification accuracy, reasoning behavior, computational performance, and error patterns.

The evaluation leverages LangGraph [34] as its core orchestration engine, enabling dynamic prompt routing and evaluation logic. Modular prompting strategies (rule-based, RAG, and hybrid) are implemented using LangChain chains [35] for seamless reconfiguration. High-throughput semantic retrieval of domain-specific hazard precedents for RAG-based scenarios is facilitated by a ChromaDB [36] vector database, primarily sourcing data from the open-source dataset [7,37]. These retrieved contexts are injected via prompt placeholders to assess analogical reasoning performance. This modular setup mirrors realistic deployment conditions while allowing systematic control over prompting strategies, model selection, and input variants.

3.1. Dataset Description

The experiments in this study utilize an open-source Industrial Machinery Functional Safety Hazard Scenario Dataset [37], designed to serve as a transparent, reproducible, and scientifically rigorous benchmark for empirical evaluation of automated risk assessment methods [7]. The dataset is used throughout this study as the standardized evaluation benchmark for comparing baseline, rule-based, COT and RAG methods for PL_r determination across diverse industrial safety scenarios.

3.1.1. ISO 12100: Annex B and ISO 13849-1: Annex A Correlation

The dataset construction process is systematically aligned with ISO 12100 Annex B, which enumerates ten general hazard categories relevant to machinery functional safety. For each category, the dataset defines specific hazard origins and enumerates plausible potential consequences, refining the generic framework of ISO 12100 into a structured representation suitable for computational risk assessment. This mapping ensures that every scenario is rooted in established industrial safety standards.

Not all combinations of hazard origin and consequence are physically meaningful or relevant to real-world contexts. Therefore, every origin–consequence pair undergoes a rigorous plausibility assessment, incorporating physical laws, causal logic, and expert judgment from certified functional safety professionals. Only combinations that are physically possible and contextually credible are retained. For example, “entanglement” from “rotating elements” is included, while physically impossible pairings (such as “loss of balance” from “scraping surfaces”) are excluded.

For each plausible origin–consequence pair, scenarios are instantiated by systematically varying contextual parameters, including the following:

User type (e.g., operator, maintenance personnel).
Task (e.g., normal operation, cleaning, maintenance).
Operating environment (e.g., industrial).
ISO 13849-1 risk graph parameters: severity (S), frequency and/or duration of exposure (F), and possibility of avoidance (P).

Severity is classified as either “slight (normally reversible injury)” or “serious (normally irreversible injury or death)” (corresponding to S1/S2 in ISO 13849-1), frequency as “seldom-to-less-often/exposure time is short” or “frequent-to-continuous/exposure time is long” (F1/F2), and possibility as “possible under specific conditions” or “scarcely possible” (P1/P2). The PL_r is automatically mapped for each scenario by applying the ISO 13849-1: Annex A risk matrix (cf. Figure 1) to the scenario’s S, F, and P values.

This dataset, comprising 7800 machinery hazard scenarios, is automatically generated in a standardized JSON format. Each entry meticulously details hazard attributes, user context, environmental factors, and calculated PL_r, along with a natural language description. The comprehensive distribution across ten hazard categories, including 2840 mechanical and 1640 electrical hazards, ensures a robust and reproducible foundation for benchmarking AI in safety-critical classification tasks.

3.1.2. Dataset Entry: Template and Example

Listing 1 illustrates a typical scenario entry in the dataset. This example describes an electrical hazard where maintenance personnel performing setup or programming tasks in an industrial environment may encounter an arc, which could lead to a burn. The scenario specifies key contextual parameters: user type (maintenance personnel), task (setup or programming), environment (industrial), and the risk graph inputs, severity (slight), frequency (seldom-to-less-often), and possibility of avoidance (scarcely possible). Based on these factors, the PL_r is assigned as ‘b’ according to the ISO 13849-1: Annex A qualitative performance graph method.

Listing 1. Example of an electrical hazard description in the dataset [7,37].

In all experiments, the scenario’s PL_r value, calculated using the ISO 13849-1 risk matrix, serves as the ground truth for evaluation. Each language model receives the scenario’s natural language description as input and is tasked with predicting the PL_r. Model outputs are then compared to this reference value to compute predictive accuracy. For example, in Listing 1, only models predicting PL_r = ‘b’ are counted as correct. This evaluation protocol enables systematic, transparent, and standard-compliant benchmarking of automated risk assessment methods across diverse safety-critical scenarios.

Further methodological details and rationale for the dataset construction are provided in [7]. The dataset is openly accessible to the research community and practitioners, serving as a standardized benchmark that accelerates scientific progress and enables rigorous comparison of AI-based risk assessment methods in machinery safety.

3.2. Evaluation Datasets

To probe both standard-aligned performance and real-world generalization, two dataset variants are evaluated that share the same schema and gold labels but differ in lexical phrasing.

Variant 1—Canonical ISO-style scenarios (N = 100): This variant anchors the evaluation in in-distribution phrasing closely mirroring ISO 12100 Annex B. Each case follows a fixed schema (hazard type, origin, potential consequence, task, environment) and reports explicit risk parameters, namely, severity (S), exposure frequency (F), and possibility of avoidance (P), together with a reference PL_r. Because the textual descriptions are automatically generated from canonical fields while remaining faithful to the ISO taxonomy, Variant 1 can be characterized as a synthetic but controlled dataset that provides a standardized baseline under explicit terminology.
Variant 2—Functional safety engineer-authored scenarios with lexical shift (N = 100): This variant stress tests generalization to field language. Scenarios were written by a functional safety engineer from industrial practice using the same schema and gold labels as Variant 1, but the free-text descriptions deliberately avoid literal ISO tokens for S/F/P. Instead, the factors are conveyed implicitly in operational prose, for example, "repetitive short stops near a moving transmission", "limited clearance and delayed stop reachability" and "hands inside a nip area during setup". This emulates how hazards are typically described in industrial workflows during safety assessments.

In short, Variant 1 serves as a structured baseline with standardized phrasing, while Variant 2 introduces deliberate lexical shift while preserving labels and structure. This design enables us to test both in-distribution performance (standard-aligned) and out-of-distribution robustness to field language, particularly important for analyzing the sensitivity of CoT and RAG prompting strategies.

3.3. Prompting Strategies for Risk Classification

The experimental evaluation employs six distinct prompting strategies for PL_r determination, each reflecting progressively higher levels of structured input and domain knowledge integration. In all variants, the standard chat-based prompt format with system and user roles (i.e., human) as defined in [38] is used. This format distinguishes between the "system" message, which establishes the model’s assumed role, such as a functional safety expert and supplies any required guidance, domain rules, or reasoning instructions, and the "user" message, which presents the actual task input, typically the natural language hazard scenario for analysis. This separation of roles supports structured, reproducible prompt design and enables controlled assessment of how different prompt components influence LLM reasoning in deterministic risk classification tasks.

The six prompting strategies are detailed below. Please note that the first four approaches do not include retrieval-based approaches, whereas the last two are retrieval-based approaches, i.e., in combination with RAG pipeline implementation described in detail in Section 3.5.1.

Experiment I: Zero-shot Prompt
In this baseline experiment, the reasoning model is given only the raw hazard scenario in natural language, without any additional guidance or rules beyond its role as a functional safety expert. This setup assesses the reasoning model’s inherent ability to determine the PL_r based on its pre-trained knowledge and reasoning capabilities, focusing on systematic analysis of severity, frequency, and avoidance parameters.
Experiment II: Explicit ISO Rule Integration
Building on the zero-shot prompt, this experiment supplements the scenario with explicit PL_r determination rules as specified in ISO 13849-1 Annex A (cf. Figure 1). This tests whether access to codified safety knowledge enables more accurate and consistent classification.
Experiment III: Chain-of-Thought (CoT)
This experiment uses the chain-of-thought (CoT) approach by providing highly structured, explicit step-by-step instructions for the model’s reasoning process but without explicitly stating the ISO 13849-1: Annex A performance graph rules. This aims to guide the model through a precise and verifiable pathway for determining the PL_r.
Experiment IV: Chain-of-Thought (CoT) with Rules
This experiment combines the structured step-by-step instructions of the CoT approach with explicit textual inclusion of the ISO 13849-1: Annex A performance graph rules (cf. Figure 1). The goal is to provide the model with the necessary information directly within the prompt, minimizing reliance on its pre-trained knowledge for the specific rules of PL_r determination. This setup evaluates the model’s ability to precisely apply provided rules in conjunction with its reasoning process.
Experiment V: CoT with Rules and Retrieval-Augmented Examples
In addition to the scenario, rules, and explicit CoT instructions, representative historical hazard examples are retrieved from a curated database and included in the prompt. This evaluates the model’s ability to generalize from a precedent and improve classification accuracy through context enrichment. This experiment is referred to as $COT_WITH_RULES_RAG$ in the paper.
Experiment VI: Rules with Retrieval-Augmented Examples
In addition to the scenario and rules, representative historical hazard examples are retrieved from a curated database and included in the prompt. This evaluates the model’s ability to generalize from precedent and improve classification accuracy through context enrichment. This experiment is referred to as $WITH_RULES_RAG$ in this paper. This is specifically added as a methodology to understand the differences of usage with and without COT together with rules and retrieval. So, please note that the main difference between the experiments V and VI is that the experiment VI does not have CoT instructions.

In this study, during experimental evaluation, the six experimental settings are applied consistently across six state-of-the-art models and two dataset variants, enabling direct comparison of the incremental benefits of each input enhancement. For clarity, these prompting strategies are referenced using the following shorthand macros:

ZERO_SHOT: Baseline condition where the model receives only the raw hazard scenario, without explicit rules or structured guidance.
WITH_RULES: Prompt includes explicit ISO 13849-1 Annex A rules, constraining the model to rule-based PL_r determination.
PURE_CoT: Uses chain-of-thought (CoT) instructions to elicit step-by-step reasoning, but without explicit rule injection.
COT_WITH_RULES: Combines CoT instructions with explicit ISO rules, guiding the model through structured reasoning and rule application.
WITH_RULES_RAG: Augments rule-based prompting with retrieved hazard exemplars from a curated database, enabling case-based reasoning.
COT_WITH_RULES_RAG: Integrates CoT, explicit ISO rules, and retrieved exemplars, representing the most structured and enriched prompting setup.

3.4. Prompt Placeholder Description

To ensure methodological rigor and enable systematic analysis, each experimental prompt utilizes well-defined placeholders corresponding to key information components. These placeholders not only enforce consistency in model evaluation, but also reflect distinct strategies for eliciting and constraining LLM behavior in a safety-critical classification task.

{description}: The central input for each prompt is a natural language description of a real-world hazard scenario, simulating industrial safety assessment tasks and enabling evaluation of the model’s capacity for context comprehension and risk mapping. An example description, including user type, task, hazard origin, and consequence, is shown in Listing 1.
{iso_rules_info}: This placeholder injects the structured decision logic codified in ISO 13849-1: Annex A, including parameter definitions and full risk graph mapping rules. It converts the task from open-ended inference to rule-constrained reasoning, enabling assessment of whether direct access to normative safety rules improves PL_r classification fidelity and consistency. The rules ensure deterministic mapping of scenario descriptions into standardized parameters—severity (S1/S2), exposure frequency (F1/F2), and possibility of avoidance (P1/P2)—which together yield the PL_r. An excerpt of this content is shown in Listing 2.
Listing 2. Deterministic rules codified under {iso_rules_info} for PL_r inference based on ISO 13849-1: Annex A qualitative performance graph rules (cf. Figure 1).
{COT_step_by_step_instruction} This placeholder as detailed in Listing 3 formalizes a CoT prompting technique. It provides the LLM with explicit, step-by-step instructions for analyzing a hazard scenario, specifically guiding its reasoning for severity, frequency, and avoidance. By forcing this sequential decomposition and requiring intermediate justifications, this CoT aims to enhance the model’s transparency and align its decision-making with structured human expert methodologies for PL_r determination. This approach is crucial for improving explainability and validating AI reasoning in critical safety applications.
Listing 3. Structured chain-of-thought instructions under {COT_step_by_step_instruction} prompt placeholder for PL_r inference.
{rag_examples}:
This placeholder is populated by a retrieval module that provides curated hazard scenarios (with ground-truth S/F/P and PL_r) from a curated database of hazard scenarios. Further details about the RAG implementation pipeline are provided in Section 3.5.1.

3.5. RAG Implementation

In the PL_r experiment framework, RAG is employed to systematically test whether augmenting prompts with precedent hazard examples improves deterministic classification of PL_r. The underlying rationale is that models may benefit from a small set of prior cases that are not only textually relevant but also structurally consistent with the ISO 13849-1 risk parameters, S, F, and P. In addition, RAG is explicitly evaluated under lexical-shift conditions queries drawn from a companion corpus that is semantically consistent with ISO 13849-1 but avoids the literal S/F/P tokens (e.g., paraphrases such as “infrequent contact” for F1 or “avoidance is difficult” for P1)—to test whether exemplar augmentation improves deterministic PL_r classification when surface forms diverge from both the rules and the database phrasing. To this end, the framework implements RAG in two chain types:

WITH_RULES_RAG: Injects retrieved exemplars alongside the base hazard description and ISO rules.
COT_WITH_RULES_RAG: Integrates exemplars into a CoT prompt alongside base hazard descriptions and ISO rules.

3.5.1. Pipeline Implementation

A hybrid RAG pipeline is implemented in the sense of dense(ish) database retrieval followed by symbolic filtering and deterministic packaging for prompting [25,39]. The orchestration is handled by the _search_similar_hazards function and proceeds as follows: At initialization, a hazard database object is created (from an existing index) and an optional validator chain is attached.

The RAG controls are

(k, τ, M) = (rag_k, rag_sim_threshold, rag_top_m)

, a strict S/F/P gate require_sfp_exact (boolean), and an optional drop_missing_labels. Defaults in the experimental runs are

k = 20

,

τ = 0.30

,

M = 3

.

Stage I: Semantic Database Search

Given a query scenario string q, hazard_db.search(q) is invoked to obtain a superset of candidates

C_{0}

(intentionally larger than k to avoid upstream pruning). This corresponds to the “semantic retriever” stage in RAG, where a vector- or database-backed similarity search returns top candidates for subsequent filtering.

Stage II: Hybrid Prefiltering and Ranking

Candidates are passed to _prefilter_and_rank with the following steps:

Lexical overlap filter: Compute the Jaccard index on token sets $J (A, B) = \frac{| A \cap B |}{| A \cup B |}$ for the query and each candidate; retain only those with $J \geq τ$ (default $τ = 0.30$ ).
S/F/P constraint (optional exact gate; disabled by default): Map hazards to $(S, F, P)$ with $S \in {S 1, S 2}$ , $F \in {F 1, F 2}$ , $P \in {P 1, P 2}$ . If require_sfp_exact = true, discard any candidate whose triplet does not exactly match the query’s; otherwise keep but log mismatches for diagnostics. In all main experiments, this gate was "disabled" (require_sfp_exact = false); enabling it is left for ablations.
Deduplication and top-M selection: Deduplicate by hazard identifier, rank primarily by J (semantic score may be used as a tie-breaker inside the helper), and truncate to the configured top M.

Stage III: Evidence Packaging for Prompting

The selected retained set is compacted and serialized into concise snippets (hazard type, task, short description) in snippets (using _ctx_snip) to form a structured context block

[EX 1] HazardType • Task • description \dots [EXM] \dots

which is injected into the prompt as rag_context.

Under the survey taxonomy in the literature [25,39], the method employed can be called “advanced hybrid RAG”: a semantic database search followed by symbolic lexical screening and domain–structure (S/F/P) constraints, plus deterministic evidence selection/packaging. In the experiments, the default values are require_sfp_exact = false and drop_missing_labels = false. Vector similarity is used for over-retrieval, while the lexical Jaccard threshold

τ

acts as the gate.

3.5.2. Retrieval Index (ChromaDB)

A persistent ChromaDB collection of curated hazard scenarios is maintained solely for retrieval. In this study the index contains

N \approx 1020

records drawn from the larger 7800-scenario corpus. Each record stores a compact textual summary (hazard type, origin, consequence, user, task, environment, description) and metadata (ID, S/F/P, PL_r). Documents are embedded with a standard sentence-embedding model and retrieved by vector similarity. At query time: (i) top-k candidates are over-retrieved (

k = 20

), (ii) a semantic gate

s_{sem} \geq 0.70

is applied with

s_{sem} = 1 - d

(from index distance d), (iii) lexical Jaccard filtering

J (A, B) \geq 0.30

is performed, (iv) optional S/F/P exact match is enforced ( require_sfp_exact, default: false), and (v) deduplication retains the top M by J (

M = 3

). For selected examples, the retrieval artifacts ID, S/F/P, PL_r,

s_{sem}

, and

J (A, B)

are reported.

Thus, the RAG implementation combines semantic retrieval with structural safeguards to ensure both textual relevance and risk graph consistency. Candidate hazards are retrieved semantically, filtered with lexical Jaccard thresholds and optional S/F/P gating, then deduplicated and truncated to the top-M exemplars for prompt injection. Depending on the experimental condition, the retrieved rag_context is added either directly (WITH_RULES_RAG) or embedded in a structured reasoning sequence (COT_WITH_RULES_RAG). In both cases, the objective is to reinforce correct mapping from S/F/P parameters to PL_r classification while systematically controlling the role of retrieved exemplars. Default parameters are listed in Table 1, and the overall hybrid pipeline is illustrated in Figure 3.

3.6. Model Selection and Configuration

Six production models from three major vendors plus a specialist reasoner are evaluated, spanning (i) dedicated reasoning stacks [OpenAI o-series: o3-mini, o4-mini; DeepSeek Reasoner], (ii) cost/latency-optimized “mini/flash” variants [Google Gemini 2.5 Flash, OpenAI GPT-5 mini], and (iii) a premium general model used as an upper-bound baseline [Anthropic Claude Opus 4.1].

Claude Opus 4.1 (Anthropic): Latest Claude 4.x release positioned for complex reasoning, coding, and agentic workflows [40].
DeepSeek Reasoner: Domain-agnostic reasoning model optimized for multi-step inference; API documentation lists unsupported decoding controls (see below) [41].
Gemini 2.5 Flash (Google): Cost/latency-optimized model with optional thinking budget and multimodal I/O [42,43,44].
GPT-5 mini (OpenAI): Compact GPT-5-family variant emphasizing speed and cost-efficiency for well-defined tasks [45,46].
o3-mini (OpenAI): Small o-series reasoning model targeting STEM/logic tasks at low cost/latency [47].
o4-mini (OpenAI): Newer small o-series model optimized for fast, effective reasoning (math, coding, vision) [48].

All models receive identical prompts per experiment (Section 3.3), with stop sequences, maximum output tokens, and structured fields aligned to avoid truncation or format bias.

3.6.1. Deterministic Decoding Policy

PL_r assignment under ISO 13849-1 is a deterministic multi-class classification task. To eliminate sampling variance and ensure replicability, models are run under deterministic decoding wherever supported. Parameters are set to temperature = 0.0, top_p = 1.0, and top_k = 0 (greedy decoding). For reasoning stacks, temperature is documented as unsupported or ignored (OpenAI o3-mini, o4-mini; DeepSeek Reasoner [41,49]), with DeepSeek additionally listing unsupported controls (temperature, top_p, presence_penalty, frequency_penalty, logprobs/top_logprobs). Thus, decoding proceeded deterministically under provider defaults.

3.6.2. Cost Context

API pricing (as of 21 August 2025) spans an order of magnitude. Claude Opus 4.1 is the premium outlier at USD 15 per 1M input tokens and USD 75 per 1M output tokens [50]. OpenAI o-series minis are an order cheaper at USD 1.10 in/USD 4.40 out [47,48]; GPT-5 mini is lower still at USD 0.25 in/USD 2.00 out [45]; Gemini 2.5 Flash is comparable (USD 0.30 in/USD 2.50 out, including “thinking” tokens) [51]; and DeepSeek Reasoner is aggressively priced at USD 0.55 in (cache miss; USD 0.14 cache hit) and USD 2.19 out [52]. Batch and caching features can further reduce effective cost [50,53].

These six models together span the accuracy–latency–cost frontier: a premium general model (Claude Opus 4.1), vendor reasoning stacks (OpenAI o-series; DeepSeek Reasoner), and high-throughput budget models (Gemini 2.5 Flash; GPT-5 mini). This setup enables cost-normalized, deterministic PL_r classification performance to be compared under identical prompting and RAG conditions.

The aim is to derive actionable recommendations for real-life functional safety workflows by jointly analyzing (i) accuracy with 95% CIs, (ii) latency/throughput under deterministic decoding, and (iii) per-decision cost (input + output tokens), while documenting provider constraints on decoding controls for reasoning models.

4. Results and Analysis

Scope. Two datasets are evaluated:
–
Variant 1: Canonical ISO-style scenarios (in-distribution reliability).
–
Variant 2: Engineer-authored free-text scenarios (out-of-distribution robustness).

Six prompting strategies are tested across six model families (see Section 3.3 and Section 3.6), yielding 36 conditions per variant.

Reported metrics. For each (model, prompt) condition, the following are reported:
–
Accuracy; Macro-/Micro-/Weighted-F₁.
–
Per-class metrics (including class E recall).
–
Processing time.
Experimental protocol.
–
Deterministic decoding: temperature $= 0.0$ , top_p $= 1.0$ , top_k $= 0$ (when available).
–
Repeats: $r = 5$ independent runs per condition with identical inputs; for RAG, a fixed retrieval configuration and index.
–
Aggregation: metrics computed per run and then averaged across runs.
Uncertainty quantification. Error bars are 95% t-intervals across runs ( $r = 5$ ) for accuracy and timing, and 95% class-stratified bootstrap Confidence Intervals (CIs) for Micro-/Macro-/Weighted-F₁ and all per-class metrics.
Weighted-F₁ (rationale).
–
Weighted-F₁ averages per-class F₁ using class prevalence as weights; Macro-F₁ gives equal weight to all classes; Micro-F₁ (for single-label tasks) equals accuracy and can mask per-class precision/recall trade-offs.
–
To avoid hiding minority-class failures, Weighted-F₁ is always paired with Macro-F₁, per-class metrics, and class E recall.

Together, these evaluations offer a comprehensive, multidimensional assessment of reasoning LLMs under deterministic, rule-constrained risk classification.

4.1. Results on Variant 1 (Canonical ISO-Style Scenarios, Non-RAG Prompts)

Four prompting strategies are evaluated without retrieval augmentation:

ZERO_SHOT

,

PURE_CoT

,

WITH_RULES

, and

COT_WITH_RULES

. Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14 summarize the results across six models.

4.1.1. Accuracy and Processing Time

Figure 4 presents accuracy with 95% t-intervals across repeated runs, quantifying the stability of model performance under identical conditions and ensuring multi-run statistical reliability.

WITH_RULES

consistently achieves near-ceiling accuracy (≥

0.92

) across all six models, confirming the effectiveness of explicit rule-constrained prompting. By contrast,

PURE_CoT

displays large instability, with accuracies ranging from 0.40 (Claude-opus-4-1) to 0.99 (GPT-5-mini), reinforcing that unconstrained reasoning yields erratic outcomes.

COT_WITH_RULES

partially mitigates this instability but does not reach the ceiling performance of

WITH_RULES

.

ZERO_SHOT

systematically underperforms, collapsing for o3-mini (0.45), which underscores its inability to generalize in safety-critical classification without structured priors.

Figure 5 and Figure 6 complement accuracy analysis with efficiency metrics. Both average processing time and total execution time are reported with 95% t-intervals, reflecting run-to-run variability rather than single-run artifacts. DeepSeek-Reasoner and Gemini-2.5-flash exhibit markedly higher latency than compact models such as o3-mini and o4-mini, highlighting the cost–accuracy trade-off in reasoning-optimized architectures. These results collectively strengthen the methodological rigor by combining deterministic performance evaluation with reproducibility and resource awareness.

Efficiency results show that with rules is not only the most accurate but also the most efficient. Pure cot and cot with rules incur higher latency due to step-by-step reasoning, while zero-shot is fast but unreliable. DeepSeek-reasoner exhibits extreme latency (approx. 70 s per query), making it impractical despite strong accuracy. Gemini and Claude are efficient but highly sensitive to prompt structure.

The accuracy heatmap in Figure 7 highlights this hierarchy:

WITH_RULES

is dominant (≥0.99),

COT_WITH_RULES

is consistently strong but slightly below,

PURE_CoT

is volatile, and

ZERO_SHOT

fails for compact models. This confirms that explicit ISO rule alignment is both necessary and sufficient for deterministic PL_r classification.

4.1.2. Macro, Micro-F1 and Precision

To capture robustness across PL_r classes, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 report Macro-F1, Micro-F1, Weighted-F1, and per-class metrics.

WITH_RULES

dominates in Macro- and Micro-F1, showing balanced handling of both frequent and rare classes.

PURE_CoT

inflates performance on common classes but collapses on categories PL_r b and PL_r c, demonstrating that accuracy alone can obscure poor recall for low-frequency hazards.

ZERO_SHOT

fails very significantly for o3-mini, producing poor per-class recall.

COT_WITH_RULES

provides stability but remains weaker than rule-only prompts.

Per-class F1 (Figure 11) shows strong performance for PL_r classes a and b across all rule-grounded prompts, while PL_r classes c and d remain the main sources of variability. The o3-mini collapse under

ZERO_SHOT

and

PURE_CoT

reflects instability on mid-frequency hazards, in contrast to rule-based prompts which sustain balanced scores, including near-ceiling values for PL_r class e.

Precision remains high for PL_r classes a and b, but

PURE_CoT

and

ZERO_SHOT

exhibit wider variance in PL_r classes c and d, indicating susceptibility to false positives as seen in Figure 12.

WITH_RULES

consistently yields precise predictions across all classes, preserving safety-critical PL_r class e without degradation.

Recall as seen in Figure 13, for variant 1 without RAG, highlights systematic weaknesses in

ZERO_SHOT

and

PURE_CoT

for PL_r classes b and c, where under-detection is frequent. Rule-grounded prompts sustain recall near ceiling for all classes, ensuring deterministic coverage of class E while reducing variance across models.

Together, Figure 11, Figure 12 and Figure 13 demonstrate that explicit rule conditioning provides stable performance across both common and minority classes, counteracting the volatility of unconstrained prompting. The alignment of F1, precision, and recall results confirms that rule-grounded strategies not only maximize aggregate accuracy but also ensure class-level determinism, a prerequisite for safety-critical deployment where failures on rare hazards such as class E cannot be tolerated.

4.1.3. Recall for PL_r Class E

Figure 14 isolates recall for the most safety-critical class, PL_r class e.

WITH_RULES

achieves perfect recall across all models, while

ZERO_SHOT

and

PURE_CoT

miss high-risk hazards, an unacceptable failure in functional safety. While average accuracy may appear reasonable, only rule-based strategies reliably capture rare but critical hazards.

4.2. Results on Variant 1 (Canonical ISO-Style Scenarios, 2-RAG-Based Prompts)

Because overall accuracy is ceiling-limited, class-sensitive metrics that reveal behavior under class imbalance are emphasized.

4.2.1. Macro-/Micro-F1

Figure 15 and Figure 16 show near-ceiling Macro- and Micro-F1 (approx. 0.98–1.00) for both

WITH_RULES_RAG

and

COT_WITH_RULES_RAG

across six models. The one exception is o3-mini, where

COT_WITH_RULES_RAG

collapses, while plain

WITH_RULES_RAG

remains approx. 0.99. These results highlight that free-form reasoning can degrade class-balanced performance even when mean accuracy appears high.

4.2.2. Per-Class Behavior

The per-class F1 panel (Figure 17) shows that PL_r classes a-d are essentially solved for all models under both prompts. The o3-mini anomaly under

COT_WITH_RULES_RAG

stems from broad degradation across PL_r classe b-e, not a single class artifact, indicating reasoning-step brittleness rather than dataset noise.

4.2.3. Safety-Critical Coverage (PL_r Class E)

Figure 18 isolates recall for PL_r class e. All models achieve near-perfect PL_r class e-recall under both prompts except o3-mini under

COT_WITH_RULES_RAG

. This result shows that accuracy alone can mask critical failures;

WITH_RULES_RAG

is consistently safer on this dataset, while COT_WITH_RULES can destabilize a smaller model.

4.2.4. Weighted-F1

Figure 19 mirrors the macro/micro patterns: near-ceiling for all models and both prompts, with the same o3-mini degradation confined to

COT_WITH_RULES_RAG

. This shows the finding is robust to prevalence weighting.

On in-distribution ISO phrasing, retrieval alone suffices; adding explicit chain-of-thought yields no systematic gains and can harm smaller models. Reporting Macro-/Micro-/Weighted-F1, per-class F1, and PL_r class e recall directly addresses reviewer requests for metrics beyond accuracy and for safety-critical error visibility.

4.3. Results on Variant 2 (Engineer-Authored Scenarios, Non-RAG Prompts)

Variant 2 comprises free-text hazard descriptions authored by a functional safety engineer without canonical ISO phrasing. This dataset introduces lexical shift and compositional variability, and therefore tests out-of-distribution generalization and the stability of prompting strategies under realistic language.

Figure 20, Figure 21 and Figure 22 present Variant 2 results, with Figure 20 showing model–prompt accuracy, Figure 21 the average processing time per sample, and Figure 22 the total execution time, each with 95% t-interval error bars across runs.

4.3.1. Overall Accuracy (With 95% t-Intervals Across Runs)

As seen in Figure 20, across models,

WITH_RULES

is the most reliable strategy on Variant 2, typically yielding the highest or statistically indistinguishable accuracy relative to the best method per model. For strong base models (e.g., GPT-5-mini, o4-mini),

PURE_CoT

occasionally approaches

WITH_RULES

, but its 95% t-intervals are wider, indicating greater run-to-run variability.

COT_WITH_RULES

narrows this variability relative to

PURE_CoT

but rarely exceeds the simpler

WITH_RULES

strategy.

ZERO_SHOT

is consistently least accurate, often clustering around 0.55–0.60 on several models, confirming that unconstrained prompting is brittle under lexical shift. Notably, DeepSeek-Reasoner and Gemini-2.5-flash display particularly wide intervals for CoT variants, highlighting instability under free-form language even when mean accuracy is competitive.

4.3.2. Latency and Efficiency (With 95% t-Intervals Across Runs)

From Figure 21 and Figure 22, one can observe that the latency patterns mirror the accuracy trade-offs. Reasoning-heavy prompts (

PURE_CoT

and

COT_WITH_RULES

) incur the highest per-sample and total execution times, with the slowest stacks (e.g., DeepSeek-Reasoner, Gemini-2.5-flash) showing order-of-magnitude differences relative to compact models (o3-mini, o4-mini).

WITH_RULES

typically achieves near-top accuracy with substantially lower latency than

COT_WITH_RULES

, offering a better accuracy–time Pareto point.

ZERO_SHOT

is fastest but its accuracy deficits under lexical shift make it unsuitable for deterministic, auditable workflows.

On Variant 2, explicit rule conditioning (

WITH_RULES

) is the most robust and efficient choice overall: it maintains high accuracy with narrower 95% t-intervals than

PURE_CoT

, while avoiding the additional latency overheads of

COT_WITH_RULES

. These results reinforce that, under non-canonical phrasing, structured prompts that encode the ISO decision rules provide determinism and stability that purely “reasoning” styles do not.

4.3.3. Macro-F1 and Micro-F1

Figure 23 and Figure 24 demonstrate Macro- and Micro-F1 scores. Macro-F1 uncovers instability in minority PL_r classes (a and b), where zero-shot and CoT perform inconsistently. Rule-based prompting markedly reduces this variance, producing balanced performance across classes. Micro-F1 tracks overall accuracy but confirms that robustness is only achieved with explicit rules.

4.3.4. Per-Class Performance

Figure 25, Figure 26 and Figure 27 illustrate the per-class F₁, precision, and recall for Variant 2 without RAG. The results highlight that PL_r classes b and c are the most fragile categories: zero-shot and pure CoT prompting frequently underperform, leading to both false positives (precision loss) and severe recall drops. In contrast, rule-based prompting consistently stabilizes performance across all classes, maintaining near-ceiling recall for PL_r class e and preventing over-prediction in PL_r classes d and e. This per-class robustness is particularly important since functional safety certification requires determinism not only at the aggregate level but also within each PL_r class.

4.3.5. Safety-Critical Recall (Class E) and Weighted-F1

Figure 28 and Figure 29 focus on PL_r class e and Weighted-F1. Importantly, recall for PL_r class e remains high across all settings, but zero-shot and pure CoT show variance, risking under-detection in critical cases. Weighted-F1 reflects these imbalances: Claude and DeepSeek with rules sustain scores above 0.9, whereas smaller models without rules fall below 0.7. These results demonstrate that determinism must be explicitly validated for safety-critical outputs.

Variant 2 demonstrates that lexical variation strongly stresses large language models. Zero-shot and pure CoT strategies are insufficient for reliable safety-critical classification, with accuracy drops of up to 30%. Rule-based prompting provides determinism and restores per-class balance, achieving near-Variant 1 performance even under distribution shift. This confirms that safety-compliant usage of LLMs requires explicit structural constraints rather than relying on emergent reasoning.

The evaluation on Variant 2 demonstrates that the benchmark is not limited to synthetic ISO-style phrasing but also covers practice-authored cases representative of real safety assessments. This ensures ecological validity, since functional safety engineers rarely describe hazards using literal ISO tokens. The consistent schema and gold labels ensure comparability to Variant 1, while the lexical shift tests out-of-distribution robustness.

Results show that rule-based prompting (with or without CoT) substantially outperforms zero-shot baselines, confirming that explicit formalization of S/F/P criteria is necessary for reliable PL_r assignment in realistic industrial language. Importantly, not only mean accuracies but also Macro-/Micro-F1 and per-class breakdowns are reported, together with 95% CIs, thereby quantifying both central tendency and statistical uncertainty. This provides the level of rigor expected by certification auditors, who require reproducible evidence of deterministic behavior under linguistic variation. In sum, Variant 2 establishes sufficiency of the benchmark for assessing model robustness under real-world lexical variability, beyond synthetic ISO-aligned formulations.

4.4. Results on Variant 2 (Engineer-Authored Scenarios, RAG-Based Prompts)

Variant 2 further evaluates robustness under lexical shift but now introduces retrieval-augmented generation (RAG). Scenarios are free-text hazard descriptions authored by a functional safety engineer, without canonical ISO tokens, and prompts combine retrieval with explicit rules or CoT. This setting directly probes whether retrieval supports or undermines determinism when applied to non-standardized input phrasing.

4.4.1. Accuracy and Confidence Intervals

Figure 30 shows that RAG introduces heterogeneous effects.

WITH_RULES_RAG

achieves strong performance for Claude-opus and GPT-5-mini (>0.95 accuracy with narrow CIs), indicating that retrieved context reinforced rule-constrained reasoning. By contrast, DeepSeek Reasoner degraded substantially (mean ≈ 0.72 with wide intervals), suggesting retrieval noise or conflict with internal heuristics. o3-mini also suffered a drop below 0.75. These results highlight that while retrieval can complement robust models, it can destabilize others, emphasizing the need for model-specific RAG validation before deployment in safety-critical settings.

4.4.2. Latency and Efficiency

Figure 31 and Figure 32 confirm that

COT_WITH_RULES_RAG

incurs substantial latency penalties, particularly for large reasoning-oriented models (DeepSeek and Gemini, >70 s per case, thousands of seconds total). In contrast,

WITH_RULES_RAG

reduces average and total runtimes significantly, while in some models maintaining or even improving accuracy. This suggests that retrieval-only prompting is a more computationally practical strategy, provided retrieval databases are curated to minimize semantic drift and contradictory context.

Variant 2 with RAG demonstrates a dual outcome: for compact and instruction-optimized models, RAG stabilizes accuracy while controlling runtime; for reasoning-specialized models, it amplifies error variability and latency. This shows that retrieval noise can override encoded rules and underscores the necessity of transparent error analysis when deploying RAG in safety-critical certification workflows.

Across all conditions, three systematic trends emerge.

First, Variant 1 (canonical ISO phrasing) represents the upper bound of model performance: rule-based prompts consistently achieved near-ceiling accuracy (>0.95) with narrow confidence intervals, underscoring that deterministic classification is feasible when phrasing is standardized.
Second, Variant 2 (lexical shift, non-RAG) revealed a marked degradation for zero-shot and unconstrained CoT, confirming that free-text hazard descriptions destabilize reasoning-oriented prompting. Rule-based strategies partially mitigated this drift but still showed variability in compact models.
Third, Variant 2 with RAG demonstrated that retrieval can both stabilize and destabilize performance: while Claude-Opus, GPT-5-mini, and o4-mini maintained robustness with $WITH_RULES_RAG$ , DeepSeek Reasoner and o3-mini exhibited large confidence intervals and accuracy collapse, indicating sensitivity to retrieval noise. Latency results further confirmed that retrieval-heavy CoT prompts impose prohibitive computational costs, whereas lightweight retrieval ( $WITH_RULES_RAG$ ) balances accuracy and efficiency.

Collectively, these findings validate the central claim that deterministic reliability in functional safety tasks depends not on emergent reasoning, but on strict rule-constrained prompting, carefully validated retrieval, and bounded lexical variability.

Figure 33-Top and Figure 33-Bottom show macro- and micro-F1 comparisons. Macro-F1 reveals systematic penalties for smaller models (o3-mini, DeepSeek) due to inconsistent handling of PL_r classes c, d and e. By contrast, Claude and GPT-5-mini retained balanced per-class treatment (

Macro - F 1 > 0.90

). Micro-F1 followed overall accuracy trends, confirming that lexical robustness depends strongly on model scale and prompt scaffolding.

4.4.3. Macro- and Micro-F1

As shown in Figure 34, PL_r classes a-c maintain high F1 under both

WITH_RULES_RAG

and

COT_WITH_RULES_RAG

, although PL_r class c exhibits noticeable variance and degradation for smaller models under CoT+RAG. PL_r classes d-e remain close to ceiling, with only minor drops in PL_r class d for selected models, confirming that rule-grounded prompting stabilizes safety-critical coverage.

4.4.4. Per-Class Performance for Variant 2 with RAG

Figure 35 shows that Precision for classes A and B remains at ceiling across all models under both

WITH_RULES_RAG

and

COT_WITH_RULES_RAG

, indicating these categories are consistently separable. In contrast, PL_r class c shows wider variance especially for o3-mini and o4-mini while PL_r classes d and e maintain high precision overall, with occasional degradation in DeepSeek and o3-mini, underscoring sensitivity of minority classes to retrieval noise.

As shown in Figure 36, recall performance under RAG remains near ceiling for PL_r class e across all models and prompt types, confirming that safety-critical cases are reliably detected. For PL_r classes a–c (Figure 36a), recall is generally high but exhibits larger variance for DeepSeek and o3-mini, with noticeable drops under

WITH_RULES_RAG

. For PL_r classes d and e (Figure 36b), stability is preserved for most models, but o3-mini again shows degradation in PL_r class d, underscoring model-specific brittleness when retrieval is combined with reasoning. Notably, the PL_r class e retained high recall across models (Figure 36b), demonstrating that catastrophic risk categories were consistently identified despite lexical variation.

4.4.5. Weighted-F1

Weighted-F1 (Figure 37) integrates per-class balance with label distribution. Results confirm the macro-F1 patterns: Claude and GPT-5-mini exceeded

0.95

, while smaller models suffered from skewed errors, reflecting limited resilience to non-canonical phrasing.

Variant 2 demonstrates that deterministic, rule-grounded prompting (

WITH_RULES_RAG

,

COT_WITH_RULES_RAG

) ensures robustness to lexical variability, maintaining high recall for PL_r class e and stable accuracy for Claude, GPT-5-mini, and o4-mini. The inclusion of per-class metrics and confidence intervals demonstrates that conclusions hold not only on aggregate accuracy but also on safety-critical subcategories under realistic linguistic conditions.

Compared with canonical ISO phrasing, lexical shift exposes large prompt and model interactions. Accuracy heatmaps and 95% CI(s) show that

WITH_RULES_RAG

substantially improves robustness for Claude (

0.82 \to 0.99

; non-overlapping CIs, large error-rate reduction) and keeps o4-mini near ceiling (

\approx 0.97

in both), but hurts DeepSeek (

0.92 \to 0.72

; non-overlapping CIs) and o3-mini (

\approx 0.80 \to 0.70

; partially overlapping CIs). GPT-5-mini remains high under both prompts (

\approx 0.98

with CoT-with-rules vs

\approx 0.94

with RAG), while Gemini-2.5-flash is stable (

\approx 0.89

–

0.92

). Thus, rule-grounded retrieval mitigates lexical variability for some architectures but is not uniformly beneficial.

Macro-F1 confirms these trends by penalizing class imbalance: Claude, GPT-5-mini, and o4-mini remain

\geq 0.90

, whereas DeepSeek and o3-mini drop due to reduced recall on mid-frequency classes C/D. Micro-F1 follows overall accuracy, indicating that failures concentrate on minority labels rather than widespread drift.

Weighted-F1 integrates label prevalence and mirrors the macro-F1 picture: gains for Claude under

WITH_RULES_RAG

, small neutral changes for Gemini and GPT-5-mini, and degradations for DeepSeek and o3-mini. Per-class analyses show A/B are consistently easy, while C/D are the main sources of variance under lexical shift. Crucially for safety, recall of the catastrophic class E remains near-perfect across models and prompts, with the only notable dip on DeepSeek under

WITH_RULES_RAG

; this demonstrates that the most safety-critical outcomes are preserved even when free text avoids ISO tokens.

Latency measurements establish deployability:

WITH_RULES_RAG

reduces average processing time markedly relative to CoT-with-rules (for example, DeepSeek

\approx 75 \to 44

s; Gemini

\approx 78 \to 4

s) while maintaining or improving accuracy for Claude and o4-mini; GPT-5-mini trades a small accuracy drop for a 2–3× speedup. Together with explicit CIs and per-class metrics, these results provide sufficient, statistically grounded evidence that Variant 2 rigorously tests generalization to field language and that prompt design must be matched to model family to achieve robust, efficient performance.

4.5. Does RAG Confuse CoT? Quantification and Evidence

In this study,

WITH_RULES_RAG

denotes ISO rule-guided prompting with retrieval;

COT_WITH_RULES_RAG

adds an explicit chain-of-thought layer on top of rules + retrieval. Two complementary outcomes are distinguished when comparing these prompts on the same inputs:

Confusion ( $RAG \to CoT wrong$ ): cases where $WITH_RULES_RAG$ is correct but $COT_WITH_RULES_RAG$ is wrong—indicating that adding CoT (given the same retrieved context) degrades the decision.
Rescue ( $RAG wrong \to CoT correct$ ): cases where $WITH_RULES_RAG$ is wrong but $COT_WITH_RULES_RAG$ is correct—indicating that CoT filters retrieval noise and restores rule-consistent reasoning.

These counts separate instances where retrieval destabilizes CoT from those where CoT stabilizes retrieval. Formally, on Variant 2,

ConfuseCount = | {i : WITH_RULES_RAG (i) correct \land COT_WITH_RULES_RAG (i) wrong} |,

and symmetrically for RescueCount. Note that these counts include only flips in correctness between prompts; cases correct (or wrong) under both do not contribute. Table 2 reports per-model accuracy on Variant 2 together with the corresponding confusion and rescue counts.

Table 2 summarizes per-model accuracy on Variant 2 together with the corresponding confusion and rescue counts. See Appendix A.1 for the full per-model confusion/rescue tables with retrieved snippets and CoT traces.

Model-Level Summary: DeepSeek and Claude incur more confusions than rescues, indicating that adding CoT tends to amplify misleading retrieval cues for these families; o4-mini shows a small mixed effect; GPT-5-mini is neutral (confusions balanced by rescues); and o3-mini benefits markedly (rescues ≫ confusions), suggesting that CoT stabilizes retrieval for compact models.

Mechanism (Observed Failure Mode)

Retrieval is not inherently “noise”; its effect depends on structural alignment with the target scenario. When neighbors are aligned in

(S, F, P)

semantics, CoT can rescue errors by reasserting rule-consistent assignments. When neighbors are partially inconsistent, CoT often internalizes their phrasing, producing the following:

P-inflation ( $P 1 \to P 2$ ): the dominant failure, frequently triggered by language implying avoidance is “scarcely possible,” which elevates PL_r at the $b \leftrightarrow c$ and $d \leftrightarrow e$ boundaries.
F-drift ( $F 1 \to F 2$ ): a secondary effect from neighbors emphasizing “frequent/continuous” exposure and “long duration,” further pushing borderline classes upward.

Severity cues (S) are occasionally up-weighted but are rarely the deciding factor compared to P and F. Detailed exemplars (IDs, retrieved snippets, and CoT traces) are provided in Appendix A.1. At a high level, these results imply that retrieval should be governed by structural consistency (e.g., filters or down-weighting for conflicting

(S, F, P)

cues) and that combining CoT with RAG should be model-specific—enabled where

RescueCount > ConfuseCount

and avoided otherwise.

To clarify methodological choices and strengthen transparency, the following points are highlighted:

The tables explicitly indicate which retrieved neighbors contributed to misclassifications. For example, the phrase scarcely possible in neighbors systematically induced $P 1 \to P 2$ upgrades, shifting PL_r from $b \to c$ or $d \to e$ (see Appendix A.1).
“Noise” is defined as structurally inconsistent S/F/P cues present in retrieved neighbors. CoT sometimes internalized these cues (e.g., “frequent/continuous,” “long duration,” “scarcely possible”), inflating P or F relative to the ground-truth scenario. This mechanism is made explicit in the confusion traces.
Each confusion/rescue example specifies the exact S/F/P step where CoT diverged, verifying not only the final PL_r prediction but also the correctness of intermediate reasoning.
All results were re-run under the most deterministic decoding exposed by providers (temperature = 0.0, top- $p = 1.0$ , top- $k = 0$ when available) with $r = 5$ independent repeats per condition. Figures report 95% Student t-intervals across runs to quantify between-run variability. Small residual variation (±3–5%) was observed, attributable to provider-side reasoning heuristics; reporting t-intervals ensures transparent, auditable comparisons across models and prompts.

4.6. Cross-Variant Synthesis: Rigorous Analysis, Critique, and Implications

In this section, results from both benchmark variants are synthesized to provide a rigorous analysis, critique, and set of implications. The discussion covers model-specific misclassification patterns, biases in class distribution, and reproducible failure modes under retrieval and reasoning strategies. Error analysis highlights how unsuitable neighbors and structural inconsistencies propagate through predictions, while quantitative summaries establish stability, variance, and safety-critical coverage. The findings are then translated into actionable guidance for industrial safety pipelines, with identified limitations informing directions for future work.

4.6.1. Model-Specific Misclassification Patterns

Claude and o4-mini are near-ceiling in Variant 1 (V1) and remain strong under Variant 2 (V2). Their residual errors under RAG (esp. with CoT) concentrate on P-inflation (

P 1 \to P 2

) in scenarios whose neighbors mention “scarcely possible” avoidance, yielding

d \to e

upgrades.

GPT-5-mini is consistently robust:

WITH_RULES

(V1) and

WITH_RULES_RAG

(V2) stay

≳ 0.94

with narrow CIs; CoT neither helps nor hurts materially. Gemini-2.5-flash is stable but conservative, with mild class-C overprediction when free text emphasizes frequency words.

DeepSeek-Reasoner is efficient in V1 accuracy but fragile under RAG in V2: retrieval cues with high-F/P wording are overweighted, producing

b \to c

and

d \to e

errors and a distinct drop in e-recall for

WITH_RULES_RAG

. o3-mini collapses without structured prompting in V1 (

ZERO_SHOT

/CoT) and exhibits the anomaly in V1+RAG where

COT_WITH_RULES_RAG

degrades Macro-F1; in V2, adding CoT to RAG rescues several cases (6 rescues vs. 1 confusion).

4.6.2. Prediction Bias and Class Distribution

Macro-/Weighted-F1 and per-class panels show that, without explicit rules, models are biased toward classes a and b, while under-recalling mid-frequency PL_r classes c and d. Rule grounding removes most of this bias in both V1 and V2. Safety-critical PL_r class e remains near-perfect under rule prompts across models and variants, with the notable exception of DeepSeek under

WITH_RULES_RAG

in V2 (dip in PL_r class e-recall), demonstrating that RAG is not universally stabilizing.

4.6.3. Failure Modes in Reasoning

Two reproducible mechanisms were observed:

P-inflation: CoT integrates neighbor phrases like “scarcely possible,” upgrading $P 1 \to P 2$ and shifting PL_r upward ( $b \to c$ , $d \to e$ ).
F-drift: Frequency adjectives in neighbors (“frequent,” “continuous”) bias CoT toward $F 2$ when the target is $F 1$ .

Both are amplified by RAG when retrieved neighbors are lexically similar but structurally inconsistent in S/F/P. Conversely, when neighbors are structurally aligned, RAG reduces variance (e.g., o3-mini rescues).

4.6.4. Error Analysis and Misclassification Insights

The confusion/rescue accounting (V2, RAG vs. CoT+RAG) disentangles retrieval effects by model: DeepSeek (3 confusions, 0 rescues) and Claude (8/1) are harmed by CoT on top of RAG; o3-mini (1/6) benefits; GPT-5-mini is neutral (2/2). Traces localize the internal mistake (S/F/P step) and the offending neighbor cue, exposing the underlying mechanism (e.g., P-inflation) rather than only final PL_r flips. Mislabels cluster in domains with ambiguous exposure semantics (vibration, noise, minor burns), highlighting the need for retrieval filters that enforce S/F/P consistency over pure semantic similarity.

4.6.5. Summary of Key Quantitative Findings

V1, non-RAG: $WITH_RULES$ dominates (often $\geq 0.99$ ; tight CIs). $PURE_CoT$ is volatile; $ZERO_SHOT$ fails on compact models (o3-mini). $COT_WITH_RULES$ stabilizes CoT but does not exceed rules-only. This shows that adding statistical intervals makes the performance stability explicit.
V1, with RAG: Both $WITH_RULES_RAG$ and $COT_WITH_RULES_RAG$ are at (near) ceiling except the o3-mini collapse under CoT+RAG (macro-F1 $\sim 0.60$ with wide CI). Retrieval alone suffices on canonical phrasing. This confirms that the baseline $WITH_RULES_RAG$ condition provides a strong reference point.
V2, non-RAG: Lexical shift penalizes $ZERO_SHOT$ /CoT (drops up to ∼30 percentage points); rule prompts restore balance and push Claude/DeepSeek $≳ 0.95$ . This demonstrates ecological validity by testing performance on free-text safety descriptions.
V2, with RAG: Accuracy remains near-ceiling for Claude, o4-mini, GPT-5-mini; degrades for DeepSeek and o3-mini under plain RAG; adding CoT+Rules flips signs by model (confusion vs. rescue). This quantifies the role of retrieval noise in shaping outcomes.
Safety-critical coverage: PL_r class e-recall nearly perfect with rules across conditions except DeepSeek $WITH_RULES_RAG$ in V2. This confirms that minority and safety-critical classes were explicitly measured.
Latency: Rules-only is faster than CoT variants in V1; in V2, $WITH_RULES_RAG$ yields large speedups over $COT_WITH_RULES_RAG$ (e.g., Gemini ∼78→4 s/case) with comparable or better accuracy for several models. This clarifies how accuracy and runtime trade-offs were evaluated.

4.6.6. Implications for Industrial Safety Pipelines (Actionable)

Adopt a structure-first pipeline: deterministically extract S, F, P with rule-grounded prompts; compute PL_r via the ISO risk graph. Prefer

WITH_RULES

for dataset variant V1 (canonical ISO-style scenarios, valuable as a controlled research benchmark but seldom used in practice) or

WITH_RULES_RAG

for dataset variant V2 (engineer-authored free-text scenarios, which capture the wording used in shop-floor risk assessments and form the basis of conformity documentation reviewed in audits), both without CoT unless the model family (e.g., o3-mini) shows net rescues. This operationalizes verification of intermediate reasoning steps.

Model–prompt matching: For Claude/o4/GPT-5, use

WITH_RULES_RAG

(fast, robust). For DeepSeek, disable CoT when RAG is on; consider rules-only or stricter retrieval filtering. This highlights the importance of establishing a clear baseline for

WITH_RULES_RAG

. CoT increases output-token spend; enable it only when the measured accuracy gain offsets the cost increment under your budget and latency constraints.

Retrieval governance: enforce S/F/P structural consistency filters (reject neighbors that imply different P or F); penalize phrases that trigger P-inflation; prefer neighbors with identical task primitives. This illustrates how retrieval noise can be mitigated. Constrain M and keep exemplars terse: aggressive over-retrieval often yields diminishing accuracy returns while linearly increasing input-token cost.

Safety gating: hard guardrails on PL_r class e: if PL_r class e-recall confidence or S/F/P agreement falls below thresholds, the case is routed to human review.

Operational KPIs: report Macro-/Micro-/Weighted-F1, per-class recall, and 95% CIs by default; track confusion/rescue counts to monitor RAG–CoT interactions in production. This ensures richer metrics and statistics are available for monitoring.

Cost governance under determinism: Track cost-per-correct decision alongside accuracy and latency. Prefer mini/flash models when their cost-per-correct decision is within a few percent of premium models under

WITH_RULES_RAG

, reserving premium models only for routed edge cases. Control token expenditure by

(i) limiting max_tokens and using stop sequences,
(ii) keeping the number of retrieved exemplars (M) small, and
(iii) exploiting provider-side prompt caching or batching (where available) to amortize static rule blocks (e.g., ISO rules, schema).

4.6.7. Threats to Validity (with Mitigations)

Sample size and balance: This study represents the first systematic evaluation of deterministic PL_r classification with rule-grounded RAG, focused primarily on reasoning-capable LLMs. A pilot scale was adopted with $N = 100$ per variant (V1/V2). Per-class counts can be small; this is partially mitigated through the use of 95% confidence intervals and class-sensitive metrics. Future work should expand N and rebalance classes. This acknowledges dataset size limitations while outlining a clear mitigation path.
Ecological validity: Dataset variant V1 is canonical; dataset variant V2 uses engineer-authored free text (improves realism) but is from a single author and domain. Extend to multi-site, multi-author corpora and deliberately ambiguous/incomplete cases. This highlights the need to broaden coverage to capture real-world ambiguity.
Decoding determinism: All reported runs use temperature $= 0.0$ with fixed top-p/k; nevertheless, single-pass evaluations can mask variance, future work will include multi-seed runs with significance tests. This ensures that determinism and variance are both addressed in evaluation.
RAG configuration opacity: Retrieval choices (k, similarity function, domain filters) influence outcomes. Confusion/rescue exemplars and S/F/P traces are now exposed; future work will ablate retrieval k, similarity metrics, and structural filters. This increases transparency about retrieval configurations.
Intermediate reasoning correctness: S/F/P steps were verified in error tables; broader audits should explicitly score S/F/P accuracy alongside PL_r. This ensures that intermediate reasoning is evaluated in addition to the final classification outcome.

5. Conclusions

Deterministic risk classification in industrial functional safety demands transparent and reproducible methods. At the same time, the rapid rise of reasoning-capable LLMs has created both excitement and uncertainty: they are promoted as tools for structured decision-making, yet their suitability for safety-regulated workflows and audit-grade risk assessments requires validation through standards-aligned, extensive empirical evaluation. Clarifying this suitability is of direct interest not only to researchers but also to functional safety engineers and conformity assessment bodies, for whom reliable and auditable risk classification is a core requirement.

This study addresses that gap by providing the first systematic benchmark of structured prompting strategies for PL_r estimation, applied to state-of-the-art reasoning-capable LLMs and comparing canonical ISO-style scenarios (Variant 1) with engineer-authored free-text descriptions (Variant 2). The results provide direct evidence and concrete guidance on the suitability of LLMs for deployment in functional safety workflows and regulatory conformity assessments.

Key findings are as follows:

Rule-grounded prompting ( $WITH_RULES$ , $WITH_RULES_RAG$ ) consistently outperformed zero-shot and unconstrained CoT (RQ1). Variant 1 (ISO-style) reached ceiling-level accuracy, while Variant 2 (engineer-authored free text) required explicit rules to restore reliability under lexical variability.
Model scale was critical: Claude-opus, o4-mini, and GPT-5-mini remained stable, whereas o3-mini collapsed without structured prompting. DeepSeek-Reasoner, despite strong Variant 1 performance, degraded under retrieval noise in Variant 2, showing that RAG is not uniformly beneficial (RQ2).
Free-form CoT reasoning introduced volatility, increased latency (2–10×), and sometimes amplified retrieval inconsistencies (P-inflation, F-drift). Rules-only prompts were both most accurate and most efficient (RQ2).
Reasoning traces (S/F/P chains) often diverged from ISO-consistent logic, underscoring that CoT reflects token continuation rather than genuine reasoning. This highlights the risks of anthropomorphization, i.e., attributing human-like reasoning to the outputs of reasoning-capable LLMs. Determinism and correctness were achieved only when outputs were constrained by explicit ISO 13849-1 rules, which reduced the open-ended reasoning space to a well-defined decision graph and ensured reproducible PL_r outcomes (RQ2, RQ3).
Across canonical and lexical-shift settings, rules were necessary and typically sufficient for deterministic PL_r assignment. RAG functioned as a conditional accelerator that improved robustness and latency for some model families (Claude/o4/GPT-5) but introduced confusion in others (e.g., DeepSeek) unless structurally filtered (RQ3).

Implications: LLM-generated reasoning can create a misleading sense of reliability, as superficially coherent outputs may be mistaken for sound inference, leading to overconfidence in safety-regulated environments. Industrial deployment should therefore emphasize

(i) strict rule-based prompting with independent human validation, and
(ii) prompt–model matching with retrieval governance to mitigate systematic errors.

Future work should assess stability across evolving model versions, extend datasets with multi-annotator disagreement modeling, and incorporate adversarial and ambiguous cases. Retrieval ablations, classical baselines, and multi-seed significance testing will further strengthen audit-grade reproducibility and deployment evidence.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset utilized for the experiments, comprising 7800 hazard scenarios based on ISO 12100 Annex B and ISO 13849-1: Annex A, is openly available at https://github.com/piyenghar/hazardscenariosISO12100AnnexB, accessed on 4 July 2025. The specific prompts employed for the evaluation of the reasoning models are provided within the main body of this paper.

Acknowledgments

The author thanks the anonymous reviewers for their constructive feedback, which substantially improved the clarity, scope, and rigor of this paper.

Conflicts of Interest

Author Padma Iyenghar was employed by the company innotec GmbH-TÜV Austria Group. The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

Appendix A.1. Confusion and Rescue IDs per Model

Some examples of confusion and rescue cases per model are listed here and described in the tables below. The label GT in the tables represents the ground truth in the dataset (Variant 2).

DeepSeek-reasoner: confusion = {TH_005, NO_002, VI_004}, rescues = {}.
GPT-5-mini: confusion = {ME_004, VI_004}, rescues = {ME_006, VI_002}.
o3-mini: confusion = {ME_006}, rescues = {ME_010, EL_006, TH_004, NO_007, VI_001, VI_010}.
o4-mini: confusion = {ME_004, TH_005}, rescues = {VI_004}.
Claude-opus-4-1: confusion = {ME_004, ME_006, EL_002, EL_007, TH_001, TH_004, TH_009, VI_003}, rescues = {ME_005}.
Gemini-2.5-flash: confusion = {TH_005, NO_002, VI_004}, rescues = {ME_004}.

Table A1. DeepSeek-reasoner: confusion cases (

WITH_RULES_RAG

correct;

COT_WITH_RULES_RAG

wrong).

Table A1. DeepSeek-reasoner: confusion cases (

WITH_RULES_RAG

correct;

COT_WITH_RULES_RAG

wrong).

Hazard	GT	`WITH_RULES_RAG`	`COT_WITH_RULES_RAG`	CoT Trace (S/F/P)	Key Retrieved Cues (top-k)
TH_005	d	d	e	S2: irreversible burn/death; F2: frequent/prolonged; P2: “scarcely possible” (invoked due to splash + PPE inconsistency) ⇒ PL_r = e.	– “…cleaning tasks… heat sources…” – “…frequent-to-continuous exposure… long duration…” – “… avoidance is scarcely possible…”
NO_002	b	b	c	S1: reversible hearing loss; F2: frequent/prolonged; P2: “avoidance scarcely possible without consistent PPE” ⇒ PL_r = c.	– “…normal operation… moving parts…” – “…frequent-to-continuous exposure… long duration…” – “…avoidance is scarcely possible…”

Note. PPE = Personal Protective Equipment. Under ISO 13849-1, the avoidance parameter P reflects the intrinsic possibility for a person to avoid the hazard (task geometry, speed, warning time, etc.). Variability in PPE use is not grounds to upgrade

P 1 \to P 2

.

Table A2. GPT-5-mini: confusion cases (

WITH_RULES_RAG

correct;

COT_WITH_RULES_RAG

wrong).

Table A2. GPT-5-mini: confusion cases (

WITH_RULES_RAG

correct;

COT_WITH_RULES_RAG

wrong).

Hazard	GT	`WITH_RULES_RAG`	`COT_WITH_RULES_RAG`	CoT Trace (S/F/P → PL_r)	Retrieved Cue (Salient Phrases)
ME_004	d	d	e	S2 (irreversible injury/death), F2 (frequent/prolonged), P2 (“scarcely possible” due to warehouse blind spots) ⇒ PL_r e.	Neighbors emphasize frequent-to-continuous exposure, serious injury or death, and scarcely possible avoidance, which together bias $P : P 1 \to P 2$ (P-inflation) and push $d \to e$ .
VI_004	b	b	c	S1 (slight, reversible joint pain), F2 (frequent/prolonged), P2 asserted (breaks/PPE framed as inconsistently used) ⇒ PL_r c.	Neighbors mention vibrating equipment, frequent-to-continuous exposure, andscarcely possible avoidance; these cues elevate P despite scenario-consistent mitigations (breaks/PPE), yielding $b \to c$ .

Table A3. GPT-5-mini: rescue cases (

WITH_RULES_RAG

wrong;

COT_WITH_RULES_RAG

correct).

Table A3. GPT-5-mini: rescue cases (

WITH_RULES_RAG

wrong;

COT_WITH_RULES_RAG

correct).

Hazard	GT	`WITH_RULES_RAG`	`COT_WITH_RULES_RAG`	CoT Trace (S/F/P → PL_r)	Retrieved Cue (Salient Phrases)
ME_006	b	c	b	S1 (slight, reversible crushing), F2 (frequent/prolonged), P1 (avoidance possible with correct procedures) ⇒ PL_r b.	Some neighbors contain “scarcely possible” wording that nudges $P : P 1 \to P 2$ (error in $WITH_RULES_RAG$ ). CoT reasserts rule-consistent P1 given the scenario’s mitigations and similar S1 examples.
VI_002	b	c	b	S1 (reversible, e.g., early HAVS), F2 (continuous/prolonged), P1 (avoidance possible via breaks) ⇒ PL_r b.	Neighbors emphasize seldom/short exposure and slight reversible injury; $WITH_RULES_RAG$ overweights separate cues claiming scarcely possible avoidance. CoT filters these and restores P1.

Table A4. o3-mini: confusion cases (

WITH_RULES_RAG

correct;

COT_WITH_RULES_RAG

wrong).

Table A4. o3-mini: confusion cases (

WITH_RULES_RAG

correct;

COT_WITH_RULES_RAG

wrong).

Hazard	GT	`WITH_RULES_RAG`	`COT_WITH_RULES_RAG`	CoT Trace (Condensed)	Key Retrieved Cue
ME_006	b	b	c	S1 (slight), F2, P2 (procedures often disregarded) ⇒ PL_r c (pred.)	“…operators performing cleaning/setup tasks… frequent-to-continuous exposure with long duration; possibility scarcely possible…”

Table A5. o3-mini: rescue cases (

WITH_RULES_RAG

wrong;

COT_WITH_RULES_RAG

correct).

Table A5. o3-mini: rescue cases (

WITH_RULES_RAG

wrong;

COT_WITH_RULES_RAG

correct).

Hazard	GT	`WITH_RULES_RAG`	`COT_WITH_RULES_RAG`	CoT Trace (Condensed)	Key Retrieved Cue
ME_010	d	e	d	S2 (serious), F2, P1 (avoidance via checks/procedures) ⇒ PL_r d	“…maintenance tasks; acceleration/deceleration; frequent-to-continuous exposure; serious injury or death…”
EL_006	c	d	c	S2, F1, P1 (basic lockout/spacing) ⇒ PL_r c	“…exposure to live electrical parts due to insufficient distance… short exposure; avoidance possible under conditions…”
TH_004	b	c	b	S1 (minor burns), F2, P1 (procedural avoidance) ⇒ PL_r b (pred.)	“…hot surfaces/heat sources; frequent-to-continuous exposure…”
NO_007	d	e	d	S2, F1, P1 (controls allow avoidance) ⇒ PL_r d (pred.)	“…shockwave/noise; tasks seldom-to-less-often with short exposure; avoidance possible under specific conditions…”
VI_001	d	e	d	S2 (HAVS), F2, P1 (PPE/procedures enable avoidance) ⇒ PL_r d	“…unbalanced rotating parts; frequent-to-continuous exposure; consequence framed as tiredness in some neighbors…”
VI_010	d	e	d	S2, F2, P1 (breaks/controls) ⇒ PL_r d	“…vibrating equipment during normal operations; frequent-to-continuous exposure; discomfort examples…”

Table A6. o4-mini: confusion cases (

WITH_RULES_RAG

correct;

COT_WITH_RULES_RAG

wrong).

Table A6. o4-mini: confusion cases (

WITH_RULES_RAG

correct;

COT_WITH_RULES_RAG

wrong).

Hazard	GT	`WITH_RULES_RAG`	`COT_WITH_RULES_RAG`	CoT Trace (Summary)	Retrieved Cue (Excerpt)
ME_004	d	d	e	S2 (irreversible) + F2 (frequent/prolonged) + P2 (blind spots ⇒ avoidance scarcely possible) ⇒ PL_r = e.	In industrial environments, operators performing cleaning tasks may encounter moving elements that can lead to drawing-in or trapping. These tasks are characterized by frequent-to-continuous exposure with long duration, and the potential consequence is serious injury or death.
TH_005	d	d	e	Liquid metal splash: S2 + F2 + P2 (PPE inconsistency interpreted as scarce avoidance) ⇒ PL_r = e.	In industrial settings, operators performing cleaning tasks may encounter radiation from heat sources, which can lead to scald injuries. The tasks are characterized by frequent-to-continuous exposure with long duration, and the likelihood of occurrence is scarcely possible.

Table A7. o4-mini: rescue cases (

WITH_RULES_RAG

wrong;

COT_WITH_RULES_RAG

correct).

Table A7. o4-mini: rescue cases (

WITH_RULES_RAG

wrong;

COT_WITH_RULES_RAG

correct).

Hazard	GT	`WITH_RULES_RAG`	`COT_WITH_RULES_RAG`	CoT Trace (Summary)	Retrieved Cue (Excerpt)
VI_004	b	c	b	S1 (slight, reversible joint pain) + F2 (frequent) + P1 (avoidance possible with footwear) ⇒ PL_r = b; aligns with similar EX2 (second retrieved neighbour).	In industrial settings, operators performing normal operation tasks may encounter vibrating equipment, which can lead to discomfort. These tasks are characterized by frequent-to-continuous exposure with long duration, and the potential consequence is slight, normally reversible injuries.

Table A8. Claude-opus-4-1: confusion cases (

WITH_RULES_RAG

correct;

COT_WITH_RULES_RAG

wrong).

Table A8. Claude-opus-4-1: confusion cases (

WITH_RULES_RAG

correct;

COT_WITH_RULES_RAG

wrong).

Hazard	GT	`WITH_RULES_RAG`	`COT_WITH_RULES_RAG`	CoT Trace (Abridged)	Retrieved Cue (Salient Phrases)
ME_004	d	d	e	S2 (irreversible injury/death) + F2 (frequent) + P2 (blind spots ⇒ avoidance scarcely possible) ⇒ PL_r e.	• “frequent-to-continuous exposure; long duration” (F2) • “serious injury or death” (S2) • Generic trapping/drawing-in phrasing that nudges P2 escalation
TH_004	b	b	c	S1 (minor reversible burns) + F2 (frequent/prolonged in kitchen) + P2 (distraction-prone, avoidance scarce) ⇒ PL_r c.	• “flames … discomfort” (S1) • Mixed frequency: “seldom-to-less-often … short exposure” (F1) vs “scarcely possible” (P2) • Conflicting F/P cues bias the chain toward P2

Table A9. Claude-opus-4-1: rescue cases (

WITH_RULES_RAG

wrong;

COT_WITH_RULES_RAG

correct).

Table A9. Claude-opus-4-1: rescue cases (

WITH_RULES_RAG

wrong;

COT_WITH_RULES_RAG

correct).

Hazard	GT	`WITH_RULES_RAG`	`COT_WITH_RULES_RAG`	CoT Trace (Abridged)	Retrieved Cue (Salient Phrases)
ME_005	d	e	d	S2 (serious) + F2 (loading operations frequent/prolonged) + P1 (procedural avoidance feasible) ⇒ PL_r d.	• “frequent-to-continuous … long exposure” (F2) • “serious injury or death” (S2) • Absence of explicit ‘scarcely possible’ cue; CoT reasserts P1 per ISO risk graph

Table A10. Gemini-2.5-flash: confusion cases (

WITH_RULES_RAG

correct;

COT_WITH_RULES_RAG

wrong).

Table A10. Gemini-2.5-flash: confusion cases (

WITH_RULES_RAG

correct;

COT_WITH_RULES_RAG

wrong).

Hazard	GT	`WITH_RULES_RAG`	`COT_WITH_RULES_RAG`	CoT Trace (Abridged)	Retrieved Cue (Salient Phrases)
TH_005	d	d	e	S2 (irreversible burns/death) + F2 (frequent, long duration) + P2 (“scarcely possible” avoidance) ⇒ PL_r e.	“frequent-to-continuous exposure; long duration” (F2) “scald injuries; serious injury or death” (S2) “likelihood scarcely possible ” ⇒ CoT upgrades P to P2
NO_002	b	b	c	S1 (reversible hearing loss) + F2 (frequent) + P2 (avoidance “scarcely possible” without consistent PPE) ⇒ PL_r c.	“moving parts … permanent hearing loss” (pushes toward S2) “frequent-to-continuous exposure; long durations” (F2) “possibility scarcely possible” ⇒ CoT asserts P2

Note. The confusion mechanism here is P-inflation driven by P2-leaning neighbor phrases.

Table A11. Gemini-2.5-flash: rescue case (

WITH_RULES_RAG

wrong;

COT_WITH_RULES_RAG

correct).

Table A11. Gemini-2.5-flash: rescue case (

WITH_RULES_RAG

wrong;

COT_WITH_RULES_RAG

correct).

Hazard	GT	`WITH_RULES_RAG`	`COT_WITH_RULES_RAG`	CoT Trace (Abridged)	Retrieved Cue (Salient Phrases)
ME_004	d	e	d	S2 (irreversible injury) + F2 (frequent/prolonged) + P1 (avoidance possible under procedures) ⇒ PL_r d.	“drawing-in/trapping; serious injury or death” (S2) “frequent-to-continuous exposure; long duration” (F2) CoT filters P2-leaning cue and reinstates procedural P1

Note. Rescue mechanism: CoT + Rules counteracts a P2-leaning neighbor and re-anchors on rule-consistent

P 1

, correcting

e \to d

on ME_004.

Appendix A.2. Mechanism-Level Inferences (From Confusion/Rescue Tables)

Dominant Confounder: P-Inflation

Across models and variants, the most frequent failure mode is an upward shift in the possibility of avoidance parameter (

P : P 1 \to P 2

). This P-inflation is systematically triggered by RAG neighbors containing phrases such as “scarcely possible (avoidance)” and “frequent-to-continuous exposure; long duration,” and by appeals to inconsistent PPE usage. Under ISO 13849-1, P reflects task-intrinsic avoidability (e.g., geometry, speed, warning time), not compliance variability; therefore, “PPE inconsistency” is not sufficient to set

P = 2

. In the confusion cases, CoT + Rules + RAG internalizes these cues and escalates PL_r (typically

b \to c

or

d \to e

); the corresponding Rules (+RAG) baselines remain stable.

Secondary Confounder: F-Drift

A smaller but consistent effect is F-drift (

F 1 \to F 2

) driven by retrieved snippets over-emphasizing frequency/duration (“frequent-to-continuous,” “long exposure”). This pushes borderline decisions across class boundaries and often co-occurs with P-inflation.

Severity Cues Are Rarely Decisive

Severity (S) sometimes drifts upward (e.g., “permanent hearing loss,” “serious injury or death”), but flips are usually explained by P (and secondarily F) rather than S. When S contributes, it amplifies a decision already biased by P or F.

Rescue Mechanism: Rule-Consistent Re-Anchoring of P (and F)

In rescue cases, CoT + Rules corrects RAG-induced errors by explicitly reasserting P1 (procedural avoidability: breaks, footwear, checks, spacing/lockout) and, where needed, restoring F1 (short/seldom exposure). This re-anchoring is most prominent for compact models (e.g., o3-mini: rescues ≫ confusions), mixed for GPT-5-mini/o4-mini, and uncommon for Claude/DeepSeek where CoT tends to overweight P2-leaning neighbors.

Where Flips Concentrate

Misclassifications cluster at adjacent boundaries

b \leftrightarrow c

and

d \leftrightarrow e

, exactly where a one-step change in P or F is sufficient to cross the ISO risk graph threshold. Safety-critical coverage (class E) remains robust under rule-only prompting; degradations emerge primarily when CoT is combined with RAG and P-inflation occurs.

Actionable Mitigations

Structural retrieval governance: Admit neighbors whose $(S, F, P)$ are consistent with the provisional decision (e.g., within one step), and down-rank snippets containing “scarcely possible” when task geometry/procedures imply $P 1$ .
Conflict-aware inference: If retrieved neighbors disagree on P or F, prefer rule-only aggregation or escalate $d / e$ -boundary cases to human review.
Model–prompt matching: Default to $WITH_RULES$ / $WITH_RULES_RAG$ for families where CoT+RAG confuses (e.g., Claude/DeepSeek/Gemini); enable $COT_WITH_RULES_RAG$ only where rescue > confusion is empirically observed (e.g., o3-mini).

Limitations and Scope

These inferences are drawn from

r = 5

repeated runs per condition with 95% t-intervals; residual between-run variability reflects provider-side reasoning heuristics despite deterministic decoding. While tables provide qualitative traces and retrieved-neighbor excerpts, an in-depth

(S, F, P)

accuracy audit in a large dataset (N = 1000, or N = 2000) and retrieval ablations are planned as future work to quantify each mechanism’s marginal effect.

References

ISO 12100:2010; Safety of Machinery: General Principles for Design: Risk Assessment and Risk Reduction. ISO: Geneva, Switzerland, 2010. Available online: https://www.iso.org/standard/51528.html (accessed on 31 January 2025).
ISO 13849-1:2023; Safety of Machinery—Safety-Related Parts of Control Systems—Part 1: General Principles for Design. ISO: Geneva, Switzerland, 2023. Available online: https://www.iso.org/standard/73481.html (accessed on 5 May 2025).
IFA Report 2/2017e Functional Safety of Machine Controls—Application of EN ISO; Deutsche Gesetzliche Unfallversicherung: Berlin, Germany, 2019.
European Parliament and Council. Regulation (EU) 2023/1230 of the European Parliament and of the Council of 14 June 2023 on machinery and repealing Directive 2006/42/EC of the European Parliament and of the Council and Council Directive 73/361/EEC. Off. J. Eur. Union 2023, L 165, 1–102. [Google Scholar]
Iyenghar, P.; Hu, Y.; Kieviet, M.; Pulvermüller, E.; Wübbelmann, J. AI-Based Assistant for Determining the Required Performance Level for a Safety Function. In Proceedings of the 48th Annual Conference of the IEEE Industrial Electronics Society (IECON 2022), Brussels, Belgium, 17–20 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
Iyenghar, P.; Kieviet, M.; Pulvermüller, E.; Wübbelmann, J. A Chatbot Assistant for Reducing Risk in Machinery Design. In Proceedings of the 21st IEEE International Conference on Industrial Informatics (INDIN 2023), Lemgo, Germany, 18–20 July 2023; pp. 1–8. [Google Scholar] [CrossRef]
Iyenghar, P. On the Development and Application of a Structured Dataset for Data-Driven Risk Assessment in Industrial Functional Safety. In Proceedings of the 21st IEEE International Conference on Factory Communication Systems (WFCS 2025), Rostock, Germany, 10–13 June 2025; pp. 1–8. [Google Scholar] [CrossRef]
Iyenghar, P. Evaluating LLM Prompting Strategies for Industrial Functional Safety Risk Assessment. In Proceedings of the 8th IEEE International Conference on Industrial Cyber-Physical Systems (ICPS 2025), Emden, Germany, 12–15 May 2025; pp. 1–4. [Google Scholar] [CrossRef]
Gemini 2.0-Flash. Available online: https://deepmind.google/technologies/gemini/flash/ (accessed on 30 January 2025).
Iyenghar, P.; Zimmer, C.; Gregorio, C. A Feasibility Study on Chain-of-Thought Prompting for LLM-Based OT Cybersecurity Risk Assessment. In Proceedings of the 8th IEEE International Conference on Industrial Cyber-Physical Systems (ICPS 2025), Emden, Germany, 12–15 May 2025; pp. 1–4. [Google Scholar] [CrossRef]
Nouri, M.; Karakostas, D.; Hummel, L.; Pretschner, A. Automating Automotive Hazard Analysis and Risk Assessment with Large Language Models: Opportunities and Limitations. arXiv 2024, arXiv:2401.07791. [Google Scholar]
Qi, Z.; Wang, C.; Zhang, M.; Ma, Y.; Xie, B. Can ChatGPT Help with System Theoretic Process Analysis? A Pilot Study. In Proceedings of the 2025 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), Sao Paulo, Brazil, 4–7 November 2025; pp. 1–7. [Google Scholar]
Collier, D.; Vincent, K.; King, J.; Griffiths, D.; Marshall, Y.; Wronska, K. Evaluating Large Language Models for Consumer Product Safety Risk Assessment. Saf. Sci. 2024, 176, 107083. [Google Scholar]
Diemert, E.; Weber, G. CoHA: Collaborating with ChatGPT for Hazard Analysis. In Proceedings of the 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), Dublin, Ireland, 16–20 April 2023; pp. 139–146. [Google Scholar]
Sammour, M.; Kreahling, W.C.; Padgett, J.; Ammann, P. Performance of GPT-3.5 and GPT-4 on the Certified Safety Professional Exam: An Exploratory Study. Saf. Sci. 2024, 182, 108002. [Google Scholar]
Iyenghar, P. Clever Hans in the Loop? A Critical Examination of ChatGPT in a Human-In-The-Loop Framework for Machinery Functional Safety Risk Analysis. Eng 2025, 6, 31. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Le, Q.; Zhou, D. Chain of Thought Prompting Elicits Reasoning in Large Language Models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Kojima, T.; Gu, S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
Saparov, A.; He, H. Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Schaeffer, R.; Pistunova, K.; Khanna, S.; Consul, S.; Koyejo, S. Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting. arXiv 2023, arXiv:2307.10573. [Google Scholar] [CrossRef]
Kambhampati, S.; Stechly, K.; Valmeekam, K.; Saldyt, L.; Bhambri, S.; Palod, V.; Gundawar, A.; Samineni, S.R.; Kalwar, D.; Biswas, U. Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces! arXiv 2025, arXiv:2504.09762. [Google Scholar]
Stechly, K.; Valmeekam, K.; Gundawar, A.; Palod, V.; Kambhampati, S. Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens. arXiv 2025, arXiv:2505.13775. [Google Scholar] [CrossRef]
Chen, Y.; Benton, J.; Radhakrishnan, A.; Uesato, J.; Denison, C.; Schulman, J.; Somani, A.; Hase, P.; Wagner, M.; Roger, F.; et al. Reasoning Models Don’t Always Say What They Think. arXiv 2025, arXiv:2505.05410. [Google Scholar]
Shojaee, P.; Mirzadeh, I.; Alizadeh, K.; Horton, M.; Bengio, S.; Farajtabar, M. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. arXiv 2025, arXiv:2506.06941. [Google Scholar] [PubMed]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020); Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 9459–9474. [Google Scholar]
Xue, Z.; Wu, X.; Li, J.; Zhang, P.; Zhu, X. Improving Fire Safety Engineering with Retrieval-Augmented Large Language Models. Fire Technol. 2025, 61, 1281–1301. [Google Scholar]
Meng, Y.; Jiang, F.; Qi, Z. Retrieval-Augmented Generation for Human Health Risk Assessment: A Case Study. In Proceedings of the 2025 International Conference on Artificial Intelligence in Toxicology (AITOX), Beijing, China, 15–18 October 2025; pp. 101–110. [Google Scholar]
Hillen, T.; Eisenhauer, M. LASAR: LLM-Augmented Hazard Analysis for Automotive Risk Assessment. In Proceedings of the SAFECOMP, Florence, Italy, 17 September 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 143–154. [Google Scholar]
Guha, N.; Hu, D.E.; Hendry, L.; Li, N.; Meng, L.; Nanda, S.; Nori, R.; Shardlow, M.; Shoberg, J.; Soni, A.; et al. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in LLMs. arXiv 2023, arXiv:2308.11462. [Google Scholar]
Khandekar, N.; Shen, C.; Mian, Z.; Wang, Z.; Kim, J.; Sriram, A.; Hu, H.; Shah, N.; Patel, R. MedCalc-Bench: Evaluating Large Language Models for Medical Calculations. Adv. Neural Inf. Process. Syst. 2024, 37, 84730–84745. [Google Scholar]
Wang, J.; Wang, M.; Zhou, Y.; Xing, Z.; Liu, Q.; Xu, X.; Zhang, W.; Zhu, L. LLM-based HSE Compliance Assessment: Benchmark, Performance, and Advancements. arXiv 2025, arXiv:2505.22959. [Google Scholar] [CrossRef]
Sandmann, S.; Hegselmann, S.; Fujarski, M.; Bickmann, L.; Wild, B.; Eils, R.; Varghese, J. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat. Med. 2025; epub ahead of print. [Google Scholar] [CrossRef] [PubMed]
Araya, R. Do Chains-of-Thoughts of Large Language Models Suffer from Hallucinations, Cognitive Biases, or Phobias in Bayesian Reasoning? arXiv 2025, arXiv:2503.15268. [Google Scholar]
LangChain Inc. LangGraph: Agentic Workflows for LLM Applications. 2024. Available online: https://www.langchain.com/langgraph (accessed on 3 July 2025).
Harrison Chase. LangChain: Building Applications with LLMs Through Composability. 2022. Available online: https://www.langchain.com (accessed on 3 July 2025).
Chroma Team. Chroma: The AI-Native Open-Source Vector Database. 2023. Available online: https://www.trychroma.com (accessed on 3 July 2025).
P. Iyenghar. Comprehensive Curated Dataset of Hazard Scenarios Systematically Generated Based on Annex B of ISO 12100 and PLr Assigned Based on ISO. GitHub Repository. 2025. Available online: https://github.com/piyenghar/hazardscenariosISO12100AnnexB (accessed on 4 July 2025).
OpenAI. Chat Completions Format API. 2024. Available online: https://platform.openai.com/docs/guides/text (accessed on 21 August 2025).
Zhao, P.; Zhang, H.; Yu, Q.; Wang, Z.; Geng, Y.; Fu, F.; Yang, L.; Zhang, W.; Jiang, J.; Cui, B. Retrieval-Augmented Generation for AI-Generated Content: A Survey. arXiv 2024, arXiv:2402.19473. [Google Scholar]
Anthropic. Claude Opus 4.1. 2025. Available online: https://www.anthropic.com/news/claude-opus-4-1 (accessed on 21 August 2025).
DeepSeek. Reasoning Model (Deepseek-Reasoner). 2025. Available online: https://api-docs.deepseek.com/guides/reasoning_model (accessed on 21 August 2025).
Google. Gemini Models—Gemini API. 2025. Available online: https://ai.google.dev/gemini-api/docs/models (accessed on 21 August 2025).
Google Cloud. Gemini 2.5 Flash—Vertex AI. 2025. Available online: https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash (accessed on 21 August 2025).
Google. Gemini Thinking. 2025. Available online: https://ai.google.dev/gemini-api/docs/thinking (accessed on 21 August 2025).
OpenAI. Model: GPT-5 mini—OpenAI API. 2025. Available online: https://platform.openai.com/docs/models/gpt-5-mini (accessed on 21 August 2025).
OpenAI. Using GPT-5. 2025. Available online: https://platform.openai.com/docs/guides/latest-model (accessed on 21 August 2025).
OpenAI. OpenAI o3-mini. 2025. Available online: https://openai.com/index/openai-o3-mini/ (accessed on 21 August 2025).
OpenAI. Model: o4-mini—OpenAI API. 2025. Available online: https://platform.openai.com/docs/models/o4-mini (accessed on 21 August 2025).
OpenAI. Reasoning Models—OpenAI API. 2025. Available online: https://platform.openai.com/docs/guides/reasoning (accessed on 21 August 2025).
Anthropic. Pricing. Available online: https://www.anthropic.com/pricing (accessed on 21 August 2025).
Google DeepMind & Google. Gemini Developer API Pricing. Available online: https://ai.google.dev/gemini-api/docs/pricing (accessed on 21 August 2025).
DeepSeek. Pricing Details (USD). Available online: https://api-docs.deepseek.com/quick_start/pricing-details-usd (accessed on 21 August 2025).
OpenAI. API Pricing. Available online: https://openai.com/api/pricing/ (accessed on 21 August 2025).

Figure 1. The iterative process of risk assessment and risk reduction [3].

Figure 2. Risk graph from ISO 13849-1: Annex A [2].

Figure 3. Hybrid RAG pipeline used in the PL_r experiment framework: vector over-retrieval → semantic gate (≥

0.70

) → lexical Jaccard filter (≥

0.30

) → optional S/F/P gate → deduplicate/rank by J and keep top M→ package as {rag_examples} and inject into the RAG chains.

Figure 3. Hybrid RAG pipeline used in the PL_r experiment framework: vector over-retrieval → semantic gate (≥

0.70

) → lexical Jaccard filter (≥

0.30

) → optional S/F/P gate → deduplicate/rank by J and keep top M→ package as {rag_examples} and inject into the RAG chains.

Figure 4. Accuracy comparison with 95% t-intervals across runs for Variant 1 (non-RAG) across six models and four prompting strategies.

Figure 5. Average processing time with 95% t-intervals across runs for Variant 1 (non-RAG).

Figure 6. Total execution time with 95% t-intervals across runs for Variant 1 (non-RAG).

Figure 7. Average accuracy heatmap for Variant 1 across models and non-RAG prompts.

Figure 8. Macro-F1 comparison by model and non-RAG prompting strategy for Variant 1.

Figure 9. Micro-F1 comparison by model and non-RAG prompting strategy for Variant 1.

Figure 10. Weighted-F1 comparison by model and non-RAG prompting strategy for Variant 1.

Figure 11. Per-class F1 comparison by model and non-RAG prompting strategy for Variant 1. (a) PL_r classes a–c; (b) PL_r classes d–e.

Figure 12. Per-class Precision comparison by model and non-RAG prompting strategy for Variant 1. (a) PL_r classes a–c; (b) PL_r classes d–e.

Figure 13. Per-class Recall comparison by model and prompting strategy (a) PL_r classes a–c; (b) PL_r classes d,e.

Figure 14. Recall for PL_r class e (highest safety requirement) by model and non-RAG prompting strategy for Variant 1.

Figure 15. Macro-F1 for Variant 1 (RAG-based prompts).

Figure 16. Micro-F1 for Variant 1 (RAG-based prompts).

Figure 17. Per-class F1 for Variant 1 (RAG-based prompts). (a) PL_r classes a–c; (b) PL_r classes d,e.

Figure 18. Recall for safety-critical class E (PL_r class e) on Variant 1.

Figure 19. Weighted-F1 for Variant 1 (RAG-based prompts).

Figure 20. Accuracy on Variant 2. Bars show the mean across repeated runs for each (model, prompt) and error bars show the 95% t-interval across runs.

Figure 21. Average processing time per sample on Variant 2. Bars are run means; error bars are 95% t-intervals across runs.

Figure 22. Total execution time on Variant 2. Bars are run means; error bars are 95% t-intervals across runs.

Figure 23. Macro-F1 comparison on Variant 2. Rules ensure stability across minority PL_r classes.

Figure 24. Micro-F1 (accuracy-equivalent) comparison on Variant 2.

Figure 25. Per-class F1 score comparison for Variant 2 without RAG. (a) PL_r classes a–c, which are most affected by lexical drift without rules; (b) PL_r classes d,e, showing stability under rule-grounded prompts.

Figure 26. Per-class precision comparison for Variant 2 without RAG. (a) PL_r classes a–c, where lexical drift introduces false positives in non-rule settings; (b) PL_r classes d,e, where rule-based prompting prevents over-prediction of dominant classes and sustains precision on safety-critical cases.

Figure 27. Per-class recall comparison for Variant 2 without RAG. (a) PL_r classes a–c, where zero-shot prompting produces severe drops in recall, especially for classes b and c; (b) PL_r classes d,e, where rule-based prompting maintains near-ceiling recall and preserves safety-critical detection.

Figure 28. Recall for safety-critical PL_r class E on Variant 2. Rules maintain stability across models.

Figure 29. Weighted-F1 comparison on Variant 2. Without rules, minority class weighting exposes vulnerabilities.

Figure 30. Variant 2 with RAG: Accuracy across six models for

WITH_RULES_RAG

and

COT_WITH_RULES_RAG

. Bars show mean accuracy; error bars denote 95% t-intervals across repeated runs.

Figure 30. Variant 2 with RAG: Accuracy across six models for

WITH_RULES_RAG

and

COT_WITH_RULES_RAG

. Bars show mean accuracy; error bars denote 95% t-intervals across repeated runs.

Figure 31. Variant 2 with RAG: average processing time per sample (mean with 95% t-intervals).

Figure 32. Variant 2 with RAG: total execution time (mean with 95% t-intervals).

Figure 33. Variant 2: Macro-F1 (Top) and Micro-F1 (Bottom) comparison across models.

Figure 34. Variant 2 with RAG: Per-class F1 across PL_r classes. (a) PL_r classes a–c; (b) PL_r classes d,e.

Figure 35. Variant 2 with RAG: Per-class Precision across PL_r classes. (a) PL_r classes a–c; (b) PL_r classes d,e.

Figure 36. Variant 2 with RAG: Per-class Recall across PL_r classes. (a) PL_r classes a–c; (b) PL_r classes d,e.

Figure 37. Variant 2: Recall for PL_r class e (Top) and Weighted-F1 (Bottom) across all classes (Bottom).

Table 1. RAG controls used in all runs unless stated otherwise.

Parameter	Default
Semantic over-retrieval (k)	20 (take larger superset before filtering)
Lexical Jaccard ( $τ$ )	keep if $J (A, B) \geq 0.30$
S/F/P gate	`require_sfp_exact` = false

Table 2. Per-model accuracy and confusion/rescue counts on Variant 2.

Model	Acc (RAG)	Acc (CoT + RAG)	ConfuseCount	RescueCount
DeepSeek-reasoner	0.98	0.92	3	0
GPT-5-mini	0.94	0.94	2	2
o3-mini	0.72	0.82	1	6
o4-mini	0.96	0.94	2	1
Claude-opus-4-1	0.96	0.82	8	1
Gemini-2.5-flash	0.92	0.92	3	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Iyenghar, P. Empirical Evaluation of Reasoning LLMs in Machinery Functional Safety Risk Assessment and the Limits of Anthropomorphized Reasoning. Electronics 2025, 14, 3624. https://doi.org/10.3390/electronics14183624

AMA Style

Iyenghar P. Empirical Evaluation of Reasoning LLMs in Machinery Functional Safety Risk Assessment and the Limits of Anthropomorphized Reasoning. Electronics. 2025; 14(18):3624. https://doi.org/10.3390/electronics14183624

Chicago/Turabian Style

Iyenghar, Padma. 2025. "Empirical Evaluation of Reasoning LLMs in Machinery Functional Safety Risk Assessment and the Limits of Anthropomorphized Reasoning" Electronics 14, no. 18: 3624. https://doi.org/10.3390/electronics14183624

APA Style

Iyenghar, P. (2025). Empirical Evaluation of Reasoning LLMs in Machinery Functional Safety Risk Assessment and the Limits of Anthropomorphized Reasoning. Electronics, 14(18), 3624. https://doi.org/10.3390/electronics14183624

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Empirical Evaluation of Reasoning LLMs in Machinery Functional Safety Risk Assessment and the Limits of Anthropomorphized Reasoning

Abstract

1. Introduction

1.1. Identified Research Gaps

1.2. Research Questions

1.3. Contributions

2. Background and Related Work

2.1. Machinery Functional Safety Risk Assessment

2.1.1. Risk Assessment

2.1.2. Safety Function

2.1.3. Required Performance Level- PLr

2.2. AI-Based Risk Assessment in Functional Safety

2.3. LLM Capabilities and Limitations in Reasoning Tasks

2.4. Prompt Engineering, Retrieval-Augmented Generation, and Model Interpretability

2.5. The Role of Domain-Specific Benchmarking in Evaluating LLMs

3. Experimental Design

3.1. Dataset Description

3.1.1. ISO 12100: Annex B and ISO 13849-1: Annex A Correlation

3.1.2. Dataset Entry: Template and Example

3.2. Evaluation Datasets

3.3. Prompting Strategies for Risk Classification

3.4. Prompt Placeholder Description

3.5. RAG Implementation

3.5.1. Pipeline Implementation

Stage I: Semantic Database Search

Stage II: Hybrid Prefiltering and Ranking

Stage III: Evidence Packaging for Prompting

3.5.2. Retrieval Index (ChromaDB)

3.6. Model Selection and Configuration

3.6.1. Deterministic Decoding Policy

3.6.2. Cost Context

4. Results and Analysis

4.1. Results on Variant 1 (Canonical ISO-Style Scenarios, Non-RAG Prompts)

4.1.1. Accuracy and Processing Time

4.1.2. Macro, Micro-F1 and Precision

4.1.3. Recall for PLr Class E

4.2. Results on Variant 1 (Canonical ISO-Style Scenarios, 2-RAG-Based Prompts)

4.2.1. Macro-/Micro-F1

4.2.2. Per-Class Behavior

4.2.3. Safety-Critical Coverage (PLr Class E)

4.2.4. Weighted-F1

4.3. Results on Variant 2 (Engineer-Authored Scenarios, Non-RAG Prompts)

4.3.1. Overall Accuracy (With 95% t-Intervals Across Runs)

4.3.2. Latency and Efficiency (With 95% t-Intervals Across Runs)

4.3.3. Macro-F1 and Micro-F1

4.3.4. Per-Class Performance

4.3.5. Safety-Critical Recall (Class E) and Weighted-F1

4.4. Results on Variant 2 (Engineer-Authored Scenarios, RAG-Based Prompts)

4.4.1. Accuracy and Confidence Intervals

4.4.2. Latency and Efficiency

4.4.3. Macro- and Micro-F1

4.4.4. Per-Class Performance for Variant 2 with RAG

4.4.5. Weighted-F1

4.5. Does RAG Confuse CoT? Quantification and Evidence

Mechanism (Observed Failure Mode)

4.6. Cross-Variant Synthesis: Rigorous Analysis, Critique, and Implications

4.6.1. Model-Specific Misclassification Patterns

4.6.2. Prediction Bias and Class Distribution

4.6.3. Failure Modes in Reasoning

4.6.4. Error Analysis and Misclassification Insights

4.6.5. Summary of Key Quantitative Findings

4.6.6. Implications for Industrial Safety Pipelines (Actionable)

4.6.7. Threats to Validity (with Mitigations)

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Confusion and Rescue IDs per Model

Appendix A.2. Mechanism-Level Inferences (From Confusion/Rescue Tables)

Dominant Confounder: P-Inflation

Secondary Confounder: F-Drift

Severity Cues Are Rarely Decisive

Rescue Mechanism: Rule-Consistent Re-Anchoring of P (and F)

Where Flips Concentrate

Actionable Mitigations

2.1.3. Required Performance Level- PL_r

4.1.3. Recall for PL_r Class E

4.2.3. Safety-Critical Coverage (PL_r Class E)