Automatic Fault Detection and Diagnosis in ROS-Based Robotic Systems Using Generative AI: A Systematic Literature Review

Cardoso, Marta; Arrais, Rafael; Sousa, Armando

doi:10.3390/app16115545

Open AccessSystematic Review

Automatic Fault Detection and Diagnosis in ROS-Based Robotic Systems Using Generative AI: A Systematic Literature Review

by

Marta Cardoso

¹

,

Rafael Arrais

^1,2

and

Armando Sousa

^1,2,*

¹

Faculty of Engineering, University of Porto, Rua Dr. Roberto Frias, s/n, 4200-465 Porto, Portugal

²

Institute for Systems and Computer Engineering, Technology and Science, Campus of the Faculty of Engineering, University of Porto, Rua Dr. Roberto Frias, 4200-465 Porto, Portugal

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5545; https://doi.org/10.3390/app16115545

Submission received: 31 March 2026 / Revised: 28 May 2026 / Accepted: 29 May 2026 / Published: 2 June 2026

(This article belongs to the Special Issue Trends and Prospects in Software Engineering)

Download

Browse Figures

Versions Notes

Abstract

The increasing complexity and distributed nature of Robot Operating System (ROS)-based robotic systems require advanced Fault Detection and Diagnosis (FDD) approaches that operate autonomously with minimal human intervention. The goal of this systematic literature review is to investigate how observability-driven FDD can be automated in ROS-based robotic systems to minimise human effort. Through this lens, the review surfaces four recurring gaps that collectively limit observability-driven automation: rich telemetry sources—logs, traces, and metrics—exist in isolation and are rarely integrated into real-time detection pipelines or leveraged collectively to improve failure diagnostics; online monitoring enables automatic fault detection but depends heavily on predefined rules and expert configuration and interpretation; failure explanations are generated post hoc and rely heavily on logs; and systems remain largely reactive, lacking the continuous monitoring infrastructure needed to anticipate faults before they propagate. Although Large Language Models (LLMs) show considerable promise for automated fault explanation and natural language interaction with robotic systems, current implementations fall short of comprehensive, real-time monitoring that unifies logs, traces, metrics, and sensor streams with Artificial Intelligence (AI) reasoning. To address these gaps, this paper motivates hybrid architectures that combine observability-first design, runtime monitoring, static analysis, and agentic LLM-based reasoning, laying the groundwork for more proactive and autonomous fault management in ROS-based systems.

Keywords:

ROS 2; Fault Detection and Diagnosis (FDD); real-time monitoring; observability; telemetry; Large Language Models (LLMs); Model Context Protocol (MCP); Retrieval Augmented-Generation (RAG); Reasoning and Acting (ReAct); generative AI; agentic AI

1. Introduction

The rapid evolution of robotic systems has led to increasingly sophisticated systems that operate in dynamic, uncertain, and often safety-critical environments. As robots transition from simulated and controlled environments to real-world scenarios, ensuring system reliability and operational robustness has become essential [1,2]. In this context, automatic FDD is critical for maintaining system reliability and maintainability in ROS-based robotic systems.

ROS-based systems are inherently complex [1,2,3], comprising distributed nodes that communicate asynchronously via topics, services, and actions, and integrate heterogeneous hardware and software components operating at different temporal and computational scales. This heterogeneity and loose coupling complicate the identification of causal relationships between observed symptoms and underlying faults [4].

As Cyber-Physical Systems (CPSs), robots tightly couple software and hardware, allowing faults to originate in software, hardware, or their interactions and propagate unpredictably [4,5]. The detection of many faults relies heavily on human oracles as system-level symptoms can be weak or unclear, and expected behavior is difficult to specify in terms of observable physical effects [4]. During execution, ROS applications generate large volumes of distributed runtime data, necessitating dedicated monitoring frameworks to ensure reliability, support real-time safety verification, and provide system observability [6,7,8,9,10].

Interpreting runtime data effectively often requires specialized domain knowledge and substantial configuration effort. As systems grow in size and complexity, manual fault analysis and the configuration of monitoring systems become increasingly time-consuming [9,10].

The growing complexity of ROS 2 system configuration increases the likelihood of bugs and faults [2,4]. Consequently, Quality Assurance (QA) tools can help prevent faults by identifying potential errors without requiring system execution [4]. These tools enable static analysis of ROS code to detect architectural violations, deviations from coding standards, and potential runtime issues [11,12].

Traditional FDD methods struggle to address the heterogeneity and dynamic behavior of ROS-based robotic systems. Rule-based and model-based approaches lack adaptability and generalization, while classical Machine Learning (ML) techniques depend on extensive feature engineering and labeled datasets that are costly and difficult to obtain in practice [5]. Although hybrid methods mitigate some of these limitations, they remain constrained in scalability and flexibility.

Recent advances in Generative AI, particularly LLMs, offer promising capabilities for robotic fault diagnosis through automated reasoning over unstructured and semi-structured data, such as ROS logs, and by generating human-interpretable explanations of robot behavior and failures [13,14,15,16,17,18,19].

Effective diagnosis requires rich system context beyond traditional log-based monitoring [20,21]. Modern observability platforms (e.g., Datadog [22] and New Relic [23]) address this need by correlating logs, metrics, and traces, with OpenTelemetry [24] providing standardized telemetry collection. However, these tools are primarily designed for human-centric dashboards, which limits programmatic access by automated reasoning systems and real-time diagnostic pipelines. The Model Context Protocol (MCP) [25] helps bridge this gap by enabling real-time AI access to observability data, while Retrieval Augmented-Generation (RAG) allows LLMs to incorporate external knowledge sources, improving diagnostic accuracy and explainability [14,17].

While LLMs have been widely applied to robotic reasoning, planning, and Human-Robot Interaction (HRI) [26,27,28], their systematic use for real-time monitoring and FDD in ROS-based systems remains underexplored.

This paper presents a systematic analysis of observability-driven FDD and the integration of LLMs and agentic AI in ROS-based systems, identifying critical gaps that currently limit autonomous fault management. It further provides a structured categorization of relevant studies and shows that, despite the maturity of the underlying technologies, among the 29 studies reviewed, no existing approach integrates logs, traces, and metrics with real-time agentic AI reasoning and automation.

This work provides a reference for both researchers and practitioners on how generative AI can be leveraged for automated FDD beyond traditional natural language interaction. It makes three primary contributions: (i) a taxonomy for classifying and positioning FDD contributions within the ROS ecosystem, (ii) an evidence-based gap analysis identifying open research directions in multimodal observability and agentic AI integration, and (iii) a design-oriented reference for developing ROS-based FDD systems. In this context, we introduce observability-first principles and a four-layer architectural perspective to support scalable system monitoring and reasoning. The focus is on architectural and software design aspects of integrating agentic AI into ROS-based FDD systems, emphasizing automation, modularity, and system-level observability. Human interaction, hardware-specific considerations, and real-time constraints are outside the scope of this work.

A key contribution of this study is the identification and synthesis of a critical research gap: existing approaches do not effectively integrate multimodal telemetry with real-time agentic AI reasoning for FDD in ROS systems. This gap motivates future research directions and underpins hybrid architectures, derived from the current review and presented in Section 6.1.

The remainder of the paper is organized as follows. Section 2 presents the review methodology. Section 3 presents the review results, comprising a bibliometric overview of the selected studies and a taxonomy of the identified research themes and approaches. Section 4 provides a collection of foundational concepts necessary for understanding FDD, building monitoring systems, and LLM-based methods. Section 5 synthesizes the literature on FDD approaches, monitoring frameworks, and LLM integration. Finally, Section 6 discusses the main findings and research gaps, and Section 7 concludes the paper.

2. Methodology

The study follows the PRISMA guidelines [29] to ensure a structured and transparent selection process. The main goal was to evaluate the state of automated FDD in ROS-based systems from an observability perspective. In particular, the paper explores how LLMs can reduce human effort in data collection, correlation, and interpretation.

The methodology is structured into four research cycles. The first examines the fundamentals of ROS, including system architecture, communication mechanisms, and native tools. The second reviews existing FDD approaches, considering fault types, levels of automation, scalability, and required human involvement. The third investigates monitoring infrastructures and observability techniques that support effective FDD in ROS-based systems. The fourth analyzes recent research on the use of LLMs in ROS environments to automate fault detection and explanation, aiming to reduce human operator workload and improve interpretability.

Two complementary analytical perspectives were applied to the curated corpus. Descriptive analysis provides qualitative synthesis, addressing what the field contains and how contributions are structured through thematic categorization, architectural classification (e.g., tool-augmented, ReAct, RAG-based), and synthesis of trends and gaps. Its outputs include the unified taxonomy (Table 3), the comparative summary of LLM-based frameworks (Table 6), and the cross-study categorization in Appendix A. Statistical analysis provides the quantitative counterpart, addressing how much and how often through publication distributions, proportions of works covering specific FDD subtasks, and trends across research phases (Table 5). Together, these perspectives provide both empirical characterization and interpretive synthesis.

2.1. Review Procedure

The execution of the literature review followed a structured, stepwise procedure aligned with PRISMA guidelines [29]. The procedure consisted of the following stages:

Formulation of the overall research objective and definition of the Research Questions (RQs).
Development of search queries and selection of relevant data sources.
Screening of retrieved records based on predefined inclusion and exclusion criteria.
Full-text assessment of eligible studies and structured data extraction.
Synthesis, aggregation, and thematic analysis of the extracted data to derive higher-level findings.

2.2. Research Question

The main RQ was formulated as: How can observability-driven FDD be automated in ROS-based robotic systems to minimize human effort?

This was further broken down into four sub-questions:

RQ1: What limitations do current FDD approaches in ROS present with respect to automation and human effort?
RQ2: How do existing monitoring and observability frameworks in ROS support FDD?
RQ3: How can LLMs enhance automated fault detection, diagnosis, and explanation in ROS-based systems?
RQ4: What gaps exist in integrating observability with AI-driven diagnosis?

2.3. Databases and Search Queries

The literature was primarily sourced from IEEE Xplore and SpringerLink, which provide comprehensive coverage of peer-reviewed research in robotics and ensure access to high-quality studies on ROS-based systems. The selection of these two digital libraries was deliberate: they index the dominant publication venues for ROS-related research, robotics software engineering, and the intersection of AI with robotic systems, including IEEE Robotics and Automation Letters, IROS, ICRA, the RoSE workshop series, and Springer’s TAROS proceedings and LNCS volumes. Broader venues were not used as primary sources to avoid introducing substantial deduplication overhead. The snowballing strategy described below was specifically designed to capture relevant work published outside the two primary sources. The implications of these decisions for coverage are further discussed as a threat to validity in Section 6.2.

To align with the RQs, three Boolean query groups were designed, each targeting a specific aspect of the study to ensure broad and systematic coverage of the domain. Across both digital libraries, consistent filters were applied to refine the results. These included publication years between 2015 and 2025, disciplines limited to Computer Science and Engineering, subject areas focused on Robotics and Robotic Engineering, English-language publications, and searches restricted to titles, abstracts, and keywords.

A two-stage screening procedure was applied, consisting of title screening followed by full-text review. Studies were included if they met at least one of the following Inclusion Criteria (IC):

IC1: Studies addressing monitoring, diagnosis, debugging, or observability in ROS-based systems.
IC2: Studies proposing frameworks, tools, or methodologies for ROS monitoring.
IC3: Studies focusing on fault detection, anomaly detection, or diagnosis in ROS.
IC4: Studies integrating LLMs, generative AI, or AI agents with ROS.

Studies were excluded if they met any of the following Exclusion Criteria (EC):

EC1: Studies not directly related to ROS 1 or ROS 2.
EC2: Studies lacking sufficient technical or methodological detail.
EC3: Studies whose primary contribution was a domain-specific solution (e.g., FDD tailored exclusively to a particular robot type or task context) without presenting a generalizable framework, method, or finding applicable to ROS-based systems broadly. Studies describing domain-specific deployments that also contributed broadly applicable monitoring frameworks, tool designs, or AI integration patterns were retained.

Query 1—FDD

(("Robot Operating System" OR ROS OR ROS 2) AND ("Fault detection and diagnosis" OR "FDD" OR "fault detection" OR "fault diagnosis"))

This query yielded 126 results from SpringerLink and 90 results from IEEE Xplore, totaling 216 initial records.

Query 2—Monitoring and Debugging

("Robot Operating System" OR ROS OR ROS 2) AND ("runtime monitoring" OR "real-time monitoring" OR "online monitoring" OR debugging OR "runtime verification")

This query yielded 329 results from SpringerLink and 215 results from IEEE Xplore, totaling 544 initial records.

Query 3—LLM and AI Integration

(("Robot Operating System" OR "ROS" OR "ROS 2") AND ("large language model" OR "large language models" OR LLM OR "AI agent" OR "AI agents" OR "Agentic AI" OR "AI powered agent"))

This query yielded 115 results from SpringerLink and 100 results from IEEE Xplore, totaling 215 initial records.

In addition to database searches, three supplementary strategies were employed:

Snowballing: Backward and forward citation tracking was performed on key papers identified during the initial screening phase to discover additional relevant studies. This resulted in the identification of 11 additional papers.
Official Documentation: Key papers from official ROS documentation and the https://ros.org/ website were reviewed to ensure foundational and widely-adopted tools and frameworks were included. This yielded two additional papers.
Consensus: The Consensus AI-assisted search platform (https://consensus.app) was used in an exploratory capacity during the early stages of the review. It was queried using topic-level phrases such as “fault detection ROS”, “runtime monitoring Robot Operating System”, “LLM robotic systems”, and “observability ROS diagnostics”. The purpose was to acquire broader familiarity with the research landscape, identify emerging terminology, and discover work at the intersection of ROS, FDD, observability, and generative AI. Consensus was not employed as a systematic, reproducible search strategy: its underlying index is not publicly auditable, result ranking is opaque, and the platform does not support Boolean query syntax equivalent to the structured queries used in IEEE Xplore and SpringerLink. Through this process, two studies were identified that had not been captured by the database searches or snowballing: González-Santamarta et al. [30] and Sobrín-Hidalgo et al. [17]. Both were subsequently validated against the predefined ICs and ECs before admission to the final corpus. Their Consensus-sourced origin is acknowledged as a non-reproducible element in Section 6.2.

Database searching yielded 975 records. After removing 20 duplicate records (13 cross-query duplicates in SpringerLink, 7 cross-query duplicates in IEEE Xplore, and 0 cross-library duplicates), 970 unique database records remained. Together with the 15 additional records, 985 records underwent title-based screening followed by full-text assessment against the defined inclusion and exclusion criteria. Within the database results, duplicate records arising from the same paper appearing in multiple query results or across both IEEE Xplore and SpringerLink were identified and removed prior to screening; this deduplication step is reflected in the PRISMA flow diagram (Figure 1). The selection process, detailed in Figure 1, resulted in 29 studies included in the final review. Appendix A provides the complete list of included studies with metadata and their mapping to the taxonomy dimensions defined in Table 3.

2.4. Data Extraction

All studies that passed the full-text review were analyzed to extract information relevant to the RQs, including:

Monitoring and observability frameworks and tools.
Diagnostic techniques and automation levels.
AI-based integration approaches.
Limitations and future research directions.

The synthesized evidence and the findings of the extracted data are presented in Section 3, Section 4, Section 5 and Section 6.

2.5. Study Quality Assessment

To provide a calibrated view of the evidence base, each included study was assigned a composite quality score based on the criteria defined in Table 1. The total score ranges from 0 to 10 and is mapped to a global quality rating as follows: Low (L) for scores below 4, Medium (M) for scores between 4 and 6, and High (H) for scores of 7 or above. These ratings are reported in Appendix A (Table A1) and are used solely to contextualise the strength of evidence supporting individual claims. They do not influence study inclusion, which was determined independently based on the ICs and ECs.

Citation impact was normalised by publication age to account for the limited time recent studies have had to accumulate citations, which is particularly relevant in emerging research domains (see Table 2). Citation counts were retrieved from Semantic Scholar. For each age bracket, studies were ranked by citation count and categorised into High, Medium, and Low tiers based on percentile thresholds (top 25%, middle 50%, and bottom 25%, respectively).

Studies lacking sufficient metadata to support citation analysis (e.g., missing DOI or unavailable indexing in the citation source) were assigned the lowest score for this criterion. Venue tier was determined using the Computing Research and Education Association of Australasia Conference Ranking (CORE) conference ranking and the SCImago Journal Rank (SJR) quartile system, where CORE A*/A and SJR Q1 venues were classified as High, CORE B and SJR Q2 as Medium, and CORE C and SJR Q3–Q4 as Low.

3. Results

This section presents the results of the systematic literature review. A unified taxonomy was developed to characterize the included studies, followed by a bibliometric overview that examines temporal trends, publication venues, topic evolution, and thematic clustering across the 29 included studies. The full characterization of the corpus is provided in Appendix A.

3.1. Taxonomy

To characterize the studies included in this review, a unified taxonomy was developed by the authors, grounded in the included studies and presented in Table 3. An initial set of high-level dimensions (fault origin, detection/diagnosis approach, observability modalities, automation level, and AI integration) was derived from the corpus of studies. The taxonomy was considered stable when all included studies could be classified without ambiguity. Adequacy and completeness are demonstrated by the fact that the taxonomy fully characterizes the methods, modalities, and automation levels reported by every included study, as evidenced by the mapping in Appendix A (Table A1 and Table A2). The categories are representative of all major issues within the scope of this review, and all relevant modalities identified during the initial corpus analysis are represented.

3.2. Bibliometic Overview

This subsection provides a bibliometric characterization of the 29 included studies, covering temporal evolution, publication landscape, geographic distribution, topic evolution across research eras, and thematic clustering to identify cross-cutting patterns and non-obvious relationships.

3.2.1. Temporal Distribution

Figure 2 presents the distribution of included studies by publication year. The corpus spans 2017–2025, with 69% of studies (20 out of 29) published between 2023 and 2025. Three distinct phases are discernible: a foundation phase (2017–2018) producing early surveys and static-analysis work; a maturation phase (2020–2022) developing runtime verification frameworks, tracing infrastructure, and ROS 2 architectural studies; and an ongoing AI integration phase (2023–2025) characterized by LLM-based tools, agentic frameworks, and natural-language interfaces.

3.2.2. Publication Type and Venue Distribution

Table 4 summarizes publication type and publisher. Conference papers constitute the largest category (11 studies, 38%), followed by journal articles (7, 24%) and workshop papers (5, 17%). Four arXiv preprints (14%) reflect rapid dissemination in the LLM–robotics integration space, where work often appears before formal peer review. IEEE is the dominant publisher (14 studies, 48%), followed by arXiv (4) and Springer (4). Two studies appear in ACM-published proceedings [5,8]; these were recovered via snowballing from IEEE-indexed papers. Key venues include IEEE Robotics and Automation Letters, the RoSE workshop series at ICSE, and several IEEE flagship conferences (IROS, ICSME, IEEE Aerospace Conference).

3.2.3. Topic Evolution Across Research Phases

Table 5 maps the dominant thematic categories to the three research phases. The foundation phase is entirely devoted to traditional FDD and static analysis. The maturation phase shifts towards runtime observability infrastructure, with no LLM work present. The AI integration phase concentrates on LLM-based and agentic applications while retaining some runtime monitoring activity, reflecting an additive rather than substitutive pattern of development. Notably, no study in any phase combines multimodal telemetry with agentic AI reasoning, confirming the central gap identified in this review.

3.2.4. Thematic Clusters and Cross-Cutting Patterns

A keyword-based thematic clustering analysis was performed. Studies were grouped based on shared methodological approaches, technical infrastructure, and application domains. Five distinct clusters emerged:

Static Analysis and Code Quality (3 studies: [4,11,12]). These studies focus on pre-deployment fault prevention through code mining, architectural analysis, and bug characterization. They share an emphasis on software engineering practices and produce artifacts (e.g., bug taxonomies, code patterns) that could serve as knowledge bases for downstream diagnostic systems.
Runtime Verification and Formal Monitoring (5 studies: [6,7,8,9,10]). This cluster comprises formal and configuration-driven monitoring frameworks that verify runtime properties. A cross-cutting pattern is the trade-off between specification rigor and usability—formal approaches offer precision but require expertise, while configuration-based tools improve accessibility at the cost of coverage.
Observability Infrastructure (4 studies: [20,21,31,32]). Studies in this cluster develop low-level tracing, network monitoring, and anomaly detection capabilities. They provide the telemetry foundation upon which higher-level diagnostic reasoning could operate, yet remain disconnected from AI-based interpretation.
LLM–ROS Integration and Task Execution (7 studies: [13,14,18,19,33,34,35]). This cluster encompasses agentic frameworks that expose ROS primitives to LLMs via tool use, MCP, or Reasoning and Acting (ReAct) patterns. A notable gap is that these systems primarily target task execution and natural language interaction rather than systematic fault detection.
Explainability and Human Understanding (5 studies: [15,16,17,30,36]). These studies focus on generating human-readable explanations of robot behavior, supporting HRI and operator trust. They demonstrate LLMs’ potential for fault explanation but operate post hoc on logs rather than in real-time diagnostic loops.

A critical cross-cluster observation is the absence of work bridging Clusters 1–3 (static analysis, runtime monitoring, observability infrastructure) with Clusters 4–5 (LLM integration, explainability). This confirms the central gap motivating this review: existing systems either provide rich telemetry without AI interpretation or offer LLM-based reasoning without access to comprehensive, multimodal observability data.

4. Fundamental Concepts, Tools and Background

This section introduces the key concepts and technologies that form the foundation of this research, providing essential background on ROS, observability principles, FDD, and AI concepts to enable automated, context-aware diagnostics.

4.1. Robot Operating System

ROS, developed in 2007 by Willow Garage, is a widely adopted open-source robotics framework for rapid prototyping and collaborative development. ROS 1 [3] relies on a centralized master node architecture, which simplifies research applications but suffers from a single point of failure and limited scalability, constraining commercial deployment. To address these limitations, ROS 2 [1] was released in 2017, replacing the centralized master with a fully distributed, peer-to-peer architecture based on the Data Distribution Service (DDS). This design improves flexibility and scalability, allowing processes to run across multiple nodes and cores. Communication in ROS 2 is configurable through Quality of Service (QoS) settings, which enable developers to control reliability, timing, and message delivery behavior. However, this distributed architecture also increases configuration complexity and introduces multiple potential failure points [2].

ROS 2 maintains and extends the core communication patterns from ROS 1. Topics provide anonymous publish-subscribe messaging, supporting many-to-many communication and enabling system introspection. Services implement a request-response pattern for associating requests and responses, now organized under nodes for improved observability. Actions offer a goal-oriented asynchronous interface with request, response, periodic feedback, and cancellation capabilities.

4.2. Native ROS Debugging Tools and Commands

Native ROS distributions provide fundamental Command-Line Interfaces (CLIs) for live introspection and logging. The rosout logging system captures runtime messages, while CLI utilities such as rosnode/rostopic (ROS 1) and ros2 node/topic (ROS 2) allow developers to inspect node status, echo topic data, and monitor message frequencies. For system-wide health checks, roswtf and ros2 doctor provide automated diagnostics to identify environment misconfigurations or network issues. Additionally, the param (or rosparam) suite enables real-time observation and modification of global parameters, facilitating runtime tuning without requiring node restarts.

Visualization frameworks extend these capabilities by providing graphical insights into complex data. Rviz serves as the primary 3D environment for visualizing sensor streams, robot models, and coordinate transforms. In addition, the rqt framework offers modular Graphical User Interface (GUI) plugins, such as rqt_graph to map the computation graph and rqt_console for advanced log filtering.

For post-mortem analysis, Rosbag serves as a fundamental tool. It records time-stamped message streams for offline replay, allowing developers to reproduce specific failure states and evaluate algorithm performance against consistent datasets.

While native ROS tools are powerful for manual introspection, they are inherently limited for automated FDD. Designed with a human-in-the-loop paradigm, they require significant domain expertise to configure and interpret correctly, making them impractical for autonomous monitoring systems that must operate without continuous human supervision.

4.3. FDD

FDD is a systematic process used to identify when a system deviates from its nominal behavior (Detection) and to pinpoint the specific component or root cause responsible for the failure (Diagnosis). According to the taxonomy established by Khalastchi et al. [5], traditional FDD techniques are categorized into three primary classes. Data-Driven Approaches represent model-free methods that use statistical analysis and pattern recognition to detect anomalies in system data, suitable for identifying unknown faults but often limited in real-time efficiency. Model-Based Approaches exploit analytical redundancy by comparing measured and expected outputs derived from system models, enabling rapid and accurate detection when high-fidelity models are available. Knowledge-Based Approaches rely on expert reasoning, causal relationships, or rule-based inference to associate observed symptoms with known faults, often serving as hybrid solutions that combine data-driven and model-based reasoning.

4.4. Fault Taxonomy

Following ISO/IEC/IEEE 24765 [37], a fault is the manifestation of an error in a system artifact. When activated, it can lead to an incorrect internal state. In contrast, a failure is the externally observable inability of the system or component to perform a required function within specified limits. In this context, Khalastchi et al. [5] identify three main categories of robotic faults. Hardware faults affect physical components and instruction execution. Software faults impact aspects such as perception, planning, and behavior execution. Interaction-related faults arise from internal malfunctions or external environmental disturbances.

4.5. Observability

Observability is the ability to infer a system’s internal state and behavior from its external outputs. In distributed architectures such as ROS, observability must be a deliberate design requirement that enables continuous insight into system health, performance, and component interactions.

Observability patterns support monitoring, visualization, and alerting, thereby simplifying fault detection and troubleshooting. Common patterns include Health Check Application Programming Interfaces (APIs) for reporting service status, Log Aggregation for centralized log analysis and alerting, Distributed Tracing for tracking requests across services, Exception Tracking for reporting and de-duplicating runtime failures, and Application Metrics for exposing quantitative system measurements [38]. The ROS-RVFT instrumentation guidelines for I1 (lifecycle APIs) and I2 (logging and filtering APIs) proposed by Caldas et al. [10] for ROS systems align with the previous principles, emphasizing runtime instrumentation to improve visibility and diagnostics for effective FDD in distributed systems.

Modern observability is structured around three primary dimensions: logs, metrics, and traces. Logs provide timestamped event records that capture execution context at component level. Metrics offer aggregated numerical measurements reflecting system health and performance. Traces deliver end-to-end execution paths that reveal inter-component dependencies.

Achieving consistent observability across heterogeneous distributed systems remains challenging due to fragmented instrumentation approaches and incompatible data formats. OpenTelemetry addresses these issues by providing a unified, vendor-neutral framework for collecting and exporting telemetry data—logs, metrics, and traces—enabling interoperability and portability across different observability backends [24].

4.6. LLM and Agentic AI

LLMs are neural networks trained on extensive text corpora for natural language understanding, generation, and reasoning. Traditional LLMs respond passively to prompts using pre-trained knowledge. Agentic AI systems, in contrast, operate autonomously, perform reasoning, take actions, and interact with the environment. AI Agents typically follow a feedback loop of context gathering, action execution, verification, and iteration, enabling multi-step tasks across multiple tools and data sources.

Tool Use (Function Calling) enables LLMs to invoke predefined external functions (e.g., database queries, system commands, or ROS services), thereby translating natural language commands into actions while separating high-level reasoning from low-level execution. RAG improves LLM outputs by retrieving relevant knowledge from external databases, where documents are segmented, embedded as vectors, and semantically searched at runtime to provide context-aware information. MCP is an open-source standard that allows AI applications to interface with external systems through a standardized client-server architecture, enabling access to data sources, tools, and workflows, supporting real-time information retrieval and task execution.

5. Related Work

This section reviews current approaches to FDD, the role of observability in diagnostic effectiveness, monitoring approaches, and the application of AI-driven techniques for system monitoring and troubleshooting in ROS-based systems.

5.1. FDD in Robotic Systems

Controlled user experiments by Song et al. [20] show that augmenting textual logs with execution traces can improve diagnosis success rates by up to 58.33%, with a further 8.33% gain when robot trajectory data are included. These findings demonstrate that context-aware information substantially enhances diagnostic performance. Specifically, enriching failure analysis with execution traces and robot trajectories helps human operators reason about faults emerging from complex software–hardware–environment interactions. In addition, this work demonstrates that distributed tracing tools, such as Zipkin [39], can be practically applied to collect execution traces in ROS-based systems, enabling fine-grained reasoning about failure scenarios.

Beyond observability insights, the same study reports that roughly half of robotic faults originate from software implementation, followed by system configuration, hardware usage, and environment-related factors. This underscores that effective robotic fault diagnosis must address multiple dimensions: code logic, configuration parameters, hardware interfaces, and assumptions about the physical environment, rather than focusing solely on component-level failures, which are commonly detected through logs. Additional evidence is provided by the ROBUST dataset [4], which analyzes bugs from seven major ROS projects. The dataset characterizes diverse fault types, including build and deployment issues, runtime configuration faults associated with ROS’s distributed architecture, and concurrency-related faults arising from unsynchronized node communication. A notable finding is that a substantial portion of bugs exhibit weak or unclear system-level symptoms, relying on human oracles for their detection. This underscores the need for techniques that can detect latent or silently degrading faults.

Khalastchi et al. [5] provide a comprehensive survey of FDD approaches for robotic systems, covering model-based, data-driven, and knowledge-based techniques and their applicability across different robot types and operating conditions. An important conclusion is that no single paradigm is sufficient to address the heterogeneity and uncertainty inherent in robotic domains, and hybrid approaches are necessary to improve robustness and coverage [5]. However, hybrid approaches introduce non-trivial trade-offs, including increased computational overhead and a continued reliance on human experts to specify models, properties, or domain knowledge, which can limit scalability and automation for FDD.

Taken together, these studies indicate that effective FDD in robotics must prioritize software and configuration faults, exploit heterogeneous data sources—logs, traces, trajectories, and other runtime telemetry—to improve diagnosis assessment, and adopt hybrid strategies to balance accuracy, coverage, and runtime overhead [4,5,20]. At the same time, the reliance on expert knowledge, the computational cost of sophisticated hybrid methods, and the prevalence of faults with weak or external symptoms remain open challenges.

Addressing RQ1, current FDD approaches in ROS-based systems exhibit three primary automation limitations. First, formal and configuration-driven methods rely on expert-specified models or rules, limiting coverage to anticipated fault scenarios. Second, data-driven techniques require labeled datasets that are costly to obtain and difficult to generalize across robot platforms and tasks. Third, all paradigms ultimately depend on human oracles for setup and interpretation, particularly given the prevalence of bugs with weak system-level symptoms [4]. Hybrid strategies mitigate some of these limitations but introduce additional configuration complexity and computational overhead, perpetuating rather than eliminating human-in-the-loop dependency.

5.2. ROS-Based Monitoring Frameworks

Research in ROS-based FDD comprises several frameworks that vary in formality, automation, and diagnostic capability, spanning RV, tracing mechanisms, anomaly detection, and static analysis.

In the context of FDD, RV tools represent an online monitoring approach that relies on formal rules to define the system’s correct behavior. ROSMonitoring [6] introduced a modular approach for monitoring inter-node communication via formal specifications, a concept further refined in ROSMonitoring 2.0 [7] to include service-level monitoring and improved real-time scalability. Despite their rigor, these frameworks face three primary challenges: the high complexity of manual formal specification, non-trivial computational overhead in resource-constrained environments, and an inability to detect unknown faults not explicitly defined. To lower the barrier to entry, recent research has transitioned towards configuration-driven monitoring. ROMoSu (ROS Monitoring Support) [8,9] enables practitioners to define monitoring scenarios through adaptable configurations rather than rigid formal models. However, this shift toward usability introduces a fundamental trade-off: simplified configurations may lack the rigor to identify subtle logic violations that formal specifications would capture. Furthermore, both formal and configurable approaches remain limited by their reliance on human oracles, requiring practitioners to pre-configure failure scenarios and manually interpret results. This leaves a significant gap in the area of autonomous and adaptive fault diagnosis.

Caldas et al. [10] suggest guidelines for developing ROS-based systems that go beyond pre-defined formal specifications or human-configured monitoring scenarios. Some guidelines focus on architectural and execution-level aspects that improve system observability and testability. For example, they advocate for fine-grained instrumentation, including APIs for logging, lifecycle management, fault injection, and component isolation. Additionally, they emphasize automated testing with fault injection and record-and-replay techniques for field experiments. These guidelines reduce reliance on human supervision and make systems more robust, helping them handle unexpected faults through observability and field-based testing.

To enable remote monitoring without requiring local ROS installation or direct network access, Ivanov et al. [36] proposed a browser-based visualization solution using ROS and ReactJS. Despite improved accessibility, the tool remains primarily observational, lacking automated verification or fault detection and relying on human operators for interpretation.

To address the performance and scalability limitations of traditional logging, Bédard et al. introduced ros2_tracing [21]. This tool provides low-overhead, fine-grained tracing of ROS 2 execution events, enabling detailed analysis of timing behavior and distributed interactions. Such capabilities are particularly valuable for diagnosing issues related to concurrency and synchronization, as well as real-time violations. At the network layer, Rivera et al. proposed ROS-FM [31], a fast monitoring framework that leverages extended Berkeley Packet Filters (eBPF) and eXpress Data Path (XDP) to inline-monitor ROS traffic. ROS-FM achieves significantly lower overhead compared to conventional monitoring tools and supports the enforcement of security and QoS policies through a modular architecture with distributed visualization. Despite their strengths, neither ros2_tracing nor ROS-FM alone is sufficient for comprehensive FDD. Instead, they serve as complementary observability and enforcement mechanisms that can support, but not replace, a complete detection and diagnostic solution.

Regarding ML techniques, Kang et al. [32] propose an offline anomaly detection method for ROS 2 that applies ML to trace data captured via ros2_tracing [21]. Their approach focuses on callback execution response time and invocation frequency without requiring modifications to node code. A key advantage of this data-driven technique is its ability to detect previously unknown faults; however, detection is limited to anomalies that manifest as deviations in callback timing patterns relative to the learned normal behavior. Notably, their method is strictly offline, limiting its ability to detect faults during execution.

Complementing runtime monitoring approaches, static code analysis tools enable fault prevention by identifying potential errors before deployment. Data mining patterns in ROS [11] leverage historical bug repositories and code patterns to identify fault-prone structures, allowing developers to address reliability issues during development rather than at runtime. Similarly, HAROS (High-Assurance ROS) [12] provides comprehensive static analysis capabilities for ROS systems, including code quality metrics, architectural pattern detection, and automated extraction of behavioral models from source code. These approaches are particularly valuable given that many faults originate in software and configuration [4,5,10,20]. The main cause is not necessarily the complexity of robotics itself, but rather the gap between available software engineering practices and their adoption in the robotics community [4]. Additionally, static analysis outputs represent valuable context for downstream analyses and, ultimately, serve as context for LLM-based diagnostic systems.

Regarding RQ2, existing monitoring and observability frameworks support FDD through complementary but fragmented capabilities. Runtime verification frameworks (ROSMonitoring [6], ROMoSu [9]) provide formal or configuration-based fault detection, while ros2_tracing [21] and ROS-FM [31] offer low-overhead instrumentation at the execution and network layers. Static analysis tools [11,12] contribute pre-deployment fault prevention. However, these tools operate in isolation: logs, traces, and metrics are rarely integrated into unified diagnostic pipelines, and current frameworks emphasize human-centric dashboards over programmatic access for automated reasoning systems.

5.3. LLM and Agentic-AI-Based Tools

The integration of LLMs into robotics has enabled significant advances across multiple domains, particularly in robot intelligence and HRI. The natural language processing capabilities of LLMs facilitate efficient communication and collaboration with robots. Key advances include improvements in robot control, perception, decision-making, and path planning [26,27]. Regarding FDD, a few established frameworks have been documented.

Several frameworks utilize the ReAct paradigm to map natural language to ROS primitives. ROSA [13], developed by NASA, abstracts ROS operations into tool-enabled functions via the LangChain framework, specifically targeting mission operations where non-experts must interact with complex ROS 1 and ROS 2 systems. Similarly, OperateLLM [34] uses the rclpy library to enable LLMs to dynamically generate and execute code, thereby facilitating the creation of nodes and publishers on the fly. ROS-LLM [35] further extends this by exposing ROS Actions and Services as atomic tools, supporting complex sequence actions execution through behavior trees and state machines. Moving beyond single-agent architectures, RAI [33] introduces a multi-agent framework built on the M-Agent model. This approach treats sensors and actuators as agent-specific capabilities, allowing for more flexible, embodied reasoning within ROS 2 environments. This modularity is essential for scaling LLM integration to heterogeneous robot swarms.

The emergence of MCP represents a shift toward standardized data interpretation. Fu et al. [18] introduce an MCP server for analyzing ROS bag files, enabling natural-language-driven interaction with robotic datasets via LLMs and Vision Language Models (VLMs)—multimodal models that combine language understanding with visual processing to interpret camera feeds, sensor visualizations, and other image-based data. The system provides domain-specific tooling organized into three categories: core data access, domain-specific analysis, and visualization. Bagel [19] also employs this protocol to facilitate natural-language-based log analysis, delivering metadata summaries, system diagnostics, and high-level insights based on runtime robotic data.

LLM-integrated fault management in ROS spans from proactive real-time monitoring to reactive post hoc explanation. ROS Help Desk [14] offers the most comprehensive proactive fault-detection capabilities. The system continuously monitors the /rosout topic, parsing log messages to detect exceptions and errors. Upon detection, these issues are automatically forwarded to the AI agent for interpretation and resolution generation. Beyond log monitoring, it integrates real-time analysis of sensor data through a specialized diagnostic node that processes multimodal sensor streams to identify anomalies. Additionally, it utilizes RAG to perform LLM-driven code reviews and provide immediate debugging suggestions when anomalies are detected. A limitation is that it ignores the impact of traces and metrics on the diagnostics it generates, which Song et al. [20] showed can improve diagnostic accuracy.

ROSA [13] also provides diagnostic capabilities through its natural language interface, allowing users to query system status and understand robot behavior. It includes tools for system diagnostics and monitoring. However, it is primarily reactive, requiring users to enter queries; consequently, its automated fault-detection capability is limited. Similarly, the MCP-based tools discussed above [18,19] offer valuable post hoc insights but operate purely on recorded data, preventing real-time fault detection.

González-Santamarta et al. [30] explored the ability of LLMs to interpret logs generated by autonomous robots to understand robot behavior using raw logs, with no preprocessing or prompt engineering. Later, Sobrín-Hidalgo et al. [17] employed RAG to generate context-aware explanations to improve trust in HRI. Both studies demonstrate that LLMs can be used to explain robot behavior; therefore, they also have potential applications in fault diagnosis and failure explanation. In this context, Scheltinga and Pek [15] focused on failure explanation in autonomous robot navigation scenarios by fine-tuning LLMs, demonstrating that Parameter-Efficient Fine-Tuning (PEFT) can transform raw ROS logs into human-readable narratives. This work was later extended by Scheltinga [16] to include personalization, allowing the depth of explanations to be adjusted based on the user’s expertise level. While these methods are valuable for fault diagnosis, their capabilities for fault detection remain limited.

Although LLM-based frameworks show promise for fault diagnosis and decision support, several persistent limitations hinder their scalability and reliability in real-world applications. LLM-based frameworks demand substantial GPU memory and processing capacity, with large models exceeding typical on-board hardware limits and introducing latency in time-critical robotic tasks [13,14]. It should be noted, however, that computational constraints vary substantially depending on deployment context: cloud-hosted LLMs shift the primary bottleneck from local GPU memory to API latency and operational cost, whereas locally deployed models face strict on-board memory and inference-speed constraints. Lightweight models can substantially reduce these requirements, though their diagnostic reasoning quality under domain-specific robotic failure scenarios has not been systematically benchmarked in the current review. LLMs can generate confident but incorrect diagnoses, which is especially problematic in safety-critical settings and requires continuous human validation [13,34]. Agents exhibit inconsistent spatial and embodiment reasoning [33] and are highly sensitive to prompt phrasing, necessitating expert prompt engineering and domain-specific tuning [35]. Existing systems are often restricted to specific domains (e.g., navigation or mobile robotics) [16,18] and are primarily evaluated using artificial fault injection, limiting confidence in real-world generalization [14].

Furthermore, alignment with ROS 2 timing requirements was not considered in the current review and remains an open constraint for practical integration. Control loop deadlines in ROS 2 typically range from 1 to 100 ms depending on the application. Achieving such constraints with current LLM-based approaches is not yet feasible, as even optimised deployments exhibit inference latencies on the order of hundreds of milliseconds to seconds.

These limitations reintroduce the burden of human effort that LLM integration aims to eliminate.

Table 6 provides a comparative summary of LLM-based frameworks for ROS, highlighting their primary purposes, underlying technologies, and FDD capabilities. Notably, ROS Help Desk is the only framework that demonstrates proactive fault detection, whereas most systems operate reactively or focus exclusively on post hoc explanation. This underscores the gap between LLM capabilities for natural language interaction and their application to autonomous, real-time fault management.

Regarding RQ3, LLMs show substantial promise for automating fault diagnosis and explanation. The reviewed literature demonstrates that LLMs can interpret raw ROS logs [30], generate human-readable failure explanations [15,16], and enable natural language interaction with robotic systems [13]. Proactive fault detection through continuous log and sensor monitoring coupled with LLM-based diagnosis is also possible [14]. However, within the reviewed corpus, current LLM implementations rely predominantly on single-modality inputs (logs or sensor streams), ignoring the diagnostic value of traces and metrics. Furthermore, most systems require human queries rather than continuously monitoring for autonomous detection.

6. Discussion

Section 5 examined several studies addressing FDD, monitoring, and LLM integration in ROS-based systems. It reveals a persistent gap between observability capabilities and autonomous diagnostic reasoning—a gap that Generative AI is only beginning to address.

While empirical evidence demonstrates that correlating logs with execution traces improves diagnosis accuracy [20], current ROS monitoring tools rarely integrate these modalities systematically. Tools like ros2_tracing [21] provide low-overhead tracing infrastructure, yet adoption in higher-level diagnostic frameworks has yet to be widely embraced. Best-practice guidelines advocate treating observability as an architectural concern [10], yet few frameworks enforce structured logging, lifecycle hooks, or fault injection APIs. Systems lacking such instrumentation inherently limit downstream FDD effectiveness, regardless of analytical sophistication.

RV frameworks (ROSMonitoring [6], ROMoSu [9]) provide precise fault detection through formal specifications but require substantial domain knowledge and manual configuration. Coverage is limited to predefined properties, leaving unknown faults undetected—particularly problematic given that many ROS bugs exhibit weak or delayed symptoms [4]. While tools like ROMoSu reduce the formal specification burden, they still require practitioners to anticipate fault scenarios, perpetuating reliance on human oracles rather than enabling autonomous adaptation.

Current LLM–ROS systems (ROSA [13], OperateLLM [34], ROS-LLM [35]) excel at natural language control and task decomposition, but show limited evidence regarding monitoring capabilities. ROSA [13], Bagel [19], ROSBag MCP [18], and Scheltinga [16] cannot proactively detect faults but can offer reactive diagnostic capabilities when queried. Only ROS Help Desk [14] attempts real-time monitoring, yet it operates solely on logs and sensor streams, ignoring traces and metrics. Fine-tuning approaches [15,16] successfully transform raw logs into personalized, human-readable explanations, demonstrating LLMs’ potential for reducing cognitive load. Similarly, recent studies [17,30] show that LLMs can infer robot behaviors directly from log data, highlighting their capability to interpret and contextualize autonomous actions. However, these systems inherit the limitations of single-modality inputs.

Among the 29 studies reviewed, no system was identified that combines real-time multimodal observability with agentic LLM reasoning for comprehensive root-cause analysis. Existing implementations rely exclusively on ROS-native data sources (logs, topics, and sensor streams) without leveraging external telemetry collection infrastructures. While LLM–ROS frameworks excel at natural language control and task decomposition, they face critical constraints. Computational requirements are significant [13], hallucination risks require human verification [34], and prompt sensitivity demands expert tuning [35]. These limitations reintroduce the burden of human effort that LLM integration aims to eliminate.

The included studies span both ROS 1 and ROS 2. Importantly, both versions produce usable logs and telemetry data and the core observability gap identified in this review, the absence of integrated multimodal FDD, applies to both. The impact of ROS version differences is more visible in the specific tools and APIs available: ROS 2’s lifecycle management APIs, DDS-based distributed architecture, and QoS configuration offer richer instrumentation hooks than ROS 1’s centralized master model. The tools reviewed target primarily ROS 2, suggesting that future rich observability–AI integration efforts will operate predominantly in that ecosystem.

A consideration for deployment contexts involving human operators is that the timing of diagnostic outputs matters. In scenarios where operators interact with a robot in real-time, a delayed but thorough LLM-generated root-cause analysis may be less useful than a rapid, approximate explanation. The trade-off between diagnostic depth and response latency is an open design challenge for LLM-based FDD in HRI settings [40].

RAG, ReAct-style reasoning, and MCP are increasingly being adopted across ROS-based systems, each addressing a distinct aspect of automated FDD—enabling agents to retrieve external knowledge, reason over system state, and access live data in a standardized way. Together, these capabilities form the foundational building blocks for fully automated diagnostic pipelines.

Achieving autonomous FDD requires hybrid architectures that unite four complementary layers: QA tools for proactive fault prevention, identifying software-level vulnerabilities, architectural violations, and fault-prone code patterns before deployment; observability infrastructure embedding structured logging, instrumentation hooks, and diagnostic APIs directly into ROS application architectures, coupled with telemetry pipelines for standardized collection of logs, traces, and metrics; runtime monitoring for real-time fault detection and identification; and agentic AI reasoning through LLMs augmented with RAG and MCP for autonomous hypothesis generation, tool selection, orchestration, and diagnostic explanation, capable of synthesizing insights from static analysis reports, runtime data, and historical fault patterns.

The proposed integration preserves the distinct diagnostic roles of each telemetry modality. Logs provide event-level context for fault localization and explanation, traces enable causal reasoning across distributed ROS nodes by linking symptoms to upstream causes, metrics support continuous health assessment and anomaly detection through aggregated signals such as latency, message loss, and CPU usage, and sensor streams capture deviations in physical system behavior for runtime monitoring and downstream triggering.

To mitigate bottlenecks from high-frequency telemetry and synchronous retrieval mechanisms such as RAG, architectures should consider selective processing strategies, including asynchronous aggregation of telemetry streams and event-driven activation upon anomaly detection. This reduces latency and context overhead by ensuring that only relevant or pre-processed information is forwarded to an agentic reasoning layer, without compromising diagnostic fidelity. A conceptual model was developed to capture the interplay of these strategies, highlighting how they collectively enhance system efficiency while preserving timely and accurate analysis (see Figure 3).

6.1. Research Directions

The gaps identified in this review suggest several avenues for future research. First, there is a need to unify currently fragmented telemetry sources—logs, traces, and metrics—into coherent observability pipelines that can support both human operators and automated reasoning systems. Second, programmatic interfaces that expose ROS system state to AI agents could reduce the brittleness of current shell-based or log-only approaches. Third, combining LLM-based reasoning with retrieval mechanisms over static analysis outputs and historical fault data may enable more context-aware diagnosis. Finally, shifting from reactive to proactive monitoring architectures, where fault detection triggers automated diagnostic workflows, represents a promising direction toward reducing human effort in fault management.

Unifying these telemetry sources requires understanding the diagnostic role of each modality. Logs provide event-level context to support fault localisation and explanation. Traces capture causal relationships across distributed nodes, enabling attribution of downstream symptoms to upstream root causes. Metrics offer aggregated time-series signals (e.g., latency, drop rates, CPU usage) for continuous health monitoring and threshold-based anomaly detection. Sensor streams feed runtime monitoring layers that detect deviations in physical behaviour and trigger anomalies.

From a design perspective, these directions motivate tighter coupling between observability pipelines and agentic reasoning components. An explicit interface contract should standardise telemetry-to-context translation to reduce brittleness arising from ad-hoc integrations. In addition, diagnostic workflows should be event-driven, decoupling continuous monitoring from reasoning by activating LLM-based agents only upon anomaly detection, thereby avoiding unnecessary inference overhead. Within this setting, selective tracing emerges as a key optimisation, where high-overhead distributed tracing is enabled only after anomaly detection to balance diagnostic depth with runtime efficiency in production deployments.

6.2. Threats to Validity

This section outlines the main threats to the validity of this study and the measures taken to mitigate them:

Construct validity: The three Boolean query groups may not capture all relevant terminology in the rapidly evolving fields of agentic AI and LLM-based robotics, where consistent terminology has not yet stabilised. Three complementary strategies—snowballing, official documentation review, and Consensus-assisted exploration—were employed to mitigate this risk. Query execution dates, exact filter configurations, and per-database result counts are documented in Appendix A to support replication.
Internal validity: Title screening and full-text assessment were performed using the predefined ICs and ECs. Ambiguous cases were discussed with the co-authors until consensus was reached, but inter-rater reliability was not formally quantified. This introduces a risk of subjective inclusion decisions, particularly for studies at the boundary of EC3 (domain-specific applications).
External validity: IEEE Xplore and SpringerLink were selected as primary sources because they index the dominant venues for ROS-related research, including IEEE Robotics and Automation Letters, IROS, ICRA, the RoSE workshop series, Springer TAROS, and LNCS volumes. Studies published in venues outside IEEE Xplore and SpringerLink that were not reachable through snowballing may have been missed. The 11 papers added via snowballing and the 2 identified through Consensus suggest that the supplementary strategies partially compensated for this gap.
All search queries, applied filters, and screening criteria are documented within this paper. The full list of included studies is provided in Appendix A, enabling third-party verification of inclusion decisions. Consensus introduces a non-reproducible exploratory element; however, the two papers it surfaced [17,30] were subsequently validated against the predefined ICs and ECs before admission to the corpus, and are clearly identified as Consensus-sourced in the methodology.
Timeliness: The database searches were executed in December 2025, representing a fixed snapshot of the literature at that point. The LLM–robotics integration field evolves rapidly, and studies published after December 2025 are outside this review’s scope. The rapid pace of development also means that some tools referenced may have evolved since inclusion.
Scope: This study focuses on software-level architectural and observational aspects of FDD in ROS-based systems. Hardware–software interaction fault modes and embodied mechanical intelligence approaches are explicitly outside the scope of this review and are identified as complementary directions for future work. Furthermore, inference latency under ROS 2 timing constraints is not systematically reported in the reviewed literature, revealing an open empirical gap that warrants dedicated benchmarking in future studies.

7. Conclusions

This paper examines the state of the art in automatic FDD for ROS-based robotic systems, with particular attention to emerging applications of Generative AI and LLMs. The review was guided by four RQs addressing automation limitations, observability frameworks, LLM integration, and gaps in AI-driven diagnosis.

Addressing RQ1 regarding FDD Limitations, current FDD approaches face significant automation barriers. RV frameworks (ROSMonitoring [6], ROMoSu [9]) require manual specification of formal properties or configuration scenarios, limiting coverage to predefined faults. Model-based and data-driven methods rely on fine-grained models or labeled datasets, which are costly to obtain [5]. Critically, all approaches rely heavily on expert knowledge for setup and interpretation, reinforcing human-in-the-loop dependency rather than achieving autonomous fault management. The ROBUST dataset [4] reveals that many ROS bugs exhibit weak symptoms requiring human oracles for detection, further highlighting automation challenges.

On the topic of Monitoring and Observability Support (RQ2), existing monitoring frameworks provide valuable but fragmented support for FDD. Tools like ros2_tracing [21] enable low-overhead execution tracing, ROS-FM [31] provides network-level monitoring, and native ROS logging captures runtime events. However, these tools operate in isolation—logs, traces, and metrics are rarely integrated systematically. Song et al. [20] demonstrate that correlating logs with execution traces improves diagnosis accuracy, yet among the 29 studies reviewed, no framework was identified that systematically combines these modalities for automated diagnostic pipelines. Current observability open-source infrastructure emphasizes human-centric dashboards rather than programmatic access for autonomous diagnostic systems.

Regarding RQ3, LLMs show substantial promise for automating fault diagnosis and explanation. Recent work demonstrates that LLMs can interpret raw ROS logs [30], generate human-readable failure explanations [15,16], and enable natural language interaction with robotic systems [13]. ROS Help Desk [14] provides an example of proactive fault detection through continuous log and sensor monitoring coupled with LLM-based diagnosis. However, within the reviewed corpus, current LLM implementations rely predominantly on single-modality inputs (logs or sensor streams), ignoring the diagnostic value of traces and metrics. Furthermore, most systems operate reactively—requiring human queries—rather than continuously monitoring for autonomous detection.

As for RQ4, among the 29 studies reviewed, no system integrates multi-modal observability (logs, traces, metrics, sensor streams) with real-time LLM-based reasoning for autonomous root-cause analysis. Current systems rely on predefined rules rather than adaptive reasoning, generate post hoc explanations from isolated data sources, react to failures instead of proactively monitoring, and require substantial human expertise.

As robotic systems continue to grow in complexity, the development of intelligent, autonomous fault-management capabilities becomes critical. Future work should prioritize the development of integrated, end-to-end FDD systems that combine QA tools to prevent software-level faults before deployment, comprehensive runtime telemetry collection, and agentic LLM-based reasoning with rich contextual access. Such systems would enable robots not only to detect and diagnose their own failures—correlating runtime symptoms with code-level or physical root causes—but also to explain them in human-understandable terms and, ultimately, to autonomously recover from failures. In line with this direction, the authors are actively developing a system that implements the vision described above.

Author Contributions

M.C.: data curation, formal analysis, methodology, writing—original draft preparation, writing—review and editing, validation; R.A.: conceptualization, methodology, writing—review and editing, validation, funding acquisition; A.S.: conceptualization, methodology, writing—review and editing, validation. All authors have read and agreed to the published version of the manuscript.

Funding

This project has received funding from the European Union’s Horizon Europe research and innovation program under Grant Agreement 101120406.

Data Availability Statement

Not applicable.

Acknowledgments

During the preparation of this work, the authors used Claude Sonnet (https://www.coderabbit.ai) to improve readability. After using this tool/service, the author(s) reviewed and edited the content as needed and take full responsibility for the content of the published article.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. List and Categorization of Included Studies

Table A1 and Table A2 present the complete list of the 29 studies included in this review. Table A1 provides publication metadata, while Table A2 provides summaries and thematic categorization. Two additional background references on ROS architecture [1,3] are cited throughout the paper but were not identified through the search; they are not counted among the 29 included studies.

Search Execution Details: Database searches were executed in December 2025. Query 1 (FDD) returned 216 records (90 from IEEE Xplore, 126 from SpringerLink). Query 2 (Monitoring and Debugging) returned 544 records (215 from IEEE Xplore, 329 from SpringerLink). Query 3 (LLM and AI Integration) returned 215 records (100 from IEEE Xplore, 115 from SpringerLink). Total records from database searches: 975. Snowballing identified 11 additional papers. Official ROS documentation review contributed 2 papers. Consensus exploratory search contributed 2 papers. Grand total before screening: 990 records. Deduplication identified 20 duplicate records: 13 cross-query duplicates within SpringerLink (9 between Q1–Q2, 4 between Q2–Q3), 7 cross-query duplicates within IEEE Xplore (6 between Q1–Q2, 1 between Q2–Q3), and 0 cross-library duplicates. After deduplication, 970 unique database records remained. Combined with the 15 additional records identified through snowballing, official documentation review, and Consensus, 985 records underwent title-based screening.

Screening Summary: Title-based screening reduced 985 records to 36. Full-text assessment against ICs and ECs excluded 7 studies (4 violated EC1 by lacking direct ROS relevance; 2 violated EC2 by providing insufficient technical detail; 1 violated EC3 as a domain-specific solution). Five ambiguous cases at the EC3 boundary were resolved through co-author discussion. Final corpus: 29 studies.

Protocol Deviations: None. The review followed the pre-specified protocol without modification.

Category sub-tags: FDD (Data-Driven/Model-Based/Knowledge-Based/Hybrid); Monitoring (Runtime Verification/Configuration-Based/Network-Level/Anomaly Detection); Observability (Logging/Tracing/Metrics/Sensor Streams); LLM/Generative AI (Log Interpretation/Code Review/Fault Explanation/Task Execution); RAG/MCP (RAG/MCP/Fine-Tuning); Agentic AI (ReAct/Multi-Agent/Tool Use/Behavior Trees); Static Analysis/QA (Code Quality/Bug Mining/Architectural Analysis); Bugs/Fault Taxonomy (Bug Dataset/Fault Classification); ROS Architecture (ROS 1/ROS 2/DDS/Node Composition); HRI/Explainability (Failure Explanation/Natural Language Interface/Personalization). In Table A1 and Table A2, “Ag.” in the ROS column denotes version-agnostic studies that address general robotic systems without targeting a specific ROS version.

Table A1. Publication metadata of the 29 included studies with quality scores and citation counts (April 2026). Types: J = Journal, C = Conference, W = Workshop, P = Preprint, T. = Thesis, Ag. = version-agnostic. Cit. = citation count (N/A: no DOI or S2 entry). QS = Quality Score; QR = Quality Rating (H/M/L) per criteria in Table 1 and Table 2.

Ref.	Title	Authors	Year	Venue	T.	Country	DOI	Cit.	QS	QR
[11]	Mining the Usage Patterns of ROS Primitives	Santos, A. et al.	2017	IROS	C	Portugal	https://doi.org/10.1109/IROS.2017.8206237	33	7	H
[5]	On Fault Detection and Diagnosis in Robotic Systems	Khalastchi, E., Kalech, M.	2018	ACM Comput. Surv.	J	Israel	https://doi.org/10.1145/3146389	215	9	H
[6]	ROSMonitoring: A Runtime Verification Framework for ROS	Ferrando, A. et al.	2020	TAROS (LNCS)	C	UK	https://doi.org/10.1007/978-3-030-63486-5_40	87	8	H
[31]	ROS-FM: Fast Monitoring for the Robotic Operating System	Rivera, S. et al.	2020	ICECCS	C	Luxembourg	https://doi.org/10.1109/ICECCS51672.2020.00029	25	6	M
[36]	Online Monitoring and Visualization with ROS and ReactJS	Ivanov, A. et al.	2021	SIBCON	C	Russia	https://doi.org/10.1109/SIBCON50419.2021.9438890	10	5	M
[12]	The High-Assurance ROS Framework	Santos, A. et al.	2021	RoSE @ ICSE	W	Portugal	https://doi.org/10.1109/RoSE52553.2021.00013	25	6	M
[8]	Towards Flexible Runtime Monitoring Support for ROS-based Applications	Stadler, M. et al.	2022	RoSE @ ICSE	W	Austria	https://doi.org/10.1145/3526071.3527515	6	6	M
[21]	ros2_tracing: Multipurpose Low-Overhead Framework for Real-Time Tracing of ROS 2	Bédard, C. et al.	2022	IEEE RA-L	J	Canada	https://doi.org/10.1109/LRA.2022.3174346	77	10	H
[20]	An Empirical Study on Fault Diagnosis in Robotic Systems	Song, X. et al.	2023	ICSME	C	China	https://doi.org/10.1109/ICSME58846.2023.00030	3	7	H
[9]	ROMoSu: Flexible Runtime Monitoring Support for ROS-based Applications	Stadler, M., Vierhauser, M.	2023	RoSE @ ICSE	W	Austria	https://doi.org/10.1109/RoSE59155.2023.00013	2	6	M
[26]	Large Language Models for Robotics: A Survey	Zeng, F. et al.	2023	arXiv	P	China	https://doi.org/10.48550/arXiv.2311.07226	291	6	M
[30]	Using Large Language Models for Interpreting Autonomous Robots Behaviors	González-Santamarta, M. Á. et al.	2023	HAIS (LNCS)	C	Spain	https://doi.org/10.1007/978-3-031-40725-3_45	19	7	H
[2]	Impact of ROS 2 Node Composition in Robotic Systems	Macenski, S. et al.	2023	IEEE RA-L	J	USA	https://doi.org/10.1109/LRA.2023.3279614	109	10	H
[4]	ROBUST: 221 Bugs in the Robot Operating System	Timperley, C. S. et al.	2024	Empir. Softw. Eng.	J	USA	https://doi.org/10.1007/s10664-024-10440-0	11	8	H
[7]	ROSMonitoring 2.0: Extending ROS Runtime Verification to Services and Ordered Topics	Saadat, M. G. et al.	2024	FMAS (EPTCS)	W	UK	https://doi.org/10.4204/EPTCS.411.3	4	5	M
[10]	Runtime Verification and Field-based Testing for ROS-based Robotic Systems	Caldas, R. et al.	2024	IEEE TSE	J	Sweden	https://doi.org/10.1109/TSE.2024.3444697	35	10	H
[28]	Advances in Large Language Models for Robotics	Qi, Z., Jing, X.	2024	ICMRA	C	China	https://doi.org/10.1109/ICMRA62519.2024.10809099	4	6	M
[15]	Explaining Robot Failures in ROS using Parameter-Efficient Fine-Tuning	Scheltinga, E., Pek, C.	2024	RSS Workshop	W	The Netherlands	N/A	N/A	5	M
[16]	Personalising Explanations for Robot Failures in ROS using PEFT	Scheltinga, E. M.	2024	TU Delft	T.	The Netherlands	N/A	N/A	3	L
[17]	Explaining Autonomy: Enhancing HRI through Explanation Generation with LLMs	Sobrín-Hidalgo, D. et al.	2024	arXiv	P	Spain	https://doi.org/10.48550/arXiv.2402.04206	28	6	M
[34]	OperateLLM: Integrating ROS Tools in Large Language Models	Raja, A., Bhethanabotla, A.	2024	ICoCET	C	USA	https://doi.org/10.1109/ICoCET63343.2024.10730448	4	6	M
[19]	Bagel	Extelligence-ai	2024	GitHub	Tool	–	N/A	N/A	3	L
[35]	ROS-LLM: A ROS Framework for Embodied AI with Task Feedback	Mower, C. E. et al.	2024	arXiv	P	UK	https://doi.org/10.48550/arXiv.2406.19741	54	6	M
[27]	Large Language Models for Robotics: Opportunities, Challenges, and Perspectives	Wang, J. et al.	2025	J. Autom. Intell.	J	China	https://doi.org/10.1016/j.jai.2024.12.003	381	9	H
[13]	Enabling Novel Mission Operations and Interactions with ROSA	Royce, R. et al.	2025	IEEE Aerosp. Conf.	C	USA	https://doi.org/10.1109/AERO63441.2025.11068426	23	9	H
[14]	ROS Help Desk: GenAI Powered Framework for ROS Error Diagnosis	Katuwandeniya, K. et al.	2025	arXiv	P	Australia	https://doi.org/10.48550/arXiv.2507.07846	N/A	3	L
[33]	RAI: Flexible Agent Framework for Embodied AI	Rachwał, K. et al.	2025	PAAMS	C	Poland	https://doi.org/10.1007/978-3-032-05925-3_16	6	7	H
[18]	ROSBag MCP Server: Analyzing Robot Data with LLMs	Fu, L. et al.	2025	RoboticCC	C	Italy	https://doi.org/10.1109/RoboticCC68732.2025.00025	1	6	M
[32]	Watch Your Callback: Offline Anomaly Detection Using ML in ROS 2	Kang, J. et al.	2025	IEEE Access	J	S. Korea	https://doi.org/10.1109/ACCESS.2025.3556864	6	7	H

Table A2. Summary and thematic categorization of the 29 included studies.

Ref.	Summary	Keywords	ROS	Categories & Sub-Tags
[11]	Data mining of ROS code repositories to identify fault-prone patterns.	ROS, static analysis, code patterns, fault prevention, software quality	1	Static Analysis/QA: Code Quality, Bug Mining
[5]	Comprehensive survey of FDD approaches across robotic system types.	fault detection, fault diagnosis, robotic systems, survey, model-based, data-driven	Ag.	FDD: Data-Driven, Model-Based, Knowledge-Based, Hybrid
[6]	Modular RV framework for monitoring inter-node communication via formal specifications.	ROS, runtime verification, formal specification, monitoring, safety-critical	1	Monitoring: Runtime Verification
[31]	Low-overhead network-level monitoring using eBPF and XDP for ROS traffic.	ROS, monitoring, eBPF, XDP, network security	1	Monitoring: Network-Level; Observability: Metrics
[36]	Browser-based remote visualization solution for ROS system monitoring.	ROS, online monitoring, visualization, ReactJS	1	Monitoring: Configuration-Based
[12]	Static analysis framework for code quality and architectural pattern detection.	ROS, static analysis, HAROS, code quality, architectural analysis	Both	Static Analysis/QA: Code Quality, Architectural Analysis
[8]	Configuration-driven runtime monitoring for ROS-based applications.	ROS, runtime monitoring, configuration, flexibility	Both	Monitoring: Runtime Verification, Configuration-Based
[21]	Low-overhead tracing framework for ROS 2 execution events and timing analysis.	ROS 2, tracing, LTTng, instrumentation, real-time, performance evaluation	2	Observability: Tracing
[20]	Empirical study showing traces and trajectories improve fault diagnosis accuracy.	ROS, fault diagnosis, tracing, empirical study	1	FDD: Hybrid; Observability: Logging, Tracing
[9]	Flexible configuration-driven monitoring replacing rigid formal specifications.	ROS, monitoring, configuration, runtime	Both	Monitoring: Runtime Verification, Configuration-Based
[26]	Survey of LLM applications in robot control, perception, and planning.	LLM, robotics, survey, planning, control	Ag.	LLM/Generative AI: Task Execution
[30]	Evaluates LLMs interpreting raw ROS 2 logs without prompt engineering.	LLM, ROS 2, log interpretation, autonomous robots, explainability	2	LLM/Gen. AI: Log Interpretation; HRI/Expl.: Failure Expl.
[2]	Benchmarks ROS 2 component node composition for performance optimization.	ROS 2, node composition, performance, benchmarking	2	ROS Architecture: ROS 2, Node Composition
[4]	Dataset characterizing 221 bugs from seven major ROS projects.	robotics, software bugs, dataset, Robot Operating System	Both	Bugs/Fault Taxonomy: Bug Dataset, Fault Classification
[7]	Extends ROSMonitoring with service monitoring and message ordering.	ROS, runtime verification, services, message ordering	Both	Monitoring: Runtime Verification
[10]	Guidelines for ROS observability, instrumentation, and field-based testing.	ROS 2, runtime verification, field testing, instrumentation, observability	2	Monitoring: Runtime Verif.; Observability: Logging, Tracing
[28]	Survey of recent LLM advances for robotic applications.	LLM, robotics, survey, advances	Ag.	LLM/Generative AI: Task Execution
[15]	PEFT transforms raw ROS logs into human-readable failure narratives.	ROS, LLM, PEFT, LoRA, failure explanation	2	LLM/Gen. AI: Fault Expl.; RAG/MCP: Fine-Tuning; HRI/Expl.: Failure Expl.
[16]	Extends failure explanations with personalization based on user expertise.	ROS, LLM, PEFT, personalization, explainability	2	LLM/Gen. AI: Fault Expl.; RAG/MCP: Fine-Tuning; HRI/Expl.: Personalization
[17]	Uses RAG with LLMs for context-aware robot behavior explanations.	robotics, HRI, explainability, LLM, RAG, autonomous robots	2	LLM/Gen. AI: Log Interp.; RAG/MCP: RAG; HRI/Expl.: Failure Expl.
[34]	Enables LLMs to dynamically generate and execute ROS nodes via rclpy.	ROS, LLM, ReAct, code generation, rclpy	2	Agentic AI: ReAct, Tool Use; LLM/Gen. AI: Task Execution
[19]	MCP-based tool for natural-language log analysis of ROS bag data.	ROS, MCP, bag files, LLM, diagnostics	2	LLM/Gen. AI: Log Interp.; RAG/MCP: MCP; Observability: Logging
[35]	Exposes ROS Actions and Services as LLM tools with behavior trees.	ROS, LLM, embodied AI, behavior trees, state machines, task feedback	2	Agentic AI: Tool Use, Behavior Trees; LLM/Gen. AI: Task Execution
[27]	Survey of LLM opportunities and challenges in robotic systems.	LLM, robotics, survey, challenges, opportunities	Ag.	LLM/Generative AI: Task Execution
[13]	NASA’s natural language interface for ROS via ReAct and LangChain.	ROS, ROSA, LLM, ReAct, LangChain, NASA	Both	Agentic AI: ReAct, Tool Use; HRI/Expl.: NL Interface
[14]	Proactive error detection via continuous log/sensor monitoring with LLM diagnosis.	ROS, LLM, RAG, error diagnosis, debugging, explainability, GenAI	2	FDD: Knowledge-Based; LLM/Gen. AI: Fault Expl., Code Review; RAG/MCP: RAG; Agentic AI: ReAct
[33]	Multi-agent framework treating sensors and actuators as agent capabilities.	ROS 2, multi-agent, embodied AI, RAG	2	Agentic AI: Multi-Agent; LLM/Gen. AI: Task Execution; RAG/MCP: RAG
[18]	MCP server enabling natural-language interaction with ROS bag files.	ROS, MCP, LLM, VLM, rosbag, agentic AI	2	RAG/MCP: MCP; LLM/Gen. AI: Log Interp.; Agentic AI: Tool Use
[32]	ML-based offline anomaly detection from ROS 2 callback trace data.	ROS 2, anomaly detection, unsupervised learning, callbacks, fault injection, tracing	2	FDD: Data-Driven; Monitoring: Anomaly Detection; Observability: Tracing

References

Macenski, S.; Foote, T.; Gerkey, B.; Lalancette, C.; Woodall, W. Robot operating system 2: Design, architecture, and uses in the wild. Sci. Robot. 2022, 7, eabm6074. [Google Scholar] [CrossRef] [PubMed]
Macenski, S.; Soragna, A.; Carroll, M.; Ge, Z. Impact of ROS 2 node composition in robotic systems. IEEE Robot. Autom. Lett. 2023, 8, 3996–4003. [Google Scholar] [CrossRef]
Quigley, M.; Conley, K.; Gerkey, B.; Faust, J.; Foote, T.; Leibs, J.; Ng, A.; Wheeler, R. ROS: An open-source Robot Operating System. In Proceedings of the IEEE International Conference on Robotics and Automation Workshop on Open Source Software, Kobe, Japan, 12–17 May 2009. [Google Scholar]
Timperley, C.S.; van der Hoorn, G.; Santos, A.; Deshpande, H.; Wąsowski, A. ROBUST: 221 bugs in the Robot Operating System. Empir. Softw. Eng. 2024, 29, 57. [Google Scholar] [CrossRef]
Khalastchi, E.; Kalech, M. On fault detection and diagnosis in robotic systems. ACM Comput. Surv. 2018, 51, 1–24. [Google Scholar] [CrossRef]
Ferrando, A.; Cardoso, R.C.; Fisher, M.; Ancona, D.; Franceschini, L.; Mascardi, V. ROSMonitoring: A runtime verification framework for ROS. In Proceedings of the Annual Conference Towards Autonomous Robotic Systems; Springer International Publishing: Cham, Switzerland, 2020; pp. 387–399. [Google Scholar] [CrossRef]
Saadat, M.G.; Ferrando, A.; Dennis, L.A.; Fisher, M. ROSMonitoring 2.0: Extending ROS runtime verification to services and ordered topics. Electron. Proc. Theor. Comput. Sci. (EPTCS) 2024, 411, 17–31. [Google Scholar] [CrossRef]
Stadler, M.; Vierhauser, M.; Cleland-Huang, J. Towards flexible runtime monitoring support for ROS-based applications. In Proceedings of the 4th International Workshop on Robotics Software Engineering (RoSE), Pittsburgh, PA, USA, 9 May 2022; pp. 43–46. [Google Scholar] [CrossRef]
Stadler, M.; Vierhauser, M. ROMoSu: Flexible runtime monitoring support for ROS-based applications. In Proceedings of the IEEE/ACM 5th International Workshop on Robotics Software Engineering (RoSE), Melbourne, Australia, 15 May 2023; pp. 53–60. [Google Scholar] [CrossRef]
Caldas, R.; García, J.A.P.; Schiopu, M.; Pelliccione, P.; Rodrigues, G.; Berger, T. Runtime verification and field-based testing for ROS-based robotic systems. IEEE Trans. Softw. Eng. 2024, 50, 2544–2567. [Google Scholar] [CrossRef]
Santos, A.; Cunha, A.; Macedo, N.; Arrais, R.; Dos Santos, F.N. Mining the usage patterns of ROS primitives. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 3855–3860. [Google Scholar] [CrossRef]
Santos, A.; Cunha, A.; Macedo, N. The high-assurance ROS framework. In Proceedings of the 2021 IEEE/ACM 3rd International Workshop on Robotics Software Engineering (RoSE), Madrid, Spain, 2 June 2021; pp. 37–40. [Google Scholar] [CrossRef]
Royce, R.; Kaufmann, M.; Becktor, J.; Moon, S.; Carpenter, K.; Pak, K.; Towler, A.; Thakker, R.; Khattak, S. Enabling novel mission operations and interactions with ROSA: The robot operating system agent. In Proceedings of the 2025 IEEE Aerospace Conference, Big Sky, MT, USA, 1–8 March 2025; pp. 1–16. [Google Scholar] [CrossRef]
Katuwandeniya, K.; Widhanapathirana, S.R.J. ROS Help Desk: GenAI powered, user-centric framework for ROS error diagnosis and debugging. arXiv 2025, arXiv:2507.07846. [Google Scholar] [CrossRef]
Scheltinga, E.; Pek, C. Explaining robot failures in ROS using parameter-efficient fine-tuning. In Proceedings of the Workshop on Robot Execution Failures and Failure Management Strategies, RSS 2024, Delft, The Netherlands, 15–19 July 2024; Available online: https://robot-failures.github.io/rss2024/papers/RobotFailuresRSS2024_paper_2.pdf (accessed on 18 December 2025).
Scheltinga, E.M. Personalising Explanations for Robot Failures in Robot Operating System Using Parameter-Efficient Fine-Tuning. Master’s Thesis, Department of Mechanical Engineering, TU Delft, Delft, The Netherlands, 2024. [Google Scholar]
Sobrín-Hidalgo, D.; González-Santamarta, M.A.; Guerrero-Higueras, Á.M.; Rodríguez-Lera, F.J.; Matellán-Olivera, V. Explaining autonomy: Enhancing human-robot interaction through explanation generation with large language models. arXiv 2024, arXiv:2402.04206. [Google Scholar] [CrossRef]
Fu, L.; Salimpour, S.; Militano, L.; Edelman, H.; Queralta, J.P.; Toffetti, G. ROSBag MCP Server: Analyzing robot data with LLMs for agentic embodied AI applications. In Proceedings of the 2025 International Conference on Robotic Computing and Communication (RoboticCC), Naples, Italy, 8–10 December 2025; pp. 70–77. [Google Scholar] [CrossRef]
Extelligence-ai/Bagel. GitHub, 2024. Available online: https://github.com/Extelligence-ai/bagel (accessed on 18 December 2025).
Song, X.; Li, Y.; Dong, Z.; Liu, S.; Cao, J.; Peng, X. An empirical study on fault diagnosis in robotic systems. In Proceedings of the 2023 IEEE International Conference on Software Maintenance and Evolution (ICSME), Bogotá, Colombia, 1–6 October 2023; pp. 207–219. [Google Scholar] [CrossRef]
Bédard, C.; Lütkebohle, I.; Dagenais, M. ros2_tracing: Multipurpose low-overhead framework for real-time tracing of ROS 2. IEEE Robot. Autom. Lett. 2022, 7, 6511–6518. [Google Scholar] [CrossRef]
Datadog. Modern Monitoring & Security. Datadog, Inc. Available online: https://www.datadoghq.com/ (accessed on 18 December 2025).
New Relic. Observability Platform. New Relic, Inc. Available online: https://newrelic.com (accessed on 18 December 2025).
OpenTelemetry. OpenTelemetry: Effective Observability Requires High-Quality Telemetry. Cloud Native Computing Foundation. Available online: https://opentelemetry.io/ (accessed on 18 December 2025).
What Is the Model Context Protocol (MCP)? Model Context Protocol Documentation. Available online: https://modelcontextprotocol.io/docs/getting-started/intro (accessed on 18 December 2025).
Zeng, F.; Gan, W.; Wang, Y.; Liu, N.; Yu, P.S. Large language models for robotics: A survey. arXiv 2023, arXiv:2311.07226. [Google Scholar] [CrossRef]
Wang, J.; Shi, E.; Hu, H.; Ma, C.; Liu, Y.; Wang, X.; Yao, Y.; Liu, X.; Ge, B.; Zhang, S. Large language models for robotics: Opportunities, challenges, and perspectives. J. Autom. Intell. 2025, 4, 52–64. [Google Scholar] [CrossRef]
Qi, Z.; Jing, X. Advances in large language models for robotics. In Proceedings of the 2024 7th International Conference on Mechatronics, Robotics and Automation (ICMRA), Wuhan, China, 20–22 September 2024; pp. 72–76. [Google Scholar] [CrossRef]
PRISMA. PRISMA 2020 Statement. 2021. Available online: https://www.prisma-statement.org/prisma-2020 (accessed on 18 December 2025).
González-Santamarta, M.Á.; Fernández-Becerra, L.; Sobrín-Hidalgo, D.; Guerrero-Higueras, Á.M.; González, I.; Lera, F.J.R. Using large language models for interpreting autonomous robots behaviors. In Proceedings of the International Conference on Hybrid Artificial Intelligence Systems; Springer Nature: Cham, Switzerland, 2023; pp. 533–544. [Google Scholar] [CrossRef]
Rivera, S.; Iannillo, A.K.; Lagraa, S.; Joly, C.; State, R. ROS-FM: Fast monitoring for the robotic operating system (ROS). In Proceedings of the 2020 25th International Conference on Engineering of Complex Computer Systems (ICECCS), Singapore, 28–31 October 2020; pp. 187–196. [Google Scholar] [CrossRef]
Kang, J.; Kim, K.; Kwon, D. Watch your callback: Offline anomaly detection using machine learning in ROS 2. IEEE Access 2025, 13, 60763–60775. [Google Scholar] [CrossRef]
Rachwał, K.; Majek, M.; Boczek, B.; Dąbrowski, K.; Liberadzki, P.; Dąbrowski, A.; Ganzha, M. RAI: Flexible agent framework for embodied AI. In Proceedings of the International Conference on Practical Applications of Agents and Multi-Agent Systems; Springer Nature: Cham, Switzerland, 2025; pp. 195–206. [Google Scholar] [CrossRef]
Raja, A.; Bhethanabotla, A. OperateLLM: Integrating robot operating system (ROS) tools in large language models. In Proceedings of the 2024 IEEE 1st International Conference on Communication Engineering and Emerging Technologies (ICoCET), Kepala Batas, Malaysia, 2–3 September 2024; pp. 1–4. [Google Scholar] [CrossRef]
Mower, C.E.; Wan, Y.; Yu, H.; Grosnit, A.; Gonzalez-Billandon, J.; Zimmer, M.; Wang, J.; Zhang, X.; Zhao, Y.; Zhai, A.; et al. ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning. arXiv 2024, arXiv:2406.19741. [Google Scholar] [CrossRef]
Ivanov, A.; Zakiev, A.; Tsoy, T.; Hsia, K.H. Online monitoring and visualization with ROS and ReactJS. In Proceedings of the 2021 International Siberian Conference on Control and Communications (SIBCON), Kazan, Russia, 13–15 May 2021; pp. 1–4. [Google Scholar] [CrossRef]
ISO/IEC/IEEE 24765:2010; Systems and Software Engineering—Vocabulary. International Organization for Standardization; International Electrotechnical Commission; Institute of Electrical and Electronics Engineers; International Standard: Geneva, Switzerland, 2010. Available online: https://www.iso.org/standard/50518.html (accessed on 18 December 2025).
Richardson, C. Microservices Patterns: With Examples in Java; Simon & Schuster: New York, NY, USA, 2018. [Google Scholar]
Zipkin. Distributed Tracing System. Available online: https://zipkin.io/ (accessed on 18 December 2025).
Bo, V.; Garrell, A.; Sanfeliu, A. Fast or accurate? How intention-recognition models shape human perception of a mobile robot. In Proceedings of the Companion Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction, Scotland, UK, 16–19 March 2026; pp. 502–506. [Google Scholar] [CrossRef]

Figure 1. PRISMA 2020 flow diagram showing the study selection process.

Figure 2. Distribution of the 29 included studies by publication year.

Figure 3. Conceptual architecture for FDD in ROS-based robotic systems. The ROS 2 runtime exposes nodes, topics, services, transforms, and action servers; the Telemetry collection pipeline instruments this runtime to aggregate logs, metrics, distributed traces, and sensor streams, which are consumed by the Runtime Monitoring layer for anomaly detection, threshold alerting, and runtime verification. Upon detecting an anomaly, the monitoring layer fires an event-driven Fault Trigger to the Agentic AI system, which can also be queried directly by the operator through natural language. To gather live diagnostic evidence, the agent issues Live Telemetry Queries to the pipeline and Live Robot State Queries directly to the ROS 2 runtime via CLI introspection. Static Analysis operates as a complementary mechanism: it raises Fault Triggers to initiate agent investigation and supplies evidence—code-level bug patterns, architectural violations, and known fault signatures—to support causal explanation of detected failures. Responses are returned to the operator upon query completion.

Table 1. Quality assessment criteria and scoring scheme.

Criterion	Category	Score
Peer-review status	Yes/No	2/0
Venue tier	High/Medium/Low	3/2/1
Age-normalised citations	High/Medium/Low	3/1/0
Publication recency	Recent (≥2022)/Not recent	2/1

Table 2. Normalised citation thresholds.

Age Bracket	Years	High (+3)	Medium (+1)	Low (+0)
Very recent (<1 yr)	2025	≥10	3–9	<3
Recent (1–2 yr)	2023–2024	≥25	8–24	<8
Established (3–4 yr)	2021–2022	≥50	15–49	<15
Mature (≥5 yr)	≤2020	≥80	20–79	<20

Table 3. Unified Taxonomy for FDD in Robotic Systems. The dimensions defined here structure the comparative analysis conducted throughout the current review.

Category	Key Elements and Definitions
Fault Origin	Hardware, Software, Interaction
Fault Type	Recoverable vs. Non-Recoverable
FDD Approach	Data-Driven, Model-Based, Knowledge-Based, Hybrid
Fault Diagnosis	Proactive: system-initiated; Reactive: user-initiated or triggered; Preventive: anticipates faults before occurrence; Corrective: responds after fault
Fault Detection	Online: real-time monitoring; Offline: post hoc analysis
Observability	Logs, Traces, Metrics, Sensor Readings
Monitoring	Alerts: automated notifications; Dashboards: visual inspection interfaces
Analysis Type	Dynamic: analysis during real-time execution; Static: analysis on recorded/historical data or source code
Verification	RV, Simulation-based, Field-based
AI Integration	LLMs, Agents
Automation	Manual: continuous human intervention required for both detection and diagnosis; Semi-automated: automated detection with human-guided diagnosis, or automated monitoring that requires significant expert configuration; Fully Automated: autonomous detection, diagnosis, and explanation without human input per fault event

Table 4. Distribution of the 29 included studies by publication type and publisher.

Publication Type	Count	Publisher	Count
Conference Paper	11	IEEE	14
Journal Article	7	arXiv	4
Workshop Paper	5	Springer	4
arXiv Preprint	4	ACM	2
Thesis	1	Elsevier	1
Software/Tool	1	Other	4

Table 5. Topic evolution across the three research phases. Numbers indicate the count of studies per thematic category within each era; categories are derived from the taxonomy in Table 3.

Thematic Category	Foundation	Maturation	AI Integration
Thematic Category	2017–2018	2020–2022	2023–2025
FDD/Bug Taxonomy	2	1	1
Static Analysis/QA	1	1	0
Runtime Monitoring/RV	0	4	1
Observability (Tracing/Metrics)	0	2	1
ROS Architecture	0	1	1
LLM/Agentic AI	0	0	13
LLM Survey	0	0	3
HRI/Explainability	0	0	2

Table 6. Summary of LLM and Agentic AI-Based Tools for ROS.

Framework	Ref.	Primary Purpose	Key Technologies	Detection Mode	Diagnosis Mode	Observability	Monitoring	Analysis Type	Fault Strategy
ROS-LLM	[35]	Natural language task execution	DeepSeek 7B; CoT; Few-shot; Tools	None	None	None	None	Dynamic	None
OperateLLM	[34]	Development-time ROS interaction	LLMs; ReAct; ROS Tools	None	None	None	None	Dynamic	None
RAI	[33]	Multi-agent embodied AI framework	LLMs; RAG; Multi-Agents; ROS Tools; LangChain	Online	None	None	Logs, Sensor Readings	Dynamic	Both
ROSA	[13]	Natural language interface for ROS operations	LLMs; ReAct; ROS Tools; LangChain	Online	Reactive	Logs, Sensors, Topics	Dashboards	Dynamic	Both
ROS Help Desk	[14]	Error detection and debugging support	LLMs; RAG; ReAct; LangChain; ROS Tools; Gradio	Online	Proactive	Logs, Sensors, Topics, Source Code	Dashboards	Dynamic	Both
ROSBag MCP	[18]	Rosbag data analysis via natural language	LLMs/VLMs; MCP; ROS Tools	Offline	Reactive	Logs, Sensors, Topics	Dashboards	Static	Corrective
Bagel	[19]	Analyze data to provide informed answers	LLMs, MCP	Online	Reactive	Logs, Sensors, Topics, Metadata	Dashboards	Static	Both
Scheltinga et al.	[15,16]	Failure explanation generation for navigation	LLMs; RAG; PEFT; Low-Rank Adaptation (LoRA)	None	Reactive	Logs	None	Static	Corrective
González-S. et al.	[30]	Robot behaviour based on log interpretation	Generic LLM	None	Reactive	Logs	None	Static	None
Sobrín-H. et al.	[17]	Robot behaviour based on log interpretation	LLMs; RAG	None	Reactive	Logs	None	Static	None

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cardoso, M.; Arrais, R.; Sousa, A. Automatic Fault Detection and Diagnosis in ROS-Based Robotic Systems Using Generative AI: A Systematic Literature Review. Appl. Sci. 2026, 16, 5545. https://doi.org/10.3390/app16115545

AMA Style

Cardoso M, Arrais R, Sousa A. Automatic Fault Detection and Diagnosis in ROS-Based Robotic Systems Using Generative AI: A Systematic Literature Review. Applied Sciences. 2026; 16(11):5545. https://doi.org/10.3390/app16115545

Chicago/Turabian Style

Cardoso, Marta, Rafael Arrais, and Armando Sousa. 2026. "Automatic Fault Detection and Diagnosis in ROS-Based Robotic Systems Using Generative AI: A Systematic Literature Review" Applied Sciences 16, no. 11: 5545. https://doi.org/10.3390/app16115545

APA Style

Cardoso, M., Arrais, R., & Sousa, A. (2026). Automatic Fault Detection and Diagnosis in ROS-Based Robotic Systems Using Generative AI: A Systematic Literature Review. Applied Sciences, 16(11), 5545. https://doi.org/10.3390/app16115545

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Fault Detection and Diagnosis in ROS-Based Robotic Systems Using Generative AI: A Systematic Literature Review

Abstract

1. Introduction

2. Methodology

2.1. Review Procedure

2.2. Research Question

2.3. Databases and Search Queries

2.4. Data Extraction

2.5. Study Quality Assessment

3. Results

3.1. Taxonomy

3.2. Bibliometic Overview

3.2.1. Temporal Distribution

3.2.2. Publication Type and Venue Distribution

3.2.3. Topic Evolution Across Research Phases

3.2.4. Thematic Clusters and Cross-Cutting Patterns

4. Fundamental Concepts, Tools and Background

4.1. Robot Operating System

4.2. Native ROS Debugging Tools and Commands

4.3. FDD

4.4. Fault Taxonomy

4.5. Observability

4.6. LLM and Agentic AI

5. Related Work

5.1. FDD in Robotic Systems

5.2. ROS-Based Monitoring Frameworks

5.3. LLM and Agentic-AI-Based Tools

6. Discussion

6.1. Research Directions

6.2. Threats to Validity

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. List and Categorization of Included Studies

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI