ST-GraphRCA: A Root Cause Analysis Model for Spatio-Temporal Graph Propagation in IoT Edge Computing
Abstract
1. Introduction
- In edge networks, dynamic links and resource constraints induce random delays, jitter, and packet loss between sensors and edge datacenter. This non-deterministic interference causes multi-source data from the same event to arrive asynchronously. Consequently, these uncertain transmission characteristics impose nonlinear time shifts across data streams. These shifts fundamentally undermine the assumption of temporal alignment. This may lead to significant errors into feature extraction methods that require precise temporal alignment.
- A fundamental issue in edge networks is the lack of a unified topological view. When a fault occurs, the dynamic dependencies among edge service instances cause rapid propagation along cascading fault chains. This propagation is further exacerbated by network retry operations, which cause non-linear amplification of the fault signals along its path.
- A critical problem in edge environments is capturing low-frequency and high-risk faults in real time. This difficulty originates in constraint of limited computing resources. It is further compounded by a general absence of labeled historical fault data.
- To address time-series misalignment in multi-source sampling metrics caused by edge network uncertainty, a novel PCA-DTW hybrid feature extraction method is proposed. Without requiring prior time-series synchronization, to improve multi-source feature extraction accuracy, this method addresses random transmission delays and stretching deformations via nonlinear vector structure alignment. The procedure is two-fold: First, metrics are categorized into network (throughput, RTT, retransmission rate) and computation (CPU utilization, memory usage, deadlock count, I/O) group. Second, Principal Component Analysis (PCA) is utilized to diagnose the primary cause of misalignment network or computation delays. This diagnosis then guides the Dynamic Time Warping (DTW) alignment strategy: a backward shift (insertion) is applied when network factors dominate, while a forward shift (deletion/compression) is used when computational delays prevail.
- In edge environments, cascading faults can be rapidly amplified. To localize their root cause, we design a stream-based forward propagation graph based on the flow conservation principle. This graph models fault propagation through dynamic directed edges, which establish the primary anomaly path. Edge weights are dynamically adjusted. This suppresses interference from non-critical calls, such as heartbeats. Concurrently, the input and output anomaly quantities at each graph node are monitored to pinpoint the root cause source.
- To detect low-frequency and high-risk anomaly faults in real time, we design a topology-constrained high-utility mining algorithm. A reachability pruning mask is first constructed from the forward propagation graph topology. The mask function is to focus high-utility mining on topologically feasible pathways. It achieves this by forcibly eliminating topologically unreachable candidate patterns during their generation. Furthermore, the utility function is optimized to enhance causal filtering capability of the algorithm. The key step is to suppress the utility values of passive victim nodes. These nodes exhibit high anomaly values alongside zero net outflow. This approach ensures the algorithm more accurately identifies the true root causes of low-frequency and high-risk anomaly faults.
2. Related Work
- The principle of causal inference is to construct causal graphs from observational data to achieve root cause analysis of anomaly sources. Pham utilized the classic Peter-Clark (PC) causal discovery algorithm to construct static causal dependencies from observational data, evaluating and comparing 9 causal discovery methods and 21 root cause analysis methods. Experimental results show that no single causal inference method is universally applicable. Thus, we need to design algorithms with strong adaptability. They should work for different application scenarios [10]. The existing LiNGAM algorithm fails to utilize sparse structures and high-order moment information. Harada proposed a new method to address this limitation. This method relies on a single statistical criterion. It improves the ICA log-likelihood and sparse penalty terms. However, edge networks face data transmission latency and jitter. It is difficult to satisfy the strict synchronization assumptions of LiNGAM [11]. To address the problem of lacking causal interpretability in IoT anomaly detection, Gad combined the LiNGAM causal discovery algorithm with interpretable Random Forest to identify causal relationships within network traffic data, thereby enhancing model interpretability in IoT scenarios. However, this method relies on extensive labeled data, making it difficult to adapt to complex and variable edge computing environments [12]. To address the problem of lacking causal interpretability in IoT anomaly detection, Gad combined the LiNGAM causal discovery algorithm with interpretable Random Forest. This identifies causal relationships within network traffic data. Thus, it enhances model interpretability in IoT scenarios. However, this method relies on extensive labeled data. Therefore, it is difficult to adapt to complex and variable edge computing environments [13]. Ikram proposed hierarchical and local learning methods to reduce the computational complexity of root cause analysis. These methods learn only the relevant parts of the causal graph. Thus, they reduce conditional independence tests. However, cascading fault chains form easily in edge networks. Therefore, local learning may overlook global propagation paths [14]. To rapidly identify root cause metrics, Li proposed a method that transforms analysis into an intervention recognition task. It first constructs a causal Bayesian network using system architecture knowledge. Then it monitors changes in the probability distributions of variables conditioned on their parents. This establishes causal relationships. However, the method relies on a pre-defined system architecture. Thus, it is hard to adapt to dynamic changes in edge networks [15]. For variable-level root cause analysis of anomalies, Budhathoki proposed a method based on causal graphs and functional causal models. This method utilizes counterfactual Shapley values to quantify each node’s contribution to an anomaly, in order to achieve attribution at the variable level. However, dependency associations among edge service instances are typically dynamic. This makes it difficult to define an accurate global causal graph in advance [16]. Orchard proposed a Polytree algorithm to address small-sample root cause analysis and the lack of structural knowledge. This method uses edge anomaly scores to perform root cause traversal. However, it relies on simple structural assumptions. Thus, it cannot address complex cascading fault chains [17]. In summary, while causal inference methods improve the interpretability of anomaly root cause analysis, they face notable challenges when applied to edge computing environments. First, uncertain edge network transmission induces jitter, which in turn creates nonlinear time shifts in asynchronously collected data. Second, the pervasive dynamic dependencies among microservices further complicate the establishment of stable causal models. The resulting temporal misalignment makes traditional correlation calculations ineffective. Consequently, critical anomaly features can be missed.
- Root cause analysis based on Large Language Models (LLMs) uses semantic understanding to find root causes. Traditional methods are not only reliant on manual expertise but are also prone to incomplete variable sets and flawed causal assumptions. Tang introduced LLMs to address this problem. He used them to parse system logs and construct dynamic causal graphs. The LLM captures causal relationships and dynamic features across temporal dimensions. This achieves anomaly root cause localization [18]. Fine-tuning large language models (LLMs) for root cause analysis is costly and resource-intensive. To address this, Zhang proposed a method using GPT-4 and in-context learning. This method retrieves historical similar incidents and uses them as prompts to guide the model in localizing fault root causes. However, the method relies on the cloud-based GPT-4 model. Therefore, it cannot meet the real-time and data localization requirements of edge computing [19]. Fault analysis requires the generation of high-quality decision sequences. To achieve this, Ezukwoke addressed the fault analysis triplet generation task. The approach used a fine-tuned BERT-GPT2 model, which improved the coherence of the generated sequences [20]. Li proposed the COCA model to address incomplete fault reports submitted by users. It extracts diagnostic clues from code to reconstruct execution paths. This assists the LLM in root cause summarization and localization. However, this method lacks labeled historical data. Therefore, it is difficult to capture low-frequency, high-risk fault events in real-time [21]. Szandała evaluated the capability of Large Language Models (LLMs) to diagnose system faults. This evaluation used observational metrics within a chaos engineering framework. Benchmarking models like GPT and Gemini on fault diagnosis tasks showed that few-shot prompting can improve accuracy. However, LLMs have an inherent hallucination problem. This makes them unsuitable for highly reliable, real-time edge systems [22]. Goel proposed the eARCO framework to address the problem of static prompts in LLM root cause analysis. It automatically optimizes prompts and combines them with domain-adaptive Small Language Models (SLMs). However, SLMs also face computational bottlenecks on resource-constrained edge nodes [23]. There is a lack of benchmark datasets to evaluate the ability of LLMs to localize software fault root causes. Xu constructed the OpenRCA benchmark dataset. It contains 335 fault cases and massive telemetry data. However, the result shows that the currently best-performing large models have a low-resolution rate for complex faults, reaching only 11% [24]. Roy proposed the ReAct method for RCA agents. It addresses the inability to dynamically collect diagnostic information. This includes logs, metrics, and databases. However, using event report data as additional input did not yield significant performance gains [25]. Static feature extraction methods struggle to capture dynamic root cause patterns in event sequences. Zhu introduced TraceLM. This method uses context-embedded language models to directly capture temporal dynamic features from data. These features are then compared to localize the root cause. However, in edge environments, the method encounters a problem: non-linear time shifts in the sampled data [26]. Distributed Kubernetes containers face high complexity due to state consistency maintenance. Xiang proposed the SynergyRCA framework to resolve this problem. It constructs state graphs to capture spatio-temporal dependencies. When a fault occurs, the LLM combines expert prompts with dynamic graphs. It predicts the fault root cause. However, edge networks lack a global unified dynamic topology view. Thus, the state graph capture method has limitations [27]. In summary, LLM-based methods offer powerful semantic understanding for root cause analysis. However, their large model size and high inference latency make them difficult to deploy in edge network scenarios.
- Deep learning-based methods use dependency graphs to extract topological features. They capture complex invocation relationships among software components. Qia addressed open graph anomaly detection by summarizing a set of widely used datasets [28]. Alsalman proposed the FusionNet model to improve IoT anomaly detection. This model combines Random Forest (RF), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Multi-Layer Perceptron (MLP). However, integrating multiple models increases inference latency and resource consumption at the edge. Therefore, it is difficult to meet real-time requirements [29]. Lin proposed the RUN model to incorporate temporal features into root cause analysis. This model uses a contrastive learning encoder to capture complex dependencies among microservices in cloud data centers. However, it ignores anomaly propagation between nodes, which can lead to cascading issues [30]. Separately, Deng addressed the challenge of capturing high-dimensional temporal features. The method combines structural learning with Graph Neural Networks (GNNs) and uses attention mechanisms. This improves the interpretability of anomaly detection, helping users infer root causes. A key limitation is that the method cannot capture low-frequency, high-risk fault events [31]. Steenwinckel proposed the FLAGS method to fuse data-driven and knowledge-driven root cause analysis. It integrates semantic knowledge based on machine learning and utilizes user feedback for adaptive optimization. Thus, it improves the interpretability of root cause localization and reduces knowledge graph modeling costs [32]. Klein utilized knowledge graphs to derive component dependencies in distributed software. He input these into a Siamese Graph Convolutional Network (GCN). This model diagnoses anomalies and achieves localization via graph pattern matching. However, static knowledge graphs struggle to adapt to dynamic topology structures in edge networks [33]. Nadim discovered state-event graphs from low-level continuous observational data, extracting highly accurate and trustworthy patterns from raw data, and finally generated causal models from event logs [34]. Nadim discovered state-event graphs from low-level continuous observational data. This approach extracts reliable patterns from raw data and generates causal models from event logs [35]. Separately, defining rules for Complex Event Processing (CEP) in IoT is challenging. Simsek proposed an automated framework to address this. The framework uses deep learning for rule extraction and includes data labeling and rule extraction components [36]. The root of misjudgment lies in the fact that passive nodes may exhibit higher anomaly metrics than the true fault source—a scenario where single-feature analysis inherently fails to identify the correct origin.
- Time-Series misalignment from physical factors. Existing methods often assume multi-sensor data is time-aligned or lack proper alignment strategies for different data types [11,12]. However, in edge systems, time shifts occur due to specific physical reasons. Not accounting for this during data alignment causes distortion. This breaks the true relationships between features and leads to missing important anomaly evidence.
- Uncaptured dynamics in cascading faults. For cascading faults, traditional dependency graphs map node connections but ignore the direction and strength of fault spread [30]. Passively affected nodes can show higher anomaly scores than the source node. This makes it difficult to identify the true root cause—characterized by high net fault output—using simple feature comparisons.
- Search space explosion and causal confusion. High-Utility Mining (HUM) is used on the edge to find rare, high-risk faults [37]. Standard HUM algorithms check all possible combinations without limits, causing an exponential growth in search space that overwhelms edge devices. Furthermore, standard utility measures cannot filter by cause. They give high scores to severely affected victim nodes, causing both missed real faults and false alarms for high-risk patterns.
3. Model Methodology
3.1. Model Structure
- PCA-DTW Hybrid Feature Extraction
- Various system operation and maintenance data transmitted via the edge network are received and aggregated to form a multi-dimensional metric matrix.
- The metric data are grouped into network and computation categories. Subsequently, environmental noise is filtered and principal components are extracted using an online PCA method [38].
- Based on the characteristics of the different groups, the DTW temporal alignment strategy is dynamically adjusted to achieve elastic alignment between observation sequences and reference vectors.
- Based on the DTW calculation, the minimum warping value at the end point of the cumulative cost matrix is taken as the node’s anomaly score, which completes the node feature extraction. By addressing the temporal misalignment caused by non-deterministic edge interference, this module significantly improves feature extraction accuracy.
- Cascading Fault Causal Inference
- A forward propagation graph is constructed. Node anomaly scores are obtained from the feature extraction module, and the weights of service invocation edges are updated by combining streaming trace data. The model maintains the directed edges of the forward propagation graph in real time to determine the propagation path of anomaly faults.
- The reference vectors are maintained based on the node anomaly scores extracted via PCA-DTW. The anomaly score serves as a real-time metric for quantifying the node anomaly status.
- For each node in the forward propagation graph, the net inflow energy is calculated using the weights of inflow edges and the corresponding node anomaly scores, while the net outflow energy is calculated using the weights of outflow edges and the corresponding node anomaly scores. Finally, the net anomaly outflow metric is calculated by . This net anomaly outflow metric is utilized to infer the root cause candidate node of a single propagating anomaly fault chain. As shown in Figure 1, The model is grounded in the principle of flow conservation. This principle allows it to quantify fault propagation intensity. Based on this quantification, it pinpoints active fault sources by their positive net outflow and differentiates them from passive victims, which exhibit negative or zero outflow.
- Anomaly Fault Analysis and Localization
- A reachability constraint matrix is first constructed from the topological connections of the forward propagation graph. This matrix then provides a definitive strategy for pruning unreachable paths during the mining process.
- To assess the likelihood of a root cause, we define a utility function for candidate nodes that show net outflow anomalies. This function is intentionally designed with two complementary parts. The first part is the internal utility, which is based on the inflow edge weights and the node’s anomaly score. The second part is the external utility, based on the outflow edge weights and the anomaly score. This two-part design allows for a detailed evaluation of fault influence in both the incoming and outgoing directions. The total utility, calculated as the sum of these two components, provides a single, consolidated measure for root cause identification.
- The process generates an initial set of candidate patterns. These patterns are then filtered using a constraint matrix. The purpose of this matrix is to eliminate any combinations that are topologically unreachable. This step ensures computational efficiency and helps prioritize the detection of low-frequency, high-risk anomaly sources. After this filtering, a recursive high-utility pattern mining algorithm operates within the constrained solution space. This algorithm performs real-time utility calculations for each candidate path. As a result, the final output is refined to include only the high-utility patterns that also possess the critical signature of high net outflow, which is indicative of a root cause.
- The screened high-utility anomaly node sequence is output, completing the root cause localization.
3.2. PCA-DTW Hybrid Feature Extraction
- Physical Semantic Grouping: Grouping is performed at the input stage based on the distinct physical semantics of the operation and maintenance data. As shown on the left side of Figure 2, the raw high-dimensional collected data are mapped into two semantic subspaces: the network-sensitive group (containing , colored black, such as network throughput, RTT, retransmission rate, etc.) and the computation-sensitive group (containing , colored white, such as CPU utilization, memory usage, deadlock count, etc.
- Energy Attribution Diagnosis: Firstly, within the time period, data from the network group (black ) and the computation group (white ) are projected into a low-dimensional space to extract the first principal component. The sum of the absolute values of the loading energies for each metric within the first principal component vector is calculated. Secondly, the loading energy is calculated by group. Specifically, for the network group energy (labeled as Sumnet in the figure), the sum of the weights of all network metrics () within the principal component is calculated; for the computation group energy (labeled as Sumcomp in the figure), the sum of the weights of all computation metrics () within the principal component is calculated.
- Gated DTW Alignment Algorithm: As illustrated on the right side of Figure 2, upon acquiring the loading energies, the process enters the decision loop depicted on the right. By comparing the loading energy proportions between the network and computation groups, we determine the physical root cause of the fault in real time for the current window. The diagnostic result serves as a control signal for the DTW algorithm module. When the fault is determined to be network-dominated, the left branch is activated, which reduces the DTW insertion penalty. This adjustment allows the algorithm to adapt to data lag (e.g., from network congestion) by applying a backward shift. Conversely, When the fault is determined to be computation-dominated, the right branch is activated, reducing the DTW deletion penalty. This enables the algorithm to handle data loss (e.g., from CPU deadlocks) via forward compression. Finally, the anomaly score is calculated by aligning the real-time vector with the reference vector.
- From a spatial perspective, the metrics for anomaly judgment are grouped according to their distinct physical meanings. Let the standardized input matrix be denoted as , containing the multi-dimensional sensor time-series data matrix (including CPU, memory, RTT, etc.). N represents the system metric dimension aggregated from edge nodes, and T (Initialize to 60 s) denotes the time step length of the sliding window. Based on physical characteristics, the data are divided into the network group and the computation group .
- PCA is adopted for dimensionality reduction to improve the computational efficiency of the algorithm and obtain the loading energies of each group. First, the covariance matrix is calculated:
- 3.
- From a temporal perspective, a gated DTW alignment method is designed to calculate node anomaly scores. Let the real-time feature sequence obtained within time T be denoted as , and the historically maintained forward reference sequence be denoted as (where ). First, a local distance matrix D of size is constructed, where the element represents the distance between two points:
- Match: Corresponds to diagonal movement. This implies that time step i of the real-time sequence is perfectly aligned with time step j of the reference sequence. A smoothly operating edge node is indicated by three key conditions: minimal disparity in PCA loading energy between network and computation groups, no significant transmission jitter, and no sampling blockage. Together, these ensure strict synchronization between sensor data generation and processing.
- Compression: Corresponds to horizontal movement. This operation is activated by the computation group, implying that the real-time sequence skips certain segments of the reference sequence. This corresponds to scenarios of sampling loss or blockage, when edge devices fail to generate partial data points due to CPU deadlocks or high load, reducing the penalty allows DTW to automatically eliminate the corresponding missing segments in the reference, achieving elastic compression alignment of discontinuous data.
- Insertion: Corresponds to vertical movement. This operation is activated by the network group, implying that a single point in the real-time sequence maps to multiple points in the reference sequence, forming a stretching effect. By reducing the penalty, the system actively identifies this temporal distortion caused by environmental lag to distinguish it from genuine numerical anomalies and prevent misdiagnosis.
| Algorithm 1. PCA-DTW hybrid feature extraction. |
| Require: Compute Covariance Matrix: Perform Eigendecomposition: extract Generate Projected Sequence: Calculate Semantic Energy Contributions: Initialize Penalty Factors: If then else if then Initialize Cumulative Matrix for i = 1 to T do for j = 1 to T do return Normalized Score based on |
3.3. Cascading Fault Causal Inference
- Real-time Data Aggregation. As illustrated on the left side of Figure 4, real-time data streams containing invocation traces (Traces) and logs (Logs) are received from the edge. This data is then used to parse the dynamic invocation relationships among microservices. Simultaneously, anomaly scores output by the PCA-DTW hybrid feature extraction module are accepted; these scores quantify the current degree of anomaly for individual nodes.
- Construction of Dynamic Forward Propagation Graph. As illustrated in the center of Figure 4, the connection relationships between nodes are updated in real time utilizing streaming data, and node states are defined based on anomaly scores. The graph structure is dynamically updated from the trace stream. A new edge is created when a new invocation appears. Conversely, if an invocation is absent for an extended period, the corresponding edge is removed via a weight decay mechanism. Under normal conditions, the system maintains two key elements in real time. First, it updates node reference vectors based on the latest vector data. Second, it adjusts node edge weights based on the ongoing Trace stream activity.
- Causal Inference of Root Cause Candidate Nodes. As illustrated on the right side of Figure 4, the net inflow and outflow anomaly indices of nodes are first calculated based on the anomaly scores and edge weights within the forward propagation graph. If the net outflow of a node exceeds a safety threshold , it indicates that not only is the anomaly severity of the node itself high, but the anomaly energy propagated outward is also significantly greater than that received; consequently, it is determined to be an active fault source, as indicated by the red node in the figure. Conversely, this implies the node is merely affected by upstream faults; therefore, it is determined to be a passive victim and is eliminated, as indicated by the gray node in the figure. Finally, the anomaly candidate nodes are output.
- (Node Set): Represents the edge services or components active within time window t. Each node in the graph stores a reference vector . This state vector corresponds to the normal reference vector extracted via DTW-PCA for the network and computation groups under normal conditions, as described in Section 3.2; it represents the feature set of the node during normal operation. The update algorithm is designed with a dual goal: it must adapt to inherent edge variations like workload fluctuations and hardware aging, without allowing sudden fault data to corrupt the reference baseline. Assume that at time step t, the real-time observation vector of node after PCA dimensionality reduction is . denotes the set of reference vectors prior to time t − 1. For any node in the forward propagation graph, its confidence is defined as:where is the node anomaly score of node (i.e., the terminal value of matrix C in Section 3.2), and and are the rolling mean and standard deviation of historical anomaly scores, initialized to the values upon first entry. The function maps the score to the (0,1) interval, representing the confidence that the node is in an anomalous state. The reference is allowed to update only when the data confidence is below the safety threshold (set to 0.8); otherwise, the reference update is frozen. During the update, a smoothing factor (set to 0.02) is introduced, and an exponential moving average algorithm is employed to reduce the impact of low-frequency sporadic connections. This enables the model to track normal environmental drift while maintaining robustness against contamination from sudden fault data. The reference vector update algorithm is presented in Algorithm 2.
Algorithm 2. Stream-based Graph Reference Vector Maintenance. Require:
Previous reference vectors
Anomaly Threshold , Smoothing Factor
Anomaly Scores
for each node in active microservices do
if then
else
return - (Edge Set): Represents the forward invocation dependency relationships among services. A directed edge indicates that service initiated a request that propagated to . These edges are dynamically updated by parsing the relationship between the parent node ID and the node ID within the Trace stream in real time.
- (Weight Set): The edge weight represents the confidence intensity of the dependency relationship. During the continuous influx of trace streams, certain service invocations may be sporadic (such as heartbeat detection or one-off tasks); however, faults typically propagate along high-frequency dependency paths. Therefore, dynamic weighted updating based on streaming data is required. To quantify the gradation in dependency, the system employs an exponential moving average (EMA) algorithm for real-time edge weight updates. When a new trace data stream arrives and an invocation of is detected, the weight update formula is:
3.4. Anomaly Fault Analysis and Localization
- Definition of Causal Utility Function
- 2.
- Design topology reachability pruning mask
- 3.
- Localization of Anomaly Fault Root Causes
- When attempting to append a new node to the terminal of the current fault chain P, the mask is first queried. If , indicating that the two nodes are unreachable on the forward graph, this invalid branch is pruned.
- For a new pattern that passes validation, if its utility upper bound satisfies the threshold requirement, a projected database is constructed, i.e., a sub-dataset containing only the subsequent propagation paths of .
- is invoked to enter the next level of recursion, continuing to search for deeper root cause nodes within the reduced subspace until no higher-value patterns are generated, outputting the Top-k candidate set nodes. Anomaly fault analysis and localization is presented in Algorithm 3:
Algorithm 3. Abnormal fault analysis and localization. Require:
for do do
end for
Construct Matrix : if i can reach j in , else 0.
Calculate using reshaped
Sort by TWU order
Define :
for node in do
if and then
New Pattern
if then
to
if then
return Top-k nodes in sorted by score descending
4. Experiments
4.1. Experimental Setup and Datasets
- Service Node Fluctuation and Restart: This simulates changes in edge node computing power. We randomly trigger Pod rescheduling, which changes service instance IPs and physical locations.
- Elastic Scaling of Microservice Instances: Based on simulated traffic changes, the number of microservice instances dynamically adjusts between 1 and 5. This makes the service call dependency graph change in real time with t. Therefore, ST-GraphRCA must accurately track fault propagation even as the topology changes, with edges connecting and disconnecting.
- Mulan [41]: Mulan is an offline diagnosis method for microservice systems. It uses a log-specific large language model (LLM) to extract semantic features from logs. Through contrastive learning, these features are aligned with structured metric data in a unified latent space. This multimodal integration supports the construction of a service causal graph. However, it incurs greater computational overhead.
- InstantOps [42]: A defensive operations framework was proposed, which integrates fault prediction with diagnosis through a unified representation method that fuses logs, metrics, and Kubernetes native events. A graph neural network was used to capture service call topology. A gated recurrent unit was used to model time series changes. A permutation test was applied to measure the contribution of each node to faults. The root cause was then located.
- Chain-of-Event [43]: An event-graph-based reasoning model was proposed. All anomalies in the system were abstracted as fine-grained events. Transition probabilities between events were learned from historical fault data. A weighted event causal graph was built. The diagnosis process of SRE experts was simulated. The most probable fault propagation path was searched for in the graph. Result readability and expert trust were emphasized.
- OCEAN [44]: A recent online algorithm was designed for real-time data streams. Offline batch processing was not used. Dilated convolutional neural networks were used to capture long-term history. A graph neural network was used to update the causal structure in real time. A multi-factor attention mechanism was introduced. The weights of logs and metrics were adjusted based on data quality. Dynamic causal drift in microservice systems was addressed. This method represents a frontier direction in online diagnosis.
- Trace Tradition [45]: A set of traditional IT operations methods was defined. These methods were used when system or service faults occurred. Logs were analyzed. System performance was monitored. Configurations were checked. Other related investigation activities were carried out to find the root cause.
4.2. Experimental Comparative Analysis
- Dataset A@1 (Mainly Injects Data Skew): Simulates computation skew problems caused by uneven IoT data distribution. This reflects real scenarios in heterogeneous IoT environments where vast differences in reporting frequencies among different sensor devices lead to severe load imbalances among edge computing nodes.
- Dataset A@2 (Mainly Injects CPU Resource Exhaustion): Simulates edge gateway overload caused by high-frequency concurrent computing. By injecting infinite loop invocations into Executor containers, this reproduces system paralysis scenarios caused by the exhaustion of computing resources when edge nodes process massive concurrent sensor data.
- Dataset A@3 (Mainly Injects Application Logic Errors): Simulates edge application logic crashes or data validation failures. By injecting erroneous data substitution at the Job/Application layer, this reproduces task execution failures on the edge side caused by dirty sensor data input or algorithmic logic defects.
- Dataset A@4 (Mainly Injects Memory Leak/Overflow): Simulates memory overflow caused by sensor data bursts. By injecting list duplication into computing tasks, this reproduces memory crash scenarios in resource-constrained edge devices when coping with sudden data floods.
- Dataset A@5 (Mainly Injects Network Latency): Simulates inter-device communication jitter in weak-network edge environments. By injecting API delays at the Pod network layer, this reproduces data transmission blocking and timeout issues between edge gateways and terminal devices under unstable network environments (this constitutes the core scenario for validating the DTW temporal alignment capability).
- Mulan: Although it exhibited extremely strong semantic understanding capabilities for logical faults such as A@3 (0.90) by utilizing a Large Language Model (LLM), its performance suffered a severe decline in the weak network environment of A@5 (0.65). This is attributed to its significant offline batch processing latency and computational overhead. Mulan relies on contrastive learning to construct causal graphs; the inference process is time-consuming and requires batch data. Consequently, it cannot achieve real-time response in edge scenarios with frequent network jitter, resulting in a capability to capture transient faults that is substantially weaker than the lightweight ST-GraphRCA.
- InstantOps: As a predictive model based on GNNs and GRUs, it performs excellently in conventional scenarios such as A@1 (0.95). However, its accuracy in A@5 (0.58) is even lower than that of certain traditional methods. This is attributed to its strong dependency on full Trace data. In edge environments characterized by weak networks and high packet loss, Trace links experience severe breakage, causing the spatial topological structure upon which the GNN relies to fragment. The model fails to effectively aggregate neighbor node information, thereby losing the topological basis for root cause inference.
- Chain-of-Event (CoE): Its explainable reasoning based on event graphs holds significant advantages in complex business logic faults such as A@3 (0.91). However, it performs mediocrely in terms of overall F1-Score (0.75) and in resource scenarios like A@2 (0.82). The core reasons are its cold start problem and dependency on historical data. Edge environment devices are heterogeneous and fault patterns are variable; it is difficult for CoE to cover all causal patterns using limited historical data. Consequently, when facing unseen resource contention or environmental interference, causal graph paths are missing, making it difficult to localize the root cause.
- OCEAN: Although OCEAN is designed as an online algorithm and performs excellently in long-term memory leak detection (A@4, 0.92), its accuracy is lower in CPU overload scenarios (A@2, 0.68) due to model complexity and resource contention. The multi-factor attention and dynamic graph updating in OCEAN pose a significant computational burden, constituting a major source of its overhead. In situations where the edge gateway CPU is already overloaded, the complex diagnostic model contends for resources with business processes. This not only causes slow operation but may also render the model completely ineffective due to termination by the system OOM (Out of Memory) Killer.
- Trace Tradition: The traditional Pearson Correlation Coefficient method achieves an F1-Score of only 0.24. This shows that simple statistical linear correlation analysis fails completely when facing complex non-linear cascading faults in microservices, especially on the edge side where trace data is incomplete.
4.3. Ablation Study
- PCA-DTW: used to compute a node deviation from its reference vector. This variant was used to test the need for time-series alignment in weak edge networks.
- Net-Flow: The flow-conservation based net anomaly outflow metric was removed. Only raw anomaly scores were used to build a forward-propagation graph for ranking. This variant was used to test whether the flow-conservation rule helps distinguish active root causes from passive victims.
- Topo-Constraint: The topology reachability pruning mask in the mining algorithm was removed. Pattern mining was carried out under an unconstrained, fully connected assumption. This variant was used to test whether causal-subspace constraints help against noise.
- PCA-DTW is against network jitter: In Dataset A@5 (Network Latency), a clear drop was observed after the PCA-DTW module was removed. The F1-score fell from 0.89 to 0.61. This result shows that random jitter and packet loss exist in edge networks. Standard distance measures cannot handle nonlinear time misalignment. Many false alarms are produced. PCA-DTW restores true data relations through semantic grouping and adaptive alignment.
- Flow conservation reduces cascade fault errors: Dataset A@2 (CPU Exhaustion) simulates cascade overload caused by high concurrency. After the net outflow mechanism was removed (Net-Flow), performance dropped to an F1-score of 0.72. The main reason was the failure to separate root nodes from victim nodes in cascade faults. True root nodes slow down due to their own compute overload. Downstream nodes remain normal but become blocked while waiting for upstream responses. Both show high anomaly values in metrics. Without flow direction, root sources and passive victims cannot be separated. Many victim nodes are ranked as root causes. After flow conservation was added, net outflow values enabled clear separation.
- Topology constraints improve noise resistance: The Topo-Constraint variant showed acceptable results in simple cases such as A@1. However, the overall average score was lower than the full model (0.82 vs. 0.89). Without topology reachability limits, unrelated nodes were grouped into the same fault pattern. Non-causal noise was introduced. Top-k ranking accuracy was reduced.
4.4. Efficiency Analysis
- Mulan: Latencies consistently exceed 1500 ms, a limitation inherent to its offline batch processing architecture. The computational burden of log embedding and contrastive learning in Large Language Models (LLMs) creates significant inference delays, rendering the model unsuitable for millisecond-level real-time demands.
- OCEAN: Despite being an online algorithm, OCEAN experiences a sharp latency spike to 1850 ms during CPU Overload (A@2) scenarios. This corroborates earlier analysis: on edge gateways with limited hardware, OCEAN’s complex dilated convolutions and multi-factor attention mechanisms trigger severe resource contention. Consequently, the diagnostic process itself stalls, leading to a drastic drop in throughput.
- InstantOps & Chain-of-Event: These models fall into the intermediate range of 600–900 ms. While faster than legacy trace methods, they remain 2–3× slower than ST-GraphRCA due to the computational overhead associated with GNN graph traversal and event chain searching.
- Trace Tradition: Exhibiting the highest latency (>2000 ms), this approach scales linearly with fault complexity; effectively confirming that full trace aggregation is computationally infeasible at the edge.
4.5. Parameter Sensitivity Analysis
4.5.1. Impact of DTW Penalty Factors on Performance
4.5.2. Impact of Graph Thresholds on Performance
- Impact of the Safety Threshold. The red solid line (lower x-axis) shows the trend of . The model’s F1-Score is best (0.89) when is set to 0.8. When is less than 0.7, the model cannot distinguish the difference between active fault sources and passive victims. Both have positive outflow in this case. This leads to a higher false positive rate. When is greater than 0.9, the judgment is too strict. Some weaker, early-stage faults are missed. The experiment shows that setting around 0.8 can better separate active fault sources from passive victims.
- Impact of the Preset Utility Threshold . The blue dashed line (upper x-axis) shows the effect of changing the Preset Threshold in high-utility mining from 100 to 350. The F1-Score peaks when is 214. If the threshold is too low ( < 150), pruning is not strong enough. Many low-utility noise items (like occasional non-causal calls) remain in the candidate set. This disturbs the ranking of the Top-k root causes, so precision drops. If the threshold is too high ( > 250), pruning is too aggressive. Some real fault patterns that are low-frequency but high-risk are wrongly removed. This causes a clear drop in recall. The experiment shows that setting () around 214 better balances pruning efficiency and the retention of fault patterns.
4.5.3. Impact of PCA Component Selection (k) on Performance
5. Deployment Analysis for Application Scenarios
- Industrial IoT and Manufacturing Gateways: In smart factory production lines, edge gateways need to process real-time data from many PLCs and sensors. When a fault happens due to network issues or overload, ST-GraphRCA can be deployed directly on industrial edge gateways. Its low inference delay lets it quickly find the root cause before the fault spreads. This helps reduce unplanned downtime and keeps production running.
- Smart City Traffic Sensing Networks: In smart traffic systems, roadside units analyze vehicle and infrastructure data in real time. Urban edge networks often have random delays and packet loss. ST-GraphRCA uses its PCA-DTW elastic alignment to handle timing sync problems well. This allows accurate fault tracing across different monitoring nodes. As a result, traffic signal control systems remain stable.
- Remote Environment Monitoring and Resource-Limited Gateways: For sensor networks in remote or power-limited areas, ST-GraphRCA is very lightweight. It needs only low performance to run. Relying mainly on the first principal component keeps its computation simple. This makes it possible to embed the model into tiny edge terminals with very few resources. These terminals can then perform local, self-contained anomaly diagnosis.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Ramya, R. Analysis and applications finding of wireless sensors and IoT devices with artificial intelligence/machine learning. In AIoT and Smart Sensing Technologies for Smart Devices; IGI Global: Hershey, PA, USA, 2024; pp. 77–102. [Google Scholar]
- Savaglio, C.; Mazzei, P.; Fortino, G. Edge intelligence for industrial IoT: Opportunities and limitations. Procedia Comput. Sci. 2024, 232, 397–405. [Google Scholar] [CrossRef]
- Bitam, T.; Yahiaoui, A.; Boubiche, D.E.; Martínez-Peláez, R.; Toral-Cruz, H.; Velarde-Alvarado, P. Artificial Intelligence of Things for Next-Generation Predictive Maintenance. Sensors 2025, 25, 7636. [Google Scholar] [CrossRef] [PubMed]
- Patel, Y.S.; Townend, P.; Singh, A.; Östberg, P.O. Modeling the Green Cloud Continuum: Integrating energy considerations into Cloud–Edge models. Clust. Comput. 2024, 27, 4095–4125. [Google Scholar] [CrossRef]
- Chen, Y.; Wu, C.; Zhang, F.; Lu, C.; Huang, Y.; Lu, H. Topology-aware Microservice Architecture in Edge Networks: Deployment Optimization and Implementation. IEEE Trans. Mob. Comput. 2025, 24, 6090–6105. [Google Scholar] [CrossRef]
- Faseeha, U.; Syed, H.J.; Samad, F.; Zehra, S.; Ahmed, H. Observability in Microservices: An In-Depth Exploration of Frameworks, Challenges, and Deployment Paradigms. IEEE Access 2025, 13, 72011–72039. [Google Scholar] [CrossRef]
- Santos-Fernandez, E.; Hoef, J.M.V.; Peterson, E.E.; McGree, J.; Villa, C.A.; Leigh, C.; Turner, R.; Roberts, C.; Mengersen, K. Unsupervised anomaly detection in spatio-temporal stream network sensor data. Water Resour. Res. 2024, 60, e2023WR035707. [Google Scholar] [CrossRef]
- Acquaah, Y.T.; Kaushik, R. Normal-only Anomaly detection in environmental sensors in CPS: A comprehensive review. IEEE Access 2024, 12, 191086–191107. [Google Scholar] [CrossRef]
- Shankar, V. Edge AI: A comprehensive survey of technologies, applications, and challenges. In Proceedings of the 2024 1st International Conference on Advanced Computing and Emerging Technologies (ACET), Ghaziabad, India, 23–24 August 2024; pp. 1–6. [Google Scholar]
- Pham, L.; Ha, H.; Zhang, H. Root cause analysis for microservice system based on causal inference: How far are we? In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), Sacramento, CA, USA, 27 October–1 November 2024; pp. 706–715. [Google Scholar]
- Harada, K.; Fujisawa, H. Sparse estimation of Linear Non-Gaussian Acyclic Model for Causal Discovery. Neurocomputing 2021, 459, 223–233. [Google Scholar] [CrossRef]
- Gad, I. TOCA-IoT: Threshold Optimization and Causal Analysis for IoT Network Anomaly Detection Based on Explainable Random Forest. Algorithms 2025, 18, 117. [Google Scholar] [CrossRef]
- Yu, G.; Chen, P.; Li, Y.; Chen, H.; Li, X.; Zheng, Z. Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), San Francisco, CA, USA, 3–9 December 2023; pp. 553–565. [Google Scholar]
- Ikram, A.; Chakraborty, S.; Mitra, S.; Saini, S.; Bagchi, S.; Kocaoglu, M. Root cause analysis of failures in microservices through causal discovery. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 31158–31170. [Google Scholar]
- Li, M.; Li, Z.; Yin, K.; Nie, X.; Zhang, W.; Sui, K.; Pei, D. Causal inference-based root cause analysis for online service systems with intervention recognition. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 3230–3240. [Google Scholar]
- Budhathoki, K.; Minorics, L.; Blöbaum, P.; Janzing, D. Causal structure-based root cause analysis of outliers. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; Volume 162, pp. 2357–2369. [Google Scholar]
- Orchard, W.R.; Okati, N.; Mejia, S.H.G.; Blöbaum, P.; Janzing, D. Root Cause Analysis of Outliers with Missing Structural Knowledge. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), San Diego, CA, USA, 30 November–7 December 2025. [Google Scholar]
- Tang, L.; Kou, E.; Wang, W.; Chen, Q. A Root Cause Analysis Framework for IoT Based on Dynamic Causal Graphs Assisted by LLMs. IEEE Internet Things J. 2025, 12, 34563–34581. [Google Scholar] [CrossRef]
- Zhang, X.; Ghosh, S.; Bansal, C.; Wang, R.; Ma, M.; Kang, Y.; Rajmohan, S. Automated root causing of cloud incidents using in-context learning with GPT-4. In Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE), Porto de Galinhas, Brazil, 15–19 July 2024; pp. 266–277. [Google Scholar]
- Ezukwoke, K.; Hoayek, A.; Batton-Hubert, M.; Boucher, X.; Gounet, P.; Adrian, J. Big GCVAE: Decision-making with adaptive transformer model for failure root cause analysis in semiconductor industry. J. Intell. Manuf. 2025, 36, 2423–2438. [Google Scholar] [CrossRef]
- Li, Y.; Wu, Y.; Liu, J.; Jiang, Z.; Chen, Z.; Yu, G.; Lyu, M.R. COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge. arXiv 2025, arXiv:2503.23051. [Google Scholar] [CrossRef]
- Szandała, T. AIOps for Reliability: Evaluating Large Language Models for Automated Root Cause Analysis in Chaos Engineering. In Proceedings of the International Conference on Computational Science, Malaga, Spain, 2–4 July 2024; Springer: Cham, Switzerland, 2024; pp. 323–336. [Google Scholar]
- Goel, D.; Magazine, R.; Ghosh, S.; Nambi, A.; Deshpande, P.; Zhang, X.; Bansal, C.; Rajmohan, S. eARCO: Efficient Automated Root Cause Analysis with Prompt Optimization. arXiv 2025, arXiv:2504.11505. [Google Scholar] [CrossRef]
- Xu, J.; Zhang, Q.; Zhong, Z.; He, S.; Zhang, C.; Lin, Q.; Pei, D.; He, P.; Zhang, D.; Zhang, Q. OpenRCA: Can large language models locate the root cause of software failures? In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar]
- Roy, D.; Zhang, X.; Bhave, R.; Bansal, C.; Las-Casas, P.; Fonseca, R.; Rajmohan, S. Exploring llm-based agents for root cause analysis. In Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE), Porto de Galinhas, Brazil, 15–19 July 2024; pp. 208–219. [Google Scholar]
- Zhu, B. TraceLM: Temporal Root-Cause Analysis with Contextual Embedding Language Models. In Proceedings of the 2024 6th International Conference on Electronic Communication and Artificial Intelligence (ICECAI), Changsha, China, 12–14 July 2025; pp. 855–858. [Google Scholar]
- Xiang, Y.; Chen, C.P.; Zeng, L.; Yin, W.; Liu, X.; Li, H.; Xu, W. Simplifying Root Cause Analysis in Kubernetes with StateGraph and LLM. arXiv 2025, arXiv:2506.02490. [Google Scholar] [CrossRef]
- Qiao, H.; Tong, H.; An, B.; King, I.; Aggarwal, C.; Pang, G. Deep graph anomaly detection: A survey and new perspectives. IEEE Trans. Knowl. Data Eng. 2025, 37, 5106–5126. [Google Scholar] [CrossRef]
- Alsalman, D. A Comparative Study of Anomaly Detection Techniques for IoT Security Using Adaptive Machine Learning for IoT Threats. IEEE Access 2024, 12, 14719–14730. [Google Scholar] [CrossRef]
- Lin, C.M.; Chang, C.; Wang, W.Y.; Wang, K.D.; Peng, W.C. Root cause analysis in microservice using neural granger causal discovery. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 206–213. [Google Scholar]
- Deng, A.; Hooi, B. Graph neural network-based anomaly detection in multivariate time series. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 4027–4035. [Google Scholar]
- Steenwinckel, B.; De Paepe, D.; Hautte, S.V.; Heyvaert, P.; Bentefrit, M.; Moens, P.; Dimou, A.; Van Den Bossche, B.; De Turck, F.; Van Hoecke, S.; et al. FLAGS: A methodology for adaptive anomaly detection and root cause analysis on sensor data streams by fusing expert knowledge with machine learning. Future Gener. Comput. Syst. 2021, 116, 30–48. [Google Scholar] [CrossRef]
- Klein, P.; Malburg, L.; Bergmann, R. Combining informed data-driven anomaly detection with knowledge graphs for root cause analysis in predictive maintenance. Eng. Appl. Artif. Intell. 2025, 145, 110152. [Google Scholar] [CrossRef]
- Nadim, K.; Ragab, A.; Ouali, M.S. Data-driven dynamic causality analysis of industrial systems using interpretable machine learning and process mining. J. Intell. Manuf. 2023, 34, 57–83. [Google Scholar] [CrossRef]
- Simsek, M.U.; Okay, F.Y.; Ozdemir, S. A deep learning based CEP rule extraction framework for IoT data. J. Supercomput. 2021, 77, 8563–8592. [Google Scholar] [CrossRef]
- Yu, W.; Zhang, J.; Liu, L.; Liu, Y.; Zhai, X.; Howlader, R.K. A Distributed Data-Driven and Machine Learning Method for High-Level Causal Analysis in Sustainable IoT Systems. IEEE Trans. Sustain. Comput. 2024, 10, 274–286. [Google Scholar] [CrossRef]
- Zida, S.; Fournier-Viger, P.; Lin, J.C.W.; Wu, C.W.; Tseng, V.S. EFIM: A highly efficient algorithm for high-utility itemset mining. In Proceedings of the Mexican International Conference on Artificial Intelligence, Cuernavaca, Mexico, 25–31 October 2015; pp. 530–546. [Google Scholar]
- Tan, J.; Yu, H.; Huang, J.; Xiao, J.; Zhao, F. Freepca: Integrating consistency information across long-short frames in training-free long video generation via principal component analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 27979–27988. [Google Scholar]
- Salimi, M. Vibration-based Damage Detection and Localization in Pipelines Using Data Analysis. Ph.D. Thesis, Concordia University, Montreal, QC, Canada, 2024. [Google Scholar]
- Hu, C.; Chen, Z.; Li, Y.; Yin, X. Performance degradation assessment of rolling bearing under vibration signal monitoring based on optimized variational mode decomposition and improved fuzzy support vector data description. J. Appl. Phys. 2024, 135, 221102. [Google Scholar] [CrossRef]
- Zheng, L.; Chen, Z.; He, J.; Chen, H. MULAN: Multi-modal causal structure learning and root cause analysis for microservice systems. In Proceedings of the ACM Web Conference 2024 (WWW ‘24), Singapore, 13–17 May 2024; pp. 4107–4116. [Google Scholar]
- Rouf, R.; Rasolroveicy, M.; Litoiu, M.; Nagar, S.; Mohapatra, P.; Gupta, P.; Watts, I. Instantops: A joint approach to system failure prediction and root cause identification in microserivces cloud-native applications. In Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering (ICPE), London, UK, 7–11 May 2024; pp. 119–129. [Google Scholar]
- Yao, Z.; Pei, C.; Chen, W.; Wang, H.; Su, L.; Jiang, H.; Xie, Z.; Nie, X.; Pei, D. Chain-of-event: Interpretable root cause analysis for microservices through automatically learning weighted event causal graph. In Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE), Porto de Galinhas, Brazil, 15–19 July 2024; pp. 50–61. [Google Scholar]
- Zheng, L.; Chen, Z.; Chen, H.; He, J. OCEAN: Online Multi-modal Root Cause Analysis for Microservice Systems. arXiv 2024, arXiv:2410.10021. [Google Scholar]
- Liu, P.; Chen, Y.; Nie, X.; Zhu, J.; Zhang, S.; Sui, K.; Pei, D. Fluxrank: A widely-deployable framework to automatically localizing root cause machines for software service failure mitigation. In Proceedings of the 30th International Symposium on Software Reliability Engineering (ISSRE), Berlin, Germany, 28–31 October 2019; pp. 35–46. [Google Scholar]










| Category | Item | Specification |
|---|---|---|
| Hardware | Master Node (Edge Server) | Intel Xeon Gold 6248R @ 3.00 GHz, 64 GB RAM, 1 TB SSD |
| Worker Nodes (×32) | Simulated via KVM: 2 vCPU, 4 GB RAM, 50 GB Disk | |
| Network | 1 Gbps LAN, bandwidth restricted to 100 Mbps via tc to mimic edge constraints | |
| Software | Operating System | Ubuntu 20.04.6 LTS (Kernel 5.4.0) |
| Orchestration | Kubernetes v1.24.0, Docker v20.10.12 | |
| Monitoring | Prometheus v2.35 (Metric Collection), Jaeger v1.34 (Trace Collection) | |
| Implementation | Programming Language | Python 3.9.12 |
| Key Libraries | scikit-learn 1.1.1 (PCA), dtaidistance 2.3.10 (DTW), networkx 2.8 (Graph), numpy 1.21.5 |
| Module | Parameter | Symbol | Value | Description |
|---|---|---|---|---|
| 3.2 | Sliding Window Size | 60 s | Time steps for the sliding window. | |
| Penalty Factor (Insert) | 0.45 | , Reduced penalty for network-induced lag . | ||
| Penalty Factor (Delete) | 0.57 | , Reduced penalty for computation-induced loss . | ||
| PCA Component | k | 1 | Number of principal components retained. | |
| 3.3 | Smoothing Factor | 0.02 | Exponential moving average factor for weight updates. | |
| Safety Threshold | 0.8 | Threshold for identifying active root causes. | ||
| Abnormal Threshold | 0.42 | Anomalies spreading downstream. | ||
| 3.4 | Minimum External Utility Threshold | 0.05 | Preventing excessively small utility values and suppressing the interference of non causal noise. | |
| Utility Threshold | 214 | Preset utility threshold. | ||
| Baselines | Learning Rate | 1 × 10−3 | Applied to DL-based baselines (Mulan, InstantOps, OCEAN). | |
| Batch Size | B | 32 | Mini-batch size used for training baseline models. | |
| Training Epochs | E | 100 | Maximum training epochs with early stopping for baselines. |
| Fault Type | Injection Target | Implementation Method | IoT Scenario Mapping |
|---|---|---|---|
| CPU Exhaustion | Executor Container | Infinite loop invocation | Edge gateway overload caused by high-frequency concurrent computing |
| Memory Leak | Computing Task | List duplication | Sensor data bursts leading to Out of Memory (OOM) |
| Network Latency | PodNetwork Layer | API delay | Inter-device communication jitter under weak network conditions |
| Logic Error | Job/Application | Erroneous data substitution | Edge application logic crash or data validation failure |
| Algorithm | A@1 | A@2 | A@3 | A@4 | A@5 | F1-Score | MAR |
|---|---|---|---|---|---|---|---|
| ST-GraphRCA | 0.86 | 0.96 | 0.79 | 0.84 | 0.89 | 0.89 | 1.41 |
| Mulan | 0.92 | 0.88 | 0.9 | 0.82 | 0.65 | 0.81 | 1.55 |
| InstantOps | 0.95 | 0.91 | 0.85 | 0.89 | 0.58 | 0.78 | 1.82 |
| Chain-of-Event | 0.88 | 0.82 | 0.91 | 0.76 | 0.72 | 0.75 | 1.95 |
| OCEAN | 0.94 | 0.68 | 0.88 | 0.92 | 0.85 | 0.72 | 2.10 |
| Trace Tradition | 0.35 | 0.45 | 0.25 | 0.35 | 0.55 | 0.24 | 4.55 |
| Algorithm | A@1 | A@2 | A@3 | A@4 | A@5 | F1-Score |
|---|---|---|---|---|---|---|
| ST-GraphRCA | 0.86 | 0.96 | 0.79 | 0.84 | 0.89 | 0.89 |
| PCA-DTW | 0.84 | 0.93 | 0.77 | 0.82 | 0.61 ⬇ | 0.79 |
| Net-Flow | 0.81 | 0.72 ⬇ | 0.75 | 0.78 | 0.80 | 0.77 |
| Topo-Constraint | 0.80 | 0.88 | 0.74 | 0.80 | 0.85 | 0.82 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Su, T.; Mo, R.; Gong, Y.; Wang, H. ST-GraphRCA: A Root Cause Analysis Model for Spatio-Temporal Graph Propagation in IoT Edge Computing. Sensors 2026, 26, 1474. https://doi.org/10.3390/s26051474
Su T, Mo R, Gong Y, Wang H. ST-GraphRCA: A Root Cause Analysis Model for Spatio-Temporal Graph Propagation in IoT Edge Computing. Sensors. 2026; 26(5):1474. https://doi.org/10.3390/s26051474
Chicago/Turabian StyleSu, Tianyi, Ruibing Mo, Yanyu Gong, and Haifeng Wang. 2026. "ST-GraphRCA: A Root Cause Analysis Model for Spatio-Temporal Graph Propagation in IoT Edge Computing" Sensors 26, no. 5: 1474. https://doi.org/10.3390/s26051474
APA StyleSu, T., Mo, R., Gong, Y., & Wang, H. (2026). ST-GraphRCA: A Root Cause Analysis Model for Spatio-Temporal Graph Propagation in IoT Edge Computing. Sensors, 26(5), 1474. https://doi.org/10.3390/s26051474

