DTS-MixNet: Dynamic Spatiotemporal Graph Mixed Network for Anomaly Detection in Multivariate Time Series

Tan, Chengxun; Hu, Jiayi; Li, Jian; Miao, Minmin; Hu, Wenjun; Wang, Shitong

doi:10.3390/bdcc9100245

Open AccessArticle

DTS-MixNet: Dynamic Spatiotemporal Graph Mixed Network for Anomaly Detection in Multivariate Time Series

by

Chengxun Tan

¹,

Jiayi Hu

²,

Jian Li

¹,

Minmin Miao

^1,3,4

,

Wenjun Hu

^1,3,4,*

and

Shitong Wang

⁵

¹

School of Information Engineering, Huzhou University, Huzhou 313000, China

²

School of Information Engineering, China University of Geosciences, Beijing 100083, China

³

Zhejiang-France Joint Laboratory for Digital Monitoring of Aquatic Resources and the Environment, Huzhou 313000, China

⁴

Huzhou Key Laboratory of Waters Robotics Technology, Huzhou University, Huzhou 313000, China

⁵

School of Artificial Intelligence & Computer Science, Jiangnan University, Wuxi 214122, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(10), 245; https://doi.org/10.3390/bdcc9100245

Submission received: 25 July 2025 / Revised: 6 September 2025 / Accepted: 24 September 2025 / Published: 25 September 2025

Download

Browse Figures

Versions Notes

Abstract

Anomaly detection in multivariate time series (MTS) remains challenging due to the presence of complex and dynamic spatiotemporal dependencies. To address this, we propose the Dynamic Spatiotemporal Graph Mixed Network (DTS-MixNet), which takes a sliding window data as input to predict the next time series data and determine its state. The model comprises five blocks. The Temporal Graph Structure Learner (TGSL) generates the attention-weighted graphs via two types of neighbor relationships and the multi-head-attention-based neighbor degrees. Then, the Cross-Temporal Dynamic Encoder (CTDE) aggregates the cross-temporal dependencies from attention-weighted graphs, and encodes them into a proxy multivariate sequence (PMS), which is fed into the proposed Cross-Variable Dynamic Encoder (CVDE). Subsequently, the CVDE captures the sensors-among spatial relationship through multiple local spatial graphs and a global spatial graph, and produces a spatial graph sequence (SGS). Finally, the Spatiotemporal Mixer (TSM) mixes PMS and SGS to build a spatiotemporal mixed sequence (TSMS) for downstream tasks, e.g., classification or prediction. We evaluate on two industrial control datasets and discuss applicability to non-industrial multivariate time series. The experimental results on benchmark datasets show that the proposed DTS-MixNet is encouraging.

Keywords:

multivariate time series; anomaly detection; graph neural networks; spatiotemporal dependencies; dynamic graph learning

1. Introduction

In the Industry 4.0 era, the complexity and criticality of infrastructures, such as industrial equipment, IoT systems, and transportation networks, have significantly increased. These infrastructures commonly generate vast amounts of multivariate time series (MTS) data, characterized by nonlinearity, nonstationarity, and inter-sensor coupling (also called inter-variable coupling). For example, in steel production, there exist various types of sensors including temperature, density, oxygen consumption, pressure, and so on. And a change in one sensor’s measured value may propagate to others, leading to correlated anomalies. Hence, it is very important to carry out system monitoring with high reliability.

Multivariate time series anomaly detection (AD-MTS), which aims to identify potential abnormal behaviors or time points requiring corrective actions, is thus a critical technique for ensuring system safety. Unlike traditional anomaly detection methods designed for static data, e.g., statistical [1,2,3] or distance-based [4,5] approaches, AD-MTS must address the inherent temporal dynamics and complex inter-variable interactions, which is called spatiotemporal dependencies. To illustrate these dependencies, we employ the Secure Water Treatment (SWaT) dataset [6], an industrial control system benchmark in which each variable is denoted by a sensor tag (e.g., LIT101, a water level indicator in Tank 1; P102, a pump; AIT201, an analyzer; MV101, a motorized valve). As shown in Figure 1, anomalies are highlighted in red: (1) temporal dependencies (including lag effects): the impact of anomalies can propagate over time, showing sequential and lagged effects, e.g., an initial change in LIT101 followed by deviations in P102 and AIT201. (2) Inter-sensor (spatial) dependencies (coupling): anomalies in certain sensors can trigger cascading effects in related sensors, e.g., a change in LIT101 leading to anomalies in MV101 and P101. Capturing such spatiotemporal dependencies is important for AD-MTS.

Recent advances in deep learning have shown promise in AD-MTS. Convolutional Neural Networks (CNNs) and Stacked Autoencoders (SAEs) [7,8,9] excel at extracting complex and nonlinear features. Generative Adversarial Networks (GANs) [10] can learn the distribution of non-anomalous data, and Long Short-Term Memory networks (LSTMs) [11] are widely used to capture long-range temporal dependencies. However, these methods largely focus on the internal patterns of individual time series and fail to adequately capture interactions among multiple series. To model inter-variable relationships, Graph Neural Networks (GNNs) have been introduced to AD-MTS. For instance, GDN [12] uses GNNs to predict non-anomalous behavior and detect deviations, while GReLeN [13] and DyGraphAD [14] explore learning dynamic graph structures to capture time-varying relationships. Despite the potential of GNN-based methods for modeling inter-sensor relationships, existing approaches still struggle to simultaneously and dynamically capture spatiotemporal dependencies. Specifically, most approaches lack the capability to dynamically capture evolving dependencies and over-time interaction among sensors, which are critical for identifying anomalies driven by complex coupling effects.

To address the aforementioned challenge, this work proposes DTS-MixNet for multivariate time series anomaly detection. It aims to adaptively learn the patterns evolving in the temporal dimension and capture the sensors-among spatial relationship in the variable dimension, which are synergistically fused to improve the accuracy and robustness of anomaly detection. The architecture of the proposed DTS-MixNet is presented in Figure 2, which contains five parts. Now, the main contributions are summarized as follows:

We propose a novel deep learning architecture, called DTS-MixNet, for multivariate time series anomaly detection. It uniquely integrates dynamic graph learning across both temporal dimension and spatial (variable) one.
Temporal Graph Structure Learner (TGSL) discovers evolving temporal dependencies by identifying across-time and adjacent-time neighbors and assigning attention-based edge weights via multi-head attention.
The framework further includes the Cross-Temporal Dynamic Encoder (CTDE) and the Cross-Variable Dynamic Encoder (CVDE). The CTDE aggregates temporal dependencies within the sliding window to generate a proxy multivariate sequence (PMS), whereas the CVDE integrates both local and global inter-sensor dependencies to construct a spatial graph sequence (SGS).
Spatiotemporal Mixer (TSM) is designed to effectively fuse the learned dynamic temporal features and the dynamic spatial graph structures. This enables the model to capture higher-order interaction patterns across both time and variables and lead to a more comprehensive representation for downstream tasks.

The rest of this article is organized as follows. Section 2 surveys related methods for time series anomaly detection. The proposed DTS-MixNet is introduced in Section 3. Section 4 discussed anomaly scores and complexity analysis. Experimental results are presented in Section 5. Finally, the conclusion is given in Section 6.

2. Related Works

In this section, we first review the current state of research in anomaly detection and analyze the developments in multivariate time series processing techniques. Since our work relies on Graph Neural Networks (GNNs), we also provide a summary related to GNNs.

2.1. Anomaly Detection

In the field of data-driven anomaly detection and diagnostics, Multivariate Statistical Process Monitoring (MSPM) methods have garnered significant attention and are considered some of the most widely studied and applied techniques. Among these, Principal Component Analysis (PCA), Partial Least Squares (PLS) and Canonical Variable Analysis (CVA) [15,16,17] are widely used for anomaly detection. These techniques are based on linear dimensionality reduction, detecting anomalies by monitoring deviations in key features within the reduced-dimensional space. However, these methods often overlook the nonlinear relationships and temporal dependencies inherent in the data, which may result in a failure to accurately capture potential anomalies in complex dynamic processes.

In addition, deep learning has been widely adopted in anomaly detection due to its powerful nonlinear modeling capabilities. Deep learning models can automatically extract complex features from data and handle large datasets, significantly enhancing the accuracy and robustness of anomaly detection. Notable methods include Autoencoders, Generative Adversarial Networks (GANs) and Convolutional Neural Networks (CNNs) [18,19,20,21]. Autoencoders identify anomalies by reconstructing input data and detecting reconstruction errors. GANs, through adversarial training between a generator and a discriminator, can generate samples that resemble the real data distribution, identifying anomalies by comparing the differences. CNNs excel at handling high-dimensional and spatiotemporal data, significantly improving anomaly detection performance in complex data scenarios. These deep learning approaches have advanced the field by addressing some of the limitations of traditional linear methods, particularly in capturing the intricate nonlinear relationships and temporal dynamics present in modern industrial and sensor-driven data.

2.2. Multivariate Time Series

Multivariate time series signals are sequences of multidimensional vector observations collected over time, generated by numerous physical and virtual sensors. To effectively process such data, researchers have published a large number of studies in recent years. Among traditional machine learning methods, the Autoregressive Model (AR) [22] is a statistical model used for time series analysis. It predicts future values based on past values, assuming that the current value is a linear combination of several previous values. By analyzing historical data and examining its correlation coefficients, an appropriate regression model can be constructed. The Autoregressive Integrated Moving Average Model (ARIMA) [23] is a classical statistical model used for time series forecasting. It predicts future values by modeling the differences between successive observations rather than the raw values themselves. Support Vector Regression (SVR) [24] is a powerful nonlinear regression method that can be used to analyze multivariate time series data. Although SVR was initially designed to solve univariate regression problems, it can be extended to handle multivariate time series data. However, traditional analytical tools like these often struggle to handle complex relationships in time series data, such as nonlinearity and inter-variable dependencies. This complexity can lead to less accurate predictions, especially when the relationships between variables are intricate or when the data exhibits nonlinear behavior. As a result, while these traditional methods have been foundational in time series analysis, they often fail to apply in complex application scenarios.

Due to the excellent performance of artificial neural networks in handling complex feature dependencies and nonlinear relationships, an increasing number of researchers have begun using artificial neural networks for modeling and analyzing multivariate time series. Recurrent Neural Networks (RNNs) [25] are a common approach in multivariate time series analysis, effectively handling time-dependent data by maintaining temporal dependencies through recurrent connections. However, RNNs suffer from the long-term dependency problem, where early information may be forgotten as the number of time steps increases. The advent of deep learning has led to the development of various neural networks based on Convolutional Neural Networks (CNNs) [26,27] and Transformers [28], which have shown significant advantages in modeling real-world time series data. One of the major limitations of these methods is their inability to effectively utilize long-term historical data. To address this issue, researchers have turned to Long Short-Term Memory networks (LSTMs) [29], an improved version of RNNs that introduces gating mechanisms to solve the long-term dependency problem. Gated Recurrent Units (GRUs) [30], a similar recurrent neural network architecture to LSTMs, can also be used to analyze multivariate time series data. GRUs have a simpler structure than LSTMs but still effectively capture long-term dependencies in time series. However, the relationships between variables in multivariate time series often exhibit graph-like structures. The aforementioned methods do not explicitly model the spatial relationships that exist between time series in non-Euclidean spaces, which limits their expressiveness. In real-world applications, the interactions between different time series are often complex and structured, and ignoring these spatial dependencies can result in models that fail to capture the full extent of the data’s underlying patterns. To overcome this limitation, recent approaches have incorporated Graph Neural Networks (GNNs) into time series analysis to capture complex graph-structured relationships among variables. This integration allows for more accurate and comprehensive modeling of multivariate time series data, taking into account both temporal dynamics and spatial interactions between variables.

In recent years, Graph Neural Networks (GNNs) [31] have emerged as a powerful tool for learning representations of non-Euclidean data, paving the way for modeling real-world time series data. GNNs can capture various complex relationships, such as inter-variable connections within multivariate sequences and temporal dependencies across time points. Given the inherent spatiotemporal dependencies in real-world scenarios, a series of studies have combined GNNs with temporal modeling frameworks to better capture these dependencies, showing promising results in multivariate time series anomaly detection. Among these, GNN-GRUAD [32], MGUAD [33] and DuoGAT [34] are some of the classic methods used for multivariate time series anomaly detection.

2.3. Graph Neural Network

Due to the widespread presence of graph-structured data in the real world—such as traffic networks, molecular structures, and social networks—there is a need to handle complex and diverse relationships within this data. Traditional deep learning models struggle to effectively process graph data in non-Euclidean spaces, where many learning tasks require dealing with intricate relational information between elements [35]. This challenge led to the emergence of Graph Neural Networks (GNNs). GNNs are specifically designed to model the spatial relationships within data and are well-suited for handling graph-structured data in non-Euclidean spaces, a task that traditional and other deep neural network-based methods find difficult. GNNs have distinct advantages and have been widely applied in various fields, including recommendation systems, social network analysis, bioinformatics, intelligent transportation, and time series anomaly detection. Their ability to explicitly model the spatial relationships in data makes them uniquely powerful for these applications.

The concept of Graph Neural Networks (GNNs) was first introduced by Gori et al. [36]. Early research focused on learning the representation of target nodes through an iterative process that propagated information from neighboring nodes using a recurrent neural architecture, continuing until a stable fixed point was reached. This process generally required substantial computation. To handle graph-structured data, researchers were inspired by convolutional networks and sought to redefine convolution for graphs. For instance, Bruna et al. [37] developed a variant of graph convolution based on Spectral Graph Theory, designing a learnable diagonal matrix filter. However, this variant of graph convolution was computationally inefficient, and the filters were not spatially localized. In response, Henaff et al. [38] attempted to localize the spectral filters spatially by introducing smoothness coefficients. Subsequently, Defferrard et al. [39] proposed ChebNet, which simplified the computation by approximating the filter up to the

K

th order using the truncated expansion of Chebyshev polynomials. Later, Kipf and Welling [40] introduced Graph Convolutional Networks (GCNs), which leveraged the adjacency relationships between nodes to learn node representations, enabling end-to-end learning on graph data. As GNNs have evolved, there has been a surge in time series analysis methods based on GNNs. These methods explicitly model relationships across time and variables, paving the way for modeling real-world time series data. GNN-based approaches have proven particularly effective in capturing the complex dependencies in time series, including both temporal dependencies and spatial ones, making them well-suited for a wide range of applications, from anomaly detection to forecasting.

3. Methodology

3.1. Problem Statement

Let

X_{t r a i n} = \{x_{t r a i n}^{1}, x_{t r a i n}^{2}, \dots \dots, x_{t r a i n}^{T}\} \in R^{N \times T}

denote the training data, which is acquired from

N

sensors over

T

equispaced time steps, and

x_{t r a i n}^{t} \in R^{N}

denote the training data at time

t

. Let

X_{t e s t} = \{x_{t e s t}^{1}, x_{t e s t}^{2}, \dots \dots, x_{t e s t}^{U}\}

denote the test data over

U

equispaced time steps, which is also acquired from the same

N

sensors, and our objective is to detect anomalies in the test data. In addition, a sliding window of size

c

serves as a processing unit and the data of a sliding window at time

t

is denoted as follows:

S^{t} = \{x^{t - c}, x^{t - c + 1}, \dots, x^{t - 1}\} .

(1)

Note that adjacent sliding windows are allowed to overlap.

S^{t} \in R^{N \times c}

will be used to predict the values

{\hat{x}}^{t}

of the

N

sensors at time

t

. The deviation between the actual data

x^{t}

and the predicted data

{\hat{x}}^{t}

is treated as the loss in the training phase and as an anomaly score in the testing phase. We note that these deviations may exhibit temporal correlation due to the underlying dynamics of multivariate time series. Unlike classical control-chart methods that assume independent residuals, our approach calibrates anomaly thresholds on validation data, which inherently accounts for such correlations. About the application of deviation, we will provide a detailed explanation in Section 3.6 and Section 4.1. The notation used in this paper is listed in Table 1.

3.2. Temporal Graph Structure Learner

As mentioned above, multivariate time series typically exhibit evolving relationships in the temporal dimension. When all

N

sensors are considered together, the data at each time step can be represented as a node. Consequently, sequential nodes form connections across time, involving both across-time and adjacent-time relationships. In graph theory, such relationships can be characterized through neighbor attribution, which consists of neighbor relationships and neighbor degrees. To capture these temporal patterns, we propose the Temporal Graph Structure Learner (TGSL), which is designed to model evolving dependencies. Specifically, it identifies and refines the candidate neighbors for each node, including both across-time and adjacent-time neighbors. The neighbor relationship is denoted as a binary-weighted graph

G (V, E, W)

, where

V

is a vertex set,

E

is an edge set and

W

is a binary matrix. Then, the graph

G (V, E, W)

is fed into a multi-head attention net with

Q

heads, and it ultimately generates

Q

weighted graphs, called attention-weighted graphs and denoted

G (V, E, W^{1})

,…,

G (V, E, W^{Q})

, respectively. As a result, the dependencies of multivariate time series are evaluated with

Q

attention-weighted graphs, which reflects different attention perspectives. Since the adjacent sliding windows in our work are allowed to overlap, without loss of generality, the data in a sliding window takes into consideration. Now, we provide the design process of the proposed learner.

(1): Capture Neighbor Relationship

Without loss of generality, the data in a sliding window is taken into consideration. Given the data of a sliding window at time

t

, i.e.,

S^{t} = \{x^{t - c}, x^{t - c + 1}, \dots, x^{t - 1}\}

, if each sample is considered as a node, i.e.,

x^{t - c - 1 + i}

is corresponding to

v_{i}

, then we can obtain a node set

V = {v_{1}, \dots, v_{c}}

. Because the nodes may exhibit temporal transitivity or coupling, their neighbor relationships cannot be simple checked by Euclidean distance. To ensure consistent input dimensionality and enable efficient batch training, a fixed sliding window of length

c

is adopted. To this end, we define two learnable vectors,

{l v}_{1} \in R^{c}

and

{l v}_{2} \in R^{c}

, which will be continuously refined through iterative optimization, and they are used to potentially represent the neighbor relationships as follows:

A = S o f t m a x (R e L U ({l v}_{1} \times {l v}_{2}))

(2)

where

A \in R^{c \times c}

is a relationship matrix, the

R e L U (•)

function ensures that the edge values are all positive, and the

S o f t m a x (•)

function guarantees that the sum of the weights corresponding to relevant time steps is equal to 1. If

A_{i j}

is larger, the relationship between

v_{i}

and

v_{j}

is closer. Thus, the nearest

K

nodes of

v_{i}

can be obtained by

N_{a c r} (v_{i}) = \underset{j \neq i, j = 1 \dots c}{Top K} (A_{i j})

(3)

N_{a c r} (v_{i})

is the set of

v_{i}

’s

K

-nearest neighbors. In this paper, this type of neighbors

N_{a c r} (•)

is called across-time neighbors, which aims to distinguish it from another type, called adjacent-time neighbors and denoted as

N_{a d j} (•)

:

N_{a d j} (v_{i}) = \{v_{j}| |i - j| \leq 1, j = 1 \dots c a n d j \neq i\} .

(4)

Obviously,

N_{a d j} (v_{i})

is the set of

v_{i}

’s two adjacent points and they have strong associations with the node

v_{i}

. Now, we combine these two types of neighbors to form the final nearest neighbors as follows:

N (v_{i}) = N_{a c r} (v_{i}) \cup N_{a d j} (v_{i}) .

(5)

Next, the neighbor relationship is denoted as a binary-weighted graph

G (V, E, W)

, in which

V

is a vertex set,

E

is an edge set and there is an edge if the vertex

v_{j}

is one of

v_{i}

’s neighbors, and

W \in R^{c \times c}

is a binary matrix and its entry is given by

W_{i j} = \{\begin{matrix} 1, i f v_{j} \in N (v_{i}) o r v_{i} \in N (v_{j}), \\ 0, o t h e r w i s e . \end{matrix}

(6)

(2): Evaluate Neighbor Degree

Although the binary-weighted graph

G (V, E, W)

describes the neighbor relationship of nodes, it can not reflect their neighbor degree. To reliably evaluate the neighbor degree, we introduce a Graph Attention Network [41] to adaptively optimize their edge weights. First, the binary-weighted graph

G (V, E, W)

is fed into a multi-head attention net with

Q

heads, and then we compute their compatibility score between the neighbors to obtain their edge weights. Specifically, for the

q

-th attention head, a learnable mapping matrix

Φ_{q} \in R^{N \times N}

transforms the fed nodes into a new space for obtaining more tight representations. Then, an optimized vector

a_{q} \in R^{2 N}

is introduced to score the neighbor degree in the new space, which is formulated as:

\begin{matrix} S c o r e (v_{i}, v_{j}) = \\ \{\begin{matrix} L e a k y R e L U \{a_{q}^{T} (Φ_{q} v_{i} | | Φ_{q} v_{j})\}, i f v_{j} \in N (v_{i}) o r v_{i} \in N (v_{j}), \\ 0, o t h e r w i s e \end{matrix} \end{matrix}

(7)

where ‘

∥

’ denotes vector concatenation, and

L e a k R e L U (•)

is an activation function. Next, the score is normalized to the edge weights as follows:

W_{i j}^{q} = \{\begin{matrix} \frac{e x p (S c o r e (v_{i}, v_{j}))}{\sum_{v_{k} \in N (v_{i})} \exp (S c o r e (v_{i}, v_{k}))}, i f v_{j} \in N (v_{i}), \\ 0, o t h e r w i s e . \end{matrix}

(8)

Through

Q

different attention heads, we can obtain

Q

weighted graphs, denoted as

G (V, E, W^{q}), q = 1, \dots, Q .

(9)

Here,

G (V, E, W^{q})

is called attention-weighted graph and

W^{q}

is the

q

-th attention weight. Note that each graph is corresponding to an attention perspective and it focuses on a distinct relationship based on its specific attention mechanism. So far, the temporal graph structure is represented by

Q

attention-weighted graphs.

3.3. Cross-Temporal Dynamic Encoder

Obviously, the attention-weighted graph

G (V, E, W^{q})

demonstrates the

q

-type complex and spreading dependencies. To use all dependencies, i.e.,

Q

types of dependencies, we propose Cross-Temporal Dynamic Encoder (CTDE), which aggregates the dependencies from

Q

attention-weighted graphs and encodes all nodes into the new representations.

All nodes are fed into an

L

-layer aggregator. Let

v_{i}^{(l)}

denote the

l

-th layer output of the node

v_{i}

, then the outputs of the subsequent layers are as follows:

v_{i}^{(l)} = σ (Ψ^{0} (v_{i}) + \frac{1}{Q} \sum_{q = 1}^{Q} \sum_{v_{j} \in N (v_{i})} W_{i j}^{q} Ψ_{q} v_{j}^{(l - 1)})

(10)

where

σ (•)

is an activation function, and

Ψ_{q} \in R^{N \times N}

,

q = 0, \dots, Q

, is a learnable mapping matrix for effective dependency aggregation form the attention-weighted graph

G (V, E, W^{q})

. The output of the final layer, i.e.,

v_{i}^{(L)}

, is the new representation of node

v_{i}

. As a result, a new sequence can be obtained, i.e.,

v_{1}^{(L)}, \dots, v_{c}^{(L)}

, and here it is called Proxy Multivariate Sequence (PMS). For brevity, we use the matrix form to denote the PMS as follows:

P = (v_{1}^{(L)}, \dots, v_{c}^{(L)})

(11)

where

P \in R^{N \times c}

is the new representation after aggregation,

N

represents the feature dimension aggregated for each timestamp, and

c

represents the number of timestamps.

3.4. Cross-Variable Dynamic Encoder

The Proxy Multivariate Sequence (PMS), generated by the proposed TGSL and CTDE, has captured the dependencies in the temporal dimension. However, there also exists rich spatial information in multivariate time series data, i.e., among sensors or variables. To mine useful information, we design a block, called Cross-Variable Dynamic Encoder (CVDE). The CVDE block utilizes PMS to construct spatial graphs and generate Spatial Graph Sequence (SGS) under local perspective and global one, respectively.

To capture the spatial relationship among variables (sensors) in Proxy Multivariate Sequence (PMS), Proxy Multivariate Sequence (PMS)

P \in R^{N \times c}

is divided into

m

non-overlapping localities with a fixed length

w = c / m

, which generates a local sequence, denoted

P^{i} \in R^{N \times w}

,

i = 1, \dots, m

. Then, in each locality, we treat a variable (a sensor) as a node and establish a spatial graph, in which the Dynamic Time Warping (DTW) distance is employed to measure the similarity between variable features in each locality and build their adjacency matrices. Specifically, for the

i

-th locality

P^{i}

, the local spatial adjacency matrix

O^{i} \in R^{N \times N}

is computed by

O_{j, k}^{i} = \exp (- \frac{{D T W}^{2} (P_{j}^{i}, P_{k}^{i})}{α})

(12)

where

P_{j}^{i} \in R^{w}

represents the feature of the

j

-th variable, i.e., the

j

-th row of

P^{i}

,

D T W (•)

is the DTW distance function between two sequences and

α

is a hyperparameter that controls the scaling of DTW distances when constructing adjacency matrices, regulating how sharply or smoothly similarities between variables decay. The DTW metric provides robustness to local misalignments by allowing elastic matching along the time axis and please see [42] for more details. Hence, the

m

non-overlapping localities are described as

m

local spatial graphs, which is represented by

m

local adjacency matrices, i.e.,

O^{i}

,

i = 1, \dots, m

.

Obviously, a local adjacency matrix only reflects the spatial information of a locality. To capture the global spatial information between variables (sensors), we also use the DTW distance to define a global spatial adjacency matrix, denoted

O \in R^{N \times N}

, which is formulated as

O_{j, k} = \exp (- \frac{{D T W}^{2} (P_{j}, P_{k})}{β})

(13)

where

P_{j} \in R^{c}

represents the feature of the

j

-th variable, i.e., the

j

-th row of in

P

and

β

is a hyperparameter. To incorporate global spatial information into the local structures, each local adjacency matrix is combined with the global one, resulting in the following Spatial Graph Sequence (SGS):

D^{i} = λ O^{i} + (1 - λ) O, λ \in [0, 1]

(14)

where

λ

is a learnable parameter and

D^{i} \in R^{N \times N} (i = 1, \dots, m)

. For simplification, let

D = {D^{1}, \dots, D^{m}}

denote the Spatial Graph Sequence.

3.5. Spatiotemporal Mixer

The Cross-Temporal Dynamic Encoder (CTDE) and the Cross-Variable Dynamic Encoder (CVDE), respectively, generate the Proxy Multivariate Sequence (PMS) and the Spatial Graph Sequence (SGS). To make full use of the complementary information from both sequences, we design the Spatiotemporal Mixer (TSM), which takes PMS and SGS as input and integrates them into a new sequence, referred to as the Spatiotemporal Mixed Sequence (TSMS). TSMS will be used for downstream tasks, e.g., classification task or prediction task. The designed TSM contains two processes as follows.

(1): Generating Hidden Representations

A 2D convolutional layer, with a

1 \times 1

convolutional kernel and

d

channels, is used to extract a richer feature representation

I \in R^{d \times N \times c}

from the Proxy Multivariate Sequence

P \in R^{N \times c}

. Here, the

1 \times 1

kernel ensures that the convolution operation focuses on feature transformation, and the

d

channels aim to obtain a higher-dimensional embedding. Note that the intrinsic aggregation information existing in

P

may be decomposed in higher-dimensional embedding space. Thus, it is necessary to re-aggregate the information of nearby points from the feature representation

I

. To this end, the feature representation

I

is fed into the Dilated Inception Layer (DIL) to aggregate nearby points, which can use various kernel sizes to capture similarities between pairs of sequences. See [43] for more detail. As a result, a hidden representation

Z \in R^{d \times N \times c}

is generated.

(2): Mixing Temporal and Spatial Information

First, the hidden representation

Z

is divided into

m

non-overlapping segments, i.e.,

Z = {Z^{1}, \dots, Z^{m}}

and

Z^{i} \in R^{d \times N \times w} (w = c / m)

, along the temporal dimension. Then,

Z^{i}

and

D^{i}

correspond one-to-one. Now, we employ MixHop module for mixing temporal and spatial information, which is formulated as:

F^{i} = M i x H o p (Z^{i}, D^{i})

(15)

where

F^{i} \in R^{d \times N \times w}

represents the feature matrix of the

i

-th segment. In fact, the MixHop [44] function can not only realize the interaction of spatiotemporal signals, but also capture higher-order relationship of neighbors through multiple orders. Next, we concatenate these feature matrices as follows:

M^{t - 1} = c o n c a t (F^{1}, F^{2}, \dots \dots, F^{m})

(16)

where

M^{t - 1} \in R^{d \times N \times c}

is the final representation of a sliding window at time

t

. Hence, over

T

time steps, we can gain Spatiotemporal Mixed Sequence (TSMS), i.e.,

M = {M^{t - 1}, \dots, M^{t + T - c}} .

3.6. Neural Network Prediction

The function

f (•)

is composed of a series of linear layers in a neural network. These layers process the concatenated input

M^{t - 1}

, ultimately producing the predicted target values for each sensor at time

t

, i.e.,

{\hat{x}}^{t}

as follows:

{\hat{x}}^{t} = f (M^{t - 1}) .

(17)

It should be noted that our approach requires labeled data during training. The model is optimized by minimizing the prediction error with respect to ground-truth labels, ensuring that the learned representations are aligned with the actual system behavior. This design follows the supervised anomaly detection setting, as in benchmark datasets such as SWaT and WADI, where attack periods are explicitly annotated. Extending the framework to fully unsupervised scenarios without labels is an interesting direction for future work.

To train the model, we adopt the mean squared error (MSE) as the loss function. Although MSE is equivalent to assuming that the residuals follow an approximately Gaussian distribution, it remains one of the most widely used and effective objectives in multivariate time series modeling. This choice is motivated by its computational simplicity, stability in optimization, and interpretability as a measure of deviation between predicted and observed values. Moreover, prior studies on SWaT and WADI have also employed MSE under similar settings, which ensures comparability with existing baselines. While other robust losses (e.g., L1 loss or Huber loss) could potentially relax the Gaussian assumption, we empirically found MSE sufficient to achieve strong performance in our experiments. The loss function is therefore defined as the deviation between the predicted output

{\hat{x}}^{t}

and the observed data

x^{t}

:

L = \frac{1}{T - c} \sum_{t = c + 1}^{T} {‖{\hat{x}}^{t} - x^{t}‖}^{2} .

(18)

Note that these deviations may exhibit temporal correlation due to the dynamics of multivariate time series. Unlike traditional control-chart methods that assume independent residuals, our approach calibrates anomaly thresholds on validation data, which implicitly accounts for such correlations.

The pseudo-code of the training phase is summarized in Algorithm 1.

Algorithm 1: Pseudo-code of Training Phase

Input:
Input signal:

X_{t r a i n} = \{x_{t r a i n}^{1}, x_{t r a i n}^{2}, \dots \dots, x_{t r a i n}^{T}\}

, Sliding window size:

c

, Ground truth data at time

t

:

x^{t}

.
Initialize model parameters

f o r e p o c h = 1 t o m a x_e p o c h s d o

t o t a l_l o s s \leftarrow 0

f o r t = c t o T d o

S^{t} \leftarrow X_{t r a i n} [t - c : t]

G \leftarrow T G S L (S^{t})

P \leftarrow C T D E (G)

D \leftarrow C V D E (P)

M \leftarrow T M S (P, D)

{\hat{x}}^{t} \leftarrow N e u r a l_N e t w o r k (M)

L \leftarrow \frac{1}{T - c} \sum_{t = c + 1}^{T} {‖{\hat{x}}^{t} - x^{t}‖}^{2}

t o t a l_l o s s \leftarrow t o t a l_l o s s + L

        end for
        Backpropagate loss and update model parameters using Adam optimizer.
        Output: Epoch loss
end for
Return trained model parameters

M_{d t s}

4. Discussion

In this section, we discuss the key insights and findings from the proposed anomaly detection method. We focus on several important aspects, including anomaly scoring, time complexity, scalability, threshold calibration, and the challenges of deploying the model in dynamic, real-world environments. We also highlight potential optimizations and adaptations that can improve the model’s performance and applicability across various domains.

4.1. Anomaly Scoring

In the testing phase, the deviation between the observed data and the model prediction is used as the anomaly score. Because these deviations may exhibit temporal correlation, we do not assume independence as in classical control-chart methods. Instead, the detection threshold is calibrated on a held-out validation set, so that the correlation structure of the residuals is implicitly taken into account. At time

t

, the deviation for the

i

-th sensor is computed as

{E r r}_{i} (t) = |x_{i}^{t} - {\hat{x}}_{i}^{t}| .

(19)

Since the distribution of errors can be skewed, each sensor’s deviation is normalized in a robust manner:

α_{i} (t) = \frac{{E r r}_{i} (t) - {\hat{μ}}_{i}}{{\hat{σ}}_{i}}

(20)

where

{\hat{μ}}_{i}

and

{\hat{σ}}_{i}

denote the median and the interquartile range (IQR) of the error distribution for the

i

-th sensor, respectively. This construction is essentially a robust Z-score, making it less sensitive to skewness and outliers compared to variance-based normalization. The system-level anomaly score is then defined as the maximum normalized deviation across sensors:

A (t) = \max_{i} α_{i} (t) .

(21)

This heuristic highlights the strongest signal among all sensors, which works effectively in practice but does not explicitly capture cross-sensor correlations. Incorporating more rigorous multivariate control statistics, such as Hotelling’s

T^{2}

, is a promising direction for future work.

The pseudo-code of the testing phase is summarized in Algorithm 2.

Algorithm 2: Pseudo-code of Testing Phase

Input:
Input signal:

X_{t e s t} = \{x_{t e s t}^{1}, x_{t e s t}^{2}, \dots \dots, x_{t e s t}^{U}\}

, Sliding window size:

c

, Ground truth data at time t:

x^{t}

.
Output: Anomaly detection result

\hat{y} \in {0, 1}

.
Initialize trained model parameters

M_{d t s}

. The detection threshold

δ

is calibrated on a held-out validation set by sweeping candidate values and selecting the one that maximizes the F1-score.

f o r t = c t o U d o

S^{t} \leftarrow X_{t e s t} [t - c : t]

G \leftarrow T G S L (S^{t})

P \leftarrow C T D E (G)

D \leftarrow C V D E (P)

M \leftarrow T M S (P, D)

{\hat{x}}^{t} \leftarrow N e u r a l_N e t w o r k (M)

E r r \leftarrow |x^{t} - {\hat{x}}^{t}|

\hat{μ} \leftarrow M E D I A N (E r r)

\hat{σ} \leftarrow I Q R (E r r)

a \leftarrow \frac{E r r - \hat{μ}}{\hat{σ}}

A (t) \leftarrow m a x (a)

{\hat{y}}^{t} \leftarrow (A (t) > δ) ? 1 : 0

Append

{\hat{y}}^{t}

to

\hat{y}

end for
Return

\hat{y}

4.2. Time Complexity Analysis

In this section, we analyze the time complexity of our proposed model during both the training phase and testing phase. The complexity is influenced by several factors, including the number of time steps

T

(for training) and

U

(for testing), the sliding window size

c

, and the number of sensors

N

. Below, we provide a detailed breakdown of the time complexity for each operation involved in both phases. Table 2 summarizes the time complexity of each operation in both the training and testing phases.

The time complexity of the training phase depends on several components. Each operation, including the Temporal Graph Structure Learner (TGSL), Cross-Temporal Dynamic Encoder (CTDE), Cross-Variable Dynamic Encoder (CVDE), Spatiotemporal Mixer (TSM), and the neural network prediction, involves processing data at each time step. Thus, the total time complexity for the training phase is

O ((L Q c + c^{2} + d c) {T N}^{2})

.

The testing phase follows a similar structure to the training phase, but without backpropagation. Each operation in the testing phase has the same complexity as in the training phase, but we only perform forward passes. Therefore, the total time complexity for the testing phase is

O ((L Q c + c^{2} + d c) {U N}^{2})

.

The time complexity analysis shows that the model’s complexity scales quadratically with the number of sensors

N

and is linearly dependent on the sliding window size

c

and the number of time steps in both the training and testing phases. The training complexity is higher due to the iterative process over multiple epochs, while the testing phase involves only forward passes, making it less computationally intensive. This analysis quantifies the computational demands of our approach and informs assessments of scalability. Nevertheless, practical limitations are more likely to arise from theoretical aspects—e.g., robustness to distribution shift and threshold calibration—than from computation; strengthening these aspects will be a focus of future work.

4.3. Scalability and Optimization

The time–complexity analysis indicates that the dominant cost grows quadratically with the number of sensors

N

(due to variable–variable interactions) and approximately linearly with the window length

c

and the number of time steps in both training and testing. While this characterization clarifies computational demands, it also highlights potential challenges when scaling to hundreds or thousands of sensors in large IoT/industrial systems.

To mitigate these costs, the following strategies can be applied in practice or explored in future work; they target either the variable–variable

O (N^{2})

terms, the temporal mixing cost, or both.

Sparse/dilated temporal neighbors. In temporal modules, restrict attention to adjacent time steps plus at most $K_{t}$ across-time links (e.g., dilated offsets). This replaces dense $O (N^{2})$ interactions with $O (c N K_{t})$ .

Approximate similarity for CVDE. When CVDE relies on DTW, employ a Sakoe–Chiba band or soft-DTW with a small bandwidth $b$ , optionally preceded by piecewise aggregate approximation (PAA). This changes per-pair cost from $O (c^{2})$ to $O (c b)$ with $b ≪ c$ .

Low-rank/linear attention. Replace quadratic attention with kernelized/low-rank variants so a layer scales as $O (L Q c N r)$ with small rank $r$ , rather than $O (L Q {(c N)}^{2}) .$

Group-level mixing. Cluster sensors into $G$ functional groups (via long-term correlation/MI or domain taxonomy), perform mixing at the group level ( $O (G^{2}), G ≪ N$ ), then refine within groups.

Several of them are compatible with the current framework and can be combined; a systematic study of the accuracy–efficiency trade-offs is a promising direction for future work.

4.4. Domain Applicability and Adaptation

Our experiments use industrial control datasets, where variables correspond to sensors/actuators with relatively regular sampling. Applying DTS-MixNet to non-industrial multivariate time series (e.g., healthcare, finance) raises distinct challenges: irregular sampling and missingness, non-stationarity/regime shifts, heterogeneous variable semantics, and privacy constraints. We outline adaptations within our framework:

Domain priors for spatial graphs. Augment dynamically learned graphs with prior structure (e.g., clinical ontologies, market sector/industry taxonomies) via Laplacian or edge-level regularization, yielding prior-regularized dynamic graphs that respect domain knowledge while remaining adaptive.
Shift-robust thresholding. Instead of a single fixed threshold, employ validation-based quantile calibration or conformal prediction for per-deployment calibration; update thresholds online with a small sliding validation buffer to accommodate regime changes.
Self-supervised pretraining. Pretrain TGSL/CTDE with masked forecasting/reconstruction on large heterogeneous MTS, then fine-tune on the target domain; this preserves the architecture and improves data efficiency under limited labels.

Although we report results only on ICS benchmarks, a more comprehensive evaluation on out-of-domain datasets (such as healthcare, finance, etc.) will be an important direction for future work.

4.5. Threshold Calibration and Adaptation

The fixed threshold

δ

used in our offline evaluation (selected on a held-out validation split and fixed for the test set) ensures comparability across methods. In dynamic deployments, however, normal behavior may drift over time, suggesting the use of adaptive calibration. Let

C_{t} = {a_{t - w}, \dots, a_{t - 1}}

denote a sliding calibration buffer of size

W

, which stores recent anomaly scores. We outline practical, model-agnostic strategies that are compatible with our scoring framework:

(1): Rolling quantile calibration (lightweight).

Populate

C_{t}

with predicted-normal windows (e.g.,

a_{t - i} \leq δ_{t - i} - γ, m a r g i n γ > 0

) and update the threshold as

δ_{t} = Q_{1 - α} (C_{t}) + κ

(22)

where

Q_{1 - α} (C_{t})

denotes the empirical

(1 - α)

-quantile of the scores in

C_{t}

and

κ \geq 0

is a safety margin. When

∣ C_{t} ∣ < B

(cold start), fall back to the validation-tuned

δ

. Here,

B

denotes the minimum buffer size required for reliable recalibration, typically set between

50

and

200

.

(2): Conformal prediction with a sliding calibration set.

Treat

a_{t}

as a nonconformity score and maintain a calibration set

C_{t}

of size

W

. At each time step, compute the conformal

p

-value as

p_{t} = \frac{1}{W + 1} (1 + \sum_{s \in C_{t}} 1 \{a_{s} \geq a_{t}\})

(23)

An anomaly is flagged if

p_{t} \leq α

. Here,

α \in (0, 1)

denotes the target significance level, which controls the tolerated false alarm rate when interpreting

p_{t}

. This procedure provides finite-sample error control under exchangeability assumptions, and effectively replaces a fixed threshold

δ

with a time-varying decision rule without changing the underlying model.

4.6. Real-Time Deployment and Challenges

In real-world industrial applications, real-time deployment introduces several critical challenges that must be addressed to ensure the practical usability of anomaly detection models. These include issues related to inference latency, streaming data processing, and robustness to missing sensor values. Below, we outline these aspects and how they might be addressed within the context of DTS-MixNet.

Inference Latency: Real-time systems require that anomaly detection models make predictions with minimal latency to allow for timely intervention. While the current DTS-MixNet architecture may involve complex graph computations, which can be computationally intensive, we propose the following approaches to optimize latency: (i) model pruning, where less critical parts of the graph are simplified, (ii) batching inputs in streaming settings to reduce the overhead of graph-based computations, and (iii) parallelization techniques such as multi-threading or GPU acceleration to speed up inference times. We plan to assess these optimization strategies in future work to improve real-time deployment performance.

Handling Streaming Data: In real-world settings, anomaly detection must often be performed on streaming data, where new sensor readings are continuously fed into the system. Our current approach handles sliding windows of fixed length, but for a true real-time deployment, future work will focus on extending this model to work with online learning techniques. This would allow the model to update its parameters incrementally as new data arrives, ensuring that it remains adaptable to changes in the system over time.

Robustness to Missing Sensor Values: Missing sensor values are a common challenge in industrial systems due to sensor failures, communication issues, or temporary disconnects. To enhance the robustness of DTS-MixNet, we will explore imputation strategies (e.g., mean imputation,

k

-NN imputation, or even model-based imputation via auxiliary models). Additionally, we will investigate the possibility of masking missing values during the graph construction and learning process to prevent them from adversely affecting the model’s predictions.

Addressing these challenges is crucial for ensuring that DTS-MixNet can be deployed in dynamic, real-world environments. Future work will investigate these aspects in more detail, potentially through experiments and optimization techniques tailored for real-time performance.

5. Experiments

5.1. Dataset

This paper utilizes two real-world benchmark datasets to evaluate multivariate time series anomaly detection methods: SWaT [6] and WADI [45]. SWaT (Secure Water Treatment) and WADI (Water Distribution) are two widely used industrial control system (ICS) datasets for cybersecurity and anomaly detection research. They are primarily employed to detect and study potential attacks or anomalous behaviors in industrial systems.

SWaT is a model of a small-scale water treatment plant, simulating six stages of a real-world water treatment process, including chemical treatment, filtration, and purification. The plant is equipped with various sensors and actuators. The dataset consists of time series data from multiple sensors and actuators, capturing both non-anomalous operations and attack scenarios.
WADI is a model of a water distribution network that simulates the water distribution process in the real world, covering multiple stages from water storage to distribution. This system is more complex than SWaT and involves larger-scale operations. The WADI dataset records sensor readings and actuator states, with the data also being time series, encompassing both non-anomalous operations and intentionally injected attack behaviors.

The specific details about above two datasets are summarized in Table 3.

5.2. Baseline Methods

We compared the performance of our method with seven popular anomaly detection methods, including:

LSTM-VAE (Long Short-Term Memory Variational Autoencoder) [46]: This model combines Long Short-Term Memory (LSTM) networks with a Variational Autoencoder (VAE), primarily used for anomaly detection in time series data. The model leverages LSTM’s ability to capture temporal dependencies and VAE’s capability in generating and modeling complex data distributions.
DAGMM (Deep Autoencoding Gaussian Mixture Model) [47]: A deep learning method for anomaly detection that integrates an Autoencoder with a Gaussian Mixture Model (GMM). It is effective in detecting anomalies in high-dimensional data.
MAD-GAN (Multivariate Anomaly Detection using Generative Adversarial Networks) [48]: A GAN-based method for anomaly detection in multivariate time series. MAD-GAN uses a generator-discriminator architecture to learn the distribution of time series data, detecting anomalies based on the discriminator’s ability to distinguish between non-anomalous and anomalous data.
MTAD-GAT (Multivariate Time-series Anomaly Detection using Graph Attention Networks) [49]: This method uses Graph Attention Networks (GAT) for multivariate time series anomaly detection. It efficiently captures the complex dependencies between variables in time series data and detects anomalies by modeling dynamic changes over time.
GDN (Graph Deviation Network) [12]: A multivariate time series anomaly detection method based on Graph Neural Networks (GNN). GDN builds and learns a relational graph between different variables in time series data, capturing the dependencies between them to detect anomalies. It identifies anomalies by learning deviations from non-anomalous behavior using GNNs.
GRN (GRU-based Interpretable Multivariate Time Series Anomaly Detection Model) [50]: GRN is a deep learning model that combines GRU (Gated Recurrent Unit) and interpretability techniques, designed to handle multivariate time series data from multiple sensors in industrial control systems. The model learns the temporal dependencies within the time series to capture non-anomalous patterns and identifies anomalies by calculating the prediction error.
ECNU-GNN (Edge Conditional Node Update Graph Neural Network) [51]: This model is a graph neural network-based model for multivariate time series anomaly detection, which captures complex temporal and relational dependencies by conditionally updating node states. The model represents time series data as a graph, where each node corresponds to a time step or sensor, and edges represent the relationships between them. By conditionally updating node states based on edge features, ECNU-GNN can more accurately learn non-anomalous behavior patterns and effectively detect anomalies in systems, making it particularly useful for anomaly detection in complex environments like sensor networks and industrial control systems.

5.3. Evaluation Metrics

We evaluate the performance of our method and other baseline models using precision (Prec), recall (Rec), and F1-score (F1) on the test dataset for anomaly detection. These metrics are defined as follows:

Precision (Prec): Precision is the ratio of true anomalies correctly identified by the model to the total number of data points predicted as anomalies.

P r e c = \frac{T P}{T P + F P} \times 100 %

(24)

It measures the model’s ability to avoid false positives, indicating how accurate the anomaly predictions are.

Recall (Rec): Recall is the ratio of true anomalies correctly identified by the model to the total number of actual anomalies in the dataset.

R e c = \frac{T P}{T P + F N} \times 100 %

(25)

It evaluates the model’s capability to detect as many true anomalies as possible, focusing on minimizing false negatives.

F1-Score (F1): The F1-score is the harmonic mean of precision and recall. It balances the trade-off between precision and recall, especially when dealing with imbalanced data.

F 1 = 2 \times \frac{P r e c \times R e c}{P r e c + R e c} \times 100 %

(26)

A higher F1-score reflects an improved harmonic mean of precision and recall, indicating that the model is simultaneously achieving higher detection coverage and lower false alarms. Nonetheless, the F1-score has inherent limitations: it implicitly weights precision and recall equally and may be misleading in skewed or safety-critical scenarios. Therefore, while F1 provides a convenient summary metric for comparative evaluation, it should be interpreted alongside precision and recall rather than as a definitive indicator of performance.

Moreover, while precision, recall, and F1-score are widely used in machine learning evaluations and facilitate direct comparison with prior anomaly detection studies, they do not fully capture the asymmetric importance of false alarms versus missed detections in industrial control systems. In statistical process monitoring, thresholds are often chosen to control the false alarm rate (FAR) or the in-control average run length (IC ARL). In our framework, the threshold

δ

is tuned on a validation set to maximize the F1-score, ensuring consistency with existing baselines. Extending the framework to incorporate FAR- or ARL-based thresholding criteria should be an important direction for future research.

5.4. Experimental Setup

(1): Parameter Selection

Our model was trained end-to-end using the Adam optimizer with a learning rate set to

0.001

. Training was conducted for

25

epochs, and

10 %

of the training samples were reserved as a validation set to monitor convergence and prevent overfitting. The number of nearest across-time neighbors

K

was set to

10

for the SWaT dataset and

15

for the WADI dataset. The number of attention heads in the Temporal Graph Structure Learner (TGSL) was set to

Q = 4

. The parameter

α

, used for constructing the local graph, was set to

α = 1.5

for SWaT and WADI, while

β

, used for constructing the global graph, was set to

β = 1.5

for SWaT and WADI. The Dilated Inception Layer (DIL) within the Spatiotemporal Mixer (TSM) utilized a kernel size combination of

{2, 3, 5, 7}

. The hidden channel dimension

d

for intermediate representations was configured to

64

. For all datasets, the time window length

c

was set to

30

, the number of sub-blocks

m

was set to

6

, and the sub-block length

w

was set to

5

. This configuration ensures consistency across datasets while allowing the model to adapt to the unique characteristics of each one.

All hyperparameters were tuned on a held-out validation split using a grid search with a blocked temporal split (train

\to

validation

\to

test) to avoid temporal leakage. For each candidate configuration, the model was trained for up to

25

epochs with early stopping on the validation loss. After training, the anomaly detection threshold

δ

was calibrated on the validation set by sweeping candidate values and selecting the one that maximized the validation F1-score, balancing false positives and false negatives. The best configuration was then re-trained on the combined training and validation sets, and the fixed

δ

was applied for testing. This procedure ensures comparability with existing baselines while preventing information leakage.

The specific grid ranges explored for each hyperparameter are summarized in Table 4. We focus the sensitivity analysis on the sliding-window length

c

and the number of attention heads

Q

(see Section 5.9), as these parameters directly control the temporal receptive field and attention capacity of TGSL/CTDE. Other hyperparameters were fixed to the values yielding the best validation performance for fairness and computational efficiency.

(2): Baseline Setup.

All reproduced baselines were trained under a unified configuration:

25

epochs, window length

c = 30

, stride

s = 1

, and—where applicable—attention heads

Q = 4

. We used the Adam optimizer with a fixed learning rate

0.001

across reproduced methods. All models shared the same train/validation/test splits and Z-score normalization (statistics from training only). To avoid threshold-selection artifacts, the decision threshold

δ

for each method was tuned on a held-out validation split by maximizing F1 and then kept fixed for the test set.

5.5. Comparative Study

Table 5 presents the evaluation results for all compared models across the specified metrics, which is shown by the anomaly detection accuracy including precision, recall and F1-score. For ECNU-GNN, the experimental results are from the literature [51]. Notably, DTS-MixNet achieves the highest precision among the methods compared. While other graph-based methods, such as GDN, MTAD-GAT, and ECNU-GNN, also achieve commendable performance, underscoring the general effectiveness of graph structures in capturing inter-variable relationships within MTS, DTS-MixNet achieves further improvements.

While bootstrap confidence intervals could in principle provide additional statistical rigor, we follow the prevailing practice in anomaly detection benchmarks and report only point estimates for clarity and comparability.

Overall, the results indicate measurable gains of DTS-MixNet over strong graph baselines on both datasets. Consistent with common practice in anomaly detection benchmarks, we report point estimates rather than bootstrap confidence intervals for clarity and comparability.

5.6. Ablation Study

To quantitatively evaluate the impact and necessity of each core component within the proposed CDTS-MixNet framework, we conducted comprehensive ablation experiments on the SWaT and WADI datasets. In these experiments, specific modules or mechanisms were systematically removed or simplified, and the resulting impact on anomaly detection precision was measured. The detailed settings are presented below, and the results summarized in Table 6.

Temporal Modeling Settings:

The Temporal Graph Structure Learner (TGSL) and Cross-Temporal Dynamic Encoder (CTDE) are designed to capture intricate dependencies along the time axis. We evaluated their importance through the following ablation settings:

w/o Temporal Attention: The multi-head attention mechanism (Equations (7) and (8)) within TGSL was removed. Instead, the aggregation in CTDE (Equation (10)) used simpler, non-learned weights (e.g., uniform weights based on the binary graph $W$ from Equation (6)).
w/o Across-Time Neighbors: The learned across-time neighbors ( $N_{a c r}$ , Equation (3)) were excluded from TGSL. Only the adjacent-time neighbors ( $N_{a d j}$ , Equation (4)) were used to construct the temporal graph.
w/o Temporal Graph Modeling: The entire TGSL and CTDE pipeline was bypassed. The raw input sequence $S^{t}$ was directly fed into subsequent modules, isolating the contribution of dynamic temporal graph learning and encoding.

Spatial Modeling Settings:

The Cross-Variable Dynamic Encoder (CVDE) focuses on capturing relationships between sensors (variables). Its contribution was assessed via the following:

w/o Spatial Graphs: Local and global spatial graph ( $O^{i}$ and $O$ , Equations (12)–(14)) were replaced. Instead, a single, static graph (e.g., $k$ -Nearest Neighbors approach,) was used for the spatial mixing in TSM.
w/o TSM Feature Enhancement: The initial feature enhancement steps within the TSM specifically the Convolution and the Dilated Inception Layer (DIL) were removed. Consequently, the Proxy Multivariate Sequence $P$ , after being segmented into $P^{i}$ , was directly fed into the MixHop component.

The results demonstrate a significant performance drop across all ablation scenarios, particularly when the entire dynamic temporal modeling is removed (w/o Temporal Graph Modeling) or when dynamic spatial graphs are replaced by static ones (w/o Spatial Graphs). This confirms that explicitly modeling dynamic temporal relationships (including both adjacent and weighted across-time dependencies) is crucial. Furthermore, the decrease in precision observed in the w/o Spatial Graphs setting highlights the critical importance of adaptively capturing inter-variable relationships based on the current window’s context using DTW (via CVDE), rather than relying solely on static assumptions. The removal of TSM feature enhancement (w/o TSM Feature Enhancement) also indicates the value of refining the temporal representations before spatiotemporal fusion.

5.7. Interpretability of Model

In this section, we explore the interpretability of our model, focusing on two critical aspects: the interpretability of sensor embeddings and the interpretability of attacks. These components provide valuable insights into how the model makes decisions, enabling us to understand and trust the results of the anomaly detection process.

(1): Interpretability of Embedding Vectors for Sensors

We present the t-SNE visualizations of sensor embeddings for both the SWaT and WADI datasets. These visualizations provide insight into how sensor embeddings, learned by our model, are organized in a low-dimensional space, reflecting the similarity of the sensors based on their measurements.

Figure 3a shows the t-SNE [52] representation of the learned sensor embeddings on SWaT colored by eight sensor classes. We observe that sensors such as HMI_FIT101, HMI_FIT201, and HMI_FIT301 form a distinct cluster, indicating that these sensors, which measure similar parameters in the SWaT system, are closely embedded in the low-dimensional space. This clustering highlights the model’s ability to capture the functional relationships between sensors. The inset further emphasizes this by zooming in on the cluster of HMI_FIT sensors, which are closely related.

Similarly, Figure 3b illustrates the t-SNE representation for the WADI dataset, which contains

12

sensor classes. Despite the increased complexity, clear clustering patterns are still evident. Sensors like 2_FIC_101_SP, 2_FIC_201_SP, and 2_FIC_301_SP form a compact cluster, demonstrating that the model effectively captures the functional relationships between sensors in the WADI system. The inset highlights this specific cluster of 2_FIC sensors, showing how closely they are embedded in relation to each other.

These visualizations confirm the effectiveness of the learned embeddings in capturing both temporal and spatial dependencies, as the model groups sensors that measure similar parameters together in the embedding space. The tight local clusters are a key feature, enhancing the interpretability of the sensor representations learned by the model.

(2): Interpretability of Attacks

The goal of this experiment is to reveal which sensors our model flags as most anomalous during attack periods and how these sensors interact with the rest of the network. Figure 4a,b show sensor graphs for the SWaT and WADI datasets. In these graphs, each node represents a sensor, and the edges between them represent functional dependencies. The red triangles indicate the sensors with the highest anomaly scores, which are the most likely to be affected by an attack or exhibit anomalous behavior.

In Figure 4a (SWaT), HMI_MV_101_STATUS has the highest anomaly score, suggesting it could be the compromised sensor or closely related to an attacked sensor. The graph highlights the strong correlation between HMI_MV_101_STATUS and HMI_LIT_101_PV, indicating that HMI_LIT_101_PV measures and transmits information about raw water volume and valve liquid level, controlled by HMI_MV_101_STATUS. Similarly, in Figure 4b (WADI), 1_MV_001_STATUS shows the highest anomaly score, and it is strongly connected to 1_AIT_001_PV and 1_FIT_001_PV, suggesting a critical relationship between these sensors. The red triangle indicates that 1_MV_001_STATUS might be affected by an anomaly, which could propagate to related sensors.

These visualizations help identify critical sensors and their interdependencies, demonstrating the model’s ability to capture functional relationships and detect anomalies within the system.

The purpose of this experiment is to evaluate the interpretability of our model in detecting anomalous behaviors associated with potential attacks. By analyzing the predicted sensor values against the observed ones, we aim to identify which sensors exhibit anomalous behavior during the attack periods, helping us better understand the model’s decision-making process. Figure 5a,b show the predicted and observed values for sensors from the SWaT and WADI datasets, respectively. The red-highlighted areas represent the regions identified as anomalies by our model.

Figure 5a illustrates the anomaly detection for sensors in the SWaT dataset. The top plot shows the FIT101 sensor, where the prediction (solid black line) deviates significantly from the observation (dashed green line) in the red-highlighted region, indicating an anomaly. Similarly, in the bottom plot, the MV101 sensor exhibits a clear discrepancy between the predicted and observed values in the same region, further confirming the detection of an anomaly. Figure 5b presents the anomaly detection results for the WADI dataset. The top plot shows 1_FIT_001_PV, where the predicted values diverge from the observed values in the red area, signaling an anomaly. In the bottom plot, the 1_MV_001_STATUS sensor also shows a similar deviation in the red-highlighted region, identifying it as another anomalous behavior.

The red-highlighted regions in both figures represent the times when the sensors exhibit unusual behavior, which is crucial for understanding system faults or potential attacks. These visualizations demonstrate how the model identifies and isolates anomalies by comparing predicted and observed sensor readings.

(3): Practical Implications and Future Work

While the embedding and attack interpretability analysis provides important insights into how our model detects anomalies, further expansion of this section could be beneficial for practitioners in real-world applications. Based on our current findings, we outline practical directions for enhancing the interpretability and utility of DTS-MixNet:

Sensor Role Identification: By examining the clusters in the sensor embedding space, we can classify sensors based on their functional roles in the system. This could help practitioners understand which sensors are crucial for the overall system health, and prioritize them for
Anomaly Propagation Mapping: The attack attribution graphs (Figure 4 and Figure 5) can be extended to visualize how anomalies propagate across different layers or modules of the system. This can help identify the root causes of failures and how they spread across the network.
Real-time Anomaly Monitoring: To improve the practical utility of our model, we are exploring ways to incorporate these interpretability features into a real-time monitoring dashboard. This would allow operators to visualize sensor embeddings, track anomalies, and identify critical sensors in the context of their system, making anomaly detection more actionable and transparent.

Future work will focus on enhancing the interpretability of the model by providing not only visualizations but also more actionable insights. This could include tools such as attribution summaries, time–sensor heatmaps, and detailed analysis of anomaly propagation, which would make the anomaly detection process more understandable and useful for practitioners.

5.8. Correlation Heatmap Analysis

It is insightful to visualize the inherent complexity of inter-sensor relationships within real-world multivariate time series data. To this end, Figure 6 presents the correlation matrix heatmap calculated for the SWaT dataset. This visualization underscores the necessity for models capable of capturing complex cross-variable interactions, motivating design choices within CDTS-MixNet, particularly the Cross-Variable Dynamic Encoder (CVDE).

The heatmap reveals several key characteristics indicative of complex system behavior:

Strong Correlations: Distinct blocks of strong positive (deep red) and negative (deep blue) correlations are clearly evident between various sensor pairs. For example, notable correlations exist within the AIT501-FIT503 sensor block. This signifies strong mutual influence or dependency between these sensors, highlighting that they cannot be effectively modeled in isolation.
Group Structures: Clusters of sensors often exhibit similar correlation patterns when compared to other groups. For instance, the sensors in the PIT501-P603 group demonstrate cohesive behavior in their correlations. These groupings may hint at underlying functional modules or subsystems within the monitored physical process.

This analysis of inherent data complexity sets the stage for the subsequent ablation study. Having visually established the importance of capturing inter-sensor dependencies, we will now quantitatively assess how effectively the different components of CDTS-MixNet contribute to this goal and impact overall anomaly detection performance.

5.9. Sensitivity

In this experiment, we investigate the impact of two critical hyperparameters—sliding window length

c

and number of attention heads

Q

—on the model performance, measured by the F1 score. We conducted experiments on two datasets, SWaT and WADI, to assess the sensitivity of the model to these hyperparameters.

(1): Sliding Window Length $c$

The sliding-window length

c

controls the temporal receptive field: larger

c

incorporates longer context, whereas smaller

c

emphasizes recent dynamics with higher variance.

In Figure 7a, we show the F1 score variations for different sliding window lengths on the SWaT and WADI datasets. The selected window lengths are 10, 20, 30, 40, 50, and 60. To ensure fair comparison, the sub-block length

w

was fixed at 5 in all settings, regardless of the value of

c

. The F1 score increases gradually as the sliding window length increases. Notably, when the window length reaches 30, there is a significant improvement in performance. Longer windows allow the model to capture deeper temporal dependencies, thus improving the F1 score.

(2): Number of Attention Heads $Q$

Q

controls the number of parallel attention distributions used to aggregate information from neighbors. A larger

Q

lets the model attend to multiple, complementary dependency patterns in parallel, but also increases computation and can introduce redundancy when different heads learn similar patterns (with fixed model width, each head receives fewer channels).

In Figure 7b, we vary

Q \in {2,3, 4,5, 6,7}

. On SWaT, F1 improves from

Q = 2

to

Q = 4

and then saturates with small fluctuations, indicating diminishing returns beyond four heads under our protocol. On WADI, the curve is flatter: additional heads yield smaller gains, likely because higher variability reduces the benefit of adding overlapping heads. Overall,

Q = 4

provides a reasonable trade-off between modeling diverse dependencies and computational cost.

6. Conclusions

This paper addressed the challenge of multivariate time series anomaly detection, focusing on the dynamic dependencies across both time and variables. We proposed the Dynamic Spatiotemporal Graph Mixed Network (DTS-MixNet), an effective framework designed to tackle this challenge. DTS-MixNet uniquely integrates dynamic graph learning for both temporal evolution (via TGSL and CTDE) and spatial interactions (via CVDE with DTW), followed by effective spatiotemporal fusion using a MixHop-based TSM.

Ablation studies further validated the significant contributions of its core components, confirming the importance of dynamically modeling and fusing both temporal and spatial information. While computationally more intensive than simpler models, DTS-MixNet effectively captures the intricate dynamics missed by conventional approaches. Future work could explore computational optimizations, enhanced interpretability, and applications in diverse real-world scenarios. In essence, DTS-MixNet provides a robust and effective approach for improving anomaly detection accuracy in complex, interconnected systems by adaptively learning and leveraging dynamic spatiotemporal dependencies.

Author Contributions

Conceptualization, C.T. and J.H.; methodology, C.T. and W.H.; validation, J.L.; formal analysis, C.T.; resources, M.M. and S.W.; data curation, J.L.; writing—original draft preparation, C.T. and J.H.; writing—review and editing, C.T., M.M., W.H. and S.W.; supervision, M.M., W.H. and S.W.; funding acquisition, W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by grants from the National Natural Science Foundation of China (Nos. 62101189 and U20A20228).

Data Availability Statement

The data presented in this study (SWaT and WADI) are available in the public domain at https://itrust.sutd.edu.sg/itrust-labs_datasets/ (accessed on 24 July 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

Tax, D.M.J.; Duin, R.P.W. Support vector data description. Mach. Learn. 2004, 54, 45–66. [Google Scholar] [CrossRef]
Kim, S.; Choi, Y.; Lee, M. Deep learning with support vector data description. Neurocomputing 2015, 165, 111–117. [Google Scholar] [CrossRef]
Zhai, S.; Cheng, Y.; Lu, W.; Zhang, Z. Deep structured energy based models for anomaly detection. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1100–1109. [Google Scholar]
Angiulli, F.; Pizzuti, C. Fast Outlier Detection in High Dimensional Spaces. In Principles of Data Mining and Knowledge Discovery; Springer: Berlin/Heidelberg, Germany, 2002; pp. 15–27. [Google Scholar]
Keogh, E.; Lin, J.; Fu, A. HOT SAX: Efficiently finding the most unusual time series subsequence. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), Houston, TX, USA, 27–30 November 2005. [Google Scholar]
Mathur, A.P.; Tippenhauer, N.O. SWaT: A water treatment testbed for research and training on ICS security. In Proceedings of the 2016 International Workshop on Cyber-Physical Systems for Smart Water Networks (CySWater), Vienna, Austria, 11 April 2016; pp. 31–36. [Google Scholar]
Choi, K.; Yi, J.; Park, C.; Yoon, S. Deep learning for anomaly detection in time-series data: Review, analysis, and guidelines. IEEE Access 2021, 9, 120043–120065. [Google Scholar] [CrossRef]
Wen, T.; Keyes, R. Time series anomaly detection using convolutional neural networks and transfer learning. arXiv 2019, arXiv:1905.13628. Available online: https://arxiv.org/abs/1905.13628 (accessed on 13 July 2025). [CrossRef]
An, J.; Cho, S. Variational autoencoder based anomaly detection using reconstruction probability. Spec. Lect. IE 2015, 2, 1–18. [Google Scholar]
Akcay, S.; Atapour-Abarghouei, A.; Breckon, T.P. GANomaly: Semi-supervised anomaly detection via adversarial training. In Computer Vision—ACCV 2018; Springer: Cham, Switzerland, 2019; pp. 622–637. [Google Scholar]
Lin, S.; Clark, R.; Birke, R.; Schönborn, S.; Trigoni, N.; Roberts, S. Anomaly detection for time series using VAE-LSTM hybrid model. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 4322–4326. [Google Scholar]
Deng, A.; Hooi, B. Graph neural network-based anomaly detection in multivariate time series. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 4027–4035. [Google Scholar]
Zhang, W.; Zhang, C.; Tsung, F. GRELEN: Multivariate time series anomaly detection from the perspective of graph relational learning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022; pp. 2390–2397. [Google Scholar]
Chen, K.; Feng, M.; Wirjanto, T.S. Multivariate time series anomaly detection via dynamic graph forecasting. arXiv 2023, arXiv:2302.02051. Available online: https://arxiv.org/abs/2302.02051 (accessed on 13 July 2025). [CrossRef]
Yin, S.; Ding, S.X.; Haghani, A.; Hao, H.; Zhang, P. A comparison study of basic data-driven fault diagnosis and process monitoring methods on the benchmark Tennessee Eastman process. J. Process Control 2012, 22, 1567–1581. [Google Scholar] [CrossRef]
MacGregor, J.F.; Jaeckle, C.; Kiparissides, C.; Koutoudi, M. Process monitoring and diagnosis by multiblock PLS methods. AIChE J. 1994, 40, 826–838. [Google Scholar] [CrossRef]
Russell, E.L.; Chiang, L.H.; Braatz, R.D. Fault detection in industrial processes using canonical variate analysis and dynamic principal component analysis. Chemom. Intell. Lab. Syst. 2000, 51, 81–93. [Google Scholar] [CrossRef]
Schlegl, T.; Seebock, P.; Waldstein, S.M.; Langs, G.; Schmidt-Erfurthb, U. f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks. Med. Image Anal. 2019, 54, 30–44. [Google Scholar] [CrossRef]
Zhou, C.; Paffenroth, R.C. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13 August 2017; pp. 665–674. [Google Scholar]
Jang, J.; Lee, H.H.; Park, J.A.; Kim, H. Unsupervised anomaly detection using generative adversarial networks in 1H-MRS of the brain. J. Magn. Reson. 2021, 325, 106936. [Google Scholar] [CrossRef]
Staar, B.; Lütjen, M.; Freitag, M. Anomaly detection with convolutional neural networks for industrial surface inspection. Procedia CIRP 2019, 79, 484–489. [Google Scholar] [CrossRef]
Zivot, E.; Wang, J. Vector autoregressive models for multivariate time series. In Modeling Financial Time Series with S-PLUS^®; Springer: New York, NY, USA, 2006; pp. 385–429. [Google Scholar]
Box, G.E.; Pierce, D.A. Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. J. Am. Stat. Assoc. 1970, 65, 1509–1526. [Google Scholar] [CrossRef]
Cao, L.J.; Tay, F.E.H. Support vector machine with adaptive parameters in financial time series forecasting. IEEE Trans. Neural Netw. 2003, 14, 1506–1518. [Google Scholar] [CrossRef]
Connor, J.T.; Martin, R.D.; Atlas, L.E. Recurrent neural networks and robust time series prediction. IEEE Trans. Neural Netw. 1994, 5, 240–254. [Google Scholar] [CrossRef]
Zhao, B.; Lu, H.; Chen, S.; Liu, J.; Wu, D. Convolutional neural networks for time series classification. J. Syst. Eng. Electron. 2017, 28, 162–169. [Google Scholar] [CrossRef]
Borovykh, A.; Bohte, S.; Oosterlee, C.W. Conditional time series forecasting with convolutional neural networks. arXiv 2018, arXiv:1703.04691. Available online: https://arxiv.org/abs/1703.04691 (accessed on 16 July 2025). [CrossRef]
Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in time series: A survey. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, Macao, China, 19–25 August 2023; pp. 6778–6786. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef]
Zhang, R.; Zhou, Z.; Zuo, Y.; Cui, Y.; Zhang, Z. Multivariate time series anomaly detection based on graph neural network and grated neural network. In Proceedings of the International Conference on Cyber Security, Artificial Intelligence, and Digital Economy (CSAIDE 2023), Nanjing, China, 3–5 March 2023; Volume 12718, p. 127180R. [Google Scholar]
Xu, K.; Li, Y.; Li, Y.; Xu, L.; Li, R.; Dong, Z. Masked graph neural networks for unsupervised anomaly detection in multivariate time series. Sensors 2023, 23, 7552. [Google Scholar] [CrossRef]
Lee, J.; Park, B.; Chae, D.K. DuoGAT: Dual time-oriented graph attention networks for accurate, efficient and explainable anomaly detection on time-series. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; pp. 1188–1197. [Google Scholar]
Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2009, 20, 61–80. [Google Scholar] [CrossRef]
Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013, arXiv:1312.6203. Available online: https://arxiv.org/abs/1312.6203 (accessed on 16 July 2025).
Henaff, M.; Bruna, J.; LeCun, Y. Deep convolutional networks on graph-structured data. arXiv 2015, arXiv:1506.05163. Available online: https://arxiv.org/abs/1506.05163 (accessed on 16 July 2025). [CrossRef]
Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016; p. 29. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. Available online: https://arxiv.org/abs/1609.02907 (accessed on 16 July 2025).
Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. Stat 2017, 1050, 10–48550. [Google Scholar]
Müller, M. Dynamic time warping. In Information Retrieval for Music and Motion; Springer: Berlin/Heidelberg, Germany, 2007; pp. 69–84. [Google Scholar]
Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Chang, X.; Zhang, C. Connecting the dots: Multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; pp. 753–763. [Google Scholar]
Abu-El-Haija, S.; Perozzi, B.; Kapoor, A.; Alipourfard, N.; Lerman, K.; Harutyunyan, H.; Steeg, G.V.; Salstyan, A. MixHop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 21–29. [Google Scholar]
Ahmed, C.M.; Palleti, V.R.; Mathur, A.P. WADI: A water distribution testbed for research in the design of secure cyber physical systems. In Proceedings of the 3rd International Workshop on Cyber-Physical Systems for Smart Water Networks, Pittsburgh, PA, USA, 21 April 2017; pp. 25–28. [Google Scholar]
Park, D.; Hoshi, Y.; Kemp, C.C. A multimodal anomaly detector for robot-assisted feeding using an LSTM-based variational autoencoder. IEEE Robot. Autom. Lett. 2018, 3, 1544–1551. [Google Scholar] [CrossRef]
Zong, B.; Song, Q.; Min, M.R.; Cheng, W.; Lumezanu, C.; Cho, D.; Chen, H. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–19. [Google Scholar]
Li, D.; Chen, D.; Jin, B.; Shi, L.; Goh, J.; Ng, S.K. MAD-GAN: Multivariate anomaly detection for time series data with generative adversarial networks. In Artificial Neural Networks and Machine Learning—ICANN 2019: Text and Time Series; Springer: Cham, Switzerland, 2019; pp. 703–716. [Google Scholar]
Zhao, H.; Wang, Y.; Duan, J.; Huang, C.; Cao, D.; Tong, Y.; Xu, B.; Bai, J.; Zhang, Q. Multivariate time-series anomaly detection via graph attention network. In Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), Sorrento, Italy, 17–20 November 2020; pp. 841–850. [Google Scholar]
Tang, C.; Xu, L.; Yang, B.; Tang, Y.; Zhao, D. GRU-based interpretable multivariate time series anomaly detection in industrial control system. Comput. Secur. 2023, 127, 103094. [Google Scholar] [CrossRef]
Jo, H.; Lee, S.W. Edge conditional node update graph neural network for multivariate time series anomaly detection. Inf. Sci. 2024, 679, 121062. [Google Scholar] [CrossRef]
Maaten, L.V.D.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. An illustrative example on SWaT dataset, where the red areas indicate anomalous events.

Figure 2. The structural diagram of our proposed framework.

Figure 3. t-SNE visualizations of sensor embeddings: (a) SWaT dataset; (b) WADI dataset.

Figure 4. Sensor graphs with anomaly detection: (a) SWaT dataset; (b) WADI dataset.

Figure 5. Anomaly detection results: (a) SWaT dataset; (b) WADI dataset.

Figure 6. Correlation matrix heatmap of SWaT dataset.

Figure 7. Sensitivity analysis of hyperparameters: (a) sliding window length

c

; (b) number of attention heads

Q

.

Figure 7. Sensitivity analysis of hyperparameters: (a) sliding window length

c

; (b) number of attention heads

Q

.

Table 1. The symbolic notations and their descriptions.

Notations	Descriptions	Notations	Descriptions
$X_{t r a i n} \in R^{N \times T}$	Training data sampled on N sensors and T time steps	$P \in R^{N \times c}$	Proxy Multivariate Sequence (PMS)
$X_{t e s t} \in R^{N \times U}$	Test data sampled on N sensors and U time steps	$P^{i} \|_{i = 1}^{m} \in R^{N \times w}$	The i-th locality with w = c/m
$S^{t} \in R^{N \times c}$	Sensor data at time t in a sliding window of size c	$P_{j}^{i} \in R^{w}$	The j-th row in the i-th locality
$x^{t} \in R^{N}$	Sensor data at time t	$O^{i} \in R^{N \times N}$	Local spatial adjacency matrix for the i-th locality
${\hat{x}}^{t} \in R^{N}$	Predicted data at time t	$O \in R^{N \times N}$	Global spatial adjacency matrix
$G (V, E, W)$	Binary-weighted graph	$D^{i} \in R^{N \times N}$	Fused spatial adjacency matrix for the i-th locality
$G (V, E, W^{q}) \|_{q = 1}^{Q}$	Attention-weighted graph under multi-head attention	$D$	Spatial Graph Sequence (SGS)
${l v}_{1}, {l v}_{2} \in R^{c}$	Learnable vectors for nodes	d	Number of channels
$A \in R^{c \times c}$	Relationship matrix for measuring the time closeness	$I \in R^{d \times N \times c}$	Feature representation after $1 \times 1$ Conv in TSM
K	Number of nearest neighbors	$Z \in R^{d \times N \times c}$	Hidden representation in TSM
$N_{a c r} (v_{i})$	Across-time neighbors of node $v_{i}$	$Z^{i} \|_{i = 1}^{m} \in R^{d \times N \times w}$	The i-th segment of hidden representation Z
$N_{a d j} (v_{i})$	Adjacent-time neighbors of node $v_{i}$	$F^{i} \|_{i = 1}^{m} \in R^{d \times N \times w}$	The i-th feature matrix by mixing $D^{i}$ and $Z^{i}$
$N (v_{i})$	Neighbors of node $v_{i}$	$M^{t - 1} \in R^{d \times N \times c}$	Final representation of sliding window $S^{t}$
$Φ_{q}, Ψ_{q} \in R^{N \times N}$	Learnable mapping matrix for the q-th attention head	$α$ , $β$	Hyperparameter scaling DTW distance
$a_{q} \in R^{2 N}$	Learnable vector for scoring neighbors in q-th attention head	$λ$	Learnable parameters

Table 2. Time complexity per module.

Module	Complexity Per Window	Description
TGSL	$O (c^{2} + Q c N^{2})$	Constructing the temporal graph structure
CTDE (L layers)	$O (L Q c N^{2})$	Encoding temporal dependencies
CVDE	$O (c^{2} N^{2})$	Encoding cross-variable dependencies
TSM and Predict	$O (d c N^{2})$	Mixing temporal and spatial information
Whole DTS-MixNet	$O ((L Q c + c^{2} + d c) N^{2})$	Total complexity of the entire model
Training Phase Total	$O ((L Q c + c^{2} + d c) {T N}^{2})$	Total complexity of the entire training
Testing Phase Total	$O ((L Q c + c^{2} + d c) {U N}^{2})$	Total complexity of the entire testing

Table 3. Statistics about two datasets.

Datasets	# Features	# Train	# Test	Anomalies
SWaT	51	47,520	44,991	12.22%
WADI	127	76,297	17,280	5.84%

Table 4. Hyperparameter search ranges.

Hyperparameter	Search Range
Window length $c$	${10, 20, 30, 40, 50, 60}$
Attention heads $Q$	${2, 3, 4, 5, 6, 7}$
Across-time neighbors $K$	SWaT: ${8, 10, 12, 14}$ ; WADI: ${12, 15, 18, 21}$
Hidden channels $d$	${32, 64, 128}$
DTW scaling $α, β$	${0.5, 1.0, 1.5, 2.0}$
Learning rate	${1 e - 3, 5 e - 4, 1 e - 4}$
DIL kernel set	${2, 3, 5, 7}$

Table 5. The accuracy in terms of precision (%), recall (%) and F1-score.

	SWaT			WADI
Methods	Pre	Rec	F1	Pre	Rec	F1
LSTM-VAE [46]	96.24	59.91	0.7384	84.79	14.45	0.2469
DAGMM [47]	27.46	69.52	0.3936	54.44	26.99	0.3608
MAD-GAN [48]	96.58	57.50	0.7206	87.33	14.87	0.2541
MTAD-GAT [49]	95.15	61.87	0.7495	62.10	18.57	0.2645
GDN [12]	88.03	62.75	0.7280	66.25	31.10	0.4206
GRN [50]	98.37	61.09	0.7537	35.84	73.98	0.4828
ECNU-GNN [51]	98.45	68.91	0.8089	76.52	46.10	0.5730
DTS-MixNet (Ours)	98.59	69.58	0.8158	87.34	43.65	0.5821

Table 6. The precision (%) after removing core components.

Settings	SWaT	WADI
DTS-MixNet	98.59	87.34
DTS-MixNet w/o. Temporal Attention	87.58	80.78
DTS-MixNet w/o. Across-Time Neighbors	93.14	84.34
DTS-MixNet w/o. Temporal Graph Modeling	83.29	72.36
DTS-MixNet w/o. Spatial Graphs	86.78	76.57
DTS-MixNet w/o. TSM Feature Enhancement	83.37	78.43

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tan, C.; Hu, J.; Li, J.; Miao, M.; Hu, W.; Wang, S. DTS-MixNet: Dynamic Spatiotemporal Graph Mixed Network for Anomaly Detection in Multivariate Time Series. Big Data Cogn. Comput. 2025, 9, 245. https://doi.org/10.3390/bdcc9100245

AMA Style

Tan C, Hu J, Li J, Miao M, Hu W, Wang S. DTS-MixNet: Dynamic Spatiotemporal Graph Mixed Network for Anomaly Detection in Multivariate Time Series. Big Data and Cognitive Computing. 2025; 9(10):245. https://doi.org/10.3390/bdcc9100245

Chicago/Turabian Style

Tan, Chengxun, Jiayi Hu, Jian Li, Minmin Miao, Wenjun Hu, and Shitong Wang. 2025. "DTS-MixNet: Dynamic Spatiotemporal Graph Mixed Network for Anomaly Detection in Multivariate Time Series" Big Data and Cognitive Computing 9, no. 10: 245. https://doi.org/10.3390/bdcc9100245

APA Style

Tan, C., Hu, J., Li, J., Miao, M., Hu, W., & Wang, S. (2025). DTS-MixNet: Dynamic Spatiotemporal Graph Mixed Network for Anomaly Detection in Multivariate Time Series. Big Data and Cognitive Computing, 9(10), 245. https://doi.org/10.3390/bdcc9100245

Article Menu

DTS-MixNet: Dynamic Spatiotemporal Graph Mixed Network for Anomaly Detection in Multivariate Time Series

Abstract

1. Introduction

2. Related Works

2.1. Anomaly Detection

2.2. Multivariate Time Series

2.3. Graph Neural Network

3. Methodology

3.1. Problem Statement

3.2. Temporal Graph Structure Learner

3.3. Cross-Temporal Dynamic Encoder

3.4. Cross-Variable Dynamic Encoder

3.5. Spatiotemporal Mixer

3.6. Neural Network Prediction

4. Discussion

4.1. Anomaly Scoring

4.2. Time Complexity Analysis

4.3. Scalability and Optimization

4.4. Domain Applicability and Adaptation

4.5. Threshold Calibration and Adaptation

4.6. Real-Time Deployment and Challenges

5. Experiments

5.1. Dataset

5.2. Baseline Methods

5.3. Evaluation Metrics

5.4. Experimental Setup

5.5. Comparative Study

5.6. Ablation Study

5.7. Interpretability of Model

5.8. Correlation Heatmap Analysis

5.9. Sensitivity

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI