1. Introduction
With the proliferation of intelligent surveillance systems, video anomaly detection (VAD) has become a cornerstone technology for enhancing public safety [
1,
2], optimizing traffic management [
3], and combating financial fraud [
4]. Unlike conventional computer vision tasks, VAD is inherently challenging due to the open-set nature of anomalies—abnormal events are rare, context-dependent, and semantically diverse [
5]. For instance, running is normal in a park but anomalous in a hospital hallway.
Despite the remarkable progress of deep learning in video anomaly detection [
6], most state-of-the-art approaches still rely on supervised or weakly supervised learning, which demands large-scale datasets with frame-level annotations [
7,
8,
9]. This reliance hinders scalability, as annotating anomalies is both labor-intensive and inherently subjective. Reconstruction-based methods often struggle with the complexity of high-dimensional video data, while prediction-based techniques fail to capture fine-grained spatiotemporal anomalies, such as subtle aggressive gestures. More importantly, existing methods frequently overlook human pose dynamics—an essential cue for detecting localized anomalies that cannot be effectively modeled using raw pixels or optical flow alone [
10].
Training an effective anomaly detector typically requires large-scale datasets with fine-grained, frame-level annotations—a process that is both time-consuming and resource-intensive. This limitation has motivated increasing interest in unsupervised VAD approaches, which can eliminate the need for manual labeling. Lara et al. [
11] leverage a temporal convolutional network to model video temporal structures and propose a method to constrain the large search space in the multiple instance learning (MIL) framework. Building on MIL with ranking, Xu et al. [
12] introduce a temporal enhancement network to learn motion-aware features. Doshi et al. [
13] design an anomaly regression network combined with two novel loss functions to detect anomalies online. De et al. [
14] propose a self-attentive integrated conditional random field to model short-range correlations. Wu et al. [
15] develop a multi-scale continuous perception refinement network to enhance MIL-based detection. Feng et al. [
16] present a multi-instance self-training framework that generates segment-level pseudo-labels by selectively sampling sparse video fragments. Although these unsupervised or weakly supervised methods have achieved promising results, they often remain sensitive to background clutter, lighting variations, and other irrelevant contextual factors—limiting their robustness in complex real-world scenarios.
To address these limitations, we propose UHPose-VAD, a novel unsupervised framework that leverages spatiotemporal graph learning on human poses for video anomaly detection. It is important to clarify the scope of “unsupervised” in our work. Following the common practice in video anomaly detection, we consider a method unsupervised if it does not utilize any video-level or frame-level abnormality labels during training. However, like most modern VAD approaches, our framework leverages a pre-trained pose estimator (trained on large-scale static image datasets like COCO) to obtain robust human keypoint representations. This pre-training provides a strong anatomical prior that is task-agnostic—it encodes general knowledge about human body structure, not information about what constitutes an “anomaly” in a specific surveillance scene. This distinction is crucial: our contribution lies in how we unsupervisedly learn the manifold of normal motion patterns from pose sequences, without ever seeing annotated anomalies.
Our approach is motivated by the premise that human skeleton sequences provide a robust, high-level representation that is invariant to appearance changes and explicitly encodes structural and dynamic information. The proposed framework begins with a pose extraction (PE) module that obtains human keypoints, which are tracked to form temporal pose graphs. A normalizing flow feature mapping (NFFM) module enriches these graphs with motion context. Subsequently, our core normalizing flow graph convolution (NFGC) module adaptively learns the graph connectivity to model the latent spatiotemporal distribution of normal poses. Finally, a Gaussian Estimation (GE) module fits this distribution, allowing anomalies to be detected as significant deviations.
Our main contributions are summarized as follows:
We propose a fully unsupervised VAD framework that integrates adaptive graph learning with flow-based density estimation in a novel way. Unlike prior works that treat graph construction and anomaly scoring separately, our framework jointly learns spatiotemporal dependencies and the normal motion manifold through end-to-end optimization, without requiring any manual annotations.
We propose a NFGC module that dynamically learns spatiotemporal dependencies among body joints. Unlike prior works using fixed skeletal graphs, our module fuses three complementary adjacency matrices—attention-based dynamic matrix, globally learnable matrix, and static anatomical prior—to capture both physical constraints and latent motion synergies (e.g., coordinated limb movements during running) that are critical for detecting subtle anomalies.
By coupling graph-based feature extraction with a Gaussian Mixture Model via normalizing flows, our method replaces heuristic anomaly scores (e.g., reconstruction error) with a rigorous likelihood-based metric, directly measuring deviations from the learned normal motion manifold. This probabilistic formulation provides more interpretable and theoretically grounded anomaly detection.
Extensive experiments on ShanghaiTech and UBnormal datasets demonstrate that UHPose-VAD achieves new state-of-the-art AUC scores of 86.1% and 69.4%, respectively. We also provide comprehensive ablation studies, robustness analysis under realistic corruptions, and model complexity analysis to validate the effectiveness and efficiency of our approach.
3. Methodology
We propose UHPose-VAD, an unsupervised framework for video anomaly detection that leverages human pose dynamics and spatiotemporal graph learning to model normal behavior patterns without requiring any annotated data. Specifically, our novelty lies in: (1) adaptively learning graph connectivity from pose sequences without any supervision, (2) coupling this with a flow-based invertible mapping to a latent space where a Gaussian mixture model captures multi-modal normal motion patterns, and (3) using a rigorous likelihood-based anomaly score instead of heuristic reconstruction errors. The framework is composed of four key components: pose extraction, normalizing flow feature mapping, normalizing flow graph convolution and Gaussian Estimation, as validated in
Figure 1.
3.1. Pose Extraction
As illustrated in
Figure 1, our approach operates under a fully unsupervised setting, where neither frame-level nor video-level anomaly labels are provided. To effectively model human motion patterns, we begin by extracting pose sequences from raw video inputs and constructing temporally aligned representations.
We first employ a region-based multi-person pose estimation framework on each frame of the video sequence V. Specifically, human instances are detected using bounding boxes, and their keypoint coordinates are estimated via symmetric spatial transformation networks combined with parameterized pose non-maximum suppression to eliminate redundant or overlapping detections. To improve robustness and spatial consistency, we further utilize pose-guided region proposal generators that adaptively sample diverse pose configurations during training.
After detecting human poses in each segment , we perform temporal tracking across consecutive frames to establish continuous pose trajectories throughout the video. A sliding window is applied to partition the video into overlapping clips of fixed length, ensuring consistent temporal granularity. Each pose sequence is then normalized to have zero mean and unit variance across joints and frames, which facilitates stable feature learning in subsequent modules.
To detect skeletons in each video frame independently, we utilize AlphaPose [
27] with the YOLOX [
28] detector. We then use PoseFlow [
29] to track the skeletons across a video. After that, we divide each pose sequence into fixed-length segments using a sliding window approach. Finally, we normalize the pose coordinates by the image size to ensure scale invariance. This preprocessing stage enables the extraction of motion-relevant structural features while filtering out irrelevant pixel-level variations such as lighting and background clutter.
We acknowledge that AlphaPose, YOLOX, and PoseFlow are trained on large-scale labeled datasets (e.g., COCO for pose estimation). However, this does not violate the unsupervised setting of our framework for two reasons. First, these components are used only as fixed feature extractors; they are not fine-tuned on our target datasets, and they never access any anomaly labels (frame-level or video-level). Their role is to provide a generic human anatomical prior—knowledge about what a human skeleton looks like—which is task-agnostic and does not encode dataset-specific normality or abnormality. Second, the entire training process of UHPose-VAD operates without any anomaly annotations; we only use unlabeled video clips to learn the distribution of normal motion patterns. Therefore, while the pose estimator itself is pretrained in a supervised manner, our overall pipeline remains fully unsupervised for the task of anomaly detection. This practice is standard in the VAD literature and does not compromise the fairness of comparison with other unsupervised methods.
3.2. Normalizing Flow Graph Convolution
The normalizing flow graph convolution (NFGC) module processes the pose features extracted by the preceding pose extraction module. As illustrated in
Figure 1, this process begins with normalizing flow feature mapping (NFFM), which transforms the input pose keypoints through a series of operations. Two successive
convolutional layers project the keypoints into a high-dimensional feature space, with the results fused back to the input via residual connections. An ActNorm layer is then applied to normalize the activations to zero mean and unit variance using trainable per-channel scale and offset parameters, thereby stabilizing training dynamics. Subsequently, a learned reversible
convolution, initialized as a random rotation matrix, permutes the channel order to enhance feature diversity while preserving information flow.
The features produced by the NFFM are then modeled by the core NFGC module. We construct an undirected spatiotemporal graph
, where each node
represents a joint
s at time step
t, with
joints per frame and
total frames. Each edge
L models the spatial or temporal relationship between pairs of joints. The graph convolution operation is formulated as:
where
denotes the neighborhood of node
, and
represents the learned connection weights.
To comprehensively capture spatiotemporal relationships, NFGC leverages three complementary adjacency matrices. The attention-based adjacency matrix
is a sample-specific, three-dimensional matrix computed using multiplicative attention. It is derived by embedding the input features twice using two sets of learned weights, followed by a transposed dot product and normalization to generate dynamic inter-joint dependencies. The global learnable adjacency matrix
is a sample-invariant, fully learnable matrix updated over
K iterations, capturing universal motion patterns consistent across all inputs. The static adjacency matrix
is a fixed matrix that reflects prior human skeletal connectivity, shared across all layers. Each of these matrices drives a separate spatio-temporal GCN layer with independent weights. Their outputs are fused using a weighted summation:
where
are trainable coefficients, thereby combining the strengths of dynamic patterns (
), global consistency (
), and anatomical priors (
).
The attention-based adjacency matrix
captures sample-specific, dynamic inter-joint dependencies. It is computed using a multiplicative attention mechanism:
where
is the input feature matrix (with
d denoting feature dimension), and
are learnable weight matrices that project the input into query and key spaces. The scaling factor
stabilizes gradients during training. This formulation allows
to dynamically assign high attention weights to joint pairs that exhibit strong motion correlation in the current sample, even if they are not directly connected in the human skeleton (e.g., left wrist and right ankle during walking). This adaptability is crucial for capturing diverse, context-dependent motion patterns.
Finally, the output features undergo a fusion and transformation step. They are split symmetrically along the channel dimension into two equal parts. One half is processed by an affine coupling layer (ACL), where it undergoes an invertible transformation for subsequent Jacobian matrix computation in probabilistic modeling. The other half is directly concatenated with features from an earlier stage, forming a residual connection that reinforces information flow across the network. This architectural design enables the NFGC to adaptively learn meaningful spatiotemporal structures from pose trajectories and effectively fuse them with normalizing flow–driven features, ultimately enhancing the model’s capacity to capture subtle and localized motion anomalies.
3.3. Gaussian Estimation
To probabilistically model the latent representations of human motions, we employ a Gaussian Mixture Model (GMM) framework. The core idea involves transforming the input data from the original space X to a structured latent space through an invertible mapping function f composed of the NFFM and NFGC modules. Specifically, a given pose sequence is mapped to a point in , where we model the density as a mixture of Gaussian components to capture the multi-modal nature of normal human motions.
The transformation is governed by the change of variables formula. The probability density in the original data space
X can be computed from the density in the latent space
and the Jacobian determinant of the transformation
f:
For a composition of
K invertible functions
, the log probability decomposes additively, facilitating tractable computation:
In the latent space
, we model the density as a Gaussian Mixture Model with
K components:
where
are mixture weights satisfying
,
are component means, and
I is the identity covariance matrix. The choice of identity covariance simplifies computation while maintaining sufficient flexibility to model complex distributions through the mixture, as the invertible transformation
f is responsible for mapping the data to a space where Gaussian components with unit covariance are appropriate.
The GMM parameters
(mixture weights) and
(component means) are jointly optimized with the network parameters via the negative log-likelihood loss
(Equation (
7)) using standard backpropagation. The identity covariance
I is assumed for computational efficiency and to encourage the invertible transformation
f to learn a disentangled latent representation where each component is isotropic. This design choice is common in normalizing flow literature and has been validated to be sufficient for density estimation when combined with a flexible flow [
21].
The model is trained by minimizing the negative log-likelihood of the data:
During inference, the anomaly score for a test instance is derived by computing
using Equation (
7), which directly measures the deviation from the learned manifold of normal motion patterns. Instances that yield low likelihood under the GMM are classified as anomalies. This probabilistic formulation provides a principled and interpretable approach to anomaly detection, as the likelihood score has a clear statistical interpretation.
4. Experiments
To comprehensively evaluate the effectiveness of the proposed UHPose-VAD framework, we conducted extensive experiments on two publicly available benchmarks: ShanghaiTech [
24] and UBnormal [
30]. This section details our implementation setup, describes the characteristics of the evaluation datasets, and presents a comparative analysis against state-of-the-art methods.
4.1. Implementation Details
The experiments were conducted on an Ubuntu 22.04.4 operating system with an NVIDIA A40 GPU for model training and testing. All experiments were implemented using the PyTorch 2.1.2 framework. Input video frames were uniformly resized to pixels and normalized to the range .
For human pose extraction, we employ AlphaPose [
27] with a YOLOX [
28] detector pre-trained on COCO. The detector confidence threshold is set to 0.3, and non-maximum suppression (NMS) with an IoU threshold of 0.6 is applied to eliminate redundant detections. For temporal tracking across frames, we use PoseFlow [
29] with the default parameters (tracking threshold = 0.2, matching window = 50 frames). When multiple people appear in a single frame, we select the person with the highest tracking confidence to maintain temporal consistency. However, if the tracking confidence of the currently selected person drops below a threshold (e.g., 0.5) and another person becomes more confident, we re-initialize tracking to the new person. This may cause identity switches across frames, but since we model each person independently and compute anomaly scores per frame, such switches do not negatively affect detection performance. For videos with no detected people, we use zero-padded pose features.
We partition each video into overlapping clips using a sliding window approach. Each clip contains frames with a stride of 4 frames between consecutive clips. The pose coordinates are normalized by the image dimensions to the range , followed by zero-mean and unit-variance normalization across the training set.
The NFFM module consists of two
convolutional layers with 64 and 128 channels, respectively, followed by batch normalization and ReLU activation. The NFGC module employs 4 graph convolutional layers with 128 hidden units each. The GMM component number is set to
based on validation performance. The GMM component means
are initialized using k-means clustering on the latent features of the training set after 50 epochs of pre-training, then jointly optimized with the network via negative log-likelihood. The trainable fusion coefficients
in Equation (
2) are initialized to
and optimized jointly with the network parameters.
The network was optimized using the Adam optimizer with an initial learning rate of , weight decay of , and batch size of 32. The learning rate was reduced by a factor of 0.5 when the validation loss plateaued for 10 consecutive epochs. Training was conducted for 200 epochs, with early stopping applied if no improvement was observed for 30 epochs.
For evaluation, we adopted the widely used frame-level Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) as our primary metric. Specifically, for each clip we compute the negative log-likelihood
(Equation (
7)) as its anomaly score. For each frame
f, the anomaly score
is defined as the average of the scores of all clips that cover
f:
where
is the set of clips containing frame
f.
4.2. Datasets
We follow the official training and test splits for both datasets. For ShanghaiTech [
24], the training set contains 274,515 normal frames from 13 scenes, and the test set contains 42,883 frames with both normal and anomalous events. For UBnormal [
30], we use the standard split with 29 scenes for training (normal videos only) and 15 scenes for testing (containing both normal and abnormal events). No frame-level or video-level anomaly labels are used during training.
As shown in
Figure 2, the ShanghaiTech dataset is a large-scale benchmark comprising real-world surveillance videos captured from various scenes on a university campus. It contains diverse normal activities (e.g., walking, waiting) and anomalous events (e.g., fighting, chasing, falling). Its challenging nature stems from complex backgrounds, varying lighting conditions, and diverse camera viewpoints, making it a standard and demanding testbed for evaluating video anomaly detection methods.
In contrast, the UBnormal dataset is a synthetically generated benchmark created using Cinema4D (Maxon Computer GmbH, Bad Homburg, Germany). It was designed to mitigate common issues in real-world datasets, such as annotation noise and severe class imbalance. UBnormal offers a controlled environment with a wide variety of structured abnormal events and diverse human motion patterns. It serves as a valuable benchmark for more rigorously assessing a model’s generalization capability and robustness to rare or unseen anomalies.
4.3. Comparison with State-of-the-Art Methods
We compare our proposed UHPose-VAD framework against a range of recent state-of-the-art methods on both datasets. The compared works span several years and represent a variety of technical approaches published in high-impact venues.
To ensure fair comparison, we follow the same evaluation protocol as prior works. All compared methods are evaluated using the official training/test splits of ShanghaiTech and UBnormal. Frame-level AUC is computed by comparing per-frame anomaly scores against ground truth frame-level annotations. For methods that report results on these benchmarks, we directly cite their published numbers. All results for our method are reported as mean ± standard deviation over 3 independent runs with different random seeds.
Table 1 presents a comprehensive comparison of our proposed UHPose-VAD against state-of-the-art methods on the challenging ShanghaiTech dataset. The results, organized chronologically from 2020 to 2025, reveal a clear trend of performance improvement over the years, with recent methods beginning to surpass the 80% AUC mark. Early approaches like MESDnet (73.2%) and scene (74.7%) established competitive baselines, while variational and dynamic modeling methods such as VABD (78.2%) and LLSH (77.6%) pushed performance further. The year 2023 witnessed remarkable breakthroughs: STG-NF [
21] achieved 85.9% AUC, setting a new record, while SSMTL++ [
31] also reached 82.9%. Subsequent methods such as Ristea et al. [
32] in 2024 obtained 84.7% AUC, yet remained below the 2023 record. While other works have been proposed in the following years (such as in 2025), our proposed UHPose-VAD framework significantly outperforms all these predecessors, achieving a new state-of-the-art AUC of 86.1%. This consistent superiority over a wide range of methodologies—including generative models, self-supervised learning, and transformer-based architectures—strongly validates the effectiveness of our core design: explicitly modeling fine-grained human pose dynamics within a probabilistic framework to capture anomalies in complex, real-world surveillance scenes.
The performance comparison on the UBnormal dataset, as summarized in
Table 2, further underscores the robustness and generalization capability of our method. UBnormal, with its synthetic yet highly diverse anomalies, presents a different set of challenges compared to ShanghaiTech. The performance trajectory here shows a notable jump from early MIL-based methods like MIL-Rank (54.1%) to more recent paradigms. Methods in 2021, such as RTFM [
45], pushed the AUC to 66.8%, which remained a strong benchmark for subsequent years. Notably, prior to our work, the strongest results were achieved by STG-NF [
21] with 71.8% AUC, followed by OCC-WS [
46] (67.4%) and STPrompt [
47] (63.9%). Our UHPose-VAD framework achieves a top AUC of 69.4%. This result is particularly significant as it demonstrates that our pose-centric and flow-integrated model generalizes effectively beyond real-world data to controlled synthetic environments. While this is 2.4% lower than STG-NF (71.8%) on UBnormal, we note that STG-NF uses a more complex normalizing flow architecture (which processes a larger feature dimension). On the real-world ShanghaiTech dataset, our method outperforms STG-NF (86.1% vs. 84.0%), suggesting that our design prioritizes robustness to pose noise over capacity for synthetic data.
4.4. Ablation Study
To quantitatively validate the contribution of each core component within the proposed UHPose-VAD framework, we conduct a series of ablation studies on both the ShanghaiTech and UBnormal datasets. The study is designed to isolate and evaluate the effects of the PE, NFFM, NFGC, and GE modules. The baseline performance is established by a simplified model, and components are incrementally added to assess their individual and collective impact on the final anomaly detection performance.
Table 3 shows the ablation study results conducted on the ShanghaiTech and UBnormal datasets to evaluate the contribution of each component in our proposed UHPose-VAD framework. The baseline model, which utilizes a simplified architecture without the specialized modules proposed in this work, yields AUC scores of 79.2% and 63.7% on the two datasets, respectively. The baseline model (first row in
Table 3) consists of a simple 2-layer MLP with 256 hidden units and ReLU activation. It takes the concatenated pose keypoints (normalized coordinates of all 18 joints across 16 frames, flattened into a 576-dimensional vector) as input and outputs a reconstruction error (MSE between input and output) as anomaly score. No graph convolution, no GMM, and no normalizing flow is used. All other rows incrementally add the proposed modules (PE: pose extraction, NFFM: normalizing flow feature mapping, NFGC: normalizing flow graph convolution, GE: Gaussian estimation). This baseline achieves 79.2% AUC on ShanghaiTech and 63.7% on UBnormal, demonstrating that even a simple pose-based model is effective, but our full model significantly improves upon it. Notably, configurations that omit the PE module (rows 5–7) consistently underperform compared to their pose-based counterparts, highlighting the indispensability of explicit pose dynamics for human-centric anomaly detection. The full model, incorporating all four components (PE, NFFM, NFGC, GE), achieves the highest AUC scores of 86.1% on ShanghaiTech and 69.4% on UBnormal, demonstrating that each module contributes uniquely and that their synergistic integration is crucial for achieving state-of-the-art performance.
Figure 3 presents a stacked bar chart that visualizes the incremental contributions of each module within the proposed UHPose-VAD framework to the overall anomaly detection performance on the ShanghaiTech dataset, measured in AUC (%). The x-axis represents the progressive integration of modules, starting from a baseline without any modules (AUC = 79.2%), followed by PE, then GE, then NFFM, then NFGC. The y-axis indicates the cumulative AUC achieved by the model.
Each bar is composed of five stacked components: The baseline (blue) corresponds to a simple model that uses only pose extraction (PE) without graph modeling or GMM, achieving an AUC of 79.2%. This baseline does not include NFFM, NFGC, or GE modules. The subsequent bars add components incrementally. The PE + GE contribution (orange) shows that adding the GE module significantly improves performance by +3.9%, highlighting the importance of probabilistic modeling in capturing the normal motion manifold. The NFFM contribution (green) indicates that the addition of normalizing flow features contributes an additional +0.7%, underscoring the benefit of integrating fine-grained motion cues. The NFGC contribution (red) demonstrates that introducing NFGC further improves AUC by +1.2%, validating that adaptive graph-based mechanisms enhance the modeling of spatiotemporal relationships. Finally, the synergy effect (purple) reflects an additional +1.1% improvement, indicating that the joint optimization of all modules leads to performance gains beyond their simple additive contributions. Overall, the full model achieves a final AUC of 86.1%, demonstrating the effectiveness of each individual component as well as their collaborative integration.
4.5. Ablation Study on Graph Adjacency Matrices
To understand the contribution of each adjacency matrix in the NFGC module, we conduct a fine-grained ablation study where we individually remove or retain each matrix type.
Table 4 reports the results on both datasets.
The results demonstrate that each matrix contributes uniquely to the overall performance. The full model achieves the highest AUC, confirming the complementary nature of the three adjacency matrices. Notably, removing (the attention-based dynamic matrix) causes a performance drop of 0.9% on ShanghaiTech, highlighting the importance of sample-specific connectivity for capturing diverse motion patterns. The global learnable matrix also plays a crucial role, with its removal leading to a 1.4% AUC decrease. Interestingly, the static anatomical prior provides a moderate contribution, and the model maintains reasonable performance even without it (−1.2%), suggesting that the learned matrices can partially compensate for anatomical priors while still benefiting from their regularization effect.
4.6. Qualitative Analysis
To provide intuitive insights into the temporal detection capability of our model, we present qualitative visualizations of anomaly scores on sequences containing anomalous events from both datasets. These curves demonstrate how UHPose-VAD responds to abnormal activities in the video stream and its precision in localizing such events temporally.
Figure 4 displays anomaly score curves for two representative sequences from the ShanghaiTech dataset. In both subfigures, the blue curve tracks the evolution of anomaly scores across frames, while red shaded regions indicate ground truth anomalous intervals. The results demonstrate UHPose-VAD’s effectiveness in real-world surveillance scenarios. In Sequence 1 (
Figure 4a), the model produces a sharp, well-defined peak that closely aligns with the anomalous event, accurately capturing its onset and duration. Similarly, in Sequence 2 (
Figure 4b), the anomaly score rises abruptly during the abnormal activity and quickly returns to baseline levels once normal behavior resumes. This pattern indicates the model’s sensitivity to deviations from learned normal patterns and its ability to temporally localize anomalies with precision.
Figure 5 presents anomaly detection results on the UBnormal dataset, which contains synthetically generated but challenging anomalous events. In Sequence 1 (
Figure 5a), the model demonstrates robust performance by maintaining low anomaly scores during normal periods while generating a prominent peak precisely aligned with the ground truth anomaly. Sequence 2 (
Figure 5b) shows a more complex scenario where the anomaly score exhibits multiple peaks corresponding to different phases of abnormal activity yet still maintains strong temporal alignment with the ground truth intervals.
Across both datasets, several consistent patterns emerge. First, UHPose-VAD maintains low anomaly scores during normal activity periods, indicating effective learning of regular human motion patterns. Second, the model produces sharp, well-defined peaks during anomalous events, demonstrating precise temporal localization capabilities. Third, the magnitude of anomaly scores correlates well with the severity of behavioral deviations, with more dramatic anomalies triggering higher peaks. These qualitative observations align with our quantitative results and validate the effectiveness of our pose-centric approach in capturing semantically meaningful anomalies across diverse scenarios.
Despite the strong overall performance, our method fails in certain scenarios.
Figure 6 shows an example from ShanghaiTech where a person is severely occluded by a pillar. AlphaPose produces incomplete keypoints (missing left arm and leg), causing the pose graph to lose critical structural information. Consequently, the anomaly score remains low even though an abnormal event (a sudden fall) occurs. This highlights the dependency of our framework on accurate pose estimation. Future work could incorporate temporal smoothing or multi-frame keypoint interpolation to mitigate such failures.
4.7. Robustness Analysis
To evaluate the robustness of our proposed UHPose-VAD framework against potential data corruptions in real-world deployment scenarios, we systematically introduce and analyze three types of noise perturbations on the input data. Spatial noise is introduced by adding independent Gaussian noise with a standard deviation of 0.2 to the upper half of each video frame, simulating partial occlusion. Temporal noise is simulated via a random frame-dropping strategy with a discard rate of 0.1, emulating unstable frame rates or intermittent sensor failures. Furthermore, a combined spatio-temporal noise condition is created by applying both perturbations simultaneously, representing a more complex and challenging degradation scenario. The performance of our model under these adverse conditions is quantitatively assessed and reported in
Table 5.
The robustness evaluation results in
Table 5 demonstrate the remarkable resilience of UHPose-VAD to various data corruptions. On the ShanghaiTech dataset, the model maintains high performance under spatial noise (85.6% AUC), showing only a marginal decrease compared to the clean data performance (86.1%). This minimal degradation underscores the advantage of our pose-driven approach, which focuses on structural human dynamics rather than raw pixel appearances, making it inherently robust to localized visual corruptions like partial occlusions. Similarly, temporal noise causes negligible performance drop (84.7% AUC), indicating that the graph-based temporal modeling effectively compensates for intermittent missing frames. Most notably, even under the challenging combined spatio-temporal noise, our method achieves 84.2% AUC on ShanghaiTech and 66.9% on UBnormal, exhibiting excellent stability. The consistent performance pattern observed on both datasets confirms that our framework maintains strong anomaly detection capability even in non-ideal conditions, highlighting its practical suitability for real-world surveillance applications where data quality cannot be guaranteed.
4.8. Computational Efficiency
We evaluate the model complexity and inference speed of UHPose-VAD. Our method achieves a compact architecture with only 2.6M parameters, significantly lower than most existing approaches. Moreover, UHPose-VAD runs at 32 FPS on a single GPU (NVIDIA A40 GPU), demonstrating its suitability for real-time anomaly detection applications. This efficiency stems from the lightweight GCN-based backbone and the effective fusion mechanism that avoids heavy computation overhead.
As shown in
Table 3, the model complexity increases moderately with each added component, and the full model remains lightweight (2.6M parameters) and real-time (32 FPS). The inference speed of all variants is sufficient for practical deployment.
4.9. Sensitivity Analysis
4.9.1. GMM Components
The number of Gaussian components
K in the GMM (Equation (
6)) is a key hyperparameter that determines the model’s capacity to capture multi-modal normal motion patterns. We evaluate the sensitivity of UHPose-VAD to different
K values on the ShanghaiTech validation set.
As shown in
Figure 7, the model achieves its best performance with
components. Using a single component (
) yields a lower AUC of 84.2%, confirming that normal human motions exhibit multi-modal characteristics that cannot be adequately captured by a single Gaussian distribution. Increasing
K beyond 3 leads to a slight performance degradation (
at
,
at
), which can be attributed to overfitting to the training data. Based on this analysis, we set
for all experiments.
We also evaluate the sensitivity of
K on the UBnormal dataset. As shown in
Figure 7b, the optimal
K remains 3, achieving 69.4% AUC, while
yields 66.7% and
yields 67.9%. This confirms that the multi-modal nature of normal motion patterns is consistent across datasets.
4.9.2. Graph Fusion Coefficients
To understand how the proposed UHPose-VAD balances the three adjacency matrices (
,
,
) and to verify whether the model truly favors adaptive connectivity over fixed physical constraints, we conduct a sensitivity analysis on the trainable coefficients
in Equation (
2). We fix two coefficients at their learned values while varying the third across the range
, and observe the resulting AUC on the ShanghaiTech validation set.
As shown in
Figure 8a, the model performance is relatively stable across different
values, with a broad plateau around the learned optimum (
). This indicates that the attention-based dynamic connectivity consistently contributes meaningful information. In contrast,
Figure 8b reveals that performance drops sharply when
deviates from its optimal range, highlighting the critical role of the global learnable matrix in capturing universal motion patterns. Interestingly,
Figure 8c shows that the model maintains reasonable performance even when
is reduced to zero, suggesting that while the anatomical prior
provides useful regularization, the model can partially compensate for its absence through the learned matrices. Overall, the learned coefficients converge to a balanced configuration (
), confirming that our adaptive fusion mechanism effectively combines the strengths of all three connectivity patterns.
5. Discussion
Our experimental results reveal several insights beyond the immediate performance comparisons. First, the sensitivity analysis on graph fusion coefficients (
Figure 8) shows that performance drops sharply when
deviates from its optimum, while the model remains relatively robust to variations in
and
. This suggests that the global learnable matrix
plays a more critical role than the attention-based
or the anatomical prior
in capturing universal motion patterns. Interestingly, even when
, the model maintains competitive AUC scores (85.2% on ShanghaiTech), indicating that the learned matrices can partially compensate for missing anatomical priors.
Second, the GMM sensitivity analysis (
Figure 7a) confirms that normal human motions exhibit multi-modal characteristics, as
significantly outperforms
. However,
leads to overfitting, suggesting that the intrinsic modes of normal motion in surveillance scenarios are limited.
Third, the robustness analysis (
Table 5) demonstrates that our pose-centric design is inherently resilient to spatial noise (e.g., partial occlusion), with only a 0.5% AUC drop on ShanghaiTech. This is a key advantage over pixel-based methods that would suffer more severely from such corruptions.
Nevertheless, the framework’s performance remains tied to pose estimation quality. In dense crowds or severe occlusions where AlphaPose fails, our method degrades gracefully but still suffers. Future work will explore (1) cross-frame temporal smoothing to filter erroneous pose detections and (2) extend the framework to model multi-person interactions via graph-based relational reasoning.