1. Introduction
Video anomaly detection strives to recognise unusual patterns or occurrences deviating from standard or expected sequences within video data. This task is of critical importance as it facilitates the flagging of non-typical events that could potentially signify threats, irregularities, or uncommon circumstances. The implications of such a capability span numerous fields, from enhancing security via surveillance applications, improving health outcomes by identifying abnormal behavioural or physiological patterns, optimising traffic control through the detection of anomalies or accidents, to augmenting industrial processes by early identification of potential machinery malfunctions.
The advent of deep learning has fundamentally transformed the field of video anomaly detection, introducing innovative approaches that tackle previously insurmountable challenges. These algorithms leverage their intrinsic ability to learn complex, hierarchical representations directly from raw data, enabling the extraction of intricate patterns and contexts necessary for detecting anomalies in intricate video sequences.
Figure 1 illustrates the architecture of our proposed MaskedSkipUNet. On the left, a sequence of input video frames is processed by a series of 2D convolutional encoding blocks. The feature maps are then passed through a symmetric decoding path on the right, typical of UNet structures. Crucially, the yellow blocks interleaved between encoder and decoder levels represent our novel MaskedConv3D layers. These layers operate on the temporal dimension of the feature maps and apply a structured mask to the centre kernel, enabling selective feature suppression and inference of normal patterns. By placing these layers at skip connection points, we modulate the flow of information and prevent the direct copying of anomalous content from encoder to decoder, a limitation present in conventional UNets.
Autoencoders (AEs) and Variational Autoencoders (VAEs) have significantly contributed to anomaly detection by accurately reconstructing input data and encoding it into a low-dimensional latent space, allowing anomalies to be identified as deviations from expected reconstructions or latent representations. However, this approach often results in small details within images and videos not being reconstructed well, leading to false positives when these details are misidentified as anomalies [
1].
In the recent panorama of video anomaly detection, UNets have emerged as a potent architecture, demonstrating remarkable efficacy [
2,
3,
4,
5,
6,
7,
8,
9]. Unlike autoencoders and VAEs, which struggle to retain fine details, UNets leverage their skip connections to carry detailed spatial information from the encoder to the decoder, effectively preserving small details that are crucial for accurate reconstruction [
10]. Originally conceived for biomedical image segmentation, UNets utilise a symmetric expanding path that helps recover spatial information lost during the contraction phase, allowing the model to effectively extract both high-level semantic information and low-level spatial information. This capability is particularly critical in applications where preserving detailed spatial contexts is vital, such as anomaly detection in videos.
The primary differentiator between UNets and autoencoders lies in the manner of information propagation. While both architectures comprise an encoding (contracting) and decoding (expanding) phase, UNets introduce additional “skip connections” that bridge the encoding and decoding paths at equivalent levels of hierarchy [
11]. These connections allow the network to carry detailed spatial information from the encoder directly to the decoder, thereby facilitating the preservation and reconstruction of intricate spatial patterns often lost in the bottleneck of traditional autoencoders.
However, UNets have a significant limitation in that they can proficiently reconstruct anomalous samples as well as normal samples due to their skip connections from earlier layers. This proficiency can obstruct anomaly detection and is often resolved by designing large complexity around the UNet, as is later demonstrated in
Section 5.1 (Figure 7).
In our research, we tackle the challenge of UNet’s anomaly reconstruction by implementing special convolution layers, the MaskedConv3D, between the skip connections. While several methodologies in the realm of anomaly detection focus on masking the image or data directly, our approach diverges significantly. Instead of operating on the image, we introduce masking at the convolutional layer level, resulting in a more nuanced and informed mechanism for anomaly identification. This strategic departure offers a unique lens through which anomaly detection can be more effectively understood and implemented. Our research presents several key innovations that distinguish it from existing methods:
Unlike traditional approaches that
mask the image or data directly, often requiring multiple random masks to account for unknown anomaly locations, our method
masks at the convolutional layer level. This avoids the risk of obscuring critical normal features or missing anomalous ones. Moreover, by masking the kernel instead of the image, our approach enables efficient inference without requiring multiple masked versions of the input, significantly reducing computational overhead [
12,
13].
Our research introduces
MaskedConv3D layers that are strategically placed
between the skip connections of the UNet architecture. By placing the MaskedConv3D layers between skip connections, we apply feature masking and feature inference on various granularities of detail. This addition tackles the challenge of UNet’s ability to reconstruct anomalies of various sizes [
14,
15].
By implementing masked convolutions between skip connections, our method infers normal spatio-temporal features to be passed to the decoder. Instead of passing forward the anomalies across the reconstruction process [
6,
7],
using MaskedConv3D layers means our
model needs to infer normal features from surrounding spatio-temporal features.
Section 2 of this paper reviews existing methods in video anomaly detection, highlighting their strengths and limitations. In
Section 3, we introduce our proposed MaskedSkipUNet architecture and explain its unique components, such as the MaskedConv3D layer.
Section 4 describes the three datasets used for evaluation: the Ped2, CUHK Avenue, and ShanghaiTech datasets. In
Section 5, we present the performance of the MaskedSkipUNet, demonstrating its advantages over other frame-based methods. Finally,
Section 6 provides an evaluation of the effectiveness of the MaskedSkipUNet architecture by comparing it across various perumataions of architecture components. The source code is publicly available at
https://github.com/demetrislappas/MaskedSkipUNet (accessed on 1 April 2025).
2. Related Work
Anomaly detection within video datasets presents a unique challenge, primarily due to the overwhelming prevalence of normal data. Consequently, training sets designated for anomaly detection are inherently biased towards normal samples [
16,
17,
18]. To counteract this imbalance, certain strategies have been deployed, such as the generation of artificial anomalies. However, the efficacy of such methodologies remains questionable, given the inherent difficulty in simulating all potential real-world anomalies.
Unsupervised learning remains a favoured approach in this domain, typically involving the implementation of autoencoders (AEs) as a key mechanism [
19,
20,
21,
22,
23,
24,
25,
26]. AEs detect anomalies by evaluating the disparity between their initial input and the reconstructed output. They comprise an encoder, which maps normal data to a latent space, and a decoder, which recreates the original input.
The underlying rationale for this methodology is that normal data samples should exhibit minimal reconstruction errors when processed through an AE during inference. Given that the AE is unacquainted with anomalous samples, these samples are expected to yield substantial reconstruction errors when decoded.
Reconstruction remains a predominant method in numerous studies involving video anomaly detection [
16,
19,
21,
22,
23,
26]. Several approaches attempt to accommodate the temporal dimension by endeavouring to reconstruct complete sequences [
21,
23].
A significant challenge with AEs is their potential inability to reconstruct normal samples accurately enough to differentiate them from anomalies. Numerous techniques employ a UNet, an AE with skip connections between hidden layers that assist in reconstructing the original input from an earlier network layer [
2,
3,
4,
5,
6,
7,
8,
9]. Yet, skip connections present complications, as they make the decoder reliant not solely on the latent space for reconstruction but also on higher-dimensional features from the encoder’s earlier layers.
UNets might learn to reconstruct anomalous samples as proficiently as normal samples. This is attributed to the idea that a latent space is a fitting representation of data in a lower dimensional space, allowing anomaly detection to occur in the latent space without the need for reconstruction during inference. This is often combated by introducing extreme complexity within the UNet’s architecture. RHCrackNet [
27] exemplifies an advancement in UNet-based architectures, focusing on static image segmentation, particularly for pavement crack detection. It incorporates feature fusion and enhancement modules to dynamically aggregate and refine multiscale information. Additionally, it employs a non-local attention mechanism to extract long-range spatial dependencies, addressing challenges such as background interference and weak crack continuity. These enhancements contribute to improved segmentation accuracy and robustness in static image applications.
A more innovative application of AEs/UNets in video anomaly detection involves their use for predicting subsequent frames [
2,
3,
4,
5,
6,
7,
8,
9,
20,
22,
25,
28,
29]. This technique hypothesizes that a decoder, having learnt from an encoded sequence of normal frames, would struggle to generate the next frame from an encoded sequence of anomalous frames. The two-stream framework proposed by [
29] employs context recovery with a spatio-temporal UNet for future frame prediction, combined with knowledge retrieval to enhance understanding of normality.
Several studies have aimed to encapsulate the significance of temporal dimensions by incorporating Transformers into their architecture, which span across a sequence of frames, facilitating an
attention mechanism amongst them [
2,
8,
25,
30,
31]. This approach appears plausible for capturing videos’ temporal characteristics, enabling an AutoEncoder/UNet to detect features unobtainable from a solitary frame.
Optical flow has also seen extensive application in video anomaly detection for capturing temporal features, as both an input feature and a predicted output feature [
3,
20,
22,
32,
33,
34]. Optical flow captures the perceived motion of pixels between consecutive frames, making it highly sensitive to motion-related anomalies. The Dual-Stream Anomaly Detection Network (DS-ADN) introduced in [
34] uses optical flow data in combination with RGB information to enhance anomaly detection in traffic surveillance scenarios, employing a Multi-Scale Attention Fusion Module to better capture anomaly characteristics.
Existing supervised learning models have been utilised to augment AE/UNet performance by comparing the hidden features of the original input and the reconstructed output within the same pre-trained model [
9]. This methodology bears similarity to the application of Generative Adversarial Networks (GANs) [
2,
3,
5,
20,
25,
31,
35]. GANs consist of two models, a generator and a discriminator, that function in opposition. The generator strives to produce samples similar to ground truth data to deceive the discriminator, while the discriminator works to discern whether a sample is authentic or synthetic. In the context of anomaly detection, the AE/UNet generally acts as the generator, while the discriminator aims to identify discrepancies between the AE’s/UNet’s original input and the reconstructed output. This interplay aids the AE/UNet in improving the quality of its reconstructions to make them indistinguishable from the original input. The attention-guided generator with dual discriminator GAN (A2D-GAN) [
35] is an example of such an approach, utilizing an encoder–decoder generator with self-attention and channel attention mechanisms, coupled with dual discriminators to enhance robustness against noise in real-time video anomaly detection.
A widely adopted strategy to prevent anomaly reconstruction is to include some form of memory within the hidden layers and/or latent space that references normal samples [
4,
6,
19,
23,
26,
33,
36,
37]. Memory modules in AEs/UNets concatenate linear combinations of learned memory vectors based on the input method. The underlying theory suggests that these modules will not be as efficient in forming linear combinations of anomalous vectors as they are with normal vectors in the latent space. This function is analogous to the role of the decoder, and one could argue that a more complex decoder might yield comparable results to a memory module.
An effective strategy for anomaly detection involves applying masking techniques to selectively obscure parts of the data, using the unmasked content to inpaint the masked areas, thereby detecting anomalies through discrepancies in reconstruction. In [
12], the unsupervised SMAI method is introduced. It uses superpixel segmentation and an inpainting module to detect anomalies. During inference, frames are masked and inpainted, and discrepancies between the original and reconstructed areas identify anomalies. While effective for image-based anomaly detection, this approach is computationally intensive and time-consuming during inference. Additionally, the inpainting technique, primarily designed for images, poses challenges when extended to video anomaly detection.
The paper by Ristea et al. [
14] uses masked convolutional layers in their self-supervised predictive convolutional attentive block (SSPCAB). The center area of the kernel is masked, and the convolution operation only considers input values from specific positions, ignoring others. While this is an innovative approach for feature extraction, it does not appear to utilise masked convolution layers to the best of their capabilities because they are only used in the first layer and the masked convolution only utilizes the corner pixels of a kernel. This limited usage restricts the network’s ability to fully exploit the advantages of masked convolution, particularly in capturing more nuanced spatio-temporal features and preventing the reconstruction of anomalies effectively. Furthermore, although [
15] extends this approach with 3D masked convolutions to consider temporal dimensions, it still does not utilize these masked convolutional layers across the entire network effectively for anomaly detection. In our approach, we extend the use of MaskedConv3D layers throughout the UNet architecture, strategically inserting them between skip connections to infer normal spatio-temporal features across various granularities of detail and provide a more comprehensive anomaly detection solution.
3. Methodology
When dealing with video anomaly detection, classic UNet architectures can inadvertently become too proficient at reconstructing input features, including anomalies. This reconstruction efficiency of UNets, especially due to their skip connections, often leads to the undesired effect of anomalies being rendered indistinct from normal data. As a result, the anomalies can become less discernible, making detection more challenging.
In a traditional UNet setup, while the skip connections are crucial for preserving high-frequency details and mitigating the loss of information during the encoding process, they may also enable the accurate reconstruction of anomalies. This occurs because the detailed features from the encoder are directly carried over to the decoder, which may result in the anomalies being maintained in the output, thereby diminishing the model’s ability to differentiate between normal and anomalous patterns.
Incorporating the MaskedConv3D layer between these skip connections can alleviate these issues. The MaskedConv3D layer, considering the temporal dimension, works on a sequence of frames and applies a zero mask to a subspace of the centre kernel. This leaves the previous and next frames, as well as the boundary of the centre frame, intact, forming a sort of “hollow ball” around potential anomalies. This design forces the model to infer the reconstruction of this hollow space based on surrounding spatial and temporal information, rather than directly processing the anomalies themselves.
This approach is underpinned by the logic that the model should be able to reconstruct normal sequences using the information from the boundary and adjacent frames, while struggling to do the same for anomalous ones. Therefore, this method prevents anomalies from being inadvertently reconstructed and normalised, ensuring they stand out for detection.
3.1. MaskedConv2D
To facilitate a more tangible understanding of our methodology, we initially apply the concept of masked convolutional layers to a 2D image dataset as a proof of concept, before diving into the complexities of video anomaly detection with MaskedConv3D layers. Understanding the idea of a masked convolutional layer in 2D is easier prior to delving into the 3D variant. This approach underscores the versatility and adaptability of our technique.
A traditional Conv2D kernel, for instance a one, is made up of 49 weights, denoted by , where are identifiers for the rows and columns of the kernel, respectively. This kernel moves over an image, and the weights operate as multipliers for each corresponding pixel, establishing a linear combination between the weights and pixels. The outcome of this calculation is then linked to the central pixel position of the kernel, in our case .
If we were to momentarily employ a single Conv2D layer to depict the entire reconstruction procedure, it would be evident that a Conv2D layer, when trained solely on normal samples, would be proficient at reconstructing any image with near precision. This would hold for both normal and anomalous samples, with a graphic depiction of this scenario visible on the left side of
Figure 2.
We could envision that most of the kernel weights would have values close to zero and the central weight, , would have a value approximating 1, implying negligible reliance on the surrounding pixels’ information.
However, if we apply a mask to some of the central pixels of the kernel, assigning them zero values (as we do in our MaskedConv2D layer, shown in
Figure 3), the central pixel would have to be deduced in the context of its surrounding pixels. In a two-dimensional space, these masked-out pixels form a sort of “ring”, a boundary within which the central pixel lies. The task of the MaskedConv2D layer then becomes figuring out the value of this central pixel based on the information from the pixels that exist outside this zero-value ring. This changes the operation of the layer from a simple linear combination of all pixels within the kernel to a process of spatial inference, where the values of the surrounding pixels provide cues to what the central pixel might be. Hence, a MaskedConv2D layer trained on normal data would find it difficult to infer anomalies from these boundary pixels and is more likely to substitute the anomaly with a pattern that resembles normal data. This phenomenon can be seen on the right side of
Figure 2.
This presents a perspective not typically accounted for in traditional autoencoders. It is generally thought that the decoder will find it difficult to reconstruct anomalous samples, but this is not always the case. The encoder maps patterns to the latent space, and the decoder rebuilds these patterns. Ideally, normal and anomalous patterns would differ significantly, but if there is significant overlap between normal and anomalous samples, autoencoders may fall short. MaskedConv2D layers introduce the idea of extrapolating a normal pattern from its neighbouring patterns.
It is worth noting a key differentiator of our method. Whereas other methodologies might mask the image input directly [
38,
39], leading to a potential loss of information, our MaskedConv2D layer masks the convolutional kernels. This preserves the original data while altering the convolution operation, pushing the architecture to focus on the surrounding context rather than directly altered data.
3.2. MaskedConv3D
Following the explanation provided in
Section 3.1, the concept of a MaskedConv3D layer should be relatively straightforward. The fundamental idea remains the same, but instead of establishing a boundary akin to a ring, as in 2D, a MaskedConv3D layer constructs a boundary resembling a hollow sphere in three dimensions. Extending the unique approach from MaskedConv2D, our 3D variant continues to mask the convolutional kernel instead of the volumetric video data, further differentiating our technique from mainstream practices. Given that the size of a Conv3D kernel is depicted as a three-dimensional array, encompassing temporal, height, and width dimensions, MaskedConv3D layers nullify only a subset of the temporal layers.
This approach utilises the information from previous and subsequent temporal frames to make educated guesses about the content of the current frame. To illustrate, let us assume a kernel size of
, which consists of 27 weights, denoted by
. In this case,
are identifiers for the temporal dimension, rows, and columns of the kernel, respectively. Among these, only the weight
would be set to zero. This example can be seen in
Figure 4.
The key is that, by leaving central pixels zero-weighted, the layer is required to draw on the three-dimensional spatial and temporal context it has at its disposal in order to infer its value. This is analogous to the way the 2D version leverages spatial context, but with the additional complexity of the temporal dimension, making the 3D version potentially more powerful for tasks involving video or other 3D data.
3.3. Architecture
Our approach employs a Conv2D UNet architecture, which is comprised of tailor-made ConvBlocks for the encoder and decoder, along with MaskedConv3D layers inserted amid the skip connections.
A ConvBlock is structured to include a multi-headed convolution, essentially meaning that it contains 3 Conv2D layers with “same” padding and respective kernel sizes of
. These layers are then concatenated together on the channel dimension. Subsequently, a batch normalisation is performed, followed by the application of a ReLU activation function. The output is then restructured to reduce the height and width dimensions, and passed through a final Conv2D layer (also with “same” padding). The entire process is capped off with an additional batch normalisation and another ReLU layer.
Figure 5 provides a visual representation of an encoder ConvBlock.
The UNet architecture is composed of 8 of these ConvBlocks, with 4 each dedicated to the Encoder and Decoder. Skip connections are implemented between the layers, each of which traverses a MaskedConv3D layer before being concatenated with its corresponding layer. The encoder and decoder, which consist of Conv2D layers, treat each frame as an individual image, the temporal dimension is only considered during the MaskedConv3D layers. It is worth noting that this architecture largely mirrors a traditional UNet, with the key distinction being the introduction of the MaskedConv3D layers in the skip connections, lending a unique capacity for contextual inference to the framework. The appropriate kernel sizes and mask dimensions for the MaskedConv3D layers were selected through empirical experimentation to balance temporal context and reconstruction quality. A visual of the architecture can be seen in
Figure 1.
3.4. Rationale Behind the Architecture
In our UNet architecture, the encoder operates as a spatio-feature extractor for each frame, unearthing a fresh layer of features at every ConvBlock. The MaskedConv3D layers, situated strategically between these ConvBlocks, serve a unique and vital role. As the features extracted by the encoder are passed through the skip connections, the MaskedConv3D layers replace these features, generating new ones by interpreting the context provided by the existing surrounding features. This stage is where the actual inference takes place and is also the primary site of novelty in our architecture.
Specifically, these MaskedConv3D layers contribute the key capability to detect anomalies—they do this by failing to accurately reconstruct the features from the surrounding context when presented with anomalous data. This failure is what signals the presence of an anomaly and differentiates the anomalous data from the normal.
Following this feature extraction and inference process, the decoder assumes its role in the framework, which is singularly focused on assembling these inferred features back together, ultimately yielding a reconstructed output.
In contrast to “Self-Supervised Predictive Convolutional Attentive Block for Anomaly Detection” [
14], our proposed MaskedSkipUNet architecture uses MaskedConv3D layers not just as feature extractors, but as a mechanism to infer reconstruction based on surrounding spatio-temporal hierarchical features.
The MaskedSkipUNet architecture improves upon methods like “Superpixel Masking and Inpainting” [
12]. By focusing the masking operation on the convolutional kernels rather than the frames themselves, MaskedSkipUNet sidesteps the computational burden of generating multiple masked copies of each frame. This kernel-based masking is not only computationally more efficient but also retains critical temporal and spatial information that could be lost when masking frames directly. Furthermore, the architecture incorporates a temporal dimension through its MaskedConv3D layers, an aspect that traditional inpainting methods are currently unable to address.
The MaskedConv3D layers in our model enable the model to leverage the temporal and spatial context surrounding anomalies, facilitating accurate detection and preserving normal patterns. By excluding the anomalies from direct reconstruction, the MaskedSkipUNet effectively distinguishes anomalies from normal frames, resulting in significantly improved anomaly detection accuracy. This difference in the application of masked convolutional layers underlines the unique approach of our proposed MaskedSkipUNet architecture in the field of video anomaly detection.
4. Datasets
In our experiments, we utilize both image and video datasets to demonstrate the adaptability and breadth of our methodology. Initially, experiments on the fairly simple MVTec dataset, which comprises high-resolution images, are used as proof of concept. Subsequently, to delve deeper into more intricate scenarios, we leverage three prominent video-based datasets: the Ped2 dataset, CUHK Avenue dataset, and the ShanghaiTech dataset. These video datasets, with their inherent complexity and diverse anomaly types, provide a comprehensive backdrop against which we rigorously evaluate our proposed model.
4.1. MVTec Dataset
The MVTec dataset [
40] is a comprehensive and widely-used benchmark for evaluating machine vision algorithms. Comprising high-resolution images across a diverse array of industrial object categories like cables, screws, and pills, it is meticulously curated to emulate real-world industrial scenarios. This design makes the MVTec dataset an invaluable resource for gauging the robustness and generalizability of anomaly detection algorithms.
4.2. Ped2 Dataset
The Ped2 dataset [
41] is renowned for its surveillance video clips, collected in a pedestrian area. The videos feature a range of anomalous events, such as people biking, skating, or moving in an irregular pattern. Each video has a frame resolution of
and varies in length, allowing for a diverse set of samples.
4.3. CUHK Avenue Dataset
The CUHK Avenue dataset [
42] is another significant surveillance video dataset collected from the avenue of the Chinese University of Hong Kong. The anomalies in this dataset typically consist of running people, loitering, and throwing objects, which are not common activities in the depicted setting. The videos have a frame resolution of
, and the number of frames varies per video.
4.4. ShanghaiTech Dataset
Lastly, the ShanghaiTech dataset [
43] is a recently developed, large-scale dataset featuring surveillance video clips from a variety of scenarios, both indoor and outdoor. The anomalies in this dataset are diverse, including burglary, climbing, and fighting. Videos have a resolution of
, with the number of frames varying per video.
5. Results
5.1. MVTec Dataset Findings
In our experimentation, we employed a specific subset of the MVTec dataset as our foundational testing ground, seeking to probe the effectiveness of masked convolutions embedded within skip connections. The structured yet challenging nature of this image dataset, especially juxtaposed against the intricacies of video anomaly detection, offered a controlled environment to rigorously ascertain the algorithm’s capability in anomaly recognition. With the MVTec dataset encompassing both defect-free and defective samples, it provided an ideal base to validate the efficiency and scalability of our method.
Figure 6 shows the Receiver Operating Characteristic (ROC) curves for the AutoEncoder, UNet, and MaskedSkipUNet, accompanying their specific AUC scores. With the UNet registering an 86.6% AUC, the MaskedSkipUNet, boasting a 93.8% AUC, only marginally surpasses the AutoEncoder’s score of 92.9%.
A more granular, qualitative examination further elucidates our conviction in advancing this methodology for the more complex domain of video anomaly detection.
Figure 7 displays results from the hazelnut category of the MVTec dataset. Ground truth images are arrayed in the first column, displaying both regular and anomalous specimens. Successive columns illuminate the output produced by the AutoEncoder, UNet, and MaskedSkipUNet. The inherent differences in reconstructions, spanning from the blurring by the AutoEncoder to the precise detailing by the UNet are notable. The MaskedSkipUNet’s extrapolation of normal features is particularly compelling, as it seemingly leverages the contextual cues to supplant anomalous regions.
Figure 7.
Comparative analysis of anomaly reconstruction techniques on hazelnut samples from the MVTec dataset. The first column displays ground truth images, featuring both a normal and an anomalous sample. Subsequent columns show the output reconstructions from an AutoEncoder, UNet, and MaskedSkipUNet, respectively. The first row presents the ground truth and reconstructions of a normal sample, while the second row highlights the differences between the reconstructed and original ground truth of the normal sample. The third row showcases the ground truth and reconstructions of an anomalous sample, and the final row illustrates the discrepancies between the reconstructed and original ground truth of the anomalous sample. The figure demonstrates the varying capabilities of each algorithm in capturing details and reconstructing anomalies.
Figure 7.
Comparative analysis of anomaly reconstruction techniques on hazelnut samples from the MVTec dataset. The first column displays ground truth images, featuring both a normal and an anomalous sample. Subsequent columns show the output reconstructions from an AutoEncoder, UNet, and MaskedSkipUNet, respectively. The first row presents the ground truth and reconstructions of a normal sample, while the second row highlights the differences between the reconstructed and original ground truth of the normal sample. The third row showcases the ground truth and reconstructions of an anomalous sample, and the final row illustrates the discrepancies between the reconstructed and original ground truth of the anomalous sample. The figure demonstrates the varying capabilities of each algorithm in capturing details and reconstructing anomalies.
5.2. Video Anomaly Detection Results
In our evaluation, we follow the competitors as outlined in [
37], and we additionally adopt a median window size of 17, as proposed in [
28]. Since our method is not based on object detection, we compare it against other frame-based methods as in [
37] and not against object detection-based methods [
28,
44,
45,
46]. Additionally, we assign an anomaly score to each frame in video anomaly detection by calculating the Euclidean distance at the pixel level between the original input and its reconstruction. We further divide this distance into patches of size
and determine the frame score to be the highest mean value among these patches.
Frame-based methods focus on detecting anomalies at the individual frame level, which aligns with our approach. These methods are particularly suitable for surveillance video analysis, where temporal continuity may not always be guaranteed, and each frame’s independent analysis can provide valuable insights.
The competitors we consider include a variety of frame-based methods, each with its unique strengths and approaches to anomaly detection. These include methods leveraging memory-augmented deep autoencoders [
36], contrastive learning [
47], comprehensive regularisation in predictive networks [
44], and more. By comparing our methodology with these established frame-based methods, we aim to demonstrate the robustness and effectiveness of our approach in detecting anomalies in surveillance videos.
Table 1 provides a comprehensive comparison of different methods in the field of video anomaly detection, evaluating their performance on the Ped2, Avenue, and ShanghaiTech datasets. Notably, our novel approach, the MaskedSkipUNet, demonstrates remarkable performance comparable to the current state of the art on the Ped2 dataset and surpasses the state of the art on the Avenue dataset. Specifically, the MaskedSkipUNet achieves an outstanding accuracy of 98.4% on the Ped2 dataset, matching the top-performing competitor. Moreover, on the Avenue dataset, our method achieves a remarkable accuracy of 91.2%, outperforming all other methods evaluated. These results exemplify the effectiveness and superiority of our approach in accurately detecting anomalies at the frame level. Our method sets the benchmark in anomaly detection on two out of three datasets. While we achieve a strong 76.0% accuracy on the ShanghaiTech dataset, its diverse and complex scenes present an ongoing challenge in the field of video anomaly detection.
While we acknowledge that object detection-based methods have been shown to attain better scores on these datasets, we argue that they may not be as effective in some practical use cases. Most object detection-based methods are only able to detect anomalies in objects they are trained to detect. This means they would not be able to predict anomalies in objects not detectable by the object detection models, limiting their applicability in real-world scenarios where anomalies could take many unpredictable forms.
6. Ablation Studies
We carried out a series of ablation studies to discern the individual contributions of various components in our proposed architecture, the MaskedSkipUNet. The following subsections detail the variants of the model (which can be seen in
Figure 8) and their respective performances (which can also be seen in
Table 2).
6.1. Comparative Analysis of Baseline and Skip Connection Variants
We explored three foundational variants of our MaskedSkipUNet architecture to understand their individual contributions to anomaly detection. The AutoEncoder, which uses only the encoder and decoder without any skip connections or MaskedConv3D layers, served as our baseline and achieved an AUC score of 87.1%. Adding skip connections to the AutoEncoder resulted in the UNet model, which surprisingly showed a decreased AUC score of 75.3%. This suggests that UNets may be more effective at reconstructing anomalies, thereby reducing their anomaly detection capabilities. Lastly, the Conv3D SkipUNet variant, which replaces MaskedConv3D layers with standard Conv3D layers, yielded a notable AUC score of 95.2%.
6.2. Impact of Dimensionality and Layer Types on Anomaly Detection
Two additional variants were examined to assess the impact of dimensionality and layer types on performance. The 2D MaskedSkipUNet, which replaces MaskedConv3D layers with MaskedConv2D layers, emphasized the importance of capturing the temporal dimension with an AUC score of 89.0%. On the other hand, the E/D Masks MaskedSkipUNet, which substitutes conventional Conv2D layers in both the encoder and decoder with MaskedConv2D layers, achieved a lower AUC score of 81.5%. This result indicates the effectiveness of standard Conv2D layers in feature extraction during both the encoding and decoding phases.
6.3. Incremental Benefits of Multiple MaskedConv3D Layers in MaskedSkipUNet
The “One Mask MaskedSkipUNet” incorporates a MaskedConv3D in the latent space without skip connections. Sequentially enhancing this setup, the “Two Mask MaskedSkipUNet” includes one MaskedConv3D skip connection and the “Three Mask MaskedSkipUNet” integrates two MaskedConv3D skip connections. Their performances are , , and AUC, respectively. Our primary contribution, the full-fledged “MaskedSkipUNet”, with a MaskedConv3D in the latent space and three MaskedConv3D skip connections, achieved the highest AUC score of . This progression underscores the incremental value of each MaskedConv3D layer introduced.
6.4. Discussion
The results from our ablation studies vividly illustrate the importance of each component in the MaskedSkipUNet architecture. In particular, the incorporation of MaskedConv3D layers, both in the latent space and as skip connections, proves crucial for optimising anomaly detection. While several methods mask the image directly and gauge the reconstruction error based on this altered input, our results highlight the advantage of masking at the convolution layer instead. This approach not only retains the original image data but also compels the model to develop a deeper understanding and sensitivity to anomalies, evident from our superior performance metrics.
Figure 9 visually compares the performance of each model variant on a frame from the Ped2 dataset, containing normal pedestrians and an anomalous cyclist. The AutoEncoder struggles to reconstruct both the cyclist and pedestrians, resulting in blurry outputs. The UNet achieves high-fidelity reconstructions for all subjects, including the anomaly, thereby failing to highlight it. Our proposed MaskedSkipUNet maintains sharp reconstructions of pedestrians while degrading the reconstruction of the cyclist, successfully isolating the anomaly. Other variants like the Conv3D SkipUNet and E/D Masks MaskedSkipUNet show partial effectiveness. Notably, as the number of MaskedConv3D layers increases from one to three, and ultimately to our final model with four, reconstructions of normal subjects improve, while the anomaly remains less visible, confirming the value of multiple masked layers.
7. Conclusions
In the evolving landscape of video anomaly detection, this research introduces a novel framework employing the innovative MaskedSkipUNet architecture. By strategically integrating MaskedConv3D layers within skip connections, we have enhanced the model’s prowess in identifying anomalies, particularly in complex surveillance footage. Our methodology’s robustness is evident from the empirical evaluations conducted on varied datasets, including prominent ones like Ped2, CUHK Avenue, and ShanghaiTech. Moreover, a preliminary exploration on a 2D image dataset establishes the versatility and adaptability of our approach as a precursor to its application on intricate video datasets.
Our findings shed light on the fundamental importance of masked convolutions, especially in preserving and emphasizing spatial details, vital for distinguishing anomalies. The MaskedSkipUNet, while built upon existing frameworks like UNet, carves a distinctive niche for itself by excelling in anomaly detection tasks where traditional methods falter.
Furthermore, our qualitative and quantitative results, underpinned by compelling visualizations, underscore the merits of our proposed architecture. The ability of MaskedSkipUNet to extrapolate normal features, even in the presence of anomalous regions, sets it apart, enabling more accurate anomaly predictions and reconstructions.
However, our approach is not without limitations. One notable drawback arises when anomalies are large in size. Since the model relies on reconstructing masked regions using surrounding spatio-temporal context, it performs best when anomalies are small and the surrounding context is largely normal. In scenarios involving large-scale anomalies or scenes where much of the surrounding content is also abnormal, the model may lack sufficient normal reference to accurately infer a normal reconstruction. This constraint could reduce the model’s effectiveness in highly dynamic environments or in scenes where the anomaly encompasses a substantial portion of the frame. Additionally, the datasets we consider do not include variations in weather conditions. To ensure robustness in outdoor deployments across different climates, appropriate preprocessing techniques would need to be applied before using our model, as discussed in [
60,
61].
Looking forward, future research could explore augmenting the MaskedSkipUNet architecture with diffusion-based mechanisms. Specifically, the reconstructed output from the model could be treated as a slightly less noisy version of the original input and recursively passed back into the MaskedSkipUNet. Training the model to refine its reconstructions across multiple iterations would allow it to gradually suppress anomalies treated as noise and recover finer normal details. This iterative refinement process could significantly improve the model’s sensitivity to subtle and low-contrast anomalies, making it better suited for real-world applications that demand high precision.
In conclusion, the MaskedSkipUNet heralds a promising leap forward in the domain of video anomaly detection. The foundation laid by this research paves the way for subsequent investigations and enhancements, driving the future evolution of this domain. As anomaly detection continues to hold paramount importance across numerous applications, from surveillance to industrial quality control, the strides made in this work ensure that our methodologies remain at the forefront of this ever-evolving discipline.