TGU-Net: A Temporal Generative U-Net Framework for Real-Time Traffic Anomaly Detection

Pérez, Borja; Resino, Mario; Al-Kaff, Abdulla; García, Fernando

doi:10.3390/smartcities8060194

Open AccessArticle

TGU-Net: A Temporal Generative U-Net Framework for Real-Time Traffic Anomaly Detection

by

Borja Pérez

^*

,

Mario Resino

,

Abdulla Al-Kaff

^*

and

Fernando García

Electrical, electronic and automation departments, Universidad Carlos III de Madrid (UC3M), 28911 Leganés, Spain

^*

Authors to whom correspondence should be addressed.

Smart Cities 2025, 8(6), 194; https://doi.org/10.3390/smartcities8060194

Submission received: 20 October 2025 / Revised: 12 November 2025 / Accepted: 14 November 2025 / Published: 19 November 2025

(This article belongs to the Section Smart Urban Mobility, Transport, and Logistics)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Integrating temporal correlation modeling and object-masked reconstruction enhances accuracy and reliability of real-time traffic anomaly detection.
The proposed TGU-Net model achieves higher precision, recall, and earlier anomaly detection compared to previous generative U-Net-based models.

What are the implications of the main findings?

Combining temporal and contextual information enhances robustness against noise and scene variability in complex traffic environments.
The lightweight and scalable design of TGU-Net supports efficient deployment in real-world intelligent traffic monitoring systems and smart city systems.

Abstract

Traffic anomaly detection plays a crucial role in improving road safety and enabling timely responses to abnormal events. Recent research has explored generative and predictive models to enhance detection accuracy; however, the dynamic and complex nature of traffic scenes often introduces noise and uncertainty, reducing reliability. This work presents TGU-Net, a Temporal Generative U-Net framework designed for real-time traffic anomaly detection in urban environments. The proposed model integrates two key innovations: (1) a temporal modeling component that captures dependencies across consecutive frames, and (2) contextual scene enrichment that enhances the distinction between normal and anomalous behaviors. These additions mitigate reconstruction noise and improve detection robustness without compromising computational efficiency. Experimental evaluations on a synthetically generated CARLA-based dataset demonstrate that TGU-Net achieves strong performance in precision, recall, and early anomaly detection, confirming its potential as a scalable and reliable framework for real-world traffic monitoring systems.

Keywords:

anomaly detection; generative models; image reconstruction; scene prediction; time dependence

1. Introduction

Traffic anomaly detection is an important research topic in Artificial Intelligence (AI) and computer vision, as it directly affects road safety and urban mobility. Traffic anomalies such as accidents, vehicle breakdowns, or unusual pedestrian movements are unexpected events that can cause injuries or fatalities and create serious problems for emergency response and traffic management. According to the World Health Organization (WHO), road accidents cause around 1.19 million deaths each year and between 20 and 50 million injures, many of whom suffer long-term disabilities [1]. These incidents not only have a high human cost but also place a heavy burden on healthcare systems and lead to major economic losses. This highlights the need for reliable and fast detection methods capable of identifying and anticipating such events.

Rapid emergency response is essential to minimize the impact of traffic anomalies. The golden hour principle in emergency medicine [2] stresses the importance of acting within the first 60 min of severe injuries to improve survival rates. However, most existing traffic monitoring systems rely on human operators monitoring multiple video feeds, which leads to delays due to fatigue, distractions, and cognitive overload. To overcome these limitations, AI-based anomaly detection systems have emerged as a crucial solution, significantly improving the speed and accuracy of anomaly identification.

In our previous work [3], we presented an intelligent system for detecting traffic anomalies using advanced deep learning techniques that predicted future frames using features extracted from the current ones. Although the method achieved promising results, it showed limitations in dynamic environments where vehicles and pedestrians appear and disappear at different speeds and directions, leading to noisy predictions and lower reliability.

To address this limitation, this paper presents an improved framework for real-time traffic anomaly detection that enhances both accuracy and efficiency. The proposed model introduces several key improvements:

Enhanced temporal context: Instead of analyzing a single frame, the model processes sequences of n consecutive frames, providing contextual knowledge to improve anomaly detection accuracy.
Taking advantage of the n-frames context window, the addition of a correlation matrix ( $C M$ ) to the system for increasing the temporal features extracted from the sequence.
Focused detection mechanism: The integration of an Interaction Field ( $I F$ ) enables the model to prioritize dynamic elements, reducing the impact of static background noise and improving reconstruction reliability.

These improvements collectively contribute to a more practical and scalable AI-driven anomaly detection system, ensuring real-time performance while maintaining architectural simplicity. By minimizing computational overhead, the proposed approach offers a more efficient alternative to existing solutions, making it feasible for deployment on resource-constrained hardware and large-scale traffic monitoring infrastructures.

The remainder of this paper is organized as follows: Section 2 reviews relevant literature on anomaly detection in traffic scenarios, highlighting key advancements and challenges. Section 3 presents the proposed methodology, detailing the architectural modifications introduced to enhance anomaly detection performance. Section 4 describes the experimental setup and discusses the obtained results. Finally, Section 5 summarizes the conclusions and outlines potential directions for future improvements.

2. Related Work

Recent advances in anomaly detection for traffic scenarios have focused on using computer vision and artificial intelligence (AI) techniques to enhance accuracy and real-time responsiveness. Researchers have developed methods that tackle challenges such as distinguishing anomalies from normal behaviors, reducing false detections, and ensuring timely identification of critical events. Existing methodologies can be broadly categorized into feature-based approaches, generative models, and deep learning-based architectures, each offering different trade-offs between accuracy, computational cost, and scalability.

A deep pyramidal network has been proposed for anomaly detection, leveraging multi-scale pyramidal structures and perceptual loss to enhance feature learning [4]. This approach encodes normal images into a low-dimensional latent space and reconstructs them, demonstrating strong performance on the MVTec dataset. However, their effectiveness in complex traffic environments remains limited due to challenges in handling intricate textures and dynamic scene variations. Another approach, SimpleNet, addressed anomaly detection by integrating a feature adaptor to minimize domain bias and an anomaly feature generator to synthesize anomalies within the feature space [5]. Evaluations on the MVTec AD benchmark revealed its ability to bridge the gap between academic research and industrial applications. Generative Adversarial Networks (GANs) have also been explored for anomaly detection, as seen in ADGAN [6]. This method searches image representations within the latent space of a GAN generator, iteratively refining both latent vectors and generator parameters. ADGAN has achieved competitive performance across various benchmarks, showcasing the potential of generative methods in anomaly detection tasks.

While significant progress has been made in image-based anomaly detection, addressing anomalies in video sequences presents additional complexities due to temporal dynamics. To improve anomaly detection and localization in surveillance videos, a method integrating topic models with a spatiotemporal classifier was introduced [7]. This “projection model” enhances detection accuracy and provides precise spatiotemporal localization of anomalies. Furthermore, a real-world surveillance video dataset was proposed to encourage further advancements in anomaly detection research. Furthermore, spatio-temporal adversarial networks have been utilized to detect anomalies in video sequences [8]. These models learn patterns corresponding to normal behaviors, enabling them to identify deviations indicative of anomalies. Experiments demonstrate competitive performance and interpretable visualizations of detected anomalies. A different approach on video anomaly detection was introduced through future frame prediction techniques [9]. By enforcing spatial (appearance) and temporal (motion) constraints, this method ensures consistency between predicted and ground-truth frames. Notably, it was the first to explicitly incorporate a temporal constraint into video prediction tasks. Deep convolutional neural networks (CNNs) have also been applied to anomaly detection, focusing on the correlation between object appearances and their associated motions [10]. This model is optimized for robustness against diverse anomalies, improving reliability in real-world applications. Unsupervised learning approaches have demonstrated strong potential in improving anomaly detection. One such approach leverages textual descriptions alongside a pre-trained vision-language model (CLIP) [11]. This model achieves results comparable to weakly supervised techniques without requiring labeled data. Expanding on this trend of using large pre-trained models, AADC-Net was introduced as a multimodal network to specifically address data limitations and imbalances [12]. This approach leverages both pretrained Large Language Models and Vision-Language Models to enhance understanding with minimal supervision. Furthermore, it integrates a DEtection TRansformer model for visual feature extraction, which notably eliminates the need for bounding box supervision. Also leveraging the cross-modal capabilities of CLIP, other research has focused on the limitations of visual-only approaches by developing a weakly supervised framework that incorporates audio-visual collaboration [13]. This method introduces an efficient audio-visual fusion mechanism and a novel audio-visual prompt to adapt the frozen CLIP backbone without full retraining.

Another significant development is S3R, a self-supervised learning framework that models feature-level anomalies via sparse representations [14]. By integrating Normal and De-Normal modules, S3R consistently outperforms existing methods in both one-class and weakly supervised settings. Addressing the dual challenges of high computational cost and the tendency of deep networks to overlook anomalies due to their strong generalization ability, another self-supervised approach was recently introduced [15]. This method proposes Anomaly Distance Learning, which uses self-supervised signals to actively maximize the feature gap between normal and abnormal samples, and a Context-Aware Skip Connection to selectively preserve normal information To further enhance temporal modeling in video anomaly detection, Spatiotemporal Long Short-Term Memory (ST-LSTM) combined with adversarial training was proposed [16]. This approach effectively captures unified memory representations of spatial appearances and temporal variations, improving anomaly detection performance. Following this line of memory optimization, researchers have proposed a Multi-Memory-Augmented dual-flow network [17]. This approach not only utilizes memory units to explicitly manage the diversity of normal patterns but also addresses the distance metric introducing a novel “curvy metric”

Addressing video anomaly detection from another perspective, several works have leveraged weakly supervised and few-shot learning approaches. Lu et al. proposed a few-shot learning approach designed to detect anomalies in previously unseen scenes with only limited frames for adaptation [18]. The proposed method automatically learns robust spatiotemporal features, making it effective in adapting quickly to new scenarios. To reduce reliance on extensive manual annotations, a weakly supervised framework based on Multiple Instance Learning (MIL) has been developed [19]. This method applies a ranking mechanism to video clips extracted via a Two-Stream Inflated 3D (I3D) network, enhancing detection accuracy. Traditional MIL-based ranking loss methods, while useful for anomaly classification, often fail to fully leverage the abundance of normal data, leading to susceptibility to misclassification errors. To overcome this limitation, a diffusion-based normality learning pretraining step has been introduced [20]. This approach trains a Global–Local Feature Encoder (GLFE) exclusively on normal videos, capturing both short- and long-range temporal dependencies through a Transformer block and a pyramid of dilated convolutions. The method integrates a Co-Attention module for dynamic feature fusion and employs Multi-Sequence Contrastive loss to enhance anomaly discrimination. Further leveraging diffusion models, the approach in [21] tackles the challenge of small abnormal objects—often missed by feature-level methods—by proposing a novel patch-based diffusion model engineered to capture fine-grained local information. Observing that anomalies manifest as deviations in both appearance and motion, this method also introduces innovative motion and appearance conditions that are integrated into the diffusion process.

A multimodal and multiscale weakly supervised method has also been introduced to improve anomaly detection in challenging video conditions such as blurring and occlusions [22]. This framework extracts RGB and optical flow features using a pre-trained I3D network, filters redundancies with an Attention De-redundancy (AD) module, and models long- and short-term dependencies using a Multi-scale Feature Learning (MFL) module. Finally, an Adaptive Feature Fusion (AFF) module dynamically integrates the most relevant appearance and motion features.

Extending these concepts specifically to traffic scenarios, researchers have proposed tailored approaches to detect and respond promptly to traffic anomalies. Motion Interaction Field (MIF) has been employed to model vehicle interactions using Gaussian kernels, significantly improving anomaly detection and localization performance [23]. This method, inspired by water wave dynamics, successfully captures key aspects of traffic interactions and outperforms traditional approaches. A multimodal large language model (MLLM)-based framework has been introduced to automate the detection of safety-critical events in driving videos [24]. By leveraging multi-stage question-answering (QA) techniques, logical and visual reasoning capabilities are enhanced, leading to improved accuracy and detection efficiency. To improve response times in traffic anomaly detection, transfer learning combined with synthetic image generation has been employed [25]. EfficientNetB1 and MobileNetV2 models have demonstrated notable performance, showcasing their potential in enhancing road safety. Another vision-based system optimized for anomaly detection in smart cities, I3D-CONVLSTM2D, integrates RGB and optical flow data using transfer learning [26]. Designed for edge IoT devices, this system effectively addresses real-world challenges and dataset limitations. Deep representation modeling using denoising autoencoders has been explored to enhance real-time anomaly detection reliability [27]. By learning normal vehicle behaviors, this approach improves the identification of unusual incidents while reducing false positives.

Several works have incorporated temporal information into neural networks to improve detection stability. For example, R-UNet [28] integrates recurrent blocks and attention mechanisms for temporal dependency modeling, and DDoS-UNet [29] adds temporal context for MRI super-resolution. These architectures demonstrate the advantages of temporal modeling, even outside the traffic domain.

Despite the significant progress, existing models still face challenges such as high false positive rates due to occlusions and poor adaptability to different environments. CNN-based methods often fail to fully capture temporal dynamics in traffic flow. To address these issues, this work proposes TGU-Net, a framework that introduces temporal correlation modeling and object-masked reconstruction to achieve more accurate and reliable anomaly detection in real-time traffic scenarios. The following section describes the proposed methodology in detail.

3. Methodology

The model extends our previous architecture [3] depicted in Figure 1 by incorporating temporal information and a focused detection strategy. The base architecture is built upon a convolutional U-Net [30], where skip connections link encoder and decoder stages to preserve spatial details. While U-Net is commonly used for image segmentation, its ability to reconstruct high-quality images makes it suitable for predicting future frames in traffic scenes.

The main improvements introduced in this work focus on better capturing temporal dependencies, refining spatial feature extraction, and improving the robustness of anomaly detection.

3.1. Data Acquisition

During early experiments, we observed that the quality of image reconstruction was limited, resulting in relatively low Peak Signal-to-Noise Ratio (PSNR) values. This was mainly due to the lack of large, high-resolution public datasets suitable for training deep generative models in realistic traffic environments. Although alternative anomaly detection datasets exist, many are characterized by low visual quality [31], outdated simulation models [32] or excessive noise and non-traffic specific anomalies [33].

To overcome this limitation, we created a synthetic dataset (https://github.com/Maarioo01/CARLAccident accessed on 12 November 2025) using CARLA simulator [34] designed for anomaly detection. For dataset collection, two specific locations from the built-in Town10 map were selected, as shown in Figure 2.

Intersection scenario: A four-way intersection equipped with four cameras placed at different angles to cover all directions as illustrated in Figure 3.
Curved road scenario: A single camera positioned on a curved road to capture vehicles approaching from both directions as shown in Figure 4.

A total of five cameras were deployed in CARLA, each configured with a resolution of 1280 × 720 pixels and a 120° field of view (FoV), ensuring extensive coverage of the selected scenarios.

To generate data, a large number of dynamic agents, including cars, trucks, motorcycles, and pedestrians, were introduced using an automated simulation script. This setup ensured a natural traffic flow within the virtual environment, facilitating the collection of high-quality “regular driving” sequences.

3.2. Proposed Architecture

Then, the temporally enhanced framework proposed in this work is illustrated in Figure 5. Given a sequence

I = I_{i \in {1, 2, \dots, n}}^{3 \times H \times W}

of n frames, the images are first resized to

R_{i}^{3 \times h \times w}

and processed by the encoder E to extract the main feature representations tensor

f_{1 \times d}

. In parallel, a Correlation Matrix

C M_{n \times n}

is computed to capture pixel-level dependencies within I. This step is critical for preserving temporal context, without it, the system would interpret I as n independent frames rather than as a temporally coherent sequence. The

C M

is then flattened into the

n^{2}

-dimensional tensor

F {(C M)}_{1 \times n^{2}}

and fed into a Temporal Modulation Module

T M

. This module consists of two distinct linear layers that transform

F (C M)

into two vectors,

γ

(scale) and

β

(shift), both with dimension

1 \times d

. The latent space

L S

is then generated by applying an Affine Transformation

A T

to the feature tensor

f_{1 \times d}

using these modulation factors:

L S_{1 \times d} = γ_{1 \times d} \cdot f_{1 \times d} + β_{1 \times d} .

(1)

This method ensures a more sophisticated and dynamic integration of temporal context into the visual features. By conditionally scaling and shifting the spatial features based on the history captured by the

C M

, we ensure that the temporal dynamics are incorporated in a non-additive manner, maximizing the fusion efficiency without increasing the dimensionality of the feature representation available to the decoder D·

At the final stage, the predicted frame

\hat{I}

generated by our model must be compared with the ground-truth frame

I_{n + 1}

to evaluate the normality of the scene. However, in most anomaly cases, many background elements such as the road surface, buildings, traffic lights, or trees are not directly involved in the event. Moreover, since reconstructions of normal situations are never perfectly accurate at the pixel level, we introduced an additional step in the anomaly detection stage. After obtaining

\hat{I}

, the ground-truth frame

I_{n + 1}

is processed with YOLOv11 [35] as the object detector

O b j D e t

to identify traffic-related elements of the scene and generate a mask M. This mask is then applied to both frames, resulting in

M (\hat{I})

and

M (I_{n + 1})

. Finally, the masked images are compared using

P S N R

, from which a similarity score is derived to determine whether the observed situation corresponds to a normal event or an anomaly.

3.3. Temporal Context

One major limitation of the initial approach [3] was the lack of temporal knowledge in the input data used for frame reconstruction. The model was originally designed to predict the next frame based on a single input image, which proved to be insufficient data for accurately capturing dynamic traffic scenes. Given that urban environments are characterized by constantly moving objects, such as vehicles and pedestrians suddenly entering the field of view, this limited temporal awareness resulted in frequent reconstruction errors.

This limitation was especially evident when the model attempted to reconstruct newly introduced objects, such as a car or pedestrian suddenly appearing in the scene. In these cases, significant reconstruction errors occurred because the model lacked prior context to predict the presence and shape of objects that had not been visible in the previous frame. The severity of these errors largely depended on the object’s position relative to the camera:

Distant objects: When an object appeared far from the camera as shown in Figure 6a, perspective effects caused it to occupy a relatively small area in the image. Consequently, the reconstruction error was distributed over a limited number of pixels, leading to only a minor drop in $P S N R$ .
Close objects: In contrast, when a new object entered the scene close to the camera as shown in Figure 6b, it occupied a significantly larger portion of the image. This led to substantial reconstruction errors, resulting in a sharp drop in the $P S N R$ . Consequently, the system often misinterpreted this drop as an anomaly, increasing the estimated probability of anomaly and triggering a false positive detection.

Further analysis showed that this issue was most pronounced in the initial frames when a new object first entered the scene. At this stage, partial occlusion, caused by the camera’s perspective, limited the model’s ability to infer the complete structure of the object. However, as the object became fully visible, the reconstruction stabilized, and the

P S N R

returned to normal levels.

The goal of this adjustment is to provide the model with a more comprehensive representation of scene dynamics, improving its ability to anticipate temporal variations while reducing the noise caused by newly appearing objects within the camera’s field of view. The primary cause of this issue was the model’s reliance on a single-frame input, which lacked sufficient temporal information for accurate predictions. To overcome this limitation, the training structure was modified to use a sequence

I

of consecutive n frames instead of a single one. However, while increasing the input to a sequence of n frames enhances the temporal view, relying solely on image features form the encoder E is insufficient to capture frame-to-frame dependencies explicitly. Therefore,

C M

is calculated to provide an explicit, low-dimensional representation of the pixel-level temporal consistency across the sequence. This information is then merged with the extracted feature tensor

f_{1 \times d}

to construct an enriched latent space

L S

with stronger temporal awareness. The goal of this adjustment is to provide the model with a more comprehensive representation of scene dynamics, improving its ability to anticipate temporal variations while reducing the noise caused by newly appearing objects within the camera’s field of view.

However, this limitation was not fully overcome, as the unpredictable emergence of new objects continues to make perfect reconstruction a persistent challenge. This highlights an avenue for future work focused on improving the model’s capacity to anticipate and reconstruct previously unseen objects in complex traffic environments.

3.4. Generative

The Generative part of our architecture, built on the U-Net encoder–decoder structure [30], benefits significantly from the enhanced latent space

L S

. Our approach refines the temporal integrations by moving beyond the original design [3], which relied only on image features propagated through skip connections. The key adjustment is the use of the Affine Transformation

A T

for the construction of

L S

. By applying the temporal modulation factors

γ

and

β

derived from the

C M

to the bottleneck features

f_{1 \times d}

, we overcome the issue of insufficient temporal knowledge that plagued the single-frame baseline. This modulation allows the model to selectively emphasize elements within the visual features based on the observed temporal consistency of the scene, leading to more accurate predictions and improved robustness against reconstruction noise due to dynamic objects. The decoder D then utilizes this temporally enriched

L S

and the spatial details from the skip connections to reconstruct the subsequent frame

\hat{I}

.

3.5. Anomaly Detection

Once the predicted frame

\hat{I}

is generated, the anomaly detection part is introduced to enhance anomaly detection on the scene. In our previous work [3] anomaly evaluation was performed by directly comparing

\hat{I}

with the ground truth frame

I_{n + 1}

to compute a similarity score. However, this initial approach exhibited several limitations. Changes between training and inference scenarios, including variations in objects, introduced noise into the reconstruction due to the inherent uncertainty of the generative model. Moreover, since

\hat{I}

cannot perfectly replicate the ground-truth pixels, the accumulation of minor reconstruction error on irrelevant regions (i.e., buildings, road surface, distant objects) degraded the accuracy of anomaly detection, despite these areas not being crucial to distinguish normal from anomalous situations.

To mitigate these issues, we integrate a post-processing strategy that emphasizes the most relevant elements of the scene. We intentionally apply

O b j D e t

after the generative phase for two reasons: (1) It ensures the encoder E has access to the full image context to extract the most comprehensive features (

f_{1 \times d}

), maximizing the reconstruction quality for all scenes, including those with subtle anomalies. (2) Using the mask M derived from the ground-truth

I_{n + 1}

guarantees that the mask accurately highlights all present objects. Specifically,

I_{n + 1}

, which is free from reconstruction noise, is processed with a state-of-the-art object detector

O b j D e t

[35] to extract bounding boxes of traffic-related entities such as vehicles, pedestrians and traffic lights. From these detections, a mask M is generated to highlight only the relevant elements. This mask is then applied to both

\hat{I}

and

I_{n + 1}

, producing

M (\hat{I})

and

M (I_{n + 1})

, thereby filtering out irrelevant regions that would otherwise introduce noise. Then, the

P S N R

is computed between the masked frames to determine whether the observer situation corresponds to a normal event or an anomaly. However, the performance of this system relies on the robustness of the object detector [35]. While its high precision proved sufficient for the diverse traffic behaviours and collisions generated in our dataset, potential failure to detect severely distorted or highly congested objects during real anomalies could lead to false negatives in the masked comparison. For future deployment in real scenarios, maintaining system reliability will require continuous monitoring and finetuning of the

O b j D e t

module with the real data to ensure that all safety elements are consistently masked.

4. Experiments and Results

Building on TGU-Net, a series of experiments were conducted to evaluate the effectiveness of the proposed approach. The main goal was to analyze the impact of object-masked to assess whether filtering out static objects enhances anomaly detection, while comparing its performance with the addition of the correlation matrix against previously explored modifications [3].

4.1. Generated Dataset

As mentioned in Section 3.1, the lack of large, high-quality datasets for traffic anomaly detection motivated the creation of a CARLA-based synthetic dataset. Using the same simulator configuration described earlier, we collected both training and testing data under realistic traffic conditions. In total, 508,685 frames were collected for training, including 406,948 frames from the intersection scenario and 101,737 frames from the curved road. All training sequences represent normal normal traffic behaviors, without any anomalies.

To evaluate the model’s performance in detecting traffic anomalies, a separate dataset was generated with a combination of normal and anomalous events. The anomalies were created manually by introducing controlled collisions between vehicles under various conditions. This dataset includes

1270 anomalous frames, grouped into different sequences and covering various types of collisions, including frontal, lateral, and rear-end impacts. These sequences were manually generated to ensure a diverse range of scenarios.
2651 normal frames, also structured in distinct sequences, representing routine traffic behavior.

For preprocessing, the same pipeline was applied to both training and testing datasets. First, for the input data, the entire set of frames were grouped into sequences of

n = 5

consecutive frames. After that, each sequence was resized to

h = 512

and

w = 512

pixels, providing the final input sequence

R_{i \in {1, 2, \dots, 5}}^{3 \times 512 \times 512}

for E. Finally, the target to be predicted is defined as the immediate subsequent element of the selected sequence.

4.2. Implementation Details

With the training datasets already prepared and organized, the next phase was the training. The TGU-Net was developed and trained using PyTorch 2.0.0+cu118 on two NVIDIA RTX 4090 GPUs (USA, Santa Clara, California) with 24 GB of VRAM each, using a workstation equipped with a 13th Gen Intel® Core™ i9-13900K (USA, Arizona, Chandler) and 64 GB of RAM. The high memory capacity was essential for managing large batch sizes and high-resolution images, effectively minimizing memory-related bottlenecks. In alignment with the goal of computational efficiency, we measured the end-to-end inference time, encompassing the full pipeline: sequence loading, feature extraction, frame reconstruction, object detection, masking and final

P S N R

calculation. The average inference time achieved was ∼0.0785 s per sequence when running on a single NVIDIA RTX 4090 GPU. This performance confirms the framework’s capability for real-time deployment in traffic monitoring applications. The experiments were conducted using an architecture composed of an encoder–decoder module, with a middle feature representation f whose dimension d is fixed to 1024. Model training was performed using the Adam [36] optimizer, combined with a learning rate scheduler to ensure stable convergence. The initial learning rate was set to 0.01 and decreased by 10% every two epochs. A batch size of 20 was adopted, and model training proceeded for a maximum of 100 epochs, with early stopping implemented to terminate the process when no further improvements were detected.

The loss function was defined as a combination of Mean Squared Error (MSE) and Structural Similarity Index Measure (SSIM), ensuring a balance between pixel-wise accuracy and perceptual quality similarity:

\begin{matrix} MSE (I_{n + 1}, \hat{I}) & = \frac{1}{N} \sum_{i = 1}^{N} {(I_{n + 1_{i}} - {\hat{I}}_{i})}^{2}, \end{matrix}

(2)

\begin{matrix} SSIM (I_{n + 1}, \hat{I}) & = \frac{(2 μ_{I_{n + 1}} μ_{\hat{I}} + C_{1}) (2 σ_{I_{n + 1} \hat{I}} + C_{2})}{(μ_{I_{n + 1}}^{2} + μ_{\hat{I}}^{2} + C_{1}) (σ_{I_{n + 1}}^{2} + σ_{\hat{I}}^{2} + C_{2})}, \end{matrix}

(3)

\begin{matrix} L (I_{n + 1}, \hat{I}) & = α MSE (I_{n + 1}, \hat{I}) + β (1 - SSIM (I_{n + 1}, \hat{I})), \end{matrix}

(4)

where

I_{n + 1}

and

\hat{I}

are the ground-truth and predicted images, N is the number of elements (pixels × channels),

μ_{x}, μ_{y}

are mean intensities,

σ_{x}^{2}, σ_{y}^{2}

are variances,

σ_{x y}

is the covariance, and

C_{1}, C_{2}

are small stabilizing constants. The weights

α

and

β

control the relative contribution of the two terms.

While the MSE measures the average squared difference between corresponding pixels, indicating lower values for greater similarity, SSIM evaluates the perceptual quality of an image by comparing luminance, contrast, and structure, producing values between −1 and 1, where values closer to 1 indicate better reconstruction.

In addition, the model’s performance was evaluated using several key metrics to assess its effectiveness and the impact of the proposed enhancements. Three main evaluation strategies were adopted:

P S N R

, accuracy, and a custom delay function.

In the first place,

P S N R

measures the quality of the reconstructed scenes by quantifying the difference between generated and ground truth images. In the context of anomaly detection,

P S N R

is particularly effective, as it emphasizes the ratio between signal peaks and noise, making it more resilient to localized outliers. In this work,

P S N R

values were also converted to confidence percentages to estimate the likelihood of an anomaly occurring. The

P S N R

is calculated as

P S N R = 10 \cdot {log}_{10} (\frac{M A X^{2}}{M S E}),

(5)

where

M A X

represents the maximum possible pixel value and

M S E

is the Mean Squared Error.

Subsequently, to complement

P S N R

, other classification metrics were used, including accuracy, and recall. These metrics offered a comprehensive view of the system’s ability to correctly distinguish between normal and anomalous sequences, based on a threshold applied to the confidence score derived from

P S N R

.

Finally, a custom delay function was implemented to evaluate the model’s responsiveness. This function measures the number of frames between the actual occurrence of an anomaly and the point at which the system detects it with 100% confidence.

4.3. Evaluation of $P S N R$ Across Different Situations

Based on the evaluation metrcis defined above, the experimental analysis was designed to systematically assess the effectiveness of the proposed TGU-Net framework. In this section, we present the results obtained from applying the model to diverse traffic scenarios, with the aim of quantifying improvements introduced by the temporal and anomaly detection modules. As a first step,

P S N R

was evaluated across different test situations to provide a direct comparison between our proposed implementation, its ablated version without the anomaly detection stage, and the U-Net and U-NetR baselines from our previous work [3]. The corresponding results are summarized in Table 1.

As expected, anomalous scenes consistently yield lower

P S N R

values than normal ones, reflecting the model’s struggle to accurately reconstruct anomalies. However, the most significant insight is that the TGU-Net w/o mask not only achieves higher absolute

P S N R

values overall but also exhibits a much clearer separation between normal and anomalous scenarios. This distinction is critical for improving anomaly classification performance.

Nonetheless, its performance remains below that of the TGU-Net with the proposed anomaly detection part, underlining the critical role of the object-masking mechanism. This significant

P S N R

gap achieved by the final model reinforces its effectiveness in separating normal and anomalous frames, offering a more robust foundation for traffic anomaly detection systems.

To better visualize these differences, two scenarios were selected from the same camera viewpoint and environment, one depicting an anomaly and the other showing normal behavior. The corresponding frame sequences were processed using all the evaluated models. As shown in Figure 7, each pair of lines represents the

P S N R

values obtained from a specific model when applied to the normal and anomalous scenes, respectively.

This visualization offers a clearer understanding of the effectiveness of TGU-Net in distinguishing between normal and anomalous scenes. Not only does it consistently yield the highest

P S N R

values, but it also maintains a well-defined separation between the two categories. Specifically, the large separation between the TGU-Net masked and the TGU-Net without mask lines demonstrates the efficacy of the masking strategy in suppressing background noise and achieving metric stability in normal scenes, a key justification for its superiority over not masked evaluation.In contrast, models on [3] exhibit significantly lower

P S N R

values, with only minimal differentiation between normal and anomalous situations.

4.4. Assessment of Anomaly Detection Confidence

As outlined earlier,

P S N R

values are leveraged as a way for estimating the probability of an anomaly. To further analyze this aspect, an offline evaluation was conducted to examine the evolution of

P S N R

during anomalous sequences. Specifically, accident scenarios were inferred with the proposed TGU-Net and the corresponding

P S N R

values were tracked across time. These values were subsequently normalized into a percentage scale

A n o m a l y C o n f i d e n c e (t) = \frac{P S N R_{m a x} - P S N R (t)}{P S N R_{m a x} - P S N R_{m i n}} \times 100,

(6)

so the lowest observed

P S N R

was mapped to a 100% anomaly confidence and the highest to 0%.

As illustrated in Figure 8, this transformation facilitates the adjustment of stage limits, enabling the design of adaptive thresholds that can distinguish between three levels of risk: normal operation, potential risk, and emergency anomaly. This approach not only enhances interpretability but also provides the basis for deploying the system in real-time safety applications.

4.5. Accuracy Evaluation of TGU-Net and Models in [3]

To further evaluate the model’s performance at different confidence levels used for anomaly classification, we analyzed its classification metrics, including accuracy, precision, recall, and F1 score. These metrics provide a comprehensive view of the model’s ability to correctly detect anomalies while minimizing false positives. The corresponding results are presented in Table 2.

The baseline models achieve perfect recall, indicating that they successfully identify all anomalous cases. However, this comes at the cost of significantly lower precision due to a high number of false positives. These results imply that while both models are sensitive to detecting anomalies, they frequently misclassify normal scenes as anomalies, resulting in a high false alarm rate.

The TGU-Net w/o the mask already shows a notable improvement over the baseline approaches. It achieves higher accuracy and maintains perfect recall, successfully identifying all anomalous instances. Nevertheless, the presence of some false alarms indicates that residual noise in the scene, likely from static or irrelevant elements, can still cause occasional misclassification.

The final TGU-Net further enhances these results, achieving the highest accuracy and maintaining perfect recall, leading to a robust F1-score. This balance reflects the model’s strong capability in correctly identifying all anomalous cases while also minimizing false positives. These gains underscore the effectiveness of the object-masking strategy, which reduces background noise and allows the model to focus on dynamic, high-risk elements within the traffic scene, resulting in more consistent and trustworthy predictions.

4.6. Frame Delay Evaluation

Finally, to assess the timeliness of each model’s detection, we introduce a custom delay metric. This metric quantifies the reaction time of the model, measured in frames, from the onset of an anomalous event to its definitive detection. We formally define the delay D for a single anomalous event as

D = F_{d e t} - F_{g t}

where

F_{g t}

(Ground Truth Frame) is the frame index where the anomalous event actually begins and

F_{d e t}

(Detection Frame) is the first frame index at which the model’s confidence score reaches its maximum value (100%), indicating a confident and complete detection. Table 3 presents a comparative analysis of the ground truth frame (

F_{g t}

), the model’s detection frame (

F_{d e t}

), and the resulting delay (D) for the TGU-Net, its mask-free variant, and the two baseline architectures.

Models in [3] demonstrate inconsistent behavior in early anomaly detection, often exhibiting significant delays. In multiple cases, they recognize anomalies well after the actual event, limiting their practicality in real-time applications.

The TGU-Net without the anomaly detection part shows notable improvement in early detection when compared to [3]. In many scenarios, it successfully anticipates anomalies before they happen. However, its performance remains inconsistent in certain cases, occasionally failing to outperform the baseline models.

Within this context, the TGU-Net achieves a superior balance between early detection and prediction stability. It consistently reduces detection delays, often identifying anomalies at the exact moment they occur, or even slightly in advance, without losing precision. Moreover, it shows enhanced temporal stability, avoiding the fluctuations seen in the unmasked version, which occasionally introduced minor delays. As shown in Figure 9, in the A-4-L-3 scenario all models converge around frame 30. However, TGU-Net and its unmasked variant reach full confidence earlier. Notably, the TGU-Net achieves 100% detection confidence precisely at the anomalous frame, as marked by the dotted horizontal line.

In summary, the TGU-Net demonstrates a good performance by enabling early and stable anomaly detection with sustained confidence. While the version without the anomaly detection part also surpasses baseline models in early detection, it falls short in terms of precision and consistency. These results underscore the performance of TGU-Net model in enhancing both detection stability and overall accuracy.

5. Conclusions

This paper presented TGU-Net, a temporal generative U-Net framework for real-time traffic anomaly detection in urban environments. The model enhances conventional U-Net architectures with two key innovations: a temporal correlation mechanism that captures frame dependencies, and an object-masked evaluation that focuses on dynamic regions of the scene. These additions were designed to overcome common limitations of existing approaches, such as noisy predictions and poor generalization in dynamic settings [3].

The experiments on a high-resolution CARLA-based dataset demonstrated clear improvements across all evaluation metrics. TGU-Net achieved higher PSNR values, a better precision–recall balance, and notably shorter detection delays compared to previous architectures. The masking strategy significantly reduced false positives, while the correlation matrix improved temporal consistency, enabling earlier and more stable anomaly detections.

Overall, the proposed approach offers a lightweight, efficient, and scalable solution for real-time traffic monitoring. Future work will explore applying TGU-Net to real video feeds. This transition will focus on domain adaption and fine tuning the TGU-Net core architecture using real traffic data to address the domain gap inherent to synthetic training. We also plan to integrate multimodal sensor data and extend the framework to predictive risk assessment and anomaly localization in broader intelligent transportation scenarios.

Author Contributions

Conceptualization, B.P. and M.R.; Methodology, B.P. and M.R.; Software, B.P. and M.R.; Validation, B.P. and M.R.; Formal analysis, A.A.-K. and F.G.; Investigation, B.P. and M.R.; Resources, B.P. and M.R.; Writing – original draft, B.P. and M.R.; writing—review and editing B.P., M.R., A.A.-K. and F.G.; Visualization, B.P. and M.R.; Supervision, A.A.-K. and F.G.; Project administration, A.A.-K. and F.G.; Funding acquisition, A.A.-K. and F.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by The Spanish Ministry of Science, Innovation, and Universities (MICIU) through the projects PID2021-128327OA-I00, and TED2021-129374A-I00 and funded by MCIN/AEI/10.13039/501100011033, by the European Union NextGenerationEU/PRTR.

Data Availability Statement

Synthetic Dataset Generated at https://github.com/Maarioo01/CARLAccident (accessed on 12 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sete, G.; Alemu, S.T. Distribution of causes of death and associated organ injuries in road traffic accident-related fatalities: A postmortem study in Addis Ababa, Ethiopia. BMC Public Health 2025, 25, 38. [Google Scholar] [CrossRef] [PubMed]
Abhilash, K.P.P.; Sivanandan, A. Early management of trauma: The golden hour. Curr. Med. Issues 2020, 18, 36–39. [Google Scholar]
López, B.P.; Solis, M.R.; Fernández, F.G.; Al-Kaff, A.H.A. Arquitecturas para detección de anomalías: Fusión de GAN, U-Net y Transformers. Jornadas Autom. 2024, 45, 1–6. [Google Scholar] [CrossRef]
Mishra, P.; Piciarelli, C.; Foresti, G.L. Image anomaly detection by aggregating deep pyramidal representations. In Proceedings of the International Conference on Pattern Recognition, Virtual, 10–11 January 2021; pp. 705–718. [Google Scholar]
Liu, Z.; Zhou, Y.; Xu, Y.; Wang, Z. Simplenet: A simple network for image anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 20402–20411. [Google Scholar]
Deecke, L.; Vandermeulen, R.; Ruff, L.; Mandt, S.; Kloft, M. Image anomaly detection with generative adversarial networks. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018, Dublin, Ireland, 10–14 September 2018; pp. 3–17. [Google Scholar]
Pathak, D.; Sharang, A.; Mukerjee, A. Anomaly localization in topic-based analysis of surveillance videos. In Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2015; pp. 389–395. [Google Scholar]
Lee, S.; Kim, H.G.; Ro, Y.M. STAN: Spatio-temporal adversarial networks for abnormal event detection. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 1323–1327. [Google Scholar]
Liu, W.; Luo, W.; Lian, D.; Gao, S. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6536–6545. [Google Scholar]
Nguyen, T.N.; Meunier, J. Anomaly detection in video sequence with appearance-motion correspondence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1273–1283. [Google Scholar]
Kim, J.; Yoon, S.; Choi, T.; Sull, S. Unsupervised video anomaly detection based on similarity with predefined text descriptions. Sensors 2023, 23, 6256. [Google Scholar] [CrossRef]
Tri Phan, D.; Hoang Minh Doan, V.; Choi, J.; Lee, B.; Oh, J. AADC-Net: A Multimodal Deep Learning Framework for Automatic Anomaly Detection in Real-Time Surveillance. IEEE Trans. Instrum. Meas. 2025, 74, 1–13. [Google Scholar] [CrossRef]
Wu, P.; Su, W.; Pang, G.; Sun, Y.; Yan, Q.; Wang, P.; Zhang, Y. AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection. arXiv 2025, arXiv:cs.CV/2504.04495. [Google Scholar]
Wu, J.C.; Hsieh, H.Y.; Chen, D.J.; Fuh, C.S.; Liu, T.L. Self-supervised sparse representation for video anomaly detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 729–745. [Google Scholar]
Park, C.; Kim, D.; Cho, M.; Kim, M.; Lee, M.; Park, S.; Lee, S. Fast video anomaly detection via context-aware shortcut exploration and abnormal feature distance learning. Pattern Recognit. 2025, 157, 110877. [Google Scholar] [CrossRef]
Zhao, M.; Liu, Y.; Liu, J.; Zeng, X. Exploiting spatial-temporal correlations for video anomaly detection. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 1727–1733. [Google Scholar]
Li, H.; Wang, Y.; Wang, Y.; Chen, J. A multi-memory-augmented network with a curvy metric method for video anomaly detection. Neural Netw. 2025, 184, 106972. [Google Scholar] [CrossRef] [PubMed]
Lu, Y.; Yu, F.; Reddy, M.K.K.; Wang, Y. Few-shot scene-adaptive anomaly detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 125–141. [Google Scholar]
Nejad, S.S. Weakly-Supervised Anomaly Detection in Surveillance Videos Based on Two-Stream I3D Convolution Network. Master’s Thesis, The University of Western Ontario, London, ON, Canada, 2023. [Google Scholar]
Basak, S.; Gautam, A. Diffusion-based normality pre-training for weakly supervised video anomaly detection. Expert Syst. Appl. 2024, 251, 124013. [Google Scholar] [CrossRef]
Zhou, H.; Cai, J.; Ye, Y.; Feng, Y.; Gao, C.; Yu, J.; Song, Z.; Yang, W. Video Anomaly Detection with Motion and Appearance Guided Patch Diffusion Model. arXiv 2024, arXiv:cs.CV/2412.09026. [Google Scholar] [CrossRef]
Sun, W.; Cao, L.; Guo, Y.; Du, K. Multimodal and multiscale feature fusion for weakly supervised video anomaly detection. Sci. Rep. 2024, 14, 22835. [Google Scholar] [CrossRef] [PubMed]
Yun, K.; Jeong, H.; Yi, K.M.; Kim, S.W.; Choi, J.Y. Motion interaction field for accident detection in traffic surveillance video. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 3062–3067. [Google Scholar]
Abu Tami, M.; Ashqar, H.I.; Elhenawy, M.; Glaser, S.; Rakotonirainy, A. Using multimodal large language models (MLLMs) for automated detection of traffic safety-critical events. Vehicles 2024, 6, 1571–1590. [Google Scholar] [CrossRef]
Tamagusko, T.; Correia, M.G.; Huynh, M.A.; Ferreira, A. Deep learning applied to road accident detection with transfer learning and synthetic images. Transp. Res. Procedia 2022, 64, 90–97. [Google Scholar] [CrossRef]
Adewopo, V.A.; Elsayed, N. Smart city transportation: Deep learning ensemble approach for traffic accident detection. IEEE Access 2024, 12, 59134–59147. [Google Scholar] [CrossRef]
Singh, D.; Mohan, C.K. Deep spatio-temporal representation for detection of road accidents using stacked autoencoder. IEEE Trans. Intell. Transp. Syst. 2018, 20, 879–887. [Google Scholar] [CrossRef]
Alweshah, A.; Barzamini, R.; Hajati, F.; Farahani, S.S.S.; Arabian, M.; Sohani, B. Temporal dependency modeling for improved medical image segmentation: The R-UNet perspective. Frankl. Open 2024, 9, 100182. [Google Scholar] [CrossRef]
Chatterjee, S.; Sarasaen, C.; Rose, G.; Nürnberger, A.; Speck, O. Ddos-unet: Incorporating temporal information using dynamic dual-channel unet for enhancing super-resolution of dynamic mri. IEEE Access 2024, 12, 99122–99136. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Sabokrou, M.; Fayyaz, M.; Fathy, M.; Klette, R. Deep-cascade: Cascading 3d deep neural networks for fast anomaly detection and localization in crowded scenes. IEEE Trans. Image Process. 2017, 26, 1992–2004. [Google Scholar] [CrossRef] [PubMed]
Wang, T.; Kim, S.; Ji, W.; Xie, E.; Ge, C.; Chen, J.; Li, Z.; Ping, L. DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving. arXiv 2023, arXiv:2304.01168. [Google Scholar] [CrossRef]
Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6479–6488. [Google Scholar]
Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An Open Urban Driving Simulator. In Proceedings of the 1st Annual Conference on Robot Learning, Mountain View, CA, USA, 13–15 November 2017; pp. 1–16. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 12 November 2025).
Kingma, D.P.; Ba, J.L. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]

Figure 1. Baseline Architecture [3].

Figure 2. Recorded locations (red) and camera positions with orientations.

Figure 3. View from every camera on the intersection scenario.

Figure 4. View from infrastructure at the curve.

Figure 5. TGU-Net architecture with the three main components proposed in this work: Temporal Contextual Module, Generative Module, and Anomaly Detection Module.

Figure 6. Object reconstruction at different distances.

Figure 7.

P S N R

comparison between all models (TGU-Net and López, Borja Pérez, et al., 2024 [3]) in the same normal and anomalous situation.

Figure 7.

P S N R

comparison between all models (TGU-Net and López, Borja Pérez, et al., 2024 [3]) in the same normal and anomalous situation.

Figure 8. Anomaly probability percentage using TGU-Net of the anomaly depicted with a red square. The red dot indicates when 100% confidence has been achieved.

Figure 9. Comparison of the models using the delay metric with the ground truth anomaly frame represented with a dashed line [3].

Table 1. Mean

P S N R

comparison between normal and anomalous scenes using TGU-Net and models proposed in [3].

Table 1. Mean

P S N R

comparison between normal and anomalous scenes using TGU-Net and models proposed in [3].

Scene	TGU-Net	TGU-Net w/o Mask	[3]	[3]
A-1-L-1	41.78	25.62	23.65	23.33
A-1-L-2	49.48	24.51	23.50	24.03
A-2-L-2-2	40.20	24.57	23.33	23.61
A-2-L-2-4	63.71	24.44	26.04	26.04
A-2-L-2-2	43.69	37.89	23.74	23.64
A-3-C	48.31	40.82	27.14	26.67
A-4-L-2	40.94	25.34	23.98	23.97
A-4-I-2	39.70	35.33	24.37	24.20
A-5-C	47.88	39.26	27.97	27.59
A-5-C-2	48.65	39.71	28.01	27.66
A-6-C	47.36	43.35	22.83	26.77
N-1-C	56.97	44.66	24.73	27.15
N-1-L-2	60.01	46.08	25.27	25.65
N-1-I-1	64.63	46.25	23.96	25.99
N-1-I-2	70.39	45.69	26.35	25.70
N-2-C	60.65	44.86	24.07	27.17
N-2-L-1	60.98	45.50	21.85	26.12
N-2-L-2	49.21	41.76	24.88	25.34
N-2-I-1	60.50	42.82	24.05	25.47
N-2-I-2	52.00	44.82	26.30	25.51
N-3-C	53.20	44.62	23.70	27.43
N-3-L-1	51.24	42.59	21.36	25.86
N-3-L-2	48.69	43.22	24.64	25.55
N-3-I-1	46.08	41.05	23.52	24.87
N-3-I-2	46.27	42.03	24.85	24.19

A = Anomaly, N = Normal, L = Normal Light camera, I = Infrastructure Light camera, C = Curve.

Table 2. Evaluation of the models using the confusion matrix.

Model	TP	FP	TN	Acc	Prec	Recall	F1
TGU-Net	11	4	10	0.84	0.73	1.00	0.84
TGU-Net w/o mask	11	5	9	0.83	0.68	1.00	0.81
[3]	11	11	3	0.58	0.50	1.00	0.66
[3]	11	9	5	0.66	0.55	1.00	0.70

TP = True Positive, FP = False Positive, TN = True Negative, FN = False Negative, Acc = Accuracy, Prec = Precision

Table 3. Evaluation of models performance on delay metrics.

Scene	Anomalous Frame	TGU-Net Frame	TGU-Net w/o Mask Frame	[3] Frame	[3] Frame
A-1-L-1	56	90 (+34)	92 (+36)	114 (+58)	113 (+57)
A-1-L-2	64	64 (+0)	59 (−5)	68 (+4)	67 (+3)
A-2-L-2-2	29	5 (−24)	0 (−29)	6 (−23)	8 (−21)
A-2-L-2-4	39	39 (+0)	29 (−10)	44 (+5)	44 (+5)
A-2-I-2-2	19	0 (−19)	0 (−19)	6 (−13)	6 (−13)
A-3-C	34	28 (−6)	28 (−6)	30 (−4)	32 (−2)
A-4-L-2	35	69 (+34)	27 (−8)	66 (+31)	68 (+33)
A-4-I-2	31	31 (+0)	32 (+1)	36 (+5)	36 (+5)
A-5-C	30	27 (−3)	27 (−3)	56 (+26)	35 (+5)
A-5-C-2	70	66 (−4)	66 (−4)	62 (−8)	121 (+51)
A-6-C	40	35 (−5)	35 (−5)	39 (−1)	91 (+51)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pérez, B.; Resino, M.; Al-Kaff, A.; García, F. TGU-Net: A Temporal Generative U-Net Framework for Real-Time Traffic Anomaly Detection. Smart Cities 2025, 8, 194. https://doi.org/10.3390/smartcities8060194

AMA Style

Pérez B, Resino M, Al-Kaff A, García F. TGU-Net: A Temporal Generative U-Net Framework for Real-Time Traffic Anomaly Detection. Smart Cities. 2025; 8(6):194. https://doi.org/10.3390/smartcities8060194

Chicago/Turabian Style

Pérez, Borja, Mario Resino, Abdulla Al-Kaff, and Fernando García. 2025. "TGU-Net: A Temporal Generative U-Net Framework for Real-Time Traffic Anomaly Detection" Smart Cities 8, no. 6: 194. https://doi.org/10.3390/smartcities8060194

APA Style

Pérez, B., Resino, M., Al-Kaff, A., & García, F. (2025). TGU-Net: A Temporal Generative U-Net Framework for Real-Time Traffic Anomaly Detection. Smart Cities, 8(6), 194. https://doi.org/10.3390/smartcities8060194

Article Menu

TGU-Net: A Temporal Generative U-Net Framework for Real-Time Traffic Anomaly Detection

Highlights

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Data Acquisition

3.2. Proposed Architecture

3.3. Temporal Context

3.4. Generative

3.5. Anomaly Detection

4. Experiments and Results

4.1. Generated Dataset

4.2. Implementation Details

4.3. Evaluation of $P S N R$ Across Different Situations

4.4. Assessment of Anomaly Detection Confidence

4.5. Accuracy Evaluation of TGU-Net and Models in [3]

4.6. Frame Delay Evaluation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

TGU-Net: A Temporal Generative U-Net Framework for Real-Time Traffic Anomaly Detection

Highlights

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Data Acquisition

3.2. Proposed Architecture

3.3. Temporal Context

3.4. Generative

3.5. Anomaly Detection

4. Experiments and Results

4.1. Generated Dataset

4.2. Implementation Details

4.3. Evaluation of P S N R Across Different Situations

4.4. Assessment of Anomaly Detection Confidence

4.5. Accuracy Evaluation of TGU-Net and Models in [3]

4.6. Frame Delay Evaluation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.3. Evaluation of $P S N R$ Across Different Situations