Next Article in Journal
A Case Study on the Validation of Renewable Energy Grid Code Compliance for a Large-Scale Wind Power Plant Grid-Connected Mode of Operation in Real-Time Simulation
Previous Article in Journal
An Improved YOLOv8-Based Dense Pedestrian Detection Method with Multi-Scale Fusion and Linear Spatial Attention
Previous Article in Special Issue
Harnessing Spatial-Frequency Information for Enhanced Image Restoration
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Innovative Approaches to Traffic Anomaly Detection and Classification Using AI

1
Autonomous Mobility and Perception Lab (AMPL), Departamento de Ingeniería de Sistemas y Automática, Universidad Carlos III de Madrid, Av. de la Universidad, 30, 28911 Madrid, Spain
2
Technological Institute of Aragón, Calle Maria de Luna, 7-8, 50018 Zaragoza, Spain
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(10), 5520; https://doi.org/10.3390/app15105520
Submission received: 5 February 2025 / Revised: 6 May 2025 / Accepted: 9 May 2025 / Published: 15 May 2025

Abstract

:
Video anomaly detection plays a crucial role in intelligent transportation systems by enhancing urban mobility and safety. This review provides a comprehensive analysis of recent advancements in artificial intelligence methods applied to traffic anomaly detection, including convolutional and recurrent neural networks (CNNs and RNNs), autoencoders, Transformers, generative adversarial networks (GANs), and multimodal large language models (MLLMs). We compare their performance across real-world applications, highlighting patterns such as the superiority of Transformer-based models in temporal context understanding and the growing use of multimodal inputs for robust detection. Key challenges identified include dependence on large labeled datasets, high computational costs, and limited model interpretability. The review outlines how recent research is addressing these issues through semi-supervised learning, model compression techniques, and explainable AI. We conclude with future directions focusing on scalable, real-time, and interpretable solutions for practical deployment.

1. Introduction

As cities grow and transportation networks become more complex, there is a growing need for systems that can ensure safety and smooth mobility. A key part of these systems is to detect and manage traffic anomalies: events that disrupt normal traffic patterns and create risks. These anomalies range from small issues like illegal parking or jaywalking to serious problems like accidents or reckless driving. Identifying and addressing these issues is crucial to keeping roads safe and traffic flowing smoothly.
Recent advances in artificial intelligence, especially in video-based systems, have enabled real-time detection of traffic anomalies. These automated systems reduce the need for constant human supervision by quickly and accurately identifying unusual events. This not only reduces the workload for traffic managers but also improves the overall efficiency of urban transportation. As a result, cities become safer and more efficient places to live and travel.
These systems play a crucial role in urban management by maintaining continuous vigilance over urban environments. They are designed to quickly identify and alert authorities to unusual events. This proactive approach not only facilitates the timely resolution of such events but also aids in their prevention, ensuring a safer urban environment.
The utility of anomaly detection technologies extends beyond traffic and road safety, finding significant applications in sectors such as security, quality control in manufacturing, and patient health monitoring in medicine. Each of these applications shares a common goal of identifying irregular patterns to prevent and mitigate risks.
  • Security: Banks have been using AI-based techniques to detect anomalies in card transactions for years. As presented in [1,2], a system is developed that learns to detect certain patterns of behavior common in fraud to block these actions and improve customers’ security.
  • Quality: Due to the production and transportation of food and products, it has also been necessary to create systems such as in [3,4] that capable of detecting anomalies in defective products for their correct disposal, as well as to check the quality of the products before their distribution and sale.
  • Medicine: Despite the great differences between the body of each patient, there are patterns of anomalies that are detectable by systems such as in [5,6], which enable prevention and action by specialists in response to detected anomalies to ensure patient health.
  • Sustainability: From geographical data, it is possible to use some applications such as in [7,8] to detect polluting elements or areas with high concentrations of pollutants, which enable the implementation of measures for their reduction.
Another major field for the application of automatic anomaly detection systems is mobility. With the extensive connectivity among urban centers such as cities, towns, and villages, there exists a substantial volume of traffic moving between these locations, leading to increased chances of abnormal incidents occurring on roads. These incidents can impede traffic flow and impact transportation, a crucial sector. Within mobility, traffic anomalies encompass various scenarios that disrupt normal traffic flow and endanger road users’ safety. These anomalies range from minor infractions like illegally parked vehicles or jaywalking pedestrians to more severe situations such as speeding, erratic driving, or major accidents. Each type of anomaly presents distinct challenges and risks, underscoring the importance of their early detection and management for ensuring safety and efficiency in urban transportation systems. Prompt identification and response to these anomalies are essential for enhancing road safety and urban quality of life. There are numerous types of traffic anomalies, spanning from minor infringements like illegal parking or pedestrian rule violations to severe incidents like accidents or reckless driving, which pose significant risks of harm and fatality to vehicle occupants and pedestrians alike.
Given the critical nature of these traffic anomalies, considerable emphasis has been placed on their detection and prevention to mitigate driving-related hazards as much as possible. However, it is crucial to acknowledge that driving behavior is heavily influenced by individual drivers, making it challenging to regulate and control solely through external interventions. Thus, several systems are oriented to use these anomaly detections to manage traffic in areas of heavy traffic (both punctual and for long periods) as in the project [9]. Another peculiar case is its application in autonomous vehicles, as seen in the work [10], in which the state of the vehicle is checked from various sensors to identify unusual behaviors. Consequently, it is also necessary to use projects such as [11], in which vehicle sensor readings and the controller area network (CAN) communication protocol are used to make predictions of future messages to detect anomalies and ensure correct communication between vehicles to avoid possible safety failures. Among these, various methods have been used to obtain these events, such as the calculation of the optical flow of the objects in a sequence, the use of You Only Look Once (YOLO) [12] for detection and classification, pixel analysis, or more advanced technologies such as the application of trained neural networks (NNs) or generative models.
The use of these techniques for the early detection of anomalies on the road is very important, since in cases of accidents or serious situations, the speed of action of the necessary services counts a lot for the successful resolution of the problem. In addition, it also facilitates the work of traffic management operators to identify problematic situations and act accordingly as quickly as possible or obtain much more information on the situation from these systems to ensure their correct resolution. In this way, it is possible to improve the quality of life both in urban areas and in the connections between them, as well as to increase the safety of drivers in the event of any anomalous situation.
Another area of mobility where the application of these anomaly detection systems would be very beneficial is in the development of autonomous vehicles or interconnected vehicle systems, since this would provide greater safety in the use of these vehicles and improve their adaptation to irregular situations, always ensuring the health of users.
To further contextualize these benefits, we present a comprehensive schematic in Figure 1 that outlines a common pipeline for traffic anomaly detection and how its integration could be implemented into a smart city framework. This diagram captures the end-to-end process—from data collection and preprocessing to detecting traffic anomalies and the system feedback—demonstrating how these models commonly function within real-time urban environments. It also illustrates how such systems can interact with broader smart infrastructure, including traffic control centers, emergency services, and autonomous vehicles, reinforcing the practical impact of these technologies in enhancing mobility safety, responsiveness, and adaptability.
This review explores the evolution of traffic anomaly detection methods, with a particular focus on deep learning techniques. To guide the reader through the wide range of approaches, we introduce a clear taxonomy of models and methodologies in Figure 2, which categorizes them based on factors such as learning paradigm (supervised vs. unsupervised), architecture, and modality (single vs. multimodal). This structured overview supports a step-by-step analysis of each method, which is further developed in the subsequent sections. By comparing their strengths, limitations, and application scenarios, the review highlights how these technologies contribute to smarter transportation systems, safer autonomous vehicles, and more efficient traffic management. The taxonomy also helps identify emerging trends and future research directions in this critical and rapidly evolving field.

2. Algorithms and Methods

The field of road anomaly detection has seen significant advancements employing a variety of methodologies to enhance accuracy, efficiency, and practicality. These techniques, driven by developments in artificial intelligence and computational science, aim to tackle the complexity and variability of traffic environments. This section reviews the key algorithms and methods, organized by their technological foundation, to highlight their unique contributions and collective impact on improving traffic safety and management.

2.1. Machine Learning

Machine learning has played a foundational role in traffic anomaly detection by enabling systems to learn patterns from data and identify deviations. These methods, particularly unsupervised learning techniques, are effective in handling the variability and unpredictability of traffic scenarios.
Trajectory analysis is a common approach in anomaly detection, where the movement patterns of objects are analyzed to identify irregularities. The work in [13] presented a method for detecting anomalous events in video sequences using trajectory analysis. The approach employs single-class support vector machines (SVMs) to group similar trajectories and identify deviations as anomalies. By smoothing trajectories and converting them into fixed-dimensional feature vectors, the model applies a Gaussian kernel for similarity analysis. The authors address the challenge of determining the parameter v, which controls the proportion of outliers, with a novel hypervolume reduction technique to identify true outliers. Tests on synthetic and real-world datasets, including urban traffic surveillance videos, demonstrated robust anomaly detection and generalization to unseen patterns, as shown in Figure 3.
Clustering techniques help analyze trajectory patterns for trajectory-based anomaly detection. In [14], a study compared six trajectory distance measures and seven clustering methods across six datasets. As illustrated in Figure 4, the obtained results showed that the clustering method has minimal impact on quality, provided that complete, unsampled trajectories are used. Time-normalized distance measures like dynamic time warping (DTW) excel in datasets with varying dynamics, while dimensionality-reduction techniques like PCA perform better on long, overlapping trajectories. In cases where trajectory dynamics is a crucial factor for separation, time-normalized distances (DTW, LCSS, and PF) proved to be superior. For example, in the I5SIM3 dataset, which contained trajectories with different velocities within the same lane, DTW, LCSS, and PF distances achieved high accuracy in differentiating trajectories, while HU and PCA were unable to do so due to a loss of velocity information during resampling.
In [16], a projection-based method for anomaly detection and localization, focusing on thematic models was introduced. The approach combines spatio-temporal descriptors (HOG-HOF) with object location and size information to build a visual vocabulary. The algorithm operates in three stages: vocabulary creation, anomaly quantification, and localization. Compared to traditional probabilistic models, the proposed method improves both detection accuracy and spatio-temporal localization. By performing tests on three real-world surveillance datasets (Figure 5), the authors validated the effectiveness of the proposed methods.

2.2. Convolutional Neural Networks (CNNs)

Convolutional neural networks (CNNs) have been widely adopted for detecting traffic anomalies in surveillance videos. These models effectively analyze video sequences, learning normal traffic patterns and identifying deviations that signal unusual events. CNN-based systems, utilizing both supervised and unsupervised machine learning techniques, can detect a variety of anomalies, such as traffic accidents, congestion, and driver behavior. Their implementation has shown promising results in enhancing road safety and enabling smarter traffic management.
Waqas Sultani et al. proposed a novel approach for anomaly detection in real-world surveillance videos, addressing the challenge of identifying anomalies without prior knowledge of specific events [17]. The work highlighted the limitations of existing methods, which treat any deviation from normal patterns as anomalies, often leading to ambiguity due to contextual variability. As shown in Figure 6, the proposed system uses weakly labeled videos—only as normal or anomalous—within a multiple instance learning (MIL) framework. Video segments are treated as “instances” within a “bag”, enabling the algorithm to assign high anomaly scores to specific frames in anomalous videos. During testing, video segments are processed through a neural network, which calculates anomaly scores to detect unusual events. To evaluate their method, the authors introduced a large-scale anomaly detection dataset containing 1900 real-world surveillance videos spanning 128 h. This dataset, with 13 types of anomalies such as traffic accidents, thefts, and vandalism, is significantly larger and more diverse than existing datasets, providing a robust benchmark for future research. Tests demonstrated the effectiveness of the Deep MIL Ranking Model, which outperformed state-of-the-art methods in anomaly detection. Key improvements include reduced false positive rates and enhanced anomaly localization through sparsity and temporal smoothness constraints in the loss function. However, the authors noted challenges in action recognition due to high intra-class variability and low video resolution.
Based on the need for robust traffic anomaly detection, the worked presented in [18] focused on a fast, unsupervised system designed for real-time performance without requiring prior training. Unlike supervised methods that rely on labeled datasets, this approach emphasizes computational efficiency while maintaining accuracy, making it particularly suited for intelligent transportation systems. The system comprises three modules:
  • Preprocessing module: Detects stationary objects using background modeling, road segmentation, and YOLO object detection.
  • Candidate selection module: Filters out misclassified stationary objects (e.g., road signs) using a nearest neighbor approach and K-means clustering. Objects forming dense clusters are considered normal, while sporadic outliers are flagged as anomalies (Figure 7).
  • Backtracking anomaly detection module: Computes a similarity statistic (SSim) between the current frame and previous frames to detect anomalies such as stopped vehicles.
In contrast to the dataset-centric approach highlighted earlier, this system was evaluated using the NVIDIA AI CITY Challenge test suite [19], achieving an F1 score of 0.5926 and an RMSE of 8.2386. The model processes frames at an average speed of 19 ms each, demonstrating its suitability for real-time applications.
Alternatively, the work presented in [20] explores the application of video surveillance systems to detect traffic anomalies in smart cities. Unlike prior approaches that focus on real-time processing, this work highlights the long-term benefits of integrating video data into traffic management strategies, particularly in urban settings such as Kazan, Russia. This work targeted two specific traffic anomalies: illegal trajectories by vehicles or pedestrians and traffic congestion. To detect anomalies in trajectories, an unsupervised learning approach is employed, involving the extraction of object trajectories, the modeling of legitimate paths, and the identification of deviations as anomalies. For congestion detection, the authors propose four methods that compare real-time data—such as vehicle speed, travel time, and density—with historical records.
Object detection and tracking are critical to both tasks, with CNNs like R-CNN [21], SSD [22], and YOLO used for detection, and algorithms such as boosting [23], multiple instance learning (MIL) [24], kernelized correlation filter (KCF) [25], and minimum output sum of squared error (MOSSE) [26] employed for tracking. YOLO proved to be the most effective for detection, offering high speed and accuracy, while KCF and MOSSE excelled in tracking anomalies in low-quality video scenarios.
The proposed system was tested on datasets from intersections for trajectory anomaly detection and road cameras for congestion assessment. The results showed that the system accurately detected illegal trajectories and congestion, demonstrating practical applications in urban traffic management. However, challenges such as establishing lane-specific rules and addressing the limitations of unsupervised learning were noted.
Expanding on real-time traffic anomaly detection, this study [27] introduced a system that combines deep learning and decision trees to identify events such as accidents and stopped vehicles in CCTV footage. Unlike previous approaches, this method integrates adaptive preprocessing to handle diverse video conditions effectively. The system operates in two stages:
  • Vehicle detection: YOLOv5 [28] identifies vehicles within video frames.
  • Anomaly analysis: The system estimates the scene background by computing the median of randomly selected frames. A road mask delimits vehicular traffic zones, and a decision tree evaluates anomalies based on detection factors such as object size, detection probability, and overlap with the road mask. This process pinpoints the start and end points of anomalies.
Using the NVIDIA AI CITY CHALLENGE 2021 [29] dataset, the system achieved an F1 score of 0.8571 and an S4 score of 0.5686, demonstrating its effectiveness under varying conditions. To address challenges like occlusion, poor video quality, and lighting variability, preprocessing techniques—including classification by road type, weather, and time of day—were applied, enhancing background estimation accuracy.
Building on the versatility of computer vision frameworks, in [30], the authors proposed a system for extracting traffic information during mass events. Using YOLOv5 for vehicle detection and the Deep-SORT algorithm for tracking, the framework projects vehicle information—including ID, location, and timestamps—onto a scaled orthogonal map via homographic transformation.
The framework was evaluated using videos from both regular days and mass events at Texas A&M University, achieving a count error of 3.09% and an RSS of 5.07. Factors affecting performance, such as input size, lighting conditions, camera angles, and pixel size, were analyzed and optimized. Beyond vehicle counting, the system tracks turning patterns at intersections using a matrix that records vehicle movements between edges. Although the accuracy for this task ranged from 43% to 72%, the results aligned with observed traffic trends.
Positioned as a cost-effective alternative to traditional monitoring systems like traffic sensors and GPS, the framework offers adaptability to varying lighting conditions and compatibility with common surveillance cameras. Additionally, it utilizes open-source software and commodity hardware, significantly reducing implementation costs. The system is also easily generalizable, allowing users to configure survey areas and integrate new cameras by selecting four reference points.
Another CNN-based system for automatic traffic accident detection in surveillance videos has been presented in [31]. The methodology centers on training a CNN using the Vehicle Accident Image Dataset (VAID), comprising 1360 accident images collected under diverse environmental conditions and resolutions. Data augmentation techniques were employed to enhance the model’s generalization capability. The system was tested on 30 videos featuring both accident and normal traffic scenes. To mitigate label flickering during inference, a moving average prediction algorithm was implemented, smoothing classifications by averaging predictions over the last 128 frames.
Experimental results demonstrated an 80% accuracy rate in detecting accidents. However, challenges were observed in handling low-resolution videos, distant accidents, foggy conditions, and light reflections. Comparisons showed that integrating the moving average prediction algorithm significantly improved accuracy compared to using the CNN alone.
Addressing the need for efficient anomaly detection systems, in [32], Light-WVAD was introduced: a lightweight and effective model based on weakly supervised learning. Unlike fully supervised methods, Light-WVAD requires only video-level labels, specifying whether a video contains anomalies without frame-level annotations. The model comprises three innovative modules (Figure 8):
  • Multilevel temporal correlation attention module (MTA): Captures temporal relationships between video clips of varying duration.
  • Hourglass fully connected layer (HFC): Reduces parameters by half compared to conventional layers, maintaining performance while improving efficiency.
  • Adaptive instance selection strategy (AIS): Dynamically selects reliable instances with the highest anomaly scores for loss computation, addressing the uncertainty of weakly labeled data.
To enhance optimization, an antagonistic loss function is employed, ensuring normal instances score near zero while anomalous ones score near one.
Experimental results on the UCF-Crime [17] and ShanghaiTech [33] datasets demonstrated the model’s exceptional performance. Light-WVAD achieved an AUC of 95.9% on ShanghaiTech, outperforming most conventional methods with only 0.14 million parameters. On UCF-Crime, it achieved an AUC of 84.7%, ranking highest among weakly supervised lightweight models and third among all models.
Another work, Ref. [34], introduced a hybrid model for road accident detection using CCTV images. Based on EfficientNet-B7 as the core architecture, the system demonstrates both high accuracy and efficiency in identifying accidents. The model development encompasses three key components:
  • Data preparation: The Accident Detection From CCTV Footage dataset, consisting of 990 balanced images of accidents and non-accidents, is used. Preprocessing techniques such as data augmentation, normalization, and resizing are applied to optimize training.
  • Model architecture: EfficientNet-B7 is combined with additional Conv2D, Flatten, and Dense layers to create a hybrid model that extracts spatial features effectively, ensuring high performance.
  • Training and evaluation: The model achieves a training accuracy of 99.24%, a cross-validation accuracy of 94.98%, and an area under the curve (AUC) of 1.00, confirming its robustness in distinguishing accidents from non-accidents.
The model significantly impacts road safety by enabling real-time accident detection and facilitating timely interventions. Furthermore, its high interpretability enhances trust and transparency in its predictions. Future research is proposed to explore multimodal data integration, real-time connections with traffic management systems, and further improvements in model interpretability.
In [35], a deep learning model to predict traffic accident severity was presented. By addressing challenges in feature extraction from complex data and capturing temporal dependencies, the model integrates a one-dimensional convolutional neural network (1D CNN), a bidirectional long short-term memory network (BiLSTM), and attentional mechanisms. This architecture effectively extracts spatial features, captures temporal relationships, and assigns feature importance to optimize prediction accuracy (Figure 9).
The proposed model outperformed five standard methods—SVM, XGBoost, CNN 1D, DNN, and CNN-LSTM—in accuracy, recall, and F1 score. Component-wise analysis highlighted the individual contributions of each element, emphasizing their combined effectiveness. Exploratory data analysis revealed significant temporal patterns, such as the impact of speed limits and seasonal variations, underscoring the importance of spatial and temporal factors in accident severity prediction. This approach has broad implications for road safety. It enables the identification of risk factors, supports preventive interventions, and serves as an early warning system for emergency teams, improving response effectiveness. Future research will focus on validating the model across different regions and incorporating additional data sources, such as social networks and sensors, to further enhance its accuracy and applicability.
Expanding on unsupervised approaches for anomaly detection, this study [36] leverages deep autoencoders to identify anomalies in surveillance videos. Autoencoders encode input data into a compressed latent space and reconstruct it, learning normal patterns during training. Significant reconstruction errors in test frames signal anomalies, as unusual events deviate from the learned patterns. The proposed model incorporates the following:
  • Conv3D layers: Extract spatial and temporal features from video sequences.
  • ConvLSTM2D layers: Combine CNN and LSTM capabilities for improved spatio-temporal processing.
  • ConvTranspose3D layers: Enhance spatial resolution for better reconstruction accuracy.
To identify anomalies, a thresholding mechanism combines reconstruction error with kernel density estimation (KDE) in the intermediate latent space, enabling more precise anomaly detection. The model was evaluated on the UCSD Peds1 [37] and CUHK Avenue [38] datasets, achieving an AUC of 86.4% and 88.9%, respectively, demonstrating strong performance compared to current methods.
Another approach based on incorporating machine learning into anomaly detection was introduced in [39]. In which a Siamese CNN was proposed to learn a distance function between pairs of video patches. Unlike methods using predefined distance functions, this approach optimizes feature vectors and distance metrics through training, enabling robust anomaly localization across diverse scenes.
The system shown in Figure 10 employs an exemplar-based nearest neighbor method to model normal activity with feature vectors (exemplars) for each spatial region. Test patches are compared to these exemplars, and patches significantly differing from all exemplars are flagged as anomalous. The Siamese CNN architecture includes twin convolutional branches with shared weights, processing similar and dissimilar pairs, and a classification pipeline minimizing cross-entropy loss.
The method was evaluated on the UCSD Ped1, Ped2 [37], and CUHK Avenue [38] datasets, outperforming or matching state-of-the-art methods. New evaluation criteria based on regions and tracking further improved the practical relevance of the results.

2.3. Generative Adversarial Networks (GANs)

Generative adversarial networks (GANs) have emerged as a promising approach for traffic anomaly detection, particularly in video surveillance contexts. By leveraging a competitive training process between a generator and a discriminator, GANs can effectively learn the distribution of normal traffic behavior. Once trained, they identify anomalies as instances the generator fails to reproduce accurately, which the discriminator then classifies as “fake”. Compared to traditional machine learning and deep learning techniques such as CNNs or autoencoders, GANs offer a unique advantage in modeling complex data distributions without requiring labeled anomaly data. Several studies have demonstrated the effectiveness of GAN-based methods in detecting rare anomalies in traffic videos. These findings suggest that GANs can complement or outperform conventional approaches in certain scenarios, particularly when anomalies are diverse and sparsely represented.
Leveraging the capabilities of GANs, the work presented in [40] introduces a novel method for video anomaly detection based on predicting future frames. This approach based on the idea that normal traffic events are predictable, while anomalies deviate from these expected patterns and are, therefore, less predictable. Unlike traditional methods that focus on reconstructing training data, this technique identifies anomalies by comparing predicted video frames with their actual counterparts. A U-Net architecture [41] predicts the next frame, using spatial constraints to minimize appearance differences and temporal constraints to align motion via optical flow. An adversarial training module enhances the realism of generated frames, as illustrated in Figure 11.
Anomalies are detected using peak signal-to-noise ratio (PSNR) between predicted and actual frames, with normalized PSNR values serving as a regularity score for classification. The study is notable for introducing future frame prediction for anomaly detection, integrating motion constraints, and validating its approach on multiple public datasets (CUHK Avenue [38], UCSD Ped1 and Ped2 [37], and ShanghaiTech [33]), achieving superior area under the curve (AUC) results.
Spatio-temporal adversarial networks (STANs) [42] is an advanced method for video anomaly detection that models “normality” in scenes by capturing spatio-temporal patterns. Its architecture consists of two key components: a bidirectional ConvLSTM-based generator, which creates intermediate frames from neighboring ones, and a 3D convolutional discriminator, which evaluates sequences for normality, as shown in Figure 12.
STAN employs adversarial training, where the generator learns to produce realistic frames, and the discriminator adapts to differentiate real from generated sequences. An “anomaly score” is calculated during testing using generator loss (measuring discrepancies between generated and actual frames) and discriminator loss (assessing sequence authenticity). Higher scores indicate a higher likelihood of anomalies.
Experiments on the UCSD Ped1 and Ped2 [37] and Avenue [38] datasets demonstrate STANs’ competitive performance, particularly in complex scenes with large objects and frequent occlusions. Visualization techniques further enhance interpretability by highlighting regions containing anomalies, as depicted in Figure 13.
The work presented in [43] proposes another GAN-based video anomaly detection method but with a dual discriminator. The main contribution lies in introducing a second discriminator specifically designed to analyze motion information, enhancing temporal continuity in predictions.
During training, the GAN generator learns to predict the next frame in video sequences using a dataset of normal events. The generator’s loss function combines intensity, gradient, motion, and adversarial losses to produce realistic frames. The motion discriminator examines optical flow between frames, ensuring smooth and consistent temporal transitions. While in testing, the generator predicts frames, and the peak signal-to-noise ratio (PSNR) is used to measure the prediction quality. Frames with low PSNR values are flagged as anomalous, with regularity scores calculated for classification.
The method was evaluated on the UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets. The results show that the dual-discriminator GAN performs comparably or better than state-of-the-art methods, particularly excelling on the ShanghaiTech dataset due to its ability to handle complex motion scenarios.
TransAnomaly [44] introduces a novel method for video anomaly detection by integrating the U-Net architecture with the Video Vision Transformer (ViViT), as shown in Figure 14. This approach effectively leverages Transformers’ ability to model temporal sequences and global video context, addressing the challenge of detecting rare and unpredictable events.
Using a frame prediction-based strategy, TransAnomaly predicts the next frame in a video sequence. While normal frames are accurately predicted, anomalous frames result in higher prediction errors. ViViT enhances temporal and global context modeling, outperforming traditional CNN-based approaches. Integrated with U-Net, the model improves predictions of appearance and motion details.
The optimization process involves intensity, gradient, and difference loss functions to ensure accurate visual and motion predictions. A patch discriminator enhances the realism of frames during adversarial training. Anomalies are detected by calculating the PSNR between predicted and actual frames, with lower PSNR values indicating anomalies. To reduce noise, a sliding window approach calculates the PSNR in relevant frame regions.
Evaluated on the CUHK Avenue and UCSD Pedestrian datasets, TransAnomaly demonstrated better performance. The work emphasized the importance of optimizing Transformer encoder depth and loss functions to maximize accuracy. Additionally, visualizations highlighted the model’s capability to localize anomalies effectively within frames.

2.4. Transformers

Transformers, renowned for their ability to capture long-range dependencies in sequential data, have emerged as a powerful methodology for detecting traffic anomalies in surveillance videos. These models can learn complex traffic patterns and vehicle behaviors and identify anomalies such as accidents or traffic violations. To further enhance performance, some approaches integrate Transformers with architectures like U-Net. By leveraging the strengths of Transformers, traffic anomaly detection systems can achieve greater accuracy and effectiveness, contributing to improved road safety.
The authors in [45] presented a method for anomaly detection in surveillance videos, using Transformers and an attention model to identify unusual events that deviate from normal patterns. To reduce reliance on frame-level labeling, the method employs a weakly supervised learning strategy, using video-level labels to generate frame-level anomaly scores. This is particularly advantageous in weakly supervised video anomaly detection (WSVAD), where distinguishing between normal and abnormal instances during training is challenging. Transformer-based video swin features are used for feature extraction, outperforming traditional CNN models such as I3D and C3D.
An attention layer with dilated convolution and self-attenuation captures both long- and short-range temporal dependencies, enhancing the model’s ability to identify anomalies accurately. For detecting anomalous segments, the robust temporal feature magnitude learning (RTFM) model is applied, using the L2 norm of temporal features to differentiate normal segments (low magnitudes) from anomalous ones (high magnitudes).
The presented work was evaluated on the ShanghaiTech Campus dataset, and the obtained results showed that the method achieved competitive performance, as indicated by a high area under the curve (AUC) score compared to existing approaches.
In [46], a three-stage framework was introduced to detect anomalies in traffic video sequences by combining spatio-temporal feature extraction, segment-level anomaly detection, and video-level anomaly detection, enhancing accuracy and reliability.
In the first stage, spatio-temporal features are extracted using the ViViT neural network, a Transformer-based model effective in video analysis. ViViT leverages its attention mechanism to model long-term contextual information, addressing the limitations of traditional 3D convolution-based extractors. The pre-trained ViViT network produces “class tokens” for global segment features and “patch tokens” for local segment details. In the second stage, the extracted tokens are processed by two anomaly detectors. The segment-level detector analyzes “class tokens” using multi-instance learning with video-level labels, while the video-level detector assesses “patch token” similarities to identify abrupt changes or inconsistencies. The latter employs binary cross-entropy loss for training as a binary classification task. In the final stage, outputs from both detectors are fused to generate a composite anomaly score. This integration leverages the video-level detector’s stability to correct potential errors from the segment-level detector, ensuring more reliable anomaly detection. The presented framework was validated using the TAD dataset, which includes diverse traffic anomalies, and the framework demonstrated significant improvements in detection and localization. Ablation studies confirmed the effectiveness of each stage and the robustness of the fusion strategy, validating this approach as a powerful tool for traffic video anomaly detection.
In another work, presented in [47], the authors introduced CViT, an architecture that combines U-Net with vision Transformers (ViTs) to detect and localize anomalous behavior in videos, as illustrated in Figure 15. By integrating convolutional layers for local feature extraction with Transformer layers for capturing global relationships, CViT enhances the accuracy of anomaly detection in video sequences.
CViT operates by training solely on normal video sequences to learn typical behavioral patterns. An encoder-decoder structure, enhanced with CViT blocks on the encoder side, processes video frames. The encoder extracts both local and global features, while the decoder reconstructs the original frame. Anomalies are detected by calculating the difference between the original and reconstructed frames, with larger differences indicating higher likelihoods of anomalies. For frames flagged as anomalous, the localization module uses YOLO to identify objects and compare them with anomalous regions. Overlapping areas are highlighted with red boxes, pinpointing the locations of detected anomalies.
The model employs a composite loss function combining intensity, gradient, and structural similarity index measure (SSIM) losses to ensure high-quality reconstructions in both appearance and detail. Evaluations on proprietary data and benchmarks like UCSD, CUHK Avenue, and ShanghaiTech demonstrate that CViT outperforms state-of-the-art methods in detection accuracy and supports real-time processing, making it highly suitable for video surveillance applications.
The paper [48] introduces Malta audio-visual anomaly detection (MAVAD), the first dataset specifically designed for anomaly detection in real-world traffic videos. MAVAD includes 764 videos spanning various weather and lighting conditions, with anomalies categorized into 11 classes.
The paper also proposes audio-visual anomaly cross-attention (AVACA), a novel method that integrates visual and audio features for anomaly detection. AVACA employs a dual-pathway architecture—one for visual processing and one for audio—fused using a Transformer-based cross-attention layer, as shown in Figure 16. This fusion enables more effective anomaly detection by combining audio and visual cues.
Experiments showed that incorporating audio embeddings improves detection performance by 5.2%, with the two Transformers enriching visual representations using audio context. This multimodal approach is particularly effective in scenarios where audio and visual information are closely linked. AVACA is trained with two loss functions: dynamic multiple instance learning loss (LDMIL), which separates anomalous from normal instances, and center loss (LC), which minimizes intra-class variation among normal instances. Additionally, the study evaluates image anonymization’s impact on AVACA, reporting an average performance reduction of just 1.7%, demonstrating its robustness even under such conditions.
The paper [49] presents a hybrid model that combines GANs with a Transformer model to improve traffic incident detection. The approach addresses challenges such as class imbalance, high false alarm rates, and the need for real-time detection. GANs are used to generate synthetic data, mitigating the dominance of “non-incident” events in the dataset, while the Transformer captures global contextual features and complex dependencies. The integration of GANs and Transformers enhances data balance and model adaptability to real-world variations. Experimental evaluations on traffic datasets such as PeMS, I-880 [50], Whitemud Drive [51], and NGSIM demonstrate superior performance, achieving high detection accuracy with low false alarm rates.
This paper [52] introduces MOVAD, a real-time video anomaly detection (VAD) architecture designed to enhance the safety of autonomous vehicles by enabling rapid detection of traffic anomalies. MOVAD operates through end-to-end training using only RGB frames, eliminating the need for optical flow or bounding boxes.
The architecture includes two main modules: the short-term memory module (STMM) and the long-term memory module (LTMM), as illustrated in Figure 17. The STMM, implemented with a video swin Transformer (VST), captures spatio-temporal correlations from recent frames, while the LTMM, based on an LSTM network, incorporates past context to enrich anomaly detection.
MOVAD was evaluated on the DoTA [53] dataset of accident videos, achieving an AUC of 82.17%. Ablation studies identified the optimal configuration as a four-frame temporal window for the STMM and a three-cell LSTM for the LTMM. The model demonstrated the ability to detect anomalies even in scenarios where participants are not visible, highlighting its robustness and versatility.

2.5. Multimodal Large Language Model (MLLM)

Large language models (LLMs) have gained importance for their versatility, leading to the development of multimodal large language models (MLLMs) capable of processing diverse data types, including images, videos, and text. MLLMs are being applied to traffic anomaly detection, where they analyze video sequences, interpret contextual information, and identify events such as dangerous driving behaviors or accidents. By integrating MLLMs into traffic surveillance systems, road safety can be enhanced, and traffic management optimized.
A novel approach to video anomaly detection is introduced in [54], using predefined text descriptions to represent both normal and abnormal situations. Unlike traditional methods that rely on labeled video data, this framework incorporates domain knowledge and large language models (LLMs), such as ChatGPT (chat.openai.com, accessed on 20 March 2023), to generate comprehensive text descriptions of events. The overall inference process is illustrated in Figure 18.
The detection process utilizes contrastive language–image pre-training (CLIP) [55], a vision–language model that calculates cosine similarity between image features and predefined text features. To adapt CLIP for this task, a conditional text similarity measure with trainable parameters is introduced. These parameters are optimized using triplet loss and regularization loss, removing the need for labeled data. Triplet loss maximizes the difference between normal and abnormal text descriptions by identifying the most similar and dissimilar examples, while regularization loss ensures pseudo-normal frames have low anomaly scores. Experiments on the ShanghaiTech and UCFcrime datasets demonstrate superior performance compared to unsupervised methods and results comparable to weakly supervised models. Additionally, the method is efficient and resilient to noisy text descriptions.
AccidentBlip2 [56] is a MLLM designed for traffic accident detection using only vision. Building on the Blip2 architecture, AccidentBlip2 integrates key innovations to optimize its performance in this domain. A key component of AccidentBlip2 is the Motion Qformer module, which facilitates temporal inferences from multi-view images. Using a vision Transformer (ViT), the module extracts image features and fuses them in the Qformer, where cross-frame attention ensures efficient temporal information transfer.
The model extends beyond single-vehicle setups by incorporating a multi-vehicle system, addressing blind spots and enhancing accident detection across the environment. Each vehicle’s neural network detects accidents involving itself or other vehicles. Multi-view features extracted by a pre-trained ViT-14g model are processed by the Motion Qformer for autoregressive inference, as shown in Figure 19.
Evaluations on the DeepAccident [57] dataset show that AccidentBlip2 outperforms benchmarks like Video-LLaMA and V2XFormer. It achieves 66.5% accuracy in single-vehicle scenarios and 72.2% accuracy in multi-vehicle setups for detecting environmental accidents. In addition, the results highlighted that by using the Motion Qformer module and multi-vehicle perception, AccidentBlip2 demonstrates superior capability in analyzing multi-view temporal data, making it a robust solution for accident detection in complex traffic environments.
ViTA [58] is an algorithm designed to accelerate video-to-text conversion in retrieval-augmented generation (RAG) frameworks, addressing latency challenges in processing large video volumes. By leveraging vision–language models (VLMs), ViTA optimizes the relationship between output tokens and processing time to enhance efficiency without sacrificing accuracy.
The algorithm employs a hybrid approach, combining lightweight and heavyweight VLMs to balance speed and detail. In the first stage, a lightweight VLM (e.g., BLIP) generates a quick, basic description of the video content. This description serves as a prompt for a heavyweight VLM (e.g., InternLM-XComposer2), which extracts additional details while limiting token generation, as illustrated in Figure 20.
This two-stage strategy reduces latency by minimizing the output token count required from the heavyweight VLM. Experimental results on real-world datasets like StreetAware [59] and Tokyo MODI [60] show that ViTA achieves up to a 43% reduction in latency compared to systems relying only on heavyweight VLMs. Importantly, the algorithm maintains high query accuracy, ensuring information extraction without quality loss.
An innovative framework utilizing multimodal large language models (MLLMs) is proposed in [61] for real-time detection of critical road safety events. This approach integrates the logical and visual reasoning capabilities of models like Gemini-Pro-Vision 1.5 to analyze driving video sequences and extract key safety information, addressing limitations of traditional methods reliant on complex models and large datasets.
The framework employs a multi-stage question-and-answer (QA) process to guide MLLMs in identifying and classifying entities within video frames. Detected hazards are categorized using three dimensions: “What” (entity classification), “Which” (specific feature identification), and “Where” (location and distance of hazards). This structured methodology ensures the comprehensive assessment and rapid detection of safety-critical events.
Experiments conducted on the DRAMA [62] dataset, which focuses on driving hazard detection using natural language queries, demonstrated that few-shot learning achieved a 79% accuracy, outperforming other methods. The framework also excelled in tasks like scene classification, vehicle direction identification, and risk agent categorization, significantly surpassing previous visual reasoning models. The integration of multimodal data and dynamic context optimization further enhanced prediction reliability in driving scenarios.
The authors of [56] reviewed the application of MLLMs and vision large models (VLMs) in object detection for transportation systems, highlighting their potential to enhance safety, efficiency, and reliability. By integrating text, images, videos, and sensor data, MLLMs offer a comprehensive understanding of transportation environments, enabling context-aware object detection and few-shot learning to reduce reliance on large annotated datasets.
The study categorizes MLLM-based applications into three main tasks: perception and understanding, navigation and planning, and decision making and control. Real-world case studies demonstrate the versatility of MLLMs through examples such as road safety attribute extraction, safety-critical event detection, and visual reasoning in thermal imaging. These applications underscore the effectiveness of MLLMs in diverse transportation scenarios.
Despite their strong potential, MLLMs face notable limitations in traffic anomaly detection. They often struggle with complex visual scenes, such as occlusions or overlapping objects, and exhibit weaknesses in recognizing fine details for accurate anomaly detection. A significant issue is the known phenomenon "hallucination”, where the model generates outputs not grounded in the input data, leading to false detections.
MLLMs also face challenges in real-time inference, as their multimodal fusion and reasoning processes are computationally intensive, making them less suitable for deployment in live traffic environments compared to more streamlined CNN or GAN models. Additionally, many MLLM studies rely on curated datasets with well-lit, clean scenes, which limits their aplicability to low-light or cluttered real-world settings.
Recent research points to future directions such as more efficient training strategies, optimized architectures for low-latency use, and integration with complementary sensor data to enhance performance in practical deployments.

CityLLaVA Efficient Fine-Tunning for VLMs in City Scenario

In [63], the authors introduced CityLLaVA, a framework designed to optimize VLMs for road safety analysis in urban settings, focusing on insurance inspection and accident prevention. By enhancing the understanding and prediction of critical events in cities, CityLLaVA addresses key challenges in traffic safety. The framework utilizes the fine-tuning process shown in Figure 21, which incorporates four key strategies:
  • Bounding boxes optimize data preprocessing by selecting the best video views and focusing analysis on areas of interest, such as pedestrians and vehicles.
  • Question sequences and textual prompts guide the model to evaluate key elements like position, movement, and environment, adapting to various traffic scenarios.
  • Block expansion introduces additional decoder blocks, improving accuracy without significant overhead.
  • Increased prediction based on sequential questioning enhances accuracy by using previously obtained information in an ordered questioning approach.
A major innovation in CityLLaVA is the use of visual and textual prompts. Visual prompts, guided by bounding boxes, help align information in each frame, while textual prompts, tailored to road safety, extract relevant details more effectively. The framework uses block expansion to improve fine-tuning efficiency, outperforming methods like low-rank adaptation (LoRA) [66]. Sequential questioning further enhances predictions by incorporating prior context. CityLLaVA achieved first place in the AI City Challenge 2024 [19] in the traffic safety description and analysis category, demonstrating its potential as a powerful tool for urban road safety analysis.
VisionGPT [67] uses LLMs for real-time anomaly detection, enabling safe visual navigation for individuals with visual impairments. By combining the open-vocabulary object detection capabilities of Yolo-World with LLM intelligence, VisionGPT identifies obstacles and provides concise audio descriptions to inform users of potential hazards.
The system analyzes camera frames, dividing each image into four regions—left, right, front, and ground—and classifies objects by position to assist in hazard identification. Detailed information on each object, including classification, size, and location, is recorded, with alerts triggered for significant obstacles, particularly in the “ground” or “left/right” regions. VisionGPT’s integration of LLMs enhances adaptability to diverse contexts, with the study exploring the impact of prompt design for future improvements in accessibility. Experiments highlight the system’s high accuracy in anomaly detection, particularly under low-sensitivity settings, and its compatibility across platforms. VisionGPT achieves an end-to-end latency of 60 ms on mobile devices, ensuring fast and reliable feedback for users.
Finally, to conclude this section, the following Table 1 presents a structured comparative analysis of the most prominent AI-based approaches used in traffic anomaly detection. It outlines the key specifications, advantages, limitations, and—newly added—real-world applications for each method, offering a clearer understanding of their practical relevance. This synthesis aims to assist researchers and practitioners in identifying suitable techniques based on their operational requirements and deployment contexts. The summarized insights also highlight ongoing challenges and research directions related to adaptability, interpretability, and computational efficiency.

3. Conclusions

This review examined recent advances in artificial intelligence methods for traffic anomaly detection and classification, emphasizing the crucial role these technologies play in enhancing urban mobility and safety. Various AI methodologies were analyzed and compared, including convolutional neural networks, Transformers, generative adversarial networks, autoencoders, and multimodal large language models, highlighting their strengths in addressing real-world scenarios. Unlike previous surveys [69,70,71,72], this paper specifically underscores the latest research in MLLMs, particularly their ability to integrate multimodal information (text, video, and images) for deeper contextual understanding of traffic behaviors.
Despite notable advancements, several key challenges remain. These include significant reliance on extensive labeled datasets, substantial computational costs, and ongoing issues related to model interpretability. Recent innovations such as semi-supervised learning, model compression, and explainable AI are progressively addressing these challenges, paving the way for more scalable, cost-effective, and transparent solutions.
Future research will prioritize the development of systems that are efficient, real-time capable, interpretable, and adaptable to diverse and dynamic traffic environments. Continued exploration into multimodal data integration and further improvements in computational efficiency are critical. Ultimately, advancements in these areas promise substantial enhancements in traffic management, road safety, and the creation of smarter, more responsive urban infrastructures.

Author Contributions

Conceptualization, B.P. and M.R.; Supervision, F.G. and A.A.-K.; Writing—original draft, B.P., M.R. and T.S.; Writing—review & editing, F.G. and A.A.-K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Spanish Government through the projects PID2021-128327OA-I00, and TED2021-129374A-I00, and funded by MCIN/AEI/10.13039/501100011033 by the European Union NextGenerationEU/PRTR.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CNNConvolutional Neural Network
RNNRecurrent Neural Network
GANGenerative Adversarial Networks
LLMLarge Language Model
VLMVisual Large Model
MLLMMultimodal Large Language Model
CANController Area Network
YOLOYou Only Look Once
NNNeural Network
SVMSupport Vector Machine
DTWDynamic time Warping
LCSSLongest Common Subsequence
PFProcrustes Fitting
CCRCorrect Clustering Ratius
MILMultiple Instance Learning
RMSERoot Mean Squared Error
SSDSingle Shot Detection
KCFKernelized Correlation Filter
MOSSEMinimum Output Sum of Squared Error
GPSGlobal Positioning System
MTAMultilevel temporal correlation attention
HFCHourglass Fully Connected
AISAdaptive Instance Selection Strategy
AUCArea Under the Curve
LSTMLong Short-Term Memory
KDEKernel Density Estimation
PSNRPeak Signal to Noise Ratio
STANSpatio-Temporal Adversarial Networks
ViTVision Transformer
ViViTVideo Vision Transformer
CViTConvolution Vision Transformer
WSVADWeakly Supervised Video Anomaly Detection
RTFMRobust Temporal Feature Magnitude Learning
SSIMStructural Similarity Index Measure
MAVADMalta Audio-Visual Anomaly Detection
AVACAAudio-Visual Anomaly Cross Attention
LDMILDynamic Multiple Instance Learning Loss
LCCenter Loss
MOVADMemory-augmented Online Video Anomaly Detection
VADVideo Detection Anomaly
AVsAutonomous vehicles
STMMShort-Term Memory Module
LTMMLong-Term Memory Module
MLLMMultimodal Large Language Models
CLIPContrastive Language-Image Pre-Training
RAGRetrieval-augmented Generation

References

  1. Xiang, S.; Zhu, M.; Cheng, D.; Li, E.; Zhao, R.; Ouyang, Y.; Chen, L.; Zheng, Y. Semi-supervised credit card fraud detection via attribute-driven graph representation. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 14557–14565. [Google Scholar]
  2. Ram, P.; Gray, A.G. Fraud detection with density estimation trees. In Proceedings of the KDD 2017 Workshop on Anomaly Detection in Finance, PMLR, Halifax, NS, Canada, 14 August 2017; pp. 85–94. [Google Scholar]
  3. Hung, Y.H. Developing an Anomaly Detection System for Automatic Defective Products’ Inspection. Processes 2022, 10, 1476. [Google Scholar] [CrossRef]
  4. Vasafi, P.S.; Paquet-Durand, O.; Brettschneider, K.; Hinrichs, J.; Hitzmann, B. Anomaly detection during milk processing by autoencoder neural network based on near-infrared spectroscopy. J. Food Eng. 2021, 299, 110510. [Google Scholar] [CrossRef]
  5. Sedik, A.; Emara, H.M.; Hamad, A.; Shahin, E.M.; El-Hag, N.A.; Khalil, A.; Ibrahim, F.; Elsherbeny, Z.M.; Elreefy, M.; Zahran, O.; et al. Efficient anomaly detection from medical signals and images. Int. J. Speech Technol. 2019, 22, 739–767. [Google Scholar] [CrossRef]
  6. Tschuchnig, M.E.; Gadermayr, M. Anomaly detection in medical imaging-a mini review. In Proceedings of the Data Science—Analytics and Applications: Proceedings of the 4th International Data Science Conference–iDSC2021, Virtual, 20–21 October 2021; pp. 33–38. [Google Scholar]
  7. Dias, M.A.; Silva, E.A.d.; Azevedo, S.C.d.; Casaca, W.; Statella, T.; Negri, R.G. An incongruence-based anomaly detection strategy for analyzing water pollution in images from remote sensing. Remote Sens. 2019, 12, 43. [Google Scholar] [CrossRef]
  8. Wei, Y.; Jang-Jaccard, J.; Xu, W.; Sabrina, F.; Camtepe, S.; Boulic, M. LSTM-autoencoder-based anomaly detection for indoor air quality time-series data. IEEE Sensors J. 2023, 23, 3787–3800. [Google Scholar] [CrossRef]
  9. Bawaneh, M.; Simon, V. Anomaly detection in smart city traffic based on time series analysis. In Proceedings of the 2019 International Conference on Software, Telecommunications and Computer Networks (SoftCOM), Split, Croatia, 19–21 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
  10. Zhang, M.; Chen, C.; Wo, T.; Xie, T.; Bhuiyan, M.Z.A.; Lin, X. SafeDrive: Online driving anomaly detection from large-scale vehicle data. IEEE Trans. Ind. Inform. 2017, 13, 2087–2096. [Google Scholar] [CrossRef]
  11. Cobilean, V.; Mavikumbure, H.S.; Wickramasinghe, C.S.; Varghese, B.J.; Pennington, T.; Manic, M. Anomaly Detection for In-Vehicle Communication Using Transformers. In Proceedings of the IECON 2023—49th Annual Conference of the IEEE Industrial Electronics Society, Singapore, 16–19 October 2023. [Google Scholar]
  12. Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  13. Piciarelli, C.; Micheloni, C.; Foresti, G.L. Trajectory-based anomalous event detection. IEEE Trans. Circuits Syst. Video Technol. 2008, 18, 1544–1554. [Google Scholar] [CrossRef]
  14. Morris, B.; Trivedi, M. Learning trajectory patterns by clustering: Experimental studies and comparative evaluation. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 312–319. [Google Scholar]
  15. Morris, B.T. Understanding Activity from Trajectory Patterns; University of California: San Diego, CA, USA, 2010. [Google Scholar]
  16. Pathak, D.; Sharang, A.; Mukerjee, A. Anomaly localization in topic-based analysis of surveillance videos. In Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 389–395. [Google Scholar]
  17. Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6479–6488. [Google Scholar]
  18. Doshi, K.; Yilmaz, Y. Fast unsupervised anomaly detection in traffic videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 624–625. [Google Scholar]
  19. Wang, S.; Anastasiu, D.C.; Tang, Z.; Chang, M.C.; Yao, Y.; Zheng, L.; Rahman, M.S.; Arya, M.S.; Sharma, A.; Chakraborty, P.; et al. The 8th AI City Challenge. In Proceedings of the The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 17–18 June 2024. [Google Scholar]
  20. Minnikhanov, R.; Dagaeva, M.; Anikin, I.; Bolshakov, T.; Makhmutova, A.; Mingulov, K. Detection of traffic anomalies for a safety system of smart city. In Proceedings of the CEUR Workshop Proceedings, Bergen, Norway, 17 July 2020; Volume 2667, pp. 337–342. [Google Scholar]
  21. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  22. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. pp. 21–37. [Google Scholar]
  23. Grabner, H.; Grabner, M.; Bischof, H. Real-time tracking via on-line boosting. In Proceedings of the British Machine Vision Conference, Edinburgh, UK, 4–7 September 2006. [Google Scholar]
  24. Babenko, B.; Yang, M.H.; Belongie, S. Visual tracking with online multiple instance learning. In Proceedings of the 2009 IEEE Conference on computer vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 983–990. [Google Scholar]
  25. Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef]
  26. Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2544–2550. [Google Scholar]
  27. Aboah, A. A vision-based system for traffic anomaly detection using deep learning and decision trees. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4207–4212. [Google Scholar]
  28. Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; NanoCode012; Kwon, Y.; Michael, K.; Xie, T.; Fang, J.; Imyhxy; et al. YOLOv5 by Ultralytics. 2020. Available online: https://zenodo.org/records/7347926 (accessed on 1 January 2020).
  29. Naphade, M.; Wang, S.; Anastasiu, D.C.; Tang, Z.; Chang, M.C.; Yang, X.; Yao, Y.; Zheng, L.; Chakraborty, P.; Lopez, C.E.; et al. The 5th AI City Challenge. In Proceedings of the The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
  30. Pi, Y.; Duffield, N.; Behzadan, A.H.; Lomax, T. Visual recognition for urban traffic data retrieval and analysis in major events using convolutional neural networks. Comput. Urban Sci. 2022, 2, 2. [Google Scholar] [CrossRef]
  31. Khan, S.W.; Hafeez, Q.; Khalid, M.I.; Alroobaea, R.; Hussain, S.; Iqbal, J.; Almotiri, J.; Ullah, S.S. Anomaly detection in traffic surveillance videos using deep learning. Sensors 2022, 22, 6563. [Google Scholar] [CrossRef]
  32. Wang, Y.; Zhou, J.; Guan, J. A lightweight video anomaly detection model with weak supervision and adaptive instance selection. Neurocomputing 2025, 613, 128698. [Google Scholar] [CrossRef]
  33. Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 589–597. [Google Scholar]
  34. Singh, R.; Sharma, N.; Rajput, K.; Pokhariya, H.S. EfficientNet-B7 Enhanced Road Accident Detection Using CCTV Footage. In Proceedings of the 2024 Asia Pacific Conference on Innovation in Technology (APCIT), Mysore, India, 26–27 July 2024; pp. 1–6. [Google Scholar]
  35. Alhaek, F.; Liang, W.; Rajeh, T.M.; Javed, M.H.; Li, T. Learning spatial patterns and temporal dependencies for traffic accident severity prediction: A deep learning approach. Knowl.-Based Syst. 2024, 286, 111406. [Google Scholar] [CrossRef]
  36. Mishra, S.; Jabin, S. Anomaly detection in surveillance videos using deep autoencoder. Int. J. Inf. Technol. 2024, 16, 1111–1122. [Google Scholar] [CrossRef]
  37. Chan, A.B.; Vasconcelos, N. Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 909–926. [Google Scholar] [CrossRef] [PubMed]
  38. Lu, C.; Shi, J.; Jia, J. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 2720–2727. [Google Scholar]
  39. Ramachandra, B.; Jones, M.; Vatsavai, R. Learning a distance function with a Siamese network to localize anomalies in videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 2598–2607. [Google Scholar]
  40. Liu, W.; Luo, W.; Lian, D.; Gao, S. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6536–6545. [Google Scholar]
  41. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. pp. 234–241. [Google Scholar]
  42. Lee, S.; Kim, H.G.; Ro, Y.M. STAN: Spatio-temporal adversarial networks for abnormal event detection. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 1323–1327. [Google Scholar]
  43. Dong, F.; Zhang, Y.; Nie, X. Dual discriminator generative adversarial network for video anomaly detection. IEEE Access 2020, 8, 88170–88176. [Google Scholar] [CrossRef]
  44. Yuan, H.; Cai, Z.; Zhou, H.; Wang, Y.; Chen, X. Transanomaly: Video anomaly detection using video vision transformer. IEEE Access 2021, 9, 123977–123986. [Google Scholar] [CrossRef]
  45. Deshpande, K.; Punn, N.S.; Sonbhadra, S.K.; Agarwal, S. Anomaly detection in surveillance videos using transformer based attention model. In Proceedings of the International Conference on Neural Information Processing, IIT, Indore, India, 22–26 November 2022; pp. 199–211. [Google Scholar]
  46. Chen, J.; Wang, J.; Pu, J.; Zhang, R. A Three-Stage Anomaly Detection Framework for Traffic Videos. J. Adv. Transp. 2022, 2022, 9463559. [Google Scholar] [CrossRef]
  47. Roka, S.; Diwakar, M. Cvit: A convolution vision transformer for video abnormal behavior detection and localization. SN Comput. Sci. 2023, 4, 829. [Google Scholar] [CrossRef]
  48. Leporowski, B.; Bakhtiarnia, A.; Bonnici, N.; Muscat, A.; Zanella, L.; Wang, Y.; Iosifidis, A. Audio-Visual Dataset and Method for Anomaly Detection in Traffic Videos. arXiv 2023, arXiv:2305.15084. [Google Scholar]
  49. Lu, X.; Zhang, D.; Xiao, J. A Hybrid Model for Traffic Incident Detection based on Generative Adversarial Networks and Transformer Model. arXiv 2024, arXiv:2403.01147. [Google Scholar]
  50. Skabardonis, A.; Petty, K.F.; Bertini, R.L.; Varaiya, P.P.; Noeimi, H.; Rydzewski, D. I-880 field experiment: Analysis of incident data. Transp. Res. Rec. 1997, 1603, 72–79. [Google Scholar] [CrossRef]
  51. Zhang, W.; Xiong, L.; Ji, Q.; Liu, H.; Zhang, F.; Chen, H. Dissipative Structure Properties of Traffic Flow in Expressway Weaving Areas. Promet-Traffic Transp. 2024, 36, 717–732. [Google Scholar] [CrossRef]
  52. Rossi, L.; Bernuzzi, V.; Fontanini, T.; Bertozzi, M.; Prati, A. Memory-Augmented Online Video Anomaly Detection. In Proceedings of the ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 6590–6594. [Google Scholar]
  53. Yao, Y.; Wang, X.; Xu, M.; Pu, Z.; Wang, Y.; Atkins, E.; Crandall, D. DoTA: Unsupervised detection of traffic anomaly in driving videos. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 444–459. [Google Scholar] [CrossRef]
  54. Kim, J.; Yoon, S.; Choi, T.; Sull, S. Unsupervised video anomaly detection based on similarity with predefined text descriptions. Sensors 2023, 23, 6256. [Google Scholar] [CrossRef] [PubMed]
  55. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  56. Shao, Y.; Cai, H.; Long, X.; Lang, W.; Wang, Z.; Wu, H.; Wang, Y.; Yin, J.; Yang, Y.; Lv, Y.; et al. AccidentBlip2: Accident Detection with Multi-View MotionBlip2. arXiv 2024, arXiv:2404.12149. [Google Scholar]
  57. Wang, T.; Kim, S.; Wenxuan, J.; Xie, E.; Ge, C.; Chen, J.; Li, Z.; Luo, P. DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving. Proc. AAAI Conf. Artif. Intell. 2024, 38, 5599–5606. [Google Scholar] [CrossRef]
  58. Arefeen, M.A.; Debnath, B.; Uddin, M.Y.S.; Chakradhar, S. ViTA: An Efficient Video-to-Text Algorithm using VLM for RAG-based Video Analysis System. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2266–2274. [Google Scholar]
  59. Piadyk, Y.; Rulff, J.; Brewer, E.; Hosseini, M.; Ozbay, K.; Sankaradas, M.; Chakradhar, S.; Silva, C. StreetAware: A High-Resolution Synchronized Multimodal Urban Scene Dataset. Sensors 2023, 23, 3710. [Google Scholar] [CrossRef]
  60. Kossmann, F.; Wu, Z.; Lai, E.; Tatbul, N.; Cao, L.; Kraska, T.; Madden, S. Extract-Transform-Load for Video Streams. Proc. VLDB Endow. 2023, 16, 2302–2315. [Google Scholar] [CrossRef]
  61. Abu Tami, M.; Ashqar, H.I.; Elhenawy, M.; Glaser, S.; Rakotonirainy, A. Using Multimodal Large Language Models (MLLMs) for Automated Detection of Traffic Safety-Critical Events. Vehicles 2024, 6, 1571–1590. [Google Scholar] [CrossRef]
  62. Malla, S.; Choi, C.; Dwivedi, I.; Choi, J.H.; Li, J. Drama: Joint risk localization and captioning in driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 1043–1052. [Google Scholar]
  63. Duan, Z.; Cheng, H.; Xu, D.; Wu, X.; Zhang, X.; Ye, X.; Xie, Z. Cityllava: Efficient fine-tuning for vlms in city scenario. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 7180–7189. [Google Scholar]
  64. Liu, H.; Li, C.; Li, Y.; Li, B.; Zhang, Y.; Shen, S.; Lee, Y.J. Llava-Next: Improved Reasoning, OCR, and World Knowledge. 2024. Available online: https://llava-vl.github.io/blog/2024-01-30-llava-next/ (accessed on 1 January 2024).
  65. Wu, C.; Gan, Y.; Ge, Y.; Lu, Z.; Wang, J.; Feng, Y.; Luo, P.; Shan, Y. Llama pro: Progressive llama with block expansion. arXiv 2024, arXiv:2401.02415. [Google Scholar]
  66. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
  67. Wang, H.; Qin, J.; Bastola, A.; Chen, X.; Suchanek, J.; Gong, Z.; Razi, A. VisionGPT: LLM-Assisted Real-Time Anomaly Detection for Safe Visual Navigation. arXiv 2024, arXiv:2403.12415. [Google Scholar]
  68. Ashqar, H.I.; Jaber, A.; Alhadidi, T.I.; Elhenawy, M. Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing. arXiv 2024, arXiv:2409.18286. [Google Scholar]
  69. Santhosh, K.K.; Dogra, D.P.; Roy, P.P. Anomaly detection in road traffic using visual surveillance: A survey. Acm Comput. Surv. (CSUR) 2020, 53, 1–26. [Google Scholar] [CrossRef]
  70. Bogdoll, D.; Nitsche, M.; Zöllner, J.M. Anomaly detection in autonomous driving: A survey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 4488–4499. [Google Scholar]
  71. Chew, J.V.L.; Asli, M.F. A survey on vehicular traffic flow anomaly detection using machine learning. In Proceedings of the ITM Web of Conferences, Marrakech, Morocco, 20–24 November 2024; Volume 63, p. 01023. [Google Scholar]
  72. El Manaa, I.; Benjelloun, F.; Sabri, M.A.; Yahyaouy, A.; Aarab, A. Road traffic anomaly detection: A survey. In Proceedings of the International Conference on Digital Technologies and Applications, Fez, Morocco, 28–30 January 2022; pp. 772–781. [Google Scholar]
Figure 1. Common architecture of anomaly detection systems in traffic surveillance and its possible integration into a smart city framework. The smart city provide the sequences of video and data to the Anomaly System detection, previously trained to detect those anomalies, and at the end this system automatically warns the corresponding services for managing the detected anomaly and its degree of seriousness or manage the Smart City’s actuators accordingly.
Figure 1. Common architecture of anomaly detection systems in traffic surveillance and its possible integration into a smart city framework. The smart city provide the sequences of video and data to the Anomaly System detection, previously trained to detect those anomalies, and at the end this system automatically warns the corresponding services for managing the detected anomaly and its degree of seriousness or manage the Smart City’s actuators accordingly.
Applsci 15 05520 g001
Figure 2. Hierarchical graph presenting the methods of this review.
Figure 2. Hierarchical graph presenting the methods of this review.
Applsci 15 05520 g002
Figure 3. Proposed method applied to real-world data. (a) Top view of an urban road. (b) Normal trajectories. (c) Anomalous trajectories detected in the test set. From [13].
Figure 3. Proposed method applied to real-world data. (a) Top view of an urban road. (b) Normal trajectories. (c) Anomalous trajectories detected in the test set. From [13].
Applsci 15 05520 g003
Figure 4. Different datasets: (a) I5SIM, (b) I5, (c) CROSS, (d) LABOMNI. From [15].
Figure 4. Different datasets: (a) I5SIM, (b) I5, (c) CROSS, (d) LABOMNI. From [15].
Applsci 15 05520 g004
Figure 5. Anomalous frames identified and anomalous words localized by the algorithm. Currently, in the implementation, the anomalous event is highlight in test documents as shown above. From [16].
Figure 5. Anomalous frames identified and anomalous words localized by the algorithm. Currently, in the implementation, the anomalous event is highlight in test documents as shown above. From [16].
Applsci 15 05520 g005
Figure 6. The flow diagram of the proposed anomaly detection approach. The model processes two types of videos: anomaly videos (positive bags) and normal videos (negative bags). Each video is split into 32 temporal segments, where each segment serves as an individual instance within a “bag” representation. These segments are passed through a pre-trained 3D convolutional neural network (C3D) to extract spatio-temporal features. The extracted features are then input to a fully connected neural network, which predicts an anomaly score for each segment. During training, the network applies a multiple instance learning (MIL) approach: it compares the segment with the highest anomaly score in the positive bag (highlighted in red) against the highest-scoring segment in the negative bag. A novel ranking loss function is used, incorporating sparsity and smoothness constraints, to encourage the model to identify anomalous segments accurately while maintaining temporal consistency. From [17].
Figure 6. The flow diagram of the proposed anomaly detection approach. The model processes two types of videos: anomaly videos (positive bags) and normal videos (negative bags). Each video is split into 32 temporal segments, where each segment serves as an individual instance within a “bag” representation. These segments are passed through a pre-trained 3D convolutional neural network (C3D) to extract spatio-temporal features. The extracted features are then input to a fully connected neural network, which predicts an anomaly score for each segment. During training, the network applies a multiple instance learning (MIL) approach: it compares the segment with the highest anomaly score in the positive bag (highlighted in red) against the highest-scoring segment in the negative bag. A novel ranking loss function is used, incorporating sparsity and smoothness constraints, to encourage the model to identify anomalous segments accurately while maintaining temporal consistency. From [17].
Applsci 15 05520 g006
Figure 7. The candidate selection stage of the proposed method. The blue dots represent objects of interest, whereas the red dots represent misclassified objects. From [18].
Figure 7. The candidate selection stage of the proposed method. The blue dots represent objects of interest, whereas the red dots represent misclassified objects. From [18].
Applsci 15 05520 g007
Figure 8. Overview of the Light-WVAD framework for weakly supervised video anomaly detection. The model uses a multiple instance learning (MIL) setup, where videos are split into 32 segments and grouped into positive (abnormal) or negative (normal) bags. Features are extracted using a pre-trained I3D network and processed by a multi-level temporal attention (MTA) module to capture time-based patterns. These are then passed through a fully connected layer to generate anomaly scores. An adaptive instance selection (AIS) method picks the most informative segments from each bag to calculate the final loss. From [32].
Figure 8. Overview of the Light-WVAD framework for weakly supervised video anomaly detection. The model uses a multiple instance learning (MIL) setup, where videos are split into 32 segments and grouped into positive (abnormal) or negative (normal) bags. Features are extracted using a pre-trained I3D network and processed by a multi-level temporal attention (MTA) module to capture time-based patterns. These are then passed through a fully connected layer to generate anomaly scores. An adaptive instance selection (AIS) method picks the most informative segments from each bag to calculate the final loss. From [32].
Applsci 15 05520 g008
Figure 9. The structure of the proposed model for predicting the level of accident severity. From [35].
Figure 9. The structure of the proposed model for predicting the level of accident severity. From [35].
Applsci 15 05520 g009
Figure 10. An illustration of the scenario where UCSD Ped2, ShanghaiTech, and CUHK Avenue are used as source datasets to learn a distance function from. Best viewed in color. From [39].
Figure 10. An illustration of the scenario where UCSD Ped2, ShanghaiTech, and CUHK Avenue are used as source datasets to learn a distance function from. Best viewed in color. From [39].
Applsci 15 05520 g010
Figure 11. Overview of the video frame prediction network pipeline. The model takes a sequence of frames as input and generates the predicted next frame using a U-Net-based generator. The prediction is supervised through multiple loss functions: intensity loss and gradient loss (between the predicted next frame and the real next frame), and optical flow loss computed by a pre-trained FlowNet model. FlowNet estimates the optical flow between both the predicted and the real frames, enforcing motion consistency. A discriminator is further used to differentiate between real and predicted frames, following an adversarial training scheme. From [40].
Figure 11. Overview of the video frame prediction network pipeline. The model takes a sequence of frames as input and generates the predicted next frame using a U-Net-based generator. The prediction is supervised through multiple loss functions: intensity loss and gradient loss (between the predicted next frame and the real next frame), and optical flow loss computed by a pre-trained FlowNet model. FlowNet estimates the optical flow between both the predicted and the real frames, enforcing motion consistency. A discriminator is further used to differentiate between real and predicted frames, following an adversarial training scheme. From [40].
Applsci 15 05520 g011
Figure 12. Training pipeline of spatio-temporal adversarial networks (STANs). The input frame sequence is first encoded by a spatial encoder to extract semantic features. These features are processed by a forward ConvLSTM and a combined ConvLSTM to model short- and long-term temporal dependencies. The output is decoded by a spatial decoder to generate the predicted frame. A spatio-temporal discriminator distinguishes between real and generated sequences to enforce realistic spatial structure and temporal consistency. From [42].
Figure 12. Training pipeline of spatio-temporal adversarial networks (STANs). The input frame sequence is first encoded by a spatial encoder to extract semantic features. These features are processed by a forward ConvLSTM and a combined ConvLSTM to model short- and long-term temporal dependencies. The output is decoded by a spatial decoder to generate the predicted frame. A spatio-temporal discriminator distinguishes between real and generated sequences to enforce realistic spatial structure and temporal consistency. From [42].
Applsci 15 05520 g012
Figure 13. Localization and visualization results of the abnormal events on UCSD Ped1 (first row), UCSD Ped2 (second row), and Avenue (third row). (a) Real frame, (b) generated frame, (c) abnormality visualization by the generator, and (d) abnormality visualization by the discriminator. From [42].
Figure 13. Localization and visualization results of the abnormal events on UCSD Ped1 (first row), UCSD Ped2 (second row), and Avenue (third row). (a) Real frame, (b) generated frame, (c) abnormality visualization by the generator, and (d) abnormality visualization by the discriminator. From [42].
Applsci 15 05520 g013
Figure 14. The framework of the model. From [44].
Figure 14. The framework of the model. From [44].
Applsci 15 05520 g014
Figure 15. Pipeline of the proposed approach for abnormal behavior detection. There is a simple pre-processing part from where the frames are extracted and used for the training of the proposed model. It is also presented the corresponding training feedback for updating parameters and optimizing the model to extract the best feature representation of the frames. At the end, the trained model is applied to different images to validate the correct output of the model. From [47].
Figure 15. Pipeline of the proposed approach for abnormal behavior detection. There is a simple pre-processing part from where the frames are extracted and used for the training of the proposed model. It is also presented the corresponding training feedback for updating parameters and optimizing the model to extract the best feature representation of the frames. At the end, the trained model is applied to different images to validate the correct output of the model. From [47].
Applsci 15 05520 g015
Figure 16. The AVACA architecture. From [48].
Figure 16. The AVACA architecture. From [48].
Applsci 15 05520 g016
Figure 17. The online frame-level VAD architecture. f[t] is the frame at time t, x is the output of the reducer, NF is the number of frames in input to the VST, s[t] is the anomaly classification score of the frame f[t]. From [52].
Figure 17. The online frame-level VAD architecture. f[t] is the frame at time t, x is the output of the reducer, NF is the number of frames in input to the VST, s[t] is the anomaly classification score of the frame f[t]. From [52].
Applsci 15 05520 g017
Figure 18. Overall inference process of the proposed anomaly detector using text descriptions. From [54].
Figure 18. Overall inference process of the proposed anomaly detector using text descriptions. From [54].
Applsci 15 05520 g018
Figure 19. Architecture of MotionBlip2. From [56].
Figure 19. Architecture of MotionBlip2. From [56].
Applsci 15 05520 g019
Figure 20. Overview of the workflow of our video-to-text conversion algorithm. Initially, a lightweight VLM is employed to generate a text description of a video clip. Subsequently, a heavyweight VLM is employed, utilizing the initial description as a prompt to generate additional information while limiting the output token count. From [58].
Figure 20. Overview of the workflow of our video-to-text conversion algorithm. Initially, a lightweight VLM is employed to generate a text description of a video clip. Subsequently, a heavyweight VLM is employed, utilizing the initial description as a prompt to generate additional information while limiting the output token count. From [58].
Applsci 15 05520 g020
Figure 21. Overview of CityLLaVA. The method is anchored on the pre-trained LLaVA-1.6-34B [64] equipped with block expansion [65], combining the textual prompt engineering and visual prompt engineering guided by bounding boxes. From [63].
Figure 21. Overview of CityLLaVA. The method is anchored on the pre-trained LLaVA-1.6-34B [64] equipped with block expansion [65], combining the textual prompt engineering and visual prompt engineering guided by bounding boxes. From [63].
Applsci 15 05520 g021
Table 1. Summary of the various traffic anomaly detection methods based on insights from reviewed studies.
Table 1. Summary of the various traffic anomaly detection methods based on insights from reviewed studies.
MethodSpecificationsAdvantagesDisadvantagesApplications
ML [13,14,16]Events by trajectories
Feature extraction
Unsupervised learning
Anomaly by trajectory analysis
Automation
Adaptable
Detects subtle anomalies
Sensitivity to parameters
Need for large datasets
Complex results
Complex algorithm
Vehicle tracking
Traffic flow analysis
Early warning systems
CNN [17,18,20,27,30,31,32,34,35,36,39]Layered architecture
Convolutional layers for local patterns, filter, and feature maps
Pooling
Spatio-temporal feature
Anomaly detection from deviations
High performance
Robust
Machine learning
Versatility
Need large datasets
Complexity
Black box
Quality of data affects performance
Surveillance video analysis
Traffic signal control
Accident detection
GAN [40,42,43,44]GAN architecture
Generator makes predictions of normal sequence
Predictions. Constraints to increase robustness.
Generator inference for predictions. Anomaly identified as large difference.
Generation of realistic synthetic data for training
Detection of variety of anomalies
Robustness
Instability of GAN architecture
Large training datasets needed
Complex evaluation.
Rare event simulation
Real-time anomaly prediction
Data augmentation
Transformers [45,46,47,48,49,52]Self-attention mechanism analyze elements’ relationships
Encoder–decoder architecture
Global relationships between elements
Fewer annotations
Capture complex relationships between events
Superior performance
Parallel higher efficiency
Applicability
Complexity
Need large datasets
Complex interpretability
Multi-camera tracking
Real-time traffic incident detection
Predictive modeling
MLLM [54,56,58,61,63,67,68]LLM with VLM combination to analyze data.
Three components: encoder, backbone, LLM, and interface connector
Learning methods depending on training
Integrations with object detection systems to improve accuracy and contextual understanding
Ability to understand context from multiple sources and deeper analysis
Highly scalable
Better performance
Logical and visual reasoning
Training and inference computationally expensive
Limitations in understanding the order and complex relationships
MLLMs can generate incorrect outputs or “hallucinations”
Limited generalization
Multimodal urban surveillance
Situation-aware traffic reporting
Decision support systems
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pérez, B.; Resino, M.; Seco, T.; García, F.; Al-Kaff, A. Innovative Approaches to Traffic Anomaly Detection and Classification Using AI. Appl. Sci. 2025, 15, 5520. https://doi.org/10.3390/app15105520

AMA Style

Pérez B, Resino M, Seco T, García F, Al-Kaff A. Innovative Approaches to Traffic Anomaly Detection and Classification Using AI. Applied Sciences. 2025; 15(10):5520. https://doi.org/10.3390/app15105520

Chicago/Turabian Style

Pérez, Borja, Mario Resino, Teresa Seco, Fernando García, and Abdulla Al-Kaff. 2025. "Innovative Approaches to Traffic Anomaly Detection and Classification Using AI" Applied Sciences 15, no. 10: 5520. https://doi.org/10.3390/app15105520

APA Style

Pérez, B., Resino, M., Seco, T., García, F., & Al-Kaff, A. (2025). Innovative Approaches to Traffic Anomaly Detection and Classification Using AI. Applied Sciences, 15(10), 5520. https://doi.org/10.3390/app15105520

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop