Survey on Image-Based Vehicle Detection Methods

Adam, Mortda A. A.; Tapamo, Jules R.

doi:10.3390/wevj16060303

Open AccessArticle

Survey on Image-Based Vehicle Detection Methods

by

Mortda A. A. Adam

^*

and

Jules R. Tapamo

School of Engineering, University of KwaZulu-Natal, Howard College Campus, Durban 4041, South Africa

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J. 2025, 16(6), 303; https://doi.org/10.3390/wevj16060303

Submission received: 6 April 2025 / Revised: 23 May 2025 / Accepted: 25 May 2025 / Published: 29 May 2025

(This article belongs to the Special Issue Vehicle Safe Motion in Mixed Vehicle Technologies Environment)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Vehicle detection is essential for real-world applications such as road surveillance, intelligent transportation systems, and autonomous driving, where high accuracy and real-time performance are critical. However, achieving robust detection remains challenging due to scene complexity, occlusion, scale variation, and varying lighting conditions. Over the past two decades, numerous studies have been proposed to address these issues. This study presents a comprehensive and structured survey of image-based vehicle detection methods, systematically comparing classical machine learning techniques based on handcrafted features with modern deep learning approaches. Deep learning methods are categorized into one-stage detectors (e.g., YOLO, SSD, FCOS, CenterNet), two-stage detectors (e.g., Faster R-CNN, Mask R-CNN), transformer-based detectors (e.g., DETR, Swin Transformer), and GAN-based methods, highlighting architectural trade-offs concerning speed, accuracy, and practical deployment. We analyze widely adopted performance metrics from recent studies, evaluate characteristics and limitations of popular vehicle detection datasets, and explicitly discuss technical challenges, including domain generalization, environmental variability, computational constraints, and annotation quality. The survey concludes by clearly identifying open research challenges and promising future directions, such as efficient edge deployment strategies, multimodal data fusion, transformer-based enhancements, and integration with Vehicle-to-Everything (V2X) communication systems.

Keywords:

vehicle detection; object detection; classical and deep learning; real-time detection

1. Introduction

Real-world applications in domains such as automated parking management, traffic monitoring, and intelligent transportation systems (ITS) utilize vehicle detection to enhance operational efficiency and road safety. These technologies optimize traffic flow and support effective transport system management [1,2,3]. The need for reliable vehicle detection has grown due to increasing traffic congestion, rising accident rates, and the widespread deployment of surveillance infrastructure such as road-mounted cameras [4]. In addition to enabling real-time traffic analysis, vehicle detection supports public safety functions, including detecting traffic violations such as speeding, unauthorized vehicle movement, red-light infractions, and helmet non-compliance [5]. Beyond ITS, the rapid development of autonomous driving technologies has further intensified the demand for high-performance vehicle detection systems. These systems serve as critical components for perception and decision-making pipelines, enabling vehicles to accurately interpret dynamic road environments. For instance, integrated frameworks that combine vehicle detection with deep reinforcement learning have shown significant improvements in high-speed cruising safety and responsiveness in autonomous navigation [6]. Similarly, real-time detection models such as YOLOv4-5D and YOLO-EfficientDet hybrids have demonstrated improved accuracy and computational efficiency for real-world deployment in autonomous vehicles [7,8]. Dedicated solutions for tasks like traffic sign detection and general object localization have further contributed to reliable autonomous driving behavior [9,10].

While existing reviews have broadly addressed object detection and traffic surveillance, this survey adopts a more application-oriented perspective by categorizing vehicle detection methods according to their suitability for real-time operation, edge deployment, and integration within autonomous driving pipelines. It unifies classical and modern deep learning approaches, emphasizing speed, accuracy, and hardware efficiency trade-offs, and it highlights their relevance to emerging paradigms such as cooperative V2X systems and edge–cloud architectures. Recent research by Liang et al. [11] proposed a MAS-based hierarchical control architecture for connected and automated vehicles, underscoring the importance of accurate vehicle detection as a foundational layer for cooperative perception and decision-making. Special attention is given to the evolution of lightweight, high-throughput detection frameworks, including one-stage detectors, Transformer-based attention models, and recent architectural innovations. The survey also synthesizes evaluation metrics, dataset characteristics, and practical benchmarks to support informed model selection across diverse deployment scenarios.

An overview of the general workflow for image-based vehicle detection is shown in Figure 1, which outlines the key stages from dataset preprocessing to model training and evaluation. This logical framework contextualizes the typical operations performed during detection model development. In practice, a vehicle detection system processes visual input from onboard or roadside sensors to identify and classify vehicles. Figure 2 illustrates this process, where an object detection model takes a raw image and produces structured detections comprising bounding boxes, associated class labels, and confidence scores for detected vehicles. This output supports downstream applications such as traffic monitoring and autonomous decision-making. Recent research has introduced attention mechanisms, optimized deep learning architectures, and edge-based inference systems to improve detection performance under real-world conditions [12,13,14,15]. The growing complexity of traffic scenes—due to occlusion, varying vehicle sizes, diverse shapes, and lighting conditions—has made vision-based detection a central focus of intelligent transportation research. Camera-based detection is particularly popular among sensor modalities due to its cost-effectiveness and high-resolution data [16]. Convolutional Neural Networks (CNNs) and other deep learning models are widely adopted for their ability to learn robust visual features directly from raw input, thus offering significant advantages over traditional handcrafted methods. Recent studies have explored hybrid object detection architectures to combine the strengths of two-stage and one-stage detectors, aiming to balance detection accuracy and inference speed. Soviany et al. [17] proposed an optimization framework that systematically tunes trade-offs between single-stage and two-stage detectors. Moreover, the emergence of RT-DETR [18,19] demonstrates that end-to-end Transformer-based models with efficient attention can rival traditional detectors in real-time performance. These models utilize sparse query-based detection mechanisms that simplify post-processing and enhance scalability. Additionally, combining YOLOv4 with EfficientDet has been shown to boost both precision and inference speed for autonomous vehicle applications [8]. Future research could investigate such hybrid paradigms, including prompt-based Transformers and adaptive attention modules, to develop scalable models suitable for edge deployment in dynamic environments.

Earlier vehicle detection methods relied on classical machine learning techniques such as Scale-Invariant Feature Transform (SIFT), Haar-like features [20], and Histogram of Oriented Gradients (HOG), which require manual feature engineering [21]. Although effective in some scenarios, these methods generally suffer from high computational overhead and lack scalability for real-time use. Deep learning models, in contrast, can automatically learn multi-level features from large datasets, thereby improving accuracy and reducing the need for manual intervention [22].

Deep learning-based vehicle detection methods can be broadly categorized into one-stage and two-stage detectors. One-stage models such as the YOLO (You Only Look Once) series—from YOLOv1 to YOLOv12—directly predict bounding boxes and class labels in a single pass, making them ideal for real-time applications [23,24,25,26,27]. Recent anchor-free one-stage models such as FCOS and CenterNet have emerged to simplify detection pipelines and improve performance under occlusion and scale variation. In contrast, two-stage detectors such as R-CNN, Fast R-CNN, and Faster R-CNN first generate region proposals and then perform classification and refinement [28,29,30], typically achieving higher accuracy but at the cost of increased computation. This paper comprehensively reviews vehicle detection methods, tracing the evolution from classical machine learning techniques to modern deep learning architectures. It also surveys benchmark datasets, evaluation metrics (e.g., mAP, IoU, FPS), and recent performance results across multiple scenarios. Furthermore, the paper discusses key challenges in real-world deployment, including scene complexity, environmental variability, and edge-device limitations. It concludes with future directions such as V2X integration, multimodal fusion, and model generalization strategies.

2. Vehicle Detection Methods

The task of identifying the location and category of the vehicle within an input image or video is vehicle detection (see Figure 2); this task is essential for numerous intelligent transportation applications, including traffic monitoring, driver assistance technologies, parking management, and autonomous driving systems. This section discusses the methods used in the existing literature for vehicle detection. The following sections provide details on the existing approaches.

2.1. Classical Vehicle Detection Methods

Classical vehicle detection methods, widely adopted prior to the deep learning era, are based on traditional machine learning pipelines that involve handcrafted feature extraction followed by classification. While these approaches were effective in early research, they have been largely supplanted by deep learning methods due to their limited scalability, poor generalization in complex environments, and high sensitivity to noise and occlusion. Figure 3 illustrates the general processing pipeline employed by classical vehicle detection systems.

2.1.1. Feature Extraction Techniques

Haar-like Features
Haar-like features are simple rectangular features used to detect edges, lines, and other visual patterns by computing the difference in pixel intensities between adjacent regions [31]. These features are particularly effective for capturing structural cues in vehicle shapes. The Haar-like feature F is calculated as follows:

$\begin{matrix} F & = S_{W} - S_{B} \end{matrix}$

(1)

$\begin{matrix} S_{W} & = \sum_{(x, y) \in W} I (x, y), S_{B} = \sum_{(x, y) \in B} I (x, y) \end{matrix}$

(2)

where $I (x, y)$ denotes the pixel intensity at coordinate $(x, y)$ , and W and B represent the white and black rectangular regions, respectively. These features are evaluated across different scales and positions using an integral image for efficient computation. They are often combined with classifiers such as AdaBoost in early vehicle detection systems. Figure 4 illustrates typical Haar-like patterns used in vehicle detection tasks.
HOG Feature Extraction
In 2005, Dalal and Triggs [32] introduced the Histogram of Oriented Gradients (HOG), a descriptor that captures local object appearance and shape by encoding the distribution of intensity gradients or edge directions. HOG is particularly effective for detecting rigid objects such as vehicles, where edge and contour structures are prominent. The image is divided into small cells, and each cell’s histogram bins gradient orientations into predefined angle ranges (e.g., 0–180°), weighted by gradient magnitude. First, gradient magnitude and orientation are computed as follows:

$M = \sqrt{G_{x}^{2} + G_{y}^{2}}, θ = arctan (\frac{G_{y}}{G_{x}})$

(3)

where $G_{x}$ and $G_{y}$ are the horizontal and vertical gradients, respectively.
The image is then divided into cells, and orientation histograms are computed for each. These histograms are normalized over larger spatial blocks to improve invariance to illumination and contrast. The block normalization of the feature vector v is using L2-norm, is defined as follows:

$v_{n} = \frac{v}{∥ v ∥}$

(4)

HOG features are widely used in vehicle detection systems due to their robustness in capturing structural information while being relatively efficient to compute.
LBP Feature Extraction
Local Binary Patterns (LBP) encode local texture patterns by comparing each pixel to its surrounding neighbors, making it robust to lighting variations. However, it is generally less effective in scenes with complex backgrounds or non-uniform textures [33].
The LBP descriptor is defined as follows:

$L B P_{P, R} = \sum_{p = 0}^{P - 1} s (i_{p} - i_{c}) \times 2^{p}$

(5)

where $i_{c}$ is the intensity of the center pixel, $i_{p}$ denotes the intensity of the $p^{t h}$ neighboring pixel, P is the number of sampling points, and R is the radius of the neighborhood. The thresholding function $s (x)$ is given by the following:

$s (x) = \{\begin{matrix} 1 & if x \geq 0, \\ 0 & if x < 0 \end{matrix}$

(6)

The resulting binary values are combined into a single integer, forming a compact descriptor that captures local texture information. LBP is computationally efficient and suitable for real-time vehicle detection in well-lit environments.
SIFT Detection
SIFT extracts local features from input images, ensuring invariance to scale and orientation. It identifies key points, such as edges and corners, using a Difference-of-Gaussian (DoG) filter across multiple scales. It then refines these key points and computes descriptors based on gradient orientation histograms within their neighborhoods [34]. It is a robust feature extraction method used to identify distinctive keypoints that remain invariant to scale, rotation, and illumination changes. The initial step involves computing the Difference of Gaussians (DoG) to detect potential keypoints:

$DoG (x, y, σ) = (G (x, y, k σ) - G (x, y, σ)) \times I (x, y)$

(7)

where $G (x, y, σ)$ denotes a Gaussian kernel with standard deviation $σ$ , k is a scale multiplier, and $I (x, y)$ is the input image. Following the DoG computation, SIFT performs keypoint localization, assigns orientations based on local gradient distributions, and generates a descriptor vector that encodes the spatial structure of gradients around each keypoint. These descriptors make SIFT particularly effective for detecting vehicles across varying scales, perspectives, and lighting conditions. While SIFT is computationally intensive, it remains influential for scale-invariant detection tasks.

2.1.2. Classical Classification Algorithms

Support Vector Machine (SVM)
The core concept of SVM is to identify an optimal hyperplane in a multidimensional feature space that best separates data points belonging to different classes [35]. This hyperplane serves as a decision boundary, partitioning the feature space into regions predominantly associated with one class or another. In vehicle detection, SVM is commonly trained to distinguish between vehicles (positive class) and non-vehicles (negative class).
The optimal hyperplane is chosen to maximize the margin—the distance between the hyperplane and the nearest data points from each class (support vectors). When data are not linearly separable, SVM applies the kernel trick to map the input data into a higher-dimensional space where linear separation becomes feasible [36]. This enables SVM to handle non-linear classification tasks effectively. Variants of SVM include LSVM (Least Squares SVM), NLSVM (Non-Linear SVM), SSVM (Structural SVM), and NSVM (Normalized SVM), each offering specific adaptations for different data characteristics. A linear decision function can be expressed as follows:

$f (I) = w \times I + b$

(8)

where w is the weight vector, I is the input feature vector, and b is the bias term.
Adaptive Boosting (AdaBoost)
AdaBoost is a powerful ensemble learning algorithm widely used in object detection, image classification, and other pattern recognition tasks [37]. The key idea is to combine multiple weak classifiers—each performing only slightly better than random guessing—into a single strong classifier with high accuracy. AdaBoost operates in a sequential manner, training each weak learner on the weighted version of the dataset. After each iteration, the weights of the misclassified examples are increased, forcing the next weak learner to focus more on the harder examples [38]. Correctly classified samples are assigned lower weights, reducing their influence in subsequent iterations. This adaptive re-weighting process ensures that the final ensemble is more focused on the difficult cases. Even if individual weak classifiers perform poorly, the combined model weighted by the accuracy of each learner can achieve excellent performance [39]. The sample weight update rule is the following:

$D_{t + 1} (i) = \frac{D_{t} (i) exp (- α_{t} y_{i} h_{t} (x_{i}))}{Z_{t}}$

(9)

where $D_{t} (i)$ is the weight of sample i at iteration t, $y_{i}$ is the true label, $h_{t} (x_{i})$ is the prediction from the weak learner, and $Z_{t}$ is a normalization factor.
Each weak learner is assigned a weight based on its classification error:

$α_{t} = \frac{1}{2} ln (\frac{1 - ε_{t}}{ε_{t}})$

(10)

where $ε_{t}$ is the error rate of the $t^{t h}$ weak classifier.
The final strong classifier is defined as follows:

$H (x) = sign (\sum_{t = 1}^{T} α_{t} h_{t} (x))$

(11)

where $H (x)$ is the final strong classifier, T is the total number of weak classifiers, $α_{t}$ is the weight assigned to the $t^{th}$ weak classifier based on its accuracy, $h_{t} (x)$ is the prediction of the $t^{th}$ weak classifier, and $sign (\cdot)$ is the sign function that returns $+ 1$ if the argument is positive and $- 1$ otherwise.

Table 1 and Table 2 show some studies that used classical vehicle detection techniques. It gives information about feature extraction and classifiers involved and highlights the performance achieved by each of the systems proposed.

2.1.3. Motion-Based Vehicle Detection

Motion-based vehicle detection focuses on the temporal dynamics of objects in video sequences, using changes in pixel intensity or motion vectors to identify vehicles. Unlike appearance-based methods that rely on static features such as edges, color, or texture, motion-based approaches leverage the relative movement between the camera and vehicles to segment and track objects of interest. These techniques are particularly useful in environments where appearance features are unreliable due to occlusion, poor lighting, or low resolution.

Techniques and Algorithms
Several classical algorithms implement motion-based detection. Lefaix et al. [49] proposed a system that models dominant image motion, typically resulting from the ego-motion of a moving camera, and detects vehicles as motion outliers. This method also enables time-to-collision (TTC) estimation by analyzing the divergence of flow vectors. Some recent approaches explore the use of optical flow and epipolar geometry constraints to enhance motion sensitivity for detecting overtaking vehicles or lane changes. Techmer [50] introduced a real-time segmentation algorithm that tracks motion along pre-defined lane contours. This method simplifies the segmentation task by focusing only on relevant motion direction, making it both efficient and resistant to lighting variation.
Applications and Advantages
Motion-based techniques are widely used in:
-
Speed estimation and violation detection in traffic surveillance [51].
-
Lightweight embedded systems with limited computational capacity.
-
Challenging environments such as nighttime driving or adverse weather conditions.
These methods typically require fewer computational resources and are more resilient to occlusion and appearance variation than purely appearance-based approaches. For example, Jazayeri et al. [52] applied a Hidden Markov Model (HMM) to classify motion traces over time, achieving reliable vehicle detection in complex, cluttered scenes.
Challenges Despite their advantages, motion-based approaches have several limitations:
-
Susceptibility to false positives caused by camera shake or ego-motion instability.
-
Difficulty separating multiple vehicles moving in the same direction or at similar speeds.
-
Ineffectiveness in detecting stationary or parked vehicles.

Mitigating these issues often involve incorporating robust background modeling, such as Mixture of Gaussians (MOG), or using contour-based tracking to stabilize the segmentation [51]. Deep learning models have shown superior performance by learning hierarchical features directly from raw data, eliminating the need for manual feature design. Techniques such as CNNs and Transformers not only improve precision but also enhance robustness in diverse and complex environments. Therefore, while classical approaches are valuable for understanding the historical evolution of vehicle detection, they are no longer competitive for modern real-time applications.

Table 2. Categorization of classical vehicle detection methods by feature type and technique. Appearance-based methods focus on object texture and shape; motion-based methods rely on temporal changes; hybrid approaches combine multiple cues or heuristic rules.

Category	Representative Techniques	References
Appearance-Based	HOG, LBP, SIFT, Haar + AdaBoost, Viola-Jones	[40,41,43,44]
Motion-Based	Background subtraction, Optical flow, Contour tracking, HMM	[42,47,49,50,52]
Hybrid/Heuristic	Pixel-based analysis, Hough + Haar-like, Shadow filtering	[46,48]

2.1.4. Summary of Classical Vehicle Detection Methods

Classical vehicle detection methods are based on sequential pipelines that combine handcrafted feature extraction with traditional machine learning classifiers. These approaches are typically grouped into three categories: appearance-based techniques (e.g., HOG, LBP, SIFT), motion-based methods (e.g., background subtraction, optical flow), and heuristic or hybrid strategies that incorporate rule-based logic. While these methods laid the groundwork for automated vehicle detection, they suffer from several limitations in real-world applications. Their performance is often hindered by sensitivity to lighting changes, occlusion, and background complexity. Furthermore, the reliance on manually designed features results in limited adaptability to variations in vehicle size, orientation, and environmental conditions. In terms of computational efficiency and detection accuracy, classical approaches generally fall short compared to modern alternatives. Their constrained scalability and slower inference times make them less suitable for real-time deployment, especially in dynamic traffic scenarios. These limitations have driven a shift toward deep learning-based methods, which automatically learn hierarchical feature representations from data and offer greater robustness and generalization [53]. The following section explores these modern techniques in detail, highlighting their advantages and recent developments in vehicle detection.

2.2. Deep Learning-Based Vehicle Detection Methods

Deep learning-based vehicle detection methods have transformed vision-based systems by leveraging convolutional neural networks (CNNs) to automate feature extraction and object recognition. Deep learning, a subfield of machine learning inspired by the hierarchical processing of the human brain, learns abstract representations from raw data through multiple nonlinear layers [54]. Figure 5 illustrates a typical deep learning workflow for object detection.

While deep models require large annotated datasets and significant computational resources for training, they deliver strong generalization performance and fast inference once deployed. Several models have shown outstanding results in vehicle detection tasks [55,56,57,58].

2.2.1. One-Stage Detectors

One-stage detectors such as YOLO [23,24,25,26,27,59], SSD [60], and RetinaNet [61] perform detection in a single feed-forward pass. These models are designed for speed and efficiency, making them ideal for real-time vehicle detection. Their architecture typically includes a backbone for feature extraction, a neck for multi-scale feature fusion, and a detection head for simultaneous classification and localization. Several YOLO-based methods improved anchor box initialization by replacing the traditional k-means clustering with more effective variants. For example, in [62,63], the k-means++ algorithm was used to generate anchor boxes that better match the distribution of vehicle sizes in the training dataset, enhancing localization accuracy and convergence speed. Furthermore, ref. [64] introduced a refined method called Rk-means++, designed to accommodate class imbalance and varying object scales, resulting in improved multi-scale detection performance. These adaptive clustering strategies enable more accurate bounding box proposals tailored to diverse vehicle dimensions in real-world scenarios. Figure 6 depicts a general one-stage pipeline.

YOLO Series Algorithms
Object detection in real-time applications requires high-speed, accurate algorithms. YOLO (You Only Look Once) was introduced to meet real-time processing demands, with YOLOv1 [23] pioneering the first one-stage object detection algorithm. This series, including YOLO versions 2 and 3 [24,25], opened new opportunities for vehicle detection. Significant improvements have continued with YOLOv4 [59], YOLOv5 [65], YOLOv6 [26], YOLOv7 [27], YOLOX [66], PP-YOLOE [67], and the latest versions of YOLO including YOLOv8, YOLOv9, and YOLOv10 [68,69,70]. The YOLO architecture typically comprises the following three key components:
-
Backbone: Responsible for extracting low-level and high-level visual features from the input image.
-
Neck: Connects the backbone to the head and enhances spatial and semantic information across different scales using feature fusion modules such as PANet, BiFPN, or path aggregation blocks.
-
Head: Generates output predictions, including object classification, bounding box regression, instance segmentation, or pose estimation. Non-maximum suppression (NMS) is applied post-processing to remove redundant overlapping boxes.
Over the years, the YOLO series has evolved significantly, from YOLOv1 to the most recent YOLOv12 [65,68,69,70,71,72], with progressive improvements in accuracy, speed, and computational efficiency. These advances have made YOLO models increasingly suitable for real-time applications on edge devices. Table 3 and Figure 7 summarize the architectural progression across different YOLO versions.
Single Shot MultiBox Detector (SSD) algorithm [60] employs a feed-forward convolutional network to generate a fixed set of bounding boxes and confidence scores for object classes within those boxes. A non-maximum suppression step is then applied to refine the final detections. The SSD algorithm comprises two main components: the first extracts feature maps using the VGG16 network [73], and the second uses convolutional filters for object detection. As a one-stage object detection algorithm, SSD achieved a mean average precision (mAP) of 76.8 % on Pascal VOC 2007 at a speed of 19 frames per second (FPS), making it well-suited for real-time object detection due to its faster performance compared to two-stage algorithms. However, its accuracy is lower than those of two-stage detectors such as the Faster R-CNN series. Several lightweight backbone networks [74,75,76] have been developed to enhance the SSD algorithm’s Backbone. For instance, Chen et al. [77] optimized the SSD model to improve vehicle detection speed. They introduced an attention mechanism and bottom-up feature fusion using a deconvolution module by replacing VGG16 with MobileNetv2 [76] as the Backbone for feature extraction. Similarly, Zhang et al. [78] proposed an improved SSD-based model that enhances vehicle detection performance, enabling real-time detection of various types of vehicles. Table 3 provides a summary of different one-step approaches for vehicle detection tasks.
SSD [60], originally built on VGG16, has benefited from lightweight backbones like MobileNet [76]. Enhancements such as attention modules and multi-scale feature layers have further improved detection accuracy [77,78]. Anchor-free methods like FCOS and CenterNet [79,80] remove the need for manually designed anchor boxes, simplifying training, which significantly enhances inference speed—an essential requirement for autonomous driving systems that rely on real-time perception.

While CNN-based one-stage detectors dominate real-time tasks, transformer-based models offer improvements in global context modeling, especially in complex urban scenes.

2.2.2. Transformer-Based Methods

Transformer-based architectures such as DETR [81], Deformable DETR [82], and Swin Transformer [83,84] offer global context modeling and strong performance in dense or cluttered scenes. These models leverage attention mechanisms to improve spatial reasoning, making them well-suited for complex urban traffic environments.

2.2.3. Gan-Based Methods

Generative Adversarial Networks (GANs) enhance vehicle detection under challenging visual conditions such as low light, occlusion, or limited data availability. Models like VAGAN [85], AugGAN [86], and VS-GAN [87] are used to augment datasets, simulate diverse conditions, or restore degraded images, thereby improving the robustness of vehicle detectors. Table 4 summarizes several prominent one-stage vehicle detection methods evaluated across different datasets.

2.2.4. Two-Stage Detectors

Two-stage detectors, including Faster R-CNN and Mask R-CNN, operate in two sequential steps: region proposal and object classification/regression [30,102]. Though more computationally intensive, they are favored for their high accuracy and reliability in scenarios where precision is paramount. Figure 8 depicts the two-stage detection pipeline.

Numerous studies have applied two-stage frameworks for vehicle detection across various datasets. Table 5 provides examples and their reported performance, and Table 6 shows a comparison of deep learning-based vehicle detection methods.

2.2.5. Summary of Deep Learning Vehicle Detection Method

Deep learning-based vehicle detection methods span a diverse set of architectures, each tailored to specific performance requirements and deployment environments. One-stage detectors such as YOLO and SSD offer fast inference and are well-suited for real-time applications, particularly on resource-constrained devices. In contrast, two-stage detectors like Faster R-CNN and Mask R-CNN deliver higher detection accuracy, making them preferable in precision-critical tasks. Transformer-based models enhance global context modeling and spatial reasoning, improving performance in complex or cluttered scenes. GAN-enhanced approaches further boost robustness by augmenting training data or restoring degraded visual inputs. Together, these advances have enabled state-of-the-art vehicle detection systems that generalize well across diverse environments and operational conditions.

3. Application Areas

Vehicle detection serves as a core enabler for numerous modern applications, including traffic control, intelligent transportation systems (ITS), smart city infrastructure, surveillance, and industrial automation. Its significance lies in enabling real-time decision-making, adaptive traffic responses, and enhanced transportation safety and efficiency. In traffic management, vehicle detection supports congestion mitigation through dynamic signal control and optimized route planning. It also enables the timely detection of incidents such as accidents or stalled vehicles, improving emergency response times [88,109]. Detection systems deployed on roadside infrastructure or UAVs offer wide-area monitoring and real-time analytics, enhancing the capabilities of traffic surveillance platforms [110]. In law enforcement and surveillance, vehicle detection is critical for tasks such as automatic number plate recognition (ANPR), speed enforcement, and red-light violation detection. It is increasingly integrated into smart camera networks and tunnel monitoring systems to improve public safety [111,112]. Real-time detection algorithms, including YOLOv4, have demonstrated high accuracy in traffic offense monitoring applications [5]. Within ITS, vehicle detection underpins autonomous driving and advanced driver assistance systems (ADAS). Deep learning-based detectors enable reliable object recognition, allowing autonomous vehicles to perceive their surroundings, avoid collisions, and make safe navigation decisions [113]. These functions rely on real-time detection frameworks that maintain both high accuracy and low latency in dynamic environments. In the context of smart cities, vehicle detection supports intelligent parking management, automated toll collection, and real-time traffic data analytics for urban planning. The resulting data contribute to reduced emissions, improved public transport planning, and more efficient infrastructure use [114,115]. Security applications also benefit from vehicle detection, including restricted zone monitoring, suspicious vehicle tracking, and real-time surveillance in critical infrastructure and sensitive areas [116]. In industrial environments such as warehouses and manufacturing facilities, detection systems improve coordination and safety among autonomous mobile robots and automated guided vehicles (AGVs), ensuring efficient and collision-free operations. Despite its broad applicability, real-world deployment of vehicle detection still presents challenges. One-stage models like YOLO and SSD are commonly used in real-time applications such as UAV-based surveillance and traffic monitoring due to their low latency. In contrast, two-stage models like Faster R-CNN and Mask R-CNN deliver higher accuracy and are preferred in offline processing or safety-critical tasks. Detection performance may degrade under occlusion, low lighting, or adverse weather conditions, highlighting the need for robust and context-aware model selection. Future advancements are expected to focus on integrating multi-modal sensor data (e.g., camera, LiDAR, radar), optimizing models for edge deployment in resource-constrained environments, and adopting transformer-based architectures for improved spatiotemporal reasoning. Furthermore, progress in unsupervised and self-supervised learning could reduce dependence on large labeled datasets, accelerating adoption in emerging smart city and industrial applications.

4. Datasets for Vehicle Detection Models Design

Datasets play a critical role in developing and evaluating object detection models. They provide the training data needed for learning robust features and serve as benchmarks for comparing algorithmic performance [117]. Building high-performing vehicle detection systems requires access to large, diverse, and well-annotated datasets [118]. Several benchmark datasets, such as BDD100K, KITTI, and MS COCO, have been widely adopted to support the advancement of real-time vehicle detection models [62,77]. Table 7 summarizes key characteristics of commonly used datasets in this domain.

Pascal VOC [117]: A foundational dataset with 20 object categories, including vehicles like cars, buses, bicycles, and motorcycles. It was used in early versions of YOLO (v1 and v2) and supports tasks such as object detection, segmentation, and classification.
MS COCO [119]: A large-scale benchmark containing 91 object classes with five dedicated vehicle categories. It is widely used for training state-of-the-art models such as Faster R-CNN and the YOLO series.
BDD100K [120]: A diverse driving dataset with 100,000 video clips captured under varied conditions. It includes ten classes and supports multiple tasks, including detection, segmentation, and lane marking. It is suitable for complex real-world driving environments.
KITTI [121]: A widely used dataset for autonomous driving, providing both 2D and 3D annotations for object detection, tracking, and stereo vision. It includes real-world traffic scenes captured from a moving vehicle.
Boxy Vehicle Detection [122]: One of the largest public datasets for freeway vehicle detection, containing 200,000 annotated images. It covers various weather conditions, traffic densities, and vehicle types, supporting large-scale training and evaluation.

Despite the richness and scale of these datasets, several limitations can hinder the performance and generalizability of vehicle detection models. One common issue is class imbalance, where certain vehicle types (e.g., cars) are overrepresented compared to others (e.g., buses or motorcycles), leading to biased model predictions. Additionally, many datasets lack diversity in environmental conditions, such as nighttime, fog, rain, or occlusion, which are critical for robust real-world deployment. Limited annotation quality, resolution variability, and dataset-specific biases can further affect a model’s ability to generalize to unseen domains or cities. Addressing these challenges requires the development of more balanced, high-resolution, and context-rich datasets, along with techniques such as domain adaptation, data augmentation, and synthetic data generation, to improve model robustness across diverse operational scenarios. Table 8 shows some common dataset limitations and corresponding mitigation strategies.

5. Evaluation Metrics

Evaluation metrics are crucial for quantifying the performance of object detection algorithms. A core challenge in object detection is correctly associating predicted bounding boxes with their corresponding ground truth annotations, which significantly affects the calculated performance. Most evaluation protocols rely on Intersection over Union (IoU) to determine the quality of this association. Based on predefined IoU thresholds, standard metrics such as precision, recall, average precision (AP), and mean average precision (mAP) are computed to assess detection effectiveness. Several studies (e.g., [77,88]) have adopted these metrics to validate their vehicle detection frameworks. However, performance evaluation becomes increasingly difficult in complex scenarios involving occlusions, poor lighting, and dense traffic conditions. To address this, recent approaches such as Optimal Transport Assignment (OTA) and SimOTA [66,123] have reformulated label assignment as a global optimization problem, improving robustness and reducing localization errors in such challenging environments. The key metrics commonly used in object detection are described below:

Precision (P): Measures the proportion of correctly predicted positive detections:

$P = \frac{T P}{T P + F P}$

(12)

where $T P$ is the number of true positives and $F P$ the number of false positives.
Recall (R): Indicates the proportion of actual positives that are correctly detected:

$R = \frac{T P}{T P + F N}$

(13)

where $F N$ denotes the number of false negatives.
Intersection over Union (IoU): Represents the ratio of the overlap between predicted and ground truth bounding boxes to their union:

$I o U = \frac{| B_{p} \cap B_{g t} |}{| B_{p} \cup B_{g t} |}$

(14)

where $B_{p}$ is the predicted bounding box and $B_{g t}$ is the ground truth box. A detection is considered correct if IoU exceeds a predefined threshold (e.g., 0.5).
Average Precision (AP): Quantifies the area under the precision-recall curve for a given class:

$A P = \int_{0}^{1} p (r) d r$

(15)

where $p (r)$ is the precision at recall level r. It captures the trade-off between precision and recall across different confidence scores.
Mean Average Precision (mAP): The average of AP values over all object classes:

$m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}$

(16)

where N is the number of classes and $A P_{i}$ is the AP for class i. mAP is widely used to compare detection models comprehensively.

Model Optimization for Extreme Real-World Conditions

While single-stage detectors such as the YOLO series are renowned for their real-time inference capability, their performance often deteriorates under challenging real-world conditions, including occlusion, low lighting, and dense traffic. To address these issues and maintain high-frame-rate performance with robust accuracy, several optimization strategies have been proposed in recent literature.

Lightweight Backbone Integration: Efficient architectures such as MobileNet and GhostNet are increasingly adopted to enhance computational efficiency without significantly compromising detection accuracy. Models like MobileNet-SSDv2 deliver fast inference on embedded platforms with low power consumption [124]. Similarly, GhostNet-based variants, including GhostNet-SSD [125] and GS-YoloNet [126], reduce redundant computation by generating ghost feature maps, preserving spatial richness.
Multi-Feature Fusion and Attention Mechanisms: Architectures such as SYGNet employ semantic-visual-guided GhostNet-YOLO to improve real-time detection in occluded or cluttered driving scenarios by fusing spatial and contextual information [127]. GS-YoloNet further enhances robustness using GhostShuffle and attention-based fusion techniques [126].
Hybrid Architecture Strategies: Combining YOLOv4 with EfficientDet modules has shown improved detection performance under lighting variation and visual clutter. This is achieved through BiFPN and compound scaling, which balance detection accuracy and computational efficiency [8].
Hardware-Aware Acceleration: TensorRT-optimized models such as YOLOv8-QSD demonstrate exceptional performance in low-light and small-object scenarios by leveraging quadrant spatial distribution encoding and depth-aware feature fusion [128].
Backbone Benchmarking for Deployment: Comparative studies highlight that lightweight backbones (e.g., MobileNet) offer favorable trade-offs over heavier ones (e.g., ConvNeXt) in terms of speed, energy efficiency, and accuracy for edge deployments in vehicle detection tasks [129].

6. Challenges and Future Research Directions

Despite considerable advancements in vehicle detection, several persistent challenges hinder reliable deployment in real-world environments. Key challenges and prospective research directions include the following:

Dataset and Annotation Quality: Accurate and diverse labeling is essential for training effective vehicle detection models. Labeling errors or imbalances—particularly in single-stage detectors—can lead to biased learning and degraded performance [78].
Complex Traffic Scenarios: Real-world traffic involves frequent occlusions, scale variations, and overlapping vehicles. Recent studies show that attention mechanisms—including spatial, channel-wise, and transformer-based—substantially improve robustness in these scenarios. Transformer models like RT-DETR and ViT variants capture long-range dependencies, while convolutional attention modules enhance feature discrimination in cluttered scenes [12,18,81,82].
Environmental Variability: Illumination changes, shadows, and adverse weather (e.g., fog, rain) remain major obstacles. Solutions include synthetic data augmentation, domain adaptation, and deep sensor fusion techniques, which improve detection reliability under degraded visual conditions [130,131].
Resource-Constrained Deployment: Real-time applications, particularly on UAVs or embedded systems, demand high efficiency. Lightweight models (e.g., YOLOv5, YOLOv6) combined with model compression techniques—such as pruning and quantization—are actively explored to address these constraints [88].
Generalization and Cross-Domain Adaptability: Many models experience performance drops when applied to unseen datasets or environments. Transfer learning, data augmentation, and continual learning are promising techniques to enhance model adaptability [114,115].
Integration with V2X Communication: The integration of detection systems with Vehicle-to-Everything (V2X) networks can enhance cooperative perception and situational awareness. Future work should explore detection architectures capable of leveraging shared sensor data across connected vehicles and infrastructure [110]. Such multi-agent cooperation frameworks, as discussed in [11], depend critically on the reliability and timeliness of vehicle detection outputs, underscoring the importance of accurate, low-latency perception in MAS-based driving environments.

7. Conclusions

This paper has presented a comprehensive survey of vehicle detection methods developed over the past two decades. We began by reviewing classical approaches rooted in traditional machine learning, highlighting widely used feature extraction techniques such as Haar-like features, HOG, and LBP, in combination with classifiers like SVM and AdaBoost. Representative studies were discussed in terms of detection tasks and performance, illustrating both the capabilities and limitations of early methods. We then examined the evolution toward deep learning-based approaches, including one-stage detectors (e.g., YOLO, SSD, FCOS, CenterNet) and two-stage frameworks (e.g., Faster R-CNN, Mask R-CNN). Emphasis was placed on architectural advancements, dataset usage, and performance trade-offs, especially regarding accuracy and real-time inference. Comparative tables (e.g., Table 1 and Table 5) offer practical insights into model selection and deployment considerations. In addition, we explored emerging directions such as transformer-based and GAN-enhanced detection models, integration with V2X communication systems, and edge–cloud deployment strategies. Major challenges, such as dense traffic scenes, adverse environmental conditions, limited annotations, and resource constraints, were also discussed. Looking ahead, future research should focus on enhancing model generalization, improving robustness via multi-modal and cooperative perception, advancing domain adaptation techniques, and enabling efficient real-time deployment on edge devices. This survey aims to serve as a valuable reference for researchers and practitioners seeking to understand the historical development, current state, and future potential of vehicle detection technologies.

Author Contributions

Conceptualization, M.A.A.A. and J.R.T.; methodology, M.A.A.A.; writing—original draft preparation, M.A.A.A.; writing—review and editing, J.R.T.; supervision, J.R.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Y.; Guo, Z.; Yu, H.; Zhao, Y. Real-time vehicle detection based on an improved YOLO model. Electronics 2020, 9, 583. [Google Scholar] [CrossRef]
Kadim, Z.; Johari, K.M.; Samaon, D.F.; Li, Y.S.; Hon, H.W. Real-Time Deep-Learning Based Traffic Volume Count for High-Traffic Urban Arterial Roads. In Proceedings of the 2020 IEEE 10th Symposium on Computer Applications & Industrial Electronics (ISCAIE), Penang, Malaysia, 18–19 April 2020; pp. 53–58. [Google Scholar] [CrossRef]
Liang, S.; Wu, H.; Zhen, L.; Hua, Q.; Garg, S.; Kaddoum, G.; Hassan, M.M.; Yu, K. Edge YOLO: Real-Time Intelligent Object Detection System Based on Edge-Cloud Cooperation in Autonomous Vehicles. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25345–25360. [Google Scholar] [CrossRef]
Sun, Z.; Bebis, G.; Miller, R. On-road vehicle detection: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 694–711. [Google Scholar] [CrossRef]
Shubho, F.H.; Iftekhar, F.; Hossain, E.; Siddique, S. Real-time traffic monitoring and traffic offense detection using YOLOv4 and OpenCV DNN. In Proceedings of the TENCON 2021—IEEE Region 10 Conference (TENCON), Auckland, New Zealand, 7–10 December 2021; pp. 46–51. [Google Scholar] [CrossRef]
Liang, J.; Yang, K.; Tan, C.; Wang, J.; Yin, G. Enhancing High-Speed Cruising Performance of Autonomous Vehicles Through Integrated Deep Reinforcement Learning Framework. IEEE Trans. Intell. Transp. Syst. 2024, 26, 835–848. [Google Scholar] [CrossRef]
Cai, Y.; Luan, T.; Gao, H.; Wang, H.; Chen, L.; Li, Y.; Sotelo, M.A.; Li, Z. YOLOv4-5D: An Effective and Efficient Object Detector for Autonomous Driving. IEEE Trans. Instrum. Meas. 2021, 70, 1–13. [Google Scholar] [CrossRef]
Rajan, V.A.; Sakhamuri, S.; Nayaki, A.P.; Agarwal, S.; Aeron, A.; Lawanyashri, M. Optimizing Object Detection Efficiency for Autonomous Vehicles through the Integration of YOLOv4 and EfficientDet Algorithms. In Proceedings of the 2024 International Conference on Trends in Quantum Computing and Emerging Business Technologies (TQCEBT), Pune, India, 22–23 March 2024; pp. 1–5. [Google Scholar] [CrossRef]
Qiu, L. Real-Time Traffic Sign Detection System for Autonomous Driving Based on YOLO Algorithm. In Proceedings of the 2024 Cross Strait Radio Science and Wireless Technology Conference (CSRSWTC), Fuzhou, China, 19–21 April 2024; pp. 1–4. [Google Scholar] [CrossRef]
Sarda, A.; Dixit, S.; Bhan, A. Object Detection for Autonomous Driving Using YOLO (You Only Look Once) Algorithm. In Proceedings of the 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), Tirunelveli, India, 4–6 March 2021; pp. 1370–1374. [Google Scholar] [CrossRef]
Liang, J.; Li, Y.; Yin, G.; Xu, L.; Lu, Y.; Feng, J.; Shen, T.; Cai, G. A MAS-Based Hierarchical Architecture for the Cooperation Control of Connected and Automated Vehicles. IEEE Trans. Veh. Technol. 2023, 72, 1559–1573. [Google Scholar] [CrossRef]
Wang, B.; Wu, J.; Li, Q.; Huang, Y. A high-precision vehicle detection and tracking method based on the attention mechanism. Sensors 2023, 23, 1223. [Google Scholar] [CrossRef]
Ammar, A.I.; Almazroi, A.A.; Alsharif, M.H.; Alhabib, M.; Anwer, A.H. A multi-stage deep-learning-based vehicle and license plate recognition system with real-time edge inference. Electronics 2023, 12, 1164. [Google Scholar] [CrossRef]
Berwo, M.; Dey, N.; Ashour, A.S. Deep learning techniques for vehicle detection and classification from images/videos: A survey. Int. J. Imaging Syst. Technol. 2023, 33, 124–145. [Google Scholar] [CrossRef]
Ragab, M.; Mohamed, M.; Saeed, R.A. Improved deep learning-based vehicle detection for urban applications using remote sensing imagery. Remote Sens. 2023, 15, 2372. [Google Scholar] [CrossRef]
Chen, J.; Xu, W.; Xu, H.; Lin, F.; Sun, Y.; Shi, X. Fast vehicle detection using a disparity projection method. IEEE Trans. Intell. Transp. Syst. 2018, 19, 2801–2813. [Google Scholar] [CrossRef]
Soviany, P.; Ionescu, R.T. Frustratingly Easy Trade-off Optimization between Single-Stage and Two-Stage Deep Object Detectors. In Proceedings of the ECCV Workshops; Springer: Munich, Germany, 2018; pp. 552–568. [Google Scholar]
Zhang, Y.; Liu, S.; Liu, Z.; Wang, X.; Wu, Z. RT-DETR: Real-Time DETR with Efficient Attention and Prompt Training. arXiv 2024, arXiv:2409.08475. [Google Scholar]
Zhou, X.; Zhang, H.; Wang, B.; Zhang, Z.; Zhang, H.; Yang, M.; Lu, C. RT-DETR v3: Scalable and Real-Time End-to-End Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Kauai, HI, USA, 8–14 December 2001; Volume 1, pp. I-511–I-518. [Google Scholar] [CrossRef]
Tian, Y.; Luo, P.; Wang, X.; Tang, X. Deep Learning Strong Parts for Pedestrian Detection. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1904–1912. [Google Scholar] [CrossRef]
Premaratne, P.; Kadhim, I.J.; Blacklidge, R.; Lee, M. Comprehensive Review on Vehicle Detection, Classification and Counting on Highways. Neurocomputing 2023, 556, 126627. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Ke, Z. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Han, S.; Han, Y.; Hahn, H. Vehicle detection method using Haar-like feature on real-time system. World Acad. Sci. Eng. Technol. 2009, 59, 455–459. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar] [CrossRef]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Awad, M.; Khanna, R. Support vector machines for classification. In Efficient Learning Machines: Theories, Concepts, and Applications for Engineers and System Designers; Apress: Berkeley, CA, USA, 2015; pp. 39–66. [Google Scholar]
Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
Rätsch, G.; Onoda, T.; Müller, K.-R. Soft margins for AdaBoost. Mach. Learn. 2001, 42, 287–320. [Google Scholar] [CrossRef]
Joshi, A.J.; Porikli, F. Scene-adaptive human detection with incremental active learning. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2760–2763. [Google Scholar]
Haselhoff, A.; Kummert, A.; Schneider, G. Radar-vision fusion for vehicle detection by means of improved Haar-like feature and AdaBoost approach. In Proceedings of the 2007 15th European Signal Processing Conference, Poznań, Poland, 3–7 September 2007; pp. 2070–2074. [Google Scholar]
Moranduzzo, T.; Melgani, F. A SIFT-SVM method for detecting cars in UAV images. In Proceedings of the 2012 IEEE International Geoscience and Remote Sensing Symposium, Munich, Germany, 22–27 July 2012; pp. 6868–6871. [Google Scholar]
Tang, Y.; Zhang, C.; Gu, R.; Li, P.; Yang, B. Vehicle detection and recognition for intelligent traffic surveillance system. Multimed. Tools Appl. 2017, 76, 5817–5832. [Google Scholar] [CrossRef]
Ali, J.E.; Rahmat, R.A.O.K. Developing and validating a real-time video-based traffic counting and classification. J. Eng. Sci. Technol. 2017, 12, 3215–3225. [Google Scholar]
Cao, X.; Wu, C.; Yan, P.; Li, X. Linear SVM classification using boosting HOG features for vehicle detection in low-altitude airborne videos. In Proceedings of the 2011 18th IEEE International Conference on Image Processing, Brussels, Belgium, 11–14 September 2011; pp. 2421–2424. [Google Scholar]
Xu, Y.; Yu, G.; Wang, Y.; Wu, X.; Ma, Y. A hybrid vehicle detection method based on Viola-Jones and HOG + SVM from UAV images. Sensors 2016, 16, 1325. [Google Scholar] [CrossRef]
Niknejad, H.T.; Takeuchi, A.; Mita, S.; McAllester, D. On-road multivehicle tracking using deformable object model and particle filter with improved likelihood estimation. IEEE Trans. Intell. Transp. Syst. 2012, 13, 748–758. [Google Scholar] [CrossRef]
Chávez-Aragón, A.; Laganiere, R.; Payeur, P. Vision-based detection and labelling of multiple vehicle parts. In Proceedings of the 2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC), Washington, DC, USA, 5–7 October 2011; pp. 1273–1278. [Google Scholar]
Fawzy, N.; Obaya, M.; Ata, M.M.; Yousif, B.; Samra, A.S. A novel video processing algorithm for accurate vehicle detection and tracking. In Proceedings of the 2021 International Telecommunications Conference (ITC-Egypt), Alexandria, Egypt, 13–15 July 2021; pp. 1–4. [Google Scholar]
Sun, Z.; Bebis, G.; Miller, R. On-road vehicle detection using Gabor filters and support vector machines. In Proceedings of the 2002 14th International Conference on Digital Signal Processing (DSP 2002), Santorini, Greece, 1–3 July 2002; Volume 2, pp. 1019–1022. [Google Scholar]
Lefaix, G.; Marchand, E.; Bouthemy, P. Motion-based obstacle detection and tracking for car driving assistance. In Proceedings of the 16th International Conference on Pattern Recognition (ICPR), Québec City, QC, Canada, 11–15 August 2002; Volume 4, pp. 74–77. [Google Scholar] [CrossRef]
Techmer, A. Real-time motion-based vehicle segmentation in traffic lanes. In Pattern Recognition (DAGM 2001); Springer: Berlin/Heidelberg, Germany, 2001; Volume 2191, pp. 202–207. [Google Scholar] [CrossRef]
Musa, S.; Saad, M.H.M.; Ahmad, R.B.; Azemi, S.N.A.M. Motion-Based Vehicle Detection in Hsuehshan Tunnel Surveillance Video. Int. J. Image Graph. Signal Process. 2019, 11, 34–45. [Google Scholar] [CrossRef]
Jazayeri, A.; Cai, H.; Zheng, J.Y.; Tuceryan, M. Vehicle detection and tracking in car video based on motion model. IEEE Trans. Intell. Transp. Syst. 2011, 12, 583–595. [Google Scholar] [CrossRef]
Chauhan, N.K.; Singh, K. A review on conventional machine learning vs deep learning. In Proceedings of the 2018 International Conference on Computing, Power and Communication Technologies (GUCON), Greater Noida, India, 28–29 September 2018; pp. 347–352. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Ouyang, L.; Wang, H. Vehicle target detection in complex scenes based on YOLOv3 algorithm. In Proceedings of the IOP Conference Series: Materials Science and Engineering, Hubei, China, 19–20 April 2019; Volume 569, p. 052018. [Google Scholar]
Espinosa, J.E.; Velastin, S.A.; Branch, J.W. Vehicle detection using AlexNet and Faster R-CNN deep learning models: A comparative study. In Advances in Visual Informatics: 5th International Visual Informatics Conference, IVIC 2017, Bangi, Malaysia, 28–30 November 2017; Springer: Cham, Switzerland, 2017; pp. 3–15. [Google Scholar]
Wu, Y.-Y.; Tsai, C.-M. Pedestrian, bike, motorcycle, and vehicle classification via deep learning: Deep belief network and small training set. In Proceedings of the 2016 International Conference on Applied System Innovation (ICASI), Okinawa, Japan, 28 May–1 June 2016; pp. 1–4. [Google Scholar]
Taek Lee, J.; Chung, Y. Deep learning-based vehicle classification using an ensemble of local expert and global networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 47–52. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Yang, K.; Gui, X. Research on real-time detection of road vehicle targets based on YOLOv4 improved algorithm. In Proceedings of the 2022 3rd International Conference on Electronic Communication and Artificial Intelligence (IWECAI), Zhuhai, China, 14–16 January 2022; pp. 243–246. [Google Scholar]
Sang, J.; Wu, Z.; Guo, P.; Hu, H.; Xiang, H.; Zhang, Q.; Cai, B. An improved YOLOv2 for vehicle detection. Sensors 2018, 18, 4272. [Google Scholar] [CrossRef]
Wu, Z.; Sang, J.; Zhang, Q.; Xiang, H.; Cai, B.; Xia, X. Multi-scale vehicle detection for foreground-background class imbalance with improved YOLOv2. Sensors 2019, 19, 3336. [Google Scholar] [CrossRef] [PubMed]
Jocher, G. Ultralytics YOLOv5; Version 7.0; Ultralytics: Frederick, MD, USA, 2020; AGPL-3.0 License; Available online: https://github.com/ultralytics/yolov5 (accessed on 27 May 2025). [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y.; et al. PP-YOLOE: An Evolved Version of YOLO. arXiv 2022, arXiv:2203.16250. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8, version 8.0.0. License: AGPL-3.0. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 April 2025).
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. YOLOv10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Chen, Z.; Guo, H.; Yang, J.; Jiao, H.; Feng, Z.; Chen, L.; Gao, T. Fast vehicle detection algorithm in traffic scene based on improved SSD. Measurement 2022, 201, 111655. [Google Scholar] [CrossRef]
Zhang, X.; Zhu, X. Vehicle detection in the aerial infrared images via an improved YOLOv3 network. In Proceedings of the 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), Wuxi, China, 19–21 July 2019; pp. 372–376. [Google Scholar]
Zhou, T.; Bai, Y.; Feng, X.; Sun, L.; Fang, Q. A feature enhancement FCOS algorithm for dynamic traffic object detection. Connect. Sci. 2023, 35, 2321345. [Google Scholar] [CrossRef]
Wang, Y.; Zhu, B.; Qiu, H.; Zhang, Z.; Peng, J. CenterNet-Auto: A multi-object visual detection algorithm for autonomous driving scenes based on improved CenterNet. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 7, 743–755. [Google Scholar] [CrossRef]
Fahad, I.A.; Arean, A.I.H.; Ahmed, N.S.; Hasan, M. Automatic Vehicle Detection using DETR: A Transformer-Based Approach for Navigating Treacherous Roads. arXiv 2025, arXiv:2502.17843. [Google Scholar]
Zong, Z.; Xu, Y.; Guo, C. Real-Time Transformer-Based Object Detection for Autonomous Driving. arXiv 2024, arXiv:2502.17843. [Google Scholar] [CrossRef]
Sun, Z.; Liu, C.; Qu, H.; Xie, G. A novel effective vehicle detection method based on Swin Transformer in hazy scenes. Mathematics 2022, 10, 2199. [Google Scholar] [CrossRef]
Dong, X.; Shi, P.; Tang, Y.; Yang, L.; Yang, A.; Liang, T. Vehicle classification algorithm based on improved vision transformer. World Electr. Veh. J. 2024, 15, 344. [Google Scholar] [CrossRef]
Wang, J.-G.; Wan, K.-W.; Yau, W.-Y.; Pang, C.H.; Lai, F.L. VAGAN: Vehicle-aware generative adversarial networks for vehicle detection in rain. In Proceedings of the 2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV), Shenzhen, China, 13–15 December 2020; pp. 363–368. [Google Scholar]
Lin, C.-T.; Huang, S.-W.; Wu, Y.-Y.; Lai, S.-H. GAN-based day-to-night image style transfer for nighttime vehicle detection. IEEE Trans. Intell. Transp. Syst. 2021, 22, 1286–1295. [Google Scholar] [CrossRef]
Zheng, K.; Wei, M.; Sun, G.; Anas, B.; Li, Y. Vehicle synthesis generative adversarial networks (VS-GANs) for improved object detection in remote sensing images. ISPRS Int. J. Geo-Inf. 2019, 8, 390. [Google Scholar] [CrossRef]
Dong, X.; Yan, S.; Duan, C. A lightweight vehicles detection network model based on YOLOv5. Eng. Appl. Artif. Intell. 2022, 113, 104914. [Google Scholar] [CrossRef]
Ruan, W.; Liu, Y. Lightweight detection method based on improved YOLOv4. In Proceedings of the 2022 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA), Changchun, China, 20–22 May 2022; pp. 46–49. [Google Scholar]
Ding, X.; Yang, R. Vehicle and parking space detection based on improved YOLO network model. J. Phys. Conf. Ser. 2019, 1325, 012084. [Google Scholar] [CrossRef]
Song, H.; Liang, H.; Li, H.; Dai, Z.; Yun, X. Vision-based vehicle detection and counting system using deep learning in highway scenes. Eur. Transp. Res. Rev. 2019, 11, 51. [Google Scholar] [CrossRef]
Fan, Y.-C.; Yelamandala, C.M.; Chen, T.-W.; Huang, C.-J. Real-time object detection for LiDAR based on LS-R-YOLOv4 neural network. J. Sens. 2021, 2021, 5576262. [Google Scholar] [CrossRef]
Uus, J.; Krilavicius, T. Detection of different types of vehicles from aerial imagery. In Proceedings of the IVUS, Kaunas, Lithuania, 25 April 2019; pp. 80–85. [Google Scholar]
Huang, Y.-Q.; Zheng, J.-C.; Sun, S.-D.; Yang, C.-F.; Liu, J. Optimized YOLOv3 algorithm and its application in traffic flow detections. Appl. Sci. 2020, 10, 3079. [Google Scholar] [CrossRef]
Karungaru, S.; Dongyang, L.; Terada, K. Vehicle detection and type classification based on CNN-SVM. Int. J. Mach. Learn. Comput. 2021, 11, 304–310. [Google Scholar] [CrossRef]
Simony, M.; Milzy, S.; Amendey, K.; Gross, H.-M. Complex-YOLO: An euler-region-proposal for real-time 3D object detection on point clouds. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Davis, T.; Nandana, K.V. Real-time 3D object detection on LiDAR point cloud using Complex-YOLO v4. Int. Res. J. Eng. Technol. 2022, 9, 716–721. [Google Scholar]
Yang, Z.; Li, J.; Li, H. Real-time pedestrian and vehicle detection for autonomous driving. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 179–184. [Google Scholar]
Xu, X.; Xiong, H.; Zhan, L.; Królczyk, G.; Stanislawski, R.; Gardoni, P.; Li, Z. A new deep model for detecting multiple moving targets in real traffic scenarios: Machine vision-based vehicles. Sensors 2022, 22, 3742. [Google Scholar] [CrossRef]
Xiong, C.; Yu, A.; Yuan, S.; Gao, X. Vehicle detection algorithm based on lightweight YOLOX. Signal Image Video Process. 2023, 17, 1793–1800. [Google Scholar] [CrossRef]
Li, X.; Liu, Y.; Zhao, Z.; Zhang, Y.; He, L. A deep learning approach of vehicle multitarget detection from traffic video. J. Adv. Transp. 2018, 2018, 7075814. [Google Scholar] [CrossRef]
Yayla, R.; Albayrak, E.; Yuzgec, U. Vehicle detection from unmanned aerial images with deep Mask R-CNN. Comput. Sci. J. Mold. 2022, 89, 148–169. [Google Scholar] [CrossRef]
Rujikietgumjorn, S.; Watcharapinchai, N. Vehicle detection with sub-class training using R-CNN for the UA-DETRAC benchmark. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–5. [Google Scholar]
Hsu, S.-C.; Huang, C.-L.; Chuang, C.-H. Vehicle detection using simplified Fast R-CNN. In Proceedings of the 2018 International Workshop on Advanced Image Technology (IWAIT), Chiang Mai, Thailand, 7–9 January 2018; pp. 1–3. [Google Scholar]
Ahmed, S.H.; Raza, M.; Mehdi, S.S.; Rehman, I.; Kazmi, M.; Qazi, S.A. Faster R-CNN based vehicle detection and counting framework for undisciplined traffic conditions. In Proceedings of the 2021 IEEE 18th International Conference on Smart Communities: Improving Quality of Life using ICT, IoT and AI (HONET), Karachi, Pakistan, 11–13 October 2021; pp. 173–178. [Google Scholar]
Tu, C.; Du, S. A hierarchical RCNN for vehicle and vehicle license plate detection and recognition. Int. J. Electr. Comput. Eng. 2022, 12, 731. [Google Scholar] [CrossRef]
Tsai, C.-C.; Tseng, C.-K.; Tang, H.-C.; Guo, J.-I. Vehicle detection and classification based on deep neural network for intelligent transportation applications. In Proceedings of the 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA, 12–15 November 2018; pp. 1605–1608. [Google Scholar]
Nguyen, H. Improving Faster R-CNN framework for fast vehicle detection. Math. Probl. Eng. 2019, 2019, 3808064. [Google Scholar] [CrossRef]
Azimjonov, J.; Özmen, A. A real-time vehicle detection and a novel vehicle tracking system for estimating and monitoring traffic flow on highways. Adv. Eng. Inform. 2021, 50, 101393. [Google Scholar] [CrossRef]
Wang, L.; Zhong, H.; Ma, W.; Abdel-Aty, M.; Park, J. How many crashes can connected vehicle and automated vehicle technologies prevent: A meta-analysis. Accid. Anal. Prev. 2020, 136, 105299. [Google Scholar] [CrossRef]
Trivedi, J.D.; Mandalapu, S.D.; Dave, D.H. Vision-based real-time vehicle detection and vehicle speed measurement using morphology and binary logical operation. J. Ind. Inf. Integr. 2022, 27, 100280. [Google Scholar] [CrossRef]
Rios-Cabrera, R.; Tuytelaars, T.; Van Gool, L. Efficient Multi-Camera Vehicle Detection, Tracking, and Identification in a Tunnel Surveillance Application. Comput. Vis. Image Underst. 2012, 116, 742–753. [Google Scholar] [CrossRef]
Muhammad, K.; Ullah, A.; Lloret, J.; Del Ser, J.; de Albuquerque, V.H.C. Deep Learning for Safe Autonomous Driving: Current Challenges and Future Directions. IEEE Trans. Intell. Transp. Syst. 2021, 22, 4316–4336. [Google Scholar] [CrossRef]
Boukerche, A.; Hou, Z. Object Detection Using Deep Learning Methods in Traffic Scenarios. ACM Comput. Surv. 2021, 54, 30. [Google Scholar] [CrossRef]
Bakirci, E.; Altay, D.; Oztel, I.; Gungor, V.C. Vehicular mobility monitoring using remote sensing and deep learning on a UAV-based mobile computing platform. Drones 2025, 9, 11. [Google Scholar] [CrossRef]
Papageorgiou, C.P.; Oren, M.; Poggio, T. A general framework for object detection. In Proceedings of the Sixth International Conference on Computer Vision (ICCV), Bombay, India, 4–7 January 1998; pp. 555–562. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Uijlings, J.R.R.; Van De Sande, K.E.A.; Gevers, T.; Smeulders, A.W.M. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. BDD100K: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2636–2645. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Behrendt, K. Boxy vehicle detection in large images. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Ge, Z.; Liu, S.; Li, Z.; Yoshie, O.; Sun, J. OTA: Optimal transport assignment for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 303–312. [Google Scholar]
Chiu, Y.-C.; Tsai, C.-Y.; Ruan, M.-D.; Shen, G.-Y.; Lee, T.-T. Mobilenet-SSDv2: An improved object detection model for embedded systems. In Proceedings of the 2020 International Conference on System Science and Engineering (ICSSE), Tainan, Taiwan, 31 August–3 September 2020; pp. 1–5. [Google Scholar] [CrossRef]
Liu, J.; Cong, W.; Li, H. Vehicle detection method based on GhostNet-SSD. In Proceedings of the 2020 International Conference on Virtual Reality and Intelligent Systems (ICVRIS), Zhangjiajie, China, 18–19 July 2020; pp. 200–203. [Google Scholar] [CrossRef]
Wei, F.; Wang, W. GS-YoloNet: A Lightweight Network for Detection, Tracking, and Distance Estimation on Highways. In Proceedings of the 2024 IEEE 99th Vehicular Technology Conference (VTC2024-Spring), Antwerp, Belgium, 24–27 June 2024; pp. 1–6. [Google Scholar] [CrossRef]
Wang, H.; Zhu, B.; Li, Y.; Gong, K.; Wen, Z.; Wang, S.; Dev, S. SYGNet: A SVD-YOLO Based GhostNet for Real-Time Driving Scene Parsing. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 2701–2705. [Google Scholar] [CrossRef]
Wang, H.; Liu, C.; Cai, Y.; Chen, L.; Li, Y. YOLOv8-QSD: An Improved Small Object Detection Algorithm for Autonomous Vehicles Based on YOLOv8. IEEE Trans. Instrum. Meas. 2024, 73, 1–16. [Google Scholar] [CrossRef]
Bihanda, Y.G.; Fatichah, C.; Yuniarti, A. Comparative Analysis of ConvNext and MobileNet on Traffic Vehicle Detection. In Proceedings of the 2023 IEEE 8th International Conference on Software Engineering and Computer Systems (ICSECS), Pahang, Malaysia, 19–21 September 2023; pp. 101–105. [Google Scholar] [CrossRef]
Bijelic, M.; Gruber, T.; Mannan, F.; Kraus, F.; Ritter, W.; Dietmayer, K.; Heide, F. Seeing Through Fog Without Seeing Fog: Deep Sensor Fusion in the Absence of Labeled Training Data. Sensors 2020, 20, 2542. [Google Scholar] [CrossRef]
Wu, B.; Zhou, X.; Zhao, S.; Yue, X.; Keutzer, K. SqueezeSegV2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a LiDAR point cloud. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 4376–4382. [Google Scholar] [CrossRef]

Figure 1. Overview of the image-based vehicle detection process. The pipeline consists of four main stages: (1) input data acquisition from publicly available vehicle datasets such as KITTI or BDD-100K; (2) preprocessing, including frame/image extraction, resizing, normalization, and exploratory data analysis (EDA) with labeling; (3) dataset preparation, where data are split into training, validation, and testing sets; (4) model training and evaluation, which involves architecture selection (e.g., YOLOv5, Faster R-CNN), performance evaluation, and fine-tuning or hyperparameter tuning.

Figure 2. Illustration of a vehicle detection system. The system receives an input image, processes it using an object detection algorithm (e.g., YOLO, Faster R-CNN), and produces an output with bounding boxes, class labels, and confidence scores.

Figure 3. Architectural pipeline illustrating classical machine learning methods used for vehicle detection.

Figure 4. Types of Haar-like features for vehicle detection. (A) Two-rectangle vertical edge; (B) Three-rectangle vertical line feature; (C) Three-rectangle horizontal line feature; (D) Four-rectangle checkerboard pattern; (E) Two-rectangle vertical edge feature; (F) Two-rectangle vertical edge feature.

Figure 5. Deep learning workflow for vehicle/object detection. The pipeline includes preprocessing, feature extraction using a CNN backbone (e.g., ResNet, CSPDarknet), feature fusion (e.g., FPN, PANet), detection heads for bounding box regression and classification, and post-processing (e.g., Soft-NMS).

Figure 6. One-stage object detection pipeline. Models like YOLOv3, SSD, and RetinaNet process input images using a convolutional backbone to directly predict bounding boxes and class probabilities.

Figure 7. YOLO Versions Timeline. Highlights of YOLO model progression from v1 to v10, including derivative models such as YOLOX, PP-YOLO, and YOLO-NAS.

Figure 8. Two-stage detection pipeline: feature extraction, region proposal (RPN), ROI classification, and bounding box refinement.

Table 1. Summary of classical vehicle detection techniques. “Performance” refers to the primary evaluation metric (e.g., accuracy, precision, or mAP) reported in each respective study.

References	Methods	Tasks Performed	Performance
[40]	SIFT and SVM	Vehicle detection and recognition	65%
[41]	Haar-like features and AdaBoost	Vehicle detection and recognition	97%
[42]	Background subtraction, shadow removal, pixel analysis	Vehicle detection and counting	95%
[43]	Boosting HOG features, SVM	Detect moving vehicles	90%
[44]	Viola-Jones (V-J) and HOG + SVM	Vehicle detection	88.5%
[45]	LSVM + HOG	Multi-vehicle detection and tracking	97%
[46]	Hough transform algorithm, Haar-like features	Detection and labeling of multiple vehicles	95%
[47]	Background subtraction and frame difference	Moving vehicle detection	95.5%
[48]	Feature extraction and classification for rear-view vehicle detection	Vehicle detection	94.8%

Table 3. YOLO Series Algorithms for Real-Time Object Detection.

Reference	Year	Version	Backbone	Head
[23]	2015	YOLOv1	Custom	Custom
[24]	2016	YOLOv2	Darknet-19	Custom
[25]	2018	YOLOv3	Darknet-53	Custom
[59]	2020	YOLOv4	CSPDarknet53	Custom
[65]	2020	YOLOv5	CSPDarknet	Custom
[26]	2022	YOLOv6	EfficientRep	Custom
[27]	2022	YOLOv7	YOLOv7Backbone	Custom
[68]	2023	YOLOv8	YOLOv8CSPDarknet	Custom
[69]	2024	YOLOv9	YOLOv9Backbone	DEKRHead
[70]	2024	YOLOv10	YOLOv10Backbone	YOLOv10Head
[71]	2024	YOLOv11	C3k2 + C2PSA	YOLOv11Head
[72]	2025	YOLOv12	R-ELAN + A2C2f	YOLOv12Head

Table 4. Examples of one-stage methods for vehicle detection.

References	Dataset	Method	Performance
[78]	UA-DETRAC	DP-SSD	77.94%
[77]	KITTI	SSD + (MobileNet v2, channel attention, deconvolution)	84.81%
[88]	MS COCO	YOLOv5s, C3Ghost, CBAM	72.40%
[89]	PASCAL VOC 2007 + 2012	YOLOv4, MobileNetv1, ECA attention	90.29%
[62]	KITTI	YOLOv4, CBM + CSP, Kmeans++	93.22%
[63]	BIT-Vehicle, CompCars	YOLOv2, k-means++	94.78%
[64]	BIT-Vehicle	YOLOv2, Rk-means++, Focal Loss	97.30%
[90]	PASCAL VOC, MS COCO, PKLot	YOLOv3	93.30%
[55]	PASCAL VOC 2007 + 2012	YOLOv3	89.10%
[91]	Vehicle dataset	YOLOv3, ORB algorithm	87.80%
[92]	KITTI, PASCAL VOC	YOLOv4	97.70%
[93]	MAFAT tournament	YOLOv3	81.72%
[94]	DETRAC	YOLOv3 with SPP module	98.80%
[95]	BIT-Vehicle	YOLOv2-tiny	82.20%
[96]	KITTI	Complex-YOLO, E-RPN	67.70%
[97]	KITTI	Complex-YOLO V4	79.00%
[7]	KITTI, BDD	YOLOv4, CSPDarknet53_dcn, PAN++	70.10%
[98]	KITTI	YOLOv2	61.30%
[99]	KITTI	YOLOv4, CBAM, Soft-NMS, DIoU	81.23%
[5]	PoribohonBD, Dhaka-AI	YOLOv4	87.19%
[100]	BIT-Vehicle	YOLOX	99.21%
[101]	VOC2007	YOLOv2	90.00%

Table 5. Examples of vehicle detection methods based on two-stage detectors.

Reference	Dataset	Method	Performance
[103]	UA-DETRAC	R-CNN, Transfer Learning	93.43%
[104]	SHRP 2 NDS	Fast R-CNN	89.2%
[102]	UAV Images	Mask R-CNN	93%
[105]	Pakistan Dataset	Faster R-CNN	82.14%
[106]	CIFAR-10	R-CNN	98.5%
[107]	PASCAL VOC	Optimized Faster R-CNN	90%
[108]	KITTI, LSVH	Faster R-CNN, Soft-NMS	89.2%

Table 6. Comparison of deep learning-based vehicle detection methods.

Method	Model Type	Application Scenario	Strengths
YOLOv5/YOLOv7	One-Stage	Real-time/Edge	Speed, Efficiency
Faster R-CNN	Two-Stage	High-Precision Detection	Accuracy, Localization
DETR/Swin	Transformer	Complex Scenes	Context Modeling
VAGAN/AugGAN	GAN-Based	Low-Light/Scarce Data	Data Augmentation

Table 7. Summary of common datasets for vehicle detection tasks. “#Img/Vid” refers to the approximate number of labeled images or video sequences.

Dataset	Classes	Vehicle Types	Tasks	Annotation	#Img/Vid	Resolution
Pascal VOC	20	Car, Bus, Bicycle, Motorbike	Detection, Segmentation, Classification	Yes	∼11,000 images	384 × 480 to 500 × 375
MS COCO	91	Bicycle, Car, Motorcycle, Bus, Truck	Detection, Segmentation, Classification	Yes	∼330,000 images	Varies (avg. 640 × 480)
BDD100K	10	Car, Bus, Truck, Bike, Motorcycle	Detection, Segmentation, Lane Marking, Image Tagging	Yes	100,000 video clips (40 s each)	1280 × 720
KITTI	9	Car, Van, Truck	Detection, Segmentation, Classification	Yes	∼15,000 images, 200 videos	1242 × 375
Boxy Vehicle	1	Passenger Cars, Trucks, Car Carriers, Motorcycles	Vehicle Detection in Freeway Driving	Yes	200,000 images	5 MP (typically 1920 × 1080)

Table 8. Common dataset limitations and mitigation strategies in vehicle detection.

Limitation	Mitigation Strategy
Class imbalance (e.g., overrepresentation of cars)	Data resampling, synthetic data generation, loss reweighting
Poor diversity in weather/lighting	Data augmentation (e.g., rain, fog simulation), use of GANs
Low-quality or inconsistent annotations	Annotation refinement, human-in-the-loop labeling, weak supervision
Limited coverage of rare scenarios (e.g., accidents, occlusion)	Scenario simulation tools, collection of edge-case datasets
Domain-specific bias (e.g., specific cities or camera angles)	Domain adaptation, transfer learning, cross-dataset evaluation

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Adam, M.A.A.; Tapamo, J.R. Survey on Image-Based Vehicle Detection Methods. World Electr. Veh. J. 2025, 16, 303. https://doi.org/10.3390/wevj16060303

AMA Style

Adam MAA, Tapamo JR. Survey on Image-Based Vehicle Detection Methods. World Electric Vehicle Journal. 2025; 16(6):303. https://doi.org/10.3390/wevj16060303

Chicago/Turabian Style

Adam, Mortda A. A., and Jules R. Tapamo. 2025. "Survey on Image-Based Vehicle Detection Methods" World Electric Vehicle Journal 16, no. 6: 303. https://doi.org/10.3390/wevj16060303

APA Style

Adam, M. A. A., & Tapamo, J. R. (2025). Survey on Image-Based Vehicle Detection Methods. World Electric Vehicle Journal, 16(6), 303. https://doi.org/10.3390/wevj16060303

Article Menu

Survey on Image-Based Vehicle Detection Methods

Abstract

1. Introduction

2. Vehicle Detection Methods

2.1. Classical Vehicle Detection Methods

2.1.1. Feature Extraction Techniques

2.1.2. Classical Classification Algorithms

2.1.3. Motion-Based Vehicle Detection

2.1.4. Summary of Classical Vehicle Detection Methods

2.2. Deep Learning-Based Vehicle Detection Methods

2.2.1. One-Stage Detectors

2.2.2. Transformer-Based Methods

2.2.3. Gan-Based Methods

2.2.4. Two-Stage Detectors

2.2.5. Summary of Deep Learning Vehicle Detection Method

3. Application Areas

4. Datasets for Vehicle Detection Models Design

5. Evaluation Metrics

Model Optimization for Extreme Real-World Conditions

6. Challenges and Future Research Directions

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI