1. Introduction
The ability to automatically detect and identify vehicles from images and video has become increasingly important in modern cities. Among the technologies in this domain, Vehicle Make and Model Recognition (VMMR) has garnered significant attention, with key applications in traffic monitoring, urban planning, environmental policy enforcement, and intelligent transportation systems. While contemporary systems are highly capable of recognizing vehicle makes, many still struggle to accurately identify specific models, differentiate between generations, or estimate a vehicle’s production period. This is a critical limitation, as achieving this level of fine-grained recognition is essential for advanced applications like emissions monitoring, law enforcement investigations, and regulatory compliance. For instance, vehicles from different production years often have vastly different performance and emission standards.
A key obstacle to fine-grained classification is the limited availability of richly annotated datasets. Although large-scale datasets like Stanford Cars and CompCars exist, they often lack the detail needed to distinguish between different versions of the same model. Moreover, many of these datasets focus on specific markets (most notably North America and China), which introduces a geographical bias. Consequently, when these models are applied to European roads, they tend to underperform, particularly with vehicles from makes like SEAT, Renault, or Peugeot, which are common in Europe but underrepresented in existing data.
Another important challenge arises from the mismatch between real-world imaging conditions and those represented in most datasets. In practical applications, vehicles are typically recorded from the front or rear by surveillance or traffic management cameras, as these positions are ideal for reading license plates or monitoring traffic flow. However, most available datasets favor side views, which, while more visually informative, are less representative of typical deployment environments. This discrepancy can lead to models that perform well under testing conditions but fail to generalize effectively in deployment.
Some systems address these challenges by incorporating additional technologies like license plate recognition, LiDAR, or magnetic sensors. While these can enhance accuracy, they introduce significant trade-offs. License plate recognition raises privacy and legal concerns, particularly under strict data protection laws. Meanwhile, LiDAR and magnetic sensors require significant infrastructure investments and may be unsuitable for large-scale urban deployment due to their cost and complexity.
To address these gaps, we propose a scalable, privacy-respecting system for fine-grained vehicle recognition in Europe that operates without license plate data. This study makes three key contributions:
EuroVMMR Dataset: A novel dataset of 84,732 images across 625 classes targeting the European market, providing generation-level annotations essential for emission enforcement.
YOLO11-Based Pipeline: A streamlined two-stage pipeline using the standard YOLO11 detection and classification models, fine-tuned on our custom dataset to provide a solution suitable for real-time deployment.
Real-World Validation: Validation on real European traffic footage achieving 80% accuracy, confirming that domain-specific training data outweighs model complexity for effective deployment.
2. State of the Art
Vehicle Make and Model Recognition (VMMR) has advanced significantly with the advent of deep learning, enabling more detailed classification across diverse vehicle types and real-world conditions. This section reviews the state of the art in three main areas: (1) Advanced Architectures and Mechanisms, (2) Dataset Contributions, and (3) Real-Time Detection and Open-World Scenarios.
2.1. Advanced Architectures and Mechanisms for VMMR
Recent advances in deep learning have led to architectures designed to distinguish between visually similar vehicle models by focusing on fine-grained visual differences. Semiromizadeh et al. introduced a 3D attention module integrated into convolutional neural networks (CNNs) that enhances feature extraction by focusing on critical vehicle details, which they tested on the Stanford Cars dataset [
1]. Yang et al. used attention mechanisms targeted at specific parts like wheels and headlights, improving the discrimination of closely related models [
2]. Wang et al. combined CNNs with temporal convolutional networks (TCNs) to capture spatial and temporal information, enabling analysis of vehicle behavior in traffic and thereby expanding the scope of VMMR [
3]. To address challenges such as image noise and intra-class variation, Liu proposed the Progressive Multi-task Anti-Noise Learning (PMAL) framework, improving robustness on datasets including Stanford Cars, CompCars, and BIT-Vehicle [
4].
Earlier work by Fang et al. developed coarse-to-fine convolutional neural network architectures that progressively refined recognition performance by focusing on fine-grained features [
5]. Sochor et al. introduced BoxCars, which leverages 3D bounding box representations to encode spatial and geometric vehicle information, enhancing fine-grained classification [
6]. Demonstrating the importance of perspective, Llorca et al. combined rear emblem features with appearance-based descriptors, a relevant technique for models analyzing the front or rear views commonly captured by traffic cameras [
7].
Part-based recognition approaches have also proven effective. Biglari applied latent Support Vector Machines (SVM) and Histogram of Oriented Gradients (HOG) features to vehicle parts like headlights and grilles to improve classification accuracy [
8]. Bularz et al. developed CNN architectures focused on rear-lamp patterns, demonstrating robustness in challenging lighting and occlusion conditions [
9].
Beyond traditional CNNs, transformer-based architectures have recently set new benchmarks in computer vision and fine-grained recognition. Models like the Vision Transformer (ViT) [
10] and its variants, such as the Swin Transformer [
11], have demonstrated exceptional performance by capturing global relationships within an image through self-attention mechanisms. In the context of vehicle recognition, these models have been used to overcome the limitations of CNNs in modeling long-range dependencies, which is crucial for distinguishing between models with subtle but globally distributed feature differences [
12]. For instance, recent studies show that transformer backbones can improve accuracy in challenging fine-grained tasks by focusing on a holistic representation of the vehicle rather than just localized features like headlights or grilles [
13]. While computationally more intensive, their state-of-the-art performance makes them an important benchmark for future VMMR systems.
2.2. Dataset Contributions
The Stanford Cars dataset [
14] provides over 16,000 images spanning 196 car classes, annotated by make, model, and production year across multiple viewpoints. The CompCars dataset [
15] includes more than 136,000 images across 1716 car models, with detailed annotations of viewpoints, parts, and attributes to facilitate fine-grained recognition. The VehicleID dataset [
16] comprises over 200,000 images collected from surveillance cameras, supporting vehicle re-identification research under realistic conditions. The Car-1000 dataset [
17] features 1000 vehicle classes with frontal and rear images, emphasizing practical scenarios aligned with surveillance camera angles.
More recent datasets aim for broader geographic and contextual diversity. The Diverse Large-scale VMM (DVMM) dataset [
18] includes 23 vehicle makes and 326 models focusing on European vehicles to enhance cross-region applicability. The Global License Plate Dataset [
19] contains over five million images from 74 countries, richly annotated with license plate, make, color, and model information, enabling studies that integrate license plate recognition and vehicle classification. Semi-automatic annotation methods developed by Zwemer et al. [
20] have improved dataset scalability and labeling consistency, facilitating the creation of large datasets necessary for effective model training.
2.3. Real-Time Detection and Open-World Scenarios
Real-time VMMR systems are crucial for traffic management and urban mobility applications. Manzoor et al. demonstrated a system capable of live vehicle classification with efficient processing suitable for deployment in traffic monitoring [
21]. Maurya et al. further developed a real-time classification system that distinguishes between multiple vehicle classes and was tested under diverse traffic conditions [
22].
A critical consideration for deploying VMMR in real-world traffic systems is the trade-off between model accuracy and computational efficiency. While large models achieve high performance, their resource requirements often make them unsuitable for deployment on edge devices like traffic cameras or roadside units. This has driven research into lightweight architectures designed for real-time processing. Models such as MobileNetV3 [
23] and EfficientNetV2 [
24] have been specifically engineered to minimize parameters and floating-point operations (FLOPs) while preserving competitive accuracy. In the vehicle recognition domain, lightweight variants of the YOLO family have become particularly popular, offering a robust balance between detection speed and precision, making them ideal for on-the-fly traffic analysis without requiring high-end GPU hardware [
25]. Our work builds on this trend by leveraging a recent YOLO variant, aiming to provide a solution that is both accurate for fine-grained tasks and practical for large-scale, sustainable urban deployment.
Handling open-world scenarios, where new or unseen vehicle models appear during deployment, is an emerging challenge. Muñoz et al. proposed Veri-Car, an integrated system combining YOLO5-based license plate detector, a fine-tuned TrOCR model for plate recognition, and multi-similarity loss functions to adapt to novel classes dynamically [
26]. Zhang et al. explored non-visual modalities for VMMR, employing magnetic fingerprint recognition combined with adversarial autoencoders and AdaBoostSVM. This offers an alternative approach when visual data is insufficient or unavailable [
27].
2.4. Motivation
Despite meaningful progress in Vehicle Make and Model Recognition (VMMR), many existing systems still face significant limitations when applied in real-world conditions, particularly in European settings. Our work is motivated by addressing several key challenges observed in prior research:
Geographical Bias in Datasets: Most widely used datasets, such as Stanford Cars, are largely focused on vehicles from the U.S. market. As a result, models trained on these datasets often struggle when applied to European roads, where the variety of makes, models, and design details can differ significantly.
Lack of Fine-Grained Classification: Many VMMR systems are limited to identifying a vehicle’s make but do not distinguish between specific models, their generations, or approximate production years. This level of detail is essential for several applications, including emissions estimation, regulatory compliance, vehicle taxation, fleet monitoring, market analysis, and fraud detection.
Limited Viewpoint Diversity: Existing datasets and methods often rely on side-view images, which offer more distinct visual cues. However, in practical applications, vehicles are usually captured from the front or rear by traffic or surveillance cameras. These angles present fewer distinguishing features, making it more difficult to recognize the exact model and production range.
Privacy Concerns Related to License Plate Use: Some approaches use license plate recognition or associated metadata to improve classification. While this can increase accuracy, it also raises significant privacy and legal concerns, especially in regions with strict data protection laws.
Dependence on Specialized Hardware: Certain methods rely on additional sensors, such as LiDAR or magnetic detectors, which can improve recognition but also increase both the cost and complexity of the system. This makes them less scalable for widespread public deployment.
3. Two-Stage VMMR Framework
Accurate vehicle make and model classification requires high-resolution images that capture the detailed features of individual cars. In real-world traffic scenes, vehicles appear at different distances and sizes, making it challenging to classify them directly from a full image. To address this, our system separates detection and classification into two distinct yet connected steps. First, the Vehicle Localization Module (VLM) locates and crops each vehicle to produce close-up images. Then, the Fine-Grained Classification Module (FGCM) uses these cropped images to identify the vehicle’s make, model, and generation. This two-stage pipeline provides the FGCM with clearer, more detailed inputs, enabling it to distinguish between similar models and different versions, even in multi-vehicle scenes.
Figure 1 presents the architecture of the framework.
3.1. Vehicle Localization Module (VLM)
The first stage of our pipeline is designed to locate vehicles within the input images. To accomplish this, we use YOLO (You Only Look Once), a widely adopted object detection model known for its balance of speed and accuracy. Its performance makes it suitable for applications that require real-time or near-real-time processing, such as traffic monitoring. YOLO processes the entire image in a single pass, simultaneously predicting both bounding boxes and object classes. Specifically, we use the YOLO11m (Medium) variant, configured with an input resolution of 640 × 640 pixels. The model’s improved architecture results in faster and more precise detections compared to previous versions. When compared to YOLOV8, YOLO11 achieves higher accuracy on standard benchmarks like the COCO dataset while using 22% fewer parameters (approx. 20.1 million). This reduction in model complexity helps improve computational efficiency, a critical factor for deployment on devices with limited resources.
The YOLO11 model we use is pretrained on the COCO dataset, which contains a variety of common street-level objects, including several vehicle classes such as cars, buses, trucks, and motorcycles. Since our focus is on traffic analysis, we limit the model’s detection to only these vehicle categories. This restriction simplifies the task for the detector by filtering out irrelevant objects. The output from the VLM consists of bounding boxes around all identified vehicles in each frame. These bounding boxes serve as inputs for the next stage of the pipeline, where the vehicles are classified by their make, model, and generation.
3.2. Fine-Grained Classification Module (FGCM)
Once vehicles are detected in a frame, the bounding boxes for each detected vehicle are passed to the Fine-Grained Classification Module. This module utilizes the YOLO11m-cls architecture, designed for single-label image classification tasks and accepting standard 224 × 224 pixel inputs. We selected this architecture specifically to optimize the trade-off between recognition granularity and operational speed. With approximately 20.1 million parameters, the model maintains a compact memory footprint and achieves high-throughput inference, significantly outperforming heavier state-of-the-art alternatives that often struggle to meet the low-latency requirements of live video processing. While the VLM identifies where objects are located, this classification model assigns a specific label to each vehicle, determining its make, model, and generation. To adapt the classifier, we use transfer learning through fine-tuning. We begin with a YOLO11 model pre-trained on ImageNet and fine-tune it on our specialized European vehicle dataset. This process allows the model to adjust its learned features to the specific visual details required to distinguish between similar vehicle generations.
3.3. Inference Protocol
To ensure the reproducibility of the framework, the inference pipeline follows a strict operational sequence defined in the system’s implementation. First, input video frames are processed by the Vehicle Localization Module (VLM) utilizing the YOLO11m architecture. To minimize false positives and resolve overlapping detections in multi-vehicle scenes, the detector applies a confidence threshold of 0.5 and a standard Non-Maximum Suppression (NMS) threshold of 0.7 (Intersection over Union). Valid objects are extracted directly from the bounding box coordinates without additional padding to strictly localize object features. These crops are passed to the Fine-Grained Classification Module (FGCM), which applies a confidence threshold of 0.5 to assign the make, model, and generation based on the highest Top-1 probability score. Finally, these classification labels are mapped back to the original frame coordinates using the stored bounding box IDs, enabling the visualization of the specific vehicle model and generation directly on the output video stream.
4. EuroVMMR Dataset
Recognizing that high-quality and diverse data is essential for training accurate classification models, we built a custom dataset specifically designed for fine-grained vehicle recognition. This dataset includes a wide range of vehicle classes, where each class is defined by a unique combination of make, model, generation, and year, as illustrated in
Figure 2. For instance, it includes detailed examples such as the Volvo 360c Concept (2018), the Škoda Enyaq iV First Generation (2020), and the Chevrolet Trax J600 (2023). By including different generations and variants of the same vehicle model, the dataset helps the model learn to distinguish the subtle differences in design and features that change over time.
To ensure the dataset is representative of real-world traffic conditions, we focused on collecting images of vehicles commonly seen on European roads. As a result, the dataset includes a strong presence of European makes such as Mercedes-Benz, Peugeot, Renault, Volkswagen, BMW, Opel, and Škoda; key Asian manufacturers like Kia, Nissan, and Hyundai; and several prominent American makes like Ford, Tesla, and Jeep. The dataset also includes commercial logistics vehicles (e.g., DAF, Scania) and modern electric platforms. This balanced mix ensures the model can handle the variety of vehicles typically found in European cities and highways.
The dataset contains 84,732 total images across 625 different vehicle classes, curated from publicly accessible web sources. The images were divided using a stratified 80/20 train/validation split and arranged in class-specific folders following YOLO classification standards. Vehicle images were collected from multiple angles—front, rear, side, and partial views—to reflect the real-world conditions in which ideal camera angles are not always possible. This variety helps improve the model’s predictive performance, even when vehicles are partially blocked or captured from unusual viewpoints. For privacy, any visible license plates in the images were blurred as a pre-processing step, and label quality was ensured via a multi-pass verification process.
Figure 3 shows an example of distribution of image in vehicle classes within the dataset, while
Figure 4 provides example images, highlighting the range of models and viewpoints represented.
Table 1 presents a comparison between our dataset and several established benchmarks in vehicle make and model recognition (VMMR). While datasets such as Stanford Cars, CompCars, and Car-1000 have advanced the field, they often face limitations when applied to real-world European scenarios. Most focus on U.S. or Chinese markets and often lack fine-grained annotations like vehicle generation or production details. Our dataset addresses these gaps by including detailed class labels (make, model, generation, and year) and a diverse range of vehicles commonly seen across Europe. It also offers broader viewpoint coverage (front, rear, and side), which better reflects the realistic deployment conditions where side views are not always available.
5. Model Training and Data Augmentation
The YOLO-based classification model was trained on our custom fine-grained vehicle dataset via transfer learning, using ImageNet-pretrained weights. Training proceeded for up to 200 epochs with an early stopping mechanism triggered after 10 consecutive epochs without an improvement in validation loss, helping to prevent overfitting. The training used a batch size of 6 and the auto optimizer (which selects an adaptive optimizer such as AdamW) with a weight decay of 0.0005. We employed a linear learning rate schedule with a 3-epoch warmup. The initial learning rate (lr0) was set to 0.0001, which decayed to a final value of 0.000001 (based on lrf: 0.01). The entire training process was performed on GPU hardware to efficiently handle the computational demands.
To improve robustness and generalization, we employed a comprehensive data augmentation pipeline during training. For the initial 10 epochs, Mosaic augmentation was enabled to expose the model to diverse spatial arrangements by merging four images into a single composite. After this stage, Mosaic was disabled to allow the training to focus on individual images. Additional augmentations included RandAugment, which randomly applies a variety of transformations to further increase data diversity. We also incorporated horizontal flipping with a 50% probability, HSV adjustments (hue 0.015, saturation 0.7, value 0.4), scaling (up to ±50%), translation (up to ±10%), random erasing of up to 40% of an image region to encourage the model to learn from partial information, and a flipped Copy-Paste strategy to introduce mirrored object segments. These augmentations collectively helped the model handle real-world variability in lighting, occlusion, scale, and positioning. An overview of the augmentation methods is shown in
Figure 5.
Training Results
The fine-tuning of the YOLO11 classification model on our custom vehicle dataset demonstrated clear improvements throughout the training process. Over approximately 150 epochs, the model achieved consistent reductions in loss and significant gains in classification accuracy (
Figure 6 and
Figure 7). This steady progress reflects the model’s ability to adapt to the fine-grained distinctions required for vehicle recognition, even when faced with a large number of visually similar classes. These results show the model’s capacity to learn subtle differences across 625 vehicle classes.
The training loss began at a high initial value of over 6 and gradually decreased as the model optimized its weights, eventually reaching less than 0.7 by the end of training. The validation loss exhibited a similar downward trend, starting at around 5.5 and stabilizing slightly above 1.0. The close alignment between the training and validation loss curves indicates strong generalization to unseen data, while the parallel decline suggests stable learning and the effective use of data augmentation. This is particularly relevant for fine-grained classification, where subtle visual differences (such as headlight or grille design) distinguish vehicle classes.
The model’s classification accuracy improved substantially during training. Top-1 accuracy, which measures the model’s ability to correctly predict the exact class with its first choice, climbed from around 5% at the beginning to approximately 80% by the final epochs. This sharp increase highlights the model’s developing capability to make precise distinctions between different vehicle types as training progressed.
Top-5 accuracy reached nearly 99%, indicating that the correct vehicle class was almost always included within the model’s top five predictions. This metric is particularly important in real-world applications, where a high likelihood of a correct identification among the top few predictions can still provide valuable insights for tasks like emissions estimation, fleet analysis, and traffic monitoring.
The reported metrics reflect the performance of the best model checkpoint obtained from a single optimized training run, selected based on the lowest validation loss.
The confusion matrix (
Figure 7) demonstrates robust overall performance, with a prominent diagonal indicating that the model correctly identifies the vast majority of the 625 vehicle classes. A closer examination of these results reveals specific challenges inherent to fine-grained recognition. The primary source of error stems from inter-generational similarity, where misclassifications frequently occur between successive versions of the same model (e.g., distinguishing a Volkswagen Golf VI from a Golf VII). These errors are attributed to subtle ‘facelifts’ that involve only minor changes to lighting signatures or bumpers, which are difficult to resolve. A second cluster of errors arises from badge engineering, where the model struggles to differentiate vehicles sharing identical platforms and body structures, such as the Peugeot Partner and Citroën Berlingo. In these instances, the visual distinctions are often limited to badges or slight grille variations, making these specific pairs exceptionally difficult for the model to differentiate reliably.
6. Validation
To assess the real-world performance of our vehicle recognition system, we validated it using traffic camera footage captured from European roads. This evaluation tested the entire pipeline, combining both the detection and classification stages.
6.1. Detection Performance
In terms of detection, our model successfully identified and located all passing vehicles across the test footage. Using the YOLO11 architecture, it was able to efficiently track cars, buses, trucks, and motorcycles even in busy traffic scenarios, ensuring that no vehicles were missed during the analysis.
6.2. Classification Performance
For classification, the model achieved an accuracy of 80% in correctly identifying the make, model, and generation of vehicles. This level of accuracy is consistent with the demands of fine-grained recognition tasks, which require distinguishing between highly similar models and even different generations of the same vehicle. While factors like the distance of vehicles from the camera and subtle design differences introduced challenges, the model still maintained a strong performance. Vehicles captured from rear angles (common in traffic camera placements) were still classified correctly in most cases, indicating that the learned features generalize well to real-world traffic footage (
Figure 8 and
Figure 9).
6.3. Comparison with the Stanford Cars Model
To further evaluate our model’s effectiveness, we compared its performance with a YOLO11 classification model trained on the Stanford Cars dataset. When tested on the same European traffic footage (
Table 2), the Stanford Cars model achieved significantly lower accuracy. This performance gap can be attributed to two main factors:
Limited Make Coverage: Many common European makes, such as Škoda, SEAT, Peugeot, and Opel, are not well represented in the Stanford Cars dataset, which is heavily U.S.-centric. This lack of coverage led to frequent misclassifications for European models.
Viewpoint Limitations: The Stanford Cars dataset primarily contains ideal side and front views, while our traffic footage often captures vehicles from the rear (especially when mounted on roadside poles or traffic lights). Our model, trained on more diverse perspectives, handled these rear-view images with much greater accuracy.
7. Conclusions
This work addresses a critical gap in existing Vehicle Make and Model Recognition (VMMR) systems: the lack of fine-grained, geographically relevant datasets that reflect the diversity of vehicles and real-world imaging conditions on European roads. To overcome these limitations, we developed a comprehensive dataset curated specifically for fine-grained recognition tasks in European contexts. The dataset includes over 625 distinct vehicle classes, with annotations for make, model, generation, and approximate production year. It incorporates varied viewpoints (front, rear, and side) and captures a wide range of lighting conditions and real-world occlusions. Complementing this dataset, we proposed a specialized two-stage recognition system. By training this system on our new dataset, we validated our approach with an 80% accuracy rate in real-world testing, significantly outperforming U.S.-centric baselines tested on European roads.
Nevertheless, opportunities for refinement remain. While the dataset introduces substantial improvements in class diversity and viewpoint representation, certain vehicle classes (particularly trucks and buses) are still slightly underrepresented, which may limit the model’s ability to generalize across all traffic participants. Moreover, since the current dataset primarily consists of clear-weather footage, the system’s robustness in adverse scenarios requires further validation. In real-world deployments, environmental factors such as heavy rain, fog, or low-light nighttime conditions can significantly reduce image contrast and obscure fine-grained details, leading to lower confidence scores and increased misclassification rates. Incorporating additional samples from these classes, as well as from adverse weather conditions, nighttime scenes, and less common vehicle types, could further enhance the system’s robustness and applicability.