1. Introduction
Vehicle type recognition [
1] is a crucial component of intelligent transportation systems [
2], with broad applications in traffic control, traffic flow statistics [
3], and traffic scheduling. In 2026, with the large-scale deployment of vehicle–road collaboration, smart cities, and autonomous driving technologies, high-precision vehicle type recognition has become an indispensable part of modern traffic systems. On the one hand, it supports refined smart city management, including real-time traffic monitoring, adaptive signal timing, congestion alleviation, and illegal vehicle detection, thereby greatly improving road utilization and management efficiency. On the other hand, it provides reliable environmental perception for autonomous driving, effectively reducing perception errors under complex conditions and enhancing driving safety. Therefore, developing efficient and accurate vehicle type recognition methods is of great practical value for both intelligent transportation and autonomous driving. Currently, vehicle type recognition algorithms can be divided into two categories: traditional methods and deep learning-based methods.
Traditional vehicle type recognition methods primarily rely on low-cost small sensors, LiDAR, manual feature descriptors, and machine learning algorithms. For instance, feature descriptors such as SIFT [
4] and HOG [
5] are widely utilized. Odat et al. [
6] utilized geomagnetic sensors to capture the external contour features of vehicles for type classification. Putra et al. [
7] employed Gaussian mixture models to extract the background images of vehicles and then classified pixel points to recognize vehicle types. Li et al. [
8] matched data from multiple sensors to form a fused feature waveform for each vehicle and output the vehicle type recognition results.
With the increase in the number of vehicles, traditional methods struggle to achieve satisfactory speed and accuracy when faced with large vehicle datasets. The rapid development of artificial intelligence and hardware facilities such as GPUs has enabled parallel processing of computational data, enhancing the accuracy of vehicle type recognition. Consequently, deep learning-based vehicle type recognition methods have emerged. In real-world applications, vehicle recognition faces great challenges because traffic flow is highly random and dynamic. Traffic volume varies significantly at different times of day, on different days of the week, and in different months of the year. Traffic flow also differs considerably between working days, weekends, and holidays [
9]. These temporal variations further increase the difficulty of stable and accurate vehicle recognition.
Deep learning-based vehicle type recognition algorithms are generally categorized into single-stage and two-stage methods. For two-stage algorithms, the RCNN series [
10,
11,
12] employs selective search algorithms to generate candidate regions and then extracts features from each candidate region to produce recognition results. For instance, Ke et al. [
13] proposed a data balancing strategy based on Faster-RCNN to enhance vehicle type recognition performance.
Although two-stage algorithms offer excellent accuracy, they struggle to meet real-time requirements on devices with limited computational resources. In contrast, single-stage algorithms such as the YOLO [
14] series, SSD [
15], and FCOS [
16] directly extract features from the input image and output recognition results. This direct processing mechanism significantly improves their recognition speed. Meanwhile, they are gradually surpassing two-stage algorithms in terms of accuracy, making them more suitable for vehicle type recognition scenarios. For example, Song et al. [
17] added Mamba modules to the backbone network, significantly reducing computational consumption, but the recognition capability also decreased. Kasper et al. [
18] used YOLOv5 and thermal network cameras for heavy truck recognition, successfully identifying heavy trucks in winter rest areas and allowing real-time prediction of parking space occupancy rates. However, this model fails to effectively detect heavy trucks obscured by other vehicles, resulting in a low detection rate for such occluded targets. Sun et al. [
19] utilized depthwise separable convolutions to reduce backbone network parameters and employed SENet to improve vehicle type recognition accuracy. Cao et al. [
20] optimized the loss function and introduced weight regularization to develop a model for vehicle type recognition. This model also enabled the systematic implementation of traffic flow statistics.
Although the aforementioned methods hold significant importance for vehicle type recognition, they still exhibit certain limitations. First, different weather conditions easily lead to unclear edge contours of vehicle targets, which interferes with feature extraction in various directions. Second, although most methods improve the accuracy of vehicle type recognition, their processing of feature fusion is incomplete. They ignore the feature correlation between different hierarchical levels [
19,
20], and the mining of deep semantic information is insufficient. Finally, vehicles are dense in real traffic scenes, and vehicle overlap is prone to occur. The effective extraction of multi-scale features also restricts the recognition accuracy of targets [
18]. Therefore, to address the issues of insufficient feature fusion and incomplete multi-scale information extraction in vehicle type recognition tasks, this paper proposes a vehicle type recognition network based on feature comparison and Mixture of Experts (MoE). First, we propose a Multi-scale Interleaving Fusion Module that utilizes multi-branch channels and interleaving transmission structures to capture multi-scale features. Second, we design a Feature Compare Enhancement Module to effectively fuse feature maps of different scales and distinguish feature intensity, enhancing feature expression capability. Finally, we construct a Mixture of Experts Feature Enhancement Module to capture specific details of vehicle features and obtain precise localization effects.
The main contributions of this paper are summarized as follows:
We propose a novel vehicle type recognition framework integrating feature comparison and the Mixture of Experts mechanism. The proposed framework overcomes the limitations of existing methods in feature fusion and target localization. It systematically integrates multi-scale feature extraction, dynamic feature enhancement, and adaptive expert selection mechanisms. This work provides a new technical pathway for high-precision, real-time vehicle type recognition in complex traffic scenarios.
We propose a Multi-scale Interleaving Fusion Module (MSIFM). By utilizing channel partitioning and interleaving transmission mechanisms, it effectively captures multi-scale features while reducing computational complexity. In this way, it solves the problem of insufficient multi-scale information fusion in existing methods.
We design a Feature Compare Enhancement Module (FCEM). This module introduces a discrimination mechanism for strongly and weakly correlated features. As a result, it can dynamically strengthen key features, which effectively alleviates the low information utilization in traditional fusion strategies such as simple concatenation or element-wise addition.
We construct a Mixture of Experts Feature Enhancement Module (MOEFEM). For the first time, the Mixture of Experts model is introduced into the vehicle type recognition task. Multiple expert units are leveraged to adaptively extract key detail features, significantly improving the localization capability for vehicle targets.
The rest of this paper is organized as follows.
Section 2 reviews related work on vehicle type recognition and Mixture of Experts (MoE) models.
Section 3 elaborates the overall framework and detailed designs of the proposed modules.
Section 4 presents experimental settings, ablation studies, comparative results, and visualization analysis.
Section 5 concludes the whole work and discusses future directions.
3. Methodology
The proposed model consists of an encoder, a decoder, and detection heads. The overall architecture is illustrated in
Figure 1. The proposed framework is developed based on the YOLOv8 detection pipeline. We choose YOLOv5, YOLOv8, YOLOv10, and YOLOv11 as representative benchmarks to verify the advancement of our method.
In the encoder stage, we utilize the medium version of MobileNetV4 [
35], which strikes a favorable balance between feature extraction capability and computational efficiency. From this backbone, we extract four multi-scale feature maps (160 × 160, 80 × 80, 40 × 40, and 20 × 20) as inputs to the decoder. These features provide a rich source of semantic information to guide the subsequent decoding process.
In the decoder stage, we first apply the Feature Compare Enhancement Module (FCEM) to fuse multi-scale features, mine deep semantic information, and restore image resolution. Then, the MOEFEM aggregates features from different hierarchical levels. It highlights vehicle targets through multi-level residual connections and feeds the features into the detection heads. Finally, the detection heads of YOLOv8 are utilized to output the recognition results.
3.1. Multi-Scale Interleaving Fusion Module
In real-world traffic environments, vehicle type recognition faces numerous challenges. Target vehicles appear at varying distances from cameras, and vehicles differ in inherent size. Consequently, vehicles of the same or different categories exhibit significantly different scales and appearance features. This diversity increases the difficulty of multi-scale feature extraction and fusion, adversely affecting recognition performance.
To address this issue, we design a Multi-scale Interleaving Fusion Module (MSIFM), as illustrated in
Figure 2. Unlike traditional SPPF or FPN structures, this module achieves cross-scale information transmission through a channel interleaving mechanism. This design avoids feature redundancy and computational stacking.
First, the input feature map is split equally into four branches along the channel dimension to reduce computational complexity. Meanwhile, a gradient structure is employed to propagate the feature flow, ensuring that each branch contains sufficient information. This process is formulated as follows:
where
denotes the channel partitioning operation, and
represents the initial input feature.
Second, to effectively identify and select discriminative features that capture non-local interactions, adaptive max pooling is applied to the input features on three branches to generate multi-scale features. Depthwise separable convolution is then employed on each branch to further capture refined vehicle features. This process can be expressed as follows:
where
denotes the depthwise separable convolution operation, and
represents the max pooling operation.
Finally, the four branches are concatenated along the channel dimension to enhance feature representation across channels. The result is added to the input feature to strengthen the output and improve feature transmission efficiency. This enables the network to better learn complex feature representations, as shown in Equations (4) and (5):
where
denotes channel concatenation,
represents the 1 × 1 convolution operation for channel adjustment, and
indicates the final output.
3.2. Feature Compare Enhancement Module
Typically, after obtaining the initial feature maps, deep semantic features are propagated to shallow layers and fused with them. This process enables the network to construct rich and expressive feature representations. However, most existing object detection algorithms merely concatenate or add features during fusion without further refinement. This shallow fusion strategy produces weakly correlated features, thereby hindering recognition accuracy.
To enhance feature fusion effectiveness, we propose a Feature Compare Enhancement Module (FCEM), as shown in
Figure 3. Unlike traditional attention mechanisms such as CBAM and SE, our module adopts a feature comparison strategy that can clearly distinguish between strong and weak features, thereby achieving more precise feature enhancement.
First, features from the previous layer and encoder features are concatenated at the channel level, enriching the channel information. The concatenated features then undergo channel shuffling. Different channels can thus interact and fuse more thoroughly, breaking down information isolation for more effective utilization of multi-channel features. This process can be expressed as follows:
where
denotes the channel shuffling operation.
Second, the feature map with rich multi-channel features is evenly divided along the channel dimension and fed into the Feature Compare module. By doing so, the two branch features are combined to form global features. Subsequently, global average pooling and sigmoid function are applied to generate threshold weights. This process can be expressed as follows:
where
denotes the
operation,
and
represent the evenly divided sub-tensors, and
denotes the threshold weights.
Then, 3 × 3 convolution is applied to both branches to obtain local features, followed by sigmoid function to acquire branch-specific weights. These weights are then compared with the threshold weights. Specifically, features exceeding the threshold are classified as strongly correlated, while those below are weakly correlated, yielding two distinct feature sets. These feature sets demonstrate the model’s dynamic capability to identify and process key features, thereby improving the discrimination and effectiveness of vehicle feature expression. This process can be expressed as follows:
where
denotes strongly correlated features, and
denotes weakly correlated features.
Finally, for strongly correlated features, depthwise convolution is employed to further capture local features of each channel while reducing redundant computation. For weakly correlated features, a self-gating mechanism is utilized to dynamically adjust input features, helping the model select more relevant vehicle target information. Both types of features are then added to obtain enhanced feature representations. This process can be expressed as follows:
where
denotes the
operation.
3.3. Mixture of Experts Feature Enhancement Module
During network execution, shallow feature maps contain more fine-grained information, which makes them suitable for detecting smaller objects. In contrast, deeper layers encompass richer global context and higher-level semantic information, which are better suited for processing large targets. Therefore, effectively fusing multi-level information is crucial for accurately identifying vehicles of varying sizes.
Currently, mainstream YOLO series models typically adopt direct concatenation for multi-scale feature fusion to improve computational efficiency. This approach ignores the differences in importance among features at different levels and cannot adaptively adjust according to the scale, pose, and occlusion status of vehicle targets, resulting in limited feature utilization efficiency. In contrast, the proposed Mixture of Experts Feature Enhancement Module (MOEFEM) introduces a dynamic gating mechanism to learn adaptive weights and select the optimal expert units automatically. Different experts are specialized in capturing edge information, contour structure, and multi-scale details. This dynamic design effectively overcomes the limitations of fixed feature concatenation, enabling the network to focus on key vehicle regions and greatly improving feature representation and localization accuracy.
The structure of the Mixture of Experts Feature Enhancement Module (MOEFEM) is shown in
Figure 4. We apply the MOEFEM to three feature maps of different scales. Each module receives features from two adjacent hierarchical levels and captures the most critical features and select optimal feature paths, enhancing the generalization capability for complex vehicle types.
First, the MOEFEM upsamples low-resolution feature maps and concatenates them with high-resolution feature maps. Through this operation, features from different hierarchical levels are fused more effectively, enhancing recognition capability for multi-scale vehicle targets.
Second, the fused features are fed into the Mixture of Experts (MoE) model. A 1 × 1 convolution is first employed to adjust the channel dimension of input features to 3, which is consistent with the number of expert units. Then, the Softmax activation function is applied along the channel dimension to generate three adaptive scalar weights, denoted as , , and . These weights represent the relative importance of the three expert branches and are automatically learned in an end-to-end manner.
Subsequently, the input features are forwarded into three parallel expert units. Expert 1 and Expert 2 are constructed as vertical and horizontal attention units, respectively, which strengthen the capture of vehicle edge information along spatial dimensions. Expert 3 utilizes convolutions with diverse kernel sizes to extract multi-scale details, while residual connections and activation functions enhance feature propagation and alleviate gradient vanishing.
To fuse the expert outputs, the three scalar weights
,
, and
are broadcasted along the channel, height, and width dimensions to match the complete spatial and channel dimensions of the expert output features. The broadcasted weights are then multiplied element-wise with the corresponding expert output features. Finally, the three weighted feature maps are summed element-wise to obtain the final enhanced feature. The computation is formulated as:
where
,
, and
denote the generated adjustment weights, and
,
, and
represent the output results of the three expert units.
Through the above weight broadcasting and weighted fusion mechanism, the model can adaptively emphasize valuable expert features and suppress trivial information, significantly improving the representation ability and localization accuracy of vehicle targets.
3.4. Datasets and Experimental Settings
3.4.1. Datasets
The datasets used in this paper are the large-scale vehicle datasets UA-DETRAC [
36] and BDD100K [
37] for traffic surveillance scenarios. The UA-DETRAC dataset is complex in terms of its scene content. It consists of four vehicle categories, with a total of over 140,000 images. The training set contains 82,085 images, while the testing set contains 56,167 images.
The UA-DETRAC dataset is constructed by extracting individual frames from captured video data to form an image dataset. It is divided into four categories based on weather conditions: cloudy, sunny, rainy, and nighttime. The UA-DETRAC dataset is shown in
Figure 5.
The BDD100K dataset comprises ten object categories: Person, Rider, Car, Bus, Truck, Bike, Motor, Train, Traffic light, and Traffic sign. Since our research focuses on vehicle type recognition, we manually eliminated the labels for Person, Rider, Train, Traffic light, and Traffic sign. Only vehicle target labels were retained. The original BDD100K dataset consists of 70,000 images for training, 10,000 images for validation, and 20,000 images for testing. However, as the test set lacks annotations, these 20,000 unlabelled images are excluded to enable more accurate evaluation of the model. The remaining annotated images are then repartitioned into new training, validation, and test sets at a ratio of 7:1:2. The BDD100K dataset is shown in
Figure 6.
3.4.2. Training Configuration
The hardware environment of the experimental platform is shown in
Table 1.
The initial learning rate is set to 0.001. SGD is selected as the optimizer. CIoU loss is employed as the loss function, which is widely used in the object detection field.
4. Experiments
4.1. Evaluation Metrics
In practical applications, vehicle type recognition must meet the dual requirements of recognition accuracy and processing speed. This ensures efficient and precise identification. Accordingly, , , , and are selected as comparison metrics to demonstrate the application value and performance advantages of the proposed algorithm.
denote the number of floating-point operations executed by the model in a single forward pass. The unit is billions of floating-point operations (109 FLOPs).
indicate the number of parameters during the model training process. They reflect the complexity of the model and the extent of resource consumption.
(Frames Per Second) represents the number of images processed per unit time. A higher relative value indicates better processing speed.
(mean Average Precision) is obtained by summing the
values for each class and averaging them. The calculation is shown in Equation (16), where
is the number of recognized classes.
4.2. Ablation Experiment
To verify the impact of each module on the overall detection performance, ablation studies are conducted on the UA-DETRAC and BDD100K datasets. The MSIFM, FCEM, and MOEFEM are replaced with 3 × 3 convolutions, using the combination of MobileNetV4 and 3 × 3 convolutions as the baseline. Specifically, each 3 × 3 convolution is a sequential stack of 3 × 3 convolution, BatchNorm, and SiLU activation, with a stride of 1. The channel numbers of the four hierarchical stages are set to 512, 256, 128, and 64 from bottom to top. The MSIFM, FCEM, and MOEFEM are then incrementally added. The experimental results are presented in
Table 2 and
Table 3.
As shown in
Table 2 and
Table 3, all three modules contribute positive improvements to the model, with the FCEM achieving the most significant performance boost for the baseline. This verifies that efficient feature fusion plays a crucial role in enhancing vehicle type recognition performance. Notably, the MSIFM not only improves detection accuracy but also reduces computational complexity, which benefits from its effective capture of multi-scale features and the reduction in computation brought by channel-wise calculation. In addition, the design of the MOEFEM endows the model with a better localization capability for vehicle targets prior to final detection. Overall, the experimental results in
Table 2 and
Table 3 demonstrate that the proposed vehicle type recognition algorithm achieves a significant improvement in recognition accuracy compared with the baseline model, and all the designed modules exert an effective role in the proposed algorithm.
4.3. Comparative Experiments
To demonstrate the vehicle type recognition performance of the proposed algorithm in traffic scenes with different complexities, comparative experiments are conducted on the UA-DETRAC dataset and the BDD100K dataset. General object detection algorithms such as YOLOv5s, RT-DETR-l, YOLOv8s, YOLOv10s, and YOLOv11s are selected. Recent state-of-the-art improved models are also included for comparison.
As shown in
Table 4, the designed vehicle type recognition algorithm outperforms other comparative algorithms in recognition performance. Compared with state-of-the-art models including RT-DETR-L, YOLOv3, YOLOv5s, YOLOv8s, YOLOv10s and YOLOv11s, the proposed algorithm achieves accuracy improvements of 4.3%, 0.9%, 3.9%, 3.1%, 3.5% and 2.2%, respectively. It also yields accuracy gains of 2.4%, 1.6%, 5.1%, 2.7% and 2.1% in comparison with the studies of international scholars reported in [
17,
27,
38,
39,
40], while demonstrating certain advantages in computational efficiency. As illustrated by the heatmap visualization results in
Figure 7, compared with the state-of-the-art YOLOv11s detector, our proposed method demonstrates more precise and comprehensive attention to the complete edge contours of vehicle objects. In contrast, the YOLOv11 detector is susceptible to environmental disturbances, leading to inaccurate and unstable focus on vehicle regions. These improvements are attributed to the superior vehicle center-localization capability of the MOEFEM, as well as the ability of the FCEM to enhance strongly correlated features while suppressing irrelevant background interference.
Specifically, Dong et al. [
27] introduced deep convolutional layers and various attention mechanisms, which significantly improved computational efficiency yet failed to achieve high detection accuracy. Zhang et al. [
38] reduced the complexity of the detection model for better deployment on resource-constrained devices, but overlooked the impact of feature fusion on vehicle type recognition. Song et al. [
17] integrated Mamba into the backbone network of YOLO, which greatly reduced the parameter count and enabled better information capture for vehicle targets. However, the feature extraction capability of the newly designed backbone network was compromised accordingly, leading to incomplete feature fusion of targets. In contrast, the proposed algorithm in this paper achieves a 5.1% improvement in recognition accuracy over the method in [
17], making it more suitable for vehicle type recognition tasks.
Moreover, Zhang et al. [
39] improved feature fusion to enhance the discrimination between background and targets, yet the weak feature extraction capability of the backbone network hindered the full utilization of feature information. On the contrary, this paper further strengthens feature representation by combining the strong feature extraction capability of the backbone network with the efficient feature fusion of the FCEM. Feng et al. [
40] introduced hypergraphs into object detection and achieved promising recognition accuracy, but graph computation also drastically increased the computational burden. In comparison, the MSIFM maintains favorable computational efficiency while enhancing the multi-scale feature extraction capability, thus making it more applicable to vehicle type recognition tasks.
To further verify the generalization of the proposed method for vehicle type recognition across multiple scenarios, comparative analyses are conducted with various object detection algorithms on the BDD100K dataset. The BDD100K dataset features a larger data volume, more vehicle categories and more complex scenarios, thus imposing higher performance requirements on detection algorithms. The comparison results are presented in
Table 5.
Table 5 demonstrates that the proposed method outperforms all comparative algorithms in both recognition accuracy and inference speed, achieving a favorable trade-off between real-time performance and detection efficacy. This result confirms the significant advantages of the proposed approach in terms of enhancing vehicle localization, preserving fine details and improving model robustness. Furthermore, the method maintains strong performance in complex scenarios involving diverse scenes and vehicle types, validating its excellent generalization capability.
4.4. Visualization Analysis
To demonstrate the vehicle type recognition performance of the proposed algorithm under various scenarios, comparative visualization results are presented under diverse scene conditions.
Figure 8 shows the recognition results of recent state-of-the-art algorithms and our method on UA-DETRAC.
Figure 9 shows the recognition results on the BDD100K dataset.
In terms of detection performance, the proposed algorithm achieves superior recognition of distant vehicles, as shown in the first and third rows of
Figure 8e. Despite the segmented targets and varying target sizes in these areas, the proposed algorithm still accurately identifies the vehicle targets, verifying its effective capture of multi-scale features. Meanwhile, our algorithm yields higher confidence scores for vehicle targets facing the camera and incomplete vehicle images at the image edges. These results demonstrate the strong capability of the proposed algorithm for vehicle type recognition in complex scenarios.
As can be seen from
Figure 8, in terms of detection performance, the proposed algorithm achieves superior recognition of distant vehicles. Specifically, as indicated by the red-circled regions in the first and third rows of
Figure 8e, despite the segmented targets and varying target sizes in these areas, the proposed algorithm still accurately identifies the vehicle targets, which verifies its effective capture of multi-scale features. Meanwhile, as shown in the second and third rows of
Figure 8e, our algorithm yields higher confidence scores for vehicle targets facing the camera and incomplete vehicle images at the image edges. The above results demonstrate the strong capability of the proposed algorithm for vehicle type recognition in complex scenarios.
According to
Figure 9, YOLOv11s suffers from missed detections in the case of overlapping vehicle targets, while other comparative algorithms also face the issue of low recognition accuracy. In contrast, the proposed method accomplishes vehicle type recognition tasks more effectively, which verifies its favorable generalization performance and great potential for practical application.
4.5. Heatmap Comparison Analysis
To more intuitively highlight the attention degree of the algorithm to key regions, detection heatmaps are generated using HiResCAM technology. Red areas represent high-attention regions, while yellow areas represent secondary-attention regions.
Figure 10 shows the recognition heatmaps of different algorithms on the UA-DETRAC dataset. It can be observed from the enlarged heatmaps that the proposed algorithm pays higher attention to the edge contours of vehicles and focuses more on the global features of vehicles compared with other comparison algorithms. Meanwhile, it is less disturbed by the background environment and can focus more on distant targets, demonstrating favorable recognition accuracy and outstanding target perception capability.
4.6. Limitations
Although the proposed vehicle type recognition network achieves competitive performance in terms of accuracy and efficiency on the UA-DETRAC and BDD100K datasets, several limitations still exist under extremely challenging conditions.
- (1)
First, the model experiences a significant performance drop under extreme weather conditions such as heavy snow and heavy rain. Severe environmental interference impairs feature extraction and reduces the discrimination between vehicle targets and the background. As a result, the FCEM cannot obtain sufficient effective features for contrast enhancement, and the MOEFEM also struggles to fully aggregate global and local features, leading to false detections or inaccurate bounding box regression, as illustrated in
Figure 11.
- (2)
Second, although the model is lightweight, there is still room for optimization for edge deployment on low-cost embedded devices with constrained computing power and memory. The real-time inference speed and power consumption need to be further improved to better meet the requirements of real-world intelligent transportation edge devices.
In future work, we will address these limitations through the following approaches: (1) We will attempt to deploy the model on edge devices for real-world scenario testing. (2) We intend to prune the model to reduce the number of parameters. (3) We plan to collect and annotate data in more diverse environments, such as snowy and foggy conditions, to expand the dataset and improve the generalization ability of the model under various weather conditions.
Figure 11.
Identification results under extreme weather conditions.
Figure 11.
Identification results under extreme weather conditions.