SBMEV: A Stacking-Based Meta-Ensemble Vehicle Classification Framework for Real-World Traffic Surveillance

Pateriya, Preeti; Trivedi, Ashutosh; Malhotra, Ruchika

doi:10.3390/app16010520

Open AccessArticle

SBMEV: A Stacking-Based Meta-Ensemble Vehicle Classification Framework for Real-World Traffic Surveillance

by

Preeti Pateriya

^1,*

,

Ashutosh Trivedi

¹

and

Ruchika Malhotra

²

¹

Department of Civil Engineering, Delhi Technological University, Delhi 110042, India

²

Department of Software Engineering, Delhi Technological University, Delhi 110042, India

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(1), 520; https://doi.org/10.3390/app16010520

Submission received: 6 December 2025 / Revised: 24 December 2025 / Accepted: 29 December 2025 / Published: 4 January 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Developing vehicle classification remains a fundamental challenge for intelligent traffic management in the Indian urban environment, where traffic exhibits high heterogeneity, density and unpredictability. In the Indian subcontinent, vehicle movement is erratic, congestion is high, and vehicle types vary significantly. Conventional global benchmarks often fail to capture these complexities, highlighting the need for a region-specific dataset. To address this gap, the present study introduced the EAHVSD dataset, a novel real-world image collection comprising 10,864 vehicle images from four distinct classes, acquired from roadside surveillance cameras at multiple viewpoints and under varying conditions. This dataset is designed to support the development of an automatic traffic counter and classifier (ATCC) system. A comprehensive evaluation of eleven state-of-the-art deep learning models, namely VGG16, VGG19, MobileNetV2, Xception, AlexNet, ResNet50, ResNet152, DenseNet121, DenseNet201, InceptionV3, and NASNetMobile, was carried out. Among these, the highest accuracy result has been achieved by VGG-16, MobileNetV2, InceptionV3, DenseNet-121, and DenseNet-201. We developed a stacking-based meta-ensemble framework to leverage the complementary strengths of its components and overcome their individual limitations. In this approach, a meta-learner classifier integrates the predictions of the best-performing models, thereby improving robustness, scalability, and real-world adaptability. The proposed ensemble model achieved an overall classification accuracy of 96.04%, a Cohen’s Kappa of 0.93, and an AUC of 0.99, consistently outperforming the individual models and existing baselines. A comparative analysis with prior studies further validates the efficacy and reliability of the stacking-based meta-ensemble method. These findings position the proposed frameworks as a robust and scalable solution for efficient vehicle classification under practical surveillance constraints, with potential applications in intelligent transportation systems and traffic management.

Keywords:

vehicle classification; deep learning architectures; meta-ensemble techniques; hyperparameter optimisation; highway traffic surveillance; intelligent transportation systems

1. Introduction

Rapid urbanisation and exponential growth of vehicular traffic have led to increased demand for efficient and automated traffic monitoring systems. Intelligent Transportation Systems (ITS) aim to enhance traffic management, reduce congestion, and facilitate decision-making through advanced sensing, computing, and communication technologies. These systems play a crucial role in enhancing road safety, facilitating smart mobility, and promoting sustainable urban development. Among various ITS application domains, vision-based vehicle detection and vehicle classification are fundamental perception tasks for traffic analysis and monitoring. They form the basis for higher-level applications such as vehicle tracking, vehicle counting, license plate recognition, toll collection, autonomous driving, traffic signal control, traffic flow monitoring, and vehicle trajectory prediction [1,2,3,4,5,6,7,8,9,10,11].

Traditional traffic monitoring systems rely on intrusive sensing technologies, including piezoelectric strips, magnetic and inductive loops, acoustic sensors and RFID-based IoT systems. Although effective, these approaches are expensive, difficult to maintain and lack Scalability for large-scale deployment in smart cities. Consequently, computer vision-based approaches utilising roadside surveillance cameras and deep learning techniques have garnered significant attention due to their non-intrusive nature, scalability, and ability to operate in real time. Deep learning has demonstrated strong performance in image recognition tasks, particularly for vehicle classification by automatically learning hierarchical features from raw image data. However, real-world traffic scenes present significant challenges, including varying illumination conditions, cluttered backgrounds, dense traffic congestion, and high inter-class similarity, all of which degrade classification performance [12]. The effectiveness of vehicle classification models is strongly dependent on the availability of realistic and diverse datasets that capture different viewpoints, illumination conditions, and weather scenarios for various vehicle categories. Several benchmark datasets, namely JUIVCD [4], VMMR [13], CompCars [14], MIO-TCD, and BIT-Vehicle [15], have been widely used for vehicle detection and classification.

While these publicly available datasets are valuable resources, they often lack geographic diversity, balanced class distribution, annotations, a variety of classes, or capture a limited range of illumination and weather conditions. Additionally, they often fail to account for real-world complexities commonly observed in the Indian traffic scenarios, such as heavy congestion, frequent occlusions, and vehicles overlapping within a single frame. Recent studies have also explored sensor-based and learning-driven approaches. Road pavement vibration and supervised machine learning are used to detect and classify vehicles, demonstrating the feasibility of accelerometer-based sensing [16]. LiDAR sensors are used to assess vehicle classification stations to determine the accuracy of non-intrusive optical measurement [17]. Ensemble learning and ADASYN resampling with a cost-effective accelerator-based system to obtain an accuracy of 99.78%, again highlighting the possibility of using vibration signals to identify vehicles [5]. An autonomous vehicle control framework based on reinforcement learning has also been suggested for use in adverse lighting and weather conditions, which again emphasises the versatility of the deep learning method in ITS applications. These works demonstrate the potential of advanced ensemble and transformer-based models, conceptually aligning with our stack-based meta-ensemble framework for vehicle classification [18]. These limitations underscore the need for region-specific datasets and robust classification models that can effectively handle complex, real-world traffic environments [19,20,21]. To address these challenges, this study introduces a new proposed dataset, captured from real Indian highway surveillance cameras under diverse illumination and traffic conditions. It proposes a stacking-based meta-ensemble framework for robust vehicle classification. The effectiveness of the proposed approach in handling occlusions, illumination variation, congestion, and inter-class similarity is systematically validated through extensive experimental evaluation. The main contribution of this study is the development of a stacking-based multi-meta-ensemble framework for a vehicle classification system in intelligent transportation systems (ITS), with a focus on achieving robustness, scalability, and regional adaptability. The most significant contributions of this work are as follows:

A new dataset comprising 10,864 vehicle images is introduced from real-world traffic scenarios in urban highways in India. Detailed and structured annotations are provided to ensure reproducibility and enable rigorous evaluation of vehicle classification methods.
Eleven state-of-the-art deep learning architectures, namely VGG16, VGG19, MobileNetV2, Xception, AlexNet, ResNet50, ResNet152, DenseNet121, DenseNet201, InceptionV3, and NASNetMobile, were rigorously evaluated on the EAHVSD and JUIVCD datasets, providing an extensive performance analysis.
A stacking-based meta-ensemble learning strategy is employed for vehicle classification, integrating diverse base learners and combining their predictions through a multi-meta-learner. This approach enhances model diversity and achieves performance across multiple evaluation metrics, including accuracy, precision, recall, F1 Score, Cohen’s Kappa, and ROC AUC curve.
Hyperparameter tuning strategies were incorporated to optimise both the individual models and the proposed ensemble framework, resulting in improved classification efficiency and robustness.
The proposed ensemble framework was validated on both datasets, demonstrating superior performance compared to individual models and existing approaches in terms of accuracy and reliability.

2. Related Work

Recent studies have thoroughly investigated advanced vehicle classification approaches using deep learning, ensemble learning and hybrid methodologies to improve performance under real-world scenarios. This section provides an extensive analysis of significant studies categorised into three sections:

2.1. Vehicle Classification Using Deep Learning Techniques

Vehicle classification is a core task in intelligent transportation systems (ITS), enabling applications such as traffic monitoring, dynamic tolling, pavement load estimation, and regulatory enforcement [22]. In recent years, deep learning, particularly convolutional neural networks (CNNs), has become the dominant paradigm for vehicle classification, owing to its strong feature learning capabilities and robustness in complex traffic environments. State-of-the-art CNN architectures have been widely employed for both coarse-grained and fine-grained vehicle classification. For vehicle make and model recognition, DenseNet201 and ResNet50V2 augmented with channel and spatial attention mechanisms (CBAM) achieved accuracies of 93.51% on the Stanford Cars dataset and 99.03% on the CompCarsSV dataset, demonstrating their effectiveness in high-precision classification tasks [23]. Lightweight transfer learning approaches using DenseNet-based architectures have also been explored for broader vehicle categorisation (e.g., car, bus, truck), achieving up to 92% accuracy on benchmark datasets such as CIFAR-100 [24]. Several studies have focused explicitly on vehicle classification under real-world Indian traffic conditions, which are characterised by high density, heterogeneity, and unstructured layouts. Evaluations of CNN models, such as AlexNet, VGG16, and DenseNet, on Indian traffic datasets indicate that DenseNet variants consistently outperform others, achieving accuracies of up to 87.09%, owing to their efficient feature reuse and improved gradient flow [25]. Recently, attention-aided CNN models such as SimSANet, which integrates DenseNet201 with sequential multi-kernel attention, have further improved recognition performance on benchmark vehicle datasets while maintaining reduced training complexity [13]. Beyond CNNs, transformer-based architectures, including Vision Transformers (ViT) and hybrid transformer models, have been investigated for vehicle classification due to their ability to model long-range dependencies using global self-attention [26]. Additionally, self-supervised learning and advanced attention mechanisms have been explored to enhance representation learning under a limited labelled data scenario [27]. However, these approaches generally require large-scale pretraining and incur higher computational costs. In contrast, CNN-based models offer a favourable balance between accuracy and computational efficiency, making them more suitable for traffic surveillance systems. This motivates the adoption of CNN architectures as base learners in the proposed stacking-based ensemble framework, which is discussed in the subsequent subsection.

2.2. Ensemble Learning Approaches

Ensemble learning improves classification robustness and generalisability by combining predictions from multiple base models using strategies such as majority voting, weighted voting, and sum-rule aggregation [4]. These methods reduce the model variance and are particularly effective in complex real-world scenarios. In vehicle classification, ensemble frameworks consistently outperform individual models. A super learner integrating ResNet50, Xception, and DenseNet achieved accuracies of 97.94% on the MIO-TCD dataset and 97.62% on the BIT-Vehicle dataset [15]. Hybrid ensembles combining classical classifiers (KNN, SVM, MLP, RF) with handcrafted or learned features have also reported accuracies up to 98.65% on real-world datasets [28,29,30]. While probabilistic deep ensemble approaches achieved accuracies exceeding 99% on public benchmarks [31]. The effectiveness of ensemble learning extends beyond vehicle classification, demonstrating strong cross-domain generalisability. Ensemble-based AFDD systems using real operational HVAC sensor data have enabled reliable fault diagnosis and improved energy efficiency [32]. Similarly, vision-based infrastructure inspection tasks, such as UAV-based rebar counting, have benefited from ensemble-oriented deep learning and transformer-based detection models [33], further validating the robustness of ensemble methodologies.

2.3. Stacking-Based Meta Ensemble Approach

Among ensemble paradigms, bagging, boosting, and stacking are the most widely adopted techniques [34]. Unlike voting-based ensembles, stacking employs a hierarchical learning structure in which a meta-learner is trained to optimally integrate the outputs of multiple base models. This allows the meta-learner to capture complementary decision patterns and systematic errors among base learners. Stacking-based meta-ensemble models have shown strong performance in vehicle behaviour analysis and traffic-related prediction tasks. A Stacking-based meta-ensemble integrating RF, SVM, LSTM, and attention-based BiLSTM models achieved up to 98.25% accuracy in recognising lane change behaviour using naturalistic driving data [35]. Similarly, snapshot-stacked deep learning models have been applied to vehicle behaviour prediction using low-resolution sensor data, achieving high F1-scores across multiple behavioural tasks [36]. Other studies have employed stacking ensembles combining RF, SVM, and decision tree classifiers for steering behaviour detection using smartphone sensor data, reporting accuracies ranging from 92.9% to 100% [37]. These studies demonstrate that the stacking-based meta-ensemble technique offers superior predictive capabilities compared to conventional ensemble strategies. Motivated by these findings, the present study employs a stacking-based meta-ensemble framework for robust vehicle classification. Comparative analysis on previously published studies as shown in Table 1.

3. Methodology

This study introduces the EAHVSD dataset, developed through a comprehensive pipeline that involves vehicle video acquisition, frame extraction annotation, and data splitting into training, validation, and testing subsets. Although the dataset is region-specific, it captures realistic traffic density, vehicle heterogeneity and illumination variations commonly observed in urban Indian highway surveillance scenarios. These characteristics make the dataset well-suited for evaluating vehicle classification methods under practical operational conditions. While the current version does not claim nationwide or cross-climatic representativeness, it establishes a reliable foundation for future extensions involving additional cities, seasons and weather conditions. In parallel, a suite of state-of-the-art deep learning architectures, including VGGNet, MobileNet, DensNet, ResNet, AlexNet, and InceptionV3, were top-performing models. These models were integrated into a stacking-based meta-ensemble framework to leverage their complementary strengths. A meta-learner aggregates predictions from the base models, enhancing classification accuracy and generalisation. Experimental results demonstrate that the proposed ensemble model outperforms individual networks and existing benchmarks, as validated through extensive testing on the EAHVSD and JUIVCD datasets. The dataset depicts typical Indian urban highway traffic conditions, with images standardised to 1080 × 1920 pixels as shown in Figure 1.

3.1. Data Collection

The strength and applicability of the methodology are evaluated by using two complementary datasets in this research. The first EAHVSD dataset is a private data collection that captures real-world traffic scenarios in India. The second dataset, JUIVCD [4], encompasses a wide variety of vehicle types prevalent in India, making the given set of images more challenging and diverse. Both datasets exhibit significant class imbalance, posing a considerable challenge for real-world vehicle classification, and motivating an evaluation of model robustness under naturally imbalanced data distributions.

3.1.1. Proposed EAHVSD Dataset

High-resolution videos were collected from highways in Hyderabad, India, to capture real-world urban traffic scenarios, as shown in Figure 1. Data acquisition was conducted using fixed surveillance cameras under morning, daytime, and nighttime conditions from multiple viewing angles. 9339 raw images with a resolution of 1920 × 1080 were initially extracted from continuous video recording and manually annotated with bounding boxes around vehicles using the YOLO-Marker tool. Each extracted frame often contained multiple vehicles as well as redundant or partially occluded views across consecutive frames. Following annotation, vehicle-type cropping and quality filtering were performed, and only clearly visible and uniquely identifiable vehicle instances were retained. This process resulted in 5107 unique original vehicle images categorised into four classes: LCV, LMV, OSV, and Truck. To improve model generalisation, data augmentation was applied exclusively to the training subset, and all images were resized to 224 × 224 pixels. This resulted in a (7603 image) effective training dataset, while the validation dataset (1629 images) and testing dataset (1632 images) subsets consisted solely of original, non-augmented vehicle images. Consequently, the complete dataset used in this study comprises 10,864 images. The EAHVSD dataset is a private dataset designed to monitor traffic across four vehicle categories: LCV, LMV, OSV, and Truck. Table 2 summarises the road geometry, sensor specifications, and data acquisition details, while vehicle classes and labels are illustrated in Figure 2.

3.1.2. JUIVCD Dataset

The framework has been further evaluated using the JUIVCD public benchmark dataset, images across 12 indigenous vehicle classes:car_0, bus_1, bicycle_2, ambassador_3, van_4, motorized2wheeler_5, rickshaw_6, motorvan_7, truck_8, autorickshaw_9, toto_10, and minitruck_12 [4]. The dataset was captured for the Indian traffic scenario using mobile cameras, offering diverse and realistic perspectives. It reflects real-world road conditions, including dense traffic environments where frequent vehicle occlusion and overlap pose significant challenges for accurate classifications. Additionally, complexity arises from visual similarities between certain categories, such as the front views of rickshaws and motorised two-wheelers. The dataset has also been found to be class-imbalanced, with motorvan underrepresented. At the same time, categories such as ambassador, bus, van, car, rickshaw, autorickshaw, and motorised two-wheeler are comparatively well-balanced.

3.2. Data Annotation Process

The labelling object in the images using the YOLO-marker annotation tool was carried out to prepare the dataset for object classification. The YOLO-marker tool was used to load each image, and bounding boxes were manually drawn around target objects, with vehicle categories assigned to respective class labels. The tool stores the class ID and normalised bounding coordinates in a .txt file for each image object using the YOLO annotation format. These annotated files, together with the corresponding images, serve as essential training input for models to detect and classify objects in real-world surveillance scenarios. Multiple trained annotators performed the annotation process following a standard annotation guideline to ensure consistency. To quantitatively assess annotation reliability, inter-annotator agreement was evaluated using the Cohen’s kappa statistic on a randomly selected subset of the annotated images. The resulting Cohen’s kappa score indicated substantial agreement among annotators, confirming a high level of annotation consistency and reliability. Furthermore, different object detection frameworks can utilise the dataset annotations provided in both .txt and .xml formats. All annotations were generated using a standard labelling tool. Table 3 shows the detailed annotation information for sample frames. Algorithm one normalised coordinates

(x_{c e n t e r}, y_{c e n t e r}, w i d t h, h e i g h t)

relative to image dimensions are used to define bounding boxes, ensuring accurate object localisation across varied scenarios. Algorithm 1 describes the conversion of YOLO-formatted annotations into XML files for structured object localisation and dataset interoperability.

Algorithm 1 YOLO-to-XML Annotation Conversion

Require:: Image file I and corresponding YOLO annotation file (.txt)
Ensure:: XML annotation file
1:: Read image dimensions $(W, H)$ from I
2:: Read YOLO annotation file containing $(c l a s s_i d, x_{c e n t r e}^{n o r m}, y_{c e n t r e}^{n o r m}, w i d t h^{n o r m}, h e i g h t^{n o r m})$
3:: Convert normalised coordinates to absolute pixel values:
4:: $x_{c e n t r e} \leftarrow x_{c e n t r e}^{n o r m} \times W$
5:: $y_{c e n t r e} \leftarrow y_{c e n t r e}^{n o r m} \times H$
6:: $w \leftarrow w i d t h^{n o r m} \times W$
7:: $h \leftarrow h e i g h t^{n o r m} \times H$
8:: Compute bounding box coordinates:
9:: $x_{m i n} \leftarrow x_{c e n t r e} - \frac{w}{2}$
10:: $y_{m i n} \leftarrow y_{c e n t r e} - \frac{h}{2}$
11:: $x_{m a x} \leftarrow x_{c e n t r e} + \frac{w}{2}$
12:: $y_{m a x} \leftarrow y_{c e n t r e} + \frac{h}{2}$
13:: Map class identifiers:
14:: $0 \to$ LCV, $1 \to$ LMV, $2 \to$ OSV, $3 \to$ Truck
15:: Generate XML structure with image metadata and bounding box coordinates
16:: Save the generated XML file with the same filename as the input image

3.3. Data Augmentation

The Keras image data generator has been used to augment data, improving model generalisation and reducing overfitting. These methods artificially expanded the training dataset by modifying existing images, thereby exposing the model to a wider range of image representations. Table 4 employs the following augmentation methods.

3.4. Data Split

A hold-out validation strategy has been implemented, dividing the dataset into 70% for training, 15% for validation, and 15% for testing. This approach ensures a reliable evaluation of the model while maintaining the original class distribution across all subsets.

(a): Training dataset: The training set of the custom dataset comprises four folders: LCV_0, LMV_1, OSV_2, and Truck_3. The LCV_0 folder contains 1044 images, LMV_1 folder contains 4211 images, OSV_2 contains 1186 images, and Truck_3 contains 1162 images.
(b): Validation dataset: The validation set of the custom dataset comprises four folders: LCV_0, LMV_1, OSV_2, and Truck_3. The LCV_0 folder contains 224 images, the LMV_1 folder contains 902 images, the OSV_2 folder contains 254 images, and the Truck_3 folder contains 249 images. In total, the validation set consists of 1634 vehicle images.
(c): Testing dataset: The testing set of the custom dataset comprises four folders: LCV_0, LMV_1, OSV_2, and Truck_3. The LCV_0 folder contains 224 images, the LMV_1 folder contains 903 images, the OSV_2 folder contains 250 images, and the Truck_3 folder contains 255 images. In total, the testing set consists of 1632 vehicle images.

Figure 3 shows the vehicle class distribution and the corresponding image frequency within the training, validation and testing datasets. Although the dataset exhibits class imbalance, particularly for LCV and OSV categories, this distribution reflects realistic Indian highway traffic conditions and is consistently maintained across training, validation, and testing subsets.

3.5. Pre-Trained Deep Learning Models

These advancements have enhanced the performance of models in tasks such as image classification, paving the way for applications in areas like vehicle imaging. These advancements have enhanced the performance of models in functions such as image classification, paving the way for applications in areas like vehicle imaging. Thus, improving these architectures could expand the capabilities of artificial intelligence. It describes CNN architectures we considered for our study. This section describes the CNN architectures considered for our research. We have chosen VGGNet [38], MobileNet [39], Xception [40], AlexNet [41], ResNet [42], DenseNet [43], Inceptionv3 [44], and NasNetmobile [45] for our experiment to cover a wide spectrum of network designs. These architectures offer unique advantages and have been widely adopted in various computer vision tasks. By evaluating their performance, we aim to identify the most effective model for our specific application and contribute to the ongoing research in this field. Multiple modifications have been made to each network’s final layer to match the number of participants in the EAHVSD dataset and accommodate images of variable dimensions for preservation.

3.5.1. VGGNet

In 2014, the Visual Geometry Group (VGG) introduced deep learning convolutional networks pre-trained for image classification [38]. The VGG16 and VGG19 architectures process input images of size

224 \times 224 \times 3

through five convolutional blocks with ReLU activation and max-pooling, producing a

7 \times 7 \times 512

feature map that is flattened into a 25,088-dimensional vector. VGG19 differs from VGG16 by including an additional Conv2D layer in blocks 3 to 5. The VGG16 model employs a dense layer with 250 units and a dropout rate of 0.5, optimised using the Adam optimiser with a learning rate of 0.0001. In contrast, VGG19 uses the same dense layer with a dropout rate of 0.2 and a learning rate of 0.001. Both architectures conclude with four softmax-activated output units, corresponding to the four vehicle categories in the dataset.

3.5.2. MobileNetV2

MobileNetV2 is a highly efficient model designed for mobile and embedded vision applications [39]. The network processes input images of size

224 \times 224 \times 3

through depthwise separable convolutions with batch normalisation and the ReLU6 activation function, producing a

7 \times 7 \times 1280

feature map. This feature map is flattened into a 62,720-dimensional vector. A dense layer with 512 units and ReLU activation, followed by a dropout rate of 0.2, is employed to learn task-specific features. This is followed by a softmax output layer for the four vehicle classes. The Adam optimiser with a learning rate of 0.0001 is used to support convergence during training. The trained MobileNetV2 serves as a fixed feature extractor, while the added classification head customises the model for the specific vehicle classification task.

3.5.3. DenseNet

DenseNet features densely connected layers that enhance feature reuse and mitigate overfitting [43]. With access to all previous feature maps at each layer, DenseNet facilitates the transmission of gradients and provides implicit supervision across depths. For input images of size

224 \times 224 \times 3

, DenseNet121 produces a

7 \times 7 \times 1024

feature map, which is flattened into a 50,176-dimensional vector for vehicle classification. Similarly, DenseNet201 generates a

7 \times 7 \times 1920

feature map, which is flattened into a 94,080-dimensional vector. Both architectures employ a fully connected dense layer with 250 ReLU units, a dropout rate of 0.2, and the Adam optimiser with a learning rate of 0.001. The final softmax layer consists of four output units, corresponding to the four vehicle categories in the dataset.

3.5.4. InceptionV3

The Inception family of architectures was first introduced for large-scale image classification tasks and is recognised for its effectiveness and superior performance [44]. In this study, InceptionV3 processes input images of size

224 \times 224 \times 3

, producing a final

5 \times 5 \times 2048

feature map, which corresponds to 51,200 features when flattened. A fully connected layer with 250 ReLU units is employed, followed by a dropout layer with a rate of 0.2 to mitigate overfitting. The Adam optimiser with a learning rate of 0.0001 is used to optimise the model, ensuring stable and effective convergence. The final softmax layer comprises four output units, corresponding to the four vehicle categories in the dataset.

3.6. Meta-Learners in the Stacking-Based Meta-Ensemble Framework

The proposed stacking-based meta-ensemble architecture utilises convolutional neural networks (CNNs) as base learners, while traditional machine learning models serve as meta-learner classifiers. The CNN base model is responsible for learning a discriminative image-level representation and producing class probability vectors for each input vehicle image. These probability outputs are concatenated to form meta-features that capture complementary decision patterns across multiple CNN architectures, thereby facilitating robust ensemble learning. In this study, multiple traditional machine learning classifiers are investigated as meta-learners. These models do not operate directly on raw image data. Instead, they are trained exclusively on the CNN-generated probability vectors to learn an optimal fusion strategy for final vehicle classification.

(a): Logistic Regression (LR): LR is employed as a linear meta-learner that estimates class posterior probabilities by learning weighted combinations of CNN outputs. Its simplicity and interpretability provide a strong baseline for evaluating the effectiveness of the stacking framework.
(b): Random Forest (RF): RF serves as a non-linear meta-learner capable of modelling complex interactions among the CNN probability features. By aggregating decisions from multiple randomised trees, RF improve robustness and reduces overfitting at the meta level.
(c): Support Vector Machine (SVM): The SVM meta learner constructs optimal separating hyperplanes in the CNN-derived feature space. Kernel-based SVMs further enable non-linear decision boundaries, enhancing class discrimination in challenging scenarios.
(d): K-Nearest Neighbour (KNN): KNN is utilised as an instance-based meta learner that assigns class labels based on similarity between CNN probability vectors. This approach provides a non-parametric perspective on ensemble decision fusion.
(e): Multi-Layer Perceptron (MLP): The MLP meta-learner relationships among the CNN outputs through multiple fully connected layers, enabling more expressive ensemble modelling.
(f): XGBoost: XGBoost is employed as a powerful gradient-boosted meta-learner that iteratively refines ensemble predictions by minimising classification loss. Its regularisation mechanisms improve generalisation and stability of the stacking model.

It is important to emphasise that all meta-learners are trained on CNN-generated probability features are derived from the training set only, while model evaluation is performed on unseen validation and test sets without any data augmentation. This design ensures a fair, unbiased, and leakage-free assessment of the proposed stacking ensemble framework.

3.7. Stacking-Based Meta-Ensemble Learning Technique

Stacking-based meta-ensemble learning improves classification performance by integrating complementary strengths of multiple base learners through a secondary model, referred to as a meta-learner. The proposed framework is evaluated on the EAHVSD vehicle classification dataset, which is partitioned into training (70%), validation (15%), and testing (15%) subsets. Initially, eleven CNN architectures (VGG16, VGG19, MobileNetV2, Xception, ResNet50, ResNet152, DenseNet121, DenseNet201, and InceptionV3) are trained and evaluated independently. Based on validation performance, five CNNs models namely, VGG16, MobileNetV2, InceptionV3, DenseNet121, and DenseNet201, are selected as base learners for the ensemble.

Let

B_{k} (\cdot)

,

k \in {1, \dots, 5}

, denote the selected base CNN models. For a given input image x, each base learner produces a softmax probability vector

p_{k} (x) = B_{k} (x) \in R^{C},

(1)

where

C = 4

represents the number of vehicle classes. The outputs of the base learners are concatenated to form a stacked meta-feature vector

z (x) = [p_{1} (x), p_{2} (x), \dots, p_{5} (x)] \in R^{5 \times 4 = 20},

(2)

which serves as the input to the meta-learner.

To prevent data leakage, base CNNs are trained independently using the training subset, with validation data used only for performance monitoring and model selection. The ensemble stage is trained exclusively on prediction outputs from the training set, while the test subset remains completely unseen during training and is used only for final evaluation. Multiple machine-learning models, including Logistic Regression, Random Forest, Support Vector Machine, K-Nearest Neighbour, Multi-Layer Perceptron, and XGBoost, are explored at the meta-learning level, with Random Forest yielding the best validation performance. Figure 4 illustrates the ensemble workflow, and Algorithm 2 formally describes the complete stacking-based meta-ensemble training and inference procedure to ensure reproducibility.

Algorithm 2 Stacking-Based Meta-Ensemble Training

Require:: Training set $D_{t r a i n}$ , Validation set $D_{v a l}$
Require:: Base CNN models $B = {B_{1}, B_{2}, \dots, B_{M}}$
Require:: Number of classes C
Ensure:: Trained base models ${{\hat{B}}_{1}, {\hat{B}}_{2}, \dots, {\hat{B}}_{M}}$ and meta-learner $\hat{M}$
1:: Stage 1: Train Base Learners
2:: for $j = 1$ to M do
3:: Train base model $B_{j}$ on $D_{t r a i n}$ using categorical cross-entropy loss
4:: Freeze trained weights to obtain ${\hat{B}}_{j}$
5:: end for
6:: Stage 2: Meta-Feature Construction
7:: Initialize meta-feature matrix $Z \in R^{| D_{v a l} | \times (M \cdot C)}$
8:: for each sample $x_{i} \in D_{v a l}$ do
9:: for $j = 1$ to M do
10:: Obtain probability vector $p_{i}^{(j)} = {\hat{B}}_{j} (x_{i})$
11:: Concatenate $p_{i}^{(j)}$ to form meta-feature vector $z_{i}$
12:: end for
13:: Store $z_{i}$ in matrix $Z$
14:: end for
15:: Stage 3: Meta-Learner Training
16:: Train meta-learner $\hat{M}$ on $(Z, y_{v a l})$ using cross-entropy loss
17:: return ${{\hat{B}}_{1}, {\hat{B}}_{2}, \dots, {\hat{B}}_{M}}, \hat{M}$

4. Experimental Setup and Implementation

4.1. Hardware and Software Configuration

The experimental evaluation was primarily conducted in a cloud-based GPU environment using Google Colab Pro+, which provided access to an NVIDIA Tesla T4 GPU (22.5 GB VRAM) with dynamically allocated system memory (up to 53 GB) and cloud storage. Preliminary code verification and baseline testing were performed locally on an Intel Core i5 (7th Gen) system to ensure implementation consistency. All deep learning models were developed using Python 3.10 with TensorFlow 2. x and Keras as the primary frameworks, along with supporting libraries such as NumPy 2.0.2, Pandas 2.2.2, Scikit-learn 1.6.1, Matplotlib 3.10.0, Seaborn 0.13.2, OpenCV 4.12.0, and PyTorch 2.9.0. Reliable high-speed internet connectivity was maintained throughout cloud-based execution. Table 5 summarises the complete hardware and software configuration used for model training and evaluation.

4.2. Computational Efficiency Analysis

To assess the practical feasibility of the proposed ensemble framework, all experiments were conducted on a cloud-based GPU platform. Model training and test-time inference were performed using Google Colab Pro equipped with an NVIDIA Tesla T4 GPU. The proposed ensemble integrates five CNN architectures: VGG16, MobileNetV2, DenseNet121, DenseNet201, and InceptionV3, thereby combining lightweight and high-capacity models to balance computational efficiency and representational strength. The successful execution of the complete stacking-based meta ensemble on a GPU-enabled platform, together with the achieved classification accuracy, demonstrates the computational feasibility of the proposed framework under an offline evaluation setting. In practical ITS deployments, frame sampling is commonly adopted to reduce processing overhead, and effective vehicle classification does not require processing every video frame. Within this context, the observed inference throughput of the proposed ensemble (approximately 14 FPS) indicates acceptable runtime performance under the sampled frame condition. While the framework has not been validated in a live deployment, these results suggest its potential applicability to intelligent transportation systems using real-world surveillance data.

4.3. Training Configuration: Hyperparameters, Optimiser and Loss Function

Each pre-trained deep learning model was empirically fine-tuned through multiple training runs over a predefined hyperparameter search space to ensure stable learning and convergence. All models used input images of size

224 \times 224

, with batch sizes selected from

{16, 32, 64}

. The search space included dense layer sizes of

{250, 512}

neurons, activated using ReLU or SeLU functions, followed by a softmax output layer with four units for vehicle classification. Dropout rates were varied in

{0.1, 0.2, 0.3, 0.5}

to mitigate overfitting. Training was performed using the Adam optimiser with learning rates explored in

{0.01, 0.005, 0.001, 0.0005, 0.0001}

and categorical cross-entropy as the loss function. The number of training epochs ranged from 80 to 120, depending on the model complexity. All experiments were conducted using a fixed random seed of 42 to ensure reproducibility. Each hyperparameter configuration was evaluated on the validation subset using classification accuracy and macro-averaged F1-score as selection criteria. The final hyperparameter values reported in Table 6 correspond to the configuration yielding the best validation performance for each model.

4.4. Evaluation Metrics

A foundational understanding of machine learning methodologies was essential for practical model training and evaluation. Classifier performance was assessed using multiple evaluation metrics to provide a thorough and comparative analysis of model efficacy.

(a): Accuracy (A): Accuracy measures the proportion of correctly classified samples in the test set, providing an overall indication of model performance [46].
(b): Precision (P): Precision evaluates the ratio of correctly predicted positive samples to the total number of predicted positive samples. It reflects the model’s ability to minimise false positives [46].
(c): Recall (R): Recall represents the proportion of actual positive samples that are correctly identified. It indicates the model’s sensitivity and is particularly important in scenarios with class imbalance [46].
(d): F1-score (F1): The F1-score is the harmonic mean of Precision and Recall. It provides a balance metric for evaluating classifiers that takes both false positives and false negatives into account. [47].
(e): ROC–AUC Curve: The Receiver Operating Characteristic (ROC) curve illustrates the relationship between the True Positive Rate (sensitivity) and the False Positive Rate (1-specificity) across varying thresholds. The area under the curve (AUC) provides a scalar measure of discrimination capability, with higher values indicating superior performance [48].
(f): Cohen’s Kappa: Cohen’s kappa quantifies inter-annotator agreement for categorical classification while accounting for the possibility of agreement occurring by chance [49].

5. Evaluation and Result

This section presents the experimental results obtained from evaluating individual CNN architectures and the proposed stacking-based meta ensemble framework on the EAHVSD and JUIVCD datasets. Standard performance metrics, including accuracy, precision, recall, F1-score, and the area under the ROC-AUC curve, were used to evaluate model performance. Comparative analysis and interpretive discussion are provided to highlight the strengths and limitations of the respective models.

5.1. Results on the EAHVSD Dataset

The performance of the proposed deep learning model was evaluated using standard classification metrics, including accuracy [46], F1- score [47], and the confusion matrix [50]. Precision, recall, and F1-score were computed using weighted averaging, where each class contributes proportionally to its sample support. These metrics assessed the classification capabilities of each CNN architecture on the developed EAHVSD dataset. Among the evaluated models, InceptionV3 achieved the highest accuracy of 0.959, followed by DenseNet121 with 0.927 and DenseNet201 with 0.926. VGG16 (0.924) and MobileNetV2 (0.923) also performed reliably. NasNetMobile achieved an accuracy of 0.908, while Xception, VGG19, and AlexNet recorded accuracies of 0.885, 0.872, and 0.837, respectively. The lowest classification accuracies were observed for ResNet50 (0.636) and ResNet152 (0.652). Based on these results, the five top-performing base CNN models, namely VGG-16, MobileNetV2, InceptionV3, DenseNet-121, and DenseNet-201, were selected for ensemble learning. A comparison of test performance for the eleven pre-trained CNN architectures, in terms of accuracy and weighted-average precision, recall, and F1-score, is presented on the EAHVSD dataset in Table 7 and Figure 5. The comparative results indicate that architectures with enhanced feature depth and connectivity, such as InceptionV3 and DenseNet variants, achieve superior performance because they capture richer hierarchical and structural features from vehicle imagery. In contrast, the relatively lower scores of ResNet variants on the EAHVSD dataset likely stem from the dataset’s moderate size and class imbalance, where residual connections are less effective than dense feature reuse. These findings emphasise that model depth, skip connectivity, and feature reusability significantly influence performance for real-world traffic surveillance. The proposed ensemble approach further integrates the complementary strengths of these high-performing models to enhance robustness and generalisation.

5.2. Ensemble Model Results on the EAHVSD Dataset

The proposed stack-based ensemble approach, which combines five top-performing CNN architectures, namely, VGG16, MobileNetV2, InceptionV3, DenseNet121, and DenseNet201, has been evaluated on the EAHVSD test dataset. The ensemble model achieved a maximum precision of 0.99 for the class LMV_1 and a minimum of 0.85 for Truck_3. The highest recall (0.99) was recorded for LMV_1, while the lowest recall value of 0.87 was observed for the class Truck_3. An F1-score of 0.99 was observed for the LMV_1 class, representing the highest among all classes. The proposed ensemble model attained an overall classification accuracy of 0.9604 and demonstrated excellent discriminative ability with an AUC-ROC of 0.99. For individual models, VGG16 achieved a peak precision of 0.99 for class LMV_1, with class Truck_3 yielding the minimum precision of 0.79 The highest recall of 0.99 was observed for class LMV_1, whereas classes Truck_3 exhibited the lowest value of 0.80. MobileNetV2 reached the highest precision for class LMV_1 (1.00), and class OSV_2 as the least (0.73). The highest recall of 0.98 was attained by class LMV_1, whereas class LCV_0 had the lowest at 0.78. InceptionV3 attained the highest precision of 1.00 for class LMV_1, whereas class Truck_3 exhibited the lowest precision at 0.81. The recall performance peaked at 0.99 for class LMV_1, whereas class Truck_3 had the lowest value of 0.85. DenseNet121 achieved a top precision of 1.00 for class LMV_1, whereas class Truck_3 had the lowest at 0.73. The highest recall, 0.99, was attained by class LMV_1, while class LCV_0 had the lowest at 0.82. The model DenseNet201 demonstrated the highest precision, 1.00, for class LMV_1 and the lowest, 0.73, for class OSV_2. The highest recall score of 0.99 was attained by class LMV_1, whereas class Truck_3 had the lowest at 0.82. Table 8 summarises the test performance of the ensemble and individual models. Figure 6 presents the corresponding confusion matrices for individual and ensemble models on the EAHVSD dataset. Class-wise performance differences primarily arise from visual and structural similarity between certain vehicle categories and unequal sample distributions. The proposed framework partially mitigates these effects using axle-based structural cues.

5.3. Results on the Publicly Benchmarked JUIVCD Dataset Individual and Ensemble Model

The dataset is publicly available through the GitHub repository JUVCsi. To address class imbalance and improve model robustness, data augmentation techniques were applied to ensure a balanced distribution across all categories. This dataset was then used to evaluate the performance of eleven CNN models, namely VGG16, VGG19, MobileNetV2, Xception, AlexNet, ResNet50, ResNet152, DenseNet121, DenseNet201, InceptionV3, and NasNetMobile. Each model was trained and tested on the dataset to assess its ability to classify indigenous vehicle types under realistic conditions. Among the evaluated models, Xception achieved the highest accuracy of 0.955, followed by InceptionV3 (0.941), DenseNet201 (0.923), and DenseNet121 (0.921). MobileNetV2 attained an accuracy of 0.912, while NasNetMobile achieved 0.909. VGG19 and VGG16 recorded accuracies of 0.867 and 0.843, respectively. AlexNet achieved an accuracy of 0.786, whereas ResNet50 and ResNet152 demonstrated comparatively lower performances with accuracies of 0.606 and 0.557, respectively. A comparison of the test performance of eleven pre-trained CNN architectures, in terms of accuracy and weighted-average precision, recall, and F1-score, is presented on the JUIVCD dataset in Table 7 and Figure 7.

Table 9 presents the performance given by ensemble and individual models on the JUIVCD1 dataset. The stack-based ensemble technique selects five pre-trained DL models, namely, VGG16, MobileNetV2, Inceptionv3, DenseNet121, and DenseNet201. The ensemble approach yielded the highest accuracy, achieving a value of 0.9528. Figure 8 shows the confusion matrix of the individual and ensemble models illustrating the classification performance across all vehicle classes, on the JUIVCD dataset.

5.4. Comparison Study on Model and Ensemble Performance on the EAHVSD and JUIVCD Datasets

The comparative performance of the proposed ensemble model, comprising VGG16, MobileNetV2, InceptionV3, DenseNet121, and DenseNet201, was evaluated on the EAHVSD and JUIVCD datasets. The proposed model demonstrates robust performance across the evaluated datasets, reflecting its effectiveness under the real-world surveillance conditions considered in this study. The results reported correspond to the proposed stacking-based meta-ensemble framework, with Random Forest employed as the final meta-learner. Table 10 presents the performance of the ensemble and individual models in terms of accuracy, precision, recall, F1-score, AUC, and Cohen’s Kappa. Cohen’s kappa was used as a complementary evaluation metric to accuracy to measure agreement beyond chance in the multi-class setting. Unlike accuracy, kappa is sensitive to class imbalance and uneven distribution of predictions. Consequently, a model biased toward the majority classes may achieve high accuracy while exhibiting lower kappa values. All kappa scores reported in this study were computed consistently using predicted class labels and ground truth annotations. Figure 9 illustrates the graphical comparison of these results, showing that the ensemble model achieves consistently improved performance compared to individual models across multiple evaluation metrics. The robustness of the framework was further validated on the JUIVCD dataset, where similar performance trends were observed. DenseNet and Inception-based models achieved strong individual results, while the proposed ensemble attained an accuracy of 0.964 with an AUC-ROC of 0.99, respectively. These findings indicate stable ensemble performance under the evaluated environmental and camera conditions.

5.5. Ablation Study

The ablation analysis confirms the synergistic effect of the proposed stacking architecture. As reported in Table 11, varying the number of base learners in the ensemble results in a gradual change in performance, indicating that each CNN contributes complementary information to the final decision. Specifically, the 3-model ensemble (InceptionV3, DenseNet121, DenseNet201) achieves 94.73% and 94.01% accuracy on the EAHVSD and JUIVCD datasets, respectively, which further improves to 94.81% and 92.70% with the 4-model configuration. The complete 5-model majority voting ensemble achieves 94.0% and 93.0%, demonstrating diminishing returns with the addition of more models. In contrast, the proposed stacking-based meta-ensemble consistently outperforms all majority voting configurations, achieving peak accuracies of 96.04% on EAHVSD and 95.28% on JUIVCD. This improvement highlights the effectiveness of the learned fusion mechanism in balancing the bias–variance trade-off of individual CNNs, leading to superior robustness and generalisation across diverse traffic scenarios.

5.6. ROC-AUC Analysis of the Proposed Ensemble Model

AUC–ROC curves for individual models and the proposed ensemble model evaluated on the EAHVSD and JUIVCD datasets are shown in Figure 10. For the multi-class classification task, ROC–AUC values were computed using a one-vs-rest (OvR) strategy with macro-averaging across all vehicle classes. The ROC curve illustrates the trade-off between accurate positive and false positive rates, serving as a diagnostic metric for classification performance. Each coloured curve corresponds to a base model (VGG16, InceptionV3, MobileNetV2, DenseNet-121, and DenseNet-201), while the brown curve represents the ensemble model. The AUC, defined as the area under the ROC curve, quantifies probabilistic class separability, with values closer to one indicating better classification performance. On the EAHVSD dataset, all base models achieved high macro-averaged AUC values (0.99), while the ensemble model attained the highest score of 0.99. On the JUIVCD dataset, the ensemble model achieved a macro-averaged AUC ranging from 0.98 to 1.00, reflecting strong class separability and reliable performance under real-world evaluation conditions.

5.7. Comparison with Existing Studies

Table 12 summarises classification accuracies reported in the literature across different datasets and models. It is essential to note that many prior studies evaluate their methods on distinct datasets with different class definitions, traffic conditions, and acquisition setups, which limits the validity of direct numerical comparisons. For instance, Xception, InceptionV3, and DenseNet121 achieved 95.00% accuracy on the JUIVCD dataset [4], while a CNN combined with AdaBoost and SVM attained 99.50% accuracy on the CompCars+ dataset [14]. Similarly, ensemble-based approaches on large-scale datasets such as MIO-TCD and BIT-Vehicle reported accuracies of 97.94% and 97.62% [19], whereas ensemble broad learning systems achieved 94.63% and 91.23% [15]. A soft weighted-average ensemble on the KITTI dataset obtained 94.75% accuracy [51]. Other studies, including CNN-based models evaluated on Kaggle and Indian datasets, reported accuracies ranging from 70.97% to 87.09% [25], while DenseNet121 with transfer learning on CIFAR datasets achieved 94.75% [24]. Among these studies, a direct and dataset-consistent comparison is feasible only on the JUIVCD dataset. On this benchmark, the proposed stacking-based meta-ensemble achieved an accuracy of 95.28%, compared to the 95.00% reported in prior work [4]. While this improvement is numerically modest, it indicates that the proposed stacking-based meta ensemble can match and marginally exceed a strong baseline under identical dataset conditions, class definitions, and evaluation protocols, without performance degradation. For the newly introduced EAHVSD dataset, the proposed method achieved an accuracy of 96.04%. A direct comparison with existing studies on EAHVSD is infeasible due to the absence of existing benchmarks. The achieved accuracy is comparable to the state-of-the-art performance range reported in the literature. Overall, the experimental results demonstrate that the proposed stacking-based meta-ensemble achieves performance consistent with the upper range of reported accuracies in the literature, while offering robustness across heterogeneous traffic conditions typical of real-world Indian surveillance environments.

6. Threats to Validity

The proposed study is subject to several threats to validity. First, the dataset exhibits class imbalance, with minority vehicle categories, such as light commercial vehicles (LCVs) and oversized vehicles (OSVs), being underrepresented, which may affect the classification performance for these classes. This effect is partially mitigated by the proposed stacking-based ensemble framework, which integrates complementary decision boundaries from multiple base models to improve robustness under imbalanced conditions. Future work will explore class-aware strategies, including a weighted loss function and data rebalancing techniques such as synthetic minority over-sampling techniques (SMOTE). Second, the dataset consists of single-view images captured from fixed surveillance cameras, which may limit generalisation across varying viewpoints, camera heights and placements. The future extension of the dataset will incorporate multi-camera and multi-angle views, as well as broader environmental coverage, to enhance generalisation across diverse surveillance environments. Finally, although manual annotation was performed with care, minor inaccuracies or inconsistencies may still be present. Such labelling noise can influence training quality and evaluation reliability. Future research will investigate semi-automated annotation methods and assess inter-annotator reliability to further improve annotation consistency and dataset quality.

7. Discussion

The findings indicate that the stack-based meta-ensemble is more effective than the single CNN models in both the proposed and JUIVCD datasets. By integrating VGG16, MobileNetV2, InceptionV3, DenseNet121, and DenseNet201, the ensemble leverages complementary spatial and contextual feature representations, resulting in higher classification accuracy, a strong F1-score (ranging from 0.91 to 0.94 on EAHVSD and 0.81 to 0.92 on JUIVCD) and a consistently high AUC (ranging from 0.98 to 1.00) across both datasets. Notably, the performance gains are more pronounced for visually similar and underrepresented vehicle classes, where individual CNN models exhibit greater variability in performance. The ablation experiment also confirms the importance of each of the base learners by demonstrating that eliminating each of them results in a significant decrease in accuracy, F1-score, and AUC. Compared with existing state-of-the-art approaches, the proposed framework demonstrates superior generalisation in heterogeneous traffic environments, highlighting its suitability for real-world surveillance and ATCC applications.

8. Conclusions

There are numerous datasets available for vehicle identification, localisation and classification; however, only a limited number accurately reflect real-world traffic conditions in the Indian subcontinent. Such environments are characterised by dense heterogeneous traffic and multiple overlapping vehicles within a single frame, leading to significant visual congestion. To address this gap, the present study introduces a newly developed image dataset, namely EAHVSD, designed explicitly for vehicle classification under Indian road and traffic conditions. Using the EAHVSD dataset, eleven pre-trained deep learning models were systematically evaluated, including VGG16, VGG19, MobileNetV2, Xception, ResNet-50, ResNet-152, InceptionV3, DenseNet-121, and DenseNet-201. In addition, a stacking-based meta-ensemble framework incorporating a meta-learner was developed to combine the strengths of individual models and enhance classification performance. The proposed stacking-based meta-ensemble approach consistently outperformed all models, achieving an accuracy of 96.04% with an AUC of 0.99 on the EAHVSD dataset. To further validate the robustness and reproducibility of the proposed meta ensemble model, it was evaluated on the public benchmark JUVICD dataset, achieving an accuracy of 95.28% with an AUC of 1.00. These results demonstrate the effectiveness and generalisation capability of the proposed ensemble strategy across different datasets. The findings highlight the strong potential of the proposed framework for deployments in traffic monitoring systems and intelligent transportation applications. Beyond vehicle classification, the proposed stacking-based meta ensemble architecture is generic and can be extended to other vision-based monitoring tasks. Ensemble learning over multiple deep detectors, such as Faster R-CNN and YOLO-based architectures, improves robustness under challenging conditions, including scale variation, occlusion, and visual clutter. Future work may include a broader range of intelligent transportation tasks, such as traffic surveillance, object detection, tracking, and real-time analysis. These improvements would enhance scalability, thereby broadening the framework’s applicability.

Author Contributions

The authors declare their contributions to this work as follows: P.P. wrote the manuscript. A.T. and R.M. revised it. P.P., A.T., and R.M. conceptualised the study, interpreted the results, and approved the final version. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data and material from Efkon EFKON Pvt. Ltd., based in Gurugram, Haryana. This study created the Efkon ATCC Highway Vehicle Surveillance Dataset using real-world images of vehicles in Indian traffic situations captured by fixed and surveillance cameras. This dataset includes annotations for multiple vehicle classes and is designed for training and evaluating classification models. For reasonable academic requests, the corresponding author can make it available. At https://github.com/SayantanMaiti/JUVCsi (accessed on 24 December 2025), the second dataset is available for comparison.

Acknowledgments

The authors would like to thank EFKON Pvt. Ltd., Gurugram, Haryana, for providing access to data and necessary infrastructure support that facilitated the completion of this research.

Conflicts of Interest

The authors declare that there are no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

ITS	Intelligent Transportation System
Faster R-CNN	Region-Based Convolutional Neural Network
YOLO	You Only Look Once
VANETs	Vehicular Ad Hoc Networks
VRe-ID	Vehicle Re-Identification
MFCC	Mel-Frequency Cepstral Coefficients
PHOG	Pyramid Histogram of Oriented Gradients
Bi-LSTM	Bidirectional Long Short-Term Memory
DenseNet	Densely Connected Convolutional Networks
InceptionV3	Deep CNN with Inception Modules
NASNetMobile	Neural Architecture Search Network for Mobile Devices
RetinaNet	Object Detection Model with Focal Loss
RCNet	Road Condition Classification Network
JUVCSID	Jamia University Vehicle Classification Surveillance Indian Dataset
NGSIM	Next-Generation Simulation Dataset
KITTI	Karlsruhe Institute for Technology and Informatics Dataset
GAN	Generative Adversarial Network
WMVE	Weighted Majority Voting Ensemble
CBAM	Convolutional Block Attention Module
AI	Artificial Intelligence
ML	Machine Learning
DL	Deep Learning
SVM	Support Vector Machine
DT	Decision Tree
RF	Random Forest
LBP	Local Binary Pattern
IoT	Internet of Things
MLP	Multi-Layer Perceptron
ROI	Region of Interest
SWA	Soft-Weighted-Average
GRU	Gated Recurrent Unit
CNN	Convolutional Neural Network
SSD	Single Shot Multi Box Detector
HOG	Histogram of Oriented Gradients
VGGNet	Visual Geometry Group Networks
ResNet	Residual Neural Network
LSTM	Long Short-Term Memory
AUC	Area Under the Curve
NB	Naive Bayes

References

Ambardekar, A.; Nicolescu, M.; Bebis, G.; Nicolescu, M. Vehicle classification framework: A comparative study. EURASIP J. Image Video Process. 2014, 2014, 29. [Google Scholar] [CrossRef]
Boukerche, A.; Siddiqui, A.J.; Mammeri, A. Automated vehicle detection and classification: Models, methods, and techniques. ACM Comput. Surv. (CSUR) 2017, 50, 1–39. [Google Scholar] [CrossRef]
Butt, M.A.; Khattak, A.M.; Shafique, S.; Hayat, B.; Abid, S.; Kim, K.-I.; Ayub, M.W.; Sajid, A.; Adnan, A. Convolutional neural network-based vehicle classification in adverse illuminous conditions for intelligent transportation systems. Complexity 2021, 2021, 6644861. [Google Scholar] [CrossRef]
Maity, S.; Saha, D.; Singh, P.K.; Sarkar, R. JUIVCDv1: Development of a still-image-based dataset for Indian vehicle classification. Multimed. Tools Appl. 2024, 83, 71379–71406. [Google Scholar] [CrossRef]
Pandey, A.D.; Kumar, B.; Parida, M.; Mudgal, A.; Chouksey, A.K.; Mishra, R. Vehicle classification using accelerometer signals and machine-learning techniques. J. Intell. Transp. Syst. 2025, 1–29. [Google Scholar] [CrossRef]
Chen, X.; Liu, Y.; Li, S. BML-YOLO: Multi-scale vehicle target detection method based on feature fusion. Signal Image Video Process. 2025, 19, 745. [Google Scholar] [CrossRef]
Yu, S.; Wu, Y.; Li, W.; Song, Z.; Zeng, W. A model for fine-grained vehicle classification based on deep learning. Neurocomputing 2017, 257, 97–103. [Google Scholar] [CrossRef]
Battiato, S.; Farinella, G.M.; Furnari, A.; Puglisi, G.; Snijders, A.; Spiekstra, J. An integrated system for vehicle tracking and classification. Expert Syst. Appl. 2015, 42, 7263–7275. [Google Scholar] [CrossRef]
Venkatasivarambabu, P.; Babu, R.K.; Jagan, B.; Rai, H.M.; Agarwal, N.; Agarwal, S. Vehicle tracking and classification for intelligent transportation systems using YOLOv5 and modified deep SORT with HRNN. Signal Image Video Process. 2025, 19, 1–12. [Google Scholar] [CrossRef]
Usama, M.; Anwar, H.; Anwar, S. Vehicle and license plate recognition with novel dataset for toll collection. Pattern Anal. Appl. 2025, 28, 57. [Google Scholar] [CrossRef]
Seo, A.; Jeon, H.; Son, Y. Robust prediction method for pedestrian trajectories in occluded video scenarios. Soft Comput. 2025, 29, 4449–4459. [Google Scholar] [CrossRef]
Aljebreen, M.; Alabduallah, B.; Mahgoub, H.; Allafi, R.; Hamza, M.A.; Ibrahim, S.S.; Yaseen, I.; Alsaid, M.I. Integrating IoT and honey badger algorithm-based ensemble learning for accurate vehicle detection and classification. Ain Shams Eng. J. 2023, 14, 102547. [Google Scholar] [CrossRef]
Gayen, S.; Maity, S.; Singh, P.K.; Sarkar, R. SimSANet: A simple sequential attention-aided deep neural network for vehicle make and model recognition. Neural Comput. Appl. 2025, 37, 319–339. [Google Scholar] [CrossRef]
Chen, W.; Sun, Q.; Wang, J.; Dong, J.-J.; Xu, C. A novel model based on AdaBoost and deep CNN for vehicle classification. IEEE Access 2018, 6, 60445–60455. [Google Scholar] [CrossRef]
Hedeya, M.A.; Eid, A.H.; Abdel-Kader, R.F. A super-learner ensemble of deep networks for vehicle-type classification. IEEE Access 2020, 8, 98266–98280. [Google Scholar] [CrossRef]
Stocker, M.; Silvonen, P.; Rönkkö, M.; Kolehmainen, M. Detection and classification of vehicles by measurement of road-pavement vibration and by means of supervised machine learning. J. Intell. Transp. Syst. 2016, 20, 125–137. [Google Scholar] [CrossRef]
Lee, H.; Coifman, B. Using LIDAR to validate the performance of vehicle classification stations. J. Intell. Transp. Syst. 2015, 19, 355–369. [Google Scholar] [CrossRef]
Pateriya, P.; Trivedi, A.; Malhotra, R. Transforming traffic management: Vehicle classification in smart transportation systems. In Proceedings of the International Conference on Structural Engineering and Construction Management, Angamaly, India, 5–7 June 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1011–1023. [Google Scholar]
Guo, L.; Li, R.; Jiang, B. An ensemble broad learning scheme for semisupervised vehicle type classification. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 5287–5297. [Google Scholar] [CrossRef]
Liu, W.; Zhang, M.; Luo, Z.; Cai, Y. An ensemble deep learning method for vehicle type classification on visual traffic surveillance sensors. IEEE Access 2017, 5, 24417–24425. [Google Scholar] [CrossRef]
Liu, W.; Luo, Z.; Li, S. Improving deep ensemble vehicle classification by using selected adversarial samples. Knowl.-Based Syst. 2018, 160, 167–175. [Google Scholar] [CrossRef]
Pemila, M.; Pongiannan, R.K.; Narayanamoorthi, R.; Sweelem, E.A.; Hendawi, E.; El-Sebah, M.I.A. classification of vehicles using machine learning algorithm on the extensive dataset. IEEE Access 2024, 12, 98338–98351. [Google Scholar] [CrossRef]
Ghosh, T.; Gayen, S.; Maity, S.; Valenkova, D.; Sarkar, R. A feature fusion-based custom deep learning model for vehicle make and model recognition. In Proceedings of the 13th Mediterranean Conference on Embedded Computing (MECO), Budva, Montenegro, 11–14 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–4. [Google Scholar]
Saputra, W.S.J.; Puspaningrum, E.Y.; Syahputra, W.F.; Sari, A.P.; Via, Y.V.; Idhom, M. Car classification based on image using transfer learning convolutional neural network. In Proceedings of the 2022 IEEE Information Technology International Seminar (ITIS), Surabaya, Indonesia, 19–21 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 324–327. [Google Scholar]
Kapaliya, S.; Swain, D.; Kaur, H.; Satapathy, S. An efficient deep learning based vehicle classification system for Indian vehicles. In Proceedings of the 2022 IEEE 2nd International Symposium on Sustainable Energy, Signal Processing and Cyber Security (iSSSC), Gunupur, India, 15–17 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–5. [Google Scholar]
Wang, Y.; Deng, Y.; Zheng, Y.; Chattopadhyay, P.; Wang, L. Vision transformers for image classification: A comparative survey. Technologies 2025, 13, 32. [Google Scholar] [CrossRef]
Wang, Y.; Yin, Y.; Li, Y.; Qu, T.; Guo, Z.; Peng, M.; Jia, S.; Wang, Q.; Zhang, W.; Li, F. Classification of plant leaf disease recognition based on self-supervised learning. Agronomy 2024, 14, 500. [Google Scholar] [CrossRef]
Shvai, N.; Hasnat, A.; Meicler, A.; Nakib, A. Accurate classification for automatic vehicle-type recognition based on ensemble classifiers. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1288–1297. [Google Scholar] [CrossRef]
Zhang, B. Reliable classification of vehicle types based on cascade classifier ensembles. IEEE Trans. Intell. Transp. Syst. 2012, 14, 322–332. [Google Scholar] [CrossRef]
Zhang, H.; Fu, R. An ensemble learning–online semi-supervised approach for vehicle behavior recognition. IEEE Trans. Intell. Transp. Syst. 2021, 23, 10610–10626. [Google Scholar] [CrossRef]
Jagannathan, P.; Rajkumar, S.; Frnda, J.; Divakarachari, P.B.; Subramani, P. Moving vehicle detection and classification using gaussian mixture model and ensemble deep learning technique. Wirel. Commun. Mob. Comput. 2021, 2021, 5590894. [Google Scholar] [CrossRef]
Wang, S. Real operational labeled data of air handling units from office, auditorium, and hospital buildings. Sci. Data 2025, 12, 1481. [Google Scholar] [CrossRef] [PubMed]
Wang, S. Effectiveness of traditional augmentation methods for rebar counting using UAV imagery with Faster R-CNN and YOLOv10-based transformer architectures. Sci. Rep. 2025, 15, 33702. [Google Scholar] [CrossRef]
Sagi, O.; Rokach, L. Ensemble learning: A survey. WIREs Data Min. Knowl. Discov. 2018, 8, e1249. [Google Scholar] [CrossRef]
Zhang, H.; Guo, Y.; Wang, C.; Fu, R. Stacking-based ensemble learning method for the recognition of the preceding vehicle lane-changing manoeuvre: A naturalistic driving study on the highway. IET Intell. Transp. Syst. 2022, 16, 489–503. [Google Scholar] [CrossRef]
Khoshkangini, R.; Mashhadi, P.; Tegnered, D.; Lundström, J.; Rögnvaldsson, T. Predicting vehicle behaviour using multi-task ensemble learning. Expert Syst. Appl. 2023, 212, 118716. [Google Scholar] [CrossRef]
Yang, J.; Zhang, H.; Zhou, Y.; Guo, Z.; Lin, F. Improved DAB-DETR model for irregular traffic obstacles detection in vision-based driving environment perception scenario. Appl. Intell. 2025, 55, 541. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Mascarenhas, S.; Agarwal, M. A comparison between VGG16, VGG19 and ResNet50 architecture frameworks for image classification. In Proceedings of the 2021 International Conference on Disruptive Technologies for Multi-Disciplinary Research and Applications (CENTCON), Bengaluru, India, 19–21 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 96–99. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Naskinova, I. Transfer learning with NASNet-Mobile for Pneumonia X-ray classification. Asian-Eur. J. Math. 2023, 16, 2250240. [Google Scholar] [CrossRef]
Buckland, M.; Gey, F. The relationship between recall and precision. J. Am. Soc. Inf. Sci. 1994, 45, 12–19. [Google Scholar] [CrossRef]
Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef]
Malhotra, R.; Khan, K. OpTunedSMOTE: A novel model for automated hyperparameter tuning of SMOTE in software defect prediction. Intell. Data Anal. 2025, 29, 787–807. [Google Scholar] [CrossRef]
Artstein, R.; Poesio, M. Inter-coder agreement for computational linguistics. Comput. Linguist. 2008, 34, 555–596. [Google Scholar] [CrossRef]
Townsend, J.T. Theoretical analysis of an alphabetic confusion matrix. Percept. Psychophys. 1971, 9, 40–50. [Google Scholar] [CrossRef]
Wang, H.; Yu, Y.; Cai, Y.; Chen, X.; Chen, L.; Li, Y. Soft-weighted-average ensemble vehicle detection method based on single-stage and two-stage deep learning models. IEEE Trans. Intell. Veh. 2020, 6, 100–109. [Google Scholar] [CrossRef]

Figure 1. Representative of the real-world scenario images of the proposed (EAHVSD) dataset.

Figure 2. Representative vehicle class images included in the EAHVSD dataset: (a) LCV, (b) LMV, (c) OSV, (d) Truck.

Figure 3. Class-wise distribution of vehicle images across training, validation and testing datasets.

Figure 4. Workflow diagram representing the complete process pipeline of the proposed model.

Figure 5. Comparison of test performance of eleven pre-trained deep learning models on the EAHVSD dataset.

Figure 6. Confusion matrices of individual deep learning models and the proposed ensemble model on the EAHVSD dataset.

Figure 7. Comparison of test performance of eleven pre-trained deep learning models on the JUIVCD dataset.

Figure 8. Confusion matrices of individual deep learning models and the proposed ensemble model on the JUIVCD dataset.

Figure 9. A comparative analysis of F1-score, precision, recall and AUC is presented for each model, evaluated separately on the EAHVSD and JUIVCD datasets.

Figure 10. (a) ROC curve for the proposed model evaluated on the proposed EAHVSD dataset. (b) ROC curve for the proposed model evaluated on the publicly available JUIVCD dataset.

Table 1. Comparative analysis of representative vehicle classification studies highlighting datasets, models, performance, and limitations.

Dataset	Images	Classes	Model/Method	Accuracy	Key Limitation
CompCars + real-world images ¹	45,230	5	CNN + AdaBoost + SVM	99.50%	Evaluated on limited public benchmarks
BIT-Vehicle ²	9850	6	ResNet50, Xception, DenseNet + Super Learner (DL ensemble)	97.62%	Confusion among visually similar vehicle classes
BIT-Vehicle ²	64,000	8	Ensemble Broad Learning System (BLS-based)	91.23%	Increased training time due to ensemble structure
VINCI + Indian traffic images ^3,4	73,638	5	VGG-14, InceptionV3 + CatBoost	99.03%	Class imbalance and overlap between vehicle categories
MIO-TCD ⁵	648,959	11	ResNet50, Xception, DenseNet + Super Learner	97.94%	Marginal performance gains from data augmentation
EAHVSD (Proposed)	10,864	4	Stacking-based CNN Meta-Ensemble	96.04%	Single-view data; limited samples in LCV and OSV classes

¹ [14], ² [15,19], ³ [28], ⁴ [25], ⁵ [19].

Table 2. Dataset description of the proposed data.

Parameters	Description
Location of site	Hyderabad, India
Type of camera	Surveillance camera
Camera installation height (m)	7.2
Frame per second (FPS)	25
Video resolution (pixel size)	1920 × 1080
Road dimension (m)	Length: 56 m, Width: 20 m
Data collection sessions	9-December-2023 (Afternoon, 25 min) 24-January-2024(Afternoon 52 min) 25-January-2024 (Night, 10 min) 2-February-2024 (Morning, 48 min) 6-February-2024 (Morning, 29 min)
Condition diversity	Morning, afternoon, and night recordings ensure variation in illumination and traffic conditions
Total number of images	10,864 images

Table 3. Annotations for images of varied vehicle classes in a single frame.

Class Label	X_center	Y_center	Width	Height	Xmin	Ymin	Xmax	Ymax
LMV_1	0.309766	0.465278	0.107031	0.113889	492	441	697	564
LMV_1	0.647656	0.757639	0.142188	0.254167	1106	681	1381	956
Truck_3	0.633594	0.445833	0.173438	0.230556	1050	357	1383	606

Table 4. Data characteristics and augmentation techniques used.

Description	Details
Model architecture	Convolutional Neural Network (CNN)
Input image size	$224 \times 224 \times 3$
Normalisation	Pixel values rescaled to the range [0, 1] (Scaling: $1 / 255$ )
Data augmentation	Applied using Keras ImageDataGenerator
Augmentation types	Rescale: $1 / 255$ ; Rotation range: 20; Zoom range: 0.2; Horizontal flip: True; Width shift range: 0.1; Height shift range: 0.1; Shear range: 0.1; Brightness range: [0.8, 1.2]

Table 5. Hardware and software configuration used for experimental evaluation.

Category	Specification
Execution Environment	Cloud-based GPU platform (Google Colab Pro)
System RAM	53 GB (allocated)
GPU	NVIDIA Tesla T4 (22.5 GB VRAM)
Processor	Intel Core i5 (7th Gen) for local testing
Storage	∼235.7 GB cloud disk allocation
Programming Frameworks	Python 3.10, TensorFlow 2.x, Keras
Supporting Libraries	NumPy, Pandas, Scikit-learn, Matplotlib
Development Tools	Jupyter Notebook, Google Colab Pro+
Training Strategy	Training with fixed random seeds 42
Compute Usage	Approx. 1.41 compute units per hour (4–5 h run time)
Inference Execution	GPU-enabled test-time inference

Table 6. Hyperparameter values and training configuration used in pre-trained deep learning models.

Model	Batch Size	Dense Units	Dropout	Learning Rate	Epochs
VGG16	16	250	0.2	0.0001	100
VGG19	16	250	0.2	0.0001	80
MobileNetV2	32	512	0.2	0.0001	70
Xception	64	250	0.2	0.0001	100
AlexNet	64	250	0.3	0.0001	90
ResNet50	32	250	0.2	0.0001	100
ResNet152	32	512	0.2	0.0001	100
DenseNet121	16	250	0.3	0.0001	100
DenseNet201	16	250	0.2	0.001	100
InceptionV3	16	250	0.2	0.0001	120
NASNetMobile	32	250	0.3	0.0001	100

Table 7. Comparison of eleven pre-trained deep learning models on the EAHVSD and JUIVCD datasets in terms of test accuracy (A) and weighted-average precision (P), recall (R), and F1-score (F1).

Model	EAHVSD				JUIVCD
Model	A	P	R	F1	A	P	R	F1
VGG16	0.924	0.924	0.924	0.924	0.843	0.843	0.840	0.840
VGG19	0.872	0.878	0.872	0.872	0.867	0.858	0.858	0.858
MobileNetV2	0.923	0.919	0.919	0.919	0.912	0.916	0.916	0.916
Xception	0.885	0.883	0.885	0.882	0.955	0.954	0.954	0.954
AlexNet	0.837	0.839	0.837	0.836	0.786	0.779	0.779	0.779
ResNet50	0.636	0.620	0.636	0.564	0.606	0.573	0.573	0.573
ResNet152	0.652	0.563	0.652	0.590	0.557	0.516	0.516	0.516
DenseNet121	0.927	0.931	0.927	0.926	0.921	0.922	0.922	0.922
DenseNet201	0.926	0.930	0.926	0.927	0.923	0.922	0.922	0.922
InceptionV3	0.959	0.959	0.959	0.959	0.941	0.938	0.938	0.938
NASNetMobile	0.908	0.905	0.898	0.897	0.909	0.910	0.910	0.910

Table 8. Precision (P), Recall (R),and F1-score (F1) values for each vehicle class in the classification report of the ensemble and individual models on the EAHVSD dataset.

Vehicle Class	VGG16			MobileNetV2			InceptionV3			DenseNet121			DenseNet201			Proposed Ensemble Model
Vehicle Class	P	R	F1	P	R	F1	P	R	F1	P	R	F1	P	R	F1	P	R	F1
LCV_0	0.84	0.86	0.85	0.90	0.78	0.83	0.89	0.88	0.88	0.86	0.82	0.84	0.86	0.82	0.84	0.90	0.92	0.91
LMV_1	0.99	0.99	0.99	1.00	0.98	0.99	1.00	0.99	0.99	1.00	0.99	0.99	1.00	0.99	0.99	0.99	0.99	0.99
OSV_2	0.92	0.85	0.88	0.73	0.94	0.82	0.89	0.89	0.89	0.92	0.80	0.86	0.92	0.80	0.86	0.93	0.90	0.91
Truck_3	0.79	0.80	0.79	0.82	0.73	0.77	0.81	0.85	0.83	0.73	0.86	0.79	0.73	0.86	0.79	0.85	0.87	0.86

Table 9. Precision (P), Recall (R), and F1-score (F1) values for each vehicle class in the classification report of the ensemble and individual models on the JUIVCD dataset.

Vehicle Class	InceptionV3			MobileNetV2			VGG16			DenseNet201			DenseNet121			Ensemble Model
Vehicle Class	P	R	F1	P	R	F1	P	R	F1	P	R	F1	P	R	F1	P	R	F1
0_Car	0.91	1.00	0.95	0.68	1.00	0.81	0.59	1.00	0.74	0.60	1.00	0.75	0.73	1.00	0.84	0.64	1.00	0.78
1_Bus	0.90	0.97	0.94	0.85	0.97	0.90	0.75	0.91	0.82	0.93	0.94	0.93	0.74	0.98	0.84	0.94	0.97	0.96
2_Bicycle	1.00	0.78	0.87	0.83	0.91	0.87	0.91	0.64	0.75	0.97	0.94	0.96	0.85	0.93	0.89	0.96	0.95	0.96
3_Ambassador	1.00	0.88	0.93	0.97	0.72	0.83	1.00	0.58	0.74	1.00	0.38	0.55	0.99	0.73	0.84	1.00	0.66	0.80
4_Van	0.94	0.97	0.95	0.82	0.91	0.86	0.98	0.74	0.84	0.77	0.96	0.86	0.96	0.87	0.91	0.96	0.93	0.95
5_Motorized2W	0.98	1.00	0.99	0.97	0.94	0.96	0.88	0.97	0.92	0.99	0.99	0.99	0.97	0.95	0.96	0.99	0.99	0.99
6_Rickshaw	0.89	0.96	0.93	0.96	0.99	0.97	0.97	0.96	0.96	0.97	0.96	0.96	0.99	0.97	0.98	0.98	1.00	0.99
7_Motorvan	1.00	0.64	0.78	0.78	0.64	0.70	0.57	0.36	0.44	1.00	0.82	0.90	1.00	0.82	0.90	1.00	0.73	0.84
8_Truck	0.62	0.95	0.75	0.74	0.71	0.72	0.68	0.39	0.49	0.59	0.93	0.72	0.41	0.98	0.58	0.73	0.83	0.78
9_Autorickshaw	0.90	0.96	0.93	0.97	0.82	0.89	0.98	0.80	0.88	0.97	0.94	0.95	0.99	0.66	0.79	0.99	0.91	0.95
10_Toto	0.73	0.48	0.58	0.87	0.57	0.68	0.82	0.39	0.53	0.47	0.87	0.61	0.61	0.74	0.67	0.95	0.78	0.86
11_Minitruck	0.97	0.55	0.70	0.87	0.54	0.67	0.52	0.80	0.63	0.98	0.50	0.66	0.87	0.43	0.58	0.85	0.74	0.79

Table 10. Comparing the performance of individual models, and the proposed stack-based metal ensemble model on the EAHVSD and JUIVCD datasets.

Model	EAHVSD Dataset						JUIVCD Dataset
Model	A	P	R	F1	AUC	Cohen Kappa	A	P	R	F1	AUC	Cohen Kappa
VGG16	0.92	0.92	0.92	0.92	0.99	0.89	0.86	0.86	0.81	0.81	0.99	0.46
MobileNetV2	0.90	0.91	0.91	0.91	0.99	0.85	0.87	0.88	0.86	0.87	0.98	0.74
InceptionV3	0.93	0.94	0.94	0.94	0.99	0.89	0.92	0.93	0.92	0.92	1.00	0.64
DenseNet121	0.92	0.92	0.92	0.92	0.99	0.86	0.89	0.89	0.85	0.85	0.99	0.58
DenseNet201	0.91	0.92	0.92	0.92	0.99	0.87	0.89	0.89	0.83	0.84	0.99	0.42
Our Proposed Ensemble	0.96	0.94	0.94	0.94	0.99	0.93	0.95	0.93	0.91	0.91	1.00	0.89

Table 11. Ablation study of individual models and ensemble strategies on the EAHVSD and JUIVCD datasets. The checkmark (✓) indicates that the corresponding model is included in the configuration, while “–” denotes that the model is not used.

Configuration	VGG16	MobileNetV2	InceptionV3	DenseNet121	DenseNet201	EAHVSD Accuracy (%)	JUIVCD Accuracy (%)
Individual (VGG16)	✓	–	–	–	–	92.0	86.0
Individual (MobileNetV2)	–	✓	–	–	–	90.0	87.0
Individual (InceptionV3)	–	–	✓	–	–	93.0	92.0
Individual (DenseNet121)	–	–	–	✓	–	92.0	89.0
Individual (DenseNet201)	–	–	–	–	✓	91.0	89.0
3-Ensemble Model	–	✓	✓	–	✓	95.20	94.01
4- Ensemble Model	✓	✓	✓	–	✓	94.73	92.70
Majority Voting Ensemble	✓	✓	✓	✓	✓	94.20	93.83
Proposed Ensemble	✓	✓	✓	✓	✓	96.04	95.28

Table 12. Performance consistent with previously reported results in the literature.

Dataset	Model	Images	Classes	Performance	Study
JUIVCD	Xception, InceptionV3, DenseNet121	6335	12	95.00%	[4]
CompCars+	CNN + AdaBoost + SVM	45,230	5	99.50%	[14]
KITTI	Soft weighted-average ensemble	7518	4	94.75%	[51]
MIO-TCD	ResNet50, Xception, DenseNet + Super Learner	648,959	11	97.94%	[15]
BIT-Vehicle	ResNet50, Xception, DenseNet + Super Learner	9850	6	97.62%	[15]
MIO-TCD	ResNet50 + Ensemble Broad Learning System	4000	4	94.63%	[19]
BIT-Vehicle	Ensemble Broad Learning System	64,000	8	91.23%	[19]
Proposed EAHVSD	Stacking-based ensemble (VGG-16, MobileNetV2, InceptionV3, DenseNet-121, and DenseNet-201)	10,864	4	96.04%	–
JUIVCD	Stacking-based ensemble (VGG-16, MobileNetV2, InceptionV3, DenseNet-121, and DenseNet-201)	6335	6	95.28%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pateriya, P.; Trivedi, A.; Malhotra, R. SBMEV: A Stacking-Based Meta-Ensemble Vehicle Classification Framework for Real-World Traffic Surveillance. Appl. Sci. 2026, 16, 520. https://doi.org/10.3390/app16010520

AMA Style

Pateriya P, Trivedi A, Malhotra R. SBMEV: A Stacking-Based Meta-Ensemble Vehicle Classification Framework for Real-World Traffic Surveillance. Applied Sciences. 2026; 16(1):520. https://doi.org/10.3390/app16010520

Chicago/Turabian Style

Pateriya, Preeti, Ashutosh Trivedi, and Ruchika Malhotra. 2026. "SBMEV: A Stacking-Based Meta-Ensemble Vehicle Classification Framework for Real-World Traffic Surveillance" Applied Sciences 16, no. 1: 520. https://doi.org/10.3390/app16010520

APA Style

Pateriya, P., Trivedi, A., & Malhotra, R. (2026). SBMEV: A Stacking-Based Meta-Ensemble Vehicle Classification Framework for Real-World Traffic Surveillance. Applied Sciences, 16(1), 520. https://doi.org/10.3390/app16010520

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SBMEV: A Stacking-Based Meta-Ensemble Vehicle Classification Framework for Real-World Traffic Surveillance

Abstract

1. Introduction

2. Related Work

2.1. Vehicle Classification Using Deep Learning Techniques

2.2. Ensemble Learning Approaches

2.3. Stacking-Based Meta Ensemble Approach

3. Methodology

3.1. Data Collection

3.1.1. Proposed EAHVSD Dataset

3.1.2. JUIVCD Dataset

3.2. Data Annotation Process

3.3. Data Augmentation

3.4. Data Split

3.5. Pre-Trained Deep Learning Models

3.5.1. VGGNet

3.5.2. MobileNetV2

3.5.3. DenseNet

3.5.4. InceptionV3

3.6. Meta-Learners in the Stacking-Based Meta-Ensemble Framework

3.7. Stacking-Based Meta-Ensemble Learning Technique

4. Experimental Setup and Implementation

4.1. Hardware and Software Configuration

4.2. Computational Efficiency Analysis

4.3. Training Configuration: Hyperparameters, Optimiser and Loss Function

4.4. Evaluation Metrics

5. Evaluation and Result

5.1. Results on the EAHVSD Dataset

5.2. Ensemble Model Results on the EAHVSD Dataset

5.3. Results on the Publicly Benchmarked JUIVCD Dataset Individual and Ensemble Model

5.4. Comparison Study on Model and Ensemble Performance on the EAHVSD and JUIVCD Datasets

5.5. Ablation Study

5.6. ROC-AUC Analysis of the Proposed Ensemble Model

5.7. Comparison with Existing Studies

6. Threats to Validity

7. Discussion

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI