A Robust Framework for Object Detection in a Trafﬁc Surveillance System

: Object recognition is the technique of specifying the location of various objects in images or videos. There exist numerous algorithms for the recognition of objects such as R-CNN, Fast R-CNN, Faster R-CNN, HOG, R-FCN, SSD, SSP-net, SVM, CNN, YOLO, etc., based on the techniques of machine learning and deep learning. Although these models have been employed for various types of object detection applications, however, tiny object detection faces the challenge of low precision. It is essential to develop a lightweight and robust model for object detection that can detect tiny objects with high precision. In this study, we suggest an enhanced YOLOv2 (You Only Look Once version 2) algorithm for object detection, i.e., vehicle detection and recognition in surveillance videos. We modiﬁed the base network of the YOLOv2 by reducing the number of parameters and replacing it with DenseNet. We employed the DenseNet-201 technique for feature extraction in our improved model that extracts the most representative features from the images. Moreover, our proposed model is more compact due to the dense architecture of the base network. We utilized DenseNet-201 as a base network due to the direct connection among all layers, which helps to extract a valuable information from the very ﬁrst layer and pass it to the ﬁnal layer. The dataset gathered from the Kaggle and KITTI was used for the training of the proposed model, and we cross-validated the performance using MS COCO and Pascal VOC datasets. To assess the efﬁcacy of the proposed model, we utilized extensive experimentation, which demonstrates that our algorithm beats existing vehicle detection approaches, with an average precision of 97.51%.


Introduction
Multimedia has deeply penetrated many realms of life in the present generation of promptly emerging technologies.In daily life, many people utilize electronic devices for various video applications i.e., animated videos, activity recognition, movies, etc. Cameras have been utilized quickly over the last century for surveillance systems.A surveillance system is a systematic method of monitoring behavior, actions, or other changing information.This results in massive data accumulation in the form of images and video clips, and it can be a tiring task to extract relevant information from this multimedia content.The three essential phases in every surveillance system are object detection, tracking, and recognition.
Object identification and localization is the procedure of locating the location of objects in images and videos captured by surveillance cameras.The objects can be classified and detected in real-time using various computer vision techniques [1].Furthermore, the objects are specified by employing rectangular bounding boxes.There exist various applications of object detection in industries and scientific research based on ML (Machine Learning) and DL (Deep Learning), for example face detection [2], text detection [3], pedestrian detection [4], logo recognition [5], object identification in the video [6], vehicle detection [7], disease detection [8], medical imaging [9] and many more.Moreover, for vehicle detection in autonomous driving systems, numerous challenges are still faced such as the algorithm's inability to detect faraway vehicles due to their small size, blurry conditions, night view, and rainy seasons because of less precision in localization of present studies.
Traditional techniques based on ML (Machine Learning) have been used mostly for object detection; the object area is computed first using a sliding window, then various features are mined, and finally, traditional classifiers such as SVM (Support Vector Machine) are used to classify the objects.Although the results are satisfactory, these methods are incapable of accurately detecting and classifying objects ignoring the underlying deep features.A researcher used the Haar feature descriptor to extract the linear, center, diagonal, and edge features before classifying the objects using a Support Vector Machine.Moreover, employing a hand-crafted feature descriptor requires human effort.As a result, researchers are concentrating their efforts on deep learning algorithms like CNN, R-CNN, and YOLO, which have greatly improved object detection performance.
Existing deep learning methods for object detection are dedicated to simplifying the network and speeding up the detection process.These results are heavily reliant on the accuracy of the proposals generated, such as when, for example, the researcher employs a faster R-CNN approach to detect and count vehicles.Although this technique can accelerate the detection process, it has lower detection accuracy than other traditional methods.Most importantly, these methods are incapable of detecting distant vehicles.We propose an improved Yolov2 algorithm with Densenet-201 as a base network in video surveillance systems to detect far-away vehicles that appear in small sizes.
The rest of the paper is structured in the following manner.Motivation is covered in Section 2, while the study's related work is presented in Section 3. The problem statement is discussed in Section 4, the methodology is discussed in Section 5 and the proposed approach is presented in Section 6. Section 7 presents the experiments, and Section 8 concludes the findings.

Motivation
The motivation of the proposed model is to investigate the issue of object detection (i.e., vehicle detection) in videos obtained from surveillance cameras employing the improved YOLOv2 technique.Moreover, the vehicles have various sizes in videos, and as a result, conventional approaches have a hard time detecting vehicles precisely.An improved YOLOv2 algorithm is developed in this paper to cope with this challenge.
The key advantages of doing this investigation are as below: 1.
The proposed model enhances an end-to-end trainable DL (Deep Learning) model i.e., improved YOLOv2 to detect tiny vehicles (e.g., Car, Bus, Truck) for video surveillance systems.Two benchmarks are utilized to train the projected technique for vehicle detection, and the outcomes showed that our proposed algorithm outperforms the already-used methods.

2.
We employed Densenet-201 as a base network in YOLOv2, which extricates the most exemplary features from the samples due to direct connections among all layers to the classification layer.Furthermore, the proposed algorithm localizes the tiny objects with high precision effectively.

3.
For the cross-validation of our proposed model, we utilized MS COCO and Pascal VOC datasets.We achieved significant vehicle detection performance for our proposed model than existing techniques, which confirms that our proposed technique is robust.

4.
The proposed model is an efficient algorithm and performs a fast automated feature extraction than the existing object detectors such as Faster RCNN.Moreover, it predicts the coordinates of bounding boxes and classification at the same time.It is easy and simple to employ for object detection in real-time videos.

5.
The proposed model is lightweight and has fewer parameters than the original YOLOv2 model.

Literature Review
Employing computer vision techniques to accurately identify on-the-road vehicles is a thought-provoking issue that has been a scorching research topic for the past two decades [10].The surveillance videos of traffic have the ever-changing background due to lighting effects.As a result, the exact size and location of vehicles are difficult to capture due to the simultaneous movements of the vehicles on the road.
Recently, DL (Deep Learning) models have piqued the interest of numerous scholars, and a plethora of deep learning object detection algorithms have been introduced.In comparison to traditional methods, manual feature extraction in machine learning object detection algorithms needs experts with years of experience in the associated domain.Whereas, deep learning models necessitate a large amount of data to automatically acquire the characteristics that can imitate differences in data, making it more demonstrative.Simultaneously, the procedure of feature extraction in the CNN layer in visual recognition mechanisms is similar to the human visual mechanism.Deep learning-based detection algorithms have achieved reasonable real-time performance compared to traditional algorithms in recent years, requiring a continuous increase in data volume, and constant updates of device hardware, and have attained recognition worldwide.Due to the better real-time accuracy and performance in the academic field, the deep learning vehicle detection algorithm has been gradually developed in two directions; one is focused on accuracy and the other one is on complexity.
For more than a decade, researchers have studied vehicle detection and recognition extensively in the literature.Previously, numerous handcrafted features were removed for vehicle detection, which requires manual intervention.Haar [11], HOG [12], and LBP [13] were the three most commonly used feature descriptors.The classification framework was evaluated for vehicle detection and found to be effective, i.e., a large number of vehicles were detected.Additionally, the HOG feature in conjunction with the Support Vector Machine classifier is commonly used with great success in vehicle detection.Moreover, the mentioned features and classifiers with broad applications in vehicle detection tasks, and statistical techniques using vertical and horizontal edge features were initiated for the detection of vehicles and vehicle tracking at night by placing the tail lights.Table 1 presents the recent works done on vehicle detection and classification.

Problem Statement
A vast number of techniques have been developed in the past era that have added much-needed attention among researchers as a result of the development and improvements in the domain of CV due to their vast surveillance applications.As a result of advances in computer vision, the detection of objects in images is becoming increasingly important because it can benefit a vast number of applications, including human detection, face detection, vehicle detection, hammer detection, gun detection, knife detection, and many others.With the advancement of technology and the increased number of vehicles on the road around the world, the traffic system has become increasingly reliant on automatic vehicle recognition systems.Consequently, the vehicle detection and recognition system must perform well in the context of time complexity and accuracy.
The main aim of this research paper is to investigate the challenge of vehicle detection in surveillance videos using deep learning.Due to the low-quality of surveillance images, lack of background information, low lighting, and distance it is quite difficult to detect vehicles.
Hence, we deduce a technique i.e., Improved YOLOv2 through surveillance videos that distinguish between vehicles and non-vehicles based on deep learning.From various research studies done previously, it is deduced that the current surveillance systems cannot efficiently tell us which kind of model should be used for what kind of images.Surveillance systems usually fail because they rely heavily on human operators who have physical restrictions in the form of lethargy or loss of attentiveness due to monitoring several screens for longer periods.These restrictions can be eased by enhancing the surveillance systems to automatically detect the various objects that are present in an image.These proficiencies can then enable surveillance systems to detect objects in various images.To have such proficiencies, we need to deduce a mechanism that can not only capture images but can also account for human emotion and behavior i.e., a method like object detection has to be introduced that can detect the difference between different objects in an image.
Although various traditional methods are used to detect vehicles in surveillance videos, the main problem is that the traditional methods are not as accurate and also, they are very expensive.Nowadays researchers focus on using deep learning methods to recognize vehicles.In this research, the Improved YOLOv2 algorithm, a type of DL (Deep Learning) technology, was utilized to detect various vehicles (e.g., Car, Bus, Truck) observed in surveillance cameras.

Materials and Methods
Conventionally, deep learning contains numerous layers of nonlinear processing modules to obtain the features.All layers are cascaded and take the output from the previous layer as input.Many researchers have attempted to build the network deeper and larger to investigate the potential of deep learning.However, it has a challenge with exploding or the vanishing gradient problem (VGP).As a result, many researchers build multiple different structures of deep learning.
A range of deep learning structures has been proposed such as AlexNet [25], ResNet [26], DenseNet [26], GoogLeNet [27], VGGNet [28].The 2012 ImageNet Large Scale Visual Recognition Competition (ILSVRC) winner, AlexNet, is comparable to LeNet and has ReLU non-linearity and max-pooling.In the 2014 ILSVRC, VGGNet came at second place, with deeper networks (19 layers) than AlexNet.To extract sparse correlating features in feature map stacks, GoogLeNet, the ILSVRC 2014 winner, uses 1 × 1 convolution to minimize the dimensions of feature maps earlier than the expensive convolutions, as well as parallel routes with variable receptive field sizes.ResNet, the ILSVRC 2015 winner, proposes a 152-layer network with a minimum of 2 layers of skipped or shortcut connections.Whereas, each layer in DenseNet feeds forward the output of all preceding layers, providing N (N + 1)/2 connections in N layers, whereas outdated convolutional networks with N layers only deliver N connections.DenseNet is capable of performing better than the cutting-edge ResNet structure in the ImageNet classification test.
In this research, we proposed DenseNet201 as the base network in YOLOv2 for vehicle detection (e.g., Car, Bus, Truck) because of its remarkable performance.However, before going into detail about DenseNet201, the traditional convolution neural network (CNN) will be discussed first, followed by the distinctions between DenseNet and CNN.

Convolution Neural Network (CNN)
A standard convolution neural network (CNN) normally includes (i) Convolution (CONV) layer, (ii) Rectified linear unit (ReLU) layer, (iii) pooling (POOL) layer, (iv) Fully connected (FC) layer, and (v) Softmax layer [29].The following are the functions of the several layers, with the convolution layer as the fundamental session of a CNN.Convolutional input with various kernels produces the feature maps.It can be expressed mathematically as presented in Figure 1.
cutting-edge ResNet structure in the ImageNet classification test.
In this research, we proposed DenseNet201 as the base network in YOLOv2 for vehicle detection (e.g., Car, Bus, Truck) because of its remarkable performance.However, before going into detail about DenseNet201, the traditional convolution neural network (CNN) will be discussed first, followed by the distinctions between DenseNet and CNN.

Convolution Neural Network (CNN)
A standard convolution neural network (CNN) normally includes (i) Convolution (CONV) layer, (ii) Rectified linear unit (ReLU) layer, (iii) pooling (POOL) layer, (iv) Fully connected (FC) layer, and (v) Softmax layer [29].The following are the functions of the several layers, with the convolution layer as the fundamental session of a CNN.Convolutional input with various kernels produces the feature maps.It can be expressed mathematically as presented in Figure 1.Succeeding the convolution layer, there exists the ReLU nonlinear activation function, which is used to extract nonlinear features.The goal of the ReLU layer is to impart non-linearity to the network.It is mathematically defined as Equation ( 1).

( ) ( )
The pooling layer works by geographically resizing the feature maps to reduce the parameters, memory footprint, and network computation time.Each feature map is subjected to the pooling function, and the most common pooling approaches are max pooling as shown in Equation ( 2), and average pooling as presented in Equation (3).
M denotes the pooling region, while Rk represents the total elements along with the pooling region.The confidential scores will be calculated through fully connected layers and stored in a 1 × 1 × c volume.Each element represents class scores, while c refers to the categories.
An individual neuron in the FC layer is linked to neurons in previous layers.In a typical CNN, all the layers are progressively associated, as shown in Equation ( 4).
( ) Succeeding the convolution layer, there exists the ReLU nonlinear activation function, which is used to extract nonlinear features.The goal of the ReLU layer is to impart nonlinearity to the network.It is mathematically defined as Equation (1).
The pooling layer works by geographically resizing the feature maps to reduce the parameters, memory footprint, and network computation time.Each feature map is subjected to the pooling function, and the most common pooling approaches are max pooling as shown in Equation ( 2), and average pooling as presented in Equation (3).
M denotes the pooling region, while R k represents the total elements along with the pooling region.The confidential scores will be calculated through fully connected layers and stored in a 1 × 1 × c volume.Each element represents class scores, while c refers to the categories.
An individual neuron in the FC layer is linked to neurons in previous layers.In a typical CNN, all the layers are progressively associated, as shown in Equation ( 4).
However, when the network grows deeper and larger, it is possible that the network could explode or the gradient would vanish.As a result, researchers offered various network architectures to solve the problem.ResNet, for example, changed this behavior by using a short link as shown in Equation (5).
Rather than summing the feature maps' outputs of the layer to the incoming feature maps, DenseNet has direct connections among all layers and each current layer takes input from all previous layers.The expression is rewritten as Equation (6).
where r denotes the layer number's index, F denotes a non-linear function and m r denotes the r-th layer's output.

Densenet-201
Due to the capacities of feature reusability by succeeding layers, the DenseNet-201 employs the condensed network, allowing the tremendously parametrically efficient model, which increases diversity in the succeeding layer input and enhances performance.The DenseNet201 has performed admirably on a variety of datasets, including ImageNet [30] and CIFAR-100 [31].Direct connections from all preceding layers to all future layers are introduced to boost connectivity in the DenseNet201 architecture, as shown in Figure 2.
could explode or the gradient would vanish.As a result, researchers offered various network architectures to solve the problem.ResNet, for example, changed this behavior by using a short link as shown in Equation ( 5).
( ) Rather than summing the feature maps' outputs of the layer to the incoming feature maps, DenseNet has direct connections among all layers and each current layer takes input from all previous layers.The expression is rewritten as Equation ( 6).
( ) where r denotes the layer number's index, F denotes a non-linear function and mr denotes the r-th layer's output.

Densenet-201
Due to the capacities of feature reusability by succeeding layers, the DenseNet-201 employs the condensed network, allowing the tremendously parametrically efficient model, which increases diversity in the succeeding layer input and enhances performance.The DenseNet201 has performed admirably on a variety of datasets, including ImageNet [30] and CIFAR-100 [31].Direct connections from all preceding layers to all future layers are introduced to boost connectivity in the DenseNet201 architecture, as shown in Figure 2. The advantages of DenseNet201, which includes 201 convolutional layers, are fewer vanishing-gradient problems, excellent feature distribution, feature reusability, and a fewer number of parameters.
Let's assume that an image m0 is fed into a neural network with R layers and nonlinear transformation Fr (.), where r is the index of the layer.ResNet's traditional skipping connections are included in the feed-forward network that bypasses the non-linear alteration with an identity function, as shown in Equation (7).
ResNet has one advantage here that from initial layers till final layer, a gradient can move straight through the identity function.Whereas, direct end-to-end connections are used in the dense network to maximize the amount of information in each layer.The r-th layer receives all of the previous layer's information as shown in Equation (8).The advantages of DenseNet201, which includes 201 convolutional layers, are fewer vanishing-gradient problems, excellent feature distribution, feature reusability, and a fewer number of parameters.
Let's assume that an image m 0 is fed into a neural network with R layers and non-linear transformation F r (.), where r is the index of the layer.ResNet's traditional skipping connections are included in the feed-forward network that bypasses the non-linear alteration with an identity function, as shown in Equation (7).
ResNet has one advantage here that from initial layers till final layer, a gradient can move straight through the identity function.Whereas, direct end-to-end connections are used in the dense network to maximize the amount of information in each layer.The r-th layer receives all of the previous layer's information as shown in Equation (8).
In DenseNet, down sampling takes place at Dense Blocks, which are split into Transition layers; it contains a 1 × 1 convolutional layer (CONV) and a pooling layer (average) with BN (batch normalization).The bulks from the transition layer ultimately spread to the dense layers.We transformed the entire average-pooling layer into a 2 × 2 max pool layer for network utility.BN (Batch normalization) is performed previously in each of the convolutional layers, making the model less complex.The hyperparameter k denotes the network's growth rate, making the DenseNet capable of producing cutting-edge results.Pooling layers are eliminated, and the proposed detection layers are fully integrated and related to the classification layers for detection.Even deeper network designs than the 201-layer network can be found in DenseNet-264 [32].Because we don't want to cast a wide network, the 201-layer structure is suitable for detecting vehicles.Due to its manner, which reflects feature maps as a global mechanism of the network, DenseNet201 performs well even with a smaller growth rate.Figure 3 exhibits the DenseNet201 architecture: ( ) In DenseNet, down sampling takes place at Dense Blocks, which are split into Transition layers; it contains a 1 × 1 convolutional layer (CONV) and a pooling layer (average) with BN (batch normalization).The bulks from the transition layer ultimately spread to the dense layers.We transformed the entire average-pooling layer into a 2 × 2 max pool layer for network utility.BN (Batch normalization) is performed previously in each of the convolutional layers, making the model less complex.The hyperparameter k denotes the network's growth rate, making the DenseNet capable of producing cutting-edge results.Pooling layers are eliminated, and the proposed detection layers are fully integrated and related to the classification layers for detection.Even deeper network designs than the 201-layer network can be found in DenseNet-264 [32].Because we don't want to cast a wide network, the 201-layer structure is suitable for detecting vehicles.Due to its manner, which reflects feature maps as a global mechanism of the network, DenseNet201 performs well even with a smaller growth rate.Figure 3   DenseNet-201 is based on the transfer learning concept, having 201 depth layers and 20 million parameters that have been trained using more than one million images attained from the ImageNet dataset.

YOLO (You Only Look Once) Theory
YOLO is an abbreviation of "You Only Look Once" [33], an advanced, one-stage algorithm, to identify objects in real-time.The YOLO technique uses CNN, and object recognition is performed as a regression scenario.CNN is employed to predict various bounding boxes and class probabilities simultaneously.In comparison to Faster R-CNN, YOLO obtains location and category predictive information without a region proposal network (RPN).

Working Principle of YOLO
At the start, the network splits the input image into the R × R grid.When the central point of an object lies in a grid cell, that grid cell is responsible for the detection of that object.B bounding boxes and confidence scores are predicted in each grid cell for those bounding boxes.Prob (Object) stands for whether there is a required object falling into this cell.The mathematical equation of confidence C in YOLO-v2 is shown in Equation (9).DenseNet-201 is based on the transfer learning concept, having 201 depth layers and 20 million parameters that have been trained using more than one million images attained from the ImageNet dataset.

YOLO (You Only Look Once) Theory
YOLO is an abbreviation of "You Only Look Once" [33], an advanced, one-stage algorithm, to identify objects in real-time.The YOLO technique uses CNN, and object recognition is performed as a regression scenario.CNN is employed to predict various bounding boxes and class probabilities simultaneously.In comparison to Faster R-CNN, YOLO obtains location and category predictive information without a region proposal network (RPN).

Working Principle of YOLO
At the start, the network splits the input image into the R × R grid.When the central point of an object lies in a grid cell, that grid cell is responsible for the detection of that object.B bounding boxes and confidence scores are predicted in each grid cell for those bounding boxes.Prob (Object) stands for whether there is a required object falling into this cell.The mathematical equation of confidence C in YOLO-v2 is shown in Equation (9).
Here, each grid cell predicts C conditional class probabilities, Pr (Class | Object), Prob (Object) is the probability of predicting whether the boundary object contains the vehicle object.If the object is present, Prob (object) is equal to 1, otherwise it is equal to 0.
There are five components of the bounding box (x 0 , y 0 , wd, ht, confidence).The confidence score reflects how self-assured the model is in the predicted box containing an object and how correctly the box is that it predicts.The (x 0 , y 0 ) coordinates refer to the center of the box related to the bound of the grid cell and these coordinate values lie between 0 and 1.The (wd, ht) box dimensions are width and height of the relative bounding box to the whole image and are also normalized to 0 and 1.The category probability p is calculated as shown in Equation (10).
The confidence score is zero if no object lies in that cell.Otherwise, the confidence score should be equivalent to the intersection over union (IoU) of the actual and predicted boxes.Each grid cell creates B of these predictions, and there exist a total of R × R × B × 5 outputs connected to bounding box predictions.The last layer of the pre-trained CNN model predicts the tensor of size R × R × (B × 5 + C), where C is several classes.
If multiple objects exist in a single grid cell then to resolve this problem, we utilized the concept of an anchor box.The anchor box enables the YOLOv2 to identify several objects in a single grid cell.Due to this, a new idea of an anchor box i.e., one more dimension, is added to the output labels by predefining several anchor boxes.After that, one object will be assigned to each anchor box. Figure 4 illustrates the framework of the YOLO methodology.
Here, each grid cell predicts C conditional class probabilities, Pr (Class | Object), Prob (Object) is the probability of predicting whether the boundary object contains the vehicle object.If the object is present, Prob (object) is equal to 1, otherwise it is equal to 0.
There are five components of the bounding box (x0, y0, wd, ht, confidence).The confidence score reflects how self-assured the model is in the predicted box containing an object and how correctly the box is that it predicts.The (x0, y0) coordinates refer to the center of the box related to the bound of the grid cell and these coordinate values lie between 0 and 1.The (wd, ht) box dimensions are width and height of the relative bounding box to the whole image and are also normalized to 0 and 1.The category probability p is calculated as shown in Equation (10).The confidence score is zero if no object lies in that cell.Otherwise, the confidence score should be equivalent to the intersection over union (IoU) of the actual and predicted boxes.Each grid cell creates B of these predictions, and there exist a total of R × R × B × 5 outputs connected to bounding box predictions.The last layer of the pre-trained CNN model predicts the tensor of size R × R × (B × 5 + C), where C is several classes.
If multiple objects exist in a single grid cell then to resolve this problem, we utilized the concept of an anchor box.The anchor box enables the YOLOv2 to identify several objects in a single grid cell.Due to this, a new idea of an anchor box i.e., one more dimension, is added to the output labels by predefining several anchor boxes.After that, one object will be assigned to each anchor box. Figure 4 illustrates the framework of the YOLO methodology.

Loss Function
The loss is split into two sub-parts, a loss for localization for predicting bounding box offsets and a classification loss for predicting the probabilities of conditional class.The squared error sum is utilized to compute both parts.Two scale parameters are used to determine how much the loss from bounding box coordinates predictions should be in-

Loss Function
The loss is split into two sub-parts, a loss for localization for predicting bounding box offsets and a classification loss for predicting the probabilities of conditional class.The squared error sum is utilized to compute both parts.Two scale parameters are used to determine how much the loss from bounding box coordinates predictions should be increased λ coord and how much we want to reduce the number of confidence score predictions for boxes that are lost without objects λ noobj .As a result, the weighted technique is used to balance the various types of losses.Generally, λ coord is set as 5 and λ noobj set as 0.5 to minimize each loss.Otherwise, each loss may contribute differently to the overall loss, rendering certain losses unsuccessful for network training.The loss equation is shown in Equation ( 11): where x i and y i represent the center coordinates, w i and h i refer to the width and height of the box, C i represents the confidence of the box, and p i (c) is the class probability related to the box of the i-th grid cell.Moreover, the equivalent predictions of x i , y i , w i , h i , C i , and p i (c) are xˆi, yˆi, wˆi, hˆi, Cˆi, and p i (c), the weight of the loss coordinates is λ coord , and λ noobj represents the weight of the bounding boxes without any objects loss.S 2 indicates the S × S grid cells, B indicates the boxes whether there is an object that falls in the j-th bounding box of the i-th grid cell, and λ noobj refers to the confidence consequence when there is no object.In Equation ( 11), is responsible for calculating the coordinate loss, 2 is responsible for computing the bounding box size loss, 2 is responsible for determining the bounding box confidence loss with objects, will calculate the bounding box confidence loss without objects, and is responsible for calculating the class loss.

Proposed Solution
Our proposed solution network's structure comprises (i) the Input layer, (ii) network for feature extraction, and (iii) detection network.The first stage in the network is to balance the size of an input image to 224 × 224 pixels, after which the scaled data is passed into DenseNet-201 for Feature Extraction.As previously indicated, we replaced the YOLOv2 baseline network Darknet-19 with DenseNet-201 and associated procedures, and now we are looking into the network's detection adjustments.The complete structure of our proposed system is depicted in Figure 5.

Dataset
Dataset is the main foundation to estimate any model's performance.Improving the recognition rate of the proposed model requires sufficient data for vehicle detection training.More training data can enhance the recognition and generalization rate as well as the robustness of the model, whereas overfitting problems may occur due to an insufficient amount of datasets.We used two datasets, Kaggle [34] vehicle and KITTI [35] datasets for the training and testing of the model.Moreover, the MS COCO [36] dataset and Pascal VOC [37] dataset were used to cross-validate the proposed model.

Kaggle Vehicle Dataset
The vehicle dataset available on Kaggle is used for experimental purposes.The dataset is split into two parts i.e., train set and the test set.The Kaggle vehicle dataset contains 22,852 training images and 5193 test images, containing a total of 28,045 images.There exist 17 classes (Ambulance, Car, Cart, Boat, Bus, Caterpillar, Helicopter, Barge, Bicycle, Segway, Limousine, Motorcycle, Tank, Taxi, Snowmobile, Truck, and Van).The class-wise distribution of Kaggle datasets is presented in Table 2.

Experimental Evaluation 7.1. Dataset
Dataset is the main foundation to estimate any model's performance.Improving the recognition rate of the proposed model requires sufficient data for vehicle detection training.More training data can enhance the recognition and generalization rate as well as the robustness of the model, whereas overfitting problems may occur due to an insufficient amount of datasets.We used two datasets, Kaggle [34] vehicle and KITTI [35] datasets for the training and testing of the model.Moreover, the MS COCO [36] dataset and Pascal VOC [37] dataset were used to cross-validate the proposed model.

Kaggle Vehicle Dataset
The vehicle dataset available on Kaggle is used for experimental purposes.The dataset is split into two parts i.e., train set and the test set.The Kaggle vehicle dataset contains 22,852 training images and 5193 test images, containing a total of 28,045 images.There exist 17 classes (Ambulance, Car, Cart, Boat, Bus, Caterpillar, Helicopter, Barge, Bicycle, Segway, Limousine, Motorcycle, Tank, Taxi, Snowmobile, Truck, and Van).The class-wise distribution of Kaggle datasets is presented in Table 2.The KITTI dataset is freely available having 80,256 labeled objects in numerous images.We utilized 7481 training photos and 2000 test images.All of the images are colored and have been saved as "png" files.There are 80 classifiers (Car, Bus, Truck, Train, Motorcycle, etc.).The class-wise distribution KITTI dataset is described in Table 3. Pascal VOC contains 20 different classes (Vehicle: train, bicycle, boat, bus, airplane, etc.), and 9963 images consisting of 24,640 annotated objects.For vehicle detection, we utilized various class samples from the Pascal VOC dataset.More precisely, we employed 800 images in total to evaluate our proposed classifier for the detection of vehicles.The KITTI dataset is freely available having 80,256 labeled objects in numerous images.We utilized 7481 training photos and 2000 test images.All of the images are colored and have been saved as "png" files.There are 80 classifiers (Car, Bus, Truck, Train, Motorcycle, etc.).The class-wise distribution KITTI dataset is described in Table 3. Pascal VOC contains 20 different classes (Vehicle: train, bicycle, boat, bus, airplane, etc.), and 9963 images consisting of 24,640 annotated objects.For vehicle detection, we utilized various class samples from the Pascal VOC dataset.More precisely, we employed 800 images in total to evaluate our proposed classifier for the detection of vehicles.

Metrics
To analyze the performance of the proposed system, we have utilized the metric of Accuracy [9], Intersection Over Union (IoU) [38], and Average Precision (mAP) [39].
Accuracy relies upon true positive (TP) [40], false positive [41] (FP), true negative (TN), and false negative (FN).Furthermore, the accuracy of the system indicates the correctly classified images by the proposed system.Equation ( 12) is presented below.

Accuracy =
TP + TN TP + TN + FP + FN (12) We have employed mAP i.e., the average precision to analyze the performance of our proposed detector.The Equation ( 13) is shown below, where Q denotes the total number of test images.

Environment
We performed the experiments using a GPU NVIDIA card i.e., GEFORCE RTX 30 with 4 GB memory.The operating system was Windows 10 having a RAM of 16 GB.The experiment was performed using Matlab 2021a.We trained our classifier for various categories of vehicles employing parameters such as epochs: 100 and learning rate: 0.0001.
The primary goal of this paper is to propose an accurate approach for the detection of vehicles correctly.The various experiments performed can provide insight into the method's robustness and capacity to run in real-time scenarios.To achieve a reliable vehicle detector, we proposed an Improved YOLOv2 using DenseNet201 as the base algorithm employing a transfer learning (TL) mechanism.The proposed model is based on the outstanding performance of DenseNet as it performs on ImageNet dataset classification tasks.Figure 7 shows results of the proposed model for the detection of Vehicles using the Kaggle Vehicle dataset.

Metrics
To analyze the performance of the proposed system, we have utilized the metric of Accuracy [9], Intersection Over Union (IoU) [38], and mean Average Precision (mAP) [39].Accuracy relies upon true positive (TP) [40], false positive [41] (FP), true negative (TN), and false negative (FN).Furthermore, the accuracy of the system indicates the correctly classified images by the proposed system.Equation ( 12) is presented below.
TP TN Accuracy TP TN FP FN We have employed mAP i.e., the average precision to analyze the performance of our proposed detector.The Equation ( 13) is shown below, where Q denotes the total number of test images.

Environment
We performed the experiments using a GPU NVIDIA card i.e., GEFORCE RTX 30 with 4 GB memory.The operating system was Windows 10 having a RAM of 16 GB.The experiment was performed using Matlab 2021a.We trained our classifier for various categories of vehicles employing parameters such as epochs: 100 and learning rate: 0.0001.
The primary goal of this paper is to propose an accurate approach for the detection of vehicles correctly.The various experiments performed can provide insight into the method's robustness and capacity to run in real-time scenarios.To achieve a reliable vehicle detector, we proposed an Improved YOLOv2 using DenseNet201 as the base algorithm employing a transfer learning (TL) mechanism.The proposed model is based on the outstanding performance of DenseNet as it performs on ImageNet dataset classification tasks.Figure 7 shows results of the proposed model for the detection of Vehicles using the Kaggle Vehicle dataset.

Class-Wise Performance
The average precision (AP) for each vehicle class, was used to measure the performance of recognition.The average recognition performance is depicted by the mean Average Precision (mAP), whereas intersection over union (IoU) indicates the average localization performance.In Object detection, mAP and IoU are significant measures for evaluating a model's performance.Table 4 shows that the proposed upgraded YOLOv2 with Densenet201 has an mAP of 97.51% and an IoU of 97.06%.Improved YOLOv2 with Densenet20 worked well for single and multiple vehicle identification, according to our findings.In our purposed method, the mAP of Taxi and Van reaches up to 98.9%, while the remainder of the results ranges from 94.5% to 98.8%.In terms of localization and recognition accuracy, our proposed technique surpassed others.

Cross-Validation
The Pascal VOC and MS COCO datasets have been employed for the cross-validation of the proposed model.For vehicle detection, we employed various samples from Pascal VOC and MS COCO datasets.Using DenseNet-201, we determined the mAP for each of the 20 classes in the PASCAL VOC dataset for Improved YOLOv2, and we achieved 81% mAP, which was approximately 2 percent higher than YOLOv2.Furthermore, our proposed model achieved promising results and outperformed other detectors, as shown in Table 5.For 1000 iterations, the training took around one hour.It was exhibited that Fast RCNN [42] attained 70% mAP, YOLOv2 [43] achieved 76.8%, and Faster RCNN with ResNet [43] achieved 76.4% mAP.The highest mAP was 81%, which was attained by our proposed model, whereas the least mAP was 63.4% which was attained by YOLO [33].Moreover, SSD300 [44] and SSD500 [44] achieved 74.3% and 76.8% mAP, respectively.On the other side, Faster RCNN along with VGG-16 [45] and Improved YOLOv3-Net [46] achieved 73.2% and 77.4% mAPs.It is concluded that our proposed algorithm transcends the existing models due to an improved base network DenseNet-201.Our base network retrieves the most relevant features, and due to dense connections the flow of information is accurate till the last layer.More precisely our proposed model is robust, to perform accurate detection due to its dense architecture.In Figure 8, the comparison plot is depicted.accurate detection due to its dense architecture.In Figure 8, the comparison plot is depicted.Models Names mAP Fast R-CNN [42] 70.0 YOLOv2 [47] 76.8 Faster R-CNN ResNet [43] 76.4 YOLO [33] 63.4 SSD300 [44] 74.3 SSD500 [44] 76.8 Faster R-CNN VGG-16 [45] 73.2 Improved YOLOv3-Net [46] 77.4 Improved YOLOv2 (DenseNet-201) Proposed Model 81.0

Comparison with Existing Models
To evaluate the performance of our proposed model, we conducted two separate experiments.In the first experiment, we employed Pascal VOC 2007 to train our detector for vehicle detection.We analyzed the effectiveness of the proposed technique and matched it with predominant techniques over Pascal VOC 2007 dataset.We utilized only three class

Comparison with Existing Models
To evaluate the performance of our proposed model, we conducted two separate experiments.In the first experiment, we employed Pascal VOC 2007 to train our detector for vehicle detection.We analyzed the effectiveness of the proposed technique and matched it with predominant techniques over Pascal VOC 2007 dataset.We utilized only three class samples from the dataset as Bus, Car, and Truck.The proposed model performed significantly better than existing techniques.This training method employed a batch size of 64 and 0.001 is the learning rate.It was done using the IoU Threshold of 0.50.Four distinct dimensions of network models have been perceived such as Improved YOLOv2, YOLOv3, and YOLOv3-Net and our proposed model Improved YOLOv2-Net-201.The statistics are shown in Table 6.The best mAP of 82.7% was achieved for Improved YOLOv2-Net-201 due to the proposed dense architecture as the base network in YOLOv2.Each layer attains data from all the preceding layers and passes it to all coming layers.More precisely, the classification layer has a direct connection with previous layers, extracting the most valuable features for the detection of vehicles.Our proposed model is capable of significant vehicle detection and outperforms the existing techniques.The comparison plot is presented in Figure 9, exhibiting the better performance among existing models.samples from the dataset as Bus, Car, and Truck.The proposed model performed significantly better than existing techniques.This training method employed a batch size of 64 and 0.001 is the learning rate.It was done using the IoU Threshold of 0.50.Four distinct dimensions of network models have been perceived such as Improved YOLOv2, YOLOv3, and YOLOv3-Net and our proposed model Improved YOLOv2-Net-201.The statistics are shown in Table 6.The best mAP of 82.7% was achieved for Improved YOLOv2-Net-201 due to the proposed dense architecture as the base network in YOLOv2.
Each layer attains data from all the preceding layers and passes it to all coming layers.More precisely, the classification layer has a direct connection with previous layers, extracting the most valuable features for the detection of vehicles.Our proposed model is capable of significant vehicle detection and outperforms the existing techniques.The comparison plot is presented in Figure 9, exhibiting the better performance among existing models.In the second phase, the COCO dataset has been used to train the detector for vehicle detection like Buses, Car, and Trucks.The statistics are shown in Table 7.The best mAP of 75.1% was achieved by our proposed model, and the least mAP was 60% attained by the original YOLOv2.Meanwhile, YOLOv3 and Improved YOLOv3 achieved 66.2% and 71.2% mAPs, respectively.Our proposed model has attained the best performance among

Conclusions
In this study, an innovative and vigorous system for Vehicle detection is proposed using a deep neural network established on YOLOv2 (You Only Look Once).Our proposed technique uses DenseNet-201 as a Feature Extraction network swapping darknet18 in the original YOLOv2.We employed two benchmarks such as the Kaggle vehicle dataset and the KITTI dataset as: 70% for training and 30% for testing of our proposed model.Moreover, we utilized samples from 17 classes exhibiting various vehicles such as buses, trucks, cars, carts, bikes, etc.We performed extensive experimentation to evaluate the performance of the proposed model and achieved better average precision for our model than

Conclusions
In this study, an innovative and vigorous system for Vehicle detection is proposed using a deep neural network established on YOLOv2 (You Only Look Once).Our proposed technique uses DenseNet-201 as a Feature Extraction network swapping darknet18 in the original YOLOv2.We employed two benchmarks such as the Kaggle vehicle dataset and the KITTI dataset as: 70% for training and 30% for testing of our proposed model.Moreover, we utilized samples from 17 classes exhibiting various vehicles such as buses, trucks, cars, carts, bikes, etc.We performed extensive experimentation to evaluate the performance of the proposed model and achieved better average precision for our model than existing techniques.Moreover, our proposed model is more compact and utilizes more representative features due to dense connections among layers.More precisely, each coming layer is directly connected with all previous layers till the classification layer in our proposed base network, and this mechanism ensures a good flow of information from the input layer to the last one.Furthermore, our proposed model detects tiny vehicles with more precision and more accurately calculates bounding boxes due to compactness in the base network than the original YOLOv2.We also performed cross-validation to determine the robustness of our proposed technique using two prominent datasets, Pascal VOC and COCO.We attained excellent performance for our proposed model compared to state-of-the-art techniques, achieving 81% mAP.We believe that our proposed model is robust and an effective framework for vehicle detection such as for cars, buses, trucks, etc.In the future, we aim to modify and fine-tune our model to attain better accuracy and mAP for vehicle detection along with classification.Moreover, we will try to utilize our framework for other object detection applications such as abnormal activity detection.
exhibits the DenseNet201 architecture:

Figure 4 .
Figure 4.The framework of the YOLO methodology.

Figure 4 .
Figure 4.The framework of the YOLO methodology.

Figure 5 .
Figure 5.The overall structure of the proposed model.

Figure 5 .
Figure 5.The overall structure of the proposed model.

7. 1
.4.COCO Common Objects in Context (COCO) is one of the most famous open-source datasets for object identification and segmentation.Microsoft sponsors the COCO dataset, which contains over 300,000 images and 90 object types.In recent years, semantic segmentation has become the industry standard for image semantics understanding.Thus, we employed only 500 images exhibiting various vehicles from the COCO dataset.Various training samples are presented in Figure 6.

7. 1
.4.COCO Common Objects in Context (COCO) is one of the most famous open-source datasets for object identification and segmentation.Microsoft sponsors the COCO dataset, which contains over 300,000 images and 90 object types.In recent years, semantic segmentation has become the industry standard for image semantics understanding.Thus, we employed only 500 images exhibiting various vehicles from the COCO dataset.Various training samples are presented in Figure 6.

Figure 6 .
Figure 6.Various samples for training.Figure 6. Various samples for training.

Figure 6 .
Figure 6.Various samples for training.Figure 6. Various samples for training.

Figure 7 .
Figure 7. Results of the proposed model's detection by using the Kaggle Vehicle dataset.Figure 7. Results of the proposed model's detection by using the Kaggle Vehicle dataset.

Figure 7 .
Figure 7. Results of the proposed model's detection by using the Kaggle Vehicle dataset.Figure 7. Results of the proposed model's detection by using the Kaggle Vehicle dataset.

Figure 8 .
Figure 8.Comparison of mAP between the different networks with the proposed network.

Figure 8 .
Figure 8.Comparison of mAP between the different networks with the proposed network.

Figure 9 .
Figure 9.Comparison Graph with existing models for Car, Bus, and Truck samples using the PASCAL VOC dataset.

Figure 9 .
Figure 9.Comparison Graph with existing models for Car, Bus, and Truck samples using the PASCAL VOC dataset.

Figure 10 .
Figure 10.Comparison Graph with existing models for Car, Bus, and Truck samples using the COCO dataset.

Figure 10 .
Figure 10.Comparison Graph with existing models for Car, Bus, and Truck samples using the COCO dataset.

Table 1 .
Summary of existing techniques for vehicle detection.

Table 2 .
Comprehensive overview of the Kaggle dataset.

Table 2 .
Comprehensive overview of the Kaggle dataset.

Table 3 .
Class-wise distribution of KITTI datasets.

Table 3 .
Class-wise distribution of KITTI datasets.

Table 4 .
Class-wise performance over Kaggle and KITTI dataset.

Table 5 .
Comparison of different Network Models using PASCAL VOC 2007.

Table 5 .
Comparison of different Network Models using PASCAL VOC 2007.

Table 6 .
Comparison of Proposed Networks with existing models on Pascal VOC Dataset.

Table 6 .
Comparison of Proposed Networks with existing models on Pascal VOC Dataset.