A Multi-Angle Appearance-Based Approach for Vehicle Type and Brand Recognition Utilizing Faster Regional Convolution Neural Networks

Vehicle type and brand information constitute a crucial element in intelligent transportation systems (ITSs). While numerous appearance-based classification methods have studied frontal view images of vehicles, the challenge of multi-pose and multi-angle vehicle distribution has largely been overlooked. This paper proposes an appearance-based classification approach for multi-angle vehicle information recognition, addressing the aforementioned issues. By utilizing faster regional convolution neural networks, this method automatically captures crucial features for vehicle type and brand identification, departing from traditional handcrafted feature extraction techniques. To extract rich and discriminative vehicle information, ZFNet and VGG16 are employed. Vehicle feature maps are then imported into the region proposal network and classification location refinement network, with the former generating candidate regions potentially containing vehicle targets on the feature map. Subsequently, the latter network refines vehicle locations and classifies vehicle types. Additionally, a comprehensive vehicle dataset, Car5_48, is constructed to evaluate the performance of the proposed method, encompassing multi-angle images across five vehicle types and 48 vehicle brands. The experimental results on this public dataset demonstrate the effectiveness of the proposed approach in accurately classifying vehicle types and brands.


Introduction
Vehicle information recognition constitutes a fundamental problem in the field of computer vision, with applicable uses in intelligent traffic systems (ITSs).The primary objective of this task is to accurately locate the region within an image that contains a vehicle and subsequently identify the specific make and model of the vehicle.Despite considerable efforts expended on this problem, the effectiveness of existing solutions remains limited.
A key challenge in vehicle information recognition stems from the wide array of vehicle models and designs across different brands, coupled with the rapid variation in appearance as the viewing angle changes over time.This complexity demands a more sophisticated and robust approach to accurately discern and classify vehicle information in diverse real-world scenarios.
To surmount the myriad of challenges and attain high recognition accuracy, traditional methods often resort to employing various handcrafted features.Prominent among these are the Scale-Invariant Feature Transform (SIFT) [1], Histogram of Oriented Gradients (HOG) [2,3], Speeded-Up Robust Features (SURF) [4,5], Harris corner [6], and fused features [7].Zhang et al. [8]  subsequently leveraging the naive Bayesian classifier for vehicle type recognition.Although these traditional methods have demonstrated commendable performances, they have been found wanting in the task of vehicle type and brand recognition.Recently, deep learning methods have exhibited remarkable proficiency in image recognition and object detection tasks, encompassing pedestrian detection [9] and visual processing [10], in addition to vehicle information recognition.This prowess is largely attributed to the exceptional feature extraction capabilities of convolutional neural networks (CNNs).Luo et al. [11] introduced the inaugural true CNN, LeNet, and applied it to the task of handwritten digit recognition in 1998.The advent of deeper CNNs, tracing their origins to AlexNet, ushered in an unprecedented era due to their exemplary performance in image classification tasks.Successive iterations such as ZFNet, VGGNet, and ResNet further solidified CNNs as the go-to choice for computer vision applications.
Deng et al. [12] leveraged CNNs to extract vehicle features, which were then classified into three types using Support Vector Machines (SVMs).Sang et al. identified six vehicle types from frontal views utilizing Faster R-CNN.Dong et al. [13] employed CNNs to extract features from vehicle frontal images, achieving an impressive accuracy of 92.89%.Huttunen et al. [14] compared deep neural networks with traditional methods, specifically evaluating the use of SIFT features in conjunction with SVM classification.Azam et al. [15] utilized convolutional neural networks (CNNs) to estimate the vehicle pose from four directions and successfully captured the regions.Chen et al. employed the rear-view image of the vehicle as the detection object.While these methods have achieved impressive classification accuracy in vehicle type recognition, they primarily rely on single-angle images of the vehicle, neglecting the multi-pose and multi-angle distribution characteristics of the vehicle.
Furthermore, some studies have addressed multi-angle vehicle detection.Specifically, Wang et al. [16] trained the detection model using images captured from seven different angles to facilitate the fine-grained classification of vehicles.Sochor et al. [17] integrated additional image information, such as the vehicle bounding box and direction, into the network for enhanced detection.The methodology proposed in this paper shares similarities with these approaches, offering a comparable solution to the problem of vehicle detection.
Considering the aforementioned challenges, one must address two primary problems: Firstly, the extraction of spatial features to effectively represent the multi-angle distribution of the vehicle.Secondly, the exploration and utilization of useful information to formulate the properties of the vehicle regions.To tackle these issues, this paper proposes an efficient method based on Faster R-CNN [18].Additionally, a comprehensive vehicle database is established, encompassing five types and 48 brands captured from eight different angles under varied environmental conditions.
The proposed framework consists of three integral components: road vehicle video processing, a vehicle type recognition network, and a vehicle brand recognition network.Specifically, the road vehicle video processing model extracts individual frames from a given vehicle video and identifies the frame containing the vehicle.The vehicle type recognition network and the vehicle brand recognition network are then employed to determine the type and brand of the vehicle, respectively.The entire framework is end-toend trainable, facilitating seamless integration and optimization.
The structure of this paper is as follows.Section 2 delves into the details of the datasets utilized in this study.Section 3 presents the architecture of the proposed network, outlining its key components and functionalities.Section 4 showcases the experimental results, highlighting the performance and effectiveness of the proposed method.Finally, Section 5 draws conclusions based on the findings and discusses potential paths for future research.

Data Construction
In this part, there is an example of where and how to obtain the data and how to annotate it in detail.

Data Construction
Existing widely utilized datasets, such as Stanford Cars and BIT-Vehicle, encompass a limited number of vehicle types, with uneven sample distributions among each type and a narrow coverage of shooting angles.The Stanford Cars dataset comprises 196 categories from 16,185 images, offering a diverse range of brands but a restricted variety of vehicle types.The BIT-Vehicle dataset, consisting of 9850 images, features five types of vehicles captured by the camera, including buses, microbuses, minivans, sedans, SUVs, and trucks, but with a single shooting angle and background condition.These limitations in the sample distribution and shooting angles may impact the robustness and generalization capabilities of models trained on these datasets.

Data Annotation and Statistic
To mimic real-world scenarios, we constructed a multi-angle database.The samples in our self-built database were primarily sourced from the internet and vehicle videos, with varied lighting and background conditions.Using the CCD camera of Yuanda Vision Technology (YDV-C932A, produced by Shenzhen Great Vidoe Technology Co., Ltd., Shenzhen, China), Equipment from Shenzhen Great Vidoe Technology Co., Ltd. to capture and collect training and testing vehicle driving video samples.When capturing vehicle driving videos, a camera is placed in front of and on the side of the road, and the vehicle is driven straight ahead to obtain images from four angles: straight-ahead, left, left-front, and left-rear.When the vehicle turns around, images are obtained from four angles: straight-behind, right, rightfront, and right-rear.In addition, the shooting environment and weather conditions are as diverse as possible in order to obtain vehicle samples under different lighting conditions.For manually captured vehicle samples, in order to diversify the background conditions, a large number of shooting locations are selected.All images were manually annotated, and the samples were captured from eight different angles: front, behind, right-front, left-front, left-side, right-side, right-behind, and left-behind, as illustrated in the top view of Figure 1.The dataset comprises 10,800 multi-angle images, with 8800 of them being cars and SUVs.Since the self-built dataset includes five types and 48 vehicle brands, it is named Car5_48.Specifically, these five types are cars, SUVs, buses, minivans, and minibuses, and the 48 brands include Volkswagen, BYD, BMW, Mercedes-Benz, Land Rover, Nissan, and others.Figure 2 displays a subset of the Car5_48 dataset, showcasing vehicle models of various brands and models captured under the eight shooting angles.
captured by the camera, including buses, microbuses, minivans, sedans, SUVs, and trucks, but with a single shooting angle and background condition.These limitations in the sample distribution and shooting angles may impact the robustness and generalization capabilities of models trained on these datasets.

Data Annotation and Statistic
To mimic real-world scenarios, we constructed a multi-angle database.The samples in our self-built database were primarily sourced from the internet and vehicle videos, with varied lighting and background conditions.Using the CCD camera of Yuanda Vision Technology (YDV-C932A, produced by Shenzhen Great Vidoe Technology Co., Ltd., Shenzhen, China), Equipment from Shenzhen Great Vidoe Technology Co., Ltd. to capture and collect training and testing vehicle driving video samples.When capturing vehicle driving videos, a camera is placed in front of and on the side of the road, and the vehicle is driven straight ahead to obtain images from four angles: straight-ahead, left, left-front, and left-rear.When the vehicle turns around, images are obtained from four angles: straight-behind, right, right-front, and right-rear.In addition, the shooting environment and weather conditions are as diverse as possible in order to obtain vehicle samples under different lighting conditions.For manually captured vehicle samples, in order to diversify the background conditions, a large number of shooting locations are selected.All images were manually annotated, and the samples were captured from eight different angles: front, behind, right-front, left-front, left-side, right-side, right-behind, and left-behind, as illustrated in the top view of Figure 1.The dataset comprises 10,800 multi-angle images, with 8800 of them being cars and SUVs.Since the self-built dataset includes five types and 48 vehicle brands, it is named Car5_48.Specifically, these five types are cars, SUVs, buses, minivans, and minibuses, and the 48 brands include Volkswagen, BYD, BMW, Mercedes-Benz, Land Rover, Nissan, and others.Figure 2

Vehicle Information Recognition Model
In this section, we present the multi-angle vehicle type and brand recognition network along with the associated video processing techniques.The architecture of the network is depicted in Figure 3, with further details elaborated below.
The overall framework comprises three main components: a video processing model, followed by vehicle type recognition and vehicle brand recognition networks.The video processing model incorporates the Gaussian model and the background difference algorithm to process vehicle videos.Subsequently, the video is converted into image frames and passed on to the two recognition networks, which are capable of identifying five vehicle types and 48 vehicle brands.

Vehicle Information Recognition Model
In this section, we present the multi-angle vehicle type and brand recognition network along with the associated video processing techniques.The architecture of the network is depicted in Figure 3, with further details elaborated below.

Vehicle Information Recognition Model
In this section, we present the multi-angle vehicle type and brand recognition network along with the associated video processing techniques.The architecture of the network is depicted in Figure 3, with further details elaborated below.
The overall framework comprises three main components: a video processing model, followed by vehicle type recognition and vehicle brand recognition networks.The video processing model incorporates the Gaussian model and the background difference algorithm to process vehicle videos.Subsequently, the video is converted into image frames and passed on to the two recognition networks, which are capable of identifying five vehicle types and 48 vehicle brands.The overall framework comprises three main components: a video processing model, followed by vehicle type recognition and vehicle brand recognition networks.The video processing model incorporates the Gaussian model and the background difference algorithm to process vehicle videos.Subsequently, the video is converted into image frames and passed on to the two recognition networks, which are capable of identifying five vehicle types and 48 vehicle brands.

Video Processing Model
The Gaussian Mixture Model (GMM) and the background subtraction algorithm are utilized for extracting video frames, which is particularly suitable for scenarios where there are gradual changes in the illumination and background.The video processing algorithm comprises four fundamental steps: Step 1: The pixel point value x t of the video image captured at the current time is compared with K initial Gaussian distributions to determine the optimal match.The matching condition is defined by Equation ( 1).The Gaussian distributions can be represented as P(x) = {[w i , µ i , σ i ]}, where i ranges from 1 to K. In this scenario, the value of K is set to 4 due to the relatively high performance and moderate time consumption.The weight of each distribution is denoted by w i , while µ i and σ i represent the mean and standard deviation of the Gaussian distribution, respectively.In the initial Gaussian distribution, the parameters are somewhat arbitrary to make the subsequent network more robust.In practical applications, we verify the effectiveness of initialization methods through experiments.Specifically, we train multiple models using different initialization methods and compare their training and prediction performances.If a certain initialization method results in a model that performs well in both training and prediction, then we can consider the initialization method to be effective.In our case, µ i is assigned a random value between 0 and 255, w i is set to 1/K, and σ i is a constant value of 6.
The lane pixels in a video sequence can be modeled using a Gaussian Mixture Model to represent the background.The parameters of this GMM, namely, the weights, means, and standard deviations, are updated according to Equations (2) through ( 4), allowing for an adaptive and dynamic representation of the background in the presence of changing lighting conditions or other environmental factors.
where α denotes the learning rate, and M k,t = 1 represents the distribution of the match, otherwise M k,t = 0. ρ is the second learning rate, which is updated according to Equation (5); η is the Gaussian probability density function.Then, KGaussian distributions are arranged based on ω i /σ i from large to small, and take the first model that satisfies Equation ( 6) as the background.
In the context of video frame analysis, the pixel values are modeled using a Gaussian Mixture Model (GMM), where the learning process is governed by a set of carefully defined parameters.Specifically, the learning rate, denoted by α, regulates the speed and adaptivity of the model.The distribution of the match, M k,t , is a binary indicator, taking the value of 1 when there is a match and 0 otherwise.Additionally, ρ represents the second learning rate which is updated according to Equation (5), further enhancing the model's ability to adapt to changing conditions.The term η represents the Gaussian probability density function, which is a fundamental component of the GMM.When it comes to selecting the appropriate Gaussian distribution for modeling the background, the KGaussian distributions are arranged in descending order based on the ratio of ω i to σ i .The first model that satisfies the criteria outlined in Equation ( 6) is then chosen as the most representative of the background.This rigorous and systematic approach ensures a robust and accurate modeling of the background in various video processing applications.
where T = 0.7 is the weighted threshold.
Step 2: Once the background model is established, the foreground is obtained by subtracting the pixel value B t (x, y) of the background image from the pixel value I t (x, y) of each point in the current image.
Step 3: After obtaining the different images, a predefined threshold specific to each video is used to determine whether the connected domain composed of foreground pixels meets the required vehicle area.If these conditions are met, the region is labeled using the minimum bounding box in the video.
Step 4: When the labeled rectangle intersects with the preset yellow or green line, the current frame is automatically captured and saved.
The results of processing the video through the aforementioned methodology are depicted in Figures 4 and 5. Figure 4 showcases the processing outcome of the frontal image, while the result in Figure 5 corresponds to the left-frontal image.Since the green and yellow lines for contact detection are fixed, the angles at which the vehicle images are captured from the video remain roughly consistent.This consistency aids in the subsequent identification of vehicle information.Ultimately, the gathered vehicle images are saved and transmitted to the vehicle information detection and recognition network for further identification.
where T = 0.7 is the weighted threshold.
Step 2: Once the background model is established, the foreground is obtained by subtracting the pixel value  (, ) of the background image from the pixel value  (, ) of each point in the current image.
Step 3: After obtaining the different images, a predefined threshold specific to each video is used to determine whether the connected domain composed of foreground pixels meets the required vehicle area.If these conditions are met, the region is labeled using the minimum bounding box in the video.
Step 4: When the labeled rectangle intersects with the preset yellow or green line, the current frame is automatically captured and saved.
The results of processing the video through the aforementioned methodology are depicted in Figures 4 and 5. Figure 4 showcases the processing outcome of the frontal image, while the result in Figure 5 corresponds to the left-frontal image.Since the green and yellow lines for contact detection are fixed, the angles at which the vehicle images are captured from the video remain roughly consistent.This consistency aids in the subsequent identification of vehicle information.Ultimately, the gathered vehicle images are saved and transmitted to the vehicle information detection and recognition network for further identification.

Vehicle Type and Brand Recognition Network
The architecture of the vehicle type and brand recognition network is composed of three primary components: a feature extraction network, a region proposal network (RPN), and a classification location refinement network.The latter includes the ROI (region of interest) and FC (fully connected) layer.This architectural design is graphically depicted in Figure 6.
where T = 0.7 is the weighted threshold.
Step 2: Once the background model is established, the foreground is obtained by subtracting the pixel value  (, ) of the background image from the pixel value  (, ) of each point in the current image.
Step 3: After obtaining the different images, a predefined threshold specific to each video is used to determine whether the connected domain composed of foreground pixels meets the required vehicle area.If these conditions are met, the region is labeled using the minimum bounding box in the video.
Step 4: When the labeled rectangle intersects with the preset yellow or green line, the current frame is automatically captured and saved.
The results of processing the video through the aforementioned methodology are depicted in Figures 4 and 5. Figure 4 showcases the processing outcome of the frontal image, while the result in Figure 5 corresponds to the left-frontal image.Since the green and yellow lines for contact detection are fixed, the angles at which the vehicle images are captured from the video remain roughly consistent.This consistency aids in the subsequent identification of vehicle information.Ultimately, the gathered vehicle images are saved and transmitted to the vehicle information detection and recognition network for further identification.

Vehicle Type and Brand Recognition Network
The architecture of the vehicle type and brand recognition network is composed of three primary components: a feature extraction network, a region proposal network (RPN), and a classification location refinement network.The latter includes the ROI (region of interest) and FC (fully connected) layer.This architectural design is graphically depicted in Figure 6.

Vehicle Type and Brand Recognition Network
The architecture of the vehicle type and brand recognition network is composed of three primary components: a feature extraction network, a region proposal network (RPN), and a classification location refinement network.The latter includes the ROI (region of interest) and FC (fully connected) layer.This architectural design is graphically depicted in Figure 6.The acquired vehicle videos are processed utilizing the model outlined in Section 3.1, where video frames are converted into images to serve as inputs for the recognition network.The CNN is employed to extract distinctive features from the images, generating a feature map.Subsequently, the region proposal network identifies potential regions of interest on the feature map.Ultimately, the classification location refinement network refines the identification and localization of the vehicle within the image, outputting both the classification and precise location of the vehicle.

Network
The ZFNet and VGG16 networks were employed for feature extraction, and the optimal network was determined by comparing their respective accuracies.The architectural schematic of ZFNet is displayed in Figure 7.Given that the features have been extracted, there is no requirement for fully connecting Layer 6 and Layer 7. The input image dimensions are 224 Pixel × 224 Pixel, and after undergoing sampling through the 5th convolutional layer, the resulting feature map exhibits dimensions of 13 × 13 × 256 Pixel.The framework also leverages VGG16 for feature extraction.Akin to ZFNet, the initial 13 convolutional layers are harnessed for feature extraction, while excluding the pooled layer POOL5 and the three fully connected layers FC6, FC7, and FC8.ReLu serves as the activation function, while maximum pooling is employed in the pooling layer.When the input image dimensions are 224 Pixel × 224 Pixel, the resulting feature size after sampling by the 13-layer CNN is 14 × 14 × 512 Pixel.Furthermore, Figure 8 showcases the visualization of the first two convolutional layers' features, wherein Figure 8a,c   The acquired vehicle videos are processed utilizing the model outlined in Section 3.1, where video frames are converted into images to serve as inputs for recognition network.The CNN is employed to extract distinctive features from the images, generating a feature map.Subsequently, the region proposal network identifies potential regions of interest on the feature map.Ultimately, the classification location refinement network refines the identification and localization of the vehicle within the image, outputting both the classification and precise location of the vehicle.

Network
The ZFNet and VGG16 networks were employed for feature extraction, and the optimal network was determined by comparing their respective accuracies.The architectural schematic of ZFNet is displayed in Figure 7.Given that the features have been extracted, there is no requirement for fully connecting Layer 6 and Layer 7. The input image dimensions are 224 Pixel × 224 Pixel, and after undergoing sampling through the 5th convolutional layer, the resulting feature map exhibits dimensions of 13 × 13 × 256 Pixel.The acquired vehicle videos are processed utilizing the model outlined in Section 3.1, where video frames are converted into images to serve as inputs for the recognition network.The CNN is employed to extract distinctive features from the images, generating a feature map.Subsequently, the region proposal network identifies potential regions of interest on the feature map.Ultimately, the classification location refinement network refines the identification and localization of the vehicle within the image, outputting both the classification and precise location of the vehicle.

Network
The ZFNet and VGG16 networks were employed for feature extraction, and the optimal network was determined by comparing their respective accuracies.The architectural schematic of ZFNet is displayed in Figure 7.Given that the features have been extracted, there is no requirement for fully connecting Layer 6 and Layer 7. The input image dimensions are 224 Pixel × 224 Pixel, and after undergoing sampling through the 5th convolutional layer, the resulting feature map exhibits dimensions of 13 × 13 × 256 Pixel.The framework also leverages VGG16 for feature extraction.Akin to ZFNet, the initial 13 convolutional layers are harnessed for feature extraction, while excluding the pooled layer POOL5 and the three fully connected layers FC6, FC7, and FC8.ReLu serves as the activation function, while maximum pooling is employed in the pooling layer.When the input image dimensions are 224 Pixel × 224 Pixel, the resulting feature size after sampling by the 13-layer CNN is 14 × 14 × 512 Pixel.Furthermore, Figure 8 showcases the visualization of the first two convolutional layers' features, wherein Figure 8a,c   The framework also leverages VGG16 for feature extraction.Akin to ZFNet, the initial 13 convolutional layers are harnessed for feature extraction, while excluding the pooled layer POOL5 and the three fully connected layers FC6, FC7, and FC8.ReLu serves as the activation function, while maximum pooling is employed in the pooling layer.When the input image dimensions are 224 Pixel × 224 Pixel, the resulting feature size after sampling by the 13-layer CNN is 14 × 14 × 512 Pixel.Furthermore, Figure 8 showcases the visualization of the first two convolutional layers' features, wherein Figure 8a,c represent the feature maps of the first and second convolutional layers, respectively, while Figure 8b,d depict the maps of the first and second pool layers.

Proposal Region Generation
The framework incorporates the region proposal network (RPN) to extract candidate regions from the obtained feature maps.Due to the ability of RPNs to share features with the classified fine-tuning network during the training process, the network's detection speed was considerably enhanced.RPN is a fully convolutional network (FCN), capable of accepting images of any scale as input.It takes the feature maps extracted from the last layer of the feature extraction network as input and employs a 3 × 3 size sliding window to generate a feature vector with either 256 dimensions (ZFNet) or 512 dimensions (VGG16).Subsequently, the fully connected layer and the bounding box regression layer utilize this vector as their input.These two layers serve for classification (distinguishing foreground from background) and positional prediction.Figure 9 provides a schematic diagram illustrating the network.

Proposal Region Generation
The framework incorporates the region proposal network (RPN) to extract candidate regions from the obtained feature maps.Due to the ability of RPNs to share features with the classified fine-tuning network during the training process, the network's detection speed was considerably enhanced.RPN is a fully convolutional network (FCN), capable of accepting images of any scale as input.It takes the feature maps extracted from the last layer of the feature extraction network as input and employs a 3 × 3 size sliding window to generate a feature vector with either 256 dimensions (ZFNet) or 512 dimensions (VGG16).Subsequently, the fully connected layer and the bounding box regression layer utilize this vector as their input.These two layers serve for classification (distinguishing foreground from background) and positional prediction.Figure 9 provides a schematic diagram illustrating the network.The candidate regions, commonly referred to as anchors, constitute a set of fixed-size reference windows encompassing three dimensions {128 Pixel × 128 Pixel, 256 Pixel × 256 Pixel, 512 Pixel × 512 Pixel } and three aspect ratios {1:1, 1:2, 2:1}.These anchors are centered on the 3 × 3 sliding window and serve as a benchmark for proposal region generation.Subsequently, the mapping relationship between anchors and the groundtruth is derived by calculating the central point and size of the anchors.Based on this, the anchors and ground-truth are assigned positive (IoU > 0.7) and negative labels (IoU < 0.3), enabling the RPN to learn about the presence of objects within the anchors.
During the training of the RPN, the parameters of the network layer shared with the feature extraction network (ZFNet/VGG16) can be directly utilized.For all the added layer parameters, we adopt a Gaussian distribution (0, 0.01) for random initialization and set the momentum to 0.9.The learning rate ε is set to 0.001, and the weight decay is specified as 0.0005.The loss function incorporates both cross-entropy loss and regression loss, ultimately yielding the final mixed loss function as shown in Equation ( 7).

Proposal Region Generation
The framework incorporates the region proposal network (RPN) to extract candidate regions from the obtained feature maps.Due to the ability of RPNs to share features with the classified fine-tuning network during the training process, the network's detection speed was considerably enhanced.RPN is a fully convolutional network (FCN), capable of accepting images of any scale as input.It takes the feature maps extracted from the last layer of the feature extraction network as input and employs a 3 × 3 size sliding window to generate a feature vector with either 256 dimensions (ZFNet) or 512 dimensions (VGG16).Subsequently, the fully connected layer and the bounding box regression layer utilize this vector as their input.These two layers serve for classification (distinguishing foreground from background) and positional prediction.Figure 9 provides a schematic diagram illustrating the network.The candidate regions, commonly referred to as anchors, constitute a set of fixed-size reference windows encompassing three dimensions {128 Pixel × 128 Pixel, 256 Pixel × 256 Pixel, 512 Pixel × 512 Pixel } and three aspect ratios {1:1, 1:2, 2:1}.These anchors are centered on the 3 × 3 sliding window and serve as a benchmark for proposal region generation.Subsequently, the mapping relationship between anchors and the groundtruth is derived by calculating the central point and size of the anchors.Based on this, the anchors and ground-truth are assigned positive (IoU > 0.7) and negative labels (IoU < 0.3), enabling the RPN to learn about the presence of objects within the anchors.
During the training of the RPN, the parameters of the network layer shared with the feature extraction network (ZFNet/VGG16) can be directly utilized.For all the added layer parameters, we adopt a Gaussian distribution (0, 0.01) for random initialization and set the momentum to 0.9.The learning rate ε is set to 0.001, and the weight decay is specified as 0.0005.The loss function incorporates both cross-entropy loss and regression loss, ultimately yielding the final mixed loss function as shown in Equation (7).The candidate regions, commonly referred to as anchors, constitute a set of fixed-size reference windows encompassing three dimensions {128 Pixel × 128 Pixel, 256 Pixel × 256 Pixel, 512 Pixel × 512 Pixel } and three aspect ratios {1:1, 1:2, 2:1}.These anchors are centered on the 3 × 3 sliding window and serve as a benchmark for proposal region generation.Subsequently, the mapping relationship between anchors and the ground-truth is derived by calculating the central point and size of the anchors.Based on this, the anchors and ground-truth are assigned positive (IoU > 0.7) and negative labels (IoU < 0.3), enabling the RPN to learn about the presence of objects within the anchors.
During the training of the RPN, the parameters of the network layer shared with the feature extraction network (ZFNet/VGG16) can be directly utilized.For all the added layer parameters, we adopt a Gaussian distribution (0, 0.01) for random initialization and set the momentum to 0.9.The learning rate ε is set to 0.001, and the weight decay is specified as 0.0005.The loss function incorporates both cross-entropy loss and regression loss, ultimately yielding the final mixed loss function as shown in Equation (7).
where i denotes the index of the anchor, and p i represents the probability of predicting the target.For positive samples, p * i is set to 1, and for non-positive samples, p * i is set to 0. The term t i signifies the positional information of the proposal region, encompassing the central position coordinate t x , t y along with the width t ω and height t h .Similarly, t * i represents the ground-truth central point position coordinate t * x , t * y and the corresponding width t * ω Sensors 2023, 23, 9569 9 of 14 and the height t * h .L cls , which denotes the classification loss, is a logarithmic loss function as depicted in Equation (8).On the other hand, L reg signifies the positional regression loss and is expressed in Equation (9).
where R denotes the robust loss function.Then, smooth L 1 , as shown in Equation (10), is incorporated to enhance the stability of the network during the training process, thereby facilitating more robust and consistent learning.

Classification Location Refinement Network
The classification location refinement network comprises an ROI pooling layer, a fully connected layer, a classification layer, and a location refinement layer.The inputs to this network are the features extracted by the feature extraction network and the proposal region generated by the RPN.The output provides the probability of the target classification and the precise positional information of the detected target.Due to variations in the size of the proposed regions, the ROI pooling layer is employed to uniformly sample these regions, which are subsequently forwarded to the fully connected layer.The detailed structural composition of the network is visually depicted in Figure 10.
where  denotes the index of the anchor, and pi represents the probability of predicting the target.For positive samples,  * is set to 1, and for non-positive samples,  * is set to 0. The term ti signifies the positional information of the proposal region, encompassing the central position coordinate  ,  along with the width  and height  .Similarly,  * represents the ground-truth central point position coordinate ( * ,  * ) and the corresponding width  * and the height  * . , which denotes the classification loss, is a logarithmic loss function as depicted in Equation ( 8).On the other hand,  signifies the positional regression loss and is expressed in Equation (9).
( ,  * ) = ( −  * ) where R denotes the robust loss function.Then, ℎ , as shown in Equation (10), is incorporated to enhance the stability of the network during the training process, thereby facilitating more robust and consistent learning.

Classification Location Refinement Network
The classification location refinement network comprises an ROI pooling layer, a fully connected layer, a classification layer, and a location refinement layer.The inputs to this network are the features extracted by the feature extraction network and the proposal region generated by the RPN.The output provides the probability of the target classification and the precise positional information of the detected target.Due to variations in the size of the proposed regions, the ROI pooling layer is employed to uniformly sample these regions, which are subsequently forwarded to the fully connected layer.The detailed structural composition of the network is visually depicted in Figure 10.

Classification Layer
The softmax function is a commonly used function in deep learning, especially when dealing with multi-class classification problems.It maps a set of real values to a probability distribution, where each element of the output result is between 0 and 1, and the sum of all elements is equal to 1.The classification layer utilizes softmax to predict the category to which the region of interest belongs.Given a total of K categories, the output dimension of K + 1 (K classes + background) corresponds to the probability of the recognized object belonging to each of the K + 1 classes.By considering only the top-1 probability as the result of vehicle type and brand recognition, the classification probability prediction for each ROI region is denoted as p = (p 0 , p 1 . . .p k ).For a specific class u, the class loss function is formally expressed in Equation (11).This loss function plays a crucial role in improving the network's performance during the training process, allowing for accurate and efficient classification of the ROI regions.

Position Refinement Layer
The proposed framework employs bounding box regression to refine the localization of objects.Considering K classifications, each associated with four positional parameters, the output is a 4 × K dimensional array, representing the refined parameters for panning and scaling to determine the ultimate output target.For a specific category denoted as µ, where 0 ≤ µ ≤ K, the output translation and scaling parameters are expressed as t u = t u x , t u y , t u w , t u h .These parameters signify the four translational and scaling values between the actual and predicted bounding boxes.
Supposing that for this category, the ground-truth coordinates are marked in the image as v = v x , v y , v w , v h , and the corresponding predicted values are given by t u = t u x , t u y , t u w , t u h , the loss function for the position refinement network is formally defined in Equation ( 12).This loss function plays a pivotal role in improving the network's ability to accurately localize objects by minimizing the discrepancies between the predicted and ground-truth bounding box parameters.
In the classification layer and position refinement process, we employ a multi-task loss function during training.This multi-task loss function combines the class loss function specified in Equation (11) and the position refinement loss function defined in Equation ( 13), weighing them appropriately to derive the ultimate multi-task loss function.By incorporating both classification and localization losses, we can jointly optimize the network parameters for improved performance in both tasks.
where L(p, u, t u , v) represents the multi-tasking loss function, with λ being a hyperparameter that regulates the relative contribution of the two individual loss functions within the overall multi-task loss.Specifically, when the predicted category corresponds to the foreground, the multi-task loss function is formulated as a weighted summation of the softmax loss function and the bounding box loss function.Conversely, in cases where the predicted category pertains to the background, the multi-task loss function reduces to the softmax loss function alone.This nuanced approach to combining losses enables the model to effectively balance classification accuracy and bounding box localization precision, facilitating a more comprehensive and robust learning process.

Experimental Results Analysis
This section delves into the evaluation procedures, encompassing the specification of parameters and the outcomes of the conducted experiments.To assess the efficacy of the proposed model, we conducted rigorous testing on two extensively utilized public datasets: Stanford Cars and BIT-Vehicle, along with Car5_48.These datasets offer a comprehensive and diverse array of samples, enabling a thorough examination of the model's performance and robustness.Through a meticulous analysis of the experimental results, we aim to demonstrate the effectiveness and superiority of the proposed approach compared to existing state-of-the-art methods.

Experimental Results
In our experiments, we utilized distinct feature networks, namely, ZFNet and VGG16, to train the Faster R-CNN model with the Car5_48 dataset.The maximum training iterations for these networks were set at 240,000 and 360,000, respectively.To enhance the model's generalization capabilities, we adopted the ImageNet dataset and employed a 10-fold cross-validation technique for pre-training the model.The experimental results pertaining to vehicle brand recognition are presented in Table 1.These results demonstrate the efficacy and performance of our proposed approach in accurately identifying and classifying different vehicle brands.The average recognition rate, denoted as mAP, represents the average accuracy across the 48 vehicle brands, including the five specific brands listed.Analysis of the results reveals that for the ZFNet network, the recognition rate progressively increases from 86.14% to 92.40% as the maximum iteration number augments.In contrast, for the VGG16 network, the difference in recognition rates between the 240,000 and 360,000 maximumiteration models is marginal, with respective rates of 93.92% and 94.03%.Furthermore, the fluctuation of the loss function tends to plateau, indicating minimal improvement beyond this point.Consequently, to improve the computational efficiency, we limit the maximum number of iterations to 360,000.Ultimately, the vehicle brand detection and recognition network utilizes the VGG16 model trained for 360,000 maximum iterations.
The results pertaining to vehicle type recognition are presented in Table 2.Under the VGG16 network trained for 360,000 maximum iterations, the highest average recognition rate achieved is 97.62%.Notably, the recognition rates for buses and trucks attain exceptional levels of 98.86% and 99.56%, respectively, while the rates for other vehicle types are slightly lower.This can be attributed to the distinctiveness of bus and truck appearances, which facilitates easier classification.By balancing accuracy and computational efficiency, we opt to use the VGG16 network trained for 360,000 maximum iterations to effectuate vehicle type classification.This decision ensures both a high level of accuracy and a reasonable training time.

Comparison of Single-Angle and Multi-Angle Models
To ascertain the impact of multi-angle images on the model's performance, this section undertakes a comparative analysis between models trained using single-angle and multiangle images.Specifically, the VGG16 network was trained separately on each of the eight angles available in the Car5_48 dataset, namely: front (f), behind (b), right-front (rf), left-front (lf), left-side (ls), right-side (rs), right-behind (rb), and left-behind (lb).These Table 5 presents a comparative analysis of the recognition results on the BIT-Vehicle dataset.The proposed method achieved an improved recognition accuracy of 94.10%, which is 1.21% higher than the accuracy reported by Dong Z et al. [13] and 2.80% higher than that of Sang Jun et al. [15].It is noteworthy that Dong Z et al. [13] utilized a convolutional neural network (CNN) for feature extraction, but their approach was limited to recognizing vehicle images from a single angle, thus discarding valuable detailed information.Similarly, Sang Jun et al. [15] employed a method similar to the one used in this study, but their detection was restricted to the front of the vehicle, which also resulted in a loss of more comprehensive details and compromised the robustness of the model to changes in vehicle angles.In contrast, the proposed method leverages multi-angle images to capture richer and more detailed feature information, enhancing the overall accuracy and robustness of the vehicle recognition system.

Recognition Angle mAP
Dong Z [13] Single angle 92.89% Sang Jun [15] Single angle 91.30%The method in this part Multi-angle 94.10% Figure 11 displays a representative selection of partial identification outcomes.The vehicle identification results depicted in the figure encompass a range of types, specifically including a bus, microbus, SUV, sedan, minivan, and truck.Notably, these results encompass images captured from diverse shooting angles, underscoring the robustness and adaptability of the identification system across various perspectives.
vehicle types, making them more challenging to distinguish.mAP2 represents the vehicle type accuracy across different datasets.Analysis of Table 4 reveals that the Car5_48 dataset exhibited the highest recognition rate (97.62%), followed closely by the BIT-Vehicle dataset (94.1%).The Stanford Cars dataset had the lowest recognition rate at 88.3%.The superior performance of the car recognition network under the Car5_48 dataset can be primarily attributed to the similarity in the test and training sample collection environments and angles.This consistency allowed for a more accurate and robust classification of vehicle types.
Table 5 presents a comparative analysis of the recognition results on the BIT-Vehicle dataset.The proposed method achieved an improved recognition accuracy of 94.10%, which is 1.21% higher than the accuracy reported by Dong Z et al. [13] and 2.80% higher than that of Sang Jun et al. [15].It is noteworthy that Dong Z et al. [13] utilized a convolutional neural network (CNN) for feature extraction, but their approach was limited to recognizing vehicle images from a single angle, thus discarding valuable detailed information.Similarly, Sang Jun et al. [15] employed a method similar to the one used in this study, but their detection was restricted to the front of the vehicle, which also resulted in a loss of more comprehensive details and compromised the robustness of the model to changes in vehicle angles.In contrast, the proposed method leverages multiangle images to capture richer and more detailed feature information, enhancing the overall accuracy and robustness of the vehicle recognition system.

Methods
Recognition Angle mAP Dong Z [13] Single angle 92.89% Sang Jun [15] Single angle 91.30%The method in this part Multi-angle 94.10% Figure 11 displays a representative selection of partial identification outcomes.The vehicle identification results depicted in the figure encompass a range of types, specifically including a bus, microbus, SUV, sedan, minivan, and truck.Notably, these results encompass images captured from diverse shooting angles, underscoring the robustness and adaptability of the identification system across various perspectives.

Conclusions
In this paper, a comprehensive multi-angle vehicle type and brand recognition method is constructed utilizing the Faster R-CNN framework.This innovative approach resolves the challenges associated with the multi-pose and multi-angle distribution of vehicle information recognition.Furthermore, to address the limitations of single-shot data collections in conventional datasets, a comprehensive vehicle type and brand dataset from eight diverse angles, designated as Car5_48, was created.Rigorous experimental evaluations demonstrate that the Faster R-CNN, when applied to multi-angle recognition, surpasses current state-of-the-art methodologies and enhances the overall robustness of the framework.This research contributes to the advancement of vehicle recognition techniques.
displays a subset of the Car5_48 dataset, showcasing vehicle models of various brands and models captured under the eight shooting angles.

Figure 1 .
Figure 1.Schematic diagram of data collection angle.

Figure 1 .
Figure 1.Schematic diagram of data collection angle.

Figure 1 .
Figure 1.Schematic diagram of data collection angle.

Figure 4 .
Figure 4. Image intercepted of a direct vehicle.

Figure 5 .
Figure 5. Image intercepted of a left-front vehicle.

Figure 4 .
Figure 4. Image intercepted of a direct vehicle.

Figure 4 .
Figure 4. Image intercepted of a direct vehicle.

Figure 5 .
Figure 5. Image intercepted of a left-front vehicle.

Figure 5 .
Figure 5. Image intercepted of a left-front vehicle.

Figure 6 .
Figure 6.Architecture of the recognition network.
represent the feature maps of the first and second convolutional layers, respectively, while Figure8b,d depict the maps of the first and second pool layers.

Figure 6 .
Figure 6.Architecture of the recognition network.
represent the feature maps of the first and second convolutional layers, respectively, while Figure8b,d depict the maps of the first and second pool layers.

Figure 8 .
Figure 8. Examples of feature maps from convolutional layers (a,c) and pooling layers (b,d).

Figure 9 .
Figure 9. Architecture of the region proposal network.

Figure 8 .
Figure 8. Examples of feature maps from convolutional layers (a,c) and pooling layers (b,d).

Figure 8 .
Figure 8. Examples of feature maps from convolutional layers (a,c) and pooling layers (b,d).

Figure 9 .
Figure 9. Architecture of the region proposal network.

Figure 9 .
Figure 9. Architecture of the region proposal network.

Figure 11 .
Figure 11.Example images of multi-angle vehicle brand position and recognition results.Figure 11.Example images of multi-angle vehicle brand position and recognition results.

Figure 11 .
Figure 11.Example images of multi-angle vehicle brand position and recognition results.Figure 11.Example images of multi-angle vehicle brand position and recognition results.

Table 1 .
Vehicle brand recognition network results.

Table 2 .
Vehicle type recognition network results.

Table 5 .
Comparison of recognition accuracy of different methods.

Table 5 .
Comparison of recognition accuracy of different methods.