Blind-Spot Collision Detection System for Commercial Vehicles Using Multi Deep CNN Architecture

Buses and heavy vehicles have more blind spots compared to cars and other road vehicles due to their large sizes. Therefore, accidents caused by these heavy vehicles are more fatal and result in severe injuries to other road users. These possible blind-spot collisions can be identified early using vision-based object detection approaches. Yet, the existing state-of-the-art vision-based object detection models rely heavily on a single feature descriptor for making decisions. In this research, the design of two convolutional neural networks (CNNs) based on high-level feature descriptors and their integration with faster R-CNN is proposed to detect blind-spot collisions for heavy vehicles. Moreover, a fusion approach is proposed to integrate two pre-trained networks (i.e., Resnet 50 and Resnet 101) for extracting high level features for blind-spot vehicle detection. The fusion of features significantly improves the performance of faster R-CNN and outperformed the existing state-of-the-art methods. Both approaches are validated on a self-recorded blind-spot vehicle detection dataset for buses and an online LISA dataset for vehicle detection. For both proposed approaches, a false detection rate (FDR) of 3.05% and 3.49% are obtained for the self recorded dataset, making these approaches suitable for real time applications.


Introduction
Although bus accidents are very rare around the globe, there are still approximately 60,000 buses are involved in traffic accidents in the United States every year. These accidents lead to 14,000 non-fatal injuries and 300 fatal injuries [1]. Similarly, every year in Europe approximately 20,000 buses are involved in accidents that cause approximately 30,000 (fatal and non-fatal) injuries [2]. These accidents mostly occurred due to thrill-seeking driving, speeding, fatigue, stress, and aggressive driver behaviors [3,4]. Accidents involving buses and other road users, such as pedestrians, bicyclists, motorcyclists, or car drivers and passengers, usually cause more severe injuries to these road users [5][6][7][8].
The collision detection systems of cars mostly focus on front and rear end collision scenarios [9][10][11][12]. In addition, different drowsiness detection techniques have been proposed to detect car drivers' sleep deprivation and prevent possible collisions [13,14]. At the same time, buses operate in a complicated environment where a significant number of unintended obstacles such as pulling out from bus stops, passengers unloading, pedestrians crossing in front of buses, and bus stop structures, etc. [15][16][17], are present. Additionally, buses have higher chances of side collisions due to constrained spaces and maneuverabil-ity [15]. Especially at turns, researchers found that the task demand on bus drivers is very high [16,17].
Further, heavy vehicles and buses, which have more blind spots compared to cars and other road users in these environments, are at higher risks of collisions [18][19][20]. Improvements to heavy vehicle and bus safety have been initiated by many countries through the installation of additional mirrors. Yet, there are still some blind-spot areas where drivers cannot see other road users [21,22]. In addition, buses may have many passengers on board. A significant number of on-board passenger incidents have been reported due to sudden braking or stopping [23]. These challenges may entail different requirements for collision detections for public/transit buses than for cars. A blind-spot collision detection system can be designed for buses to predict impending collisions in their proximity and to reduce operational interruptions. It could provide adequate time for the driver to smoothly push the brake or take any other precautionary measures to avoid such imminent collision threats as well as avoid injuries and trauma inside the bus.
Over the past few years, many types of collision detection techniques have been proposed [9,10,[24][25][26]. Among these, vision-based collision detection techniques provide reliable detection of vehicles across a large area [9,10,26]. This is due to cameras that provide a wide field of view. Several vision-based blind-spot collision detection techniques for cars and other vehicles have been proposed [9][10][11][12][27][28][29]. In vision-based techniques, the position of the camera plays a significant role. Depending on the position of the installed camera, vision-based blind-spot collision detection systems are categorized as rear camera based [11,30] or side camera-based systems [26][27][28][29]. Rear camera-based vision systems detect vehicles by acquiring the images from a rear fish-eye camera. The major drawback of using a rear fish-eye camera is that the captured vehicle suffers from severe radial distortions, leading to huge differences in appearance for different positions [11].
In contrast, side camera-based vision systems have the camera installed directly at the bottom or next to the side mirrors that directly face the blind spot and detect the approaching vehicles. In these systems, the vehicle appearance drastically changes with its position; yet, it has the advantage of high resolution images for vehicle detection [11].
In vision-based blind-spot vehicle techniques, deep convolutional neural network (CNN) models often achieve better performance [10,12] compared to conventional machine learning models (based on the appearance, histogram of oriented gradients (HOG) features, etc.) [11,27,28]. This is due to convolutional layers that can extract and learn more pure features from the raw RGB channels than traditional algorithms such as HOG. However, blind-spot vehicle detection is still challenging on account of the large variations in appearance and structure, especially ubiquitous occlusions that further increase the intra-class variations.
Recently, deep learning techniques prove to be a game changer in object detection. Many deep learning models have been proposed to detect different types and sizes of objects in images [31][32][33]. Among these models, two-stage object detectors show better accuracy compared to one-stage object detectors [34][35][36]. Therefore, two-stage object detectors, such as faster R-CNN [31], seem to be more suitable for blind-spot vehicle detection. In faster R-CNN, a self-designed CNN or a pre-trained network (such as VGG16, ResNet50, and ResNet-101, etc.) is used to extract a feature map [37,38]. These networks are trained on a large dataset and are proven to be better in performance compared to simple convolutional neural networks (CNNs). In medical applications, it has been reported that multi-CNNs performed much better in residual feature extraction and classification compared to single CNNs [39][40][41].
In this paper, we propose a novel blind-spot vehicle detection technique for commercial vehicles based on multi convolutional neural networks (CNNs) and faster R-CNN. Two different convolutional neural network-based approaches/models with faster R-CNN as an object detector are proposed for blind-spot vehicle detection. In the first approach/model, two self designed CNNs networks are used to extract the features, and their outputs are concatenated and fed to another self designed CNN. Next, faster R-CNN uses these high-level features for vehicle detection. In the second approach/model, two ResNet CNN networks (ResNet-50 and ResNet-101) are concatenated with the self-designed CNN to extract features. Finally, these extracted features are fed to the faster R-CNN for blind-spot vehicle detection. The scientific contributions of this research are as follows: 1.
Design of two high-level CNN based feature descriptors for blind-spot vehicle detection for heavy vehicles; 2.
Design of fusion technique for different high level feature descriptors and its integration with the faster R-CNN. In addition, performance comparison with existing state-of-the-art approaches; 3.
Introduction of fusion technique for pre-trained high-level feature descriptors for object detection application.

Related Work
The recent deep convolutional neural network (CNN) based algorithms depict extraordinary performance in various vision tasks [42][43][44][45]. Convolutional neural networks extract features from the raw images through a large amount of training with high flexibility and generalization capabilities. The first CNN based object detection and classification system was presented in 2013 [46,47]. Up to now, many deep learning-based object detection and classification models have been proposed, including region based convolutional neural network (R-CNN) [48], fast R-CNN [49], faster R-CNN [31], single shot multibox detector (SSD) [50], R-FCN [51], you only look once (YOLO) [32], and YOLOv2 [33].
R-CNN models achieve promising detection performance and are a commonly employed paradigm for object detection [48]. They have essential steps, such as object regional proposal generation with selective search (SS), CNN feature extraction, selected objects classification, and regression based on the obtained CNN features. However, there are large time and computation costs to train the network due to repeated extraction of CNN features for thousands of object proposals [52].
In fast R-CNN [49], the feature extraction process is accelerated by sharing the forward pass computation. Due to the regional proposal generation by selective search (SS), it still appears to be slow and requires significant computational capacity to train it. In faster R-CNN [31], "regional proposal generation using SS" was replaced by the "proposal generation using CNN". This increases the computational capacity of the network and makes it efficient and quick compared to the R-CNN and fast R-CNN.
YOLO [32] frame object detection is a regression problem to separate bounding boxes and associated class probabilities. In YOLO, a single CNN predicts the bounding boxes and class probabilities for these boxes. It utilizes a custom network based on the GoogLeNet architecture. An improved model called YOLOv2 [33] achieves comparable results on standard tasks. YOLOv2 employs a new model called Darknet-19, which has 19 convolutional layers and 5 max-pooling layers. This new model only takes 5.58 s to compute results. However, the YOLOv2 network still lacks some important elements, it has no residual blocks, no skip connections, and no up-sampling, etc.
The YOLOv3 network is the advanced version of YOLOv2 and incorporates all of these important elements. YOLOv3 is a 53 layer network trained on Imagenet. For object detection, YOLOv3 has 53 more layers stacked onto it and gives us a 106 layer fully convolutional underlying architecture [53]. Recently, two new versions of YOLO were introduced, named YOLOv4 and YOLOv5, respectively [54,55]. Other than YOLO, there are also other one-stage object detectors, such as SSD [50] and RetinaNet [34].
Recent studies show that two-stage object detectors obtained better accuracy compared to one-stage object detectors [34,35], thus, making faster R-CNN a suitable candidate for blind-spot vehicle detection. However, in these object detectors, the whole system accuracy is profoundly dependent on the feature set obtained from the neural networks. In recent object detectors, it has also been proposed to collect features from different stages of the neural network to improve the system performance [56,57]. In medical applications, it has been demonstrated that the usage of multiple feature extractors can significantly improve system accuracy [39][40][41].
Thus, to increase system accuracy, in this research multiple CNN networks based blind-spot vehicle detection approaches are proposed. Along with the fusion of a selfdesigned convolutional neural network, system performance is also investigated using a fusion approach for pre-trained convolutional neural networks.

Proposed Methodology
The proposed methodology comprises several steps, including pre-processing of datasets, anchor boxes estimation, data augmentation, and multi CNN network design, as shown in Figure 1.

Pre-Processing
For the self-recorded dataset, image labels were created using MATLAB 2019a "Ground Truth Labeller App", whereas for the online dataset, ground truths were provided with the image set. Next, images were resized to 224 × 224 × 3 to enhance the computation performance of the proposed deep neural networks.

Anchor Boxes Estimation
Anchor boxes are important parameters of deep learning object recognition. The shape, scale, and the number of anchor boxes impact the efficiency and accuracy of the object detector. Figure 2 indicates the plot of aspect ratio and box area of the self-recorded dataset. The anchor boxes plot reveals that many vehicles have a similar size and shape. However, vehicle shapes are still spread out, indicating the difficulty of choosing anchor boxes manually. Therefore, a clustering algorithm presented in [33] was used to estimate anchor boxes. It groups similar boxes together using a meaningful metric.

Data Augmentation
In this work, data augmentation is performed to minimize the over-fitting problem and to improve the proposed network's robustness against noise. Random brightness augmentation technique is considered to perturb the images. The brightness of the images is augmented by randomly darkening and brightening the images. The darkening and brightening values randomly range from [0.5, 1.0] and [1.0, 1.5], respectively.

Proposed CNNs and Their Integration with Faster R-CNN
Initially, the same images are fed to two different deep learning networks to extract high-level features. Subsequently, these high-level features are fed to another CNN architecture to combine and smooth these features. Finally, faster R-CNN based object detection is performed to detect impending collisions. The layer wise connection of deep learning architectures and their integration with faster R-CNN are shown in Figure 3.  (2) pre-trained convolutional networks, as shown in Figure 3. Additional details of these feature descriptors are given below.

Self-Designed High-Level Feature Descriptors
In first approach, multiple self-designed convolutional neural networks are connected with the faster R-CNN network. The layer wise connection of two self-designed CNN networks (named DConNet and VeDConNet) is shown in Figure 4. Initially, DConNet and VeDConNet are used to extract deep features, and their output is provided to the third 2D CNN architecture for the purpose of features addition and smoothness. Both DConNet and VeDConNet architectures consist of five convolutional blocks. In DConNet, all five blocks are composed of a two 2D convolutional and ReLU layers. In addition, at the end of each block there is a max-pooling layer. In VeDConNet, the initial two blocks are similar to DConNet as they consist of two 2D convolutional layers, each followed by a ReLU activation function, where a max-pooling layer is also available after the second ReLU activation function. The other three blocks of VeDConNet comprise four convolutional layers, each followed by the ReLU layer and maxpooling layer after the fourth ReLU activation function.

Pre-Trained Feature Descriptors
In the second approach, two pre-trained convolutional networks (i.e., Resnet 101 and Resnet 50) are linked with the third CNN architecture, which is further connected with the faster R-CNN network for the purpose of vehicle detection. The features obtained from ReLU Res4b22 and ReLU 40 layers of ResNet 101 and ResNet 50, respectively, as given in Figure 5.

Features Addition and Smoothness
The high level features obtained from the two self-designed/pre-trained CNN architectures are added together through the addition layer, as shown in Figure 3. Let F 1 (x) and F 2 (x) be the output of the first and second deep neural networks, then their addition H(x) is given as: The addition layer is followed by the convolutional layer and ReLU activation function for the features smoothness.

Integration with Faster R-CNN
As shown in Figure 3, faster R-CNN takes high level features from the ReLU layer to perform the blind-spot vehicle detection. The obtained features map is fed to region proposal network (RPN) and ROI pooling layer of the faster R-CNN. The loss function of faster R-CNN can be divided into two parts: R-CNN loss [49] and RPN loss [31], which is shown in the equations below: The detailed description of the faster R-CNN architecture and the above equations is given in references [31,49,58].

Results and Discussion
In this section, the vehicle detection using the proposed deep learning models is discussed in detail. We compared the performance of both approaches with each other and with the state-of-the-art benchmark approaches. This section also includes the dataset description along with the details of the proposed network implementation.

Dataset
A blind-spot collision dataset was recorded by attaching cameras to the side mirrors of a bus. The placement of cameras is shown in Figure 6. The dataset was recorded in Ipoh, Seri Iskandar and along Ipoh-Lumut highway in Perak, Malaysia. Ipoh is a city in northwestern Malaysia, whereas Seri Iskandar is located about 40 km southwest of Ipoh. Universiti Teknologi PETRONAS is also located in the new township of Seri Iskandar. Data were recorded in multiple round trips from Seri Iskandar to Ipoh for different lighting conditions. In addition, data were recorded in the cities of Ipoh and Seri Iskandar for dense traffic scenarios. Moreover, Malaysia has a tropical climate and the rainfall remains high year-round, thus allowing us to easily record data in different weather conditions. Finally, a set of 3000 images from the self-recorded dataset was selected in which vehicles appeared in blind-spot areas.
To the best of our knowledge, there is no publicly available online dataset for heavy vehicles. Therefore, a publicly available online dataset named "Laboratory for Intelligent and Safe Automobiles (LISA)" [59] for cars was used to validate the proposed method. In the LISA dataset, the camera was installed at the front of the car. The detailed description of both datasets is given in Table 1. Both datasets are divided randomly into 80% for training and 20% for testing.

Network Implementation Details
The proposed work was implemented on the Intel® Xeon(R) E-2124G CPU @ 3.40 GHz (installed memory 32 GB), with a NVIDIA Corporation GP104GL [Quadro P4000] graphics card. MATLAB 2019a was used as platform to investigate the proposed methodology.
In the first approach, both CNN based feature extraction architectures (i.e., DConNet and VeDConNet) have five blocks with N number of convolutional filters for each block. Therefore, the number of convolutional filters of the five blocks from the input to the output is equal to N = [64,128,256,512,512]. Moreover, after the addition layer, there was also convolution layer with a total of 512 filters. For all these convolutional layers, the filter size was 3 × 3, and ReLU was used as an activation function. At the same time, the stride and the pool size of the max-pooling layer was 2 × 2.
In the second approach, for Resnet 101 and Resnet 50, standard weights were used. Moreover, after the addition layer, there was a convolution layer with a total of 512 filters and ReLU as an activation function.
In both approaches, we used an SGDM optimizer with a learning rate of 10 −3 and a momentum of 0.9. The batch size was set to 20 samples, and the verbose frequency was set to 20. Negative training samples are set equal to the samples that overlap with the ground truth boxes by 0 to 0.3. However, positive training samples are set equal to the samples that overlap with the ground truth boxes by 0.6 to 1.0.

Evaluation Matrix
The existing state-of-the-art approaches measure the performance in terms of true positive rate (TPR), false detection rate (FDR), and frame rate [59][60][61][62]. Therefore, the same parameters are used to evaluate the performance of the proposed models. TPR (also known as sensitivity) is the ability to correctly detect blind-spot vehicles. FDR refers to the false blind-spot vehicle detection among the total detection incidents. Moreover, the frame rate is defined as the total number of frames processed in one second [60]. If TP, FN, and FP represent the true positive, false negative, and false positive, respectively, then the formulas for TPR and FDR are given as:

Results Analysis
The proposed approaches/models appeared to be successful in detecting the vehicles for both self-recorded and online datasets. A few of the images from blind-spot detection are shown in Figure 7.  Figure 7 shows that the proposed CNN based models were successfully able to detect different types of vehicles, including light and heavy vehicles and motor bikes, in different scenarios and lighting conditions. The proposed work was successful enough to recognize multiple vehicles simultaneously, as shown in Figure 7a,b. These figures also show the presence of shadows along with the vehicles. It reveals the significance of the proposed vehicle detection algorithm, as it was capable of differentiating remarkably between real vehicles and their shadows; this leads to the notable reduction of possible false detection.
Furthermore, it is shown in Figure 7c,d that the proposed technique detects a motorcyclist approaching and driving very close to the bus. A small mistake by the bus driver in such scenarios could lead to a fatal accident. Therefore, the blind-spot collision detection systems are very important for heavy vehicles.
Similarly, vehicle detection from the online LISA dataset [59] is shown in Figure 8. From Figure 8, our models were successfully able to detect all types of vehicles in different scenarios using the LISA dataset. Figure 8a,b show the detection of vehicles in dense scenarios. The proposed models were reliable enough to detect multiple vehicles simultaneously in dense scenarios, even in the presence of vehicle shadows on the road. Figure 8c,d exhibit the detection of vehicles on a highway, and Figure 8e,f convey the detection of vehicle in urban areas. In both figures, we can see the presence of lane markers on the road, which were successfully neglected by the proposed systems. Furthermore, Figure 8f shows a person crossing the road; this could lead to a false detection. However, our models managed to identify the vehicle and successfully differentiated between the person and vehicle. In the LISA dataset, labels were only provided for vehicles. Therefore, the proposed model only detected the vehicle. The visual analysis of the true positive rate (TPR) and false detection rate (FDR) for proposed approaches against different sets of data is presented in Figure 9; the figure shows that both approaches delivered reliable outcomes for the self-recorded as well as online datasets. The TPR value obtained from the faster R-CNN with pre-trained fused (Resnet 101 and Resnet 50) high-level feature descriptors is slightly higher compared to the the faster R-CNN with the proposed fused (DConNet and VeDConNet) feature descriptors. However, the faster R-CNN with the proposed feature descriptors provides a lower value of FDR for the self-recorded dataset and gives a comparable FDR for the LISA-Urban dataset. The frame rate (frames per second) for each type of dataset used in both approaches is given in Table 2, which shows that the first model has a comparatively better frame rate. The pre-trained model (i.e., faster R-CNN with high-level feature descriptors of ResNet 101 and ResNet 50) took more time to compute features compared to the model presented in the first approach (i.e., faster R-CNN with high-level feature descriptors of DConNet and VeDConNet). Hence, the model presented in first approach is capable of providing significant performance for the vehicle detection scenarios where less computational time is required. The detailed comparisons of different parameters, including TPR, FDR, and frame rate from the existing state-of-the-art techniques and our proposed models, are presented in Table 3. In addition, the graphical representation of true positive and false detection rates (i.e., TPR and FDR) of both models and their comparisons with the existing state-of-the-art approaches are given in Figure 10.  Table 3, it can be deduced that our model achieved significantly higher results compared to the existing methods (deep learning and machine learning models). The deep learning model presented by S. Roychowdhury et al. (2018) [61] was able to achieve 100% and 98% TPR for LISA-Urban and LISA-Sunny datasets, respectively. The proposed model (i.e., faster R-CNN with high-level feature descriptors of DConNet and VeDConNet) was able to get a higher TPR value for the LISA-Sunny dataset but a significantly close TPR value for the LISA-Urban dataset. Our model outperformed all the existing methods in terms of FDR. A very low false detection rate was obtained for all three online datasets (LISA-Dense, LISA-Sunny, and LISA-Urban) compared to the existing machine/deep learning techniques. Moreover, higher TPR values have been acquired for all three LISA datasets compared to the existing machine learning techniques. From Figure 10, one can see that, for the first model, the FDR is less than 4% for all datasets, making it suitable for real time applications. Further, TPR values are almost constant for all types of datasets. It shows that the model achieved a reliable result for all types of scenarios.

Discussion
The proposed approaches were successfully able to detect different types of vehicles, such as motorcycles, cars, trucks, etc. In addition, both approaches proved to be reliable for dense traffic conditions for the online LISA dataset. The fusion of pre-trained networks provided a higher accuracy for both the self-recorded and online datasets compared to the first approach in which two self designed CNNs are used. However, for the first approach, a lower frame rate was obtained compared to the second approach.
For the online datasets, both approaches obtained either a high or comparable accuracy compared to the existing state-of-the-art approaches, as given in Table 3. For LISA-Dense, the highest TPR value of 98.06% was obtained by the second proposed approach, followed by the first approach with a value of 97.87%. Further, the machine learning approaches proposed by M.   [60], R. K. Satzoda (2016) [62], and S. Sivaraman [59] (2010) reported TPR values of 95.01%, 94.50%, and 95%, respectively. S. Roychowdhury et al. (2018) [61] did not report any results for the LISA-Dense dataset. For LISA-Urban, the highest TPR value was obtained by S. Roychowdhury et al. (2018) [61], followed by the proposed second approach. For LISA-Urban, the lowest TPR value of 91.70% was obtained by S. Sivaraman [59].
From Figure 10, the fusion of features significantly improved the performance of faster R-CNN. A notable reduction in false detection was found for the online datasets compared to the deep learning [61] and machine learning approaches [59,60,62]. A system with lower false detection rate will provide fewer false warnings and thus increase the trust of the drivers for the system. It has been found in the literature that collision warnings reduce the attention resources required for processing the target correctly [63]. In addition, collision warnings facilitate the sensory processing of the target [64,65]. Finally, our fusion technique results are in line with studies in [39][40][41].
With regard to the comparison between both approaches, the model presented in the first approach obtained a lower FDR compared to the model presented in the second approach for the self-recorded and LISA-Urban datasets. In addition, the model presented in the first approach has a higher frame rate for all the datasets compared to the model presented in the second approach. In other TPR and FDR values, the second approach model outperformed the first approach model. Therefore, there is a slight trade off between performance and computation time.

Conclusions and Future Work
In this research, we propose deep neural architectures for blind-spot vehicle detection for heavy vehicles. Two different models for feature extraction are used with the faster R-CNN network. Furthermore, the high-level features obtained from both networks are fused together in order to improve the network performance. The proposed models successfully detected blind-spot vehicles with reliable accuracy using both the self-recorded and publicly available datasets. Moreover, the fusion of feature extraction networks improved the results significantly, and a notable increment in performance is observed. In addition, we compared our fusion model with the state-of-the-art benchmark, machine learning, and deep learning approaches. Our proposed work outperformed all the existing approaches for vehicle detection in various scenarios, including dense traffics, urban surroundings, with and without pedestrians, shadows, and different weather conditions. The proposed model is capable enough to be usedfor not only buses but also other heavy vehicles such as trucks, trailers, oil tankers, etc. This research work is limited to the integration of only two convolutional neural networks with faster R-CNN. In the future, more than two convolutional neural networks may be integrated with faster R-CNN, and parametric study for accuracy and frame rate may be performed. Acknowledgments: We express our gratitude and acknowledgment to the Centre for Intelligent Signal and Imaging Research (CISIR) and Electrical and Electronic Engineering Department, Universiti Teknologi PETRONAS (UTP), Malaysia.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: