A Nighttime Vehicle Detection Method with Attentive GAN for Accurate Classification and Regression

Vehicle detection plays a vital role in the design of Automatic Driving System (ADS), which has achieved remarkable improvements in recent years. However, vehicle detection in night scenes still has considerable challenges for the reason that the vehicle features are not obvious and are easily affected by complex road lighting or lights from vehicles. In this paper, a high-accuracy vehicle detection algorithm is proposed to detect vehicles in night scenes. Firstly, an improved Generative Adversarial Network (GAN), named Attentive GAN, is used to enhance the vehicle features of nighttime images. Then, with the purpose of achieving a higher detection accuracy, a multiple local regression is employed in the regression branch, which predicts multiple bounding box offsets. An improved Region of Interest (RoI) pooling method is used to get distinguishing features in a classification branch based on Faster Region-based Convolutional Neural Network (R-CNN). Cross entropy loss is introduced to improve the accuracy of classification branch. The proposed method is examined with the proposed dataset, which is composed of the selected nighttime images from BDD-100k dataset (Berkeley Diverse Driving Database, including 100,000 images). Compared with a series of state-of-the-art detectors, the experiments demonstrate that the proposed algorithm can effectively contribute to vehicle detection accuracy in nighttime.


Introduction
With the continuous increase in the number of vehicles on the road, traffic accidents occur frequently, and traffic safety problems are becoming more and more serious. With the aim of fundamentally alleviating road traffic pressure and reducing the occurrence of traffic accidents, Intelligent Transportation Systems (ITS) [1] was proposed. As a part of the ITS, the intelligent driving system [2] uses advanced information technology to provide road driving assistance to the driver to make up for the lack of driver response capabilities. Vehicle detection is the most critical part of ITS, which provides important support for solving traffic problems. A variety of important road traffic information can be obtained through vehicle detection process, such as vehicle location, vehicle type, and distance between vehicles, which can provide a basis for judging vehicle driving and help connectivity in urban road scenes [3,4]. Vehicle detection is an important part of Automatic Driving System (ADS), which has achieved great improvements in recent years. The current research on vehicle detection has very important theoretical significance and application value.
Vehicle detection usually detects vehicles accurately by extracting vehicle features. The vehicle characteristics are more obvious, and vehicle detection is more accurate in the daytime. However, in night conditions, vehicle detection experiences difficulties, owing to unclear vehicle features and the complicated lighting environment. In the night environment, the visibility of the road environment becomes poor, while the road condition information obtained by the driver is often inaccurate, which makes the situation very 1.
An improved GAN is used to acquire representative features, which is named Attentive GAN. U-Net with an attention module is used as the generator and the global and local discriminator is used to balance the local dark regions and overall dark area; 2.
To get accurate target localization, a multiple local regression is employed in the regression branch, which predicts multiple bounding box offset; 3.
For precise classification, an improved RoI pooling method is used in the classification branch which assigns different weights to different sampling points based on deformable RoI pooling.

Related Work
Vehicle detection and state prediction methods [11,12] are very important and suitable for all road sections. Therefore, researchers have put significant efforts on the design of a vehicle detection algorithm to improve the detection accuracy and help drivers gain road traffic information easily [13]. At present, vehicle detection methods [14][15][16] mainly involve two types: traditional vehicle detection methods and deep learning-based vehicle detection methods.
The method using vehicle light information as vehicle features to detect and recognize the acquired vehicle features is one of the most widely used traditional vehicle detection methods, which cannot eliminate the interference of road reflections and is inaccurate in detecting vehicles in two-way driving lanes [17]. Another traditional method is Deformable Parts Model (DPM) [18], which has strong robustness to the deformation of the target through the detection method of parts. Subsequently, Chen et al. [19] first used Gaussian filtering to enhance the contrast of the taillights, extracted the value of the C r channel in the YC b C r space, and selected the region of interest through an adaptive threshold, and then performed the fast Fourier transform, and used the AdaBoost classifier for training and classification. Kim J et al. [20] proposed improvements on the traditional HOG features and proposed πHOG features based on location information and light intensity. According to the πHOG feature, SVM [21], ELM [22], and KNN [23] were used to train and detect vehicles, respectively. Traditional detection methods need to extract artificial features which have strict requirements regarding the environment. The training of SVM classifiers requires prior knowledge of the characteristics of related vehicles, which is complicated in operation and difficult to apply to practical applications.
With the tremendous progress made in deep learning research, the use of convolutional neural network (CNN) [24,25] to solve image detection problems has become a major trend. Compared with conventional methods, CNN-based object detectors have gained significant improvements in vehicle detection. Deep learning-based vehicle detection approaches roughly fall into two main types of groups: one-stage detection approaches [26][27][28][29] and two-stage detection approaches [30][31][32][33]. The one-stage vehicle detection method does not Entropy 2021, 23, 1490 3 of 17 need to choose candidate regions, but directly converts the classification and positions of the target into a regression problem. It is a vehicle detection method with a simple network structure and real-time performance. Two-stage detection approaches need to generate region proposals and use these proposals to detect objects, which have higher accuracy. Focusing on improving the performance of vehicle detection, a significant amount of research is conducted on the basis of basic vehicle detection methods. Huang et al. [34] proposed a model called M-YOLO which uses the lightweight network MobileNetv2 as the feature extraction backbone network. K-means algorithm was used to cluster the dataset. For fast vehicle detection, Hoanh Nguyen [35] proposed an improved framework based on Faster R-CNN [36]. MobileNet architecture [37] was applied to build the base network in the original Faster R-CNN [36] framework. Hu et al. [38] combined multifeatured fusion and convolutional neural network to the vehicle detection method.
However, the detection methods mentioned above can get excellent performance on common vehicle detection, which are not properly suitable for nighttime vehicle detection. To deal with the detection problems in nighttime, GAN [39][40][41] has shown excellent performance on image processing recently. There is a limited number of research that apply the GAN network to nighttime vehicle detection, and some of them have limitations. Lin et al. [42] proposed a GAN-based data augmenter named AugGAN to expand the dataset for improving the performance of the nighttime vehicle detector. However, the processed image obtained by this nighttime vehicle detection method is noisy and has distortion. Based on CycleGAN, Shao et al. [43] combined features of night and day, and proposed a multi-scale feature fusion to enhance night vehicle feature detection. However, small vehicles in the remote distance are always discarded. To tackle the abovementioned problems on nighttime vehicle detection, the proposed method is introduced. The proposed method can eliminate the interference of road reflections, which is a difficult part in traditional vehicle detection methods. In addition, the proposed method has remarkable performance in most complex conditions, even in extremely dark, occluded, and dazzling scenes.

Method
In this part, the proposed method is described in detail. The overall framework of the proposed approach is depicted in Figure 1, which is based on the Faster R-CNN [44] framework. To eliminate the effect of weak environmental light or complex vehicle light at night, attentive GAN is introduced to get precise accuracy. We use multiple local regression, replacing the traditional box offset regression of Faster R-CNN, which will achieve more precise localization. In the classification module, an improved RoI pooling is used to accurately classify the vehicle, which extracts the features of different sub-regions of the candidate region, and then assigns adaptive weights to obtain discriminative features. In the following, the proposed method will be explained.

Attentive GAN Module
The Generative Adversarial Network (GAN) [39] is introduced into the method to improve the accuracy of nighttime vehicle detection, which contains a generator G and a discriminator D. An attentive U-Net [45] is used as the generator of GAN, while a global discriminator and local discriminator are used to improve the global light and enhance local regions.

Attentive GAN Module
The Generative Adversarial Network (GAN) [39] is introduced into the method to improve the accuracy of nighttime vehicle detection, which contains a generator G and a discriminator D. An attentive U-Net [45] is used as the generator of GAN, while a global discriminator and local discriminator are used to improve the global light and enhance local regions.

Attentive Generator
U-Net is used as the generator backbone for the reason that U-Net retains multi-scale context information. As shown in Figure 2a, U-Net is a U-shaped network which connects the encode layers to the decode layers. It helps information flow correctly from the encoder to the decoder. In order to deal with night image problems, an attention scheme is introduced into the GAN. Attention modules are added to each layer of encoder and decoder connections, which is shown in Figure 2b. With the purpose of balancing the dark regions and the bright regions in night image, IC, which is the illumination channel of the RGB image is normalized to [0, 1] and 1-IC is used as the attention map. The attention map is resized to different sizes, which can be used to multiply with the intermediate feature maps.

Attentive Generator
U-Net is used as the generator backbone for the reason that U-Net retains multi-scale context information. As shown in Figure 2a, U-Net is a U-shaped network which connects the encode layers to the decode layers. It helps information flow correctly from the encoder to the decoder. In order to deal with night image problems, an attention scheme is introduced into the GAN. Attention modules are added to each layer of encoder and decoder connections, which is shown in Figure 2b. With the purpose of balancing the dark regions and the bright regions in night image, IC, which is the illumination channel of the RGB image is normalized to [0, 1] and 1-IC is used as the attention map. The attention map is resized to different sizes, which can be used to multiply with the intermediate feature maps.

ttentive GAN Module
The Generative Adversarial Network (GAN) [39] is introduced into the method to ove the accuracy of nighttime vehicle detection, which contains a generator G and a iminator D. An attentive U-Net [45] is used as the generator of GAN, while a global iminator and local discriminator are used to improve the global light and enhance regions.

Attentive Generator
U-Net is used as the generator backbone for the reason that U-Net retains multi-scale xt information. As shown in Figure 2a, U-Net is a U-shaped network which connects ncode layers to the decode layers. It helps information flow correctly from the ento the decoder. In order to deal with night image problems, an attention scheme is duced into the GAN. Attention modules are added to each layer of encoder and deconnections, which is shown in Figure 2b. With the purpose of balancing the dark ns and the bright regions in night image, IC, which is the illumination channel of the image is normalized to [0, 1] and 1-IC is used as the attention map. The attention map ized to different sizes, which can be used to multiply with the intermediate feature .

Discriminator
Using the global discriminator alone cannot solve the problems of overexposure underexposure in the local areas of the night image. Therefore, on the basis of the g discriminator, a local discriminator is added to solve the problems of overexposure underexposure in local areas. PatchGAN, which retains a certain high resolution and

Discriminator
Using the global discriminator alone cannot solve the problems of overexposure and underexposure in the local areas of the night image. Therefore, on the basis of the global discriminator, a local discriminator is added to solve the problems of overexposure and underexposure in local areas. PatchGAN, which retains a certain high resolution and high detail retention of image information is used in both the local and global discriminator. A local discriminator extracts patches randomly from the generator output and normal images, and learns to distinguish them between fake and real. The relativistic discriminator structure [46] which can help the generator produce more realistic images is used in a global discriminator. The function of relativistic discriminator [46] can be represented: where C represents the discriminator network, x real denotes sampling from the real distribution and x f ake represents sampling from the fake distribution. sigmoid is the sigmoid function. However, the sigmoid function is replaced by the least-square GAN (LSGAN) [47] loss in this relativistic discriminator structure. Therefore, the loss functions are: In the local discriminator, the output image of generator and real image are all cropped into several patches. Original LSGAN is used as the local discriminator loss. The standard LSGAN [47] can be represented as:

Training Loss of Attentive GAN Module
In traditional visual tasks, a pre-trained VGG is always used to represent the feature distance between the generator output and the ground truth. In the proposed method, the feature distance between input night image and its output image is constrained, which can preserve the image content features of themselves. For this purpose, self-feature preserving loss [48] is used and it can be computed as: where I represents the input and G(I) represents the output. W i,j and H i,j are the dimensions of the feature maps. T i,j represents the feature map generated from the pre-trained VGG model. i is max pooling, while j indicates convolutional layers after i-th max pooling. Furthermore, similar feature preserving loss, L Local SFP , is used in the cropped local patches of local discriminator. In addition, after each feature map, an instance normalization layer [49] is used. Then, feature maps send to L Local SFP and L SFP with the aim of having steady training. Therefore, the training loss of attentive GAN module can be expressed as:

Vehicle Detection Module
The vehicle detection module consists of four parts: the backbone network used for feature extraction, region proposal network (RPN), classification sub-network and  [50], and RPN is used to generate RoI proposals. In traditional Faster R-CNN, after getting proposals and feature maps, RoI Pooling [44] or RoI Align [51] is used to get fix-sized feature maps, which will send to some full connected layers to obtain the classification score and bounding box. Recently, several researchers have divided the classification and regression into two separate branches: the classification branch and the bounding box regression branch, which is beneficial to accuracy vehicle detection. Therefore, the classification branch is implemented by an improved RoI Pooling and the regression branch is realized by a multiple regression. In the proposed method, the backbone network of the vehicle detection module is ResNet101 with FPN, which can improve the model performance. As shown in Figure 3, ResNet101 with FPN is used to extract multi-scale features from the input night vehicle images. Compared with traditional backbone network of Faster R-CNN, ResNet101 with FPN can get deeper and more accurate feature information which will benefit to the later detection.

Vehicle Detection Module
The vehicle detection module consists of four parts: the backbone netwo feature extraction, region proposal network (RPN), classification sub-network sion sub-network. The image features are extracted by ResNet101 with FPN [50 is used to generate RoI proposals. In traditional Faster R-CNN, after getting pro feature maps, RoI Pooling [44] or RoI Align [51] is used to get fix-sized fea which will send to some full connected layers to obtain the classification score a ing box. Recently, several researchers have divided the classification and regr two separate branches: the classification branch and the bounding box regress which is beneficial to accuracy vehicle detection. Therefore, the classification implemented by an improved RoI Pooling and the regression branch is realize tiple regression. In the proposed method, the backbone network of the vehicl module is ResNet101 with FPN, which can improve the model performance. A Figure 3, ResNet101 with FPN is used to extract multi-scale features from the vehicle images. Compared with traditional backbone network of Faster R-Net101 with FPN can get deeper and more accurate feature information which to the later detection.

Multiple Local Regression
In traditional Faster R-CNN, after getting proposals and feature maps, R or RoI Align is used to generate the fix-sized (k × k) feature map within the prop

Multiple Local Regression
In traditional Faster R-CNN, after getting proposals and feature maps, ROI pooling or RoI Align is used to generate the fix-sized (k × k) feature map within the proposal. Then, each feature map is sent to several fully connected layers, which will fall into two output layers: the bounding box regression branch and the classification branch. The branch of the bounding box regression is implemented by predicting the box offset (t x ,t y , t w ,t h ) of each proposal to get accuracy bounding box. The offset can be represented by: where x P , y P , w P , and h P are the center coordinates of proposal P and its width and height. Variables x G , y G , w G , h G are the center coordinates of ground truth box G and its width and height. While in the proposed approach, unlike the traditional Faster R-CNN which predicts one box offsets, we selectively replace traditional bounding-box regression with multiple local regression. The proposed regression predicts multiple box offsets of proposal. In the multiple local regression branch, the fix-sized (k × k) feature map, which can get multiple local box offsets, is treated as k 2 local features. As shown in Figure 1, the fix-sized (k × k) feature map is sent to a fully convolutional network to predict multiple box offsets. The multiple box offsets represent the distance of local feature p j at the position (x j , y j ) to the  Figure 4. The offsets ( j , t j , r j , b j ) at position j can be calculated as follows: where (x j , y j ) and (x , y t ) represent the bottom-right and the top-left of the ground truth box. w P and h P are the width and height of proposal P. These predicted multiple box offsets are applied to calculate the bottom-right and top-left corner positions of the predicted box. Then, boxes of local features prediction are average computed to get the final bounding box.
multiple box offsets represent the distance of local feature pj at the position ( j x , j y ) to bottom-right and top-left corner of the ground truth box as shown in Figure 4. The of ( j  , j t , j r , j b ) at position j can be calculated as follows:

Improved RoI Pooling
With the aim of getting more accurate vehicle classification results, the classifica sub-branch is on the basis of the deformable RoI pooling [52] and makes some extens

Improved RoI Pooling
With the aim of getting more accurate vehicle classification results, the classification sub-branch is on the basis of the deformable RoI pooling [52] and makes some extensions to improve the classification. Similar to the standard deformable RoI pooling layers, the classification module also includes standard RoI pooling, a fully connected layer, and offsets.
In deformable RoI pooling, RoI Align is used to generate fix-sized (k × k) pooled feature maps. Then, a fully connected layer is used to provide the normalized offsets of these feature maps. Different from the weighted strategy of the standard deformed RoI pooling, different weights are assigned to different sampling points obtained within the k × k feature maps. The weighted feature ∼ F of the candidate proposal can be calculated by.
where F is the RoI Align feature in primal sampling points, W(F) is computed from F using the convolution operations. Furthermore, is the Hadamard product. Several sampling points (s1, s2, s3, s4) and the computed weights (w1, w2, w3, w4). The weighted pooling (WP) process is shown in Figure 5. After ∼ F is obtained, average pooling is performed, and finally the weighted RoI feature is obtained. Then, the feature is sent to fully connected layers to gain the final classification score. sampling points (s1, s2, s3, s4) and the computed weights (w1, w2, w3, w4). Th pooling (WP) process is shown in Figure 5. After F  is obtained, average poo formed, and finally the weighted RoI feature is obtained. Then, the feature is s connected layers to gain the final classification score.

The Loss of Vehicle Detection Module
In vehicle detection module, the loss mainly contains two parts: the loss o and the loss of classification. Regression loss contains two components: RPN loss and the loss of multiple regression branch. Smooth L1 loss is chosen as R sion loss, which will improve the robustness of regression process.
1 L S loss pressed as:

Inter tion P G IoU
Union P G = − , where P represents the bounding box prediction and G indicates the bo ground truth.
The classification loss is also composed of the RPN classification loss and classification branch, and cross entropy loss is used. Cross entropy loss is repr

The Loss of Vehicle Detection Module
In vehicle detection module, the loss mainly contains two parts: the loss of regression and the loss of classification. Regression loss contains two components: RPN regression loss and the loss of multiple regression branch. Smooth L1 loss is chosen as RPN regression loss, which will improve the robustness of regression process. S L1 loss can be expressed as: where x = p i − p * i . p i is a vector representing the four parameterized coordinates of the predicted bounding box and p * i is the coordinate vector of ground truth box corresponding to positive anchor. IoULoss, which trains the location information as a whole, is used as the loss of multiple regression branch. Compared with smooth L1 loss, IoULoss contributes to more accurate training results, which is quite important for the final location. IoULoss can be calculated as: Inter sec tion(P, G) Union(P, G) , where P represents the bounding box prediction and G indicates the bounding box ground truth. The classification loss is also composed of the RPN classification loss and the loss of classification branch, and cross entropy loss is used. Cross entropy loss is represented as: where p i represents the true label value, and p * i is the predicted value. In RPN classification process, cross entropy loss is a binary cross entropy loss. Different from the loss of RPN classification, the loss of classification branch is cross entropy loss of multi-classifications.

Datasets and Implementation
In order to demonstrate the effectiveness of the proposed method, this paper conducts experiments on the selected nighttime vehicle dataset. The selected dataset is composed of the partial nighttime images of Berkeley Deep Driving (BDD-100k) [53] dataset. The BDD-100k dataset is a naturalistic driving dataset, which contains 100k high resolution images. Various scene types are included, such as city streets, residential areas, and highways. Furthermore, it also includes different kinds of images at different times of the day. A total of 8075 nighttime images are randomly selected from BDD-100k dataset to evaluate the proposed method, 6075 images are selected to train the models and test, while 2000 images are included in the test set. A brief demonstration of the dataset is shown in Figure 6. make it easier for us to conduct our experiment. For precise detection, attentive GA used for more detailed image information. ResNet101 with FPN is chosen as the back of Faster R-CNN. To simplify hyperparameter tuning of algorithm [54,55], the Stoch Gradient Descent (SGD) solver with momentum of 0.9, and the weight decay of 0.000 a single NVIDIA GeForce RTX 2080Ti GPU is used to optimize the network. The lear rate is initialized as 0.0025 for the first 20 epochs and decreased by 0.1 after 16th and epochs.  Images which are selected from BDD-100k are resized to 800 × 600. Our baseline models are all based on the framework of MMDetection, which is an object detection toolkit based on PyTorch. MMDetection contains dozens of state-of-the-art detectors, which make it easier for us to conduct our experiment. For precise detection, attentive GAN is used for more detailed image information. ResNet101 with FPN is chosen as the backbone of Faster R-CNN. To simplify hyperparameter tuning of algorithm [54,55], the Stochastic Gradient Descent (SGD) solver with momentum of 0.9, and the weight decay of 0.0001 on a single NVIDIA GeForce RTX 2080Ti GPU is used to optimize the network. The learning rate is initialized as 0.0025 for the first 20 epochs and decreased by 0.1 after 16th and 22th epochs.

The Results of Comparisons
We quantitatively analyze the experimental results of the proposed method on the selected BDD-100k dataset with state-of-the-art detection approaches, including two-stage detection approaches of Faster R-CNN, Cascade R-CNN [56], Mask R-CNN [51], and one-stage detection approaches of RetinaNet [57], and SSD [58]. These detection models are evaluated with the COCO metrics, including average precision (AP) over Intersection over Union (IoU) thresholds from 0.5 to 0.95, AP 50 (IoU threshold over 0.5), AP 75 (IoU threshold over 0.75). Furthermore, AP S , AP M, and AP L represent the results on small, medium, and large scales, respectively. As presented in Table 1, the proposed method obtains the greatest performance. As shown in Table 1, compared with two-stage detection approaches, one-stage detection approaches really have poor detection accuracy. SSD gets the least AP value and it has bad performance under different vehicle sizes compared with other detection approaches. RetinaNet, another one-stage detection approach, shows an improvement compared with SSD. Two-stage detection approaches show more significant improvement in detection accuracy compared with one-stage detection approaches. However, the most accurate result is made by the proposed method. As shown in Table 1, the value of AP S is quite lower than others, which reflects the problem that small target vehicles are not easy to detect. Compared with other approaches, the proposed method has notable AP S gains, which indicates that proposed method has better performance in small target vehicles.

Qualitative Analysis
To verify the effectiveness of the proposed method, visual comparisons are performed on the selected BDD-100k dataset. We choose some representative scene, and the detection results are presented in Figures 7-11. The green rectangles in the resulting image represent the object location and sizes of the vehicles detected by the specific method. The red rectangles show the detailed information of the specific location. accuracy. As presented in Figure 10, the interference from lights beside the viaduct causes incorrect detection results. RetinaNet, Faster R-CNN, and Cascade R-CNN make incorrect detection results on both sides of the viaduct, while Mask R-CNN and SSD make incorrect detection results on the left side of the viaduct. The proposed method shows the best performance, which only detects correct objects and has correct bounding box localizations. As shown in Figure 11, there are more error bounding box localizations in the detection result of the detection approaches because of the occlusion and interferential light. More right boxes are produced by the proposed method and more vehicles are detected. These results show that the proposed method achieves remarkable performance in the most complex conditions, even in extremely dark, occluded, and dazzling conditions. accuracy. As presented in Figure 10, the interference from lights beside the viaduct causes incorrect detection results. RetinaNet, Faster R-CNN, and Cascade R-CNN make incorrect detection results on both sides of the viaduct, while Mask R-CNN and SSD make incorrect detection results on the left side of the viaduct. The proposed method shows the best performance, which only detects correct objects and has correct bounding box localizations. As shown in Figure 11, there are more error bounding box localizations in the detection result of the detection approaches because of the occlusion and interferential light. More right boxes are produced by the proposed method and more vehicles are detected. These results show that the proposed method achieves remarkable performance in the most complex conditions, even in extremely dark, occluded, and dazzling conditions.

Ablation Study
An ablation experiment is conducted on the selected dataset to examine the contributions of different components to the overall network. All experiments use ResNet101 with FPN as the backbone. The baseline is Faster R-CNN. We gradually apply the Attentive GAN, Multiple Local Regression and Improved ROI Pooling to the baseline and present the performance in Table 2.
First, Attentive GAN is applied to the baseline network, which will be beneficial to obtain more distinct vehicle features. Clearer vehicle features will contribute to more accurate detection results. As shown in Table 2, the improvement over the baseline by adapting the Attentive GAN can be clearly observed. By enhancing the nighttime image features with the Attentive GAN, a 2.9% AP improvement is acquired, which illustrates the benefit of adding this component. Next, a Multiple Local Regression module is adopted to the method to get more accurate locations. It is observed that compared with the baseline, the value of AP is increased by 1.9%. Additionally, applying Improved RoI Pooling module to the baseline will get more discriminative features for classification. The detection result is shown in Table 2, a reasonable improvement of 0.9% AP is contributed. As analyzed above, different components all make their contributions to the improvement of the detection accuracy. One thing that is very noticeable is that applying the three modules to the baseline can achieve the best performance.

Conclusions
A nighttime vehicle detection method is introduced to obtain accurate vehicle detection in this work. Initially, Attentive GAN is proposed to improve the vehicle features, which contributes a lot to the accuracy of detection. Additionally, multiple local regression module is used to obtain more accurate object localization, while an improved RoI As shown in Figure 7, the image shows low brightness and the vehicles features are not obvious, which will be a challenge for vehicle detection. All detectors show their detective performance with different accuracy. As the red rectangle shows, there is a car which is hardly recognized. Other state-of-the-art detectors have missed detections, while our proposed method can detect and achieves accurate results. Vehicle detection under low brightness has great importance, which can help drivers make appropriate choices. Even in normal scenes that are not too dark, missed detection problems still occur at a distance and an example is shown in Figure 8. It is obvious that the missed vehicle at a distance is difficult to be recognized by Cascade R-CNN and Mask R-CNN. Mask R-CNN missed detections, while SSD made error detection. The proposed method can recognize the difficult vehicle and perform better detection performance. In addition, the surrounding lights can make it difficult to detect vehicles correctly. In Figure 9, partial dark and partial dazzling scenes are a challenge for the detection approaches. It is obvious that SSD has made an error detection at a partial dazzling region, while the other detection approaches give the right detection result. In a partial dark area, only the proposed method and SSD give the right detection results. However, the proposed method shows higher accuracy. As presented in Figure 10, the interference from lights beside the viaduct causes incorrect detection results. RetinaNet, Faster R-CNN, and Cascade R-CNN make incorrect detection results on both sides of the viaduct, while Mask R-CNN and SSD make incorrect detection results on the left side of the viaduct. The proposed method shows the best performance, which only detects correct objects and has correct bounding box localizations. As shown in Figure 11, there are more error bounding box localizations in the detection result of the detection approaches because of the occlusion and interferential light. More right boxes are produced by the proposed method and more vehicles are detected. These results show that the proposed method achieves remarkable performance in the most complex conditions, even in extremely dark, occluded, and dazzling conditions.

Ablation Study
An ablation experiment is conducted on the selected dataset to examine the contributions of different components to the overall network. All experiments use ResNet101 with FPN as the backbone. The baseline is Faster R-CNN. We gradually apply the Attentive GAN, Multiple Local Regression and Improved ROI Pooling to the baseline and present the performance in Table 2. First, Attentive GAN is applied to the baseline network, which will be beneficial to obtain more distinct vehicle features. Clearer vehicle features will contribute to more accurate detection results. As shown in Table 2, the improvement over the baseline by adapting the Attentive GAN can be clearly observed. By enhancing the nighttime image features with the Attentive GAN, a 2.9% AP improvement is acquired, which illustrates the benefit of adding this component. Next, a Multiple Local Regression module is adopted to the method to get more accurate locations. It is observed that compared with the baseline, the value of AP is increased by 1.9%. Additionally, applying Improved RoI Pooling module to the baseline will get more discriminative features for classification. The detection result is shown in Table 2, a reasonable improvement of 0.9% AP is contributed. As analyzed above, different components all make their contributions to the improvement of the detection accuracy. One thing that is very noticeable is that applying the three modules to the baseline can achieve the best performance.

Conclusions
A nighttime vehicle detection method is introduced to obtain accurate vehicle detection in this work. Initially, Attentive GAN is proposed to improve the vehicle features, which contributes a lot to the accuracy of detection. Additionally, multiple local regression module is used to obtain more accurate object localization, while an improved RoI pooling module is used to generate precise classification and higher detection confidence score. By integrating the three modules, the proposed method can effectively detect vehicles with a small size and partial occlusion. The comparison detection results between the proposed and other state-of-the-art detection methods indicate that the proposed method has more competitive performance than others.