Joint Semantic Intelligent Detection of Vehicle Color under Rainy Conditions

: Color is an important feature of vehicles, and it plays a key role in intelligent trafﬁc management and criminal investigation. Existing algorithms for vehicle color recognition are typically trained on data under good weather conditions and have poor robustness for outdoor visual tasks. Fine vehicle color recognition under rainy conditions is still a challenging problem. In this paper, an algorithm for jointly deraining and recognizing vehicle color, ( JADAR ), is proposed, where three layers of UNet are embedded into RetinaNet -50 to obtain joint semantic fusion information. More precisely, the UNet subnet is used for deraining, and the feature maps of the recovered clean image and the extracted feature maps of the input image are cascaded into the Feature Pyramid Net ( FPN ) module to achieve joint semantic learning. The joint feature maps are then fed into the class and box subnets to classify and locate objects. The Rain Vehicle Color -24 dataset is used to train the JADAR for vehicle color recognition under rainy conditions, and extensive experiments are conducted. Since the deraining and detecting modules share the feature extraction layers, our algorithm maintains the test time of RetinaNet -50 while improving its robustness. Testing on self-built and public real datasets, the mean average precision ( mAP ) of vehicle color recognition reaches 72.07%, which beats both sate-of-the-art algorithms for vehicle color recognition and popular target detection algorithms.


Introduction
Vehicle information recognition has been applied in the field of intelligent traffic management and criminal investigation. License plate, model, and vehicle color comprise the main vehicle information. Although license plate recognition is a commonly used vehicle information recognition technology [1], it also faces many challenges in criminal investigation and for intelligent traffic law enforcement, as license plates can be easily obscured (partially or fully) or faked/duplicated by criminals. As it can still be identified despite partial occlusion or viewpoint changes, vehicle color recognition is widely applied in video surveillance [2], vehicle detection [3], vehicle tracking [4], automatic driving [5,6], criminal investigation [7], etc. All the above-mentioned tasks inevitably encounter adverse weather conditions, especially rain. This, in turn, adversely affects the performance of object recognition/retrieval, because rain can significantly reduce the contrast of the scene and reduces visibility, compromising image quality. Many scholars have conducted research on how to improve the performance of object detection under rainy conditions.

1.
J ADAR contains far fewer parameters than previous two-stage methods since its subnets, UNET and RetinaNet share the same feature extracting layers. This is of high practical value as the size of outdoor mobile equipment can be substantially reduced.

2.
In J ADAR, the multi-scale fusion information obtained by cascading feature maps of original rainy images and recovered images are fed into the subsequent class-box subnet. In so doing, multi-scale information across domains is crucially beneficial for fine vehicle color recognition. 3.
The joint processing of low-level and high-level tasks can be mutually beneficial.
Embedding the image restoration module can help improve the performance of subsequent high-level tasks under severe weather; conversely, the performance of subsequent high-level tasks as evaluation metrics can, in turn, fine-tune the image restoration algorithms.

4.
Comprehensive experiments show that our proposed methods outperform basic detection networks and two-stage network and transfer learning methods for the task of color recognition under rainy conditions. Further, our training and testing times are shortened. Next, related work is introduced in Section 2. J ADAR is constructed in Section 3. Section 4 shows that our method is superior to the state-of-the-art quantitatively and qualitatively. Section 5 concludes the main content.

Related Work
There exists some research on vehicle color recognition under normal weather and object detection under adverse weather conditions, which is reviewed below.

Vehicle Color Recognition under Normal Weather Conditions
Vehicle color recognition methods generally fall into traditional model-based methods [8][9][10][11] and data-based deep learning methods [12][13][14][15][16][17][18]. Regarding traditional model-based methods, Chen et al. [8] train a linear support vector machine classifier on the region of interest ROI in vehicle images based on eight color types; Jeong et al. [9] adopt the multi-class AdaBoost algorithm to classify the color of front-of-vehicle images into seven types; Dule et al. [11] train three classifiers (KNN, ANN, and SVM) for two ROIs (smooth hood section and semi-front of vehicle).
Data-based methods have been receiving increasing attention for vehicle color recognition. Hu et al. [12] were the first to apply a convolutional neural network (CNN) with a spatial pyramid strategy to boost the accuracy of vehicle color recognition. Zhang et al. [15] proposed a lightweight CNN for vehicle color recognition. Fu et al. [16]  It is worth mentioning that there has been no research on vehicle color detection under bad weather conditions, which is the focus of this paper.
For example, Chen et al. [33] embedded two domain adaptation modules into Faster RCNN to reduce the domain discrepancy on image level and instance level. Sindagi et al. [31] proposed an unsupervised domain-adapting method to improve generalization of object detection under hazy and rainy conditions. Style transfer is considered in [27], in which the authors construct a cross-domain representation learning method including domain diversification and a multi-domain invariant. Huang et al. [41] combine dual subnet frameworks for object detection under foggy conditions. Except for the two-stage methods, the above-mentioned methods do not pay special attention to the rainy conditions. However, two-stage methods such as [35][36][37][38][39]42] do pay attention to image deraining instead of object detection. in other words, the joint tasks of deraining and object detection are not taken seriously. Motivated by the above considerations, we propose J ADAR for joint semantic intelligent detection of vehicle color in rainy scenes.

Fusion Network Design
In this paper, a Joint Algorithm for Deraining And Recognition (JADAR) is designed for vehicle color recognition in inclement weather conditions; it is based on RetinaNet-50, as shown in Figure 1. In Figure 1, O is the rainy image input, B is the corresponding clean background image, B and y o , respectively, are the outputs of the deraining and detecting results. To see results clearly, we zoom in on the recognition results of each car in picture y o i (i = 1, 2, 3); y o 1 is the enlarged result of the first car in the picture-the recognition color is silver-gray with a confidence level of 0.91; y o 2 is the enlarged result of the second car in the picture-the recognition color is black with a confidence level of 0.58; y o 3 is the enlarged result of the third car in the picture-the recognition color is dark gray with a confidence level of 0.81. The green/blue/purple/orange boxes represent the feature extraction module/UNet-3/ information fusion module/class+bbox subnets, respectively. L reg is the regression loss using the smooth L1 loss. L cls is the classification loss using the focal loss. The loss function for deraining is MSE loss. J ADAR is trained by the weighted sum of these three losses (see Equation (7)).
The J ADAR framework is designed by embedding the three-layer decoder of UNet-3 [43] into the last three sub-blocks of the feature extraction module, as illustrated by the green-tinted box in Figure 1. The whole framework consists of four main modules: image feature extraction module, deraining module, information fusion module, and class + box subnets. The rain removal and feature extraction modules share three layers, avoiding extra computational burden. In fact, Section 4.5 shows that J ADAR has the same testing time as RetinaNet-50. The last three feature maps and the corresponding recovered feature maps are cascaded together and then fed into their respective class + box subnets, which can learn multi-scale joint semantic representations to improve object detection accuracy under rainy conditions. The feature fusion sub-module setting is illustrated in Figure 2.
The overall object function is back-propagated to train the deraining module and to improve rainy image deraining performance recursively. The object detection backbone network uses three-scale class + box subnets to leverage multi-scale fusion color feature maps to classify 24 car color types and locate the bounding-box.
Architecture and weights of the proposed network in detail.

Model Formulation and Model Optimization
Let the physical mechanism of rainy image corruption be where x, y, z denote rainy image O, recovered clean background image B, and rain layer R, respectively. To tackle the problem of supervised vehicle object detection by color in inclement weather conditions, a joint network is proposed to learn joint semantic representation from an input rainy image x. Let y denote the corresponding label of rainy image x.
As demonstrated by the green box in Figure 1, the last three feature maps f 1 (x), f 2 (x), f 3 (x) are taken from the feature extraction sub-blocks of RetinaNet. Then, f 1 (x) is fed into the corresponding last layer of the decoder of UNet-3, and g 1 (x) is output. Next, g 1 (x) and f 2 (x) are cascaded into the penultimate layer UNet-3, and g 2 (x) is output; then g 2 (x) and f 3 (x) are cascaded into the last decoder layer of UNet-3, and g 3 (x) is output. The output of the deraining module y is denoted by g 3 (x). Thus, the mean square error (MSE loss) is used as deraining object function L der as follows: where n is the number of rainy images. Finally, f 1 (x) and g 1 (x), f 2 (x) and g 2 (x), and f 3 (x) and g 3 (x) are cascaded and input into differently scaled class + box subnets, where joint semantic information is fused, 24 vehicle colors are classified, and box-bounded regressions are achieved; the last cascading output image is denoted y o . The classification loss function is where α t is a balancing factor to balance the uneven proportion of positive and negative examples of every vehicle color category, C = 24 denotes the number of all vehicle color categories, γ ≥ 0 is a tunable focusing parameter (we take γ = 2.0 in Section 4 following [24]), t is equal to 0 or 1, which denotes the positive or negative sample, p i1 ∈ [0, 1] denotes the prediction probability of the positive sample of the i-th vehicle color class, and 1−p i1 indicates the prediction probability of negative examples of every vehicle color category i ∈ {1, 2, · · · , 24}; i.e., The loss function of the box bounding regression is with where Here (x, y) denotes the center coordinates of the bounding box, w/h denotes the width / height, and t i (t * i ) represents the offset of the prediction box ( the ground truth box). Now, L reg (i) represents the regression loss for the i-th image, and L reg represents the total regression loss for all images. The total loss function is then given by where λ ∈ [0, 1] is a hyperparameter controlling the strength of the image deraining module's adjustment to the rainy weather target detection performance. In this context, for λ = 0.5, mAP of the proposed network detection is optimal from many ablation experiments. See Section 4.3 for details.

Experimental Setup
Implementation Details. J ADAR is trained end-to-end on the Rain Vehicle Color-24 image set using the Adam optimizer [44] to simultaneously learn image deraining, color classification, and object localization on the PyTorch platform. All experiments are implemented on the AutoDL platform with a Tesla P40. The hyper-parameters α and γ of the classification loss function L cls are set to 0.25 and 2, respectively. We divide Rain Vehicle Color-24 into a training set, a validation set and a testing set at a ratio of 8:1:1. The batch size is 4, the epoch is 100, and the confidence threshold is 0.5. The learning rate is 10 −4 for the first 50 epochs, 10 −5 for the next 30 epochs, and 10 −6 for the last 20 epochs.
Evaluation Metric. Generally, object detection uses IoU (Intersection over Union) [21], Precision (accuracy) [45], Recall [45], AP (Average Precision) [18], mAP (mean Average Precision) [41], or other evaluation metrics; these concepts are well-known, so we list the formulas in brief: where A/B denotes GT (bounding box of the object) / the prediction bounding box. Mathematical definitions of Precision and Recall are as follows: where TP is true positives (correctly predicted as positive), FP is false positives (incorrectly predicted as positive), and FN is false negatives (failed to predict a positive).
AP is calculated by where p is Precision, and r is Recall.
The mAP (mean Average Precision) is the average of AP, so mAP is calculated by where N is the number of categories.

Real Rain Vehicle Datasets: RID and RIS
Li et al. collected two real rainy image vehicle datasets, RID and RIS [38], for testing object detection. RID is rainy images collected from in-vehicle cameras while driving on rainy days, and RIS is surveillance rainy images collected from network traffic surveillance cameras during rainy weather conditions. The two datasets differ in many aspects: rainfall type, image quality, target size and angle, etc. They represent real-world application scenarios where deraining may be required. RID includes 2495 images, and its rainy image effect is closest to "raindrops" on the camera lens. RIS includes 2048 images, and its rainy image effect is closest to "rain and fog" (many cameras have fog condensation when it rains, and lower resolutions also cause more fog effects) [47]. Due to the highly complex scenes of these two rainy image datasets, it is a challenging dataset, and we choose these two datasets for testing to better illustrate the effectiveness of our proposed algorithm. Examples of these two datasets are given in Figure 4.

Ablation Study
To determine the optimal design of our proposed framework, we train four combinations on the Rain Vehicle Color-24 dataset: RetinaNet, J ADAR1, J ADAR2, and J ADAR. All these models are trained and tested on Rain Vehicle Color-24 using different loss functions: λ = 0, 0.1, 1.0, and 0.5, respectively. Figure 5 shows that the testing mAP values of the J ADAR1, J ADAR2, and J ADAR models are 2.92%, −3.99%, and 4.3% higher, respectively, than the RetinaNet model, which clarifies that joint semantic feature extraction is beneficial to improve vehicle color recognition performance under rainy weather conditions. Referring to Table 1, when the hyper-parameter λ is 0.1, the rain removal module provides a weak assisting effect on vehicle color recognition under rainy weather conditions. When λ is 1.0, it plays the opposite effect. When λ is 0.5, J ADAR performs best; so we choose this value in our method.  In this section, our proposed algorithm, the vehicle color recognition method, the target detection method, the two-stage method combining rain removal with target detection, and the transfer learning method are compared.
To discuss vehicle color recognition performance, J ADAR and SMNN-MSFF [18] are compared. Both are trained on Rain Vehicle Color-24 training subset and tested on its test subset. The quantitative results are shown in the second column of Table 2. These quantitative results confirm that the mAP of our method reaches 72.07%, which is 23.49% higher than SMNN-MSFF. The qualitative results are shown in Figure 6. J ADAR outperforms SMNN-MSFF under rainy conditions; for example, there are five vehicles recognized by J ADAR, while three vehicles are recognized by SMNN-MSFF. A white vehicle is recognized by J ADAR with a confidence score of 0.79, while SMNN-MSFF recognizes it with a confidence score of 0.62.
To compare object detection performance, J ADAR, RetinaNet [24], Faster RCNN [19], SSD [20], and YOLO V3 [21] are compared qualitatively and evaluated by mAP quantitatively. In our experiments, the loss function and settings (i.e., scale, anchor or default box, backbone network, classifier, etc.) of each compared method remains unchanged from the original work. Furthermore, all methods are trained on the Rain Vehicle Color-24 dataset and tested on its test set. The qualitative results of J ADAR, Faster RCNN, YOLO V3, SSD, and RetinaNet for vehicle color recognition in rain are shown in Figures 7 and 8. As can be seen from the figures, our proposed J ADAR outperforms other models for fine vehicle color recognition. The quantitative results show that the proposed J ADAR is 11.42%, 22.19%, 5.74%, and 4.3% better than Faster RCNN, YOLO V3, SSD and RetinaNet, respectively, from Table 2. To compare recognition performances of different joint methods, three state-of-the-art rain removal methods, i.e., LPNet [35], PReNet [48], and RCDNet [49]), are chosen to first derain the images, and then RetinaNet is leveraged to recognize vehicle colors. These methods are denoted LR, PR, and RR. Figures 9 and 10 give qualitative comparisons of our J ADAR and three two-stage methods for vehicle color recognition under rainy weather conditions. J ADAR performs better than other models. From Table 3, our J ADAR is 15.56%, 20.37%, and 2.06% higher than LR, PR, and RR, respectively.
To compare with transfer learning methods, two domain-adaptation methods, Daf aster [33] and ATF [50], are compared with J ADAR. Here, the VC-24 is the source domain, and Rain Vehicle Color-24 is the target domain; they are leveraged to train the above algorithms. From the 5-th and 6-th columns of Table 3, our method is 25.95% and 9.14% better than Da-f aster and ATF, respectively. The qualitative results in Figures 9 and 10 show that JADAR identifies more vehicles with higher confidence than the other two methods.     Figures 11-14. As can be seen from these figures, the test results of J ADAR on the real datasets, RID and RIS, are generally better than those of other methods. As can be seen from Figure 11, the J ADAR and SSD algorithms can correctly identify the two cars in the picture; Yolo V3 can also identify the two cars in the picture, but the black color is mistakenly identified as silver-gray; while the other three algorithms can hardly identify any vehicles in the picture. Referring to Figure 12, because the recognition effects of Faster RCNN and SSD are better than others', we find a limitation of J ADAR in recognizing small targets. Referring to Figure 13, all algorithms can identify the color of the vehicle in the image but with different confidence values; specifically, ATF has the highest confidence value for blue vehicle, with 0.98. However, Figure 14 shows that only J ADAR and ATF can identify a certain white vehicle.

Inference Time
In order to compare the test time of all methods, all network models are tested on a testing subset with an input of 1920 × 1080 images. The test times are shown in Table 4. JADAR takes 1.7 s per image on a single Tesla P40 GPU, which is the same as for RetinaNet, but JADAR is 21.8, 1.1, 4.4, 0.8, and 0.9 seconds faster than LR, PR and RR, Da-f aster, and ATF, respectively. Therefore, although J ADAR has one more decoder module than RetinaNet, it still maintains its original high detection speed.

Conclusions
In this paper, we study vehicle color recognition under rainy conditions and propose a joint semantics learning method J ADAR, which is designed by embedding UNet-3 into RetinaNet. The UNet module achieves rainy image removal and restores the clean background image. The recovered background image and the rainy image are input together into the class + bbox sub-module of RetinaNet network to accurately extract the joint semantic of the vehicle color features maps. J ADAR outperforms other methods under rainy as well as normal conditions for fine vehicle color recognition. Extensive experimental results show that the mAP of the proposed method reaches 72.07% in identifying 24 colors under rainy conditions. Because our algorithm is trained on synthetic datasets, its generalization is not guaranteed. In the future, semi-supervised or few-shot learning is planned to further improve the generalization and realizability of the algorithm. As a further research topic, one can consider fusing overlap functions and fuzzy (rough) sets (see [51][52][53][54][55]) to develop the method of this paper.