All-Weather Pedestrian Detection Based on Double-Stream Multispectral Network

Hsia, Chih-Hsien; Peng, Hsiao-Chu; Chan, Hung-Tse

doi:10.3390/electronics12102312

Open AccessArticle

All-Weather Pedestrian Detection Based on Double-Stream Multispectral Network

by

Chih-Hsien Hsia

^1,2,*

,

Hsiao-Chu Peng

¹

and

Hung-Tse Chan

³

¹

Department of Computer Science and Information Engineering, National Ilan University, Yilan County 26047, Taiwan

²

Department of Business Administration, Chaoyang University of Technology, Taichung City 413310, Taiwan

³

Department of Electronic and Computer Engineering, National Taiwan University of Science and Technology, Taipei City 106335, Taiwan

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(10), 2312; https://doi.org/10.3390/electronics12102312

Submission received: 17 March 2023 / Revised: 12 May 2023 / Accepted: 15 May 2023 / Published: 20 May 2023

(This article belongs to the Special Issue New Trends in Deep Learning for Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

Recently, advanced driver assistance systems (ADAS) have attracted wide attention in pedestrian detection for using the multi-spectrum generated by multi-sensors. However, it is quite challenging for image-based sensors to perform their tasks due to instabilities such as light changes, object shading, or weather conditions. Considering all the above, based on different spectral information of RGB and thermal images, this study proposed a deep learning (DL) framework to improve the problem of confusing light sources and extract highly differentiated multimodal features through multispectral fusion. Pedestrian detection methods, including a double-stream multispectral network (DSMN), were used to extract a multispectral fusion and double-stream detector with Yolo-based (MFDs-Yolo) information. Moreover, a self-adaptive multispectral weight adjustment method improved illumination–aware network (i-IAN) for later fusion strategy, making different modes complimentary. According to the experimental results, the good performance of this detection method was demonstrated in the public dataset KAIST and the multispectral pedestrian detection dataset FLIR, and it even performed better than the most advanced method in the miss rate (MR) (IoU@0.75) evaluation system.

Keywords:

deep learning; computer vision; multispectral fusion; pedestrian detection; light source perception

1. Introduction

Advanced driver assistance systems (ADAS) are important technical systems to detect the external environment and make appropriate judgments using vehicle sensors. Nowadays, it has become a trend for ADAS to develop from assisted driving into automatic driving. Most current sensors in the automatic driving market are based on RGB images, radars [1], and lidars. Among all those sensors, the most advantageous feature of image-based sensors is their low costs, which can help easily control manufacturing costs compared with self-driving vendors. Additionally, the use of image sensors can not only enhance automatic driving but also reduce the rate of traffic accidents. Moreover, the life safety of pedestrians is most vital.

As a significant topic of ADAS, pedestrian detection has received much attention along with the development of the automatic driving industry. However, pedestrian detection still faces many challenges owing to light source intensity, object shading, or other changes and conditions. In regard to automatic driving vehicles (e.g., Tesla Model 3, Toyota Camry, Chevrolet Malibu, and Honda Accord), the American automobile association conducted a pedestrian knock-down test in daytime scenarios [2]. According to the results, when the test vehicle runs at 20 miles per hour, and an adult suddenly crosses the road, there was a 40% chance of the vehicle avoiding this pedestrian; when the test vehicle drives at 20 miles per hour and suddenly encounters a child showing up between two vehicles, there was an 89% chance of the vehicle hitting the child; finally, when the test vehicle runs at 30 miles per hour, there was a 100% chance of the vehicle hitting the pedestrian. The knock-down possibility in daytime scenarios is very high. However, when driving at night at low light, self-driving vehicles still face serious problems in detecting pedestrians.

In summary, the pedestrian detection task applied in ADAS requires robust methods unaffected by environmental factors such as light source intensity and real-time detection methods. However, the mono-spectral information provided by each type of image sensor has advantages and disadvantages [3]. For example, although the visible light image sensor provides high spatial resolution and high-definition texture details, it will be affected by low lighting or bad weather without providing useful information. Thermal image sensors can distinguish the target from the background based on radiation differences. Moreover, it can retain a good image-capturing ability even in bad scenarios such as low light intensity, heavy rain, or thick smoke, as shown in Figure 1. Although different sensors can obtain multispectral information from different images detected by varying wavelengths in the same scenario, the difficulty lies in effectively using the mono-spectral information from different sensors to fully use each type of sensor.

Previous studies have suggested that it is better to use deep learning (DL) to detect pedestrians than to use manual features (traditional methods). In previous years, most DL-based methods relied on mono-spectral images captured by single sensors. Moreover, the CMOS optical image capturing system is generally used in those detection methods, which may cause issues such as low light intensity, lack of color information, and low signal-to-noise ratio. There are many studies on the positioning and classification of an anchor. For the multiscale convolutional neural network (MS-CNN) detected by ADAS, Wei et al. [4] improved the scale changes of large targets, target shading, target classification, and target positioning. However, they have not proposed relevant resolutions to solve issues encountered by image sensors in unfavorable lighting conditions such as insufficient light, solar glare, and headlamp glare. Blin et al. [5] used polarimetric features to enhance the accuracy rate of object detection if the light was insufficient. However, their datasets ignored the strong reflection by light sources from heavy rain or snow. Based on the features of thermal-like images extracted from visible light images, Kruthiventi et al. [6] could detect barely visible pedestrians under challenging light intensity; however, the method still has its limitations for only depending on visible light images. Traditional mono-spectral methods cannot be applicable under different experimental environments, experimental equipment, and environmental light sources, and images can only be well detected under a specific environment or lighting conditions

In recent years, most studies have shown that the fusion of RGB and thermal images can improve pedestrian detection performance. Based on the histogram of oriented gradient (HOG) features of enhanced thermal images, Hwang et al. [7] put forward the multispectral aggregated channel features (ACF) and a multispectral pedestrian dataset KAIST for training and testing. Their results suggested that the detection performance of the multispectral method was evidently better than that of mono-spectral methods. After [7] establishing the first large-scale multispectral pedestrian baseline dataset, many studies further focused on the optimal multispectral fusion model besides using a neural convolution network to extract features. Wagner et al. [8] proposed a multispectral pedestrian detection method based on R-CNN to discuss the advantages and disadvantages of two multispectral fusion models. According to the results, later fusion is better than early fusion. Li et al. [9] put forward a unified network integrated framework that can significantly improve the performance, which integrates the outputs of different branches (including optimal segmentation and detection tasks). Chen et al. [10] used TV minimization based on structure transfer to combine infrared images and visible images, retain infrared intensity distribution and local appearance information. However, when the pedestrian targets are not prominent in infrared images, the detection performance of the proposed method will be affected. Considering the consistency between varied modes, Zhang et al. [11] proposed a module that can integrate and refine RGB and thermal features. Except for the refining of features, it is equally important to extract highly discriminative multispectral features. Lu et al. [12] revealed the importance of those features through the channel-level cross-modality interaction attention module.

Furthermore, Zhang et al. [13] employed the pixel-level attention module to focus on the localization guidance of inter-modality and intra-modality. However, although multispectral fusion can largely improve pedestrian detection performance, it is inevitable to detect pedestrians in low-resolution images. To solve this issue, Wolpert et al. [14] raised an anchor-free framework and employed a new data enhancement method. Additionally, Nataprawira et al. [15] merged multispectral image channels, and designed a compressed YoloV3 model, sacrificing minimal accuracy and greatly improving the processing time of the model. Even though many studies have discussed modality infusion, they have not fully investigated the complementarity of single modality, let alone the situation when the representative features of pedestrians cannot be extracted, owing to excessive irrelevant background information.

In addition to modality fusion, the complementarity of modality under different lighting conditions is also worthy of investigation. Therefore, studies [16,17] have proposed different lighting perception modules to self-adaptively re-weight different modalities. The reason is that the detection performance of RGB images is slightly better than that of thermal images in daytime or in sufficient lighting; however, at night or with insufficient lighting, due to the invisible light spectrum, the detection performance of thermal images is significantly better than that of RGB images. Moreover, Zhou et al. [18] discovered that the multispectral pedestrian detection model has not only illumination modality imbalance problems, but also feature modality imbalance. Accordingly, they proposed two solutions, namely a differential modality perception fusion module and a lighting perception module. However, only using daytime or night scenarios to train illumination networks cannot deliver good results in the shadow scenarios during the night or daytime. Most notably, there are still doubts with regard to temperature labels that measure temperature conditions which cannot even be obtained from existing multispectral baseline datasets, making the training ineffective. Thus, Zhuang et al. [19] further examined temperature conditions and employed an illumination lightweight temperature perception multispectral network (IT-MN), hoping to improve the performance of model detection through temperature conditions.

In summary, the study is organized as follows: Section 2 proposes the MFDs-Yolo and i-IAN modules to detect pedestrians; Section 3 focuses on experimental results and comparison, and the final section illustrates the conclusion.

2. Methodology

In this paper, we adopted different spectral information to achieve the pedestrian detection of multi-sensor applications in ADAS. Based on the reference [20,21], we used two independent Yolo-based detectors to extract and detect features of two different spectra to employ the RGB and thermal image information. Moreover, to allocate fusion weights of two different spectra and solve the light source fusion in actual scenarios, we proposed the aware illumination network to self-adaptively allocate fusion weights for different scenarios under varied light source intensity. Finally, the main contributions of this paper are as follows: (1) an innovative multispectral method MFDs-Yolo (a multispectral fusion and double-stream detector with Yolo-based methods) redesigned based on YoloV4, with the double-stream multispectral network (DSMN), which adopts a post-fusion strategy through RGB and thermal image information; and (2) a redesigned i-IAN module based on MB-Net with a framework of DSMN, which can self-adaptively allocate fusion weights of RGB-SubNet and thermal-SubNet for scenarios of varied light source intensity besides sufficient light in daytime or limited light at night.

2.1. Proposed Double-Stream Multispectral Network (DSMN)

The overall framework of the proposed method is shown in Figure 2, which is mainly based on the extension of the YoloV4 framework. Specifically, it consists of an MFDs-Yolo and an i-IAN module. MFDs-Yolo consists of two YoloV4 double-stream architectures, which generates individual detection results through the multispectral information provided by RGB images and thermal images. i-IAN uses RGB images to know the lighting situation of the environment and adaptively adjusts the weight parameters of RGB images and thermal images by using the illumination perception method. The image weights and the output of MFDs-Yolo are fused to produce the final detection results.

2.2. A Multispectral Fusion and Double-Stream Detector with Yolo-Based Detectors (MFDs-Yolo)

To establish a real-time pedestrian detection model to carry out tasks in scenarios, the choice of a neural network architecture should consider both the accuracy and calculation speed. Inspired by YoloV4, this work raised an improved network framework using multispectral fusion and double-stream technology to detect pedestrians in different scenarios.

The framework of DSMN is made up of two Yolo-based subnetworks, including RGB-SubNet and thermal-SubNet. Based on the confidence score of classification and the offset of a bounding box for objects, independent detection results can be generated. RGB and FIR images were especially taken as inputs to perform computer vision tasks. Finally, the detection results could be obtained by using calculated fusion weights to re-weight the detection results of the two subnetworks, as shown in Figure 3.

The structure of each Yolo-based subnetwork is similar to that of YoloV4 and can be divided into three parts: backbone, neck, and head. The backbone, acting as the preliminary feature extraction, uses CSPNet–CSPDarketnet53 to add CSP, based on each residual neural network (RNN) of Darketnet53. Therefore, before the input of each RNN, the feature map will be split into two parts by the first convolution, one of which will remain intact, and the other will go through a 1 × 1 convolution layer and pooling layer. Later, the two parts will join together to be input into the next RNN. In addition, Darketnet53 has 53 layers, each containing Conv2D, batch normalization, and Mish; using CSPNet can enhance CNN’s learning ability and remain accurate while keeping lightweight and reducing calculation costs. For the neck, it consists of spatial pyramid pooling (SPP) and path aggregation network (PAN). To enrich the input information to the head, features were infused. Among them, the core size of the maximum pooling layer in SPP k = {1 × 1, 5 × 5, 9 × 9, 13 × 13} and different feature maps of pooled cores were combined as the output. Compared with single pooling layers, the SPP can increase the receptive field of the backbone, and PAN can connect feature maps through more sampling to improve the overall calculation efficiency of the model. Regarding head, consisting of multiple convolution layers, it can be used for predicting the bounding box. It outputs three sizes of tensors for connecting and calculating the anchor to obtain the position of the prediction box.

Our concept of multispectral fusion originated from the comparison of the fusion framework carried out by halfway fusion [20] and IAF R-CNN [16], including input fusion, early fusion, halfway fusion, late fusion, and score fusion. Input fusion, as the most direct fusion method, only integrates multispectral images at the input layer; early fusion is performed after the first convolution layer and can continue to train the pre-trained backbone; halfway fusion begins the fusion after the fourth convolution layer and reduces the dimensions by NIN; later fusion usually uses the last full-connected layer of multi-spectrum to fuse; and score fusion can be classified into the cascade and non-cascade designs, the former of which submits the results to the subnetwork of other modalities and carries out fusion through equal weights, while the latter carries out the re-weighting based on prediction scores and bounding box regression output by two subnetworks. Finally, based on experimental results, we employed a low-error-rate-score fusion strategy to fuse two subnetworks.

2.3. An Improved Illumination Modality Module (i-IAN)

Insufficient and confusing lighting is inevitable for actual road scenarios. Although a multispectral fusion method combing RGB and thermal images was already used, those two kinds of spectral information cannot be fully applied. To allocate different weights to MFDs-Yolo, the lighting intensity should be estimated, and thus pedestrians in confusing lighting scenarios can be detected. This work estimated the lighting intensity based on RGB images, as thermal images cannot reveal luminosity or brightness information, while RGB images can deliver better results.

As shown in Figure 4, the overall framework of i-IAN includes input and feature extraction, illumination weight, and weight fusion. In the part of input and feature extraction, this work adjusted the tensor into (56,56,3) to improve the efficiency of training or testing; a series of convolution, fully connected, and maximum pooling layers followed. The maximum pooling layer consisted of two convolution blocks and a fully connected layer, each convolution block uses 3 × 3 convolution and connects with ReLU and a 2 × 2 maximum pooling layer. Passing through two flattened and dense layers ((256,) and (2,)), the fully connected layer adds a dropout between two dense layers to slow down the fitting. At last, the softmax loss was used to obtain the illumination value iv, while the weights

w_{d}

and

w_{n}

were obtained through Equation (1).

w_{d}

and

w_{n}

represent the illumination values of the day and night, and the network is optimized by minimizing the cross entropy loss function between illumination values and true labels.

w = \frac{i v}{1 + α e x p (- \frac{i v - 0.5}{β})}, i v \in [0, 1]

(1)

where α and β are both learnable parameters.

In regard to the illumination weight aspect, to better allocate weights, we distributed a high weight for RGB-SubNet, and low weight for thermal-SubNet, so the final detection results can benefit from two kinds of spectra. On the contrary, when the weight of the thermal-SubNet dominates under bad lighting conditions, detection effects can be improved as the information provided by RGB was the noise. Considering all the above, the initial weights of

w_{d}

and

w_{n}

can be re-calculated to obtain the weight of RGB

w^{r g b}

:

w^{r g b} = (\frac{w_{d} - w_{n}}{2}) \times (α_{w} \times |w| + γ_{w}) + \frac{1}{2}

(2)

where |w| ∈ [0, 1] denotes another independent prediction branch with one neuron as the output of fully connected layers at the end; and

α_{w}

and

γ_{w}

denote the learnable parameters at initial values of 1 and 0, respectively. With 0.5 as the bias, the linear regression of |w| is used to re-parameterize

w_{d}

and

w_{n}

. When the light source is sufficient, the weight of positive values will be enlarged and

w^{r g b}

will be increased. Conversely, when the light source is insufficient, the weight of negative values will be reduced and

w^{r g b}

will be decreased. Moreover, due to the complementarity of RGB and thermal images, the weight of the light source was defined as 1, and the weight of the thermal

w^{t h e r m a l}

was obtained using Equation (3):

w^{t h e r m a l} = 1 - w^{r g b}

(3)

For weight fusion, the final prediction results were calculated using

w^{r g b}

and

w^{t h e r m a l}

. Each Yolo-based subnetwork can produce two outputs, including K + 1 types of confidence scores s = (s₀, …, s_K) and the offset of bounding box regression t = (t₁, …, t_K) among K types of objects. Thus, according to

s^{r g b}

and

t^{r g b}

of RGB-SubNet and

s^{t h e r m a l}

and

t^{t h e r m a l}

of thermal-SubNet, we could obtain the final detection results using Equations (4) and (5):

s^{f i n a l} = w^{r g b} \times s^{r g b} + w^{t h e r m a l} \times s^{t h e r m a l}

(4)

t^{f i n a l} = w^{r g b} \times t^{r g b} + w^{t h e r m a l} \times t^{t h e r m a l}

(5)

3. Discussion

We adopted two multispectral pedestrian baseline datasets in the experiments to evaluate the effectiveness of the proposed methods.

3.1. Dataset and Evaluation Metrics

In the experiments of this study, we used two public multispectral pedestrian baseline datasets, including KAIST [7] and FLIR [22], to evaluate the effectiveness of the proposed methods. All of the images in Figure 5 are RGB or thermal images. However, the pedestrians or objects captured by different image sensors will have different features. First of all, KAIST contains 95,328 RGB–thermal image pairs (640

\times

480), which include 103,128 bounding boxes and 1182 distinct pedestrians. Because of the defects in the original training data, this work adopted the improved labeling dataset put forward by Zhang et al. [23]. There were 8963 images in the training dataset, while samples were taken every 20 frames in the testing dataset. Finally, 2252 RGB–thermal image pairs were obtained for testing, including 1455 daytime and 797 night images. For the public dataset FLIR, there were 10,228 frames of RGB–thermal image pairs (640

\times

512), 9214 of which were image pairs with labeling boxes. There were five categories in FLIR, namely people (28,151), bicycles (4457), cars (46,692), dogs (240), and others (2228). A total of 8862 frames in the dataset were used for training, while the remaining 1366 frames were used for testing. However, due to the drawbacks of the original dataset, the re-aligned dataset to put forward by Zhang et al. [11] was used for the experiment.

In this experiment, log coverage miss rate (MR) [7] and mean average precision (mAP) were employed as indicators to evaluate pedestrian detection. Each negative sample sampled by MR within the range of

[10^{- 2}, 10^{0}]

was calculated by averaging the MR. The lower the MR value, the better the evaluation, while mAP was the opposite.

3.2. Implement Details and Quantitative Study

In this work, the model is realized under the PyTorch framework, in which the MFDs-Yolo module is modified based on the PyTorch version of Yolo V4; and the i-IAN module is modified based on the TensorFlow version of MB-Net. For the resolution, this research adjusted the size of an image into 416

\times

416 through the double-stream interpolation method to meet the required input dimension; during the training process, the Adam optimizer was used with an initial learning rate of 1 × 10⁻³. Then, cosine annealing changed the learning rate, and 200 epochs were trained. Additionally, this paper employed focal loss [24] to train the model effectively to help the model learn semantic information. Finally, without using the pre-training model, our model was trained from scratch, and all the experiments were performed on the single NVIDIA GeForce RTX 3090 Ti GPU and Intel i7-10400 CPU.

Table 1 and Table 2 compare results obtained from the proposed and the most advanced methods. In Table 1, MR (IoU@0.5) and MR (IoU@0.75) were adopted in all scenarios, including daytime and night, to evaluate and compare their differences. Based on the results, the evaluation of MR (IoU@0.75) obtained the best results when part of all scenarios reached 50.73%, daytime scenarios achieved 47.85%, and night scenarios fulfilled 53.76%. IoU@0.75 is closer to ground-truth, that is, its predictions were more accurate, which is nearly 10 MR lower than the best performance of the MBNet. Table 2 used mAP to evaluate, and referring to the results, the proposed method of this study delivered the best performance with an mAP of 80.30%. Moreover, the details of the mAP of each category are illustrated in Table 3, including AP, Precision, Recall, F1-score, and MR. It can be found that the AP value of people is lower than that of the other two categories. The KAIST dataset marked pedestrians in overlapping, occlusion or crowds, etc., as “People”. According to the formulas of Precision and Recall:

P r e c i s i o n = \frac{T P}{T P + F P}

(6)

R e c a l l = \frac{T P}{T P + F N}

(7)

It can be inferred that the Precision of people reaches 89.47% while Recall is only 44.74%, so FP < FN, i.e., the value of those that are not people, yet detected by the model as people, is low, and the value of those that are people but were detected by the model as not people is high, so that the pedestrians who were obscured can be clearly labeled as person, that is, they can be identified as individual pedestrians. As a result, people’s FN increased, and AP was underestimated compared to the other two categories. Table 4 compares the results obtained from the proposed method and other most advanced methods. Except for the results of mAP, it also revealed the results of AP (%) and MR (%) for three categories (person, bicycle, and car). Moreover, good results have been realized in recent studies. Bicycle performs better than person and car, which have relatively more data. This result is based on the feature of dataset re-alignment. Taking the thermal image as the benchmark, the RGB image is aligned toward the thermal image. If the RGB image is smaller than the original image size after alignment, the edge will be filled with 0, which will affect the determinant of i-IAN and mislead the weight distribution of multispectral double-stream network. In contrast, the images with the bicycle category do not need to be filled with 0 after alignment, i.e., there is no black frame on the edges of the images, so bicycle has the best performance among the three categories. It can be deduced that if RGB images are used as the benchmark and thermal images are aligned to RGB images, i-IAN can obtain the complete illumination information of the images and is expected to achieve better results.

Figure 6 illustrates the output results of the proposed method in KAIST and FLIR datasets. The upper and lower corners of each image show the detection results of RGB and thermal images, respectively.

4. Conclusions

Due to instabilities such as lighting changes and weather conditions, image sensors still face challenges in all-weather multispectral pedestrian detection. Considering all the above, this study proposed a new DL network framework to carry out multispectral pedestrian detection. We used multispectral fusion technology combined with an illumination perception method to adjust multispectral image information adaptively to compensate for the shortcomings of mono-spectral methods. Based on the Yolo structure, a double-stream network of MFDs-Yolo suitable for extracting multispectral information was redesigned, and a self-adaptive modality weight adjustment method i-IAN was used in the later fusion strategy of the framework. It can accurately identify road objects ahead in the harsh environment of low light at night, while retaining abundant color information during the day when light is sufficient. Further research plans include exploring the application of our DL network framework in real-time pedestrian detection systems for autonomous vehicles. To address the limitation of computing speed, cloud computing can play a crucial role. By offloading the computational burden to powerful cloud-based infrastructure, we can enhance the overall system efficiency and enable the efficient processing of large amounts of data and complex algorithms. The availability of future 5G technology and the emergence of the internet of vehicle (IoV) ecosystem will be instrumental in supporting the seamless integration of cloud computing into pedestrian detection systems. However, considering the complementarity of modalities, highly discriminative multispectral features, and confusing illumination, this work employed two public multispectral pedestrian datasets, including KAIST and FLIR, and achieved a good detection accuracy along with a low MR, the results of which can be applied in self-driving image systems. Compared with the state-of-the-art detection methods in this study, MR was reduced by 9.36% and 6.9%, respectively.

Author Contributions

C.-H.H. carried out the studies and drafted the manuscript; C.-H.H., H.-C.P. and H.-T.C. participated in its design and helped to draft the manuscript; H.-C.P. and H.-T.C. conducted the experiments, performed the statistical analysis, and methodology. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Ministry of Science and Technology, Taiwan, under grant no. MOST 111-2221-E-197-020-MY3.

Data Availability Statement

This study did not report any data.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yeong, D.J.; Velasco-Hernandez, G.; Barry, J.; Walsh, J. Sensor and sensor fusion technology in autonomous vehicles: A review. Sensors 2021, 21, 2140. [Google Scholar] [CrossRef] [PubMed]
AAA, Inc. Automatic Emergency Braking with Pedestrian. 2019. Available online: https://www.aaa.com/AAA/common/aar/files/Research-Report-Pedestrian-Detection.pdf (accessed on 15 January 2022).
Shopovska, I.; Jovanov, L.; Philips, W. Deep visible and thermal image fusion for enhanced pedestrian visibility. Sensors 2019, 19, 3727. [Google Scholar] [CrossRef] [PubMed][Green Version]
Wei, J.; He, J.; Zhou, Y.; Chen, K.; Tang, Z.; Xiong, Z. Enhanced object detection with deep convolutional neural networks for advanced driving assistance. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1572–1583. [Google Scholar] [CrossRef][Green Version]
Blin, R.; Ainouz, S.; Canu, S.; Meriaudeau, F. A new multimodal RGB and polarimetric image dataset for road scenes analysis. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 867–876. [Google Scholar] [CrossRef]
Kruthiventi, S.S.S.; Sahay, P.; Biswal, R. Low-light pedestrian detection from RGB images using multi-modal knowledge distillation. In Proceedings of the 2017 24th IEEE International Conference on Image Processing, Beijing, China, 17–20 September 2017; pp. 4207–4211. [Google Scholar]
Hwang, S.; Park, J.; Kim, N.; Choi, Y.; So Kweon, I. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar]
Wagner, J.; Fischer, V.; Herman, M.; Sven, B. Multispectral pedestrian detection using deep fusion convolutional neural networks. In Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, 22–24 April 2016. [Google Scholar]
Li, C.; Song, D.; Tong, R.; Tang, M. Multispectral pedestrian detection via simultaneous detection and segmentation. In Proceedings of the British Machine Vision Conference, Newcastle, UK, 3–6 September 2018; pp. 1–12. [Google Scholar]
Chen, X.; Liu, L.; Tan, X. Robust Pedestrian Detection Based on Multi-Spectral Image Fusion and Convolutional Neural Networks. Electronics 2022, 11, 1. [Google Scholar] [CrossRef]
Zhang, H.; Fromont, E.; Lefevre, S.; Avignon, B. Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 276–280. [Google Scholar] [CrossRef]
Zhang, L.; Liu, Z.; Zhang, S.; Yang, X.; Qiao, H.; Huang, K.; Hussain, A. Cross-modality interactive attention network for multispectral pedestrian detection. Inf. Fusion 2019, 50, 20–29. [Google Scholar] [CrossRef]
Zhang, H.; Fromont, E.; Lefevre, S.; Avignon, B. Guided attentive feature fusion for multispectral pedestrian detection. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 5–9 January 2021; pp. 72–80. [Google Scholar] [CrossRef]
Wolpert, A.; Teutsch, M.; Sarfraz, M.S.; Stiefelhagen, R. Anchor-free small-scale multispectral pedestrian detection. In Proceedings of the British Machine Vision Conference, Virtual, 7–10 September 2020; pp. 1–14. [Google Scholar]
Nataprawira, J.; Gu, Y.; Goncharenko, I.; Kamijo, S. Pedestrian detection using multispectral images and a deep neural network. Sensors 2021, 21, 2536. [Google Scholar] [CrossRef] [PubMed]
Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recognit. 2019, 85, 161–171. [Google Scholar] [CrossRef][Green Version]
Guan, D.; Cao, Y.; Yang, J.; Cao, Y.; Yang, Y. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Inf. Fusion 2019, 50, 148–157. [Google Scholar] [CrossRef][Green Version]
Zhou, K.; Chen, L.; Cao, X. Improving multispectral pedestrian detection by addressing modality imbalance problems. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 787–803. [Google Scholar] [CrossRef]
Zhuang, Y.; Pu, Z.; Hu, J.; Wang, Y. Illumination and temperature-aware multispectral networks for edge-computing-enabled pedestrian detection. IEEE Trans. Netw. Sci. Eng. 2021, 9, 1282–1295. [Google Scholar] [CrossRef]
Liu, J.; Zhang, S.; Wang, S.; Metaxas, D. Multispectral deep neural networks for pedestrian detection. In Proceedings of the British Machine Vision Conference 2016, New York, NY, USA, 20 September 2016. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y. YoloV4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
FLIR Starter Thermal Dataset. Available online: https://www.flir.com/oem/adas/adas-dataset-form/ (accessed on 16 March 2023).
Zhang, L.; Zhu, X.; Chen, X.; Yang, X.; Lei, Z.; Liu, Z. Weakly aligned cross-modal learning for multispectral pedestrian detection. In Proceedings of the IEEE/CVF International Conference Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef][Green Version]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed][Green Version]
Konig, D.; Adam, M.; Jarvers, C.; Layher, G.; Neumann, H.; Teutsch, M. Fully convolutional region proposal networks for multispectral person detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 243–250. [Google Scholar] [CrossRef]
Zhang, L.; Liu, Z.; Chen, X.; Yang, X. The cross-modality disparity problem in multispectral pedestrian detection. arXiv 2019, arXiv:1901.02645. [Google Scholar]
Ghose, D.; Desai, S.M.; Bhattacharya, S.; Chakraborty, D.; Fiterau, M.; Rahman, T. Pedestrian detection in thermal images using saliency maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 988–997. [Google Scholar]
Kristo, M.; Ivasic-Kos, M.; Pobar, M. Thermal object detection in difficult weather conditions using YOLO. IEEE Access 2020, 8, 125459–125476. [Google Scholar] [CrossRef]
Munir, F.; Azam, S.; Jeon, M. SSTN: Self-supervised domain adaptation thermal object detection for autonomous driving. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 206–213. [Google Scholar] [CrossRef]
Dasgupta, K.; Das, A.; Das, S.; Bhattacharya, U.; Yogamani, S. Spatio-contextual deep network-based multimodal pedestrian detection for autonomous driving. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15940–15950. [Google Scholar] [CrossRef]
Devaguptapu, C.; Akolekar, N.; Sharma, M.M.; Balasubramanian, V.N. Borrow from anywhere: Pseudo multi-modal object detection in thermal imagery. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Kieu, M.; Bagdanov, A.D.; Bertini, M. Bottom-up and layerwise domain adaptation for pedestrian detection in thermal images. ACM Trans. Multimed. Comput. Commun. Appl. 2021, 17, 1–19. [Google Scholar] [CrossRef]
Zuo, X.; Wang, Z.; Liu, Y.; Shen, J.; Wang, H. LGADet: Light-weight anchor-free multispectral pedestrian detection with mixed local and global attention. Neural Process. Lett. 2022, 1–18. [Google Scholar] [CrossRef]

Figure 1. Night scenarios in the pedestrian dataset KAIST (as red circle is pedestrian): (a) visible light image; (b) far-infrared (FIR) image.

Figure 2. Overview framework of DSMN structure.

Figure 3. The DSMN framework with Yolo-based.

Figure 4. Framework of i-IAN.

Figure 5. Images obtained from public datasets. (a) KAIST of RGB at day. (b) KAIST of RGB at night. (c) KAIST of thermal at day. (d) KAIST of thermal at night. (e) FLIR of RGB at day. (f) FLIR of RGB at night. (g) FLIR of thermal at day. (h) FLIR of thermal at night.

Figure 6. Detection results of public datasets that these colors are object detecting. (a) KAIST of RGB at day. (b) KAIST of RGB at night. (c) KAIST of thermal at day. (d) KAIST of thermal at night. (e) FLIR of RGB at day. (f) FLIR of RGB at night. (g) FLIR of thermal at day. (h) FLIR of thermal at night.

Table 1. The effectiveness comparison of the proposed method in KAIST dataset with other methods.

Methods	MR (IoU@0.5) (%)			MR (IoU@0.75) (%)			Platform	Speed (s)
Methods	All	Day	Night	All	Day	Night	Platform	Speed (s)
ACF [7]	47.32	42.57	56.17	88.79	87.70	91.22	MATLAB	7.73
Proposed YOLOv3 [10]	43.25	46.99	35.84	—	—	—	1080 Ti	—
Halfway fusion [20]	25.75	24.88	26.59	81.29	78.43	86.80	TITAN X	0.43
Fusion RPN+BF [25]	18.29	19.57	16.27	72.97	68.14	81.35	MATLAB	0.80
IAF R-CNN [16]	15.73	14.55	18.26	75.50	72.34	81.12	TITAN X	0.21
IATDNN + IASS [17]	14.95	14.67	15.72	76.69	76.46	77.05	TITAN X	0.25
RFA [26]	14.61	16.78	10.21	—	—	—	TITAN X	0.08
CIAN [12]	14.12	14.77	11.13	—	—	—	1080 Ti	0.08
MSDS-RCNN [9]	11.34	10.53	12.94	70.57	67.36	79.25	TITAN X	0.22
AR-CNN [23]	9.34	9.94	8.38	64.22	57.87	76.82	1080 Ti	0.12
MBNet [18]	8.13	8.28	7.86	60.12	54.90	68.34	1080 Ti	0.07
This work	14.33	13.34	22.36	50.76	47.85	53.76	3090 Ti	0.76

Table 2. The mAP comparison of the proposed method in KAIST dataset with other methods.

Methods	mAP (%)
PiCA-Net [27]	65.80
R³Net [27]	70.85
tY [28]	63.00
SSTN101 [29]	73.22
MuFEm + ScoFA [30]	78.03
This work	80.30

Table 3. The mAP details of the proposed method in KAIST dataset.

Category	mAP (%)	AP (%)	Precision (%)	Recall (%)	F1 Score	MR (%)
Person	80.30	93.21	92.89	86.82	0.90	4.62
People		64.29	89.47	44.74	0.60	28.94
Cyclist		83.41	88.99	74.05	0.81	11.45

Table 4. The AP and mAP comparison of FLIR dataset and other methods.

Methods	mAP (%)	AP (%)			MR (%)
Methods	mAP (%)	Person	Bicycle	Car	Person	Bicycle	Car
Baseline	53.97	54.69	39.66	67.57	—	—	—
MMTOD-CG [31]	61.40	63.31	50.26	70.63	—	—	—
MMTOD-UNIT [31]	61.54	64.47	49.43	70.72	—	—	—
TD(T, T) [32]	71.40	75.50	51.90	86.90	—	—	—
BU(AT, T) [32]	73.10	76.10	56.10	87.00	—	—	—
BU(LT, T) [32]	73.20	75.60	57.40	86.50	—	—	—
LGADet [33]	—	74.52	—	—	28.43	—	—
This work	70.66	70.13	65.01	77.03	21.53	22.43	17.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hsia, C.-H.; Peng, H.-C.; Chan, H.-T. All-Weather Pedestrian Detection Based on Double-Stream Multispectral Network. Electronics 2023, 12, 2312. https://doi.org/10.3390/electronics12102312

AMA Style

Hsia C-H, Peng H-C, Chan H-T. All-Weather Pedestrian Detection Based on Double-Stream Multispectral Network. Electronics. 2023; 12(10):2312. https://doi.org/10.3390/electronics12102312

Chicago/Turabian Style

Hsia, Chih-Hsien, Hsiao-Chu Peng, and Hung-Tse Chan. 2023. "All-Weather Pedestrian Detection Based on Double-Stream Multispectral Network" Electronics 12, no. 10: 2312. https://doi.org/10.3390/electronics12102312

APA Style

Hsia, C.-H., Peng, H.-C., & Chan, H.-T. (2023). All-Weather Pedestrian Detection Based on Double-Stream Multispectral Network. Electronics, 12(10), 2312. https://doi.org/10.3390/electronics12102312

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

All-Weather Pedestrian Detection Based on Double-Stream Multispectral Network

Abstract

1. Introduction

2. Methodology

2.1. Proposed Double-Stream Multispectral Network (DSMN)

2.2. A Multispectral Fusion and Double-Stream Detector with Yolo-Based Detectors (MFDs-Yolo)

2.3. An Improved Illumination Modality Module (i-IAN)

3. Discussion

3.1. Dataset and Evaluation Metrics

3.2. Implement Details and Quantitative Study

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI