Research on Pedestrian Detection Method Based on Dual-Branch YOLOv8 Network of Visible Light and Infrared Images

He, Zhuomin; Chen, Xuewen

doi:10.3390/wevj17040177

Open AccessArticle

Research on Pedestrian Detection Method Based on Dual-Branch YOLOv8 Network of Visible Light and Infrared Images

by

Zhuomin He

and

Xuewen Chen

^*

College of Automobile and Traffic Engineering, Liaoning University of Technology, Jinzhou 121001, China

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J. 2026, 17(4), 177; https://doi.org/10.3390/wevj17040177

Submission received: 28 February 2026 / Revised: 19 March 2026 / Accepted: 25 March 2026 / Published: 26 March 2026

(This article belongs to the Section Vehicle and Transportation Systems)

Download

Browse Figures

Versions Notes

Abstract

In complex traffic environments such as low light, strong glare, occlusion and at night, systems that rely solely on visible light single sensors for pedestrian detection have drawbacks such as low detection accuracy and poor robustness. Based on the YOLOv8 convolutional network, this paper adopts a dual-branch structure to process visible light and infrared images simultaneously, fully utilizing feature information at different scales to effectively detect pedestrian targets in complex and changeable environments. To address the issues of insufficient interaction of modal feature information and fixed fusion weights, a cross-modal feature interaction and enhancement mechanism was introduced. A modal-channel interaction block (MCI-Block) was designed, in which residual connection structures and weight interaction were added within the module to achieve feature enhancement and filter out noise information. Introduce a dynamic weighted feature fusion strategy, adaptively adjusting the contribution ratio of different modal features in the fusion process, aiming to enhance the discrimination ability of the key pedestrian area. The training and testing of the network designed in this paper were completed on the visible light and infrared pedestrian detection dataset LLVIP and Kaist. At the same time, the test results of the dual-branch model and the model designed in this paper were further verified in actual traffic scenarios. The results show that the dual-branch YOLOv8 network for visible light and infrared images, which was constructed in this paper, can reliably enhance the detection performance of pedestrian targets in complex traffic environments, including accuracy, recall rate, and mAP@0.5, etc., thereby improving the robustness of pedestrian detection.

Keywords:

pedestrian detection; YOLOv8 dual-branch fusion network; cross-modal feature interaction; feature fusion

1. Introduction

In recent years, with the development of deep learning theory, pedestrian detection technology based on computer vision has been widely applied in intelligent driving tasks. Using video captured by sensors, the pedestrian targets in the images are detected and identified through appropriate algorithms. The identification results are then transmitted to the intelligent driving decision control system, enabling tasks such as automatic emergency braking and forward crossing warning to be accomplished.

The traditional pedestrian detection system relies solely on visible light sensors, which is insufficient to cope with the complex and ever-changing external environment, thereby affecting the performance of intelligent driving in real-world scenarios. Visible light sensors can capture information such as object edges, feature details, and background. However, due to their sensitivity to light, the visible light sensors are unable to effectively capture target information in strong glare, low light, and night scenes, thereby affecting the personal safety of vulnerable road users and the driving safety and comfort of intelligent driving systems. Infrared sensors can distinguish pedestrians from the surrounding environment in complex and changing traffic conditions. However, they have low resolution and lack information such as detailed textures. Moreover, in traffic environments with a high density of vehicles and pedestrians, the heat radiation range of vehicles is wide, which easily causes the background characteristics of the surrounding environment to be radiated, resulting in incorrect detection of pedestrian targets. Therefore, it is necessary to fully utilize the advantages of visible light sensors and infrared sensors, and at the same time introduce the dual-modal feature interaction and enhancement mechanism, in order to ensure effective detection of pedestrian targets in complex and changing environments.

A large number of scholars have conducted extensive research on the pedestrian detection field involving the fusion of visible light and infrared. Their studies mainly focused on the performance of the fusion stage and the fusion strategies. Reference [1] designed different network structures for the early, middle and late fusion stages of the dual-branch features, and ultimately verified that the middle-stage fusion achieved the best performance. Reference [2] utilized the Transformer self-attention fusion strategy to convert different modalities into temporal problems, enabling sufficient interaction of features between visible light and infrared, thereby enhancing the detection performance. Reference [3] introduced a dual cross-attention module and utilized the iterative interaction mechanism to optimize the fusion efficiency. Reference [4] utilized the dual nature of Transformer and channel attention mechanisms to guide the feature fusion of different modalities. Reference [5] utilized the Swin Transformer to design an attention-guided cross-modal module, which enables the extraction of local information and the complementary integration of cross-modal information. Reference [6] utilized the complementary nature among different modalities by adding cyclic fusion and refining the features of each modality in the network structure. Reference [7] proposes a fusion strategy by introducing learnable weighting parameters, which enables more efficient utilization of information integration among different modalities. Reference [8] proposed a detection framework that combines low-light enhancement with dual modalities in order to fully exploit the effective information among different modalities. Reference [9] addressed the issue that the existing fusion strategies did not take into account the lighting factors, and proposed a progressive fusion network based on light perception, PIAFusion, which improved the performance of multi-spectral pedestrian detection. Reference [10] compared three fusion strategies to adapt to different lighting conditions in order to optimize the detection performance. Eventually, it was concluded that the mid-term fusion method yielded the best results, and the generated lighting perception fusion weights could best adapt to different lighting conditions. Reference [11] addressed the issue of insufficient detection performance of visible light images under low-light conditions and the problem of false detections caused by noise in multispectral fusion. It proposed the TFDnet perception strategy, which exploits the similarities between different features during the fusion stage, enhances the contrast between pedestrian features and background features, highlights pedestrian features and suppresses irrelevant information. Reference [12] focuses on the dynamic changes that occur in different modalities during multispectral pedestrian detection, proposing a stable multispectral pedestrian detection algorithm. By parameterizing different modalities of evidence, it fuses multi-branch evidence and introduces a modal enhancement module to improve the interaction effect. Reference [13] proposed the PedDet adaptive spectral optimization framework, which achieves significant improvement in the accuracy and robustness of multimodal pedestrian detection in complex lighting conditions through multi-scale spectral feature fusion and a mechanism for decoupling illumination-robust features.

The current mainstream methods still have the following limitations: (1) the interaction ability of modal features is insufficient, and the correlations between modalities cannot be fully exploited; (2) to address the issue of cross-modal correlation, some methods employ a hybrid architecture combining CNNs and Transformers, resulting in increased computational complexity and a significant surge in computational load; (3) the fusion weights are fixed and cannot be dynamically adjusted according to changes in the scene; furthermore, parameters such as weights require fine-tuning to achieve optimal performance. This lack of hyperparameter robustness makes the model susceptible to the effects of parameter settings in practical applications. In response to the above issues, this paper improves the convolutional network structure based on the YOLOv8 algorithm, expanding the main network into a dual-branch main network, enabling it to simultaneously process infrared and visible light information. Adding a modal-channel interaction block to the main network can effectively extract visible light and infrared features while suppressing noise information. A dynamic alignment feature fusion module is added between the main network and the neck. Based on the features, the contribution ratio of visible light and infrared is dynamically adjusted to enhance the robustness of pedestrian detection for infrared and visible light fusion.

The organizational structure of each part of the thesis is as follows: The first part mainly analyzes the research progress in the field of pedestrian detection involving the fusion of visible light and infrared. The second part constructs a dual-modal fusion YOLOv8 network structure for visible light and infrared images. This structure can simultaneously process the information of visible light and infrared images. Through multiple convolutions, it extracts the features of the detected targets. A modal-channel interaction block is designed to achieve feature fusion. Residual connection structures and weight dynamic interaction are added to further enhance the features and filter out noise information. A dynamic alignment feature fusion module (DAFF) was designed. Through a dynamic weight fusion strategy, the contribution ratios of visible light and infrared images were adjusted to achieve refined feature fusion. The third part utilized the public datasets LLVIP and Kaist to conduct the training and testing of the YOLOv8 network for pedestrian detection with dual modalities of visible light and infrared. The performance evaluation indicators of the detection were compared. Finally, the actual detection results were verified for scenarios where there was obstruction or heat source interference in the non-motor vehicle lane, as well as common traffic scenarios involving strong light-induced glare or pedestrian flows at crosswalks, and scenarios of multiple people walking through the street at night. The fourth part summarizes the main research work and conclusions of the thesis.

2. Design of Dual-Modal Fusion Network Structure for Pedestrian Detection

2.1. Dual-Branch Network Infrastructure

With the iterative updates of the YOLO algorithm [14], each version incorporates the most advanced deep learning methods of its era, including convolutional modules, attention modules, data augmentation, etc., resulting in significant improvements in the performance and efficiency of the current models [15]. Compared with previous versions, YOLOv8 is better at capturing real targets and has a lower rate of missed detections. Although its accuracy has slightly declined, its recall rate and overall performance are the best. Therefore, in this paper, YOLOv8 is selected as the benchmark model. The traditional YOLOv8, through an end-to-end network structure, can only handle a single data type. When visible light or infrared images are input, the final output is a single image that contains detection boxes and the corresponding category scores. To enable the simultaneous processing of visible light and infrared signals, the main network is expanded into a dual-branch structure, as shown in Figure 1.

As shown in Figure 1, the visible light image F_vis and the infrared image F_ir can be simultaneously input into the YOLOv8 network. After multiple convolution feature extractions, the features of the two images are fused using a mid-level fusion method. Since the receptive fields obtained by each convolutional layer are different, in order to fully utilize the feature information of different scales, the visible light and infrared features are fused respectively in the P₃, P₄, and P₅ feature extraction module layers, and finally the fused feature F_fusion is transmitted to the YOLOv8 neck and head network.

2.2. Modal-Channel Interaction Block Design

During the feature extraction process, in addition to extracting the unique features of visible light and infrared, the common features should also be retained. However, these features contain noise information. Directly integrating them may result in feature redundancy or the loss of important features, ultimately leading to reduced detection efficiency and decreased accuracy. Therefore, a modal-channel interaction block (MCI-Block) was designed, and residual connection structures and weight interactions were added within the block to achieve feature enhancement and noise information filtering. The MCI-Block structure is shown in Figure 2.

The visible light and infrared features can be divided into common parts and unique parts. To enhance the consistency of feature structures, adaptive average pooling operations are performed on the input features separately. First, perform a difference operation on the visible light and infrared features to extract the difference part between the two features; that is, the difference feature “diff” is used as the basis for generating weights, as shown in Formula (1).

d i f f = |F_{v i s}^{'} - F_{i r}^{'}|

(1)

In Formula (1),

F_{v i s}^{'}

represents the visible light features after pooling;

F_{i r}^{'}

represents the infrared features after pooling.

To address the issue of feature redundancy, the channel compression recovery module is utilized to reduce the feature dimension and decrease the computational load. The visible light and infrared features are respectively reduced in the number of channels through convolution, and activation functions are introduced, as shown in Formula (2) and Formula (3).

F_{v i s_c} = Re l u (B N (C o n v (F_{v i s}^{'})))

(2)

F_{i r_c} = Re l u (B N (C o n v (F_{i r}^{'})))

(3)

where the convolution kernel size of the Conv layer is 1’1. BN stands for normalization processing; Relu represents the activation function; F_{vis_c} and F_{ir_c} respectively denote the visible light and infrared features after channel compression.

The channel recovery stage also uses convolution to restore the number of channels to the original stage. The formula is as follows:

F_{v i s_r} = C o n v (F_{v i s_c})

(4)

F_{i r_r} = C o n v (F_{i r_c})

(5)

where the convolution kernel size of the Conv layer is 1’1. F_{vis_r} and F_{ir_r} respectively represent the visible light and infrared features after channel restoration. During the recovery stage, residual connections are introduced to prevent information loss. The difference part is processed through a 3’3 convolution to generate spatial attention weights, which guide the enhancement of internal features in visible light and infrared. Multiply the restored features with the generated weights to enhance the features of the regions with significant differences, as shown in Formulas (6) and (7).

X_{v i s} = F_{v i s_r} \cdot S i g m o i d (C o n v (d i f f))

(6)

X_{i r} = F_{i r_r} \cdot S i g m o i d (C o n v (d i f f))

(7)

where Sigmoid represents the activation function. The convolution kernel size of the Conv layer is 3’3.

To reduce the sequence length and lower the computational complexity, the self-attention is calculated within a small window, and the sliding window method [5] is adopted to achieve the ability of global modeling. This paper employs the multi-scale sparse window attention (MS-SWA) to confine the global self-attention within each partitioned window, thereby reducing the computational complexity and ensuring that it grows linearly with the size of the feature map. After performing the MS-SWA operation, the expression ability of the detailed parts within the feature map can be effectively enhanced.

The input feature size in the modal-channel interaction block is

χ \in R^{B \times C \times H \times W}

. To ensure that the feature map can be evenly divided by the window size, a padding operation needs to be performed, and the size of the padded feature map becomes

χ \in R^{B \times C \times H^{'} \times W^{'}}

. By dividing the global feature map into multiple non-overlapping small windows, the global feature map is transformed into a sequence of features for each local window. Then, the attention scores are normalized using Softmax to generate a weight matrix. Subsequently, the features from each head are concatenated along the channel dimension to restore the original number of channels, resulting in the enhanced detail visible light feature X_vis_-att and infrared feature X_ir_-att. The enhanced features are respectively transmitted to the local compensation module (LCM), and the visible light and infrared are complementarily enhanced through the local features.

The local detail features of visible light and infrared are extracted using grouped convolution and point-wise convolution, as shown in Formulas (8) and (9).

X_{v i s_l c m} = S i l u (B N (C o n v_{1 \times 1} (C o n v_{3 \times 3} (X_{v i s}))))

(8)

X_{i r_l c m} = S i l u (B N (C o n v_{1 \times 1} (C o n v_{3 \times 3} (X_{i r}))))

(9)

where Silu represents the activation function; Conv1’1 and Conv3×3 respectively represent pointwise convolution and grouped convolution.

The obtained visible light and infrared local features are concatenated and processed by the global cross-modal attention module (GCMA). They are respectively processed by 3 × 3 convolution and 1 × 1 convolution as well as activation functions to generate global attention weights, and then split along the channel dimension, as shown in Formulas (10) –(12).

X_{f u s e} = C o n c a t ([X_{v i s_l c m}, X_{i r_l c m}]) \in R^{B \times 2 C \times H \times W}

(10)

W = S i g m o i d (C o n v_{1 \times 1} (C o n v_{3 \times 3} (X_{f u s e})))

(11)

W_{v i s}, W_{i r} = S p l i t (W, 2)

(12)

where X_fuse represents the fused feature, Concat indicates feature concatenation, W_ir represents the infrared weight, W_vis represents the visible light weight, and Split indicates splitting along the channel dimension into two sub-weights.

By using the global attention weights after splitting for compensation operations, the visible light features are used to compensate for the texture information of the infrared features, and the infrared features are used to compensate for the visible light target information. A unified learnable parameter γ is used to control the degree of compensation, with an initial value of 0.5. Through the above processing, the complementary enhancement of visible light and infrared features is achieved in both the channel and spatial domains, thereby improving the quality and performance of feature fusion, as shown in Formulas (13) and (14).

X_{v i s_o u t} = X_{v i s_a t t} \oplus γ_{1} \cdot (X_{v i s_l c m} \cdot W_{i r})

(13)

X_{i r_o u t} = X_{i r_a t t} \oplus γ_{2} \cdot (X_{i r_l c m} \cdot W_{v i s})

(14)

where X_vis_-out and X_ir_-out respectively represent the output of visible light and the output of infrared features.

2.3. Dynamic Alignment Feature Fusion Module

The design of the fusion strategy will determine the confidence and accuracy of the final detection results. If specific fusion strategies such as direct concatenation [16] or element addition [17] are adopted, the advantages of visible light and infrared images cannot be fully exploited. In low-light and complex traffic scenarios, due to the presence of numerous heat sources and the wide distribution range of local infrared vehicle features, it is easy to lose the infrared characteristics of pedestrians, resulting in missed detections and false detections. To prevent the occurrence of the above situation, a dynamic alignment feature fusion module (DAFF) was designed. It uses dynamic weight adjustment to determine the contribution ratio of visible light and infrared images, achieving refined fusion, improving the quality of the final fused features, and thereby enhancing the detection effect. The structure of DAFF is shown in Figure 3.

To achieve refined fusion of visible light and infrared features, a 1 × 1 convolution operation is applied to each of the two modalities to uniformly adjust the representation dimensions of both in the channel dimension. After alignment, the features are concatenated along the channel dimension. The fusion weights are extracted through a 3 × 3 convolution, and after being activated by Sigmoid, the two sub-weights are split along the channel dimension to adjust the importance of visible light and infrared features respectively. Introduce learnable parameters p₁ and p₂, with initial values set to 0.5, to achieve automatic adjustment of the modal fusion ratio and further improve the fusion quality, as shown in Formulas (15) and (16).

W_{f u s e} = S i g m o i d (C o n v (C o n c a t (X_{v i s}, X_{i r})))

(15)

X_{o u t} = (X_{v i s} \otimes W_{1}) \otimes p_{1} \oplus (X_{i r} \otimes W_{2}) \otimes p_{2}

(16)

where W_fuse represents the weight for generating the spliced features; W₁ and W₂ respectively represent the weights for generating visible light and infrared features; p₁ and p₂ respectively represent the learnable parameters of the visible light and infrared features.

3. Pedestrian Detection Experiment and Result Analysis

3.1. Experimental Environment and Dataset Settings

The hardware platform for the pedestrian detection experiment is NVIDIA RTX 4090 (manufactured by NVIDIA Corporation, Santa Clara, CA, USA) with 24 GB memory. The software environment is PyTorch 1.10.0, CUDA 11.3, and Python 3.8. The model is trained on the GPU using the Adam optimizer, with an initial learning rate of 0.001 and a final learning rate of 0.2. The batch size is set to 64, and the number of epochs is set to 100. The image size is uniformly set to 640 × 640. The selected datasets are the publicly available LLVIP [18] and Kaist [19], which are used for model training and validation. Both datasets contain infrared and visible light images that are pixel-aligned and include object detection annotations. LLVIP is a visible light and infrared paired pedestrian detection dataset. This dataset contains 15,488 pairs of visible light and infrared images, and most of them are low-light images. The dataset is divided into a training set of 12,025 pairs and a test set of 3463 pairs. Kaist is a widely used dataset for visible light and infrared pedestrian detection, containing 47,664 pairs of visible light and infrared images, captured in street and other regular traffic scenarios. Since the original dataset was derived from consecutive frames of videos and there were issues with poor annotation quality, a cleaned dataset was adopted. The training set consists of 7095 image pairs, and the test set consists of 2252 pairs.

3.2. Selection of Evaluation Indicators

During the experiment, the performance evaluation indicators were used to assess the proposed detection method. The evaluation indicators included detection accuracy P (precision), recall rate R (recall), and average precision mAP (mean average precision) (in this paper, the IOU threshold was selected as 0.5, i.e., mAP@0.5).

3.3. Experimental Results and Analysis

To verify the detection performance of the improved model, the improved algorithm was compared with the original YOLOv8 model in terms of visible light and infrared input separately, as well as in the case where YOLOv8’s dual branches only performed element-wise addition operation. At the same time, to verify the effectiveness of each module and add ablation experiments to the designed modules, as shown in Table 1, the convergence of the training process curve is shown in Figure 4.

From Table 1 it can be seen that compared with the model that only uses YOLOv8 to detect infrared images, the improved model has achieved a 4.1% increase in recall rate on the LLVIP dataset and a 9.4% increase on the Kaist dataset. This indicates that the improved model has addressed the issue of missed detections while maintaining accuracy. Compared with using YOLOv8 to detect visible light images alone, the improved model has achieved significant improvements on both datasets. The accuracy has increased by 9.3% and 7.4% respectively, and the mAP@0.5 has increased by 11.7% and 22.8% respectively. The research results show that the improved model can effectively make up for the shortcomings of using only a single type of image in achieving target detection in low-light scenarios. The ablation experiments by adding MCI-Block and DAFF to the dual-branch model further verified the effectiveness of the designed modules in the pedestrian target detection task. As can be seen from Table 1, when each module is used independently, all the indicators have improved to varying degrees on both the LLVIP and Kaist datasets. When all the designed modules are integrated into the dual-branch backbone network, the model performance is higher than that of a single module. This indicates that under the mutual promotion effect of each module, the pedestrian detection ability that integrates visible light and infrared features can be effectively enhanced. From the comparison of training situations of different models in Figure 4, it can be seen that when the dual-branch network is equipped with different sub-modules simultaneously, all the models can eventually converge to a stable state. Moreover, the model in this paper can also reduce the fluctuations in the early training stage.

To further verify the testing effectiveness of the dual-branch model and the improved model (the model described in this paper) in actual traffic scenarios, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10 show the comparison detection results of the two models in different traffic scenarios.

Figure 5 shows the pedestrian detection results using two models (the dual-branch baseline model and the model in this paper) under the condition of non-motor vehicle lane obstruction in infrared (left) and visible light (right) images. In Figure 5a, due to the obstruction by trees, the baseline model failed to detect the pedestrian targets, while in Figure 5b, the pedestrian targets can be reliably detected.

Figure 6 shows the pedestrian detection results using two models (the dual-branch baseline and the model in this paper) under the condition where there is interference from the heat source of motor vehicles on the non-motorized vehicle lane (infrared on the left and visible light on the right). Due to the interference of vehicle heat sources and the obstruction of trees, the baseline model led to incorrect judgment results for pedestrian targets. By comparison, it can be seen that using the model in this paper can effectively detect pedestrian targets under occlusion conditions and distinguish similar heat sources, which can alleviate the situation of false detections of targets during the detection process.

Figure 7 shows the pedestrian detection results using two models under the condition of common nighttime traffic scenes with the glare interference caused by the headlights of oncoming vehicles. In Figure 7a, due to the glare caused by the oncoming vehicle headlights and the small size of the pedestrian target, the baseline model is unable to detect the pedestrian target, while the model proposed in this paper can effectively detect the pedestrian targets in both infrared and visible light images.

Figure 8 shows the pedestrian detection results using two models in a common nighttime traffic scene with human–vehicle interactions and dense crowd conditions. In Figure 8a, only a part of the pedestrian features are shown in the lower left corner. The baseline model failed to detect this part of the pedestrian targets. Using the model presented in this paper can effectively detect the pedestrian target located at the lower left corner in both infrared and visible light images.

Figure 9 shows the pedestrian detection results using the two models in the daytime traffic scenario when pedestrians are crossing the road. In Figure 9a, the baseline model mistakenly identifies the nearest pedestrian in the image as a pedestrian due to the leg being obstructed. By using the model described in this paper, it is possible to reliably detect pedestrian targets in both infrared and visible light images, effectively avoiding incorrect judgment results.

In order to further verify the detection performance of the model proposed in this paper, it was compared and verified with several representative pedestrian detection algorithms that have utilized the fusion of visible light and infrared in recent years, as shown in Table 2 and Table 3.

From Table 2, it can be seen that compared with the methods of ICAFusion [3], ADCNet [20], CSM [21], and PCMFNet [22] on the LLVIP dataset, the accuracy of this method has increased by 8.9%, 9.1%, 7.7%, and 3.5% respectively, the recall rate has increased by a maximum of 11.8% and a minimum of 7.8%, and the mAP@0.5 has increased by a maximum of 12.4% and a minimum of 4.5%, indicating that this model is superior to other models.

From Table 3, it can be seen that the method in this paper has improved the accuracy on the Kaist dataset by 2.7%, 9.6%, and 3.7% compared to BPP-YOLO [7], LDI-YOLOv8 [23], and Dual-YOLO [24], respectively. The recall rate has increased by 1.5%, 5.4%, and 4% respectively. Meanwhile, the mAP@0.5 has increased by 2.7%, 5.1%, and 2.8% respectively. This comparison clearly shows that the improved model has better detection performance, enhancing the detection accuracy and robustness.

Compared to ICAFusion, which focuses solely on mesoscale features, and ADCNet, which emphasizes misalignment correction, BPP-YOLOv8 employs a single-layer mid-level fusion, while LDI-YOLOv8 uses neck-level fusion. Our method achieves full-scale coverage of targets of various sizes across layers P3–P5, resulting in more comprehensive feature complementarity and greater robustness for multi-scale object detection. Compared to the fixed dual attention mechanism in CSM and the progressive concatenation in PCMFNet, BPP-YOLOv8 emphasizes scale-weighted attention, while Dual-YOLO employs fixed structural weighting. The DAFF module in this paper uses dynamic weights and learnable parameters to adaptively balance the contributions of different modalities, better addressing modal differences in complex scenes.

4. Closing Remarks

In intelligent transportation systems and autonomous driving scenarios, environmental detection serves as a critical foundation for safe decisions and route planning. As key participants in road traffic, the accuracy of pedestrian detection directly impacts vehicle safety. This paper addresses the issues of missed detections and false detections when using only visible light images to achieve pedestrian detection in traffic scenarios such as low light, strong glare, occlusion, and night. To address these problems, a pedestrian detection method based on a dual-branch YOLOv8 network using both visible light and infrared images is proposed. The main conclusions and achievements are as follows: (1) A dual-modal fusion YOLOv8 network structure based on visible light and infrared was constructed, which processed both visible light and infrared images simultaneously. A modal-channel interaction block (MCI-Block) was introduced in the dual-branch backbone network to achieve complementary enhancement of visible light and infrared features in the spatial and channel dimensions, while suppressing redundancy and noise information. (2) During the fusion stage, a dynamic alignment feature fusion module (DAFF) was designed. This module incorporates learnable parameters to adaptively adjust the weights of the two modalities, thereby improving the quality of the fused features and enhancing the detection performance. (3) Based on the LLVIP and Kaist pedestrian detection datasets, the detection results of the original YOLOv8 model, the YOLOv8 + Dual-branch model, the Dual-branch model + MCI-Block, the Dual-branch + DAFF model, and the model in this paper were compared. The experimental results show that the model established in this paper has achieved varying degrees of improvement in terms of accuracy, recall rate, mAP@0.5 and other indicators compared to the original YOLOv8 model and other multimodal models as well as mainstream research methods on the LLVIP and Kaist datasets. At the same time, the test results of the dual-branch model in actual traffic scenarios and the model in this paper were further verified. The results show that the visible light and infrared image dual-branch YOLOv8 network constructed in this paper can reliably detect pedestrian targets under complex lighting conditions, effectively improving the robustness of pedestrian detection, to further enhance the autonomous driving system’s perception capabilities in complex environments.

Although the method described in this paper achieves some performance improvements in the pedestrian detection task, it still has certain limitations. For example, because it employs a dual-branch architecture to process infrared and visible light images separately, it incurs a higher computational cost compared to traditional single-modality detection methods. Future work will focus on improving real-time performance to enable deployment on edge devices and evaluating the method in more diverse real-world and autonomous driving scenarios.

Author Contributions

Conceptualization, Z.H. and X.C.; methodology, Z.H.; software, Z.H.; validation, Z.H. and X.C.; formal analysis, X.C.; investigation, Z.H.; resources, Z.H.; data curation, Z.H.; writing—original draft preparation, Z.H.; writing—review and editing, X.C.; visualization, Z.H. and X.C.; supervision, X.C.; project administration, X.C.; funding acquisition, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of China under Grant 62373175 and 2024 Fundamental Research Funding of the Educational Department of Liaoning Province LJZZ232410154016.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cao, Z.; Yang, H.; Zhao, J.; Guo, S.; Li, L. Attention fusion for one-stage multispectral pedestrian detection. Sensors 2021, 21, 4184. [Google Scholar] [CrossRef] [PubMed]
Fang, Q.; Han, D.; Wang, Z. Cross-modality fusion transformer for multispectral object detection. arXiv 2022, arXiv:2111.00273v4. [Google Scholar] [CrossRef]
Shen, J.; Chen, Y.; Liu, Y.; Zuo, X.; Fan, H.; Yang, W. ICAFusion: Ierative cross-attention guided feature fusion for multispectral object detection. Pattern Recognit. 2024, 145, 109913. [Google Scholar] [CrossRef]
Althoupety, A.; Wang, L.Y.; Feng, W.C.; Rekabdar, B. Daff: Dual attentive feature fusion for multispectral pedestrian detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 2997–3006. [Google Scholar]
Ma, J.; Tang, L.; Fan, F.; Huang, J.; Mei, X.; Ma, Y. SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA J. Autom. Sin. 2022, 9, 1200–1217. [Google Scholar] [CrossRef]
Zhang, H.; Fromont, E.; Lefevre, S.; Avignon, B. Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; IEEE: New York, NY, USA, 2020; pp. 276–280. [Google Scholar]
Cui, J.L.; Wang, H.; Zheng, H.; Hu, Z.H. Multimodal Target Detection Algorithm Based on BPP-YOLOv8. Infrared Technol. 2025, 12, 1–9. Available online: https://link.cnki.net/urlid/53.1053.TN.20240830.1105.002 (accessed on 19 March 2026).
Li, Z.; Li, X.; Niu, Y.; Rong, C.; Wang, Y. Infrared and Visible Light Fusion for Object Detection with Low-light Enhancement. In Proceedings of the 2024 IEEE 7th International Conference on Information Systems and Computer Aided Education (ICISCAE), Dalian, China, 27–29 September 2024; IEEE: New York, NY, USA, 2024; pp. 120–124. [Google Scholar]
Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion 2022, 83, 79–92. [Google Scholar] [CrossRef]
Yan, C.; Zhang, H.; Li, X.; Yang, Y.; Yuan, D. Cross-modality complementary information fusion for multispectral pedestrian detection. Neural Comput. Appl. 2023, 35, 10361–10386. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, X.; Wang, J.; Ying, J.; Sheng, Z.; Yu, H.; Li, C.; Shen, H.L. TFDet: Target-aware fusion for RGB-T pedestrian detection. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 13276–13290. [Google Scholar] [CrossRef] [PubMed]
Li, Q.; Zhang, C.; Hu, Q.; Zhu, P.; Fu, H.; Chen, L. Stabilizing multispectral pedestrian detection with evidential hybrid fusion. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 3017–3029. [Google Scholar] [CrossRef]
Zhao, R.; Zhang, Z.; Xu, Y.; Yao, Y.; Huang, Y.; Zhang, W.; Song, Z.; Chen, X.; Zhao, Y. Peddet: Adaptive spectral optimization for multimodal pedestrian detection. arXiv 2025, arXiv:2502.14063v2. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
He, X.J.; Song, X.N. Improved YOLOv4-Tiny Lightweight Target Detection Algorithm. J. Front. Comput. Sci. Technol. 2024, 18, 138–150. [Google Scholar]
Ma, J.; Tang, L.; Xu, M.; Zhang, H.; Xiao, G. STDFusionNet: An infrared and visible image fusion network based on salient target detection. IEEE Trans. Instrum. Meas. 2021, 70, 1–13. [Google Scholar] [CrossRef]
Ram Prabhakar, K.; Sai Srikar, V.; Venkatesh Babu, R. Deepfuse: A deep unsupervised approach for exposure fusion with extreme exposure image pairs. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4714–4722. [Google Scholar]
Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 March 2021; pp. 3496–3504. [Google Scholar]
Hwang, S.; Park, J.; Kim, N.; Choi, Y.; So Kweon, I. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar]
He, M.Z.; Wu, Q.B.; Ngan, K.N.; Jiang, F.; Meng, F.M.; Xu, L.F. Misaligned RGB-infrared object detection via adaptive dual-discrepancy calibration. Remote Sens. 2023, 15, 4887. [Google Scholar] [CrossRef]
Fang, Q.Y.; Wang, Z.K. Cross-modality attentive feature fusion for object detection in multispectral remote sensing imagery. Pattern Recognit. 2022, 130, 108786. [Google Scholar] [CrossRef]
Gao, Q.; Zhang, C.; Shi, R.; Zhang, Y. Cross-modal Progressive Fusion Method for UAV Target Detection. Unmanned Syst. Technol. 2024, 7, 54–64. [Google Scholar]
Cui, J.L.; Wang, H.; Zheng, H.; Hu, Z.H. A fusion image detection method based on LDI-YOLOv8. Microelectron. Comput. 2025, 42, 61–70. [Google Scholar]
Bao, C.; Cao, J.; Hao, Q.; Cheng, Y.; Ning, Y.; Zhao, T. Dual-YOLO architecture from infrared and visible images for object detection. Sensors 2023, 23, 2934. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Dual-branch backbone network.

Figure 2. Modal-channel interaction block.

Figure 3. Dynamic alignment frature fusion module.

Figure 4. Comparison of training for different models: (a) shows the iterative changes in accuracy during the training of different models; (b) shows the iterative changes in recall; (c) shows the iterative changes in average precision mAP.

Figure 5. Comparison of pedestrian detection results of two models in the non-motorized vehicle lane scenario under conditions of occlusion. (a) Baseline model detection results of infrared (left) and visible light (right) under occlusion conditions. (b) Detection results of infrared (left) and visible light (right) using our model under occlusion conditions.

Figure 6. Comparison of pedestrian detection results of two models in non-motorized vehicle lane scenarios under the interference of vehicle heat sources. (a) Detection results of infrared (left) and visible light (right) baseline models under the interference of vehicle heat sources. (b) Detection results of infrared (left) and visible light (right) baseline models under the interference of vehicle heat sources.

Figure 7. Comparison of pedestrian detection results of two models under the condition of glare caused by opposite lighting in common traffic. (a) Detection results of infrared (left) and visible light (right) using baseline models under light-induced glare conditions. (b) Detection results of infrared (left) and visible light (right) using our model under light-induced glare conditions.

Figure 8. Comparison of pedestrian detection results of two models with a large crowd present at night intersections. (a) Detection results of infrared (left) and visible light (right) using baseline models with a large crowd present at night intersections. (b) Detection results of infrared (left) and visible light (right) using our model with a large crowd present at night intersections.

Figure 9. Comparison of pedestrian detection results of two models in the scenario where pedestrians cross the road during the day. (a) Detection results of infrared (left) and visible light (right) using baseline models in the scenario where pedestrians cross the road. (b) Detection results of infrared (left) and visible light (right) using our model in the scenario where pedestrians cross the road.

Figure 10. Comparison of pedestrian detection results of two models in a nighttime environment with pedestrian targets. (a) Detection results of infrared (left) and visible light (right) using baseline models in a nighttime environment with pedestrian targets. (b) Detection results of infrared (left) and visible light (right) using our model in a nighttime environment with pedestrian targets.

Table 1. Comparison of detection results for different models and input types.

		P		R		mAP@0.5
Model	Input Type	LLVIP	Kaist	LLVIP	Kaist	LLVIP	Kaist
YOLOv8	ir	0.925	0.724	0.863	0.613	0.95	0.681
YOLOv8	vis	0.857	0.714	0.763	0.462	0.84	0.532
Dual-branch	ir+vis	0.931	0.744	0.866	0.616	0.94	0.71
Dual-branch+MCI-Block	ir+vis	0.933	0.776	0.871	0.661	0.941	0.734
Dual-branch+ DAFF	ir+vis	0.947	0.769	0.888	0.658	0.941	0.713
Ours	ir+vis	0.95	0.788	0.904	0.707	0.957	0.76

Table 2. Comparison results of model detection performance based on the LLVIP dataset.

Methods	P	R	mAP@0.5
ICAFusion	0.861	0.793	0.866
ADCNet	0.859	0.786	0.833
CSM	0.873	0.788	0.845
PCMFNet	0.915	0.826	0.912
Our method	0.95	0.904	0.957

Table 3. Comparison results of model detection performance based on the Kaist dataset.

Methods	P	R	mAP@0.5
BPP-YOLOv8	0.761	0.692	0.733
LDI-YOLOv8	0.692	0.653	0.709
Dual-YOLO	0.751	0.667	0.732
Our method	0.788	0.707	0.76

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

He, Z.; Chen, X. Research on Pedestrian Detection Method Based on Dual-Branch YOLOv8 Network of Visible Light and Infrared Images. World Electr. Veh. J. 2026, 17, 177. https://doi.org/10.3390/wevj17040177

AMA Style

He Z, Chen X. Research on Pedestrian Detection Method Based on Dual-Branch YOLOv8 Network of Visible Light and Infrared Images. World Electric Vehicle Journal. 2026; 17(4):177. https://doi.org/10.3390/wevj17040177

Chicago/Turabian Style

He, Zhuomin, and Xuewen Chen. 2026. "Research on Pedestrian Detection Method Based on Dual-Branch YOLOv8 Network of Visible Light and Infrared Images" World Electric Vehicle Journal 17, no. 4: 177. https://doi.org/10.3390/wevj17040177

APA Style

He, Z., & Chen, X. (2026). Research on Pedestrian Detection Method Based on Dual-Branch YOLOv8 Network of Visible Light and Infrared Images. World Electric Vehicle Journal, 17(4), 177. https://doi.org/10.3390/wevj17040177

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Research on Pedestrian Detection Method Based on Dual-Branch YOLOv8 Network of Visible Light and Infrared Images

Abstract

1. Introduction

2. Design of Dual-Modal Fusion Network Structure for Pedestrian Detection

2.1. Dual-Branch Network Infrastructure

2.2. Modal-Channel Interaction Block Design

2.3. Dynamic Alignment Feature Fusion Module

3. Pedestrian Detection Experiment and Result Analysis

3.1. Experimental Environment and Dataset Settings

3.2. Selection of Evaluation Indicators

3.3. Experimental Results and Analysis

4. Closing Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI