1. Introduction
Dual-modality image detection is a key research area in computer vision that enhances the performance and robustness of object detection systems by integrating information from two perceptual modalities [
1]. Traditional single-modality detection methods rely solely on information from one modality, making them susceptible to variations in lighting, occlusion, and environmental complexity, which can lead to reduced detection accuracy or increased false positive rates [
2,
3]. In contrast, dual-modality image detection overcomes these limitations, improving detection capabilities in complex scenarios [
4].
Infrared images provide stable thermal signals in low-light, nighttime, or adverse weather conditions. Existing infrared detection frameworks, such as the slow-fast tubelet (SFT) [
5] and the AIR-Net [
6], have demonstrated significant advantages in low-light or dark environments by optimizing infrared data processing and feature fusion. In contrast, visible light images offer rich detail and color information under normal lighting conditions. For instance, the Depth Attention Enhancement Module (DAEM) and the RGB-Depth Fusion Module (RDFM) introduced in the MDFN model effectively extract fine texture details from visible light images [
7], addressing issues like insufficient depth information and noise interference, which greatly improve detection performance. The fusion of infrared and visible light modalities significantly enhances the robustness of object detection. However, achieving an effective fusion of these two modalities presents a complex challenge, requiring the establishment of meaningful correlations and complementary relationships between them [
8]. Due to the inherent differences between infrared and visible light modalities, the feature extraction and fusion process must overcome these disparities and fully leverage the unique strengths of each modality. Thus, ensuring feature complementarity between the two modalities is critical for successful fusion [
9].
By thoroughly investigating the complementary nature of information between different sensing modalities, the model’s ability to capture richer and more diverse target features can be enhanced, leading to more accurate target recognition and classification. This approach improves detection accuracy and effectively reduces false positives and missed detections, equipping the model with greater adaptability to complex and dynamic real-world scenarios [
10]. Furthermore, designing an appropriate network architecture and loss function is crucial to optimizing the model’s learning capacity and generalization performance, ensuring stable results across diverse environments [
9].
Significant advancements have been made in dual-modal image object detection in recent years. For instance, FusionNet [
11] utilizes Convolutional Neural Networks (CNNs) as a foundational feature extractor and inputs the fused feature maps into the object detection network to enhance detection performance. Dual-YOLO [
12] introduces an information fusion module and a fusion shuffling module, enabling the network to complete infrared images using visible light features, thus improving detection accuracy and robustness. Guan et al. [
13] extract features from infrared and visible light images separately using deep convolutional neural networks and integrate these features with a convolutional fusion module. Kim et al. [
14] employed a joint training approach to enhance overall detection performance further. J. Zhu et al. [
15] designed a Modal Interaction Module based on the Transformer architecture, which fuses features from RGB and thermal infrared images and incorporates a Query Location Module for precise object localization. Z Ye et al. [
16] proposed the Cross-Modality Fusion Transformer, utilizing a Cross-Modality Fusion Transformer structure to effectively integrate information from both modalities through Transformer modules, thereby improving the performance of dual-modal drone object detection.
However, current dual-modal object detection methods still exhibit four main limitations. First, the precise control and interpretation of dual-modal feature fusion remain challenging, often resulting in either information redundancy or insufficiency. As noted by Gao et al. [
17], although deep learning has advanced image fusion, the absence of well-designed loss functions can lead to suboptimal fusion performance. Ataman et al. [
18] also point out that balancing preserving details with eliminating information redundancy across different spectral features remains problematic. This reflects the limitations of current methods in extracting and fusing both deep and shallow features, particularly in ensuring the sufficiency and effectiveness of the fused information. Second, the loss of deep information in complex scenes significantly affects detection accuracy, especially during the forward propagation of fusion networks. Zhao et al. [
19] mitigate this issue by employing a dual-scale decomposition mechanism that processes low-frequency base information and high-frequency details separately, thereby reducing deep information loss.Similarly, Wang et al. [
20] enhance the retention of background details and highlight infrared features by introducing multi-scale information through an improved generative adversarial network. Third, extracting fine-grained features and detecting multi-scale objects in complex backgrounds remain difficult. Liu et al. [
21] emphasize the advantages of multi-scale fusion strategies in addressing occlusion issues in complex scenes. In contrast, Bao et al. [
22] highlight the critical role of multi-scale processing in dealing with intricate backgrounds. Lastly, the high parameter count of fusion networks results in large model sizes and increased computational costs, negatively impacting real-time performance. Nousias et al. [
23] explore model compression and acceleration techniques to reduce computational burdens and meet real-time inference requirements. Additionally, Poeppel et al. [
24] stress that optimizing computational resource consumption and enhancing inference speed are key to achieving efficient real-time detection.
We propose a real-time object detection method based on the dual-branch fusion of visible and infrared images to address the aforementioned challenges. This approach effectively overcomes common issues in feature fusion, such as feature disparity, information redundancy, and model optimization difficulties, enabling efficient and accurate real-time object detection. Experimental results demonstrate that the proposed dual-branch detection method significantly improves the detection accuracy of multi-scale objects in complex environments while maintaining a low parameter count. The main contributions of this paper are summarized as follows:
(1) Building on the outstanding performance of YOLOv8 [
25] in real-time object detection, we have developed a dual-branch network named IV-YOLO. This network consists of one branch dedicated to feature extraction from infrared images and another for feature extraction from visible light images. Additionally, a small object detection layer is incorporated into the neck network. This approach effectively addresses issues of background complexity and target feature occlusion inherent in single-modal image detection, enabling the extraction of fine-grained features of small objects. IV-YOLO significantly improves detection accuracy for multi-scale objects in complex backgrounds by maintaining a low parameter count and high frame rate.
(2) We propose a solution incorporating the Bi-Concat module, which employs a bidirectional pyramid structure for the weighted fusion of features extracted from infrared and visible light images at different levels. This approach effectively reduces redundant features and optimizes the fusion of features from the two modalities.
(3) We have designed the Shuffle-SPP structure, which uses Spatial Pyramid Pooling (SPP) to extract features at multiple scales, thereby enhancing the model’s ability to detect multi-scale objects. Subsequently, an efficient hierarchical aggregation mechanism is employed to merge features from different levels, further improving feature representation. This approach also emphasizes different parts of the features to enhance the Precision of object details and localization. The multi-level feature fusion, shuffling, and pooling operations effectively minimize information loss and retain more critical details, improving overall network performance. A loss function is also introduced to enhance the focus on small objects and accelerate the network’s convergence.
(4) Our method has achieved state-of-the-art results on the challenging KAIST [
26] multispectral pedestrian dataset and the Drone Vehicle dataset [
27]. Furthermore, experiments on the multispectral target detection dataset FLIR [
28] further validate the effectiveness and generalizability of the algorithm.
The remainder of this paper is structured as follows:
Section 2 describes related work pertinent to our network. In
Section 3, we detail the network architecture and methodology.
Section 4 presents our experimental details and results, comparing them with state-of-the-art networks to validate the effectiveness of our approach. In
Section 5, we summarize the research content and experimental findings.
4. Results
In this chapter, we describe the implementation process for the dual-branch infrared and visible light image detection network, including hardware and software configuration specifics. To validate the effectiveness of the proposed method, we conducted extensive experiments across several publicly available datasets, including the Drone Vehicle dataset, the KAIST dataset, and the FLIR pedestrian dataset. The experimental results demonstrate that our method performs exceptionally well on these datasets.
4.1. Dataset Introduction
4.1.1. Drone Vehicle Dataset
The Drone Vehicle dataset [
27] is a publicly available resource specifically designed for drone-based object detection and classification tasks. It is extensively used in traffic monitoring and intelligent transportation system research. Drones capture the dataset and provide high-resolution aerial images covering various traffic scenarios, including urban roads, highways, and parking lots. The images are meticulously annotated with positional information, bounding boxes, and class labels for vehicles like cars, trucks, and buses. The dataset spans a range of environmental conditions from daytime to nighttime and includes infrared and visible light images, totallng 15,532 image pairs (31,064 images in total) and 441,642 annotated instances. Additionally, it accounts for real-world challenges such as occlusion and scale variation.
4.1.2. FLIR Dataset
The FLIR dataset [
28], released by FLIR Systems, is a publicly available resource widely used for infrared image object detection and pedestrian detection research. The dataset primarily consists of infrared images accompanied by corresponding visible light images, facilitating the exploration of multimodal fusion techniques. It covers a range of scenarios and environmental conditions, including daytime, nighttime, urban streets, and rural roads, to assess the robustness of detection algorithms under varying lighting conditions and complex backgrounds. The dataset includes over 10,000 pairs of 8-bit infrared images and 24-bit visible light images, encompassing targets such as people, vehicles, and bicycles. The resolution of the infrared images is 640 × 512 pixels, while the resolution of the visible light images ranges from 720 × 480 to 2048 × 1536 pixels.
4.1.3. KAIST Dataset
The KAIST dataset [
26], released by the Korea Advanced Institute of Science and Technology, is a publicly available resource extensively used for multimodal object detection and tracking research. This dataset includes visible light and infrared images, facilitating the study of information fusion across different modalities to enhance detection performance. It spans various scenarios and environmental conditions, including daytime, nighttime, sunny, and rainy conditions, which helps evaluate the robustness of algorithms under diverse lighting and weather conditions. The dataset focuses on pedestrian detection and provides detailed annotation information, including pedestrians’ location, size, and category labels. The dataset is divided into 12 subsets: sets 00 to 05 are used for training (sets 00 to 02 for daytime scenes and sets 03 to 05 for nighttime scenes), and sets 06 to 11 are used for testing (sets 06 to 08 for daytime scenes and sets 09 to 11 for nighttime scenes). The images have a 640 × 512 pixels resolution and include 95,328 images, each containing visible light and infrared. The KAIST dataset encompasses a range of typical traffic scenarios, including campus, street, and rural environments, with 103,108 densely annotated objects.
In our experiments, we resized each visible image to 640 × 640.
Table 1 summarizes the dataset information used for training and testing.
4.2. Implementation Details
We implemented the code using the PyTorch (version 1.12.1) framework and conducted experiments on a workstation with an NVIDIA RTX 4090 GPU. Details of the experimental environment and parameter settings are provided in
Table 2, while the hyperparameters for the datasets are listed in
Table 3. To ensure training and testing accuracy, we maintained a consistent number of infrared and visible light images and performed data cleaning during the network training and testing phases.
During data augmentation, each image had a 50% chance of undergoing random horizontal flipping. Additionally, we employed mosaic operations to stitch multiple images into a single composite, enhancing the complexity and diversity of the training samples to improve the model’s adaptability to different scenes and viewpoints. The entire network was optimized using the AdamW optimizer, combined with weight decay, and trained for 300 epochs. The learning rate was set to 0.001, with a batch size of 32, weight decay of 0.0005, and momentum warm-up set to 0.8. These hyperparameters were chosen based on the specific challenges and characteristics of the datasets: a lower learning rate facilitates more precise model parameter adjustments for high-resolution images, weight decay reduces the risk of overfitting, a batch size of 32 balances memory usage and training stability, and momentum warm-up accelerates convergence in the early stages of training. The AdamW optimizer, which integrates the advantages of the Adam optimizer while effectively mitigating overfitting through weight decay, is particularly well-suited for handling complex multimodal datasets such as Drone Vehicle, FLIR, and KAIST, thereby providing more stable and efficient training and enhancing detection performance and accuracy.
4.3. Evaluation Metrics
In this study, we evaluate the detection performance of the network using Precision, Recall, and mean Average Precision (mAP). Additionally, we assess the network’s efficiency by considering the number of parameters and Frames Per Second (FPS).
Precision and Recall are primarily used during the experiments to measure the network’s performance, with the calculations shown in Equations (
14) and (
15).
Specifically, TP (True Positive) represents the number of correctly predicted positive samples, FP (False Positive) denotes the number of incorrectly predicted positive samples, and FN (False Negative) refers to the number of positive samples that were incorrectly predicted as negative. Average Precision (AP) is the area under the Precision–Recall curve, and the closer the AP value is to 1, the better the detection performance of the algorithm.
Mean Average Precision (mAP) is the average of the AP values across all classes, offering a balanced evaluation by combining Precision and Recall. In multi-class detection tasks, mAP is particularly important because it ensures good performance across all classes. Moreover, mAP is robust to class imbalance, making it widely used in evaluating multi-class object detection tasks. It is also a key metric in our experiments to assess detection accuracy. The formula for calculating mAP is as follows:
The number of parameters and FPS evaluate the efficiency of the model. The number of parameters refers to the total count of all learnable parameters, including weights and biases. FPS measures how many image frames the network can process per second, an essential real-time performance indicator.
4.4. Analysis of Results
This section tests the IV-YOLO network on three test datasets and compares the detection results with advanced methods.
4.4.1. Experiment on the DroneVehicle Dataset
We conducted a series of experiments on the Drone Vehicle dataset to verify the proposed dual-branch object detection method’s capability in detecting small objects in complex environments. During the experiments, the dataset was preprocessed by cropping 100 pixels from each edge of the images to remove the white borders, resulting in images of size 640 × 512. The detection head was also replaced with an oriented bounding box (OBB) detection head.
In the Drone Vehicle dataset, the shapes of freight cars and vans are quite similar, and many existing detection methods often omit these two categories to avoid fine-grained classification errors. However, our experiments used the complete Drone Vehicle dataset to assess the network’s ability to extract and fuse fine-grained dual-modal features.
Table 4 presents a comparative analysis of our method’s performance against other networks.
Since many network models are based on single-modal images for detection, we evaluated these networks using visible and infrared images. As shown in
Table 4, YOLOv8 demonstrates outstanding performance in detection accuracy and speed in single-modal scenarios, so we chose YOLOv8 as the foundational framework for the IV-YOLO algorithm. Through our innovative improvements, the IV-YOLO algorithm achieved an accuracy of 74.6% on the Drone Vehicle dataset, outperforming all other networks. Notably, in fine-grained feature extraction and fusion, IV-YOLO achieved detection accuracies of 63.1% for freight cars and 53% for vans, significantly surpassing other networks.
The results in
Table 4 indicate that IV-YOLO effectively integrates dual-modal features to enable robust detection in complex environments and improves the accuracy of detecting visually similar objects through dual-branch feature fusion. However, the emphasis on fine-grained features led to a decrease in detection performance for visually similar categories. Additionally, in small object detection, the network focuses more on extracting low-level features to capture fine details, as small objects require higher-resolution feature maps and precise local details. This focus may weaken the network’s performance on larger objects, as detecting larger targets requires a broader receptive field to understand the global context fully. An overemphasis on details can reduce the utilization of high-level semantic information, affecting larger objects’ detection performance.
Figure 5 illustrates the visualized detection results on the Drone Vehicle dataset. The images are divided into six groups, each containing three rows of images. The visible light detection results are on the left side of each group, while the right side shows the infrared image detection results. The first row demonstrates that the IV-YOLO network is capable of robust target detection even in low-light conditions or when soft occlusion is present. The second row highlights the network’s ability to effectively extract fine-grained features, successfully distinguishing between visually similar objects. The third row shows the network’s robustness when processing scenes with dense targets.
4.4.2. Experiments on the FLIR Dataset
We conducted a series of experiments on the FLIR dataset to validate the effectiveness of the proposed method. We compared the performance of several detection algorithms, including SSD, YOLOv9, YOLOv10, and YOLOF, and evaluated them against the Dual-YOLO network, which has demonstrated outstanding performance on the FLIR dataset. The detailed results of these experiments are presented in
Table 5.
The experimental results on the FLIR dataset are summarized in
Table 6. The table shows that our network achieves the highest mAP value on this dataset, outperforming other methods. Integrating multi-scale feature fusion and the triple upsampling operations in the neck significantly enhances our network’s ability to extract features from small objects. The results indicate a noticeable improvement in Precision for detecting small objects, such as bicycles, which are challenging to extract. However, due to a slight reduction in global feature capture, the detection performance for larger objects, such as cars and pedestrians, is marginally lower than the Dual-YOLO network. Overall, the mAP results demonstrate that our network effectively extracts multi-scale features, particularly capturing fine details at smaller receptive field levels, thus improving the detection of small targets. Additionally, through the fusion module, our network effectively extracts and integrates shared features from both modalities and their unique characteristics. Through weighted parameters, the fusion mechanism enables mutual enhancement and compensation between the two modalities, leading to superior detection performance.
Figure 6 illustrates the visualization of object detection results on the FLIR dataset. The first row of images demonstrates that, regardless of changes in background lighting, our network accurately locates and detects objects by integrating features from both modalities. The second row shows that the network effectively uses the dual-branch structure to extract fine-grained features of objects at different scales. Combined with a specially designed loss function, it successfully detects each object, even in cases of occlusion and high target density. The third row highlights the network’s advantage in multi-scale feature extraction, particularly at smaller receptive field levels, where it captures more subtle features, thus significantly enhancing the detection of small targets.
4.4.3. Experiments Based on the KAIST Dataset
To further validate the effectiveness and robustness of our proposed IV-YOLO algorithm, we conducted experiments on the challenging KAIST dataset. Given numerous blank and misaligned images in the KAIST dataset, we first performed data cleaning and then performed the object detection task on the processed dataset. After comparing our results with several popular methods, our experimental outcomes are presented in
Table 6.
As shown in
Table 6, the performance of the network we proposed on the KAIST dataset significantly outperforms that of the first three methods, which exhibit notably lower accuracy. This discrepancy primarily arises because the training process for the KAIST dataset did not include data preprocessing, resulting in blank images and misaligned labels that impacted the accuracy of the results. After preprocessing the dataset, we compared our network with the state-of-the-art Dual-YOLO and PearlGAN methods. The results, detailed in the last three rows of
Table 6, demonstrate that our method excels in Precision, Recall, and mAP.
Figure 7 presents the visualized test results of our network on the KAIST dataset. The figure includes six groups, with a total of twelve images. The first row illustrates the strong robustness of our network under varying scales and lighting conditions. The second row highlights our IV-YOLO network’s capability to detect pedestrians on the street despite overlapping and occlusions accurately. The third row demonstrates that, through feature fusion, our network effectively identifies targets even in complex backgrounds.
4.5. Parameter Analysis
In this study, we evaluated the proposed model’s parameter efficiency and inference speed on the Drone Vehicle and FLIR datasets. All tests were conducted on an NVIDIA RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA), with the model parameters measured in megabytes (MB) using 16-bit Precision to ensure the accuracy of the experimental results. As shown in
Table 7, the IV-YOLO network maintains high detection accuracy while keeping the parameter count low. Consequently, the model can be flexibly deployed on resource-constrained devices without compromising performance. When running on the NVIDIA RTX 3090, IV-YOLO achieved a real-time processing speed of up to 203 frames per second (FPS). This high FPS value highlights the model’s ability to perform fast and accurate object detection with high-resolution inputs. It is particularly suitable for applications where both speed and quality are critical. Green indicates the optimal results on the corresponding dataset.
4.6. Ablation Study
We designed and conducted a series of ablation experiments to thoroughly analyze the individual contributions of each module and component in the proposed model. By systematically removing key components from the network, we evaluated their impact on overall performance, thus clarifying the role of each module within the model. The experiments employed mAP@0.5 to assess model accuracy and mAP@0.5:0.95 as a comprehensive evaluation metric to eliminate the influence of varying IoU threshold settings on the results. The consequences ensured the completeness and fairness of the performance assessment.
First, we conducted ablation experiments on the feature fusion module. The Shuffle-SPP component, designed to enhance the internal correlation between deep features and improve the model’s ability to recognize and localize objects, was initially removed. After re-running the experiments on the standard dataset without this component, the model’s mAP dropped by 1.5%. This result demonstrates that including Shuffle-SPP effectively improves the extraction of deep features.
Next, a similar ablation experiment was performed on the Bi-Fusion module. Compared with removing Shuffle-SPP, eliminating Bi-Fusion had a more significant impact on overall model performance, with the mAP decreasing to 82.9%. This marked decline indicates that the Bi-Fusion structure is critical in effectively merging features from visible and infrared images and is essential for improving the model’s performance. The results of the ablation experiments for the feature fusion module are presented in
Table 8. In this table, “
![Sensors 24 06181 i001]()
” indicates that the network includes the module, while “×” denotes that the module is not utilized.
We also conducted experiments by removing an upsampling layer and a concatenation structure from the network. These components were specifically optimized for small object detection. Removing these elements resulted in significant performance fluctuations in detecting small objects, and there was a noticeable decline in the model’s ability to detect high-similarity objects. To fully assess these additional structures’ contribution, we removed all three components and analyzed their combined effect on model performance. The final ablation results are illustrated in
Figure 8 and
Figure 9. In these figures, ‘A’ represents our IV-YOLO network, ‘N’ denotes the Shuffle-SPP structure, ‘F’ indicates the Bi-Fusion structure, and ‘E’ refers to the network optimized for small object detection.
These ablation experiments reveal the crucial roles of each module and component within the model. By examining the impact of each component individually, we validated their importance for specific tasks, further demonstrating the rationality and effectiveness of the model’s design. The experimental results indicate that these modules exhibit significant complementarity across different tasks and scenarios. The combined use of these modules maximizes the model’s performance, while the absence of any single module markedly undermines the overall effectiveness of the model.
5. Conclusions
This paper presents a dual-branch object detection network based on YOLOv8, named IV-YOLO, which integrates infrared and visible light images. The network is meticulously designed to effectively extract features from both modalities during the dual-branch feature extraction process and perform object detection at multiple scales. Additionally, we developed a Bidirectional Pyramid Feature Fusion structure (Bi-Fusion) that integrates features from different modalities across multiple scales using weighted parameters. This approach effectively reduces the number of parameters and avoids feature redundancy, enhancing the network’s fusion performance. To further bolster the network’s expressive capability, we introduced the Shuffle Attention Spatial Pyramid Pooling structure (Shuffle-SPP), which captures global contextual information across channels and spatial dimensions. This structure enables the network to focus more accurately on the semantic and positional information of objects within deep features, significantly improving the model’s representational power. Finally, by incorporating the PIoU v2 loss function, we accelerated the convergence of object detection boxes for targets of varying sizes, enhanced the fitting Precision for small targets, and sped up the overall convergence of the network. Experimental results demonstrate that the IV-YOLO network achieves a mean Average Precision (mAP) of 74.6% on the Drone Vehicle dataset, 75.4% on the KAIST dataset, and 85.6% on the FLIR dataset. These results indicate that IV-YOLO excels in feature extraction, fusion, and object detection for infrared and visible light images. Furthermore, the network’s parameter count is only 4.31 M, significantly lower than other networks, showcasing its potential for deployment on mobile devices.
Despite its excellent performance in dual-modal object detection tasks, IV-YOLO has some limitations: (1) it requires further optimization of speed and memory usage before hardware deployment; (2) there is limited availability of dual-modal datasets with diverse scenes and targets, necessitating further research and validation of the method’s effectiveness; and (3) while the network enhances detection of small targets, future improvements are needed to strengthen its global perception capability. Although single-modal detection performs well for certain specific tasks, its limitations become evident in complex real-world applications. In contrast, dual-modal detection leverages the advantages of multiple data sources to provide richer feature information, significantly improving object detection accuracy and robustness. Therefore, dual-modal detection technology is undoubtedly a key development direction for future object detection research, offering strong support for addressing more complex detection tasks.