1. Introduction
In recent years, visible light remote sensing images have been widely used for object detection in automatic driving, intelligent medical care, urban planning, and military reconnaissance. However, several challenges remain, such as susceptibility to weather and lighting conditions, occlusion, and limited spectral information. Clouds and haze can scatter and absorb light, reducing the image quality and target visibility. Variations in lighting conditions, especially at night or under poor lighting, degrade image quality and make target details difficult to discern. Obstructions in the line of sight also hinder target identification. Moreover, a narrow electromagnetic band limits the ability of visible light to provide rich spectral information for accurate classification and recognition. These challenges highlight the need for algorithms with higher detection accuracy and robustness, particularly given the limited improvement in the resolution of visible light remote sensing images.
The majority of algorithms employed for target detection in visible light scenes can be broadly categorized into two distinct types based on the specific detection steps involved, namely two-stage and single-stage detection algorithms. Two-stage detection algorithms generate a candidate frame and then detect fine-grained targets. Representative algorithms include fast R-CNN [
1], faster R-CNN [
2], and task-aligned one-stage object detection (TOOD) [
3]. Single-stage algorithms do not need candidate regions and directly predict the location and class of the target. Representative algorithms include the you only look once (YOLO) series [
4]. In addition, there has been some related research on detection algorithms [
5] based on the Transformer algorithm [
6]. The effectiveness of these algorithms in visible image target detection has motivated researchers to continue to focus on and improve them.
To further enhance the performance of object detection in visible light scenes, researches have focused on various aspects of image processing, feature augmentation, backbone networks, and lightweight models. For example, a framework for greater realizability in remote sensing picture super-resolution was composed of a joint nonlocal tensor and Bayesian tensor factorization [
7]. Subsequently, the construction of a model that leveraged super-resolution information to enhance target detection performance was achieved by merging target detection with remote sensing super-resolution pictures [
8]. Cheng et al. focused on enhanced feature representation by proposing a dual attention feature enhancement (DAFE) module and a context feature enhancement (CFE) module, which were applied to Faster R-CNN, using the publicly available DIOR and DOTA datasets [
9]. Additionally, a more lightweight algorithm, LSKNet, leveraged a priori knowledge to extract contextual information using large convolutional kernels and a spatial selection mechanism as backbone, yielding outstanding results on multiple datasets [
10]. The comparably light LVGG demonstrated good accuracy in categorizing outcomes across various datasets [
11]. Additionally, it is noteworthy that there has been some advancement in the use of human–computer interaction in the realm of remote sensing imaging [
12,
13].
In addition to visible light images, target detection in infrared images has also received widespread attention. Based on the medical image segmentation network U-Net [
14], Wu et al. developed UIU-Net by nesting U-Nets of different sizes, reaching the pinnacle of infrared small target detection algorithms [
15]. However, these algorithms inevitably have a large number of parameters and lengthy training time. So there is considerable interest in single-stage algorithms that offer faster detection times. For instance, for addressing the issues of elevated false alarms in the detection of infrared image targets, YOLO-FIRI incorporates a shallow cross-stage partial connectivity module and an enhanced attention module, significantly improving the algorithm’s accuracy on public datasets [
16]. GT-YOLO improved upon the YOLO algorithm by fusing an attention mechanism, introducing SPD-Conv, which is better suited to small targets and low resolution, and incorporating Soft-NMS [
17]. This algorithm was demonstrated to be effective on an infrared nearshore ship dataset. Additionally, MGSFA-Net was developed considering the importance of the scattering features of SAR ships, combining segmentation and k-nearest neighbor algorithms to obtain local graph features [
18]. Additionally, the YOLO and vision Transformer architectures have both produced positive outcomes [
19].
It is evident from the research above that, when identified separately, each of the two modes has benefits and drawbacks. The higher resolution and rich detail and color information of visible light target detection provide benefits for the research of target detection in remote sensing images, but the dependence of the image quality on light makes the image detection effect poor under conditions of low light, as well as poor external environments. In contrast, infrared images excel in detecting thermal targets with high pixel values and demonstrate the ability to penetrate obstacles (e.g., detecting pedestrians obscured by stone pillars), even at night. However, infrared images typically exhibit low spatial resolution, resulting in unclear target details and edges. Furthermore, the varied and complex shapes and sizes of targets significantly increase the detection difficulty, and the low contrast between targets and backgrounds further diminishes visibility. Therefore, numerous studies have been conducted around fusing these modalities to obtain more information, aiding in accurate identification and enabling all-weather operation.
Existing research divides inter-modal fusion into three categories: CNN-based approaches, Transformer-based methods, and GAN-based methods. Fusion also comprises pixel-level fusion, feature-level fusion, and decision-level fusion approaches. The YOLOrs algorithm uses a new midlevel fusion architecture that adapts to multimodal aerial images, which not only helps detect the target in real time but also helps predict the target direction [
20]. However, decision-level fusion increases the computational complexity of the model, since it combines two modalities that are run through the complete network structure individually before being fused. SuperYOLO was developed for fusing images in a pixel-level manner in front of a backbone network, and an encoder and decoder were used as auxiliary branches for super-resolution features of different classes [
21]. These algorithms have performed at an advanced level on publicly available datasets. In addition to direct pixel-level fusion, dynamic and adaptive fusion have also been employed to increase multimodal detection performance [
22,
23]. Additionally, the PIAFusion network utilizes a cross-modal differential sensing module for fusion and considers the impact of light from various angles, emphasizing its significance in fusion algorithm design [
24].
The CFT model utilizes a Transformer encoder for cross-modality fusion, achieving a high degree of intra-modal and inter-modal fusion, with significant advantages in detection on multiple datasets [
25]. C
2Former, which also utilizes a Transformer structure, addressed the error in the modalities and the lack of precision in the fusion, and reduces the dimensionality, yielding robust detection results [
26]. A two-stage training strategy fusion algorithm was also developed to train the same autoencoder structure using different modal data, and to train the newly proposed cross-attention strategy and decoder to generate a combined image, with experimental findings affirming the method’s efficacy [
27]. The merging of multiscale and multimodal data has demonstrated good performance in remote sensing [
28]. To improve fusion quality, one study segmented the images of both modalities into a common region with feature decomposition [
29]. ASFusion also involves decomposition but decomposes the structural blocks, and the visible light is adaptively and visually enhanced prior to decomposition [
30]. In addition, generative and adversarial networks (GAN) have also received attention from researchers in the field of multimodal fusion [
31]. The algorithms FusionGAN [
32], MHW-GAN [
33], and DDcGAN [
34] employ a generator and discriminator to further enhance the intricacy of the fusion procedure.
Although infrared images are not as useful as visible images with respect to color, texture, and other features, they are very advantageous for detection in bad weather. Previous studies have shown that the fusion of data from the two modalities can help to obtain comprehensive features and improve the detection performance compared with using a single modality. Therefore, this paper proposes a new detection method, attentive and cross-differential fusion (ACDF)-YOLO, based on the fusion of visible and infrared images. ACDF-YOLO successfully addresses the issues related to inadequate multimodal fusion and poor detection of targets. The key contributions of this paper can be encapsulated thus:
(1) Derived from the YOLOv5 network configuration, an algorithm capable of fusing visible and infrared images for detection is proposed, named ACDF-YOLO. Visible and infrared images are combined, which not only yields more detailed feature data by merging the two modal datasets, but also addresses the issue of restricted detection precision caused by inadequate features. The module was verified through comparison experiments with current advanced algorithms and ablation experiments, and the overall detection performance after fusing the modalities was higher than that of the other algorithms.
(2) We propose a fusion module for attention, named efficient shuffle attention (ESA). Depthwise separable convolution and shuffle modules are added to the fusion operation process of infrared and visible images to obtain more fully fused features and to focus on the correlation and importance between different modalities under attention, increasing the benefits of fusion. The detection results achieved 77.11% for mAP0.5 and 74.34% for recall on the VEDAI dataset.
(3) We propose a cross-modal difference module (CDM) to represent the differences between different modalities, capturing the correlations and complementary features between modalities. This module also performs a de-redundancy operation on the information shared by the two modalities when they are fused, which provides richer and more accurate information for the subsequent detection. Based on differential fusion, the experimental mAP0.5 value increased to 78.1%, demonstrating the module’s usefulness.
The remainder of this paper is organized as follows: In
Section 2, we discuss the related work covered in the paper.
Section 3 describes the proposed ACDF module and the overall network structure.
Section 4 and
Section 5 provide further details of the experiments, to demonstrate the superiority of the algorithm from an experimental point of view.
Section 6 summarizes this research.
3. Methodology
3.1. Overall Structure
As is widely known, YOLO-DCAM recognizes individual trees in UAV RGB images with rich edges and realistic tree characteristics, demonstrating the benefit of visible light and the efficiency of detection [
42]. However, in order to achieve all-weather detection, a visible light picture is unable to gather edge information from individual trees in the middle of the night, or even recognize hidden individual trees in the image, reducing the overall detection effectiveness. Similarly, an infrared picture can still perform well in highly exposed, unevenly lit, and dark situations within a dataset, but it lacks rich information such as color and accumulates incomplete features [
43].
The architecture of the ACDF-YOLO network designed in this study is shown in
Figure 1, comprising a multimodal target detection model that is both efficient and accurate. Based on the requirement of real-time performance and the consideration of a lightweight network structure, we employ the widely used and high-performance YOLO algorithm as our basic framework. Among the many YOLO versions, we adopt YOLOv5 as the backbone of our model, because it has a small number of parameters. In order to avoid adding repetitive parameters, in the fusion strategy, we do not use the decision-level fusion method, but choose the intermediate layer fusion method, which does not require additional parameters, and this fusion method is more effective than fusion through the direct addition of the previous period. At the same time, we supplement the residual connection in the ACDF structure to reduce information loss during the convolution process, in order to achieve fusion of the two modes of data.
The input layer, fusion stage, backbone, and detection header are the four main parts of ACDF-YOLO. To streamline the presentation and concentrate on the key parts of the structure, we omit many convolutional details in
Figure 1. In the design of the fusion stage, we specifically considered the handling of different modalities. Traditional fusion strategies tend to fuse the modalities after the respective backbone network, or in the neck or even the head of the detector, increasing the number of model parameters and providing a limited improvement in detection accuracy. For this reason, we fuse the two modalities before the backbone network, avoiding an unnecessary increase in network branches.
In the fusion phase, we use the designed ACDF module, which amalgamates two principal segments: an attention section and a CDM. After the visible and infrared images are input, they are preprocessed by the ESA mechanism, then multiplied element-by-element with the result of the convolutional processing, followed by residual concatenation to preserve the preliminary information. Subsequently, this information is fed into the CDM to further emphasize inter-modal differences, and the attention given to local features is enhanced by performing stitching operations in combination with the ECA mechanism. Through this series of operations, we obtain deeply fused and improved features. It is possible to depict the entire procedure using Equations (
1)–(
3), where
denotes the features obtained from the visible image after branching,
denotes the features of the infrared image after branching, and
denotes the fused features.
The backbone adopts the structure of YOLOv5, a choice based on the advantages of its cross-stage partial connectivity structure, which reduces the amount of computation and the number of parameters, without sacrificing the capacity to effectively extract features. Furthermore, the inclusion of a spatial pyramid pooling module improves the model’s multi-scale target detection performance by boosting the model’s capacity to recognize targets of various sizes. In the detection head part, we utilize a feature pyramid and path aggregation network structure to fuse low-level features with high-level semantic information, which further enhances the context awareness of the model, thus strengthening its reliability and precision in target detection.
3.2. Attention Module
Various attentional mechanisms have been demonstrated in the investigation of multimodal fusion, considering the ease of operation, as well as effectiveness. Prior to feature fusion, we incorporate our ESA mechanism into the respective branches of the two modalities. We note the advantage of ECA in improving the model’s capability to depict the channel dimension. ECA empowers the model to concentrate on key channel characteristics by dynamically adjusting the channel’s dimensions, enhancing both its representational efficiency and performance. Moreover, we considered that the ECA mechanism, although it is lightweight, requires some additional parameters to learn the attention weights between channels. Therefore, we modified it to obtain the ESA mechanism.
In ShuffleNet, effective feature extraction is possible because of the depthwise separable convolution and channel shuffle functions. Depthwise separable convolution reduces the computational requirements and performs feature extraction in the channel dimension, whereas the channel shuffle operation enhances the interaction between features, breaks the information isolation between channels, and improves the diversity and richness of features. Meanwhile, channel shuffle can reduce redundant and unnecessary features through the operations of shuffling and compressing channels, which improves the efficiency, helps generalize the model, and reduces the demand for computational resources. In addition, ESA enhances the model’s expressive ability in the channel dimension, allowing the model to extract key features more effectively and learn more discriminative feature representations from complex data inputs. Furthermore, depthwise separable convolution decreases the amount of processing and number of model parameters. With the newly added branch, the features are processed in the spatial dimension, and the inter-channel interactions are continuously enhanced in the form of blending. Using the new ESA formed by the addition of this branch is expected to further increase the network’s effectiveness and performance, while minimizing the computational complexity and parameter count.
The ESA structure is illustrated in
Figure 2. The input is received from the upper layer and contains the channel and spatial details of the image. Subsequently, the input is subjected to global average pooling to compress the spatial dimension information into global features at the channel level, where the kernel size of the convolution is computed using a formula, and the feature map is obtained using the one-dimensional convolution kernel. At the same time, the input is adjusted through the branch of depthwise separable convolution and shuffle to improve the information flow between different channel groups, encourage inter-feature cross-learning, magnify the model’s representational abilities, and perform a summation operation for the two acquired features to gain a richer feature representation. Finally, the fused features of the two branches are multiplied element-by-element to avoid information loss during the series of operations.
3.3. Cross-Modal Difference Module
After enhancing the features of each modality in the two-branch process, we introduce the CDM and make further improvements. The CDM can determines and enhance the differences in features, which is a unique advantage of multimodal data processing. It determines differences between modalities through a series of explicit computations and operations, thus enhancing the model’s sensitivity to important features. The CDM focuses on extracting and enhancing the differences between different input features, which is important for tasks that rely on the differences between modalities. The difference module not only extracts features that differ across modalities but also extracts features that are common to them, thus enhancing the generalization capabilities when the model is confronted with different data.
The CDM structure is shown in
Figure 3. We denote the features of visible and infrared images by Fv and Fi, respectively. Using this structure, we first perform the phase subtraction operation on the infrared features and the visible light features to highlight the dissimilarity between the two modalities. In this way, the model can recognize the features that are more significant in one modality and less significant in the other modality. Immediately following this, global average pooling is performed in both the X and Y directions (i.e., horizontally and vertically) for the phase subtraction, which helps to extract global contextual information at a macroscopic level and reduces the computational load, while preserving the discrete features. The outcomes in both directions are then further processed through shared convolution to extract comprehensive global information and avoid overfitting. Then, the results are summed and nonlinearly transformed by an activation function. This approach refines the feature information. A subsequent multiplication operation avoids the loss of valid information and reweights the initial disparity features of the modalities, thus highlighting important feature regions. The final cross-sum operation combines the initial modal information and the modal disparity information after a series of operations to form the final fused features, namely Fv′ and Fi′.
Overall, this module serves the following purposes: enhancing modal differences, which helps to capture features that are particularly important to the task; extracting global contextual information, which provides the model with a broader perspective to understand the content of the image; enhancing feature representation, which enhances the model’s capacity for generalization by learning more intricate feature representations; highlighting important features, which improves the accuracy of detection; and retaining the integrated information, which is useful for combining the initial features with the processed features, so that the final fused feature signature contains the inter-modal difference information and retains the distinct information of the respective modalities.
5. Results
To confirm the effectiveness of ACDF-YOLO in multimodal fusion target detection, we carried out numerous experiments using the LLVIP and VEDAI datasets, comparing several versions of the YOLO model (including YOLOv3, YOLOv4, and YOLOv5s) with the existing multimodal fusion models (YOLOrs. SuperYOLO).
The experimental results of the LLVIP dataset are shown in
Table 3, which lists the values obtained from the different methods for the different modalities, including the precision, recall, mAP0.5, and mAP0.5:0.95. Most of the algorithms achieved an increase in precision and recall with the use of modal fusion, but some algorithms exhibited precision values that were similar for infrared and visible images before and after fusion. Overall, fusion appeared to provide significant performance gains for unimodal visible images. These results demonstrated the effectiveness of multimodal fusion for the LLVIP dataset, especially for improving the detection accuracy of visible images. However, the relatively small performance gains for infrared images may have been due to the nature of the infrared images themselves in the target detection task.
To visualize the result of multimodal fusion for the detection of single-category data in the LLVIP dataset, we compare the detection results of several algorithms in
Figure 4. Here, we selected three representative images for detection. The image in the first row has more uniform illumination, whereas the second row shows pedestrian targets under dim lighting, and there is an overlap between targets in the last row. The first column represents the ground truth, and the other columns represent the results of each algorithm. For first row, YOLOv3 detected the unlabeled pedestrian target, whereas YOLOv5s and YOLOrs mislabeled the two pedestrians next to the stone column as three pedestrians. YOLOv4, SuperYOLO, and our algorithm detected the targets with improved confidence. For the detection of pedestrians in (b), YOLOv3, YOLOv4, and YOLOv5s incorrectly treated the tree branch in the lower right corner and the crosswalk as pedestrians, and their confidence level was more than 0.5, whereas YOLOrs, SuperYOLO, and our algorithm correctly labeled the targets without misdetections. This also shows that the direct splicing of the two modal fusion methods by the YOLO series algorithms was not as effective as the fusion methods used in YOLOrs, SuperYOLO, or our algorithm. For the detection of the target in the last row, only our algorithm correctly detected the target, whereas the other algorithms contained omissions and misdetections. Our algorithm could detect overlapping targets because the CDM obtained the fusion information more reliably.
The experimental results for the VEDAI are listed in
Table 4, containing the percentages for the precision, recall, and average detection accuracy corresponding to different thresholds for each model. In most cases, the detection performance was higher for visible images than infrared images, which may have been a result of the rich color and texture information in the visible images. In addition, multimodal fusion provided an improvement in mAP0.5 and mAP0.5:0.95 for almost all models, suggesting that the infrared and visible light information effectively complemented each other and the combination improved the overall detection performance.
Our model performed well in multimodal detection for the Car, Pickup, Camping, Other, and Boat categories, ranking first among all the models. For mAP0.5 and mAP0.5:0.95, our model achieved 78.10% and 47.88%, respectively, outperforming the other models. By combining an ESA mechanism and CDM, our model not only exhibited high performance in a single modality but also exhibited a significant performance enhancement in multimodal fusion scenarios, especially for the detection precision and recall. The experimental results also emphasize the key role of multimodal fusion techniques in enhancing the target detection performance of remote sensing images.
To visualize the effectiveness of our model using the VEDAI dataset, we compare the detection results of our algorithm with those of the other algorithms in
Figure 5a–c, corresponding to three different images, and each column corresponds to the detection effect graphs of each algorithm. The YOLOv3, YOLOv4, YOLOv5s, and YOLOrs algorithms had misdetections in image
Figure 5a, which are marked with red arrows. Among them, YOLOv4 and YOLOv5s also contained unlabeled targets, which are marked with black arrows. Both SuperYOLO and our algorithm could detect the targets correctly, but our algorithm had a significantly higher detection confidence for the Camping van. For
Figure 5b, several algorithms missed the Boat, which we mark with yellow arrows. YOLOrs and SuperYOLO also detected nonexistent targets, marked with black arrows. For
Figure 5c, the unimproved YOLO algorithms all missed the target, whereas the algorithms that incorporated multimodality all detected the target, with subtle differences in their confidence of detection.
5.1. Ablation Experiment
To further demonstrate the efficacy of ESA and the CDM, we designed ablation experiments to demonstrate how each module influenced the model’s total efficacy. In the first set of experiments, we directly fed data from both modalities into the YOLOv5s network model, and in the second set of experiments, we integrated the ESA mechanism into both branches of the model. We first added ESA to the two branches and spliced them after branch processing, where the splicing enhanced the representation of local features with the help of ECA. The third set of experiments incorporated the CDM on top of ESA, with the goal of boosting the model’s efficiency by refining and improving the distinctions in features among the various modalities. A comparison of the experimental results is shown in
Table 5, revealing the enhancement in energy absorption. After applying ESA, the overall precision of the model decreased slightly, but the recall was significantly improved. Despite the slight increase in training time, the mAP value also increased. After adding the CDM, the precision, recall, and mAP were all enhanced, while the training time was slightly reduced, indicating that the CDM optimized the training efficiency, while improving the performance of the model. Although the entire model resulted in an increase in the number of parameters, the increase in the number of parameters was within acceptable bounds when compared to GFLOPs.
To showcase the efficacy of our proposed algorithm in terms of feature visualization, we carried out additional visualization analyses. As illustrated in
Figure 6, the first column depicts the model’s input graph, featuring the van and the camping area in white. The second column displays the original feature map, while the third column presents the heat maps generated after processing through the backbone networks of YOLOv5s and ACDF-YOLO, respectively. The region focused on by the YOLOv5s network lacks targets, whereas ACDF-YOLO clearly identified a broader range of targets compared to YOLOv5s, thereby demonstrating the effectiveness of our algorithm.
5.2. Choice of Attention
In our efforts to design efficient multimodal fusion modules, the selection and experimental study of different attentional mechanisms became a critical aspect. After a series of experimental comparisons, we proposed the ESA mechanism combined with the existing ECA mechanism. We selected triplet attention [
45], ECA, and the Gaussian context transformer (GCT) [
46] attention mechanism as the focus of our study and applied them on two separate branches for experimental validation of the attention schemes that promoted the model’s performance.
The performance of various attention mechanisms was compared and analyzed by examining the AP value, precision, recall, and mAP value. Triplet attention considers both spatial and channel dimensions, instead of limiting itself to feature weighting in a single dimension, which theoretically bolsters the model’s ability to gather important information. However, based on our experimental results, triplet attention did not show the expected benefits in the context of this study. In contrast, GCT utilizes a Gaussian context converter to directly map global environmental information onto attentional activation, which is unique, in that it achieves significant feature reinforcement without using a learning process [
46]. In the theoretical analysis, GCT was expected to provide more effective feature fusion, but the experimental results did not support this. Line graphs of the different attention mechanisms are shown in
Figure 7, revealing their relative performance across the various categories of detection and evaluation metrics. Although triplet attention and GCT performed well for some aspects, our proposed ESA mechanism achieved an optimal performance for most categories and key detection evaluation metrics. Notably, under the more stringent detection threshold requirements, the ESA mechanism still provided a significant performance improvement, further validating its effectiveness and reliability.
Through this series of experimental validations and comparative analyses, we not only demonstrated that the introduction of the ESA mechanism maximally enhanced the multimodal fusion model, but also emphasized the importance of integrating spatial and channel information when designing advanced attention models. In summary, our experimental results supported the ESA mechanism as a key module for improving the performance of multimodal fusion models.
5.3. Differential Module Insertion Position
In this study, we also carefully designed a CDM and identified its optimal insertion position in the multimodal fusion framework. To determine the impact of the CDM insertion position on model performance, we conducted a series of experimental comparative analyses, which informed the design of the final ACDF module.
In this part of the study, we focus on the changes in model performance when CDMs were inserted at different locations under the same experimental conditions. Specifically, the CDM was inserted into the ESA (
Figure 8, referred to as method ADFa) and into branches before fusion (
Figure 9, referred to as method ADFb). To compare the performance between these two structures with different insertion positions, we summarize the experimental results in
Figure 10.
We found that ADFb outperformed ADFa and the ACDF module in the detection task for specific categories (e.g., Truck, Tractor, and Van). However, even though method ADFb led in detection accuracy for certain categories, its overall recall was relatively low, implying that it may have a high misdetection rate in practical applications. Considering that a low recall can severely limit the usability of a model in real-world scenarios, we attempted to find a balance where a relatively high recall can be maintained, while ensuring a high detection precision.
Based on the above considerations, we chose the ACDF module as the core component of our multimodal fusion model. The ACDF module not only excels in overall average detection accuracy but also achieves a better balance between accuracy and recall. Owing to this design, our model can provide high accuracy, while reducing the missed detection of targets, making it more suitable for application in a wide range of real-world scenarios.