MMYFnet: Multi-Modality YOLO Fusion Network for Object Detection in Remote Sensing Images

Guo, Huinan; Sun, Congying; Zhang, Jing; Zhang, Wuxia; Zhang, Nengshuang

doi:10.3390/rs16234451

Open AccessArticle

MMYFnet: Multi-Modality YOLO Fusion Network for Object Detection in Remote Sensing Images

by

Huinan Guo

^1,†

,

Congying Sun

^1,2

,

Jing Zhang

^2,*,†,

Wuxia Zhang

³

and

Nengshuang Zhang

^1,2,†

¹

Xi’an Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Xi’an 710119, China

²

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China

³

School of Computer Science & Technology, Xi’an University of Posts & Telecommunications, Xi’an 710121, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2024, 16(23), 4451; https://doi.org/10.3390/rs16234451

Submission received: 17 September 2024 / Revised: 14 November 2024 / Accepted: 25 November 2024 / Published: 27 November 2024

Download

Browse Figures

Versions Notes

Abstract

Object detection in remote sensing images is crucial for airport management, hazard prevention, traffic monitoring, and more. The precise ability for object localization and identification enables remote sensing imagery to provide early warnings, mitigate risks, and offer strong support for decision-making processes. While traditional deep learning-based object detection techniques have achieved significant results in single-modal environments, their detection capabilities still encounter challenges when confronted with complex environments, such as adverse weather conditions or situations where objects are obscured. To overcome the limitations of existing fusion methods in terms of complexity and insufficient information utilization, we innovatively propose a Cosine Similarity-based Image Feature Fusion (CSIFF) module and integrate it into a dual-branch YOLOv8 network, constructing a lightweight and efficient target detection network called Multi-Modality YOLO Fusion Network (MMYFNet). This network utilizes cosine similarity to divide the original features into common features and specific features, which are then refined and fused through specific modules. Experimental and analytical results show that MMYFNet performs excellently on both the VEDAI and FLIR datasets, achieving mAP values of 80% and 76.8%, respectively. Further validation through parameter sensitivity experiments, ablation studies, and visual analyses confirms the effectiveness of the CSIFF module. MMYFNet achieves high detection accuracy with fewer parameters, and the CSIFF module, as a plug-and-play module, can be integrated into other CNN-based cross-modality network models, providing a new approach for object detection in remote sensing image fusion.

Keywords:

cross-modality; cosine similarity; feature fusion; multi-spectral remote sensing imagery; dual-branch; object detection

Graphical Abstract

1. Introduction

As a core technology in the field of computer vision, object detection not only enables automatic recognition and localization of objects but also provides a solid foundation for subsequent tasks such as decision analysis and behavior understanding. It is widely applied in various domains including security surveillance, autonomous driving, intelligent manufacturing, search and rescue, significantly enhancing the intelligence level and response speed of these systems [1]. Traditional object detection methods are limited by manual features and geometric extraction, leading to inefficiency and insufficient accuracy [2]. With the rise of deep learning, the powerful feature extraction capabilities of multi-layer neural networks has significantly advanced the development of object detection technology [3,4].

Deep learning-based object detection algorithms are mainly divided into one-stage and two-stage methods. Two-stage methods divide object detection into candidate region generation and candidate region classification [5]. Wu et al. [6] proposed a two-stage method that first uses Binarized Normed Gradients (BING) to obtain candidate regions and then applies Convolutional Neural Networks (CNN) for detection. This research introduced CNN into aircraft detection for the first time. Wu et al. [7] presented an improved Mask Region-based CNN (R-CNN) model called SCMask R-CNN, which can perform object detection and segmentation in parallel. The improved model increased detection accuracy by 1% to 2%. Zhang et al. [8] proposed a method that combines GoogLeNet with R-CNN to detect aircraft of various scales in remote sensing images, which can effectively achieve object detection. Liu et al. [9] combined corner clustering with CNN to propose a new two-stage aircraft target detection method. This method can generate a small number of high-quality candidate regions, reducing the false alarm rate and improving the accuracy and robustness of object detection. Kumar et al. [1] conducted an in-depth study on the detection performance of Faster R-CNN on military aircraft, exploring the impact of anchor ratios on detection accuracy. These methods have significant advantages in accuracy, but their complex structures require a large amount of computing resources, resulting in slow computation speeds and a lack of global information [10].

One-stage detectors directly regress the bounding boxes from feature maps, achieving higher efficiency and being more lightweight [10]. Representative methods include You Only Look Once (YOLO) [11], Single Shot MultiBox Detector (SSD) [12]. YOLO algorithm is widely used in aircraft detection due to its concise network architecture and ultra-fast operation speed. It has optimized the trade-off between speed and accuracy in subsequent iterations [13]. Ji et al. [14] designed a remote sensing image segmentation algorithm to address the issues of large image size and significant variations in remote sensing images. By combining the yolov5 detector, they performed segmentation before detection on the original images. This method was validated on remote sensing datasets, effectively improving object recognition accuracy. Bhavani et al. [15] used an improved yolov5 algorithm to enhance object detection accuracy. Aiming at the difficulties and high costs of labeling spaceborne optical remote sensing images, Wang et al. [16] proposed an end-to-end lightweight aircraft detection framework, A Network Based on Circle Grayscale Characteristics (CGC-NET), which can effectively identify targets with a small number of samples. Cai et al. [17] proposed a YOLOv8-GD method to address the issues of poor detection accuracy and weak generalization ability in Synthetic Aperture Radar (SAR) images with complex backgrounds, effectively improving the accuracy of aircraft detection in SAR images. Bakirci et al. [18] utilized the newly released YOLOv9 to detect aircraft targets in satellite images obtained from low-Earth orbit, enhancing airport and aircraft security.

These object detection techniques have achieved remarkable results in single-modality remote sensing images. However, due to the influence of complex environmental factors, it is difficult to accurately detect targets using single-modality (visible light only or infrared only, etc.) remote sensing images [9,19]. To this end, researchers have introduced multi-modality image fusion technology into aircraft detection [8], particularly the fusion of infrared and visible light images, aiming to improve detection accuracy through information complementarity. Several multispectral datasets, such as FLIR [20], LLVIP [21], and VEDAI [22], have also contributed to the advancement of this technology. Adrián [23] et al. addressing the issue of lighting and other harsh environments at disaster sites limiting vision, proposed an infrared-visible light fusion object detection method based on YOLOv3. This method was applied to detect human and vehicle targets, achieving good performance in search and rescue scenarios. Dong [24] et al. focusing on the limitation of vehicle detection accuracy using single-modal methods in complex urban environments, innovatively proposed a feature-level infrared and visible light image fusion technique specifically designed to enhance vehicle detection accuracy during both daytime and nighttime. They successfully implemented the practical application of this technique. Due to limitations in datasets, current multi-modality fusion detection primarily focuses on targets such as pedestrians and vehicles, with relatively limited research on aircraft.

In existing multimodal fusion object detection methods, image feature-level fusion methods can be divided into early fusion, mid-fusion, and late fusion according to the fusion stage. Studies have shown that mid-fusion outperforms other fusion methods [25,26,27,28]. Traditional fusion methods, such as channel concatenation, are simple but have limited in effectiveness. Zhou et al. [29] proposed a Modality Balance Network (MBNet), which fuse features through Differential Modality-Aware Fusion (DMAF) and an illumination-aware feature alignment module. However, differential methods can only extract differential features between modalities, ignoring other features. Subsequently, various innovative methods have been proposed to enhance feature fusion effects, such as the dual-branch YOLOv3 brightness-weighted fusion network [30], the self-attention feature fusion module and feature interaction in YOLO-MS [31], and the fusion module that adaptively adjusts feature weights [28,32], as well as the Attention-based Feature Enhancement Fusion Module (AFEFM) that combines differential features with original features through cross-concatenation [33]. Although these methods have their unique characteristics, they often overlook the value of common information between modalities and mostly focus on local feature information, lacking interaction with distant features. Researchers have then attempted to combine Transformer with CNNs [19,34], which improves detection accuracy but significantly increases model complexity, hindering practical deployment [35]. Fang et al. [36] achieved efficient multimodal feature fusion with low parameters by distinguishing between common and differential features, but there is still room for improvement in their feature separation method.

Cosine similarity, as an efficient and stable similarity measurement method, boasts fast calculation speed and high resilience to changes in image brightness and contrast [37]. It has been utilized to calculate the similarity between texts or images [38]. Liu et al. [39] applied weighted cosine similarity for image retrieval. Researchers Li et al. [40,41,42] successfully applied the cosine similarity metric to image classification tasks. Yuan et al. [43] combined cosine similarity calculation with feature redundancy analysis, removing redundant features through cosine transformation. Islam et al. [44] used cosine similarity to filter similar images in medical imaging. Ahmad et al. [45] replaced dot product operation in convolutional operations with sharpened cosine similarity calculations, measuring similarities in direction, providing a new perspective for feature extraction. Wang et al. [46] proposed a target detection network model named YOLO-CS, which ingeniously adopted cosine similarity to replace the traditional dot product operation for measuring the similarity between cross-scale features, thereby facilitating efficient fusion of multi-scale features. Furthermore, cosine similarity was also applied in the construction of the loss function, significantly enhancing the model’s ability to distinguish between backgrounds and targets. Practical applications have demonstrated that the YOLO-CS model exhibits higher accuracy in UAV detection tasks, effectively resolving the original issues.These studies collectively demonstrate the significant application potential of cosine similarity in object detection and image processing. Existing research methods primarily apply cosine similarity in fields such as image classification, retrieval, and similar image filtering. They also propose a new feature extraction approach by using cosine similarity to replace the dot product in convolutional calculations. However, when calculating the similarity between images using cosine similarity, they typically flatten all pixels of the entire image into a one-dimensional vector, or compress image channels and flatten the pixels of cropped or processed sections into one-dimensional vectors for computation. Research that treats an image’s feature channels as a unit and flattens the pixels of a single feature channel into a vector to calculate the similarity between corresponding feature channels of different modalities is relatively scarce.

To address the above issues, we innovatively adopt a lightweight dual-branch YOLOv8 architecture, which consists of two parallel backbone networks and a unified head network. These two backbone networks specialize in extracting information from visible and infrared images, respectively, and fuse their outputs before passing them to the head for further detection. To better achieve feature complementarity between visible and infrared modalities, we have devised an efficient feature fusion module called Cosine Similarity-based Image Feature Fusion (CSIFF). CSIFF module initially leverages cosine similarity to decompose features from both modalities into similar and specific features through the Feature Spliting (FS) module. Subsequently, the Similar Feature Processing (SFP) module filters similar features to mitigate redundancy while the Distinct Feature Processing (DFP) module enhancs specific features to preserve their uniqueness. By embedding the CSIFF module into various layers of the dual-branch YOLOv8 backbone to achieve multi-scale feature fusion, constructing an efficient and lightweight Multi-Modality YOLO Fusion Network (MMYFNet) for object detection. This study holds great significance in enhancing the accuracy of object detection.

The main contributions of this paper are as follows:

We designed a dual-branch YOLOv8 architecture, where two parallel backbone networks extract features from visible and infrared modalities, respectively. By ingeniously integrating the CSIFF module into different layers of the backbone, we constructed a novel multi-modal fusion object detection network, MMYFNet.
We propose a lightweight feature fusion module, CSIFF, which first precisely partitions features into shared and specific features based on the cosine similarity between corresponding feature channels of different modalities. Subsequently, The DFP module and the SFP module are respectively employed for feature processing operations on distinct features and similar features. Finally, the processed specific and shared features are element-wise added to obtain the fused features.
We introduced the FS module, which innovatively applies cosine similarity to compute the similarity between corresponding feature channels of different modalities. Unlike previous applications, it treats each feature channel as a computational unit, flattening all pixels within a single feature channel into a vector for similarity calculation with feature vectors from corresponding channels of different modalities. By adopting this approach, we are able to classify the different feature channels of the same image, distinguishing between similar and dissimilar features, thereby providing more detailed and accurate information for subsequent processing. This new application direction broadens the scope of cosine similarity applications.
We carried out exhaustive ablation studies and comparative experiments on two public datasets, VEDAI and FLIR, to comprehensively validate the effectiveness of the proposed method. The results indicate that multi-modal feature fusion outperforms single-modal approaches, while the CSIFF module surpasses the existing fusion techniques, achieving a superior balance between parameter efficiency and detection accuracy. Meanwhile, the comparative evaluation results against other state-of-the-art multi-modal fusion detection networks further substantiate the effectiveness of MMYFNet.

The remaining chapters of this paper are organized as follows: Section 2 introduces the MMYFNet architecture and the implementation details of the CSIFF module; Section 3 validates the effectiveness of the CSIFF module through comparative experiments and visual analysis; The discussion is provided in Section 4. In Section 5, we conclude our paper.

2. Materials and Methods

2.1. MMYFNet

Figure 1 illustrates our proposed MMYFNet, which takes visible and infrared images as inputs and employs two parallel and structurally identical backbone networks backbone 1 and backbone 2 to extract features from multimodal data respectively. Subsequently, the CSIFF module is embedded at three distinct feature extraction stages to achieve multi-scale intermediate feature fusion. The fused multi-scale features are then passed to the head network, where low-level and high-level features undergo interaction and fusion through upsampling and convolution operations. Ultimately, detection results are output at three different scales.

The backbone network is composed of multiple Conv, C2f, and Spatial Pyramid Pooling—Fast (SPPF) modules. Firstly, the Conv+C2f structure is cascaded to progressively extract multi-scale image features at different stages. The C2f module divides the input into two parts: one part undergoes deeper feature extraction through multiple Bottleneck modules, while the other part directly connects shallow and deep features. Finally, these two parts are concatenated along the channel dimension to obtain the final output. The SPPF module employs a multi-level pooling strategy, processing the input feature maps with pooling kernels of different sizes to capture multi-scale information. The pooled feature maps are then integrated through concatenation and a convolutional layer, resulting in a richer and more comprehensive feature representation.

We pass the outputs from the three corresponding Conv+C2f modules of the dual-backbone network to the CSIFF module to achieve multi-modal feature fusion. The fused results are then transmitted to the head network, successfully embedding the CSIFF module within the backbone. In the head network, the fused features undergo further interaction and fusion with deep features through upsampling and concatenation operations, ultimately outputting detection results at three different scales. The detection head comprises separate classification and regression heads, responsible for target classification and bounding box regression, respectively.

2.2. CSIFF

The element-wise summation or channel concatenation methods, when fusing features from different modalities, fail to efficiently leverage the complementary features between modalities, not only elevating the complexity of network learning but also rendering it susceptible to variations in brightness and noise. To address this, we propose the CSIFF module, which is based on the divide-and-conquer philosophy, focusing on both the distinct and similar features across modalities. The CSIFF module comprises three parts: Feature Splitting (FS), Similar Feature Processing (SFP), and Distinct Feature Processing (DFP), as depicted in Figure 2. The FS module introduces the cosine similarity approach, considering global feature relationships, to extract common and specific features from the source features. The SFP module refines the similar features across modalities, while the DFP module specializes in enhancing the distinct features. Finally, these processed features are combined to complete the fusion process.

2.2.1. FS

This module evaluates the similarity between corresponding channels of visible and infrared features by treating each of them as a vector and computing their cosine similarity based on the angle between the vectors, as illustrated in Figure 3, using Equation (1).

{cos}_{m} (θ) = \frac{F_{rgb} \cdot F_{ir}}{∥F_{rgb}∥ ∥F_{ir}∥} = \frac{\sum_{j = 1}^{n} (F_{r g b}^{j} F_{i r}^{j})}{\sqrt{\sum_{j = 1}^{n} {(F_{r g b}^{j})}^{2}} \cdot \sqrt{\sum_{j = 1}^{n} {(F_{i r}^{j})}^{2}}}

(1)

In the formula, m represents the index of the current channel, M represents the total number of channels, and j denotes the j-th pixel in the m-th channel.

F_{r g b}^{j}

and

F_{i r}^{j}

respectively represent the values of the j-th pixel in the m-th channel of the visible light and infrared features. The calculated value of

{cos}_{m} (θ)

indicates the overall similarity between the visible light and infrared channels, with a range of values from

[- 1, 1]

. The similarity results of all channels are used to construct the similarity matrix C, as shown in Equation (2). Based on the values in the similarity matrix C, we binarize it to 0 or 1, obtaining the similar feature extraction matrix

C_{c o m}

and the specific feature extraction matrix

C_{d i f}

. Specifically, to construct the similar feature extraction matrix

C_{c o m}

, we first set all values greater than 0 in the similarity matrix C to 1, indicating that these feature channels are considered similar features and their original information will be preserved in subsequent feature extraction steps. Conversely, values less than 0 in C are set to 0, implying that these features do not belong to similar features and will not contribute to subsequent calculations. Similarly, when generating the specific feature extraction matrix

C_{d i f}

, we adopt the opposite strategy: all values less than 0 in C are set to 1, while values greater than 0 are set to 0. By binarizing into 0s and 1s, we can not only ensure that similar and specific features are clearly distinguished, but also maintain the order of the feature channels before and after the division.

Subsequently, we utilize the obtained

C_{c o m}

and

C_{d i f}

matrices to perform element-wise multiplication with each channel of the original input features

F_{r g b}

and

F_{i r}

, respectively, to achieve attention weighting. In this process, adaptive parameters

α

and

β

are introduced as regulatory factors to dynamically adjust the importance of different features, thereby optimizing the final feature representation, as shown in Equations (3)–(6). The overall process of this module is illustrated in Figure 4.

C = [{cos}_{1} (θ), {cos}_{2} (θ), \dots, {cos}_{m} (θ), \dots, {cos}_{M} (θ)]

(2)

\begin{matrix} F_{c o m 1} & = α_{1} \cdot C_{c o m} \cdot F_{r g b} \end{matrix}

(3)

\begin{matrix} F_{c o m 2} & = α_{2} \cdot C_{c o m} \cdot F_{i r} \end{matrix}

(4)

\begin{matrix} F_{d i f 1} & = β_{1} \cdot C_{d i f} \cdot F_{r g b} \end{matrix}

(5)

\begin{matrix} F_{d i f 2} & = β_{2} \cdot C_{d i f} \cdot F_{i r} \end{matrix}

(6)

2.2.2. DFP

Figure 5 introduces the structure of the DFP module, which receives the divided specific features of visible

F_{d i f 1}

and infrared

F_{d i f 2}

. Through feature grouping and four parallel branches, the module learns different semantic features. Information interaction in the channel direction is achieved through the x and y dimensional pooling branches. A

3 \times 3

convolution branch is added to fuse contextual information, enabling the network to obtain multi-scale spatial information under a large receptive field and retain it in the channels. Finally, the channel and spatial attention are integrated to obtain the final attention weight matrix, which weights the original input features. After filtering and enhancement processing, the specific fusion feature

F_{d i f}

is obtained.

2.2.3. SFP

The module receives visible similar features

F_{c o m 1}

and infrared similar features

F_{c o m 2}

after feature partitioning, and adaptively selects important features from the two channels while removing redundant features, as shown in Figure 6. Inspired by select kernel [47], this module simplifies its structure by retaining only the Fuse and Select parts. Initially, the two modal features are combined through element-wise summation to obtain F. Then, global average pooling

f_{G A P}

is applied to derive the global information s. Subsequently, the attention vectors a and b for the two modalities are derived using a shared fully connected layer

f_{F C}

, followed by individual

s o f t m a x

operations for each modality. The attention vectors are multiplied with the corresponding inputs of the module and then added element-wise to obtain the final similar fused feature

F_{c o m}

.

This process can be stated as:

\begin{matrix} F = F_{c o m 1} + F_{c o m 2} \end{matrix}

(7)

\begin{matrix} s = f_{G A P} (F) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F (i, j) \end{matrix}

(8)

\begin{matrix} z_{1} = f_{F C 1} (s) \end{matrix}

(9)

\begin{matrix} z_{2} = f_{F C 2} (s) \end{matrix}

(10)

\begin{matrix} a = f_{s o f t m a x} (z_{1}) \end{matrix}

(11)

\begin{matrix} b = f_{s o f t m a x} (z_{2}) \end{matrix}

(12)

\begin{matrix} F_{c o m} = a \otimes F_{c o m 1} + b \otimes F_{c o m 2} \end{matrix}

(13)

Here,

f_{GAP}

denotes the global average pooling operation, with

s \in R^{C \times 1 \times 1}

,

z_{1} \in R^{C \times 1 \times 1}

, and

z_{2} \in R^{C \times 1 \times 1}

. Due to the massive number of parameters and high cost of ordinary FC networks, to improve efficiency, the networks

f_{FC 1}

and

f_{FC 2}

share the weights of their first layers, and simultaneously, the shared layer reduces the dimensionality to

\frac{1}{32}

of the input vector s.

f_{softmax}

represents the softmax function, which assigns weights to the feature channels of different modalities, thereby adaptively achieving feature selection.

3. Experiment and Result

This chapter introduces the evaluation metrics of the model, the experimental environment, and the parameter settings. The effectiveness of CSIFF and MMYFNet are verified on the VEDAI and FLIR datasets, and the advantages of our fusion method are further explained through visualization.

3.1. Dataset and Implementation Details

The VEDAI [22] dataset is a public dataset for aerial multispectral vehicle detection, containing two sizes of

1024 \times 1024

and

512 \times 512

. Each size provides both a 3-channel RGB mode and a single-channel infrared mode, with the image content of the two modes being identical and mutually aligned. The dataset contains nine types of vehicles, with an average of 5.5 vehicles per image, accounting for approximately 0.7% of the total image pixels. We selected 1246 images of

1024 \times 1024

pixels for training and validation, with a total of over 3700 annotated targets. The original label format is a rotated bounding box, which we converted to a horizontal bounding box format (x, y, w, h), and uniformly adjusted the annotation boxes that exceed the image boundaries to within the image boundaries.

The FLIR [20] dataset is a multispectral object detection dataset with 5142 RGB and thermal image pairs of size

640 \times 512

, with 4129 for training and 1013 for testing. We selected the categories of person, bicycle, and car for training. Since the original dataset was not aligned, we used an aligned dataset available online. The original label format is XML, which we converted to (x, y, w, h) format.

The experiments were conducted using torch 2.1.1 on a single NVIDIA GeForce RTX 3090 GPU. We adopted the YOLOv8s pre-trained on the COCO [48] dataset as the base detector, choosing the SGD optimizer with an initial learning rate of 0.01 and a momentum of 0.937. The input image size was set to

640 \times 640

, with Mosaic data augmentation applied. An early stopping mechanism was added during training, with the evaluation metric being bestfitness. The batch size was set to 32 for the VEDAI dataset and 16 for the FLIR dataset. To avoid drastic oscillations in the early stages of model training, a warm-up strategy was used for the first 3 epochs, with the warm-up momentum set to 0.8.

3.2. Evaluation Metrics

The evaluation indicators for this experiment are Precision, Recall, AP (Average Precision), and mAP (mean Average Precision). Specifically, when calculating AP (Equation (16)), interpolation is typically applied to precision to ensure that a corresponding precision value is available for each recall value. This interpolation method enables the conversion of discrete recall values into a continuous function, thereby allowing integration. P(r) represents the relationship function between precision and recall obtained through interpolation. mAP@0.5 refers to the mAP value when the IoU (Intersection over Union) threshold is set to 0.5. mAP@[0.5:0.05:0.95] represents the average of mAP values across IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05. for convenience, it is usually abbreviated as mAP as Equation (17). Additionally, we have incorporated an early stopping mechanism, which utilizes the evaluation indicator

b e s t f i t n e s s

(Equation (18)) to determine when to terminate the training process. The formulas for calculating these evaluation indicators are translated as follows:

p r e c i s i o n = \frac{T P}{T P + F P}

(14)

r e c a l l = \frac{T P}{T P + F N}

(15)

A P = \int_{0}^{1} p r e c i s i o n d (r e c a l l) = \int_{0}^{1} P (r) d r

(16)

m A P = \frac{1}{n} \sum_{i = 0}^{n} A P_{i}

(17)

b e s t f i t n e s s = 0.9 \times mAP + 0.1 \times mAP @ 0.5

(18)

3.3. Parameter Sensitivity Analysis

To investigate the specific impact of the four adaptive parameters within the FS module on model performance during training, we designed a parameter sensitivity experiment. In this experiment, we held constant the values of

α_{2}

,

β_{1}

and

β_{2}

, while manually adjusting the value of

α_{1}

, observing and recording the changes in the model’s mAP@0.5 and mAP values as

α_{1}

varied. This approach allowed us to intuitively assess the sensitivity of

α_{1}

to the overall effectiveness of the model. Based on the data obtained from the experiment, we fitted a smooth curve to illustrate the trend of model performance as a function of

α_{1}

, as shown in Figure 7. The shape of this curve reflects the sensitivity of the model’s performance to changes in

α_{1}

. Our findings revealed that when the value of

α_{1}

falls within a specific range, the model performance undergoes notable fluctuations, further affirming the pivotal role of the

α_{1}

parameter in shaping model performance. Notably, The red star in the figure represents the value that the adaptive parameter

α_{1}

actually adjusted to during the normal training process of our model. It is noteworthy that the parameters selected by the adaptive algorithm are situated close to the peak of model accuracy corresponding to this curve. This result not only underscores the necessity and effectiveness of adaptive parameters but also offers a solid foundation and clear direction for further optimizing model performance.

3.4. Ablation Study

Conduct ablation studies on the three components of CSIFF module to analyze the contribution of each component to the overall performance. The results are presented in Table 1.

Additionally, we employed the t-SNE method to visually analyze the feature distribution during the fusion process, further validating the effective partitioning and filtering capabilities of different modules. Figure 8a intuitively illustrates the initial distribution state of the two inputs to the fusion module—-visible and infrared features. Blue dots represent visible features, while red dots signify infrared features. The features of these two modalities overlap and approach in some regions, while they are visibly separated in others. This distribution indicates the presence of both similarities and differences in the feature information captured by visible and infrared images. Figure 8b delves deeper into the feature distribution after FS module processing. Blue triangles signify the specific features of visible light (dif1), red triangles represent the specific features of infrared (dif2), blue crosses denote the common features of visible light (com1), and red crosses indicate the common features of infrared (com2). This figure clearly illustrates the FS module’s proficiency in effectively partitioning the features from the two modalities, resulting in a more compact distribution of similar features and a pronounced separation between dissimilar ones.Figure 8c presents a visualization of the features after they have been screened and enhanced by the DFP and SFP modules. Here, green triangles represent specific features, while purple crosses signify common features. Notably, the specific and common features are nearly perfectly segregated in the diagram, with a clear demarcation between the two. Moreover, the number of features has significantly decreased following the screening process. This not only verifies the modules’ efficacy in efficiently selecting and enhancing features but also visually underscores the exceptional performance of the fusion strategy in enhancing feature quality and discrimination.

3.5. Comparison Experiments

3.5.1. Effectiveness of Multimodality

To validate the advantages of multimodality, we conducted comparisons with current state-of-the-art single-modality networks on the VEDAI and FLIR datasets, and provided baselines based on our fundamental detector, as shown in Table 2 and Table 3. We present the AP values for nine categories and the mAP values for all categories, with the best values highlighted in bold. On the VEDAI dataset, the mAP@0.5 for YOLOv8’s single visible modality and single infrared modality detection are 67.7% and 58.4%, respectively. Our baseline can elevate the mAP@0.5 to 74.1%, and after incorporating our fusion strategy, the mAP@0.5 exceeds the baseline by 5.9%. On the FLIR dataset, YOLOv8’s mAP@0.5 for single visible modality and single infrared modality are 62.5% and 75.6%, with the mAP values being 29.4% and 40%, respectively. The multimodal baseline achieves an mAP@0.5 of 75.5% and an mAP of 40.7%.

3.5.2. Evaluation of Fusion Module

Table 4 presents the results on the VEDAI dataset for the two-branch YOLOv8 network with various fusion modules added. The baseline represents the use of element-wise addition as the fusion strategy. It can be seen that our fusion strategy achieves the best performance across 7 categories, with an mAP@0.5 of 80% and an mAP of 52.1%.

The Figure 9 illustrates the detection results of different fusion modules on three pairs of images. Apart from GFU_v1 and our fusion module, other fusion modules exhibit missed detections or false detections. Our method demonstrates higher confidence scores in multiple categories compared to the GFU_v1 method, while also achieving lower Gflops and Params by 12 and 12.3 respectively, significantly improving accuracy while maintaining a relatively small number of parameters.

We utilized the Grad-CAM method to visualize the attention maps of different fusion modules on three pairs of images, as shown in the Figure 10. Our fusion module’s highlighted regions are concentrated on the target areas, effectively filtering out the background, indicating that our method can comprehensively and accurately focus on the targets while ignoring irrelevant regions, resulting in higher detection efficiency.

We integrated the CSIFF fusion module into two backbone network architectures: ResNet50 and CSPDarkNet53, incorporating three CSIFF fusion modules into each backbone. The parameters of these three modules are shown in the Table 5. These backbone networks were then combined with the YOLOv8 detection head for experimentation. The experimental results presented in Table 6 convincingly demonstrate that the CSIFF module consistently and significantly enhance detection performance across these diverse backbone networks.

3.5.3. Performance of MMYFNet

On the two public datasets, VEDAI and FLIR, we conducted a comprehensive comparison of our proposed method with state-of-the-art network architectures, as presented in Table 7 and Table 8.

In the evaluation results on the VEDAI dataset (Table 7), although Cross-modal Local Calibration and Global Context Modeling Network (CLGNet) leads with a slight advantage of 0.2% in mAP@0.5, its number of parameters is not explicitly given, making it difficult to comprehensively evaluate its efficiency. It is worth mentioning that the number of parameters in our model is only 7% of that of ICAfusion. Even with such a compact model size, our method still outperforms ICAfusion by 3.4% and 7.2% in mAP@0.5 and mAP, respectively, demonstrating an excellent balance between performance and efficiency. On the FLIR dataset (Table 8), our method also shows significant advantages. Although CFT has a slightly higher mAP@0.5 than our model by 1.9%, its model parameters are 12 times as many as ours. In terms of mAP, we not only lead CFT by 0.7%, but also achieve the highest mAP.

In Figure 11, We have shown the Params-mAP@0.5 scatter plots for various models on both the VEDAI and FLIR datasets, it becomes evident that within the relatively low parameter count range, our method achieves superior performance, highlighting its tremendous potential in resource-constrained scenarios.

4. Discussion

In complex environments, object detection faces multiple challenges such as illumination variations, occlusions, and severe weather conditions, where single-modality data struggles to comprehensively capture target features, leading to limited detection performance. To address this issue, researchers have introduced infrared and visible image fusion techniques, leveraging information complementarity to enhance detection accuracy. Early approaches focused on differences between modalities through differential methods, followed by the proposal of other feature interaction and fusion modules or networks. However, most of these methods concentrate on differential and local features, neglecting the value of common and long-range information.

Recently, the introduction of Transformers has significantly boosted the performance of fusion-based object detection networks, albeit accompanied by a sharp increase in model complexity. In response to the prevalent issues of high computational complexity and low information utilization in current methods, we propose the CSIFF module and the lightweight dual-branch feature fusion detection network MMYFNet. MMYFNet processes infrared and visible images in parallel through its dual-branch structure, while the CSIFF module significantly improves detection accuracy through feature partitioning, independent processing, and fusion.

Experimental results demonstrate that MMYFNet surpasses single-modality and baseline models in detection accuracy, and the CSIFF module outperforms other fusion methods. Nevertheless, we also observe that while detection accuracy improves, the introduction of cosine similarity partitioning slightly reduces model computation speed, limiting its potential for real-time applications to some extent. Additionally, our research primarily relies on the VEDAI and FLIR datasets, and the generalization ability of our approach needs further validation on other datasets. Moreover, while our network achieves good detection accuracy in lightweight models, its performance does not yet match that of more complex models incorporating Transformers.

Future research can focus on algorithm optimization to reduce computational complexity while exploring more efficient fusion strategies to fully leverage the complementarity of multi-modal data. Furthermore, applying our method to real-time object detection systems and evaluating its stability and robustness across different scenarios represents a promising direction for exploration.

5. Conclusions

In summary, we innovatively propose the CSIFF module, capable of comprehensively integrating multi-modal information, and a lightweight dual-branch feature fusion detection network, MMYFNet. We conduct exhaustive ablation studies and comparative experiments to comprehensively evaluate the effectiveness of our proposed MMYFNet and its core component, the CSIFF module. Firstly, through component ablation analysis, we demonstrate the contribution of each component in the CSIFF module to the overall performance, with the introduction of the complete component yielding a 2% to 3.5% performance improvement compared to combinations of partial components. In comparisons with single-modality and baseline models, MMYFNet achieves an mAP@0.5 that is 5.8% higher than the best-performing single-modality model and 4.6% higher than the baseline model. When applied to the same network architecture, the CSIFF module outperforms existing fusion modules by 3.1% to 13.8% in mAP@0.5. By flexibly embedding the CSIFF module into different network architectures and comparing them with their respective baseline models, the addition of the CSIFF module brings about 0.3% and 4.6% improvements in detection accuracy, highlighting its strong compatibility and upgradability potential. Finally, comprehensive comparisons with other advanced networks in terms of mAP and parameters demonstrate that our network exhibits superior detection performance, particularly achieving a good balance between parameter efficiency and detection accuracy. This underscores the potential of our approach in resource-constrained scenarios and provides experimental support for future research directions.

It is noteworthy that we recognize that separately processing similar and specific features can enhance detection accuracy. While incorporating cosine similarity partitioning improves fusion effectiveness, it slightly slows down computation speed, pointing to a direction for our subsequent research: optimizing algorithm efficiency while maintaining high accuracy to achieve faster and more efficient object detection.

Author Contributions

Conceptualization, W.Z., H.G. and C.S.; methodology, H.G., W.Z. and J.Z.; software, C.S. and H.G.; validation, C.S.; formal analysis, C.S.; investigation, C.S.; resources, H.G. and C.S.; data curation, C.S. and N.Z.; writing—original draft preparation, C.S. and H.G.; writing—review and editing, W.Z. and J.Z.; visualization, C.S.; supervision, H.G., J.Z. and W.Z.; project administration, H.G. and W.Z.; funding acquisition, H.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by National Natural Science Foundation of China General Program 62471389, in part by the Shaanxi Provincial Key Research and Develop Programme General Project under Grant 2024SF-YBXM-572.

Data Availability Statement

The FLIR dataset is openly available on the website https://github.com/CalayZhou/Multispectral-Pedestrian-Detection-Resource/issues/6, accessed on 24 November 2024. The VEDAI dataset can be downloaded at https://downloads.greyc.fr/vedai/, accessed on 24 November 2024. The COCO dataset is available at https://cocodataset.org/#download, accessed on 24 November 2024.

Acknowledgments

We would like to express our sincere gratitude to the researchers who created and shared the VEDAI, FLIR and COCO open-source datasets used in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AFEFM	Attention-based Feature Enhancement Fusion Module
BING	Binarized Normed Gradients
CFT	Cross-Modality Fusion Transformer
CGC-Net	A network Based on Circle Grayscale Characteristics
CLGNet	Cross-modal Local Calibration and Global Context Modeling Network
CMAFF	Cross-Modality Attentive Feature Fusion
CNN	Convolutional Neural Networks
CSIFF	Cosine Similarity-based Image Feature Fusion
DFP	Distinct Feature Processing
DMAF	Differential Modality-Aware Fusion
FS	Feature Spliting
GAFF	Guided attentive feature fusion
GFU	Gated Fusion Unit
ICAfusion	Iterative cross-attention guided feature fusion
MBNet	Modality Balance Network
MCFF	Multispectral Channel Feature Fusion
MMYFNet	Multi-Modality YOLO Fusion Network
R-CNN	Region-based CNN
SAR	Synthetic Aperture Radar
SFP	Similar Feature Processing
SPPF	Spatial Pyramid Pooling - Fast
SSD	Single Shot MultiBox Detector
YOLO	You Only Look Once

References

Kumar, A.; Singh, S. AIR-SCAN: Aircraft Identification and Recognition using Deep Learning Scanning. In Proceedings of the 2024 11th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 14–15 March 2024; pp. 1–6. [Google Scholar]
Kumar, A.S.S.; Mahesh, K. An Investigation of Deep Neural Network based Techniques for Object Detection and Recognition Task in Computer Vision. In Proceedings of the 2023 2nd International Conference on Edge Computing and Applications (ICECAA), Namakkal, India, 19–21 July 2023. [Google Scholar]
Srivastava, S.; Divekar, A.V.; Anilkumar, C.; Naik, I.; Pattabiraman, V. Comparative analysis of deep learning image detection algorithms. J. Big Data 2021, 8, 66. [Google Scholar] [CrossRef]
Pathak, A.R.; Pandey, M.; Rautaray, S. Application of Deep Learning for Object Detection. Procedia Comput. Sci. 2018, 132, 1706–1717. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. A Comprehensive Review On Two-Stage Object Detection Algorithms. In Proceedings of the 2023 International Conference on Quantum Technologies, Communications, Computing, Hardware and Embedded Systems Security (iQ-CCHESS), Kottayam, India, 15–16 September 2023; pp. 1–7. [Google Scholar] [CrossRef]
Wu, H.; Zhang, H.; Zhang, J.; Xu, F. Fast aircraft detection in satellite images based on convolutional neural networks. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 4210–4214. [Google Scholar]
Wu, Q.; Feng, D.; Cao, C.; Zeng, X.; Feng, Z.; Wu, J.; Huang, Z. Improved mask R-CNN for aircraft detection in remote sensing images. Sensors 2021, 21, 2618. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Fu, K.; Sun, H.; Sun, X.; Zheng, X.; Wang, H. A multi-model ensemble method based on convolutional neural networks for aircraft detection in large remote sensing images. Remote Sens. Lett. 2018, 9, 11–20. [Google Scholar] [CrossRef]
Liu, Q.; Xiang, X.; Wang, Y.; Luo, Z.; Fang, F. Aircraft detection in remote sensing image based on corner clustering and deep learning. Eng. Appl. Artif. Intell. 2020, 87, 103333. [Google Scholar] [CrossRef]
Zhou, M. Research Advanced in Deep Learning Object Detection. In Proceedings of the 2022 IEEE Conference on Telecommunications, Optics and Computer Science (TOCS), Dalian, China, 11–12 December 2022; pp. 1318–1322. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin, Germany, 2016; pp. 21–37. [Google Scholar]
Terven, J.R.; Esparza, D.M.C. A Comprehensive Review of YOLO: From YOLOv1 to YOLOv8 and Beyond. arXiv 2023, arXiv:2304.00501. [Google Scholar]
Ji, C.; Zhang, L.; Li, J. Aircraft Detection in Airport Remote Sensing Images based on YOLOv5. In Proceedings of the 2022 9th International Conference on Digital Home (ICDH), Guangzhou, China, 28–30 October 2022; pp. 13–18. [Google Scholar]
Panda, B.S.; Gopal, K.M.; Satpathy, R.; Panda, G. Detection and recognition of aircraft vehicle-A supple approach using deep pliable YOLOv5. Multimed. Tools Appl. 2024, 1–22. [Google Scholar] [CrossRef]
Wang, T.; Zeng, X.; Cao, C.; Li, W.; Feng, Z.; Wu, J.; Yan, X.; Wu, Z. CGC-NET: Aircraft Detection in Remote Sensing Images Based on Lightweight Convolutional Neural Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2805–2815. [Google Scholar] [CrossRef]
Cai, Y.; Zhao, Y.; Wen, S.; Feng, J. Improved YOLOv8 SAR Image Aircraft Object Detection Method. In Proceedings of the 2024 7th International Symposium on Autonomous Systems (ISAS), Chongqing, China, 7–9 May 2024; pp. 1–6. [Google Scholar]
Bakirci, M.; Bayraktar, I. Transforming aircraft detection through LEO satellite imagery and YOLOv9 for improved aviation safety. In Proceedings of the 2024 26th International Conference on Digital Signal Processing and its Applications (DSPA), Moscow, Russia, 27–29 March 2024; pp. 1–6. [Google Scholar]
Shen, J.; Chen, Y.; Liu, Y.; Zuo, X.; Fan, H.; Yang, W. ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection. Pattern Recognit. 2024, 145, 109913. [Google Scholar] [CrossRef]
Zhou, C. Add Dataset: Aligned FLIR Dataset. Available online: https://github.com/CalayZhou/Multispectral-Pedestrian-Detection-Resource/issues/6 (accessed on 2 February 2022).
Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3496–3504. [Google Scholar]
Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef]
Banuls, A.; Mandow, A.; Vazquez-Martin, R.; Morales, J.; Garcia-Cerezo, A. Object Detection from Thermal Infrared and Visible Light Cameras in Search and Rescue Scenes. In Proceedings of the 2020 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), Abu Dhabi, United Arab Emirates, 4–6 November 2020. [Google Scholar]
Xin, D.; Xu, L.; Chen, H.; Yang, X.; Zhang, R. A Vehicle Target Detection Method Based on Feature Level Fusion of Infrared and Visible Light Image. In Proceedings of the 2022 34th Chinese Control and Decision Conference (CCDC), Hefei, China, 15–17 August 2022; pp. 469–474. [Google Scholar] [CrossRef]
Nataprawira, J.; Gu, Y.; Goncharenko, I.; Kamijo, S. Pedestrian detection using multispectral images and a deep neural network. Sensors 2021, 21, 2536. [Google Scholar] [CrossRef] [PubMed]
Wagner, J.; Fischer, V.; Herman, M.; Behnke, S. Multispectral Pedestrian Detection using Deep Fusion Convolutional Neural Networks. In Proceedings of the ESANN, Bruges, Belgium, 27–29 April 2016; Volume 587, pp. 509–514. [Google Scholar]
Liu, J.; Zhang, S.; Wang, S.; Metaxas, D. Multispectral deep neural networks for pedestrian detection. arXiv 2016, arXiv:1611.02644. [Google Scholar]
Cao, Z.; Yang, H.; Zhao, J.; Guo, S.; Li, L. Attention fusion for one-stage multispectral pedestrian detection. Sensors 2021, 21, 4184. [Google Scholar] [CrossRef]
Zhou, K.; Chen, L.; Cao, X. Improving multispectral pedestrian detection by addressing modality imbalance problems. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16. Springer: Berlin, Germany, 2020; pp. 787–803. [Google Scholar]
Xue, Y.; Ju, Z.; Li, Y.; Zhang, W. MAF-YOLO: Multi-modal attention fusion based YOLO for pedestrian detection. Infrared Phys. Technol. 2021, 118, 103906. [Google Scholar] [CrossRef]
Xie, Y.; Zhang, L.; Yu, X.; Xie, W. YOLO-MS: Multispectral object detection via feature interaction and self-attention guided fusion. IEEE Trans. Cogn. Dev. Syst. 2023, 15, 2132–2143. [Google Scholar] [CrossRef]
Zheng, Y.; Izzat, I.H.; Ziaee, S. GFD-SSD: Gated fusion double SSD for multispectral pedestrian detection. arXiv 2019, arXiv:1903.06999. [Google Scholar]
Jiang, Q.; Dai, J.; Rui, T.; Shao, F.; Wang, J.; Lu, G. Attention-based cross-modality feature complementation for multispectral pedestrian detection. IEEE Access 2022, 10, 53797–53809. [Google Scholar] [CrossRef]
Qingyun, F.; Dapeng, H.; Zhaokui, W. Cross-modality fusion transformer for multispectral object detection. arXiv 2021, arXiv:2111.00273. [Google Scholar]
Nie, J.; Sun, H.; Sun, X.; Ni, L.; Gao, L. Cross-Modal Feature Fusion and Interaction Strategy for CNN-Transformer-Based Object Detection in Visual and Infrared Remote Sensing Imagery. IEEE Geosci. Remote. Sens. Lett. 2023, 21, 5000405. [Google Scholar] [CrossRef]
Qingyun, F.; Zhaokui, W. Cross-modality attentive feature fusion for object detection in multispectral remote sensing imagery. Pattern Recognit. 2022, 130, 108786. [Google Scholar] [CrossRef]
Sejal, D.; Ganeshsingh, T.; Venugopal, K.; Iyengar, S.; Patnaik, L. Image recommendation based on ANOVA cosine similarity. Procedia Comput. Sci. 2016, 89, 562–567. [Google Scholar] [CrossRef]
Li, D.; Li, J.; Fan, Y.; Lu, G.; Ge, J.; Liu, X. Printed label defect detection using twice gradient matching based on improved cosine similarity measure. Expert Syst. Appl. 2022, 204, 117372. [Google Scholar] [CrossRef]
Liu, X.; Hu, G.; Ma, X.; Kuang, H. An image retrieval algorithm based on multiple convolutional features of RPN and weighted cosine similarity. In Proceedings of the 2018 Chinese Control And Decision Conference (CCDC), Shenyang, China, 9–11 June 2018; pp. 4095–4098. [Google Scholar]
Li, Z.; Wang, L.; Liu, J. Research on image recognition algorithm of valve switch state based on Cosine similarity. In Proceedings of the 2020 International Conference on Virtual Reality and Intelligent Systems (ICVRIS), Zhangjiajie, China, 18–19 July 2020; pp. 458–461. [Google Scholar]
Tang, L.; Xu, P.; Xue, L.; Liu, Y.; Yan, M.; Chen, A.; Hu, S.; Wen, L. A novel self-attention model based on cosine self-similarity for cancer classification of protein mass spectrometry. Int. J. Mass Spectrom. 2023, 494, 117131. [Google Scholar] [CrossRef]
Zhao, Q.; Yang, W.; Liao, Q. AdaSAN: Adaptive cosine similarity self-attention network for gastrointestinal endoscopy image classification. In Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; pp. 1855–1859. [Google Scholar]
Yuan, G.; Zhai, Y.; Tang, J.; Zhou, X. CSCIM_FS: Cosine similarity coefficient and information measurement criterion-based feature selection method for high-dimensional data. Neurocomputing 2023, 552, 126564. [Google Scholar] [CrossRef]
Islam, M.; Zunair, H.; Mohammed, N. CosSIF: Cosine similarity-based image filtering to overcome low inter-class variation in synthetic medical image datasets. Comput. Biol. Med. 2024, 172, 108317. [Google Scholar] [CrossRef]
Ahmad, M.; Mazzara, M. SCS-Net: Sharpend cosine similarity based neural network for hyperspectral image classification. IEEE Geosci. Remote. Sens. Lett. 2024, 21, 503204. [Google Scholar] [CrossRef]
Wang, M.; Zhang, B. Contrastive Learning and Similarity Feature Fusion for UAV Image target Detection. IEEE Geosci. Remote. Sens. Lett. 2023, 21, 6001105. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Ingham, F.; Poznanski, J.; Fang, J.; Yu, L.; et al. ultralytics/yolov5: v3. 1-bug fixes and performance improvements. Zenodo 2020. [Google Scholar] [CrossRef]
Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super resolution assisted object detection in multimodal remote sensing imagery. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Xie, J.; Nie, J.; Ding, B.; Yu, M.; Cao, J. Cross-modal Local Calibration and Global Context Modeling Network for RGB-Infrared Remote Sensing Object Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2023, 16, 8933–8942. [Google Scholar] [CrossRef]
Zhang, H.; Fromont, E.; Lefèvre, S.; Avignon, B. Guided attentive feature fusion for multispectral pedestrian detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 72–80. [Google Scholar]

Figure 1. General overview of the Multi-Modality YOLO Fusion Network (MMYFNet).

Figure 2. The overall architecture of Cosine Similarity-based Image Feature Fusion (CSIFF).

Figure 3. Cosine Similarity Diagram: Representing Similarity between Vectors through the Cosine of the Angle Between Them.

Figure 4. Structure of Feature Splitting (FS).

Figure 5. Structure of Distinct Feature Processing (DFP).

Figure 6. Structure of Similar Feature Processing (SFP).

Figure 7. Curve graph of model performance varying with adaptive parameters

α_{1}

.

Figure 7. Curve graph of model performance varying with adaptive parameters

α_{1}

.

Figure 8. Feature visualization using t-SNE method. (a) Distribution of original input features. (b) Features after partitioning by FS Module. (c) Features processed by DFP and SFP modules.

Figure 9. Detection results of different fusion modules.

Figure 10. Detection results of different fusion modules.

Figure 11. Scatter plot of Params vs. mAP@0.5 for different network architectures on VEDAI (a) and FLIR (b).

Table 1. Ablation study results of CSIFF module components.

	FS	DFP	SFP	mAP@0.5	mAP
1	✓	✓		0.777	0.519
2	✓		✓	0.780	0.496
3		✓	✓	0.765	0.462
4	✓	✓	✓	0.800	0.521

Feature Splitting (FS) module, Similar Feature Processing (SFP) module, Distinct Feature Processing (DFP) module. ✓ denotes add this module. Bold indicates the best results.

Table 2. Results of single-modality and multimodal networks on the VEDAI dataset, using AP values for each category, mAP@0.5 and mAP@[0.5:0.05:0.95] as evaluation metrics, where mAP in this context represents mAP@[0.5:0.05:0.95].

Model	Modality	Plane	Car	Truck	Ship	Tractor	Camper	Vans	Others	Pick Up	mAP@0.5	mAP
RetinaNet [49]	RGB	0.926	0.791	0.649	0.427	0.625	0.690	0.282	0.323	0.680	0.599	0.324
SSD [12]	RGB	0.393	0.813	0.560	0.673	0.638	0.731	0.594	0.365	0.638	0.609	0.313
YOLOv3-tiny [50]	RGB	0.923	0.810	0.514	0.443	0.646	0.640	0.595	0.493	0.622	0.632	0.343
Faster R-CNN [51]	RGB	0.846	0.900	0.718	0.517	0.464	0.762	0.449	0.333	0.827	0.646	0.371
Libra R-CNN [52]	RGB	0.740	0.875	0.697	0.474	0.601	0.811	0.545	0.370	0.796	0.657	0.383
YOLOv3 [50]	RGB	0.865	0.839	0.738	0.674	0.811	0.749	0.746	0.391	0.754	0.730	0.436
YOLOv3-SPP [50]	RGB	0.956	0.845	0.809	0.680	0.770	0.689	0.675	0.525	0.715	0.740	0.455
YOLOv5 [53]	RGB	0.935	0.911	0.783	0.718	0.812	0.751	0.622	0.333	0.823	0.743	0.4624
YOLOv5 [53]	IR	0.995	0.867	0.771	0.707	0.623	0.666	0.843	0.430	0.759	0.74	0.4621
YOLOv8s	RGB	0.995	0.868	0.799	0.523	0.819	0.712	0.692	0.461	0.812	0.742	0.465
YOLOv8s	IR	0.920	0.876	0.593	0.316	0.573	0.578	0.400	0.296	0.696	0.584	0.353
baseline	RGB+IR	0.995	0.899	0.770	0.578	0.699	0.717	0.785	0.514	0.829	0.754	0.494
Ours	RGB+IR	0.995	0.908	0.849	0.600	0.749	0.745	0.746	0.726	0.883	0.800	0.521

RGB represents the results under the visible modality, while IR denotes the outcomes in the infrared modality. The baseline refers to a dual-branch YOLOv8 network that employs element-wise addition as the fusion strategy. Bold indicates the best results.

Table 3. Results of single-modality and multimodal networks on the FLIR dataset, mAP in this context represents mAP@[0.5:0.05:0.95].

Model	Modality	mAP@0.5	mAP
Faster R-CNN [51]	RGB	0.649	0.289
Faster R-CNN [51]	IR	0.744	0.376
SSD [12]	RGB	0.522	0.218
SSD [12]	IR	0.655	0.296
YOLOv3 [50]	RGB	0.583	0.257
YOLOv3 [50]	IR	0.736	0.368
YOLOv5s [53]	RGB	0.678	0.259
YOLOv5s [53]	IR	0.739	0.395
YOLOv8s	RGB	0.625	0.294
YOLOv8s	IR	0.756	0.40
baseline	RGB+IR	0.755	0.407
Ours	RGB+IR	0.768	0.409

Bold indicates the best results.

Table 4. The comparison results on the VEDAI dataset with other fusion modules.

Model	Plane	Car	Truck	Ship	Tractor	Camper	Vans	Others	Pick Up	mAP@0.5	mAP	Flops(G)	Params(M)
MCFF [28]	0.995	0.878	0.615	0.345	0.565	0.681	0.686	0.436	0.754	0.662	0.425	42.1	16.43
AFEFM [33]	0.995	0.859	0.683	0.626	0.688	0.741	0.780	0.492	0.852	0.746	0.474	43.5	17.21
GFU_v2 [32]	0.995	0.882	0.773	0.485	0.728	0.679	0.801	0.611	0.792	0.749	0.471	67.3	30.15
GFU_v1 [32]	0.995	0.865	0.780	0.604	0.691	0.721	0.724	0.622	0.800	0.756	0.502	66.0	29.46
CMAFF [36]	0.995	0.870	0.778	0.706	0.769	0.696	0.749	0.580	0.780	0.769	0.466	42.2	16.48
baseline	0.995	0.899	0.770	0.578	0.699	0.717	0.785	0.514	0.829	0.754	0.494	42.1	16.38
Ours	0.995	0.908	0.849	0.600	0.749	0.745	0.746	0.726	0.883	0.800	0.521	43.0	17.16

Bold indicates the best results.

Table 5. Parameters of the three CSIFF modules in the dual-branch network.

Module	Params (M)
CSIFF1	0.037320
CSIFF2	0.148364
CSIFF3	0.591636

Table 6. Results of integrating the CSIFF module into different backbone on the VEDAI dataset.

Backbone	Model	mAP@0.5	mAP
ResNet50	baseline	0.589	0.295
ResNet50	CSIFF	0.592 (↑0.003)	0.303 (↑0.008)
CSPDarkNet	baseline	0.754	0.494
CSPDarkNet	CSIFF	0.800 (↑0.046)	0.521 (↑0.027)

Baseline represents using element-wise addition as the fusion strategy, while CSIFF represents using it as the fusion strategy. ↑ indicates the degree of performance improvement based on the baseline. Bold indicates the best results.

Table 7. Comparison of model performance on the VEDAI dataset.

Model	mAP0.5	mAP	Params(M)
super-YOLO [54]	0.751	-	4.85
ICAfusion [19]	0.766	0.449	120.21
YOLOfusion [36]	0.786	0.491	12.52
CLGNet [55]	0.802	-	-
Ours	0.800	0.521	17.16

Bold indicates the best results.

Table 8. Comparison of model performance on the FLIR dataset.

Model	mAP0.5	mAP	Params(M)
GAFF [56]	0.729	0.375	23.77
YOLO-MS [31]	0.752	0.383	15.3
YOLOfusion [36]	0.766	0.398	12.52
CFT [34]	0.787	0.402	206.03
Ours	0.768	0.409	17.16

Bold indicates the best results.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, H.; Sun, C.; Zhang, J.; Zhang, W.; Zhang, N. MMYFnet: Multi-Modality YOLO Fusion Network for Object Detection in Remote Sensing Images. Remote Sens. 2024, 16, 4451. https://doi.org/10.3390/rs16234451

AMA Style

Guo H, Sun C, Zhang J, Zhang W, Zhang N. MMYFnet: Multi-Modality YOLO Fusion Network for Object Detection in Remote Sensing Images. Remote Sensing. 2024; 16(23):4451. https://doi.org/10.3390/rs16234451

Chicago/Turabian Style

Guo, Huinan, Congying Sun, Jing Zhang, Wuxia Zhang, and Nengshuang Zhang. 2024. "MMYFnet: Multi-Modality YOLO Fusion Network for Object Detection in Remote Sensing Images" Remote Sensing 16, no. 23: 4451. https://doi.org/10.3390/rs16234451

APA Style

Guo, H., Sun, C., Zhang, J., Zhang, W., & Zhang, N. (2024). MMYFnet: Multi-Modality YOLO Fusion Network for Object Detection in Remote Sensing Images. Remote Sensing, 16(23), 4451. https://doi.org/10.3390/rs16234451

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MMYFnet: Multi-Modality YOLO Fusion Network for Object Detection in Remote Sensing Images

Abstract

1. Introduction

2. Materials and Methods

2.1. MMYFNet

2.2. CSIFF

2.2.1. FS

2.2.2. DFP

2.2.3. SFP

3. Experiment and Result

3.1. Dataset and Implementation Details

3.2. Evaluation Metrics

3.3. Parameter Sensitivity Analysis

3.4. Ablation Study

3.5. Comparison Experiments

3.5.1. Effectiveness of Multimodality

3.5.2. Evaluation of Fusion Module

3.5.3. Performance of MMYFNet

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI