Multi-Modal Object Detection Method Based on Dual-Branch Asymmetric Attention Backbone and Feature Fusion Pyramid Network

Wang, Jinpeng; Su, Nan; Zhao, Chunhui; Yan, Yiming; Feng, Shou

doi:10.3390/rs16203904

Open AccessArticle

Multi-Modal Object Detection Method Based on Dual-Branch Asymmetric Attention Backbone and Feature Fusion Pyramid Network

by

Jinpeng Wang

¹,

Nan Su

^1,2,*

,

Chunhui Zhao

^1,2,

Yiming Yan

^1,2

and

Shou Feng

^1,2

¹

College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China

²

Key Laboratory of Advanced Marine Communication and Information Technology, Ministry of Industry and Information Technology, Harbin Engineering University, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(20), 3904; https://doi.org/10.3390/rs16203904

Submission received: 21 August 2024 / Revised: 19 September 2024 / Accepted: 3 October 2024 / Published: 21 October 2024

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

With the simultaneous acquisition of the infrared and optical remote sensing images of the same target becoming increasingly easy, using multi-modal data for high-performance object detection has become a research focus. In remote sensing multi-modal data, infrared images lack color information, it is hard to detect difficult targets with low contrast, and optical images are easily affected by illuminance. One of the most effective ways to solve this problem is to integrate multi-modal images for high-performance object detection. The challenge of fusion object detection lies in how to fully integrate multi-modal image features with significant modal differences and avoid introducing interference information while taking advantage of complementary advantages. To solve these problems, a new multi-modal fusion object detection method is proposed. In this paper, the method is improved in terms of two aspects: firstly, a new dual-branch asymmetric attention backbone network (DAAB) is designed, which uses a semantic information supplement module (SISM) and a detail information supplement module (DISM) to supplement and enhance infrared and RGB image information, respectively. Secondly, we propose a feature fusion pyramid network (FFPN), which uses a Transformer-like strategy to carry out multi-modal feature fusion and suppress features that are not conducive to fusion during the fusion process. This method is a state-of-the-art process for both FLIR-aligned and DroneVehicle datasets. Experiments show that this method has strong competitiveness and generalization performance.

Keywords:

multi-modal fusion; object detection; asymmetric attention

1. Introduction

Object detection technology has been widely used in military and civilian fields, such as in intelligent monitoring, urban planning, precision agriculture, geographic information system (GIS) updating, and so on [1,2,3,4]. Optical images have significant advantages when detecting objects with rich color and discriminative color information, such as a red ball on green grass. However, RGB images are easily disturbed by light, smoke, and other factors, which greatly affects the detection performance.

Compared with optical images (RGB images), infrared images have outstanding advantages in poor illumination and smoke interference conditions. When the optical sensor has difficulty functioning, the observer can still obtain the semantic information, such as the position and shape of the target, from the infrared image through the heat information. This means that infrared images play a very important role in many applications, such as early warning systems and marine monitoring systems [5,6]. In the application scenario, the information from the RGB image and infrared image is complementary. Multi-modal data can be used for target detection at the same time, which can break through the accuracy limit of single-mode data. However, due to differences in imaging principles, infrared images often lack texture, target contour, and other details, and have strong background clutter and noise. However, the two modes have obvious modal differences, and the simple fusion of the features of the two modes may lead to the introduction of redundant information that is not good for detection, thus reducing the performance of network detection.

Object detection methods using multi-modal images can be divided into three types: image-level fusion [7], feature-level fusion, and decision-level fusion [8,9]. In the image-level fusion detection method, scholars carry out the feature matching and fusion of multi-modal images to obtain a fusion image with better visual effects and then use the fusion image for object detection. This method is of great significance to reconnaissance processes with human participation. However, for object detection algorithms, fusion images with better visual effects may not result in large performance improvements. They are limited by the consistency of target feature distribution, as targets with a more consistent distribution tend to be easier to detect. The decision level fusion algorithm involves the prediction of multi-modal images by the network and then mapping through a specific strategy to improve the detection performance. However, for difficult targets that need to utilize complementary information, this approach will be difficult. Research shows that, compared with image-level fusion and decision-level fusion, blending different modal features in the middle of the network tends to produce better multi-modal detection results [7,8,9,10,11]. Therefore, adopting feature-level fusion has become the default fusion strategy for multi-modal object detection using deep learning.

How to obtain high-quality fusion features is the core problem of feature level fusion detection methods. J’org Wagner et al. designed a fusion structure based on CNN, which was based on CaffeNet, to learn the parameters of the fusion network end-to-end and to obtain a feature representation of the multi-modal association [8]. Meng et al. used the calculated weights to weight the multi-modal features and then obtained the fusion features in the form of addition [12]. Based on the convolutional structure, Hong et al. designed a variety of more complex fusion structures to obtain better fusion features [13]. Some researchers used additional structures to obtain the illumination information of an image and carried out weighting and convolution operations for each modal feature [14,15]. The joint expression of multi-modal features is optimized by suppressing the interference of invisible objects in optical images, such as low illumination. However, these methods do not selectively enhance multi-modal information, and useful information in a fusion feature may be disturbed by a large amount of redundant information. Xie et al. obtained simple fusion features by concatenating and convolution operations in multi-modal features and then used simple fusion features for channel attention operations in order to selectively enhance each modal feature along channel dimensions [16]. On the basis of using the initial fusion features for attention operation, Zhang et al. carried out a self-attention operation for each mode feature, which further improved the performance of the algorithm [17]. However, these methods do not take into account the differences in the different modal information, and the features used for fusion are only selectively enhanced without the missing information being supplemented. At present, the mechanism of attention is becoming a hot topic in research. Zhang et al. proposed ResNeST [18], a backbone network based on a self-attention mechanism. Li et al. used the fusion features obtained by splicing to weight modal features [19] and then added branch features to obtain fusion features. Yang et al. proposed a gate convolution structure to obtain fusion features after weighting modal features [20]. However, effective information exists in different forms in infrared images and optical images, such as temperature information in infrared images and global details in optical images. Different forms of effective information are distributed differently in each dimension and using the same attention structures will reduce the learning efficiency of the network. In addition, information compensation in the process of feature extraction can reduce the difficulty of mining network target features. Both Li and Yang carried out feature fusion after feature extraction.

Recently, the Transformer structure has shown increasing competitiveness. Fang et al. introduced Transformer architecture into the feature fusion field with great success [21]. The features were changed into vector form; matrix multiplication was carried out; fusion features, including a self-attention component and a mutual attention component, were obtained; and attention operations were carried out on each modal component while information fusion was carried out. However, the structure of information fusion in this method is symmetrical, and the imaging characteristics of infrared and RGB images were not taken into account, so the lack of information required supplementation.

The multi-modal data are different and complementary. On the one hand, many of the above mentioned fusion methods do not consider the differences and directly fuse multi-modal features, which may lead to the introduction of features that are not conducive to detection, thus reducing the performance of object detection. On the other hand, although some methods use attention structures to enhance the target features, they only weight each modal feature separately and do not carry out complementary operations with direct information. In some methods, although the two above mentioned characteristics of multi-modal data are taken into account, the attention operation is designed to be symmetrical. In other words, the information supplement structure between the optical and infrared branches is the same, which makes it difficult to supplement the features that are lacking in each mode.

To solve these problems, we propose a new multi-modal object detection method based on a dual-branch asymmetric attention backbone and feature fusion pyramid network. Specifically, we designed a dual-branch asymmetric attention backbone network that enhances useful information while more directly supplementing the information that a mode lacks. In addition, we also propose a new Transformer-like feature fusion pyramid module to obtain better fusion feature representation.

In conclusion, the main contributions of this paper are as follows:

1. To realize high-performance object detection by using multi-modal image information, we designed a new dual-branch fusion object detection architecture. In the feature extraction stage, multi-modal information is supplemented by different modal features and selective feature fusion is carried out in the neck part. Finally, fusion features are used for object detection.

2. To make full use of the complementary advantages of multi-modal data, we designed a dual-branch asymmetric attention backbone network. To supplement the detailed information in infrared data, we designed DISM, which uses the optical branch to perform attention operations on the infrared branch in a large field of view, and at the same time causes the two modal data types to simply fuse. In order to supplement the semantic information of optical images, we designed SISM to enhance the target features of optical images and improve the object detection rate by using more adequate semantic features in infrared images.

3. To obtain better multi-modal fusion features, we designed a feature pyramid fusion module, which adaptively enhances multi-modal feature maps of different scales by using symmetric decoder structure. The module uses the transformed feature image to carry out static weight mapping, to learn attention weight efficiently, and to selectively enhance the multi-modal feature part that is conducive to fusion. Finally, the multimodal feature maps enhanced at different scales are superimposed to obtain a high-quality fusion feature pyramid.

The rest of this article is structured as follows. In Section 2, we describe the proposed network in detail. Section 3 presents our work and experimental results to verify the validity of our approach. Proposed improvements are discussed in Section 4. Finally, we summarize the research content in Section 5.

2. Method

2.1. Overall Network Architecture

In this paper, we propose a structure for high-precision object detection by combining infrared and optical image information. The overall flow chart of the proposed method is shown in Figure 1. The whole architecture is mainly composed of three parts: feature extraction, feature fusion, and detection. Due to the significant modal differences between optical and infrared images, a two-branch backbone network was designed to extract different modal features separately. In terms of the differences between modes, the semantic information in the infrared branch is used to enhance the optical features and the detailed information in the optical branch is used to enhance the infrared features. In order to further obtain excellent joint feature representation, a fusion structure of Transformer strategy is proposed to enhance useful information in the fusion stage and suppress redundant information that is not conducive to fusion. Finally, fusion features are used for object detection.

As shown in Figure 1, the network uses infrared and optical images as inputs to detect the network. In the process of the feature extraction of the two modal images, we used DISM and SISM alternately between several residual stages. After feature extraction, the method collected the calculation results of each stage of the network to form two feature pyramids corresponding to two modes. Then, a high-quality fusion feature pyramid was obtained using FPFM. Finally, fusion features were used for object detection. The overall calculation flow of the detection method proposed in this paper is shown in Algorithm 1.

Algorithm 1 The Proposed Detection Method

Require: I_rgb: an optical image; I_inf: an infrared image;
Ensure: Result: the result of detection
Build a dual-branch backbone as B_dual
Build feature fusion module as M_f
M_f = [M_f1,M_f2,M_f3,M_f4]
F_rgbs,F_infs = B_dual(I_rgb,I_inf)
F_fusions = []
for i in [0,1,2,3]:
F_fusion = M_f[i](F_rgbs[i],F_infs[i])
F_fusions.append(F_fusion)
Build a detector as D
Result = D(F_fusion)

2.2. Dual-Branch Asymmetric Attention Backbone Network

Optical images and infrared images have advantages and disadvantages, and in some challenging scenarios, their information is complementary. The backbone network is the basic information extractor of the object detection network. In some architectures, feature fusion is only carried out in the neck part after feature extraction and the focus of network training is placed on the fusion structure. However, this training process improves the backbone network much less than the fusion structure. Therefore, a new two-branch backbone network was designed to better extract optical and infrared image features. Due to different imaging principles, optical images tend to have more detailed information; however, infrared images are sensitive to temperature, and the edge between the target and the background is often more obvious and has more semantic information. Therefore, DISM and SISM are used in the backbone network to supplement the information of each mode image feature and improve the quality of each mode feature.

The main structure diagram of DAAN is shown in Figure 2. In general, details and texture information are more often extracted in the shallow layers of the network, while semantic information is more often extracted in the deep layers. During the design process of the method, we used DISM at a shallow level and SISM at a deeper level. At each stage, only one information supplement module (DISM or SISM) was used to establish a one-way information path in a local area to improve the convergence speed of the network. We think of DISM and SISM as a loop structure that appears in pairs, which can be plugged in and played at any stage. In the experiment, we also carried out an ablation experiment of the position of the loop structure. In BAAN, in order to focus on the proposed dual branch architecture and loop structure, we used the same residual structure as ResNet50 [22] and the network depth was also consistent with it.

2.2.1. Detail Information Supplement Module

Figure 3 shows the structures of DISM and SISM. We improved DISM on the basis of spatial attention, using optical image features to supplement the details of infrared image features. In order to supplement the details more directly, we broadcast the weight matrix dimensionally after the spatial attention operation to make it consistent with the input feature dimension, and then added it to the enhanced infrared feature. Such an operation can directly introduce the information obtained from the optical branch and further enhance the influence of the weight matrix on the feature.

The DISM is calculated as follows:

DISM (F^{r g b}, F^{i n f}) = B_{c} [W (F^{r g b})] + C_{1 \times 1} [W (F^{r g b}), C_{1 \times 1} (F^{i n f})] \cdot F^{i n f}

(1)

In the formula,

F^{r g b}

and

F^{i n f}

refer to the characteristics of the optical and infrared branches, respectively;

B_{c}

refers to the broadcast of the channel dimension;

C_{1 \times 1}

represents the scale-invariant convolution operation with a kernel size of 1; and the two internal parameters represent the concatenation before processing.

W

is the spatial principal component sampling matrix of the optical branch, which can be calculated as follows:

W (F^{r g b}) = C_{1 \times 1} [P o o l_{a v g}^{c} (F^{r g b}), P o o l_{\max}^{c} (F^{r g b})]

(2)

Pool refers to the pooling operation, and c refers to the operation on the channel dimension. After pooling, the channel dimension is 1. It is worth noting that we use the pooling operation to obtain the spatial principal component sampling matrix for the detailed information sources, and the convolution operation to obtain the spatial sampling matrix for the supplemented infrared branch features because we want the network to be able to adapt to the degree of missing infrared features; in other words, this design allows the network to more dynamically supplement the details to varying degrees.

2.2.2. Semantic Information Supplement Module

We improved SISM on the basis of channel attention and used infrared image to supplement the semantic information of optical image features. Since the target features of different feature branches are not necessarily similar (infrared images may focus on contours while optical images may focus on details), a direct feature supplement may interfere with optical branches. Therefore, in this module, we used a weighted method to enhance the semantic information of optical branch features. The SISM is calculated as follows:

SISM (F^{r g b}, F^{i n f}) = F^{r g b} + W^{'} (F^{r g b}, F^{i n f}) \cdot F^{r g b}

(3)

The output of the semantic information supplementary module is the addition of the optical branch feature enhanced by the multimodal feature and its pre-enhanced feature.

W^{'}

in the formula is the channel attention weight vector obtained based on two modal features, and its calculation formula is as follows:

W^{'} (F^{r g b}, F^{i n f}) = F C {C_{1 \times 1} [P o o l_{a v g}^{h, w} (F^{r g b}), P o o l_{a v g}^{h, w} (F^{i n f})]}

(4)

P o o l_{a v g}^{h, w}

refers to the pooling operation along the high and wide dimensions, and after the operation, the output feature tensor has both height and width dimensions of 1. FC refers to the fully connected layer.

2.3. Feature Fusion Pyramid Network

After obtaining the feature information of infrared images and optical images, in order to obtain high-quality joint feature expression, a new fused symmetric decoder structure with Transformer strategy was designed. FFPN uses multi-modal information to enhance the features of each scale of each modality point by point, which enhances the useful information and suppresses the interference information that is not conducive to fusion. The detailed structure of the FFPN is shown in Figure 4. In order to integrate multi-modal features, taking optical branches as an example, we used the query tensor of infrared branch features and the bond tensor of optical branches to calculate the weight matrix, and then mapped it to the optical features.

In FPFM, we first used the linear layer of channel dimension to obtain the linear mapping of multiple subspaces and used the obtained query tensor, bond tensor, and value tensor for the calculation. The C&C boxes in the figure represent concatenation and convolution operations, and E-Pro and E-Plus refer to the product of elements and the addition of elements. The calculation process of FPFM is as follows:

FFPN (F^{r g b}, F^{i n f}) = {\tilde{F}}^{r g b} + {\tilde{F}}^{i n f}

(5)

where

{\tilde{F}}^{r g b}

and

{\tilde{F}}^{i n f}

are weighted feature tensors, and the specific calculation process is as follows:

{\tilde{F}}^{r g b} = S i g [C_{1 \times 1} (Q^{i n f}, K^{r g b})] \cdot V^{r g b}

(6)

{\tilde{F}}^{i n f} = S i g [C_{1 \times 1} (Q^{r g b}, K^{i n f})] \cdot V^{i n f}

(7)

Sig in the above formulae represents the sigmoid function, which can map the feature value between 0 and 1, so that it can be used as the enhanced weight value of the feature point. Q, K, and V in the formula are all derived from feature mapping, and

Q^{r g b}

and

Q^{i n f}

are calculated in the same way. Taking optical branches as an example, they are calculated as follows:

Q^{r g b}, K^{r g b}, V^{r g b} = C_{1 \times 1}^{1} (F^{r g b}), C_{1 \times 1}^{2} (F^{r g b}), C_{1 \times 1}^{3} (F^{r g b})

(8)

Convolution symbols with different superscripts represent different convolution layers. By means of cross-modal feature enhancement, high-quality co-expression can be obtained and the detection performance can be further improved.

3. Results

This section describes the experimental dataset, the implementation details of the proposed method, and the relevant evaluation indicators. In addition, a large number of comparative tests have been carried out to verify the advancement and generalization performance of the proposed method.

3.1. Dataset Introduction

3.1.1. FLIR Dataset

The FLIR dataset is a publicly available dataset provided by Teledyne FLIR. The FLIR dataset provides data for both day and night scenarios, including 10 k pairs of RGB images and infrared images, but only the infrared images are labeled. These images are captured by an optical camera and a thermal imager mounted on the vehicle while driving. In order to enable the dataset to be used for multimodal object detection, Zhang et al. [23] removed unaligned image pairs from the dataset and only retained three categories, namely people, bicycles, and cars. Finally, the dataset contained a total of 5142 pairs of well-aligned visible light-thermal images with a resolution of 640 × 512, of which 4129 pairs were used as the training set and 1013 pairs were used as the test set. The figures show some samples from the FLIR-aligned datasets.

3.1.2. DroneVehicle Dataset

The DroneVehicle dataset is a large-scale vehicle detection dataset released by Tianjin University [24]. The images in the dataset are taken in different urban areas, including different types of urban roads, residential areas, parking lots, highways, etc., in day and night conditions. It features well-annotated RGB images and infrared images, including oriented object bounding boxes, object categories, and more. As a large-scale dataset with RGB and thermal infrared (RGBT) images, the benchmark enables the extensive evaluation and investigation of visual analysis algorithms on drone platforms. We used 17,990 for training on RGB-infrared images and 1469 pairs for validation. The entire dataset contains the following five categories of targets: cars, vans, trucks, buses, and vans. The figures show some samples from the DroneVehicle dataset.

3.2. Implementation Details

The experimental device uses an Intel(R) Xeon(R) Silver 4210R CPU and the operating system is Ubuntu 18.04. The proposed method was trained on an NVIDIA RTX 3090 GPU, and the scheme was implemented through Pytorch1.9.0 framework. We used 1024 images as the input, with a batch size of 4 and an initial learning rate of 0.008. The optimizer uses a stochastic gradient descent (SGD) algorithm with an attenuation weight of 0.0001 and a momentum of 0.937. In this paper, mAP50 was used to evaluate all models. Our experimental results on two publicly available datasets show that the proposed method leads in multimodal object detection tasks.

3.3. Evaluation Metrics

The object detection algorithm generally uses the accuracy, recall rate, and average precision to evaluate detection performance. The relevant calculation methods are as follows:

P recision = \frac{T P}{T P + F P}

(9)

R_{ecall} = \frac{T P}{T P + F N}

(10)

where

T P

(true positive) indicates a correctly detected positive sample, and

F P

(false positive) indicates an incorrectly detected positive sample. Similarly,

T N

and

F N

refer to negative samples that were correctly detected and false samples that were incorrectly detected. In the field of object detection,

I o U

is used as the standard to judge whether the sample is positive. The

I o U

is calculated as follows:

I o U = \frac{P \cap T}{P \cup T}

(11)

where

P

refers to the prediction target region and

T

is the real value region of the indicator note. mAP50 refers to determining the correct sample when the

I o U

is greater than 0.5.

After a certain type of prediction is arranged by confidence, the accuracy rate and recall rate are calculated one by one and the

P

-

R

curve is drawn. The

A P

is the area under the curve. The closer the value is to 1, the better the detection performance of the algorithm. The calculation process can be expressed as follows:

A P = \int_{r \in R e s u l t} P (R) d R

(12)

The mAP is the average of each category of

A P

and can comprehensively characterize the performance of the method on datasets.

3.4. Analysis of Results

To verify the validity and advance of the proposed method, we conducted a series of experiments on FLIR-aligned and DroneVehicle datasets. Specifically, we conducted comparative experiments with other advanced algorithms on both datasets to verify the advancement and generalization of this method. In addition, in order to verify the validity of the backbone structure and fusion structure proposed in this paper, we conducted ablation experiments on the FLIR dataset, including the location ablation experiments of the backbone structure, fusion structure, and loop attention structure in the backbone network.

3.4.1. Experiments on the FLIR-Aligned Dataset

(1): Comparison with other state-of-the-art methods

A series of comparison experiments were designed to verify the effectiveness of our fusion detection algorithm. In the comparative experiments, we chose current popular, advanced, and innovative algorithms, including multi-modal data design algorithms such as CFR, GAFF, and CFT. In addition, to evaluate the performance of our algorithm more comprehensively, we also compared it with popular single-modal object detection methods. These single-modal methods only use the data of one mode for target detection, ignoring the complementarity of multi-modal data. The experimental results are shown in Table 1, which illustrates the evaluation indexes of various methods. The superior performance and effectiveness of the proposed method on multi-modal data were verified by comparison and ablation experiments. The baseline aim of this paper was to use double branches for feature extraction, directly add each scale feature of the two branches, and then carry out target detection.

The Faster RCNN is a currently very popular two-stage target detection algorithm, and PSC is a new architecture that was proposed in recent years. However, due to the limitation of single-modal data information, their performances are lower than those of multi-modal target detection methods. CFR and YOLO-MS are both advanced new fusion target detection algorithms for visible-infrared data research, but our approach significantly outperforms these methods.

Figure 5 provides a comparison of the subjective detection effect of this research method and the baseline method. In the figure, the orange-red box marks the target detected by the method, while the green box is the real location of the actual target. By observing the samples in the first row, we can observe that the performance of this method is better than that of the baseline method in the low-light environment of the optical image and for the target visible in the infrared image. This demonstrates that this method has a higher detection accuracy when there is interference information in optical images.

Further observation of the samples in the second and third rows shows that the method has difficulty distinguishing some objects from the background in infrared images, but it still shows excellent detection performance in terms of the targets visible in visible images. This shows that the method still has strong detection ability and high adaptability under infrared image interference conditions.

In summary, the method proposed in this study is significantly better than the baseline method in terms of the subjective detection effect. In the case of interference in one mode of multi-mode data, this method can extract fusion features and suppress interference information adaptively, so as to maintain high detection accuracy.

(2): Ablation experiment

In order to verify the effectiveness of the modules proposed in this paper, we conducted ablation experiments on each module, and the experimental results are shown in Table 2.

Experiments show that both DAAB and FFPN proposed in this paper can effectively improve the detection performance of baseline. By realizing feature fusion in FFPN and combining with DAAB to selectively supplement and extract multi-modal features, the performance of the whole algorithm can be optimized. Although the scheme proposed in this paper increases the cost of computation and memory, it effectively gains performance improvement. In the future, we will further explore a better balance of time consumption and performance.

In addition, for the loop structure in the backbone network, we conducted a series of ablation experiments on the location of the structure, and the results are shown in Table 3. The initial position, the deepest position and the maximum inner distance of the loop structure in the backbone network were tested. It can be seen from the table that the loopback structure can effectively improve the detection performance of the network. When the loopback structure composed of DISM and SISM was applied in the last two stages, it resulted in a better level of improvement than its application in the first two stages. Adding this structure to both deep and shallow layers of the backbone network can maximize the detection performance of the network.

The D in the table indicates that the infrared branch feature is supplemented by the DISM at the corresponding position, and the S indicates that the optical branch feature is supplemented by the SISM. In addition, we conducted comparative experiments using pairs of D and S in one or more locations at the same time. It is worth noting that the combination of D and S used in the same location resulting in each mode separately complementing the other modal, rather than one mode being supplemented and then supplementing the other modal. Experiments show that the DAAB proposed in this paper can effectively supplement the missing target feature information of each mode branch and FFPN can obtain high-quality fusion features to improve the detection performance of the network.

3.4.2. Experiments on the DroneVehicle Dataset

To further verify the advanced nature of the proposed method, we conducted a comparative experiment with other advanced algorithms on a public dataset from a drone view perspective (Table 4).

We compared current advanced single-mode detection methods and multi-mode detection methods. TSFADet is currently the most advanced method on this dataset and achieved the highest accuracy after the concatenation operation. However, the performance of the method proposed in this paper is still in a leading position.

The subjective detection results of the proposed method on the DroneVehicle dataset are shown in Figure 6. The first row contains optical images, the second row comprises infrared images labeled with the locations of the targets, and the third row shows the visualizations of the detection results of our method.

In the first and second columns of the schematic, some objects are in a low detection state in the optical image, which is difficult for the object detection algorithm to detect, as even the human eye cannot distinguish them. In the third and fourth columns, some targets in the infrared images are difficult to identify with the naked eye and the optical image needs to supplement the color information to confirm the target position. The methods proposed in this paper maintain excellent detection performance in the above cases.

4. Discussion

In this paper, a new multimodal object detection method was proposed which uses an asymmetric attention structure to supplement missing detail information in infrared images and missing semantic information in optical images as well as a fusion module to obtain joint feature representation, so as to improve detection performance. However, in some application scenarios, it is difficult for users to combine the field of view of infrared and optical sensors individually, as there are coincidence parts and there are non-coincidence parts. How to deal with multi-modal data with a huge field-of-view deviation and how to undertake high-performance object detection is our future research direction.

5. Conclusions

Through the characteristics of remote sensing multimodal image data, we researched how to obtain high-quality fusion feature representation while suppressing interference information, and a new fusion target detection method was proposed.

Firstly, considering the differences between different modal data, a new dual-branch asymmetric attention backbone network was proposed to improve the reliability of the modal information. Specifically, the backbone network supplements infrared and RGB images with targeted information through DISM and SISM.

Secondly, in order to make better use of the complementarity of multi-modal data, a feature pyramid fusion module was designed, which uses the structure of a Transformer decoder to enhance the weight of another mode of information, so as to enhance useful information conducive to fusion while suppressing redundant information unfavorable to fusion. Finally, multi-level and multi-scale fusion features were obtained.

Numerous experiments conducted on FLIR-aligned and DroneVehicle data revealed that the proposed approach was able to achieve encouraging results. The backbone network and fusion structure included in this method can also be used in other network architectures to improve the multi-modal feature fusion performances of different algorithms. Experiments show that our method achieves SOTA results on both multi-modal datasets and has a considerable degree of robustness and generalization performance. The proposed method gives full play to the advantages of each mode to obtain high-quality fusion features. In the future, asymmetric attention structures could be further studied to use multi-modal images with misaligned targets for high-performance target detection.

Author Contributions

Conceptualization, J.W. and N.S.; methodology, J.W. and N.S.; software, J.W.; validation, Y.Y. and C.Z.; formal analysis, C.Z. and S.F.; data curation, Y.Y. and S.F.; writing—original draft preparation, J.W. and N.S.; writing—review and editing, N.S. and J.W.; funding acquisition, C.Z., Y.Y. and N.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China: No. 62271159, No. 62071136, No. 62002083, No. 61971153, No. 62371153; Excellent Youth Foundation of Heilongjiang Province of China: YQ2022F002; Fundamental Research Funds for the Central Universities: 3072024XX0805; Heilongjiang Province Key Research and Development Project: GA23B003; Key Laboratory of Target Cognition and Application Technology: 2023-CXPT-LC-005.

Data Availability Statement

The FLIR-aligned dataset was obtained from the website https://github.com/zhanghengdev/CFR, accessed on 10 May 2023. The DroneVehicle dataset was obtained from the website https://github.com/VisDrone/DroneVehicle, accessed on 5 January 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cerutti-Maori, D.; Klare, J.; Brenner, A.R.; Ender, J.H.G. Wide-Area Traffic Monitoring With the SAR/GMTI System PAMIR. IEEE Trans. Geosci. Remote Sens. 2008, 46, 3019–3030. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Y.; Huang, X.; Yuille, A.L. Deep networks under scene-level supervision for multi-class geospatial object detection from remote sensing images. ISPRS J. Photogramm. Remote Sens. 2018, 146, 182–196. [Google Scholar] [CrossRef]
Han, J.; Zhang, D.; Cheng, G.; Guo, L.; Ren, J. Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning. IEEE Trans. Geosci. Remote Sens. 2015, 53, 3325–3337. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Park, J.; Chen, J.; Cho, Y.K.; Kang, D.Y.; Son, B.J. CNN-Based Person Detection Using Infrared Images for Night-Time Intrusion Warning Systems. Sensors 2019, 20, 34. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Shin, H. Pedestrian Detection at Night in Infrared Images Using an Attention-Guided Encoder-Decoder Convolutional Neural Network. Appl. Sci. 2020, 10, 809. [Google Scholar] [CrossRef]
Zhou, K.; Chen, L.; Cao, X. Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Wagner, J.; Fischer, V.; Herman, M.; Behnke, S. Multispectral pedestrian detection using deep fusion convolutional neural networks. ESANN 2016, 587, 509–514. [Google Scholar]
Liu, J.; Zhang, S.; Wang, S.; Metaxas, D.N. Multispectral deep neural networks for pedestrian detection. arXiv 2016, arXiv:1611.02644. [Google Scholar]
Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-Aware Faster r-Cnn for Robust Multispectral Pedestrian Detection. Pattern Recognit. 2019, 85, 161–171. Available online: https://www.sciencedirect.com/science/article/pii/S0031320318303030 (accessed on 11 August 2023). [CrossRef]
Zhang, L.; Zhu, X.; Chen, X.; Yang, X.; Lei, Z.; Liu, Z. Weakly aligned cross-modal learning for multispectral pedestrian detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5126–5136. [Google Scholar]
Meng, S.; Liu, Y. Multimodal Feature Fusion YOLOv5 for RGB-T Object Detection. In Proceedings of the 2022 China Automation Congress (CAC), Xiamen, China, 25–27 November 2022; pp. 2333–2338. [Google Scholar] [CrossRef]
Hong, D.; Gao, L.; Yokoya, N.; Yao, J.; Chanussot, J.; Du, Q.; Zhang, B. More diverse means better: Multimodal deep learning meets remote-sensing imagery classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 4340–4354. [Google Scholar] [CrossRef]
Guan, D.; Cao, Y.; Yang, J.; Cao, Y.; Yang, M.Y. Fusion of Multispectral Data through Illumination-Aware Deep Neural Networks for Pedestrian Detection. Inf. Fusion 2019, 50, 148–157. Available online: https://www.sciencedirect.com/science/article/pii/S1566253517308138 (accessed on 12 August 2023). [CrossRef]
Li, C.; Song, D.; Tong, R.; Tang, M. Multispectral pedestrian detection via simultaneous detection and segmentation. arXiv 2018, arXiv:1808.04818. [Google Scholar]
Xie, Y.; Zhang, L.; Yu, X.; Xie, W. YOLO-MS: Multispectral Object Detection via Feature Interaction and Self-Attention Guided Fusion. IEEE Trans. Cogn. Dev. Syst. 2023, 15, 2132–2143. [Google Scholar] [CrossRef]
Zhang, H.; Fromont, E.; Lefevre, S.; Avignon, B. Guided Attentive Feature Fusion for Multispectral Pedestrian Detection. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 72–80. [Google Scholar] [CrossRef]
Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. ResNeSt: Split-Attention Networks. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 2735–2745. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar] [CrossRef]
Yang, C.; An, Z.; Zhu, H.; Hu, X.; Xu, K.; Li, C.; Diao, B.; Xu, Y. Gated Convolutional Networks with Hybrid Connectivity for Image Classification. arXiv 2019, arXiv:1908.09699. [Google Scholar] [CrossRef]
Fang, Q.; Han, D.; Wang, Z. Cross-Modality Fusion Transformer for Multispectral Object Detection. arXiv 2021, arXiv:2111.00273. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Zhang, H.; Fromont, E.; Lef, S.; Avignon, B. Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 276–280. [Google Scholar]
Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-Based RGB-Infrared Cross-Modality Vehicle Detection Via Uncertainty-Aware Learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Fromont, E.; Lefevre, S.; Avignon, B. Deep Active Learning from Multispectral Data Through Cross-Modality Prediction Inconsistency. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 449–453. [Google Scholar] [CrossRef]
Yu, Y.; Da, F. Phase-Shifting Coder: Predicting Accurate Orientation in Oriented Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13354–13363. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.-S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2844–2853. [Google Scholar] [CrossRef]
Wang, Q.; Chi, Y.; Shen, T.; Song, J.; Zhang, Z.; Zhu, Y. Improving RGB-Infrared Object Detection by Reducing Cross-Modality Redundancy. Remote Sens. 2022, 14, 2020. [Google Scholar] [CrossRef]
Zhang, L.; Liu, Z.; Zhang, S.; Yang, X.; Qiao, H.; Huang, K.; Hussain, A. Crossmodality interactive attention network for multispectral pedestrian detection. Inf. Fusion 2019, 50, 20–29. [Google Scholar] [CrossRef]
Yuan, M.; Wang, Y.; Wei, X. Translation, Scale and Rotation: Cross-Modal Alignment Meets RGB-Infrared Vehicle Detection. arXiv 2022, arXiv:2209.13801. [Google Scholar]

Figure 1. The overall scheme of the proposed method. The input arrows for each cube are the variables involved in the calculation not the fixed calculation method. The green, orange, and yellow in the figure mark the calculation process for optical, infrared, and fusion features, respectively, as in the following text.

Figure 2. The main structure diagram of DAAN. The area highlighted by the red line can be seen as a loop structure. The part of the figure highlighted by the red line is a loop structure.

Figure 3. (a) DISM structure diagram; (b) SISM structure diagram. In the figure, C refers to the concatenation operation, the plus sign refers to the addition operation of the elements, and 1× 1 (3 × 3) refers to the convolution operation with a kernel size of 1 (3).

Figure 4. The main structure diagram of FFPN. The symbols in the figure are the same as described above.

Figure 5. Our method detects results on some typical images from the FLIR dataset. The prediction box is marked as orange-red.

Figure 6. Our method detects results on some typical images in the DroneVehicle dataset. The prediction box for the baseline method is marked in green, and the prediction box for the present method is marked in red.

Table 1. These are the performances of various advanced methods on the FLIR-aligned dataset, and the modal column indicates which mode the algorithm utilizes to achieve target detection.

Method	Modal	mAP
Faster RCNN [25]	R/I	63.60\75.30%
HalfwayFusion [23]	R + I	71.17%
MBNet [7]	R + I	71.30%
DALFusion [26]	R + I	72.11%
CFR [23]	R + I	72.39%
GAFF [17]	R + I	73.80%
YOLO-MS [16]	R + I	75.20%
CFT [21]	R + I	77.63%
MFF-YOLOv5 [12]	R + I	78.20%
Baseline	R + I	68.20%
Our method	R + I	78.73%

Table 2. The ablation experiments of DAAB and FFPN on FLIR-aligned datasets presented in this paper.

Method	mAP	FLOPs	MAC	Runtime
baseline	68.21%	186.98 G	64.84 M	0.045 s
base + DAAB	74.82%	281.42 G	131.71 M	0.058 s
base + FFPN	73.50%	291.99 G	109.45 M	0.061 s
our	78.73%	386.43 G	176.29 M	0.068 s

Table 3. These are the performances of various advanced methods on the FLIR dataset, and the modal field indicates which mode the algorithm utilizes to achieve target detection.

Module	Loop Position				mAP
Module	Stage1	Stage2	Stage3	Stage4	mAP
DISM: D SISM: S	×	×	×	×	68.20%
	D	S	×	×	72.22%
	×	×	D	S	74.21%
	D	×	×	S	73.94%
	×	×	×	D, S	72.15%
	D, S	×	×	×	71.82%
	D, S	D, S	D, S	D, S	73.76%
	D	S	D	S	74.83%

Table 4. These are the performance of various high-level methods on DroneVehicle datasets, and the modal columns indicate which mode the algorithm uses to achieve target detection.

Method	Modal	mAP
Faster RCNN [25]	R/I	54.06\60.27%
PSC [27]	R/I	56.23\63.69%
RoITransformer [28]	R/I	61.55\65.47%
UA-CMDet [24]	R + I	64.01%
RISNet [29]	R + I	66.40%
Halfway Fusion (OBB) [9]	R + I	68.19%
CIAN(OBB) [30]	R + I	70.23%
AR-CNN(OBB) [11]	R + I	71.58%
TSFADet [31]	R + I	73.06%
Cascade-TSFADet [31]	R + I	73.90%
Our method	R + I	75.17%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Su, N.; Zhao, C.; Yan, Y.; Feng, S. Multi-Modal Object Detection Method Based on Dual-Branch Asymmetric Attention Backbone and Feature Fusion Pyramid Network. Remote Sens. 2024, 16, 3904. https://doi.org/10.3390/rs16203904

AMA Style

Wang J, Su N, Zhao C, Yan Y, Feng S. Multi-Modal Object Detection Method Based on Dual-Branch Asymmetric Attention Backbone and Feature Fusion Pyramid Network. Remote Sensing. 2024; 16(20):3904. https://doi.org/10.3390/rs16203904

Chicago/Turabian Style

Wang, Jinpeng, Nan Su, Chunhui Zhao, Yiming Yan, and Shou Feng. 2024. "Multi-Modal Object Detection Method Based on Dual-Branch Asymmetric Attention Backbone and Feature Fusion Pyramid Network" Remote Sensing 16, no. 20: 3904. https://doi.org/10.3390/rs16203904

APA Style

Wang, J., Su, N., Zhao, C., Yan, Y., & Feng, S. (2024). Multi-Modal Object Detection Method Based on Dual-Branch Asymmetric Attention Backbone and Feature Fusion Pyramid Network. Remote Sensing, 16(20), 3904. https://doi.org/10.3390/rs16203904

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Modal Object Detection Method Based on Dual-Branch Asymmetric Attention Backbone and Feature Fusion Pyramid Network

Abstract

1. Introduction

2. Method

2.1. Overall Network Architecture

2.2. Dual-Branch Asymmetric Attention Backbone Network

2.2.1. Detail Information Supplement Module

2.2.2. Semantic Information Supplement Module

2.3. Feature Fusion Pyramid Network

3. Results

3.1. Dataset Introduction

3.1.1. FLIR Dataset

3.1.2. DroneVehicle Dataset

3.2. Implementation Details

3.3. Evaluation Metrics

3.4. Analysis of Results

3.4.1. Experiments on the FLIR-Aligned Dataset

3.4.2. Experiments on the DroneVehicle Dataset

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI