Object Detection in Multispectral Remote Sensing Images Based on Cross-Modal Cross-Attention

In complex environments a single visible image is not good enough to perceive the environment, this paper proposes a novel dual-stream real-time detector designed for target detection in extreme environments such as nighttime and fog, which is able to efficiently utilise both visible and infrared images to achieve Fast All-Weatherenvironment sensing (FAWDet). Firstly, in order to allow the network to process information from different modalities simultaneously, this paper expands the state-of-the-art end-to-end detector YOLOv8, the backbone is expanded in parallel as a dual stream. Then, for purpose of avoid information loss in the process of network deepening, a cross-modal feature enhancement module is designed in this study, which enhances each modal feature by cross-modal attention mechanisms, thus effectively avoiding information loss and improving the detection capability of small targets. In addition, for the significant differences between modal features, this paper proposes a three-stage fusion strategy to optimise the feature integration through the fusion of spatial, channel and overall dimensions. It is worth mentioning that the cross-modal feature fusion module adopts an end-to-end training approach. Extensive experiments on two datasets validate that the proposed method achieves state-of-the-art performance in detecting small targets. The cross-modal real-time detector in this study not only demonstrates excellent stability and robust detection performance, but also provides a new solution for target detection techniques in extreme environments.


Introduction
Object detection based on deep learning is one of the key techniques in the crossapplication of machine vision and remote sensing technology [1,2].Object detection in remote sensing images is a technique that uses images acquired from satellites or drones, etc., to identify, classify and monitor features on the Earth's surface.These techniques have a wide range of applications in many fields such as military, agriculture, geological exploration, urban planning, and environmental monitoring [3].
Traditional target detection techniques [4][5][6], which mainly use the sliding window on image method, first identify the candidate region on the image, extract the relevant features and classify them using support vector machine.As traditional target detection techniques have high computational complexity, low adaptability and other problems, with the development of deep learning, deep learning based target detection algorithms surpass these traditional detection methods, deep learning based target detection algorithms mainly deal with natural images, which can be divided into two-stage detection algorithms as well as one-stage detection algorithms [7,8], and the two-stage detectors mainly include the Region-based Convolutional Neural Networks (RCNN) family [9][10][11][12], which divides the detection algorithm into two stages: localisation and recognition, and one-stage detection algorithms such as Single Shot MultiBox Detector (SSD) [13], RetinaNet [14], and You Only Look Once (YOLO) series algorithms [15][16][17][18], etc. One-stage detectors use regression to achieve target recognition and localisation, which reduces the region proposing step compared to two-stage detectors, and has a faster speed while detecting with high accuracy.However, these methods are designed for a single modality, and RGB images are susceptible to harsh environments such as low-light scenes or foggy days, for example, leading to poor detection by these target detection algorithms.
However, this drawback can be overcome by introducing additional target information in the imaging mode [19,20].Considering the robustness of the IR camera to illumination and weather changes, we try to additionally introduce thermal infrared (IR) spectroscopy.IR images measure the temperature of the detected target, thus avoiding the effects of low illumination scenarios and foggy weather scenarios on target detection accuracy [21,22].When the target is in a scene with insufficient visible light, the target features in the RGB image cannot be extracted, and the corresponding IR image can provide effective information about the object.Zhang et al. [23] designed QFDet to detect small figures in aerial imagery more efficiently by utilising the feature information from both RGB images and thermal IR.An et al. [24] proposed ECISNet that improves the detection accuracy by enhancing the feature representation capability between RGB and thermal infrared modalities.Fusing the complementary modalities of RGB and thermal infrared can further improve the perceivability and robustness in target detection algorithms.Meanwhile, some challenging multispectral datasets, e.g., DroneVehicle [25], VEDAI [26], LLVIP [27], etc., continue to promote the development of multispectral target detection.
In order to better solve the problem of environmental perception in complex environments, this paper proposes a fast all-weather detection algorithm that can simultaneously process complementary visible image information and infrared image information to achieve accurate environmental perception.Specifically, this paper has the following contributions: 1.
A dual-stream real-time detector is proposed to perform target detection using both visible and infrared images with stable detection performance in extreme environments such as night and fog.

2.
The process of network deepening inevitably brings information loss.In this paper, the features of each modality are filtered and enhanced by cross-modal attention, avoiding the information loss in the process of network deepening, and improving the detection effect of the detector on weak targets.

3.
The features of different modalities often possess large differences, and one-time fusion does not mix them well.In this paper, a three-stage fusion strategy is designed to fuse features of different modalities from three different perspectives: spatial, channel, and overall.It is worth noting that the cross-modal feature fusion module is end-to-end during training.

4.
Extensive experiments on two datasets show that the method in this paper achieves SOTA performance in the detection of remotely sensed objects.
Figure 1 shows the overall architecture of the algorithm in this paper.This paper is organized as follows.In Section 2, we introduce the related work.Section 3 presents our method.Section 4 contains experiments.Finally, we conclude in Section 5.

Single Source Remote Sensing Object Detection
Traditional object detection algorithms rely on manual feature extraction, these algorithms are limited in terms of detection efficiency, detection accuracy, and equipment deployment, making them unsuitable for remote sensing equipment cross-application.Most of the deep learning based object detection algorithms use DCNN, for processing remote sensing images, Jiang et al. [28] proposed an optimised deep neural network, where a dual feature map extraction strategy and interleaved localisation strategy were used to optimise the detection of small or narrow rectangular objects.Haroon et al. [29] proposed an adaptive single-pass depth multiscale target detection framework for detecting objects of multiple sizes and different classes from remotely sensed images.Gao et al. [30] proposed a two-stage model for vehicle detection using Fully Convolutional One-Stage Object Detection (FCOS), which is designed with a two-stage positive and negative sample mechanism and a two-step classification model.With the development of regression-based target detection technology, the YOLO series of algorithms has a wide range of applications [31], and the direct cross-application of the original YOLO series of algorithms with remote sensing technology will have low detection accuracy and large model size, so the aerial target detection network based on the YOLO architecture is constantly proposed.Ma et al. [32] proposed Light-YOLOv4 to target the problem of object detection for edge-oriented devices.Light-YOLOv4 performs a series of sparse training, pruning, knowledge distillation, and quantisation operations, which makes Light-YOLOv4 more suitable to be deployed in remote sensing platforms.Liu et al. [33] proposed CCH-YOLOX to solve the aerial images' problems caused by dense object distribution and scale variations.Deng et al. [34] proposed LAI-YOLOv5s to improve the detection efficiency by combining Deep Feature Map Cross Path Fusion Network for feature fusion and VoVNet module for enhanced feature extraction in the architecture of YOLOv5.Zhang et al. [35] proposed Vit-YOLO, which enhances the detection efficiency by integrating the multi-head selfattention block and BiFPN modules to enhance the detection of small objects.Hui et al. [36] proposed DSAA-YOLO, which enriches the dataset by proposing Super Resolution Data Augment (SRDA) data enhancement strategy to maintain the data quality while enriching the dataset, designing Dense Residual-based Super-Resolution module (DRSR) and Information Alignment Feature Enhancement Module (IAFE) modules to extract the original features of the object of the remotely sensed image in a higher quality, and finally designing the Multi-Object Golf Dynamic Anchor (MGDA) strategy to enhance the effective target feature extraction and generates more accurate bounding boxes, which effectively improves the detection accuracy.It can be seen that the targeted network design effectively improves the target detection efficiency under the condition of sufficient visible light.

Multimodal Remote Sensing Object Detection
The performance of object detection in remote sensing images can be further improved by combining multimodal technology with remote sensing technology.Fusion of multiple modal information in remote sensing images is the core problem of target detection in multispectral remote sensing images, and multispectral fusion methods have been categorised into three forms based on the different stages of fusion [37], i.e., pixel-level fusion, featurelevel fusion and decision-level fusion.Pixel-level fusion fuses different modalities at the primary stage, which has a low fusion cost but is sensitive to noise and may not be able to effectively utilise the high-level features and semantic information of the remote sensing image by fusing only at the pixel level.Decision-level fusion fuses the detection results at the final stage, which can utilise the final detection results of each modality and effectively reduce the influence of low-level noise on the final detection results, but this fusion method requires effective decision rules to be formulated in advance and occupies a large amount of hardware resources due to the repeated computation on different modal branches.Aerial image detection networks based on cross-modal fusion mainly use feature-level fusion because feature-level fusion can achieve a better balance between preserving detailed information and providing advanced semantic information, and it is a more applicable method in multimodal remote sensing image fusion.Feature-level fusion firstly inputs the different modal images into parallel branches, which independently extract features from the different modalities, and then, through attention or tandem operations to combine these features.Some works have conducted in-depth studies on RGB-IR multimodal object detection, Sun et al. [25] used feature-level fusion and decision-level fusion in their study, they constructed a UA-CMDet that reduces the detection bias caused by high-uncertainty targets by fusing the information from both modalities of the visual and infrared images and quantifying the uncertainty of the different targets using illumination estimation, thus Vehicle detection in extreme scenarios is achieved.Fang et al. [38] proposed an effective cross-modal feature fusion method based on the self-attention mechanism, which improves the performance of multispectral target detection in remote sensing images by making full use of different modal information.Bao et al. [39] proposed Dual-YOLO, which improves the performance of multi-spectral target detection in remote sensing images by designing Attention Fusion Modul and Fusion Shuffle Module to efficiently process and integrate features from both image types, introducing Fusion Loss to accelerate network convergence during training, by optimising the integration of infrared and visible features.Existing feature-level fusion methods do not deeply explore the characteristics of the information between different modes, and the backbone network does not extract the features well; this study starts from the correlation and differences between modes, and designs a feature enhancement module to enhance the features of different modes, and a feature fusion module to fuse the features of different modes, and selects the state-of-the-art single-stream detection algorithm to expand it, achieves state-of-the-art multi-spectral detection performance.

Algorithm Overview
Yolo algorithm has been widely used due to its high detection accuracy with very fast detection speed.In this paper, YOLOv8 with the best performance is selected as the baseline for expansion.The overall structure of YOLOv8 consists of backbone, neck and detect head, where the backbone part is mainly used for feature extraction, which mainly consists of the CBS module, the C2f module and the SPPF module, the CBS module performs convolutional operations on the input information, followed by batch normalisation, and finally activates the information streams using SiLU in order to obtain the output results.The Ncek part is inspired by PANet [40] and adopts the PAN-FPN structure, the neck part is used to fuse the features extracted from the backbone at different scales, and the detect head is used to process the fused features to get the final detection results.
In this process, a backbone can only extract the features of one modality, but cannot take care of the features of another modality.In order to allow the detector to process the features of different modalities at the same time, early researchers tried to fuse the images of different modalities and then feed them into the detector, which does not take advantage of the complementary modalities very well.Therefore, this paper extends the backbone part to develop a feature extraction backbone for dual streams, and the overall structure is shown in Figure 2. Firstly, we connect two identical backbones in parallel.Then, Cross-modal feature enhancement module (FEM) is embedded between the same layers of different backbones for enhancing the features of different modalities.Finally, Cross-modal feature fusion module (FFM) module is designed to fuse the features coming from the same stages of different backbones.After feature extraction by the dual-stream backbone, the enhanced and fused features are fed into a neck network to fuse the features at different scales, and finally the fused features are fed into the detect head for regression prediction to obtain the final detection results.

Single Module Information Processing Module
The unimodal information processing module aims to efficiently process visible modal features along with infrared modal features, specifically using the C2f module which uses gradient shunt connections to enrich the information flow of the network.The feature maps are pooled using the SPPF module to achieve adaptively sized outputs.

C2f Module
The C2f module has a key role in feature extraction and information flow optimisation, and the C2f module is constructed and designed based on the improvement of the Cross Stage Partial (CSP) [16] structure.The C2f module not only improves the efficiency of feature utilisation, but also increases the network's ability to process high-dimensional information, while maintaining the lightweight nature of the model.Specifically, the C2f module can be defined as: The input feature x is first passed through the convolutional layer Conv 1 , which divides the output into two parts: where split 1 (x) and split 2 (x) denote split the output of Conv 1 into two parts along the channel dimension, following split 2 (x) is fed into a series of Bottleneck layers, each is with batch normalization: where y denote the input feature of Bottleneck, BN denote batch normalization, Conv denote convolution with kernel is 3.The output of each Bottleneck is summed up, the features are then merged with split 1 (x) in the channel dimension, Concat denote feature merging along the channel, the result of the final feature merge is processed by the convolutional layer Conv 2 , reducing feature dimensions using convolution with a convolution kernel of 1.

SPPF Module
SPPF is optimised for spatial pyramid pooling (SPP) [41] to extract multi-scale spatial features from the input feature maps while maintaining high computational efficiency.SPPF first passes the input feature maps through a convolutional layer to reduce the dimensionality of the data while keeping the spatial dimensionality of the feature maps constant: where x ′ denote convolution with kernel 1. Next, a maximum pooling operation is performed on the feature maps that have been processed by the first convolutional layer, with the aim of enhancing the model's ability to perceive features at different scales in the input data without changing the spatial dimensions of the feature maps: where Maxpool(•) denote maximum pooling.Immediately after that, the input feature map x is spliced with the three pooled feature maps y 1 , y 2 , y 3 in the channel dimension.Finally, the channel dimensions of the feature map are adjusted by another convolutional layer with a convolutional kernel of 1.The output is also obtained by fusing the features at each scale z: z = Conv Concat x ′ , y 1 , y 2 , y 3 (8)

Cross-Modal Information Processing Module
The cross-modal information processing module aims to process visible modal features and infrared modal features at the same time, and we designed FEM and FFM to reduce the loss of effective feature information and improve the robustness of the model.

Cross-Modal Feature Enhancement Module
The features of different modalities can be divided into differential mode part and common mode part, during the forward propagation of multiple modalities, the loss of information in the common mode part of one modality can be supplemented by the information of another modality, but the loss of differential mode information is fatal.Therefore, in this paper, a cross-modal feature enhancement module is designed to enhance the differential mode portion of different modal features to combat potential feature loss during network deepening.Figure 3 presents the overall structure of the cross-modal feature enhancement module in this paper.Firstly, the differential mode part of infrared and visible modal features is extracted: where F d denotes the differential mode part of the dissimilar modal features, F vis denotes visible images features, F in f denotes infrared images features, |•| denotes absolute value.
Next, the information about the spatial distribution of the differential mode part of the features in the visible and infrared features is estimated: where M denotes the information about the spatial distribution of the differential mode features among the different modal features.σ denote Sigmoid function.ReLU denotes the ReLu activation function.Conv denotes a convolutional layer with a convolutional kernel of 3, step size is set to 1 and padding is set to 1, number of input channels is 2, number of output channels is 1, is used to learn information about the spatial distribution of differential mode features.Concat denote connections along the corridor.Avepool denotes the average pooling along the channel.Maxpool denotes the maximum pooling along the channel, is used for initial feature extraction of differential mode features.Then, using the information of the spatial distribution of the differential mode features of the visible and infrared modes, the features of the different modes are spatially augmented to reinforce the importance of the differential mode features among the different modal features: where F e vis and F e in f denote the visible and infrared features after feature enhancement.⊛ denotes multiplication of corresponding elements in space.After the spatial enhancement, it is immediately followed by the on-channel enhancement, unlike the spatial enhancement, the channel enhancement learns the channel distribution vectors of the differential mode features in the dissimilar mode features: where W denote vector representing the channel distribution of differential mode features in different modal features.FC denote the fully connected layer.GAP indicates global average pooling.In this process, the spatially augmented dissimilar modal features are first blended, then the blended feature vectors are compressed into one-dimensional vectors using global average pooling, followed by squeezing and excitation of the vectors using two fully-connected layers, where the number of channels is first squeezed to 1/16 of the input, then enlarged by a factor of 8. Finally, a Sigmoid function is used to constrain the values of the channel vectors to be between 0 and 1.After obtaining the channel distribution vectors, the features of the dissimilar modes are further enhanced to obtain the final output: where F e ′ vis and F e ′ in f denote enhanced visible features and infrared features.⊗ denotes the element-by-element multiplication in the channel dimension.

Cross-Modal Feature Fusion Module
Features from dissimilar modalities possess different characteristics, and simple summation or connection on channels cannot well fuse features from different modalities.Therefore, in this paper, the Cross-modal feature fusion module (FFM) module is designed to fuse the features extracted from different backbones in different phases, and Figure 4 shows the overall architecture of the FFM module.Firstly, the initial fusion of dissimilar modal features is performed by splicing on the channels.Then the initial fused features are compressed into vectors using a global average pooling layer, and the process can be formulated as: where W 1 denotes the vector obtained by compressing the features after initial fusion.Next, the vectors are processed using two different squeeze-excitation branches to obtain the channel weights for the fusion of visible and infrared features: where W vis and W in f denote the channel weight vectors for visible and infrared feature fusion, respectively, and it is worth noting that the weights of the two squeeze-excitation branches are not shared.Then, the first fusion of features from different modalities is performed using weights: where Fused 1 denotes the features after the first fusion.Unlike the feature enhancement part, in the fusion part, where the spatial bias of the fused features towards any of the modalities results in the loss of information from the other modality, the visible features and the infrared features should have the same distribution of importance in their spatial distribution.Therefore in this paper we use the features after the first fusion to learn the spatial weight map used for fusion: where M represents the spatial weight map used for fusion.Finally, both the spatial weight map and the channel weight vectors are used to obtain the final fusion features: where Fused denotes the final fusion feature, the role of Conv is to reprogram the number of channels of the feature, to ensure that the fused features can be fed into the Neck section for multi-scale feature fusion.

Loss Function
The detection head part of the network uses a decoupled head structure with two separate branches for target classification and prediction bounding box regression respectively.The classification task uses binary cross-entropy loss (BCE Loss) and the prediction bounding box regression task uses distribution focal loss (DFL) [42] and CIoU [43].

Binary Cross-Entropy Loss
Binary Cross-Entropy Loss (BCE Loss) is a loss function for category classification, BCE Loss is designed to ensure that the model has a low loss when the prediction is correct and a high loss when the prediction is incorrect, driving the model to optimise in the direction of correct prediction.Its mathematical expression is: where N is the sample size.y i is the true label of the i-th sample.p i is the probability that the model predicts this sample to be a positive class.When true label y i is 1, loss function focus log(p i ), which is the logarithm of the probability that the model predicts a positive class.When true label y i is 0, loss function focus log(1 − p i ), which is the logarithm of the probability that the model predicts a negative class.

Border Regression Loss
Targets in remote sensing images usually exist in complex scenes, resulting in ambiguity and uncertainty in the true bounding box of the target.In this paper, we use Distribution Focal Loss (DFL) [42] with CIoU [43] as the marginal regression loss.DFL makes the model focus more on samples that perform poorly on the probability distribution by taking into account the difference between the probability distribution predicted by the model and the probability distribution of the true labels.The formula for DFL is: where S i and S i+1 denotes the probability of two consecutive positions predicted by the model, y i and y i+1 denotes two consecutive interval points in discretised bounding box coordinates.y is the actual bounding box label position.CIoU enables the target detection model to adjust the prediction frames more accurately, not only to maximise their overlap with the real frames, but also in terms of precise matching of positions and consistency of shapes, which improves the overall performance and accuracy of the detection.The CloU is: where , where w, h and w gt , h gt are the width and height of the predicted and real boxes, respectively.α is a weight parameter, is used to balance the effect of the aspect ratio, which usually depends on the value of v.

Experiment
To test the performance of the FAWDet proposed in this paper, we use the public datasets DroneVehicle [25] and VEDAI [26].

Datasets 4.1.1. DroneVehicle Dataset
The DroneVehicle dataset is a large-scale RGB-IR cross-modal target detection dataset captured by UAVs, which covers a wide range of scenarios ranging from daytime to nighttime, such as urban roads, residential areas, and car parks, and consists of 28,439 RGB and infrared image pairs covering the annotation of 953,087 object instances.In order to overcome, for example, the lack of performance of RGB images in low-light conditions and the noise problem in infrared images due to the lack of colour information, the DroneVehicle dataset provides an experimental basis for the study of cross-modal feature fusion, uncertainty management and target detection algorithms by providing a large number of cross-modal image pairs.Figure 5 shows information related to the labelling of objects in the DroneVehicle dataset.

VEDAI Dataset
The VEDA dataset is designed for target recognition of small vehicles in aerial imagery and contains multi-spectral and multi-resolution images to simulate the complex environments of the real world.The VEDAI dataset consists of diverse backgrounds, such as urban roads and natural landscapes, with vehicular targets of varying directionality, and some of which suffer from occlusion and specular reflection problems, providing a rich set of challenges in the development of algorithms.The VEDA dataset is designed not only to enhance the understanding and application of small target detection techniques, but also to facilitate the advancement of related computer vision techniques in the field of aerial surveillance and reconnaissance.Figure 5 shows the information related to the labelling of objects in the VEDAI dataset.

Implementation Details
The neural network training process requires a large amount of arithmetic support, and in this study, we used a platform configured with an Intel Xeon-2690v4 CPU and a NVIDIA TESLA P100 GPU with 16 GB discrete video memory for the experiments.We use YOLOv8 as the main framework.The entire network is trained 200 times with weight decay set to 0.0005, momentum set to 0.937, batchsize set to 8. The training is performed using 640 resolution on the DroneVehicle dataset and 1024 resolution on the VEDAI dataset, and mosaic enhancement is used in each training session, which can greatly enrich the training data and increase the model's ability to handle complex scenes.The experimental environment and parameter settings are shown in Table 1.

Evaluation Indicators
In this study we used Precision (P), Recall (R), mAP0.5, mAP0.75, and mAP0.5:0.95 to evaluate the model.P represents the proportion of true positive samples that are predicted to be positive, and a higher value of P means that the model is more accurate in the prediction of the positive class, while R represents the proportion of true positive samples that are correctly predicted to be positive, and an increase in the value of R means that the model is able to better capture the positive samples.The mAP reflects the overall accuracy of the model in multiple categories, which is one of the main evaluation indexes used in target detection tasks.mAP increase means that the model's detection performance in each category has been improved.mAP0.5 is the mAP value when the IoU (intersection and merger ratio) threshold is set to 0.5.mAP0.5:0.95 is a more stringent index, which calculates the IoU from 0.5 to 0.95.mAP0.5:0.95 is a more stringent index, which calculates the IoU from 0.5 to 0.95.mAP0.5 is a more stringent index, which calculates the IoU from 0.5 to 0.95.mAP0.5 is a more stringent index.It calculates the average mAP in the range of IoU from 0.5 to 0.95 (with an interval of 0.05), which can more accurately evaluate the comprehensive performance of the model under different IoUs.The following formulas show how the different indicators are calculated: Recall = TP TP + FN (23) where True Positives (TP) represent the number of positive sample detection frames correctly predicted, False Positives (FP) represent the number of negative samples incorrectly predicted as positive, False Negatives (FN) represent the number of positive samples incorrectly predicted as negative and True Negatives (TN) Negatives represent number of negative samples correctly predicted to be negative.

Ablation Experiment
In this section, we conduct a series of ablation experiments to deeply analyse the performance of our proposed network.In this paper, YOLOv8 is used as the base network and experiments are carried out by introducing different improvements respectively, including the introduction of the cross-modal feature fusion module FFM, and the crossmodal feature enhancement module FEM.In this paper, we compare the different model configurations in terms of key performance metrics such as precision, recall, and mean average precision (mAP) for different IoU thresholds, and the analysed dataset categories are all.M1, M2, and M3 represent the dual-stream YOLOv8 using YOLOv8 to detect visible images only, using YOLOv8 to detect infrared images only, and our designed dual-stream YOLOv8, respectively.M4, and M5 are the dual-stream YOLOv8 under the improvement of applying FFM, and FEM, respectively.After the ablation study of the two datasets, the P, R, and mAP0.5, Map0.75, mAP0.5:0.95, the mean values of the five fusion metrics were quantitatively analysed.Red, blue and green colours indicate the best, second and third values, respectively.The results of one of the ablation studies on the DroneVehicle dataset and the VEDAI dataset are shown in Tables 2 and 3.The results of the ablation experiments on the DroneVehicle data show that the use of the dual-stream YOLOv8 architecture significantly improves the model performance com-pared to the single-stream models (M1 and M2), with the performance of the dual-stream YOLOv8 model (M3) on the mAP0.5 metric improving from 0.717 (M1) and 0.804 (M2) to 0.825, showing the dual-stream architecture's ability to integrate effectiveness of visible and infrared image data.In addition, the introduction of the cross-modal feature fusion module (FFM) further improves mAP0.5 from 0.825 to 0.839 (M4), indicating that the FFM can effectively facilitate the information interaction between different modalities, thus improving the robustness of the model.The addition of the cross-modal feature enhancement module (FEM) resulted in a significant increase in the accuracy from 0.834 in M3 to 0.931 (M5), and this significant improvement proved that FEM effectively enhanced the model's ability to recognise target details.When FFM and FEM are combined in the dual-stream YOLOv8 model (M6), while maintaining a high precision (0.840), the recall also reaches 0.796, and the best performance is achieved in the main metrics of mAP0.5, mAP0.75, and mAP0.5:0.95.The effectiveness of the cross-modal information processing module in this paper is demonstrated.In order to provide a comprehensive picture of the model's performance, the performance of the model's metrics at different confidence levels is shown in Figure 6.Ablation experiments on VEDAI data show that the dual-stream YOLOv8 architecture (M3) significantly outperforms the single-modal configurations (M1 and M2) in terms of precision (0.798), recall (0.677), and average precision at multiple IoU thresholds (0.697 for mAP0.5 and 0.429 for mAP0.5:0.95), confirming the dual-stream architecture's ability to fuse the visible and infrared image features with high efficiency.The dual-stream YOLOv8 (M4) with the introduction of FFM reaches 0.698 at mAP0.5 and 0.437 at mAP0.5:0.95, a significant improvement that highlights the efficacy of the FFM module in feature fusion.Meanwhile, the dual-stream YOLOv8 (M5) with integrated FEM maintains relative stability in other metrics although its mAP0.75 performance slightly decreases to 0.529, indicating the effectiveness of FEM in enhancing cross-modal feature processing capability.When both FFM and FEM are integrated into the dual-stream YOLOv8 (M6), the model performs optimally on all evaluation metrics, especially reaching the highest value of 0.439 on mAP0.5:0.95.In addition, the M6 performs well on both accuracy (0.799) and mAP0.5 (0.701).These results clearly show that the synergy of feature fusion and enhancement techniques can significantly improve the accuracy and robustness of target detection in complex multimodal scenes.In addition, we show in Figure 7 the metric transformations of this paper's model on the VEDAI dataset at different confidence levels.Module (FEM), achieves the optimal performance in all categories, and also confirms its high efficiency and applicability in real-world applications.In order to visualise the performance of the model in this paper, the confusion matrices generated by the model in this paper on the two datasets are presented in Figure 9.The horizontal coordinates in Figure 9 indicate the true category of each labelled box, the vertical coordinates indicate the categories predicted by this paper's method for each category, and the data in the squares indicate the probability of occurrence of different combinations.The data show that the method in this paper correctly classifies each category with a low false alarm rate and high stability.

Comparative Experiments on the VEDAI Dataset
In the VEDAI dataset we selected 8 major categories for comparison, such as Car, Truck, Boat, etc., and we also tested them on the VEDAI dataset according to the above comparison model.The results of the comparison experiment on VEDAI dataset are shown in Table 5.The experiment evaluates the performance of the same batch of algorithms on the VEDAI UAV vision dataset.In order to show more intuitively the detection effect of our method on the VEDAI dataset, we still choose five different scenarios for inference experiment comparison, the inference results of different models are shown in Figure 10, and the method proposed in this paper can still get satisfactory results in the task of target detection in complex environments.Comparison experiments on the VEDAI dataset show that our model achieves 0.692 on the mAP0.5 metric, outperforming the CMAFF model's 0.68 and the CMT model's 0.679.On the mAP0.5:0.95metric, our model leads the CMAFF's 0.426 and the CMT's 0.409 with a score of 0.437, demonstrating its stability and robustness in high-precision target detection.The gap between the single-stream and dual-stream models on VEDAI is somewhat smaller than that of DroneVehicle, but our designed model still achieves the best performance on most categories.In terms of category-specific detection capability, our model achieves mAP0.5 of 0.593, 0.91 and 0.931 on Truck, Pickup and Tractor categories, respectively, which highlights its strong ability to accurately identify different vehicle types.Although in terms of processing speed, our model is slightly lower than some single-stream models at a rate of 51 frames per second, this frame rate still represents a practical balance between performance and real-time performance considering the complexity of the data it processes and the high accuracy required.The confusion matrix in Figure 9 confirms the stability of the method in this paper on VEDAI.

Conclusions
In this paper, a dual-stream real-time detector is proposed, which is capable of utilising visible and infrared images simultaneously, effectively improving the stable detection performance in extreme environments such as night and fog.In order to counteract the inevitable information loss during network deepening, this paper designs a cross-modal feature enhancement module (FEM), which significantly improves the detection of weak targets by enhancing the differential features between different modalities.Aiming at the differences of different modal features, this paper further designs the cross-modal feature fusion module (FFM) with a three-stage fusion strategy to optimise the feature fusion from three dimensions: spatial, channel and overall.Through extensive experiments on two datasets, the method in this paper proves to achieve state-of-the-art performance in detecting weak targets.The designed dual-stream detector embedding FEM as well as FFM in this paper has significant advantages in enhancing the detection performance in extreme environments, and FEM and FFM can be widely applied to dual-stream feature extraction base detectors.

Figure 1 .
Figure 1.The overall flow of the algorithm in this paper.Where FEM and FFM are the feature enhancement and feature fusion modules proposed in this paper.

Figure 2 .
Figure 2. Overall network structure.FEM denotes feature enhancement module and FEM denotes feature fusion module.CBS, C2f, SPPF are single modal feature processing modules.

Figure 3 .
Figure 3. Overall FEM structure.Mixed attention is used to enhance the features of different modalities.

π 2 arctan w gt h gt − arctan w h 2
is intersection of union of prediction box b and ground box b gt .ρ b, b gt is European distance of the centres of b and b gt .c is the diagonal length of the smallest closure box containing b and b gt , used for normalised centre distance.v is used to measure the consistency of the aspect ratio, defined as 4

Figure 5 .
Figure 5. Visualisation of training data distribution for dataset DroneVehicle and dataset VEDAI.

Figure 6 .
Figure 6.Indicators at different confidence levels on DroneVehicle.

Figure 7 .
Figure 7. Indicators at different confidence levels on VEDAI.

Figure 8 .
Figure 8.Comparison of detection results on the DroneVehicle dataset.

Figure 9 .
Figure 9. Confusion matrix generated by the model in this paper on different datasets.

Figure 10 .
Figure 10.Comparison of detection results on the VEDAI dataset.

Table 1 .
Environment and parameterisation of the experiment.

Table 2 .
Ablation findings on the DroneVehicle dataset.

Table 3 .
Ablation findings on the VEDAI dataset.