1. Introduction
In recent years, unmanned aerial vehicle (UAV) technology has rapidly evolved beyond military applications to civilian and research fields, finding widespread use in terrain mapping, logistics, precision agriculture, target tracking, and disaster relief. UAV imagery has contributed to datasets supporting research in hyperspectral anomaly detection [
1], image detection [
2], classification [
3], remote sensing analysis [
4], and visual data augmentation [
5]. Despite its significance, small-target detection remains a major challenge. Insufficient lighting reduces contrast and detail in visible-light images, hindering feature extraction. Moreover, small-target identification is influenced by lighting conditions, target size, and background interference, making traditional detection methods less effective.
Convolutional Neural Networks (CNNs) have achieved significant advancements in feature extraction [
6], image classification [
7], image denoising [
8], object recognition [
9], fault diagnosis [
10], hyperspectral unmixing [
11], super-resolution reconstruction [
12], and change detection [
13]. CNN-based target detection methods have garnered increasing attention, with the YOLO series being widely applied across various fields due to its superior performance [
14]. The initial version of YOLO employed a grid-based approach with single-shot inference, enabling fast detection but exhibiting lower accuracy for small targets. YOLOv2 introduced anchor boxes and K-means clustering, enhancing small-target detection. YOLOv3 incorporated multi-scale detection and adopted Darknet-53 as its backbone to improve accuracy. YOLOv4 integrated advanced optimization strategies to achieve a better trade-off between speed and accuracy, while YOLOv5 further refined the model and introduced the Spatial Pyramid Pooling-Fast (SPPF) structure for broader applicability. In 2021, YOLOX adopted an anchor-free architecture to simplify the model, while YOLOR introduced a multi-task network for both detection and segmentation. In 2022, YOLOv6 integrated the Re-parameterizable Visual Geometry Group (RepVGG) structure, and YOLOv7 further enhanced performance through architectural optimization and dynamic label assignment.
Balancing detection accuracy and model size remains a significant challenge in UAV-based target detection. To address this issue, researchers have proposed various enhancements to YOLO models. For instance, Lou et al. [
15] improved small-target detection by refining downsampling and feature fusion, while Guo et al. [
16] introduced the Convolutional 3D Network (C3D) structure to better preserve spatial and temporal information. Wang et al. [
17] developed the Spatial–Temporal Complementary Network (STC) and incorporated global attention (GAM) to mitigate feature loss. Additionally, Wang et al. [
18] integrated the BiFormer attention mechanism and the Feed-Forward Neural Block (FFNB) module, expanding detection scales to reduce missed detections. However, the accuracy of detecting certain small objects, such as bicycles, still requires further improvement. Currently, most object recognition methods rely on single-modality visible-light images [
19,
20], which perform poorly under low-light or nighttime conditions. Infrared imagery, with its advantages in nighttime environments, serves as a valuable complement to visible-light images, improving detection accuracy and robustness. Compared to traditional pixel-level and decision-level fusion, feature-level fusion directly integrates image features, thereby avoiding registration errors while enhancing information utilization and recognition efficiency. As a result, feature-level fusion has emerged as a widely studied approach in multimodal image processing.
With the increasing role of deep learning in computer vision, image feature fusion has emerged as a key research focus. In 2018, Li et al. [
21] proposed a feature fusion method that employs weighted averaging and L1-norm fusion to generate multi-level fused images. Hwang et al. [
22] introduced the KAIST dataset and developed a multimodal fusion method—Aggregate Channel Features combined with Temporal Features and Temporal Histogram of Oriented Gradients (ACF+T+THOG). However, its recognition performance remained suboptimal. In 2016, Feichtenhofer et al. [
23] demonstrated that multi-stage fusion in deep learning enhances feature extraction. That same year, Wagner [
24] found that late-stage CNN fusion yielded superior performance, whereas Liu [
25] identified mid-stage fusion as the optimal strategy. In 2018, Li et al. [
26] introduced a Multi-Scale and Dual-Stream Faster R-CNN (MSDS-RCNN) with mid-layer fusion for pedestrian detection, significantly improving accuracy.
To address lighting variations in image fusion, researchers have developed adaptive fusion methods. In 2019, Guan et al. [
27] designed a light-sensing network incorporating dedicated sub-networks for daytime and nighttime, thereby enhancing adaptability by dynamically predicting lighting conditions. Li et al. [
28] leveraged light intensity as a fusion weight, training a network to estimate illumination levels and adjust detection results accordingly. Zhang et al. [
29] introduced channel-level attention mechanisms for multimodal fusion, leading to improved detection accuracy. In 2020, Zhang et al. [
30] proposed a spectral feature unification module, enhancing feature consistency and overall detection performance. Subsequently, in 2021, Zhang et al. [
31] integrated intra- and inter-modal attention mechanisms to adaptively weight multispectral features, achieving higher accuracy while maintaining low computational cost. More recently, Xiongxin et al. [
32] (2023) developed a YOLOv5-based UAV human detection framework incorporating visible-infrared fusion. In 2024, Sangin et al. [
33] introduced the Infrared and Visible Image Saliency-Aware Network (INSANet), an attention-based fusion network designed to capture global spectral relationships. Meanwhile, Qian et al. [
34] proposed a hallucination-based domain adaptation approach for thermal imaging pedestrian detection, effectively integrating virtual visible-light and thermal images to enhance accuracy.
In summary, multimodal fusion in target detection presents several challenges, particularly in effectively determining optimal fusion stages and mitigating information loss. While numerous methods integrate visible and infrared images, they often fail to account for feature variations across different fusion stages, resulting in suboptimal performance—especially for small targets with sparse features that are susceptible to loss. Additionally, many studies do not adequately consider lighting conditions and modality differences. Lighting significantly impacts the quality of aerial imagery, yet numerous algorithms treat varying illumination conditions uniformly, overlooking their influence on detection performance. Furthermore, differences in color, texture, and brightness between visible and infrared images can lead to information omission or distortion, thereby reducing detection accuracy and overall robustness. To address these challenges, this paper proposes an improved UAV-based aerial target detection framework leveraging multimodal feature fusion. Specifically, we develop a YOLOv5-based fusion approach that integrates light sensing and cross-attention mechanisms to mitigate the limitations of visible-light-only detection, such as low accuracy and high false negative rates. The proposed method employs a bidirectional feature extraction network incorporating a cross-attention mechanism to enhance modality-specific feature representation. Additionally, a light-sensing weight module is designed to adaptively fuse features based on illumination conditions. The fused feature representations are subsequently fed into the detection network, enabling high-efficiency, all-weather target detection. The main contributions of this article are summarized as follows:
To address the impact of illumination variations on feature fusion, we introduce a dedicated neural network within the model to construct a light-sensing weight module. This module leverages illumination information to dynamically adjust fusion weights, significantly enhancing target detection accuracy under all-weather conditions.
To mitigate modality differences, we design a cross-attention module that processes feature maps generated by the dual-stream network. This module enables the model to capture correlations between different modalities and fully exploit cross-modal interactions, thereby improving the accuracy of target detection and recognition.
The proposed method is evaluated on publicly available datasets, including DroneVehicle, KAIST, and LLVIP, and compared against several state-of-the-art algorithms. The experimental results demonstrate that our approach exhibits superior generalization ability, accurately detecting targets even under conditions of indistinct features or occlusion, and achieving notable improvements in pedestrian target recognition.
The rest of this article is organized as follows.
Section 2 introduces the datasets used in the experiments and the performance evaluation metrics and provides a detailed description of the proposed method.
Section 3 includes the experimental details and results.
Section 4 discusses the experimental results.
Section 5 gives conclusions and some possible future works.
2. Materials and Methods
In this study, the experiments are conducted using publicly available multimodal datasets. During the training and testing phases, we utilize aligned visible-light and infrared image datasets, including KAIST, LLVIP, and DroneVehicle. Specifically, the DroneVehicle dataset is employed for various ablation studies, while the LLVIP and KAIST datasets are used to evaluate the generalization capability of the proposed algorithm. To assess the overall performance of different methods, we adopt two primary evaluation metrics: the logarithmic average missed detection rate () and average precision (AP). These metrics provide a comprehensive measure of detection accuracy and robustness. A detailed description of the datasets and evaluation metrics is provided in the subsequent sections.
2.1. Definition of Small Targets
In different scenarios, the definition of small targets varies based on specific contexts. However, in academic research, the definition is generally categorized into two types. The first category is based on a relative scale, often involving the ratio of the target to the image. Chen et al. [
35] proposed a definition method, when defining small targets, the relative area of the target instance (the ratio of the bounding box area to the image area) is typically set within the range of 0.08% to 0.58% for instances in the same category. The second category is based on an absolute scale, where small targets are defined by their absolute pixel size. A widely accepted standard in the field of object detection comes from the Microsoft Common Objects in Context (MS COCO) dataset, where small targets are defined as those with a resolution smaller than
pixels. In this paper, we primarily define small targets as those smaller than
pixels, and all three datasets mentioned below contain images that meet this criterion. Furthermore, medium-sized targets are defined as those with an area ranging from
to
pixels, and large targets are defined as those with an area greater than
pixels.
2.2. Multimodal Datasets
The DroneVehicle dataset [
36], developed and annotated by Tianjin University, is a large-scale UAV-based vehicle dataset designed for aerial target detection. It comprises 28,439 pairs of aligned RGB-IR images, all categorized as vehicles and captured using UAV-mounted cameras. The dataset covers diverse regional scenes, including urban roads, residential areas, and highways, under both daytime and nighttime lighting conditions. To ensure comprehensive data variation, images were captured from three different altitudes: 80 m, 100 m, and 120 m, with camera angles set at 15°, 35°, and 45°. During the dataset calibration phase, affine transformations and region clipping were applied to crop and align RGB-IR image pairs, ensuring precise cross-modal correspondence. Sample images from the DroneVehicle dataset are shown in
Figure 1, where the first row presents visible-light images, and the second row displays infrared images.
The KAIST dataset, proposed by Hwang et al. [
22], was collected using an onboard camera and includes images from two modalities that have undergone rigorous registration processing. The dataset features pedestrian images captured from diverse scenes, such as campus and urban roads, with image dimensions of 640 × 512. The training set comprises 8963 pairs of visible and infrared images, while the test set contains 2252 pairs. A subset of the KAIST dataset images is shown in
Figure 2, with the first row displaying visible-light images and the second row displaying infrared images.
The LLVIP dataset, proposed by the research team from Beijing University of Posts and Telecommunications [
37], utilizes binocular cameras and contains 16,836 pairs of carefully aligned images. As most of the images in this dataset are captured in nighttime environments, it is particularly well-suited for testing the performance of algorithms in nighttime scenarios. A subset of the LLVIP dataset image pairs is shown in
Figure 3, with the first row displaying visible-light images and the second row showing infrared images.
In this paper, for the sake of consistency, we will use the symbol RGB to represent visible-light images and the symbol IR to represent infrared images.
2.3. Evaluation Metrics
To comprehensively assess the performance of multimodal image fusion algorithms in object detection tasks, this study employs both subjective and objective analysis methods. For subjective analysis, human evaluators visually assess the detection results of selected image samples, providing an initial performance judgment. In contrast, objective analysis uses various evaluation metrics to quantitatively measure the algorithm’s performance on the test set, ensuring more accurate and reliable results. Specifically, for binary classification tasks, the predicted results of image targets are divided into positive and negative classes, which can be clearly represented through a confusion matrix. The confusion matrix compares the model’s predictions with the actual results, yielding four types of classification outcomes:
True Positive (TP): The number of samples that are actually positive and correctly identified as positive by the model;
True Negative (TN): The number of samples that are actually negative and correctly identified as negative by the model;
False Positive (FP): The number of samples that are actually negative but incorrectly identified as positive by the model;
False Negative (FN): The number of samples that are actually positive but incorrectly identified as negative by the model.
This combined approach of subjective visual assessment and objective quantitative analysis provides a comprehensive understanding of the performance of multimodal fusion techniques in object detection tasks.
and AP can be computed using these four classification outcomes.
is a metric introduced by Dollar et al. to measure the performance of multimodal pedestrian detection. This metric is derived from the MR-FPPI curve, where MR represents the miss rate, which is plotted on the vertical axis of the MR-FPPI curve. The formula for calculating MR is as follows:
where TP + FN reflects the total number of positive samples in the dataset, while FN refers to the number of positive samples that are incorrectly classified as negative by the model,
where FP refers to the number of false positive detections in all images, and
is the total number of images in the dataset.
The calculation of involves selecting 9 evenly spaced FPPI (False Positives Per Image) points within the interval [0.01, 1.0] on a logarithmic scale, then averaging the corresponding miss rates (MRs) for these points. As a result, a decrease in the value directly reflects a reduction in missed detections, indicating an improvement in the algorithm’s performance in target detection and recognition. This evaluation method not only provides an accurate quantification of model performance but also enables comparison of performance differences between various algorithms.
The calculation of AP is similar to that of
and is also based on the P-R (precision–recall) curve. AP refers to the average of the precision values (P) at different recall levels (R), where precision (P) is the proportion of true positive samples among all samples predicted as positive. The equation for calculation is as follows:
where R represents recall, which indicates the number of true positive samples correctly predicted from the actual positive samples. The calculation equation is as follows:
The higher the AP value, the better the performance of the algorithm in target detection and recognition. mAP (mean average precision) is a commonly used metric for evaluating the performance of object detection models. It is the average of the AP values across all categories. The calculation equation is as follows:
where N is the number of categories, and
is the average precision for the ith category.
2.4. Overall Framework
The method primarily comprises a dual-stream backbone network, a light sensing module, a cross-attention module, and the Neck and Detection Head of the YOLOv5 network. The core function of the light sensing weight module is to extract light information from visible-light images by compressing their spatial features. It then calculates weight values for the two modalities using a normalization equation, dynamically adjusting their contribution ratios during the fusion process to avoid the early suppression of informative features under varying illumination conditions.
To reduce the risk of information loss, especially for small targets with sparse and low-resolution features, a cross-attention module is introduced at multiple feature extraction stages. Unlike conventional late fusion methods that merge features after significant downsampling, our approach integrates cross-attention in earlier layers, enabling information exchange between infrared and visible-light modalities before spatial resolution is significantly reduced. This proactive fusion strategy enhances fine-grained feature retention and preserves critical small target details.
The specific architecture is shown in
Figure 4. In the dual-stream backbone network, visible-light and infrared images serve as inputs. First, the visible-light image is processed by the light sensing weight module to obtain the light weight. Then, both images undergo four down sampling operations. After the first, second, and third down sampling steps, the feature maps from both modalities simultaneously enter the cross-attention module for feature enhancement. The first cross-attention module generates infrared feature Gir0 and visible-light feature Grgb0. Grgb0 and Frgb0 are added element-wise, and after convolution and down sampling, the result is Frgb1. Similar operations are performed for the infrared features. This process continues for subsequent layers. At the same time, features Frgb1 and Fir1, Frgb2 and Fir2, as well as the final Frgb3 and Fir3 features, are fused using the light weights calculated by the light sensing weight module, resulting in features G0, G1, and G2, which are then fed into the subsequent detection head for classification and localization.
2.5. Light Sensing Weight Module
Many existing image fusion algorithms tend to overlook the impact of varying lighting conditions on the fusion process. This oversight can lead to the loss of valuable feature information from both infrared and visible-light images, thereby reducing the performance of all-weather image fusion. In such cases, critical details may be lost, and important features may not be fully explored, affecting the overall quality and effectiveness of the fused image. To address this issue, a light sensing weight module is designed. This module utilizes a miniaturized neural network to obtain the corresponding weights for the visible and infrared modalities. It preserves more effective information from both image sources, enhances detail retention, and ensures stronger robustness under various lighting conditions [
38]. The module architecture is shown in
Figure 5.
The specific content is as follows: Given a visible-light image
, the light sensing process is shown in Equation (
7):
where
represents the light sensing weight module, and
and
represent the light distribution values for daytime and nighttime input images, respectively, and are non-negative scalars. The design of the light sensing weight module aims to effectively address the issue of small-target detection under varying lighting conditions, based on the theoretical framework of light modeling and multimodal image fusion. In image processing, variations in lighting conditions significantly affect the quality and informational content of RGB and IR, especially in low-light or no-light environments. In such conditions, infrared images provide more stable target information, while RGB images offer more detailed features under sufficient lighting. Therefore, designing a light sensing module that can adaptively adjust the contributions of RGB and IR images to target detection according to different lighting conditions is of great importance.
The specific steps of this module are as follows: First, CNN is used to process the input images and extract light-related features. The convolutional network effectively compresses spatial information and captures local lighting features, making it sensitive to changes in scene brightness. Next, a global max pooling layer integrates these lighting features, capturing the global lighting information of the entire image, which ensures that the model can make effective judgments based on overall lighting conditions. Then, two fully connected layers are used to compute the lighting distribution values, with a ReLU activation function applied to ensure non-negative outputs, avoiding negative weights and ensuring the validity of the light weights. Finally, a normalization function is used to process the calculated lighting weights, dynamically adjusting the weights of RGB and IR images based on different environmental conditions. The weight allocation mechanism uses a simple normalization function, as shown in Equation (
8):
where
and
represent the respective weights of the infrared image and the visible-light image in the feature fusion process.
In well-lit environments, RGB images are given a higher weight, as they provide more visual information. In low-light or no-light conditions, the IR image is given a higher weight, enhancing the stability and reliability of target detection. This weight distribution mechanism ensures that the model can adaptively optimize the fusion ratio of different modalities under various lighting conditions, thereby improving detection accuracy.
The theoretical basis of this module comes from research on light perception and multimodal image fusion, particularly in low-light environments where IR images can supplement RGB images by compensating for potential information loss. Meanwhile, the light sensing mechanism dynamically adjusts the fusion ratio of features through an effective weighting process, improving the model’s robustness and accuracy in complex lighting conditions. Additionally, the normalization function used for weight distribution is a simple yet effective approach to dynamically adjust the influence of different modalities during the feature fusion process, enhancing the model’s adaptability to lighting changes. Through this light perception-based feature fusion, the model not only achieves significant improvements in accuracy but also handles small-target detection tasks under varying lighting conditions.
2.6. Cross-Attention Module
To address the differences between modalities and effectively extract complementary features, this section introduces two key modules.
2.6.1. Modal Normalization
Since visible-light images and infrared images belong to different spectral bands, there is an inherent modal difference between them. Directly using features from one modality to calculate similarity scores with features from another modality is not an optimal solution. Therefore, inspired by [
39], this section proposes the use of modal normalization for transformation, as illustrated in
Figure 6. The figure demonstrates the modal normalization process, where the mean and variance of one modality are used to reduce the modal differences when transforming the features of the other modality.
The transformation process for the two modalities is the same, so let us take the modality normalization of visible-light features as an example. First,
represents the input feature. The following is the Z-score normalization equation [
40], where
and
represent the standard deviation tensor and mean tensor of the visible-light image features, respectively. The distribution of
is remapped using Equation (
9).
,
and
represent 3 × 3 convolutions, which are used to predict two learnable parameter tensors
and
, as shown in Equation (
10). Finally, using Equation (
11), the previously obtained parameters are recalculated to obtain the normalized infrared feature
and visible-light feature
.
The association alignment phase following modality normalization allows for complementary refinement of feature representations. By aligning the features from both modalities, this phase improves the model’s ability to capture more precise and coherent information, which, in turn, enhances the overall detection performance of the model.
2.6.2. Attention Mechanism Module
The cross-attention mechanism has a broad theoretical foundation in multimodal learning and object detection tasks [
41]. Multimodal learning aims to improve a model’s detection capabilities by fusing data from different sources, such as RGB and infrared images. In complex environments, information from a single modality often fails to capture the full range of target features. The cross-attention mechanism establishes interactions between different modalities, enabling the model to extract complementary feature information from each modality, thereby enhancing small-object detection performance.
The theoretical foundation of the cross-attention mechanism originates from the self-attention mechanism and the attention mechanism in Transformer architectures. These techniques were initially applied in natural language processing tasks and later successfully adapted to image processing and visual tasks. In traditional attention mechanisms, the model automatically assigns weights to input features to learn important areas or information. In self-attention, the weights of features are dynamically calculated based on the correlations between them, while the cross-attention mechanism extends this idea by enabling information interaction and weighting between two modalities, ensuring that each modality’s features effectively complement the other.
Specifically, the cross-attention mechanism first calculates the similarity between different modalities, capturing the relationship between key features in one modality and related features in another. By weighting these relationships, the model learns the contribution of each modality to object detection, thus improving the fusion of features from both modalities. For example, in the fusion of RGB and infrared images, the cross-attention mechanism guides the model to focus on the target areas in the infrared image while benefiting from the rich visual information provided by the RGB image, thereby enhancing detection accuracy. Unlike traditional attention mechanisms, the cross-attention mechanism not only focuses on features within a single modality but also achieves feature fusion through interactive weighting between the two modalities, maximizing the utility of each modality’s information.
In multimodal object detection tasks, the cross-attention mechanism enhances the complementarity of information through interactions between different modalities. For instance, when features in one modality are limited or unclear, the other modality can compensate for these deficiencies, improving the model’s robustness and accuracy. The cross-attention mechanism adaptively adjusts the information fusion process based on the actual conditions of each modality, ensuring that the model maintains high detection accuracy under various environments and conditions.
The application of the cross-attention mechanism not only improves detection accuracy but also effectively addresses issues of information redundancy and loss in multimodal fusion. Through this mechanism, the model can focus on complementary features from different modalities, maximizing each modality’s advantages during fusion and ultimately improving overall performance.
Theoretically, the cross-attention mechanism is based on the idea of attention mechanisms, leveraging the complementarity between modalities to enhance object detection capabilities. Particularly in small-object detection tasks, the cross-attention mechanism can effectively capture subtle features, improving object recognition and localization. This mechanism has been widely applied in multimodal object detection and has proven effective, especially in complex scenes with significant lighting and viewpoint variations.
As shown in
Figure 7, in the cross-attention module of this section, suppose the input features are the visible-light features
and infrared features
. First, the input features are used to generate query vectors q, key vectors k, and value vectors v descriptors. For each modality, the q, k, and v descriptors are computed with themselves to obtain the attention weights. In the bimodal model used in this section, the cross-attention module, as shown in
Figure 7, computes the dot product between the query vector q derived from the visible-light features
and the key vector k derived from the infrared features
, resulting in a similarity matrix
. Then, this similarity matrix is used for soft attention fusion with the value vector v derived from the infrared features
. This module can dynamically supplement the features based on the attention weights and inject contextual feature information, making the complementary features more robust.The similarity matrices
and
are calculated by the cross-correlation alignment step, as shown in Equation (
12),
where MatMul represents matrix multiplication, T denotes the transpose of a matrix, and
is the scaling factor. Then, a softmax layer is applied to obtain the normalized cross-correlation matrices
and
.
2.6.3. Loss Function and Optimization Method
The loss function in the YOLOv5 network mainly consists of three components: bounding box regression loss (
), objectness loss (
), and classification loss (
) [
14]. The specific loss function can be represented by the following Equation (
13):
where N represents the number of detection layers, and B refers to the number of targets assigned to each anchor box. Additionally, S × S represents the number of grid cells the network divides the input into, i and j represent the grid indices in the feature map. In this network model, the light sensing weight module is used to enhance the rationality of the fusion, which adds a fusion loss function
, as shown in the following Equation (
14):
The fusion loss function
is also composed of several parts, where
represents the light sensing loss, defined as follows:
where
and
represent the intensity loss for infrared and visible-light images, respectively, and
and
represent the light sensing weights for infrared and visible-light images, respectively. These weights are specifically defined in Equation (
8). The intensity loss measures the difference between the fused image and the source images at the pixel level. Therefore, the intensity loss for the infrared and visible-light images can be defined as follows:
where H and W represent the height and width of the input image, respectively, and
denotes the
norm.
is the image after feature fusion. To ensure that the fused image maintains consistent light intensity with the original images, the light sensing weights
and
are used to control the light distribution in the fused image.
Although the light loss function can automatically adjust according to light conditions to preserve the original image’s light information, this alone is not sufficient to ensure the optimal state of the intensity distribution in the fused image. Therefore, additional auxiliary intensity loss needs to be introduced for further optimization:
In extensive experiments, it has been demonstrated that the best texture performance for infrared and visible-light images is achieved when the image textures are concentrated. Therefore, to maintain both optimal brightness distribution and rich texture details in the fused image, the concept of texture loss is adopted, which aims to enrich the texture information in the fused image. The definition of texture loss is as follows:
where ∇ represents the Sobel gradient operator used to measure the image texture information by calculating the gradients. Finally, the light sensing loss
, auxiliary intensity loss
, and texture loss
are combined through weighted coefficients
to obtain the fusion loss function
, which is expressed as follows:
The network, under the constraints of light sensing loss and auxiliary intensity loss, dynamically adjusts the light intensity distribution of the generated image based on the input light scene, enabling it to better reflect the complex lighting conditions found in the real world. Additionally, the texture loss constraint helps preserve the texture details of the original image in the generated output. Consequently, the generated image not only excels in terms of light adjustment but also maintains high texture quality. As a result, the light sensing weight module enhances both the quality of image fusion and the network’s adaptability and stability in varying lighting conditions, providing a solid foundation for multimodal detection tasks.
4. Discussion
In response to the existing models that neglect the impact of lighting intensity on target recognition performance, this study introduces a lighting-aware weight module to address this issue. This module improves image fusion by considering the lighting intensity contribution of the image. Additionally, to ensure that each modality compensates for the weaknesses of the other before fusion, a cross-modal attention mechanism is introduced to extract and enhance complementary features.
First, through ablation experiments on the individual modules, we validate the key role of the lighting-aware weight module and the cross-attention mechanism in the algorithm. The experimental results show that, on the DroneVehicle dataset, the baseline fusion network has a miss detection rate of 26.1% and an accuracy of 79.7%. After adding the lighting-aware weight module, the miss detection rate is reduced by 2.8%, and the average accuracy increases by 1.9%. When only modality normalization and the cross-attention module are added, the miss detection rate drops by 3.4%, and the average accuracy improves by 3.4%. Finally, after incorporating all modules, the miss detection rate decreases by 5.3%, and the accuracy improves by 5.1%. These results demonstrate that these two modules not only effectively balance lighting intensity but also enhance target information by utilizing complementary features from both modalities, significantly improving detection accuracy and reducing the miss detection rate.
To further validate the performance of the algorithm, we applied it to the publicly available KAIST and LLVIP datasets for pedestrian detection and compared it with several current algorithms. The experimental results show that the algorithm exhibits strong generalization ability, accurately detecting targets even when the target features are unclear or occluded, and demonstrates superior pedestrian recognition performance.
5. Conclusions
This paper presents a method for small-target detection on aircraft based on multimodal feature fusion. The method utilizes the cross-attention mechanism to enable mutual learning of complementary features between modalities, while continuously enhancing target detection. The light fusion module learns the image’s light intensity to generate light weights, ensuring that the fused image effectively accounts for the lighting impact and achieves optimal feature fusion. The method is analyzed and validated on various datasets, demonstrating its effectiveness and good generalization ability.
Although the proposed method demonstrates excellent performance in detection, significantly improving detection accuracy and reducing the miss detection rate, the introduction of various attention modules and the lighting-aware weight module has increased the model’s complexity, leading to a reduction in detection speed. Additionally, while the early fusion strategy benefits small-object detection, it may further impact inference efficiency, potentially becoming a bottleneck in real-time applications that require fast responses. This issue could become a bottleneck in real-time applications that require fast responses. To balance accuracy and speed, future research will focus on optimizing these modules and exploring methods to reduce the model’s parameters and computational cost while maintaining high accuracy. On one hand, lightweight neural network architectures can be designed to reduce redundant computations; on the other hand, more efficient attention mechanisms, such as adaptive attention modules, may help maintain low computational complexity while improving detection performance. Additionally, model distillation techniques could serve as a potential optimization approach, helping to compress the model and accelerate inference. Ultimately, the goal is to further enhance the detection speed and real-time response capabilities of the network to meet the requirements of various application scenarios, including autonomous driving, security surveillance, and other fields.