1. Introduction
The performance of traditional object detection systems that rely solely on visible-light images is often constrained by the inherent limitations of visible-light sensors. These systems experience significant degradation under conditions such as low illumination, adverse weather, and partial occlusions [
1]. To address this, infrared and visible-light imaging—two complementary modalities—can be fused to produce composite images that enhance robustness and accuracy in target perception [
2].
Recent advances in deep learning have significantly reshaped the field of object detection. Girshick et al. introduced the R-CNN framework [
3], which enabled end-to-end learning of deep visual features from region proposals, substantially improving detection performance over traditional hand-crafted approaches. Redmon et al. proposed YOLO, a one-stage detector that formulates object detection as a direct regression problem [
4], achieving real-time performance by predicting bounding boxes and class probabilities in a single forward pass. Simultaneously, the use of thermal imaging gained momentum. For instance, Hwang et al. presented the KAIST multispectral pedestrian dataset [
5], comprising aligned RGB and thermal image pairs, and demonstrated that incorporating thermal data improves detection under poor lighting. More recently, deep fusion networks have emerged that jointly process visible and infrared inputs. One notable example is M2FNet, a transformer-based two-stream model proposed by Jiang et al. [
6], which leverages cross-modal attention to fuse features across spectra, enhancing detection robustness in complex environments.
However, applying general-purpose detectors such as YOLOv8 directly to fused infrared–visible images is often suboptimal. Fused images exhibit specific challenges, including redundant background features, modality inconsistencies, and semantic dilution. These issues are particularly problematic for detecting small or occluded targets, which often lose discriminative features during the fusion process. Moreover, conventional detection frameworks are typically designed for unimodal data and are not optimized to address the heterogeneous and noisy nature of fused inputs, resulting in reduced accuracy and generalization capability.
To overcome these challenges, we propose an improved detection framework based on the YOLOv8 architecture. The main contributions of this work are summarized as follows:
An enhanced YOLOv8 model incorporating a dedicated small-object detection layer, which improves sensitivity to low-resolution or partially occluded targets.
Integration of the Global Attention Mechanism (GAM), which highlights salient spatial and channel-wise features to improve robustness under background clutter and occlusion.
A refined loss function based on Weighted IoU (WIoU), designed to improve bounding box regression in scenarios involving ambiguous boundaries or noisy annotations.
The remainder of this paper is organized as follows.
Section 2 reviews related work on image fusion and deep learning-based object detection.
Section 3 introduces the image fusion methodology.
Section 4 presents the proposed detection framework.
Section 5 discusses experimental results and performance evaluation.
Section 6 and
Section 7 discuss and conclude the paper.
2. Related Work
The main objective of this study is to enhance multi-target detection performance on infrared–visible fused images by integrating attention mechanisms into a deep learning-based detection framework. To contextualize our work, we review related research from three perspectives: infrared–visible image fusion techniques, deep learning-based object detection methods, and attention mechanisms in visual detection tasks.
Infrared–visible (IR–VIS) image fusion methods have been extensively studied to combine complementary information from thermal and optical sensors. Ma et al. provide a comprehensive survey of IR–VIS fusion techniques [
7], highlighting the need to preserve both salient infrared targets and rich visible textures in the fused result. Recent deep learning approaches exemplify this Liu et al. decomposed images into base and detail layers and used multi-layer CNNs with discrete cosine transforms to extract and fuse features for detail preservation [
8]. While this approach improves texture preservation, it lacks adaptive mechanisms to prioritize different spectral information based on scene content. Yang et al. further exploited CNNs by estimating saliency-weight maps and employing a multi-scale fusion strategy, explicitly retaining IR target luminance and visible background textures [
9]. Although this method shows promise in maintaining thermal targets, it does not adequately address the challenge of small-object detection in complex backgrounds. Generative adversarial models have also been applied: for example, Li et al. proposed MrFDDGAN, which uses multi-receptive-field convolutions and a dual-discriminator GAN to balance infrared intensity with color visible texture in the fusion results [
10]. Despite achieving visually appealing results, GAN-based methods often lack direct optimization for downstream detection tasks, potentially leading to suboptimal fusion for object detection applications.
In parallel, deep learning has revolutionized object detection. Early two-stage detectors like Faster R-CNN leverage a Region Proposal Network to generate candidate object regions [
11], while one-stage detectors such as YOLO and SSD reformulate detection as a single-shot regression task [
12]. These models achieve real-time performance and high accuracy on benchmarks, but their effectiveness diminishes significantly when applied to infrared–visible fused images due to domain gap issues. Subsequent advances improved robustness and speed: for example, Lin et al. introduced focal loss in RetinaNet to mitigate extreme foreground–background class imbalance in dense detection [
13]. Surveys of deep detectors note that modern architectures typically employ deep convolutional backbones and multi-scale feature fusion to detect objects at various scales. While focal loss addresses class imbalance, it does not specifically tackle the unique challenges posed by fused multispectral imagery, where small thermal targets may be overwhelmed by rich visible textures.
Finally, attention mechanisms have been increasingly integrated into object detection architectures to enhance feature representation. Channel attention modules, such as Squeeze-and-Excitation proposed by Hu et al., adaptively recalibrate feature responses along the channel dimension [
14]. However, these methods primarily focus on single-modality inputs and may not effectively handle the heterogeneous nature of fused multispectral features. Combined spatial–channel attention modules, including CBAM by Woo et al. and BAM by Park et al., sequentially generate attention maps to emphasize both spatial and channel-wise features [
15,
16]. These modules can be embedded into convolutional neural networks to focus on informative regions, thereby improving both classification and localization performance. More recently, transformer-based detectors have introduced global self-attention mechanisms. DETR, developed by Carion et al., formulates object detection as a set prediction task using an encoder–decoder transformer architecture, enabling the model to capture global context and reason about object relationships for direct bounding box prediction [
17]. Compared to sequential attention modules like CBAM and BAM, the Global Attention Mechanism (GAM) proposed by Liu et al. [
18] captures both channel and spatial dependencies in a unified manner, enabling stronger global context modeling. This design enhances the representation of small or occluded targets, which is especially beneficial in infrared–visible fused imagery.
Existing methods face key limitations: they lack attention mechanisms tailored to IR-VIS fused images, struggle with small-target detection due to scale and contrast variations, and perform poorly under noisy conditions. To address these issues, we propose a detection framework that integrates a global attention mechanism, adds a small-object detection layer, and refines the loss function to enhance accuracy and robustness.
Table 1 summarizes the strengths and limitations of existing approaches:
3. Dataset Creation
In this paper, we adopt a generative adversarial network (GAN)-based approach to perform image fusion between infrared and visible-light modalities. The goal is to exploit the complementary information contained in both modalities and generate fused images that are both perceptually informative and suitable for downstream tasks such as object detection.
3.1. Image Collection
The GAN-based image fusion algorithm used in this paper needs to be trained using a dataset containing a large number of images. The construction of the fusion dataset requires a large number of infrared image and visible-light image pairs. Since the fusion algorithm in this section does not perform image preprocessing operations such as pixel registration on the images, it is required that the infrared images and visible-light images in the dataset have been registered. The fusion dataset in this paper mainly comes from two sources, namely, image pairs obtained from the Internet, and image pairs generated using style transfer technology.
3.1.1. From the Web
The datasets used in this paper are mostly from existing open-source datasets on the Internet, including TNO [
19], Roadscene [
20], Multispectral [
21], 3FD [
22], LLVIP [
23], etc.
Table 2 lists an overview of these datasets, including image resolution, shooting environment, scene category, etc.,
Figure 1 shows the infrared and visible-light image data pairs collected in some dark environments in the LLVIP dataset.
3.1.2. Using Style Transfer Technology
Style transfer is an image generation technique that combines the semantic content of one image with the appearance style of another to produce a new image, typically implemented using deep learning. In this paper, infrared and visible-light images differ substantially in visual characteristics. Given that visible-light images are easier to acquire, we leverage style transfer to generate pseudo-infrared images from visible-light inputs. This approach enables the creation of additional training data to enhance model performance.
During training, the style transfer process typically involves the following steps:
Reference image selection: A content image and a style image are selected. The content image provides the structural and semantic information, while the style image defines the visual appearance.
Feature extraction: A pre-trained deep convolutional neural network extracts features from both images. Content features such as shapes and contours and style features such as textures and colors are captured from intermediate layers.
Loss function definition: Two losses are defined—content loss, which measures the difference in content features, and style loss, which evaluates the similarity of style features using Gram matrix correlations.
Optimization: The generated image is iteratively updated via gradient descent to minimize a weighted sum of the content and style losses.
Image generation: Upon convergence, the resulting image preserves the semantic structure of the content image while exhibiting the visual style of the style image.
We trained a Pix2Pix [
24] model to translate visible-light images into synthetic infrared-style images using a conditional GAN framework. After training, the generator produces thermal-like outputs that resemble real infrared images in appearance and structure. This approach addresses the lack of paired multimodal data and supports downstream tasks like target detection. As illustrated in
Figure 2, each row shows a visible image (left), a real infrared image (middle), and a generated infrared-style image (right), enabling direct comparison. The selected scenes such as foggy weather, grasslands, and urban areas, reflect typical conditions in civilian surveillance. The results show that Pix2Pix effectively captures thermal patterns and scene structure, proving useful for data augmentation in IR-VIS fusion.
Using the aforementioned methods, a dataset for image fusion was constructed. It contains over 4000 pairs of visible and infrared images, featuring targets such as pedestrians, cars, and buses. The images cover various weather conditions, including daytime, dusk, and nighttime, as well as diverse backgrounds such as forest and urban scenes. The acquisition angles also vary, including vehicle-mounted frontal views, stationary frontal views, and overhead surveillance perspectives. The partial pictures are shown in
Figure 3. The first row of images is infrared images, and the second row is visible-light images.
By combining these two metrics, this paper provides a targeted evaluation of the fusion performance in terms of both information preservation and structural consistency.
3.2. Fusion Network Training
It is important to clarify that while the Pix2Pix model introduced earlier is used solely for data augmentation by generating synthetic infrared images, the following GAN architecture is independently designed to perform the core image fusion task, as described in this section.
This fusion network learns to integrate complementary features from paired visible and infrared images into a single, perceptually informative fused output. The architecture consists of a generator G, responsible for producing the fused images, and two discriminators and , which evaluate whether the fused image preserves the semantic characteristics of the visible and infrared modalities, respectively.
The training process of this GAN follows the two-timescale update rule (TTUR) [
25]. Specifically, the learning rate of the discriminator is set to 0.002, and the generator’s learning rate is set to 0.001. The Adam optimizer is used for both components, and the batch size is set to 16 based on available memory resources.
The training process of GAN are as follows: the training steps for G, , and are , , and , respectively; is the maximum number of training steps; and are the stopping criteria.
Initialize the parameters of ;
For each training iteration:
Train the discriminators and :
Randomly select a batch of infrared images and their corresponding visible images .
Generate fake data using the generator:
Minimize the loss to update the parameters of .
Minimize the loss to update the parameters of .
If and , repeat the update step and increment .
If and , repeat the update step and increment .
Train the generator G:
Randomly select a batch of infrared images and their corresponding visible images .
Generate fake data using the generator:
Minimize the loss to update the parameters of G.
If and , repeat the update step and increment .
End the training when the maximum number of iterations or stopping criteria is met.
After 100 rounds of training, the loss change curve in the training process after smoothing is shown in
Figure 4.
It can be observed from the figure that the loss of the generator gradually decreases with the increase in the number of training rounds and eventually stabilizes. This shows that the generator gradually learns the ability to generate more realistic samples during the training process, and the training process has achieved good results.
3.3. Fusion Result
The image fusion result of the algorithm in this section is shown in
Figure 5, where the columns from left to right are the visible-light image, infrared image, and fused image.
In infrared and visible image fusion tasks, although existing studies often employ multiple objective evaluation metrics to comprehensively assess the fusion performance, this paper focuses on two representative indicators: Structural Similarity Index (SSIM) and entropy (EN).
EN measures the amount of information contained in an image. A higher entropy value indicates richer information content, suggesting that the fused image effectively preserves detailed features from both the infrared and visible source images. Entropy is defined as
where
L is the total number of gray levels in the image;
is the probability of the i-th gray level, i.e., the normalized histogram of the image.
This measure quantifies the information content in the fused image. A higher entropy value suggests that the image has better information retention from both infrared and visible sources.
SSIM evaluates the similarity between the fused image and the source images in terms of structure, luminance, and contrast. It serves as a perceptually relevant metric that reflects the structural fidelity of the fusion result. In infrared and visible image fusion tasks, SSIM is defined as follows:
Here, denotes the structural similarity between the source image A and fused image F, and denotes the similarity between the source image B and fused image F.
The SSIM between two images is calculated as
In these equations,
, , and are the mean intensities of images A, B, and F respectively;
, , and are the variances of the corresponding images;
and denote the covariance between image pairs A, F and B, F respectively;
and are small constants introduced to avoid instability when denominators are close to zero. In this paper, both are set to 0 for simplification.
SSIM takes into account luminance, contrast, and structural similarity between images. A higher SSIM value indicates better fusion performance. To validate the effectiveness of the proposed fusion method, The results are summarized in
Table 3. It can be observed that the proposed method achieves higher entropy, suggesting richer information content, while maintaining competitive structural similarity compared to DIDFuse [
26].
In general, the fused image of the algorithm proposed in this section has an overall dark tone, highlighting the contour of the target information. The fused image can retain the information in the source image, such as the formation of high intensity and vehicle target information in the infrared image, and the environmental detail information in the visible-light image.
4. Architecture Improvements for Multi-Target Detection
This section introduces architectural enhancements to improve multi-target detection in infrared–visible fused images. The proposed improvements address challenges such as small-object visibility, occlusion, and feature redundancy. A dedicated detection layer is added to improve sensitivity to small targets, while a Global Attention Mechanism (GAM) helps the network focus on key spatial and channel features. In addition, a refined regression loss function, WIoU, is used to handle noisy labels and imprecise boundaries. These components work together to build a more accurate and robust detection framework for complex fusion scenarios.
4.1. Small-Target Detection Layer
In the original YOLOv8 model, the network will calculate three feature maps of different scales. These three feature maps will be used to predict targets of different sizes in the image.
The features extracted by the YOLOv8 backbone network are shown in
Figure 6.
The YOLOv8 backbone takes 640 × 640 images as input and generates three feature maps through 8×, 16×, and 32× downsampling, corresponding to resolutions of 80 × 80, 40 × 40, and 20 × 20. According to the Feature Pyramid Network (FPN) principle, deeper layers capture rich semantic information and are suitable for large-object detection, but repeated downsampling leads to loss of spatial details, making small-object detection more difficult. In contrast, shallow layers retain finer location information and are better for detecting small targets.
To address this, an additional feature map from a 4× downsampling layer is incorporated into the fusion process to enhance the network’s ability to detect small objects. The improved feature extraction structure is shown in
Figure 7.
To enhance small-target detection, a 4× downsampled feature map is introduced into the feature fusion network, generating a new 160 × 160 scale feature. Although this feature map has a smaller receptive field and less semantic information, it retains more spatial detail, making it beneficial for locating small objects.
Deep feature maps obtained through multiple downsamplings contain rich semantic information but may lose fine details and precise location cues. In contrast, shallow feature maps preserve detailed spatial information but are more susceptible to noise and lack semantic understanding. To leverage the strengths of both, the improved model adopts a feature fusion structure based on FPN and PAN, as shown in
Figure 8. FPN enhances semantic propagation from deep to shallow layers, while PAN improves the transfer of location information from shallow to deep layers. This fusion strategy enables the network to extract more robust and discriminative features across scales, effectively improving small-object detection performance.
4.2. Integration of Global Attention Mechanism
Compared to CBAM and SE, GAM provides a more unified and expressive modeling of both inter-channel and spatial dependencies. While SE focuses solely on channel-wise attention and CBAM applies spatial and channel attention sequentially in a localized manner, GAM employs a deeper and more integrated fusion strategy that better preserves global contextual information. This is particularly advantageous for detecting small or occluded targets, where broad context awareness is essential. Moreover, unlike transformer-based attention mechanisms such as those used in DETR, which rely heavily on computationally intensive self-attention across tokenized inputs, GAM achieves strong global representation with lower computational overhead, making it more suitable for resource-constrained detection scenarios. Therefore, this paper integrates GAM into the detection framework to enhance feature representation and improve overall detection performance. The computation process of GAM is illustrated in
Figure 9.
The feature map
input to the network will pass through two attention modules
and
successively.
first passes through
to obtain
, and
then passes through
to obtain the optimized feature
. The calculation process is shown in Equations (5) and (6).
Among them, and represent attention modules, ⊛ represents multiplication, and , , are input feature maps, intermediate state feature maps and output feature maps respectively.
The processing flow of the channel attention module on the input is shown in
Figure 10. The processing flow is as follows:
For the output feature map , the dimension is transformed, and the dimension order changes from to ;
The multi-layer perceptron is used to introduce nonlinear capabilities, so that the model learns adaptive weights to represent the complex relationship between dimensions. The multi-layer perceptron is an encoder–decoder structure with a reduction ratio of r;
Finally, the three-dimensional arrangement is restored to and multiplied with the original input feature map element to obtain the final result.
The input processing flow of the spatial attention module is shown in
Figure 11. First, the input feature map is convolved through two
convolutional layers to integrate and extract spatial information. These two convolutional layers can capture the features at different spatial scales in the input feature map, thereby enhancing the representation ability of spatial information. Although the maximum pooling operation helps to reduce the complexity of the network, it also causes a lot of information loss. Unlike CBAM attention, in order to retain more information, the GAM spatial attention module does not use pooling operations. Therefore, this module may significantly increase the parameters. In order to reduce the parameters while maintaining effectiveness, group convolution with random mixing of channels is used.
4.3. Network Model Update
The GAM attention mechanism and small-object detection layer are integrated into the model to obtain the improved YOLOv8 model as shown in
Figure 12.
The added small-target detection layer can retain the position information of small targets and improve the detection effect of small targets. The GAM attention mechanism can adaptively adjust the channel weight and feature weight so that the network can better capture image features. This section adds the GAM module to the end of the backbone network.
4.4. Loss Function Improvement
In this paper, the primary modification is applied to the CIoU loss component within the regression loss of the YOLOv8 framework.
Intersection over Union (IoU) is often used to indicate the similarity between the model’s prediction results and the true annotations, as shown in
Figure 13.
The calculation process of IoU is as follows: first calculate the intersection and union area between the predicted area and the real marked area, then divide the intersection area by the union area, and finally get a ratio value, which is between 0 and 1. The calculation formula of IoU is shown in Equation (
7).
The closer the IoU is to 1, the closer the predicted result is to the real area. When the IoU value is greater than a certain threshold, it can be considered that the predicted box matches the real box and the prediction is correct. The IoU loss is defined as
There are some problems with IoU loss. When IoU loss is equal to 1, the predicted box and the real box do not overlap at all, but this cannot confirm how far apart the two boxes are, which makes it impossible to return the gradient for further training; when IoU loss is the same, it can only confirm that the intersection and union between the two boxes are the same size, and cannot reflect the overlap between the two boxes. It is shown in
Figure 14.
YOLOv8 uses CIoU loss, which is more sensitive to target size than IoU loss. The schematic diagram of CIoU is shown in
Figure 15. The center line of the two rectangles of the predicted box and the true box is represented as
, and the diagonal of the minimum rectangle union of the predicted box and the true box is represented as c.
The calculation formula of CIoU Loss is
Among them, and w represent the width of the real box and the predicted box, respectively, and h represent the height of the real box and the predicted box, respectively, is a penalty term with weights, is the weight value of v, v represents the difference in aspect ratio between the predicted box and the real box, contained in the v calculation formula is the aspect ratio of the real box, is the aspect ratio of the predicted box, and the greater the difference between the two, the greater the value of v.
During the training of object detection networks, low-quality data, such as mislabeled bounding boxes or overlapping targets, can adversely affect model performance. CIoU loss penalizes predictions based on geometric discrepancies, including distance and aspect ratio. However, when trained on noisy data, this may result in overfitting and poor generalization. To mitigate this issue, we replace CIoU with WIoU [
27], an adaptive loss function that reduces penalties for high-overlap predictions on low-quality samples, thereby improving the model’s robustness and generalization capability.
Wise-IoU v1
The calculation equation of WIoUv1 is
From Equation (
10), we can derive
, which is the penalty term of
. When
, it is a normal anchor box,
, which can enlarge
, but the enlargement coefficient of
is different in different cases. When
,
,
, the penalty term is the minimum value. At the same time,
indicates that the predicted box and the true box are basically overlapped, and a good effect has been achieved at this time. Therefore, this penalty term can reduce the impact of loss when the predicted box effect is already very good. At the same time, in order to prevent
from hindering convergence,
and
are separated from the calculation graph. The asterisk (*) in the denominator denotes a separation operation, and this interpretation remains consistent throughout the subsequent formulas.
Wise-IoU v2
WIoUv2 uses a monotonic focusing mechanism, which can reduce the impact of simple data on the loss value. WIoUv2 introduces a gradient gain
. The larger the IoU, the smaller the
, and the smaller the gradient gain
, which will reduce the impact of loss when the effect is good.
Equation (
12) adds a normalization factor
to Equation (
11), which can keep the gradient gain
at a high level and prevent the gradient gain from being too small to affect training.
Wise-IoU v3
WIoU v3 introduces the outlier to indicate the quality of the anchor box. The calculation equation of the outlier is
A small outlier indicates a high-quality prediction box. In model training, more attention should be paid to the bounding boxes that have a positive impact on the prediction accuracy. WIoUv3 focuses on the prediction boxes with better quality by controlling the gradient gain. For the prediction boxes with large outliers, WIoUv3 will reduce the gradient gain, which can reduce the adverse effect of the gradient generated by this set of low-quality data on the model. In this way, the model can pay more attention to the prediction boxes of general quality, improve the performance in actual scenarios, and effectively control the impact of low-quality data. The calculation equation of IoUv3 is
where
r is a non-monotonic focusing coefficient constructed using
, and
and
are hyperparameters.
It should be noted that in Equation (
14), the denominator
may approach zero when
is small and
is close to
, potentially leading to numerical instability. To mitigate this, we ensure that
remains a small but non-zero positive constant, and we constrain the range of
during training to maintain numerical stability.
Moreover, the coefficient exhibits non-monotonic behavior with respect to , meaning that increasing does not always lead to a linear increase or decrease in r. This design enables the loss function to dynamically adjust its focus on bounding boxes of different qualities, making it more adaptive to varying object detection scenarios.
In this paper, the original CIoU loss function in YOLOv8 is replaced with WIoU. Specifically, the bounding box regression branch of the detection head is modified to compute WIoU-based loss values during training. This allows YOLOv8 to benefit from WIoU’s adaptive penalty mechanism, thereby improving its robustness to noisy labels and enhancing detection accuracy, particularly in complex or low-quality data scenarios.
5. Experiments Results
This section will train the improved YOLOv8 target detection network model, conduct corresponding ablation experiments on the small-target detection layer, GAM attention module, and WIoU loss function, and use relevant evaluation indicators to judge the experimental results. The experiments show that the three optimization strategies proposed in this paper can improve the performance of YOLOv8 in detection tasks.
5.1. Detection Dataset Construction
The target detection task based on deep learning requires a large amount of training data, and the quality of the dataset also determines the final training effect. The target detection network in this section needs to detect the fused images in
Section 3, so the dataset in this section comes from the fusion results obtained in
Section 3. Some images in the dataset are shown in
Figure 16.
The targets to be detected in this dataset include pedestrians, cars, buses, street lights, motorcycles, and large trucks, which are all common objects in the field of civilian monitoring. The backgrounds of the images include high-rise buildings, tunnels, underground parking lots, the sky, lawns, forests, warehouses, and other backgrounds. The sampling angles of these images also include vehicle-mounted head-up, static head-up, and bird’s-eye monitoring, which fully considers the needs of the research background.
Bounding boxes are manually labeled using LabelImg by drawing rectangles around the target objects, as shown in
Figure 17. For YOLO format, each line in the annotation file specifies an object class, the normalized center coordinates
, and the normalized width and height of the bounding box.
5.2. Experimental Evaluation Indicators
After training, the model is evaluated on a validation set. Common evaluation metrics for object detection include precision, recall, accuracy, and mean Average Precision (mAP). A confusion matrix is often used to assess model performance, where rows represent actual classes and columns represent predicted classes. It consists of four elements:
(True Positives),
(False Positives),
(False Negatives), and
(True Negatives), as shown in
Table 4.
The confusion matrix can be used to calculate the evaluation indicators of target detection such as precision, recall, and accuracy, and further evaluate the effect of the model.
The proportion of samples () is calculated as follows:
Precision
The proportion of samples () that are truly predicted correctly among all samples () with positive prediction results. The calculation equation is .
Recall
The proportion of samples () that can predict the correct () among all positive samples. The calculation formula is .
Accuracy
The proportion of samples () that can predict the correct among all samples (). The calculation equation is .
When evaluating the model, by taking a series of confidence thresholds to divide the samples, multiple values can be obtained. With recall rate R as the horizontal coordinate and precision rate P as the vertical coordinate, these values are plotted on the image to obtain the curve. The area enclosed by the curve and the coordinate axis is AP, and the AP of all categories in the entire dataset is mAP.
5.3. Model Training
The dataset is divided into training, validation, and test sets in a 7:2:1 ratio. During model training, the batch size is set to 32, and the number of training epochs is set to 300. Stochastic Gradient Descent (SGD) is used as the optimizer, with an initial learning rate of 0.01. An early stopping mechanism is applied to prevent overfitting by evaluating model performance on the validation set after each epoch. If multiple consecutive evaluations show no significant improvement, training is halted to avoid overfitting. In this case, training stopped early at the 295th epoch. The loss variation curves are presented in
Figure 18, illustrating the following from left to right: Distribution Focal Loss (DFL) improves localization precision by modeling bounding box coordinates as discrete distributions; bounding box loss evaluates the regression accuracy, often using IoU-based metrics; classification loss measures the discrepancy between predicted and true class labels, typically via cross-entropy or focal loss.
The changing curves of precision, recall, mAP50 and mAP50-95 during training are shown in
Figure 19. After 200 rounds, the curves have basically converged.
5.4. Comparison of Detection Results
In order to verify that the three optimization strategies of adding a small-target detection layer, adding a GAM attention mechanism and improving the loss function can bring performance gains to the model, this section compares the detection effects of the improved network and the unimproved network. The experimental results are shown in
Table 5.
Table 5 presents six experiments. The first uses the original YOLOv8 model as a baseline. In the second experiment, the default CIoU loss is replaced with WIoU, leading to slightly decreased precision but improved recall, F1-score, and both mAP and mAP50–95. Experiment 3 adds the GAM attention module while retaining WIoU; this results in a marginal increase in recall and F1-score, but a slight drop in precision and mAP metrics. Experiment 4 replaces attention with a small-target detection layer, significantly boosting precision, recall, and mAP, indicating the layer’s effectiveness in enhancing small-object detection. Experiment 5 combines all three components, WIoU, GAM attention, and the small-target detection layer, achieving the highest values in mAP, mAP50–95, recall, and F1-score, despite a modest trade-off in precision. The sixth experiment trains the original YOLOv5 model as a reference point, which underperforms compared to all YOLOv8 variants, particularly in recall and mAP50–95.
Although each component may only provide limited improvement when used alone, the full combination in Experiment 5 achieves the best overall performance, particularly in terms of mAP50–95, which is a more comprehensive indicator of localization accuracy across different IoU thresholds. This demonstrates that the proposed components, GAM attention, WIoU loss, and the small-object detection layer, are complementary and synergistic when integrated.
To visually illustrate the improvements, nine representative image groups from the test set were selected. For each group, the top image shows detection results from the baseline YOLOv8 model, while the bottom shows results from the improved model. As shown in
Figure 20, the proposed enhancements result in clearer and more accurate detections, especially for small and low-contrast targets.
As shown in the figure, the comparison includes a total of nine image groups, arranged in three rows with three groups per row and numbered from Group 1 to Group 9.
In Groups 1–6, the number of detectable targets is relatively small. In most cases, both the original and improved detection models successfully identify the key targets. However, in Groups 1, 2, and 4, the original model fails to detect targets located near the image edges and performs poorly on partially visible objects. Although the improved model enhances detection performance in these scenarios, it misses the motorcycle target in Group 2. Groups 7–9 contain a larger number of targets, many of which are occluded, small in size, and occupy only a small portion of the image. While both models detect most targets, some remain undetected. Nevertheless, the improved model consistently identifies more targets than the original.
Overall, the enhanced detection algorithm exhibits superior performance compared to the original model.
5.5. Effect of Image Fusion on Detection
In the previous experiments, the detection model was trained using the fused images generated in
Section 3. To further validate the effectiveness of the fusion algorithm, this subsection compares the performance of models trained on single-modal images—i.e., visible-light or infrared images—prior to fusion.
The experimental results, summarized in
Table 6, show that detection using fused images consistently outperforms detection using only visible-light or infrared inputs across all evaluation metrics. While the improvement in precision is relatively small, the recall and mAP values, particularly mAP50-95, show a more noticeable increase. These metrics are especially important for detecting small or partially occluded targets, where fine-grained localization and high recall directly affect detection robustness.
In safety-critical domains such as autonomous driving or security surveillance, even a marginal gain in recall or localization performance can significantly enhance system robustness, especially when the risk of missing a small or partially hidden target may result in severe consequences. We further analyzed the detection outputs and found that fused images are more effective in identifying occluded or low-contrast targets. To illustrate these cases, qualitative comparisons have been included in
Figure 21. Seven groups of images were selected from visible-light images, infrared images, and fused images for detection effect comparison. In each group, from left to right, the detection results of visible-light images, infrared images, and fused images are shown.
The first and second image groups were captured in low-light nighttime monitoring environments. In the first group, the fused image successfully detects all individuals, whereas both the infrared and visible-light images fail to identify some of the people present. In the second group, the visible-light image misses individuals located near the image edges and in darker regions, while both the fused and infrared images correctly detect all targets.
The third, fourth, and fifth groups depict smoke-filled environments, where heavy smoke significantly degrades the quality of visible-light images, obscuring the targets behind it. In contrast, the fused images are able to successfully detect all targets that are missed by the visible-light modality.
The sixth and seventh groups correspond to high-brightness environments with clear visible-light imagery. In these scenarios, both the visible-light and fused images detect all individuals effectively, including those partially occluded by trees in the sixth group.
Overall, in low-illumination and smoke-obscured environments, fused images outperform single-modal images in target detection. While precision is similar between the fused and visible-light inputs, the recall and mAP50–95 of fused images are noticeably higher. Specifically, mAP50–95 improves from 0.600 to 0.613 and recall improves from 0.858 to 0.859. These metrics are particularly important in applications involving small or occluded targets, where comprehensive detection with high recall and fine-grained localization with high mAP50–95 are critical.
6. Discussion
This study confirms that integrating a small-target detection layer, Global Attention Mechanism (GAM), and WIoU-based loss into a YOLO-based framework can significantly enhance detection performance on fused infrared–visible imagery. The proposed approach demonstrates strong adaptability across diverse environments, highlighting the advantages of multimodal fusion in improving detection accuracy and robustness. GAM enhances global feature perception by reinforcing both spatial and channel dependencies, while the small-target layer helps retain fine-grained positional cues critical for detecting small or distant objects. The use of WIoU as a loss function promotes more stable training and better alignment with challenging label distributions. These components work synergistically to address the demands of real-world applications such as intelligent surveillance, autonomous navigation, and public safety monitoring.
While each enhancement contributes specific benefits, there are also trade-offs to consider. For instance, increased model complexity may introduce higher computational requirements, which should be evaluated based on the deployment scenario. The method is particularly suitable for tasks requiring reliable performance under varying lighting, weather, and terrain conditions. Moreover, although the current framework focuses on static image-based detection, it can be extended to video analysis or real-time processing with appropriate adaptations.
Another important factor affecting detection performance is the quality of the fused images. Imperfect fusion results—such as spatial misalignment between modalities or low contrast in critical regions—can introduce noise or distortions that impair feature extraction and classification accuracy. In our experiments, we observed that while the model maintains a degree of robustness, degraded fusion quality does impact the confidence and localization precision, especially for small or partially occluded targets. These findings suggest that ensuring high-quality image fusion is crucial for the downstream detection task. Future research could explore adaptive fusion strategies that incorporate alignment correction or contrast enhancement as preprocessing steps.
7. Conclusions
We investigate multi-target detection based on fused infrared and visible-light images, aiming to enhance the accuracy, robustness, and all-weather applicability of vision-based surveillance systems. To adapt to the specific characteristics of the target detection dataset, three key improvements are proposed to the YOLO-based detection framework: (1) a small-target detection layer to better preserve positional information and improve the recognition of small objects; (2) the integration of the Global Attention Mechanism (GAM) to enhance the model’s ability to focus on critical features; and (3) the replacement of CIoU with WIoU in the loss function to improve robustness against low-quality data and promote better model convergence.
To support this work, we constructed a custom dataset of over 4000 fused infrared–visible image pairs, using both web-crawled content and style-transfer-generated pseudo-infrared images.
Experimental results demonstrate that our improved model outperforms single-modality baselines in terms of both detection accuracy and robustness. Ablation studies confirm the individual contribution of each component, and the model maintains reliable performance in challenging conditions such as nighttime, fog, and urban occlusion.
This approach shows significant promise for real-world applications, particularly in domains where reliability under adverse conditions is critical, such as autonomous driving, urban surveillance, disaster response, and search-and-rescue.
In future work, we plan to explore transformer-based fusion architectures to enhance long-range feature dependency modeling, and to evaluate the generalization ability of our method on public benchmarks such as the FLIR or KAIST datasets. We also aim to extend our model to video-based detection and real-time streaming scenarios using embedded hardware such as Jetson Xavier, enabling practical deployment in field robotics and mobile surveillance systems.
Author Contributions
Conceptualization, H.F. and Z.Q.; methodology, Z.Q.; software, Z.Q. and H.F.; validation, H.F., Z.Q. and Y.G.; formal analysis, H.F.; investigation, H.F.; resources, H.F.; data curation, H.R.; writing—original draft preparation, Z.Q. and H.F.; writing—review and editing, Y.G. and H.R.; visualization, Z.Q.; supervision, Z.D. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The original contributions presented in this study are included in the article. Further inquires can be directed to the corresponding author.
Acknowledgments
We thank the members of our lab for their technical support and insightful suggestions.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Anis, M. Integrating Visible-Infrared Fusion and Deep Learning for Enhanced Object Detection in Complex Environments. 2024. Available online: https://www.researchgate.net/publication/384994474_Integrating_Visible-Infrared_Fusion_and_Deep_Learning_for_Enhanced_Object_Detection_in_Complex_Environments (accessed on 14 June 2025).
- Ma, W.; Wang, K.; Li, J.; Yang, S.X.; Li, J.; Song, L.; Li, Q. Infrared and Visible Image Fusion Technology and Application: A Review. Sensors 2023, 23, 599. [Google Scholar] [CrossRef] [PubMed]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Hwang, S.; Park, J.; Kim, N.; Choi, Y.; Kweon, I.S. Multispectral Pedestrian Detection: Benchmark Dataset and Baseline. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar]
- Jiang, C.; Ren, H.; Yang, H.; Huo, H.; Zhu, P.; Yao, Z.; Li, J.; Sun, M.; Yang, S. M2FNet: Multi-Modal Fusion Network for Object Detection from Visible and Thermal Infrared Images. Int. J. Appl. Earth Obs. Geoinf. 2024, 130, 103918. [Google Scholar] [CrossRef]
- Ma, J.; Ma, Y.; Li, C. Infrared and Visible Image Fusion Methods and Applications: A Survey. Inf. Fusion 2019, 45, 153–178. [Google Scholar] [CrossRef]
- Liu, Y.; Dong, L.; Ji, Y.; Xu, W. Infrared and Visible Image Fusion through Details Preservation. Sensors 2019, 19, 4556. [Google Scholar] [CrossRef]
- Yang, C.; He, Y.; Sun, C.; Chen, B.; Cao, J.; Wang, Y.; Hao, Q. Multi-Scale Convolutional Neural Networks and Saliency Weight Maps for Infrared and Visible Image Fusion. J. Vis. Commun. Image Represent. 2024, 98, 104015. [Google Scholar] [CrossRef]
- Li, J.; Li, B.; Jiang, Y.; Tian, L.; Cai, W. MrFDDGAN: Multi-Receptive Field Feature Transfer and Dual Discriminator-Driven Generative Adversarial Network for Infrared and Color Visible Image Fusion. IEEE Trans. Instrum. Meas. 2023, 72, 1–28. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A.C. SSD: Single Shot MultiBox Detector. In European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
- Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. BAM: Bottleneck Attention Module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings Europe Conference Computer Vision (ECCV); Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
- Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
- Toet, A. The TNO Multiband Image Data Collection. Data Brief 2017, 15, 249–251. [Google Scholar] [CrossRef] [PubMed]
- Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A Unified Unsupervised Image Fusion Network. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 502–518. [Google Scholar] [CrossRef]
- Takumi, K.; Watanabe, K.; Ha, Q.; Tejero-De-Pablos, A.; Ushiku, Y.; Harada, T. Multispectral Object Detection for Autonomous Vehicles. In Proceedings of the Thematic Workshops of ACM Multimedia, Mountain View, CA, USA, 23–27 October 2017; ACM: New York, NY, USA, 2017; pp. 35–43. [Google Scholar]
- Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-Aware Dual Adversarial Learning and a Multi-Scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June; IEEE: New York, NY, USA, 2022; pp. 5802–5811. [Google Scholar]
- Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A Visible-Infrared Paired Dataset for Low-Light Vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCV), Montreal, BC, Canada, 11–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 3496–3504. [Google Scholar]
- Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 1125–1134. [Google Scholar]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates: Red Hook, NY, USA, 2017; Volume 30, pp. 6626–6637. [Google Scholar]
- Zhao, Z.; Xu, S.; Zhang, C.; Liu, J.; Li, P.; Zhang, J. DIDFuse: Deep Image Decomposition for Infrared and Visible Image Fusion. arXiv 2020, arXiv:2003.09210. [Google Scholar]
- Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Figure 1.
LLVIP dark environment dataset.
Figure 1.
LLVIP dark environment dataset.
Figure 2.
Style transfer results using Pix2Pix.
Figure 2.
Style transfer results using Pix2Pix.
Figure 3.
Partial pictures.
Figure 3.
Partial pictures.
Figure 4.
Generator and discriminator loss curves during training.
Figure 4.
Generator and discriminator loss curves during training.
Figure 5.
Algorithm fusion results.
Figure 5.
Algorithm fusion results.
Figure 6.
Original YOLOv8 feature extraction model.
Figure 6.
Original YOLOv8 feature extraction model.
Figure 7.
Feature extraction model adding new scale features.
Figure 7.
Feature extraction model adding new scale features.
Figure 8.
Improved feature fusion network.
Figure 8.
Improved feature fusion network.
Figure 9.
GAM attention mechanism module.
Figure 9.
GAM attention mechanism module.
Figure 10.
GAM channel attention module.
Figure 10.
GAM channel attention module.
Figure 11.
GAM spatial attention module.
Figure 11.
GAM spatial attention module.
Figure 12.
Improved YOLOv8 network model.
Figure 12.
Improved YOLOv8 network model.
Figure 14.
Same IoU, different overlap methods.
Figure 14.
Same IoU, different overlap methods.
Figure 15.
CIOU schematic.
Figure 15.
CIOU schematic.
Figure 16.
Parts of fusion image dataset.
Figure 16.
Parts of fusion image dataset.
Figure 17.
Image annotation using LabelImg.
Figure 17.
Image annotation using LabelImg.
Figure 18.
Loss change curve.
Figure 18.
Loss change curve.
Figure 19.
The network’s indicator changes on the validation set.
Figure 19.
The network’s indicator changes on the validation set.
Figure 20.
Comparison of detection effects before and after improvement.
Figure 20.
Comparison of detection effects before and after improvement.
Figure 21.
Comparison of detection effects between fusion images and single-modality images.
Figure 21.
Comparison of detection effects between fusion images and single-modality images.
Table 1.
Comparison of existing methods vs. proposed approach.
Table 1.
Comparison of existing methods vs. proposed approach.
Method Type | Current Capabilities | Small-Target Detection |
---|
Existing Methods | Limited robustness in occlusion and noise;
No dedicated attention for fused modalities;
Ineffective for small-target detection | Limited |
Proposed Method | GAM-enhanced multi-scale representation;
Small-object detection layer integrated;
Loss function optimized for fusion input | Enhanced |
Table 2.
Comparison of datasets.
Table 2.
Comparison of datasets.
Dataset | TNO | Roadscene | Multispectral | M3FD | LLVIP |
---|
Number of Image Pairs | 251 | 221 | 2999 | 4200 | 15,488 |
Image Resolution | 768 × 576 | 768 × 576 | 768 × 576 | 1024 × 768 | 1920 × 1080 |
Color Visible Light | No | Yes | Yes | Yes | Yes |
Shooting Method | Static Horizontal | Driving Process | Driving Process | Multi-view | Surveillance |
Number of Targets | Few | Few | Many | Many | Numerous |
Annotated Target Files | No | No | Yes | Yes | Yes |
Table 3.
Comparison of fusion methods based on EN and SSIM.
Table 3.
Comparison of fusion methods based on EN and SSIM.
Method | EN | SSIM |
---|
Laplace | 6.461 | 0.931 |
Ours | 7.193 | 0.974 |
Table 4.
Confusion matrix.
Table 4.
Confusion matrix.
Sample Category | Positive | Negative |
---|
True | | |
False | | |
Table 5.
Comparison of experimental results.
Table 5.
Comparison of experimental results.
No. | Base Network | Experimental Parameters | Precision | Recall | mAP | mAP50-95 | F1-Score |
---|
Small-Target Detection | GAM Attention | WIoU Loss |
---|
1 | YOLOv8 | × | × | × | 0.872 | 0.806 | 0.858 | 0.599 | 0.838 |
2 | YOLOv8 | × | × | √ | 0.871 | 0.818 | 0.861 | 0.603 | 0.844 |
3 | YOLOv8 | × | √ | √ | 0.869 | 0.823 | 0.855 | 0.584 | 0.845 |
4 | YOLOv8 | √ | × | √ | 0.902 | 0.856 | 0.897 | 0.612 | 0.869 |
5 | YOLOv8 | √ | √ | √ | 0.88 | 0.858 | 0.901 | 0.613 | 0.878 |
6 | YOLOv5 | × | × | × | 0.866 | 0.555 | 0.84 | 0.555 | 0.676 |
Table 6.
Performance comparison of different datasets.
Table 6.
Performance comparison of different datasets.
No. | Dataset | Precision | Recall | mAP | mAP50-95 | Inference Speed (ms) |
---|
1 | Fusion | 0.88 | 0.859 | 0.901 | 0.613 | 1.582 |
2 | Visible Light | 0.88 | 0.858 | 0.887 | 0.6 | 1.582 |
3 | Infrared | 0.866 | 0.84 | 0.861 | 0.583 | 1.582 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).