3.2. Construction of a Visible-Light Small Target Dataset
To address the scarcity of annotated data, we construct a dataset containing 843 visible-light images with small targets. An improved CycleGAN model is then employed to translate these visible-light images into the infrared domain, generating pseudo-infrared images to augment the training data for infrared small target detection tasks. The targets in our custom dataset include aircraft, ships, birds, humans, masts, parachutes, hot air balloons, buoys, and surfboards. The backgrounds feature diverse scenes such as the ocean, urban environments, clouds, sky, rivers, and mountains. Compared to publicly available datasets, our dataset includes a wider range of target categories and more complex, diverse background scenarios. Example images are shown in
Figure 2.
To generate high-quality pseudo-infrared images, we modify the original CycleGAN architecture in three key aspects. First, we replace both the generator and discriminator backbones with a U-Net architecture. The encoder–decoder framework and skip connections of U-Net preserve spatial details and resolution, which is especially beneficial for infrared scenes containing small targets or strong background noise. Second, we incorporate the CBAM module. This attention mechanism enables the network to focus automatically on discriminative channels and spatial regions, thereby amplifying the response of small targets while suppressing background interference. Third, to improve the geometric consistency and semantic integrity of target regions in the pseudo-infrared images, we add an IoU loss term to the original adversarial loss and cycle-consistency loss of CycleGAN. By leveraging the available target annotations, this IoU loss provides explicit regional supervision, guiding the generator to maintain the location, contour, and semantic features of targets while performing style translation. This addition alleviates target semantic shifts that arise from the different physical characteristics of the two modalities. We define the overall loss function for the visible-to-infrared style translation task using the improved CycleGAN model as follows:
where
is a hyperparameter used to balance the importance of the cycle-consistency loss. When the weight is too small, the cycle-consistency constraint becomes insufficient, potentially causing semantic drift in the generated images. On the other hand, if the weight is too large, it may interfere with adversarial learning, reducing the quality of the generated images. Therefore, based on empirical experience,
is set to 5.
,
,
, and
denote the adversarial loss functions for the generators
and
, the cycle-consistency loss, and the IoU loss, respectively. The loss for the generator
is formulated as follows:
where
denotes a sample drawn from the visible-light domain
. The generator
aims to map visible-light images to the infrared domain, producing synthetic infrared images
. The discriminator
is trained to distinguish between real infrared images and generated ones. It outputs a probability in
, indicating the likelihood that an input image is real. By maximizing
, the generator is encouraged to produce outputs that are statistically indistinguishable from real infrared images, thereby improving the fidelity of domain translation. Since the loss is defined with a negative sign, minimizing it encourages the generator to produce more realistic images.
Similarly, the loss for the reverse mapping is defined as follows:
where
denotes a real sample from the target domain
, and
is the generator mapping target images back to the source domain.
However, above two losses cannot ensure semantic consistency between input and output images. Therefore, we introduce a cycle consistency loss defined as follows:
where this loss contains two terms: one ensures that a source-domain image mapped to the target domain and then back
closely resembles the original input
x. And the other ensures the same for
y from the target domain. Where
denotes the
norm, which measures pixel-wise differences and helps preserve structural and edge details.
In Equation (
5),
denotes the intersection between the predicted mask
and the ground-truth mask
Y, and
represents the number of pixels in this intersection.
As a result, the improved CycleGAN not only possesses strong style adaptation capabilities but also preserves critical structural information of small targets in infrared images more accurately, providing more reliable pseudo-sample support for subsequent semi-supervised detection tasks. The generated pseudo-infrared images more closely resemble real infrared images and exhibit better adaptability in target detection tasks. Examples of the generated pseudo-infrared small target images are shown in
Figure 3.
3.3. Wavelet-Enhanced Channel Recalibration and Fusion (WECRF) Module
We design an efficient and lightweight wavelet-enhanced channel recalibration and fusion (WECRF) module by integrating channel and spatial attention, channel weight recalibration, and feature fusion mechanisms. This module helps the model perform more precise segmentation of targets. The detailed structure of the module is shown in
Figure 1.
After entering this module, the input features
X first undergo a wavelet transformation, producing one low-frequency component and three high-frequency components, each with half the size of the original feature map. These four components are then concatenated to form a new feature map with a reduced spatial size and a channel dimension four times larger than the original.
where WT denotes the wavelet transform, and
denotes the concatenation operation.
To address this, a channel weight recalibration operation is applied. Specifically, the channels of the input tensor are first divided into G groups. The channels are then reordered so that channels from different groups are interleaved.
where
denotes the channel group reordered operation,
denotes the ReLU function. After interleaving, the channel dimension is flattened, and a new tensor is returned with recalibrated channel arrangements. The main purpose of this operation is to promote inter-channel interaction and enhance feature mixing by rearranging the channel order. This is motivated by the observation that in deep neural networks, each channel in a feature map typically captures specific attributes, such as edges, textures, or semantic content. In grouped convolutions or other grouped operations, channels within different groups remain independent, which limits information exchange and may lead to overly homogeneous features. By allowing information to be exchanged across groups through channel interleaving, the model gains richer feature representations and improved expressive capability.
After that, we combine channel attention and spatial attention mechanisms to construct multi-attention components to enhance feature representations. Spatial attention, on the other hand, calculates the correlation of local spatial context using group normalization, further enhancing the feature representation. By combining these two attention mechanisms, the network’s ability to focus on important regions is significantly improved, enabling better capture of local details in infrared small targets. As shown in
Figure 4, the multi-attention module first reshapes the input feature map
x by grouping its channels, resulting in multiple sub-feature maps. Each group is then split into two parts (
and
) to compute channel attention
and spatial attention
separately.
For channel attention, global average pooling is first applied to
to obtain global information for each group. This global information is then passed through a linear transformation
, implemented with learnable weights and biases, to calculate the importance of each channel. A sigmoid activation function
is used to generate the channel attention map, which is subsequently applied to
to reweight the channel features.
When computing spatial attention, group normalization is first applied to
to extract spatial attention information. This is then adjusted using weights and biases, followed by a sigmoid activation function to generate the spatial attention map. The spatial attention is applied to
to refine the spatial features. Finally, the attention-weighted feature maps
and
are concatenated to form a new feature map.
Finally, the attention-weighted feature maps
and
are concatenated to form a new feature map:
To further exploit the relationship between channels and spatial locations, we perform an additional channel weight recalibration after the hybrid attention module. Because channels are interrelated and the features of different wavelet components complement one another at global and local scales, simple concatenation cannot capture these dependencies effectively. Moreover, target responses to high-frequency features vary across spatial positions. To address this, we group the channels and then reorder them, thus breaking the isolation of features within a single wavelet component (for example, low-frequency or high-frequency bands). After reordering, each channel contains information from multiple wavelet components, enabling cross-dimensional interaction among components. Although spatial attention has already refined the spatial distribution of features, it does not eliminate the separation of features within channels. Channel reordering promotes the cross-channel fusion of spatial information, further uncovering the intrinsic connections between channel and spatial domains. This operation mitigates feature isolation, strengthens inter-channel interaction, and reduces feature redundancy, thereby enhancing the feature-representation capability of the WRSSNet model.
3.4. Loss Function
Our semi-supervised loss consists of three components: the loss
computed from real infrared images, the loss
computed from pseudo-infrared images, and the consistency loss
derived from unlabeled infrared images. The definitions of these three loss terms are as follows:
In the above equations,
,
, and
represent the IoU loss, BCE loss, and MSE loss, respectively. Based on this, our total loss is defined as follows:
where
and
are weighting hyperparameters, which are empirically initialized to 0.5 and 0.1, respectively.
The BCE loss is suitable for pixel-wise binary segmentation tasks (e.g., distinguishing targets from the background in infrared target detection). As a convex function, it is theoretically easy to optimize. However, when used alone, it may be biased towards the background if the target occupies a very small portion of the image. Therefore, we combine it with IOU loss to address class imbalance. In semi-supervised or multi-task learning, MSE is often used to assess the consistency between pseudo-labels and ground truths, providing a consistency constraint for the model. This loss is more sensitive to deviations between predictions and targets, making it suitable for tasks requiring precise alignment. It also avoids gradient explosion issues when predictions approach boundary values (e.g., 0 or 1), and it integrates well with other loss functions. Therefore, we adopt a combination of IOU loss, BCE loss, and MSE loss as our total loss. The definitions of the BCE and MSE losses are as follows:
where
N denotes the number of samples,
is the model-predicted output, and
is the target label. In the context of semi-supervised learning, they correspond to the pseudo-label predicted by the model and the re-predicted pseudo-label after data augmentation, respectively.