Next Article in Journal
Urban Underground Space Geological Suitability—A Theoretical Framework, Index System, and Evaluation Method
Previous Article in Journal
End-to-End Multi-Modal Speaker Change Detection with Pre-Trained Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Semi-Supervised Single-Image Deraining Algorithm Based on the Integration of Wavelet Transform and Swin Transformer

1
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
2
Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(8), 4325; https://doi.org/10.3390/app15084325
Submission received: 6 March 2025 / Revised: 31 March 2025 / Accepted: 10 April 2025 / Published: 14 April 2025

Abstract

:
Rain is a typical meteorological event that affects the visual appeal of outdoor pictures. The presence of rain streaks severely blurs image details, negatively impacting subsequent computer visual tasks. Due to the challenge of acquiring authentic photographs of rainfall, most deraining methods have been developed using generated samples. However, the inherent differences between generated and real data lead to poor generalization performance in practical applications. This study proposes a semi-supervised single-image rain removal approach using Transformer and wavelet transform. It fully utilizes the feature information of rainy images, addressing the issue that current methods focus too much on network structure innovation while neglecting rain streak features. The algorithm leverages the directional properties of wavelet transform to decompose rainy images into multi-scale components, with networks of varying sizes generating rain streak maps across different directions and scales. By combining supervised and unsupervised training in a semi-supervised system, the model improves deraining performance and generalization capability. Additionally, a residual detail recovery network restores fine-grained image details, further enhancing the deraining effect in real-world scenarios. Comprehensive tests on multiple standard datasets show that the proposed approach outperforms current methods, confirming its effectiveness in practical applications. Experimental results on common datasets demonstrate that it performs better than advanced rain removal algorithms. The method’s superiority is further validated by the PSNR and SSIM values of 34.86 dB and 0.961 on the Rain1200 synthetic dataset, and the NIQE and PIQE values of 11.52 and 9.13 on the RealRain dataset.

1. Introduction

Images taken on rainy days usually contain a large number of rain streaks, which have a significant impact on image quality as the intensity of rainfall increases. Many vision systems require high input image quality, so image deraining has become an important research topic in the field of computer vision, aiming at restoring a clear background, free from rain streaks. The task of image deraining is divided into video image deraining and single image deraining [1,2,3,4,5,6]. The latter is unable to utilize inter-frame differences and is more difficult to research, which is of great academic and practical significance. Existing single-image rain removal algorithms are mainly classified into traditional methods [7,8,9,10,11,12] and deep learning methods [13,14], the latter of which have made significant progress on synthetic datasets, but mostly focus on architectural innovations, ignore the multi-scale features of rain streaks, and are trained on artificial synthetic rain scenes, resulting in a poor ability to generalize to real scenes. Despite their improved performance on real images, semi-supervised learning methods [15,16,17,18] still face problems in rain streak recognition and background detail recovery.
In order to solve the above problems, considering that the rain marks in rainy images have obvious vertical downward or oblique downward directional characteristics, this paper proposes a semi-supervised single-image rain removal network based on multi-scale wavelet transform, which is divided into supervised and unsupervised network branches sharing the same network parameters, and utilizes the wavelet transform to decompose the rainy image into components of different scales and designs networks of different scales accordingly. The network is designed for different scales by using wavelet transform to decompose the rainy image into differently scaled components and design different scale networks for it, which are co-trained on both labeled synthetic images and unlabeled real images to improve the rain removal effect and enhance the generalization ability of the model, so as to achieve a better restoration of the background detail information of the image while removing the rainwater.

2. Related Work and Theory

2.1. Wavelet Transform

Wavelet transform (WT) simultaneously analyzes both time (or spatial) and frequency domains, offering localization and multi-resolution properties [19]. It separates high-frequency rain streaks from low-frequency backgrounds, enabling targeted rain removal without affecting the image structure.

2.2. Attention Mechanism

The attention mechanism dynamically adjusts weights to focus on important features. Channel attention highlights channels with rain streak features, while spatial attention prioritizes regions with rain, preserving background details. Combining both improves deraining by focusing on key features and enhancing image quality.

2.3. Theoretical Analysis

The combination of wavelet transform and attention mechanisms creates a synergistic effect in both frequency and spatial domains. Wavelet transform decomposes rain streak features into high-frequency subbands, while the attention mechanism applies adaptive weighting in the channel and spatial dimensions, allowing for rain streak separation and background preservation. This synergy enhances deraining performance by optimizing feature extraction, region-based weighting, and adaptive handling of various rain conditions, ultimately improving the model’s generalization ability while retaining image details and structure.

2.3.1. Working Mechanism of Wavelet Transform

  • Frequency domain analysis characteristics: as a time-frequency analysis tool, wavelet transform can analyze signals in both time (or space) and frequency domains. Unlike Fourier transform, it has localization and multi-resolution properties, allowing it to extract image details at different scales, which is crucial for detecting rain streaks in deraining tasks. Rain streaks, as high-frequency noise, are concentrated in certain areas of the image and are captured in high-frequency components (HL, HH). Wavelet transform breaks the image into different scales, enabling focused processing of rain streaks while preserving background details by separating components from high-frequency small scales to low-frequency large scales.
  • Correlation between high-frequency components and rain streaks: since rain streaks are typically vertical or diagonal, they manifest as high-frequency features in the HL and HH sub-bands during wavelet decomposition, making them easier to capture. In contrast, the low-frequency sub-bands (LL, LH) retain horizontal details and background information. Deraining targets high-frequency subbands to remove or suppress rain streaks. The image is then restored using the inverse wavelet transform, combining all subbands to reconstruct the rain-free image. Since low-frequency subbands remain unaffected, the image structure is preserved, ensuring clarity and retention of original details.
  • Wavelet coefficient thresholding in practical applications: After extracting high-frequency rain streak features using wavelet transform, thresholding techniques can be further applied to suppress noise. Rain streaks, being sparse signals in high-frequency components, can be extracted, and irrelevant noise can be suppressed by setting an appropriate threshold. This method leverages the sparsity and locality of the signal, effectively reducing interference with the image background and enhancing deraining performance.

2.3.2. Working Mechanism of Attentional Mechanisms

  • Channel and spatial attention mechanisms: the channel attention mechanism adaptively adjusts the weights of feature channels to focus on important rain streak information while suppressing irrelevant ones. It captures global information through average pooling and generates weights via convolution, prioritizing channels with stronger rain streak features and adjusting to different rain conditions. The spatial attention mechanism weights the significance of different image regions, creating an attention map that highlights areas with concentrated rain streaks. It combines global and local information through max-pooling and average-pooling, ensuring that rain streaks are processed in key regions while preserving details in others.
  • Synergistic effect of channel and spatial attention: the spatial and channel attention mechanisms apply weighting to the feature map from different dimensions, with channel attention focusing on feature channels and spatial attention on regions. Combining these mechanisms creates a synergistic effect in deraining tasks. Channel attention prioritizes rain-related information, while spatial attention focuses on important regions to ensure accurate rain streak removal and preserve image details. In a multi-branch network, both attention mechanisms work together to enhance feature gathering, improving feature map learning and ultimately enhancing the quality of the derained image.

2.3.3. Combined Advantages of Wavelet Transform and Attention Mechanisms

Combining the wavelet transform with the attention mechanism can make the rain removal effect significantly improved, mainly in the following aspects:
  • Fine feature extraction and rain streak attention: the wavelet transform captures multi-scale details and high-frequency features of the image, while the attention mechanism enhances the focus on these high-frequency rain streaks. This combination allows the model to effectively remove rain streaks while preserving background details, avoiding excessive smoothing.
  • Rain removal and detail recovery: in the process of removing rain streaks, especially in the high-frequency part, the background details may be affected. Wavelet transform provides an efficient decomposition, while the attention mechanism can help the model to recognize and recover the details lost in the rain removal process. For example, the spatial attention mechanism enables better retention of background information, while the channel attention mechanism helps to remove rain streaks without losing important image details.
  • Improved generalization and inter-domain adaptability: the attention mechanism’s adaptive weighting, combined with the wavelet transform, boosts the model’s performance across various complex rainy scenes. It also addresses inter-domain differences between synthetic and real data, improving the model’s robustness and stability in real-world conditions.

3. Methodology

3.1. Overall Network Framework

The following essay provides a somewhat controlled single-image deraining network that is founded on multi-scale wavelet transform (MSWT-SSIDA), aimed at enhancing deraining performance on both real and artificial visuals. The entire network framework is demonstrated by Figure 1. The basic framework of the web is separated into two halves, the supervised learning part and the unsupervised learning part, with both sharing the same network parameters. The artificially created rainy picture dataset and the actual rainy image dataset are input into the multi-scale wavelet transform-based deraining network (MSWT-DN) for training. After training with MSWT-DN, the output is the derained synthetic and real rain-free images. The key difference is that the synthetic derained images are supervised using clean synthetic images, while the real derained images are unsupervised, i.e., trained in an unsupervised manner. Finally, the original rainy images and the network’s output rain-free images are subtracted to obtain rain streaks, and the KL loss between the two is calculated to progressively reduce the difference, improving the capacity for generalization of the network model.

3.2. MSWT-DN Structure

3.2.1. Introduction to Network Components

The MSWT-SSIDA architecture proposed in this paper combines several specially designed components, each of which has a unique contribution to the rain removal process. Despite the complexity of the architecture, each component works independently and collaborates in a complementary manner to improve the overall performance. The modular design ensures that each component can be optimized independently, allowing for greater flexibility and maintainability in the system. As shown in Figure 2, the entire network consists of four main modules:
RSENet: this component is responsible for extracting the rain traces, helping the network to focus on the most relevant features for processing by subsequent modules. By separating the high-frequency rain trace portion through wavelet transform, RSENet accurately generates an approximate rain trace map, which provides key inputs for the deraining process.
IMARM: feature fusion and reconstruction through multi-scale attention mechanism enhances the network’s ability to extract features at different scales and improves the robustness of the network, especially when dealing with complex backgrounds. The IMARM enhances the ability to recognize rain trace regions by capturing details at different scales.
U-Former: based on the Swin Transformer structure, it improves the feature extraction ability, especially when dealing with long-distance dependencies in the image, and further improves the accuracy of rain removal. The introduction of U-Former enables the network to effectively capture the global context information in the image, and improves the network’s ability to deal with a wide range of rain marks.
DRANet: removes rain traces while restoring the details of the image’s background through spatial and channel attention mechanisms, avoiding the loss of background information. DRANet ensures the restoration of the background details while suppressing the interference in the rain trace region through residual learning and attention mechanisms.
In summary, each module divides and collaborates with different functions to finally form an efficient rain removal network. Although the MSWT-SSIDA architecture contains multiple modules, its modular design makes it possible to optimize each module independently, thus improving the efficiency and maintainability of the whole system.

3.2.2. Multi-Scale Wavelet Transform

An application of the wavelet transform to an image is demonstrated in Figure 3. According to Figure 3, the rain trace map of the input model is first processed by horizontal filtering, and convexity is carried out on the rain trace map using passively smoothing and elevated throw detail filters to extract the small-amount element, L, and the powerful element, H, of the map. Then, the filtering is then conducted in the vertical direction to yield four components. These include the low-frequency component (LL), which contains the original small amount data located within the picture; the powerful element component obtained through downward recursive filtering (HL), which contains the powerful element information of the downward filtering and the small amount element information of the vertical filtering; the powerful element component obtained through vertical recursive filtering (LH), which contains the small amount element of the horizontal filtering and the powerful element information of the vertical filtering; and the diagonal high-frequency component (HH), which contains diagonal powerful element information. Due to the downward or parallel directionality of rain streaks, vertical splashes of rainwater are primarily concentrated in the vertical powerful element components HL and HH, while the horizontal architectural details of the picture are more prominently reflected in the horizontal low-frequency components LL and LH.
In this paper, synthetic and real rainy image data are input into MSWT-DN. The rainy images are first processed by the RSENet to generate an approximate rain streak map, and then the images are subjected to wavelet transform to obtain four different frequency domain components: R L L , R L H , R H L and R H H . Since rain streaks typically have a vertical or diagonal directionality, they are mainly concentrated in the R H L and R H H components, while the horizontal structural details of the image are more represented in the horizontal elements with little variation, R L L and R L H . For the X H L and X H H elements, we propose an improved multi-scale attention residual module (IMARM) and a U-shaped deraining network architecture (U-Former) to focus on learning the information in these two components. IMARM is employed to gather shallow characteristics from the R H L and R H H components, perform feature fusion, and reconstruct them to create the fused feature representation. Then, the fused feature representation is concatenated with the approximate rain streak features obtained from RSENet and input into the U-Former deraining subnetwork, which outputs the derained elements R H L and R H H for both synthetic and real datasets. For the horizontal elements with little variation, R L L and R L H , which contain fewer rain streak features, these components are input into the proposed residual detail restoration network (DRANet), which integrates spatial and channel attention to restore the image’s background details. Moreover, we present a loss function, L1, for different branches corresponding to the components, which computes the L1 loss between the network’s four derained components and the corresponding distinct picture components, obtaining the difference between the predicted and true values. Finally, an inverse wavelet transform is employed to fuse the four derained components, resulting in the final rain-free image.

3.2.3. Model Parameter Analysis

The MSWT-DN is a complex cascade structure consisting of several deep modules, such as RSENet, IMSAR, U-Former, and DRANet. Due to its depth and multi-module design, the model has a relatively large number of parameters, making parameter analysis essential. This analysis not only helps us understand the computational complexity of the model, but also uncovers potential issues. By examining the parameters, we can gain insight into the contribution of each module to the network’s complexity and evaluate the computational overhead of the entire network across different tasks. The number of parameters reflects both the model’s expressive power and computational cost, providing valuable insights for future model optimization and improvement. The total number of parameters in the MSWT-DN network is approximately 45 million, with the following detailed breakdown:
IMSAR: due to the introduction of multi-scale convolutions and attention mechanisms, this module is responsible for extracting rain streak features across multiple scales. As a result, it has a high number of parameters, accounting for the largest proportion of the network’s total parameters.
U-Former module: based on the Transformer architecture, U-Former is designed to enhance long-range dependencies and global feature modeling. While this module significantly improves the model’s performance, it also leads to a larger parameter count.
RSENet and DRANet modules: these two modules have relatively fewer parameters. RSENet focuses on extracting rain streaks, while DRANet specializes in restoring background details. Their parameter counts are more compact and focused on specific tasks.
Despite the high parameter count of the MSWT-DN model, it provides a significant advantage in feature extraction and representation, enabling the model to handle complex deraining tasks effectively. However, the large number of parameters also implies higher computational demands, which may affect its efficiency and scalability in real-world applications. Therefore, further optimization and adjustment of the model are necessary during deployment to balance performance with computational resource requirements, ensuring its efficient application in practical scenarios.

3.3. Rain Streak Extraction Network (RSENet)

In single-image deraining tasks, an ideal situation would be if the true rain streak map could be acquired. Taking the wet picture and deleting the corresponding rain streaks, the rain-free image could be directly obtained. However, this situation only occurs under ideal conditions. If we can obtain an approximate rain streak map, we can also use the subtraction method to significantly reduce the rain streaks in the wet picture, thus reducing the difficulty of subsequent tasks.
Currently, most datasets do not contain rain streak data. Namhyn et al. [20] proposed a method for obtaining an approximate rain streak, as shown in Equation (1).
R i , j max = max R i , j , c d i f f , c R G B
where i and j epresent pixel indices, c represents different channels, R d i f f denotes the distinction between the picture of the rain and the corresponding picture without rain, and R m a x is the acquired approximate rain streak map, which is a grayscale image.
During the training process in this paper, RSENet optimizes the network parameters by comparing the differences between the generated approximate rain trace map A and the real rain trace map. Specifically, the goal of RSENet is to make the generated approximate rain trace map as close as possible to the real rain trace map by minimizing the loss function. The formula is derived as follows:
L = i , j ( I r a i n y ( i , j ) I c l e a n ( i , j ) I r a i n a p p r o x ( i , j ) ) 2
where L is the mean square error loss; I r a i n y ( i , j ) is the value of pixel location ( i , j ) in the image with rain; I c l e a n ( i , j ) is the value of pixel location ( i , j ) in the image without rain; and I r a i n a p p r o x ( i , j ) is the value of pixel location ( i , j ) in the approximate rain trace image generated by RSENet.
In practice, the approximate rain trace map generated by RSENet can be used to remove rain traces from an image.
I c l e a n a p p r o x = I r a i n y I r a i n a p p r o x
In the above equation, assuming that we have a rainy image I r a i n y , the rain-free image I c l e a n a p p r o x can be obtained by subtracting the generated approximate rain trace map I r a i n a p p r o x from the rainy image.
This process is based on the assumption of “ideal rain removal”: if the real rain map is known, the rain map can be subtracted directly from the rain image to obtain the removed image. By generating an approximate rain map, the rain removal operation can be approximated without the real rain map. Although the rain removal effect of this method is not as perfect as in the ideal case, it can significantly reduce the rain traces in the image and reduce the difficulty of the subsequent tasks.
The rain streak extraction network (RSENet) suggested in this study includes several consecutive channel and spatial attention blocks (CSAB) and skip connections, as shown in Figure 4. The CSAB module is based on the residual network [21], but in contrast to the residual network in [21]. Moreover, using two convolutional layers and ReLU functions, we also add both spatial and channel attention mechanisms (CAM and SAM) to better capture deep features from the image. This approach uses residual network skip connections to address problems such as vanishing gradients or exploding gradients caused by deep network layers. Lastly, the advantages of the suggested approach are shown by means of experimental data.

3.3.1. Channel Attention Mechanism

The channel attention mechanism (CAM) [22,23] assigns varying weights to different channels predicated on the importance of the details in each channel. This makes it possible for the network to concentrate more on the details in important channels while suppressing irrelevant information in less important channels, as shown in Equation (4).
M c = σ ( W 2 ( ReLU ( W 1 G A P ( F C A M i ) + y 1 ) ) + y 2 )
In this equation, F C A M i represents the feature values of the feedback to the channel function for awareness, σ and ReLU symbolize the sigmoid activation role and the ReLU activation function, correspondingly. G A P s refers to global average pooling, and denotes convolution. W i and y i indicate the convolution matrices and the prejudice vector, respectively.
In the channel attention mechanism, the input feature information is first passed through pooling of the global average to capture the overall spatial features. The input feature information with dimensions C × H × W is then converted to C × 1 × 1 , where C , H , and W stand for the input feature map’s width, height, and channel, respectively. Subsequently, the channel attention map M c is obtained through two convolutions layers followed by the ReLU activation function.
F C A M 0 = F C A M i M c
After acquiring the channel attention map M c , the last output F C A M 0 is obtained via carrying out a duplication of elements within F C A M i and M c as shown in Equation (5). In this context, represents the element-wise multiplication operation.
The final output image from the rain streak extraction network (RSENet) proposed in this paper is identical in size to the source picture. However, after wavelet transform, the first version image’s dimensions is reduced to half of its size. Therefore, a convolution functioning is required to process the output image from RSENet.

3.3.2. Spatial Attention Mechanism

The spatial attention mechanism (SAM) [24,25] assigns various weights to different regions of the input feature map, reflecting the relevance of each region. The Atlas entry attribute F R C × H × W undergoes both greatest division and even division procedures, and the effects are summed, before passing through a convolution layer to extract a single-channel feature map. After that, a sigmoid activation function is used on this feature map to produce the spatial attention map M s R 1 × H × W as shown in Equation (6).
M s = σ ( C o n v ( C o n c a t e [ M P ( F S A M i ) , A P ( F S A M i ) ] ) )
where C o n v symbolizes the process of convolution, M P and A P represent the operations of maximum pooling and average pooling, respectively, and C o n c a t e denotes the concatenation along the channel dimension.
F S A M 0 = F S A M i M s
The final output is obtained through carrying out a duplication of elements within F S A M i and M s as shown in Equation (7).
In this study, a multi-branch remaining network incorporating circuit and geographic focus processes is introduced in the deraining network. The suggested method demonstrates the following advantages in deraining accomplishment, and the findings from the investigation are presented later in the paper.
Improvement in rain trace removal: in the rain trace removal task on the Rain1200 dataset, the peak signal-to-noise Ratio (PSNR) of MSWT-SSIDA reached 34.86 dB, which is a 5.24 dB improvement over the network that did not incorporate the channel and spatial attention mechanisms. Additionally, the structural similarity index (SSIM) increased by 0.085, reaching 0.961.
Improvement in computational efficiency: under the same hardware conditions, RSENet’s inference speed is 20% faster than the model without attention mechanisms, reducing the processing time per image by approximately 15 milliseconds.
Subsequent experimental results show that the mechanisms of spatial and channel attention provide significant advantages in single-image deraining tasks, enabling better preservation of image details, improved deraining performance, and enhanced computational efficiency, while also increasing the stability of the model.

3.4. Improved Multi-Scale Attention Residual Module (IMSAR)

This paper designs an improved multi-scale attention residual (IMSAR) module to achieve feature fusion and reconstruction of images. Here, MARB indicates the multi-scale attention residual block, and LFFB represents the local feature fusion block.
The IMSAR first extracts shallow features from the vertical high-frequency components HL and HH, then applies multiple MARBs and a local feature fusion block (LFFB) to fuse the extracted characteristics, and finally performs image reconstruction, as shown in Figure 5.
Shallow features are extracted from the image using a 3 × 3 convolutional kernel, as shown in Equation (8):
S = C o n v 3 × 3 ( X H i ) ( i L , H )
where C o n v 3 × 3 symbolizes the convolution operation with a 3 × 3 kernel; S represents the output of shallow feature extraction.
Multiple channels are concatenated, and the feature fusion is accomplished by a 1 × 1 convolution, which is performed by the LFFB. A 3 × 3 multilayer neural network performs the ultimate picture rebuilding layer in order to acquire the ultimate result derained image branch, as shown in Equation (9).
y 1 = X H i C o n v 3 × 3 ( I )
where X H i represents the high-frequency components from wavelet transform, I is the feature-fusion output, and y 1 is the final output high-frequency derained image branch.

3.5. U-Former Deraining Subnetwork

The U-Former proposed in this paper is a symmetric encoder–decoder structure. The encoder processes the input data, increasing the number of channels through stacked convolutional layers to extract image features, while the decoder uses transpose convolutions to reconstruct the image. According to Figure 6, the deraining subnetwork includes three encoders, three decoders, and a bottleneck module. The U-Former uses Swin Transformer modules (Swin Transformer block, STB) to replace the 3 × 3 convolutional layers in the original UNet, with each STB containing two submodules. Each submodule includes two LayerNorm layers, one MLP layer, and the concentration of many brains module is window-based. Additionally, to prevent the extracted features from being confined to small windows, the self-attention in the second submodule uses shifted window self-attention to allow feature information exchange between different windows.
After processing with the IMSAR, the feature-fused image is concatenated with the approximate rain streak map obtained from RSENet and input into the U-Former deraining network, producing derained images at multiple scales. Given an input component X i p R H 2 × W 2 × 3 , the U-Former first fuses the approximate rain streak map from the RSENet module, and then uses two convolution layers and two ReLU functions (Conv3×3—ReLU—Conv3×3—ReLU) to excerpt shallow characteristics F s f . The shallow features F s f are then fed into the the symmetric encoder–decoder structure to extract deeper features. Each encoder–decoder layer contains N i ( i = 1 , 2 , 3 ) STBs. During the decoding phase, the output of each decoder layer includes both the STB output and a skip connection that merges the corresponding encoder output. Finally, the feature information from the encoder–decoder structure is processed through two convolution layers and two ReLU functions to obtain F o p R H 2 × W 2 × 3 , which is then passed through a residual connection to learn residual information and produce the final output X o p .

3.5.1. Self-Attention Mechanism

The Swin Transformer [26] modifies the Vision Transformer [27] by using a sliding window mechanism instead of global self-attention, significantly reducing the computational complexity.
Compared to traditional attention models, self-attention models can learn the relationships between elements in the input sequence, allowing for improved comprehension of the background data within the sequence. The query–key–value mechanism is an essential component of the technique of self-awareness. The query, key, and value matrices are the three distinct matrices that are created by mapping the supplied string. The associated weights between elements in the input sequence are computed by taking the dot product between the query and key representations. The results of the dot product are then scaled and processed by the softmax function, and the final output is obtained by performing a dot product with the value vector. The self-attention computation formula is shown in Equation (10).
A u t o _ a t t e ( Z , B , N ) = softmax ( Z · B u E y ) · N
where E y symbolizes the dimensions of K e y . To capture more information, the Swin Transformer employs the multi-head attention mechanism (multi-headed self-attention, MSA), an extended form of self-attention that enhances the representation capacity of self-attention by simultaneously applying multiple attention heads to the input. Its formula is shown in Equation (11).
M u l t i H e a d s ( Z , B , N ) = Concat ( head 1 h e a d h ) W 0 h e a d i = A uto _ atte ( Q W i Q , K W i K , V W i V )
where h indicates the quantity of attention heads, W 0 is the output projection matrix, and W i Q , W i K , and are W i V the projection matrices.

3.5.2. STB

As shown in Figure 7, each STB contains two submodules. The input F s f asses through the first LayerNorm layer to acquire the normalized feature map F s f R H 2 × W 2 × C , which then enters the STB for self-attention computation using W-MSA.
M i = softmax ( Q i · K i T d k ) h e a d i = M i · V i
The feature map F s f is linearly mapped to obtain Q , K , and V , and based on the multi-head attention calculation described in Section 3.3.1, Q , K , and V are divided into h s rojection heads. Self-attention is computed for each projection space, as shown in Equation (12).
The self-attention maps obtained from the s projection heads are concatenated using the Concate operation, followed by a LayerNorm layer and an MLP module. The features from different depths are fused via short connections to acquire the ultimate result of the first module, F o u t .
The input to the second submodule of the STB is the output F o u t from the first submodule. The processing steps in this submodule are similar to those in the first, but when calculating window-based self-focus, putting emphasis on individual shifts in the window shifts the window to the lower-right to enable information exchange between different windows.
The U-Former deraining subnetwork replaces the 3 × 3 convolutional layers in the original UNet with STBs, offering several advantages:
Enhanced feature representation: STBs outperform traditional 3 × 3 convolutional layers in handling both local and global information. Traditional convolutional layers, with limited receptive fields, may fail to capture long-range dependencies, while STBs with window-based attention can more effectively integrate distant information. This improves feature extraction, capturing finer details and structural information in images, leading to clearer and more realistic image reconstruction results for deraining tasks.
Improved long-range dependency and global information integration: rainy images are highly complex, requiring the model to handle long-range dependencies and integrate global information. Traditional 3 × 3 convolutional layers may fail to capture distant correlations in the image. STBs, through self-attention and shifted window attention, better address long-range dependencies, effectively integrating global information, thus improving deraining performance.
Reduced parameter count and computational complexity: although STBs are generally more complex than traditional convolutional layers, their attention-based nature allows for a decrease in the quantity of variables and computational complexity. This is crucial for efficiently processing large-scale image data in deraining tasks, as it balances high performance with improved processing speed.
Modular design and scalability: STBs offer a strong modular design, making it easier to integrate and extend them within various network structures. By using STBs in the U-Former, we not only enhance feature extraction capabilities, but also make it easier to adjust and expand the model to address different image deraining tasks. In contrast, traditional convolutional layers, while simple, may not offer the flexibility required to adapt to complex network structures.
In conclusion, applying the Swin Transformer module to the U-Former significantly improves the network’s feature extraction capabilities while maintaining efficiency and flexibility, enabling the model to achieve higher performance in deraining tasks and better meet the demands of complex image processing challenges.

3.6. Residual Detail Restoration Network (DRANet)

For the LL and LH components obtained through horizontal and vertical filtering, this paper proposes a residual detail restoration network (DRANet). In this network, the input consists of the LL and LH components. The input X r a i n y , L initially extracts shallow features F s f by passing through a convolutional layer, and then the shallow features F s f are processed through several consecutive channel and spatial attention blocks (CSAB) for deep feature extraction. Ultimately, the product of a network is generated by fusing its superficial and extensive characteristics. The architecture of this linkage is displayed in Figure 8.
Since the LL and LH components contain relatively less information, especially under the influence of rain, the details and background information are often lost. The design of DRANet aims to focus on these components to restore the occluded image details. The residual network structure allows information to propagate through skip connections, which helps preserve the original background information and reduces information loss. By employing residual learning, DRANet can effectively learn the mapping from input to output, improving its ability to restore details. DRANet uses multiple convolutional layers to gradually extract and restore information, making it especially suitable for recovering background details in the LL and LH components. This step-by-step processing approach better addresses the needs of supplementing both, and includes significant alongside whispered features. The lightweight design of DRANet enables it to maintain high restoration performance while reducing computational resource consumption, making it adaptable to efficiency requirements in practical applications.

3.7. Component Complementarity Analysis

The semi-supervised single-image rain removal algorithm based on multi-scale wavelet transform proposed in this paper integrates several specialized components, each of which contributes to the overall rain removal performance. The following is a formal derivation of how the rain removal effect can be enhanced by the complementary nature of each component.

3.7.1. Analysis of RSENet

The main goal of the RSENet is to generate an approximation of the rain trace map, which is a key feature to distinguish rain traces from background details. The rain trace extraction process can be considered as an extraction task of high-frequency features aimed at capturing the rain trace structure in the image. Wavelet transform is used as a preprocessing step to decompose the image into different frequency components, helping to separate the high-frequency component (rain traces) from the low-frequency component (background information). This is shown in Equation (13).
I i n p u t = I L F + I H F
where I i n p u t is the input image, the image is decomposed into a series of high-frequency and low-frequency components by wavelet transform, I L F represents the low-frequency background, and I H F represents the high-frequency rain traces. The RSENet utilizes the high-frequency component, I H F , to generate an approximated rain trace map, I r a i n , which will be used as the basis of the subsequent module for the deraining process.

3.7.2. Analysis of IMSAR

The IMSAR is designed to fuse multi-scale features from different network levels, focusing on fine-grained details and global context. The module is able to prioritize the most important features at different scales by combining spatial attention and channel attention mechanisms, as shown in Equation (14).
F a t t e n d e d = F A s p a t i a l A c h a n n e l
where denotes element-by-element multiplication. Given a feature map F , the attention mechanism computes spatial and channel attention weights A s p a t i a l and A c h a n n e l , such that important features (e.g., rain traces and details) are preserved and enhanced. The multi-scale approach allows the network to capture features at different resolutions, thus improving the robustness of the network to various rain trace structures.

3.7.3. Analysis of U-Former

The U-Former subnetwork enhances the feature extraction capability of the network by utilizing a Transformer-based mechanism to capture long-distance dependencies in images. It also reduces the parameter and computational complexity compared to the traditional CNN architecture, as shown in Equation (15).
F o u t p u t = D e c o d e r ( E n c o d e r ( F i n p u t ) )
Among them, the U-Former subnetwork uses a Transformer-based encoder–decoder structure to extract multi-scale features. The encoder captures the global context by focusing on long-range dependencies, and the decoder refines local details. This allows the model to better understand complex patterns, such as rain trails that span multiple regions. By combining the local details provided by IMARM with the global context provided by U-Former, the network is able to achieve better performance in the rain removal task.

3.7.4. Analysis of DRANet

DRANet improves the network’s ability to recognize rain marks while recovering background details by utilizing spatial attention and channel attention mechanisms. The network efficiently recovers the details of the image while removing rain marks through residual learning. DRANet utilizes residual learning and attention mechanism to recover the background details, as shown in Equation (16).
I r e s t o r e d = I i n p u t I r a i n + A t t e n t i o n ( I i n p u t )
where I r a i n is the estimated rain trace map, and the attention mechanism selectively recovers background details through spatial and channel features.
I r e s t o r e d = I i n p u t I r a i n + A t t e n t i o n ( I i n p u t )
In Equation (17), σ denotes the activation function, and W 1 , W 2 are the learned weights. In this way, the model is able to effectively recover the background details while removing the rain marks.
Overall, the key to the success of the network lies in the complementarity between the individual components. Each module plays a specific role: the RSENet extracts rain traces, helping the network to focus on the most relevant features, the IMARM enhances the feature extraction process through a multi-scale attention mechanism, ensuring that details and global contextual information are captured efficiently and effectively, and the U-Former improves the capture of long-range dependencies and reduces computational complexity through the structure of the Transformer model. DRANet ensures that the background details are recovered while removing the rain traces, avoiding the introduction of artifacts. These components complement each other, and each module plays a role in different tasks (rain trace extraction, feature enhancement, global dependency capture, detail recovery), thus improving the performance of the whole rain removal process.

3.8. Model Optimization and Deployment Strategies

Although each component of the MSWT-DN model can effectively complete the task, its overall complexity may limit its deployment in resource-constrained environments, especially in scenarios with limited computing power, such as edge devices or mobile platforms. To address this challenge and ensure the efficient operation of the model in these environments, the following optimization strategies can be adopted:
  • Model pruning: by removing redundant or unimportant parameters, model pruning can significantly reduce storage requirements and improve inference speed without significantly affecting performance. This makes the model more suitable for real-time deployment, especially in devices with limited computing resources.
  • Quantization: quantization reduces memory usage and computational overhead by reducing the precision of weights and activations (for example, converting floating point numbers to low-precision integers). It is particularly suitable for devices with limited hardware capabilities, which can speed up computation and save memory.
  • Adaptive computing: adaptive computing allows the model to dynamically adjust the amount of computation based on the complexity of the input. For simple inputs, the model uses fewer layers or modules, thereby reducing computational requirements; for complex inputs, the full model is used to maintain high accuracy, which helps optimize the use of computing resources.
  • Hardware acceleration: using GPU, TPU or AI-specific hardware (such as NPU) to accelerate the inference process can significantly improve computing efficiency, especially for real-time applications and resource-constrained devices.
In summary, although the MSWT-DN performs well in rainfall tasks, its large number of parameters and complexity pose challenges when deployed in resource-constrained environments. Through optimization strategies such as model pruning, quantization, adaptive computing, and hardware acceleration, its computing efficiency and deployment feasibility can be significantly improved, ensuring that the model runs efficiently in a variety of practical application scenarios.

3.9. Loss Function

3.9.1. Supervised Loss

The L1 loss function, also known as the mean absolute error loss function, represents the average of the sum of the absolute values of the errors between the predicted and true values of the network output. Based on the L1 loss function, for the different branches of the rain streak components, this paper proposes the following loss function, as shown in Equation (18).
L 1 ( S g t , S o p ) = i { L L , L H , H L , H H } L 1 ( S g t , i , S o p , i )
where S g t , i represents the wavelet components of the predicted clean, rain-free synthetic image, and S o p represents the wavelet components of the actual derained image output by the network.
To better transfer synthetic rain to real rain, assuming that the synthetic rain follows a Gaussian process, the corresponding expression is as follows:
P s N ( μ s | Σ s )
where P s represents the synthetic rain streak, and μ s and Σ s represent the mean and variance corresponding to the Gaussian distribution of the synthetic rain streak.
Classical least squares loss is employed to train the network via controlled loss by minimizing the least squares loss regarding the input synthetic rainwater picture S i p and the culmination of the cartel S o p , as shown in the equation below:
L R s = i ( L L , L H , H L , H H ) M | | S i p , i S o p , i | | F 2
where i represents the different branches after wavelet transformation, S i p is the participation synthetic rainwater picture, and S o p is the output derained picture.

3.9.2. Unsupervised Loss

Due to technical limitations, It is challenging to acquire the clean background image matching pictures of actual rain images, and it is even tougher to accurately extract the rainwater layer from real rainwater images. Therefore, a parameterized distribution is needed to represent the random distribution of real rain streaks. Generally, if the model is sufficiently complex and the model coefficients are set properly, a Gaussian mixture model (GMM) can represent any data distribution. Since rain occurs at various separations from the camera and the rain streaks are typically formed by raindrops of varying sizes and intensities, the effects of these raindrops at different positions may cause the observed rain streaks to exhibit a multi-modal characteristic. The Gaussian mixture model (GMM) is well-suited to capture this multi-modal distribution and represents the superposition of various raindrop features. Thus, the rain streaks are approximated using a GMM, and the corresponding expression is as follows:
P r k = 1 K π k N ( P r | μ r k , Σ r k )
where π k represents the Gaussian mixture coefficients, and P r represents the network output derained picture and the remnant of the actual rainwater picture, which are used together to estimate the actual rain streak by computing their difference. μ r k and Σ r k represent the mean and variance of each Gaussian distribution in the GMM, respectively, K indicates the quantity of Gaussian distributions, and N is the quantity of real rainwater pictures.
L G M M = k = 1 K π k N ( P r | 0 , Σ r k )
For convenience, the mean of the Gaussian process is set to 0, and the expression is given in Equation (22) above.
Furthermore, the expectation maximization (EM) algorithm is implemented to solve this loss function, representing the actual distribution of rain streaks. The target function is constructed for the unsupervised part, and its gradient is backpropagated to adjust the network parameters. Thus, the Gaussian mixture model applied to real rain streaks can be represented as:
L G M M = k = 1 K π k N ( R ip R o p | 0 , Σ r k )
where π k represents the Gaussian mixture coefficients, K represents the number of Gaussian distributions, R i p and R o p represent the input real rain image and the real derained image output by the network, respectively.
Since the Gaussian mixture model can represent any continuous distribution, to greater fit the actual rainwater pattern allocation, during training, the Kullback–Leibler (KL) differences between the Gaussian distribution-acquired knowledge from artificial rain streaks and the GMM learned from real rain streaks is reduced. This allows the synthetic rain to transfer to real rain. In this process, since KL divergence is asymmetric, this chapter first computes the KL variance in the Gaussian allocation of the Gaussian mixing model of actual rain streaks and artificial rain streaks for each component, and then selects the Gaussian process with the smallest KL divergence as the real rain model, ensuring that the GMM learned from the real data has one component resembling rain, which can be represented as:
D K L ( G s | | G M M r ) = min D K L ( G s | | G r k )
where G s represents the Gaussian distribution extracted from the synthetic rain layer, G M M r represents the Gaussian mixture model extracted from the real rain streaks, and G r k represents the k component of the Gaussian mixture model.

3.9.3. Total Loss Function

The total loss of the MSWT-SSIDA algorithm is formed through the amalgamation of the supervised loss and the unsupervised loss, and can be represented as:
L = i ( L L , L H , H L , H H ) M | | S i p , i S o p , i | | F 2 + α k = 1 K π k N ( R ip R o p | 0 , Σ r k ) + β D K L ( G s | | G M M r )
where α and β are hyperparameters.

4. Experiments

4.1. Experimental Environment and Parameter Configuration

The proposed semi-supervised deraining network model based on multi-scale wavelet transform was implemented using Python 3.8, Pytorch 2.0.0, and Ubuntu 20.04. The approach received coaching and tested on an NVIDIA GeForce RTX 3090 GPU.
Smaller input sizes reduce the computational load, thus accelerating both training and inference speed. Therefore, image patches of size 128 were randomly cropped, and utilized as the network input from the training dataset. A smaller group size (e.g., 16) consumes less GPU memory during each iteration, making the training process more flexible. Hence, the batch size for this network was set to 16. A smaller learning rate helps the model to gradually learn features and capture important information in the data without making large, erratic adjustments in the early stages, thereby avoiding the learning of incorrect patterns. Consequently, the Adam optimizer had grown accustomed to tuning the parameters, and the starting learning rate was set at 1 × 10−4. Every experiment that followed was conducted under the same setting configuration. The model was trained for 200 epochs, and to help the model fine-tune the parameters without oscillation near the optimal solution, the rate at which one learned was consistent across the initial 100 periods of time, and then reduced by half every 25 epochs for the remaining 100 epochs. This approach improved model convergence and stability.

4.2. Datasets

To ensure comprehensive experimentation, both synthetic and real-world datasets were used to train and evaluate the network model. During supervised training, two commonly used public rainy datasets, Rain1200 [28] and Rain1400 [29], were utilized for instruction and evaluation. The Rain1200 dataset, established by Zhang et al., contains a total of 13,200 pictures, including 1200 test pictures and 12,000 instruction pictures. A prime instance of the Rain1200 statistic is displayed in Figure 9. The Rain1400 dataset, created by Fu et al., includes 14,000 synthetic images, with 12,600 instructional photos and 1400 assessments. A sample of the Rain1400 dataset is shown in Figure 10. For unsupervised training, the unlabeled real rainy dataset, RealRain, and 1100 real rainy images from the internet were used. The RealRain collection consists of 3584 pictures for instruction and 448 pictures to verify the results. A sample of the RealRain collection is shown in Figure 11.

4.3. Evaluation Metrics

For image quality evaluation metrics, peak signal-to-noise ratio (PSNR) [30] and structural similarity index (SSIM) [31] are the majority of widely utilized indicators for synthetic rainy pictures. PSNR is one of the most widely applied image quality metrics, typically calculated by evaluating the mean squared error (MSE) between the corresponding pixels of two images. For real rainy images, the Natural Image Quality Evaluator (NIQE) [32] and Perception-Based Image Quality Evaluator (PIQE) [33] are employed as criteria for assessing photographic quality that do not need references.

4.4. Ablation Study

To confirm the efficacy and superiority of the proposed MSWT-SSDN network structure and loss function, ablation tests were carried out to evaluate the effects of different factors such as the network branches, modules, and loss functions.

4.4.1. Ablation of Network Structure

Throughout the present investigation, we proposed a semi-supervised arrangement that utilizes both supervised and unsupervised branches to jointly train the model with synthetic and real training images. The network structure of the two branches is identical and combines the impairment of operations, both monitored and unattended to constrain the network training. To confirm the efficacy of the overseen and unattended branch structure, we removed the unsupervised branch and conducted an experiment using only the synthetic dataset for training. The network was optimized solely on the labeled synthetic data, without utilizing the information from real data. Table 1 displays the outcomes of the experiment. “Input” refers to the untreated real rainy images, “MSWT-DN” alludes to the trained model only on the synthetic Rain1200 dataset, and “MSWT-SSIDA” refers to the trained model using the proposed method on both Rain1200 and real data.
In Table 1, by comparing the NIQE and PIQE values, it can be observed that removing the unsupervised branch significantly degraded the deraining performance, particularly when handling actual data, where the generalization ability was poor. This illustrates the efficacy of the proposed semi-supervised approach.

4.4.2. Ablation of Network Modules

This section investigates the effects associated with each unit on the deraining accomplishment of the network through ablation experiments. The experiments were conducted through instruction on the synthetic collection Rain1200 and evaluation the PSNR and SSIM values on its test set, as well as training on the real dataset RealRain and testing the NIQE and PIQE values on its test set. The effectiveness of the network modules was demonstrated through these evaluations. Table 2 displays the relevant ablation findings.
To confirm the wavelet transform’s function, we removed the wavelet transform module from the network. The approximate rain streak map generated by RSENet and the input rainy image were passed directly through the IMARM, U-Former, and DRCSNet subnetworks, while keeping other conditions constant during training and testing. As shown in Table 2, removing the wavelet transform resulted in a decrease of 1.51 dB in PSNR and 0.010 in SSIM, compared to the proposed MSWT-SSIDA algorithm. This indicates that wavelet transform is crucial for better localizing the rain streaks in rainy images and improving the overall network performance in deraining.
Similarly, to confirm the efficacy and advantages of the RSENet module for rain streak extraction, the IMARM for feature fusion and reconstruction, and the U-Former module for enhancing feature extraction, experiments were conducted by sequentially removing each module while maintaining other conditions. The outcomes of the experiment demonstrate that after removing the RSENet module, the PSNR and SSIM values decreased by 0.35 dB and 0.003, correspondingly, demonstrating that the RSENet module significantly enhances the network’s deraining capability. Likewise, after removing the IMARM, the PSNR and SSIM values decreased by 0.35 dB and 0.003, correspondingly, proving that IMARM also improves deraining performance. In the ablation experiment where the U-Former subnetwork module was removed and replaced by the original UNet 3 × 3 convolutional layers, with other conditions unchanged, the results showed that the network using the original U-Net 3 × 3 convolutional layers had a decrease of 0.26 dB in PSNR and 0.008 in SSIM, compared to the network using the proposed U-Former module. This indicates that the proposed U-Former module performs better in the deraining task.
Finally, the ablation results of the network without channel and spatial attention mechanisms are also presented in Table 2. In the deraining task on the Rain1200 dataset, the proposed MSWT-SSIDA algorithm achieved a 5.24 dB improvement in PSNR and a 0.085 increase in SSIM, reaching 0.961, relative to the network without channel and spatial attention mechanisms. This further demonstrates that the introduction of channel and spatial awareness mechanisms can effectively improve the network’s deraining capability.

4.4.3. Ablation of Loss Function

To validate that the two mechanisms associated with losses in the semi-supervised branch contribute to improving the model’s functionality, ablation tests were carried out on the loss functions. The outcomes of the ablation experiments are shown in Table 3. In Table 3, Q1 indicates the application of only the mechanism of depletion L G M M in the unsupervised branch, Q2 indicates the use of only the KL divergence loss D K L in the unsupervised branch, and Q3 represents the total loss function used in the unsupervised branch of the proposed method, which combines both L G M M and D K L . As shown in Table 3, both loss functions contribute to enhancing the deraining performance to varying degrees.
In summary, by using these loss functions, the network can not only utilize labeled synthetic data for supervised training, but also express the distribution of real data in the unsupervised portion. Compared to existing algorithms, the method proposed in this chapter demonstrates better generalization and practicality.

4.5. Comparison Experiments

4.5.1. Comparative Experiments with FFT and DCT

In order to verify the advantages of the wavelet transform in the rain removal task, we have designed comparison experiments with commonly used frequency domain methods (fast Fourier transform (FFT) and discrete cosine transform (DCT)). These experiments will help us further validate the effectiveness of the wavelet transform in high-frequency raindrop trace removal and background detail recovery.
  • Experimental design
This section compares three methods: multi-scale wavelet transform (MSWT), fast Fourier transform (FFT), and discrete cosine transform (DCT). Experiments used the Rain1200 and Rain1400 synthetic datasets for training and the RealRain dataset for unsupervised testing. All methods were trained with the same deraining network architecture and consistent parameters (learning rate, batch size) for fairness. Each method transforms rainy images in the frequency domain before processing with the derain network.
For evaluation, PSNR, SSIM, NIQE, and PIQE were used. PSNR and SSIM evaluate image quality, while NIQE and PIQE assess unreferenced image quality, particularly for real rain images. The experiments compare the effectiveness and limitations of each method on synthetic and real datasets.
2.
Experimental results and conclusions
The quantitative comparative experimental results of multi-scale wavelet transform (MSWT), fast Fourier transform (FFT) and discrete cosine transform (DCT) on Rain1200, Rain140, and RealRain datasets are shown in Table 4. This table provides the PSNR, SSIM, NIQE, and PIQE indicators of each method on each dataset.
As can be seen from Table 4, the wavelet transform method outperforms the FFT and DCT methods in terms of PSNR and SSIM on both the Rain1200 and Rain1400 datasets. Also, on the RealRain dataset, the wavelet transform method significantly outperforms the FFT and DCT in terms of NIQE (11.52) and PIQE (9.13) values, which indicates that the wavelet transform is able to remove the raindrop traces while retaining the background details of the image in a better way, and generates a higher quality image. Therefore, compared with the FFT and DCT methods, the wavelet transform method has a significant improvement in rain removal effect, background detail recovery, and image quality, especially in real rain images.

4.5.2. Results on Synthetic Datasets

To confirm the efficacy and advancements of the suggested algorithm, we evaluated it against a number of cutting-edge deraining algorithms. The compared algorithms include supervised deraining algorithms such as DARAGNet [34], DSMNet [35], IDT [36], and RCHNet [37], as well as semi-supervised deraining algorithms like MOSS [15], LSNet [17], and SSID_KD [38]. For fair comparison, all the compared algorithms were retrained and tested on the artificial datasets Rain1200 and Rain1400. The test outcomes were recalculated using PSNR and SSIM values in Matlab as evaluation metrics, where larger values indicate better results.
  • Quantitative Comparison
The quantitative findings for the Rain1200 dataset are displayed in Table 5.
From Table 5, It is evident that, on the Rain1200 dataset, the suggested method performs better than the semi-supervised SSID_KD algorithm, with a PSNR enhancement of 0.33 dB and an SSIM enhancement of 0.017, achieving the best values. On the Rain1400 dataset, the proposed algorithm shows improvements in PSNR and SSIM in contrast to most other algorithms, and is second only to RCHNet, a supervised learning-based method, in achieving the best PSNR and SSIM values.
2.
Qualitative Comparison
Figure 12 shows the deraining results on the Rain1200 dataset, where rain streaks are relatively simple. All algorithms effectively removed the rain streaks, but some showed residual streaks or distortions. DARAGNet, DSMNet, and IDT left streaks in the edge details, while RCHNet caused slight color distortions in the background. Semi-supervised methods like MOSS and LSNet removed rain but blurred the background, while SSID_KD and the proposed algorithm successfully removed rain while preserving background details.
The deraining findings on the Rain1400 statistics are displayed in Figure 13. For this more complex dataset, where the rain streak intensity and rain scenes are more complex, Figure 13 shows that supervised algorithms like DARAGNet, DSMNet, and IDT leave small residual rain streaks in the background, while MOSS shows some flaws in the background. The semi-supervised algorithms LSNet and SSID_KD still show residual rain streaks, which affect the image’s visual quality to varying degrees. In contrast, both RCHNet and the proposed algorithm deliver better deraining results while preserving background details.

4.5.3. Results on Real Datasets

Additional testing was performed on the actual RealRain dataset in order to evaluate the suggested algorithm’s practical applicability. Due of the lack of clean reference photographs that correlate to actual rainy images, the NIQE and PIQE metrics, which are judges of image quality without reference, were utilized to assess the deraining effectiveness of many techniques. Lower values indicate better results.
  • Quantitative Comparison
The most effective and inferior scores are marked with bold and pointed out, specifically, in Table 6, which depicts the statistical findings on the RealRain dataset.
From Table 6, evidently, the suggested algorithm achieves the best values with a NIQE of 11.52 and a PIQE of 9.13, while SSID_KD achieved the second-best values, clearly demonstrating more effective deraining execution of the suggested algorithm in real-world applications.
2.
Qualitative Comparison
Figure 14 presents the qualitative results of different deraining algorithms on the RealRain dataset. From Figure 14, It is evident that DARAGNet and DSMNet leave somewhat residual rainfall traces, and the restored background appears blurry. IDT and RCHNet eliminate most of the rain tracks. but result in overly smooth background restoration, reducing image quality and leaving some residual rain streaks. MOSS and LSNet suppress most of the rain stains most rain stains, but some large streaks are still not effectively handled. Compared to other methods, SSID_KD and the proposed method effectively identify and eliminate the streaks of rain, and as seen in Figure 12, the proposed method leaves the fewest residual rain streaks while maintaining good image details, achieving excellent deraining results.

5. Limitations and Areas of Discussion

Although the MSWT-DN model has achieved remarkable performance in rain removal tasks, its performance still has certain limitations under extreme rainfall conditions and complex backgrounds. In scenes with heavy rainfall or complex backgrounds, the model may have difficulty effectively distinguishing raindrop traces and background details, resulting in a decrease in the deraining effect. In addition, the high computational complexity of the model, especially in the case of multi-module design and large number of parameters, limits its real-time deployment in environments with limited computing resources, especially on devices with limited resources, such as edge devices or mobile platforms.
These limitations make the model challenging to use in practical applications, especially in scenarios with severe weather and high computing requirements. In order to improve the wide applicability of MSWT-DN in practical applications, future research could focus on the following directions:
  • Enhance robustness under extreme rainfall conditions: the performance of the current model drops significantly under heavy rainfall conditions. In the future, the performance of the model under severe weather conditions can be improved by introducing more targeted rainfall features or designing more specialized modules. In addition, combining temporal information (such as the temporal model of video frames) may also effectively improve the performance of the model in dynamic and real-time applications.
  • Optimize computational efficiency: the computational overhead of the model is large, which limits its deployment in low-resource devices. To solve this problem, in the future, we could explore optimization techniques such as model pruning, quantization, and knowledge distillation to improve its efficiency in real-time applications by reducing model parameters and accelerating the inference process. Adaptive computing technology can also be used as an effective means to dynamically adjust the computational amount of the model according to the complexity of the input and reduce unnecessary computational consumption.
  • Multimodal data fusion: future research could explore the fusion of multimodal data (such as depth maps, infrared images, etc.), which will provide more contextual information for the model and help to more accurately separate raindrop traces from background elements, especially in complex backgrounds and severe weather conditions, further improving the robustness of the model.

6. Conclusions

The main body of this study presents a semi-supervised single-image deraining algorithm based on multi-scale wavelet transform (MSWT-SSIDA), aimed at addressing the key issue of rain streak removal in rainy images, particularly enhancing the algorithm’s generalization capability in real-world scenarios. By incorporating multi-scale wavelet transform, the algorithm decomposes rainy images into multiple components at different scales, allowing networks of varying sizes to better identify diverse rain streaks and raindrop types. This enhances the model’s ability to capture richer rain features and improves deraining performance. The semi-supervised learning framework further boosts the model by combining supervised learning on synthetic data and unsupervised learning on real data, enabling better identification of complex rain streaks and improving deraining performance in real-world settings. Experimental results show that the MSWT-SSIDA algorithm outperforms existing deraining techniques across multiple benchmark datasets, especially in real-world rainy images, demonstrating its adaptability and robustness. Additionally, incorporating a residual detail restoration network (DRANet) effectively restores image details while removing rain streaks, preserving background detail, and improving image clarity. Overall, the algorithm demonstrates strong performance in both synthetic and real-world scenarios, confirming its effectiveness and superiority.

Author Contributions

X.L.: Conceptualization; formal analysis; funding acquisition; investigation; writing—review and editing. Y.H.: conceptualization; formal analysis; investigation; methodology; validation; writing—original draft; writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 61462055, with financial support provided by Xin Yan and Xiaoyan Liu (funding amount: 500 CNY).

Data Availability Statement

The datasets used in this study (including Rain1200, Rain1400, and RealRain) are publicly available for research purposes, and some of the experimental datasets are available at [https://github.com/hezhangsprinter/DID-MDN, accessed on 9 April 2025] [https://xueyangfu.github.io/projects/cvpr2017.html, accessed on 9 April 2025] for access. Some of the data generated in this study are not publicly available due to privacy and ethical considerations. The training and test set data for this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ma, L.; Liu, R.; Zhang, X.; Zhong, W.; Fan, X. Video deraining via temporal aggregation-and-guidance. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar]
  2. Yan, W.; Tan, R.T.; Yang, W.; Dai, D. Self-aligned video deraining with transmission-depth consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 11966–11976. [Google Scholar]
  3. Kulkarni, A.; Patil, P.W.; Murala, S. Progressive subtractive recurrent lightweight network for video deraining. IEEE Signal Process. Lett. 2021, 29, 229–233. [Google Scholar] [CrossRef]
  4. Li, M.; Cao, X.; Zhao, Q.; Zhang, L.; Meng, D. Online rain/snow removal from surveillance videos. IEEE Trans. Image Process. 2021, 30, 2029–2044. [Google Scholar] [CrossRef] [PubMed]
  5. Zhang, K.; Li, D.; Luo, W.; Ren, W.; Liu, W. Enhanced Spatio-Temporal Interaction Learning for Video Deraining: Faster and Better. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1287–1293. [Google Scholar] [CrossRef] [PubMed]
  6. Yue, Z.; Xie, J.; Zhao, Q.; Meng, D. Semi-supervised video deraining with dynamical rain generator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 642–652. [Google Scholar]
  7. Yang, W.; Tan, R.T.; Wang, S.; Fang, Y.; Liu, J. Single image deraining: From model-based to datadriven and beyond. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4059–4077. [Google Scholar] [CrossRef]
  8. Cherian, A.K.; Poovammal, E.; Philip, N.S.; Ramana, K.; Singh, S.; Ra, I.H. Deep Learning Based Filtering Algorithm for Noise Removal in Underwater Images. Water 2021, 13, 2742. [Google Scholar] [CrossRef]
  9. Zhou, W.; Ye, L. UC-former: A multi-scale image deraining network using enhanced transformer. Comput. Vis. Image Underst. 2024, 248, 104097. [Google Scholar] [CrossRef]
  10. Fu, X.; Liang, B.; Huang, Y.; Ding, X.; Paisley, J. Lightweight Pyramid Networks for Image Deraining. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 1794–1807. [Google Scholar] [CrossRef]
  11. Chen, X.; Pan, J.; Dong, J. Bidirectional Multi-Scale Implicit Neural Representations for Image Deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; IEEE: New York, NY, USA, 2024; pp. 25627–25636. [Google Scholar]
  12. Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 17683–17693. [Google Scholar]
  13. Zhang, Z.; Wei, Y.; Zhang, H.; Yang, Y.; Yan, S.; Wang, M. Datadriven single image deraining: A comprehensive review and new perspectives. Pattern Recognit. 2023, 143, 109740. [Google Scholar] [CrossRef]
  14. Guo, Q.; Sun, J.; Juefei-Xu, F.; Ma, L.; Xie, X.; Feng, W.; Liu, Y.; Zhao, J. Efficientderain: Learning pixel-wise dilation filtering for high-efficiency single-image deraining. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1487–1495. [Google Scholar] [CrossRef]
  15. Huang, H.; Yu, A.; He, R. Memory Oriented Transfer Learning for Semi-supervised Image Deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 7732–7741. [Google Scholar]
  16. Yasarla, R.; Sindagi, V.A.; Patel, V.M. Syn2real transfer learning for image deraining using gaussian processes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 2723–2733. [Google Scholar]
  17. Jiang, N.; Luo, J.; Lin, J.; Chen, W.; Zhao, T. Lightweight Semi-supervised Network for Single Image Rain Removal. Pattern Recognit. 2023, 137, 109277. [Google Scholar] [CrossRef]
  18. Chen, X.; Pan, J.; Jiang, K.; Li, Y.; Huang, Y.; Kong, C. Unpaired Deep Image Deraining Using Dual Contrastive Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 2017–2026. [Google Scholar]
  19. Yang, Y.; Wu, X.D.; Du, K. T-shaped image dehazing network based on wavelet transform and attention mechanism. J. Hunan Univ. 2022, 49, 61–68. [Google Scholar]
  20. Ahn, N.; Jo, S.Y.; Kang, S.J. EAGNet: Elementwise Attentive Gating Network-Based Single Image De-Raining with Rain Simplification. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 608–620. [Google Scholar] [CrossRef]
  21. Luo, Z.; Sun, Z.; Zhou, W.; Wu, Z.; Kamata, S.I. Rethinking ResNets: Improved stacking strategies with high-order schemes for image classification. Complex Intell. Syst. 2022, 8, 3395–3407. [Google Scholar] [CrossRef]
  22. Lin, X.; Ma, L.; Sheng, B.; Wang, Z.J.; Chen, W. Utilizing two-phase processing with FBLS for single image deraining. IEEE Trans. Multimed. 2021, 23, 664–676. [Google Scholar] [CrossRef]
  23. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA, 2018; pp. 7132–7141. [Google Scholar]
  24. Han, Y.; Feng, L.; Gao, J. A new end-to-end framework based on non-local network structure and spatial attention mechanism for image rain removal. Int. J. Comput. Appl. 2022, 44, 1083–1091. [Google Scholar] [CrossRef]
  25. Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
  26. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 10012–10022. [Google Scholar]
  27. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations; ICLR: Singapore, 2020. [Google Scholar]
  28. Miclea, A.V.; Terebes, R.M.; Meza, S.; Cislariu, M. On Spectral-Spatial Classification of Hyperspectral Images Using Image Denoising and Enhancement Techniques, Wavelet Transforms and Controlled Data Set Partitioning. Remote Sens. 2022, 14, 1475. [Google Scholar] [CrossRef]
  29. Xinyi, L. Deep Learning Based Single Image Deraining: Datasets, Metrics and Methods; Fujian Normal University: Fuzhou, China, 2023. [Google Scholar]
  30. Zhang, H.; Sindagi, V.; Patel, V.M. Image de-raining using a conditional generative adversarial network. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 3943–3956. [Google Scholar] [CrossRef]
  31. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  32. Mittal, A.; Soundararajan, R.; Bovik, A.C. Making a “completely blind“ image quality analyzer. IEEE Signal Process. Lett. 2012, 20, 209–212. [Google Scholar] [CrossRef]
  33. Venkatanath, N.; Praneeth, D.; Bh, M.C.; Channappayya, S.S.; Medasani, S.S. Blind image quality evaluation using perception based features. In Proceedings of the 2015 Twenty-First National Conference on Communications (NCC), Mumbai, India, 27 February–1 March 2015; IEEE: New York, NY, USA, 2015; pp. 1–6. [Google Scholar]
  34. Li, P.; Jin, J.; Jin, G.; Fan, L.; Gao, X.; Song, T.; Chen, X. Deep Scale-space Mining Network for Single Image Deraining. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, New Orleans, LA, USA, 19–20 June 2022; pp. 4275–4284. [Google Scholar]
  35. Xiao, J.; Fu, X.; Liu, A.; Wu, F.; Zha, Z.J. Image De-Raining Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12978–12995. [Google Scholar] [CrossRef] [PubMed]
  36. Li, Y.; Lu, J.; Chen, H.; Wu, X.; Chen, X. Dilated Convolutional Transformer for High-Quality Image Deraining. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Vancouver, BC, Canada, 17–24 June 2023; pp. 4199–4207. [Google Scholar]
  37. Zhou, W.; Ye, L.; Wang, X. Residual Contextual Hourglass Network for Single-Image Deraining. Neural Process. Lett. 2024, 56, 63. [Google Scholar] [CrossRef]
  38. Cui, X.; Wang, C.; Ren, D.; Chen, Y.; Zhu, P. Semi-supervised image deraining using knowledge distillation. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 8327–8341. [Google Scholar] [CrossRef]
Figure 1. Overall networking framework.
Figure 1. Overall networking framework.
Applsci 15 04325 g001
Figure 2. Structure of MSWT-DN.
Figure 2. Structure of MSWT-DN.
Applsci 15 04325 g002
Figure 3. Wavelet transform process.
Figure 3. Wavelet transform process.
Applsci 15 04325 g003
Figure 4. Structure of CSAB.
Figure 4. Structure of CSAB.
Applsci 15 04325 g004
Figure 5. Structure of IMSAR.
Figure 5. Structure of IMSAR.
Applsci 15 04325 g005
Figure 6. Structure of U-Former.
Figure 6. Structure of U-Former.
Applsci 15 04325 g006
Figure 7. Structure of STB.
Figure 7. Structure of STB.
Applsci 15 04325 g007
Figure 8. Structure of DRANet.
Figure 8. Structure of DRANet.
Applsci 15 04325 g008
Figure 9. Examples from the Rain1200 synthetic dataset.
Figure 9. Examples from the Rain1200 synthetic dataset.
Applsci 15 04325 g009
Figure 10. Examples from the Rain1400 synthetic dataset.
Figure 10. Examples from the Rain1400 synthetic dataset.
Applsci 15 04325 g010
Figure 11. Examples from the RealRain real dataset.
Figure 11. Examples from the RealRain real dataset.
Applsci 15 04325 g011
Figure 12. Comparison of results of different algorithms on the Rain1200 dataset. (a) represents the original rainy image, (f) denotes the ground truth clean background, (be,gi) show the results of the selected comparison algorithms, and (j) presents the result of the proposed algorithm.
Figure 12. Comparison of results of different algorithms on the Rain1200 dataset. (a) represents the original rainy image, (f) denotes the ground truth clean background, (be,gi) show the results of the selected comparison algorithms, and (j) presents the result of the proposed algorithm.
Applsci 15 04325 g012
Figure 13. Comparison of results of different algorithms on the Rain1400 dataset. (a) represents the original rainy image, (f) denotes the ground truth clean background, (be,gi) show the results of the selected comparison algorithms, and (j) presents the result of the proposed algorithm.
Figure 13. Comparison of results of different algorithms on the Rain1400 dataset. (a) represents the original rainy image, (f) denotes the ground truth clean background, (be,gi) show the results of the selected comparison algorithms, and (j) presents the result of the proposed algorithm.
Applsci 15 04325 g013
Figure 14. Qualitative analysis of several techniques for removing rain from the RealRain actual dataset. (a) represents the original rainy image, (bh) show the results of the selected comparison algorithms, and (i) presents the result of the proposed algorithm.
Figure 14. Qualitative analysis of several techniques for removing rain from the RealRain actual dataset. (a) represents the original rainy image, (bh) show the results of the selected comparison algorithms, and (i) presents the result of the proposed algorithm.
Applsci 15 04325 g014
Table 1. Ablation experiments on different branches. (The bold words indicate the experimental results of the algorithm proposed in this paper.)
Table 1. Ablation experiments on different branches. (The bold words indicate the experimental results of the algorithm proposed in this paper.)
MetricsInputMSWT-DNMSWT-SSIDA
NIQE13.7212.8311.55
PIQE10.659.939.16
Table 2. Ablation experiments on individual network modules. (The bold words indicate the experimental results of the algorithm proposed in this paper.)
Table 2. Ablation experiments on individual network modules. (The bold words indicate the experimental results of the algorithm proposed in this paper.)
Evaluation AlgorithmsRain1200Rain1400
PSNRSSIMNIQEPIQE
Removal of wavelet transform31.310.90613.5810.34
Removal of RSENet32.470.91313.6510.11
Removal of IMARM31.890.90413.9310.26
Removal of U-Former31.960.91113.8910.18
Removal of channel space attention29.620.87614.0610.54
MSWT-SSIDA34.860.96111.529.13
Table 3. Ablation experiment results for the loss function. (The bold words indicate the experimental results of the algorithm proposed in this paper.)
Table 3. Ablation experiment results for the loss function. (The bold words indicate the experimental results of the algorithm proposed in this paper.)
Title 1Rain1200Rain1400
PSNRSSIMPSNRSSIM
Q132.170.91531.040.895
Q233.640.92232.380.913
Q334.880.96332.570.956
Table 4. Comparison of three frequency-based methods on synthetic and real datasets. (The bold words indicate the experimental results of the algorithm proposed in this paper.)
Table 4. Comparison of three frequency-based methods on synthetic and real datasets. (The bold words indicate the experimental results of the algorithm proposed in this paper.)
Frequency
Method
Rain1200Rain1400RealRain
PSNRSSIMPSNRSSIMNIQEPIQE
FFT33.360.94432.100.93613.2010.40
DCT33.280.94032.050.93013.1110.29
MSWT34.860.96132.550.95311.529.13
Table 5. Quantitative results of different algorithms on synthetic dataset. (Highlight the best and second best results in bold and underlined).
Table 5. Quantitative results of different algorithms on synthetic dataset. (Highlight the best and second best results in bold and underlined).
Comparison AlgorithmsRain1200Rain1400
PSNRSSIMPSNRSSIM
DARAGNet32.680.89031.440.885
DSMNet32.930.89131.670.923
IDT33.880.89332.120.890
RCHNet34.040.91232.580.955
MOSS34.150.92732.250.906
LSNet34.370.93032.460.927
SSID_KD34.530.94432.380.923
MSWT-SSIDA34.860.96132.550.953
Table 6. Quantitative results of different algorithms on the RealRain real dataset. (Highlight the best and second best results in bold and underlined).
Table 6. Quantitative results of different algorithms on the RealRain real dataset. (Highlight the best and second best results in bold and underlined).
MetricsDARAGNetDSMNetIDTRCHNetMOSSLSNetSSID_KDMSWT-
SSIDA
NIQE13.8813.7413.6913.6112.4612.1711.7911.52
PIQE10.6310.4110.169.859.729.539.469.13
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hao, Y.; Liu, X. A Semi-Supervised Single-Image Deraining Algorithm Based on the Integration of Wavelet Transform and Swin Transformer. Appl. Sci. 2025, 15, 4325. https://doi.org/10.3390/app15084325

AMA Style

Hao Y, Liu X. A Semi-Supervised Single-Image Deraining Algorithm Based on the Integration of Wavelet Transform and Swin Transformer. Applied Sciences. 2025; 15(8):4325. https://doi.org/10.3390/app15084325

Chicago/Turabian Style

Hao, Yu, and Xiaoyan Liu. 2025. "A Semi-Supervised Single-Image Deraining Algorithm Based on the Integration of Wavelet Transform and Swin Transformer" Applied Sciences 15, no. 8: 4325. https://doi.org/10.3390/app15084325

APA Style

Hao, Y., & Liu, X. (2025). A Semi-Supervised Single-Image Deraining Algorithm Based on the Integration of Wavelet Transform and Swin Transformer. Applied Sciences, 15(8), 4325. https://doi.org/10.3390/app15084325

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop