Next Article in Journal
Anomaly Detection Based on 1DCNN Self-Attention Networks for Seismic Electric Signals
Previous Article in Journal
Fuzzy-Based Multi-Modal Query-Forwarding in Mini-Datacenters
Previous Article in Special Issue
An Efficient Hybrid 3D Computer-Aided Cephalometric Analysis for Lateral Cephalometric and Cone-Beam Computed Tomography (CBCT) Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Hybrid Content-Aware Network for Single Image Deraining

1
School of Computer Science and Artificial Intelligence, Shanxi Normal University, Taiyuan 030000, China
2
School of Physics and Electronic Engineering, Shanxi Normal University, Taiyuan 030000, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Computers 2025, 14(7), 262; https://doi.org/10.3390/computers14070262
Submission received: 16 May 2025 / Revised: 16 June 2025 / Accepted: 2 July 2025 / Published: 4 July 2025
(This article belongs to the Special Issue Machine Learning Applications in Pattern Recognition)

Abstract

Rain streaks degrade the quality of optical images and seriously affect the effectiveness of subsequent vision-based algorithms. Although the applications of a convolutional neural network (CNN) and self-attention mechanism (SA) in single image deraining have shown great success, there are still unresolved issues regarding the deraining performance and the large computational load. The work in this paper fully coordinates and utilizes the advantages between CNN and SA and proposes a hybrid content-aware deraining network (CAD) to reduce complexity and generate high-quality results. Specifically, we construct the CADBlock, including the content-aware convolution and attention mixer module (CAMM) and the multi-scale double-gated feed-forward module (MDFM). In CAMM, the attention mechanism is used for intricate windows to generate abundant features and simple convolution is used for plain windows to reduce computational costs. In MDFM, multi-scale spatial features are double-gated fused to preserve local detail features and enhance image restoration capabilities. Furthermore, a four-token contextual attention module (FTCA) is introduced to explore the content information among neighbor keys to improve the representation ability. Both qualitative and quantitative validations on synthetic and real-world rain images demonstrate that the proposed CAD can achieve a competitive deraining performance.

1. Introduction

Due to the influence of inclement weather, images or videos often suffer from unfavorable visibility or degraded quality of the captured content. Images or videos captured under complicated rain weather environments usually affect the subsequent vision tasks (such as target detection [1,2], segmentation [3], and so on), reducing the image value significantly. Therefore, it is important preliminary work obtains clear images by removing rain streaks. The temporal information contained in videos can help to better identify and remove streaks; however, single-image deraining suffers from greater challenges and difficulties due to the lack of temporal information. Thus, it is of great interest to propose effective approaches to recover high-quality rain-free content from single images.
Currently, most studies consider that the rain image can be represented with the following equation:
O =   B   +   R ,
where O denotes the rain image, B denotes the clean background, and R denotes the rain streak layer. Image deraining aims to recover the clean background B from rain image O.
The rain removal methods can be divided into two categories: model-driven methods and data-driven methods. The model-drive methods are mainly based on dictionary learning [4], low-rank representation [5], Gaussian mixture [6], and other models. These models can only handle the rain streaks with specific physical characteristics and typically require many iterative optimization steps to determine the optimal solution, which limits their practical applications. With the rapid development of computer performance and deep learning frameworks, data-driven methods have gradually become the mainstream of rain removal [7]. Thanks to the powerful representation ability of convolutional neural networks, deep learning-based deraining methods have achieved remarkable performance. For example, Fu et al. [8] completed the rain streak removal task by introducing the end-to-end residual CNN. H. Z. et al. [9] proposed a multi-scale density-aware deraining network that automatically determines the rain-density information. Y. Z. et al. [10] separated rain streaks from the background image in a coarse-to-fine fashion by exploiting a residual multiscale pyramid architecture. Y. C. et al. [11] presented a direction-aware network endowing the discriminative representation ability of different rain streaks from the line pattern image edges for rain streak modeling. Yao et al. [12] use a filter to decompose the image into high and low frequency parts and then use CNN to remove the high-frequency part to achieve rain removal results. However, the intrinsic characteristics of convolutional operation, i.e., local receptive fields and independence of input content, hinder the model’s capacity to eliminate long-range rain degradation perturbation. Subsequently, self-attention mechanisms are proposed as a substitute to resolve the above-mentioned shortcomings.
A transformer is a typical self-attention model, which can process features over long distances due to its global computational properties and alleviates the limitations of convolution operations precisely in the field of computer vision. However, self-attention also has non-negligible disadvantages, such as a quadratic relationship between computation and image size, which limits its application environment seriously. Therefore, many efforts have been made to reduce the computational burdens [13,14].
As is known to all, the rain image is composed of rain streak layers and a clean background. To the best of our knowledge, different contents vary in complexity in dealing with deraining. For example, a flat area (e.g., sky, land) is naturally easier to process than textures, so different regions in the background own different complexities. Inspired by this, we infer that rain images can be divided into two categories: the harder regions and the simpler regions. The smaller modules are used to process the simpler regions and the bigger modules are used to treat the more complex image regions in this paper to achieve a balance between deraining performance and model complexity.
Above all, according to the drawbacks and complementarity of CNN and self-attention and the opinion that different regions apply to different networks, a hybrid content-aware deraining network (CAD) is proposed to deal with the single image deraining issue. The core of CAD is a series of CADBlocks, which include two components: the content-aware convolution and attention mixer module (CAMM) and the multi-scale double-gated feed-forward module (MDFM). In practice, CAMM is the organic combination of convolution and attention, which applies the attention mechanism for intricate windows and simple convolution for plain windows to generate less computational costs. After CAMM, MDFM blends multi-scale spatial features in a double-gated manner to ensure the network preserves local detail features and enhances image restoration capabilities.
Furthermore, since the rain streak layer and the background have a high interlacement and the rain streaks seriously destroy the content, a rain streak detection module (RSDM), which utilizes a four-token contextual attention module (FTCA) and a 3 × 3 convolution, is developed to process the global information to extract the rain streaks. The attention map generated by RSDM is fed into the subsequent feature map to encourage the network to pay more attention to the rain areas. Specifically, the supervisory information of RSDM used in training is generated by comparing the rain image with the clean background.
The main contributions made in this paper can be summarized as follows:
(1)
A hybrid content-aware deraining (CAD) network is proposed to generate high-quality deraining results, which incorporates a content-aware convolution and attention mixer module (CAMM) to apply simple convolution for plain areas and attention mechanism for complex areas and a deraining multi-scale double-gated feed-forward module (MDFM) to preserve local detail features and enhance image restoration capabilities.
(2)
An attention model named four-token contextual attention (FTCA) is designed to explore the rich context information among neighbor keys. The proposed FTCA applies four tokens, rather than the usual three tokens in transformers, to process the input features and exploit the introduced token to alleviate computational costs and enhance the expressiveness by aggregating global information.
(3)
Qualitative and quantitative evaluations both on synthetic and real-world datasets have verified that the proposed method not only exhibits excellent superiority and effectiveness in image deraining compared to other methods, but also maintains higher operational efficiency.
The rest of the paper is organized as follows: Section 2 introduces the related works on single image deraining, accelerating and lightweight framework. Section 3 elaborates on the proposed method. Section 4 presents the experiments and results of the synthetic and real-world datasets. Finally, the conclusion and discussion are given in Section 5.

2. Related Works

2.1. Single Image Deraining

Removing rain streaks from a single image has been a highly serious problem. Most traditional methods tend to explore handcrafted prior knowledge of images by determining the rain-dominant regions [15] or decomposing the rain images into high-frequency and low-frequency layers [16] to remove rain streaks. CNN-based methods [17,18] achieve better results by exploiting rain streaks direction [19] and density [9]. Subsequently, strategies such as residual connection [10], multi-scale features [20], and progressive fusion [21] are introduced to enhance the ability to restore clear images. Following this, the application of transformers [22,23,24] has made some headway in image deraining tasks and performed better than the CNN-based models. Recently, methods such as improving convolutional feed-forward network (FFN) [22,25] or adding CNN elements have been applied to enhance transformer-based approaches. However, these models struggle to strike a balance between efficiency and effectiveness. In contrast, a hybrid content-aware network proposed in this work can achieve a balance of deraining performance and computational efficiency.

2.2. Accelerating and Lightweight Framework

Accelerating framework. The complexity of deep learning models continuously increases to achieve more satisfying visuals in the low-level vision tasks, resulting in the practical application of these models being harder than before. Recently, content-aware routing has been proposed to deal with cropped patches by using different models that own different complexities to accelerate the models’ calculation rates in visual tasks. Kong et al. [26] divided the input into three categories, i.e., complex, medium, simple, and different categories are processed by corresponding networks.
Lightweight framework. The computational complexity of CNN [27] has also increased with the gradual expansion of model size and numerous strategies have been applied to reduce the complexity for a lightweight inference. For example, an efficient information fusion structure [28,29] is proposed to reduce both parameters and calculations and a simplified information distillation procedure [30,31] is introduced to obtain real-time inference on mobile devices. Moreover, varied powerful token mixers are also introduced for being lightweight. For example, both window-based SA [32] and large kernel convolution [33] achieved SOTA performance. Inspired by these categories, the content-aware routing and token mixer are integrated to design the proposed CAD, which applies complex operators for informative areas but simple operators for plain areas.

3. Method

3.1. Backbone Pipeline

Given a rain image O H × W × 3 , the shallow feature X S H × W × C is first extracted by a 3 × 3 convolution, while the attention map X a t t H × W × 1 is simultaneously computed through RSDM. Subsequently, the feature X S is passed through a feature refinement block (FRB) and a series of CAD groups (CADGs) for deep feature extraction. Each CAD group consists of a stack of varying numbers of CAD blocks (CADBs). In CAD block, the CAMM is employed for local window modeling, only involving convolution and attention mechanisms. Then, the MDFM is utilized to integrate different ranges and scale information to further bolster the features for local details. Moreover, the attention map X a t t is fed to multiply the output of CAD block to enhance the rain regions. Skip connections [34] are also employed to fuse the shallow feature X S and the deep feature X D . Finally, a 3 × 3 convolution and pixel-shuffle operations [35] are used to process the fused features to generate the finally reconstructed deraining image Y H × W × 3 . The overall workflow of the content-aware deraining (CAD) framework is illustrated in Figure 1.

3.2. Four-Token Contextual Attention Module

Conventional self-attention calculates each query-key pair, which not only results in excessive computational costs but also ignores the rich context information among neighbor keys. A new attention model named four-token contextual attention (FTCA) is proposed to alleviate the above problems.
Different from the original transformer, an additional token A , which can be seen as a key or query at different places, is introduced in FTCA to solve the first problem. As illustrated in Figure 2, token A is first utilized to perform calculations M 1 between K and A as query and simultaneously perform calculations M 2 between Q and A as key. Subsequently, the attention matrix M H W × n n , instead of the regular attention map of size H W × H W , is achieved by connecting the M 1 and M 2 by one 3 × 3 convolution. Finally, the feature map Y is obtained by calculating the attention matrix M and the pooling value V . For the second problem, the K is calculated by employing group convolution to connect all neighbor keys within the grid, and then a skip connection is used to connect K and the feature map Y . Given the input X H W × C , the proposed FTCA can be written as:
Q = Conv 1 × 1 X H W × C , K = Conv k × k X H W × C , V = Pool Conv 1 × 1 X n n × C , A = Pool Conv 1 × 1 X n n × C , M 1 = Softmax   A K T n n × H W , M 2 = Softmax Q A T H W × n n , M = Conv 3 × 3 cat M 1 , M 2 T H W × n n , Y = M V T T + K   H W × C ,
where n is the pooling size and Conv k × k represents k×k group convolution. It should be noted that the rain streak detection module (RSDM) applies FTCA to explore the rich context information among neighbor keys, followed by a 3 × 3 convolution operation to obtain a rain streak attention map, promoting the subsequent modules to focus on the rain areas consciously.

3.3. Feature Refinement Block

The shallow feature X S H × W × C is extracted by a 3 × 3 convolution; however, the simple convolution operation may not be able to extract information effectively from rain images and result in the loss of crucial details. Therefore, a feature refinement block (FRB), which can extract features in both spatial and channel dimensions simultaneously, is introduced to better learn the structural and textural details representation of rain images. As illustrated in Figure 3, the input X S is first processed by a 3 × 3 convolution and an average pooling to achieve channel and spatial processing, respectively. Then, the channel output X 1 and the spatial output X 2 are both multiplied by the input X S and added together to obtain the fused feature output X 3 . Finally, skip connection is employed to connect the processed output X 3 and the input X S to reserve abundant crucial details. Given the input X S , the detailed calculation process of FRB can be presented as:
X 1 = σ Conv ф Conv GAP X S , X 2 = σ Conv ф Conv Conv X S , X 3 = X S X 1 + X S X 2 , X F R B = ф Conv X 3 + Conv X S ,
where X F R B refers to the output feature and GAP · denotes the global average pooling operation; is element-wise multiplication;   σ · and ф · denote the Sigmoid and ReLU activation functions, respectively. The mathematical notation nomenclature is listed in Table A1 of Appendix A.

3.4. Content-Aware Deraining Block

Similar to the paradigm of the transformer block [14], the CAD block integrates the content-aware routing to token mixers, which use the attention mechanism for intricate areas and simple convolution for plain areas, followed by a multi-scale double-gated feed-forward module (MDFM) to enhance the features for local details. Specifically, the CAD block employs content-aware convolution and attention mixer module (CAMM) to achieve local window modeling, the calculation cost is greatly reduced compared to the conventional self-attention mechanism. In addition, MDFM is introduced for local feature processing, where the gating mechanism and multi-scale view are combined to enhance the local processing capabilities. The process of CAD block is depicted in Figure 4. Given the feature X i j 1 H × W × C as the input of the j-th CAD block, the procedures can be expressed as:
X i j 1 2 = X i j 1 + BN C A M M X i j 1 , X i j = X i j 1 2 + BN M D F M X i j 1 2 ,
where BN · denotes BatchNorm layer;   X i j 1 2 and X i j denotes the output of CAMM and MDFM, respectively. In addition, i denotes the i-th CAD group in CAD blocks.
Content-aware convolution and attention mixer module (CAMM). As depicted in Figure 5, the proposed CAMM exploits a predictor [36] to generate a mixer mask, an offset map, and an attention map. The offset map is generated by convolution operation, which can reduce the parameters greatly compared to the fully connection operation in [36]. In addition, different from the attention map in [36], which only contains spatial information, the attention map in the proposed CAMM integrates spatial and channel information through split operation. In practice, we evenly divide the input into two groups based on dimensions, one group performing spatial operations and the other group performing channel operations, and then multiply the outputs of the two groups to obtain a simple attention map. Specifically, the sorting mask is arranged in descending order, the first K indices are set to represent areas with complex information, while the other indices are set to represent plain areas. For the plain areas, the attention map is used to implement the simple attention for lightweight operation. For the informative complex areas, the offset map is applied to modulate the key in FTCA to endow the module with the dynamic property, promoting the deformable window of FTCA to adaptively include more useful information to enhance the representational capacity. Finally, the plain areas and the informative complex areas are integrated by a 1 × 1 convolution and an FTCA with a fixed window. Given the input X i j 1 , the CAMM is formulated as:
m a s k , o f f s e t , A m a p = P r e d i c t o r X i j 1 , X p l a i n   ,   X c o m p l e x   = X i j 1 m a s k , X p l a i n = X p l a i n   A m a p , X c o m p l e x = D F T C A X c o m p l e x , X c o m p l e x + o f f s e t , X i j 1 2 = Conv X p l a i n + X c o m p l e x , X i j 1 2 = F F T C A X i j 1 2 , X i j 1 2 ,
where the first input of FTCA denotes the input of Q, V, A, and the second input denotes the input of K;   is element-wise multiplication. Here, DFTCA denotes the FTCA with the deformable window and FFTCA denotes the FTCA with the fixed window.
Multi-scale double-gated feed-forward module (MDFM). The regular feed-forward network (FFN) can extract and fuse feature information from different locations by operating on each pixel location separately and identically. One 1 × 1 convolution is used to expand the input channel dimension and another 1 × 1 convolution is used to reduce the channel to the original input dimension. Previous studies typically leverage the standard FFN to bolster local context [25]. To focus on enriching features with contextual information, S. et al. added a gating mechanism and depth-wise convolutions into the standard FFN named gated-dconv feed-forward network (GDFN) [22]. Based on GDFN, Sun et al. [25] conceived a dual-scale gated feed-forward module, which uses a 5 × 5 convolution and a 3 × 3 convolution to extract the multi-scale information. Motivated by these, we propose a multi-scale double-gated feed-forward module (MDFM), which integrates different range and scale information into the transmission process by using two gating mechanisms and three convolutions that own different kernel sizes. The process of MDFM is depicted in Figure 6. Given the input X i n H × W × C , the proposed MDFM is formulated as:
X 1 , X 2 , X 3 = Split Conv 1 × 1 X i n ,   X 1 , X 2 , X 3 = Conv 3 × 3 X 1 , Conv 5 × 5 X 2 , Conv 7 × 7 X 3 , X o u t = Conv 1 × 1 X 1 X 2 X 3 ,
where the Split · indicates the channel equalization operation, Conv N × N represents N×N depth-wise convolution; is element-wise multiplication; and · denotes the GELU non-linearity. Overall, the MDFM can process spatially neighboring pixel information by the depth-wise convolutions and acquire fine details complementary by the gating mechanism, which allows each level to focus on the other levels.

3.5. Loss Function

To better enhance the structure details, a hybrid loss function is applied to optimize the proposed CAD.
The MSE loss function is adopted to reduce the pixel-level difference between the output image of the proposed network and the corresponding background image, expressed as follows:
L m s e = E x C x y 2 ,
where x is the input image, y denotes the ground truth, and C · is the result generated by the proposed CAD.
The SSIM loss function is adopted to assess the structure-level similarity difference between the two types of images. It can be expressed as:
L s s i m = 1 SSIM C x , y ,
where S S I M · denotes the structural similarity between two images. When two images are more similar, the value of the function S S I M · becomes closer to 0. On the contrary, the closer its value is to 1.
The attention loss function is utilized to enhance the focus on the important rain streak areas, which can be formulated as:
L a t t = E x R S D M x M x 2 ,
where R S D M · is the attention map predicted by the RSDM and the M · denotes a binary map of the rain streaks.
The perceptual loss function is used to minimize the feature distance between the input image and the target image to achieve better image quality that is more in line with the human eye’s perception. The function can be indicated as:
L p e r = L m s e V i C x , V i y ,
where V i · represents the feature map calculated by the i-th layer in the pretrained VGG-16 network.
The Laplace edge loss function is adopted to capture the texture information related to high-frequency structure to improve the images’ detail representation and it can be expressed as:
L e d g e = L m s e K C x , K y ,
where K · denotes the Laplace operator convolution operation.
Above all, the overall loss function can be formulated as:
L l o s s = λ 1 L m s e + λ 2 L s s i m + λ 3 L a t t + λ 4 L p e r + λ 5 L e d g e ,
where λ 1 , λ 2 , λ 3 , λ 4 , and λ 5 are used to balance different loss terms.

4. Experiments and Discussions

4.1. Datasets and Implementation Details

Datasets. Experiments on two frequently used synthetic rain datasets are conducted to effectively assess the proposed deraining method. Rain100L [37] and Rain100H [37] both contain 1800 pairs of images for training and 100 pairs of images for testing. The rain streaks in these two groups images are different in the aspects of density and direction. In addition, the real-world datasets provided by Zhang et al. [38] and Yang et al. [37] are also used to verify the efficiency of the proposed method in practical application.
Metrics. PSNR [39] and SSIM [40] are applied to evaluate the deraining results processed by the proposed method and other compared methods.
PSNR = 10 · log 10 M A X 2 M S E ,
SSIM x , y = 2 μ x μ y + c 1 2 σ x y + c 2 μ x 2 + μ y 2 + c 1 σ x 2 + σ y 2 + c 2 ,
where x and y denote the predicted rain removal image and the corresponding ground truth, respectively; MSE denotes mean squared error and MAX denotes the maximum possible value of image pixel values; μ represents the average; σ represents the variances; and c 1 and c 2 are different constants.
Implementation details. The number of CAD groups is set to 4, the number of CAD blocks that each CAD group owns to [4,4,6,6], and the initial channel number to 60. The window size of the attention model FTCA is set to 16. Furthermore, the Adam optimizer with the initial learning rate of 1 × 10−3 and decayed by multiplying 0.2 at every 30 epochs until the 150th epoch. The entire network of CAD is implemented in a PyTorch framework of 1.12.1 with an Intel i9 CPU and a 4090 GPU. The proposed CAD is compared with two prior-based methods (GMM [3] and DSC [6]), six CNN-based methods (PReNet [41], VRGNet [42], MPRNet [43], SAPNet [44], HINetp [45], and DRAN [11]), two transformer-based methods (IDT [12] and DCT [46]), and one hybrid-based method (ELF [47]).

4.2. Comparison with State-of-the-Art

Synthetic datasets. Table 1 shows the quantitative comparison in the datasets of Rain100L and Rain100H. Due to the fact that the codes of DRAN [11] and DCT [46] are not open source, their Params and FloPs cannot be calculated, and we directly report the experimental results in their papers. As can be seen, the proposed CAD achieves the highest PSNR and SSIM, as well as the shortest inference time. For Rain100L, the proposed CAD surpasses the second-best method DCT by 1.111 dB in PSRN. For Rain100H, the proposed CAD surpasses the CNN-based method DRAN by 0.157 dB in PSRN. In contrast to the DCT, which is based on a transformer, although the proposed CAD has a small advantage in PSNR of only 0.067 dB, the SSIM is significantly higher than DCT in Rain100H. Compared with the hybrid method ELF, which simply combines CNN and a transformer to process the entire image, the proposed CAD utilizes a predictor formed by CNN operation to generate offset maps and attention maps; the former are used to provide deformable windows for the transformer, and the latter are applied to endow the convolution with dynamic characteristics for simple regions. These improvements in CAD make the deraining results in Rain100H superior by 0.399 dB in PSNR than ELF. In addition, the proposed model achieves obvious advantages in terms of Params and FloPs among the methods involving attention mechanisms. Limited by the structure of self-attention, the proposed CAD has higher Params and FloPs compared with some CNN-based methods, such as PReNet and VRGNet, but the PSNR has increased by 4.037 dB and 2.761 dB, respectively, and the rain removal effect has been greatly improved.
Furthermore, Figure 7 presents the visual comparison results and the corresponding SSIM and PSNR on two synthetic images. As can be seen, GMM and DSC fail to produce satisfactory rain removal images and the results contain many residual rain streaks. SAPNet and VRGNet remove most rain streaks but introduce unpleasant blurry artifacts. For example, there is a clear blur at the intersection of two mountains in the first image, and the sail ropes of the sailboat are ghosted and blurred in the second image. MPRNet and HINet cause excessive texture removal and detail loss. For instance, in the first image, the mountain is excessively smooth, while in the second image, the sail ropes of the sailboat have disappeared. In contrast, the proposed CAD can preserve more image detail and achieve better visual results.
Real-world datasets. Experiments on real-world rain datasets are further conducted to demonstrate the generalization ability of the proposed method. However, the deraining performance on these images can only be evaluated visually as the corresponding rain-free images are difficult to obtain. The deraining results of all the involved methods are shown in Figure 8. One can see that the proposed CAD can remove the rain streaks clearly and produce more credible content, whereas other competitive methods fail to eliminate the apparent rain streaks. Specifically, the background images generated by GMM are very blurry in both images. DSC cannot eliminate the rain streaks in the first image and the tree trunk in the second image exhibits significant artifacts. The backgrounds in the first image of SAPNet and VRGNet are unclear, and the results of MPRNet and HINet own heavy shadows. The results of SAPNet, VRGNet, MPRNet, and HINet in the second image are similar, i.e., there are some rain residual streaks and artificial marks on the tree trunk. The satisfactory visual deraining result of the proposed CAD proves that it has good generalization ability.
In addition, a survey on user satisfaction is conducted to further demonstrate the deraining effect of CAD. We selected eight images from the real-world deraining images and arranged them randomly. Then, 100 participants were asked to rank the images from 1 to 10 based on the quality of deraining, with 10 representing the best effect. The average scores of the seven methods are shown in Table 2, it can be seen that the proposed CAD obtained the highest scores, confirming that the deraining results of CAD are more in line with human visual perception.

4.3. Ablation Study

In this section, ablation studies are conducted to validate the main components of the proposed method, as well as the effects of different window sizes and pooling sizes in FTCA. To this end, five variants of CAD are trained on Rain100H with the strategy described in Section 4.1. In addition, three different window sizes and four pool sizes in FTCA are explored.
Effectiveness of different blocks. Based on the proposed CAD, studies that remove different modules separately as the variable are conducted to validate the effectiveness of different blocks. Specifically, the multi-scale double-gated feed-forward module (MDFM), the feature refinement block (FRB), the four-token contextual attention (FTCA) in the rain streak detection module (RSDM), the DFTCA with deformable windows, the FFTCA with fixed windows, and the content-aware convolution and attention mixer module (CAMM) are thoroughly studied. It should be noted that a gated-dconv feed-forward network (GDFN) is applied to replace the MDFM and a conventional self-attention (SA) is used to substitute FTCA. As shown in Table 3, comparing the proposed CAD with the model (a), MDFM achieves 0.604 dB performance gain since three scales and two gated mechanisms in MDFM can help the model handle spatially local structures for effective improvement. From model (b), model (c), and model (d), the FTCA at different positions obtains 0.192 dB, 0.104 dB, and 0.11 dB in PSNR, which shows the superiority of FTCA. In contrast with model (e), the final model shows that the FRB provides a favorable gain of 0.245 dB since the FRB can extract more detailed rain streaks. The models (f) and (g) utilize convolution and self-attention to replace CAMM. It can be observed that model (f) results in a decrease of 0.446 dB in PSNR and model (g) shows a decrease of 0.745 dB in PSNR, demonstrating the effectiveness of CAMM in rain removal.
Effectiveness of window size in FTCA. In the CAD block, the attention mechanism based on a window is used to process the complex areas to reduce the computational costs. Table 4 illustrates the influence of different window sizes on the deraining results. From Table 4, one can observe that the model with a window size 8 × 8 reduces only 0.006 G FLOPs but reduces the PSNR 0.265 dB more than the model with a window size 16 × 16. The model with a window size 16 × 16 not only advances 0.126 dB but also reduces 0.051 G FLOPs than the model with a window size 32 × 32. Therefore, we use a window size of 16 × 16 to achieve a balance between performance and computations.
Effectiveness of pool size in FTCA. The pool operation used to generate the token A and the value V in FTCA plays an important role in reducing parameters. Therefore, a reasonable pool size n is very important to balance performance and complexity. Table 5 shows the results in the situations of different pool sizes. Specifically, the model with pool size 5 reduces 0.078 G FloPs more than pool size 6 while reducing 0.184 dB in PSNR. When the pool size is 6, we obtain better results than the size of 7, with a reduction of 0.1 G FloPs and an improvement of 0.111 dB in PSNR. The pool size 8 results in an increase of 0.259 G FloPS and a reduction of 0.105 dB in PSNR compared with pool size 6. Large pooling sizes can lead to excessively high FloPs, while small pooling sizes can affect the model’s performance. As a result, the pooling size of 6 supports the proposed CAD.

5. Conclusions

In this essay, an effective hybrid network based on the content of images has been proposed for image deraining tasks. By analyzing the drawbacks of CNN and self-attention and exploiting the opinion that different regions apply different networks, a content-aware convolution and attention mixer module (CAMM) is introduced, which uses an attention mechanism for intricate windows and a simple convolution for plain windows. To enhance image restoration capabilities, the multi-scale double-gated feed-forward module (MDFM) is proposed to guide the network for preserving local detail features better. Furthermore, a four-token contextual attention module (FTCA) is introduced to explore the rich context information among neighbor keys to enhance the representation ability. Extensive experiments on both synthetic and real-world datasets have demonstrated the superiority and effectiveness of the proposed CAD framework.
Furthermore, benefiting from its lightweight design and high efficiency, the proposed CAD can be applied to smartphone cameras, autonomous driving, drone detection, and other scenarios to achieve real-time rain removal and improve visual effects. In addition to rain, harsh weather conditions such as snow and fog can also greatly affect people’s visual perception. Therefore, we will conduct experiments in a future study to verify the effectiveness of the proposed CAD in snow and fog removal and make improvements to enhance the practical value of CAD in real-world applications.

Author Contributions

Conceptualization, methodology, and writing-original paper draft, G.C.; implementation, validation, investigation, resources, writing—review and editing, R.Y.; framework, formal analysis, visualization, J.G.; supervision, direction and planning, review, funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 62201333; Postgraduate Education Innovation Program of Shanxi Province, grant number 2024XSY58; and the Basic Research Program of Shanxi Province, grant number 202203021222220.

Data Availability Statement

The online experimental datasets in this paper are available at https://github.com/YRui106/CADNetwork, accessed on 1 April 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1 lists all the mathematical notations used in the formulas in this paper to help readers understand their meanings.
Table A1. This is a mathematical notation nomenclature table.
Table A1. This is a mathematical notation nomenclature table.
Math NotationsMeaning
Element-wise multiplication
σ · Sigmoid activation function
ф · ReLU activation function
· GELU non-linearity function

References

  1. Jiang, X.; Liu, T.; Song, T.; Cen, Q. Optimized Marine Target Detection in Remote Sensing Images with Attention Mechanism and Multi-Scale Feature Fusion. Information 2025, 16, 332. [Google Scholar] [CrossRef]
  2. Liu, J.; Qi, J.; Zhong, P.; Jiang, J. A Hyperspectral Nonlinear Unmixing Network for Nearshore Underwater Target Detection. In Proceedings of the 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), Nanjing, China, 22–24 March 2024; pp. 2174–2178. [Google Scholar]
  3. Alvarado-Robles, G.; Espinosa-Vizcaino, I.; Manriquez-Padilla, G.; Saucedo-Dorantes, J. SDKU-Net: A Novel Architecture with Dynamic Kernels and Optimizer Switching for Enhanced Shadow Detection in Remote Sensing. Computers 2025, 14, 80. [Google Scholar] [CrossRef]
  4. Li, Y. Rain Streak Removal Using Layer Priors. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2276–3033. [Google Scholar]
  5. Chen, Y.L.; Hsu, C.T. A Generalized Low-Rank Appearance Model for Spatio-temporally Correlated Rain Streaks. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 1968–1975. [Google Scholar]
  6. Luo, Y.; Xu, Y.; Ji, H. Removing Rain from a Single Image via Discriminative Sparse Coding. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3397–3405. [Google Scholar]
  7. Yang, W.; Tan, R.T.; Wang, S.; Fang, Y.; Liu, J. Single Image Deraining: From Model-Based to Data-Driven and Beyond. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4059–4077. [Google Scholar] [CrossRef]
  8. Fu, X.; Huang, J.; Zeng, D.; Huang, Y.; Paisley, J. Removing Rain from Single Images via a Deep Detail Network. In Proceedings of the 2017 IEEE Conference on Computer Vision & Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1715–1723. [Google Scholar]
  9. Zhang, H.; Patel, V.M. Density-Aware Single Image De-raining Using a Multi-stream Dense Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 695–704. [Google Scholar]
  10. Zheng, Y.; Yu, X.; Liu, M.; Zhang, S. Single-Image Deraining via Recurrent Residual Multiscale Networks. IEEE Trans. Neural Networks Learn. Syst. 2022, 3, 1310–1323. [Google Scholar] [CrossRef]
  11. Chang, Y.; Chen, M.; Yu, C.; Li, Y.; Chen, L.; Yan, L. Direction and Residual Awareness Curriculum Learning Network for Rain Streaks Removal. IEEE Trans. Neural Networks Learn. Syst. 2024, 35, 8414–8428. [Google Scholar] [CrossRef]
  12. Yao, Y.J.; Shi, Z.M.; Hu, H.W.; Li, J.; Wang, G.C.; Liu, L.T. GSDerainNet: A Deep Network Architecture Based on a Gaussian Shannon Filter for Single Image Deraining. Remote Sens. 2023, 15, 4825. [Google Scholar] [CrossRef]
  13. Zhang, S.; Liu, H.; Lin, S.; He, K. You Only Need Less Attention at Each Stage in Vision Transformers. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 23–28 June 2024; pp. 6057–6066. [Google Scholar]
  14. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  15. Zhu, L.; Fu, C.W.; Lischinski, D.; Heng, P.A. Joint Bi-layer Optimization for Single-Image Rain Streak Removal. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2545–2553. [Google Scholar]
  16. Kang, L.W.; Lin, C.W.; Fu, Y.H. Automatic Single-Image-Based Rain Streaks Removal via Image Decomposition. IEEE Trans. Image Process. 2011, 21, 1742–1755. [Google Scholar] [CrossRef]
  17. Zheng, C.; Jiang, J.; Ying, W.; Wu, S.B. Single Image Deraining via Feature-based Deep Convolutional Neural Network. arXiv 2023, arXiv:2305.02100. [Google Scholar]
  18. Chen, X.; Pan, J.; Jiang, K.; Li, Y.; Huang, Y.; Kong, C.; Dai, L.; Fan, Z. Unpaired Deep Image Deraining Using Dual Contrastive Learning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2007–2016. [Google Scholar]
  19. Sivaanpu, A.; Thanikasalam, K. A Dual CNN Architecture for Single Image Raindrop and Rain Streak Removal. In Proceedings of the 2022 7th International Conference on Information Technology Research (ICITR), Moratuwa, Sri Lanka, 7–9 December 2022; pp. 1–6. [Google Scholar]
  20. Chen, H.; Chen, X.; Lu, J.; Li, Y. Rethinking Multi-Scale Representations in Deep Deraining Transformer. Proc. AAAI Conf. Artif. Intell. 2024, 38, 1046–1053. [Google Scholar] [CrossRef]
  21. Ragini, T.; Prakash, K. Progressive Multi-scale Deraining Network. In Proceedings of the 2022 IEEE International Symposium on Smart Electronic Systems (iSES), Warangal, India, 18–22 December 2022; pp. 231–235. [Google Scholar]
  22. Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient Transformer for High-Resolution Image Restoration. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 5718–5729. [Google Scholar]
  23. Wang, Z.; Cun, X.; Bao, J.; Liu, J. Uformer: A General U-Shaped Transformer for Image Restoration. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 17662–17672. [Google Scholar]
  24. Xiao, J.; Fu, X.; Liu, A.; Wu, F.; Zha, Z.J. Image De-Raining Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12978–12995. [Google Scholar] [CrossRef]
  25. Sun, S.; Ren, W.; Gao, X.; Wang, R.; Cao, X. Restoring Images in Adverse Weather Conditions via Histogram Transformer. arXiv 2024, arXiv:2407.10172. [Google Scholar]
  26. Kong, X.; Zhao, H.; Qiao, Y.; Dong, C. ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12011–12020. [Google Scholar]
  27. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans Pattern Anal Mach Intell. 2016, 38, 295–307. [Google Scholar] [CrossRef]
  28. Hui, Z.; Wang, X.; Gao, X. Fast and Accurate Single Image Super-Resolution via Information Distillation Network. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 723–731. [Google Scholar]
  29. Hui, Z.; Gao, X.; Yang, Y.; Wang, X. Lightweight Image Super-Resolution with Information Multi-distillation Network. arXiv 2019, arXiv:1909.11856. [Google Scholar]
  30. Wang, Y. Edge-enhanced Feature Distillation Network for Efficient Super-Resolution. arXiv 2022, arXiv:2204.08759. [Google Scholar]
  31. Kong, F.; Li, M.; Liu, S.; Liu, D.; He, J.; Bai, Y.; Chen, F.; Fu, L. Residual Local Feature Network for Efficient Super-Resolution. arXiv 2022, arXiv:2205.07514. [Google Scholar]
  32. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Gool, L.V.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. arXiv 2021, arXiv:2108.10257. [Google Scholar]
  33. Wang, Y.; Li, Y.; Wang, G.; Liu, X. Multi-scale Attention Network for Single Image Super-Resolution. arXiv 2024, arXiv:2209.14145. [Google Scholar]
  34. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
  35. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
  36. Wang, Y.; Liu, Y.; Zhao, S.; Li, J.; Zhang, L. CAMixerSR: Only Details Need More “Attention”. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 25837–25846. [Google Scholar]
  37. Yang, W.; Tan, R.T.; Feng, J.; Liu, J.; Guo, Z.; Yan, S. Deep Joint Rain Detection and Removal from a Single Image. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1685–1694. [Google Scholar]
  38. Zhang, H.; Sindagi, V.; Patel, V.M. Image De-Raining Using a Conditional Generative Adversarial Network. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 3943–3956. [Google Scholar] [CrossRef]
  39. Huynh-Thu, Q.; Ghanbari, M. Scope of validity of PSNR in image/video quality assessment. Electron. Lett. 2008, 44, 800–801. [Google Scholar] [CrossRef]
  40. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  41. Ren, D.; Zuo, W.; Hu, Q.; Zhu, P.; Meng, D. Progressive Image Deraining Networks: A Better and Simpler Baseline. arXiv 2019, arXiv:1901.09221. [Google Scholar]
  42. Wang, H.; Yue, Z.; Xie, Q.; Zhao, Q.; Zheng, Y.; Meng, D. From Rain Generation to Rain Removal. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14786–14796. [Google Scholar]
  43. Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H. Multi-Stage Progressive Image Restoration. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14816–14826. [Google Scholar]
  44. Zheng, S.; Lu, C.; Wu, Y.; Gupta, G. SAPNet: Segmentation-Aware Progressive Network for Perceptual Contrastive Deraining. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 3–8 January 2022; pp. 52–62. [Google Scholar]
  45. Chen, L.; Lu, X.; Zhang, J.; Chu, X.; Chen, C. HINet: Half Instance Normalization Network for Image Restoration. arXiv 2021, arXiv:2105.06086. [Google Scholar]
  46. Li, Y.; Lu, J.; Chen, H.; Wu, X.; Chen, X. Dilated Convolutional Transformer for High-Quality Image Deraining. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 18–22 June 2023; pp. 4199–4207. [Google Scholar]
  47. Jiang, K.; Wang, Z.; Chen, C.; Wang, Z.; Cui, L.; Lin, C.-W. Magic ELF: Image Deraining Meets Association Learning and Transformer. arXiv 2022, arXiv:2207.10455. [Google Scholar]
Figure 1. Illustration of the proposed CAD framework for single image deraining.
Figure 1. Illustration of the proposed CAD framework for single image deraining.
Computers 14 00262 g001
Figure 2. Four-token contextual attention module (FTCA).
Figure 2. Four-token contextual attention module (FTCA).
Computers 14 00262 g002
Figure 3. Feature refinement block (FRB).
Figure 3. Feature refinement block (FRB).
Computers 14 00262 g003
Figure 4. Content-aware deraining block (CADB).
Figure 4. Content-aware deraining block (CADB).
Computers 14 00262 g004
Figure 5. Content-aware convolution and attention mixer module (CAMM).
Figure 5. Content-aware convolution and attention mixer module (CAMM).
Computers 14 00262 g005
Figure 6. Multi-scale double-gated feed-forward module (MDFM).
Figure 6. Multi-scale double-gated feed-forward module (MDFM).
Computers 14 00262 g006
Figure 7. Visual comparison of derained results on Rain100H (the 1st image) and Rain100L (the 2nd image) datasets.
Figure 7. Visual comparison of derained results on Rain100H (the 1st image) and Rain100L (the 2nd image) datasets.
Computers 14 00262 g007
Figure 8. Visual results of derained results on a real-world dataset.
Figure 8. Visual results of derained results on a real-world dataset.
Computers 14 00262 g008
Table 1. Comparison results of PSNR and SSIM on the Rain100L and Rain100H datasets.
Table 1. Comparison results of PSNR and SSIM on the Rain100L and Rain100H datasets.
DatasetsRain100LRain100HOverhead
MetricsPSNR/SSIMPSNR/SSIMParams [M]FloPs [G]Time/s
Prior-based methodsGMM [4]26.945/0.84416.317/0.431---
DSC [6]27.271/0.83715.579/0.396---
CNN-based methodsPReNet [41]32.420/0.95026.770/0.8580.16916.5760.163
VRGNet [42]36.662/0.97330.077/0.8870.16916.5760.120
MPRNet [43]36.687/0.96730.427/0.8913.637137.1630.207
SAPNet [44]32.291/0.95128.046/0.8670.182165.9380.198
HINet [45]37.591/0.97130.649/0.89488.67142.6290.698
DRAN [11]37.830/0.98030.650/0.900---
Transformer-based methodsIDT [22]37.010/0.97029.950/0.89816.39058.4400.164
DCT [46]38.190/0.97030.740/0.890---
Hybrid-based methodsELF [47]36.672/0.96830.480/0.8961.53266.3900.125
CAD (our)39.301/0.98430.807/0.9151.19632.9560.109
Table 2. Average scores of user satisfaction.
Table 2. Average scores of user satisfaction.
MethodsGMMDSCSAPNetVRGNetMPRNetHINetCAD
Scores4.113.526.277.148.178.569.25
Table 3. Ablation study on different blocks on Rain100H.
Table 3. Ablation study on different blocks on Rain100H.
ModelMDFMFTCADFTCAFFTCAFRBCNNAttentionPSNRSSIM
model (a)×××30.2030.907
model (b)×××30.6150.912
model (c)×××30.7030.913
model (d)×××30.6970.913
model (e)×××30.5620.910
model (f)××30.3610.907
model (g)××30.0620.906
CAD (ours)××30.8070.915
Table 4. Ablation study on window size on Rain100H.
Table 4. Ablation study on window size on Rain100H.
Window SizeParams [M]FLOPs [G]PSNRSSIM
8 × 81.12432.95030.5420.908
16 × 161.19632.95630.8070.915
32 × 321.77133.00730.6810.911
Table 5. Ablation study on pool size on Rain100H.
Table 5. Ablation study on pool size on Rain100H.
Pool SizeParams [M]FloPs [G]PSNRSSIM
51.19532.87830.6230.912
61.19632.95630.8070.915
71.19833.05630.6960.912
81.20133.21530.7020.912
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chai, G.; Yang, R.; Ge, J.; Chen, Y. A Hybrid Content-Aware Network for Single Image Deraining. Computers 2025, 14, 262. https://doi.org/10.3390/computers14070262

AMA Style

Chai G, Yang R, Ge J, Chen Y. A Hybrid Content-Aware Network for Single Image Deraining. Computers. 2025; 14(7):262. https://doi.org/10.3390/computers14070262

Chicago/Turabian Style

Chai, Guoqiang, Rui Yang, Jin Ge, and Yulei Chen. 2025. "A Hybrid Content-Aware Network for Single Image Deraining" Computers 14, no. 7: 262. https://doi.org/10.3390/computers14070262

APA Style

Chai, G., Yang, R., Ge, J., & Chen, Y. (2025). A Hybrid Content-Aware Network for Single Image Deraining. Computers, 14(7), 262. https://doi.org/10.3390/computers14070262

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop