A Novel Residual Dense Pyramid Network for Image Dehazing

Recently, convolutional neural network (CNN) based on the encoder-decoder structure have been successfully applied to image dehazing. However, these CNN based dehazing methods have two limitations: First, these dehazing models are large in size with enormous parameters, which not only consumes much GPU memory, but also is hard to train from scratch. Second, these models, which ignore the structural information at different resolutions of intermediate layers, cannot capture informative texture and edge information for dehazing by stacking more layers. In this paper, we propose a light-weight end-to-end network named the residual dense pyramid network (RDPN) to address the above problems. To exploit the structural information at different resolutions of intermediate layers fully, a new residual dense pyramid (RDP) is proposed as a building block. By introducing a dense information fusion layer and the residual learning module, the RDP can maximize the information flow and extract local features. Furthermore, the RDP further learns the structural information from intermediate layers via a multiscale pyramid fusion mechanism. To reduce the number of network parameters and to ease the training process, we use one RDP in the encoder and two RDPs in the decoder, following a multilevel pyramid pooling layer for incorporating global context features before estimating the final result. The extensive experimental results on a synthetic dataset and real-world images demonstrate that the new RDPN achieves favourable performance compared with some state-of-the-art methods, e.g., the recent densely connected pyramid dehazing network, the all-in-one dehazing network, the enhanced pix2pix dehazing network, pixel-based alpha blending, artificial multi-exposure image fusions and the genetic programming estimator, in terms of accuracy, run time and number of parameters. To be specific, RDPN outperforms all of the above methods in terms of PSNR by at least 4.25 dB. The run time of the proposed method is 0.021 s, and the number of parameters is 1,534,799, only 6% of that used by the densely connected pyramid dehazing network.


Introduction
The images taken on hazy days inevitably lose colour fidelity and intensity contrast, since floating particles in the atmosphere such as water droplets and dust particles absorb or scatter the light reflected

•
We propose a new end-to-end residual dense pyramid network (RDPN) based on the encoder-decoder architecture, which achieves high performance in image dehazing.

•
We propose the residual dense pyramid (RDP) as the basic building module, which not only can effectively boost network performance by improving the information flow via dense connection and the residual learning mechanism, but also can learn structural features at different resolutions from all the layers of the encoder and decoder.
• By using one RDP in the encoder and two RDPs in the decoder, the light-weight RDPN contains much fewer network parameters (only 6% of that used by DCPDN [24]) and is much faster than existing CNN based methods (run time is reduced to 0.021 s).

•
To enhance the generalization ability of the RDPN, both indoor and outdoor images are collected to generate a new synthetic dataset for training. The extensive experimental results demonstrate that our light-weight RDPN can achieve competitive results compared to other heavy-weight network models. ( ) p f f S 1 4 ( ) p f f S 1 8 ( ) p f f S 1 16 ( )

Network Structure
Inspired by the flexibility of the encoder-decoder network that can produce compelling results for image denoising, super-resolution and image harmonization, we explored the effective design of encoder-decoder in image dehazing. In most existing encoder-decoder modules, the dense block is employed as the basic building model and stacked layer-by-layer in a greedy fashion to construct the network architecture for feature transformation.
Consider Figure 3a as an example: three dense blocks with its down-sampling and transition blocks from dense-net121 [28] are used for building the encoder and the symmetrical dense blocks with corresponding deconvolutions as the decoder. Although this design utilizes the dense information flow to extract features with smaller sizes and transform them back to haze-free image, multi-scale structural information, which has been demonstrated to be effective in the traditional dehazing method, is totally neglected [24,25]. After inputting a real hazy image, the dehazing result has a halo effect (see the magnified detail in Figure 3a). Based on this model, Zhang et al. added a multilevel pyramid pooling block (MPPB) with pooling size 1 32 , 1 16 , 1 8 and 1 4 at the end of the decoder (see Figure 3b) and denoted this network as PDCN. Then, the global structural information with different scales was used to estimate the final result [24]. However, this scheme only takes the multi-scale information of the last layer into account, and high level global information from intermediate layers is not considered at all. The halo effect is not completely removed (see the white shadow in the magnified detail of Figure 3b). To further make full use of long range global information, some work, e.g., [27], adds MPPB into each layer of the decoder (see Figure 3c) to address the drawback of the model in Figure 3b. The improved model can remove the halo effect and enhance fine features, but the multi-scale information of the bottom-up pathway in the encoder is not explored. The colours in the dehazing result shift from the real colours. For example, the magnified detail in Figure 3c has turned to a reddish tint colour. In our proposed approach, we adopted the MPPB in the encoder. To verify the effectiveness of MPPB, only one dense block from dense-net121 [28] and one MPPB from PDCN [24] were used for building the encoder. Similarly, two MPPBs from PDCN [24] and two dense blocks with deconvolutions from dense-net121 [28] were adopted for constructing the decoder (see Figure 3d). In the corresponding output, we found not only that the halo effect was removed completely, but also the scene colours were much closer to the ones in the real world (see the output and magnified detail in Figure 3d). To understand this better, we further visualize the feature map at the bottleneck of the encoder-decoder framework. Figure 3e displays the intermediate feature map of Figure 3c. Meanwhile, Figure 3f shows the intermediate feature map of Figure 3d. By comparing these two results, we found that even one MPPB in the encoder allowed the network to obtain more global structural information than stacked dense blocks. For example, the edge and contour information is richer in Figure 3f than that in Figure 3e.  Based on the above discussion, we propose the residual dense pyramid (RDP) as the basic building module, which includes a dense block, residual learning and an MPPB. By building the encoder and decoder with RDP, the novel residual dense pyramid network (RDPN) can learn and fuse structural information from different resolutions at all layers. Generally speaking, RDPN takes a hazy image as input and predicts its corresponding dehazed result as output. As shown in Figure 1, the architecture of the RDPN mainly consists of five parts: a shallow feature extraction layer (SFEL), an encoder, a decoder, a multilevel pyramid pooling layer (MPPL) and a global information fusion layer (GIFL).
Simultaneously, the skip connection with the same filter size is used to ease the training of the RDPN. The specific operations of these five parts are described in the next five subsections.
Shallow feature extraction layer (SFEL): As reported in many previous approaches [29], the low-level features such as contours and edges extracted in shallower stages usually have a smaller feature size and provide rich and detailed global information for deeper stages, which contain high-level features. Therefore, the SFEL was necessary for the encoder decoder network. Besides, in our work, SFEL also acts as a transition layer, which enabled the features with different spatial sizes to be captured gradually, avoiding shrinking the input with smaller spatial size sharply. In our design, we used a 1 × 1 convolution layer with a stride of two for extracting shallow features S s . With half of the input image size, the shallow features not only can preserve the primary contours and edges for deeper stages, but also can suppress noise and unimportant details. The related operation can be expressed as: where S is the input image. f SFEL (·) denotes a 1 × 1 convolution operation of SFEL. S s is the output and also serves as the input to the subsequent encoder. Encoder: The encoder further extracts a set of feature maps from S s . Inspired by dense connection [28] and residual learning [30], the proposed pyramid fusion can fully learn spatial information at different resolutions. In this paper, we propose the residual dense pyramid (RDP) as the basic building module and employ one RDP to construct the encoder. For each RDP, a 1 × 1 convolution layer is also used to capture all extracted features of the RDP and to enable down-sampling simultaneously. The output of the encoder can be formulated as: where f RDP denotes the composite function of our RDP, such as dense information fusion, multiscale pyramid fusion and residual learning. f Conv 1 denotes the convolution operation following the RDP, and S en is the output. More details of the RDP are given in Section 2.2. Decoder: After the features are extracted by the encoder, the decoder is utilized to restore the image content and reconstruct the dehazed image. In the proposed decoder, two RDPs are used. Similar to the encoder, two 1 × 1 deconvolution layers are used for each RDP to refine mapping features and to realize up-sampling. The decoder function can be described as follows: where f DeConv 1 and f DeConv 2 denote two 3 × 3 deconvolution functions. S de is the final output of the decoder network. Multilevel pyramid pooling layer (MPPL): Similar to [24], where features at different scales in an image are utilized for image dehazing, we also adopted a multilevel pyramid pooling layer to make sure that the features from a hierarchical global context prior were embedded in the resulting image, containing information from different scales. Here, four pooling operations with pyramid size 1/4, 1/8, 1/16 and 1/32 are employed to obtain multilevel features. Subsequently these pooled features are up-sampled to the size of the input image by nearest neighbour interpolation, followed by a concatenated operation with the original input to capture more global context information. The above operation can be expressed as: where S de 1/4 , S de 1/8 , S de 1/16 and S de 1/32 are the pooling results of S de with pyramid sizes 1/4, 1/8, 1/16 and 1/32, respectively. f u is the up-sampling operation and S p is the output of the MPPL.
Global information fusion layer (GIFL): The extracted global hierarchical features from the MPPL were further fused in the GIFL. In particular, we placed a 3 × 3 convolution layer in the last stage of the RDPN. The reconstructed result S g from the final convolution function f GIFL is given by:

Residual Dense Pyramid
Motivated by the performance of the RDB, which combines the advantage offered by the dense block and the residual block, we propose a new compact block named RDP. Different from the existing RDB, we also added multiscale pyramid fusion in the RDP, which adopted the multiscale pyramid fusion to enable learning local context information and exploring the spatial relation in the RDP. The proposed RDP is shown in Figure 2, and all the components are discussed in subsequent subsections.
Dense information fusion design (DIF): Based on the observation that dense connections can maximize the information flow, we adopted dense connections as the basic structure in the RDP. As displayed in Figure 2, S d and S o denote the input and output features, respectively. Red coloured arrows indicate the dense connections between six convolution layers. Suppose the input feature number of S d is G 0 and the growth gate for dense connections is G, then S 0 has G 0 + 5G feature maps. Although the higher growth rate G can introduce more local features, it also makes the network hard to train. Hence, it is necessary to reduce the number of features. Inspired by the RDB, to control and fuse the information flow, a 1 × 1 convolution layer indicated by grey colour shown in Figure 2 was added. The overall structure of our proposed DIF is specified as: Here, [S d , S 0 d , . . . , S l−1 d ] defines the concatenation of the features produced using input S d and the preceding dense layers 0 . . . l − 1. f D denotes a 3 × 3 convolution function to produce G feature maps following the l th layer and has G 0 + (l − 1) × G input feature maps. f DIF denotes a 1 × 1 convolution function to fuse the original input S d and outputs S 0 d , . . . , S 5 d from six dense connection layers into S f with G feature maps.
Multiscale pyramid fusion (MPF): Even though the DIF improved the information flow to a large extent, the features from DIF still lost the spatial relation. The local context information at different scales is helpful in this regard to explore spatial information at different resolutions. To address this problem efficiently, we used multiscale pyramid fusion (MPF). MPF was realized by four pooling operations and five convolution operations, as illustrated in Figure 2. In particular, we first pooled the feature maps from DIF into four different scales, such as 1/2, 1/4, 1/8 and 1/16. A single 1 × 1 convolution layer was introduced in the last stage for learning context information. The operation of MPF can be defined as: represents the concatenation of pooling results and the original input. S m is the final output of the MPF. Residual learning (RL): In order to enhance the RDP representation ability and to achieve better performance, we introduced the residual learning mechanism in the RDP before the final output. The final output S r of the RDP is defined as: This motivation for this design was that the final RL can ensure that the RDP makes full use of the advantages offered by the DIF, MPF and RL and to enable high quality estimation of dehazed images.

Loss Function
Since earlier works demonstrated that the Euclidean loss (L 2 ) easily leads to colour distortion or halo artefacts in the dehazed images [24], the works in [24,25] attempted to solve this problem by adding some edge preserving information in the loss, such as the combination of three losses, including the feature edge loss L F , the gradient loss L G and standard (L 2 ) loss function. For fair comparison, we adopted the same combined loss function for learning the parameters of the proposed RDPN. Considering that the combined loss function used in [24] was designed for the U net, PDCN and GAN jointly, we only adopted part of the loss function that is used in PDCN.
where λ and β are weighting coefficients for loss terms L G and L F , respectively. Let I i , i = 1, 2 . . . , N and J i , i = 1, 2, . . . , N represent the set of hazy images and the set of corresponding ground truths, respectively. Then, L 2 is defined as: where f represents the proposed dehazing network and Θ denotes the parameters in f . The gradient loss L G is defined using gradient operations of the horizontal and vertical directions: where G v and G h are the vertical and horizontal gradient operators, respectively. Such a loss function allows us to preserve fine details and to remove artefacts. The feature edge loss L F is defined based on the edge information extracted from a pre-trained VGG-16 network. This design aimed to make the reconstructed image approximate the ground truth from the perspective of the feature edge. It is defined as: where f v 1 and f v 2 are extracted edge features from the first and second layers of the pre-trained VGG-16 network [31]. To show the effectiveness of this loss function, the experiments in super-resolution, image dehazing and other relative fields [24,25] provided sufficient evidence.

Discussions
Difference from DCPDN: Inspired by the densely connected pyramid dehazing network (DCPDN), we also propose a novel end-to-end RDPN for image dehazing. However, it is worth noting that there are two obvious differences between DCPDN and our model. In general, the sizes of the network models are different. DCPDN uses a U network, a pyramid densely connected network (PDCN) and a GAN to jointly estimate the atmospheric light, the transmission map and the dehazing result simultaneously [24]. Hence, the size of the network model was 268 MB, and the number of network parameters was 12,446,386; while our proposed network used one RDP in the encoder and two RDPs in the decoder, which not only saved GPU memory, but also improved computational efficiency. Using the same platform of PyTorch as that used in DCPCN, the size of the proposed model was 4.3 MB and the number of network parameters was only 787,707, which was only 6% of that used in DCPCN.
Difference from PDCN: The PDCN in DCPDN was designed based on the encoder-decoder structure for estimating the transmission map [24]. Comparing with our proposed RDPN, there are two main differences: First, PDCN uses dense blocks as the basic building blocks, which cannot capture informative texture and structural information for dehazing by stacking dense blocks. In contrast, we used RDP to construct our RDPN, which not only combined the advantages of dense blocks and residual blocks, but also added the multiscale pyramid fusion mechanism in the RDP for learning structural information at different resolutions. Second, PDCN explores the structural information by applying the multilevel pyramid pooling block (MPPB) at the end of the decoder, but ignoring the structural information from intermediate layers. Our RDPN not only learns structural information from each layer of the network by using RDP, but also learns global context information by placing a multilevel pyramid pooling layer (MPPL) at the end of the network.
Difference from RDB: Due to the convincing advantages of the residual dense block (RDB) [29], we propose RDP based on RDB. However, there are three main differences between them. First, RDB was designed for image super-resolution, while our proposed RDP was designed to realize image dehazing. Second, RDB was proposed by combining the advantages of dense blocks and residual blocks, while our proposed RDP added the multiscale pyramid fusion between dense information fusion and residual learning for fully learning the spatial information at different resolutions. Third, existing methods stacked RDBs for extracting features with a fixed scale [29], while we embedded RDP into the encoder-decoder architecture and added a convolution or deconvolution operation following each RDP to realize the down-sampling or the up-sampling operation.

Implementation Details
The detailed architecture and parameter settings of the RDPN are provided in Table 1, where each RDP in the encoder and decoder has the same setting except the filter number, which depends on the number of output channels from the preceding layer, and these are shown in Table 2. Each convolutional layer in the DIF was followed by a rectified linear unit (ReLU) for improving training efficiency and for adding non-linearity. The growth rate G was 32. Adam was selected as the optimization algorithm with a learning rate of 2 × 10 −3 for training the model. The batch size was set as two. Empirical values of λ and β were used, which were 2 and 0.8, respectively. All the training images were resized to 512 × 512, and the output of the corresponding clear image had three channels (red, green and blue). The RDPN was trained for 3,200,000 iterations.

Experimental Results
In this section, we further investigate the effectiveness of RDPN. We first introduce our large dataset, which contained both a synthetic dataset and real hazy images for training and testing. Then, we compare our method with several state-of-the-art methods in terms of visual results and accuracy. Finally, a series of analyses and discussion related to the performance, run time and limitations of the RDPN are given.

Datasets
Although there are some existing training datasets, the amount of synthetic hazy images contained in them is enormous. For example, the RESIDE dataset [32] contains 313,950 synthetic outdoor images. Directly using existing datasets for training our model would cost too much training time. Besides, it is also not fair to compare our trained model with other dehazing models that were trained on 4000∼10,000 synthetic images [24]. Therefore, we created our dataset including both indoor and outdoor images. Similar to [24], 1000 depth images from the NYU depth dataset [33] were selected randomly for generating 4000 indoor training images and 400 testing images via Equation (1), with random atmospheric light A ∈ {0.5, 1} and scattering coefficient β ∈ {0.4, 0.6}. In addition, from the RESIDE dataset [32], another 4000 synthetic training images and an extra 400 test images with β in {0.04, 0.06, 0.08, 0.1, 0.12, 0.16, 0.2} and A in {0.8, 0.85, 0.9, 1} were chosen randomly as the outdoor images. Hence, we had 8000 training images and 800 testing images in total, including 400 indoor images denoted as the indoor testing dataset and 400 outdoor images denoted as the outdoor testing dataset.

Comparison with Existing Dehazing Methods
In this section, we first compare our model on synthetic datasets (indoor testing dataset and outdoor testing dataset) with six state-of-the-art dehazing methods, including DCP [6], NLP [7], CAP [11], AODN [19], Dehazenet [17] and DCPDN [24]. Two commonly used quality metrics: PSNR and SSIM are used to evaluate the dehazing results. All the PSNR/SSIM measures are reported in Table 3. Compared with other dehazing methods, we see that our proposed RDPN had a higher PSNR and SSIM. In Figure 4, three samples with the magnified details from the synthetic dataset were selected for visual comparison. Among them, Figure 4a,c,e are the original hazy images. Figure 4b,d,f are the magnified details of regions enclosed in red rectangles in corresponding hazy samples. The corresponding ground truths of Figure 4 are shown in Figure 5. Meanwhile, Figure 6a-f display the dehazing results of DCP [6], NLP [7], Dehazenet [17], DCPCN [24], CAP [11] and AODN [19], respectively, and the corresponding magnified details are shown in the second, fourth and sixth rows. It can be seen that even though existing methods could remove haze from the original images to some extent, their results tended to be either over dehazed or under dehazed. For example, the results of DCP [6] (Figure 6a) and CAP [11] (Figure 6c) were over dehazed and had some colour distortions, compared with the ground truths ( Figure 5), e.g., the towel, building and sky region in magnified images from the second, fourth and sixth rows in Figure 6a,c. In the dehazed results of NLP [7], there were haze residuals and artefacts, which could be observed in the road and tree in the third and fifth rows of Figure 6b. These improperly dehazed results were probably due to the invalid assumption of priors used in the above methods. The AODN and Dehazenet estimates of the dehazing result and the transmission map by neural networks, respectively, could overcome the limitations of the hand-crafted prior based methods, e.g., DCP, NLP and CAP. However, the results shown in Figure 6d,e still contained some hazy residuals. The DCPDN using GAN to optimize the dehazing result estimated by neural networks could obtain clearer results shown in Figure 6f than those of other methods. Unfortunately, upon detailed inspection, this method produced noticeable colour shifts, e.g., the towel in the second row and buildings in the sixth rows. In contrast, our method worked better than others and generated clearer images with less colour distortion. Actually, the results displayed in Figure 6g are visually closest to the ground truth shown in Figure 5. The PSNR/SSIM measures shown under each image also demonstrated the favourable performance of the proposed method.  RESIDE [32], a recently released dehazing benchmark, was also adopted for further evaluating the performance of RDPN. As a public benchmark for image dehazing and beyond, the sub-dataset SOTS [32] in RESIDE containing 500 indoor images and 500 outdoor images with different haze concentration was used for testing the performance of different dehazing algorithms. The quantitative results of our model and the extra seven state-of-the-art methods tested on SOTS are displayed in Tables 4 and 5, where the quantitative values of some methods were collected from [20,21,23]. From Table 4, we can see that our model ranked the third among popular dehazing methods on the indoor images of SOTS, only second to the results by GridDN [23] and EPDN [20]. Meanwhile, our model ranked the second on outdoor images of SOTS as shown in Table 5. It is noteworthy that GridDN consisted of a pre-processing module, a dehazing module and a post-processing module for image dehazing. Hence, GridDN had the most competitive performance. In contrast, our model removed haze with one RDP in the encoder and two RDPs in the decoder, with a much simpler architecture and much fewer network parameters. Further, our method even outperformed some recent methods, e.g., RIGAN, GPE, AMEF and GFN. The corresponding dehazing results of two samples from the SOTS dataset are displayed in Figure 7. As can be seen, our results (see Figure 7g) were closest to the ground truth (see Figure 7h), while the results by other methods were either over dehazed or under dehazed (see Figure 7b- Figure 6. Dehazing results of three samples in the synthetic dataset. (a) DCP [6]; (b) NLP [7]; (c) CAP [11]; (d) AODN [19]; (e) Dehazenet [17]; (f) DCPDN [24]; (g) ours. Table 4. Quantitative comparisons on the indoor images of the SOTS in terms of PSNR/SSIM. Red, green and blue indicate the best, second best and third best performance, respectively. GPE, genetic programming estimator; PWAB, pixel-wise alpha blending method; AMEF, artificial multi-exposure image fusion; GFN, gated fusion network; GridDN, GridDehazeNet; EPDN, enhanced pix2pix dehazing network; RIGAN, generative adversarial networks with residual inception module.   Figure 7. Dehazing results of two samples from the SOTS dataset. (a) Input; (b) GPE [15]; (c) PWAB [13]; (d) AMEF [14]; (e) GFN [22]; (f) EPDN [20]; (g) ours; (h) ground truths

Testing on Real Images
To verify the generalization ability of our model, we further tested RDPN on challenging images provided by previous methods [22,24]. Visual dehazing results produced by RDPN and six state-of-the-art methods are displayed in Figure 8.
The first, second and fourth rows of Figure 8a show three original real-world images. Figure 8b-g display the corresponding results of DCP [6], NLP [7], DCPDN [17], CAP [11] and AODN [19], respectively. Our results are given in Figure 8h. The magnified details of two images using different methods are shown in the third and fifth rows. In Figure 8b-d, the results of DCP [6], NLP [7] and CAP [11] suffered from over dehazed, due to the colour distortion and blocking artefacts shown in Figure 8b-d. The results of AODN [19] and Dehazenet [17] displayed in Figure 8e,f still had some remaining haze in them. Some details shown in the magnified regions were missing, as shown in the third and fifth rows. DCPCN [24] could produce clearer images with strong contrast (see Figure 8g), but part of the buildings in the first, second and third rows was not recovered. In particular, the tops of buildings in the magnified inset shown in the third row of Figure 8g were missing. Furthermore, the magnified region in the fifth row had an over dehazed effect. In contrast, our method could remove haze with visually appealing results in all cases.

Analysis and Discussion
We further analyse and discuss the validity of our RDPN with different network architectures and parameters. Besides, we also discuss the runtime performance and limitations of RDPN. Figure 8. Dehazing results on real-world images downloaded from the Internet. (a) Input; (b) DCP [6]; (c) NLP [7]; (d) CAP [11]; (e) AODN [19]; (f) Dehazenet [17]; (g) DCPDN [24]; (h) ours. The third row shows the magnified view of the highlighted windows in the second row. The fifth row shows the magnified view of the highlighted windows in the fourth row.

Different RDP Number
Since the proposed neural network was constructed based on RDP, we first investigated the effect of the number of RDP in the encode and decoder network. To determine the effect of the RDPN's depth, we trained the network with three different settings: one RDP in the encoder and two RDPs in the decoder (denoted as D = 1), two RDPs in the encoder and three RDPs in the decoder (denoted as D = 2) and three RDPs in the encoder and four RDPs in the decoder (denoted as D = 3). The quantitative comparisons of these three settings are shown in Table 6. We can see that the PSNR and SSIM values of D = 1 were higher than those of D = 2 and D = 3, which demonstrated that stacking more RDPs in the encoder and the decoder would not lead to better performance, as commonly believed. Therefore, we used D = 1 as our basic network parameters. Based on the RDPN with D = 1, we also investigated the effect of the number of dense convolution layers C and the growth rate G in DIF. Because the default setting of C and G in the DIF of RDPN was six and 32, respectively, the settings of C = 5, C = 7 and G = 16, G = 64 were adopted for testing the effect of RDPN further. From Table 6, we can see that they produced suboptimal results compared to those of C = 6 and G = 32.   Generally, it can be seen that RDPN was quite robust based on different configurations and parameter settings, as the results of SSIM in Table 6 ranged between 0.9708 and 0.9752 for both the indoor testing dataset and the outdoor testing dataset. In particular, RDPN with D = 1, C = 6 and G = 32 attained the best performance among the evaluated configurations.

Analysis of the RDP Structure
The proposed RDP for image dehazing is a significant contribution of this paper. To verify its effectiveness, we compared the RDPN with several variants of RDPs. RDP w/o R indicates that the RDP module does not contain the residual learning in the MPF. DRP represents that the residual learning in the original RDP is moved from the end of MPF to the end of DIF. For fair comparison, these network structures (RDPN using RDP w/o R and RDPN using DRP) were the same as the proposed RDPN, except for using different building modules, e.g., RDP w/o R and DRP. As reported in Table 7, the RDPN with the proposed RDP outperformed other models on all datasets, which demonstrated that the design of RDP could take advantage of the dense connection, multi-scale pyramid fusion and residual learning in the best combination.

Different RDP Placement
In our work, we built RDPN with the proposed RDP so that the contextual and structural information from all the layers could be used to obtain more robust features. A question that deserves asking is how the RDP improved the performance of the model? To investigate the effects of using RDP in different layers of the RDPN, we further compared three variant models, namely, RN, RN w/o MPPL and RDPN-decoder, where RN denotes the model with all RDPs in RDPN replaced with dense block (the DIF in the RDP), and RN w/o MPPL means the RN model without MPPL, just as the schematic figures displayed in Figure 3a,b, respectively. RDPN-decoder indicated that in the RDPN model, the RDP in the encoder was replaced with the dense block (the DIF in the RDP), just as the schematic figure shown in Figure 3c. The results reported in Table 7 demonstrate that the performances of these models were inferior to that of the proposed model RDPN. That means the contextual and structural information collected from RDPs in all the layers made contributions to image dehazing. Besides, the performance differences between RDPN-decoder and RDPN were obvious. With the help of RDP inserted in the encoder of RDPN, the output of the encoder could capture more global structural information in the encoding stage and could generate better results with higher PSNR and SSIM values.

Effectiveness of SFEL
To demonstrate the effectiveness of SFEL in the proposed RDPN, we removed the SFEL from RDPN and also set the stride of convolution behind RDP at four to keep a symmetrical feature size in the encoder and decoder. The result of the corresponding model named RDPN w/o SFEL is shown in Table 7. As can be seen, without SFEL degraded the performance of the original model RDPN, as shown in the second row of Table 7. This indicates that SFEL was an important transition layer, which provided rich global information with half of the input image size for deeper stages collecting high level information. Removing SFEL straightforwardly led to missing significant useful global information by shrinking the input feature by a quarter.

The Impact of Regulation Coefficients in the Loss Function
In this work, we adopted the combined loss function of [24] to learn the parameters of the proposed RDPN. That means the feature edge loss L F and gradient loss L G were combined with the common standard L 2 function in the loss. In Equation (10), L F and L G are multiplied by corresponding weighting coefficients λ and β. To verify the robustness of this loss function, in Table 8, we list results when different settings of weighting coefficients λ and β are considered. As can been seen, using λ = 2 and β = 0.8, the network obtained the highest PSNR and SSIM values. The other settings lowered the performance in a small range. Hence, the combined loss function had good robustness. As our network contained three RDPs with significantly fewer parameters than those of other heavy-weight dehazing models, how fast can the proposed method dehaze an image? How many fewer parameters are contained in our dehazing model compared with other methods? In this section, we mainly compare the average run time and number of parameters of the RDPN with the counterparts of several state-of-the-art methods on a computer (Intel Xeon(R) CPU E5-2637 3.5 GHz). Related results are provided in Table 9. Besides, the accuracy of different methods, e.g., average PSNR, obtained by testing them on 500 outdoor images of the public SOTS dataset are also given in Table 9 for comprehensive comparison. From that, we observed that our method ranked second in run time and ranked third in the number of parameters, only second to PIDN and AODN. However, PIDN had fewer parameters, which could be attributed to its use of a recurrent structure to share the parameters in the network. AODN had much fewer parameters and a much shorter run time, because AODN only used five convolutional layers to build the network. However, the design of AODN also led to poor accuracy. From the last row of Table 9, we can see that our method outperformed AODN and most dehazing methods, e.g., DCPDN, PWAB, AMEF and GPE, on outdoor images of the SOST dataset in terms of PSNR by up to 4.25 dB at least, only second to PIDN [20]. Hence, our method was much more efficient in comparison (the run time was reduced to 0.021 s) and could produce better results with fewer parameters (the number of parameters was 1,534,799, only 6% of that used by DCPDN) than these state-of-the-art methods. Table 9. Comparison of the proposed RDPN with other state-of-the-art methods in terms of run time, number of parameters and accuracy. PIDN was re-trained with our synthetic hazy images. Average PSNR values and run time are reported on the outdoor images of the public SOTS dataset. Red, green and blue indicate the best, second best and third best results, respectively.

Limitations
The training outdoor images were taken during the daytime and synthesized with white fog. Therefore, the model did not hold for images taken in the evening or at night with strong grey smog. Figure 9 shows that the RDPN was not able to produce a clear image (see Figure 9a) for a night-time image (see Figure 9b). This was probably because the training dataset did not contain similar images, resulting in the RDPN model failing to learn the corresponding mapping function. We plan to address this problem by adding more comprehensive outdoor hazy images taken at different times into the training dataset.

Conclusions
In this paper, we presented a novel end-to-end residual dense pyramid network (RDPN) based on the encoder and decoder architecture for image dehazing, where the proposed residual dense pyramid (RDP) served as the basic building module. RDP used multiscale pyramid fusion (MPF) to learn spatial information, leading to effective information fusion. After using one RDP in the encoder and two RDPs in the decoder in RDPN, the proposed framework also adopted a pyramid pooling module to capture the global content information from different scales before the final mapping. Extensive experiments showed that the average PSNR of the proposed RDPN was 26.82 dB, which outperformed most art-of-the-state methods, e.g., the recent densely connected pyramid dehazing network, all-in-one dehazing network, enhanced pix2pix dehazing network, pixel-based alpha blending, artificial multi-exposure image fusions and genetic programming estimator, by up to 4.25 dB. Besides, the run time of the RDPN was reduced to 0.021 s, and the number of parameters in the network was 1,534,799, which was only 6% of that used by the densely connected pyramid dehazing network. Hence, RDPN achieved superior performance over state-of-the-art methods with a significantly smaller model size and much fewer network parameters.