Image Deblurring Using Multi-Stream Bottom-Top-Bottom Attention Network and Global Information-Based Fusion and Reconstruction Network

Image deblurring has been a challenging ill-posed problem in computer vision. Gaussian blur is a common model for image and signal degradation. The deep learning-based deblurring methods have attracted much attention due to their advantages over the traditional methods relying on hand-designed features. However, the existing deep learning-based deblurring techniques still cannot perform well in restoring the fine details and reconstructing the sharp edges. To address this issue, we have designed an effective end-to-end deep learning-based non-blind image deblurring algorithm. In the proposed method, a multi-stream bottom-top-bottom attention network (MBANet) with the encoder-to-decoder structure is designed to integrate low-level cues and high-level semantic information, which can facilitate extracting image features more effectively and improve the computational efficiency of the network. Moreover, the MBANet adopts a coarse-to-fine multi-scale strategy to process the input images to improve image deblurring performance. Furthermore, the global information-based fusion and reconstruction network is proposed to fuse multi-scale output maps to improve the global spatial information and recurrently refine the output deblurred image. The experiments were done on the public GoPro dataset and the realistic and dynamic scenes (REDS) dataset to evaluate the effectiveness and robustness of the proposed method. The experimental results show that the proposed method generally outperforms some traditional deburring methods and deep learning-based state-of-the-art deblurring methods such as scale-recurrent network (SRN) and denoising prior driven deep neural network (DPDNN) in terms of such quantitative indexes as peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) and human vision.


Introduction
Image blurring is one of the major reasons for image quality degradation. It poses a great challenge for understanding and analyzing the high-frequency components in the images [1]. Therefore, deblurring has been a hot research field in image restoration [2][3][4][5][6]. Image deblurring is the process of inferring the latent sharp images in the absence of degradation model information. The blur types can be roughly divided into two categories. The first one is motion blur. For example, camera shake during exposure will blur the captured image. The second one is defocus blur produced by such factors as atmospheric turbulence and aberrations in the optical system. In most situations of practical interest,

1.
We proposed a novel MBANet by combining the advantages of encoder-decoder structure and attention mechanisms to extract multi-scale and multi-level features effectively, thereby providing a solid foundation for recovering the fine details and reconstructing the sharp edges from the blurry images.

2.
The proposed global information-based fusion network can merge the channel-wise and spatial information encoded in the bottom-top data stream at each scale effectively. Moreover, the proposed reconstruction network based on the fused global information can gradually refine the outputs of the GIFNet to finally produce the satisfactory deblurred images.

3.
A new loss function was designed by combining the perceptual loss with the mean squared error to accelerate the gradient descent of the network, thereby leading to effective network training.
The remainder of the paper is organized as follows. In Section 2, we present the details of MBANet and the GIFRNet. In Section 3, the experimental results are provided. Finally, conclusions and future research directions are given in Section 4.

The Structure of the Proposed Networks
The architecture of the proposed networks is illustrated in Figure 1. As shown in Figure 1, the multi-scale images generated by the resizing operation are firstly fed into the MBANet, which combines a modified encoder-decoder ResBlock network with an attention mechanism. Then, the integration of backward and forward information is performed by the skipping connection. Based on the global information, the squeeze-and-excitation network (SENet) [33] is introduced here to implement the adaptive calibration of the channel-wise feature, highlight the important features in the encoder, and pass them to the decoder. Following the decoder, the resizing operation is again adopted to ensure that the generated multi-scale maps have the same resolution as the original input image. Finally, the GIFRNet is utilized to integrate the global feature information at each scale and refine the Figure 1. The flowchart of the proposed method. Each colored box is considered to be a feature block, and the arrows between blocks represent the information stream.

MBANet
With regard to the well-established multi-scale networks, there are two main types of network structures. The first one is to process images at different scales separately, which lead to the non-utilization of information between different scales. The second one is to cascade and share the weights among different scales. The integration of features by way of cascading makes network training easier and more stable, but its effectiveness greatly depends on the accuracy at the first scale. In this work, we proposed a parallel bottom-top-bottom network structure and introduced spatial and channel attention mechanisms. As opposed to other multi-scale methods, our method uses multi-scale images as the inputs of the network, so that the errors introduced by the first scale can be prevented from transmission to other scales. In order to greatly reduce training difficulty and increase stability, the network weights is shared across different scales.
The proposed end-to-end MBANet can be viewed as a regression network. From the perspective of network structure, there are several advantages. Firstly, the network is deep enough to generate a large receptive field for extracting different levels of features. Secondly, the combination of spatial and channel-wise information can facilitate the utilization of the global information to highlight the important features. Finally, the network combines the low-level cues and the high-level semantic information to improve the fineness of the deblurred image.

Structure of Bottom-Top-Bottom Attention Network
We chose the encoder-decoder model as the basic structure of the bottom-top-bottom [34] attention network and modified it to realize accurate feature extraction and combination. The encoder-decoder network [35] has a symmetric structure in which the input images are progressively transformed into feature maps with more channels and smaller sizes, and then the feature maps are Each colored box is considered to be a feature block, and the arrows between blocks represent the information stream.

MBANet
With regard to the well-established multi-scale networks, there are two main types of network structures. The first one is to process images at different scales separately, which lead to the non-utilization of information between different scales. The second one is to cascade and share the weights among different scales. The integration of features by way of cascading makes network training easier and more stable, but its effectiveness greatly depends on the accuracy at the first scale. In this work, we proposed a parallel bottom-top-bottom network structure and introduced spatial and channel attention mechanisms. As opposed to other multi-scale methods, our method uses multi-scale images as the inputs of the network, so that the errors introduced by the first scale can be prevented from transmission to other scales. In order to greatly reduce training difficulty and increase stability, the network weights is shared across different scales.
The proposed end-to-end MBANet can be viewed as a regression network. From the perspective of network structure, there are several advantages. Firstly, the network is deep enough to generate a large receptive field for extracting different levels of features. Secondly, the combination of spatial and channel-wise information can facilitate the utilization of the global information to highlight the important features. Finally, the network combines the low-level cues and the high-level semantic information to improve the fineness of the deblurred image.

Structure of Bottom-Top-Bottom Attention Network
We chose the encoder-decoder model as the basic structure of the bottom-top-bottom [34] attention network and modified it to realize accurate feature extraction and combination. The encoder-decoder network [35] has a symmetric structure in which the input images are progressively transformed into feature maps with more channels and smaller sizes, and then the feature maps are transformed back to the resolution of the original input images. The skip connections between corresponding feature maps are the operations that combine the forward information, i.e., the extracted features in the encoder. It is easy to understand that the encoder is merely implemented using a series of convolutional (Conv) layers and the decoder contains several deconvolutional (Deconv) layers to further increase the network depth. However, the direct utilization of the existing encoder-decoder structure cannot ensure satisfactory performance. The reason is that we need to add more convolutional layers in the encoder/decoder blocks for deblurring tasks, which makes the network converge slowly. Moreover, it is difficult to obtain the spatial information from the middle feature maps with the small size for reconstruction. Hence, we chose to transform the encoder-decoder model into an encoder-decoder ResBlock network, as shown in Figure 2, which acts as our bottom-top backbone structure. Each backbone structure is comprised of one convolutional layer with a stride of two followed by several ResBlocks. The number of kernels at the convolutional layers is doubled for the successive backbone structure. Each ResBlock includes two convolutional layers with the same number of kernels. The network structure of the decoder is symmetric to that of the encoder. Therefore, the resolution of output images for the transformed encoder-decoder ResBlock network is the same as that of the original input images. It is highly significant to produce a large receptive field for extracting more feature information while making network training simpler. The encoder and the decoder are connected by the SENet, which can realize the information exchange between the channels, enhance the useful features, and suppress the features that are not important for the current task, thereby facilitating extraction of the key information in the feature maps at different channels with a reduction ratio of 64. For the SENet, its core operations include squeeze and excitation. The function of the squeeze operation is to encode the spatial features on a channel into a global feature, thereby allowing information from the global receptive field to be used by different layers. The aggregation is followed by an excitation operation, which adopts a simple self-gating mechanism to produce a collection of per-channel modulation weights. The learned channel weights are multiplied by the original input features to generate the channel feature X n c . It is given by where Z n c represents the global spatial information of the n-th scale image at the c-th channel; M n c is the feature map produced by the encoder; W f eature and H f eature are the width and height of M n c , respectively; W 1 ∈ R c rse ×c and W 2 ∈ R c× c rse are the full connection operations with r se denoting the channel reduction ratio; θ means rectified linear unit (ReLu) activation [36]; σ denotes the sigmoid activation function; and S n c is the channel-wise weight. From Equation (4), it can be seen that the channel-wise multiplication between S n c and M n c generates X n c .   Considering that Gaussian blur is sensitive to scale, we duplicated the designed bottom-top-bottom attention network in parallel. Specifically, the multi-scale versions of the input blurry images generated by the resizing operation are first encoded in the bottom-top stream. Then, for each input image, its scaled version B n replicates the first scale from the bottom layer to the top layer with a series of residual blocks, and a multi-stream output feature map O n (n = 1, 2, . . . , Ns with Ns denoting the number of scales) with the same resolution to B n is produced. Finally, these maps are resized to the same resolution to the original input image for the next network (see Figure 1).

GIFRNet
The GIFRNet consists of two sub-networks: GIFNet and GIRNet. The GIFNet integrates the maps generated by the MBANet to produce the feature map R f inal . Then, the GIRNet gradually refines R f inal to produce the final deblurred image I f inal .

GIFNet
The detailed structure of the GIFNet is shown in Figure 3a. As shown in Figure 3a, the dense spatial information of source images is combined with the multi-stream output maps, which are produced by the multi-stream network with the multi-scale structures to further increase the correlation of the concatenated maps. The original image B and the multi-stream output maps O 1 , O 2 , . . . , O Ns are firstly concatenated into a feature map F 0 with 3Ns + 3 channels. Then we add the global context block (GCBlock) [37], which can help establish a connection between two pixels with a certain distance to capture the spatial information in the feature images, thereby facilitating the fusion of the multi-stream information with the original image. The architecture of the GCBlock is shown in Figure 3c. It includes the context modeling module and the transform. The context modeling module groups the features of all positions together via weighted averaging to extract the global information. In order to reduce the calculation complexity, the channel compression transform is adopted. Furthermore, the layer normalization (LayerNorm) normalizes the mean and variance of all summed inputs to the neurons in one layer to improve the training speed, accuracy, and robustness of the model. The output features are fed to a series of convolutional and ReLU layers. In the convolutional layers, the kernel size is set to 3 × 3, and the channel number is set to 64, 128, 64, and 3, respectively. A merged map R f inal with the same resolution as the original image is produced as the final output. The GIFNet can be expressed as where cat denotes the concatenation operator; F 1 means the feature information produced by the GCBlock with the channel compression ratio set to 2; F t denotes the feature image generated at the t-th (t = 2, 3, . . . , T) convolutional layer; W t and b t represent the convolutional filter and bias of t-th convolutional layer, respectively; and the GIFNet fully utilizes the dense spatial information of the original image and nonlinearly fuses the multi-stream output maps. ...

GIRNet
Although the GIFNet can integrate the feature maps at different scales effectively and improve spatial correlation by introducing the source image, errors may be introduced into the restored image. In order to address this issue, we introduced the GIRNet to further refine the output deblurred image. Figure 3b shows the network structure of the GIRNet, which has the same structure to the GIFNet but with different parameters. At each iteration, we feed both the original image and the input map through the GIRNet to produce the refined map, which in turn serves as the input map at the next iteration. The final restored image can be expressed as is the preceding integrated map as the first input map. The operator  means a recursive composition function. m W and m b represent the convolutional filter and the bias of the GIRNet, respectively. The final restored image is produced by continuously correcting its previous errors until the last iteration is implemented.

Loss
We used Euclidean loss and perceptual loss [38] as the total loss to measure the difference between the network output and the ground truth. Here, the Euclidean loss is the classic mean square error (MSE) loss. Unlike the pixel-level MSE loss, the perceptual loss utilizes high-dimensional features obtained from high-performance convolutional neural networks and can help restore more details for the deblurred images. The proposed total loss is defined as  Here, (C, H, W) means a feature map with the number C of channels, height H, and width W; r denotes channel compression ratio; ⊗ represent matrix multiplication; ⊕ means broadcast element-wise addition.

GIRNet
Although the GIFNet can integrate the feature maps at different scales effectively and improve spatial correlation by introducing the source image, errors may be introduced into the restored image. In order to address this issue, we introduced the GIRNet to further refine the output deblurred image. Figure 3b shows the network structure of the GIRNet, which has the same structure to the GIFNet but with different parameters. At each iteration, we feed both the original image and the input map through the GIRNet to produce the refined map, which in turn serves as the input map at the next iteration. The final restored image I f inal can be expressed as where R f inal is the preceding integrated map as the first input map. The operator means a recursive composition function. W m and b m represent the convolutional filter and the bias of the GIRNet, respectively. The final restored image is produced by continuously correcting its previous errors until the last iteration is implemented.

Loss
We used Euclidean loss and perceptual loss [38] as the total loss to measure the difference between the network output and the ground truth. Here, the Euclidean loss is the classic mean square error (MSE) loss. Unlike the pixel-level MSE loss, the perceptual loss utilizes high-dimensional features obtained from high-performance convolutional neural networks and can help restore more details for the deblurred images. The proposed total loss is defined as where λ is the parameter to balance L MSE and L c , and φ k,l denotes the feature extractor for extracting the image features at the l-th convolutional layer (after activation) before the k-th pooling layer in the VGG16 network pre-trained on the ImageNet dataset. The features extracted by φ k,l are used to measure the difference between the high-level semantic information of images I f inal and I o . The Adam optimizer [39] is adopted to minimize the cost function.

Model Training
To train the proposed networks, we used the realistic and dynamic scenes (REDS) dataset [40], in which each image was of size 720 × 1280. It was recorded by the GoPro HERO6 Black camera at 120 fps. There were 300 sequences in total, and each one included 100 pairs of sharp and blurry frames. We randomly selected 1120 sharp images as the ground truth for training and manually synthesized the blurry images by convolving the sharp images with Gaussian blur kernels of size 17 × 17. The standard deviations of the blur kernels were changed from 1.6 to 2.4 with steps of 0.2 to generate 5 blur kernels. The blurry images are sampled and randomly cropped into 256 × 256 image patches as training images. The ground-truth patches were generated in the same way. In total, 1,120,000 patches were generated for training.
We realized the proposed networks with TensorFlow 1.9.0. The proposed MBANet and GIFRNet were jointly trained on a Ubuntu 16.04 with an Intel I7-6950 CPU and 96G RAM. The NVIDIA 1080Ti GPU with CUDA 10.0 was used for acceleration. For the proposed networks, the initial learning rate was set to 0.0001, and it was exponentially decayed to 1 × 10 −6 at 1000 epochs using power 0.3. At each iteration, we fixed the batch size to be 10. All trainable variables were initialized by Xavier [41]. The training took approximately 1.5 days.

Experimental Results and Discussion
In this section, we carried out experiments on the chosen REDS dataset and GoPro dataset [42] to determine the key parameters in the proposed networks and evaluate their performance. The testing data were generated in the same way as the training data. For each type of dataset, 10 images were randomly selected as testing images. To evaluate the deblurring performance of the proposed networks, we adopted such indexes as peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [43], which are defined as follows: where W and H are the width and height, respectively, of the deblurred image I f inal . δ

Parameter Setting
As for the parameters in the proposed networks, they mainly include the number of scales (Ns) in the MBANet, the iterative times (Tr) in the GIRNet, the regularization parameter λ for balancing Sensors 2020, 20, 3724 9 of 19 the content loss, and the perceptual loss and parameters k and l at the feature layer related to the perceptual loss. In the loss function of the proposed networks, the content loss and the perceptual loss are of the same order of magnitude. Following the selection method of parameters in the loss function in other deep learning models for image deblurring, we set λ = 1 and k = l = 3. Here, we discuss how to determine the two parameters Ns and Tr.

Number of Scales
As for the number of scales in the MBANet, it is easy to understand that too large an Ns value brings much calculation burden while too small a value is disadvantageous for extracting the effective features from the input images. Here, we trained the proposed model using three different scales for comparison to analyze the contributions of Ns. The PSNR and SSIM of the three models performed on the REDS dataset and GoPro dataset are shown in Figures 4 and 5, respectively. Clearly, the three-scale model and the two-scale model could only perform well on the REDS dataset and GoPro dataset, respectively. However, the four-scale model could obtain better or competitive results on both the REDS dataset and GoPro dataset in terms of PSNR and SSIM. Therefore, we chose four scales as a compromise.

Parameter Setting
As for the parameters in the proposed networks, they mainly include the number of scales (Ns) in the MBANet, the iterative times (Tr) in the GIRNet, the regularization parameter λ for balancing the content loss, and the perceptual loss and parameters k and l at the feature layer related to the perceptual loss. In the loss function of the proposed networks, the content loss and the perceptual loss are of the same order of magnitude. Following the selection method of parameters in the loss function in other deep learning models for image deblurring, we set 1 λ = and k = l = 3. Here, we discuss how to determine the two parameters Ns and Tr.

Number of Scales
As for the number of scales in the MBANet, it is easy to understand that too large an Ns value brings much calculation burden while too small a value is disadvantageous for extracting the effective features from the input images. Here, we trained the proposed model using three different scales for comparison to analyze the contributions of Ns. The PSNR and SSIM of the three models performed on the REDS dataset and GoPro dataset are shown in Figures 4 and 5, respectively. Clearly, the three-scale model and the two-scale model could only perform well on the REDS dataset and GoPro dataset, respectively. However, the four-scale model could obtain better or competitive results on both the REDS dataset and GoPro dataset in terms of PSNR and SSIM. Therefore, we chose four scales as a compromise.   Sensors 2020, 20, x FOR PEER REVIEW 9 of 19

Parameter Setting
As for the parameters in the proposed networks, they mainly include the number of scales (Ns) in the MBANet, the iterative times (Tr) in the GIRNet, the regularization parameter λ for balancing the content loss, and the perceptual loss and parameters k and l at the feature layer related to the perceptual loss. In the loss function of the proposed networks, the content loss and the perceptual loss are of the same order of magnitude. Following the selection method of parameters in the loss function in other deep learning models for image deblurring, we set 1 λ = and k = l = 3. Here, we discuss how to determine the two parameters Ns and Tr.

Number of Scales
As for the number of scales in the MBANet, it is easy to understand that too large an Ns value brings much calculation burden while too small a value is disadvantageous for extracting the effective features from the input images. Here, we trained the proposed model using three different scales for comparison to analyze the contributions of Ns. The PSNR and SSIM of the three models performed on the REDS dataset and GoPro dataset are shown in Figures 4 and 5, respectively. Clearly, the three-scale model and the two-scale model could only perform well on the REDS dataset and GoPro dataset, respectively. However, the four-scale model could obtain better or competitive results on both the REDS dataset and GoPro dataset in terms of PSNR and SSIM. Therefore, we chose four scales as a compromise.

Iterative Times
For the iterative times Tr in the GIRNet, we found that too large a Tr increased the training time of the proposed networks greatly, while one that was too small could not ensure the satisfactory deblurring performance. To demonstrate the influence of Tr, we trained the proposed model using three different iterative times. The quantitative results are shown in Figures 6 and 7. From Figures 6 and 7, we can see that the proposed model with Tr = 2 performed the worst on the REDS dataset, and our model with Tr = 1 performed the worst on the GoPro dataset. By comparison, our model with Tr = 3 could achieve satisfactory performance on the two datasets. Based on the above analysis, we chose Tr = 3 in the proposed networks.

Iterative Times
For the iterative times Tr in the GIRNet, we found that too large a Tr increased the training time of the proposed networks greatly, while one that was too small could not ensure the satisfactory deblurring performance. To demonstrate the influence of Tr, we trained the proposed model using three different iterative times. The quantitative results are shown in Figures 6 and 7. From Figures 6 and 7, we can see that the proposed model with Tr = 2 performed the worst on the REDS dataset, and our model with Tr = 1 performed the worst on the GoPro dataset. By comparison, our model with Tr = 3 could achieve satisfactory performance on the two datasets. Based on the above analysis, we chose Tr = 3 in the proposed networks.

Comparison with Different Network Structures
In order to verify the effectiveness of the structure in the proposed networks, we removed GCBlock from GIFNet and GIRNet. The resultant networks were named as fusion network (FNet) and reconstruction network (RNet), respectively. Here, we compare the three models, i.e., MBANet with FNet (MBANet + FNet), MBANet with FNet and RNet (MBANet + FNet + RNet), and MBANet with GIFNet and GIRNet (MBANet + GIFNet + GIRNet). We also used ten images chosen from the REDS dataset as the sharp images for testing. The quantitative results of PSNR and SSIM for the three models are shown in Figure 8. It can be seen from Figure 8 that MBANet + GIFNet + GIRNet achieved the best performance among the three models, which indeed indicates that the adoption of attention mechanism and reconstruction network structure improved the deblurring performance effectively.

Iterative Times
For the iterative times Tr in the GIRNet, we found that too large a Tr increased the training time of the proposed networks greatly, while one that was too small could not ensure the satisfactory deblurring performance. To demonstrate the influence of Tr, we trained the proposed model using three different iterative times. The quantitative results are shown in Figures 6 and 7. From Figures 6 and 7, we can see that the proposed model with Tr = 2 performed the worst on the REDS dataset, and our model with Tr = 1 performed the worst on the GoPro dataset. By comparison, our model with Tr = 3 could achieve satisfactory performance on the two datasets. Based on the above analysis, we chose Tr = 3 in the proposed networks.

Comparison with Different Network Structures
In order to verify the effectiveness of the structure in the proposed networks, we removed GCBlock from GIFNet and GIRNet. The resultant networks were named as fusion network (FNet) and reconstruction network (RNet), respectively. Here, we compare the three models, i.e., MBANet with FNet (MBANet + FNet), MBANet with FNet and RNet (MBANet + FNet + RNet), and MBANet with GIFNet and GIRNet (MBANet + GIFNet + GIRNet). We also used ten images chosen from the REDS dataset as the sharp images for testing. The quantitative results of PSNR and SSIM for the three models are shown in Figure 8. It can be seen from Figure 8 that MBANet + GIFNet + GIRNet achieved the best performance among the three models, which indeed indicates that the adoption of attention mechanism and reconstruction network structure improved the deblurring performance effectively.

Comparison with Different Network Structures
In order to verify the effectiveness of the structure in the proposed networks, we removed GCBlock from GIFNet and GIRNet. The resultant networks were named as fusion network (FNet) and reconstruction network (RNet), respectively. Here, we compare the three models, i.e., MBANet with FNet (MBANet + FNet), MBANet with FNet and RNet (MBANet + FNet + RNet), and MBANet with GIFNet and GIRNet (MBANet + GIFNet + GIRNet). We also used ten images chosen from the REDS dataset as the sharp images for testing. The quantitative results of PSNR and SSIM for the three models are shown in Figure 8. It can be seen from Figure 8 that MBANet + GIFNet + GIRNet achieved the best performance among the three models, which indeed indicates that the adoption of attention mechanism and reconstruction network structure improved the deblurring performance effectively. Sensors 2020, 20, x FOR PEER REVIEW 11 of 19 (a) (b)

Comparison with State-of-the-Art Deblurring Methods
To demonstrate the advantage of the proposed method, we compared it with five state-of-the-art image deblurring algorithms including three well-known model-based deblurring methods such as EPLL [14], NCSR [13], and ECP [20], and two deep learning methods such as SRN [21] and DPDNN [31]. For the compared algorithms, we firstly followed the parameter settings suggested by the authors and then tuned them to ensure the best results for the corresponding testing images based on the comprehensive consideration of PSNR and SSIM as well as subjective human vision. For fairness, unless noted otherwise, all experiments were conducted for the evaluated methods on the same dataset with the same parameter configuration.

Comparison on REDS Dataset
Experiments were done on the color images in the REDS dataset. Tables 1 and 2 list the PSNR and SSIM values of all evaluated methods. From Table 1, we can see that the deep learning methods outperformed the ECP, EPLL, and NCSR methods by providing significantly higher PSNR. Compared with the SRN and DPDNN methods, the proposed algorithm, on average, provided PSNR improvements by 0.42 dB and 0.27 dB, respectively. Table 2 shows that the SSIM values of the conventional model-based methods were lower than those of the deep learning methods. Meanwhile, the proposed method could provide slightly higher SSIM than the SRN and DPDNN methods. Figures 9-16 show the deblurred results and the corresponding difference images of all evaluated methods. Here, we used the red boxes to indicate where the proposed method was visually superior to other state-of-the-art methods. As shown in Figures 9 and 11, it can be seen that our proposed method could recover the thin lines in Figure 9 and the boundaries of stone steps in Figure 11 better than the other compared methods by providing more complete and sharper edges for them. It is shown in Figure 12 that the multi-scale deblurring approaches such as the proposed method and the SRN method gained an obvious advantage over other methods in recovering small-scale details such as the vertical thin iron rod marked with one red box. For example, the DPDNN method damaged the thin iron rod so seriously that it was almost invisible in the deblurred image while it was preserved significantly better by the SRN method and the proposed method. Furthermore, the proposed method outperformed the SRN method by providing a sharper restored result for the complex structures marked with another red box. As seen in Figure 14, the textures were so complicated and densely arranged that their restoration was very challenging. Clearly, the other compared methods led to the loss of textures in the deblurred images to some extent. However, the proposed method could still recover the textures effectively. Moreover, Figure 15 shows that in the case of large deviations of blur kernels, our method could preserve such image details as the number on the car more effectively than other evaluated methods. Furthermore, from the difference images between the original image and the deblurred images shown in Figures 10, 13 and 16, we can see that the traditional methods performed worse than the deep learning-based methods because there was too

Comparison with State-of-the-Art Deblurring Methods
To demonstrate the advantage of the proposed method, we compared it with five state-of-the-art image deblurring algorithms including three well-known model-based deblurring methods such as EPLL [14], NCSR [13], and ECP [20], and two deep learning methods such as SRN [21] and DPDNN [31]. For the compared algorithms, we firstly followed the parameter settings suggested by the authors and then tuned them to ensure the best results for the corresponding testing images based on the comprehensive consideration of PSNR and SSIM as well as subjective human vision. For fairness, unless noted otherwise, all experiments were conducted for the evaluated methods on the same dataset with the same parameter configuration.

Comparison on REDS Dataset
Experiments were done on the color images in the REDS dataset. Tables 1 and 2 list the PSNR and SSIM values of all evaluated methods. From Table 1, we can see that the deep learning methods outperformed the ECP, EPLL, and NCSR methods by providing significantly higher PSNR. Compared with the SRN and DPDNN methods, the proposed algorithm, on average, provided PSNR improvements by 0.42 dB and 0.27 dB, respectively. Table 2 shows that the SSIM values of the conventional model-based methods were lower than those of the deep learning methods. Meanwhile, the proposed method could provide slightly higher SSIM than the SRN and DPDNN methods.  Figures 9-16 show the deblurred results and the corresponding difference images of all evaluated methods. Here, we used the red boxes to indicate where the proposed method was visually superior to other state-of-the-art methods. As shown in Figures 9 and 11, it can be seen that our proposed method could recover the thin lines in Figure 9 and the boundaries of stone steps in Figure 11 better than the other compared methods by providing more complete and sharper edges for them. It is shown in Figure 12 that the multi-scale deblurring approaches such as the proposed method and the SRN method gained an obvious advantage over other methods in recovering small-scale details such as the vertical thin iron rod marked with one red box. For example, the DPDNN method damaged the thin iron rod so seriously that it was almost invisible in the deblurred image while it was preserved significantly better by the SRN method and the proposed method. Furthermore, the proposed method outperformed the SRN method by providing a sharper restored result for the complex structures marked with another red box. As seen in Figure 14, the textures were so complicated and densely arranged that their restoration was very challenging. Clearly, the other compared methods led to the loss of textures in the deblurred images to some extent. However, the proposed method could still recover the textures effectively. Moreover, Figure 15 shows that in the case of large deviations of blur kernels, our method could preserve such image details as the number on the car more effectively than other evaluated methods. Furthermore, from the difference images between the original image and the deblurred images shown in Figures 10, 13 and 16, we can see that the traditional methods performed worse than the deep learning-based methods because there was too much structural information in their difference images. Compared with the SRN and DPDNN methods, the proposed method performed better in that it produced less obvious structural information, which further proves the superiority of the proposed method. much structural information in their difference images. Compared with the SRN and DPDNN methods, the proposed method performed better in that it produced less obvious structural information, which further proves the superiority of the proposed method.              In order to further verify the robustness of the proposed method, we selected four different levels of Gaussian blur kernels with standard deviations ranging from 1.7 to 2.3 with a step of 0.2. These blurred images were not used as training sets for model training. The PSNR and SSIM values of all evaluated methods are listed in Tables 3 and 4. It can be seen that the deep learning-based methods still performed better than the traditional methods. Compared with the SRN and DPDNN methods, the proposed method provided slightly higher SSIM values and it, on average, provided improvements of PSNR by 0.52 dB and 0.6 dB, respectively. The above comparison indeed demonstrated good robustness of the proposed method.  In order to further verify the robustness of the proposed method, we selected four different levels of Gaussian blur kernels with standard deviations ranging from 1.7 to 2.3 with a step of 0.2. These blurred images were not used as training sets for model training. The PSNR and SSIM values of all evaluated methods are listed in Tables 3 and 4. It can be seen that the deep learning-based methods still performed better than the traditional methods. Compared with the SRN and DPDNN methods, the proposed method provided slightly higher SSIM values and it, on average, provided improvements of PSNR by 0.52 dB and 0.6 dB, respectively. The above comparison indeed demonstrated good robustness of the proposed method.   In order to further verify the robustness of the proposed method, we selected four different levels of Gaussian blur kernels with standard deviations ranging from 1.7 to 2.3 with a step of 0.2. These blurred images were not used as training sets for model training. The PSNR and SSIM values of all evaluated methods are listed in Tables 3 and 4. It can be seen that the deep learning-based methods still performed better than the traditional methods. Compared with the SRN and DPDNN methods, the proposed method provided slightly higher SSIM values and it, on average, provided improvements of PSNR by 0.52 dB and 0.6 dB, respectively. The above comparison indeed demonstrated good robustness of the proposed method.  In order to further verify the robustness of the proposed method, we selected four different levels of Gaussian blur kernels with standard deviations ranging from 1.7 to 2.3 with a step of 0.2. These blurred images were not used as training sets for model training. The PSNR and SSIM values of all evaluated methods are listed in Tables 3 and 4. It can be seen that the deep learning-based methods still performed better than the traditional methods. Compared with the SRN and DPDNN methods, the proposed method provided slightly higher SSIM values and it, on average, provided improvements of PSNR by 0.52 dB and 0.6 dB, respectively. The above comparison indeed demonstrated good robustness of the proposed method.

Comparison with State-of-the-Art Methods on the GoPro Dataset
To further evaluate the performance of the proposed method, we used the GoPro dataset, which is different from the REDS dataset, to compare it with other deblurring methods. The quantitative results computed on the 10 images selected from this dataset are listed in Tables 5 and 6. The results show that the deep learning-based algorithms still performed better than the traditional algorithms in PSNR and SSIM. The SRN performed worse than the DPDNN and the proposed method. Although the proposed method provided slightly lower PSNR and SSIM results than the DPDNN when the deviation was 2.4, it produced higher average PSNR and SSIM than the latter.  Figure 17 shows the visual comparisons of the deblurred images for three deep learning-based deburring methods performed on the GoPro dataset, where the standard deviation of the Gaussian blur kernel in the test image was 2.2. From Figure 17, it can be seen that the SRN and the DPDNN produced the unwanted artefacts around the horizontal lines in the red box to different extents. By comparison, our method could generate more similar deblurred result as the unblurred sharp image without introducing the artefacts due to the combination of the features from the original image with the global context information. Indeed, the visual comparison demonstrated the superiority of the proposed method to other deep learning-based methods in recovering image details.

Discussion
In this work, we proposed the novel multi-stream bottom-top-bottom attention network as well as the global information-based fusion and construction network for restoring Gaussian blurred images. From the above experimental results, it can be seen that the proposed method gains an advantage over other compared methods in terms of deblurring performance and robustness. However, this work has some limitations. Firstly, the proposed networks are only used for the deblurring of Gaussian blurred images. Future work will be focused on extending our method to address other kinds of blur such as motion blur and the hybrid blur. Such extension will involve the modification of network structure such as the introduction of dense blocks. Secondly, the training of the proposed networks is very time-consuming. The model training time can be shortened by simplifying the network structure. Finally, the loss function we used is relatively simple, and thus there may be possibilities for further improvement of deblurred results. The effective structural preserving loss item can be explored and additionally introduced to address this problem.

Conclusions
In this paper, we proposed the novel multi-stream bottom-top-bottom attention network as well as the global information-based fusion and construction network for restoring Gaussian blurred images. Distinctively, the proposed network fully combines high-level semantics with low-level image features and channel-wise information to generate the deblurred images. Besides, the multi-

Discussion
In this work, we proposed the novel multi-stream bottom-top-bottom attention network as well as the global information-based fusion and construction network for restoring Gaussian blurred images. From the above experimental results, it can be seen that the proposed method gains an advantage over other compared methods in terms of deblurring performance and robustness. However, this work has some limitations. Firstly, the proposed networks are only used for the deblurring of Gaussian blurred images. Future work will be focused on extending our method to address other kinds of blur such as motion blur and the hybrid blur. Such extension will involve the modification of network structure such as the introduction of dense blocks. Secondly, the training of the proposed networks is very time-consuming. The model training time can be shortened by simplifying the network structure. Finally, the loss function we used is relatively simple, and thus there may be possibilities for further improvement of deblurred results. The effective structural preserving loss item can be explored and additionally introduced to address this problem.

Conclusions
In this paper, we proposed the novel multi-stream bottom-top-bottom attention network as well as the global information-based fusion and construction network for restoring Gaussian blurred images. Distinctively, the proposed network fully combines high-level semantics with low-level image features and channel-wise information to generate the deblurred images. Besides, the multi-stream mechanism fully takes advantage of multi-scale information, and this structure is easier to train by sharing the parameters. Additionally, the reconstruction network can further refine the restored image by combining the global context information. Experimental results show that the proposed method can provide better deblurring performance than other state-of-the-art algorithms both quantitatively and qualitatively.