1. Introduction
The image Super-Resolution (SR) problem has been drawing the attention of Internet of Things (IoT) researchers and artificial intelligence (AI) companies for a decade. Due to the restriction on capturing high-resolution images in image-based applications such as the Internet of Things (IoT), the image Super-Resolution (SR) technique is essential in enhancing the visual quality of low-resolution images. An acquired low-resolution image (image/video communications and edge IoT sensors) contains some degradation, such as noise and blur effects. A super-resolution algorithm is required to enlarge the low-resolution image and reconstruct a high-resolution image while reducing the negative effects of the degradation [
1,
2,
3].
Single-Image Super-Resolution (SISR), which considers an ill-posed challenge, is a process of reconstructing a High-Resolution (HR) image from a Low-Resolution (LR) image and effectively improving the quality of captured images. Compared to conventional image-enhancement methods, deep-learning-based image enhancement is used in various computer vision applications such as haze visibility enhancement [
4], environment-aware imagery [
5] and video [
6] for IoT services. In recent years, learning-based algorithms have demonstrated impressive performance compared to conventional SR methods when learning LR-to-HR mapping. In particular, the Convolution Neural Network (CNN) has been trained in a supervised manner to learn the abstract feature representation of the LR patch and a corresponding HR patch. According to this concept, Dong et al. [
7] demonstrated a Super-Resolution Convolutional Neural Network (SRCNN) architecture hat learns based on an end-to-end nonlinear mapping from interpolated LR patch to HR patch in a three-layer network. This CNN model significantly improved the SR results regarding Peak Signal-to-Noise Ratio (PSNR) compared to conventional SR algorithms. However, the SRCNN suffered poor perceptual quality, noise amplification effects, and weakness in reconstructing image detail and recovering high-frequency details (tiny edges and lines). To modify the deep-learning-based SR algorithm, researchers have proposed various network architectures and learning strategies such as designing deeper networks, proposing different network topologies, modifying upsampling frameworks, and adopting attention mechanisms.
Following the SRCNN concept, the Very Deep SR network (VDSR) [
8] and Deeply Recursive Convolutional Network (DRCN) [
9] used pre-upsampling 20-layer network architectures and obtained superior performances over the previous model. The network architectures became deeper by improving the capability of CNN models and learning strategies. Inspired by Residual Net (ResNet) [
10], several effective SR architectures [
11,
12,
13,
14,
15] have used the residual block strategy. Multi-scale Deep Super-Resolution (MDSR) [
16] and Enhanced Deep Super-Resolution (EDSR) [
16] proposed by Lim et al. are two modified versions of the residual block architecture with a post-upsampling framework that demonstrates significant improvements in reconstructing HR images. MDSR is a deep network with simplified residual blocks, while EDSR architecture is considered a wide network. Although residual block architecture has improved the accuracy and quality of images compared to SRCNN, they suffer some limitations, such as weakness in reconstructing small detail and inaccurate structure reconstruction due to learning difficulties in mapping features in the post-upsampling method.
The first progressive upsampling framework SR model is Laplacian Pyramid Super-Resolution Network (LapSRN) [
17]. In this SR model, the HR image is reconstructed using sub-band residuals of a high-dimensional image in a progressive procedure. Although the progressive LapSRN reduced the learning difficulty, the network structure is ineffective in reconstructing high-frequency detail. Weaknesses cause this problem in the projecting of high-frequency information from the early layer.
To improve the SR model’s high-frequency information and representation ability, Zhang et al. [
18] proposed a very deep Residual Channel Attention Network (RCAN). Due to the robustness of the residual architecture in the SR field, the RCAN model used deep (over 400 layers) Residual-in-Residual block (RIR) architecture with short- and long-skip connections to directly transfer the high-frequency details of the image to the final output. The Channel Attention (CA) mechanism is also used to re-scale the features across channels. However, this model’s lack of global information due to convolutions operating across the local region leads to weakness in reconstructing the sophisticated structure (holes and lattices texture) similar to the ground-truth image and more enhancement in perceptual quality [
19].
However, designing an image super-resolution algorithm with less network complexity, while maintaining the representation ability of the super-resolution model to reconstruct the tiny details of the output image, remains a challenge.
Moreover, most of the enhancements only consider the architecture of SR models, and the effect of the objective function has not gained much attention. In terms of the pixel-wise objective function in the SR algorithm, early models such as SRCNN [
7], Fast Super-Resolution Convolutional Neural Network (FSRCNN) [
20], Memory Network for image-Restoration purposes (MemNet) [
13], and Deep Back-Projection Networks (DBPN) [
21] used the
loss function. The later SR models, such as LapSRN [
17], EDSR [
16], SRFBN [
22], Meta-RDN [
23], RCAN [
18], and DRLN [
24], used the
loss function in their models and improved the model convergence and representation performances.
Although the trend of network depth from the first model of SRCNN [
7] to DRLN [
24] reveals that increasing the depth (network complexity) eventually leads to improving the network performance, these SR models face some limitations.
(1) The super-resolved images have a weakness in recovering high-frequency information. This limitation leads to reconstructing inaccurate SR images and a lack of capability to produce sophisticated structures such as lines, edges, and tiny shapes. (2) Most existing CNN-based SR models employ post-upsampling operations. Although the post-upsampling framework performs the most computations in a low-dimensional space and reduces the computation complexity, it increases the learning difficulties of the SR model on larger scale factors. (3) Despite the importance of the loss function in reconstructing the SR image, the effect of CNN’s loss function on the SR issue has not received considerable attention. A progressive upsampling architecture with an effective loss function is essential.
A practical way to develop a robust SR network is to use a gradual upsampling framework that contains fewer network layers at low dimensions and then use more convolution layers after the upsampling modules. Rather than using a very deep architecture with a post-upsampling framework that applies numerous convolution layers in low-dimensional space and then suddenly up-sample at the end of the model, we use the progressive upsampling framework. The progressive upsampling concept has been proven to improve the robustness of the SR model in recovering high-frequency information, reducing learning difficulties, and producing promising results in multiple degradations [
25]. Specifically, we propose a progressive upsampling framework that stacks effective simplified Residual-in-Residual Dense Blocks (RRDB) at the low-dimensional space before the first upsampling module. Then, another RDB is used after the upsampling layer to explore the feature maps in higher-dimensional space. Additionally, the Depth-Wise (DW) bottleneck projections are used to easily flow the high-frequency details of the early CNN layers into our network’s progressively upsampling modules.
Moreover, a fusing approach for the practical loss function that combines and multi-scale structural similarity indexed measures (MS-SSIM) objective functions is proposed as the most effective loss function for our SR model.
In summary, our main contributions are listed as follows:
(1) Propose a simplified RRDB structure with depth-wise bottleneck projections to map the discriminative high-frequency details to each stage of upsampling layers of our network, which increases the network convergence in the training phase and maintains the representation ability.
(2) Employ the progressive upsampling framework for our architecture to reduce the learning difficulties of the model in larger scale factors due to the progressively upsampling procedure.
(3) Introduce a novel fusion objective function by combining and MS-SSIM loss functions to improve the representative capability of our model.
The remainder of this article is organized as follows.
Section 2 briefly reviews the relevant works related to the proposed method.
Section 3 details the proposed model’s architecture and fused loss function. The implementation details, datasets, and experimental results are demonstrated in
Section 4, and discuss the relationships between state-of-art models and our own. Finally, the conclusion is given in
Section 5.
2. Related Works
The past decade has witnessed incredible development of SISR using the deep-learning approach. Among the various aspects of SR developments, the network architecture, upsampling framework, and learning objective function are considered the essential aspects of any SR structure that directly contributes to the SR model representation capability [
26].
As pioneer research, Dong et al. [
7] used the CNN approach to introduce the SRCNN model, which could learn mapping from LR image to HR image in an end-to-end learning-based approach. This first CNN-based model achieved superior performance in reconstructing HR images from LR images compared to previous conventional SR algorithms. The SRCNN model consists of a shallow three-layer convolutional network that uses a pre-upsampling framework. This means the LR image at the first stage is enlarged by bicubic interpolation, then fed to the network as the input image. This single-path SR model, without any skip connections, directly learns an end-to-end mapping between the original HR image and the bicubic interpolated input. The SRCNN model uses the
objective function. Despite improvements to SRCNN compared to classic models, the resultant SR images are blurry and noisy because of using a very shallow network architecture. In comprehensive research, Dong et al. [
20] investigated the effect of depth network architecture on the SR model. Although the deeper network improved model performance, the perceptual quality was unsuitable. It required more improvement by modifying the network architecture, upsampling framework, and objective function.
Here, we review related SR research on the network architecture, and different residual connections, upsampling frameworks, and objective functions.
2.1. Network Architecture Review
Followed by SRCNN [
7], the SR field has witnessed a variety of network architectures that aim to improve SR model performance and enhance the reconstructed image quality by designing deeper network architectures. Designing deeper architecture makes gradient vanishing an issue during the training of the models. Some strategies, such as residual architectures, are feasible solutions to this limitation. Inspired by the Residual Network (ResNet) concept [
10] for image recognition models, a vast range of SR network architectures used residual learning strategy [
9,
11,
13,
14,
16,
18,
21,
27,
28,
29,
30,
31,
32]. In contrast to single-path conventional network structures, the residual learning concept uses different variants of residual connections, such as residual projection, and short- and long-skip connections in network architecture to prevent gradient-vanishing problems and make it possible to design an incredibly deep network [
10,
31].
Since the topology of the residual blocks has a direct effect on the performance of the networks, several structures of the residual blocks, such as residual blocks in PixelCNN [
33], projected convolution (PConv) [
31], gated convolution blocks in advanced PixelCNN [
34], and PixelCNN++ [
35] were designed and explored in the deep-learning image-reconstruction models.
Van Oord et al. [
34] stacked a three-layer convolutional structure consisting of
,
, and
convolutions with nonlinear activation function. This residual approach (bottleneck) improves computational efficiency, but, because of single branch topology, had a less enhancing effect on model convergence. PixelCNN [
33] used two-channel branches to improve the convergence of the model with sigmoid and hyperbolic tangent operations after the first
convolution layer, then stacked them to a
convolution layer. The
convolution layer maintains computational efficiency. However, the hyperbolic tangent technique shows a limitation in mapping various features. Salimans et al. [
35] used the idea of PixelCNN, where the hyperbolic tangent branch was replaced by identity mapping. Since the size of the features in this model is constant, mapping the high-frequency detail of the features that improve the reconstruction capability of the SR model is not very effective. The residual projection connection can enhance this limitation.
Fan et al. proposed a progressive residual network in Balanced Two-stage Residual Networks [
31] (BTSRN). The feature maps of the low-dimensional stage are upsampled and fed into the higher-dimensional stages with a variant of residual block known as residual projection connection. The residual projection of this SR model consists of a two-layer projected convolution (PConv) structure including
convolution layer as feature map projection followed by
convolution layer with rectified linear activation function (ReLU). Despite the robustness of the model to reconstruct the high-frequency detail in super-resolved images, the high computation cost is an important issue.
Although studies have shown that deeper CNN architectures lead to superior performance, the drawback of numerous network layers in too deep architecture is an important obstacle to converging networks in training mode. Besides the different training strategies for deep networks, reducing the network layers and computational cost is a feasible solution. Based on the residual block architecture in SR models, Lim et al. proposed the EDSR model [
16] by removing unnecessary Batch Normalization (BN) layers of each residual block and the activation functions outside the residual blocks while expanding the depth of network architecture. Although the EDSR model [
16] reduced network layers, it has many network parameters and computation costs.
Xiao et al. [
36] proposed a lightweight model (LAINet) using a novel residual architecture known as the dual-path residual approach to increase the diversity of the reconstructed features. Specifically, the LAINet [
36] model is split into two branches (dual-path residual), and the extracted features are produced according to the different homogeneous functions of each branch. This lightweight model lacks reconstruction capability to recover the edges, due to limitations in combining the hierarchical features in its architecture.
Motivated by the DenseNet [
37] architecture in the image classification field, various SR models based on the dense connection concept have been proposed. The main advantage of dense architecture is combining hierarchical features along the entire network to produce richer feature representations. Tong et al. [
38] proposed the SRDenseNet model, which used the dense connection between SR network layers. These multiple skip connections improve information flow from low-level features to the high-level layers before final image reconstruction and avoid vanishing-gradient trouble. The Super-Resolution Feedback Network [
22] (SRFBN) also used the dense skip connection and feedback projection to enrich the perceptual quality of the SR result.
Followed by the simplified concept of the residual block in the EDSR [
16] model and the dense architecture of SEDenseNet [
38], Zhang et al. [
27] designed an effective Residual Dense Network (RDN) by combining the residual skip connections with dense connections, and proposed a deeper network architecture with a CA mechanism. The residual connection models in RDN are categorized into global and local skip connections. At the local connection, the input of each block is forwarded to all RDB layers and added to the model output. The local fusion approach reduced the dimension by
convolution in each RDB. The global connection combines multiple RDB outputs and, via
convolution, performs global residual learning in the model. These local and global residual connections improved the results compared to SRDenceNet and helped stabilize the network in training mode. However, it has a convergence issue in training mode that reduces the capability of the model to recover acceptable SR images.
Although the simplified version of the residual block in these models partly decreased the network layers, training such a very deep architecture containing millions of network parameters remains challenging. Due to the effectiveness of the Residual-in-Residual Dense Block (RRDB) structure in terms of the training facilitation and maintaining the perceptual quality of the reconstructed image, several successful models such as ESRGAN [
15], RCAN [
18], Deformable Non-local Network (DNLN) [
39], and Densely Residual Laplacian Network (DRLN) [
24] have employed this concept in their networks.
2.2. Upsampling Framework Review
In addition to the network structure, any SR model’s upsampling framework is highly important in generating reconstructed images [
26]. Although the existing SR architectures differ broadly, the upsampling framework can be categorized into three main types: pre-upsampling, post-upsampling, and progressive upsampling frameworks, as shown in
Figure 1.
The first and most straightforward upsampling framework is the pre-upsample that Dong et al. [
7,
20] first adopted with the SRCNN model. As shown in
Figure 1a, the LR image is enlarged to a coarse HR image by bicubic interpolation. Then, CNN is applied on coarse HR to refine the reconstructed HR image. However, this predefined up-sample model significantly reduces learning difficulty, increases the computational cost, and often produces blurring and noise amplification results [
9,
11,
12,
13,
26].
The post-upsampling model is an alternative framework to solve post-upsampling limitations, and FSRCNN [
20] is the pioneer of this framework. As shown in
Figure 1b, the LR image is fed into the network without increasing the resolution, and most computations are performed in low-dimensional space. Then, at the tail of the network, the up-sample procedure is applied to the image. Using this upsampling approach improves computational efficiency while reducing the spatial complexity of the SR model. However, this framework has been considered to be one of the most mainstream upsampling strategies [
14,
15,
16,
24,
29,
38,
40], and it exceeds the learning difficulty of the SR model. As a result of performing upsampling only in one stage at the end of architecture, learning difficulties are increased, especially for larger scaling factors (e.g., 4, 8). Due to the learning difficulties of this upsampling framework, some models such as RCAN [
18], SAN [
41], and DPAN [
42] use the channel attention and non-local attention mechanisms for re-scaling the channel-wise features in low-dimensional space to improve the learning ability of the model [
26].
To address the post-processing drawback, a progressive upsampling strategy was employed [
17,
31,
43,
44]. The topology of the progressive upsampling framework is demonstrated in
Figure 1c.
Specifically, this framework comprises several stages of upsampling based on a cascade approach, and progressively reconstructs HR images to reduce the learning difficulties at larger scale factors. To transfer low-level features into the higher-level layers in the other stages of this framework, the projection connection that considers a variant of residual connection [
10] is used [
21,
22,
31,
45].
Laplacian pyramid SR network [
17] (LapSRN) and balanced two-stage residual network [
31] (BTSRN) are the earlier progressive approaches. Other SR models, such as MS-LapSRN [
43], Progressive SR (ProSR) [
44] and Progressive convolutional Super-resolution [
25] (PCSR), also used this framework and achieved higher performance compared to the other upsampling frameworks. The progressive upsampling framework of both LapSRN and MS-LapSRN uses the first image upsampled to the subsequent convolutional modules of the network. At the same time, the ProSR model maintains the main information stream, and the individual convolution branches generate intermediate upsampling. Motivated by the progressive upsampling framework of LapSRN [
17], Xiao et al. [
25] proposed the PCSR model by applying the dense architecture under the progressive upsampling framework (multi-stage upsampling) for the blind SR model and examined the results with multiple degradations such as blur and noise. PCSR [
25] has proven this framework performs promising results in multiple degradations due to the progressive estimation of images’ high-frequency details according to previous stages’ outputs. As well as these three main upsampling frameworks, Haris et al. in D-DBPN [
21] and Li et al. in SRFBN [
22] used upsampling and downsampling modules in their models, which is supported by deep back-projection connections. The idea behind this iterative up-and-down upsampling framework is to use the mutual dependency of LR and HR pairs to improve the learning ability of the SR model. However, the network complexity in this framework is an important obstacle to having efficient execution time.
2.3. Objective Function Review
The loss function measures the pixel-wise difference (error) between the HR image and the corresponding reconstructed image, and consequently guides SR model optimization. The loss function of the SR model mainly includes
loss and
loss, which are known as the mean absolute error (MAE) and mean square error (MSE), respectively [
46]. Due to the high correlation between pixel-wise loss and PSNR definition,
loss function becomes the most broadly used loss function in SR models such as SRCNN [
7], DRCN [
9], FSRCNN [
20], SRResNet [
15], MemNet [
13], SRDenseNet [
38], and DBPN [
23] given by
where
p shows the pixel’s index and
P denotes the patch, and
and
represent the values of the pixels in the SR patch and the corresponding HR, respectively. Since the
penalizes large errors, it properly preserves the sharp edges of the image while showing more tolerance to minor errors, regardless of the underlying structure of the reconstructed image. Although the
considers the most broadly applied cost function in the SR field, it suffers from independent Gaussian noise, especially in the smooth regions of the image [
26,
46].
To improve the
limitations, EDSR [
16], RDN [
27], CARN [
30], MSRN [
47], RCAN [
18], RNAN [
48], Meta-RDN [
23], SAN [
41], SRFBN [
22], and DRLN [
24] used the
loss function. The
can be written as
where
p is the pixel’s index,
P demonstrates the patch, and
and
denote the values of the pixels in the SR patch and the corresponding HR, respectively. In contrast to the
loss function, the
does not over-penalize the error and provides less independent noise and a smoother result compared to
loss. The weakness of the
loss is a relatively slower convergence speed without the residual block. Ahn et al. [
30] mitigated the slower convergence speed using a ResNet [
10] architecture model.
Although the
loss function demonstrates outperforming visually pleasing results over the
loss, its result is not optimal. As well as the
loss, Lai et al. [
17] used a variant of the
loss function known as the Charbonnier loss function, given by
where
p represents the pixel’s index,
P shows the patch and
denotes the residual image while
and
represent the values of the pixels in the SR patch and the corresponding HR, respectively. This variant of the
(Charbonnier loss) is not optimal and shows degradation in the edges area of the image [
26].
Despite the importance of loss function in the learning process of neural networks, the loss function has attracted less attention in the SR field. Using a type of loss function in the SR model that correlates with the Human Visual System (HVS) improved the reconstructed quality of the image [
49]. Some image-restoration learning-based models use MS-SSIM, such as underwater image restoration [
50]. The image-dehazing model is not optimal, and successfully increased the performance of their model using this loss function. Since MS-SSIM loss operates based on HVS (luminance, contrast, and structure), it shows a noticeable improvement in the perceptual quality of results in image-restoration models.
3. Proposed Method
In this research, we proposed a novel objective function for a Single-Image Super-Resolution (SISR) model by fusion of MS-SSIM and . In addition, we adopt the progressive upsampling strategy for our network architecture. Moreover, based on the effectiveness of Depth-Wise convolution, we designed a Depth-Wise Bottleneck Projection connection to bypass the high-frequency details of the early layer through the multi-step prediction network, and improve the convergence of the model. In the following section, we describe our model architecture, then explain the details of the proposed fusion objective function to train our SR model as a combination of loss and Multi-Scale SSIM loss (MS-SSIM + ).
As demonstrated in
Figure 2, our Progressive Multi-Residual Fusion (PMRF) network consists of the residual dense block (RDB) architecture under a three-stage progressively upsampling framework. At the end of each stage, the image is enlarged by a scale factor of two with the upsampling module. The output of each stage contains high-frequency detail of that stage, and thus it can be propagated to subsequent stages. The progressive prediction based on this approach that uses the generated image of the previous stage produces more accurate SR results. As shown in
Figure 2, the multi-level residual dense topology called Residual-in-Residual Dense Blocks (RRDB) is employed in the first stage of our progressively upsampling framework. By contrast, the other stages use the RDB architecture and projection approach to transfer details of the early layer to the upsampling modules at the end of each stage.
3.1. Network Architecture
The residual network demonstrates outstanding performance to obtain high-level features from low-level features, especially in SR problems [
11,
13,
15]. Due to improving the performance and computation costs of the residual network, a combination of residual architecture [
16] and dense connections [
38] under the three-stage progressively upsampling framework is employed. The combination of the multi-level residual network and the dense connections architecture (RRDB) [
15,
39] is used in the first stage of our progressive framework model. The residual dense block architecture [
27] is applied in the second and third stages, while the Batch Normalization (BN) layers are removed [
15]. The BN layer uses the mean and variance for normalizing feature maps during the training and testing phases of the model. In the training phase, BN operates based on the mean and variance of every batch, while in the testing phase, BN performs based on the mean and the variance of the whole training dataset [
16]. The problems of unpleasant visual artifacts and inconsistent performance have appeared once the statistics of testing and training datasets are different [
51]. The simplified structure of the residual block was introduced to tackle unpleasant visual artifacts and maintain stable training of the SR model.
The simplified structure of the residual network demonstrated in
Figure 3 has proven to increase the performance of computer vision tasks such as deblurring and dramatically decreasing the computational complexity and memory usage [
15,
16,
20]. The dense convolutional architecture (DenseNet) [
37], aims to connect each layer of the network to every other layer in a feed-forward manner to increase information flow between layers in the network, as illustrated in
Figure 4.
The dense convolutional architecture (DenseNet) [
37], aims to connect each layer of the network to every other layer in a feed-forward manner, to increase information flow between layers in the network, as illustrated in
Figure 4.
This means that the feature maps of all previous layers are used as the inputs in every single layer. Subsequently, the yielded feature maps of each layer are used as inputs into all further layers. According to research evidence [
13,
21,
52], using more network layers and connections led to increasing information flow between layers and, consequently, a superior performance model. Combining the dense block approach and the simplified residual block creates the RDB architecture.
Figure 4 shows the multi-level residual dense block network (RRDB) used in the first stage of our upsampling framework.
is the scaling parameter of the residual architecture from the range 0 to 1. The residual scaling parameter is multiplied by the residual output before adding to the main block, as demonstrated in
Figure 2 and
Figure 4. According to previous studies [
15,
39],
= 0.2 is the optimum value for the residual scaling parameter. The pixel shuffle [
53] upsampling model is used as the upsampling module at each stage of our progressive upsampling network. According to progressive strategy, using the dense block after the upsampling module in the second and third stages improves our model’s reconstruction capability (multi-step prediction) more effectively due to the use of prior images across scales. However, increasing the size of feature maps after upsampling at each stage unavoidably increases the processing time.
In addition, the Depth-Wise Bottleneck Projection approach, which conveys the high-frequency information of extracted features from the early layer into each upsampling stage, is explained in the next section.
3.2. Residual Bottleneck Projection
The residual block concept has been presented in many CNN-based image SR models [
28,
30,
47,
48,
53]. The residual concept prevents gradient vanishing in the training phase and makes it feasible to design deeper network architecture. The residual projection is considered to be a variant of a residual block, which changes the dimension of the features. In our architecture, the feature maps of the early layers at the low-dimensional stage are upsampled by the Bicubic interpolation method and fed into the higher-dimensional stages using the residual projection method. Multiple settings of the residual projection blocks are explored and demonstrated in
Figure 5, including Residual Projection, Bottleneck Projection, and Depth-Wise Bottleneck Projection.
The residual projection architecture [
10] consists of two convolutions of size
, as demonstrated in
Figure 5a, followed by nonlinear activation. The other topology stacks the
,
, and
convolution layers are known as the “bottleneck” building block [
10,
54] displayed in
Figure 5b. The first
convolution layer reduces the dimension of the feature map from 256-dimensional to 64-dimensional. The second convolution layer of
is used for computation, and ultimately the feature dimension is changed to 256 using the last
convolution layer.
Our model uses an efficient bottleneck block structure using the Depth-Wise (DW) convolution layer called a Depth-Wise Bottleneck Projection block. In contrast to normal convolution in the Bottleneck Projection, the DW convolution disentangles spatial interactions such as height and width from the channel interactions [
45]. Then, the convolutions are computed over each channel separately, and the result of each separated channel is stacked together [
45].
Since projection aims to map the high-frequency information of low-level features to every stage of our progressive framework, the DW Bottleneck Projection method demonstrates a more effective result due to the different channel-wise convolution operations [
54]. Moreover, the network convergence of the DW Bottleneck Projection approach is improved compared [
55] to the residual projection, and the regular Bottleneck Projection approaches are demonstrated in the results section. As demonstrated in
Figure 5c, the first layer contains a
convolution layer to reduce the dimensions of the feature map. The dimension reduction of feature maps is known as the bottleneck concept. The second layer includes a
Depth-Wise (DW) convolution operation. The main idea behind DW convolution is to replace a normal convolution with a special convolution that aims to implement more effective and lighter filtering by employing a single convolutional operation per each input channel and then stacking them back [
45]. The third layer of
convolution, known as a point-wise convolution, intends to construct new features via calculating linear combinations of the input channels [
21,
26,
54,
55]. The projected features increase the flow of low-level information into progressively upsampling modules and improve the reconstruction capability of the model.
3.3. Objective Function
The objective function measures the pixel-wise difference (error) between the reconstructed patch of the image and the corresponding ground-truth (GT) patch. To compute an error function [
26], the loss for a patch
P can be mentioned as (4):
where
p is the index of pixels and
denotes the values of the pixels in the error measurement. To obtain a smoother result for the SR model, the
loss function perfumes better than the
loss. However, both
and
losses are correlated inadequately with image quality as perceived by human observation [
46]. Using the loss function that correlated independently with HVS is a feasible solution. The sensitivity of HVS depends on the reconstructed image’s local contrast, luminance, and structure [
46]. To improve the network learning strategy according to the HVS, which reconstructs the image by attending to contrast, luminance, and structure qualities, the SSIM loss function is suggested.
Let us assume
and
are two patches of GT and reconstructed SR images, respectively. Then, let
and
be the mean and variance of
x, respectively. The covariance of
x and
y is assumed to be
. Therefore,
and
can be shown as estimates of the luminance and contrast of
x, while
measures the tendency of
x and
y to vary together, and define the structural similarity among
x and
y. According to [
46], the luminance, contrast, and structure evaluations are demonstrated as follows:
,
and
are small constants defined by:
where
L denotes the dynamic range of pixel values (
for 8 bits/pixel images), and
and
denote two scalar constants and are set to
= 0.01 and
= 0.03.
According to [
56], the general form of the SSIM between GT and SR patches is described as follows:
where
,
and
are parameters to explain the relative importance of these components which are considered to be
=
=
= 1. According to [
56] the SSIM index can be written as
where the dependencies of means and standard deviations on pixel
p are obtained. The means and standard deviations are calculated using a Gaussian filter by a standard deviation
,
. Therefore, the SSIM loss function [
50] can be mentioned as:
In Equation (
10), the SSIM
calculation needs to look at pixel
neighborhood as large as
can support. According to [
46], the computation of
and its derivatives in some patch regions is impossible. The derivative computation at
for any other pixel
in a patch
can be defined as
where
and
are the first and second terms of Equation (
11) and their derivatives are
and
demonstrates the Gaussian coefficient correlated with pixel q.
As mentioned above, the quality of the reconstructed image (SR) depends on
. For instance, the large value of
tends to preserve noise at the edge. In contrast, the small value of
leads to unpleasant artifacts due to reducing the network’s ability to reconstruct the image’s local structure. Using the multi-scale structure of SSIM (MS-SSIM), which is designed according to a dyadic pyramid of
M level resolution, is a feasible solution for the SSIM limitation.
Figure 6 illustrates the MS-SSIM diagram. Based on [
56] it is defined as
where
demonstrates luminance, and
demonstrates contrast and similarity. As observed in the diagram, the GT and SR patches are taken as inputs. The low-pass filter and downsample operation by a factor of 2 are applied iteratively on the inputs. The input patches are indexed as the first scale (scale 1), while the highest-order scale is considered to scale
M obtained after
iterations.
The diagram and equation show that the luminance comparison is calculated only at scale M, defined as
. By contrast, the structure and contrast comparisons are computed at the j-th scale and defined as
and
, respectively. We set
=
= 1. According to [
46] the final loss for a patch
with its center pixel
is defined as
Based on [
46] the derivative of the MS-SSIM loss function can be described as
However, The MS-SSIM loss produces a smoother SR image compared to the
loss, and it also preserves the image’s contrast in high-frequency regions better than the
loss function. On the other hand,
loss preserves the edges and is very sensitive to indicating sharp intensity changes. To reconstruct the best result of our SR model, the mix of MS-SSIM loss and
loss function is proposed:
Point-wise multiplication is applied between and . The best performance of is obtained by setting as 0.8. Experiments with different weight in are demonstrated in the next section.
4. Experiments
The motivation for designing our PMRF model stems from the need to produce the SR image as similarly as possible to the HR image, which can reconstruct detail such as holes and minor lines, content sharpness, and texture diversity.
We conducted several examinations to validate the performance of our SR model. First, we examined the experimental setting of the proposed model. Second, we assessed the effect of different projection approaches in the training phase. Third, we explored the effect of several objective functions on the reconstructed images. The comparison of different in the mix of MS-SSIM loss and loss function is demonstrated in the fourth section. The performance comparison of our model with different projection approaches and objective functions is demonstrated in the fifth section. Moreover, we compared and evaluated our SR images using several selected representative SR methods and comparative analysis. Additionally, we compared our model network parameters and execution time with some selected SR models. Finally, learning difficulty analysis and noise degradation analysis for different objective functions of the proposed model are represented.
4.1. Experimental Setting
This section explains the experimental settings of the datasets used in the training and testing phases, the training details, and the evaluation metrics.
4.1.1. Dataset
We used the DIV2K dataset [
57] to train our SR model. The DIV2K dataset contains 800 high-quality (2K resolution) images used for training purposes. In the testing phase, we compared the performance of our model on five benchmark datasets including Set5 [
58], Set14 [
59], BSD100 [
60], Manga109 [
61], and Urban100 [
62].
Table 1 represents information regarding the training and testing benchmark dataset in this research.
4.1.2. Training Details
In each training batch of our model, random LR patches in RGB mode with a size of 48 × 48 are extracted as the inputs with the corresponding HR patches. The ADAM optimizer [
63] with setting of
1 = 0.9,
2 = 0.999, and
=
is used for training our model. Minibatch size set to 16. Python 3.5 programming language under Keras 2.2.4 framework [
64] with TensorFlow 1.5 as the back end was used to implement our SR model, and it was trained on a Titan Xp GPU with 24 GB Memory. The learning rate of our proposed model is set to
.
4.1.3. Evaluation Metrics
The PSNR and SSIM [
56] evaluations are implemented on the Y channel of transformed YCbCr space to measure the quality of SR results.
4.2. Effects of Projection
The effect of different projection approaches, including Residual Projection, Bottleneck Projection, and Depth-Wise Bottleneck Projection in the training phase of our model, are compared in this section.
Figure 7 and
Figure 8 demonstrate the graphs of average training loss and PSNR (dB), respectively, on 800 training epochs under the same training dataset (DIV2K).
As observed in the graphs of
Figure 7 and
Figure 8, the Bottleneck Projection (blue) and Depth-Wise Bottleneck Projection (red) represent superior convergence performance over the residual projection approach (yellow). Although the performances of Bottleneck Projection and Depth-Wise Bottleneck Projection are almost close together, the Depth-Wise Bottleneck approach demonstrates smoother and better convergence performances in both average losses (
Figure 7) and average PSNR (dB) (
Figure 8) in the training mode.
4.3. Comparison of Different Objective Functions
The effect of different objective functions, including
loss,
loss, MS-SSIM loss,
+ MS-SSIM loss, and
+ MS-SSIM loss, are compared. The visual compression of our model with different objective functions is shown in
Figure 9 and
Figure 10. To better visually distinguish the results of these objective functions, we used the images with different textural structures at different scale factors (
scale and
scale). The quantitative comparisons of the benchmark datasets among these objective functions at different scales are presented in
Table 2,
Table 3 and
Table 4.
Figure 9 displays the results of our SR model with different objective functions on the “Monarch” image from the Set14 [
59] dataset at
scale. Since the
loss function penalizes the smaller error compared to the
loss function, the result of
is smoother and sharper compared to
while the high-frequency details and minor features in the regions connecting the edges vanish. This means that despite the smoothness of the
result, it shows weakness in reconstructing the minor details of the image similar to the original image. Although the MS-SSIM result demonstrates a sharper image compared to
, and more details than
; it shows weakness in reconstructing the edges equal to the original image. The mix of
and MS-SSIM represents more realistic results than the other loss functions. It represents a sharp image while more minor details are preserved around the edges. The PSNR (dB) and SSIM evaluations also demonstrate the superior performance of the proposed loss function.
Figure 10 illustrates the results of our SR model with different objective functions on “Image-92” from the Urban100 [
62] dataset at a
scale. These results compare the effect of each objective function for reconstructing the vertical and horizontal lines over a constant surface. As observed in
Figure 10, the
objective function represents better performance for reconstructing vertical and horizontal lines than other non-mixed objective functions. However, the lack of smoothness due to the over-noise amplification (greater error penalized in
) makes it a non-pleasing image. The mixed
and MS-SSIM loss function represents the best performance in producing a sharp image, while also detecting all the lines similar to the original image. The PSNR (dB) and SSIM evaluations also demonstrate the best performance compared to the other objective functions.
Comparing the generated results of different objective functions in
Figure 9 and
Figure 10, the combination of MS-SSIM and
demonstrates the best performance in quantitative evaluations and perceptual quality. The MS-SSIM objective function operates based on visible structures of the image (luminance, contrast, and structure), and
objective function computes based on more emphasis on the differences between the GT and the SR image. Thus, combining them produces better perceptual quality for the human viewer and more appealing SR results compared to other objective functions.
The quantitative performance comparisons include PSNR (dB) and SSIM on the benchmark datasets of Set5 [
58], Set14 [
59], BSD100 [
60], Urban100 [
62] and Manga109 [
61] at
,
and
scales, as demonstrated in
Table 2,
Table 3 and
Table 4. The red numbers indicate the best performance, and the blue ones show the second best.
As observed in
Table 2,
Table 3 and
Table 4 for
,
, and
scales, the best performances (PSNR and SSIM) among different objective functions belong to the mixed
and MS-SSIM objective functions. Noticeably, the second best (blue) belonged to another fused objective function concept (MS-SSIM +
).
4.4. Comparison of Different in Mix of MS-SSIM Loss and Loss Function
The influence of different
weight to fuse of MS-SSIM loss and
loss function are compared in this section. The quantitative performance includes PSNR (dB) and SSIM with different
weight on the benchmark datasets of Set5 [
58], Set14 [
59], BSD100 [
60], Urban100 [
62] and Manga109 [
61] at
and
scales, as shown in
Table 5 and
Table 6, respectively.
According to
Table 5 and
Table 6, the best
weight for gaining the highest PSNR and SSIM is
= 0.8.
4.5. Performance Comparison of Our Model with Different Objective Functions and Projection Approaches
Table 7 shows the performance investigation of our progressive upsampling SR model (PSNR value at scale four on the Set5 [
58] dataset) using different objective functions and different projection approaches.
The DW Bottleneck Projection performs best in mapping the discriminative high-frequency details compared to the Residual and Bottleneck Projection approaches in all objective functions. Additionally, the proposed fused objective function demonstrates noticeable improvement in accuracy compared to the common objective functions ( or ) used in SR models. The best PSNR value (32.70 dB) belongs to the + MS-SSIM objective function and DW Bottleneck Projection.
4.6. Comparison with Other Super-Resolution Methods
Here, we compare our PMRF model method with state-of-the-art SR methods, including the visual and quantitative comparisons at
,
, and
scales. In visual comparison, we compare the results of SRCNN [
7], VDSR [
8], LapSRN [
17], MemNet [
13], MS-LapSRN [
44], EDSR [
16] and RCAN [
18] models with our model results at
and
scales on BSD100 [
60], Manga109 [
61] and Urban100 [
62] datasets.
In
Figure 11, we show visual comparisons at
scale for image “3096” of the BSD100 [
60] dataset. We observe that the compared SR models show weakness in reconstructing the sharp image with small details and suffer blurry artifacts. By contrast, our PMRF model reduces the blurring effect and reconstructs a better perceptual quality image due to the effectiveness of the proposed objective function.
In
Figure 12, we display visual comparisons at
scale for the image “GakuenNoise”, which belongs to the Manga109 [
61] dataset. Other models show weakness in representing the lattice’s circular shapes. Some models suffer from blurring artifacts, while in RCAN [
18] and EDSR [
16], the reconstructed lattice shapes are not similar to the original HR image. On the other hand, the result of our PMRF model represents better performance in recovering the circular lattice details due to using the progressive upsampling framework to reconstruct the detail progressively.
In
Figure 13, we demonstrate visual comparisons at
scale for image “Img-12”, which belongs to the Urban100 [
62] dataset. In contrast to the other SR models, our PMRF model performs better in reconstructing the parallel lines because of its robustness in mapping the high-frequency detail (edges and lines).
Figure 14 compares other SR results on image “302008” of the BSD100 [
60] dataset. Due to the large scale factor, the Bicubic method’s result has lost the HR image’s correct structure. Reconstructing the wrong structure because of a very large scale factor also occurs in some other models such as SRCNN [
7], VDSR [
8] and LapSRN [
17]. Our PMRF model performs better in recovering the original structure of black lines than the other state-of-the-art models, which lack smoothness, blurring artifacts, and the capability to recover tiny line connections. Notably, by reducing the learning difficulties in a progressively upsampling procedure, our model and MS-LapSRN [
44] effectively recover the edge detail.
Figure 15 shows visual comparisons at
scale for image “Img-096”, which belongs to the Urban100 [
62] dataset. The progressive upsampling framework-based models such as LapSRN [
17] and MS-LapSRN [
44] show robustness in reconstructing the parallel lines at this large scale factor. However, these models demonstrate a lack of smoothness. Although the RCAN [
18] model recovered the parallel lines, it did not produce a sharp result. This weakness in RCAN [
18] is caused by a lack of global information in the CA mechanism that shows quality degradation on larger scales. The proposed progressive model outperforms the SR image in recovering parallel lines more effectively without blur and halo effect around the lines due to the effectiveness of the proposed fused objective function and multi-stage enlarging.
The quantitative results comparison using PSNR (dB) and SSIM evaluations at
,
, and
scales on Set5 [
58], Set14 [
59], BSD100 [
60], Manga109 [
61], and Urban100 [
62] datasets are illustrated in
Table 8,
Table 9 and
Table 10. For the quantitative comparisons, we used 11 state-of-the-art models including Bicubic, SRCNN [
7], FSRCNN [
20], VDSR [
8], LapSRN [
17], MemNet [
13], EDSR [
16], SRMDNF [
65], D-DBPN [
21], PAN [
66], LAINet [
36], RDN [
27] and SRFBN [
22]. The results of other models are cited from their papers. The red numbers indicate the best performance, and the blue ones demonstrate the second best.
In contrast to
scale in
Table 9 and
scale in
Table 10, our results at
scale shown in
Table 8 are slightly less than RDN [
27] and SRFBN-S [
22]. Since our model has the progressive upsampling framework, scale factor
acts similar to a post-upsampling framework, although it has less network depth than the other post-upsampling-based models at this scale.
Compared with other SR models at and scales, our PMRF shows the best PSNR (dB) and SSIM results in all examined datasets. These results indicate that our progressive upsampling framework with the proposed fused objective function represents superior performance over the other SR models at larger scale factors ( and ). The comparisons of network parameters and the execution time are demonstrated in the following section.
4.7. Comparative Analysis
The performance of the SR model is evaluated using objective measures including PSNR and SSIM. In comparative analysis, we compare the performance of our proposed model with different state-of-the-art algorithms, including over five benchmark datasets (Set5 [
58], Set14 [
59], BSD100 [
60], Manga109 [
61], and Urban100 [
62]) at
and
scales.
Figure 16 and
Figure 17 compare PSNR and SSIM at
scale over five benchmark datasets, respectively. The proposed model (PMRF) is the most effective SR model regarding the PSNR and SSIM of the super-resolved images on all benchmark datasets compared to SRFBN-S [
22], RDN [
27], SRMDNF [
65], EDSR [
16], and SRCNN [
7]. The best improvements compared to the other models regarding PSNR and SSIM belong to the performance of our model on the Urban100 dataset. Compared to the SRFBN-S model, our model (PMRF) improved by 0.28 dB of PSNR and around 0.01 of the SSIM.
Figure 18 and
Figure 19 compare PSNR and SSIM at
scale over five benchmark datasets, respectively. According to the bar graphs, the most effective SR model on all benchmark datasets compared to D-DBPN [
21], RDN [
27], SRMDNF [
65], EDSR [
16], and SRCNN [
7]. Only the PSNR of RDN model on the Set14 dataset is 0.01 dB more than our model. However, regarding SSIM, our model shows robustness compared to the other models.
4.8. Model Size Analysis
We represent comparisons of model size and performance in this section. For these comparisons, we used nine state-of-the-art models including SRCNN [
7], FSRCNN [
20], VDSR [
8], LapSRN [
17], MemNet [
13], EDSR [
16], D-DBPN [
21], MDSR [
16] and RCAN [
18]. These models have been implemented on a Titan Xp GPU with 24 GB Memory.
Figure 20 compares the performance and number of parameters on the Set5 [
58] dataset at a scale of
.
According to this graph, our PMRF model gains the highest PSNR (32.7 dB), and the number of parameters in our model is less than the RCAN [
18] model as the second-best PSNR on this scale. Our model with a progressive upsampling framework and the proposed fused objective function archives acceptable trade-offs between accuracy and parameter efficiency.
Figure 21 compares the performance and the execution time on the Set5 [
58] dataset. According to this graph, our PMRF model gains the highest PSNR (32.7 dB) while its execution time is faster than EDSR [
16], RCAN [
18], MDSR [
16], and MemNet [
13].
4.9. Learning Difficulty Analysis
The effect of the upsampling framework on the learning difficulty of the SR model is demonstrated in this section.
Figure 22 shows the average PSNR (dB) per epoch in the training of our SR model with progressive upsampling and post-upsampling frameworks. To compare fairly, we trained both models under the same hyperparameters. As observed in the graphs of
Figure 22, the progressive upsampling (blue) represents superior convergence performance compared to the post-upsampling (red) framework. According to the figure, the PSNR of the progressive upsampling (blue) graph at the initial stage of training is higher than the red one and grows rapidly with fewer fluctuations compared to the red graph.
4.10. Noise Degradation Analysis
The evaluation of different objective functions including
,
, MS-SSIM,
+ MS-SSIM and
+ MS-SSIM on the degradation image of the Set14 [
59] dataset is shown in
Table 11. The Gaussian degradation with a kernel size of 0.5 and noise level of 15 is used for this evaluation. The red numbers indicate the best performance, and the blue ones demonstrate the second best.
The performance of and objective functions (PSNR and SSIM) are weaker than MS-SSIM and the combinations of MS-SSIM with and . Due to the greater sensitivity of to noise, it shows the weakest performance. The best performances belong to the fusion objective function approaches. The highest PSNR and SSIM of MS-SSIM with demonstrate the robustness of the proposed fusion objective function against noise degradation.
The list of abbreviations used in this article is tabulated in
Table 12.
5. Conclusions
This research proposes a novel fusion objective function by fusing and Multi-Scale SSIM loss function for the single-image super-resolution model to improve the accuracy and perceptual quality of the resultant images. Moreover, we designed a novel Progressive Multi-Residual architecture (PMRF) that uses Residual-in-Residual Dense Blocks (RRDB) under the progressive upsampling framework. Additionally, the Depth-Wise (DW) Bottleneck Projection approach was applied to bypass the high-frequency components of the early layer features in every stage of the upsampling module, which led to an increase in the training convergence of our model. Quantitative and qualitative evaluations were conducted on five benchmark datasets (Set5, Set14, BSD100, Urban100, and Manga109) at , , and scales. The proposed fused objective function ( and MS-SSIM) improved perceptual quality and accuracy (PSNR/SSIM). Additionally, the fused objective function demonstrates noticeable robustness against noise degradation compared to the conventional objective functions ( and ). The proposed Depth-Wise Bottleneck Projection improved the convergence of our model by mapping high-frequency detail to each stage of upsampling. Due to the progressive estimation of high-frequency components of images based on the outputs of previous stages in our progressive framework, the learning difficulty of the model is reduced, and the resultant images show effectiveness in recovering complex textures. Moreover, the experiments into execution time and the number of parameters reveal an acceptable trade-off between parameter efficiency and accuracy. The performance of the proposed model at scale is slightly less than two other models (RDN and SRFBN-S). At this scale, the progressive framework acts similarly to a post-upsampling framework, and it has network depth compared to other post-upsampling models.
In the future, we would like to implement our proposed SR model in the real-time application of high-definition video and explore the super-resolution model in larger scale factors in real-time image and video communications in edge IoT devices. The proposed SR algorithm is helpful in image-based smart IoT ecosystems, and in video and image communication systems to enhance the perceptual quality of the captured images and video frames in real-time applications.