Our single image super-resolution is composed of a two-stage hourglass network, including a memory-friendly encoder and recovery decoder. Here we impose a deconvolution layer that achieves 2×, 3×, 4× super-resolution, yet our methodology can be applied for higher upsampling goals. As aforementioned, we bootstrap the super-resolution process by a multi-task co-optimizing method to focus on different inherent patterns of an image such as local luminance, contrast, structure and fitting the ground-truth data distribution. Meanwhile, we impose cross-scale training strategy to improve performance further. Our memory-friendly encoder approach progressively extracts high-level feature maps and decreases the channel dimension, thus avoiding the overwhelming memory overheads.
3.1. EMTCM
Inspired by [
28], we choose the CNN-based method as our basic approach. However, in the supervised model based on traditional image SR such as SRCNN, the image needs to be processed and interpolated to the desired size, and SRCNN model learns mapping in high dimensional space, leading to time-consuming and computing-consuming problem. Therefore, we create an end-to-end EMTCM to solve aforementioned weakness. As shown
Figure 1, our overall network is a two-stage hourglass architecture, including a memory-friendly encoder and a recovery decoder. What is more, we impose a number of residual blocks with long skip connection in order to solve the problem of CNN’s local receptive field and enhance the capability of model feature representation.
A Memory-Friendly Encoder: as shown in
Figure 1, our memory-friendly encoder consists of two stages—Coarse Extractor
G and Memory Friendly Module
F. Each stage stacks several convolution layers. Coarse Extractor
G is a coarse network, which is a network with a simple structure and can restore a coarse HR image. Specifically, the module G takes the original LR image
as input without interpolation, performs a series of convolutions and extracts coarse feature maps. The coarse feature maps are treated as high-dimensional feature vectors, leading overwhelming memory overheads. Therefore, the high-dimensional feature vector can be formulated by:
where
denotes the coarse extractor operation, which consists of a series of convolution layers,
is the degraded low-resolution image and
denotes the high-dimensional feature vectors extracted by
G, serving as the inputs to the memory-friendly module.
As aforementioned, our EMTCM model takes the original LR image as the input through
G with a sufficient number of convolution layers, thus becoming high dimensional vectors. The convolution operators result in a prohibitive memory of EMTCM. Therefore, we apply a memory-friendly module
F that gradually extracts higher feature maps while decreasing the channel dimension. The low-dimensional feature vectors can be formulated by
where
is given by the function
which takes
in (
1) as the input and gradually extracts higher information while decreasing the channel dimension.
As aforementioned, even though we impose an end-to-end architecture to infer the overall network, the LR image is extracted by high-dimensional feature vector, leading prohibitive memory. Therefore, we apply two coherent efforts and innovations to avoid the problem of notoriously prohibitive memory. On the one hand, a naive option directly takes raw pixel without any interpolation as input, but it still cannot sufficiently solve the problem of overwhelming memory and computation. On the other hand, we build a memory-friendly module to fix the weakness of the first effort. EMTCM with mutual collaboration between en-to-end network and memory-friendly can solve prohibitive memory and computation.
Recovery Decoder: as shown in
Figure 1 right, we propose a recovery decoder, following the memory-recovery encoder, including mapping module
M, recovery module
R and a deconv layer
. In mapping module, we impose a sufficient number of CNN to capture more context information of LR in order to avoid the inherent weakness of local receptive field of CNN operation. However, that could cause the loss of feature resolution, fine details and gradient vanishing or gradient exploding. Another parallel way to effectively address the above issue is to apply a residual block with long skip connection. By doing so, we can capture more high-level information to guide the network to generate super-resolver results. Meanwhile, we can relieve the inherent weakness of CNN. Here we define
M function, as below:
where
is given by the function
that maps
in (
2) to
.
is the high-level information and the foundation of our multi-task co-training strategy.
As the following, we apply a recovery module
R after the
M module. In
R module, we aim to increase the channel dimension of low-dimensional feature map vectors. Although,
M module reduces the channel dimension of high-dimensional feature vectors for the sake of the computational efficiency, if we generate the HR image directly from low-dimensional feature vectors, the final performance quality will be poor. Therefore, we apply recovery module to boost performance further. It is defined as:
where
R denotes the recovery module function.
is obtained by function
that recovers the resolution of the feature maps.
is significant for attaining final visual super resolution images.
The last module is Upsampling Operation. The operation is the learning based on upsampling Transposed Convolution Layer De, also known as deconvolution layer, which tries to perform a transformation opposite a normal convolution, i.e., to predict the possible input based on feature maps of the output size of convolutional layers. Specifically, it improves the image resolution by expanding the image by inserting zero values and performing convolution. Since the transposed convolution layer can enlarge the image size in an end-to-end manner while maintaining a connectivity pattern compatible with vanilla convolution, we impose deconv as learning-based upsampling method. Then the output is directly the reconstructed HR image. The deconvolution layer learns a set of upsampling kernels for the input feature maps. These kernels are diverse and meaningful. If we force these kernels to be identical, the parameters will be used inefficiently (equal to summing up the input feature maps as one). The final result is expressed as:
where
denotes the Transposed Convolution Operation and
is the final output of the EMTCM model.
3.2. Multi-Task Co-Optimization Strategy
Image super-resolution tasks in the SISR domain benefit from the currently best collaboration between CNN-based and pixel wise loss. However, the collaboration seems like not the best partner in SISR. This is because there are some disadvantages to both of them. Meanwhile, in the span of just a couple of years, neural networks have been employed for virtually every computer vision and image processing task known to the research community. Much research has focused on the definition of new architectures that are better suited to a specific problem. A large effort was also made to understand the inner mechanisms of neural networks and what their intrinsic limitations are. However, loss function as an effective driver of network’s learning has attracted little attention within the SISR research community: most CNN-based methods impose pixel wise loss. Note that loss and the Peak Signal-to-Noise Ratio, PSNR, do not correlate well with human’s perception of image quality: is a single-task to just optimize PSNR. Here, we introduce a multi-task co-optimizing strategy to fix the aforementioned weakness. Interestingly, adding the multi-task co-optimizing strategy can improve performance. Therefore, that makes it a natural idea to incorporate multi-take co-optimizing strategy into EMTCM, which may help it capture more useful and meaningful information. Specifically, we construct a multi-task of super-resolution, instead of loss. That makes EMTCM focus on inherent pattern of images, including local luminance, contrast, structure and fitting the ground-truth data distribution.
HVS task: Human visual system correlates well with inherent pattern of images, including local luminance, contrast and structure. Moreover,
loss can force the overall network to focus on inherent pattern of images and is proven effective in recovering high-field images. Hence we introduce the
loss to generate realistic images. We define the
loss as:
where the network EMTCM is parameterized by
,
is a spatial similarity map between
X and
.
X is ground-truth HR image and
is the last step to finally output
.
Fitting ground-truth data distribution task: Another task co-optimizing is to fit ground-truth data distribution. In order to fix the weakness of
loss, we introduce Cross Entropy Loss as fine-tuning strategy to fit the distribution of HR. We define Cross Entropy Loss, as follows:
where
denotes the ground-truth probability distribution of image,
stands for the output probability distribution produced by EMTCM based on
X.
n stands for batch size of training data.
Multi-task co-optimizing strategy: the combination of two losses can achieve the goal of multi-task co-optimizing and improve performance further. Based on the two tasks and loss introduced above, we define the overall loss of our EMTCM super-resolution model as follows:
where
is the multi-task co-optimizing overall loss of EMTCM,
is a trade-off parameter that balances overall objective loss in order to fine-tune model. We set the
.