Video Super-Resolution Based on Generative Adversarial Network and Edge Enhancement

: With the help of deep neural networks, video super-resolution (VSR) has made a huge breakthrough. However, these deep learning-based methods are rarely used in specific situations. In addition, training sets may not be suitable because many methods only assume that under ideal circumstances, low-resolution (LR) datasets are downgraded from high-resolution (HR) datasets in a fixed manner. In this paper, we proposed a model based on Generative Adversarial Network (GAN) and edge enhancement to perform super-resolution (SR) reconstruction for LR and blur videos, such as closed-circuit television (CCTV). The adversarial loss allows discriminators to be trained to distinguish between SR frames and ground truth (GT) frames, which is helpful to produce realistic and highly detailed results. The edge enhancement function uses the Laplacian edge module to perform edge enhancement on the intermediate result, which helps further improve the final results. In addition, we add the perceptual loss to the loss function to obtain a higher visual experi-ence. At the same time, we also tried training network on different datasets. A large number of experiments show that our method has advantages in the Vid4 dataset and other LR videos.


Introduction
Super-resolution (SR) aims to reconstructing high-resolution (HR) images or videos from their low-resolution (LR) versions, which is a classic problem in computer vision. It not only pursues the enlargement of the physical size but also recovers high-frequency details to ensure clarity. Classical algorithms have existed for decades and can be divided into the following categories, methods based on patch [1], edge [2], sparse coding [3], prediction [4], and statistics [5]. These methods have lower computational cost than deep learning methods, but their recovery performance is also very limited. With the popularity of deep learning, convolutional neural networks have been widely applied and led to a dramatic leap in SR.
This field can be divided into two parts, single image super-resolution (SISR) and video super-resolution (VSR). The former exploits the spatial correlation in a single frame, while the latter additionally uses inter-frame temporal correlation. Digital video processing technology includes many fields, such as passive video forgery detection techniques [6][7][8]. In this article, we will focus on videos with lower resolution and blurry quality. To obtain HR data, the most direct way is to use HR cameras. However, due to the production process and engineering cost considerations, high-resolution cameras will not use for shooting in many cases, such as CCTV. Urban CCTV is helpful to security. However, in order to ensure the long-term stable operation of recording equipment and the appropriate frame rate of dynamic scenes, this product often sacrifices resolution to some extent. 1G of a 1080p video file can only record for less than half an hour at most. If it can only record for a short time, it loses the meaning of monitoring. However, we can improve the quality of CCTV through SR to obtain more information that is useful. In addition, video SR is also used in the HR reconstruction of old movies and TV shows, such as Farewell My Concubine. Similar applications exist in the field of remote sensing and medical imaging. Moreover, SR also helps to improve the performance of other computer vision tasks, such as semantic segmentation [9]. Therefore, obtaining HR data through superresolution (SR) technology has many practical applications and demands.
On the one hand, choosing the proper VSR algorithm is crucial. VSR was once divided into a large number of single multi-frame SR subtasks [10,11], which resulted in inevitable flicker artifacts and expensive calculations. In our work, as with mainstream algorithms, we use the previously reconstructed high-resolution results to SR the subsequent frames. Since the above methods ignore people's perception, some SR reconstruction results are still unsatisfactory. Therefore, Generative Adversarial Network (GAN) was introduced into the field of SR. GAN, which contains a generator (G) and a discriminator (D), is a popular deep learning-based model. G and D compete with each other during the training process so that the generated data obtained from the generator are as similar to the real data as possible. Goodfellow et al. [12] proposed GAN in 2014. After that, GAN has been applied to various computer vision problems, including SR. For example, a GAN for image SR (SRGAN) [13] uses adversarial loss and perceptual loss to recover photo-realistic textures from LR images. This type of network has excellent performance in reconstructing high-frequency details and can restore textures that are more realistic. However, it also has limitations. GAN will introduce noise and cause some details of the dislocation. Later, the comprehensive consideration of SR combined with other image enhancement methods [14,15] attracted people's attention. SR belongs to the big field of image enhancement. Both of their purpose is to improve people's perception. When SR increases the physical size, it will inevitably cause some discomfort such as blur, which can be improved by combining with other image enhancement methods. After the initial SR, the edge enhancement module is added, which will greatly help the image quality improvement.
On the other hand, methods based on deep learning are data-driven. Specifically, training requires a large amount of paired LR-HR data, which determines the reconstruction ability of the network to a certain extent. Generally, LR frames are degraded from a continuous set of HR frames by linear down-sampling (for example, bi-cubic degradation) or adding other noise on this basis, and formalized as (1) or (2): where ⊗ represents the convolution operation, k represents the blur kernel, s ↓ represents down-sampled operation, and n represents additive noise [16,17]. Then, the network is used to learn the mapping between low-resolution image y and high-resolution image x . However, the degradation process is more complicated or even unknown in the real world. Recently, many studies have been conducted on this issue [18][19][20][21][22]. In addition, the dataset may not match the actual LR scene. For example, the dataset is about landscapes, and the characters need to be reconstructed. In this article, we try to train the network on different datasets and test on different testing datasets.
Our main contributions in this paper can be summarized as follows: 1. We proposed an end-to-end GAN-based network for VSR, which focuses on videos with lower resolution and blurry quality.
2. The Laplacian edge module, which can enhance edges while suppressing noise, is added in the generator after SR to meet the needs of people's perception.
3. We trained and tested our method on different datasets. Extensive experiments demonstrate the superiority of our method.

Related Works
While SR is a classical task, our review in this section focuses on deep learning-based methods for SISR and VSR.

Single Image Super-Resolution (SISR)
Given that Y is the low-resolution image, ( ) F Y is the reconstructed image, and X is the corresponding ground truth HR image, the goal of SISR is to ensure that ( ) F Y and X are as similar as possible. Dong et al. [23] proposed a deep convolutional network for image SR (SRCNN), which introduced the convolutional neural network into the SR field for the first time. Subsequently, to accelerate the speed, the same team proposed the fast SR convolutional neural network (FSRCNN) [24], which is a compact hourglass-shape structure. Shi et al. [25] proposed a novel sub-pixel convolutional layer to replace the deconvolutional layer. By doing so, the training complexity is significantly reduced. The above approaches are based on linear networks, and the structure is relatively simple. However, as the depth of networks increased, over-parameterization appeared. To address these difficulties, recursive networks [26,27] behaved well by using weights repeatedly. On the one hand, the network is deeper; thus, the performance is better. On the other hand, deeper networks are also more likely to cause an exploding gradient. To deal with this contradiction, Kim et al. [28] proposed learning residuals only, since the low-frequency information carried by the LR image is similar to the HR images. A very deep residual channel attention network (RCAN) [29] is proposed for high-precision image SR. As a result of the sparsity of residual images, the convergence speed is accelerated. Afterwards, based on residual learning, many frameworks were proposed [30,31].
With the development of deep neural network, excellent networks are constantly being introduced into this field. [32,33] are based on the densely connected convolutional network (DenseNet) [34]. They make full use of low-level features by introducing dense skip connections. GANs are also adapted for SISR in SRGAN [13]. These kinds of methods propose a perceptual loss function in order to recover photo-realistic textures from LR images. Perceptually satisfying in the sense is their main target.
Recently, more categories of SR appeared, such as blind SR [20][21][22] and unsupervised SR [35]. Moreover, it has been found that the development of SISR tends to be practical. Google announced the Super Res Zoom technology [36], which focused on solving the problem that the images taken by handheld devices are not clear enough. Dong et al. proposed [37,38], which combine the SR with mersisters. Qian et al. proposed the Trinity Enhancement Network (TENet) [15], which can solve multiple problems at the same time. Deng proposed an algorithm, named SR by Neural Texture Transfer (SRNTT) [39], which implemented SR in a referential way. This year, a large number of SR methods for specific objects have emerged, such as hyperspectral SISR [40], face SR [41], and so on.

Video Super-Resolution (VSR)
In addition to information in a single frame, VSR has inter-frame temporal correlation. Therefore, both accuracy and consistency need to be considered at the same time. For this purpose, VSR usually has two unavoidable steps: motion compensation and SR restoration.
At the very beginning, VSR was divided into a large number of independent multiframe SR subtasks [10,11]. They focused on obtaining high-quality reconstruction results for each single frame, while the individually generated high-resolution frames lack coherency temporally, resulting in unpleasant flickering artifacts. The above methods did not make full use of time domain information.
Afterwards, adding optical flow networks to the VSR for motion estimation became popular. Taking efficient sub-pixel convolutional neural network (ESPCN) [25] as a reference, Caballero et al. [42] proposed video ESPCN (VESPCN), which consisted of spatio-temporal sub-pixel convolution networks and optical flow networks. Specifically, VES-PCN learned the motion compensation by the former and improved the accuracy in real time by the latter. Sajjadi et al. [43] proposed frame-recurrent video super-resolution (FRVSR), which repeatedly using previously estimated SR frames to recover subsequent frames. In addition to reusing the reconstructed HR frames, frame and feature-context video super-resolution (FFCVSR) [44] was proposed to exploit the features of the previous frame repeatedly. Likewise, Wang et al. [45] proposed learning for video super-resolution through HR optical flow estimation (SOF-VSR), which innovatively reconstructed highresolution optical flow instead of estimating the optical flow among low-resolution frames to improve the accuracy of motion compensation. Chu et al. proposed Temporally Coherent GAN (TecoGAN) [46], of which the architecture is based on GAN. It not only used optical flow networks, but also suggested novel loss functions to improve time consistency. Furthermore, due to its feature space losses, the proposed approach improved perceptual quality in VSR.
The addition of the optical flow network does improve the experimental results, but it also increases the computational and memory cost as well. Moreover, the final performance heavily depends on the accuracy of the optical flow prediction. Inaccurate optical flow will cause artifacts, which will also propagate to the reconstructed HR video frame. Therefore, several studies have been done to remove explicit motion compensation. Unlike the previous works, video super-resolution via residual learning (EVSR) [47] estimated motion compensation between frames automatically without explicit motion compensation modules. Ganet [48] integrated motion estimation and the frame recovery into one step by utilizing the self-attention network to merge local features into global features. Younghyun et al. [49] introduce a novel framework dynamic upsampling filters (DUF). Instead of explicitly estimating the motion compensation between LR frames, DUF implicitly utilized the motion information to generate suitable up-sampled filters. In [50], a new method to ensure temporal consistency is proposed. Instead of using optical flow, it uses deformable convolution to track the traceable points by a pyramid, cascading and deformable (PCD) module. Tian et al. [51] proposed a time deformable alignment network (TDAN), which aligned adaptively at the feature level.

Methods
In this paper, we aimed at learning non-linear mapping between the input LR frames and the final HR frames. Our main framework is based on GAN, and the main work is to improve the generator. As illustrated in Figure 1, the generator mainly consists of two parts: one for intermediate SR results [46] and the other for edge enhancement [14], which makes the final results clearer. LR videos are usually blurry and accompanied by noise. In addition, GAN will inevitably introduce noise. Therefore, edge enhancement while suppressing noise will greatly improve people's perception. Instead of discriminating the realism of spatial detail only, the generator discriminating temporal changes as well. Moreover, in order to obtain good objective indicators while ensuring people's perception, we added a trained Visual Geometry Group (VGG) to compare the difference between the final results and the GT on several specific feature layers.

TecoGAN
The network is based on the GAN. The generator G is divided into two parts. The first part is the optical flow network F, which obtains the motion compensation t v from two adjacent low-resolution input frames four times to obtain t V . Afterwards, the previous SR frame The following formula can be used to summarize the above steps: The design of the adversarial network and loss function is the main innovation. The adversarial network, which called a spatio-temporal discriminator, not only discriminates spatial details but also includes information in the temporal. It receives two sets of inputs, which consists of the generated results and the GT. In each set of inputs, in addition to spatial details, it also includes temporal information. In this way, the discriminator can automatically balance space and time information to avoid inconsistent clarity or excessive smooth result. TecoGAN has a novel loss function named ping-pong loss as well. The input of the optical flow network is two low-resolution groups. The first group has n continuous frames, and the second group is the reverse sequence of the first group. Therefore, it is possible to get the motion compensation t v between SR t X . Theoretically, the two are the same. Therefore, the ping-pong loss is as follows: 1 '

EEGAN
The network is based on the GAN for SISR. The main innovation of this method is in its generator, which divides the results into intermediate result base I and final result * edge I . Intermediate result base I is generated by a topologically shaped network. This dense block D in the topological structure is regarded as the basic module of feature extraction and fusion. Unlike traditional dense blocks, they can share and fuse feature maps extracted from multiple previous convolutional layers in both horizontal and vertical directions. Therefore, the number of link nodes is approximately twice that of the original dense block, thereby achieving a variety of fine feature expressions.
The final result is the edge enhancement of the intermediate result.
Taking into account that edge enhancement will also amplify noise, the mask branch is performed to learn the image mask to detect and remove isolated noise, which are false edges generated in edge extraction. Subsequently, the enhanced edge map is projected onto the HR space through a sub-pixel convolution operation. According to [14], the mathematical expression of the edge enhancement can be written as follows:  3. ( ) F ⋅ denotes the dense block above using feature extraction and fusion.

4.
( ) M ⋅ represents the mask branch, which is used for removing false edges caused by noise.

5.
( ) PS ⋅ denotes sub-pixel convolutional, which up-samples the edge maps into HR space.

Our Method
Referring to the generator of TecoGAN, we constructed the intermediate SR result , which is the final result of the generator in TecoGAN. As we all know, the picture quality of videos with lower resolution is always blurry. In view of the characteristic above, we perform edge enhancement after , SR t base X , which will significantly improve the edge of the subtitles and the outline of the things, thereby improving the overall picture quality. In the subsequent edge enhancement part, we refer to edge-enhanced GAN (EE-GAN) [14]. First, the edge of , and its formula is as follows: , , where ⊗ is the convolution operation, and However, videos with lower resolution, such as CCTV, are accompanied by inevitable noise due to the limitations of shooting and production technology. Therefore, the edge obtained at this stage contains a part of false edges caused by noise. GAN will inevitably introduce noise. In order to extract more pure and effective edges, we learn from EEGAN to refine and strengthen , SR t edge X . The specific structure is shown in Figure 2.
, SR t edge X is firstly converted to low-resolution space in order to reduce the computational cost. After a few convolutional layers, the dense block in EEGAN [14] is used for feature extraction to obtain edges that are more refined. Meanwhile, we learn the noise mask through a mask branch to achieve the purpose of eliminating noise and artifacts and obtain refined and enhanced edge  Figure 2. This is our edge enhancement module and partial results. (a) This is our edge enhancement module. We take an image as an example to show its process. The Dense Block and Noise Mask inside refer to the design in edge-enhanced GAN (EEGAN) [14]; (b) On the left is ground truth, and on the right is a partial enlarged view of , In order to ensure the continuity of the reconstructed video in the temporal, we add ping-pong loss from TecoGAN [46] in our framework. We input two groups of consecutive video frames, each of frames. The second group is the reverse sequence of the first group. In this way, we obtain the forward result where HR t X is the GT, and α is the weight. Specifically, α changes according to a certain rule during the training. As the training step increased, the model becomes more and more accurate. Simultaneously, the difference between the intermediate result and the final result is getting bigger and bigger. Based on this, the α is set to 10 at the beginning and is increased with the training step. See the experimental part for specific parameters.
In addition to the above loss functions, we retain the other loss functions in TecoGAN.
In addition, we train the model in two steps. In a word, we firstly train the simplified network and then train the complete network on the basis of the simplified network. In the intermediate model, we only train the generator. In addition, the loss function is simplified. In this step, only the content loss is retained, and the weight remains unchanged. This step is equivalent to an initialization parameter training of the subsequent mode. Since the framework and loss functions here are more complicated, if we train the complete network directly, it is difficult to find accurate network parameters or it takes a long time. A simplified network that is pre-trained helps find the approximate range of the final parameters of the network. Then, we initialize the complete network with the parameters of the simplified network. Next, we fine-tune the framework. In this step, α increases with the training steps as well as the learning rate decays with the training steps.

Experiments
In this chapter, we first give training details. Secondly, we perform a comparative experiment study. Then, the evaluation metrics will be illustrated. Finally, we will provide qualitative analysis and quantitative evaluation of the experimental results.

Train Details
We perform the experiment using Python3.6 and Tensorflow-gpu1. 10.0 on Py-Charm2019.1.3 (Community Edition). The computer used for the experiment is of 3.6 GHz CPU and NVIDIA GeForce GTX 1080Ti GPU. See Table 1 for more details. The dataset used for training was downloaded from Vimeo. We got the video download link from TecoGAN [46]. Vimeo Terms of Service are followed, and all used videos are available on Vimeo with the download option. Specifically, we download 25 highresolution videos. In order to learn fine motion compensation, we selected 276 scenes, each of which contains 120 frames without lens switching. The resolution size of each scene is not uniformly specified, but the length or height must be larger than 400. Imitating the characteristics of videos with lower resolution, fuzziness, and noise, we use Gaussian blur kernel for four times down-sampled. See Table 2 for more details.  , and the initial α is fixed at 10. Moreover, we use the decay function provided in the Tensorflow to dynamically decay the learning rate and α . The formula is as follows: global step decayed initial decay rate decay step For learning rate, the decay_rate is 0.9 and the decay_step is 28 K. For , the de-cay_rate is 1.1 and the decay_step is 50 K. The intermediate model performs 600 K iterations, while the intermediate model performs 1200 K. We use Adam with a momentum of 0.9 and a weight decay of the same as the learning rate for optimization. We also recorded the performance of the model on peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) as the interaction changes when training the final model. See Table 3, Figure 3 for more details. Table 4 shows the details of the Vid4. During the test, we removed the first and last two frames. Specifically, the final data are the average of 155 frames of images, including 37 frames of calendar, 30 frames of city, 45 frames of foliage, and 43 frames of walk. α

Comparative Study
We tried different decay methods, different loss functions, and different datasets to compare the final results and different edge enhancement modules.
For different decay methods, we compared two patents. Both of them start from 10; one is exponentially decreasing at a rate of 0.9, while the other is exponentially increasing at a rate of 1.1. Other factors remain the same. The experimental results in Figures 4  and 5 show that the incremental approach is better. The testing samples are the same as above, including 155 frames. As the number of iterations increased, the model becomes more and more accurate, and the gap between   For the loss function, we tried to calculate the loss in proportion to the two outputs , SR t base X and , SR t final X of the generator for all loss functions or to calculate the loss in proportion to the content loss only. The former is loss function A, the latter is loss function B. The latter performs better. In Figure 6, we can find that loss function A will cause a more obvious mosaic phenomenon. Using the two layers of the generator on the content loss and assigning different loss weights helps lock in the final result in a more accurate range at the beginning and keep a relatively reasonable range later. Other loss functions only need to use the final result For the datasets, we tried down-sampling from high-resolution videos downloaded randomly on vimeo or down-sampling from high-resolution repaired versions of film and television dramas around 2000. Models train on different datasets perform differently in different scenes. The former performed better on Vid4, while the latter performed better on the film and television scene, which you can see in Figure 7. The experiment shows that the models trained on different training datasets adapt to different scenarios.
The latter is as described in Section 3.3, where experiments proved our theory. As you can see in Figure 8, when performing edge enhancement, if both denoising and strengthening are considered, the result is better. The simple edge enhancement will lead to some noise and blurred edges.

Evaluation
According to the mainstream of the SR field, we calculate Peak Signal to Noise Ratio (PSNR) and Structural SIMilarity (SSIM) on the Y channel of YCbCr space, where Y refers to the luminance component, Cb refers to the blue chrominance component and Cr refers to the red chrominance component Peak signal-to-noise ratio (PSNR) is an objective standard for evaluating images. The mathematical formula is as follows: where MSE is the mean square error between the original image and the SR frame, and MAX indicates the maximum value of the image color. For example, the 8-bit sampling point is expressed as 255.
Structural similarity (SSIM) is an index to measure the similarity of two images. The mathematical formula is as follows: where x is the SR frame, y is the GT, x μ and y μ are the mean values; x σ and y σ are the standard deviations, and xy σ is the covariance of x and y . We use the built-in compare_ssim function of the skimage module to calculate. SSIM is a number between 0 and 1. The larger it is, the smaller the gap between the result frame and the GT; that is, the image quality is better. When the two images are exactly the same, SSIM is 1.
We compare the proposed method on the Vid4 dataset with some other SR algorithms: video super-resolution with convolutional neural network (VSRNet) [52], VES-PCN [42], SOF-VSR [45], FRVSR [43], and TecoGAN [46]. Table 4 shows the details of the Vid4. During the test, we removed the first and last two frames. Table 5 shows that our network has the best average results on PSNR and SSIM on the Vid4 dataset. Figures 9 and 10 also show the superiority of our method in qualitative results. Compared with TecoGAN [46], the results of our method are closer to GT. The results of TecoGAN contain more noise. Meanwhile, distortion is more obvious in some details.    [42], VSRNet means video super-resolution with convolutional neural network [52], SOF-VSR means learning for video superresolution through HR optical flow estimation [45], FRVSR means frame-recurrent video super-resolution [43] and TecoGAN means temporally coherent generative adversarial network [46]. The data of VSRNet, VESPCN, and FRVSR are quoted from their paper directly.
We also tested our method in other low-resolution scenes in our lives. Table 6 shows the details of the data. During the test, we removed the first and last two frames. Table 7 shows that our network has the best average results on PSNR and SSIM on film and television scenes. Figure 11 also shows the superiority of our method in qualitative results.

Conclusion
In this article, we proposed an end-to-end SR method for LR video, which can be used to improve the image quality of urban CCTV. A large number of experiments have shown that our method can improve the resolution of the video and meet people's perception. These LR videos are usually blurry and inevitably accompanied by noise. The edge enhancement module we added can successfully enhance the edge but does not amplify the noise. At the same time, we have also done many comparative experiments. These experiments show that models trained on different training datasets perform significantly differently in different scenarios. We have proved that this method is superior to other methods on different test datasets.
In the future, we will consider optimization based on the training dataset. In this article, we down-sample HR frames to obtain the dataset. The down-sampling process simulates the degradation process of LR data as much as possible, but the same effect cannot be guaranteed. Therefore, the trained model is only most suitable for LR scenarios that meet specific degradation conditions. Based on this, we will try to eliminate the process of manually down-sampling HR frames to obtain LR frames. Specifically, we will directly use continuous frames of the original video as inputs and the corresponding continuous frames of the HR repair version as targets.

Conflicts of Interest:
The authors declare no conflict of interest.