1. Introduction
The last two decades have witnessed outstanding achievements in science and technology. Expensive devices and spectacular media contents which are enjoyed only by a minority are now available for almost every individual. Particularly, a camera device is nowadays common to everyone’s daily life since it is embedded into portable devices such as smartphones, tablets, and laptops. Furthermore, the multimedia contents generated by the camera are digitalized, stored, shared, and transmitted rapidly and globally. Today the number of video content viewers watching over-the-top (OTT) media services such as Netflix, Roku, Disney+, YouTube, and Amazon Prime Video are rapidly increasing with the video streaming service’s ongoing developments.
Meanwhile, multimedia content consumers prefer to watch higher resolution videos because of their enhanced vivid and realistic effects. The general video resolution on streaming services is FullHD (1920 × 1080) or 4K (3840 × 2160) and it has recently reached up to 8K resolution (7680 × 4320). The progress in video resolution is significant and sooner or later the current maximum resolution could also be substituted by a much higher resolution format. However, high-resolution video requires a large amount of storage, network bandwidth as shown in
Table 1, and longer elapsed time to be transferred to clients through a network, inevitably resulting in a lower resolution video being delivered to viewers subject to network conditions. To overcome the problems of storage and bandwidth, streaming service providers have been developing various methods, such as Per-Title Encoding, Adaptive Bitrate Streaming (ABR), Decentralized Content Delivery, Dynamic Optimizer, and Content Delivery Network (CDN).
In addition, the abundant amount of pre-existing low-resolution video content also needs to be delivered to OTT service consumers. However, the video quality of the old low resolution is not satisfiable to be played on Ultra-High Definition (UHD) size display devices at home. Accordingly, if the received low-resolution video on higher resolution display devices including smartphones can be converted to high-resolution video clips delicately, consumers could enjoy the benefits of high-speed video streaming as well as better quality videos than before.
The aforementioned network bandwidth, storage, and video quality issues can be resolved with Video Super-Resolution (VSR) technology which reconstructs high-resolution video from a lower resolution video through the use of various features in one or sequential frames in the video [
2]. It starts from conventional computer vision technology including static interpolation theories such as the nearest neighbor, bilinear, or bicubic filter, and has shown a significant progress by adopting Convolutional Neural Network (CNN). CNN is a type of deep learning neural network that is particularly well suited for image processing tasks. The key idea behind CNN is to use convolutional layers, which scan the input image with a small filter (also called kernel or weights) and applies the same transformation at each location of the image. By applying multiple convolutional layers, a CNN can learn increasingly complex features of the image, such as edges, textures, and patterns. Many researchers have demonstrated that CNN-based Super-Resolution methods produce clearer and higher resolution output when compared to traditional interpolation techniques [
3,
4,
5]. The latest research on VSR demonstrated meaningful advances in terms of the quality of super-resolved video and conversion speed.
The four principal streams of the related research are: (1) Recurrent Frame-based VSR Network (FRVSR, RBPN, RRN) [
6,
7,
8], (2) Spatio-Temporal VSR Network (SOF-VSR, STVSR, TDAN, TOFlow, TDVSR-L) [
9,
10,
11,
12,
13], (3) Generative Adversarial Network (GAN)-based SR Network [
14,
15,
16,
17], and (4) Video Compression-informed VSR Network (FAST, COMISR, CDVSR, CIAF) [
18,
19,
20,
21].
This paper proposes a method utilizing the information from the compressed video to achieve a lightweight VSR model applicable in video streaming services without seriously decreasing the quality of super-resolved video, namely Compression-informed Lightweight VSR (CILVSR), as shown briefly in
Figure 1. This study aims to utilize more information acquired from the procedure of video decoding to improve the performance of the VSR model in terms of speed and lightness. The information covers slice type, which is almost similar to frame type in here, macroblock type, group of pictures, and motion vector.
Table 2 presents slice type of the H.264 Video codec standard [
22]. The slice type of Intra-frame is either I (Intra) slice or SI (Switching I) slice and the slice type of Inter-frame is P (Predictive), SP (Switching P), B (Bi-directional Predictive), or SB (Switching B) type.
The iterative period of selecting the Intra-frame group of pictures(GOP), i.e., the gap between Intra-frames shown in
Figure 2, is determined by encoding configuration. The small number of GOP means the target bitrate of the encoding file is big enough and this configuration is chosen in case a frequent scene change occurs in a video or a video content requires more details and less block noise after decoding. In the case of H.264, the start frame is designated as an Instantaneous Decoder Refresh (IDR) frame with Intra-frame type and once a frame is encoded with this frame type, all the statuses such as reference picture buffer status, reference frame number, and picture order count are initialized in decoding [
22].
The proposed VSR model is composed of two main networks which are the Intra-frame-based network and Inter-frame-based network. The Intra-frame-based network utilizes periodic Intra-frames in compressed video and it is trained without any dependency on consecutive frames. As this network model adopts Intra-frames which consist of a significant amount of information in compressed video such as one still image, it is possible to be considered as a single image SR network which is beneficial to implement a lightweight VSR model. Meanwhile, the Inter-frame-based network presented in this study facilitates two consecutive frames for training to utilize the temporal relation between the Inter-frames. In this Inter-frame-based network training, the motion compensation process exploits the motion vector, macroblock type, and the completely decoded previous frame as a reference frame. Furthermore, the integration of the two models enables it to be a simple and adaptable model, which utilizes the intact information from the original compressed video.
The contributions of this research are as follows:
The VSR model in this paper consumes low computational resources for inference work without significantly damaging the quality of the video through adopting a smaller number of reference frames compared to other Spatio-Temporal-based VSR models;
To extend the availability of VSR model under a bad network environment, the proposed model is separable by frame type;
The proposed VSR model is appropriate for real-time video streaming services by using various information from a video decoder.
3. Methodology
The proposed method, CILVSR, consists of two main parts to obtain super-resolved frames from a low-resolution video, Intra-frame upsampling and Inter-frame upsampling.
Firstly, after decoding an encoded LR video, GOP and frame type information are acquired. Based on the information, frames are classified as Intra-frame or Inter-frame ahead. The super-resolved results of all frames in GOP,
in Equation (1) are a group of the estimated high resolution of Intra-frame,
and a group of the estimated high resolution of Inter-frames,
:
where
is the number of frames in a GOP.
The upsampling of low-resolution Intra-frame,
can be considered a single image super resolution (SISR) method. In general, SISR training and inference process require fewer computing resource than a VSR model using multiple frames for training and inference. Extracting the Intra-frame from the encoded video and treating it as a separate SISR module is profitable to reduce the overall burden of VSR model training. In this paper, SRCNN [
6] with Laplacian enhancement is used to obtain Super-resolved Intra-frames,
from compressed low-resolution video. Laplacian enhancement is adopted to restore high-frequency details which are reduced in the process of video encoding [
36]. A Laplacian enhanced frame is produced by a Gaussian kernel blur
G (∙, ∙):
where σ is the width of the Gaussian kernel and
is an intermediate HR frame.
Meanwhile, in video compression, the relation between Inter-frames is described as predicted frames (P frames) or bi-directional predicted frames (B frames). P frames refer to previous frames for compression/decompression and it contains two types of macroblock, intra macroblock (I_* in
Table 3) and predicted macroblock (P_* in
Table 3).
Table 3 shows the macroblock type for the P frame in H.264. A macroblock is a unit of pixels for video compression and its basic size in the case of H.264 video compression standard is 16 × 16 pixels. Each macroblock can be partitioned into a sub-macroblock such as 16 × 8, 8 × 16, 8 × 8 or 4 × 4 to acquire better compression results as
Figure 3.
The first step of Inter-frame encoding is to find motion vectors of similar pixel blocks between frames and the second step is the bitwise compression of the residuals and the estimated motion vectors which are encoded with the value of mb_type from 0 to 4 in
Table 3. Throughout this Inter-frame compression process, temporal redundancy between frames in GOP is eliminated because in case the same macroblocks exist between consecutive frames, the macroblocks can be compressed and represented after decoding as a type of reference macroblock and motion vectors instead of encoding all the macroblocks in sequential frames.
In the case of JM (Joint Model) software [
52] for H.264 encoding, to find out motion vector, Unsymmetrical Multi-Hexagon Search (UMHexagonS), and Center Biased Fractional Pel Search (CBFPS) methods are used for high-speed motion estimation. This search process is the most time-consuming module in video encoding because it requires numerous mathematical calculations. To achieve a high compression ratio of Inter-frames, finding the exact motion vectors is crucial in the video encoder.
In the meantime, optical flow is also a vintage technique for finding out moving patterns in two frames occurring by object movement or lighting change [
53]. To implement the high performance of the VSR model, many researchers adopt the optical flow technique to realize the Spatio-Temporal correlation between frames and a substantial amount of evaluation results from the research show the effectiveness of using optical flow for VSR. Inspired by the SOF-VSR model [
9] and VESPCN [
54] utilizing an optical flow network for video upscaling, the proposed method utilizes the motion vector of compressed low-resolution video as one form of input data for the model training of Inter-frames.
Figure 4 shows the overall architecture proposed. The architecture is mainly composed of two networks, SRCNN with Laplacian enhancement for Intra-frame upsampling in
Table 4 [
3] and Spatio-Temporal ESPCN [
54] with Laplacian Enhancement for Inter-frame upsampling in
Table 5. In
Figure 4,
and
are decoded low-resolution frames.
means a low-resolution reference frame which is a frame that is used as a reference frame to attain motion vector and
is a low-resolution current frame that should be super-resolved.
is classified into Intra-frame or Inter-frame by frame type information from a video decoder. Other necessary information for the proposed model such as the number of GOP, macroblock type, motion vector, and reference frame number in each macroblock can also be extracted from a video decoder. In general, the information is already transferred to the display module before the initiation of a video streaming playback. In other words, there is no additional burden to acquire the information for VSR model training or inference process.
General video compression standards support multiple reference frames and bi-directional referencing to obtain one motion vector, but the proposed method considers only single previous frame referencing in this research to implement the most lightweight VSR model.
The motion vector loader in
Figure 4 fetches the motion vector elements from decompressed low-resolution video following the macroblock types. It is unlikely that optical flow representing every pixel in a frame, one motion vector pair,
(
mvx,
mvy), can represent from a 4 × 4 pixel partition to a 16 × 16 pixel partition in the case of H.264 [
22], as shown in
Figure 3. For example, if a macroblock of 16 × 16 pixels is partitioned into four 8 × 8 pixel units, four motion vector pairs can be allocated to express the movement of one macroblock such as the upper right partition in
Figure 3. Furthermore, if the sub-macroblock partition type is an 8 × 8 pixel block and the block can be partitioned to a 4 × 4 macroblock type, then the total number of motion vector pairs in a macroblock is 16.
where
is a motion vector loader,
is motion vector values from low-resolution video and
is the macroblock type of each macroblock. The maximum number of motion vector pairs, M is 16 as aforementioned in the case of H.264.
The subsequent process is motion compensation using a reference frame and folded low-resolution motion vector. Similar to the motion compensation module in [
54], motion-compensated frames are acquired through warping with LR grids and the extracted motion vectors from the decoder.
where
is a warping module that utilizes bilinear interpolation-based grid-sampling with inputs as a reference frame,
and motion vector,
of the low-resolution frame.
The output frame,
from the motion compensation block and current low-resolution frame,
are fed into the multi-frame-based ESPCN in [
54]. As
Table 5 shows specifically, the 9L-E3-MC network in [
54] was adopted and it consists of eight 3 × 3 convolutional layers, a pixel shuffle layer and one 1 × 1 convolutional layer.
where
is the set of parameters of Spatio-Temporal ESPCN layers.
To train Inter-frames with multi-frame-based ESPCN, MSE loss with Laplacian enhancement and Huber loss are adopted similarly with [
54]. The Huber loss constrains motion vector values while training as it works to treat optical flows.
where
is model parameters,
is the coefficient for the motion compensation module,
is the coefficient for Huber loss and
is the ground truth of Inter-frame. Similarly, to achieve a lightweight VSR model, different from SOF-VSR [
9] or VESPCN [
54], the proposed model utilizes motion vectors of two decoded frames directly instead of utilizing optical flows from three frames.
Meanwhile, for the training of the SRCNN model with Laplacian enhancement, MSE loss is used and the learning rate of the first two layers is 10
−4. This value is 10 times bigger than the last layer following the suggestion from the original SRCNN model [
3].