Inter-Frame Based Interpolation for Top–Bottom Packed Frame of 3D Video

: The frame-compatible packing for 3D contents is the feasible approach to archive the compatibility with the existing monocular broadcasting system. To perceive better 3D quality, the packed 3D frames are expanded to the full size at the decoder. In this paper, an interpolation technique enhancing and comparing the quality of enlarged halt vertical left and right stereo video in the top–bottom frame-compatible packing is presented. To this end, the appropriate interpolation modes from fourteen available modes for each row segment, which exploit the correlation between left and right stereoscopic as well as current and adjacent frames of individual view, are estimated at the encoder. Based on the information received from the encoder, at the decoder, the interpolation scheme can select the most appropriate available original data to ﬁnd the missing values of to-be-discarded row segments. The proposed method outperformed than the state-of-the-art interpolation methods in terms of subjective visualization and numerical PSNRs and SSMI about 11%, with an execution time of about 12% comparisons.


Introduction
Recently, with the increasing popularity of 3D contents, 3DTV broadcasting services becomes a reality. However, compared with convention digital TV (DTV) broadcasting services, 3DTV needs double the transmission bandwidth for the same video resolution since the left and right frame of 3D video are required to perceive 3D contents [1,2]. To utilize existing transmission infrastructure including teh H.264/AVC compression scheme, the compatibility of 3D video with conventional digital television (DTV) transformation format is strongly required [3,4]. The existing broadcasting network can be used by converting the 3D video format into a single DTV frame. For example, a top-bottom packing after the horizontal line sub-sampling can be treated as a single frame of the existing DTV. Other packing methods include side by side, interleaved formats [5]. The existing 3D compatible frame packing formats are shown in Figure 1.
With the results of utilizing the packing methods, we can utilize the existing transmission infrastructure for the stereoscopic 3DTV similar to [6]. However, the reduction in the spatial resolution due to the decimation of frame-compatible packing is a critical problem when the user wants to experience the video's full quality. One way to solve this problem is that one suitable interpolation method should be applied to recover the decoder's original resolution. Interpolation is the problem of image processing for many years. To yield the unknown pixels of the upsampled image, the conventional interpolation methods can be used. The simple bilinear and bicubic interpolations use the linear and non-linear correlations of known surrounding neighbor pixels, and the sophisticated iterative methods use higherorder correlations such as methods [7,8]. NEDI6 [9] and the updated version of NEDI4 [10], which uses six original pixel values in the low-resolution image to estimate coefficients. Computation time is the main advantage of real-time applications. The idea of sending side information based on the horizontal and vertical line to the decoder is considered for an adaptive linear interpolation for the line-pruned image in [11] which proved good for the horizontal and vertical edges. Regarding the stereo images, the relation between the left and right images can play an essential cue for interpolation to full resolution, which has not been considered by conventional methods. The pixel-based stereo matching method [12] produces the horizontal disparities between left image pixels with their right image corresponding pixels and similar for right and left frames case. Because all pixels should be estimated, the disparity value by searching the match point, the burden of computation time and the accuracy of this method lead to motivation to find new methods.
To reduce the computation time of pixel-based stereo matching, in the patent [13], block-based stereo matching is proposed. Each pixel of the left image (or the right image in this block) is assigned the value of the corresponding pixel of another image for each matched block. This leads to the interpolated pixel being less accurate due to a block mismatch problem. Another issue of stereo matching is the accuracy of disparity estimation on the occlusion region where there are no matched regions between the right and left frames. For the top-bottom packing, the disparity estimation for each deleted line can be exploited to determine an appropriate interpolation mode [14]. Although this method significantly improves the video quality, it does not consider the inter-frame relations to improve the quality of interpolated 3D video sequences.
An effective interpolation technique addresing the unpacking top bottom format of stereo sequences was proposed and evaluated. Specificaaly, our proposal's primary concern is to use the available values of the previous and subsequent frames of the current view to preserve the quality of the interpolated stereo images. To achieve our goal, adding to the modes of paper [14] by the modes which include the previous and subsequent frame of the current frame is taken into account to exploit the high correlation. The best interpolation mode among the proposed interpolation modes is estimated at the encoder (sending side), before sending the decoders to yield the best interpolation results. The proposed method can archive better running time and subjective evaluation comparing to the evaluated interpolation methods.
The paper is organized as follows. The the frame-compatible top-bottom packing is presented in Section 2. The the proposed inter-frame based interpolation is detailed in Section 3. The optimal proposed interpolation strategy is evaluated in Section 4. The conclusion and future works are in the last Section 5.

Frame-Compatible Top-Bottom Packing
The proposed optimal mode based interpolation uses the top bottom packing scheme of 3D video during broadcasting 3D contents (see Figure 1). The top bottom packing scheme has a similar configuration with [14], but with the addition of the same packing scheme for the current frame and adjacent frame (previous and subsequent frames) of one view of the stereo sequences. Specifically, each line of two consecutive horizontal rows will be discarded to squeeze the height by half right and left images, and a similar process is applied for the current and previous frames detailed, as shown in Figure 1. To utilize each image's remaining lines effectively, the horizontal downsized right image I r and current frame of one view has a left image I l with one line offset and a previous frame of one view, respectively. Then, the sub-sampled left and right views are combined as an individual frame within the stereoscopic stream for compression and transmission.

Proposed Inter-Frame Based Interpolation
In this paper, the idea of exploiting the correlations of inter-frames optimally at the encoder to leverage interpolating the top-bottom packed stereoviews at the decoder is implemented. The first motivation is that the need-to-be-interpolated pixels on the discarded horizontal line of one frame of the current view can be recovered by the corresponding pixels that are not discarded on the other view frame. This is the characteristic of the well-rectified stereo image as the result of epipolar constraint. To obtain the best perception of 3D effects, epipolar constraint states that the corresponding pixel in the right image to the considered pixel in the left image stay on the same row. The second motivation is that the inter-frame correlation is significantly high for frames of a single view. As a result, the current frame's information can exist in the adjacent frames, such as previous or subsequent frames. As a result, the current frame's information can exist in the adjacent frames, such as previous or subsequent frames. Figure 2 is an overview of the transmission system. Conventional stereo matching, which wants to find the matched pixel (or matched block) by using some matching criteria like the sum of absolute differences (SAD), sum of squared differences (SSD) [15,16], to interpolate the right frame from the available left pixel on the left frame (similar for interpolating the right frame). In this paper, with the original available data from not to-be-discarded pixels of the corresponding left pixels and the pixels of previous and after right frames, the combinations of this available information can be estimated at the encoder and exploited to the decoder as shown in Figure 3. To this end, using the biliear, firstly the half vertical resolution right and left views are interpolated to full resolution at the prepossessing step. Each deleted row is divided into S parts with the length L similar to the method [14]. Each part will have the same interpolation modes. To find the best mode for each horizontal part, the mode will be found by the estimation scheme at the encoder. Note that among the total 14 modes in Table 1, modes 1 is the equation of parallax added left view I l data (x j i,k * → x j i,k * +L ) and modes 2 and 3 are the function of adjacent right data (x r−1 i,l * → x r−1 i,l * +L ), (x r+1 i,g * → x r+1 i,g * +L ) and mode 4 are the formulas which combine the parallax-compensated left image with the previous and next frame motion estimation. Note that the left x l i and right x r i are the full resolution frame expanded by bilinear interpolation method and right x r−1 , right x r+1 are the full resolution expanded by the bilinear interpolation of the previous and subsequent right frames, respectively. The horizontal parallax k * in interpolation modes can be found as Equation (1).
Moreover, the motion displacement l * between the segment of the current right frame and previous right frame can be found as Equation (2): Similarly, Equation (3) is used to find the motion displacement g * between the segment of the right frame and the next right frame: is a set of a horizontals shift, and ∆ is the first pixel of the considered segment, and m d is the maximum horizontal disparity. Based on the original pixels and the pre-calculated candidate interpolation modes, the best interpolation mode b * r (i, s) of sequences of deleted row, this can be calculated as Equation (4): where right x r i,∆+j is the pixel in the original right frame at time t and right x r i,∆+j (m) is the interpolated pixel by the mode number m of the proposed interpolated modes. Note that after being estimated, the optimal mode is sent to the decoder to improve the accuracy of the interpolation.

Experimental Results
The experiments were done on MATLAB version R2016a environment installed on the machine with the configuration: Intel ® core™i5CPU 4G Ram Windows 10 64 bit operating system. Six stereoscopic sequences named Book, Car, Door, Horse, Moabit, Bullinger are downloaded from the open source database [17] and used to evaluate the efficiencies of the proposed method. For a fair comparison, we selected the 3D sequences as same as the 3Dbased frame-compatible interpolation in reference [14]. There are 150 frames for each tested sequence. The 3D frames of tested sequences are down-sampled for the top-bottom packing. The conventional interpolation methods including the Bilinear, NEDI6 [9], pixel-based matching method [12], block-based matching patent [13], the method [14] are implemented to compare with the proposed method in terms of subjective and numerical data.
Note that, in the bilinear method, the interpolated pixel p i can be estimated by the Equation (5) where x l , x r , x t , x b are the left right top and bottom undiscarded pixels' data, and c l , c r , c t , c b are the corresponding distances from the interpolated pixel. In NEDI6 [9], Io d is the downsample of the original image I o which is expanded to I i of size mxn with zero value row at k t h row of I o , the row of Io after the downsample is linked to the odd rows of expanded frame I i by I i (i, 2j − 1) = I o (i, j, and then using the sixth order h 6 , the discarded row is derived from the undiscarded row as Equation (7). The pixel-based matching method [12] and block-based matching patent [13] can be used to derive the unknown discarded pixels by searching for the minimal value of SAD values within the pixel order or block of pixels order, respectively. The SAD algorithm has the advantage of computational efficiency. The SAD, as Equation (6), finds the disparity disp ab windows of corresponding pixels in the left image I l and right image I r . The optimal mode-based method [14] only considers the current left and right frame to derive the optimal interpolation modes: We applied the strategy by calculating the average measure from one top-bottom frame, and then averaged all 150 frames per one tested video sequence. As a result, the peak signalnoise ratio (PSNR), structural similarity index measure (SSIM) and multiscale structural similarity index measure (MSSSIM) were added in Tables 2-4, respectively. Note that, PSNR measures the lossy interpolated images. On another hand, SSMI and MS-SSMI are the objective image quality measures that have a higher correlation with the human visual system. Given original image I o and interpolated image I i with the size M, n and standard deviation µ and mean σ, Equations (8) and (10) were used to find the PSNR and MMSI, whilst the MS-SSIM is implemented by applied multiple scale SSIM: where MSE is calculated as Equation (9): where c 1 , c 2 are constants. The proposed method proved that it archives the better quality when applied to the sequences with complicated textual and many corners like the Alt-moabit sequence.
For the sequences in which their left frames and right frames consist of occlusions, the proposed method derived the best results in terms of subjective evaluation than stereobased matching methods [12,13].
We calculated the ANOVA Fisher's least significant difference (LSD) procedure the multicomparison post hoc statistical test to check which pairs of means have a significantly different variation from the mean PSNR, SSIM, and MS-SSIM of different interpolation methods. The numerical data are shown in Figure 4 for PSNR, Figure 5 for SSIM , and Figure 6 for MS-SSIM, respectively. As one can observe, the performance of the method [14], which is the second-best method, is considerably lower than proposed optimal mode based interpolation technique. Even though the proposed method's execution time is slightly higher than the Bilinear and [14], the proposed method outperformed all validated methods in PSNR, SSIM, and MS-SSIM with significant margins and comparable standard deviations.   We zoom in on the interpolated frame of the tested sequences for subjective comparisons in Figures 7-10, for the Moabit, Car, Horse, and Book-Arrive sequences, respectively. From the results, the proposed method archives outperforms than others, especially at thin horizontal lines. This better quality caused by deleted pixels on one view still exists in another view, and its value can be almost recovered after interpolation based on featurebased matching. Figure 11 shows the results of the tested method for the zoom in on the vertical edge. The proposed method fixes zigzag artifacts generated by incorrectly estimated disparity. The execution time comparison for encoder and decoder processes of all tested methods is shown in Table 5. Although the bilinear method yeilds the smallest time values, its visual effects are worst among the tested methods. The proposed method takes more time than the bilinear and much less time than the pixel-based stereo matching and NEDI6. Note that the execution time to derive the optimal mode for considered segment at the encoder is method's primary consumption time [14] and the proposed method.  [9]; (d)pixel-based matching [12]; (e) patent [13]; (f) paper [14]; and (g) proposed method.  (d) NEDI6 [9]; (e)pixel-base matching [12]; (f) patent [13]; (g) paper [14]; and (h) proposed method. (d) NEDI6 [9]; (e)pixel-base matching [12]; (f) patent [13]; (g) paper [14]; and (h) proposed method. (d) NEDI6 [9]; (e)pixel-base matching [12]; (f) patent [13]; (g) paper [14]; and (h) proposed method. Figure 11. Zigzag artifact at the vertical edge caused by an incorrect estimate disparity: (a) patent [13]; (b) paper [14]; and (c) proposed method after fixing artifact.
Furthermore, similar to [18], we add the depth image rendered by the proposed method using the interpolated left and right frame and compare with the method [4]; and as shown in Figure 12, one can observe that the proposed interpolation technique creates the less noise and higher PSRN for the rendered depth images from the interpolated left and right frames.

Discussion and Conclusions
Since we need the fast interpolation method to obtain the intermediate full-size resolution at the decoder, we applied the bilinear method so that it could then be refined by the optimal modes send by the encoder. The single image super-resolution such as the tested method yielded a lower quality than the proposed method. The deep learningbased method using the single image might provide better quality but it required heavy computation and advanced hardware to store and load the trained models which is not always available at the encoder of DTV.
In this paper, the matched region between the right and left frames and the adjacent frames of the same view was exploited for the problem of the interpolation framecompatible top-bottom packing to full resolutions. The correlation of pixels between the view is the right cue to derive the appropriate value of to-be-interpolated pixels. The experimental section shows clearly that the proposed method yields higher PSNR by 1-2 dB and also better visualizes for interpolated images while comparing with the other methods. The future work will focus on (1) reduce mode options; (2) extending to online interpolation; (3) expanding the proposed approach into another frame-compatible packing system; (4) exploring the interpolation mode by AI-aided techniques, testing the method in real hardware devices; and (5) the images quality assessment, which must be studied intensively to provide the appropriate tool to evaluate the interpolated, restored images [19][20][21][22][23]. The method as mentioned in the recommendation ITU-R BT.500-14 [24] will consider as future works since it required setting up the evaluation environment with professional equipment and observers.