Next Article in Journal
A CAT Bond Pricing Model Based on the Distortion of Aggregate Loss Distributions
Previous Article in Journal
On Group-like Properties of Left Groups
Previous Article in Special Issue
On the Stability of the Linear Complexity of Some Generalized Cyclotomic Sequences of Order Two
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Neural Network-Based Atlas Enhancement in MPEG Immersive Video

1
Department of Computer Engineering, Dong-A University, Busan 49315, Republic of Korea
2
Electronics and Telecommunications Research Institute, Daejeon 34129, Republic of Korea
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(19), 3110; https://doi.org/10.3390/math13193110
Submission received: 4 August 2025 / Revised: 20 September 2025 / Accepted: 26 September 2025 / Published: 29 September 2025
(This article belongs to the Special Issue Coding Theory and the Impact of AI)

Abstract

Recently, the demand for immersive videos has surged with the expansion of virtual reality, augmented reality, and metaverse technologies. As an international standard, moving picture experts group (MPEG) has developed MPEG immersive video (MIV) to efficiently transmit large-volume immersive videos. The MIV encoder generates atlas videos to convert extensive multi-view videos into low-bitrate formats. When these atlas videos are compressed using conventional video codecs, compression artifacts often appear in the reconstructed atlas videos. To address this issue, this study proposes a feature-extraction-based convolutional neural network (FECNN) to reduce the compression artifacts during MIV atlas video transmission. The proposed FECNN uses quantization parameter (QP) maps and depth information as inputs and consists of shallow feature extraction (SFE) blocks and deep feature extraction (DFE) blocks to utilize layered feature characteristics. Compared to the existing MIV, the proposed method improves the Bjontegaard delta bit-rate (BDBR) by −4.12% and −6.96% in the basic and additional views, respectively.

1. Introduction

Recently, the expansion of virtual reality, augmented reality, and the metaverse with multimodal interactions has highlighted the need for efficient immersive video streaming services. As a kind of immersive video, multi-view texture and depth videos are obtained from various viewpoints using an omnidirectional light-field camera comprising an array of micro lenses. These videos require massive amounts of raw data, including multiple texture videos with diverse perspectives and depth videos displaying distances between objects and the cameras. Consequently, powerful video codecs for immersive videos are required to transmit them within the constrained network bandwidth. As shown in Figure 1, the number of two-dimensional (2D) video codecs for immersive videos is determined by the number of input videos with multi-view texture and depth videos, resulting in high complexity. Moving picture experts group (MPEG) has attempted to resolve this complexity issue through MPEG immersive video (MIV) standardization [1]. MPEG released the test model for MIV (TMIV) in 2020 [2], serving as a reference software for compressing the atlas videos derived from multi-view texture and depth videos. Currently, TMIV uses either high efficiency video coding (HEVC) [3] or versatile video coding (VVC) [4].
Figure 2 shows the overall architecture of TMIV, in which several multi-view texture and depth videos constituting the input videos are injected into TMIV encoder. For each texture and depth video, the TMIV encoder generates texture and depth atlas videos for basic and additional views. As atlas videos, each texture and depth basic view is the representative video generated by the multiple texture and depth videos. In contrast, the additional view is composed of rectangular patches obtained from the residual after removing the overlapped regions of the basic view. These basic and additional views exhibit distinct characteristics compared to natural videos, as illustrated in Figure 3. Figure 3a shows a texture atlas used as the basic view where large coherent regions are packed to minimize patch boundaries. Figure 3b shows the texture additional view that gathers residual patches not assigned to the basic view. Figure 3c shows a depth atlas used as the basic view which exhibits smooth layers with strong gradients at object boundaries. Figure 3d shows the depth additional view that contains sparse patches for the remaining depth regions.
Therefore, this study proposes a neural network-based post-processing filter to significantly reduce the compression artifacts of atlas videos. We present a feature extraction-based convolutional neural network (FECNN) that consists of shallow feature extraction (SFE) and deep feature extraction (DFE) blocks to improve the quality of texture atlas videos. The proposed method takes advantage of the FECNN’s performance by utilizing a depth atlas and quantization parameter (QP) information as input data. Additionally, various ablation studies were conducted to evaluate the efficiency of the proposed design and its effect on the performance of neural networks. The performance of the proposed method was evaluated under the test sequences provided by MIV and measured using Bjontegaard delta bit-rate (BDBR) as performance metrics. The primary contributions of this study are summarized as follows:
  • The proposed FECNN newly deployed depth atlas and QP information were used as inputs to enhance the coding performance as well as visual quality. To the best of our knowledge, this is the first study that uses such neural networks to enhance the atlas quality of MIV in the literature.
  • To develop FECNN, we designed SFE and DFE blocks to improve the visual quality of the texture atlas compared with that of existing TMIV. Specifically, it can provide noticeable visual improvements between objects and patch boundary lines in the atlas videos.
  • The proposed FECNN improved BDBR by −4.12% and −6.96% on average in the basic and additional views, respectively.
The remainder of this paper is organized as follows: Section 2 reviews previous coding methods for immersive video and various post-processing studies for artifact reduction. Section 3 describes the proposed FECNN in terms of network architecture and training. Finally, experimental results and conclusions are given in Section 4 and Section 5, respectively.

2. Related Works

2.1. Previous Coding Methods for Immersive Video

To improve the coding performance of immersive video, several studies investigated new views and enhanced the accuracy of input depth videos. For instance, Jeong et al. proposed a method that assigned different QP values due to the varying characteristics between the basic and additional views of atlas videos [5]. Milovanovic et al. reduced the amount of transmitted depth data by leveraging the characteristics of the depth information included in the texture view [6]. Lee et al. designed a multi-view streaming system that employed group-based immersive video by deriving weights from these groups and transmitting bitstreams based on the given viewports [7]. Garus et al. proposed the depth recovery concept and implemented a decoder-side depth recovery method that exploited motion information in a texture bitstream [8]. Jeong et al. combined texture and depth information to generate a single frame, which was then compressed into subpicture bitstreams to form a unified bitstream [9]. Mieloch et al. proposed the following two techniques to improve the depth map accuracy: (i) input depth map assistance (IDMA) technique to modify the depth map using the global multi-view optimization method and (ii) the extended IDMA technique to enhance the quality of the estimated depth map by reprojecting the input depth map onto different views [10]. Lim et al. extended the dynamic depth range using the patchwise min-max linear scaling method to accurately transmit depth information [11]. Dziembowski et al. applied two methods to improve the coding performance of immersive video. The first method removed the constant components of all luma and chroma constituents of each patch in the atlas, and the second method adjusted the dynamic range of the depth atlas to match the quality of the input depth map [12]. Lee et al. proposed a high bit depth geometry representation for MIV that preserves sub-bit details by assigning the post-quantization residual bits to the chroma channels and adds tailored preprocessing for YUV 420 format [13]. Oh et al. proposed an efficient atlas coding strategy for object based MIV that detects unused areas in the atlas and reduces the atlas resolution during encoding to cut computation and improve coding efficiency [14]. In contrast to the aforementioned studies, the proposed method deploys a neural network to enhance the visual quality of atlas video as well as the coding performance of immersive video.

2.2. Artifact Reduction of Video Coding

Several studies on visual quality enhancement methods have utilized neural networks to remove compression artifacts caused by lossy coding schemes. To improve the quality of HEVC I-frames, Dai et al. developed a deep neural network-based in-loop filter, known as the variable-filter-size residue-learning convolutional neural network (VRCNN) [15]. Yu et al. proposed a high-frequency guided CNN to enhance the quality of the chroma components using high-frequency information from the input luma component [16]. Hoang et al. investigated a deep recursive residual network with block information (B-DRRN) to provide better quality for compressed frames using coding block information [17]. B-DRRN includes additional network branches to utilize block information and reduce network complexity by applying recursive residual structures and weight-sharing techniques. Wang et al. designed a unified single-model solution that is a CNN-based in-loop filter applicable to video coding [18]. They also proposed an attention based dual-scale CNN to remove the artifacts from reconstructed videos using QP and coding unit (CU) partitioning information. Zhao et al. proposed variable-filter-size residue-learning CNN with batch normalization (VRCNN-BN) [19]. VRCNN-BN was designed as an end-to-end model and applied separately to the luma and chroma components of the video. Das et al. developed a CNN-based post-processing framework that utilized QP information to enhance the visual quality of reconstructed videos [20]. Zhang et al. proposed a weakly connected dense attention neural network (WCDANN) to remove visual artifacts based on residual learning scheme [21]. Santamaria et al. proposed a content-adaptive CNN post-filter with per-picture activation signaled via Supplemental Enhancement Information (SEI), built upon the Neural Network-based Post Filter (NNPF) [22] architecture. The filter minimizing Sum of Squared Differences (SSD) is selected among four candidates, enabling decoder-side operation without bitstream changes [23]. Although these methods have demonstrated high performance in reducing compression artifacts from natural videos, there are limitations in applying them to texture and depth atlas videos due to distinct characteristics compared with natural videos. Thus, research on neural network-based algorithms suitable for atlas video is required.

3. Proposed Method

As MIV specified a common test condition (CTC), four different QP values were used in each test sequence under CTC [24]. Table 1 shows that a test sequence has four rate points (RP) corresponding to the texture and depth atlas videos. Because depth atlas videos are compressed with lower QP values, reconstructed depth atlas videos have minor compression artifacts. Conversely, texture atlas videos employing a maximum QP of 51 generate significant compression artifacts and lead to the degradation of final views.
Figure 4 shows the integrated MIV framework with the proposed FECNN. The roles of the blocks in Figure 4 are as follows in a compact form: Multi view videos provide the texture and depth video inputs, and the TMIV encoder converts the multi view inputs into atlases and metadata while removing redundancy, with the texture atlas carrying patches for the basic and additional views. The VVC encoder and decoder compress and reconstruct the atlases with QP maps, and FECNN enhances the decoded texture atlases using depth and QP to reduce artifacts without blurring edges. From a modeling standpoint, decoded atlases can be regarded as clean signals perturbed by quantization noise whose magnitude depends on the QP and by boundary-localized artifacts that arise near patch or object boundaries. To address these distortions, we guide the restoration using depth, which provides a geometry-aware cue that preserves object boundaries, and we use the QP maps as a spatial confidence signal that allocates stronger correction to neighborhoods encoded with high QP while keeping edge structures intact. In this study we deploy VVC as the video codec, and we apply FECNN as a decoder-side post-processing filter to the reconstructed texture atlases for both basic and additional views. FECNN enhances the texture atlas using SFE and DFE blocks; SFE extracts boundary and patch-aware low-level features, and DFE aggregates broader spatial context to suppress blocking and ringing without blurring edges.

3.1. Network Architecture

Figure 5 shows the architecture of the proposed neural network, comprising the input layer, SFE, DFE, and output layer. The input layer feeds the following four inputs: the texture atlas, depth atlas, texture QP map, and depth QP map. Texture and depth atlas videos are reconstructed video sequences from the video codecs of Figure 4, with the latter being half the size of the former in terms of width and height. To generate QP maps of texture and depth, the texture and depth QP values ( Q P t e x t u r e , Q P d e p t h ) used in the atlas videos encoding process have normalized values that range from zero to one by dividing the maximum QP value ( Q P m a x ). Then, QP maps of texture and depth are subsequently filled with the normalized values with the same size of input texture and depth atlas videos.
According to the VVC specification, Q P m a x value was set to 63. Input layer, output feature maps were extracted using a 3 × 3 filter to account for the spatial correlations of the texture and depth atlas videos. Because the input QP maps were insufficient to represent the spatial information, the feature maps were extracted using a 1 × 1 filter. The convolutional operation of FECNN between output feature maps ( F i ) and previous feature maps ( F i 1 ) is expressed in Equation (1):
F i = δ i ( W i F i 1 + B i )
where δ(•), W i , B i , and ⨂ represent the parametric rectified linear unit (PReLU) as one of the activation functions, filter weights, biases, and convolutional operations, respectively. In this study, the notation of a convolutional operation is denoted as Conv (filter size, channel depth of generated feature maps).
Atlases are piecewise smooth with discontinuities at patch and object boundaries, and these regions are amplified at high QP. The SFE uses two branches to learn complementary texture-driven and depth-driven features so that boundary structures are preserved. The DFE aggregates global context to mitigate blocking and ringing without blurring edges. We use separable 1 × 3 and 3 × 1 filters to approximate 3 × 3 responses with fewer parameters, and we use PixelShuffle to bring half-resolution depth features to the texture resolution by channel-to-space rearrangement, which avoids interpolation blur near boundaries.
The SFE layer was designed to extract each feature from the concatenated texture, depth, and QP feature map. In SFE layers, the output feature maps derived from depth information were upsampled by a factor of two using the pixel shuffle method [25] and concatenated with those derived from texture information. Figure 6a shows a SFE block structure comprising four convolution layers and one skip connection [26]. After the intermedia features were extracted using 1 × 1 and 3 × 3 filters, combinations of 1 × 3 and 3 × 1 filters were used to achieve a performance comparable to that of the 3 × 3 filter with lower complexity. In addition, the feature maps extracted using the 1 × 1 filter were used as the skip connection to transmit initial input information.
Figure 6b shows the DFE block structure, comprising six convolution layers and one skip connection. The input feature maps of DFE were passed through a 3 × 3 convolution layer, which was split into two branches for deep feature extraction. Here, deep feature refers to features that closely resemble texture information. Subsequently, the output feature maps from each branch were concatenated, and the channel depth was reduced to alleviate the memory burden of the DFE. Then, the feature maps with the reduced channel depth were calculated by two convolution layers with 1 × 3 and 3 × 1 filters. Similarly to SFE, the feature maps extracted from the initial 3 × 3 filter were used as the skip connection.

3.2. Network Training

The proposed FECNN was trained on an immersive video dataset provided by MIV. As shown in Table 2, the dataset consists of 21 video sequences [27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44] and eight sequences labeled “optionally” were used as the training dataset. The video sequences in the training dataset were compressed under CTC configurations and subsequently converted to the NPY data type for network training. The converted training data were extracted without overlapping regions from the texture and depth atlas videos with a size of 256 × 256 and 128 × 128, respectively, and collected a total of 451,904 patches. For additional views, we masked out gray background regions and sampled patches only from valid atlas pixels.
The proposed network was trained to minimize the mean square error between the original and reconstructed videos with the loss function as in Equation (2):
L θ = 1 N i = 0 N 1 O i Y i 2 2
where θ, N, O, and Y represent the network parameters such as filter weights, batch size, original, and reconstructed videos, respectively. Table 3 shows the hyper-parameters for the training of FECNN. The proposed network measured L2 loss with a batch size of 32 and used Adam optimizer [45] with a learning rate of 10 4 over 50 epochs, where the learning rate was reduced by a factor of 10 every 20 epochs. To calculate the mean gradients during the training, the hyper-parameters β1 and β2 of Adam optimizer were set to 0.9 and 0.999, respectively.

4. Experimental Results

4.1. Experimental Environments

All training and test procedures were performed using an Intel Xeon Gold 5218 CPU (16 cores @ 2.30 GHz) with 258 GB of RAM and three NVIDIA Tesla V100 GPUs. To evaluate the performance of the proposed method, we used the “mandatory” video sequences among the MIV CTC dataset. These sequences represent essential videos that must be used to evaluate the coding performance of MIV. They comprised eight sequences containing the classes B, D, E, J, L, and W. The video encoding and decoding processes were performed using the software listed in Table 4.

4.2. Performance Measurements

The Bjontegaard delta bit-rate (BDBR) is used to evaluate coding performance [46]. In general, a BDBR increase of 1% corresponds to a decrease of 0.05 dB in the BD-peak signal-to-noise ratio (BD-PSNR). Thus, a higher BDBR indicates coding loss. This metric is valuable for comparing the coding performance of different video codecs as it considers both bit-rate and video quality. By quantifying the difference in bit-rate needed to achieve the same video quality, BDBR objectively evaluates coding performance. Equation (3) calculates the weighted average BDBR of the Y, U, and V color components in the test sequences recommended by MIV CTC.
B D B R Y U V = 6 × B D B R Y + B D B R U + B D B R V 8
Here, B D B R Y , B D B R U , and B D B R V denote the BDBR of the Y, U, and V color components of the texture atlas, respectively. Table 5 lists the BDBR performances of the basic and additional views of the texture atlas. Since FECNN operates as a decoder-side post-filter, the bitstream and bitrate remain unchanged. Therefore, the reported BD-rate reductions stem from decreased distortion at identical rate points. Compared with TMIV as an anchor, the proposed method achieved average BD-rate reductions in basic and additional views to 4.12% and 6.96%, respectively. Particularly, Fan sequence exhibited a significant BDBR reduction in both basic and additional views. As the additional views composed the combinations of patches, resulting in numerous boundaries between patches, compression artifacts primarily occur at the boundaries of the additional view. Consistent with our modeling, additional views exhibit larger gains than basic views because they typically contain a higher density of patch/object boundaries where boundary-localized distortions accumulate. Sequences with stronger boundary activity or high-frequency textures show the most pronounced improvements. Thus, the proposed method achieved good performance in artifact removal at these boundaries as well as BDBR improvement. Existing CNN-based post-filters [20], ref. [22] apply single-stream sequential architectures from conventional natural video coding to reconstructed frames. These methods do not sufficiently consider the many patch and object boundaries present in atlases. Consistent with this difference, ref. [20] attains modest, content-dependent gains by using QP information, and these gains tend to be larger on additional views than on basic views, yet they remain smaller than those achieved by the proposed method. In contrast, ref. [22] uses only the texture input and shows limited or inconsistent improvements, with little or no gain in some cases. Collectively, these outcomes indicate that designs that explicitly reflect atlas characteristics are advantageous.
In addition to BD-rate, we report Structure Similarity Index Measure (SSIM) as a perceptual indicator on the reconstructed texture luma, averaged over frames for each sequence and reported separately for basic and additional views. As shown in Table 6, the proposed method attains higher SSIM than ref. [20] on all sequences and on both view types. The average gains are +0.0033 on basic views and +0.0028 on additional views, with the largest improvements on Fan and Cadillac additional views. Higher SSIM indicates better preservation of structural content near patch and object boundaries.
Table 7 summarizes model size, FLOPs, and inference time measured on an RTX 4070 Ti at the native atlas resolution. The proposed model processes four inputs per frame, namely two maps at 1920 × 4608 for the texture atlas and its QP map and two maps at 960 × 2304 for the depth atlas and its QP map. The proposed model uses 5.01 MB of parameters and 31.62 T FLOPs, which are both smaller than ref. [20] with 8.45 MB and 39.19 T. Its inference time is 22.86 s, while ref. [20] records 14.36 s. This latency difference arises because FLOPs count arithmetic operations only, whereas wall-clock time is also affected by memory traffic from multiple inputs and the movement of intermediate feature maps. Method ref. [22] has the smallest complexity and the fastest runtime, and its coding gains are correspondingly limited. Taken together, these results indicate that the proposed method achieves substantially better performance at a suitable level of complexity compared with generic natural content post-filters.
Both BD-rates are computed on the rendered views: Y-PSNR measures pixel-by-pixel distortion on the luma channel, whereas Immersive Video (IV)-PSNR [47] uses a local search window with a global color-difference compensation and is less sensitive to small spatial shifts. Table 8 shows that our post-filter attains an average BD-rate of −0.9% on Y-PSNR but +3.4% on IV-PSNR. This gap mainly occurs because the network sharpens texture edges while the depth atlas, which drives view warping, is unchanged, so small mis-registrations around depth discontinuities persist after rendering and the local matching in IV-PSNR cannot fully compensate, which can lower the score even when pixel-wise distortion decreases. The impact is further diluted because most synthesized pixels come from interior regions, so boundary-focused gains contribute less to the rendered average. To better align atlas enhancement with downstream quality, a renderer-aware integrated network that jointly processes texture and depth atlases and enforces cross-modal consistency at patch and object boundaries will be explored as future work.

4.3. Visual Comparisons

Figure 7 and Figure 8 show the visual quality comparisons between VVC compressed content and the proposed method at the maximum QP value. For a sophisticated comparison of test datasets, the figures show the zoom-in for the areas in the red boxes. Each row presents the ground truth, the reconstructed result of the TMIV, and that of the proposed method. As a result, the proposed network effectively reduced artifacts occurring between objects and patch boundaries of the atlas videos. In Figure 7a the boxed region around the sphere shoulder and the distant chess pieces exhibits reduced ringing and smoother curvature and the silhouettes remain sharp. As show in Figure 7b the inset over the fan grille shows thin curved lines that stay continuous and straight with bending and staircase artifacts suppressed. As depicted in Figure 8a the boxed patch junction indicates that the visible seam and brightness jump disappear and texture continuity is restored across the packing cut. As shown in Figure 8b under a high QP setting, the boxed region near a patch boundary shows reduced blocking and ringing while the diagonal object edge remains crisp. Overall, the proposed method proves effective for atlas videos by directly targeting patch seams and object boundaries with depth guidance and QP adaptation.

4.4. Ablation Studies

To investigate the optimal network architecture of FECNN, a variety of ablation studies were conducted as follows:
  • Performance analysis according to the use of chroma components as an input
  • Determination of a suitable up-sampling method
  • Optimal number of SFE and DFE blocks
As shown in Table 9, excluding chroma yields better averages. Without chroma, the averages are −3.73% on basic views and −6.39% on additional views. With chroma, the averages are −2.89% and −5.92%. Since adding chroma increases memory use and inference time due to extra inputs, the final model does not include chroma.
Table 10 compares transposed convolution and PixelShuffle. PixelShuffle achieves slightly better averages, −3.80% on basic views and −6.42% on additional views, while transposed convolution achieves −3.73% and −6.39%. PixelShuffle also performs channel-to-space rearrangement that avoids interpolation blur and checkerboard artifacts near patch or object boundaries, so we adopt PixelShuffle.
We swept the number of blocks while keeping other settings fixed. As reported in Table 11, performance improves up to 4 SFE and 5 DFE, reaching −4.12% on basic views and −6.96% on additional views, and then plateaus or slightly regresses. We therefore adopt 4 SFE/5 DFE as the final configuration and use this setting in all main experiments.

5. Conclusions

This study proposes an FECNN to enhance both visual quality and coding performance compared to the conventional TMIV as an anchor. The proposed network employs newly designed SFE blocks that extract boundary and patch aware features from the texture atlas, depth atlas, and QP maps, followed by DFE blocks that aggregate wide spatial context conditioned on depth and QP to restore texture atlas details while suppressing blocking and ringing near patch and object boundaries. A separable convolution method is deployed to reduce the complexity of the FECNN. The performance of FECNN was objectively evaluated under MIV CTC environments and subjectively identified through visual quality comparisons. A variety of ablation studies were conducted to determine the optimal FECNN structure. Experimental results show that the proposed method can improve the BDBR by −4.12% and −6.96% on average in the basic and additional views of the texture atlas, respectively. Compared with conventional TMIV decoding and a generic CNN post filter, the proposed method reduces blocking, ringing, and patch seams near object boundaries through atlas aware depth guidance and QP adaptive restoration.
For future work, we will extend the decoder side restoration to a joint model that enhances both the texture atlas and the depth atlas. Our observation is that improving only the texture atlas can leave inconsistencies with the depth atlas, which limits rendering quality near patch and object boundaries. The planned model will share features across modalities and will use geometry aware constraints to enforce consistency between texture edges and depth discontinuities while remaining QP adaptive. We will also study temporal consistency and lightweight modules so that the method remains real time and preserves full compatibility with TMIV and the VVC bitstream.

Author Contributions

Conceptualization, T.L. and D.J.; methodology, T.L. and D.J.; software, T.L.; formal analysis, D.J.; investigation, T.L. and D.J.; resources, D.J.; data curation, T.L.; writing—original draft preparation, T.L.; writing—review and editing, D.J., K.Y. and W.-S.C.; visualization, T.L.; supervision, D.J.; project administration, D.J.; funding acquisition, D.J., K.Y. and W.-S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. For further inquiries, please contact the corresponding author(s).

Acknowledgments

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (2022-0-00022, RS-2022-II220022), Development of immersive video spatial computing technology for ultra-realistic metaverse services).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Boyce, J.M.; Doré, R.; Dziembowski, A.; Fleureau, J.; Jung, J.; Kroon, B.; Salahieh, B.; Vadakital, V.K.M.; Yu, L. MPEG immersive video coding standard. Proc. IEEE 2021, 109, 1521–1536. [Google Scholar] [CrossRef]
  2. Dziembowski, A.; Lee, G. Test model 17 for MPEG immersive video. ISO/IEC JTC 1/SC 29/ WG 04, Document N0376. In Proceedings of the 147th MPEG Meeting, Geneva, Switzerland, 15–19 July 2023. [Google Scholar]
  3. Sullivan, G.J.; Ohm, J.-R.; Han, W.-J.; Wiegand, T. Overview of the HEVC Standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [Google Scholar] [CrossRef]
  4. Bross, B.; Wang, Y.-K.; Ye, Y.; Liu, S.; Chen, J.; Sullivan, G.J.; Ohm, J.-R. Overview of the VVC standard and its applications. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3736–3764. [Google Scholar] [CrossRef]
  5. Jeong, J.; Lee, S.; Ryu, E. Delta QP Allocation for MPEG Immersive Video. In Proceedings of the 13th International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 19–21 October 2022; pp. 568–573. [Google Scholar]
  6. Milovanovic, M.; Henry, F.; Cagnazzo, M. Depth patch Selection for Decoder-Side Depth Estimation in MPEG Immersive Video. In Proceedings of the Picture Coding Symposium (PCS), San Jose, CA, USA, 7–9 December 2022; pp. 343–347. [Google Scholar]
  7. Lee, S.; Jeong, J.; Ryu, E. Group-Based Adaptive Rendering System for 6DoF Immersive Video Streaming. IEEE Access 2022, 10, 102691–102700. [Google Scholar] [CrossRef]
  8. Garus, P.; Henry, F.; Maugey, T.; Guillemot, C. Motion Compensation-based Low-complexity Decoder Side Depth Estimation for MPEG Immersive Video. In Proceedings of the 24th International Workshop on Multimedia Signal Processing (MMSP), Shanghai, China, 26–28 September 2022; pp. 1–6. [Google Scholar]
  9. Jeong, J.; Lee, S.; Ryu, E. VVC Subpicture-Based Frame Packing for MPEG Immersive Video. IEEE Access 2022, 10, 103781–103792. [Google Scholar] [CrossRef]
  10. Mieloch, D.; Dziembowski, A.; Kloska, D.; Szydelko, B.; Jeong, J.; Lee, G. A New Approach to Decoder-Side Depth Estimation in Immersive Video Transmission. IEEE Trans. Broadcast. 2023, 69, 951–965. [Google Scholar] [CrossRef]
  11. Lim, S.; Kim, H.; Kim, Y. Adaptive Patch-Wise Depth Range Linear Scaling Method for MPEG Immersive Video Coding. IEEE Access 2023, 11, 133440–133450. [Google Scholar] [CrossRef]
  12. Dziembowski, A.; Mieloch, D.; Jeong, J.; Lee, G. Immersive Video Postprocessing for Efficient Video Coding. IEEE Trans. Circuit Syst. Video Technol. 2023, 33, 4349–4361. [Google Scholar] [CrossRef]
  13. Lee, Y.; Oh, K.; Lee, G.; Oh, B. High-Bit-Depth Geometry Representation and Compression MPEG Immersive Video System. IEEE Access 2024, 12, 189064–189072. [Google Scholar] [CrossRef]
  14. Oh, J.; Li, X.; Oh, K.; Lee, G.; Jang, E. Efficient Atlas Coding Strategy using cropping for Object-based MPEG Immersive Video. In Proceedings of the IEEE International Conference on Advanced Communications Technology (ICACT), Pyeong Chang, Republic of Korea, 16–19 February 2025; pp. 253–260. [Google Scholar]
  15. Dai, Y.; Liu, D.; Wu, F. A Convolutional Neural Network Approach for Post-processing in HEVC Intra Coding. In Proceedings of the Multimedia Modeling (MMM), Reykjavik, Iceland, 4–6 January 2017; pp. 28–39. [Google Scholar]
  16. Yu, L.; Chang, W.; Liu, Q.; Gabbouj, M. High-frequency guided CNN for video compression artifacts reduction. In Proceedings of the IEEE International Conference on Visual Communications and Image Processing (VCIP), Suzhou, China, 13–16 December 2022; pp. 1–5. [Google Scholar]
  17. Hoang, T.M.; Zhou, J. B-DRRN: A Block Information Constrained Deep Recursive Residual Network for Video Compression Artifacts Reduction. In Proceedings of the Picture Coding Symposium (PCS), Ningbo, China, 12–15 November 2019; pp. 1–5. [Google Scholar]
  18. Wang, M.; Wan, S.; Gong, H.; Ma, M. Attention-Based Dual-Scale CNN In-Loop Filter for Versatile Video Coding. IEEE Access 2019, 7, 145214–145226. [Google Scholar] [CrossRef]
  19. Zhao, H.; He, M.; Teng, G.; Shang, X.; Wang, G.; Feng, Y. A CNN-Based Post-Processing Algorithm for Video Coding Efficiency Improvement. IEEE Access 2019, 8, 920–929. [Google Scholar] [CrossRef]
  20. Das, T.; Choi, K.; Choi, J. High Quality Video Frames Form VVC: A Deep Neural Network Approach. IEEE Access 2023, 11, 54254–54264. [Google Scholar] [CrossRef]
  21. Zhang, H.; Jung, C.; Zou, D.; Li, M. WCDANN: A Lightweight CNN Post-Processing Filter for VVC-based Video Compression. IEEE Access 2023, 11, 83400–83413. [Google Scholar] [CrossRef]
  22. Wang, H.; Chen, J.; Reuze, K.; Kotra, M.A.; Karczewicz, M. EE1-related: Neural Network-based in-loop filter with constrained computational complexity, JVET of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29, document JVET-W0131. In Proceedings of the 23rd JVET Meeting, Teleconference, 7–16 July 2021. [Google Scholar]
  23. Santamaria, M.; Yang, R.; Cricri, F.; Lainema, J.; Zhang, H.; Youvalari, G.R.; Hannuksela, M.M. EE1-1.11: Content-adaptive post-filter, JVET of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29, document JVET-AC0055. In Proceedings of the 29th JVET Meeting, Teleconference, 11–20 January 2023. [Google Scholar]
  24. Dziembowski, A.; Kroon, B.; Jung, J. Common test conditions for MPEG immersive video, ISO/IEC JTC 1/SC 29/ WG 04, document N0372. In Proceedings of the 143rd MPEG Meeting, Geneva, Switzerland, 17–21 July 2023. [Google Scholar]
  25. Shi, W.; Caballero, J.; Huszar, F.; Totz, J.; Aitken, A.-P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
  26. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  27. Kroon, B. 3DoF+ Test Sequence ClassroomVideo. ISO/IEC JTC 1/SC 29/WG 11, document M42415. In Proceedings of the 122nd MPEG Meeting, San Diego, CA, USA, 16–20 April 2018. [Google Scholar]
  28. Doré, R. Technicolor 3DoF+ Test Materials. ISO/IEC JTC 1/SC 29/WG 11, document M42349. In Proceedings of the 122nd MPEG Meeting, San Diego, CA, USA, 16–20 April 2018. [Google Scholar]
  29. Ilola, L.; Vadakital, V.K.M.; Roimela, K.; Keränen, J. New Test Content for Immersive Video-Nokia Chess. ISO/IEC JTC 1/SC 29/WG 11, document M50787. In Proceedings of the 128th MPEG Meeting, Geneva, Switzerland, 7–11 October 2019. [Google Scholar]
  30. Schenini, J.; Ilolla, L.; Vadakital, V.K.M. A new computer graphics scene, Guitarist, suitable for MIV edition-2. ISO/IEC JTC 1/SC 29/WG 04, document M58080. In Proceedings of the 136th MPEG Meeting, Teleconference, 11–15 October 2021. [Google Scholar]
  31. Boissonade, P.; Jung, J. Proposition of New Sequences for Windowed-6DoF Experiments on Compression, Synthesis and Depth Estimation. ISO/IEC JTC 1/SC 29/WG 11, document M43318, ISO/IEC JTC 1/SC 29/WG 11. In Proceedings of the 123rd MPEG Meeting, Ljubljana, Slovenia, 16–20 July 2018. [Google Scholar]
  32. Doré, R.; Briand, G.; Thudor, F. New Cadillac Content Proposal for Advanced MIV v2 Investigations. ISO/IEC JTC 1/SC 29/WG 04, document M57186. In Proceedings of the 135th MPEG Meeting, Teleconference, 12–16 July 2021. [Google Scholar]
  33. Doré, R.; Briand, G. Interdigital Mirror Content Proposal for Advanced MIV Investigations on Reflection. ISO/IEC JTC 1/SC 29/WG 11, document M55710. In Proceedings of the 133rd MPEG Meeting, Teleconference, 11–15 January 2021. [Google Scholar]
  34. Doré, R.; Briand, G.; Thudor, F. Interdigital Fan Content Proposal for MIV. ISO/IEC JTC 1/SC 29/WG 11, document M54732. In Proceedings of the 131st MPEG Meeting, Teleconference, 29 June–3 July 2020. [Google Scholar]
  35. Doré, R.; Briand, G.; Thudor, F. Interdigital Group Content Proposal. ISO/IEC JTC 1/SC 29/WG 11, document M54731. In Proceedings of the 131st MPEG Meeting, Teleconference, 29 June–3 July 2020. [Google Scholar]
  36. Thudor, F.; Doré, R. Dancing sequence for verification tests, ISO/IEC JTC 1/SC 29/WG 04, document M57751. In Proceedings of the 136th MPEG Meeting, Teleconference, 11–15 October 2021. [Google Scholar]
  37. Doyen, D.; Langlois, T.; Vandame, B.; Babon, F.; Boisson, G.; Sabater, N.; Gendrot, R.; Schubert, A. Light Field Content from 16-camera Rig. ISO/IEC JTC 1/SC 29/WG 11, document M40010. In Proceedings of the 117th MPEG Meeting, Geneva, Switzerland, 16–20 January 2017. [Google Scholar]
  38. Tapie, T.; Schubert, A.; Gendrot, R.; Briand, G.; Thudor, F.; Dore, R. Breakfast new natural content proposal for MIV. ISO/IEC JTC 1/SC 29/WG 04, document M56730. In Proceedings of the 134th MPEG Meeting, Teleconference, 26–30 April 2021. [Google Scholar]
  39. Tapie, T.; Schubert, A.; Gendrot, R.; Briand, G.; Thudor, F.; Dore, R. Barn new natural content proposal for MIV. ISO/IEC JTC 1/SC 29/WG 04, document M56632. In Proceedings of the 134th MPEG Meeting, Teleconference, 26–30 April 2021. [Google Scholar]
  40. Salahieh, B.; Marvar, B.; Nentedem, M.-M.; Kumar, A.; Popovic, V.; Seshadrinathan, K.; Nestares, O.; Boyce, J. Kermit Test Sequence for Windowed 6DoF Activities. ISO/IEC JTC 1/SC 29/WG 11, document M43748. In Proceedings of the 123rd MPEG Meeting, Ljubljana, Slovenia, 16–20 July 2018. [Google Scholar]
  41. Mieloch, D.; Dziembowski, A.; Domanski, M. Natural Outdoor Test Sequences. ISO/IEC JTC 1/SC 29/WG 11, document M51598. In Proceedings of the 129th MPEG Meeting, Brussels, Belgium, 13–17 January 2020. [Google Scholar]
  42. Domanski, M.; Dziembowski, A.; Grzelka, A.; Mieloch, D.; Stankiewicz, O.; Wegner, K. Multiview Test Video Sequences for Free Navigation Exploration Obtained using Paris of Cameras. ISO/IEC JTC 1/SC 29/WG 11, document M38247. In Proceedings of the 115th MPEG Meeting, Geneva, Switzerland, 30 May–3 June 2016. [Google Scholar]
  43. Bai, Y.; Li, S.; Yu, L. Test results of CBAbasketball sequence and challenges. ISO/IEC JTC 1/SC 29/WG 04, document M59558. In Proceedings of the 138th MPEG Meeting, Teleconference, 25–29 April 2022. [Google Scholar]
  44. Mieloch, D.; Dziembowski, A.; Szydelko, B.; Kloska, D.; Grzelka, A.; Stankowski, J.; Domansk, M.; Lee, G.; Jeong, J. New natural content-MartialArts. ISO/IEC JTC 1/SC 29/WG 04, document M61949. In Proceedings of the 141st MPEG Meeting, Teleconference, 16–20 January 2023. [Google Scholar]
  45. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference for Learning Representations (ICLR), San diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]
  46. Bjontegaard, G. Calculation of average PSNR differences between RD curves. ITU-T SG 16 Q6 VCEG, document VCEG-M33. In Proceedings of the 13th VCEG Meeting, Austin, TX, USA, 2–4 April 2001. [Google Scholar]
  47. Dziembowski, A.; Mieloch, D.; Stankowski, J.; Grzelka, A. IV-PSNR—The Objective Quality Metric for Immersive Video Applications. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7575–7591. [Google Scholar] [CrossRef]
Figure 1. Conventional coding scheme for multi-view videos.
Figure 1. Conventional coding scheme for multi-view videos.
Mathematics 13 03110 g001
Figure 2. Framework of TMIV.
Figure 2. Framework of TMIV.
Mathematics 13 03110 g002
Figure 3. Examples of texture and depth atlas videos as basic and additional views: (a) Texture atlas as basic view; (b) Texture atlas as additional view; (c) Depth atlas as basic view; (d) Depth atlas as additional view.
Figure 3. Examples of texture and depth atlas videos as basic and additional views: (a) Texture atlas as basic view; (b) Texture atlas as additional view; (c) Depth atlas as basic view; (d) Depth atlas as additional view.
Mathematics 13 03110 g003
Figure 4. Framework of TMIV with the proposed method.
Figure 4. Framework of TMIV with the proposed method.
Mathematics 13 03110 g004
Figure 5. Overall architecture of the proposed FECNN.
Figure 5. Overall architecture of the proposed FECNN.
Mathematics 13 03110 g005
Figure 6. The structures of the feature extraction block in the proposed network: (a) SFE block; (b) DEF block.
Figure 6. The structures of the feature extraction block in the proposed network: (a) SFE block; (b) DEF block.
Mathematics 13 03110 g006
Figure 7. Visual comparisons on basic view with RP4. (a) Chess sequence; (b) Fan sequence.
Figure 7. Visual comparisons on basic view with RP4. (a) Chess sequence; (b) Fan sequence.
Mathematics 13 03110 g007aMathematics 13 03110 g007b
Figure 8. Visual comparisons on additional view with RP4. (a) Chess sequence; (b) Fan sequence.
Figure 8. Visual comparisons on additional view with RP4. (a) Chess sequence; (b) Fan sequence.
Mathematics 13 03110 g008
Table 1. QP values for atlas encoding in MIV CTC.
Table 1. QP values for atlas encoding in MIV CTC.
ClassSequencesTexture QPDepth QP
RP1RP2RP3RP4RP1RP2RP3RP4
A01Classroom Video263038517101627
B01Museum294047519182327
B02Chess18273545171422
B03Guitarist2224293935917
C01Hijack19243449151325
C02Cyberpunk2124293935917
J01Kitchen18263341171219
J02Cadillac223141513111927
J03Mirror263342517121927
J04Fan3238455111162227
W01Group263137467111523
W02Dancing2024284025818
D01Painter243243515112027
D02Breakfast253035436101420
D03Barn253035426101419
E01Frog293440469131823
E02Carpark23283747481523
E03Street21253241361119
L01Fencing23283951481727
L02CBABaskeball24273143571120
L03MartialArts24273143571120
Table 2. MIV test sequences on CTC.
Table 2. MIV test sequences on CTC.
Content
Categories
ClassSequencesResolutionNumber of Source Views
Computer
generated
AClassroom Video4096 × 204815
BMuseum2048 × 204824
Chess2048 × 204810
Guitarist2048 × 204846
CHijack4096 × 204810
Cyberpunk2048 × 204810
JKitchen1920 × 108025
Cadillac1920 × 108015
Mirror1920 × 108015
Fan1920 × 108015
WGroup1920 × 108021
Dancing1920 × 108024
NaturalDPainter2048 × 204816
Breakfast1920 × 108015
Barn1920 × 108015
EFrog1920 × 108013
Carpark1920 × 10889
Street1920 × 10889
LFencing1920 × 108810
CBABasketball2048 × 108834
Martial Arts1920 × 108015
Table 3. Hyper-parameters of the proposed method.
Table 3. Hyper-parameters of the proposed method.
Hyper-ParametersOptions
Loss functionMean Squared Error (MSE)
OptimizerAdam
Number of epochs50
Batch size32
Learning rate 10 4   to   10 6
Activation functionPReLU
Table 4. Software and library versions used in our experiments.
Table 4. Software and library versions used in our experiments.
SoftwareVersion
TMIV17.0
VVenC1.7.0
VVdeC1.6.0
PyTorch1.14.0
CUDA11.6
Table 5. Comparison of coding performance between the proposed method and CNN-based post-filters.
Table 5. Comparison of coding performance between the proposed method and CNN-based post-filters.
ClassSequencesProposed MethodRef. [20]Ref. [22]
BasicAdditionalBasicAdditionalBasicAdditional
BDBR-YUVBDBR-YUVBDBR-YUVBDBR-YUVBDBR-YUVBDBR-YUV
B02Chess−2.51%−4.85%−1.56%−3.96%1.74%0.89%
B03Guitarist−1.98%−3.37%−0.25%−2.54%0.68%0.20%
J02Cadillac−3.13%−7.25%−1.32%−4.29%0.21%0.11%
J04Fan−8.88%−13.19%−1.25%−5.40%−0.83%−1.02%
W01Group−2.15%−6.74%−0.12%−4.31%−0.55%−0.46%
D01Painter−4.96%−9.37%−1.33%−6.59%−0.40%−0.18%
E01Frog−4.10%−4.88%−2.77%−4.06%−0.82%−0.81%
L02CBABasketball−5.24%−6.01%−4.36%−5.20%0.07%−0.19%
Average−4.12%−6.96%−1.62%−4.54%0.01%−0.18%
Table 6. Comparison of SSIM between the proposed method and CNN-based post-filters.
Table 6. Comparison of SSIM between the proposed method and CNN-based post-filters.
ClassSequencesProposed MethodRef. [20]
BasicAdditionalBasicAdditional
B02Chess0.97440.96360.97410.9628
B03Guitarist0.97040.96890.96990.9685
J02Cadillac0.95400.94210.94940.9365
J04Fan0.87960.92260.86830.9149
W01Group0.87850.88290.87630.8804
D01Painter0.91360.92040.90840.9176
E01Frog0.86740.84340.86520.8415
L02CBABasketball0.96410.95870.96360.9580
Average0.92520.92530.92190.9225
Table 7. Comparison of computation complexity the network.
Table 7. Comparison of computation complexity the network.
MethodsParametersFLOPsInference Time
Proposed Method5.01 MB31.62 T22,861.44 ms
[20]8.45 MB39.19 T14,364.18 ms
[22]0.34 MB1.54 T422.27 ms
Table 8. Comparison of BD-rate on rendered views using Y-PSNR and IV-PSNR for MIV-CTC. Negative values indicate bitrate reduction at equal quality.
Table 8. Comparison of BD-rate on rendered views using Y-PSNR and IV-PSNR for MIV-CTC. Negative values indicate bitrate reduction at equal quality.
ClassSequencesBD-Rate
Y-PSNR
BD-Rate
IV-PSNR
B02Chess−0.7%2.7%
B03Guitarist−2.9%5.1%
J02Cadillac−0.9%3.5%
J04Fan3.3%2.7%
W01Group0.3%1.3%
D01Painter−3.2%4.4%
E01Frog−1.9%5.1%
L02CBABasketball−1.4%2.3%
Average−0.9%3.4%
Table 9. Coding performance between w/chroma and w/o chroma.
Table 9. Coding performance between w/chroma and w/o chroma.
Methodw/Chromaw/o Chroma
ClassSequencesBasicAdditionalBasicAdditional
BDBR-YUVBDBR-YUVBDBR-YUVBDBR-YUV
B02Chess−1.64%−3.41%−1.46%−3.90%
B03Guitarist−0.98%−2.63%−1.71%−3.14%
J02Cadillac−1.28%−5.27%−3.36%−6.63%
J04Fan−6.91%−12.95%−8.18%−12.36%
W01Group−1.85%−5.93%−1.96%−6.45%
D01Painter−3.19%−7.82%−4.71%−8.47%
E01Frog−3.14%−4.49%−3.81%−4.73%
L02CBABasketball−4.13%−4.87%−4.64%−5.44%
Average−2.89%−5.92%−3.73%−6.39%
Table 10. Coding performance between de-convolution and pixel shuffle.
Table 10. Coding performance between de-convolution and pixel shuffle.
MethodDe-ConvolutionPixel Shuffle
ClassSequencesBasicAdditionalBasicAdditional
BDBR-YUVBDBR-YUVBDBR-YUVBDBR-YUV
B02Chess−1.46%−3.90%−2.01%−4.19%
B03Guitarist−1.71%−3.14%−1.64%−3.00%
J02Cadillac−3.36%−6.63%−3.05%−6.56%
J04Fan−8.18%−12.36%−8.55%−12.58%
W01Group−1.96%−6.45%−1.80%−6.38%
D01Painter−4.71%−8.47%−4.92%−8.57%
E01Frog−3.81%−4.73%−3.81%−4.63%
L02CBABasketball−4.64%−5.44%−4.62%−5.43%
Average−3.73%−6.39%−3.80%−6.42%
Table 11. Comparisons of the number of SFE and DEF blocks. Bold indicates the best performance.
Table 11. Comparisons of the number of SFE and DEF blocks. Bold indicates the best performance.
TestBasicAdditional
SFE BlockDFE BlockBDBR-YUVBDBR-YUV
33−3.80%−6.42%
43−3.79%−6.51%
53−3.93%−6.67%
34−3.93%−6.71%
35−3.49%−6.42%
44−4.05%−6.78%
454.12%6.96%
54−3.82%−6.64%
55−3.75%−6.56%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, T.; Yun, K.; Cheong, W.-S.; Jun, D. Neural Network-Based Atlas Enhancement in MPEG Immersive Video. Mathematics 2025, 13, 3110. https://doi.org/10.3390/math13193110

AMA Style

Lee T, Yun K, Cheong W-S, Jun D. Neural Network-Based Atlas Enhancement in MPEG Immersive Video. Mathematics. 2025; 13(19):3110. https://doi.org/10.3390/math13193110

Chicago/Turabian Style

Lee, Taesik, Kugjin Yun, Won-Sik Cheong, and Dongsan Jun. 2025. "Neural Network-Based Atlas Enhancement in MPEG Immersive Video" Mathematics 13, no. 19: 3110. https://doi.org/10.3390/math13193110

APA Style

Lee, T., Yun, K., Cheong, W.-S., & Jun, D. (2025). Neural Network-Based Atlas Enhancement in MPEG Immersive Video. Mathematics, 13(19), 3110. https://doi.org/10.3390/math13193110

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop