1. Introduction
Immersive video technologies [
1] are continuously advancing, driven by the increasing demand for realistic and interactive multimedia experiences. These technologies implement various degrees of freedom, e.g., 3DoF, 3DoF+, and 6DoF [
2,
3]. The process of generating immersive video (
Figure 1) requires using a multicamera system to register a 3D scene. Recent developments in immersive video have explored multiple configurations of multicamera systems, including linear, planar, and spherical camera arrangements.
The other component crucial for creating immersive video is the three-dimensional geometry of the registered scene, which can be obtained either from depth cameras or estimated using dedicated software. Since using depth cameras introduces a problem of interference between sensors [
4], for the considerations presented in this paper, we will be assuming that depth information is obtained through the process of depth estimation.
Practical systems of immersive video [
5] can enhance multicamera setups with a virtual camera placed among the real cameras, enabling dynamic and adaptive view synthesis to reflect the movement of the user in a 3D scene. In other words, a virtual camera is a viewport rendered by the system at the request of the viewer [
6]. To represent all of this data, the multiview video plus depth (MVD [
7]) format is usually used, as shown in
Figure 2. It allows separate encoding of real views and their corresponding depth maps using any available video encoder (HEVC, VVC [
8,
9]). Because traditional encoders were not developed for data such as depth maps (as they do not resemble naturally captured videos), alternative solutions have been explored (MV-HEVC, 3D-HEVC [
10]). Unfortunately, experimental data show that their versatility is highly limited [
11].
The latest approach to immersive video compression is the MPEG immersive video (MIV) standard [
12], created by the ISO/IEC MPEG Video Coding. MIV utilizes multiview pre-processing and post-processing (
Figure 3), making the resulting videos more efficient to compress with traditional video encoders. Additionally, MIV supports a range of profiles dedicated to different coding scenarios.
The MIV Main profile groups input views into a set of videos called atlases (usually four), which are shown in
Figure 4. Each atlas is encoded independently. Input views, which are views captured by the multicamera system, are divided into basic and additional views. Basic views contain the most information about a scene and are usually packed into the first atlas. Additional views contain a large amount of redundant information, which is removed by the pruning process. This process creates small fragments (called patches) of the remaining view information. Patches are packed into the remaining space left in the first two atlases. The third and fourth atlases contain depth information corresponding to the first two atlases.
Another approach is called the MIV decoder-side depth estimation (DSDE) coding profile. Similarly to the MIV Main, this profile also utilizes atlases (
Figure 4) consisting of multiple input views per atlas, but in this approach the atlases do not contain depth information. As depth is required to perform virtual view synthesis, it is estimated from the decoded views on the decoder side. In this scenario, the transmitted views should be chosen in such a way as to minimize mutual redundancy while maximizing coverage of scene content. The view selection process for the DSDE profile is further complicated by an intrinsic trade-off: increasing the number of transmitted views improves the accuracy of the depth estimates and, consequently, the quality of synthesized views, whereas a poorly chosen small subset can cause the depth estimator to fail, particularly in regions affected by occlusions. Hence, it is unclear whether the view selection method used for the MIV Main profile will remain effective for the DSDE profile, or whether selection criteria tailored to decoder-side depth estimation are required.
This paper addresses the challenge of optimal input view selection for the MIV DSDE profile, proposing a method that is adjusted to scene characteristics. We analyze how to ensure the highest possible depth-estimation quality when only a limited number of views can be transmitted, while simultaneously selecting them in a way that mitigates the negative impact of scene occlusions on the estimation process. Our proposed adaptive approach ensures the highest coding efficiency and visual quality across diverse immersive video content.
In
Section 2, we review the state of the art in view selection, outline the issues that may arise during the view selection process, and provide a detailed analysis of how view selection operates in the MIV Main profile. In
Section 3, we describe our proposed view selection method for the DSDE profile, which explicitly accounts for occlusions present in the scene.
Section 4 presents an overview of the experiments, and
Section 5 reports and analyzes the experimental results.
Section 6 contains conclusions and future work.
2. View Selection for Immersive Video
View selection for immersive video is a complex process that requires careful consideration of the specific application scenario. A variety of factors influence the choice of input views in the DSDE scenario: the selected cameras determine not only the accuracy of estimated depth maps but also the quality of the synthesized virtual views [
13], and therefore, ultimately, the overall quality of the immersive experience. Key challenges to consider include scene occlusions, finite image resolution, non-Lambertian surfaces, camera layout and type (e.g., linear, omnidirectional), and the decoding hardware’s computational limits. In the following sections, we survey several view selection strategies designed to address these diverse problems and use cases.
2.1. View Selection Optimized for Virtual View Synthesis
This section describes the problem of selecting the best input views for the virtual view synthesis process in free viewpoint television (FTV) systems. As such systems require real-time synthesis, it is essential to limit the number of input views in order to maintain reasonable computational time. Adding more views to view synthesis [
14,
15,
16] increases the quality of virtual views; nevertheless, it also increases computational time [
6]. Therefore, an effective view selection method should aim to balance the quality of synthesized virtual views with the need to minimize computational complexity.
In the simplest implementations, a virtual view is synthesized from two manually selected real views [
17]. The method described in [
13] deals with the problem of choosing two views that will guarantee the highest quality of the synthesized virtual view. To achieve this, three main view synthesis challenges were taken into account:
Occlusions: Gaps occur where real views do not overlap; choosing cameras closest to the virtual viewpoint minimizes these holes.
Finite resolution: Objects appear at different scales across views; projecting from the views where each object is largest preserves geometric continuity.
Non-Lambertian reflectance: Surface brightness varies with angle; using the two cameras nearest the virtual position ensures more consistent lighting.
Through addressing the above challenges, it was shown that the best quality of the synthesized virtual view can be obtained when using two neighboring real views (nearest left and right), and this is true for any camera arrangement. However, this research considers view selection for virtual view synthesis in a scenario where all of the real views are available. In the scenario analyzed for the purpose of this article, we have only a subset of real views available at the decoder-side for depth estimation and virtual view synthesis. Consequently, it is not always possible to select the nearest-right and nearest-left views, and it is therefore necessary to investigate which selection strategy can provide a quality level closest to the scenario described in [
13].
2.2. View Selection Optimized for Transmission
This section presents the problem of optimal selection of input views transmitted within the video bitstream [
18] in immersive video systems for the MIV Main profile. In this scenario, only a subset of views available at the encoder side can be transmitted to the decoder. Selecting proper real views for transmission is crucial to achieving good quality on the decoder side. The method described in the previous section is unsuitable for this scenario, as it would require knowing which view was selected by the user prior to transmission, which is not possible.
The most straightforward approach to this scenario is using multiview simulcast coding. Unfortunately, it results in high bitrate and pixelrate [
12], making this approach not valuable for a practical immersive video system.
Another approach is to extend the method described in the previous section. View selection had to be performed in such a way that would guarantee the highest average quality of the synthesized views independently of the user’s viewpoint. For this purpose, Ref. [
18] presents the results of a simulation of a simple practical immersive video system with the assumption of a reasonable pixelrate [
12] and number of cameras [
19]. The results showed that the best quality in the transmission scenario is achieved when cameras are distributed evenly; however, for omnidirectional content, there is a necessity to send input views from the horizontal axis rather than the vertical. The authors proved this through subjective tests. As noted in
Section 1, the MIV DSDE profile differs materially from the MIV Main profile. While [
18] describes a view selection strategy tailored to MIV Main, we contend that the DSDE scenario requires a different approach; our proposed method is presented in
Section 3.
2.3. Impact of Camera Arrangement on Depth Estimation
The methods of view selection described in the previous sections did not consider one crucial process used in the creation of immersive video: depth estimation. When cameras are too far apart, fewer pixels are captured by at least two views, which negatively influences depth estimation, since any scene point must be visible in two or more images to have its depth reliably calculated, whereas occluded regions can only be interpolated or extrapolated. Moreover, larger baselines amplify lighting and reflectance discrepancies across views, complicating depth matching between views and degrading both depth maps and synthesized views.
There are other efficient camera setups [
20,
21,
22,
23,
24], but these techniques require more input information, e.g., geometry of objects, which cannot be predicted in practical immersive video systems.
Stereopair grouping has been proposed to address these issues without increasing the number of cameras [
25]. By organizing cameras into closely spaced pairs, each pair shares nearly identical viewpoints and lighting conditions, minimizing intra-pair occlusions and ensuring most scene points are visible to at least two cameras. However, the short baseline intrinsic to each pair limits depth precision. The solution is a hierarchical approach: Use long-baseline pairs (drawn from different stereo pairs) for precise depth estimation where possible, and rely on dense, short-baseline pairs to fill in occluded or poorly observed regions.
To present experimental results, the authors of [
25] decided to use the PSNR metric because of its simplicity [
26,
27]. In the results, PSNR gains are split into baseline-adjustment and occlusion reduction. While long baselines improve depth, uniform layouts encounter a problem of a large number of occlusions in complex scenes [
28,
29], which causes horizontal displacements in virtual views. In typical two-step view-synthesis pipelines [
30,
31], occluded regions are inpainted and have lower quality. Empirical results indicated that when occlusions exceed roughly 20–25% of the scene, stereo-pair arrangements dramatically outperform uniform layouts, striking the optimal balance between depth accuracy and coverage in challenging immersive-video scenarios.
The experiments in [
25] were conducted using the DERS [
29] depth estimation algorithm, which is now outdated. In the present work, we employ IVDE [
32], the current reference software for depth estimation, and perform experiments both on the sequences used in [
25] and on new test sequences created specifically for research on immersive video [
33].
3. The Proposal—Transmitted View Selection for DSDE
In immersive video systems with decoder-side depth estimation (DSDE), only a subset of available input views can be transmitted within the bitstream due to bitrate and pixelrate constraints [
12]. Usually, a uniform selection of views is used [
34] for limiting the number of transmitted views. However, such a strategy does not guarantee the highest quality on the decoder side.
The goal of the proposed method is to maximize the quality of synthesized virtual views while preserving the number of transmitted views and the overall bitrate and pixelrate.
The method relies on the analysis of occluded regions across input views. The occlusions are detected by a reprojection of 3D points between views, identifying pixels that are visible in only one view (thus, the pixels for which the depth is not estimable). In the proposed approach, the occlusion detection is implemented using the reprojection module already available in the TMIV v16.0 encoder [
35]. At first, a set number of evenly spaced input views is set (in this paper, this number equals four), and each of them is reprojected to the remaining ones. For each pixel, the reprojected depth is compared with the original depth in the target view. A pixel is considered “visible” in the target view if the reprojected depth does not exceed the original depth (i.e., it is not occluded). The percentage of pixels visible in fewer than two views is accumulated. This provides a direct estimation of the occlusions ratio, i.e., the proportion of regions where depth cannot be reliably estimated.
To avoid fluctuations of the selected views within a single group of pictures (GOP)—which would significantly decrease video compression efficiency—the analysis is performed once per GOP (i.e., on the first frame of the GOP) and the resulting view selection remains fixed for the entire GOP. Because the underlying TMIV module is already highly optimized, the computation introduces only a negligible overhead. A detailed analysis of the processing time is provided in
Section 5.3.
If a sequence contains a high ratio of occlusions (inestimable areas), the algorithm selects multiple camera stereopairs instead of uniformly arranged single cameras (see
Figure 5). As proven in [
18], such an approach reduces the number of non-estimated regions, thus increasing the overall quality of synthesized views. On the other hand, when the quantity of occlusions is lower, uniformly distributed views are chosen to maximize coverage and spatial diversity.
The proposed approach is straightforward to implement within an MIV encoder (e.g., in the Test Model for MPEG immersive video reference software, TMIV [
35]) and does not require any changes on the decoder side.
Moreover, by selecting views captured by stereopairs instead of evenly distributed cameras, the method can also reduce the total bitrate of an immersive video, as similar neighboring views packed into a single atlas may be compressed more efficiently (especially when screen-content coding tools are used by a video encoder [
34]).
4. Overview of Experiments
To comprehensively evaluate the effectiveness of the proposed view selection method, two complementary experiments were conducted:
“MIV experiment”—evaluation using modern immersive video content.
“Supplementary experiment”—evaluation using legacy, classical FTV multiview sequences.
Both experiments employed exactly the same processing pipeline (defined in MIV common test conditions, MIV CTC [
33]), using the same software:
TMIV reference software [
35] for creating atlases, view selection, and view synthesis on the decoder side.
VVenC + VvdeC [
36] for VVC [
9] atlas encoding and decoding.
IVDE reference software [
32] for decoder-side depth estimation.
All the parameter settings except one were aligned with MIV CTC. The only intentional change—applied consistently in both experiments—was restricting the system to transmit only four views in total (all packed into a single atlas). This design choice reflects the goal of evaluating the influence of camera pairing and spacing in the most controlled and interpretable setting. By focusing on a minimal, one-dimensional camera layout (cameras placed along a line or an arc), the experiment isolates the effect of horizontal baselines on video coding, depth estimation, and synthesized view quality, without confounding factors introduced by multidimensional camera rigs or additional transmitted views.
The four-view configuration also represents the simplest non-trivial case in which different pairing strategies meaningfully affect view synthesis. More complex scenarios (e.g., multi-atlas setups or two-dimensional camera arrangements) are natural extensions of this study, but addressing them would require analyzing several factors at once. Our intention was to start with a clean, analyzable scenario, establishing conclusions that can later be generalized to higher-dimensional arrangements and a higher number of transmitted views.
Both experiments differ solely in the test sequences, allowing for assessing whether the proposed methodology is consistent across datasets with different capture characteristics, resolutions, and scene geometries.
4.1. MIV Experiment
The main experiment was conducted on the modern immersive video content from the MIV CTC [
33]. The MIV CTC test set contains 21 miscellaneous multiview sequences. However, only six of them satisfy the requirements of this study, i.e., multicamera setups with cameras arranged approximately along a line or along an arc. Such setups are essential for analyzing how different horizontal baselines influence immersive video processing in the DSDE scenario.
To systematically evaluate the influence of camera spacing, a uniformity coefficient
was introduced:
where
is the baseline of each stereopair, and
is the distance between two stereopairs (i.e., the distance between two middle cameras within the selected four), c.f.
Figure 6.
This parameter quantifies the uniformness of camera placement across the camera setup, where a coefficient equal to 1 corresponds to a perfectly uniform distribution (with equal distances between all neighboring cameras), and smaller values indicate non-uniform layouts with two stereopairs of cameras. A uniformity coefficient greater than 1 corresponds to a situation where the distance between two camera pairs is smaller than the baseline of each stereopair (i.e., the arrangement with a stereopair in the middle and two single cameras at each side).
The evaluation was based on the IV-PSNR metric, which measures the fidelity of the synthesized virtual views compared to reference ones, taking into account typical immersive video characteristics [
37]. To assess the rate-distortion performance, BD-IVPSNR values were computed. The authors chose BD-IVPSNR instead of typical BD-rates because the RD-curves for different configurations did not always overlap. Therefore, the BD-IVPSNR metric ensured consistent comparison of the results.
Moreover, for each test sequence, occlusion maps were also generated in order to assess the proportion of scene areas not visible in multiple input views. This information was used to analyze how camera arrangement affects depth estimation and view synthesis quality for different levels of scene complexity.
The obtained results and their interpretation are discussed in
Section 5, where the relationship between occlusion percentage, camera arrangement, and the quality of synthesized virtual views is analyzed in detail.
4.2. Supplementary Experiment
To validate the generality of the conclusions beyond modern immersive datasets, a supplementary experiment was performed using several classical FTV sequences: five BigBuckBunny sequences [
38] (BBB Butterfly Arc, BBB Flowers Arc, BBB Rabbit Arc, BBB Butterfly Linear, and BBB Rabbit Linear), Bee [
39], and three sequences from Nagoya University [
40] (Champagne, Dog, and Pantomime). Although these sequences are of lower resolution and are no longer used in standardization activities, they provide diverse content characteristics and dozens of input views, allowing for analysis of multiple uniformity coefficients.
Moreover, the choice of sequences creates the possibility to partially compare the results of this research with the results presented in [
25], where the authors analyzed the influence of camera pairing on depth estimation and view synthesis quality, but without the use of any video compression.
Crucially, the same experimental pipeline was used: TMIV + IVDE + VVenC/VVdeC, following MIV CTC guidelines and the same four-view, single-atlas constraint. This allowed a direct comparison of trends observed across datasets.
5. Experimental Results
5.1. MIV Experiment
Figure 7 presents a quality increase caused by camera pairing in comparison with uniform view distribution. Gain in quality is clearly visible for CBABasketball and MartialArts sequences, as well as the Frog and Fencing sequences, indicating that the camera pairing approach provides a noticeable advantage in scenes with complex geometry and frequent occlusions. The percentage of occlusions in each test sequence is presented in
Table 1.
The existence of the relationship between the arrangement of selected real views and the amount of occlusions in the scene is further confirmed by the rate–distortion curves presented in
Figure 8. For the sequences with more occlusions, the proposed view selection strategy consistently achieves higher IV-PSNR for a given bitrate.
Because the RD-curves for different configurations did not overlap, BD-rates could not be computed; therefore, in order to assess the average quality improvement across bitrates, the BD-IVPSNR values were used (
Figure 9).
Figure 9 summarizes the BD-IVPSNR gains as a function of the uniformity coefficient. The results confirm that pairing of the cameras (with a reasonable baseline—uniformity coefficient in the range [0.2, 0.7]) leads to a measurable quality increase for most of the tested sequences, especially for those with more occlusions.
For scenes with very limited occlusions (e.g., Carpark, Street), both the IV-PSNR gains and BD-IVPSNR gains caused by camera pairing remain negligible. In such scenes, all the cameras observe nearly the same content, and an area with non-estimable depth is small for all camera arrangements. As a result, camera pairing does not introduce additional geometric cues that would noticeably improve depth estimation or view synthesis. On the other hand, uniform camera distribution—as described in
Section 2.1 and [
13]—minimizes problems with finite resolution of depth maps and existence of non-Lambertian reflections in the scene. Therefore, for scenes with limited occlusions, the proposed view selection method purposely selects views distributed uniformly.
The results demonstrate that the proposed view selection method brings clear benefits when the proportion of occluded regions exceeds approximately 10% (c.f.,
Table 1).
In such cases, pair-wise camera grouping improves the reconstruction of disoccluded areas and reduces geometric distortions in synthesized views. On the other hand, for sequences with minimal occlusions (e.g., Carpark, Street), a uniform camera distribution remains sufficient, and camera pairing provides no significant quality gains (
Table 2 and
Figure 10 and
Figure 11).
Overall, the experimental results confirm that the spatial arrangement of cameras has a measurable effect on the final quality of content presented to the user of an immersive video system. As presented, the pairwise view selection optimally balances occlusion handling and the precision of depth estimation, making it particularly suitable for complex scenes where occluded areas are a significant part of the visible content.
5.2. Supplementary Experiment
Figure 12 presents a quality increase caused by camera pairing in comparison with uniform view distribution. All sequences used in the supplementary experiment contain enough input views to provide results for ten different uniformity coefficients (including a coefficient equal to 1, representing uniform camera arrangement). Among all test sequences, there are three for which camera pairing always introduces quality loss: BBB Rabbit Arc, BBB Butterfly Arc, and Dog. As presented in
Table 3, these three sequences are characterized by the lowest percentage of occlusions. For the remaining sequences, in which the occlusion ratio is higher, camera pairing with a sufficiently large baseline (uniformity coefficient > 0.5) provides gains in terms of the quality of synthesized views.
To more clearly illustrate the relationship between occlusions (
Table 3) and the effectiveness of camera pairing (
Figure 12),
Figure 13 presents a scatterplot combining occlusion ratio and IV-PSNR gain. For all sequences, the results obtained for the arrangement with a uniformity coefficient of 0.7 are shown. The results confirm the outcomes from the MIV experiment: camera pairing yields consistent quality benefits when the proportion of occluded regions exceeds approximately 10%. In opposite cases, a uniform camera arrangement outperforms the pair-wise camera layout in terms of virtual view quality.
An additional observation concerns the three sequences with the highest occlusion ratios (above 30%), for which the quality gains from camera pairing are present but smaller than could be expected given the strong advantage of pair-wise camera setups. This behavior is consistent with the limitations of current immersive video pipelines—when the scene becomes overly complicated (the proportion of occluded areas becomes very large), the accuracy of depth estimation and view synthesis is constrained by the lack of information. In such cases, camera pairing cannot fully compensate for the severe geometric ambiguity, and the quality gains are restricted by incomplete scene information. Nevertheless, even in such challenging scenarios, the pair-wise camera layout still outperforms the uniform arrangement, proving that the proposed approach is beneficial for all tested content difficulty levels.
It is important to highlight that the 10% threshold—however consistent in both presented experiments—is smaller than a similar threshold reported in [
25], where it exceeded 20%. The difference in the occlusion threshold used in this article, when compared with research conducted in [
25], originates from the evolution of processing tools used for depth estimation and view synthesis. The study conducted in [
25] relied on the DERS [
29] algorithm for depth estimation and VSRS [
41] software for virtual view synthesis. For the purpose of this article, the authors used IVDE v7.0 [
32] depth estimation software and TMIV’s view weighting synthesizer (VWS) [
35] for virtual view synthesis. Both IVDE and VWS were developed as successors to DERS and VSRS, and they contain improvements such as inter-view and temporal consistency (IVDE) and the ability to synthesize views based on more than two input views (VWS). As a result, the newer software provides more robust depth estimation, improved handling of challenging or weakly textured regions, and significantly higher-quality virtual view synthesis. These improvements reduce the sensitivity of the system to missing information and therefore shift the effective threshold at which stereo-pair grouping becomes advantageous.
In order to present the evolution of immersive video processing pipeline efficiency, we have estimated PSNR values for all tested sequences and uniformity coefficients, keeping the same methodology as in [
25] (PSNR for luma component only, averaged over all virtual views). Obtained results are presented in
Table 4.
Table 5 contains differences between results gathered in
Table 4 and the results reported in [
25]. As presented, for most of the content, the quality is significantly higher when using the modern immersive video processing pipeline (TMIV + IVDE) than with the use of legacy depth estimation and view synthesis software (DERS + VSRS). The only exceptions are the Bee and Dog sequences (and Pantomime for several uniformity coefficients), where the multiview-based synthesis used in VWS performed worse than the simple two-view synthesis used in the VSRS algorithm.
Taken together, the results in
Table 4 and
Table 5, and
Table 2 from [
25] illustrate how the substantial progress in depth estimation and view synthesis achieved over the past decade fundamentally changed the system’s sensitivity to occluded content, fully justifying the lower empirical occupancy threshold observed in this work.
5.3. Computational Overhead
The coding pipeline of the proposal follows the standard TMIV workflow, with all modules and settings unchanged, except for the addition of the occlusion analysis step. The proposed occlusion analysis is executed using the effective and fast reprojection module already implemented within the TMIV v16.0 software [
35]. To compute the occlusion ratio, four evenly spaced input views are selected, and each view is reprojected onto the remaining ones to determine the proportion of pixels that are visible by fewer than two cameras (i.e., for which the depth is not estimable).
To maintain encoding efficiency, the occlusion analysis is performed once per GOP (only for the first frame of GOP), and the resulting view selection persists for the entire GOP. Such an approach additionally decreases the computational overhead introduced by the proposed approach.
As presented in
Table 6, the time required for performing the introduced occlusion analysis is negligible when compared to the total TMIV encoding time. In both experiments, the proposal increased the computational time by less than 0.5%. Moreover, the proposal does not change the decoding time, which is crucial in any practical video system.
The presented results demonstrate that the proposed method is computationally efficient and practical for real-world immersive video encoding, providing an efficient and reliable view selection with minimal impact on processing time.
6. Conclusions
In this paper, we propose an adaptive view selection method for the MIV DSDE profile that dynamically switches between uniform camera placement and grouping them into stereo pairs. This decision is made based on an analysis of the occlusion level in the scene, performed at the encoder.
Our experimental results, obtained within the TMIV reference software, quantitatively validate this approach. We demonstrate a clear correlation between the percentage of occluded areas and the optimal camera layout. A key finding is the identification of a decision threshold: for scenes with occlusion levels exceeding approximately 10%, the stereo pair grouping strategy yields a significant and measurable gain in quality (up to 2 dB BD-IVPSNR). Below this threshold, the traditional uniform layout is sufficient or even more effective; thus, the uniform layout is automatically chosen by the proposed method.
However, it should be emphasized that this threshold is not a universal constant, but rather a consequence of the performance of the depth estimation, view synthesis, and video encoding algorithms used in an immersive video pipeline. In the modern MIV-based system, where TMIV with its efficient view weighting synthesizer [
35], together with IVDE [
32] for decoder-side depth estimation, is used, the threshold is equal to approximately 10%. Historically, when legacy state-of-the-art depth estimation and view synthesis were used (DERS [
29] and VSRS [
41], respectively), the effective threshold was significantly higher, reaching 20–25% for comparable multiview setups [
25]. This evolution reflects the continuous improvement of geometric reconstruction of 3D scenes using modern immersive video processing techniques. Therefore, the empirical threshold reported in this paper should be interpreted as typical for modern MIV DSDE pipelines.
Overall, the presented results confirm that the spatial arrangement of transmitted views has a measurable impact on the quality of video watched by a viewer in a DSDE immersive video system. The proposed adaptive method effectively balances occlusion handling and geometric accuracy, making it suitable for practical use.
For future work, while this study confirmed the benefit of switching to stereo pairs, further investigation could focus on automating the selection of the optimal baseline (represented by the “uniformity coefficient”) for those pairs, potentially adapting it dynamically based on scene geometry. Furthermore, exploring alternative or combined scene analysis metrics beyond a simple occlusion percentage could lead to an even more robust and fine-grained decision model for view selection.