Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Content-Adaptive Bitrate Ladder Estimation in High-Efficiency Video Coding Utilizing Spatiotemporal Resolutions

Electronics 2024, 13(20), 4049; https://doi.org/10.3390/electronics13204049

by Jelena Šuljug^*

and Snježana Rimac-Drlje

Reviewer 1:

Ja-ling Wu

Reviewer 2:

Po-Chyi Su

Reviewer 3: Anonymous

Electronics 2024, 13(20), 4049; https://doi.org/10.3390/electronics13204049

Submission received: 15 August 2024 / Revised: 28 September 2024 / Accepted: 10 October 2024 / Published: 15 October 2024

(This article belongs to the Special Issue Image and Video Processing and Retrieval Based on Machine Learning and Deep Learning)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

One of the key contributions of this work is the development of two neural network models: one to estimate the TR and BR values representing the switching points to a higher SR, and the other to estimate SSIM values based on SI, TI, TR, and SR. However, some key terms are not clearly defined in the paper. For example, the calculation of Spatial Information (SI) and Temporal Information (TI) presented in Figure 1 is unclear. Additionally, the details provided about the flowchart in Figure 3 are insufficient. It would be helpful to clarify the following:

1. What is the underlying neural network architecture?

2. Why was a two-layer feedforward network chosen instead of a convolutional neural network (CNN)?

3. What loss function was used during the training process?

4. In Table 2, how is “performance” calculated, and what metric is used?

5. In line 342, why did the training halt at epoch 17? Was it simply because the gradient value was close to zero?

6. What is the “validation check” mentioned in line 351?

7. In line 353, the statement “the neural network demonstrated effective learning and strong performance” is not supported by the explanations provided in the paper.

Overall, the paper needs to be revised to clearly define the terms and metrics, as well as the equations used in the training and testing process. Without these clarifications, readers may find it challenging to see the evidence supporting the contributions claimed by the authors.

Comments on the Quality of English Language

The quality of the English Language is acceptable!

Author Response

The calculation of Spatial Information (SI) and Temporal Information (TI) presented in Figure 1 is unclear.

Response: Spatial Information (SI) and Temporal Information (TI) were calculated using established metrics for measuring video complexity. The Spatial Information metric represents the degree of spatial variation in a video frame and is computed using the variance of pixel intensities, preceded by Sobel edge detection. The Temporal Information metric measures motion between successive frames, using frame differences. SI and TI are computed as time averages using equations that are defined in ITU-T P.910 (10/2023) guidelines [11]. (Lines 228-234)

Additionally, the details provided about the flowchart in Figure 3 are insufficient.

The flowchart explanation given with lines 316-336 only gives an overview of the process that is explained in detail in lines 233-313, 340-347, 409-452, 490-504. The structure of the paper has been refined to improve coherence, ensuring that detailed explanations are now more closely aligned with the flowchart. This enhances the logical flow and clarity of the methodology, allowing readers to better understand the step-by-step process and how each element of the approach is interconnected.

What is the underlying neural network architecture?

Response: The neural network architecture used in this study is a two-layer feedforward network with sigmoid hidden neurons and linear output neurons. The first layer, or hidden layer, consists of neurons with a log-sigmoid activation function, which introduces nonlinearity, allowing the network to learn complex, nonlinear relationships between the input and output vectors. The second layer, or output layer, employs a linear activation function, which is typical for function fitting or nonlinear regression problems, as it ensures that the network can produce continuous output values. This architecture enables the approximation of any function with a finite number of discontinuities, provided there are sufficient neurons in the hidden layer. The use of nonlinear activation in the hidden layer combined with a linear output layer ensures that the network can generalize effectively, making it suitable for applications such as video quality prediction and bitrate ladder estimation. (Lines 365-378)

Why was a two-layer feedforward network chosen instead of a convolutional neural network (CNN)?

Response: The reason for not using a convolutional neural network (CNN) is that our input data consists of extracted video metrics (SI, TI, etc.) rather than pixel-level image data, making a CNN less suitable. Feedforward networks are more appropriate for this type of input, which is numerical and derived from the statistical features of the video.

What loss function was used during the training process?

Response: The loss function used during training was a mean squared error, which is a common choice for regression tasks where the goal is to minimize the difference between predicted and actual values. (Lines 374-378)

In Table 2, how is “performance” calculated, and what metric is used?

Response: The performance metric in Table 2 refers to the mean squared error (MSE) of the network during training. This metric was used to evaluate how well the model's predictions matched the actual values. The lower the MSE, the better the model's performance. (Lines 356-359 and 369-370)

In line 342, why did the training halt at epoch 17? Was it simply because the gradient value was close to zero?

Response: The training stopped early at epoch 17 due to a validation check mechanism, which monitors the model's performance on a separate validation dataset. When the validation error stops improving, early stopping is triggered to prevent overfitting. In this case, the gradient had decreased sufficiently, and further training would likely lead to overfitting. (Lines 381-392)

What is the “validation check” mentioned in line 351?

Response: A portion of the dataset (15% in our case) was set aside from the training data and utilized for validation purposes. This process is essential for monitoring the model's generalization ability, ensuring that the neural network does not overfit to the training data and performs well on unseen examples. Training proceeds until the stopping criterion is satisfied, specifically when the validation error (MSE for validation data) exceeds or equals the minimum validation error achieved in the previous iterations. A validation check occurs when the validation error (an error that the NN model obtained in the given training epoch achieves on the validation dataset) reaches a plateau. If no improvement is observed after several epochs, training halts. This prevents overfitting, where the model might perform well on training data but poorly on unseen data. (Lines 381-392)

In line 353, the statement “the neural network demonstrated effective learning and strong performance” is not supported by the explanations provided in the paper.

Response: Thank you for your feedback. We acknowledge that the statement “the neural network demonstrated effective learning and strong performance” could be seen as overly optimistic given the current explanation. To address this, we have revised the statement to provide a more measured assessment of the model's performance, highlighting its satisfactory progress rather than claiming it is perfect. While the NN showed substantial improvement and convergence, the early stopping mechanism may have prevented the model from reaching its full potential, but it was necessary to avoid overfitting. The model performed well in terms of reducing the error and stabilizing the training process, though further training might have led to marginal improvements. We believe the model still provides meaningful results and generalizes well to unseen data, but we agree that there is room for further refinement, which we aim to explore in future work.

Reviewer 2 Report

Comments and Suggestions for Authors

1. The term “optimization” has to be used carefully. It seems to us that using NN and regression cannot be viewed as an optimization approach.

2. Sec. 2 is “Related Work.” Sec. 3 is “Test Setup.” Sec. 4 is “Results.” The structure makes the readers feel that there is no proposed work.

3. The structure of the NN should be explicitly shown and explained.

4. We know that there are few high-resolution test videos and that encoding takes some time. However, the volume of data in this research doesn’t seem large enough for training a NN that can be claimed as being “general.” This may explain some strange values in Table3. We wonder whether certain rule-based methods can already work very well.

5. In Figs. 6 and 8, the trends of linear regression are not very convincing. In addition, the explanations of Figs. 8 and 9 are short. If a figure is shown, its details or meaning should be described.

6. We do not usually see the commands (as shown on Page 6) in a paper.

7. There is no comparison with existing work. It is hard to evaluate the contribution of this work.

Author Response

The term “optimization” has to be used carefully. It seems to us that using NN and regression cannot be viewed as an optimization approach.

Response: While it is true that neural networks and regression methods are not traditionally seen as optimization approaches, the term was used in a broader context to describe the process of refining model performance to achieve better prediction of bitrate ladders. We will revise the text to clarify that optimization refers to improving predictive accuracy rather than formal mathematical optimization. (Lines 83-85)

2 is “Related Work.” Sec. 3 is “Test Setup.” Sec. 4 is “Results.” The structure makes the readers feel that there is no proposed work.

Response: The structure of the manuscript has been revised in order to emphasize the proposed work.

The structure of the NN should be explicitly shown and explained.

Response: We added a detailed description of the neural network structure in the revised manuscript, including a diagram. The network consists of two layers, with 10 hidden neurons in the first layer using the sigmoid activation function and a linear output layer suited for regression tasks. The neural network architecture used in this study is a two-layer feedforward network with sigmoid hidden neurons and linear output neurons. The first layer, or hidden layer, consists of neurons with a log-sigmoid activation function, which introduces nonlinearity, allowing the network to learn complex, nonlinear relationships between the input and output vectors. The second layer, or output layer, employs a linear activation function, which is typical for function fitting or nonlinear regression problems, as it ensures that the network can produce continuous output values. This architecture enables the approximation of any function with a finite number of discontinuities, provided there are sufficient neurons in the hidden layer. The use of nonlinear activation in the hidden layer combined with a linear output layer ensures that the network can generalize effectively, making it suitable for applications such as video quality prediction and bitrate ladder estimation. (Lines 365-378)

We know that there are few high-resolution test videos and that encoding takes some time. However, the volume of data in this research doesn’t seem large enough for training a NN that can be claimed as being “general.” This may explain some strange values in Table 3. We wonder whether certain rule-based methods can already work very well.

Response: While it is true that the dataset size is limited, the use of data augmentation helped increase the training data volume. We acknowledge that larger datasets could potentially improve generalization and we plan to expand the dataset in future work to test that hypothesis. For 4K resolution at 120 fps, there are no established rule-based methods, as there is a lack of research addressing resolution changes at such high levels. Even for lower resolutions, there are insufficient results to establish rules for adjusting both spatial and temporal resolutions. In our previous research, we developed mathematical models for selecting optimal parameters and determining the number of representations for video encoding and segmentation. This was based on the spatial and temporal activity of video content and applied to the H.264 encoder, using objective metrics like the Structural Similarity Index Measure (SSIM) along with Spatial Information (SI) and Temporal Information (TI) to quantify video spatial and temporal activity (https://www.mdpi.com/2079-9292/10/15/1843). However, that study did not account for changes in temporal resolution. In our preliminary work, presented at the ELMAR 2024 conference, which incorporates temporal resolution, we concluded that for low-complexity videos, reducing temporal resolution from 120 fps to 25 fps before adjusting spatial resolution yielded the best results. For medium- and high-complexity content, the spatial resolution should be reduced first. Existing methods typically rely on content-dependent bitrate ladder selection, which requires intensive encoding across all chosen spatial and temporal resolutions, as well as across a wide range of bitrates.

In Figs. 7 and 9, the trends of linear regression are not very convincing. In addition, the explanations of Figs. 9 and 10 are short. If a figure is shown, its details or meaning should be described.

Response: Thank you for your comment. We acknowledge that the linear regression trends in Figures 7 and 9 may not appear fully convincing at first glance. This can be attributed to the inherent complexity and variability of video data, particularly in the context of spatiotemporal features and bitrate estimations. Also, the output of the neural network is continuous, and linear regression would be more apparent if the outputs were rounded to discrete frame rate values (25, 30, 60, and 120 fps), which were used in the initial dataset. However, the models were designed with video streaming applications in mind, where the primary goal is not necessarily achieving perfect regression fits but rather providing robust and efficient predictions that optimize bitrate ladders and ensure good video quality under varying network conditions. Despite the variability in the regression trends, the models demonstrate strong performance in reducing MSE across the training, validation, and test datasets. This ensures that the predictions are sufficiently accurate for the practical requirements of video streaming, such as selecting appropriate bitrate-resolution combinations and ensuring smooth adaptive streaming. The neural networks have shown the ability to generalize well to unseen data and maintain reasonable predictive power, even though the regression lines are not perfectly linear. Thus, while the regression trends might suggest some variability, the models remain suitable and perform adequately for the intended purpose of adaptive video streaming, where real-time responsiveness and efficiency are more critical than achieving exact linear trends. We also agree that a more detailed explanation is needed, thus the comments for figures 9 and 10 have been revised. (Lines 506-591)

We do not usually see the commands (as shown on Page 6) in a paper.

Response: Thank you for your comment. We included the specific commands in the paper to ensure that our work is fully replicable by other researchers. Reproducibility is a key principle in scientific research, and by providing detailed command lines for the encoding and scaling processes, we aim to enable other researchers to precisely replicate our experimental setup and results. Given the technical nature of video processing and encoding, these commands ensure transparency and allow for consistent application of the methods we used. We understand this level of detail is not commonly seen in many papers, but we believe it is crucial for fostering reproducibility and advancing research in this domain.

There is no comparison with existing work. It is hard to evaluate the contribution of this work.

Response: Thank you for pointing out the need for a more explicit comparison with existing work. In the Related Work section, we have discussed various approaches developed by other researchers and industries in the area of bitrate ladder optimization and video streaming. For instance, methods like Netflix’s per-title encoding optimization and Bitmovin’s content-agnostic models are referenced as established approaches that focus primarily on spatial resolution and bitrate optimization, without integrating temporal resolution. In contrast, our contribution introduces a novel approach by optimizing bitrate ladders using both spatial and temporal resolutions in conjunction, which is rarely addressed in the current literature. While existing methods, such as Netflix’s optimization approach, rely on extensive computational processes for every possible combination of spatial and bitrate parameters, our approach leverages NN to estimate switching points, significantly reducing the computational demands. Furthermore, compared to methods that depend solely on ML or polynomial regression for content-adaptive encoding, our work incorporates high temporal resolutions (up to 120 fps), expanding the model’s applicability to modern high-frame-rate video formats, which is a key advancement over most previous works. The only research that incorporates TR as a parameter is presented in [32]. In this work, the authors estimate the convex hull utilizing both PSNR and VMAF as objective metrics. However, their approach does not employ ML or DL techniques for bitrate ladder estimation, which results in a higher demand for computational resources and is not appropriate for comparison. Considering there is a lack of related work that includes data needed for comparison purposes, we added an additional layer of model verification with the Lips video sequence that was not included in the training of initial NN.

Reviewer 3 Report

Comments and Suggestions for Authors

In this manuscript, the authors propose a bitrate gradient optimization method based on the spatiotemporal characteristics of video sequences and the complexity of video content. The authors design a neural network (NN) model for data augmentation to reduce the computational complexity and make it suitable for real-time video streaming. The authors use video sequences with different content complexities for efficient video coding, thus capturing data at multiple spatial and temporal resolutions, and the optimal temporal resolution and bitrate values are estimated by NN model training and data augmentation, which is used as a switching point to a higher spatial resolution. The experimental results show that the method can effectively simplify the process of constructing bitrate gradients and provide practical real-time video streaming solutions for video sequences with specific complexities. In general, this is a valuable work that can be considered for publication in Electronics. Some issues in the manuscript are as follows:

1. It can be noticed that the spatial information and temporal information of the eight videos selected by the authors are more concentrated in smaller magnitudes. Will this affect the development and evaluation of the model? It is suggested that the authors discuss this in the manuscript.

2. The sizes of the captions and fonts on the horizontal and vertical axes of the images in the manuscript are not uniform. Please adjust them.

3. It is suggested that the authors add the predictive inference speed of the proposed NN model to the manuscript.

4. It is suggested that the authors compare the proposed method with the current state-of-the-art methods in the manuscript to highlight its advantages.

5. How does the NN model perform when targeting degradation problems (e.g., noise, jitter) in video quality? Do the authors have plans to better handle these problems? Please discuss this in the manuscript.

Author Response

It can be noticed that the spatial information and temporal information of the eight videos selected by the authors are more concentrated in smaller magnitudes. Will this affect the development and evaluation of the model? It is suggested that the authors discuss this in the manuscript.

Response: Thank you for your observation. The concentration of SI and TI values in a smaller range is indeed a characteristic of the selected video sequences. While this may result in a narrower distribution of data, potentially introducing some bias in the model’s training, we have taken steps to mitigate this effect. This augmentation process increased the diversity of the dataset, allowing the model to generalize better across a broader range of spatial and temporal complexities. Moreover, the neural network was trained to recognize subtle distinctions in video complexity, even within these concentrated initial SI and TI ranges. The inclusion of augmented data helps ensure that the model learns to handle varying degrees of spatial and temporal variation, rather than being overly influenced by the original data's smaller magnitude distribution. Thus, while the initial SI and TI values may appear concentrated, the augmentation and training processes enable the model to perform robustly across a wider range of video complexities. (Lines 330-348)

The sizes of the captions and fonts on the horizontal and vertical axes of the images in the manuscript are not uniform. Please adjust them.

Response: We have ensured that the font sizes and axis labels are uniform and properly formatted in the final version.

It is suggested that the authors add the predictive inference speed of the proposed NN model to the manuscript.

Response: Thank you for your suggestion. We have measured the predictive inference speed of the first proposed neural network (NN) model and can confirm that it is fast enough to be used in real-time video streaming applications. During testing, the average inference time per sample was 0.0284 seconds (or 28.4 milliseconds), which comfortably meets the requirements for real-time video streaming. For instance, typical streaming applications, such as those operating at 30 frames per second (fps), require that each frame be processed in under 33 milliseconds. Given the inference speed we observed, our model is well within these operational constraints, ensuring that it can efficiently handle real-time bitrate estimations for adaptive video streaming. Although there remains room for optimization, the current predictive inference speed demonstrates that the neural network is capable of making accurate and timely predictions without introducing delays that would disrupt real-time video streaming. We will include this information in the manuscript to clarify the NN's suitability for real-time applications. (Lines 394-400)

It is suggested that the authors compare the proposed method with the current state-of-the-art methods in the manuscript to highlight its advantages.

How does the NN model perform when targeting degradation problems (e.g., noise, jitter) in video quality? Do the authors have plans to better handle these problems? Please discuss this in the manuscript.

Response: Thank you for your comment. Our NN model is primarily focused on determining the optimal encoding parameters before streaming occurs, which inherently targets video quality issues such as noise during the pre-streaming phase. Degradation is mainly introduced by compression, and down/up sampling process, and our model incorporates as a key metric, which is highly sensitive to noise and visual distortions. Therefore, we believe the model adequately addresses noise-related degradation by optimizing for SSIM during encoding parameter selection, ensuring that the encoded video retains high visual quality even if noise is present in the source.

However, our model does not currently target jitter, as jitter is a network-induced issue that occurs during transmission and after the video has been encoded. Since the model is designed to determine encoding parameters before streaming, it cannot predict or manage network-related issues such as jitter. These problems are better handled by adaptive streaming protocols like DASH or HLS, which adjust video quality in real time based on network conditions. Jitter can also occur as a result of lowering the frame rate. Since the entire method is based on the Structural Similarity Index, the effectiveness of the neural network in the presence of jitter depends on how well SSIM captures temporal degradation. An improvement to the NN model could be achieved by replacing SSIM with subjective Mean Opinion Scores (MOS), as subjective assessments may better account for perceived temporal artifacts. This is one of the directions we intend to explore in future research. We have included this clarification in the manuscript to highlight how noise is accounted for, but jitter is outside the scope of our current model. (Lines 315-319)

Article Menu

Content-Adaptive Bitrate Ladder Estimation in High-Efficiency Video Coding Utilizing Spatiotemporal Resolutions

Further Information

Guidelines

MDPI Initiatives

Follow MDPI