Impact of Scene Content on High Resolution Video Quality

This paper deals with the impact of content on the perceived video quality evaluated using the subjective Absolute Category Rating (ACR) method. The assessment was conducted on eight types of video sequences with diverse content obtained from the SJTU dataset. The sequences were encoded at 5 different constant bitrates in two widely video compression standards H.264/AVC and H.265/HEVC at Full HD and Ultra HD resolutions, which means 160 annotated video sequences were created. The length of Group of Pictures (GOP) was set to half the framerate value, as is typical for video intended for transmission over a noisy communication channel. The evaluation was performed in two laboratories: one situated at the University of Zilina, and the second at the VSB—Technical University in Ostrava. The results acquired in both laboratories reached/showed a high correlation. Notwithstanding the fact that the sequences with low Spatial Information (SI) and Temporal Information (TI) values reached better Mean Opinion Score (MOS) score than the sequences with higher SI and TI values, these two parameters are not sufficient for scene description, and this domain should be the subject of further research. The evaluation results led us to the conclusion that it is unnecessary to use the H.265/HEVC codec for compression of Full HD sequences and the compression efficiency of the H.265 codec by the Ultra HD resolution reaches the compression efficiency of both codecs by the Full HD resolution. This paper also includes the recommendations for minimum bitrate thresholds at which the video sequences at both resolutions retain good and fair subjectively perceived quality.


Introduction
In recent years, the number of various types of surveillance and data collection cameras located both indoors and outdoors have been constantly increasing. Popularity of home security cameras is also growing as even high-quality models become more affordable. Typical surveillance cameras applications include public safety, protection of facilities against theft or vandalism, remote video monitoring, traffic surveillance, weather monitoring, or more special cases, such as animal monitoring or data collection, for statistical or marketing purposes. Today, due to the pandemic situation, face recognition with and without a protective mask is also becoming a point of interest for researchers in cooperation with technology companies [1][2][3]. It is important to realize that each such employed sensor produces a tremendous amount of data to be subsequently transmitted over the network or further processed, which calls for effective video compression. Furthermore, whether the image or video is presented to a live person or a machine learning algorithm (most often for its classification or segmentation), the best results can be achieved when the image is of the highest achievable quality. This implies one common goal for the distributors, communication service providers, or even broadcasting companies, to optimally set the compression parameters so that perceived video quality is maximal, while the bandwidth requirements are minimal. This challenge leads to increased interest in the analysis of video content followed by the individual setting of the compression parameters of video sequences with different types of scene content. Even though many For our measurements, we used the dataset from the Media Lab of the Shanghai Jiao Tang University [34]. We selected eight sequences with various scene content, illustrated in Figure 1, classified according to the Temporal Information (TI) and Spatial Information (SI) from this database. SI defines the amount of spatial detail in an image and is higher for more spatially complex scenes, while TI represents the number of temporal changes in a video sequence and is higher for high motion sequences [58]. The spatial perceptual information is based on the Sobel filter and is represented by the formula: where F n stands for video frame, and std space for the standard deviation over the pixels in each Sobel-filtered frame. The temporal information is computed as: where M n (i, j) is the difference between pixels at the same position in the frame belonging to two consecutive frames, i.e., where F n (i, j) is the pixel at the i-th row, and j-th column of n-th frame in time [58]. Both of these parameters were calculated for each sequence using the Mitsu tool [59] and plotted in Figure 2. The general specification of the dataset is given in Table 1, and the content of individual sequences is briefly described in Table 2.

Dataset description
For our measurements, we used the dataset from the Media Lab of the Shanghai Jiao Tang University [34]. We selected eight sequences with various scene content, illustrated in Figure 1, classified according to the Temporal Information (TI) and Spatial Information (SI) from this database. SI defines the amount of spatial detail in an image and is higher for more spatially complex scenes, while TI represents the number of temporal changes in a video sequence and is higher for high motion sequences [57]. The spatial perceptual information is based on the Sobel filter and is represented by the formula: where F n stands for video frame, and std space for the standard deviation over the pixels in each Sobel-filtered frame. The temporal information is computed as: where M n (i, j) is the difference between pixels at the same position in the frame belonging to two consecutive frames, i.e., where F n (i, j) is the pixel at the i-th row, and j-th column of n-th frame in time [57]. Both of these parameters were calculated for each sequence using the Mitsu tool [58] and plotted in Figure 2. The general specification of the dataset is given in Table 1, and the content of individual sequences is briefly described in Table 2.

Test Sequence Description Test Sequence Description
Bund Nightscape is a video sequence portraying the above view of a night city crossed by a busy road next the river. The time-lapse video is captured from a high angle with a steady camera as one extreme long shot. The scene is relatively static, except for the accelerated movement of cars driving on the road, people passing by, flags waving in the wind and flashing lights.
Marathon is a video sequence picturing a large group of people in colorful apparel running a race on an asphalt road on a rainy day. The sequence was filmed from bird's eye perspective with almost no camera movement as a very long shot. The scene is rather dynamic, given almost the entire frame is filled by running marathon participants and raindrops falling on the wet road.
Campfire Party is a night time video sequence depicting a group of people posing for a photograph behind a large campfire. The long shot is captured by a stationary camera, which zooms in slightly at the end of the video. The motion in the scene is caused mainly by a flashing fire in the foreground and a woman who briefly runs out of and back into the shot.
Runners is a video sequence that captures athletes running on a tree lined road in a cloudy weather. The racers in the very long shot are approaching the stationary camera, which is positioned approximately at their eye level. The scene contains a considerable amount of motion caused by rushing contestants and by the wind in the treetops.
Construction Field is a very still video sequence capturing construction equipment in the middle of a building site during excavation work. A hand-held camera was used to film the very long shot from a high angle. The only moving objects in the scene are an excavator digging a foundation pit and people slowly walking in the background.
Tall Buildings is a video sequence portraying the tallest skyscrapers and busy intersections in Shanghai, with a grand river in the background. The video was captured from a bird's eye view using a camera that slowly pans to take a panoramic extreme long shot. The movement in the scene is primarily a result of the panning motion of the camera and partially of the cars driving fast at a deep distance.

Test Sequence Description Test Sequence Description
Fountains is a video sequence focused on several fountains in the center of a housing estate with multiple trees and apartment buildings in the background. The video is captured by a static camera as a long shot. All the motion in the scene can be attributed to water gushing from the fountain jets and droplets evaporating into the air.
Wood is a video sequence picturing a tall forest during a sunny autumn day. The video was filmed from a low angle as a long shot with a camera performing a moderately fast panning motion. All the movement in the scene can be attributed to the camera pan and the resulting change in the angle of the sunlight rays incident on the lens.

Dataset Preparation
In our research, we decided to explore the quality of 8-bit video sequences at two commonly used resolutions, i.e., Full HD (FHD) and Ultra HD (UHD) with a typical chroma subsampled YUV 4:2:0 format. Because original sequences were uncompressed and YUV 4:4:4 color format at Ultra HD resolution with 10-bit depth was used, we had to convert them to the appropriate formats. Therefore, all test sequences were first chroma subsampled from YUV 4:4:4 to YUV 4:2:0 format and also the bit depth was changed from 10 to 8 bits per channel. Subsequently, all these conversion steps were repeated for Full HD resolution utilizing the FFmpeg tool [61]. As we wanted to assess also Full HD in addition to Ultra HD, the resolution also had to be altered. For all these conversion steps we used once again the FFmpeg tool [61]. Correspondingly, two uncompressed test sequences were generated ( Figure 3) for each type of content, which adds up to 16 videos. We call them the source video sequences (SRCs) for the rest of this paper.

Coding Process
All these test video sequences (SRCs) were afterwards encoded to both compression standards to be evaluated, i.e., H.264/AVC and H.265/HEVC. As a quality restriction parameter, we decided to use the constant bitrate. We selected 5 various target bitrates: 1, 3, 5, 10 and 15 Mbps based on our previous research [62] which have shown the efficiency of codecs growing nonlinearly with increasing bitrate. We have limited the number of bitrates to 5 as a compromise between the complexity and time requirements of subjective testing and precision of the measurements. For the purposes of our research, we decided to use the Group of Pictures (GOP) length typical for video intended for transfer over a noisy communication channel. The GOP length is based on the framerate of used video sequences and is commonly set to half of the framerate value. Accordingly, given that test video sequences had a framerate of 30 fps (frames per second), we chose the GOP length of 15 frames, i.e., M = 3, N = 15. The first number, labeled with M letter, expresses the distance between two anchor frames (I or P) and the second number, denoted with N letter, stands for the distance between two key frames (I).

Subjective Quality Assessment
During the subjective testing, all created PVSs were shown to people of different ages and genders to evaluate their quality. We decided to use the Absolute Category Rating (ACR) method [58,63] which belongs to the category of Single Stimulus (SS) subjective video quality assessment techniques. The principle of this method is that the degraded sequences are presented to the observers one at a time, and they are asked to rate its quality on a five-level grading scale, where 1 indicates the bad quality, and 5 stands for the excellent quality. The measurement was conducted in two laboratories separately: one situated at the University of Zilina (UNIZA), and the second at the VŠB -Technical University in Ostrava. The video sequences were presented on three types of displays (Table 3) depending on the resolution of the test sequences in the laboratories under normal indoor illumination conditions. Table 3. Types of used displays.

Type of Assessment
Type of Display Thirty participants, mostly students, were involved in the testing in each laboratory. All of them were naive observers which means they had no expertise in the image artefacts that may be introduced by the system under test. Naturally, they were thoroughly acquainted with the method of assessment, types of impairment, grading scale, sequence, and timing as required by Reference [58]. The statistical distribution of the number of men and women who took part in the tests, as well as the average age of all observers, is shown in Table 4. The course of the entire subjective assessment process is represented by Figure 4.

Statistical Analysis and Presentation of the Results
After performing the subjective tests, we processed all collected results statistically; for each test sequence, codec, and resolution, the Mean Opinion Score (MOS) and 95 percent Confidence Interval (CI) in accordance with Reference [64] were calculated and plotted in graphs, a shown below. The presentation of the results could be divided into five parts. In the first part, the cross-comparison of the results obtained from different laboratories, i.e., from UNIZA and VŠB, is performed using the Pearson correlation coefficient (PCC) and the Root Mean Square Error (RMSE). In the second part, the bitrate impact on the perceived video quality depending on the scene content is plotted. The third part deals with the Analysis of Variance (ANOVA) which was applied on the acquired data. In the fourth part, the impact of the bitrate on the perceived video quality in terms of the used codec and resolution is presented. Finally, in the fifth part, the minimum bitrate thresholds at which the video sequence should be encoded to reach certain quality are determined.

Correlation between the Results from Individual Laboratories
To compare the MOS values obtained from both laboratories, i.e., from UNIZA and VŠB, and, to find out the correlation, the Pearson correlation coefficient (PCC), as well the Root Mean Square Error (RMSE) were calculated. All computations were done for both codecs and resolutions, as well as for all test sequences. The results are plotted in Figures 5 and 6 and are shown in Table 5.   As we can see from Figures 5 and 6, as well as from Table 5, there is a high correlation between the results from both laboratories. The lowest correlation was reached by the combination of Full HD resolution and H.264 codec. This is most likely due to the different displays used in the assessments; at the UNIZA laboratory, the Full HD display was used, while, at the VŠB laboratory, the Ultra HD display was used. Vice versa, the highest correlation rate was achieved by video sequences encoded to H.264 at UHD resolution. Figure 7 shows the impact of the bitrate on the perceived video quality (defined by the MOS with associated CI). In this figure, eight graphs are inserted considering used codec, resolution, and laboratory where the evaluation was conducted. Sequences with different scene contents are color-coded in the graphs; each curve represents MOS values for a given test sequence. Figure 8 shows the average MOS values obtained from UNIZA and VSB laboratories.

Impact of Bitrate on Video Quality Depending on Scene Content
It is apparent from the graphs that the sequences with the lowest SI and TI values, such as the "Bund Nightscape" and the "Construction Field", reached the best MOS value. Vice versa, the observers rated the sequences situated in the middle of the SI-TI diagram, such as the "Marathon" or "Runners", as of worst quality. Interesting cases are the "Campfire Party" and the "Fountains" sequences. The "Campfire Party" contains a lot of movement (high TI values) but not many details (low SI values) and reached low MOS value, while the "Fountains" sequence lies near to the "Bund Nightscape" and the "Construction Field" sequences, meaning it has low both TI and SI values and also scored low on the MOS scale. A special case is the "Wood" sequence which is situated at the upper right corner of the SI-TI diagram. Nevertheless, its quality was perceived as similar to the sequences "Fountains" and "Runners". All these differences are more pronounced: • at low bitrates-with increasing bitrate, the perceived quality rises, too, and approaches the perceived quality of sequences with low SI-TI values, • at Ultra HD resolution rather than at Full HD resolution, and • at H.265 codec rather than at H.264 codec.
Based on these results, we can state that the compression efficiency and related video quality depends on the content of the sequences. However, the sequence representation and description only by the spatial and temporal information is not sufficient and should be the subject of further research. We suggest other parameters should be used to describe the scene, such as, for instance, the luminance and contrast or the colors occurring in the scene. In addition, the psychological factors should be considered. Based on the results, we can also state that the temporal information has greater impact on the perceived quality than the number of the objects defined by the spatial information.  It is apparent from the graphs that the sequences with the lowest SI and TI values, such as the "Bund Nightscape" and the "Construction Field", reached the best MOS value. Vice versa, the observers rated the sequences situated in the middle of the SI-TI diagram, such as the "Marathon" or "Runners", as of worst quality. Interesting cases are the "Campfire Party" and the "Fountains" sequences. The "Campfire Party" contains a lot of movement (high TI values) but not many details (low SI values) and reached low MOS value, while the "Fountains" sequence lies near to the "Bund Nightscape" and the "Construction Field" sequences, meaning it has low both TI and SI values and also scored low on the MOS scale. A special case is the "Wood" sequence which is situated at the upper right corner of the SI-TI diagram. Nevertheless, its quality was perceived as similar to the sequences "Fountains" and "Runners". All these differences are more pronounced: • at low bitrates-with increasing bitrate, the perceived quality rises, too, and approaches the perceived quality of sequences with low SI-TI values, • at Ultra HD resolution rather than at Full HD resolution, and • at H.265 codec rather than at H.264 codec.
Based on these results, we can state that the compression efficiency and related video quality depends on the content of the sequences. However, the sequence representation and description only by the spatial and temporal information is not sufficient and should be the subject of further research. We suggest other parameters should be used to describe the scene, such as, for instance, the luminance and contrast or the colors occurring in the scene. In addition, the psychological factors should be considered. Based on the results, we can also state that the temporal information has greater impact on the perceived quality than the number of the objects defined by the spatial information.

Analysis of Variance
To verify what stemmed from the graphical representation of the subjective evaluation results, the ANOVA was applied on the data [65]. The three-way ANOVA was used to compare the significance and influence of individual sequence parameters on the resulting perceived video quality. The interaction between three independent variables, bitrate (X1), content (scene type) (X2), and resolution (X3) in Table 6 or compression standard (X3) in Table 7 was examined, with video quality being considered a dependent variable. Tables 6 and 7 depict the three-way ANOVA matrices. The F-value, also called the F-ratio is calculated as the variance of the group means divided by the mean of the within group variances (Mean Squared Error). Greater F-value indicates more significant variation. In ANOVA, the p-value, i.e., the probability of getting the observed result at random, is also determined. For the source of variation to be regarded as insignificant, the p-value must be higher than a given alpha level, commonly set to 0.05. When performing ANOVA, the p-value is also determined to investigate the probability of rejecting the hypothesis.
Based on the analysis of the tables, the following conclusions can be drawn. Table 6 indicates that for H.265 encoded sequences, the effect of resolution can be ignored, since this variable was deemed statistically insignificant. In contrast, in the case of the H.264 codec, this negative phenomenon does not occur and resolution is the second most important parameter that determines the subjectively perceived quality. For both codecs, an alteration in bitrate results in a maximum change in the subjective MOS. According to Table 7, the impact of compression format on the perceived quality is considered statistically insignificant for Full HD video sequences. However, that is not the case for Ultra HD resolution, where deployed codec is the second most influential variable. Equivalently to Table 6, the bitrate has the greatest effect on the subjective video quality assessment results. All remaining ANOVA test results in both tables can be regarded statistically significant based on their p-values.  Figure 9 shows the impact of the bitrate on the perceived video quality (defined by the MOS with associated CI) plotted separately for each type of video sequence. In this figure, eight graphs are inset, considering examined test sequence, which show the impact of used codec and resolution on the perceived quality of a given sequence; curve represents averaged MOS values from both laboratories for a given codec and resolution.  In Figure 10, the averaged MOS value from both laboratories from all used test sequences for each codec and resolution is plotted.
We can draw several conclusions from Figures 9 and 10. Firstly, it is apparent that the H.265 compression standard yields better quality than the H.264 codec. This is a generally known fact and we expected it. But what is interesting and important is that the efficiency difference between these two codecs is negligible for the Full HD video sequences. Therefore, it is inessential to use H.265 compression standard at this resolution, as the observers will not see any notable differences. The use of H.265 codec is relevant only for the videos at the Ultra HD resolution, particularly at low bitrates. This is due to the fact that the quality of H.264 encoded video sequences increases with the rising bitrate up to the point where it reaches or even surpasses the perceived quality of H.265 sequences. Secondly, the compression efficiency of the H.265 compression standard at the Ultra HD resolution reaches the compression efficiency of both codecs at the Full HD resolution.
Indisputably, the conclusions drawn from the Analysis of Variance (ANOVA) and the graphical representation of the subjective quality evaluation results coincide. These findings could be beneficial for visual media content providers and broadcasting companies, as they indicate how to adjust video compression parameters to improve its quality. The fastest growth of perceived video quality is apparently due to an increase in bitrate. Specifically, the quality increases most rapidly until the bitrate reaches a value of approximately 5 Mbps. The analyses also revealed which combination of resolution and compression format is best used so that the resulting quality of visual content is perceived by viewers as good as possible.

Minimum Bitrate Thresholds Suggestions
Finally, Figure 10 shows the minimum bitrate thresholds at which the video sequences should be encoded to achieve good (4) or fair (3) quality. These quality thresholds are based on MOS values of used ACR method and are important for the bitrate setting of each codec to maintain a certain quality. Table 8 shows the mentioned minimum bitrates.

Conclusions
This paper dealt with the content impact on the perceived video quality evaluated using the subjective Absolute Category Rating (ACR) method. Eight types of video sequences with various scene content were evaluated. Two widely used video compression standards H.264/AVC and H.265/HEVC in combination with Full HD and Ultra HD resolutions, were tested. In the coding process, we selected 5 various bitrates based on our previous research, which showed that the efficiency of codecs grows nonlinearly with increasing bitrate. The number of bitrates was a compromise between the complexity and time requirements of subjective testing. In total, we created an annotated database which contains 160 different video sequences coded at constant bitrates with GOP set to half of the framerate value which is typical for video intended for transfer over a noisy communication channel. The perceived quality of the sequences was evaluated employing the subjective ACR method. The assessment was conducted in two laboratories: one situated at the University of Zilina, and the second at the VSB-Technical University in Ostrava. First, we calculated the correlation of the MOS values between both laboratories using the Pearson correlation coefficient (PCC) and the Root Mean Square Error (RMSE). The correlation proved to be considerably high. After that, we described the impact of the bitrate on video quality depending on scene content defined by Spatial (SI) and Temporal information (TI). The results showed that even if the sequences with low SI and TI values reach better MOS than the sequences with higher SI and TI values, these two parameters are not sufficient for scene description, and this domain should be the subject of further research. Subsequently, we described the impact of bitrate on video quality depending on codec and resolution. Based on the results, we concluded that the employment of the H.265 codec for compression of Full HD sequences is inessential, as the people did not observe any significant differences. Furthermore, we stated that the compression efficiency of the H.265 codec by the Ultra HD resolution reaches the compression efficiency of both codecs by the Full HD resolution. We also applied the ANOVA to verify what stemmed from the graphical representation of the subjective evaluation results. Finally, we determined the minimum bitrate thresholds at which the video sequences at both resolutions retain good and fair subjectively perceived quality.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.