Figure 1.
Speed vs. mean Spearman’s Rank-order Correlation Coefficient (SROCC) computed across 100 train-test combinations on LIVE-Qualcomm videos (resolution at 1080p). Speed is measured by running methods in CPU.
Figure 1.
Speed vs. mean Spearman’s Rank-order Correlation Coefficient (SROCC) computed across 100 train-test combinations on LIVE-Qualcomm videos (resolution at 1080p). Speed is measured by running methods in CPU.
Figure 2.
Our NR-VQA method consists of three main steps: the Frame sampling, the Multi-level feature extraction, and the Video quality estimation. In the Frame sampling, a set of K frames is first sampled from the whole video. These K frames are then encoded in terms of both quality and semantic features in the Multi-level feature extraction step. The Video quality estimation block aggregates the frame-level feature vectors into a video-level feature vector using a Temporal statistic pooling and then estimates the quality score using a Support Vector Regressor.
Figure 2.
Our NR-VQA method consists of three main steps: the Frame sampling, the Multi-level feature extraction, and the Video quality estimation. In the Frame sampling, a set of K frames is first sampled from the whole video. These K frames are then encoded in terms of both quality and semantic features in the Multi-level feature extraction step. The Video quality estimation block aggregates the frame-level feature vectors into a video-level feature vector using a Temporal statistic pooling and then estimates the quality score using a Support Vector Regressor.
Figure 3.
Definition of the Extractor-Q. Given an image of any size, Extractor-Q estimates the overall quality score as well as the scores for eight quality attributes, namely, brightness, colorfulness, contrast, graininess, luminance, noisiness, and sharpness, and color saturation. It consists of a MobileNet-v2 followed by a Global Average Pooling (GAP) level and a fully connected (FC) level for each one of the previously mentioned aspects.
Figure 3.
Definition of the Extractor-Q. Given an image of any size, Extractor-Q estimates the overall quality score as well as the scores for eight quality attributes, namely, brightness, colorfulness, contrast, graininess, luminance, noisiness, and sharpness, and color saturation. It consists of a MobileNet-v2 followed by a Global Average Pooling (GAP) level and a fully connected (FC) level for each one of the previously mentioned aspects.
Figure 4.
Sample frames of the video contents contained in the four considered databases: (a) CVD2014, (b) KoNViD-1k, (c) LIVE-Qualcomm, and (d) LIVE-VQC.
Figure 4.
Sample frames of the video contents contained in the four considered databases: (a) CVD2014, (b) KoNViD-1k, (c) LIVE-Qualcomm, and (d) LIVE-VQC.
Figure 5.
(Best viewed in colors and magnified.) Visualization of some examples of prediction of the proposed method. (a) Two samples of under- and overestimated video quality predictions. (b) Two videos with a very low error between the MOS and the predicted quality score. For each video, the 15 frames sampled by the proposed method, the MOS, and the quality score predicted by our method are shown.
Figure 5.
(Best viewed in colors and magnified.) Visualization of some examples of prediction of the proposed method. (a) Two samples of under- and overestimated video quality predictions. (b) Two videos with a very low error between the MOS and the predicted quality score. For each video, the 15 frames sampled by the proposed method, the MOS, and the quality score predicted by our method are shown.
Figure 6.
Scatter plots of the predicted scores versus MOS for the four considered databases: (a) CVD2014, (b) KonViD-1k, (c) LIVE-Qualcomm, and (d) LIVE-VQC.
Figure 6.
Scatter plots of the predicted scores versus MOS for the four considered databases: (a) CVD2014, (b) KonViD-1k, (c) LIVE-Qualcomm, and (d) LIVE-VQC.
Figure 7.
Floating point operations (FLOPs) for each method estimated for videos with different number and resolution of frames.
Figure 7.
Floating point operations (FLOPs) for each method estimated for videos with different number and resolution of frames.
Figure 8.
Mean PLCC (a) and SROCC (b) across 100 train–test random splits of the four considered datasets with respect to the number of sampled frames.
Figure 8.
Mean PLCC (a) and SROCC (b) across 100 train–test random splits of the four considered datasets with respect to the number of sampled frames.
Figure 9.
Computation time comparison in seconds for four videos selected from the considered databases with respect to the number of frames sampled. (a) The time in CPU; (b) the time in GPU. {xxx}frs@{yyy}p indicates the video frame length and the resolution, respectively.
Figure 9.
Computation time comparison in seconds for four videos selected from the considered databases with respect to the number of frames sampled. (a) The time in CPU; (b) the time in GPU. {xxx}frs@{yyy}p indicates the video frame length and the resolution, respectively.
Figure 10.
Mean PLCC (a) and SROCC (b) computed across 100 train–test combinations on the considered databases for different sampling step values in the proposed frame sampling algorithm.
Figure 10.
Mean PLCC (a) and SROCC (b) computed across 100 train–test combinations on the considered databases for different sampling step values in the proposed frame sampling algorithm.
Figure 11.
Mean PLCC (a) and SROCC (b) computed across 100 train–test combinations on the considered databases using different frame size values in the proposed frame sampling algorithm.
Figure 11.
Mean PLCC (a) and SROCC (b) computed across 100 train–test combinations on the considered databases using different frame size values in the proposed frame sampling algorithm.
Figure 12.
(Best viewed in colors and magnified.) Linear vs. our frame sampling. Two example videos are shown in which the 15 frames sampled linearly or with the proposed sampling algorithm are compared.
Figure 12.
(Best viewed in colors and magnified.) Linear vs. our frame sampling. Two example videos are shown in which the 15 frames sampled linearly or with the proposed sampling algorithm are compared.
Table 1.
Overview of the publicly available databases for in-the-wild video quality assessment. In the column Device types: “DSLR” stands for Digital single lens reflex.
Table 1.
Overview of the publicly available databases for in-the-wild video quality assessment. In the column Device types: “DSLR” stands for Digital single lens reflex.
Attribute/Database | CVD2014 [18] | KoNViD-1k [6] | LIVE-Qualcomm [8] | LIVE-VQC [7] |
---|
Year | 2014 | 2017 | 2017 | 2018 |
No. of sequence | 234 | 1200 | 208 | 585 |
No. of scenes | 5 | 1200 | 54 | 585 |
No. of devices | 78 | N/A | 8 | 101 |
Device types | smartphone and DSLR | DSLR | smartphone | smartphone |
Distortion type | generic | generic | specific | generic |
Duration | 10–25 s | 8 s | 15 s | 10 s |
Resolution | VGA and 720 p | 540 p | 1080 p | various |
Frame rate | 10–31 | 30 | 30 | N/A |
Format | various | MPEG-4 | YUV | N/A |
Rating per video | 27–33 | 50 | 39 | >200 |
MOS range | –6.50–93.38 | 1.22–4.64 | 16.56–73.64 | 0–100 |
Table 2.
Mean Pearson’s Linear Correlation Coefficient (PLCC), Spearman’s Rank-order Correlation Coefficient (SROCC), and Root Mean Square Error (RMSE) across 100 train–test combinations on the four considered databases. In each column, the best and second-best values are marked in boldface and underlined, respectively.
Table 2.
Mean Pearson’s Linear Correlation Coefficient (PLCC), Spearman’s Rank-order Correlation Coefficient (SROCC), and Root Mean Square Error (RMSE) across 100 train–test combinations on the four considered databases. In each column, the best and second-best values are marked in boldface and underlined, respectively.
| CVD2014 | KonViD-1k |
---|
| PLCC ↑ | SROCC ↑ | RMSE ↓ | PLCC ↑ | SROCC ↑ | RMSE ↓ |
---|
NIQE [12] | | | | | | |
BRISQUE [13] | | | | | | |
V-CORNIA [17] | | | | | | |
V-BLIINDS [16] | | | | | | |
HIGRADE [15] | | | | | | |
TLVQM [4] | | | | | | |
VSFA [9] | | | | | | |
QSA-VQM [5] | | | | | | |
Proposed | | | | | | |
| LIVE-Qualcomm | LIVE-VQC |
| PLCC↑ | SROCC↑ | RMSE↓ | PLCC↑ | SROCC↑ | RMSE↓ |
NIQE [12] | | | | | | |
BRISQUE [13] | | | | | | |
V-CORNIA [17] | | | | | | |
V-BLIINDS [16] | | | | | | |
HIGRADE [15] | | | | | | |
TLVQM [4] | | | | | | |
VSFA [9] | | | | | | |
QSA-VQM [5] | | | | | | |
Proposed | | | | | | |
Table 3.
SROCC in the Cross-dataset setup. In each column, the best and second-best values are marked in boldface and underlined, respectively.
Table 3.
SROCC in the Cross-dataset setup. In each column, the best and second-best values are marked in boldface and underlined, respectively.
Training | CVD2014 | KoNViD-1k |
---|
Testing | LIVE-Qualcomm | KoNViD-1k | LIVE-VQC | CVD2014 | LIVE-Qualcomm | LIVE-VQC |
---|
TLVQM [4] | | | | | | |
VSFA [9] | | | | | | |
QSA-VQM [5] | | | | | | |
Proposed | | | | | | |
Training | LIVE-Qualcomm | LIVE-VQC |
Testing | CVD2014 | KoNViD-1k | LIVE-VQC | CVD2014 | LIVE-Qualcomm | KoNViD-1k |
TLVQM [4] | | | | | | |
VSFA [9] | | | | | | |
QSA-VQM [5] | | | | | | |
Proposed | | | | | | |
Table 4.
Computation time comparison in seconds for four videos selected from the considered databases. {xxx}frs@{yyy}p indicates the video frame length and the resolution, respectively.
Table 4.
Computation time comparison in seconds for four videos selected from the considered databases. {xxx}frs@{yyy}p indicates the video frame length and the resolution, respectively.
Mode | Method | 240frs@540p | 364frs@480p | 467frs@720p | 450frs@1080p |
---|
CPU | V-BLIINDS [16] | 382.06 | 361.39 | 1391.00 | 3037.30 |
QSA-VQM [5] | 281.21 | 265.13 | 900.72 | 2012.61 |
VSFA [9] | 269.84 | 249.21 | 936.84 | 2081.84 |
V-CORNIA [17] | 225.22 | 325.57 | 494.24 | 616.48 |
TLVQM [4] | 50.73 | 46.32 | 136.89 | 401.44 |
NIQE [12] | 45.65 | 41.97 | 155.90 | 351.83 |
BRISQUE [13] | 12.69 | 12.34 | 41.22 | 79.81 |
Proposed | 8.43 | 6.24 | 16.29 | 37.68 |
GPU | QSA-VQM [5] | 9.70 | 9.15 | 25.79 | 55.27 |
VSFA [9] | 8.85 | 7.55 | 27.63 | 58.48 |
Proposed | 0.69 | 0.85 | 1.71 | 2.43 |
Table 5.
Mean PLCC, SROCC, and RMSE across 100 train–test combinations on the four considered databases using different sampling methods. In each column, the best and second-best values are marked in boldface and underlined, respectively.
Table 5.
Mean PLCC, SROCC, and RMSE across 100 train–test combinations on the four considered databases using different sampling methods. In each column, the best and second-best values are marked in boldface and underlined, respectively.
| CVD2014 | KoNViD-1k |
---|
| PLCC ↑ | SROCC ↑ | RMSE ↓ | PLCC ↑ | SROCC ↑ | RMSE ↓ |
---|
All frames | | | | | | |
Linear sampling (15frs) | | | | | | |
MAE sampling (15frs) | | | | | | |
Proposed sampling (15frs) | | | | | | |
| LIVE-Qualcomm | LIVE-VQC |
| PLCC↑ | SROCC↑ | RMSE↓ | PLCC↑ | SROCC↑ | RMSE↓ |
All frames | | | | | | |
Linear sampling (15frs) | | | | | | |
MAE sampling (15frs) | | | | | | |
Proposed sampling (15frs) | | | | | | |
Table 6.
Mean PLCC, SROCC, and RMSE across 100 train–test combinations on the four considered databases using the logits or the features obtained from Extractor-Q and Extractor-S. In each column, the best and second-best values are marked in boldface.
Table 6.
Mean PLCC, SROCC, and RMSE across 100 train–test combinations on the four considered databases using the logits or the features obtained from Extractor-Q and Extractor-S. In each column, the best and second-best values are marked in boldface.
| CVD2014 | KoNViD-1k |
---|
| PLCC ↑ | SROCC ↑ | RMSE ↓ | PLCC ↑ | SROCC ↑ | RMSE ↓ |
---|
Logits | | | | | | |
Features (proposed) | | | | | | |
| LIVE-Qualcomm | LIVE-VQC |
| PLCC↑ | SROCC↑ | RMSE↓ | PLCC↑ | SROCC↑ | RMSE↓ |
Logits | | | | | | |
Features (proposed) | | | | | | |