Hybrid No-Reference Quality Assessment for Surveillance Images

: Intelligent video surveillance (IVS) technology is widely used in various security systems. However, quality degradation in surveillance images (SIs) may affect its performance on vision-based tasks, leading to the difﬁculties in the IVS system extracting valid information from SIs. In this paper, we propose a hybrid no-reference image quality assessment (NR IQA) model for SIs that can help to identify undesired distortions and provide useful guidelines for IVS technology. Speciﬁcally, we ﬁrst extract two main types of quality-aware features: the low-level visual features related to various distortions, and the high-level semantic information, which is extracted by a state-of-the-art (SOTA) vision transformer backbone. Then, we fuse these two kinds of features into the ﬁnal quality-aware feature vector, which is mapped into the quality index through the feature regression module. Our experimental results on two surveillance content quality databases demonstrate that the proposed model achieves the best performance compared to the SOTA on NR IQA metrics.


Introduction
With the increasing demand for public security and the rapid development of computer vision technologies and digital products, intelligent video surveillance (IVS) technology has become a hot topic [1].IVS technology mainly adopts algorithms related to computer vision tasks such as recognition, detection, and tracking in order to understand the content of surveillance videos and automatically perform the task of monitoring or control, which can greatly reduce the burden of human attention [2][3][4][5].Hence, IVS technology has been widely applied in security systems and distributed in various scenarios.However, surveillance images (SIs) usually suffer from different types and degrees of quality degradation in the SI acquisition and transmission process.Specifically, poor physical conditions (smoke, fog, insufficient illumination, etc.), in-capture distortions (noise, blur, etc.), and compression distortions are the main reasons for quality degradations of SIs [6][7][8][9].Distortions in SIs may affect the performance of subsequent high-level tasks, making it difficult for IVS technology to extract valid information from the SIs.As shown in Figure 1, SIs may suffer from uneven illumination or motion blur distortion, leading to difficulties in recognizing objects for both observers and computers.Therefore, it is necessary to consider SI quality assessment (SIQA) in the design of IVS technology.On the one hand, IVS systems can adopt the SIQA method to predict the quality level of the SIs and filter low-quality SIs.On the other hand, IVS systems can employ the SIQA method to detect and identify different types of degradation and apply appropriate quality enhancement processing to improve the quality of the SIs [10].Both of the two strategies mentioned above can help to improve the performance of IVS systems on vision-based tasks.In the past two decades, IQA has gained popularity in the field of image processing [11].Depending on whether a human is involved or not, IQA can be divided into subjective IQA and objective IQA [12][13][14].Because human eyes are generally the final receiver of the images, subjective IQA is the most reliable way to assess the quality of images.In recent years, many popular IQA databases have been proposed, such as LIVE [15], TID2008 [16], TID2013 [17], CSIQ [18], etc., which are used to train and validate objective IQA methods.Despite having high reliability, subjective IQA methods require lots of time and labour, and as a result are not suitable for real applications.Therefore, objective IQA methods which can automatically predict the quality of images have attracted much attention from researchers and been widely used in various real-world applications [19].According to the available the reference information, objective IQA can be divided into full-reference IQA (FR IQA), reduced-reference IQA (RR IQA), and no-reference IQA (NR IQA) [20].FR IQA utilizes whole reference information, while RR IQA adopts partial reference information.The reference signal is not used in NR IQA metrics.In reality, reference images are not available to an IVS systems.Therefore, in this paper we mainly discuss NR IQA methods, as these are more suitable for real applications; however, the absence of a reference image makes them more challenging.
1.1.Related Work 1.1.1.IQA Databases IQA databases are divided into traditional and emerging databases based on the image content type and underlying application [11].Traditional databases are generally composed of a few high-quality pristine images and many distorted images, which are corrupted by such typical distortion types as JPEG and JPEG 2000 compression, white noise, blur, etc. LIVE [15] contains 29 reference images and 779 distorted images generated by five common types of distortion.TID2013 [17] consists of 25 pristine images and 3000 distorted images corrupted by 24 distortion types and five distortion levels.CSIQ [18] includes 30 reference images and 866 distorted images generated by six distortion types.These traditional databases sometimes cannot cover the content types and distortion types of certain specific IQA problems.Hence, emerging databases [21][22][23] have been proposed for specific IQA applications such as 3D images, screen content image, and virtual reality image databases.Because SIs have more complicated content and distortion compared to traditional images, Zhu et al. [7] constructed a surveillance image quality database (SIQD) including 500 in-the-wild SIs with various scenarios, resolutions, and illumination conditions, then performed a study on the subjective quality assessment of these SIs with different degrees of quality.For the surveillance video quality assessment, Beghdadi et al. [8] established the Video Surveillance Quality Assessment Dataset (VSQuAD), which includes 36 reference surveillance videos and 1576 distorted videos generated by nine distortion types.

NR IQA Metrics
Based on the goal of predicting the perception of human vision without any information from the original reference image, many NR IQA metrics have been proposed in recent years [24].According to distortion type, NR IQA metrics can be divided into general-purpose algorithms and distortion-specific algorithms [11].General-purpose metrics usually have general quality features designed to describe all types of distortions, while distortion-specific metrics use relevant features designed for a specific IQA problem.General-purpose NR IQA methods can be categorized into three types, namely, natural scene statistics (NSS)-based metrics, learning-based metrics, and human visual system (HVS)-based metrics [25].
The motivation behind NSS-based methods is that high-quality natural scene pictures tend to follow certain statistical properties, and quality degradation can be identified where there is a departure from these statistics.NSS-based methods usually contain three common stages, namely, feature extraction, NSS modeling, and feature regression.Saad et al. [26] designed the BLIINDS (blind image integrity notator using DCT statistics) index based on the NSS of the discrete cosine transformation (DCT) domain.BLIINDS-II [27] adopts the generalized Gaussian distribution (GGD) to model the NSS of the DCT coefficients and then obtains the quality-aware features through the GGD model parameters.BRISQUE [28] (blind image spatial quality evaluator) and DIIVINE [29] apply NSS in the spatial domain to develop their algorithms.GMLF [30] was developed based on the joint statistics of the gradient magnitude (GM) map and the Laplacian of Gaussian (LOG) response.
With the rapid development of machine learning techniques, a large number of learning-based NR IQA metrics have been developed in the last few years [31][32][33].COR-NIA [34] is based on an unsupervised feature learning framework that uses raw image patches as local descriptors and uses soft-assignment for encoding.Xu et al. [35] developed a NR IQA method based on high-order statistics aggregation (HOSA).Zhang et al. [36] designed a deep bilinear convolutional neural network (CNN)-based NR IQA model for both synthetic and authentic distortions by conceptually modeling them as two-factor variations followed by bilinear pooling.HyperIQA [37] was developed based on a self-adaptive hypernetwork architecture, and introduces a multi-scale local distortion-aware module to capture complex distortions.
The working mechanism of HVS is a high degree of prior knowledge in the design of quality-aware features [38][39][40].Zhai et al. [41] developed a psychovisual quality measure based on the free energy principle.Gu et al. [42] designed an NR free energy-based robust metric (NFERM) combining spatial NSS features, free energy-based features, and HVSinspired features such as structural information and gradient magnitude.Gu et al. [35] proposed a six-step blind metric (SISBLIM) for quality assessment of both singly and multiply distorted images by systematically incorporating the single quality prediction of each emerging distortion type and joint effects of different distortion sources.
If the distortion process is known in advance, distortion-specific NR IQA methods are preferred due to their higher robustness and accuracy [43][44][45].JPEG compression, JPEG2000 compression, and blur/noise are the most widely studied distortion types.Based on the observation that pixel values change abruptly across the boundary while remaining unchanged along the whole boundary, Lee et al. [46] designed an NR IQA metric for JPEG images by measuring the strength of blocking artifacts.Sheikh et al. [47] developed an NSSbased metric for JPEG 2000 compression based on the assumption that the compression process can disturb nonlinear dependencies in the natural scenes.Narvekar and Karam [48] adopted a probabilistic model to predict the probability of detecting blur at the image edges, then obtained the blur estimation by pooling the cumulative probability of blur detection (CPBD).To the best of our knowledge, there are few studies on the quality assessment of SIs, and existing general-purpose metrics encounter difficulty when handling the complicated content and distortion types present in SIs.Therefore, there is an urgent need to design an effective NR IQA metric for SIs.

Contributions
In order to address the SIQA problem, we propose a novel NR IQA model for SIs which is able to predict the quality level or distortion type and level of SIs, helping to improve performance of IVS systems on high-level tasks.The proposed model is composed of three modules, namely, a feature extraction module, feature fusion module, and feature regression module.First, based on the assumption that the human perception of SIs is influenced by both the low-level visual properties and the high-level semantic information, we mainly extract the following two types of quality-aware features: low-level visual features related to various distortion types (noise, blur, and structure damage), and highlevel semantic features extracted by the transformer backbone.Second, the feature fusion module concatenates the distortion features and semantic features into a final qualityaware feature representation.Finally, the quality-aware feature representation is mapped into a final quality score or distortion type and level assessment in the feature regression module.Our experimental results show that the proposed NR IQA model outperforms the compared state-of-the-art NR IQA metrics on two surveillance content quality databases.

Structure
The rest of this study is organized as follows.Section 2 introduces the proposed NR IQA model for SIs in detail.Section 3 mainly presents the experimental results and discussion, including the benchmark databases, experimental setup, IQA competitors, evaluation criteria, performance discussion, statistical tests, and ablation study.Finally, our conclusions are presented in Section 4.

Proposed Method
The framework of the proposed method is clearly shown in Figure 2, which includes the feature extraction module, the feature fusion module, and the feature regression module.

Feature Extraction
SIs can contain various types of distortion, such as noise, blur, structure damage, etc., which inevitably harm the perceived quality.Moreover, the semantic information can influence on human judgment as well [49].Therefore, in order to fully investigate the information that affects human perception of SIs, we propose to extract features from both the distortion and semantic elements.The distortion features are extracted using the classic image quality descriptors, while the semantic information is collected with the assistance of the high-performance backbone Swin Transformer (ST) [50].

Preliminaries
To better analyze the distortions of SIs, we conduct the local normalization process in advance, which is a common practice in IQA research.Given an SI I, the illumination map can be computed using the maximum of RGB channels where L denotes the illumination map and i and j represent the pixel indexes of SI.Then, the local mean and variance maps can be derived as follows: where w is a local Gaussian weighting window, µ L represents the local mean map, and σ L represents the local variance map.

Distortion Feature
Noise Estimation: Due to the limitations of camera devices and the insufficient amount of light in dark environments, SIs can be severely degraded by noise distortions.Thus, estimating the level of noise is significant for predicting the quality levels of SIs.Inspired by the task of assessing quality in low-light conditions, we propose using the noise descriptors in [51] to evaluate the level of noise in SIs.Specifically, the noise level is measured in this way by calculating two traditional noise estimation maps through the Gaussian filter and the median filter in order to eliminate the Gaussian noise [52] and salt-and-pepper noise [53], respectively.Then, the noise level can be described by the difference in the images before and after denoising.An example is exhibited in Figure 3.Given a single SI illumination map L, the denoised maps can be derived as follows: where M n indicates the noise difference maps, n ∈ {gaussian, median}, and F n represents the denoising function for Gaussian filters (with kernel size 7 × 7) and median filters (with kernel size 3 × 3).In common situations, the low-light and flat regions are more easily affected by noise; thus, we compute the final noise level by pooling the noise difference maps in the low-light and flat regions: where D n indicates the estimated noise levels for Gaussian noise and salt-and-pepper noise, E(•) indicates the average operation, T R denotes the number of pixels in the flat and low-light regions set R, and set R contains all pixels of SIs witgh local mean and variance values smaller than the average local mean and variance.Blur Description: Blur is a significant factor in the quality assessment of SIs.Limited by the resolution of camera devices and influenced by the compression of the transmission systems, the texture and details in the SIs may be lost.However, the texture and details are usually vital for identifying the objects and understanding the content of the SIs.Therefore, we propose including the blur features as distortion features.As shown in Figure 4, the gradient features are employed, as they are highly correlated with high-frequency information and have previously been used to describe sharpness [54,55].Given a single SI illumination map L, we use the Sobel gradient operator to obtain the gradient maps: where G L indicates the gradient magnitude map of the SI illumination map, the operator ⊗ represents the convolution operation, and and S x and S y are the horizontal and vertical Sobel operators, respectively, which can be described as follows: With the computed gradient magnitude maps, the blur measurement can be obtained via average pooling: where D b is the blur measurement level.Structure Damage: The structure is the outline of the main object in an SI.In this sense, structure damage is caused by low visibility of the major content objects [56].To quantify the extent of structure damage, we utilize the piecewise smooth image approximation (PSIA) proposed in [57] to generate structure maps: where SM represents the structure map, Ω is the image domain, K denotes the edge set, K dσ represents the total edge length, P indicates the pixel, and the coefficients α and β are positive regularization constants.Figure 5 presents an example of the PSIA results.Similarly, with the computed structure map we can obtain the structure damage descriptor using average pooling: where D s is the structure damage descriptor.
Summing Up: The process described above results in two noise features D n (n ∈ {gaussian, median}), one blur description feature D b , and one structure damage feature D s , which are obtained as the distortion feature vector DF ∈ R 1×4 .

Semantic Feature Extraction
In previous works, it has been proven that the semantic features are highly correlated with quality assessment.Different semantic contents have diverse impacts on human tolerance for different types of distortion [58,59].For example, humans find blur distortions on flat and texture-free targets such as plain ocean and smooth walls more acceptable.However, blur distortions on objects that are rich in texture, such as rough rocks and complex plants, can be hard to endure.Considering the huge success of the Swin Transformer [50], we use the Swin Transformer-tiny (ST-t) here as the semantic feature extraction backbone.In addition, as visual information is normally perceived hierarchically from low-level to high-level [55], we employ the hierarchical ST-t for feature extraction: where F j (x) denotes the features from the k-th stage, AP(•) stands for the average pooling operation, γ k (x) denotes the pooled results from the k-th stage, and ⊕ indicates the concatenation operation.Then, we can obtain the semantic features SF ∈ R 1×N ST−t , where N ST−t represents the number of output channels of the hierarchical ST-t backbone.Specifically, the dimensions for the feature maps of ST-t's four stages are 784 × 192, 196 × 384, 49 × 768, 49 × 768.After average pooling, the dimensions turn into 784 × 1, 196 × 1, 49 × 1, and 49 × 1.After concatenation, the number of the output channels N ST−t of the hierarchical ST-t is 784 + 196 + 49 + 49 = 1078.

Feature Fusion
In order to actively relate the quality-aware information between the distortion features and semantic features, we first concatenate the features to form one feature vector: where F represents the final quality-aware feature vector and ⊕ indicates the concatenation operation.

Feature Regression
There are several tasks in the quality assessment of SIs, including detection of distortion types, identification of the severity level of each detected distortion, and prediction of the overall quality score.In this paper, we design a corresponding feature regression module for each task.

Classification of Distortion Types and Levels
Supposing that the number of distortion types is D type and the number of levels (including the distortion-free level) of each distortion type is D level , we can adopt D type detection branches (DBs) to detect one specific distortion type and estimate the severity level of the corresponding distortion type.Specifically, each DB consists of fully-connected layers containing 128 and D level neurons, respectively.Then, the final quality-aware feature vector F is run through different DBs to obtain the severity level of each distortion type, as follows: where the dimension of the predicted vector P i is D level , which corresponds to the probability of each severity level for the i-th distortion type.We employ the Cross-Entropy Loss as the loss function for the identification task of each distortion type: where CE(•) refers to the Cross Entropy Loss function and G i is the ground-truth label of the severity level for the i-th distortion type.Then, we sum the loss functions of all the distortion types to obtain the final loss function:

Regression of the Quality Score
With the obtained final quality-aware feature vector F, a two-stage fully-connected layer is applied to regress the features into quality scores: where FC(•) stands for the fully connected layers and Q represents the regressed quality scores.For the quality assessment tasks, it is necessary to pay attention to the accuracy of the predicted quality levels.Furthermore, the focus should be on the correctness of the quality rankings [49].Therefore, the loss function employed in this paper includes two parts: the Mean Squared Error (MSE) and the rank error.The MSE loss is employed in order to force the predicted quality values to be close to the quality labels, and can be computed as follows: where Q i represents the predicted quality values, Q i is the quality label of the SI, and n is the size of the mini-batch.The rank loss has better ability to help the model distinguish the tiny quality difference when the SIs have quite similar quality labels.For this purpose, we employ the differentiable rank function described in [60] to approximate the rank loss: where i and j are the corresponding indexes for two SIs in a mini-batch.The rank loss can be derived as follows: Then the loss function can be calculated as the weighted sum of MSE loss and rank loss: where λ 1 and λ 2 are used to define the weight of the MSE loss and the rank loss, respectively.

Benchmark Databases
We mainly validated our methods on the SI Quality Database (SIQD) [7] and Video Surveillance Quality Assessment Database (VSQuAD) [8].The SIQD database contains 500 in-the-wild SIs that are diverse in termss of both content and distortions.The main objects in the SIs in the SIQD database include humans and vehicles, and the database covers a wide resolution range, from 352 × 288 to 1920 × 1080.The VSQuAD database contains 964 single-distortion-affected and 612 multiple-distortion-affected surveillance videos (SVs) generated from 36 reference SVs.The distortions include defocus blur, haze, low-light conditions, motion blur, rain, smoke, uneven illumination, and compression artifacts.Each SV lasts for 10 s.Because we propose an IQA method for SIs, we extract ten frames of each SV (one frame for each second) as the representative SIs for each SV.Thus, the extracted SIs have the same distortion types and levels labels as the source SV.
Additionally, we conducted a subjective experiment to gather the quality labels using the SIQD database.Several human participants were invited to judge the quality of the SIs in a well-controlled environment, and their mean opinion scores were recorded as the ground truth for the SIs.For the VSQuAD database, distortions were manually introduced to the surveillance videos, and the type and strength levels of the added distortions were recorded for use as the ground truth.

Experimental Setup
The employed hierarchical ST-t [50] backbone was initialized with the weights pretrained on the ImageNet database [61] for semantic feature extraction.The SIs were first resized to the resolution of 256 × 256 and then randomly cropped into patches with the resolution of 224 × 224 as the inputs.The Adam optimizer [62] was utilized, with the initial learning rate set as 1 × 10 −4 .The learning rate decays with a ratio of 0.95 every five epochs.The default number of the training epochs was set as 50.If the training loss did not decrease for ten epochs, the training process was ended.Furthermore, we employed the five-fold cross validation strategy.We split the SIQD database into five groups, with each group containing 100 SIs.For each unique group, we trained the model on the left four groups and used the unique group as testing sets.This process was repeated five times to ensure that each group was taken as the testing set only once.Then, the average performance was recorded as the final performance for the model.A similar five-fold cross validation strategy was conducted on the VSQuAD database.

IQA Competitors
To fully validate the effectiveness of the proposed method, several mainstream IQA methods are selected for comparison, be categorized into two types:
It is worth mentioning here that the compared methods were all retrained using the default experimental setup.

Classification of the Distortion Types and Levels
To assess the detection of distortion types and identification of the severity level of each detected distortion, we utilize the Accuracy and F1 score to evaluate the predictive performance of different quality assessment metrics.Specifically, the following four evaluation metrics are used:

•
Accu type : The ratio of correctly predicted observations to the total observations for distortion detection.• F1 type : The weighted average of Precision and Recall for distortion detection.

•
Accu both : The ratio of correctly predicted observations to total observations for distortion detection with severity level identification.• F1 both : The weighted average of Precision and Recall for the distortion detection with severity level identification.

Regression of the Quality Score
Here, four criteria are used to evaluate the performance of the quality assessment models: the Spearman Rank Order Correlation Coefficient (SRCC), the Pearson Linear Correlation Coefficient (PLCC), the Kendall Rank Correlation Coefficient (KRCC), and the Root Mean Squared Error (RMSE).These four statistical indexes describe different aspects for evaluating the performance of IQA models.To be more specific, SRCC and KRCC both reflect the prediction monotonicity, while PLCC and RMSE reflect the prediction linearity and prediction accuracy, respectively.The calculation equations are as follows: • Spearman rank order correlation coefficient (SRCC): where d i represents the difference between the i-th images's ranks in subjective evaluations and predicted scores, while N is the number of testing images.SRCC is used to measure the prediction monotonicity.The value of SRCC is between 0 and 1.The larger the value, the better the result predicted by the model.

•
Pearson linear correlation coefficient (PLCC): where s i and p i represent the i-th image's subjective score and predicted score, while s and p are the mean of all s i and p i .PLCC can be used to estimate the linearity and consistency of prediction.The value of PLCC is between 0 and 1, with larger values being better.• Kendall rank order correlation coefficient (KRCC): where N c and N d represent the numbers of concordant and discordant pairs in the testing data.Similar to SRCC, KRCC can be used to measure the monotonicity.The value of KRCC is between 0 and 1, with larger values being better.• Root mean square error (RMSE): RMSE is used to evaluate prediction accuracy.The RMSE value is a positive number; a smaller the value indicates higher accuracy of the model.
Before computing the criteria values, we utilize a five-parameter logistic regression function to fit the predicted scores to the scale of the quality labels: where {β i | i = 1, 2, . . ., 5} are the parameters to be fitted, y represents the predicted scores, and ŷ represents the mapped scores.

Performance Discussion
The experimental performance results on the SIQD and the VSQuAD databases are clearly shown in Tables 1 and 2, from which we can draw several interesting conclusions: (a) the deep-learning based methods achieve much better performance than the handcrafted based methods, indicating that the semantic information extracted by the CNN or vision transformer backbone is very important for the quality prediction of the SIs; (b) the proposed NR IQA method performs the best on both the SIQA database and VSQuAD dataset compared with other NR IQA metrics, which demonstrates the effectiveness of the proposed NR IQA method for the SIs; (c) the proposed model outperforms all the compared deep-learning based methods, indicating that the low-level visual features related to distortions in the SIs serve as a vital complement to the deep features for the quality assessment of SIs, from which it can be concluded that it is necessary to specifically design relevant features for specific IQA problems; (d) the task of identifying the severity level of each distortion is relatively difficult compared to the task of detecting the distortion type, which demonstrates that quality assessment models are less sensitive to the distortion level.

Statistical Test
To further validate the effectiveness of the proposed method, we carried out statistical significance tests following the procedure suggested in [64].In this subsection, these statistical tests are used to compare the relations between the predicted results and the subjective labels.The null hypothesis of the t-test is that the residuals of two quality metrics derived from the same distribution are statistically indistinguishable with a 95% confidence.The statistical significance test results are shown in Figure 6.From the figure, it can be seen that the proposed method is significantly superior to nine compared methods on the SIQD database and eleven compared methods on the VSQuAD database, indicating that the proposed method has better ability to detect and evaluate distortions in SIs.Black/white blocks mean that the method in that row is statistically worse/better than one in the corresponding column.A gray block means that the method in the row and that in the column are statistically indistinguishable.The methods denoted by A-L are in the same order as in Tables 1 and 2.

Ablation Study
To further investigate the respective contributions of different types of features, we performed an ablation experiment to compare the distortion features, semantic features and hybrid (distortion and semantic) features.
The results of the ablation experiment are listed in Tables 3 and 4. First, it can be seen that the hybrid features perform better than either the distortion features or the semantic features alone.Second, the contribution of the distortion features is inferior to that of the semantic features, meaning that the semantic features are more important in quality assessment of SIs.Finally, the semantic features perform worse than all the compared deep learning NR IQA metrics, which may be explained by the resize operation resulting in the loss of texture information, which could in turn affect the performance.

Conclusions
To tackle the challenge of SIQA and provide more useful guidelines for surveillance systems, in this paper we propose a hybrid no-reference image quality assessment method.The features are mainly extracted from the distortion and semantic aspects.Specifically, the distortion features are extracted using the noise, blur, and structure hand-crafted descriptors.We employ Swin Transformer-tiny as the backbone for semantic feature extraction, in light of its great success as a vision transformer.Afterwards, the hybrid features are concatenated and regressed into quality values with the assistance of fullyconnected layers.The proposed method is validated on the SI Quality Database (SIQD) and the Video Surveillance Quality Assessment Database (VSQuAD).Finally, we evaluate several similar methods and compare them to our proposed method by assessing the correlation between their predicted scores and quality labels and by measuring their accuracy when predicting distortion types and levels.From the experimental results, we find that the proposed method outperforms all the compared methods, revealing its strong ability to solve the SIQA problem.

Figure 1 .
Figure 1.Examples of quality degradations in SIs: (a) the SI suffers from uneven illumination distortion; (b) the SI suffers from motion blur distortion.

Figure 2 .
Figure 2. The framework of the proposed method.

Figure 3 .
Figure 3. Examples of noisy images and denoised gray images.The noise in the low-light and flat regions is reduced after Gaussian and median filtering.

Figure 4 .
Figure 4. Illustration of an example of original SI and Sobel-operated SI.

Figure 6 .
Figure 6.Statistical test results of the proposed method and compared methods on the SIQD and VSQuAD databases: (a) statistical test results on the SIQD database and (b) results on the VSQuAD database.Black/white blocks mean that the method in that row is statistically worse/better than one in the corresponding column.A gray block means that the method in the row and that in the column are statistically indistinguishable.The methods denoted by A-L are in the same order as in Tables1 and 2.

Table 1 .
Performance results on the SIQD database.

Table 2 .
Performance results on the VSQuAD database.

Table 3 .
Ablation study results on the SIQD database; DF represents distortion features and SF indicates semantic features.The default experimental setup and quality regression mechanism are maintained.

Table 4 .
Ablation study results on the VSQuAD database; DF represents distortion features and SF indicates semantic features.The default experimental setup and quality regression mechanism are maintained.