Dual-Anchor Metric Learning for Blind Image Quality Assessment of Screen Content Images

: The natural scene statistic is destroyed by the artificial portion in the screen content images (SCIs) and is also impractical for obtaining an accurate statistical model due to the variable composition of the artificial and natural parts in SCIs. To resolve this problem, this paper presents a dual-anchor metric learning (DAML) method that is inspired by metric learning to obtain discriminative statistical features and further identify complex distortions, as well as predict SCI image quality. First, two Gaussian mixed models with prior data are constructed as the target anchors of the statistical model from natural and artificial image databases, which can effectively enhance the metrical discrimination of the mapping relation between the feature representation and quality degradation by conditional probability analysis. Then, the distances of the high-order statistics are softly aggregated to conduct metric learning between the local features and clusters of each target statistical model. Through empirical analysis and experimental verification, only variance differences are used as quality-aware features to benefit the balance of complexity and effectiveness. Finally, the mapping model between the target distances and subjective quality can be obtained by support vector regression. To validate the performance of DAML, multiple experiments are carried out on three public databases: SIQAD, SCD, and SCID. Meanwhile, PLCC, SRCC, and the RMSE are then employed to compute the correlation between subjective and objective ratings, which can estimate the prediction of accuracy, monotonicity, and consistency, respectively. The PLCC and RMSE of the method achieved 0.9136 and 0.7993. The results confirm the good performance of the proposed method.


Introduction
The screen content image (SCI) is an important medium for human-computer interaction that can offer people a high standard of comfort and high-quality visual experiences. Thus, SCIs are extensively used in remote desktops, cloud computing, video games, multiscreen interaction, and other fields [1][2][3][4]. However, a great deal of noise will inevitably be involved in the process of image acquirement, transmission, and storage, which can lead to SCI image quality degradation and decrease people's visual experience [5][6][7]. Thus, a reliable estimation of SCIs plays a critical role in the optimization of processing systems as guidance. Currently, image quality assessment (IQA) methods can be classified into three categories: full-reference (FR), reduced-reference (RR), and no-reference or blind (NR), based on the existence of reference image information. However, because the reference version of authentically distorted images is not available in most cases, constructing an effective blind image quality assessment (BIQA) method for SCIs has important research significance and practical application value.

Related Work
Many BIQA methods have progressed markedly in recent decades when analyzing natural images. However, these methods are not suitable for SCIs, as demonstrated in existing studies. The main reason is that the inherent characteristics of SCIs are quite different from those of natural images [8,9]. More specifically, SCIs are arbitrarily composed of natural and artificial parts via splicing or overlapping. The natural part is similar to natural images, containing rich and complex brightness and color distribution, but the artificial part is generally just the opposite. Therefore, the perception preferences exhibit a marked difference from the natural images. For this problem, some prior studies have been carried out in this field from different perspectives and can be roughly categorized into feature-inspired methods and neural network-based methods. The former methods, as the name implies, construct quality-aware features in the grayscale domain by fully considering the perceptual properties of SCIs for a certain aspect and then learning the mapping model between the obtained features and subjective quality to predict the distorted image quality. Gu et al. constructed 13 and 4 types of perceptual features to characterize image quality by analyzing the degradation mechanisms of structure, brightness, and so on [10,11]. Min et al. extracted and integrated the multiscale corner and edge features of SCIs [12]. Lu et al. extracted the orientation and structure features based on the orientation selectivity mechanism [13]. Fang et al. incorporated statistical brightness and texture features inspired by the human visual system [14]. Zheng et al. used the variance of the local standard deviation as a local feature and the hybrid region-based property as a global feature [15]. Fang et al. resorted to photometric invariant chromatic descriptors and local ternary pattern operators to measure the statistical features of the color and texture of SCIs, respectively [16]. Considering the redundancy of the spatial domain, some efforts have been devoted to representing these artificial feature vectors with more compact representations via sparse representation. Yang et al. characterized the local texture property of SCIs with the oriented gradient histogram and then represented these texture features using sparse coding [17]. Zhou et al. constructed the local and global dictionaries to achieve a fused quality representation for distorted SCIs [18]. Shao extracted quality-aware features by conducting local and global sparse representations for the corresponding regions [19]. Wu et al. leveraged sparse representation to extract the local structural feature and the global brightness feature [20]. Bai et al. learned content-specific codebooks to generate effective micro features [9] and further combined the macro features based on the Bernoulli law of large numbers for quality prediction [21]. In brief, these artificial features in the spatial or sparse domain can intuitively describe the content variations within each SCI, such as brightness, texture, and shape, and demonstrate moderate performance in legacy benchmark databases. However, limited by visual mechanisms and subjective knowledge, these features only focus on specific distortion types and cannot be authentically effective in revealing the essence of real-world distortions for SCIs.
Differing from feature-inspired methods, neural network-based methods make full use of end-to-end characteristics to capture the high-level features of SCIs, which can more efficiently characterize advanced semantic information by imitating human visual perception. Chen et al. designed a naturalization module composed of an upsampling layer and a convolutional layer for the quality prediction of SCIs [22]. Jiang et al. proposed a novel quadratic optimized model to optimize a deep convolutional neural network for SCIs [23]. Yue et al. designed a convolutional neural network for SCIs with the entire image instead of image patches as inputs [24]. Jiang et al. modified the convolutional neural network by treating image patches differently according to their contents [25]. Yang et al. proposed a multitask distortion-learning network by combining the distortion types and degree as prior knowledge to predict SCI quality [26]. Then, Yang et al. designed an AdaBoosting backpropagation neural network by integrating the contour and edge information with L-moment distribution estimation [27]. These high-level features are more adaptable to complex and specific tasks but lack intuition and interpretability due to the neural network's characteristics. Moreover, because their performance often depends on the design of the network structure and the scale of the database, it is typically difficult to obtain an optimal model with good stability. Such models are also typically prone to underfitting or overfitting the results. In conclusion, current methods primarily focus on the feature extraction and neural network structure and do not attempt to describe the statistical characteristics of SCIs because the artificial part of SCIs destroys the natural scene statistics (NSS) features [28], which are widely used for BIQA of natural images and achieve very good effectiveness [29,30]. Bai et al. designed a lognormal pooling scheme to enhance the effectiveness of feature aggregation by analyzing the particularity of the statistical distribution of sparse codes [21]. Chen et al. introduced the correlation penalization between different feature dimensions, leading to features with lower ranks and higher diversity [31]. Yang et al. extracted the quality-aware features from the textual region and pictorial region [32]. Thus, finding a reliable statistical model which can be adopted to efficiently discriminate the intrinsic quality variations is still a marked challenge that must be overcome.

Contributions
To fill these gaps in knowledge, the dual-anchor metric learning (DAML) method is designed to evaluate the quality of distorted SCIs more accurately in this study. Considering that the NSS can easily be destroyed by the artificial portions of an SCI, it is difficult and impractical to obtain an accurate statistical model of SCIs. Inspired by metric learning, we do not deliberately seek an accurate statistical model of SCIs but rather construct a distance function to measure the similarity or difference degree with the available models and then apply the distance to identify the complex mixtures of distortions of SCIs. First, two available statistical models with prior data are constructed as the target anchors of the statistical model from two uncorrelated pristine databases. Then, the differences in the second-order statistics are softly aggregated between the local features and clusters of each target statistical model. Finally, the differences are used to predict the distorted image quality via support vector regression. Compared with other studies reported in the literature, the main contributions of this paper are summarized as follows:

•
Metric learning is used to characterize the statistical features of SCIs, providing new thoughts and direction for the establishment of statistical feature models of complex scenes. Considering the variable composition of SCIs, statistical features cannot be accurately represented with a single statistical model but can be more reliably characterized by the measured distance with some available statistical models inspired by metric learning. In this paper, the dual-anchor and variance differences can contribute to the multi-aspect analysis of complex mixtures of SCI distortions, avoiding the dependence on some specific distortion types, and experimental results with three public SCI databases confirm the effectiveness of the proposed method.

•
The performance of metric learning is directly determined by the anchor point and metrics function. Most existing studies focused on generating a single statistical model with only one dataset, based on the assumption that each distortion follows a uniform distribution. However, this strategy fails to describe the statistical characteristics of SCIs due to the intricate content, variable composition, and composite mixtures of multiple distortions. Thus, we resort to a dual-anchor statistical model as the anchor point for SCIs in this study. First, two Gaussian mixed models (GMMs) with different characteristics are generated by representative datasets with unrelated images, and then both are used as the positive and negative anchor points. Specifically, the GMM is used as the statistical model of the anchor points for more informative scene representation, because the GMM is a linear combination of multiple Gaussian distribution functions and fully incorporates prior knowledge, which is theoretically suitable for the description of complex scene distributions. Meanwhile, the measured distances of high-order statistics are used as a metric function for efficient distance calculation, and only the variance differences are used as the quality-aware features in this study to balance complexity and effectiveness via empirical analysis and experimental verification.
• Both color and brightness information are combined via tensor decomposition to avoid information loss and optimize the structure of feature extraction. As mentioned above, existing methods primarily focus on feature generation in the grayscale domain and generally ignore color information. For tensor decomposition, the brightness and color information are fused perfectly in the principal component without missing the primary texture details. With that in mind, this component is employed as the carrier to train models and extract features in this paper, as well as acquire certain positive effects.
The remainder of this paper is organized as follows. In Section 2, the motivation and methodology of the proposed method are described in detail. Section 3 shows the experimental results and compares the performances with the state-of-the-art methods. Finally, Section 4 concludes the paper.

Materials and Methods
Considering that the artificial portion of SCIs destroys the NSS feature of natural scenes, we construct a dual-anchor metrics function to measure the high-order statistical differences with the existing statistical models inspired by metric learning and then apply them to identify the complex mixtures of distortions of SCIs. The flowchart of the proposed BIQA method is shown in Figure 1. Obviously, the proposed method involves two stages: offline model training and online quality prediction, which will elaborate the motivation and methodology of the anchor point and metrics function, respectively. Specifically, the training stage involves anchor location and model learning, which are implemented offline with two collected pristine image datasets and will end once the two target GMMs have been trained. For the test SCI, only the testing stage is involved, and the quality prediction consists of two steps: feature generation and quality regression. Among them, feature generation softly aggregates the high-order statistical differences between the clusters of local features and the generated dual-anchor statistical models. Then, quality regression is performed via support vector regression (SVR) based on the combined statistical differences.

Offline Model Training
For metric learning, the distance function can be expressed as a set of points with the following relations: the sample points are similar or dissimilar anchors, and the metric function is optimal for distance calculation [33]. Thus, the performance of the distance function is directly determined by both the anchor point and metrics function. In this subsection, the influence of anchor points on the model reliability will be described in detail through two steps: anchor location and model learning.

Anchor Location
The core of metric learning is to predict the probability of subjective qualities for each image by calculating the similarity or difference between the learned statistical models. Thus, the models must be sensitive to the position in the feature space, and choosing an appropriate anchor can effectively improve the discrimination and expressiveness of the features, making it easier for the models to identify the degree of image distortion. For example, the statistical model of NSS features has been demonstrated to be stable and mature for natural images, and it has been mapped to predict the visual quality scores with efficient performance. However, for SCIs, artificial components, such as computer graphics and document contents, destroy these statistical features of natural scenes. To date, it is still impractical to obtain an accurate statistical model due to the variable composition of the artificial and natural parts in SCIs.
Assuming that two distortions of SCIs follow the distribution, as shown in Figure 2 with different colors and numbers, obviously, the metric accuracy of the distribution for each distortion is different when the distortion is projected on different axes. Taking the distribution of (1) in purple as an example, the performance of the statistical difference is markedly better when it is projected onto the vertical axis than when it is projected onto the horizontal axis. However, the opposite is true for the distribution of (2) in orange. These results indicate that using only a single anchor for the metric method is not sufficient to represent the specific characteristics of SCIs due to their intricate content, variable composition, and composite mixtures of multiple distortions. Thus, a more efficient method should be designed to convey authentically distorted image quality. As shown in Figure 2, a naive idea is to design some independent anchors and further employ the mutual constraints between these anchors to make the quality mapping of metric learning more robust. Obviously, the number of anchor points directly affects the robustness and complexity of the method. Thus, because SCIs are arbitrarily composed of artificial and natural portions, their image quality will be reduced with increasing noise intensity and types. The natural and artificial portions exhibit different statistical features from each other that are unrelated. Hence, two representative subsets, with the collected pristine natural images and artificial images shown in Figure 3, were built to characterize the extreme content characteristics of SCIs in two opposite directions and were then used to train the unrelated statistical models. Subsequently, both models were used as the positive and negative anchor points. The experimental results in Section 3 can verify the superiority of this dual-anchor statistical model.  The natural image dataset had a total of 90 images collected from TID [34] and LIVE [35] public datasets, and the artificial image dataset had a total of 100 document content images, where all pictures were obtained by manual screenshots. Considering the pristine natural image dataset as an example, the raw image was preprocessed first with tensor decomposition and other feature enhancement techniques, and then the constructed feature vector was used for subsequent model training. The specific process is described as follows.
First, tensor decomposition was employed to mitigate the fact that the color property had not been considered in the previous studies on BIQA of SCIs. As a form of higherorder principal component analysis, Tucker tensor decomposition can decompose a tensor χ ∈ R I 1 ×I 2 ×···I N into a core tensor ς ∈ R J 1 ×J 2 ×···J N multiplied (or transformed) by a group of matrices along each mode [36]. Specifically, a data cube of the RGB image can be converted into a three-order tensor as follows: where χ ∈ R I 1 × I 2 ×I 3 ; I 1 , I 2 , and I 3 are the sizes of the red, green, and blue channels of the raw image, respectively, and Y (1) , Y (2) , and Y (3) are the factor matrices with the same sizes of each channel, which are typically orthogonal. As mentioned in our previous study [21], we can draw the following conclusions. Y (1) , as the principal component, basically preserves the texture details and brightness range. Meanwhile, the brightness property and color information are seamlessly combined. Thus, the principal component is adopted as the carrier of subsequent model training.
For the principal component, the raw patches, which are n × n in the grayscale domain, are all normalized with a divisive normalization transform to imitate the early nonlinear processing in the human visual system, reduce data redundancy, and maintain data consistency [37,38]:p where p(i, j) andp(i, j) are the raw and normalized patches of the principal component Y (1) , respectively, (i, j) are the indices over the entire image, α and β are the local mean and standard deviation of each patch, respectively, and γ is a constant to prevent instability, which is set equal to 10 by the experience in this paper. Aside from this, the whitening process is used in this paper to eliminate the linear correlations of each patch [39]. Finally, the global feature vector is constructed with these normalized image patchesp(i, j) to implement the subsequent model training.

Model Learning
After the anchor location mentioned above, how to construct two appropriate statistical models from two pristine databases, which are used as the target dual-anchor statistical models, must be determined. Selecting a model type is still a particular challenge for anchor points for each database.
As reported in the literature [40], Xu et al. presented a BIQA method for natural images based on high-order statistics aggregation (HOSA) with a small codebook, which calculated the differences of high-order statistics between the local features and corresponding clusters as the quality-aware image representation. In essence, this method is a simplified distance metric learning with a statistical model. Specifically, the codebook is equivalent to constructing a statistical model as an anchor point, and these statistics differences (i.e., mean, variance, and skewness) correspond to the distance measures of different orders. Each distortion pattern is characterized by a different kind of cluster, and this relative relationship varies with the distortion level. Therefore, the HOSA can measure the quality of the natural images more effectively.
However, the HOSA limit factors are more obvious for synthetic SCIs, one of which is the generality problem of the statistical model. For SCIs, the NSS feature of natural images is destroyed by the artificial portion, and no particularly reliable statistical model has been found to date due to the combined diversity of SCIs. If HOSA is directly transplanted to SCIs with only a single model (i.e., one anchor point), it does not exhibit effective performance compared with natural images, considering the varied and unpredictable distribution for the SCIs, as shown in Figure 2. Additionally, the statistical model of HOSA is constructed with a small codebook that contains only 100 codewords, which is relatively simple and suitable for natural scenes. However, the universality and robustness of this model seem to be marginally insufficient to reveal the statistical characteristics of SCIs due to the intricate content, variable composition, and composite mixtures of multiple distortions. Currently, the Gaussian mixed model (GMM) has been widely used to solve the situation where the data in the same set contain multiple different distributions, and it has achieved remarkable successes in many image processing tasks [41]. Compared with the limited codewords, the typical character is that the GMM is a linear combination of multiple Gaussian distribution functions which can theoretically fit any type of distribution by setting the cluster property. Therefore, the GMM was adopted as the target model to enhance the universality and robustness in this paper.
Meanwhile, HOSA lacks the effective guide provided by a priori information. For ill-conditioned problems, the core paradigm is to introduce a priori information to achieve the goal of discovering hidden and meaningful knowledge from limited data [42]. Hence, the a priori information must be applied reasonably to overcome shortages of limited feature information in the BIQA domain of SCIs and thus enhance the generalization and sensitivity of feature representation for SCIs. Two available GMMs with priors are constructed as the final statistical models in this paper, and the model learning process is illustrated as follows.
For the natural image dataset, we considered these normalized image patchesp(i, j) as local features and chose the VLFeat open-source library to implement GMM training [43]. For each image, N normalized patches are extracted such that X = [p 1 ,p 2 , . . . ,p N ] ∈ R D (D = n × n), where each column corresponds to one patch. Therefore, the constructed GMM for X can be described as P N (X ρ, µ, σ 2 ), and where P N is the cumulative distribution function generated with the natural image dataset, ρ, µ, and σ 2 are the prior, mean, and covariance of each feature in the GMM, respectively, ρ k ≥ 0, ∑ K k=1 ρ k = 1, and φ(X) is the probability density function. In addition, K clusters of the GMM were constructed to capture various distortion characteristics. Similarly, the probability density function generated with the artificial image dataset is expressed as P A . Note that this process is performed offline, and both GMMs (P N and P A ) can be applied directly as the target dual-anchor statistical model for feature learning of the test image without subsequent updates.

Online Quality Prediction
With the constructed GMMs (P N and P A ), the quality of testing SCIs could be predicted online with the following two steps: feature generation and quality regression. For feature generation, because the metric method will directly affect the accuracy of the quality prediction for each distortion beside the anchor points, the variance difference was selected as the target metric method through theoretical and empirical analysis in this study, considering the characteristics of the SCIs. Subsequently, SVR was performed to calculate the final quality score based on the combined statistical differences.

Feature Generation
In this subsection, we follow the line of HOSA to aggregate the statistical distances between the local features and clusters of the target dual-anchor statistical models (i.e., the two GMMs). Meanwhile, to tackle HOSA's deficiency for SCIs, the a priori information of dual-anchor GMMs was extra extracted and used in feature generation, and only the second-order statistical differences were calculated as the quality-aware features to benefit the balance of complexity and effectiveness.
Here, the target dual-anchor statistical model consists of two GMMs (P N and P A ), and both GMMs are used in a similar process. For each single local featureP i of the test SCI, r nearest clusters rNN (xi) are selected by Euclidean distance. Soft assignment with kernel similarity weights attempts to alleviate the problems of uncertainty and plausibility in the clustering selection of the GMM without introducing large quantization error. In this paper, r is set to five based on the author's experience.
Then, different order statistical distances were calculated with each prior as follows to further measure the degradation degree of the distorted image. The residual between the soft weighted mean, variance, skewness, and kurtosis of local features are assigned to cluster k and those of cluster k in the constructed GMM P N (or P A ): whereμ d k and µ d k are the means of the dth dimension in cluster k for the local features and the target GMM P N (or P A ), respectively, ρ d k is the prior of each feature in the GMM, the superscript d denotes the dth dimension of a vector, and ω ik denotes the Gaussian kernel similarity weight between local feature xi and cluster k. The sum of the weights for each cluster is one. We also have where (σ 2 ) d k and (σ 2 ) d k are the variances of the dth dimension in cluster k for the local features and the target GMM P N (or P A ), respectively. Similarly,γ d k and γ d k are the skewness of the dth dimension, andκ d k and κ d k are the kurtosis of the dth dimension. Each statistical distance with different orders can characterize diverse image features. However, only the second-order statistical distances (i.e., variance differences) are employed to predict image quality in this study for the following reasons. For natural images, HOSA, which considers the mean, variance, and skewness, has demonstrated highly competitive performance with high-frequency information such as texture and details. Compared with natural images, SCIs generally have rich, complex artificial parts and fewer, simpler brightness or color variations and structures. In image processing, the variance, which can characterize the texture and edge properties of scenes, has been widely investigated and exhibits excellent comprehensive performance [40]. Considering that the image statistics aggregation method can describe the approximate location of an image's local features in each cluster, and each distortion pattern is characterized by a different kind of cluster, the image quality will be more dramatically varied as the strength of the relative relationship increases. To avoid excessive complexity, it is intuitively obvious that the variance is an effective indicator of statistical characteristics for SCIs with larger artificial portions. The experimental results in the next section further validate the analysis compared with some combinations of different orders.
More specifically, we denote the second-order statistical difference with GMMs P N and P A as v N k and v A k , respectively. Then, both second-order statistical differences are concatenated to a single long quality-aware feature: . . , K. Furthermore, there are some similar contents in SCIs and similar quality scores in subjective opinion scores, and these similarities increase image feature similarity, severely decrease the contribution of other important dimensions, and reduce overall feature effectiveness. Hence, elementwise signed power normalization was adopted on the aggregated features to alleviate the corruption caused by these similarities [44]. Specifically, each second-order local featuref can be described as follows: where λ is the parameter to control the inhibition degree on the frequent components, which was set to 0.2 in this study. Finally, the entire quality-aware features, which are used for quality regression, can be denoted bŷ whereV N ,V A are the normalized second-order subfeatures with the GMMs P N and P A , respectively.

Quality Regression
After feature generation, SVR was employed to learn a mapping function from normalized features to subjective quality scores for training SCIs [45]. Then, the quality score of the test SCI can be predicted with the pretrained regression model in the testing stage. Here, SVR with a radial basis function kernel was adopted by using the LIBSVM package with the default parameters [46].
In this study, the patch size D was set to 7 × 7, and the cluster number K was set to 100 based on the authors' experience so that the quality-aware representation provided a vector of the dimensionality D × K = 4900 features (i.e.,V) and D × K × 2 = 9800 (which isF) in total for each test SCI. The practical effect of each feature vector will be illuminated in detail in the next section.

Experimental Protocol
In this section, thorough experiments are conducted to demonstrate the effectiveness of the proposed method with three public SCI databases: the screen content image quality assessment database (SIQAD) [8], screen content database (SCD) [47], and screen content image database (SCID) [48]. A brief introduction of these datasets is shown in Table 1.  Specifically, in digital images, GN mainly originates from poor lighting or sensor noise during acquisition, GB is an image blur filter that uses a normal distribution to calculate the transformation of each pixel, MB is the apparent blurring of dragging traces caused by fast-moving objects, CC easily causes brightness and saturation distortion, J2K represents distortion caused by JPEG and JPEG2000 encoding, and HEVC also has distortion problems in encoding. The "type" and "level" in the table indicate the distortion category and distortion level, respectively.
Meanwhile, Pearson's linear correlation coefficient (PLCC), Spearman's rank order correlation coefficient (SRCC), and the root mean squared error (RMSE) are then employed to compute the correlation between the subjective and objective ratings, which can estimate the prediction of the accuracy, monotonicity, and consistency, respectively. Higher values for the SRCC and PLCC and a lower value for the RMSE are expected for an advanced quality prediction metric. In addition, a five-parameter nonlinear logistic function was employed to nonlinearly regress the quality ratings into a common range as follows [49]: where β i , i ∈ {1, . . . , 5} are the parameters to be fitted and x and f (x) denote the raw predicted score and corresponding mapped scores, respectively. Additionally, each database was randomly divided into training and testing subsets 1000 times, with 80% as the training dataset and the remainder as the testing dataset, and the median result was adopted as the final performance.

Performance Comparison on the Overall Database
Here, we compare the proposed DAML with the following state-of-the-art FR-IQA and NR-IQA methods. Specifically, the FR methods include five classic methods built for natural images (PSNR, SSIM [50], FSIM [51], VSI [52] and VIF [53]) and five top methods built for SCIs (SVQI [54], SQE [55], EFGD [56], SRCNN [57], and QODCNN [23]). The NR methods include 10 feature-inspired methods (SIQE [11], OSM [13], NRLT [14], HRFF [15], PQSC [16], TFSR [17], LGFL [18], CLGF [20], CSC [9], and MTD [21]), and 5 neural networkbased methods (PICNN [22], IGMCNN [24], SIQA-DF [25], MtDl [26], and ABPNN [27]). Note that the results were cited from the literature except with the classic methods for fairness, and "/" indicates that a value is not available in the following tables. Table 2 shows the experimental results of the FR methods on the SIQAD, SCD, and SCID, where the top three results in each case are highlighted in boldface. From this table, we can make the following observations. First, the classic FR methods for natural images could nnot be directly transferred to SCIs because they do not consider the peculiar perceptual properties of SCIs. Second, for the top FR-IQA methods for SCIs, their performance was markedly improved because the targeted features or network structures were constructed for some specific distortions in SCI databases, and the original reference could also provide more accurate and reliable feature information. However, the limiting factors were also strong for these methods, because it was difficult or not possible to obtain the reference in most cases. In contrast, we resorted to metric learning to extract the discriminative statistical features of SCIs and achieve comparable results with these top FR methods for SCIs.  Table 3 shows the experimental results of the NR methods on the SIQAD, SCD, and SCID, in which the results were primarily concentrated in the SIQAD and SCID in terms of test images and distortion types. From this table, we can see that most feature-inspired NR-IQA methods exhibited worse performance than that of the FR-IQA methods above, such as SIQE, OSM, NRLT, HRFF, TRSR, LGFL, and CLGF. In addition, the gap with the two excellent algorithms of PQSC and MTD was not obvious. The algorithm proposed in this paper was very close to the data of the CSC and MTD in the SIQAD and SCID databases, respectively, and the indicators in the SCD database were even better. The main reason for this is that, limited by the research progress of visual perception and the attention preference of designers, these manual features show excessive subjectivity and independence from each other, which makes it difficult to accurately characterize and measure the intrinsic quality variations of SCIs if there is a lack of reference information. Due to the diversity of the SCI content, it was necessary to explore a more unified and complete theoretical system to reduce the loss of important information and serious subjective preferences for partial distortion types. For neural network-based methods, such as SIQA-DF and MtDl, they showed comparable performance to these FR methods because these high-level features are more adaptable to complex and specific tasks but lack intuition and interpretability, and this can easily lead to overfitting due to the neural network characteristics. In addition, Table 3 shows that the proposed method can effectively describe the distribution characteristics of the SCIs by constructing a distance function to measure the similarity or difference degree with two available uncorrelated statistical models. Finally, the proposed method achieved excellent performance in the PLCC compared with the feature-inspired methods and obtained competitive performance that was comparable to that of the neural networkbased methods.

Performance Comparison of the Individual Distortion Type
To verify the performance of the individual distortion type, we investigated the model performances with the proposed DAML and other state-of-the-art methods on three SCI databases. Specifically, Tables 4-6 show the experimental results of PLCC, SRCC, and the RMSE, respectively, and the top three metrics are highlighted in boldface. Note that the variances were calculated to describe the fluctuation magnitude for each distortion type, and a lower value indicates better prediction consistencies. From these tables, it is obvious that most existing methods showed obvious preferences for specific distortion types, particularly for TFSR, LGFL, and CLGF. For example, CLGF handled the GB distortion with a PLCC of 0.9082, but its PLCC was only 0.5575 for the LSC distortion. Similarly, LGFL handled the GB distortion with an SRCC of 0.8940, but its SRCC was only 0.4870 for the CC distortion. The primary reason for this result is that these quality-aware features, which are extracted by existing methods, are subjective, independent, and limited by visual mechanisms and subjective knowledge. Thus, they merely reflect the quality degradation characteristics of some parts and cannot authentically and effectively describe the essence of real-world distortions for SCIs. In contrast, the proposed method combines metric learning and probability distribution to construct the discriminative statistics feature, identify complex distortions, and predict SCI image quality from a global perspective. Thus, the proposed method exhibited better generalization performance across different distortion types, producing variances that were orders of magnitude lower than those of other methods, as shown in Tables 4-6. In particular, it can be clearly seen that the proposed model was more sensitive to handling most distortion types (i.e., GN, CC, JPEG, J2K, and LSC) and exhibited good competitiveness with other types (i.e., GB and MB).
Additionally, Figure 4 presents similar results on the SCD and SCID for different distortion types. Thus, these results suggest that the proposed MADL can more precisely and steadily describe various degenerations from the perspective of statistical characteristics and distributions for SCIs and can further verify the effectiveness and robustness of the proposed method. The data and results are shown in Tables 4-6, where SIQAD was a commonly used data set and we listed the detailed evaluation data. SCD is the dataset that mainly tests coding distortion, so the data given are relatively small, while for SCID, the dataset is relatively large. The "/" in the table indicates that the article did not test it in detail, and there was no relevant code to reproduce and calculate the relevant indicators.

Cross-Database Validation
In this subsection, cross-database validation is conducted to verify the generalizability of the proposed DAML. Because SIQAD and SCID were the representative and largest databases, respectively, and both of them contained six distortion types (GN, GB, MB, CC, JPEG, and J2K), both databases were adopted as the training and testing databases, respectively. Similar to the practice of Mittal et al. [39] and Ye et al. [58], the DAML was trained on one database with these six distortion types, and the other was used to test the performance of the trained model. Meanwhile, the median performance is reported in this paper. Note that entire samples of both databases were adopted for model training and testing, which could reduce dependence on the scale of the database and further verify the generalizability of the proposed method [21]. Table 7 shows the cross-database results for each type of distortion, in which (a) means that the model was trained with SIQAD and tested with SCID, and (b) means the opposite. From this table, we can obtain the following observations. First, both cross-database performances were similar to each other, which indicates that the proposed model had the advantages of high generalization ability, regardless of database size and complexity. Second, the cross-database performance was marginally worse than the in-database performance, which is also a common problem for existing methods. The primary reason for this result is that different fusion rules that are caused by variable image compositions and distortion intensities of the SCI can generate complex degradation mechanisms and statistical properties for each SCI database and further result in performance degradation for each method. Third, the cross-database performance decreased for the proposed model but still achieved satisfactory performance and stability for most distortion types, achieving competitive performance compared with the FR methods in Table 2 and outstanding performance compared with most of the feature-inspired NR methods in Table 3. Note that the performance on the J2K type was lower than those of other distortion types because it belonged to the complex composite compression distortion.
In addition, the proposed cross-database performance was marginally worse than the neural network-based methods listed in Table 3 but was still worthy of affirmation considering its interpretability. Thus, the cross-database results demonstrate that the proposed method achieved good prediction accuracy, powerful stability, and generalization.

Ablation Study
To further verify the effectiveness of the proposed DAML, comparative experiments were conducted on three SCI databases. More specifically, these factors primarily include the anchor type, K value, and feature type. Among them, the anchor type and K value are defined based on the anchor location and model learning during offline model training, respectively, and the feature type is defined during feature generation of online quality prediction. In this study, the sensitivity of each factor is discussed with different settings, and then comparative experiments are performed to validate the influence of the parameter setting.
For metric learning, the type and number of anchor points are the most important factors to be considered first. Because the deficiency of a single anchor point was illustrated in detail in Section 2.1, it will not be repeated in this study, and the two anchors were set as the defaults in this study. Considering the anchor type, a naive idea is that two unrelated image types are used as the positive and negative anchor points to characterize some extreme content characteristics of SCIs in two opposite directions. Intuitively, there are two appropriate anchor types in terms of distortion intensity and content composition for SCIs. For distortion intensity, the reference images and distortion images can be adopted as the targeted anchors, which can directly describe the condition of quality distortion. For content composition, natural images and document images are suitable choices to directly describe the characteristics of the content composition in SCIs. Table 8 shows the comparison of the prediction performances with different anchor types. Performances were markedly improved with both anchor types, which effectively clarified the feasibility of the dual-anchor strategy. Meanwhile, the performance of the content composition was marginally better than that of the distortion intensity. The primary reason for this result is that distortion intensities exist for both natural images and SCIs, but the most distinctive aspect of SCIs lies in the arbitrary composition and random combination of different contents compared with natural images. With the anchor type of the content composition, model learning has become another bottleneck of performance improvement for metric learning. Considering the characteristics of the content composition, the GMM was adopted as the target model in this study because it could solve the situation containing multiple different distributions in the same set. However, for the GMM, the value of K, which denotes the number of clusters, directly influenced the trade-off of performance and complexity. In this study, Table 9 shows the comparison of the prediction performances with different values of K. Obviously, there were only marginally different performances for each K value, and thus we set K equal to 100 as the default according to the actual results in Table 9. For online quality prediction, the selection of the feature type is a critical step for feature generation and directly affects the efficiency of quality regression. In this study, we constructed the experiments with different feature type combinations on three SCI databases and compared the results with HOSA on the SIQAD, which are shown in Tables 10 and 11, respectively. Note that the feature types used in this study include the firstorder (mean), second-order (variance), third-order (skewness), and fourth-order (kurtosis) statistics, as well as the combinations of each other. In the two tables, "M.", "V.", "S.", and "K." denote the abbreviations for the mean, variance, skewness, and kurtosis statistics, respectively. Table 11 shows that all feature types had certain effects on image degradation, but the sensitivity of each type was different. Particularly after the optimization of the dual-anchor strategy, the performance of a single feature type (i.e., variance) was better than that of the feature combination, which could effectively enhance the efficiency of quality regression. Meanwhile, compared with HOSA built for natural images, the proposed method achieved better improvement on the SIQAD due to some of the following reasons: (1) the dual-anchor strategy makes quality mapping of metric learning more robust for varied content and the distortion of SCIs; (2) the GMM model can theoretically fit any type of distribution, which is particularly suitable for solving the situation of containing multiple different distributions in SCIs; and (3) the introduction of a priori information can further discover hidden and meaning knowledge from limited data.

Conclusions
This paper presented a dual-anchor metric learning method for blind image quality assessment for screen content images (SCIs). Inspired by metric learning, the statistical distance between the local features and clusters of the target dual-anchor model were resorted to represent the statistics feature and then predict the distorted image quality of SCIs. The target dual-anchor statistical model consisted of two Gaussian mixed models generated from unrelated pristine databases to avoid dependence on specific distortion types. The high-order statistical differences were further optimized and enhanced the effectiveness of quality-aware feature extraction. On three public SCI databases, the experimental results verified the superior prediction accuracy and generalizability of the proposed method for individual distortion types compared with the state-of-the-art blind image quality assessment methods of SCIs.