1. Introduction
The perceptual quality assessment of visual media has drawn considerable attention in the recent past owing to the millions of images and videos captured and shared daily on social media websites, such as Facebook, Twitter and Instagram. Large-scale video streaming services such as YouTube, Netflix and Hulu contribute heavily to Internet traffic, which continues to expand rapidly as consumer demand for content increases. Reliable assessment of picture quality by large groups of human subjects is an inconvenient, time-consuming task that is very difficult to organize at scale. Thus, objective no-reference (NR) image quality assessment (IQA) models, which do not require any additional information beyond the input image, are often deployed in such settings to predict visual quality automatically and accurately as perceived by an average human subject. These models have also been successfully used to optimize the image capture process perceptually to improve the perceptual quality of the acquired visual signals. In addition, “quality-aware” perceptual strategies are used to compress visual media to deliver high quality content to consumers over constrained network bandwidths [
1].
Many NR IQA algorithms have been proposed recently, which for increased clarity, we will broadly classify into three categories. (1) Distortion-specific approaches include algorithms that predict the quality of images afflicted by one or more known distortion types such as blockiness [
2], ringing [
3] and blur [
4,
5] artifacts. These models are difficult to generalize to other distortion types. (2) Purely data-driven approaches involve the extraction of low-level image features such as color and texture statistics [
6], which are then mapped to subjective image quality scores using regression. More recently, deep learners have been trained to learn large sets of low level image features, which are then used to feed classical regressors that map the features to subjective quality space [
7]. The general framework of convolutional neural network-based IQA models involves feeding a pre-processed patch to convolutional layers, which are often followed by pooling layers. The learned features are then fed to a combination of fully-connected layers followed by non-linear activation and dropout layers [
8,
9]. (3) Natural scene statistics (NSS)-based approaches leverage statistical models of natural images and quantify the severity of distortion by measuring the degree of “unnaturalness” caused by the presence of distortions. The perceptual image quality is measured as a distance of the distorted image from the subspace of natural images [
10,
11,
12,
13].
A number of techniques have been devised for general-purpose NR IQA. The generalized Renyi entropy and normalized pseudo-Wigner distribution have been used to model the directionality or anisotropicity of the variance of expected entropy to predict image quality [
14]. NSS-based models have been designed to extract quality-aware features under natural image models in the wavelet [
13], spatial [
12] and discrete cosine transform (DCT) domains [
15], achieving high correlations with human opinion scores.
The divisive normalization transform (DNT), which is used to model the nonlinear response properties of sensory neurons, forms an integral component in the density estimation of natural images [
16]. A commonly-used parametric form of DNT is:
where
denotes a natural image signal that has been processed with a linear bandpass filter and
are parameters that can be optimized on an ensemble of natural image data. As shown in [
17], when bandpass natural images are subjected to DNT with
, they become Gaussianized with reduced spatial dependencies. The underlying Gaussian scale mixture (GSM) [
18] model of the marginal and joint statistics of natural (photographic) image wavelet coefficients also implies similar normalization (
) of neighboring coefficients. In our recent work, we developed a generalized Gaussian scale mixture (GGSM) model of the wavelet coefficients of photographic images, including distorted ones [
19]. This new model factors a local cluster of wavelet coefficients into a product of a generalized Gaussian vector and a positive mixing multiplier. The GGSM model demonstrates the hypothesis that the normalized wavelet-filtered coefficients of distorted images follow a generalized Gaussian behavior, devolving into a Gaussian if distortion is not present. A related approach was adopted in [
20], where a finite generalized Gaussian mixture model (GGMM) was used as a prior when modeling image patches in an image restoration task.
Here, we build on the above ideas and propose a generalized Gaussian-based local contrast estimator, which we use in conjunction with a multivariate density estimator to extract perceptual quality-rich features in the spatial domain.
NSS models have been well studied on an increasing variety of natural imaging modalities, including visible light (VL), long wavelength infrared (LWIR) [
21], fused VL and LWIR [
22] and X-ray images [
23]. This kind of statistical modeling of these imaging modalities has led to the development of new and interesting applications and are of significance to the design of visual interpretation algorithms. In a like vein, here we explore the effectiveness and versatility of multivariate generalized contrast normalization (MVGCN) by deploying it in applications arising in two different imaging modalities; specifically, blind quality assessment (QA) of VL images and the prediction of technician detection task performance on distorted X-ray images.
The rest of the paper is organized as follows. In
Section 2, we describe the generalized contrast normalization technique, which forms the core of MVGCN. We detail the multivariate statistical image model in
Section 3 and analyze the effects of distortions on the estimated parameters of the multivariate model.
Section 5 describes the first application, whereby MVGCN features are used to predict the detection task performance of trained bomb experts on X-ray images. The second application is explained in
Section 4, where the MVGCN model is used to drive an NR IQA algorithm. Finally,
Section 6 concludes the paper with possible ideas for future work.
2. Generalized Contrast Normalization
It is well established in the vision science and image quality literature that processing a natural scene by a linear bandpass operation followed by non-linear local contrast normalization has a decorrelating and Gaussianizing effect on the pixel values of the images of these natural scenes [
24,
25,
26]. This kind of processing of visual data mirrors efficient representations computed by neuronal processing that takes place along the early visual pathway. These statistical models of natural (photographic) images have been used effectively in applications ranging from low-level tasks such as image denoising [
27,
28,
29] and image restoration [
30,
31], as well as higher level processes such as face recognition [
32,
33], object detection [
34,
35] and segmentation [
36,
37].
A number of NSS-based IQA algorithms [
12,
13] operate under the hypothesis that the divisively normalized bandpass responses of a pristine image follow Gaussian behavior and that the presence of distortion renders an image statistically unnatural, whereby the characteristic underlying Gaussianity is lost [
17], as depicted in
Figure 1, where Gaussianity is a poor fit to the distribution of bandpass, divisively normalized coefficients of a JP2000(JP2K) compressed image. Here, we propose a way of collectively modeling both pristine and distorted images, using a generalized contrast normalization approach that is based on the premise that the divisively normalized bandpass coefficients of both distorted and undistorted images follow a generalized Gaussian distribution. We refer to the processed coefficients as mean subtracted generalized contrast normalized (MSGCN) coefficients.
Given a
grayscale image of intensity
, the MSGCN coefficients
are computed as:
where
and
are the local weighted mean (the maximum likelihood estimate of the mean of a generalized Gaussian density is given by:
Optimizing over each block of size
of an image is computationally expensive; thus, we instead use the sample mean of the coefficients given by (
3), as used in [
12].):
and local contrast fields defined as:
where
,
are spatial indices and
is a 2D isotropic Gaussian kernel normalized to unit volume with
and truncated to three standard deviations.
C and
are small positive constants used to prevent instabilities.
is estimated using the popular moment-matching technique detailed in [
38]. The generalized Gaussian corresponds to a Gaussian density function when
and a Laplacian density function when
. MSGCN coefficients behave in a similar manner against different distortion types as do mean subtracted contrast normalized (MSCN) coefficients that are generated under the Gaussian model assumption (
) [
12]. Distortions such as white noise tend to increase the variance of MSGCN coefficients, while distortions such as compression and blur, which increase correlations, tend to reduce variance. The MSGCN model is more generic than the MSCN model and provides an elegant approach to study the statistics of distorted images.
3. Multivariate Image Statistics
In this section, we use the aforementioned MSGCN coefficients to develop a multivariate NSS model and a way to extract quality-rich features. The generalized contrast normalization (GCN) transform is a form of local gain control mechanism that accounts for the non-linear properties of neurons, resulting from the pooled activity of neighboring sensory neurons [
39]. These kinds of perceptually-relevant transformations account for the contrast masking effect, which plays an important role in distortion perception [
39]. Although the GCN transform, as with other DNTs, reduces redundancies in visual data, the normalized coefficients of natural images may still exhibit dependencies in some form (depending on the image content), as depicted in
Figure 2. Distortions such as compression, upscaling and blur, which reduce the amount of complexity of an image and that induce artificial correlations, tend to affect the MSGCN coefficients in a pronounced way. Increased statistical interdependencies are observed to occur between neighboring coefficients with increased distortion strength.
3.1. The Multivariate Generalized Gaussian Distribution
Once the MSGCN map of an input image is computed using (
2), a 5D multivariate generalized Gaussian (MVGG) distribution is used to model the joint distribution of five neighboring coefficients as illustrated in
Figure 3.
MVGG distributions have been extensively studied in the literature [
41,
42,
43,
44]. We utilize the Kotz-type distribution [
41], which is a form of zero-mean multivariate elliptical distribution defined as:
where
s is a shape parameter that determines the exponential fall-off of the distribution (the higher
s, the lower the fall-off rate),
is the scale parameter (matrix) that controls the spread of the coefficients along different dimensions,
d is the dimension of
and
is the gamma function:
The MVGG distribution becomes a multivariate Laplace distribution when , a multivariate Gaussian distribution when and a multivariate uniform distribution as .
This form of MVGG distribution has also been used in a reduced-reference IQA framework [
45], in an RGB color texture model [
46] of the joint statistics of color-image wavelet coefficients, a generalized Gaussian scale mixture (GGSM) model of the conditioned density of a GGSM vector [
19] and in a no-reference IQA algorithm [
15] to model the joint empirical distribution of extracted DCT features and subjective scores, where a bivariate version of the MVGG is used. The moment-matching scheme [
41] used to estimate the shape and scale parameters of an MVGG is detailed in the
Appendix A.
3.2. Analysis of the Shape Parameter of the MVGG Distribution
We next analyze how the shape parameter of the MVGG distribution varies when modeling the joint distribution of adjacent MSGCN coefficients of natural, photographic images from two widely used databases: the Waterloo exploration database [
47] and the Berkeley Segmentation Database (BSD) [
48]. The Waterloo exploration IQA database contains 4744 pristine natural images reflecting a great diversity of real-world content. The Berkeley Segmentation Database was designed to support research on image segmentation and contains 300 training images and 200 test images. In our analysis, we only used ostensibly pristine images to generate MSGCN response maps, toward modeling a five-dimensional joint empirical distribution of neighboring MSGCN coefficients using an MVGG density.
Figure 4b plots a histogram of the estimated shape parameter values of the MVGG model. The shape parameter peaked at around the same value (
) on both databases, suggesting that the joint distribution of MSGCN coefficients of the pristine images may be reliably modeled as a multivariate Gaussian. This outcome may be viewed as a multivariate extension of the well-established Gaussian property of univariate normalized bandpass coefficients [
18,
24,
25,
26]. There are, however, a few samples within the studied collection of natural images where the estimated shape parameter deviated from
. For example, a few images from the Waterloo exploration database, e.g., those shown in
Figure 4a, contain predominantly flat, highly correlated regions that yielded peakier MVGG fits where
. Cloudless sky regions (upper left of
Figure 4a) are bereft of any objects and cause this effect. The lower two images of
Figure 4a have large saturated over/under-exposed areas and may be viewed as substantially distorted. Overall, undistorted non-sky images of this type are rare. Conversely, the images shown in
Figure 4c are each almost entirely comprised of heavily textured regions, with less peaky fits (
). These kinds of images are also unusual.
3.3. Effect of Distortions on the Shape Parameter
Having established the relevance of the shape parameter of the MVGG and values it assumes on pristine images, we next examine how it behaves in the presence of distortions. In this experiment, we degraded 1000 pristine images from the Waterloo exploration database using three common distortions: JPEG compression, Gaussian blur and additive white Gaussian noise (AWGN), each applied at ten different levels. We then followed a similar modeling procedure as that described in the previous subsection: we fit the 5D empirical joint distribution of MSGCN coefficients of the distorted images with an MVGG distribution.
Figure 5 depicts the way the shape parameter characteristically varies in the presence of the different degradation types and levels. Gaussian blur (
Figure 5a) and JPEG (
Figure 5c) degradations lead to peaky, heavy-tailed MVGG fits and reduced values of
s. This effect becomes more pronounced with increasing distortion strength. Conversely, AWGN (
Figure 5b) degradations increase the randomness and entropy of an image, leading to larger values of
s.
The presence of some degradations deviate the distributions of distorted MSGCN coefficients from multivariate Gaussian behavior. To better understand this effect, we computed the Kullback–Leibler (KL) divergences between the empirical bivariate joint distribution of vertically adjacent MSGCN coefficients and its multivariate Gaussian fit, which are shown in
Figure 6. Computing the KL divergence between an empirical 5D joint distribution and its 5D Gaussian fit is computationally expensive for a large sample size; thus, we resorted to only computing a bivariate joint distribution of immediate neighbors. As shown in
Figure 6b, increases of the AWGN standard deviation produced a slight decrease in the KL divergence, indicating that the joint distribution of the MSGCN coefficients becomes more similar to Gaussian, which is not unexpected given that the AWGN is Gaussian. Degradations such as blur and JPEG compression, which result in peakier MVGG fits, caused larger KL divergences, which increase with increasing distortion levels.
3.4. Feature Extraction
Given that the MVGG model can be used to characterize distorted image statistical behavior well, we can build feature-driven image quality prediction tools. As a first set of “quality-aware” features, compute the estimated shape parameter s and the five eigenvalues of the estimated covariance (scale) matrix of the MVGG distribution. The premise behind the choice of these features is that the joint distribution of neighboring MSGCN coefficients of pristine images follows a multivariate Gaussian distribution, but the presence of distortion causes deviation from Gaussianity. Since each distortion affects the coefficient distributions in a characteristic manner, it is possible to predict the type and perceptual severity of distortions and, hence, the perceived image quality.
As shown in
Figure 2, even after the application of the GCN transform, the MSGCN responses remain correlated on images degraded by correlation-inducing distortions such as compression and blur. Such distortions lead to more polarized eigenvalues of the estimated covariance matrix than do other distortions (AWGN). In order to demonstrate the effect of distortions on the eigenvalues, we use the ratio of the minimum and maximum eigenvalues (
) of the estimated scale matrix
from the best 2D MVGG fit to the vertically adjacent MSGCN coefficients. We also fit a 5D MVGG to the five neighboring coefficients (as shown in
Figure 3).
Figure 7 shows the boxplots of the ratio
over all images from the LIVE database [
40], but classified by distortion type. The pattern of variation of the eigenvalues of the estimated covariance matrix in the presence of different distortion types is indicative of the rich perceptual quality information captured by eigenvalues.
The pairwise products of adjacent MSGCN coefficients, like those of MSCN coefficients, also exhibit statistical regularities on natural, photographic images. We follow a similar modeling approach as that described in [
12] and use a zero-mode asymmetric generalized Gaussian distribution (AGGD) to fit the pairwise products along four directions whose density is defined as [
12]:
where:
and:
The AGGD parameters
are estimated using the moment-matching technique described in [
49]. In addition to
, AGGD mean
yields a fourth quality-aware feature. Extracting these four parameters along four orientations (
) given by:
where
and
are spatial indices, yields a total of 16 features.
In order to capture even higher-order correlations caused by complex distortions, we model the joint paired-product response map along the four directions (
) using a four-dimensional MVGG distribution. The eigenvalues of the estimated covariance matrix of the 4D MVGG density are extracted as an additional set of four quality-relevant features. Since all of the features are extracted at two scales, a total of
perceptually-relevant quality-aware MVGCN features are computed. A brief summary of all of these features and their methods of computation is laid out in
Table 1. In subsequent sections, we study the effectiveness of the MVGCN features by applying them to multiple image quality-relevant tasks.
4. Quality Assessment of Visible Light Images
In order to demonstrate the quality-rich feature extraction capabilities of the MVGCN model, we utilized them for the blind image quality assessment task. We compared the performance of MVGCN against a number of well-known NR IQA algorithms, such as SSEQ [
50], CORNIA [
51], CNN-IQA [
8], BLIINDS [
15], NIQE [
10], BRISQUE [
12] and DIIVINE (to be consistent with other algorithms, we utilized the single step framework of DIIVINE, directly mapping the features to MOS/differential mean opinion scores (DMOS) while skipping the distortion identification stage) [
13] (all of which are publicly available) and two full reference (FR) IQA algorithms: PSNR and MS-SSIM [
52]. We conducted our experiments on four widely-used IQA databases, namely: LIVE [
40], TID08 [
53], CSIQ [
54] and LIVE in the Wild Challenge [
55]. In all of the experiments, each model was trained on
of the database while the other
was used for testing. A support vector regressor (SVR) was used with the radial basis function (RBF) to map quality features to the DMOS after determining its parameters using five-fold cross-validation on the training set. The train-test splits were carried out in a manner to ensure that the training and test sets would not share reference images, so that the performances of the models would reflect their ability to learn distortions, without bias from overfitting on image content. A total of 100 such splits was performed, and the median Spearman’s rank ordered correlation coefficient (SROCC) and Pearson’s linear correlation coefficient (PLCC) computed between the predicted quality scores and the DMOS are reported in
Table 2. The overall results reported in
Table 2 were computed by first applying Fisher’s z-transformation [
56] given by:
then averaging the transformed correlation scores for each method across each database and finally applying the inverse Fisher’s z-transform.
Learning-based algorithms that involve a training stage to learn optimal parameters are sometimes susceptible to overfitting, especially when trained and tested on the same database, due to similar modeling of distortions, similar experimental conditions and other factors. The main objective of NR IQA algorithms is their ability to generalize well on other datasets. To demonstrate the generalization capabilities, we trained the NR IQA models on one entire database and evaluated their performance on common distortion types from other databases, including: JPEG2000 (JP2K) and JPEG compression, Gaussian blur and AWGN.
Table 3 reports the database-independence performance of MVGCN, while
Table 4 compares its aggregate performance against other NR IQA models across four leading IQA databases. We used the non-parametric Wilcoxon rank-sum test to conduct the statistical significance analysis (reported in
Table 5) between different algorithms across multiple databases. As can be noted from the tables, MVGCN performed better than several leading NSS-based NR IQA algorithms and competed well against CORNIA [
51], which uses raw image patches in an unsupervised manner to learn a dictionary of local descriptors. CORNIA extracts a 20,000D feature vector and is much more computationally expensive than MVGCN, as shown in the time complexity analysis results reported in
Table 6. Although CNN-IQA performed better than other models on the CSIQ and TID08 databases, it failed to deliver comparable performance on the LIVE Challenge database, which consists of authentic real-world distortions. This raises questions on the practical application of such models and limits their use in real-world scenarios.
5. Predicting Detection Performance on X-Ray Images
In previous work, we studied the natural scene statistics (NSS) of X-ray images and found that the NSS modeling paradigm applies quite well to X-ray image data, although the model is somewhat different from that of visible light (VL) images [
23,
57]. In prior work, we used a nominal set of X-ray NSS features along with standardized objective image quality indicators (IQIs) to analyze the relationship between X-ray image quality and the task performance of professional bomb technicians who were asked to detect and identify a collection of diverse potential threat objects.
To analyze the effects of image quality on task performance, we conducted a human task performance study in which professional bomb technicians were asked to detect and identify improvised explosive device (IED) components in X-ray images that we created, degraded and presented to them in an interactive viewing environment [
58]. Certain commercial equipment, instruments, or materials are identified in this paper in order to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement by the U.S. government, nor is it intended to imply that the materials or equipment identified are necessarily the best available for the purpose. The degradations included spatially correlated noise, reduced spatial resolution and combinations of these. The NIST-LIVE database of ground truth judgments of bomb experts was then used to evaluate the predictive performance of the objective X-ray image quality features. More details regarding the task performance study protocols can be found in [
59].
Given that the MVGCN model provides a powerful NSS-based perceptual image quality feature extractor, we examined its performance against other NSS-based models and also against conventional IEEE/ANSI N42.55 [
60] metrics. We hypothesized that the presence of degradations would change the characteristic statistical properties of the MSGCN coefficients of X-ray images, which would allow the MVGCN model to better capture degradations and would better correlate with the outcomes of expert detection and identification tasks conducted on degraded X-ray images.
The models used for comparison are the quality inspectors of X-ray images (QUIX) model [
57], the IEEE/ANSI N42.55 standard [
60] and combinations of these. QUIX features are a set of simple and efficient NSS-based perceptual quality features that accurately predict human task performance. In [
57], QUIX considers only horizontal and vertical correlations while extracting features denoted as ‘
pp’ features. In order to be consistent and to have a fair comparison against QUIX, we developed a reduced feature version of MVGCN, which we refer to as MVGCN-X-ray, which does not include the products of diagonal coefficients as part of the paired-product modeling and corresponding MVGG fits. A summary of the MVGCN-X-ray features used and the feature extraction procedure is described in
Table 7.
Image quality indicators (IQIs) are a set of standard objective image quality metrics defined in IEEE/ANSI N42.55 [
60]. These IQIs are determined by analysis of images of a standard test object under test conditions. In our analysis, we used eight IQIs, including “steel penetration”, “spatial resolution”, “organic material detection”, “dynamic range”, “noise” and three other descriptive features that are extracted from the spectral distribution of the measured modulation transfer function (MTF), noise equivalent quanta (NEQ) and noise power spectrum (NPS).
Given that CORNIA is among the top performing IQA algorithms, albeit much more computationally expensive, as observed in the previous application, we compared its time complexity against MVGCN on X-ray images. CORNIA required about 50-times more time than MVGCN-X-ray did (as reported in
Table 8) to extract features from high spatial-resolution (the size of the studied X-ray images is of the order of
pixels) X-ray images.
To evaluate performance, we divided the NIST-LIVE database on the basis of component and clutter combinations. The component categories include IED components: “power source”, “detonator”, “load”, “switch” and “metal pipe”, which are labeled by professional bomb technicians, if found in an image, else labeled as not found. Here, we consider the task of measuring the accuracy of objective image quality models to predict the detection performance of experts. We further divided each category into four clutter types: clutter (laptop), shielding (steel plate), clutter with shielding and no clutter. Clutter/shielding was added to some images to make the detection task more challenging.
We then devised a binary classification framework whereby features were mapped to a binary variable indicating whether the component was successfully identified by an expert. We used a logistic regression model to be consistent with [
57]. The data from each component-clutter category were divided into an 80% training set to learn logistic function parameters, which were then used to predict on the remaining 20% test set. We used a similar performance evaluation methodology as followed in [
57]: generated random disjoint train-test splits and computed median log loss and area-under-the-ROC-curve (AUC) scores over 1000 iterations (reported in
Table 9). A smaller value of log loss and a larger value of AUC indicate superior classification performance, implying better correlation with human judgments.
We also demonstrated in [
57] that QUIX features and IQIs supply complementary information, which when combined into a single predictor performed better than either of them in isolation. Under a similar premise, we augmented MVGCN-X-ray features with IQIs to obtain similar benefits in performance. As shown in
Table 9, the combination of MVGCN-X-ray with IQIs yielded better performance than any of the other features in isolation, while competing well against the combination of QUIX and IQIs. The improvement in performance of the combination can be attributed to the capture of different levels of distortion-sensitive higher-order correlations by the MVGCN-X-ray features and by complementary X-ray image quality information supplied by IQIs.
6. Conclusions
We designed a multivariate approach to NR IQA, which uses generalized contrast normalization, a form of DNT that is more suitable to model degraded image coefficients. We investigated the effect of degradations on the estimated shape and eigenvalues of the estimated covariance matrix of MVGG fit to the joint distribution of neighboring MSGCN coefficients. Further, we demonstrated applications of the MVGCN model to the blind QA of visible light images and on the prediction of threat object detection and identification by trained experts on degraded X-ray images, achieving near state-of-the-art performance in both applications.
There are a number of possible future directions. It is of interest to utilize the MVGCN model to design a spatio-temporal model of normalized bandpass video coefficients for video QA. The aforementioned multivariate modeling approach is also possibly extensible to other NSS models that utilize univariate parametric distributions of bandpass image coefficients. Furthermore, studying the statistics of other imaging modalities such as millimeter-wave, computed tomography (CT) and multi-view X-ray images also is a potential future direction of exploration.