In this section, we propose a full-reference quality index that can automatically assess the perceived quality of the reconstructed background images. The proposed Reconstructed Background Quality Index (RBQI) uses a probability summation model to combine visual characteristics at multiple scales and quantify the deterioration in the perceived quality of the reconstructed background image due to the presence of any residual foreground objects or unnaturalness that may be introduced by the background reconstruction algorithm. The motivation for RBQI comes from the fact that the quality of a reconstructed background image depends on two factors namely:
4.1. Structure Difference Map ()
An image can be decomposed into three different components: luminance, contrast and structure, which are local features computed at every pixel location
of the image as described in [
29]. By comparing these components, similarity between two images can be calculated [
19,
29]. A reconstructed background image is formed by mosaicing together parts of different input images. For such an image to appear natural, it is important that the structural continuity be maintained. Preservation of the local luminance from the reference background image is of low relevance as long as this structure continuity is maintained. Any sudden variation in the local luminance across the reconstructed background image manifests itself as contrast or structure deviation from the reference image. Thus, we consider only contrast and structure for comparing the reference and reconstructed background images while leaving out the luminance component. These contrast and structure differences between the reference and the reconstructed background images, calculated at each pixel, give us the ‘contrast-structure similarity map’ referred to as ‘structure map’ for short in the rest of the paper.
First, the structure similarity between the reference and the reconstructed background image, referred to as Structure Index (
), is calculated at each pixel location
using [
29]:
where
r is the reference background image,
i is the reconstructed background image, and
is the cross-correlation between image patches centered at location
in the reference and reconstructed background images.
and
are the standard deviations computed using pixel values in a patch centered at location
in the reference and reconstructed background image, respectively.
C is a small constant to avoid instability and is calculated as
,
K is set to
and
is the maximum possible value of the pixel intensity (255 in this case) [
29]. A higher
value indicates higher similarity between the pixels in the reference and reconstructed background images.
The background scenes often contain pseudo-stationary objects such as waving trees, escalator, local and global illumination changes. Even though these pseudo-stationary pixels belong to the background, because of the presence of motion, they are likely to be classified as foreground pixels. For this reason, the pseudo-stationary backgrounds pose an additional challenge for the quality assessment algorithms. Just comparing co-located pixel neighborhoods in the two considered images is not sufficient in the presence of such dynamic backgrounds, our algorithm uses a search window of size
centered at the current pixel
in the reconstructed image, where
is an odd value. The
is calculated between the pixel at location
in the reference image and
pixels within the
search window centered at pixel
in the reconstructed image. The resulting
matrix is of size
. The modified Equation (
1) to calculate
for every pixel location in the
window centered at
is given as:
where
The maximum value of the
matrix is taken to be the final
value for the pixel at location
as given below:
The map takes on values between [−1, 1].
In the proposed method, the
map is computed at
L different scales denoted as
,
,
. The
SI maps generated at three different scales for the background image shown in
Figure 4b and reconstructed using the method of [
12] are shown in
Figure 6. The darker regions in these images indicate larger structure differences between the reference and the reconstructed background images while the lighter regions indicate higher similarities. From
Figure 6c, it can also be seen that the computed
SI maps show the structure distortions while being robust to the escalator motion in the background.
The structure difference map is calculated using the
map at each scale
l as follows:
takes on values between [0, 1], where the value of 0 corresponds to no difference while 1 corresponds to the largest difference.
4.3. Computation of the Reconstructed Background Quality Index (RBQI) Based on Probability Summation
As indicated previously, the reference and reconstructed background images are decomposed each into a multi-scale pyramid with
L levels. Structure difference maps
and color difference maps
are computed at every level
as described in Equations (
4) and (
5), respectively. These difference maps are pooled together within the scale and later across all scales using a probability summation model [
55] to give the final RBQI.
The probability summation model as described in [
55] considers an ensemble of independent difference detectors at every pixel location in the image. These detectors predict the probability of perceiving the difference between the reference and the reconstructed background images at the corresponding pixel location based on its neighborhood characteristics in the reference image. Using this model, the probability of the structure difference detector signaling the presence of a structure difference at pixel location
at level
l can be modeled as an exponential of the form:
where
is a parameter chosen to increase the correspondence of RBQI with the experimentally determined MOS scores on a training dataset as described in
Section 5.2 and
is a parameter whose value depends upon the texture characteristics of the neighborhood centered at
in the reference image. The value of
is chosen to take into account that differences in structure are less perceptible in textured areas as compared to non-textured areas and that the perception of these differences depends on the scale
l.
In order to determine the value of
, every pixel in the reference background image at scale
l is classified as textured or non-textured using the technique in [
56]. This method first calculates the local variance at each pixel using a 3 × 3 window centered around it. Based on the computed variances, a pixel is classified as edge, texture or uniform. By considering the number of edge, texture and uniform pixels in the 8 × 8 neighborhood of the pixel, it is further classified into one of the six types: uniform, uniform/texture, texture, edge/texture, medium edge and strong edge. For our application, we label the pixels classified as ‘texture’ and ‘edge/texture’ as ’textured’ pixels and we label the rest as ‘non-textured’ pixels.
Let
be the flag indicating that a pixel is textured. Thus, values of
can be expressed as:
when
, the value of
a should be large enough such that
. In our implementation, we chose the value of
. Thus, in our current implementation,
takes on the form of a binary function that can be replaced with a computationally efficient model obtained by replacing division by
in Equation (
6) with multiplication by weight
. In the remainder of the paper, we keep the notation in Equation (
6) to accommodate a more generalized adaptation model based on local image characteristics in textured areas.
Similarly, the probability of the color difference detector signaling the presence of a color difference at pixel location
at level
l can be modeled as:
where
is found in a similar way to
and
corresponds to the Adaptive Just Noticeable Distortion (AJNCD) calculated at every pixel
in the
color space as given in [
57]:
where
and
correspond, respectively, to the
a and
b color values of the pixel located at
in the Lab color space,
is set to 2.3 [
58], and
is the mean background luminance of the pixel at
and
is the maximum luminance gradient across pixel
. In Equation (
9),
is the scaling factor for the chroma components as is given by [
57]:
is the scaling factor that simulates the local luminance texture masking and is given by:
where
is the weighting factor as described in [
57]. Thus,
varies at every pixel location based on the distance between the chroma values and texture masking properties of its neighborhood.
A pixel
at the
l-th level is said to have no distortion if and only if neither the structure difference detector nor the color difference detector at location
signal the presence of a difference. Thus, the probability of detecting no difference between the reference and reconstructed background images at pixel
and level
l can be written as:
Substituting Equations (
6) and (
8) for
and
, respectively, in Equation (
12), we get:
where
and
A less localized probability of difference detection can be computed by adopting the “probability summation” hypothesis [
55], which pools the localized detection probabilities over a region
R.
The probability summation hypothesis is based on the following two assumptions:
Assumption 1. A structure difference is detected in the region of interest R if and only if at least one detector in R signals the presence of a difference, i.e., if and only if at least one of the differences is greater than the threshold and, therefore, considered to be visible. Similarly, a color difference is detected in region R if and only if at least one of the differences is above .
Assumption 2. The probabilities of detection are independent; i.e., the probability that a particular detector will signal the presence of a difference is independent of the probability that any other detector will. This simplified approximation model is commonly used in the psychophysics literature [55,59] and was found to work well in practice in terms of correlation with human judgement in quantifying perceived visual distortions [60,61]. Then, the probability of no difference detection over the region
R is given by:
Substituting Equation (
12) in the above equation gives:
where
In the human visual system, the highest visual acuity is limited to the size of the foveal region, which covers approximately of visual angle. In our work, we consider the image regions R as foveal regions approximated by non-overlapping image blocks.
The probability of no distortion detection over the
l-th level is obtained by pooling the no detection probabilities over all the regions
R in level
l and is given by:
or
where
Similarly, we adopt a “probability summation” hypothesis to pool the detection probability across scales. It should be noted that the Human Visual Systems (HVS) dependent parameters
and
that are included in Equations (
14) and (
15), respectively, account for the varying sensitivity of the HVS at varying scales. The final probability of detecting no distortion in a reconstructed background image
i is obtained when no distortion is detected at any scale and is computed by pooling the no detection probabilities
over all scales
l,
, as follows:
or
where
where
and
are given by Equations (
22) and (
23), respectively. From Equations (
26) and (
27), it can be seen that
and
take the form of a Minkowski metric with exponent
and
, respectively.
By substituting the values
,
,
,
,
and
in Equation (
25) and simplifying, we get:
where
In Equation (
29),
and
are given by Equations (
14) and (
15), respectively. Thus, the probability of detecting a difference between the reference image and a reconstructed background image
i is given as:
As it can be seen from Equation (
30), a lower value of
D results in a lower probability of difference detection
while a higher value results in a higher probability of difference detection. Therefore,
D can be used to assess the perceived quality in the reconstructed background image, with a lower value of
D corresponding to a higher perceived quality.
The final Reconstructed Background Quality Index (RBQI) for a reconstructed background image is calculated using the logarithm of
D as follows:
As D increases, the value of RBQI increases implying more perceived distortion and thus lower quality of the reconstructed background image. The logarithmic mapping models the saturation effect, i.e., beyond a certain point, the maximum annoyance level is reached and more distortion does not affect the quality.