In this section, we propose a full-reference quality index that can automatically assess the perceived quality of the reconstructed background images. The proposed Reconstructed Background Quality Index (RBQI) uses a probability summation model to combine visual characteristics at multiple scales and quantify the deterioration in the perceived quality of the reconstructed background image due to the presence of any residual foreground objects or unnaturalness that may be introduced by the background reconstruction algorithm. The motivation for RBQI comes from the fact that the quality of a reconstructed background image depends on two factors namely:

#### 4.1. Structure Difference Map (${d}_{s}$)

An image can be decomposed into three different components: luminance, contrast and structure, which are local features computed at every pixel location

$(x,y)$ of the image as described in [

29]. By comparing these components, similarity between two images can be calculated [

19,

29]. A reconstructed background image is formed by mosaicing together parts of different input images. For such an image to appear natural, it is important that the structural continuity be maintained. Preservation of the local luminance from the reference background image is of low relevance as long as this structure continuity is maintained. Any sudden variation in the local luminance across the reconstructed background image manifests itself as contrast or structure deviation from the reference image. Thus, we consider only contrast and structure for comparing the reference and reconstructed background images while leaving out the luminance component. These contrast and structure differences between the reference and the reconstructed background images, calculated at each pixel, give us the ‘contrast-structure similarity map’ referred to as ‘structure map’ for short in the rest of the paper.

First, the structure similarity between the reference and the reconstructed background image, referred to as Structure Index (

$SI$), is calculated at each pixel location

$(x,y)$ using [

29]:

where

r is the reference background image,

i is the reconstructed background image, and

${\sigma}_{{r}_{(x,y)}{i}_{(x,y)}}$ is the cross-correlation between image patches centered at location

$(x,y)$ in the reference and reconstructed background images.

${\sigma}_{{r}_{(x,y)}}$ and

${\sigma}_{{i}_{(x,y)}}$ are the standard deviations computed using pixel values in a patch centered at location

$(x,y)$ in the reference and reconstructed background image, respectively.

C is a small constant to avoid instability and is calculated as

$C={(K\xb7{I}_{max})}^{2}$,

K is set to

$0.03$ and

${I}_{max}$ is the maximum possible value of the pixel intensity (255 in this case) [

29]. A higher

$SI$ value indicates higher similarity between the pixels in the reference and reconstructed background images.

The background scenes often contain pseudo-stationary objects such as waving trees, escalator, local and global illumination changes. Even though these pseudo-stationary pixels belong to the background, because of the presence of motion, they are likely to be classified as foreground pixels. For this reason, the pseudo-stationary backgrounds pose an additional challenge for the quality assessment algorithms. Just comparing co-located pixel neighborhoods in the two considered images is not sufficient in the presence of such dynamic backgrounds, our algorithm uses a search window of size

$nhood\times nhood$ centered at the current pixel

$(x,y)$ in the reconstructed image, where

$nhood$ is an odd value. The

$SI$ is calculated between the pixel at location

$(x,y)$ in the reference image and

${\left(nhood\right)}^{2}$ pixels within the

$nhood\times nhood$ search window centered at pixel

$(x,y)$ in the reconstructed image. The resulting

$SI$ matrix is of size

$nhood\times nhood$ . The modified Equation (

1) to calculate

$SI$ for every pixel location in the

$nhood\times nhood$ window centered at

$(x,y)$ is given as:

where

The maximum value of the

$SI$ matrix is taken to be the final

$SI$ value for the pixel at location

$(x,y)$ as given below:

The $SI$ map takes on values between [−1, 1].

In the proposed method, the

$SI$ map is computed at

L different scales denoted as

$S{I}_{l}(x,y)$,

$l=0$,

$\dots ,L-1$. The

SI maps generated at three different scales for the background image shown in

Figure 4b and reconstructed using the method of [

12] are shown in

Figure 6. The darker regions in these images indicate larger structure differences between the reference and the reconstructed background images while the lighter regions indicate higher similarities. From

Figure 6c, it can also be seen that the computed

SI maps show the structure distortions while being robust to the escalator motion in the background.

The structure difference map is calculated using the

$SI$ map at each scale

l as follows:

${d}_{s,l}$ takes on values between [0, 1], where the value of 0 corresponds to no difference while 1 corresponds to the largest difference.

#### 4.3. Computation of the Reconstructed Background Quality Index (RBQI) Based on Probability Summation

As indicated previously, the reference and reconstructed background images are decomposed each into a multi-scale pyramid with

L levels. Structure difference maps

${d}_{s,l}$ and color difference maps

${d}_{c,l}$ are computed at every level

$l=0,\dots ,L-1$ as described in Equations (

4) and (

5), respectively. These difference maps are pooled together within the scale and later across all scales using a probability summation model [

55] to give the final RBQI.

The probability summation model as described in [

55] considers an ensemble of independent difference detectors at every pixel location in the image. These detectors predict the probability of perceiving the difference between the reference and the reconstructed background images at the corresponding pixel location based on its neighborhood characteristics in the reference image. Using this model, the probability of the structure difference detector signaling the presence of a structure difference at pixel location

$(x,y)$ at level

l can be modeled as an exponential of the form:

where

${\beta}_{s}$ is a parameter chosen to increase the correspondence of RBQI with the experimentally determined MOS scores on a training dataset as described in

Section 5.2 and

${\alpha}_{s,l}(x,y)$ is a parameter whose value depends upon the texture characteristics of the neighborhood centered at

$(x,y)$ in the reference image. The value of

${\alpha}_{s,l}(x,y)$ is chosen to take into account that differences in structure are less perceptible in textured areas as compared to non-textured areas and that the perception of these differences depends on the scale

l.

In order to determine the value of

${\alpha}_{s,l}$, every pixel in the reference background image at scale

l is classified as textured or non-textured using the technique in [

56]. This method first calculates the local variance at each pixel using a 3 × 3 window centered around it. Based on the computed variances, a pixel is classified as edge, texture or uniform. By considering the number of edge, texture and uniform pixels in the 8 × 8 neighborhood of the pixel, it is further classified into one of the six types: uniform, uniform/texture, texture, edge/texture, medium edge and strong edge. For our application, we label the pixels classified as ‘texture’ and ‘edge/texture’ as ’textured’ pixels and we label the rest as ‘non-textured’ pixels.

Let

${f}_{tex,l}(x,y)=1$ be the flag indicating that a pixel is textured. Thus, values of

${\alpha}_{s,l}(x,y)$ can be expressed as:

when

${f}_{tex,l}(x,y)=1$, the value of

a should be large enough such that

${P}_{D,s,l}(x,y)\to 0$. In our implementation, we chose the value of

$a=1000.0$. Thus, in our current implementation,

${\alpha}_{s,l}(x,y)$ takes on the form of a binary function that can be replaced with a computationally efficient model obtained by replacing division by

${\alpha}_{s,l}(x,y)$ in Equation (

6) with multiplication by weight

${w}_{s,l}(x,y)=1/{\alpha}_{s,l}(x,y)=(1-{f}_{tex,l}(x,y))$. In the remainder of the paper, we keep the notation in Equation (

6) to accommodate a more generalized adaptation model based on local image characteristics in textured areas.

Similarly, the probability of the color difference detector signaling the presence of a color difference at pixel location

$(x,y)$ at level

l can be modeled as:

where

${\beta}_{c}$ is found in a similar way to

${\beta}_{s}$ and

${\alpha}_{c,l}(x,y)$ corresponds to the Adaptive Just Noticeable Distortion (AJNCD) calculated at every pixel

$(x,y)$ in the

$Lab$ color space as given in [

57]:

where

${a}_{l}(x,y)$ and

${b}_{l}(x,y)$ correspond, respectively, to the

a and

b color values of the pixel located at

$(x,y)$ in the Lab color space,

$JNC{D}_{Lab}$ is set to 2.3 [

58], and

$E\left({L}_{l}\right)$ is the mean background luminance of the pixel at

$(x,y)$ and

$\Delta L$ is the maximum luminance gradient across pixel

$(x,y)$. In Equation (

9),

${s}_{C}$ is the scaling factor for the chroma components as is given by [

57]:

${s}_{L}$ is the scaling factor that simulates the local luminance texture masking and is given by:

where

$\rho \left(E\left({L}_{l}\right)\right)$ is the weighting factor as described in [

57]. Thus,

${\alpha}_{c,l}$ varies at every pixel location based on the distance between the chroma values and texture masking properties of its neighborhood.

A pixel

$(x,y)$ at the

l-th level is said to have no distortion if and only if neither the structure difference detector nor the color difference detector at location

$(x,y)$ signal the presence of a difference. Thus, the probability of detecting no difference between the reference and reconstructed background images at pixel

$(x,y)$ and level

l can be written as:

Substituting Equations (

6) and (

8) for

${P}_{D,s,l}$ and

${P}_{D,c,l}$, respectively, in Equation (

12), we get:

where

and

A less localized probability of difference detection can be computed by adopting the “probability summation” hypothesis [

55], which pools the localized detection probabilities over a region

R.

The probability summation hypothesis is based on the following two assumptions:

**Assumption** **1.** A structure difference is detected in the region of interest R if and only if at least one detector in R signals the presence of a difference, i.e., if and only if at least one of the differences ${d}_{s,l}(x,y)$ is greater than the threshold ${\alpha}_{s}$ and, therefore, considered to be visible. Similarly, a color difference is detected in region R if and only if at least one of the differences ${d}_{c,l}(x,y)$ is above ${\alpha}_{c}$.

**Assumption** **2.** The probabilities of detection are independent; i.e., the probability that a particular detector will signal the presence of a difference is independent of the probability that any other detector will. This simplified approximation model is commonly used in the psychophysics literature [55,59] and was found to work well in practice in terms of correlation with human judgement in quantifying perceived visual distortions [60,61]. Then, the probability of no difference detection over the region

R is given by:

Substituting Equation (

12) in the above equation gives:

where

In the human visual system, the highest visual acuity is limited to the size of the foveal region, which covers approximately ${2}^{\circ}$ of visual angle. In our work, we consider the image regions R as foveal regions approximated by $8\times 8$ non-overlapping image blocks.

The probability of no distortion detection over the

l-th level is obtained by pooling the no detection probabilities over all the regions

R in level

l and is given by:

or

where

Similarly, we adopt a “probability summation” hypothesis to pool the detection probability across scales. It should be noted that the Human Visual Systems (HVS) dependent parameters

${\alpha}_{s,l}$ and

${\alpha}_{c,l}$ that are included in Equations (

14) and (

15), respectively, account for the varying sensitivity of the HVS at varying scales. The final probability of detecting no distortion in a reconstructed background image

i is obtained when no distortion is detected at any scale and is computed by pooling the no detection probabilities

${P}_{ND}\left(l\right)$ over all scales

l,

$l=0,\dots ,L-1$, as follows:

or

where

where

${D}_{s}\left(l\right)$ and

${D}_{c}\left(l\right)$ are given by Equations (

22) and (

23), respectively. From Equations (

26) and (

27), it can be seen that

${D}_{s}$ and

${D}_{c}$ take the form of a Minkowski metric with exponent

${\beta}_{s}$ and

${\beta}_{c}$, respectively.

By substituting the values

${D}_{s}$,

${D}_{c}$,

${D}_{s}\left(l\right)$,

${D}_{c}\left(l\right)$,

${D}_{s,l}\left(R\right)$ and

${D}_{c,l}\left(R\right)$ in Equation (

25) and simplifying, we get:

where

In Equation (

29),

${D}_{s,l}(x,y)$ and

${D}_{c,l}(x,y)$ are given by Equations (

14) and (

15), respectively. Thus, the probability of detecting a difference between the reference image and a reconstructed background image

i is given as:

As it can be seen from Equation (

30), a lower value of

D results in a lower probability of difference detection

${P}_{D}\left(i\right)$ while a higher value results in a higher probability of difference detection. Therefore,

D can be used to assess the perceived quality in the reconstructed background image, with a lower value of

D corresponding to a higher perceived quality.

The final Reconstructed Background Quality Index (RBQI) for a reconstructed background image is calculated using the logarithm of

D as follows:

As D increases, the value of RBQI increases implying more perceived distortion and thus lower quality of the reconstructed background image. The logarithmic mapping models the saturation effect, i.e., beyond a certain point, the maximum annoyance level is reached and more distortion does not affect the quality.