Full Reference Objective Quality Assessment for Reconstructed Background Images

With an increased interest in applications that require a clean background image, such as video surveillance, object tracking, street view imaging and location-based services on web-based maps, multiple algorithms have been developed to reconstruct a background image from cluttered scenes. Traditionally, statistical measures and existing image quality techniques have been applied for evaluating the quality of the reconstructed background images. Though these quality assessment methods have been widely used in the past, their performance in evaluating the perceived quality of the reconstructed background image has not been verified. In this work, we discuss the shortcomings in existing metrics and propose a full reference Reconstructed Background image Quality Index (RBQI) that combines color and structural information at multiple scales using a probability summation model to predict the perceived quality in the reconstructed background image given a reference image. To compare the performance of the proposed quality index with existing image quality assessment measures, we construct two different datasets consisting of reconstructed background images and corresponding subjective scores. The quality assessment measures are evaluated by correlating their objective scores with human subjective ratings. The correlation results show that the proposed RBQI outperforms all the existing approaches. Additionally, the constructed datasets and the corresponding subjective scores provide a benchmark to evaluate the performance of future metrics that are developed to evaluate the perceived quality of reconstructed background images.

view imaging and location-based services on web-based maps [5], [6], and texturing 3D models obtained from multiple photographs or videos [7]. But acquiring a clean photograph of a scene is seldom possible. There are always some unwanted objects occluding the background of interest.
The technique of acquiring a clean background image by removing the occlusions using frames from a video or multiple views of a scene, is known as background reconstruction or background initialization. Many algorithms have been proposed for initializing the background images from videos, for example, [8]- [14]; and also from multiple images such as [15]- [17].
Background initialization or reconstruction is crippled by multiple challenges. The pseudostationary background (e.g., waving trees, waves in water, etc.) poses additional challenges in separating the moving foreground objects from the relatively stationary background pixels. The illumination conditions can vary across the images thus changing the global characteristics of each image. The illumination changes cause local phenomena such as shadows, reflections and shading, which change the local characteristics of the background across the images or frames in a video. Finally, the removal of foreground objects from the scene creates holes in the background that need to be filled in with pixels that maintain the continuity of the background texture and structures in the recovered image. Thus the background reconstruction algorithms can be characterized by two main tasks: 1) foreground detection, in which the foreground is separated from the background by classifying pixels as foreground or background; 2) background recovery, in which the holes formed due to foreground removal are filled.
The performance of a background extraction algorithm depends on two factors: 1) its ability to detect the foreground objects in the scene and completely eliminate them; and 2) the perceived quality of the reconstructed background image. Traditional statistical techniques such as Peak Signal to Noise Ratio (PSNR), Average Gray-level Error (AGE), total number of error pixels (EPs), percentage of EPs (pEP), number of Clustered Error Pixels (CEPs) and percentage of CEPs (pCEPs) [18] quantify the performance of the algorithm in its ability to remove foreground objects from a scene to a certain extent, but they do not give an indication of the perceived quality of the generated background image. On the other hand, the existing Image Quality Assessment (IQA) techniques such as Multi-scale Similarity metric (MS-SSIM) [19] and Color image Quality Measures (CQM) [20] used by the authors in [21] to compare different background reconstruction algorithms are not designed to identify any residual foreground objects in the scene. Lack of a quality metric that can reliably assess the performance of background reconstruction algorithms by quantifying both aspects of a reconstructed background image motivated the development of the proposed Reconstructed Background visual Quality Index (RBQI). RBQI uses the contrast, structure and color information to determine the presence of any residual foreground objects in the reconstructed background image as compared to the reference background image and to detect any unnaturalness introduced by the reconstruction algorithm that affects the perceived quality of the reconstructed background image. This paper also presents two datasets that are constructed to assess the performance of the proposed as well as popular existing objective quality assessment methods in predicting the perceived visual quality of the reconstructed background images. The datasets consist of reconstructed background images generated using different background reconstruction algorithms in the literature along with the corresponding subjective ratings. Some of the existing datasets such as video surveillance datasets (Wallflower [22], I2R [23]), background subtraction datasets (UCSD [24], CMU [25]) and object tracking evaluation dataset ("Performance Evaluation of Tracking and Surveillance (PETS)") are not suited for this application as they do not provide reconstructed background images but just the foreground masks as ground-truth. The more recent database "Scene Background Modeling Net" (SBMNet) [26] is targeted at comparing the performance of the background initialization algorithms but it does not provide any subjective ratings for the reconstructed background images. Hence the SBMNet database [26] is not suited for benchmarking the performance of objective background visual quality assessment.
The datasets proposed in this work are the first and currently the only datasets that can be used for benchmarking existing and future metrics developed to assess the quality of reconstructed background images.
The rest of the paper is organized as follows. In Section II we highlight the limitations of existing popular assessment methods [27]. We introduce the new benchmarking datasets in Section III along with the details of the subjective tests. In Section IV, we propose a new index that makes use of a probability summation model to combine structure and color characteristics at multiples scales for quantifying the perceived quality in reconstructed background images.
Performance evaluation results for the existing and proposed objective visual quality assessment methods are presented in Section V for reconstructed background images. Finally, we conclude the paper in Section VI and also provide directions for future research.

II. EXISTING FULL REFERENCE BACKGROUND QUALITY ASSESSMENT TECHNIQUES AND THEIR LIMITATIONS
Existing background reconstruction quality metrics can be classified into two categories: statistical and image quality assessment (IQA) techniques, depending on the type of features used for measuring the similarity between the reconstructed background image and reference background image.

A. Statistical Techniques
Statistical techniques use intensity values at co-located pixels in the reference and reconstructed background images to measure the similarity. Popular statistical techniques [18] that have been traditionally used for judging the performance of background initialization algorithms are briefly explained here.
(i) Average Gray-level Error (AGE): AGE is calculated as the absolute difference between the gray levels of the co-located pixels in the reference and reconstructed background image.
(ii) Error Pixels (EP ): EP gives the total number of error pixels. A pixel is classified as an error pixel if the absolute difference between the corresponding pixels in the reference and reconstructed background images is greater than an empirically selected threshold τ .
(iii) Percentage Error Pixels (pEP ): Percentage of the error pixels, calculated as EP /N , where N is the total number of pixels in the image.
(iv) Clustered Error Pixels (CEP ): CEP gives the total number of clustered error pixels. A clustered error pixel is defined as the error pixel whose 4 connected pixels are also classified as error pixels.
(v) Percentage Clustered Error Pixels (pCEP ): Percentage of the clustered error pixels, calculated as CEP /N , where N is the total number of pixels in the image.
Though these techniques have been used to judge the quality of the reconstructed background images, their performance has not been previously evaluated. As we show in Section V and as noted by the authors in [27], the statistical techniques were found to not correlate well with the subjective quality scores.

B. Image Quality Assessment
The existing Full Reference Image Quality Assessment (FR-IQA) techniques use perceptually inspired features for measuring the similarity between two images. Though these techniques have been shown to work reasonably well while assessing images affected by distortions such as blur, compression artifacts and noise, these techniques have not been designed for assessing the quality of reconstructed background images. In [21] popular FR-IQA techniques including Peak Signal to Noise ratio (PSNR), Multi-scale Similarity metric (MS-SSIM) [19] and Color image Quality Measure (CQM) [20], were adopted for objectively comparing the performance of the different background reconstruction algorithms; however, no performance evaluation was carried out to support the choice of these techniques. Other popular IQA techniques include Structural Similarity Index (SSIM) [28], visual signal-to-noise ratio (VSNR) [29], visual information fidelity (VIF) [30], pixel-based VIF (VIFP) [30], universal quality index (UQI) [31], image fidelity criterion (IFC) [32], noise quality measure (NQM) [33], weighted signal-to-noise ratio (WSNR) [34], feature similarity index (FSIM) [35], FSIM with color (FSIMc) [35], spectral residual based similarity (SR-SIM) [36] and saliency-based SSIM (SalSSIM) [37]. The suitability of these techniques for evaluating the quality of reconstructed background images remains unexplored.
As the first contribution of this paper we present two benchmarking datasets that can be used for comparing the performance of different techniques in objectively assessing the perceived quality of the reconstructed background images. These datasets contain reconstructed background images along with their subjective ratings, details of which are discussed in Section III-A. When the statistical and IQA techniques were tested on these datasets, none of the techniques were found to correlate well with the subjective scores as discussed in Section V. This motivated our second contribution, the objective Reconstructed Background Quality Index (RBQI) that is shown to outperform all the existing techniques in assessing the perceived visual quality of reconstructed background images.

A. Databases
In this section we present two different datasets constructed as part of this work to serve as benchmarks for comparing existing and future techniques developed for assessing the quality of   background image that is free of any foreground objects is also captured for every scene. Figure 1 shows the reference images corresponding to each of the eight different scenes in this database.
Each of the image sequences is used as input to twelve different background reconstruction algorithms [8]- [17]. The 144 (8 × 12) background images generated by these algorithms along with the corresponding reference images for the scene are then used for the subjective evaluation.
Each of the scenes pose a different challenge for the background reconstruction algorithms.
For example, "Street" and "Wall" are outdoor sequences with textured backgrounds while the "Hall" is an indoor sequence with textured background. The "WetFloor" sequence challenges the underlying principal of many background reconstruction algorithms with water appearing as a low-contrast foreground object. The "Escalator" sequence has large motion in the background due to the moving escalator, while "Park" has smaller motion in the background due to waving trees. The "Illumination" sequence exhibits changing light sources, directions and intensities while the "Building" sequence has changing reflections in the background. Broadly, the dataset contains two categories based on the scene characteristics: (i) Static, the scenes for which all the pixels in the background are stationary; and (ii) Dynamic, the scenes for which there are non-stationary background pixels (e.g., moving escalator, waving trees, varying reflections). Four out of the eight scenes in the ReBaQ dataset are categorized as Static and the remaining four are categorized as Dynamic scenes. The reference background images corresponding to the static scenes are shown in Figure 1  (g) Very Long category contains sequences each with more than 3,500 images; (h) Very Short category contains sequences with a limited number of images (less than 20). The authors of SBMNet [26] provide reference background images for only 13 scenes out of the 79 scenes.
There is at least one scene corresponding to each category with reference background image available. We use only these 13 scenes for which the reference background images are provided.  the categories from SBMNet [26] in brackets. Background images that were reconstructed by 14 algorithms submitted to SBMC [12], [16], [39]- [48] corresponding to the selected 13 scenes were used in this work for conducting subjective tests. As a result, a total of 182 (13 × 14) reconstructed background images along with their corresponding subjective scores form the S-ReBaQ dataset.

B. Subjective Evaluation
The subjective ratings are obtained by asking the human subjects to rate the similarity of the reconstructed background images to the reference background images. The subjects had to  Figure 4, if the image quality was rated as excellent but the foreground object visibility was rated 1 (all visible), the reconstructed background quality cannot be scored to be very high. The background reconstruction quality scores, referred to as raw scores in the rest of the paper, are used for calculating the Mean Opinion Score (MOS).
We adopted a double-stimulus technique in which the reference and the reconstructed background images were presented side-by-side [49] to each subject as shown in Figure 3   Though the same testing strategy and set up was used for the ReBaQ and S-ReBaQ datasets described in Section III-A, the tests for each dataset were conducted in separate sessions.
As discussed in [27], the subjective experiments were carried out on a 23-inch Alienware monitor with a resolution of 1920x1080. Before the experiment, the monitor was reset to its factory settings. The setup was placed in a laboratory under normal office illumination conditions.
Subjects were asked to sit at a viewing distance of 2.5 times the monitor height. Since the number of participating subjects was less than 20 for each of the datasets, the raw scores obtained by subjective evaluation were screened using the procedure in ITU-R BT 500.13 (a) Four out of eight images from the input sequence "Escalator".
[11], MOS=1.5882 [12], MOS=2.2353 [13], MOS=2.2941 [16], MOS=4.1176 (b) Background images reconstructed by different algorithms and corresponding MOS scores.  An image can be decomposed into three different components: luminance, contrast and structure [28]. By comparing these components, similarity between two images can be calculated [19], [28]. A reconstructed background image is formed by mosaicing together parts of different input images, hence, preservation of the local luminance from the reference background image is of low relevance as long as the structure continuity is maintained. Any sudden variation in the local luminance across the reconstructed background image manifests itself as contrast or structure deviation from the reference image. Thus, in our application we consider only contrast and structure for comparing the reference and reconstructed background images while leaving out the luminance component. These contrast and structure differences between the reference and the reconstructed background images, calculated at each pixel, give us the 'contrast-structure difference map' referred to as 'structure map' for short in the rest of the paper.
First the structure similarity between the reference and the reconstructed background image, referred to as Structure Index (SI), is calculated using [28]: where r is the reference background image, i is the reconstructed background image, σ r and σ i are the standard deviations of the reference and reconstructed background image, respectively.
σ r (x,y) ,i (x,y) is the cross-correlation between the reference and reconstructed background images at location (x, y). C is a small constant to avoid instability and is calculated as C = (K · l) 2 , K is set to 0.03 and l is the maximum possible value of the pixel intensity (255 in this case) [28]. A higher SI value indicates higher similarity between the pixels in the reference and reconstructed background images.
The background scenes often contain pseudo-stationary objects such as waving trees, escalator, local and global illumination changes. Even though these pseudo-stationary pixels belong to the background, because of the presence of motion, they are likely to be classified as foreground pixels. For this reason the pseudo-stationary backgrounds pose an additional challenge for the quality assessment algorithms. Just comparing co-located pixel neighborhoods in the two considered images is not sufficient in the presence of such dynamic backgrounds, our algorithm uses   Figure 5(b) and reconstructed using the method in [12]. The darker regions indicate larger structure differences between the reference and the reconstructed background images.
a search window of size nhood × nhood centered at the current pixel (x, y) in the reconstructed image, where nhood is an odd value. The SI is calculated between the pixel at location (x, y) in the reference image and (nhood) 2 pixels within the nhood × nhood search window centered at pixel (x, y) in the reconstructed image. The resulting SI matrix is of size nhood × nhood . The modified Equation (1) to calculate SI for every pixel location in the nhood × nhood window centered at (x, y) is given as: where m = x − (nhood − 1)/2 : x + (nhood − 1)/2 n = y − (nhood − 1)/2 : y + (nhood − 1)/2 The maximum value of the SI matrix is taken to be the final SI value for the pixel at location (x, y) as given below: The SI map takes on values between [-1,1]. In the proposed method, the SI map is computed at L different scales denoted as SI l (x, y), l = 0, ..., L − 1. The quality maps generated at three different scales for the background image shown in Figure 5(b) and reconstructed using method in [12] are shown in Figure 7. The darker regions in these images indicate larger structure differences between the reference and the reconstructed background images while the lighter regions indicate higher similarities.
The structure difference map is calculated using the SI map at each scale l as follows: d s,l takes on values between [0,1] where the value of 0 corresponds to no difference while 1 corresponds to largest difference.

B. Color Distance (d c )
The d s,l map is vulnerable to failures while detecting differences in areas of background images with no textures or no structural information and/or with objects of same luminance but different color. Hence we incorporate the color information at every scale while calculating the RBQI. The reference and the reconstructed images are converted to the Lab color space and filtered using a lowpass Gaussian filter. The color difference between the filtered reference and reconstructed background images at each scale l is then calculated as the Euclidian distance between the values of co-located pixels as follows: d c,l (x, y) = (L r (x, y) − L i (x, y)) 2 + (a r (x, y) − a i (x, y)) 2 + (b r (x, y) − b i (x, y)) 2 (5) In (5), for the Lab color space components the scale index l was dropped from the notation for convenience.

C. Computation of the Reconstructed Background Quality Index (RBQI) based on Probability Summation
As indicated previously, the reference and reconstructed background images are decomposed each into a multi-scale pyramid with L levels. Structure difference maps d s,l and color difference maps d c,l are computed at every level l = 0, ..., L − 1 as described in Equations (4) and (5), respectively. These difference maps are pooled together within the scale and later across all scales using a probability summation model [52] to give the final RBQI.
The probability summation model as described in [52] considers an ensemble of independent difference detectors at every pixel location in the image. These detectors predict the probability of perceiving the difference between the reference and the reconstructed background images at the corresponding pixel location based on its neighborhood characteristics in the reference image. Using this model, the probability of the structure difference detector signaling presence of a structure difference at pixel location (x, y) at level l can be modeled as an exponential of the form: where β s is a parameter chosen to increase the correspondence of RBQI with experimentally determined MOS scores on a training dataset and α s,l (x, y) is a parameter whose value depends upon the texture characteristics of the neighborhood centered at (x, y) in the reference image. The value of α s,l (x, y) is chosen to take into account that differences in structure are less perceptible in textured areas as compared to non-textured areas.
In order to determine the value of α, every pixel in the reference image is classified as textured or non-textured using the technique in [53]. This method first calculates the local variance at each pixel using a 3x3 window centered around it. Based on the computed variances a pixel is classified as edge, texture or uniform. By considering the number of edge, texture and uniform pixels in the 8x8 neighborhood of the pixel, it is further classified into one of the six types: uniform, uniform/texture, texture, edge/texture, medium edge and strong edge. For our application we label the pixels classified as 'texture' and 'edge/texture' as 'textured' pixels and we label the rest as 'non-textured' pixels.
Let, f tex (x, y) = 1 be the flag indicating that a pixel is textured. Thus values of α s,l (x, y) can be expressed as: In our implementation we chose the value of a = 1000.0 resulting in a value of P D,s,l close to zero when a pixel is classified as textured.
Similarly, the probability of the color difference detector signaling the presence of a color difference at pixel location (x, y) at level l can be modeled as: where β c is found in a similar way to β s and α c,l (x, y) corresponds to the Adaptive Just Noticeable Distortion (AJNCD) calculated at every pixel (x, y) in the Lab color space as given in [54]: α c,l (x, y) = JN CD Lab · s L (E(L l (x, y)), ∆L l (x, y)) · s C (a l (x, y), b l (x, y)) (9) where JN CD Lab is set to 2.3 [55], E(L l ) is the mean background luminance of the pixel at (x, y) and ∆L is the maximum luminance gradient across pixel (x, y). In Equation (9), s C is the scaling factor used for adjusting the dimension of ellipsoid along the chroma axis as is given by [54]: where a l (x, y) and b l (x, y) correspond to the a and b color values of pixel located at (x, y) in the Lab color space, respectively. s L is the scaling factor that simulates the local luminance texture masking as is given by: where ρ(E(L l )) is the weighting factor as described in [54]. Thus, α c,l varies at every pixel location based on the distance between the chroma values and texture masking properties of its neighborhood.
A pixel (x, y) at the l-th level is said to have no distortion if and only if neither the structure difference detector nor the color difference detector at location (x, y) signal the presence of any differences. Thus, the probability of detecting no difference between reference and reconstructed background images at pixel (x, y) and level l can be written as: Substituting Equation (6) and Equation (8) for P D,s,l and P D,c,l , respectively, in the above equation, we get: A less localized probability of difference detection can be computed by adopting the probability summation hypothesis which pools over the localized detection probabilities over a region R [52]. The probability summation hypothesis is based on the following two assumptions: 1) no difference is detected if none of the detectors in the region R sense the presence of distortion, and 2) the probabilities of detection at all locations in the region R are independent. Then the probability of no difference detection over the region R is given by: Substituting Equation (12) in the above equation gives: where In the human visual system, the highest visual acuity is limited to the size of foveal region, which covers approximately 2 • of visual angle. In our work, we consider the image regions R as foveal regions approximated by 8 × 8 non-overlapping image blocks.
The probability of no distortion detection over the l-th level is obtained by pooling the no detection probabilities over all the regions R and is given by: or where Thus the final probability of detecting no distortion in a reconstructed background image i is obtained by pooling the no detection probabilities P N D(l) over all scales l, l = 0, ..., L − 1, as follows: ) or where From Equation (24), it can be seen that D s and D c take the form of a Minkowski metric with exponent β s and β c , respectively.
By substituting the values D s , D c , D s (l), D c (l), D s,l (R) and D c,l (R) in Equation (23) and simplifying, we get: where Thus the probability of detecting a difference between the reference image and a reconstructed background image i is given as: As it can be seen from Equation (28), a lower value of D results in a lower probability of difference detection P D (i) while a higher value results in a higher probability of difference detection.
Therefore, D can be used to assess the perceived quality in the reconstructed background image, with a lower value of D corresponding to a higher perceived quality.
The final Reconstructed Background Quality Index (RBQI) for a reconstructed background image is calculated using the logarithm of D as follows: As D increases the value of RBQI increases implying more perceived distortion and thus lower quality of the reconstructed background image. The logarithmic mapping models the saturation effect, i.e., beyond a certain point the maximum annoyance level is reached and more distortion does not affect the quality.

V. RESULTS
In this section we analyze the performance of RBQI in terms of its ability to predict the subjective ratings for the perceived quality of reconstructed background images. We evaluate the performance of the proposed quality index in terms of its prediction accuracy, prediction monotonicity and prediction consistency and provide comparisons with the existing statistical and IQA techniques. In our implementation, we set nhood = 17, L = 3 and β s = β c = 3.5. We also evaluate the performance of RBQI for different scales and neighborhood search windows.
We conduct a series of hypothesis tests based on the prediction residuals (errors in predictions) after nonlinear regression. These tests help in making statistically meaningful conclusions on the index's performance.
We use the two databases ReBaQ and S-ReBaQ described in Section III-A to quantify and compare the performance of RBQI. For performance evaluation, we employ three most commonly used metrics: (i) Spearman rank-order correlation coefficient; (ii) Pearson correlation coefficient; and (iii) root mean squared error (RMSE). A 4-parameter regression function [56] is applied to IQA metrics to provide a non-linear mapping between the objective scores and the subjective mean opinion scores (MOS): where M i denotes the predicted quality for the ith image and M OS p i denotes the quality score after fitting, and γ n , n = 1, 2, ..., 4, are the regression model parameters. Figure 8 shows the scatter plots of MOS versus the prediction scores using the proposed technique along with the corresponding fitting curve calculated using (30). Tables I and II show [21] to compare the performance of the algorithms is shown to perform very poorly on all three datasets and hence is not a good choice for evaluating the quality of reconstructed background images and thus is not suitable for comparing the performance of background reconstruction algorithms.

A. Performance Comparison
The P-value is the probability of getting a correlation as large as the observed value by random chance, while the variables are independent. If the P-value is less than 0.05 then the correlation is significant. The P-values (P PCC and P SROCC ) reported in Tables I and II indicate that most of the correlation scores are statistically significant.

B. Model Parameter Selection
The proposed quality index accepts four parameters: 1) nhood, dimensions of the window centered around the current pixel for calculating the d s ; 2) L, number of multi-scale levels;  3) β s , used in the calculation of P D,s,l (x, y) in Equation (6); and 4) β c , used in the calculation of P D,c,l (x, y) in Equation (8). In Table III, we evaluate our algorithm with different values for the parameters. These simulations were run only on the ReBaQ dataset. for all our experiments. Table III(b) gives performance results for different number of scales. As a tradeoff between the computation complexity and prediction accuracy we chose the number of scales to be L = 3. The probability summation model parameters β s and β c were found such that they maximized the correlation between RBQI and MOS scores on a training dataset consisting of randomly selected images from the ReBaQ dataset. Values β s = β c = 3.5 were found to correlate well with the subjective tests.
These parameters remained unchanged for the experiments conducted on the S-ReBaQ dataset to obtain the values in Table II.

VI. CONCLUSION
In this paper we addressed the problem of quality evaluation of reconstructed background images. We first proposed two different datasets for benchmarking the performance of existing and future techniques proposed to evaluate the quality of reconstructed background images.
Then we proposed the first full-reference Reconstructed Background Quality Index (RBQI) to objectively measure the perceived quality of the reconstructed background images.
The RBQI uses the probability summation model to combine visual characteristics at multiple scales to quantify the deterioration in the perceived quality of the reconstructed background image due to the presence of any foreground objects or unnaturalness that may be introduced by the background reconstruction algorithm. The use of a neighborhood search window while calculating the contrast and structure differences provides further boost in the performance in the presence of pseudo-stationary background while not affecting the performance on scenes with static background. The probability summation model penalizes only the perceived differences across the reference and reconstructed background images while the unperceived differences do not affect the RBQI, thereby giving better correlation with the subjective scores. Experimental results on the benchmarking datasets showed that the proposed measure out-performed all the existing statistical and IQA techniques in estimating the perceived quality of reconstructed background images.