Hybrid Fusion-Based Background Segmentation in Multispectral Polarimetric Imagery

: Multispectral Polarimetric Imagery (MSPI) contains signiﬁcant information about an object’s distribution, shape, shading, texture and roughness features which can distinguish between foreground and background in a complex scene. Due to spectral signatures being limited to material properties, Background Segmentation (BS) is a di ﬃ cult task when there are shadows, illumination and clutter in a scene. In this work, we propose a two-fold BS approach: multiband image fusion and polarimetric BS. Firstly, considering that the background in a scene is polarized by nature, the spectral reﬂectance and correlations and the textural features of MSPI are calculated and analyzed to demonstrate the fusion signiﬁcance. After that, integrating Principal Component Analysis (PCA) with Fast Fourier Transform (FFT), a hybrid fusion technique is proposed to show the multiband fusion e ﬀ ectiveness. Secondly, utilizing the Stokes vector, polarimetric components are calculated to separate a complex scene’s background from its foreground by constructing four signiﬁcant foreground masks. An intensity-invariant mask is built by di ﬀ erentiating between the median ﬁltering versions of unpolarized and polarized images. A strongly unpolarized foreground mask is also constructed in two di ﬀ erent ways, through analyzing the Angle of Linear Polarization (AoLP) and Degree of Linear Polarization (DoLP). Moreover, a strongly polarized mask and a strong light intensity mask are also calculated based on the azimuth angle and the total light intensity. Finally, all these masks are combined, and a morphological operation is applied to segment the ﬁnal background area of a scene. The proposed two-fold BS algorithm is evaluated using distinct statistical measurements and compared with well-known fusion methods and BS methods highlighted in this paper. The experimental results demonstrate that the proposed hybrid fusion method signiﬁcantly improves multiband fusion quality. Furthermore, the proposed polarimetric BS approach also improves the mean accuracy, geometric mean and F1-score to 0.95, 0.93 and 0.97, respectively, for scenes in the MSPI dataset compared with those obtained from the methods in the literature considered in this paper. Future work will investigate mixed polarized and unpolarized BS in the MSPI dataset with specular reﬂection.


Introduction
The emerging significance of Multispectral Polarimetric Imagery (MSPI) has been actively pursued in diverse applications over the last few decades. Using advanced communication tools and techniques, multispectral images are currently composed of several sub-bands from both the visible and near infrared (NIR) wavelengths [1], with the latter material-dependent and capable of penetrating deeper into materials. Potential applications could investigate acquiring an imaging system that performs image denoising [2], image dehazing [3] and semantic segmentation [4]. Multispectral imaging is a mode commonly reported in the literature for enhancing color reproduction [5], illuminant estimation [6], where τ G is a global threshold, M x,y, λ,o i , the final background mask at pixel (x, y) of a fused spectrum (λ) image at a polarimetric orientation (o i ) and d is the distance between the pixel of the predicted background (B) and that of the fused image in spectrum λ(I) at orientation o i . In this paper, the literature related to BS is reviewed in terms of multispectral fusion and polarimetric BS.

Fusion
In general, multispectral images contain both spectral and spatial redundancies. The former is defined by a high correlation among multiple spectra and the latter by the correlation among neighboring pixels. Multispectral fusion aims to reduce both these redundancies while destroying as little as possible of the original image's information and is usually performed by transform compression. Individual techniques and their performances related to visible and infrared image fusion are discussed in recent literature [26][27][28]. Although, in a lossless image compression technique, images remain almost identical to their original versions, in a lossy one, a high compression ratio such as PCA [29][30][31][32][33], which is based on dimensionality reduction, is calculated according to the covariance matrix (C): where N is the number of spectral bands, X n is a vector containing the pixel intensities of the n th band image and X is the mean of X n . The eigenvectors and eigenvalues are then computed from C, after that PCA scores also calculated and multiplied by those of the original pixel intensities of the band images to obtain the first PC as a fused image [28]. Through applying PCA, Bavirisetti [34] has proposed edge-preserving Fourth-order Partial Differential Equations (FPDEs) to fuse multisensor images. Despite the advantage over energy compaction of the decorrelated spectral bands, the dependence on data to calculate the C among the spectral bands is a disadvantage of PCA. Due to the energy balance and registration problem existing in a multispectral data acquisition system, the appearances of objects demonstrate low correlations in different spectral bands [35]. However, multispectral pixel-level and transform domain-based fusion can effectively present an overall dataset. The traditional pixel-level fusion averages (AVG) method [36] contains the information of multiple spectra and, although it is simple and fast, each band's images participate equally in the fused image without their information considered. In contrast, transform domain-based fusion methods first decompose the source images into a sequence of sub-band ones through a specific mathematical transform structure and then apply some fusion rules to determine the so-called sub-band coefficients [3]. Finally, a fused image is formed using the corresponding inverse transform. Mathematical transforms, which are well known as Multiscale Geometric Analyses (MGAs) include the Discrete Wavelet Transform (DWT) [37][38][39], Shearlet Transform (SHT) and Discrete Cosine Transform-based Laplacian Pyramid (DCT-LP). The DWT generates four different coefficients: approximation, horizontal detail, vertical detail and diagonal detail, with a mean-maximum rule applied to fuse those of multiple spectra. Finally, an inverse transform is applied on all the final coefficients [27]. Although the DWT Remote Sens. 2020, 12,1776 4 of 25 provides a higher signal-to-noise ratio (SNR) and better temporal resolution than other transforms, its performance depends on the number of decomposition levels.
In the SHT [20], two significantly different images (A and B) are selected from spectral bands which further decompose into some low-and many high-frequency coefficients (L iA and H iA , respectively). The mean of the low frequencies is the final low-frequency approximation coefficient (Equation (3)) and, by applying an area-based feature selection method (Equations (4) and (5)), the final high-frequency detail coefficients are calculated. Eventually, a fused image is obtained using an inverse SHT as follows.
L(x, y) = L A (x, y) + L B (x, y) 2 (3) where S is the weighted energy of the local window (Ω) calculated as: S(x, y) = f (x, y) 2 1 Card(Ω) (m,n)∈Ω f (m, n) 2 (5) where Card(Ω) denotes the cardinality of a window (Ω) that is calculated by counting the number of distinct elements in the (m, n) coordinates. The advantage of the SHT is that it includes multiscale, shift-invariant and multidirectional image decompositions, but its computational complexity is significant.
In the DCT-LP [40], firstly, a Gaussian Pyramid (GP) is constructed with a reduced size of a band's image and then the DCT and inverse DCT are applied on it, with an LP built by expanding the size of the previous image and then applying the DCT and inverse DCT on it. After constructing a k-level LP for individual images ( P i k → g i k , l i 0 , l i 1 , . . . . . . , l i k−1 , where i = 0, 1, . . . , N ), the fusion rule is at the kth level, g f k = g i k /i (6) and, at the k − 1 to 0 levels, g where E g f k is an expanded function and l where j i and j = 0, 1, . . . ., N. Despite its signal extension capability, the DCT suffers from poor image quality.
Therefore, considering the pros and cons of individual methods, we propose a holistic hybrid fusion approach for the lossy compression of multispectral images with high spatial resolutions by integrating an FFT with PCA.

Segmentation
Recent literature on multispectral polarized BS has focused mainly on binary (or foreground-background) segmentation [41] after fusion [42]. Without considering some constraints or assumptions, BS in multiband images seems to be very difficult due to an object appearing in some spectra exhibiting a low correlation [35]. Approaches based on monocular segmentation usually rely on the hypotheses of visual saliency (e.g., an object appearing in a scene is highly focused) or human oversight to achieve good results [43,44]. Given the additional assumptions that can be made about the motion of an object or scene, the same issue in the temporal domain (i.e., for image sequences) is easier to tackle. There are methods for motion clustering that depend on partitioning the optical flow or trajectory points to identify image regions that behave differently from their surroundings [45]. Also, the close correlation between motion partitioning and the segmentation of video objects has gained attention in the last decade [46,47].
Foreground-background separation is an easy task if the object of interest is available in multiple spectra. An example of the assumption of visual appearance is where common foreground Remote Sens. 2020, 12, 1776 5 of 25 objects are presented, and the background demonstrates low correlations in multiple spectra [48][49][50]. In contrast, methods for mutual segmentation usually assume that the same object is observed from multiple viewpoints and optimize the geometric consistency of the extracted foreground area [51][52][53]. Earlier mutual segmentation approaches which focus on single-spectrum imagery [53][54][55], with the registration problem solved using depth cameras [56,57], offer a range-based solution for detecting objects in a scene [51,52,58].
Currently, researchers are concentrating on combining spatial, spectral and polarimetric information to effectively predict the background and foreground in a scene without creating any background model. In some studies, color space mapping [13,20,29] is applied on fused multispectral polarimetric components and then fuzzy C-means clustering on separate objects from the background. Lu [59] applied PCA on different polarized images and then used a clustering technique to segment objects. These methods do not require any previous background model to segment the foreground/object from the background and are robust to illumination conditions. The proposed BS method creates some significant foreground masks from different polarimetric components and combines them to segment a complex background from a scene.

Analysis of the MSPI Dataset
Multispectral polarimetric data contain rich sets of information for discriminating foreground objects from the background in a complex scene.

Description of the Dataset
The multispectral polarimetric image dataset used in this paper is obtained from a publicly available one [60], which contains polarimetric and multispectral images from visible and near infrared (NIR) bands in the range of 420-1000 nm. The imaging system generates 10 different scenes in six spectral channels at four different polarimetric filter orientation angles (0, 45,90,135). Of them, three scenes in which the background is polarized in nature are selected for our purpose. Samples of the dataset at 0 polarimetric orientation (I 0 ) are shown in Figure 1, where the liquid scene is composed of reflecting objects, such as water in a bottle and a doll in a jar, the food scene of an apple and a banana, and the leaf scene of real and fake leaves with a mixed background.
Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 25 the registration problem solved using depth cameras [56,57], offer a range-based solution for detecting objects in a scene [51,52,58]. Currently, researchers are concentrating on combining spatial, spectral and polarimetric information to effectively predict the background and foreground in a scene without creating any background model. In some studies, color space mapping [13,20,29] is applied on fused multispectral polarimetric components and then fuzzy C-means clustering on separate objects from the background. Lu [59] applied PCA on different polarized images and then used a clustering technique to segment objects. These methods do not require any previous background model to segment the foreground/object from the background and are robust to illumination conditions. The proposed BS method creates some significant foreground masks from different polarimetric components and combines them to segment a complex background from a scene.

Analysis of the MSPI Dataset
Multispectral polarimetric data contain rich sets of information for discriminating foreground objects from the background in a complex scene.

Description of the Dataset
The multispectral polarimetric image dataset used in this paper is obtained from a publicly available one [60], which contains polarimetric and multispectral images from visible and near infrared (NIR) bands in the range of 420-1000 nm. The imaging system generates 10 different scenes in six spectral channels at four different polarimetric filter orientation angles (0°, 45°, 90°, 135°). Of them, three scenes in which the background is polarized in nature are selected for our purpose. Samples of the dataset at 0° polarimetric orientation ( °) are shown in Figure 1, where the liquid scene is composed of reflecting objects, such as water in a bottle and a doll in a jar, the food scene of an apple and a banana, and the leaf scene of real and fake leaves with a mixed background.

Analysis of Spectral Reflectance
The camera response ( ) of the th channel is described by [1] as: where is the spectral emission of the illuminant, the reactance, the global transmittance of all the optical elements and the spectral sensitivity of the th channel. After calculating the measurement matrix (M), the spectral reflectance ( ) is calculated as:

Analysis of Spectral Reflectance
The camera response (ρ i ) of the i th channel is described by [1] as: where E(λ) is the spectral emission of the illuminant, R(λ) the reactance, O(λ) the global transmittance of all the optical elements and C i (λ) the spectral sensitivity of the i th channel. After calculating the measurement matrix (M), the spectral reflectance (R) is calculated as: The mean spectral reflectance of an area demonstrates variations in information among multiple bands. It is also predicted that the foreground areas of different scenes, such as the liquid (L-Bottle, L-Jar), food (F-Apple, F-Banana) and leaf (LF-Real, LF-Fake) ones, exhibit higher spectral reflectance than their backgrounds as shown in Figure 2.
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 25 The mean spectral reflectance of an area demonstrates variations in information among multiple bands. It is also predicted that the foreground areas of different scenes, such as the liquid (L-Bottle, L-Jar), food (F-Apple, F-Banana) and leaf (LF-Real, LF-Fake) ones, exhibit higher spectral reflectance than their backgrounds as shown in Figure 2.

Analysis of Multiband the Dissimilarity Matrix
In MSPI, the data redundancies that can occur in spectral bands denote that one band's data can be partly or fully predicted from those of the others. Multiband dissimilarity is calculated in two different ways, by the Pearson correlation and Euclidean distance. We consider three different MSPI scenes to compute their dissimilarity levels in multiple spectra.

Pearson Correlation
Higher correlations among multiple spectra demonstrate highly redundant information. To measure an inherent correlation between two images ( and ), the Pearson correlation coefficient ( ) is calculated as: where ̅ = 1 ∑ −1 =0 and ̅ = 1 ∑ −1
The correlation matrices in the MSPI scenes due to the multiple spectra and multiple polarimetric orientations are presented in Figure 3a-c [61]. As can be seen, band 6 in the liquid and leaf scenes is not correlated with the other bands, while band 6 in the food scene is correlated with only band 5 above a threshold of 0.90. The bands not strongly correlated are indicated in green.

Analysis of Multiband the Dissimilarity Matrix
In MSPI, the data redundancies that can occur in spectral bands denote that one band's data can be partly or fully predicted from those of the others. Multiband dissimilarity is calculated in two different ways, by the Pearson correlation and Euclidean distance. We consider three different MSPI scenes to compute their dissimilarity levels in multiple spectra.

Pearson Correlation
Higher correlations among multiple spectra demonstrate highly redundant information. To measure an inherent correlation between two images (A mn and B mn ), the Pearson correlation coefficient (r) is calculated as: i=0 B mn . The correlation matrices in the MSPI scenes due to the multiple spectra and multiple polarimetric orientations are presented in Figure 3a-c [61]. As can be seen, band 6 in the liquid and leaf scenes is not correlated with the other bands, while band 6 in the food scene is correlated with only band 5 above a threshold of 0.90. The bands not strongly correlated are indicated in green.

Euclidean Distance
To measure the distances among multiple spectra, the classical method used is the Euclidian distance that, between two images (A mn and B mn ), is calculated by: (11) and then normalized in the range (0-1). The distance matrices in the MSPI scenes due to the multiple spectra and different polarimetric orientations are presented in Figure 3d-f. As can be seen, band 6 is the most dissimilar to the others in the liquid and leaf scenes, while, in the food scene, bands 5 and 6 demonstrate similar information but are dissimilar to the others. The bands with large distances are indicated in green.

Analysis of Multiband Textural Features
In MSPI, textural features describe the local spatial variations in intensity as well as the homogeneity of images. They are divided into the two broad categories of first-and second-order statistical features, with the former dependent mainly on a histogram analysis of gray-level image representations. Considering z i as the gray value of the i th pixel, P(z i ) is a normalized histogram and N the number of distinctive gray levels, with the first-order histogram defined as [50]: The first-order statistical features are defined as follows [62].

1.
Mean is a measure of the spreading of the distribution from the mean value.

2.
Standard deviation is used to sharpen edges as the intensity level changes by a large value at the edge of an image. 3.
Energy is a measure of the homogeneity of the histogram.

4.
Skewness is a measure of the degree of the histogram's asymmetry around the mean.

5.
Kurtosis is a measure of the histogram's sharpness, that is, whether the data are peaked or flat relative to a normal distribution.
The second-order statistics or Gray-level Co-occurrence Matrix (GLCM) define a square matrix the size of which represents the probability of the gray value (g 1 ) being moved from a fixed spatial location relationship (size and direction) to another gray value (g 2 ). Assuming that f (i 1 , j 1 ) is a 2D gray-scale image, where S is a set of pixels with a certain spatial relationship in the region and P the GLCM, this is expressed as [63]: where # is used to define the quantity. The second-order statistical features are defined as follows.

1.
Contrast is a measure of the local variations present in an image, that is, it reflects the depth and smoothness of the image's textural structure. 2.
Correlation is a measure of the gray-level linear dependence between the pixels at specified positions relative to each other, that is, it reflects the similarity of an image's texture in the horizontal or vertical direction. where, 3. Angular Second Moment or Energy is a measure of the global homogeneity of an image.

4.
Homogeneity is a measure of the local homogeneity of an image.

5.
Entropy is a measure of a histogram's uniformity, that is, it reflects the complexity of the textural distribution.
The MSPI's textural features are presented in Figure 4. As can be seen, the kurtosis, skewness, energy and entropy values demonstrate high variations in the bands in the liquid, food and leaf scenes. Therefore, it is predicted that the information in MSPI is not inherently correlated among the bands.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 25 scenes. Therefore, it is predicted that the information in MSPI is not inherently correlated among the bands.

Proposed Two-Fold BS
Our approach can be described as two-fold as it uses a low-level fusion of multiple spectra and polarimetric component-based BS.

Overall Framework and Algorithm
As, in our proposed method, the multiple spectra of each polarimetric filter orientation are fused to solve intricate situations and benefit from the complementarity of the information among multiple bands, it consists of multiband polarimetric fusion and BS. The whole process is decomposed into five main steps: in the first, multiband polarimetric data are captured and analyzed; in the second, a combination of a FFT and PCA is applied on multiple spectra to calculate the fused imagery/image of each polarimetric filter orientation; in the third, the polarimetric components are calculated; in the fourth, four different significant masks are obtained; and, in the last, all of the foreground masks are combined and applied in a morphological operation to generate the final background mask. Figure 5 demonstrates this proposed two-fold BS framework.

Proposed Two-Fold BS
Our approach can be described as two-fold as it uses a low-level fusion of multiple spectra and polarimetric component-based BS.

Overall Framework and Algorithm
As, in our proposed method, the multiple spectra of each polarimetric filter orientation are fused to solve intricate situations and benefit from the complementarity of the information among multiple bands, it consists of multiband polarimetric fusion and BS. The whole process is decomposed into five main steps: in the first, multiband polarimetric data are captured and analyzed; in the second, a combination of a FFT and PCA is applied on multiple spectra to calculate the fused imagery/image of each polarimetric filter orientation; in the third, the polarimetric components are calculated; in the fourth, four different significant masks are obtained; and, in the last, all of the foreground masks are combined and applied in a morphological operation to generate the final background mask. Figure 5 demonstrates this proposed two-fold BS framework. The aim of this research is the joint use of spatial, spectral and polarimetric information to segment a polarized background from a complex scene, with Algorithm 1 describing the steps for multiband fusion and BS. In it, the significance of fusion is analyzed by calculating the spectral reflectance, correlation and textural features, and then the proposed fusion algorithm (Algorithm 2) is applied to determine the polarimetric components. Finally, Algorithm 3 is used to segment the background from a complex scene.  The aim of this research is the joint use of spatial, spectral and polarimetric information to segment a polarized background from a complex scene, with Algorithm 1 describing the steps for multiband fusion and BS. In it, the significance of fusion is analyzed by calculating the spectral reflectance, correlation and textural features, and then the proposed fusion algorithm (Algorithm 2) is applied to determine the polarimetric components. Finally, Algorithm 3 is used to segment the background from a complex scene.

Hybrid Fusion Framework and Algorithm
An analysis of MSPI in terms of spectral reflectance, correlation and textural features can predict the significance of multiband pixel-level fusion. The proposed fusion method is based on 2D neighborhood information, such as gradients or edge magnitudes and orientations, with the source images multiple spectral ones (Im i ) and the fused image (Im f ) based on different polarimetric filter orientations. This method is hybrid in nature as it combines two different techniques, an FFT and PCA, as shown in Figure 6. Firstly, by applying a Gaussian low-pass filter, each spectral image is decomposed into lowand high-frequency components (LPF i and HPF i , respectively). Then, a Discrete Fourier Transform (DFT i ) of each band is calculated with the help of the FFT. Multiplying the DFT i by the LPF i and HPF i , the final FFTs (FLP i and FHP i ) are obtained. Then, applying an inverse FFT, the results are converted to the spatial domain as the final low and high frequencies of each band as: Remote Sens. 2020, 12, x FOR PEER REVIEW 11 of 25

Hybrid Fusion Framework and Algorithm
An analysis of MSPI in terms of spectral reflectance, correlation and textural features can predict the significance of multiband pixel-level fusion. The proposed fusion method is based on 2D neighborhood information, such as gradients or edge magnitudes and orientations, with the source images multiple spectral ones ( ) and the fused image ( ) based on different polarimetric filter orientations. This method is hybrid in nature as it combines two different techniques, an FFT and PCA, as shown in Figure 6. Firstly, by applying a Gaussian low-pass filter, each spectral image is decomposed into low-and high-frequency components ( and , respectively). Then, a Discrete Fourier Transform ( ) of each band is calculated with the help of the FFT. Multiplying the by the and , the final FFTs ( and ) are obtained. Then, applying an inverse FFT, the results are converted to the spatial domain as the final low and high frequencies of each band as: Then, the covariance matrix, eigenvector and eigenvalue of each low-and high-frequency component are calculated and the transformed image reconstructed by determining their first PCs. Finally, these components are combined to obtain the fused image; Figure 6 illustrates the framework of this method. MSPI often presents complementary information about the foreground and background in a scene efficiently combined by pixel-level fusion. The process for polarimetric orientation-wise multiband fused imagery is described in Algorithm 2.  Then, the covariance matrix, eigenvector and eigenvalue of each low-and high-frequency component are calculated and the transformed image reconstructed by determining their first PCs. Finally, these components are combined to obtain the fused image; Figure 6 illustrates the framework of this method. MSPI often presents complementary information about the foreground and background in a scene efficiently combined by pixel-level fusion. The process for polarimetric orientation-wise multiband fused imagery is described in Algorithm 2.

Requires:
MSPI Scene Dataset Ensures: Polarimetric Orientation − wise Fused Imagery 1: for all polarimetric orientations k do 2: for all band i do 3: Create a Gaussian Low Pass Filter (LPF i ) and Gaussian High Pass Filter (HPF i ) 4: Calculate a Discreate Fourier Transform DFT i 5: Multiply DFT i with LPF i and HPF i

6:
Convert the Result to the Spatial Domain by inverting Multiplication result Apply Inverse Fast Fourier Transformation to the multiplication results which produce band-wise final results as LP k,i and HP k,i 7: end for 8: Calculate the covariance and eigenvector of LP k and HP k 9: Calculate the first Principal Component of LP k and HP k 10: Calculate the fused imagery by adding both Principal Components 11: end for

Calculation of Polarimetric Components
The fused images calculated at different polarimetric orientations (0 0 , 45 0 , 90 0 , 135 0 ) are labeled I 0 0 , I 45 0 , I 90 0 and I 45 0 , respectively, to determine their polarimetric components. The Stokes parameters (S 0 -S 2 ) [64] describe the linear polarization characteristics using a three-element vector (S), as shown in Equation (27), where S 0 represents the total intensity of light, S 1 the difference between horizontal and vertical polarizations and S 2 the difference between linear +45 • and −45 • ones.
The DoLP is a measure of the proportion of the linear polarized light relative to the light's total intensity, the AoLP the orientation of the major axis of the polarization ellipse, which represents the polarizing angle where the intensity should be the strongest, and the Unpol a measure of the unpolarized light according to the dataset's author. The DoLP, AoLP and Unpol are derived from the Stokes vector, respectively, as:

Proposed BS Algorithm
The proposed BS algorithm, which involves five different steps for generating four significant foreground masks through processing the polarimetric components, is presented in Algorithm 3.

3:
Construct an intensity invariant mask through differentiating the median filtering version of unpolarized and polarized imagery 4: Calculate a strongly unpolarized foreground mask in two different ways utilizing AoLP and DoLP 5: Calculate a strongly polarized foreground mask 6: Calculate a strong light intensity mask based on azimuth angle and S 0 . 7: Combine steps 3-6 and apply a morphological operation to segment the total background area of a scene. 8: end if

•
Step 1. Construction of intensity-invariant mask The proportion of the unpolarized to total intensity is defined as the degree of unpolarized intensity (Equation (31)). A DoLP modulation coefficient can be used to stretch the contrast and, thereby, the separability of the foreground/background areas in a polarimetric scene [29]. This function should be continuous in the field of the DoLP and monotonically increase with it. A logarithmic function is a natural choice for this purpose, as pointed out by Shannon [65], and the DoLP modulation coefficient (M DoLP ) is defined in Equation (32). The differentiation between the median filtering versions of the degree of the unpolarized image (D Unpol ) and M DoLP in Equation (33) produces an intensity-invariant mask as: • Step 2. Construction of strongly unpolarized mask An indirect unpolarized mask is calculated through a DoLP analysis. Firstly, due to the dark offset and spatial nonuniformity of light, a DoLP image may contain noise. Therefore, a FFT with a Gaussian low-pass filter is applied to smooth it and reduce the noise level (Equation (34), where D(u, v) is the distance from a point (u, v ) to the center of its frequency in the DoLP, and D 0 is the cutoff frequency) and the FFT of the DoLP image with padding obtained by Equation (35). The transformed image is multiplied by the filter and then an inverse FFT applied to obtain a noise-free smooth image (Equation (36)). The calculated low-pass filtered image of the DoLP (Smooth) is multiplied by the D Unpol to obtain a lower intensity (lower global threshold) of the polarized image and then binarized to calculate a noise-free smooth unpolarized mask (Equation (37)).
A direct unpolarized mask is calculated using an AoLP textural analysis. A desultory value of the AoLP denotes a rough surface that indicates the unpolarized part of a scene and a continuous one a smooth surface, which indicates the polarized part of a scene [20]. To segment the unpolarized part, firstly, a textural image is created by applying entropy filtering around a 9-by-9 neighborhood of the AoLP. The aim is to obtain an output image (Mask DirUnp ) in which each output pixel contains the entropy value of the 9-by-9 neighborhood around the corresponding pixel in the input image of the AoLP as: A final unpolarized mask is generated through a bit-AND operation. Because the contrast, illumination and energy levels are not uniform among the scenes in the bands, some portions of an individual scene gain advantages during this process, with some considering an AoLP textural analysis and others a DoLP one. Applying this operation on both sides obtains an exact unpolarized mask as: • Step 3. Calculation of strongly polarized mask Based on Otsu's method for calculating the global threshold, the DoLP image is converted to a binary one to obtain a strongly polarized mask of a scene as: • Step 4. Calculation of strong light intensity mask The inclination or azimuth angle (β) of the polarized light is calculated. The electric field vector (E) is divided into two orthogonal plane waves (E x and E y ) in the direction of the x-and y-axes, respectively. The calculations of E x and E y depend on S 0 , S 1 and S 2 , as in Equations (41) and (42), and the azimuth angle is defined in Equation (43). Finally, as the β is calculated from the DoLP, a noise-free version of it is determined using Equations (41)- (43) and renamed S Azimuth .
The total intensity image (S 0 ) is divided by the logarithmic function of β to calculate the area of the maximum strong light intensity. The purpose of using this natural logarithmic function of β is to stretch the contrast and, thus, the separability of the strong light intensity areas in polarimetric images by: • Step 5. Combination of foreground masks and application of morphological operation All the foreground areas, such as the intensity-invariant, strongly unpolarized, strongly polarized and strong light intensity masks, are calculated and fused to obtain the final foreground mask (Equation (45)). Then, a morphological structuring operation based on a dilation followed by an erosion to boost the segmentation performance is applied on the combined background mask. The final background area in the corresponding scene is denoted in black and it can be seen that some object parts are mixed with the background due to the presence of polarization.

Experimental Results
In this section, performance evaluations and comparisons of the proposed two-fold BS and other approaches using different metrics in terms of multiband fusion and polarimetric BS are discussed. Also, computational time analyses of individual methods are conducted.

Selection of Fusion Metric
Currently, the quality of a fused image can be quantitively evaluated using the metrics [66] Mean Absolute Percentage Error (MAPE), Peak Signal-to-Noise Ratio (PSNR), Pearson Correlation Coefficient (PCOR) and Mutual Information (MI). The MAPE is a measure of the closeness between predictions of the reference and fused images. The PSNR block computes the PSNRs of the reference and fused images. A lower value of MAPE and higher value of PSNR indicate a better quality of the reconstructed or fused image. The PCOR computes the linear correlation coefficient between the reference and fused images, while MI indicates the mutual dependence between them, with the higher their values, the better the correlation. Considering two images A and B, the mathematical evaluation measures of the aforementioned metrics are:

Observation of Fusion Quality
To evaluate the performance of fusion, the median image of each band is considered the reference one and the fused image the resultant one. The former contains the central value of the bands and tends to retain more detailed information. Table 1 demonstrates that the average PSNR value of the proposed method is lower than other methods; however, the average MAPE value is significantly better than some references. The average metric value of the three scenes demonstrates that the fusion quality of our proposed hybrid fusion-based approach is superior in terms of its higher PCOR and MI compared to those in the current literature.

Comparison of Performances of Fusion Methods
The textural features of the fused images are presented in Figure 7. It can be seen that the kurtosis, skewness and energy of the liquid, food and leaf scenes demonstrate lower values for the proposed hybrid fusion method than for the DWT [37], PCA [33], AVG [36], DCT-LP [40], SHT [20] and FDPE [34] while, in contrast, its entropy is predicted to be higher than those of the DWT, PCA and AVG methods. Therefore, it is considered that, overall, the proposed hybrid pixel-level fusion method generates the maximum amount of information among multiple spectra.

Visualization of Fusion Performance
It is obvious that multiband fusion methods demonstrate better BS accuracy than direct processing [67,68]. The multiband fusion results shown in Figure 8 demonstrate that the proposed hybrid fusion approach has an advantage over the other methods as it appears strongly at the edge of an object, which will enable foreground objects to be further separated from a scene.

Calculation of Polarimetric Component
Based on different fusion methods, the polarimetric components 0 and are calculated and shown in Figure 9. Analyzing the values, it is predicted that, in the scenes, the backgrounds are polarized, whereas the foregrounds have mixed polarized and unpolarized intensities.

Visualization of Fusion Performance
It is obvious that multiband fusion methods demonstrate better BS accuracy than direct processing [67,68]. The multiband fusion results shown in Figure 8 demonstrate that the proposed hybrid fusion approach has an advantage over the other methods as it appears strongly at the edge of an object, which will enable foreground objects to be further separated from a scene.

Visualization of Fusion Performance
It is obvious that multiband fusion methods demonstrate better BS accuracy than direct processing [67,68]. The multiband fusion results shown in Figure 8 demonstrate that the proposed hybrid fusion approach has an advantage over the other methods as it appears strongly at the edge of an object, which will enable foreground objects to be further separated from a scene.

Calculation of Polarimetric Component
Based on different fusion methods, the polarimetric components and are calculated and shown in Figure 9. Analyzing the values, it is predicted that, in the scenes, the backgrounds are polarized, whereas the foregrounds have mixed polarized and unpolarized intensities.

Calculation of Polarimetric Component
Based on different fusion methods, the polarimetric components S 0 and DoLP are calculated and shown in Figure 9. Analyzing the DoLP values, it is predicted that, in the scenes, the backgrounds are polarized, whereas the foregrounds have mixed polarized and unpolarized intensities. DWT [37] PCA [33] AVG [36] DCT-LP [40] SHT [20] FPDE [34] Proposed Figure 9. Polarimetric components calculated using different fusion methods.

Selection of BS Metric
The BS method is evaluated at the pixel level of a binarized scene in which the foreground and background are white and black, respectively. Its performance can be divided into four pixel-wise classification results: true positives ( ), which mean correctly detected foreground pixels; false positives ( ), that is, background pixels incorrectly detected as foreground ones; true negatives ( ), which indicate correctly detected background pixels; and false negatives ( ), that is, foreground pixels incorrectly detected as background ones. The binary classification metrics used in this paper are accuracy, specificity, sensitivity, Geometric mean (G-mean), precision, recall and the F1-score. Accuracy is measured as the proportion of true results obtained, either or ; specificity (a fraction) as the proportion of actual negatives predicted as negatives; sensitivity (a fraction) as the proportion of actual positives predicted as positives; the G-mean, the root of the product of specificity and sensitivity; precision, the number of foreground pixels detected that are actually foreground ones; recall, the number of foreground pixels detected from the actual foreground (recall and sensitivity are similar); the F1-score (a boundary F1 measure) the harmonic mean of the precision and recall values, which measures how closely the predicted boundary of an object matches its

Selection of BS Metric
The BS method is evaluated at the pixel level of a binarized scene in which the foreground and background are white and black, respectively. Its performance can be divided into four pixel-wise classification results: true positives (T p ), which mean correctly detected foreground pixels; false positives (F p ), that is, background pixels incorrectly detected as foreground ones; true negatives (T n ), which indicate correctly detected background pixels; and false negatives (F n ), that is, foreground pixels incorrectly detected as background ones. The binary classification metrics used in this paper are accuracy, specificity, sensitivity, Geometric mean (G-mean), precision, recall and the F1-score. Accuracy is measured as the proportion of true results obtained, either T n or T p ; specificity (a T n fraction) as the proportion of actual negatives predicted as negatives; sensitivity (a T p fraction) as the proportion of actual positives predicted as positives; the G-mean, the root of the product of specificity and sensitivity; precision, the number of foreground pixels detected that are actually foreground ones; recall, the number of foreground pixels detected from the actual foreground (recall and sensitivity are similar); the F1-score (a boundary F1 measure) the harmonic mean of the precision and recall values, which measures how closely the predicted boundary of an object matches its ground-truth and is an overall indicator of the performance of binary segmentation. The mathematical evaluation measures of the aforementioned metrics follow [69]: Speci f icity (SP) = T n T n + F p (51) Geometric − Mean (GM) = Speci f icity × Sensitivity (53)

Generation of Ground Truth
Each ground truth is generated manually by an expert with the maximum possible background among bands covered. In Figure 10, for the scenes' binary ground truths, the black and white pixels indicate their backgrounds and foregrounds, respectively.

Generation of Ground Truth
Each ground truth is generated manually by an expert with the maximum possible background among bands covered. In Figure 10, for the scenes' binary ground truths, the black and white pixels indicate their backgrounds and foregrounds, respectively.

Comparison of BS Accuracy: Direct vs. Fusion
In Table 2, the performances of BS in terms of the metrics for direct and fusion-based approaches are compared, where B-1 to B-6 denote Bands 1 to 6, respectively, in the range of 400nm-1000nm. As can be seen, the mean of an individual evaluation metric is significantly higher for a fusion-based BS method than for a direct band-wise BS one. Also, it can be ascertained that the mean accuracy, Gmean and F1-score are 0.88, 0.86 and 0.91, respectively, for the former, and 0.74, 0.67 and 0.80 for the latter.

Comparison of BS Accuracy: Direct vs. Fusion
In Table 2, the performances of BS in terms of the metrics for direct and fusion-based approaches are compared, where B-1 to B-6 denote Bands 1 to 6, respectively, in the range of 400-1000 nm. As can be seen, the mean of an individual evaluation metric is significantly higher for a fusion-based BS method than for a direct band-wise BS one. Also, it can be ascertained that the mean accuracy, G-mean and F1-score are 0.88, 0.86 and 0.91, respectively, for the former, and 0.74, 0.67 and 0.80 for the latter.

Visualization of Performance of BS
The polarimetric BS errors are shown in Figure 11 in which purple and green denote the error areas of the segmentation methods present in the ground truths' black and white portions, respectively. These errors are lower for the fusion-based methods than for the direct band-wise ones, with our proposed hybrid fusion-based polarimetric BS approach superior to all the others.   Figure 11 in which purple and green denote the error areas of the segmentation methods present in the ground truths' black and white portions, respectively. These errors are lower for the fusion-based methods than for the direct band-wise ones, with our proposed hybrid fusion-based polarimetric BS approach superior to all the others.

Direct BS Fusion-based BS
Band-1 DWT [37] Band-2 PCA [33] Band-3 AVG [36] Band-4 DCT-LP [40] Band-5 SHT [20] Band-6 FPDE [34] Proposed Figure 11. Direct vs. fusion-based segmentation errors of three scenes in the MSPI dataset. It is worth mentioning that the performances of these existing BS methods are not exactly comparable, as each reports its accuracy for a specific MSPI database. Also, the recognition accuracy values obtained from the fusion methods and color-mapping techniques used for segmentation vary.
By integrating an FFT with PCA for a fusion approach and then generating foreground masks for the segmentation task, the proposed BS method obtains higher accuracy, G-mean and F1-score values for the liquid (0.97, 0.97 and 0.98, respectively), food (0.96, 0.94 and 0.97, respectively) and leaf (0.89, 0.87 and 0.92, respectively) scenes in the MSPI dataset than those in the existing literature on foreground separation methods, as presented in Table 3.  Figure 12 shows the average performances of individual methods for scenes in the MSPI dataset. It is clear that the proposed method performs better than the others with a mean accuracy, G-mean and F1-score of 0.95, 0.93 and 0.97, respectively. It is worth mentioning that the performances of these existing BS methods are not exactly comparable, as each reports its accuracy for a specific MSPI database. Also, the recognition accuracy values obtained from the fusion methods and color-mapping techniques used for segmentation vary.
By integrating an FFT with PCA for a fusion approach and then generating foreground masks for the segmentation task, the proposed BS method obtains higher accuracy, G-mean and F1-score values for the liquid (0.97, 0.97 and 0.98, respectively), food (0.96, 0.94 and 0.97, respectively) and leaf (0.89, 0.87 and 0.92, respectively) scenes in the MSPI dataset than those in the existing literature on foreground separation methods, as presented in Table 3. Table 3. Comparison of performances of BS methods for three scenes in the MSPI dataset.  Figure 12 shows the average performances of individual methods for scenes in the MSPI dataset. It is clear that the proposed method performs better than the others with a mean accuracy, G-mean and F1-score of 0.95, 0.93 and 0.97, respectively.  Figure 13 presents the error areas (purple and green) obtained from the individual methods for different scenes in the MSPI dataset. As can be seen, the proposed system reports fewer BS errors than the others. In the liquid scene, however, the background is completely separated by the proposed method, due to weak polarization in the jar cap's area; it has some errors. In the food (apple and banana) and leaf scenes, although their foregrounds are completely separated by the proposed method, there are some errors due to strong unpolarization in their background areas.

59] Proposed
Liquid Scene Figure 12. Average performances of individual methods for scenes in the MSPI dataset. Figure 13 presents the error areas (purple and green) obtained from the individual methods for different scenes in the MSPI dataset. As can be seen, the proposed system reports fewer BS errors than the others. In the liquid scene, however, the background is completely separated by the proposed method, due to weak polarization in the jar cap's area; it has some errors. In the food (apple and banana) and leaf scenes, although their foregrounds are completely separated by the proposed method, there are some errors due to strong unpolarization in their background areas.

Computational Time Analysis
The experiments are carried out using MATLAB on a desktop with a hardware configuration of 16 GB RAM and 3.4 GHz Intel Core i7 CPU. Table 4 presents the running times (in seconds) of the individual methods for multiband fusion and polarimetric BS approach. As can be seen, the proposed system incurs slightly longer computational times than the PCA-based multiband fusion method, but shorter ones than the polarimetric BS, with its shortest mean running time demonstrating its effectiveness in terms of BS in MSPI.
proposed method, due to weak polarization in the jar cap's area; it has some errors. In the food (apple and banana) and leaf scenes, although their foregrounds are completely separated by the proposed method, there are some errors due to strong unpolarization in their background areas.

Computational Time Analysis
The experiments are carried out using MATLAB on a desktop with a hardware configuration of 16 GB RAM and 3.4 GHz Intel Core i7 CPU. Table 4 presents the running times (in seconds) of the individual methods for multiband fusion and polarimetric BS approach. As can be seen, the proposed system incurs slightly longer computational times than the PCA-based multiband fusion method, but shorter ones than the polarimetric BS, with its shortest mean running time demonstrating its effectiveness in terms of BS in MSPI.

Conclusion
In this paper, a two-fold BS approach involving multiband fusion followed by polarimetric BS is proposed for MSPI. Combining complementary information from spectral and polarimetric cues in the visible and NIR ranges improves a foreground's contrast and details. This framework and algorithm demonstrate the significance of multiband fusion through analyzing spectral reflectance, correlation and textural features. The proposed hybrid fusion method first decomposes each band image into low and high frequencies using an FFT. Then a PCA is performed, with the first PC of each frequency computed and combined with the others to obtain a final fused image. The proposed fusion algorithm predicts better fusion quality with fewer errors in terms of the MAPE, PSNR, COR

Conclusions
In this paper, a two-fold BS approach involving multiband fusion followed by polarimetric BS is proposed for MSPI. Combining complementary information from spectral and polarimetric cues in the visible and NIR ranges improves a foreground's contrast and details. This framework and algorithm demonstrate the significance of multiband fusion through analyzing spectral reflectance, correlation and textural features. The proposed hybrid fusion method first decomposes each band image into low and high frequencies using an FFT. Then a PCA is performed, with the first PC of each frequency computed and combined with the others to obtain a final fused image. The proposed fusion algorithm predicts better fusion quality with fewer errors in terms of the MAPE, PSNR, COR and MI than the DWT, PCA, DCT-LP, SHT and FPDE techniques. After fusion, the polarimetric components are calculated through a Stokes vector analysis. Finally, four significant foreground masks are generated and combined with the polarimetric components to segment the complex background of a scene. The proposed BS algorithm is compared with four baseline approaches based on fusion, color mapping and fuzzy C-means clustering using an MSPI dataset. The experimental results illustrate the validity and efficiency of the proposed BS method based on diverse performance evaluation metrics. They also demonstrate that it significantly improves the mean accuracy, G-mean and F1-score to 0.95, 0.93, 0.97, respectively, for three scenes in the MSPI dataset compared with those in the existing literature referenced in this paper.
As an extension of this work, we will investigate an advanced BS method in which the background and foreground of a scene are mixed with polarized and unpolarized intensities. We will also address the issues related to the specular reflection of the foreground area in a scene that may mistakenly be detected as the background. As it is known that the specular reflection in an area of strong light intensity can destroy the shape of an actual object; developing an algorithm for detecting and reconstructing a secular reflection in the MSPI dataset will be explored.