Singular Spectrum Analysis for Background Initialization with Spatio-Temporal RGB Color Channel Data

In video processing, background initialization aims to obtain a scene without foreground objects. Recently, the background initialization problem has attracted the attention of researchers because of its real-world applications, such as video segmentation, computational photography, video surveillance, etc. However, the background initialization problem is still challenging because of the complex variations in illumination, intermittent motion, camera jitter, shadow, etc. This paper proposes a novel and effective background initialization method using singular spectrum analysis. Firstly, we extract the video’s color frames and split them into RGB color channels. Next, RGB color channels of the video are saved as color channel spatio-temporal data. After decomposing the color channel spatio-temporal data by singular spectrum analysis, we obtain the stable and dynamic components using different eigentriple groups. Our study indicates that the stable component contains a background image and the dynamic component includes the foreground image. Finally, the color background image is reconstructed by merging RGB color channel images obtained by reshaping the stable component data. Experimental results on the public scene background initialization databases show that our proposed method achieves a good color background image compared with state-of-the-art methods.


Introduction
Scene background initialization is a basic low-level process in video-processing applications, such as video segmentation [1], video compression [2], computational photography [3], and video surveillance [4,5] (e.g., tracking, counting). The background initialization is also known as background estimation, background reconstruction, and background generation. The task of background initialization can be described as follows: given a video, we need to construct a model that describes the clear background image despite the continued presence of moving objects. The background image may be valid for the entire video or updated in time if the background configuration changes due to illumination change or the displacement of background objects. Figure 1a shows frames from the HighwayII sequence of the scene background initialization (SBI) database [6]. There is an appearance of moving objects in each frame, particularly cars. These frames are the input data of the background initialization model as described in Figure 1b. Using the proposed background initialization model, we can eliminate the appearance of moving objects to obtain a clean background, which is also known as the closest-to-ground-truth background, as shown in Figure 1c. During the past two decades, many methods [1,[7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25] were proposed for the background initialization task. In general, these techniques can be classified into four main categories: pixel-based methods [1,[7][8][9][10][11], iterative-based methods [12][13][14][15], lowrank/sparse data separation methods [16][17][18][19][20][21][22], and deep learning-based methods [23][24][25][26][27].
The first subcategory includes pixel-based methods where each pixel is processed individually over time. Chiu et al. [1] achieved the background by clustering the pixels. Pixels obtained from each location along its time axis are clustered according to their intensity variations. The pixel corresponding to the cluster that has a maximum probability more significant than a time-varying threshold is extracted as a background pixel. Maddalena and Petrosino [7] used a temporal median to compute the background pixel as the mean of the pixels at the same position across all the image sequences. The most wellknown method is a mixture of Gaussians (MoG) proposed by Stauffer [9]. The background is modeled probabilistically at each pixel location by fitting MoG to the observed pixel valued in a recent temporal window. MoG decides whether each pixel is classified as background or foreground. More recently, Laugraud et al. [10] presented a method called LaBGen, which combined a pixel-wise temporal median filter and a patch-selection mechanism based on motion detection. In each frame, a background subtraction algorithm determines whether each pixel in the video belongs to the foreground or background. Tian et al. [11] introduced the block-level background modeling (BBM) algorithm to obtain video-coding background components. The BBM algorithm uses the residual gradient as the temporal information to distinguish the background blocks. BBM is used to consider the boundary difference, and the pixel smoothness process is handled using a weighted average of pixel temporal value.
The second subcategory includes iterative-based methods [12][13][14][15]. These methods usually consist of two stages. In the first stage, these methods detect static regions considered reference backgrounds. The background model is iteratively completed in the second stage based on suitable spatial consistency criteria. Hsiao and Leou [12] performed background initialization and foreground segmentation tasks based on motion estimation and computation of the correlation coefficient. Each block of the current frame is classified into four categories: background, still object, illumination change, and moving entity to exploit for the background updating phase. The static blocks, such as "background" and "illumination change", are selected as the reference, and the remaining blocks are suitably used for the iterative completion of the background model. In [13], Torre and Black applied robust principal component analysis (RPCA) for separating the background and foreground to detect the outlier from video or image data. Firstly, the number of bases that preserve 55% of data energy is calculated using standard PCA. Then based on the obtained number of bases, RPCA is used for minimizing the vital energy function until convergence to receive the weight matrix. Finally, the weight matrix is used to detect outliers. Reitberger and Sauer [14] proposed a background-determining model based on an iterative singular value decomposition via singular vectors spanning a subspace of the image space. The method has a fast processing speed and can be applied in real-time applications. But it has difficulty handling challenges, such as intermittent motion. Recently, During the past two decades, many methods [1,[7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25] were proposed for the background initialization task. In general, these techniques can be classified into four main categories: pixel-based methods [1,[7][8][9][10][11], iterative-based methods [12][13][14][15], low-rank/sparse data separation methods [16][17][18][19][20][21][22], and deep learning-based methods [23][24][25][26][27].
The first subcategory includes pixel-based methods where each pixel is processed individually over time. Chiu et al. [1] achieved the background by clustering the pixels. Pixels obtained from each location along its time axis are clustered according to their intensity variations. The pixel corresponding to the cluster that has a maximum probability more significant than a time-varying threshold is extracted as a background pixel. Maddalena and Petrosino [7] used a temporal median to compute the background pixel as the mean of the pixels at the same position across all the image sequences. The most well-known method is a mixture of Gaussians (MoG) proposed by Stauffer [9]. The background is modeled probabilistically at each pixel location by fitting MoG to the observed pixel valued in a recent temporal window. MoG decides whether each pixel is classified as background or foreground. More recently, Laugraud et al. [10] presented a method called LaBGen, which combined a pixel-wise temporal median filter and a patch-selection mechanism based on motion detection. In each frame, a background subtraction algorithm determines whether each pixel in the video belongs to the foreground or background. Tian et al. [11] introduced the block-level background modeling (BBM) algorithm to obtain video-coding background components. The BBM algorithm uses the residual gradient as the temporal information to distinguish the background blocks. BBM is used to consider the boundary difference, and the pixel smoothness process is handled using a weighted average of pixel temporal value.
The second subcategory includes iterative-based methods [12][13][14][15]. These methods usually consist of two stages. In the first stage, these methods detect static regions considered reference backgrounds. The background model is iteratively completed in the second stage based on suitable spatial consistency criteria. Hsiao and Leou [12] performed background initialization and foreground segmentation tasks based on motion estimation and computation of the correlation coefficient. Each block of the current frame is classified into four categories: background, still object, illumination change, and moving entity to exploit for the background updating phase. The static blocks, such as "background" and "illumination change", are selected as the reference, and the remaining blocks are suitably used for the iterative completion of the background model. In [13], Torre and Black applied robust principal component analysis (RPCA) for separating the background and foreground to detect the outlier from video or image data. Firstly, the number of bases that preserve 55% of data energy is calculated using standard PCA. Then based on the obtained number of bases, RPCA is used for minimizing the vital energy function until convergence to receive the weight matrix. Finally, the weight matrix is used to detect outliers. Reitberger and Sauer [14] proposed a background-determining model based on an iterative singular value decomposition via singular vectors spanning a subspace of the image space. The method has a fast processing speed and can be applied in real-time applications. But it has difficulty handling challenges, such as intermittent motion. Recently, based on long-term background stable and short-term foreground changes of scenes, Chen et al. [15] adopted a Bayesian framework to classify the background and foreground.
The third subcategory includes low-rank/sparse data separation methods. The background information is considered low-rank information, and the remainder of the data represents both noises and moving objects. One of the first attempts to initialize the background in this subcategory was introduced by Candes et al. [16]. They perfectly separated a given video into a low-rank matrix and a sparse matrix by solving a very convenient convex program called principal component pursuit (PCP). However, PCP has several disadvantages for real-world videos, such as its time consumption and computational complexity. To overcome the limitations of the PCP method, many studies were proposed, such as Javed et al. [17] and Zhou et al. [18], which work well in specific environments. Ye et al. [19] presented a motion-assisted matrix restoration (MAMR) model for background-foreground separation of a video. In the MAMR model, the sparse matrix contains the foreground objects, and the low-rank matrix includes the background. A dense motion field is calculated and mapped into a weighting matrix for each frame, which indicates the likelihood that each pixel belongs to the background. In [20], Grosek and Kutz introduced the video dynamic mode decomposition (DMD) method for foreground and background separation. The DMD method decomposes video data into different dynamic modes, which are associated with Fourier frequencies. The frequencies near the origin do not change from frame to frame. Thus they are considered background components, and the terms with Fourier frequencies bounded away from the origin are foreground components. In [21], non-negative matrix factorization (NMF) was used to approximate a non-negative matrix A to a product of two non-negative, low-rank factor matrices W and H, where W contains background components and H contains foreground components. More recently, Kajo et al. [22] introduced a spatio-temporal, slice-based, singular value decomposition (SVD) method by organizing videos, such as tensors and seeks, to sparse them into different components. Each of these components, namely the moving object and the background, is represented by a few distinct significant eigenvalues. However, this proposal can be time-consuming to process over an ample space. Besides, it still has some limitations in the complex scenes, such as illumination variation, short video, and clutter.
The fourth subcategory includes deep learning-based methods [23][24][25][26][27]. These methods used the effectiveness of the deep learning model to automatically learn the background model. Ramirez-Quintana and Chacon-Murguia [23], based on self-organizing maps (SOMs) and cellular neural networks (CNNs), proposed a self-adaptive system named SOM-CNN. This system includes two neural network architectures called retinotopic SOM (RESOM) and neighbor threshold CNN (NTCNN) for video and motion analysis. The system can work with typical and complex scenarios in real time. Zhao et al. [24] proposed a background modeling method called the stacked multilayer self-organizing map background model (SMSOM-BM). This model can learn the background model of challenging scenarios and automatically determine most network parameters by considering every pixel and spatial consistency at each layer. Halfaoui et al. [25] proposed a CNN-based method to estimate the background component. This method is effective for challenges, such as dynamic backgrounds, illumination variation, and clusters. Yang et al. [26] proposed a deep neural network for background modeling. First, they used the temporal encoding to sample multiple frames from original sequential images with variable intervals, then they used a fully convolutional network to extract temporal and spatial information from frames. In the work by Gregorio et al. [27], the authors introduced a background initialization approach by weightless neural network. Each pixel is allied to an artificial weightless neural network that learns more frequently. This method is useful for processing long-term and live videos.
In the real world, background initialization still faces many challenges, such as lighting changes, the foreground occupying most of the frames, the automatic adjustment of the video camera, and objects moving heterogeneously (sometimes stationary, sometimes moving). To address these issues, we propose a novel method belonging to the low-rank/sparse data separation method named background initialization with singular spectrum analysis (BISSA). Firstly, the input image sequence is reorganized into a spatio-temporal data type useful for background-foreground separation tasks. Secondly, an adaptive background initialization algorithm for image sequences based on the SSA is proposed. Finally, to evaluate the effectiveness of our method, we compare our approach with some of the state-of-the-art techniques by doing experiments on a SBI [6] database. The experiment results show that our proposed method is more accurate and easier to apply in real-world applications.
The rest of the paper is organized as follows: Section 2 describes an overview of the SSA algorithm. Section 3 presents our proposed method. Finally, experimental results and discussion are summarized in Section 4, while conclusions and future work are represented in Section 5.

Singular Spectrum Analysis
In recent years, singular spectrum analysis (SSA) [28][29][30] has emerged as a powerful non-parametric tool to apply for analyzing and predicting time series data. This method aims to decompose the input data into a sum of different meaningful components, where these components can be grouped and merged based on their common properties to compose subsequent components. These grouped components indicate different groups of features of the original time series data. Currently, many researchers apply SSA in different areas, such as biomedical diagnostic tests [31], climatology [32], economics [33,34], signal processing [35], etc. A flowchart of SSA, consisting of the substages of decomposition and reconstruction, is shown in Figure 2. rank/sparse data separation method named background initialization with singular spectrum analysis (BISSA). Firstly, the input image sequence is reorganized into a spatio-temporal data type useful for background-foreground separation tasks. Secondly, an adaptive background initialization algorithm for image sequences based on the SSA is proposed. Finally, to evaluate the effectiveness of our method, we compare our approach with some of the state-of-the-art techniques by doing experiments on a SBI [6] database. The experiment results show that our proposed method is more accurate and easier to apply in real-world applications. The rest of the paper is organized as follows: Section 2 describes an overview of the SSA algorithm. Section 3 presents our proposed method. Finally, experimental results and discussion are summarized in Section 4, while conclusions and future work are represented in Section 5.

Singular Spectrum Analysis
In recent years, singular spectrum analysis (SSA) [28][29][30] has emerged as a powerful non-parametric tool to apply for analyzing and predicting time series data. This method aims to decompose the input data into a sum of different meaningful components, where these components can be grouped and merged based on their common properties to compose subsequent components. These grouped components indicate different groups of features of the original time series data. Currently, many researchers apply SSA in different areas, such as biomedical diagnostic tests [31], climatology [32], economics [33,34], signal processing [35], etc. A flowchart of SSA, consisting of the substages of decomposition and reconstruction, is shown in Figure 2. As can be seen in Figure 2, the basic SSA algorithm consists of two isolated stages: decomposition and reconstruction stages. In the first stage, embedding and singular value decomposition steps are applied for the decomposition. In the last stage, eigentriple grouping and diagonal averaging steps are used to reconstruct the time series. For example, given non-zero time series = ( , , … , ) of length , is denoted as the window length and 1 < < ; = − + 1. The SSA algorithm is described below: Stage 1: Decomposition Step 1: Embedding Embedding is a standard procedure in time series analysis. Embedding can be regarded as a mapping that transfers a one-dimensional time series into a multidimensional series. By selecting a large window size, more information about the basis pattern of the time series is captured. Constructing the trajectory matrix of the original time series , which is a matrix of size × , gives: As can be seen in Figure 2, the basic SSA algorithm consists of two isolated stages: decomposition and reconstruction stages. In the first stage, embedding and singular value decomposition steps are applied for the decomposition. In the last stage, eigentriple grouping and diagonal averaging steps are used to reconstruct the time series. For example, given non-zero time series X = ( f 1 , f 2 , . . . , f K ) of length K, W is denoted as the window length and 1 < W < K; L = K − W + 1. The SSA algorithm is described below: Stage 1: Decomposition Step 1: Embedding Embedding is a standard procedure in time series analysis. Embedding can be regarded as a mapping that transfers a one-dimensional time series into a multidimensional series. By selecting a large window size, more information about the basis pattern of the time series is captured. Constructing the trajectory matrix F of the original time series X, which is a matrix of size W × L, gives: where rows and columns of F are subseries of the original time series.
Step 2: Singular value decomposition (SVD) This step computes the SVD of the trajectory matrix F sized W × L. By using SVD, matrix F can be decomposed into the product of three matrices: an orthogonal matrix U of size W × r, a diagonal matrix Σ of size r × r, and the transpose of another orthogonal matrix V of size r × L, where r is the rank of matrix F. In general, the SVD of trajectory matrix F can be written as: where

Stage 2: Reconstruction
Step 3: Eigentriple grouping This step can be used to analyze and determine the physical behavior of each component in the time series data. The purpose of the eigentriple grouping is to gather data based on their common properties. The different matrices of rank-1 acquired from applying the SVD of trajectory matrix F can be selected and gathered together. In that way, correctly clustered groups reflect other original time series data criteria. The grouping procedure separates the set of r eigentriples into m (m ≤ r) distinct subsets, and they are expressed as F G j = F G 1 , F G 2 , . . . , F G m , where each F G j contains several F i and presents as: The progress of selecting the sets F G 1 , F G 2 , F G 3 , . . . , F G m is called eigentriple grouping.
Step 4: Diagonal averaging The final step is to perform the diagonal averaging on the matrices F G j where j = 1, 2, 3, . . . , m. This step converts grouped matrices F G j into a one-dimensional original time series via the diagonal averaging method. In particular, where F G j is a trajectory matrix grouped in step 3, the element f kj , k = 1, 2, . . . , K of time series data S j is computed as the average of all elements on the minor diagonal kth of matrix F G j , which can be expressed as: The result of the reconstructed trajectory matrix along the diagonal averaging process is time series data of length K represented by:

Background Initialization Using Singular Spectrum Analysis
Generally, background-foreground separation can be regarded as a matrix separation problem [16,[36][37][38][39][40]. We can separate a video into two group components, one component that contains stable information and the remaining component that holds dynamic information. Constructing these components can be based on an eigentriple or a group of eigentriples. The background data (almost stable and highly correlated between frames) is contained in the static component, and the dynamic component usually represents the foreground data (moving objects or noise). The matrix separation problem can unify in a more general framework formulated as follows [16,[36][37][38][39][40]: where X is the input video data information, matrix S indicates the stable component, and ε D represents the dynamic component, respectively. These components are achieved by reconstructing one or a group of eigentriples of the trajectory matrix F. As a result, the stable and dynamic components are calculated as: where 1 ≤ τ ≤ r and r is the rank of F. In this study, video X is stored in three matrices as spatio-temporal data M (C) . Trajectory matrices F (C) are constructed based on these spatio-temporal data matrices, where C ∈ {R, G, B} represents R, G, and B color channels. More details on how to construct video X as spatio-temporal data used as the input data for our background initialization system are introduced in the following subsections.

Storing a Video as Spatio-Temporal Data
A fundamental problem in mathematics is how to arrange data, through which they reveal the most critical information. By organizing the correct given data, we can solve our problem. In this section, we introduce a way of rearranging input video data to solve the problem of separating the background and moving objects. Spatio-temporal data [40] is a data type that contains both space and time characteristics of the original data. Spatial refers to space and temporal relates to time. Spatio-temporal data analysis is discovering patterns and knowledge from spatio-temporal data. A video can be considered a dynamic system with evolving frames, where each frame presents the system's state. In this study, by flattening the color frames of a video as columns of matrices, we obtain spatio-temporal data.
A grayscale video is three-dimensional (3D) input data, which is the frame height (m), width (n), and time (k) with k frames of the video, as shown in Figure 3a. By reshaping each frame into a column of size 1 × a of a matrix of size k × a (where a = m × n), as shown in Figure 3b, we obtain the spatio-temporal data matrix. In this matrix, the correlation between pixels located at the same neighboring position between adjacent frames is preserved over time. Additionally, the video is mapped from 3D space into two-dimensional (2D) space, thereby reducing the complex computing. Without loss of generality, X is assumed to be an original color video consisting of k frames with a resolution of a. To display multichannel images in the RGB space, 24 bits with 8 bits for each color channel is used. Firstly, each color frame is separated into three color channels, namely R, G, and B. Secondly, we flatten each color channel frame to one vector and arrange the vector side by side to form a spatio-temporal data matrix, called the color channel spatio-temporal matrix. Finally, we obtain three spatio-temporal data matrices corresponding to the three color channels. Based on the color frame, we construct three spatio-temporal data matrices. The process to flatten video's frames to the color channel spatio-temporal data matrices is summarized in Algorithm 1, as follows:

←
Arrange the vector column ( ) side by side to form color channel spatio-temporal data matrices.

Singular Spectrum Analysis for Background Initialization
This section presents the central part of our background initialization method using SSA in detail. We introduce how to apply SSA for the background initialization task, given that is a color video sequence of frames, where each frame has a resolution of = × . Firstly, by using Algorithm 1, as discussed in Section 3.1, we receive three color channel spatio-temporal data matrices ( ) of size × , ∈ { , , } representing the R, G, or B color channel used, which can be written as: Without loss of generality, X is assumed to be an original color video consisting of k frames with a resolution of a. To display multichannel images in the RGB space, 24 bits with 8 bits for each color channel is used. Firstly, each color frame is separated into three color channels, namely R, G, and B. Secondly, we flatten each color channel frame to one vector and arrange the vector side by side to form a spatio-temporal data matrix, called the color channel spatio-temporal matrix. Finally, we obtain three spatio-temporal data matrices corresponding to the three color channels. Based on the color frame, we construct three spatio-temporal data matrices. The process to flatten video's frames to the color channel spatio-temporal data matrices is summarized in Algorithm 1, as follows:

Algorithm 1. Construct the Three-Color Channel Spatio-Temporal Data Matrices of the Video
Input: X is a color video consisting of k color frames, where each frame has a resolution of a = m × n. Output: Three color channel spatio-temporal data matrices, M (C) , C ∈ {R, G, B}, of the video. 1.

m
image into a vector column of size 1 × a.

M (C) ←
Arrange the vector column m (C) i side by side to form color channel spatio-temporal data matrices.

Singular Spectrum Analysis for Background Initialization
This section presents the central part of our background initialization method using SSA in detail. We introduce how to apply SSA for the background initialization task, given that X is a color video sequence of k frames, where each frame has a resolution of a = m × n. Firstly, by using Algorithm 1, as discussed in Section 3.1, we receive three color channel spatio-temporal data matrices M (C) of size a × k, C ∈ {R, G, B} representing the R, G, or B color channel used, which can be written as: Embedding: We construct the trajectory matrices F (C) based on color channel spatiotemporal data matrices M (C) by embedding operator T . The dimensions of the matrices F (C) are determined by two window lengths, W a and W k , where 1 ≤ W a ≤ a, 1 ≤ W k ≤ k, and 1 < W a W k < ak, then L a = (a − W a + 1) and L k = (k − W k + 1). The input 2D matrix M (C) is organized into the matrix F (C) of size (W a W k × L a L k ) as follows: where each T a composed from the color channel spatio-temporal data matrix M (C) , such as: Decomposition: We perform SVD on the trajectory matrices F (C) to obtain sets of the rank-1 matrices.
where U (C) = [u Grouping: The rank-one matrices are merged following general criteria; the aggregate of the rank-one matrices acquire the grouped matrices in N (N ≤ r) groups.
where F (C) Return to the object decomposition: The grouped matrices are transformed to the form of the input object by performing T −1 based on the diagonal averaging method, as described in Equation (4): where m = 1, 2, 3, . . . , N.

Grouping of Eigentriples
This section analyzes and determines the specific meaning of an eigentriple or a group of eigentriples in video data. The first step is to set a window length. The algorithm proposed in this study separates the set of eigentriples into two groups, as described in Equation (6). Both groups reconstruct output data, resulting in two reconstructed output component data for given input data, so we set all window lengths to 2 in this study. We selected video sequences, namely Board, CaVignal, and IBMtest2, for analysis and observation. This experiment considers window length sizes 2, 4, and 10, respectively. Figure 4 presents the eigenvectors plot from trajectory matrices of the three videos, as we analyzed the data with different window length sizes. As shown in Figure 4, from top to bottom, the blue line represents the first component, and other color lines indicate the remaining components. We can see that the first eigenvector is always a constant over time. The eigenvalue represents the magnitude of the data, and the eigenvector indicates the direction of the data. Therefore, the first eigenvector represents the unchanged data component over time. Those are referred to as stable components (S), representing the background in the video. Because of that reason, we reconstructed the background in this first eigentriple-based video and dynamic component (ε D ) obtained by remaining eigentriples.

Grouping of Eigentriples
This section analyzes and determines the specific meaning of an eigentriple or a group of eigentriples in video data. The first step is to set a window length. The algorithm proposed in this study separates the set of eigentriples into two groups, as described in Equation (6). Both groups reconstruct output data, resulting in two reconstructed output component data for given input data, so we set all window lengths to 2 in this study.
We selected video sequences, namely Board, CaVignal, and IBMtest2, for analysis and observation. This experiment considers window length sizes 2, 4, and 10, respectively. Figure 4 presents the eigenvectors plot from trajectory matrices of the three videos, as we analyzed the data with different window length sizes. As shown in Figure 4, from top to bottom, the blue line represents the first component, and other color lines indicate the remaining components. We can see that the first eigenvector is always a constant over time. The eigenvalue represents the magnitude of the data, and the eigenvector indicates the direction of the data. Therefore, the first eigenvector represents the unchanged data component over time. Those are referred to as stable components ( ), representing the background in the video. Because of that reason, we reconstructed the background in this first eigentriple-based video and dynamic component ( ) obtained by remaining eigentriples.  In summary, we split the set of indices {1, 2, . . . , r} into two groups, namely a stable component and a dynamic component. The result of the step is the representation: where r is the rank of trajectory matrix.

Proposed Method
From the arguments presented above, by using the first eigentriple of color channel spatio-temporal data matrices, we can construct the most effective background image of the given video. The remaining eigentriples are used to construct the foreground. The implementation of the main part of BISSA method can be summarized in Algorithm 2 as follows: Algorithm 2. Initialize Background Using SSA Input: Color channel spatio-temporal data matrices M (C) of video X. Output: Obtain background and foreground images corresponding to each input color frame.

Construct trajectory matrix F (C)
: Decompose the trajectory matrix F (C) into a sum of one-rank elementary matrices: where r (C) = rank(F (C) ).

3.
Construct background component S (C) based on the first eigentriple group (i = 1) :

4.
Construct the foreground component ε (C) D based on remaining eigentriple groups:

5.
Perform the diagonal averaging S (C) : Perform the diagonal averaging ε Reshape the first column of S (C) to matrices sized m × n: Reshape the columns of ε Merge three colors channels of S (C) to receive the background image of the video: 10.
Merge three colors channels of ε (C) D to receive the foreground image ε bg, i corresponding to i th frame: Given X is a video consisting of k color frames, where each frame has a resolution of a = m × n, after applying Algorithm 1, the three color channel spatio-temporal data matrices corresponding to three color channels are obtained. Each color channel spatiotemporal data matrix contains k columns and a rows. Each column corresponds to one frame, and each row contains k pixel values of the same pixel position in the video. By applying Algorithm 2 on the three color channel spatio-temporal data matrices separately, we process and find the relevance of all frames over time. In our proposed method, we use the eigentriples of the color channel spatio-temporal data matrix to construct two groups of matrices. The first group is constructed by using only the first eigentriple, which contains the background information of the video. The second group is built by using the remaining eigentriples, which include the foreground information. To receive the desired background images, we reshape each column with the first reconstruction matrix to a matrix of size m × n to obtain the chosen background images. The color background image is obtained by merging the three color channels. Moreover, k-achieved color background images are the same; thereby, only the first column of the first matrix is used to reconstruct the background image to save processing time. Experimental results that support our arguments are discussed in more detail in the next section.

Experimental Results and Discussion
In this section, to indicate the effectiveness of our proposed method, experiments for the background initialization problem are conducted on the most popular benchmark database named the SBI [6] database. We also compare our background initialization performance with some state-of-art background initialization, such as median [7], RPCA [13], dynamic mode decomposition (DMD) [20], non-negative matrix factorization (NMF) [21], and background estimation by WiSARD (Wilkes, Stonham and Aleksander Recognition Device) [27]. Finally, to assess the accuracy of the obtained background images against the ground truth images, we use several measurement metrics, such as structural similarity index (SSIM) [41], feature similarity index for image quality assessment (FSIM) [42], peaksignal-to-noise ratio (PSNR) [43], average gray-level error (AGE), and percentage of error pixels (pEPs) [44].

SBI Database
This database was introduced by L. Maddalena at the workshop on scene background modeling and initialization in 2016. The SBI database [6] consists of 14 different image sequences, namely Board, Candela_m1.10, CAVIAR1, CAVIAR2, CaVignal, Foliage, Hall&Monitor, HighwayI, HighwayII, HumanBody2, IBMtest2, People&Foliage, Snellen, and Toscana, as shown in Figure 5a. These sequences are composed of 6 to 740 frames, and their dimensions vary from 144 × 144 to 800 × 600. Each image sequence is accompanied by a ground truth background image, as shown in Figure 5b. SBI was designed to evaluate existing and future background initialization algorithms. The image sequences in the SBI database are intended for different challenges in background initialization tasks, such as camera jitter and shadows challenge, intermittent motion challenge, clutter challenge, very short challenge, etc. scene with nothing. The CaVignal sequence is challenging because the man appears and retains position in more than 60% of the frames before leaving. Some sequences include the challenges of intermittent motion, such as Candela_m1.10, CAVIAR1, CAVIAR2, CaVignal, Hall&Monitor, and People&Foliage. The other sequences, HumanBody2 and IBMtest2, contain basic challenges. Finally, the Toscana sequence consists of only 6 frames, which presents a short video challenge. Our goal is to propose a background estimation method to obtain the factual background of a given video.

Evaluation and Result
Our paper proposes an efficient method for the background initialization task. As discussed in Section 3, given is a color video sequence of k frames , = 1, 2, … , with a resolution of = × , the background initialization algorithm based on SSA is proposed. Firstly, color frames are split into three color channels, ( ) , ∈ {R, G, B} representing the R, G, or B color channels, then flattened to a vector column of color channel spatio-temporal data matrices ( ) . These matrices contain both space and time characteristic information of the original video. In ( ) matrices, the correlation between pixels located at the same position between adjacent frames is preserved over time. Next, ( ) matrices are decomposed by SSA to find the eigentriples for constructing the stable and dynamic components. The stable component containing the background information is computed by grouping the first eigentriple, and the remaining eigentriples construct the dynamic component. Finally, by reshaping an arbitrary column of ( ) to matrices of size × , then merging the three color channels, we receive a corresponded background image of the video. Similarly, by reshaping the columns of ̃ ( ) to matrices of size × , then merging the three color channels, we obtain a sequence of foreground images corresponding to each video frame. Figure 6 displays achieved background and foreground images corresponding to the frames in HighwayI, IBMtest2, CAVIAR2, and HighwayII sequences by using our proposed method. The background images presented in the second row and the foreground images corresponding to the frames are illustrated in the third row. These videos contain several challenges, such as intermittent motion, shadows, camera jitter, and basic. Using our proposed method, for each video containing frames, we can obtain background image and foreground image. This study focused on the background initialization task; however, we also obtained impressive foreground results, as shown in the third row of Figure In the SBI database, the first one is the clutter challenge, where the objects appear almost to cover the entire background, such as the Board, People&Foliage, Foliage, and Snellen sequences. The HighwayI and HighwayII sequences have many cars that are constantly moving. These sequences include challenges, such as shadows and camera jitter. The Candela_m1.10 sequence presents a scenario where a man appears with his bag and leaves the scene with nothing. The CaVignal sequence is challenging because the man appears and retains position in more than 60% of the frames before leaving. Some sequences include the challenges of intermittent motion, such as Candela_m1.10, CAVIAR1, CAVIAR2, CaVignal, Hall&Monitor, and People&Foliage. The other sequences, HumanBody2 and IBMtest2, contain basic challenges. Finally, the Toscana sequence consists of only 6 frames, which presents a short video challenge. Our goal is to propose a background estimation method to obtain the factual background of a given video.

Evaluation and Result
Our paper proposes an efficient method for the background initialization task. As discussed in Section 3, given X is a color video sequence of k frames f i , i = 1, 2, . . . , k with a resolution of a = m × n, the background initialization algorithm based on SSA is proposed. Firstly, color frames are split into three color channels, f (C) i , C ∈ {R, G, B} representing the R, G, or B color channels, then flattened to a vector column of color channel spatiotemporal data matrices M (C) . These matrices contain both space and time characteristic information of the original video. In M (C) matrices, the correlation between pixels located at the same position between adjacent frames is preserved over time. Next, M (C) matrices are decomposed by SSA to find the eigentriples for constructing the stable and dynamic components. The stable component containing the background information is computed by grouping the first eigentriple, and the remaining eigentriples construct the dynamic component. Finally, by reshaping an arbitrary column of S (C) to matrices of size m × n, then merging the three color channels, we receive a corresponded background image of the video. Similarly, by reshaping the columns of ε (C) D to matrices of size m × n, then merging the three color channels, we obtain a sequence of foreground images corresponding to each video frame. Figure 6 displays achieved background and foreground images corresponding to the frames in HighwayI, IBMtest2, CAVIAR2, and HighwayII sequences by using our proposed method. The background images presented in the second row and the foreground images corresponding to the frames are illustrated in the third row. These videos contain several challenges, such as intermittent motion, shadows, camera jitter, and basic. Using our proposed method, for each video containing k frames, we can obtain k background image and k foreground image. This study focused on the background initialization task; however, we also obtained impressive foreground results, as shown in the third row of Figure 6. As can be seen, all moving objects are eliminated from the original frame's image. However, all the background images are the same, as shown in the second row of Figure 6. Therefore, we only need to construct a video's background image by reshaping one column of the stable component to an image to reduce processing time.
Entropy 2021, 23, x FOR PEER REVIEW 13 of 18 6. As can be seen, all moving objects are eliminated from the original frame's image. However, all the background images are the same, as shown in the second row of Figure 6. Therefore, we only need to construct a video's background image by reshaping one column of the stable component to an image to reduce processing time. Our proposed algorithm can obtain background and foreground images of a video. However, we focus on handling the background initialization in this paper. To show the greater effectiveness of BISSA, we compare the proposed method with several existing methods, such as median [7], RPCA [13], DMD [20], NMF [21], and BEWiS [27]. Figure 7 shows the ground truth background images and obtained background images using different methods of 14 different video sequences in the SBI dataset. The Toscana sequence in the SBI database includes only six frames that represent the challenge of very short videos. Therefore the convergence criterion of the Toscana sequence is not met in RPCA, which means we cannot compute the matrix that contains the weighting of each pixel in the training data. Figure 7a presents ground truth background images of 14 video sequences in the SBI dataset. Figure 7b-g illustrate the background images obtained by using BEWiS, median, RPCA, DMD, NMF, and our proposed method, respectively. As seen, our proposed method obtains a clear background image in most cases, such as the Board, Candela_m1.10, CAVIAR1, CAVIAR2, Hall&Monitor, HighwayI, HighwayII, Hu-manBody2, IBMtest2, and Snellen image sequences. For the CaVignal image sequence, the obtained background image is not as expected because the man appears and retains position in more than 60% of the frames before leaving, much like the People&Foliage video sequence, in which the result is not expected because the people and trees appear in 338 out of 341 frames. For the Toscana video sequence, the results are not as good as expected due to too few frames (only six frames) and the object appears to occupy the majority of the video. In summary, our proposal achieves positive results on the basic challenge, intermittent motion challenge, camera jitter challenge, and shadows challenge, but struggles a little in handling clutter video and a very short video sequence. However, the obtained results are still really good when compared to other methods, such as RPCA, DMD, and NMF. Our proposed algorithm can obtain background and foreground images of a video. However, we focus on handling the background initialization in this paper. To show the greater effectiveness of BISSA, we compare the proposed method with several existing methods, such as median [7], RPCA [13], DMD [20], NMF [21], and BEWiS [27]. Figure 7 shows the ground truth background images and obtained background images using different methods of 14 different video sequences in the SBI dataset. The Toscana sequence in the SBI database includes only six frames that represent the challenge of very short videos. Therefore the convergence criterion of the Toscana sequence is not met in RPCA, which means we cannot compute the matrix that contains the weighting of each pixel in the training data. Figure 7a presents ground truth background images of 14 video sequences in the SBI dataset. Figure 7b-g illustrate the background images obtained by using BEWiS, median, RPCA, DMD, NMF, and our proposed method, respectively. As seen, our proposed method obtains a clear background image in most cases, such as the Board, Candela_m1.10, CAVIAR1, CAVIAR2, Hall&Monitor, HighwayI, HighwayII, HumanBody2, IBMtest2, and Snellen image sequences. For the CaVignal image sequence, the obtained background image is not as expected because the man appears and retains position in more than 60% of the frames before leaving, much like the People&Foliage video sequence, in which the result is not expected because the people and trees appear in 338 out of 341 frames. For the Toscana video sequence, the results are not as good as expected due to too few frames (only six frames) and the object appears to occupy the majority of the video. In summary, our proposal achieves positive results on the basic challenge, intermittent motion challenge, camera jitter challenge, and shadows challenge, but struggles a little in handling clutter video and a very short video sequence. However, the obtained results are still really good when compared to other methods, such as RPCA, DMD, and NMF.
Entropy 2021, 23, x FOR PEER REVIEW 13 of 18 6. As can be seen, all moving objects are eliminated from the original frame's image. However, all the background images are the same, as shown in the second row of Figure 6. Therefore, we only need to construct a video's background image by reshaping one column of the stable component to an image to reduce processing time. Our proposed algorithm can obtain background and foreground images of a video. However, we focus on handling the background initialization in this paper. To show the greater effectiveness of BISSA, we compare the proposed method with several existing methods, such as median [7], RPCA [13], DMD [20], NMF [21], and BEWiS [27]. Figure 7 shows the ground truth background images and obtained background images using different methods of 14 different video sequences in the SBI dataset. The Toscana sequence in the SBI database includes only six frames that represent the challenge of very short videos. Therefore the convergence criterion of the Toscana sequence is not met in RPCA, which means we cannot compute the matrix that contains the weighting of each pixel in the training data. Figure 7a presents ground truth background images of 14 video sequences in the SBI dataset. Figure 7b-g illustrate the background images obtained by using BEWiS, median, RPCA, DMD, NMF, and our proposed method, respectively. As seen, our proposed method obtains a clear background image in most cases, such as the Board, Candela_m1.10, CAVIAR1, CAVIAR2, Hall&Monitor, HighwayI, HighwayII, Hu-manBody2, IBMtest2, and Snellen image sequences. For the CaVignal image sequence, the obtained background image is not as expected because the man appears and retains position in more than 60% of the frames before leaving, much like the People&Foliage video sequence, in which the result is not expected because the people and trees appear in 338 out of 341 frames. For the Toscana video sequence, the results are not as good as expected due to too few frames (only six frames) and the object appears to occupy the majority of the video. In summary, our proposal achieves positive results on the basic challenge, intermittent motion challenge, camera jitter challenge, and shadows challenge, but struggles a little in handling clutter video and a very short video sequence. However, the obtained results are still really good when compared to other methods, such as RPCA, DMD, and NMF. To assess the accuracy of the obtained background images against the ground truth images, we use five measurement metrics: SSIM [41], FSIM [42], PSNR [43], AGE, and pEPs [44], to measure the similarity between the two images. These measurement metrics are image-to-image metrics measuring the visual correctness of an estimated background image against a ground truth background image. These methods exploit different aspects of image quality evaluation, thus leading to an extensive comprehensive evaluation of the obtained result. Table 1 summarizes the rank of values and preference of these measurement metrics. As can be seen in Table 1, for the SSIM, FSIM, and PSNR measurement metrics, the higher obtained values demonstrate a higher similarity between the two images. On the contrary, for the AGE and pEPs measurement metrics, the lower of the obtained values show a higher similarity between the obtained backgrounds and ground truth images. A summary is presented in Table 1.  [44] Average gray-level error [0-255] lower pEPs [44] Percentage of error pixels [0-1] lower A summary is presented in Table 2 that highlights the best values of the corresponding metrics in bold. As shown in Table 2, the BISSA method gets high performance in most videos when we use the SSIM, FSIM, and PSNR metrics to evaluate. With the pEPs metric, our proposed method gets high performance in most videos except the Board, CaVignal, To assess the accuracy of the obtained background images against the ground truth images, we use five measurement metrics: SSIM [41], FSIM [42], PSNR [43], AGE, and pEPs [44], to measure the similarity between the two images. These measurement metrics are image-to-image metrics measuring the visual correctness of an estimated background image against a ground truth background image. These methods exploit different aspects of image quality evaluation, thus leading to an extensive comprehensive evaluation of the obtained result. Table 1 summarizes the rank of values and preference of these measurement metrics. As can be seen in Table 1, for the SSIM, FSIM, and PSNR measurement metrics, the higher obtained values demonstrate a higher similarity between the two images. On the contrary, for the AGE and pEPs measurement metrics, the lower of the obtained values show a higher similarity between the obtained backgrounds and ground truth images. A summary is presented in Table 1. A summary is presented in Table 2 that highlights the best values of the corresponding metrics in bold. As shown in Table 2, the BISSA method gets high performance in most videos when we use the SSIM, FSIM, and PSNR metrics to evaluate. With the pEPs metric, our proposed method gets high performance in most videos except the Board, CaVignal, and Snellen sequences. BEWiS is the best performer in the Foliage, but BISSA is still better than the RPCA, DMD, median, and NMF methods. With the AGE metric, our proposed method gets high performance in most videos except Canlenda_m1.10, CAVIAR1, and HumanBody2. However, the BISSA is still better than the remaining methods. When using the PSNR metric to evaluate results, our proposed method also gets high performance in most videos except Canlenda_m1.10 and Snellen. In most videos, the backgrounds obtained by our proposed method are very similar to the ground truths.

Conclusions
This study proposes an effective background initialization algorithm for image sequences. By storing color frame sequences of the video into color channel spatio-temporal data matrices, we can preserve the correlation between pixels located at the same position between adjacent frames over time. Next, the SSA method was applied to these spatiotemporal data. Then, the stable component is constructed by using the first eigentriple, which is the component that holds the color background image. In addition, encouraging results of the foreground component were obtained based on the remaining eigentriples. The experiment results on the most popular public scene background initialization database demonstrate our proposed method's effectiveness. The obtained background image is compared to the ground truth background image by the five most common metrics: SSIM, FSIM, PSNR, AGE, and pEPs. The results proved that our study achieved some positive results, especially in dealing with basic challenges, cluster challenges, intermittent motion challenges, camera jitter challenges, and intense shadow challenges. In addition, the results also show that our proposed method achieves a good color background image when compared with state-of-the-art techniques, such as BEWiS, median, RPCA, DMD, and NMF. However, videos recorded with few frames (less than 20 frames) and intermittent object motion challenges (such as CaVignal sequence) remain open challenges. Moreover, computing the background from the first eigentriple only obtains a good estimation of the background, but is not optimal to get a reasonable estimate of the foreground. In the future, we will continue to improve our method to achieve better background and accurately detect moving objects in video.