Motion Segmentation Based on Model Selection in Permutation Space for RGB Sensors

Motion segmentation is aimed at segmenting the feature point trajectories belonging to independently moving objects. Using the affine camera model, the motion segmentation problem can be viewed as a subspace clustering problem—clustering the data points drawn from a union of low-dimensional subspaces. In this paper, we propose a solution for motion segmentation that uses a multi-model fitting technique. We propose a data grouping method and a model selection strategy for obtaining more distinguishable data point permutation preferences, which significantly improves the clustering. We perform extensive testing on the Hopkins 155 dataset, and two real-world datasets. The experimental results illustrate that the proposed method can deal with incomplete trajectories and the perspective effect, comparing favorably with the current state of the art.


Introduction
Motion segmentation is aimed at segmenting objects with different motions in the video and has become an essential issue for many computer vision applications, such as a visual odometer and video segmentation. A review of motion segmentation can be found in Zappella et al. [1].
In this paper, we propose a robust solution that addresses the issue of motion segmentation. In the case of affine cameras, the trajectories of a rigidly moving object lie in a linear subspace of at most four dimensions, and the trajectories of different objects lie in different subspaces [2,3]. Thus, motion segmentation is equivalent to the clustering of the data into subspaces.
Based on subspace clustering, motion segmentation algorithms were classified into four categories [4,5], i.e., algebraic methods [6][7][8], statistical methods [9][10][11][12], iterative methods [13][14][15], and spectral clustering methods [16][17][18][19][20][21][22][23]. The first three categories' methods require the dimension and number of subspaces as prior information and are sensitive to the initial values and noise. The spectral clustering methods are effective at data clustering but cannot handle outliers and noise and often require post-processing. In recent years, there are some motion segmentation algorithms based on deep learning [24][25][26][27], which usually obtain more accurate segmentation results. However, the results of the deep-learning-based method depend strongly on the semantic segmentation. Therefore, deep learning methods require a sufficient number of samples and may fail without concrete semantic information. This is different from the motion segmentation problem we consider. There exist some objects without semantic information in actual data. Our proposed solution for motion segmentation does not depend on semantic information.
Recently, many multi-model fitting methods have been developed to solve the problem of motion segmentation [28][29][30][31][32][33][34]. The multi-model fitting methods first generate a model hypotheses by sampling, and then estimate the model parameters by analyzing the preferences from the point-to-model hypotheses. J-linkage [28] involves constructing preference sets of the points in the conceptual space through the selected inlier thresholds, and then computing the Jaccard distance between each point for 1.
We propose a data grouping method, which defines the similarity between data points, and introduce the LSH tool in the processing of the similarity to group the data points; 2.
We propose a model selection approach that combines energy minimization and the geometric robust information criterion (GRIC) to optimize the model set obtained by the data grouping; 3.
No prior knowledge is needed, such as the number of motions, as this can be automatically estimated through the model selection.
The structure of this paper is as follows. In Sections 2 and 3, we describe the proposed motion segmentation algorithm in detail. The data grouping process is presented in Section 2, and Section 3 Sensors 2019, 19, 2936 3 of 15 introduces the model selection approach. The experimental results are presented in Section 4. Finally, we draw conclusions in Section 5.

Data Grouping in Permutation Space
Before we describe our method, we first briefly review the basic formulation setup in motion segmentation.
Under the affine projection model, it is assumed that an f -frame image sequence is extracted from the video. The image sequence is then preprocessed by a feature point extraction algorithm, such as scale-invariant feature transform (SIFT) or speeded-up robust features (SURF), to obtain N tracked feature points (xf n , yf n ) . The 3D coordinates (X n , Y n , Z n ) N n=1 of the tracking points can then be converted into a 2D representation by Equation (1) [1]: where Af = R 2f ×3 T 2f ×1 ∈ R 2×4 is the affine motion matrix of thef -frame image sequence. The input of the motion segmentation problem under the affine projection model is a trajectory matrix composed of the 2D coordinates of the N tracked feature points: We can write the above equation in the form of W 2F×N = A 2F×4 S 4×N , where W is the trajectory matrix [49]. Clearly, rank(W) = rank(A 2F×4 S N×4 ) ≤ 4. That is, in the affine projection model, the N trajectories from m rigid motions all lie in a union of m linear subspaces of dimensions at most four in R 2F , and similar trajectories from a single rigid motion also lie in the same subspace. Therefore, the motion segmentation problem can be solved by the clustering of the data into subspaces.

Preference Analysis
As stated in [50], the probability of two points having arisen from the same model can be estimated from the residual sorting information. Therefore, given the data point set X ={x i } N i=1 , the proposed method starts by shifting the data points to the permutation space. More specifically, firstly, in the manner of random sampling, a large number of hypotheses θ j M j=1 are generated from X. The residuals for the data points are then computed and stored in the N × M matrix: where the rows represent N points and the columns represent M hypotheses. Therefore, for data point x i its absolute residual to all the M hypotheses is the vector r i : The preference of x i is then the permutation: which sorts r i in ascending order, i.e., r . The "coincidence rate" between two preferences τ i and τ j is obtained as where τ represents the number of identical elements in sets τ . Generally speaking, we are interested in the top k permutation preferences rather than a full ranking of the permutation preferences to analyze the data in the model fitting problems, and we set k = M/10. If the coincidence rate is larger, it indicates that data points x i and x j are more similar.
In order to better express the feature that points sharing the same preference may belong to the same structure, we use a positive semi-definite kernel matrix S ∈ [0, 1] N×N to define the similarity between x i and x j : where ε(i, j) = 1 − f (τ i , τ j ) represents the distance between x i and x j .

Data Grouping by Locality-Sensitive Hashing (LSH)
As stated in Section 2.1, the similarity matrix S ∈ [0, 1] N×N is used to measure the degree of similarity between data points. Therefore, we use S(i, j) = to redefine each point, where the value on the diagonal is 1 and the point x i is expressed as a similarity permutation vectorŝ i = [ŝ (i,1)ŝ(i,2) · · ·ŝ (i,N) ]. We define a concept of "dual similarity", i.e., if the similarity permutation vectorsŝ i andŝ j are similar, the data points x i and x j have a high probability of belonging to the same motion model. That is to say, grouping the data points by similarity is a feasible solution.
Locality-sensitive hashing (LSH) is an approximate nearest neighbor search tool, as stated in [51], and hashes high-dimensional points into buckets based on locality, where points of high similarity are hashed into the same LSH bucket. However, if we directly use LSH to hash the data points by Euclidean distance, as in [51], this will result in the points in an identical bucket most likely belonging to different motion models. We therefore use the similarity matrix instead of the original data, which is equivalent to applying a preference constraint to the data points. This is done so that the points in an identical bucket have a high probability of belonging to the same motion model. We adopt the same p-stable based LSH as in [51] to process the Euclidean distance between the similarity permutation vectors, to complete the initial grouping of the original data points. P-stable LSH is a locality-sensitive hashing method based on the p-stable thought, which calculates the hash values h 1 and h 2 of the eigenvectors v 1 and v 2 , where v 1 and v 2 are the eigenvectors of the similarity permutation vectorsŝ i andŝ j , respectively. Since the hash function is locally sensitive, if the two eigenvectors v 1 and v 2 are closer together, the probability that the hash values h 1 and h 2 map to the same bucket will be larger, and vice versa.
The hash function p-stable LSH is defined as follows: where · is the round down function, each entry in vector a is chosen independently from a p-stable distribution, w is a constant greater than 0, and b is a real number chosen uniformly from the range [0, w]. For a detailed description of p-stable LSH, see [52]. In order to prevent the existence of small clusters (data points less than the minimal sample sets (MSS)), we first choose a small number of high-density buckets as [51], which contain a significant portion of the data. Because points with high similarity have a high probability of being assigned to the same bucket, these buckets can be used to represent the initial model clusters C = {c 1 , c 2 , · · · , c t }. These models are then spread to the rest of the data points in a top-down fashion, i.e., we map each data point to its closest model. Finally, we obtain data clusters containing all the data points, as shown in Figure 1, where points in an identical cluster belong to the same motion model. Therefore, this approach can provide a good initialization for the iterative process in the model selection. In order to prevent the existence of small clusters (data points less than the minimal sample sets (MSS)), we first choose a small number of high-density buckets as [51], which contain a significant portion of the data. Because points with high similarity have a high probability of being assigned to the same bucket, these buckets can be used to represent the initial model clusters These models are then spread to the rest of the data points in a top-down fashion, i.e., we map each data point to its closest model. Finally, we obtain data clusters containing all the data points, as shown in Figure 1, where points in an identical cluster belong to the same motion model. Therefore, this approach can provide a good initialization for the iterative process in the model selection. Ground truth for the checkerboard sequence 2RT3RCT_B, the traffic sequence cars9, and the articulated sequence people2, respectively, where the data points belonging to different motion models are labeled with different colors. Bottom (d, e, f): The corresponding data point grouping results. We obtain many data clusters and points in the same cluster almost always belong to the same motion mode.

Model Selection
The number of models in the initial model set obtained by LSH is redundant, so we use a strategy combining energy minimization and the GRIC criterion to select the model that best fits the data. Firstly, with random sampling in i c , the MSS contains almost no outliers, and the generated hypothesis is more likely to be a good fit to the data. We then use energy minimization to select the hypothesis that best fits the cluster. We adopt the energy E composed of the data energy d E and smoothness energy s E to measure the quality of the fitting: The data term d E is used to penalize inaccuracies induced by the point-to-model assignment, and is generally defined as where D is a distance function between point i x and the model hypothesis.
If we let N denote the set of all such neighboring data point pairs, the smoothness energy is: Ground truth for the checkerboard sequence 2RT3RCT_B, the traffic sequence cars9, and the articulated sequence people2, respectively, where the data points belonging to different motion models are labeled with different colors. Bottom (d-f): The corresponding data point grouping results. We obtain many data clusters and points in the same cluster almost always belong to the same motion mode.

Model Selection
The number of models in the initial model set C = {c 1 , c 2 , · · · , c t } obtained by LSH is redundant, so we use a strategy combining energy minimization and the GRIC criterion to select the model that best fits the data.
Firstly, with random sampling in c i , the MSS contains almost no outliers, and the generated hypothesis is more likely to be a good fit to the data. We then use energy minimization to select the hypothesis that best fits the cluster.
We adopt the energy E composed of the data energy E d and smoothness energy E s to measure the quality of the fitting: The data term E d is used to penalize inaccuracies induced by the point-to-model assignment, and is generally defined as where D is a distance function between point x i and the model hypothesis. If we let N denote the set of all such neighboring data point pairs, the smoothness energy is: which penalizes f i f j of the points in a neighborhood. The minimization of Equation (9) can be optimized effectively with the α-expansion algorithm [53]. After the initial selection by energy minimization, we obtain t redundant models and then select n (n ≤ t) models that best explain the input data using GRIC. GRIC is a model selection algorithm that establishes a scoring mechanism to rate each model, allowing us to select the model with the lowest score. The GRIC criterion can robustly select the motion model and detect the presence of outliers and is defined as follows: The first term is the error function, which is defined according to the Huber function [54] as where e i represents the residual of the point, and (r − d) is a codimension of the r-dimensional points fitted by a manifold of dimension d. It can be seen that the error function represents the goodness of fit. The term (λ 1 dn + λ 2 k) in Equation (10) represents a penalty on the complexity of the model. λ 1 dn is a penalty term for the dimensionality of the model, where the greater the dimension of the model, the greater the penalty. λ 2 k is a penalty term for the number of parameters of the model, to greater penalize models with more parameters [41]. Therefore, the model GRIC selects is the one with the highest information content, but the least complexity. In addition, we set the penalty factors λ 1 and λ 2 as λ 1 = log(4) = 1.4 and λ 2 = log N = log 4n, where n is the number of data points, and k is the number of parameters of the fitted model.
The energy minimization and GRIC are conducted alternately and continuously until the model set is almost unchanged. Figure 2 shows the model selection results on the 2RT3RCT_B sequence by the proposed model selection approach. As can be seen from Figure 2c, the selected models are very similar to the real model.
( , ) i j V f f is derived from the Potts model: which penalizes i j f f ≠ of the points in a neighborhood.
The minimization of Equation (9) can be optimized effectively with the α-expansion algorithm [53].
After the initial selection by energy minimization, we obtain t redundant models and then select n ( n t ≤ ) models that best explain the input data using GRIC. GRIC is a model selection algorithm that establishes a scoring mechanism to rate each model, allowing us to select the model with the lowest score. The GRIC criterion can robustly select the motion model and detect the presence of outliers and is defined as follows: The first term is the error function, which is defined according to the where  λ dn is a penalty term for the dimensionality of the model, where the greater the dimension of the model, the greater the penalty. 2 k λ is a penalty term for the number of parameters of the model, to greater penalize models with more parameters [41]. Therefore, the model GRIC selects is the one with the highest information content, but the least complexity. In addition, we set the penalty factors 1 λ and 2 λ as 1 = log (4) Figure 2 shows the model selection results on the 2RT3RCT_B sequence by the proposed model selection approach. As can be seen from Figure 2c, the selected models are very similar to the real model.

Model Clustering
Through the model selection, we obtain the number of models and the data point permutation preference information represented by the residual matrix R. The similarity matrix S of the data points is derived from the residual matrix R according to the steps in Section 2.1, which can express the data point permutation preferences well. Since permutation preferences for the points have been proven to be able to distinguish inliers belonging to different models ("model" refers to subspace in motion segmentation) [55,56], bottom-up linkage clustering is adopted in the permutation space for clustering the points. Therefore, points with similar permutation preferences can be sampled to generate good hypotheses, and good hypotheses can make the permutation preferences more distinguishable, thereby improving the clustering.
We present the detailed steps in Algorithm 1.

Experiments
To test the performance of the proposed method, we carried out motion segmentation experiments on the Hopkins 155 dataset [57] and two real-world datasets. We evaluated the performance in terms of the classification error [57].

Results of the Hopkins 155 Dataset
The Hopkins 155 dataset contain 155 video sequences, where 120 of the videos have two motions and 35 of the videos have three motions. In addition, it contains complex motion scenes, with many noise points and isolated points. The sequences can be roughly divided into three categories: Checkerboard sequences, traffic sequences, and articulated sequences.
We compare the proposed method with the state-of-the-art approaches of random sample consensus (RANSAC) [9], generalized principal component analysis (GPCA) [6], local subspace affinity (LSA) [17], agglomerative lossy compression (ALC) [12], the sparse subspace clustering algorithm (SSC) [5], J-linkage [28], and T-linkage [30]. The average and median classification errors of the different scenes are listed in Tables 1 and 2, and the average and median classification errors of the other methods are obtained from [19,30]. Note that in order to obtain satisfactory results, our method only requires to tune one parameter (permutation length), which is much fewer than many other state-of-art methods. We can make the following observations from the two tables. The RANSAC, GPCA, LSA, and ALC methods have high classification error in the entire experiment. Meanwhile, the SSC method always performs well-even on the challenging sequence articulated, the classification error is only 1.42% for three motions and 0.62% for two motions. However, the proposed method performs the best among all the methods on the checkerboard and traffic sequences, obtaining the lowest classification error. The classification error has been significantly reduced, about 12 times better than the best result previously reported by SSC. On the articulated sequences, it scores second-best, and is fairly close to the SSC algorithm. However, the classification error of the proposed method is still much lower than that of the other methods. Moreover, most of the existing methods do not perform well on the articulated sequences. This is because motions in the two_cranes video sequence are very complex and partially dependent on each other (as shown in Figure 3). We can make the observation from Figure 3a that the number of tracking points is only 94, making it impossible to generate sufficient assumptions for good permutation preferences. Figure 3b is the segmentation result of three motions, whose classification error is 3.29%. Figure 3c-e gives the segmentation results of two motions, with a classification error of 5.13%, 3.9%, and 4.17%, respectively.   Figure 4g,h gives the articulated video sequences. It is very difficult for many methods to correctly segment motion models that are close in the spatial domain because they involve the spatial constraints of data points, such as in sampling and clustering. On the contrary, since we group the data points based on similarities in the feature space, instead of grouping the data points with Euclidean distance directly in the Euclidean space, the spatial constraint is not so important for motion model grouping. Therefore, our method can well segment motion models that are spatially close.   Figure 4g,h gives the articulated video sequences. It is very difficult for many methods to correctly segment motion models that are close in the spatial domain because they involve the spatial constraints of data points, such as in sampling and clustering. On the contrary, since we group the data points based on similarities in the feature space, instead of grouping the data points with

Results of the Real-World Dataset
The Hopkins155 dataset has some limitations, such as limited depth reliefs and dominant camera rotations. Taking into account these limitations, it is not appropriate to use this dataset as a benchmark for investigating motion segmentation capability in the wild [58]. Real-world sequences contain real challenges, such as missing data, unknown number of motions, and perspective effects [48]. For this reason, we also evaluated the proposed method on the real-world datasets: The MTPV62 dataset [48] and the KITTI 3D Motion Segmentation Benchmark (KT3DMoSeg) [58].
The MTPV62 dataset comprises 62 video sequences, of which 50 are from Hopkins 155. Another 12 video sequences have heavy occlusions, of which four video sequences are from [54] and another eight video sequences are provided by [48]. Of the 62 video sequences, 26 contain two motions, 36 contain three motions, 12 suffer from seriously missing data, and nine have strong perspective effects. The KT3DMoSeg dataset is a more challenging dataset because it contains strong perspectives and strong forward translations. All sequences of KT3DMoSeg involve strong perspective effects in the background, but the foreground moving objects often have limited depth reliefs [58].
We compare the performance of the proposed method with seven state-of-the-art methods: ALC, GPCA, LSA, SSC, TPV [48], LRR [59], and MSSC [60]. The quantitative results are presented in Table 3. All the classification errors of the seven methods were obtained from [58]. We use Chen's matrix completion approach [61] to handle missing data. Some qualitative results are presented in Figures 5 and 6.
We make the following observations from Table 3. First, we achieved a pretty good performance on Hopkins 50 clips. However, the average classification error on the Missing Data 12 clips is a little high. As seen in Figure 5f, incorrect segmentation on the Raffles sequence results in the high classification error of MTPV62 dataset. Actually, the classification error on the Raffles sequence is as high as 31.33%. The reason is that the distribution of the inliers of the foreground and background is extremely unbalanced, and the background is very complicated. In addition, there are only seven points belonging to the foreground, which results in difficulty in sampling an all-inlier minimal set and seriously impacts the performance of the preferences. Secondly, we obtained the best average classification error on the KT3DMoSeg dataset. However, the segmentation accuracy can be further increased when considering the complexity of KT3DMoSeg. Many background objects in Figure 6 have noncompact shapes, thus the background is often separated and the segmentation on the junction of the foreground and background is very difficult. The most obvious case is Figure 6d. In

Results of the Real-World Dataset
The Hopkins155 dataset has some limitations, such as limited depth reliefs and dominant camera rotations. Taking into account these limitations, it is not appropriate to use this dataset as a benchmark for investigating motion segmentation capability in the wild [58]. Real-world sequences contain real challenges, such as missing data, unknown number of motions, and perspective effects [48]. For this reason, we also evaluated the proposed method on the real-world datasets: The MTPV62 dataset [48] and the KITTI 3D Motion Segmentation Benchmark (KT3DMoSeg) [58].
The MTPV62 dataset comprises 62 video sequences, of which 50 are from Hopkins 155. Another 12 video sequences have heavy occlusions, of which four video sequences are from [54] and another eight video sequences are provided by [48]. Of the 62 video sequences, 26 contain two motions, 36 contain three motions, 12 suffer from seriously missing data, and nine have strong perspective effects. The KT3DMoSeg dataset is a more challenging dataset because it contains strong perspectives and strong forward translations. All sequences of KT3DMoSeg involve strong perspective effects in the background, but the foreground moving objects often have limited depth reliefs [58].
We compare the performance of the proposed method with seven state-of-the-art methods: ALC, GPCA, LSA, SSC, TPV [48], LRR [59], and MSSC [60]. The quantitative results are presented in Table 3. All the classification errors of the seven methods were obtained from [58]. We use Chen's matrix completion approach [61] to handle missing data. Some qualitative results are presented in Figures 5 and 6.
We make the following observations from Table 3. First, we achieved a pretty good performance on Hopkins 50 clips. However, the average classification error on the Missing Data 12 clips is a little high. As seen in Figure 5f, incorrect segmentation on the Raffles sequence results in the high classification error of MTPV62 dataset. Actually, the classification error on the Raffles sequence is as high as 31.33%. The reason is that the distribution of the inliers of the foreground and background is extremely unbalanced, and the background is very complicated. In addition, there are only seven points belonging to the foreground, which results in difficulty in sampling an all-inlier minimal set and seriously impacts the performance of the preferences. Secondly, we obtained the best average classification error on the KT3DMoSeg dataset. However, the segmentation accuracy can be further increased when considering the complexity of KT3DMoSeg. Many background objects in Figure 6 have noncompact shapes, thus the background is often separated and the segmentation on the junction of the foreground and background is very difficult. The most obvious case is Figure 6d. In addition, we adopt a single geometric model in handling the motion segmentation problem. However, the comparison in [58] shows that the performance of multi-view approaches is consistently better than when we adopt a single geometric model. Sometimes subspace overlap occurs with a single geometric model. Just as presented in Figure 6e,f, some foreground objects are incorrectly segmented into the background. addition, we adopt a single geometric model in handling the motion segmentation problem. However, the comparison in [58] shows that the performance of multi-view approaches is consistently better than when we adopt a single geometric model. Sometimes subspace overlap occurs with a single geometric model. Just as presented in Figure 6e,f, some foreground objects are incorrectly segmented into the background.

Conclusions
In this paper we have proposed a robust subspace clustering method that applies multi-model fitting to the problem of motion segmentation. We first transformed the data into permutation space and then defined a similarity matrix based on data point permutation preferences and used this in grouping and clustering the data points. Then, we used a model selection strategy that combines energy minimization and the GRIC information criterion to select the best model, which can generate more distinguishable permutation preferences for the data points, thereby obtaining better clustering results. In the experiments undertaken in this study, the proposed method can deal with incomplete trajectories and perspective effect, achieving state-of-the-art performance in motion segmentation.

Conclusions
In this paper we have proposed a robust subspace clustering method that applies multi-model fitting to the problem of motion segmentation. We first transformed the data into permutation space and then defined a similarity matrix based on data point permutation preferences and used this in grouping and clustering the data points. Then, we used a model selection strategy that combines energy minimization and the GRIC information criterion to select the best model, which can generate more distinguishable permutation preferences for the data points, thereby obtaining better clustering results. In the experiments undertaken in this study, the proposed method can deal with incomplete trajectories and perspective effect, achieving state-of-the-art performance in motion segmentation.