RoiSeg: An Effective Moving Object Segmentation Approach Based on Region-of-Interest with Unsupervised Learning

: Traditional video object segmentation often has low detection speed and inaccurate results due to the jitter caused by the pan-and-tilt or hand-held devices. Deep neural network (DNN) has been widely adopted to address these problems; however, it relies on a large number of annotated data and high-performance computing units. Therefore, DNN is not suitable for some special scenarios (e.g., no prior knowledge or powerful computing ability). In this paper, we propose RoiSeg, an effective moving object segmentation approach based on Region-of-Interest (ROI), which utilizes unsupervised learning method to achieve automatic segmentation of moving objects. Speciﬁcally, we ﬁrst hypothesize that the central n × n pixels of images act as the ROI to represent the features of the segmented moving object. Second, we pool the ROI to a central point of the foreground to simplify the segmentation problem into a classiﬁcation problem based on ROI. Third but not the least, we implement a trajectory-based classiﬁer and an online updating mechanism to address the classiﬁcation problem and the compensation of class imbalance, respectively. We conduct extensive experiments to evaluate the performance of RoiSeg and the experimental results demonstrate that RoiSeg is more accurate and faster compared with other segmentation algorithms. Moreover, RoiSeg not only effectively handles ambient lighting changes, fog, salt and pepper noise, but also has a good ability to deal with camera jitter and windy scenes.


Introduction
Many researchers have proposed efficient solutions to solve foreground detection in video object segmentation problems. Among these solutions, deep neural network (DNN) methods have impressive performance with high accuracy. However, DNN architectures need enough datasets and time to train the network for improving the accuracy, which makes it not suitable for some special scenarios without enough training samples (e.g., detection of air-dropped objects in military operations) or with strict time requirements (e.g., the interception of a cannon against a shell). Moreover, these DNN-based methods also require high-performance computing units to complete all the tasks, which is too expensive for ordinary people. Background subtraction and frame difference are commonly adopted in solving the video object segmentation problems [1,2]. There are several challenges existing in background subtraction and frame difference, such as including various illumination changes, camera jitter, dynamic background, camouflage, shadows, bootstrapping and video noise [3,4]. Although many useful algorithms for background modeling have been designed, their performance is limited due to the complexity of algorithms, for example, background subtraction and the modeling of a scene based on each pixel of each frame [5]. Moreover, the accuracy of these algorithms is to some extent effected by wind noise or camera jitter [6].
To deal with these challenges, we propose RoiSeg, an effective object segmentation approach based on Region-of-interest (ROI), which utilizes unsupervised learning method to achieve automatic segmentation of moving objects. RoiSeg hypothesizes the central n * n pixels of images as the ROI to reflect the features of moving object, then the classification of all pixels is turned into that of ROI central points. In the field of classification, the supervised learning methods usually provide a better accuracy compared with the unsupervised learning methods, however, they inevitably need more annotated datasets, hence increasing the workload of computing units [7]. To address this problem, RoiSeg adopts an automatic generation method based on ROI to produce the training samples with the unsupervised learning method. Moreover, RoiSeg also implements an online sample classifier to compensate the imbalance of different classes.
We highlight our main contributions as follows: • We propose RoiSeg, an effective object segmentation approach based on ROI, which utilizes unsupervised learning method to achieve automatic segmentation of moving objects. RoiSeg not only effectively handles ambient lighting changes, fog, salt and pepper noise, but also has a good ability to deal with camera jitter and windy scenes. • We hypothesize the central n * n pixels as the ROI and simplify the foreground segmentation into a classification problem based on ROI. In addition, we propose an automatic generation method to produce the training samples and implement an online sample classifier to compensate the imbalance of different classes, respectively. • We also conduct extensive experiments to evaluate the performance of RoiSeg and the experimental results demonstrate that RoiSeg is more accurate and faster compared with other segmentation algorithms.
The rest of this paper is organized as follows. Section 2 presents a review of related works. The description of RoiSeg is demonstrated in Section 3. The comparison experiments are given in Section 4. Finally, the conclusion is drawn in Section 5.

Related Work
Video segmentation has attracted great attention and many researchers have proposed to use DNN methods to solve this problem due to its impressive performance in this field. However, DNN is obviously not suitable for scenarios with a small/no training samples. Background subtraction, a crucial step in video object segmentation has attracted great attention in the last two decades. The main idea of background subtraction is to build a background model with a fixed number of frames. This model can be designed by different methods, such as statistical, fuzzy, neuro-inspired, and so forth. Among these methods, statistical methods have been intensively studied and widely used in various applications [8][9][10][11]. For example, Xue et al. developed a message passing algorithm termed offline denoising-based turbo message passing subtracting the background successfully with a lower mean squared error and better visual quality for both offline and online compressed video background subtraction [12]. Stauffer and Grimson implemented a parametric probabilistic background model [13]. In this model, distributions of each pixel color updated through an online expectation-minimization algorithm, were represented by a sum of weighted Gaussian distributions defined in a given color space: the Gaussian Mixture Model (GMM). Culibrk et al. adopted a neural network to determine whether each pixel of the image belongs to the foreground or the background [14]. Yu et al. established a spatio-color model based on both foreground and background, which used Expectation Maximization (EM) to track the parameters of GMM [15]. Gallego et al. used EM in the same way but modeled foreground and background at the region level and pixel level, respectively [16]. Cuevas and Garcia proposed an algorithm for foreground extraction and background updating using fuzzy functions and modeled both foreground and background in a non-parametric way [17].
These algorithms mainly implemented foreground detection on each pixel of a frame and may not be able to segment some parts of the background into foreground, resulting in a lower accuracy than DNN methods. However, these algorithms can provide some real-time results to meet some time-sensitive tasks. In addition, it is useful to first segment an ROI by frame difference before clustering and classification. In this paper, we propose RoiSeg, an effective object segmentation method based on ROI, which utilizes unsupervised learning to improve the accuracy of foreground segmentation and ensure the real-time performance. RoiSeg includes two crucial designing methods, namely clustering and classification methods.
Cluster analysis is a statistical multivariate analysis technique which is a common method of unsupervised machine learning [18]. It divides a set of data points into several classes, with the data points in each cluster being very similar but the data points in different clusters being very different [19]. K-means is an excellent clustering method based on segmentation. It iteratively calculates the distance from each point to the K-cluster center, so that K clusters can be found in a given data set [20]. Seiffert et al. presented an efficient initial seed selection method, RDBI, to improve the performance of the Kmeans filtering method by locating the seed points at dense, well-separated areas of the dataset [21]. Nidheesh et al. presented an improved, density-based version of K-Means, the key idea of which is to select as the initial centroids data points which belong to dense regions and which are adequately separated in feature space [22]. The Gaussian mixed model (GMM) is a classic statistical model, in which samples are generated by a Gaussian mixture distribution and the expectation maximization (EM) algorithm is used to update the parameters of the model [23]. Unlike traditional methods of cluster analysis based on heuristic or distance-based procedures, finite mixture modeling provides a formal statistical framework on which to base the clustering procedure [24]. Theoretically, all the data points can fit as long as the GMM has enough components, but the relationship between the number of modes and the number of components in the mixture is very complex so it is particularly important to determine the number of components. In this paper, we only use two unsupervised clustering algorithms: the GMM and K-Means.
The popular Naive Bayesian classifier performs well in dealing with discrete data [25]. Naive Bayes can perform surprisingly well in classification tasks where the probability, itself calculated by the Naive Bayes classifier, is not important. In recent years, many scholars have studied Naive Bayesian classifiers and suggested several algorithms to improve their predictive accuracy [26][27][28]. However, classifiers trained with imbalanced data tend to generate results with a high true negative rate and low true positive rate. In data mining and machine learning, it is difficult to establish an effective classifier for imbalanced data [29]. Therefore, many scholars have proposed various methods to compensate for it [30]. The common methods are as follows: algorithmic-level methods, data-level methods, cost-sensitive methods, and ensembles of classifiers [31].
The threshold method and one-class learning method are the most efficient algorithmlevel solutions; the former sets different thresholds at different learning stages for different types of data, whereas the latter uses specific data to train the classifier. Data-level solutions are based on preprocessing the collected imbalanced training data set by either downsampling or oversampling strategies. Gustavo showed that resampling solutions can effectively solve the class imbalance problem and optimize classifier performance [32]. In particular, preprocessing the imbalanced data before constructing the classier is simple and efficient because the advantage of the data-level solution is to make the sampling and classifier training processes independent [33]. The data preprocessing method is based on the resampling of imbalanced data. Oversampling approaches are used to increase the number of data samples in the minority class and downsampling approaches are used to reduce the number of data samples in the majority class, respectively [34]. Common resampling methods include the synthetic minority oversampling technique (SMOTE) and random undersampling (RUS). RUS [21] performs similarly to SMOTE, but is based on a downsampling process where some examples are removed from the majority class. Lin et al. presented a clustering-based undersampling, which uses the K-nearest to cluster the minority class into the class subset which consists of the difference between the majority class and the minority class, resulting in a balanced training set [33].

Design of RoiSeg
In this section, we will describe the designing process of RoiSeg. As shown in Figure 1, RoiSeg consists of three modules, namely, ROI-central-point generation and feature extraction, automatic training-sample generation, and an online sample classifier. The purpose of RoiSeg is to classify the foreground through the ROI central points. In the first module, the frame difference and canny edge detection are used to transform the background modeling of each pixel into an ROI-central-point-based classification problem, which greatly reduces the amount of data operation. We then extract the features of the ROI central points and provide them to the automatic training-sample generation. In the second module, the characteristics extracted from the ROI central points are made into training samples using ROI-central-point-based sample clustering and a proposed trajectory-based class classifier. In the third module, we explore the training samples and find that they are class imbalanced. A K-means oversampling method is proposed to solve the class imbalance problem and a means to update training samples online is employed to compensate for the weaknesses of the Bayesian classifier.

ROI-Central-Point Generation
We use an imbalance degree η proposed in [35] to demonstrate the imbalance between the foreground and background, as shown in Table 1.
where sum(F) and sum(B) are the sums of foreground and background pixels. We compute η on several subsets of the BMC database plus two self-captured sequences ("My_video1", "My_video2"). The result reveals that the foreground and background are relatively imbalanced. ∞ means the number of foreground is far greater than the number of background. As shown in Figure 2, the moving targets often include the foreground and the background [13]. The frame difference is intended to compute the difference between the current frame and the previous frame in the video sequence and then segment the moving targets. Suppose we have obtained the foreground frame shown in Figure 2c. There is a significant change in the pixel value between the current and previous frames at the position of the moving target. The moving target in the current frame is copied to the corresponding position in the previous frame, and a new previous frame is obtained. The moving target will not appear again if the new previous frame is subtracted from the current frame, as shown in Figure 3.  Following this principle, suppose we have classified the foreground and background. The background of the current frame is copied to the corresponding position in the previous frame, and a new previous frame is obtained. Foreground detection is then done if the new previous frame is subtracted from the current frame, as shown in Figure 3. When the moving target is detected by the frame difference (shown in Figure 3b), the Canny algorithm is used to detect the contours of the moving target and then the bounding boxes of the contours are obtained (as shown in Figure 3c). We call the region of the bounding boxes the Region of Interest (ROI). Therefore, the classification of foreground and background can be regarded as the classification of the bounding boxes. Furthermore, we use the center of the bounding box (the ROI central point) to represent the bounding box so that foreground detection is transformed into an ROI-central-point-based classification problem. Figure 3c shows that the areas of the bounding boxes for noise are much smaller than those of the foreground, because the bounding boxes for the vehicles and pedestrians that we pay attention to are often larger than those of other moving targets [36,37]. Based on this assumption, a bounding-box-area-based noise filter is proposed to remove the bounding boxes whose area is below a preset threshold. Here, the threshold is set to 0.1%. In this paper, we use 12 frame sequences as the experimental test set. Ten of them are from the BMC dataset [38], and two are hand-captured high-resolution crowd walking videos taken with a top-view camera, in which jitter was generated by shaking the camera. The description of the experimental test set is shown in Table 2. The filtering thresholds of the 12 videos are shown in Table 3. For the video test sets "112", "122", "212", "222", "312", "322", "412", and "422", we find that after filtering with the preset threshold, all the foregrounds are recognized, as shown in Figure 4. There are two reasons for this phenomenon: first, the frame difference has good suppression on ambient lighting changes, so the moving cast shadows and fog cannot be detected. Second, the areas of the bounding boxes caused by salt and pepper noise, and so forth, are usually far smaller than those of the foreground, such as cars and pedestrians. However, for the video test sets "512", "522", "My_video1", and "My_video2", noise produced by the wind or the camera jitter and by other dynamic background factors dominates. The areas of the bounding boxes of such noise are random, usually varying with the strength of the wind and the magnitude of the jitter. When the area of a bounding box is larger than the threshold, noise is not be removed by area-based filtering. The filtering result is shown in Figure 4. It is important to achieve high classification accuracy in the Performance Comparison for the experiment (shown in Table 4) because a lot of noise is removed. For the test sets "112", "122", "212", "222", "312", "322", "412", and "422", we successfully obtain the foreground after filtering, so we do not need to use the second and third modules to classify the foreground and the background. This is why we do not use test sets "112", "122", "212", "222", "312", "322", "412", and "422" in the experiment to classify the foreground-and background-based ROI central points and instead only use the test sets "512", "522", "My_video1", and "My_video2".

Automatic Training-Sample Generation
In Section 3.2, we get input samples, but these samples are unlabeled original samples and cannot be used as training samples for the classifier. In order to obtain labeled training samples, we propose an ROI-central-point-clustering method and a class detector.

ROI Pooling and Feature Extraction
For those videos not classified with the aforementioned noise filter, we are required to develop further feature extraction methods to enable noise reduction.
Let Z t = z t i be a frame at time t, z t i represent each pixel in Z t , and i generally refers to the i-th element in the set. We choose the (r i , g i , b i ) color feature and the coordinate (x i , y i ) of z t i as the spatio-color feature space For the ROI central point, a 5-tuple vector in the spatio-color feature space is selected as the z t i = r t i , g t i , b t t , x t i , y t t classification-learning feature. In Section 3.1, we used the center of the bounding box to represent the ROI central point. We also process the ROI central point with meanpooling. Experiments show that using the pooled ROI central point x t i , y t t as a learning feature can provide good foreground detection results, as shown in Table 4 in the columns for the proposed method. The advantage of this method is that it reduces the computation load and guarantees good classification accuracy.

ROI Central Point Based Sample Clustering
We noticed that the background noise is often bound to a specific area. In adjacent frames, however, a moving target's coordinates are also similar. Regardless of the foreground or background, the same type of target has more similar color features R, G, and B. In this section, we cluster ROI central points with similar characteristics. A GMM is a function to estimate the probability density of the exact polymorphism. It has excellent performance in clustering and its general form is as follows: The weighted coefficient α j is satisfied as follows: The j-th component (j = 1, . . . , k) is shown below: where µ j and ∑ j represent the i −t 1 mean vector and covariance matrices, respectively. We choose z t i = r t i , g t i , b t t , x t i , y t t as x i , which is the input of the GMM. All ROI central points in the first l frames (here we set l = 5) are used as input samples for the GMM. For example, we use the 3-Component GMM, so the ROI central points are clustered into 3 similar clusters, red, blue and yellow sets, as shown in Figure 5. However, the number of components determines the clustering accuracy of GMM. We choose precision (P), recall (R), and F-Measure (F) for performance evaluation: TP, FN and FP are the number of true positive, false negative, and false positive pixels, respectively. Figure 6 shows that F increases with the number of components increases. However, when this number reaches 6, the growth of F tends to be slow. Figure 7 shows that FPS drops with the increase of the number of components. The reasons are as follows: with the number of GMM components increasing, the data-fitting ability of the GMM is gradually enhanced, so the F value increases; meanwhile, the computation load is also increasing, which leads to the reduction of FPS; finally, the number of components increasing makes it easy for the GMM to over-fit the data. As such, we set the number of components to 6. Figure 6 demonstrated that F is between 0.87 and 0.91, which means that we can use the pooled ROI center point as the input of the GMM.   Figure 5 shows that in the 1st, 2nd, and 4th clips, the red set represents the foreground, and the blue and yellow sets represent the background. In the 3rd clip, pedestrians were incorrectly detected as another cluster, unlike in the 1st, 2nd, 4th clip. The reason for this is that it is difficult to classify the foreground and background using a GMM because it is a clustering method.

Trajectory Based Class Classifier
Therefore, we propose a trajectory-based-classifier method to foreground and background. Suppose G t−s+1 · · · G t are the clustering results for the first s frames: where G t means the tth frame clustering result and m = k means there are k cluster in G t . g t m = x t m1 , x t m2 , · · · x t mn , n = 1, 2, · · · , num, where g t m represents the mth cluster, x t nn = r t i , g t i , b t i , x t i , y t i represents the pooled ROI central point, and num is the total number of ROI central points in g t m . Clusters in G t the current frame are matched one-by-one with clusters in of G t−1 the previous frame and if the matching is successful, they are considered the same type of target. Our proposed trajectory-based-classifier method is as follows: (1) Calculate the mean-featureḡ t m of each cluster The mean-feature of all clusters in each frame is: (2) Find the same class in adjacent frames In adjacent frames, if a cluster g t−1 in the previous frame is the same type of target as a cluster g t in the current frame, they have similarḡ t . If they are not the same target type, they often have significant differences. Theḡ t of each cluster in the previous and current frames is ordered accordingly, from small to large. Theḡ t of the same sequence location is the same type of target becauseḡ t of the same type of target is very similar in adjacent frames. In this way, we can compare two ascending-order clusters in adjacent frames to get the same classes of foreground objects, thus reducing the cost of computation. Then we use this method to match the same type of target in the first s frames. For example, the first 4 frames matched the result shown in Figure 8. As a result, we changed the classification in Figure 5 and classified the pedestrians as red, the chairs in the middle as blue, and the chairs on the right as yellow, as shown in Figure 8. (3) Label Positive and Negative Samples with Trajectories We need to identify each class as foreground or background. When we obtain a class of targets in the first s frames, its displacement can be calculated. The proposed principle is as follows: The position coordinates of the mean-feature of the same cluster in the first s frames are denoted as [(x t−s+1 ,ȳ t−s+1 ), (x t−s+2 ,ȳ t−s+2 ), · · · , (x t ,ȳ t )]. Then, we compute the moving distance of the clusters in t − 1, . . . , t − s + 2, t − s + 1 and t. We assume that the foreground displacements are increasing in a certain direction in adjacent frames, while the background often remains still or exhibits jitter with uncertain directions. If the moving distance of one cluster is increasing in the first s frames, it is classified as foreground; otherwise it is classified as background.
where X = 0 means foreground, marked as X F , andX = 1 means background, marked as X B . After the foreground and background classes are categorized, the original samples can be used as a labeled training sample, represented as X = X F , X B , v = 0, 1 . In Figure 9, the red and yellow sets represent foreground and background, respectively. In this method, the correct selection of the value s has a great impact on the accuracy of the foreground and background detection. If s is too small, it is easy to misjudge a background element as foreground. If s is too large, the method may misjudge part of the foreground as background. Therefore, we experiment on video test sets to decide s. We define precision for foreground and background detection as c: where sum(Frame) is the sum of the test frame sequences, and sum(True) is the sum of the correctly judged test frame sequences. In Table 5, we find that when s = 6, c has the highest value in most cases, and can almost reach 0.9.

Imbalance Compensation
The training samples are class imbalanced, as shown in Table 6. To better train the online classifier, we need to resample the training set to obtain balanced data. A K-means oversampling method is adopted to compensate for this imbalance. The methods are as follows: (1) Calculate the total number of samples in the foreground, NF, and in the background, NB, respectively. Then, calculate the difference between them: K = |NB − NF|.
(2) Use the K-means method to preprocess minority classes to get K clusters, calculate the mean of each cluster, then use the means as new minority-class samples. The new samples are added to each training sample so we get a set of balanced training samples. The imbalance degrees are different in Tables 1 and 6, because we use ROI-based area filtering, mentioned in Section 3.2.

Online Sample Updating
We use the new balanced training samples as training sets for the Naive Bayesian classifier. We use the ROI center points of the latest frame as the test set. After the Naive Bayesian classifies the test set, we get foreground and background represented as follows: Thus, we can get background bounding boxes and copy the pixels in the bounding boxes to the previous frame. Using the frame difference algorithm to process the new previous frame and the current frame, the foreground can then be segmented. To update the trained samples online, the newly classified foreground and background are added to the training samples. Then, we train the Naive Bayesian classifier with the updated training samples so that we can detect the foreground online.
The experimental results of the training classifier with imbalanced training sets are shown in Figure 10, and the precision results are shown in Figure 11a. The experimental results of training the classifier with the balanced and updated training sets are shown in Figure 12, and the precision is shown in Figure 11b. Red represents the background and yellow represents the foreground, respectively. Figure 11a,b demonstrated that P increased by 22.7%, R increased by 2.1%, and F increased by 23.4%. This means using the proposed K-means oversampling method and online updating training samples to compensate for the weaknesses of the classifier is effective, especially for P and F.

Evaluation
In this section, we compared RoiSeg with the traditional foreground segmentation algorithms. Because the CNN foreground segmentation algorithms are not suitable for these scenarios due to their strict real-time requirements. Sobral tested and compared 29 background subtraction algorithms and recommended five of the best, namely DP-WrenGABGS, MixtureOfGaussianV1BGS, MultiLayerBGS, PixelBasedAdaptiveSegmenter and LBAdaptiveSOM [38]. In this paper, we used these five algorithms to compare with our proposed method and the hardware of our experiment is a Lenovo desktop with Intel(R) Core(TM) i5-4590 CPU @ 3.3 GHz, 8 GB RAM, Win 10 64bit system. Because the foreground detected by frame difference has an aperture, we manually filled some aperture in the foreground in order to evaluate our algorithm using P, R and F. As the size of the self-captured sequences "My_video1" and "My_video2" were 1280 × 720 and that of the sequences provided by the BMC database was 640 × 480, the FPS we give was the average for all test sequences.
In Section 3.2, for the video sequences "112", "122", "212", "222", "312", "322", "412", and "422", we found that after filtering with the preset threshold, all foregrounds were recognized. Thus, we did not need a classifier to distinguish foreground and background. The results of running these five algorithms and the proposed RoiSeg on these eight sequences are shown in Table 4 and Figure 13. The average P, R and F of the five algorithms and proposed RoiSeg were computed, as shown in Table 4. We could see that the proposed RoiSeg had the best R and the best F on some sequences. For video sequences "112" and "122", which were without noise, and "212" and "222", which had salt and pepper noise, the proposed noise filter method did not function best; however, the R and F still reached over 0.9. For video sequences "312" and "322", which having moving cast shadows, and "412" and "422", which were foggy, the proposed noise filter method functions best. A nonoptimal P indicated that our method produced more false positives than other algorithms. The reason for this was that the foreground detected by frame difference has ghosts, and the increased false positives were mainly located at the boundaries of moving objects, which was not harmful for visual observation. From Table 4, we found the proposed noise filter method had the highest FPS. This is due to the effectiveness of the proposed ROI-based noise filter, which mainly focuses on the ROI instead of the full frame.
For the video sequences "My_video1", "My_video2", "512", and "522", with wind and camera jitter, the noise could not be removed by filtering. Thus, we needed a classifier to distinguish foreground and background. The results of applying these five algorithms and the proposed RoiSeg algorithm are shown in Table 4 and Figure 14. From Figure 14, we could observe that the proposed RoiSeg algorithm produced better visual results than the five algorithms. Furthermore, Table 4 shows that the RoiSeg algorithm had the best F, which proved that our method had the best overall performance. After the proposed clustering and online classification work in Sections 3.3 and 3.4, the FPS of our method decreases. However, Table 4 shows that the FPS of our method still outperformed those of the other approaches. We also evaluated the performance of RoiSeg on different datasets with the metrics of the average pixel error rate (APFPER) and the joint intersection overlap (IoU) [39]. APF-PER measured the number of misclassified pixels and IoU was to calculate the combined intersection of the estimated and true split plots for evaluating the split performance. We compared RoiSeg with state-of-the-art unsupervised learning methods on FBMS dataset, as shown in Figure 15 and Table 7. It is observed that the image saliency methods rendering of information within frames can produce unsatisfactory results, and even some images miss foreground objects, mainly because the time correlation in the image sequence to convey the target information was not taken into consideration [40]. However, these foreground segmentation methods based on motion perform better than the image saliency methods [41][42][43]. RoiSeg estimated the target object in a more cluttered background with higher real-time boundary and splits video objects in a completely unsupervised manner. We also conducted experiments to compare the performance of RoiSeg and other segmentation methods on a different dateset (SegTrack), as shown in Tables 8 and 9.  In Table 8, References [41][42][43]45,[47][48][49][50] are unsupervised learning methods, while [39,51] are the supervised learning methods. The results demonstrated that RoiSeg could meet the requirements of most tasks, although its performance was not as good as state-of-the-art segmentation methods. In Table 9, References [41,44,45,48,52] are unsupervised learning methods, while [46,50,51] are supervised learning methods. Among them, ref [46] utilized the CNN methods and VOC 2011 [53] for pre-training and testing, respectively. Table 9 shows that the result of IoU evaluation on RoiSeg was similar to that of APFPER and the CNN-based approaches had an absolute advantage in supervised segmentation tasks, but it relied on too much data. We also conducted extensive experiments to evaluate the real-time performance as shown in Table 10 and RoiSeg achieved a better performance in terms of real-time operations. In summary, RoiSeg outputted the expected results on some video sequences compared to the best performing unsupervised learning methods. There is a gap compared to the supervised method and the CNN method, and RoiSeg is better in real-time operations with the average processing time of 45 ms.

Conclusions and Future Work
In this paper, we propose RoiSeg, an effective object segmentation method, which consists of three modules, ROI-central-point generation and feature extraction, automatic training-sample generation, and an online sample classifier. RoiSeg can be applied to a number of scenarios where datasets are difficult to obtain and require high real-time performance. We also conduct extensive experiments and the results demonstrate that the frames per second of RoiSeg is 95.84, which is better than other algorithms, and the classification accuracy is 92.4%. Future work may fall into two categories. First, to find better algorithms to detect stopped objects, we plan to introduce Kalman filtering to predict the state of the stopping target the next time. Second, we will try to design a deep neural network algorithm to study the segmentation of the foreground in long-term scenarios.
Author Contributions: Conceptualization, Z.Z. and Z.P.; data collection, Z.T.; analysis and interpretation of results, Z.Z. and F.G.; validation, Z.Z. and F.G.; writing-original draft preparation, Z.Z. and F.G.; writing-review and editing, F.G. All authors have read and agreed to the published version of the manuscript.

Informed Consent Statement: Not applicable.
Data Availability Statement: BMC (Background Models Challenge) provides videos for testing our background subtraction algorithm. For more description of the BMC dateset and how to use the BMC dateset, please refer to this website: http://backgroundmodelschallenge.eu/#evaluation; (accessed on 2 January 2022). The Freiburg-Berkeley Motion Segmentation Dataset (FBMS) has a total of 720 frames annotated. FBMS-59 comes with a split into a training set and a test set. For more description of motion segmentation dataset and how to use evaluation code, please refer to this website: https://lmb.informatik.uni-freiburg.de/resources/datasets/moseg.en.html; (accessed on 2 January 2022). SegTrack is a video segmentation dataset. For more description of SegTrack dataset, please refer to this website: https://web.engr.oregonstate.edu/~lif/SegTrack2/dataset.html; (accessed on 2 January 2022).

Conflicts of Interest:
The authors declare no conflict of interest.