U 2 -ONet: A Two-Level Nested Octave U-Structure Network with a Multi-Scale Attention Mechanism for Moving Object Segmentation

: Most scenes in practical applications are dynamic scenes containing moving objects, so accurately segmenting moving objects is crucial for many computer vision applications. In order to efﬁciently segment all the moving objects in the scene, regardless of whether the object has a predeﬁned semantic label, we propose a two-level nested octave U-structure network with a multi-scale attention mechanism, called U 2 -ONet. U 2 -ONet takes two RGB frames, the optical ﬂow between these frames, and the instance segmentation of the frames as inputs. Each stage of U 2 -ONet is ﬁlled with the newly designed octave residual U-block (ORSU block) to enhance the ability to obtain more contextual information at different scales while reducing the spatial redundancy of the feature maps. In order to efﬁciently train the multi-scale deep network, we introduce a hierarchical training supervision strategy that calculates the loss at each level while adding knowledge-matching loss to keep the optimization consistent. The experimental results show that the proposed U 2 -ONet method can achieve a state-of-the-art performance in several general moving object segmentation datasets.


Introduction
Moving object segmentation is a critical technology in computer vision tasks, and is directly related to the effects of the subsequent work, such as object tracking, visual simultaneous localization and mapping (SLAM), image recognition, etc. Being able to accurately segment moving objects from a video sequence can greatly improve the effects of many visual tasks, such as dynamic visual SLAM [1][2][3][4], visual object tracking [5], dynamic object obstacle avoidance, autonomous navigation [6], autonomous vehicles [7], human activity analysis [8], video surveillance [9,10], and dynamic object modeling [11]. For example, in an autonomous driving scene, the segmentation of moving objects can help the vehicle to understand the surrounding motion information, which is the basis for avoiding collision, braking operations, and smooth maneuvering. Most of the current methods are designed to segment N predefined classes in the training set. However, in a practical environment, many applications, such as autonomous driving and intelligent robots, need to achieve robust perception in the open world. These applications must discover and segment never-before-seen moving objects in the new environment, regardless of whether they are associated with a particular semantic class.
The segmentation of the different motions in dynamic scenes has been studied for decades. The traditional methods of motion segmentation use powerful geometric con-straints to cluster the points in the scene into a model parameter instance, thereby segmenting the moving objects into different motions [12,13], which is called multi-motion segmentation. This kind of method realizes the motion segmentation of feature points instead of working pixel by pixel. However, since the results of these methods are strongly dependent on the motion model, they are not robust enough in complex scenes. In addition, these methods can only segment the more salient moving objects and only simultaneously fit a small number of motion models in scenes. With the development of deep learning, instance/semantic segmentation and object detection in videos have been well studied [14][15][16][17]. These methods are used to segment specific labeled object categories in annotated data, so that the main focus is on predefined semantic category segmentation through appearance rather than segmentation of all the moving objects. Meanwhile, these methods are not able to segment new objects that have not been labeled in the training data. More recent approaches combine instance/semantic segmentation results with motion information from optical flow to segment moving object instances in dynamic scenes, as in [18][19][20][21]. These methods [19,20] can segment never-before-seen objects that have not been predefined in the training set based on their motion.
However, the networks used in these methods are usually not deep enough due to the fact that deeper networks usually lead to increased computational burden and greater training difficulty. The use of deeper architectures has proved successful in many artificial intelligence tasks. Therefore, in this paper, we use a much deeper architecture, which can greatly improve the effectiveness of the moving object segmentation. In order to avoid the increase in spatial redundancy of the feature maps, the increase in computational burden and memory cost, and the greater difficulty of training supervision, we integrate octave convolution (OctConv) [22] to improve the residual U-block (RSU block) in [23] and propose the novel octave residual U-block (ORSU block). We take advantage of OctConv to reduce the spatial redundancy and further improve the accuracy. We also propose a hierarchical training supervision strategy to improve the training effect of the deep network optimization in order to improve the segmentation accuracy.
In this paper, we propose a novel two-level nested U-structure network with a multiscale attention mechanism to learn to segment pixels belonging to foreground moving objects from the background, called U 2 -ONet, whose inputs consist of two RGB frames, the optical flow between between the pair of RGB frames, and the instance segmentation of the frames. We combine the convolutional block attention module (CBAM) [24] and OctConv [22] with the U 2 -Net [23] network originally used for salient object detection to propose this network for moving object segmentation. Due to the continuation of U 2 -Net's main architecture, the proposed U 2 -ONet is a nested U-structure network that is designed without using any pre-trained backbones from image classification for training. On the bottom level, the novel octave residual U-block (ORSU block) is proposed, which is based on the residual U-block (RSU block) in [23], and uses octave convolution (OctConv) [22], factorizing the mixed feature maps by their frequencies, instead of vanilla convolution. With the advantages of OctConv and the structure of U-blocks, the ORSU blocks can extract intra-stage multi-scale features without degrading the feature map resolution, with a smaller computational cost than the RSU. At the top level, there is a U-Net-like structure, in which each stage is filled by an ORSU block and each scale contains an attention block. By adding attention blocks at different scales, we introduce spatial and channel attention into the network and eliminate aliasing effects. For the training strategy, we propose a hierarchical training supervision strategy instead of using the standard top-most supervised training or a deeply supervised training scheme. We calculate the loss at each level and add a probability-matching loss called the Kullback-Leibler divergence loss (KLloss) to promote the supervision interactions among the different levels in order to guarantee a more robust optimization process and better representation ability. An illustration of U 2 -ONet is provided in Figure 1.
In summary, this work makes the following key contributions: 1. We propose the U 2 -ONet, which is a two-level nested U-structure network with a multi-scale attention mechanism to efficiently segment all the moving object instances in a dynamic scene, regardless of whether they are associated with a particular semantic class. 2. We propose the novel octave residual U-block (ORSU block) with octave convolution to fill each stage of the U 2 -ONet. The ORSU blocks extract intra-stage multi-scale features while adopting more efficient inter-frequency information exchange, as well as reducing the spatial redundancy and memory cost in the convolutional neural network (CNN). 3. We propose a hierarchical training supervision strategy that calculates both the standard binary cross-entropy loss (BCEloss) and KLloss at each level, and uses the KLloss implicit constraint gradient to enhance the opportunity of knowledge sharing in order to improve the training effect of this deep network. 4. In the task of moving object segmentation, the results have proved that the U 2 -ONet is efficient, and the hierarchical training supervision strategy improves the accuracy of the deep network. The experimental results show that the proposed U 2 -ONet achieves a state-of-the-art performance in some challenging datasets, which include camouflaged objects, tiny objects, and fast camera motion. Figure 1. Illustration of the two-level nested octave U-structure network with a multi-scale attention mechanism (U 2 -ONet).

Video Foreground Segmentation
Video foreground segmentation is focused on classifying every pixel in a video as either foreground or background. Early methods [25][26][27] relied on heuristics in the optical flow field, such as spatial edges and temporal motion boundaries in [27], to identify moving objects. With the introduction of a standard benchmark, the Densely Annotated Video Segmentation (DAVIS) 2016 dataset [28], there has been much related research on video object segmentation [29][30][31][32][33][34][35]. Some methods [29,[31][32][33][34] only complete segmentation of the foreground objects and the background, without segmenting individual instances. Among the instance-level methods, the attentive graph neural network (AGNN) method [30] is based on a novel neural network, and the collaborative video object segmentation using the foreground-background integration (CFBI) method [35] imposes the feature embedding from both the foreground and background to perform the matching process from both the pixel and instance levels. However, video object segmentation usually involves segmenting the most salient and critical objects in the video, not just moving objects. The proposed U 2 -ONet method focuses on motion information and is used to segment all the moving objects in the scene, regardless of whether they are salient.

Instance Segmentation
Instance segmentation not only needs to assign class labels to pixels, but also to segment individual object instances in the images. Methods based on R-CNN (Region-CNN) [36] are popular and widely used at present. Mask-RCNN [14] uses the object bounding box obtained by Faster RCNN [37] to distinguish each instance, and then segments the instances in each bounding box. PANet [38] improves Mask-RCNN by adding bottom-up path augmentation that enhances the entire feature hierarchy with accurate localization signals in earlier layers of the network. Subsequently, BshapeNet [39] utilizes an extended framework by adding a bounding box mask branch that provides additional information about the object positions and coordinates to Faster RCNN to enhance the performance of the instance segmentation. More recently, a few novel contour-based approaches for real-time instance segmentation have been proposed [40,41]. The Deep Snake method [40] uses circular convolution for feature learning on the contours, and uses a two-stage pipeline including initial contour proposal and contour deformation for the instance segmentation. The Poly-YOLO algorithm [41] increases the detection accuracy of YOLOv3 and realizes instance segmentation using tight polygon-based contours. More recent approaches consider the instance segmentation problem as a pixel-wise labeling problem by learning pixel embeddings. The method proposed in [42] unrolls mean shift clustering as a neural network, and the method proposed in [43] introduces a new loss function optimizing the intersection over union of each object's mask to cluster pixels into instances. EmbedMask [44] is built on top of the one-stage detection models and applies proposal embedding and pixel embedding for instance segmentation. The method proposed in [45] uses a deep neural network to assign each pixel an embedding vector and groups pixels into instances based on object-aware embedding. However, most of these methods focus on appearance cues and can only segment objects that have been labeled as a specific category in a training set. In the proposed approach, we leverage the semantic instance masks from Mask-RCNN with motion cues from optical flow to segment moving object instances, whether or not they are associated with a particular semantic category.

Motion Segmentation
The multi-motion segmentation method [12,13,[46][47][48] based on geometric methods involves clustering points of the same motion into a motion model parameter instance to segment the multiple motion models of the scene, which can be utilized to discover new objects based on their motion. This kind of method obtains the results at the feature point level, instead of pixel by pixel, and the application conditions and scenarios are limited. For example, these methods only segment the more salient moving objects, there is a limited model number of segmentations, and they come with a high computational cost. Some deep-learning-based methods segment foreground moving object regions from the scene. The method proposed in [9] uses an analysis-based radial basis function network for motion detection in variable bit-rate video streams. The method proposed in [49] proposes a novel GAN (Generative Adversarial Networks) network based on unsupervised training and a dataset containing images of outdoor all-day illumination changes for detecting moving objects. However, these methods cannot segment each moving object instance. More recent approaches have used optical-flow-based methods for the instance-level moving object segmentation, including a hierarchical motion segmentation system that combines geometric knowledge with a modern CNN for appearance modeling [18], a novel pixel-trajectory recurrent neural network to cluster foreground pixels in videos into different objects [19], a two-stream architecture to separately process motion and appearance [20], a new submodular optimization process to achieve trajectory clustering [50], and a statistical-inference-based method for the combination of motion and semantic cues [21]. In comparison, we propose a two-level nested U-structure deep network with octave convolution to segment each moving object instance while reducing the spatial redundancy and memory cost in the CNN.

Method
Firstly, the overall structure of the network is introduced, including the network inputs. Next, the design of the proposed ORSU block is be introduced, and the structure of the U 2 -ONet is described. Then, the hierarchical training supervision strategy and the training loss procedure are described. The post-processing used to obtain the instance-level moving object segmentation results is introduced at the end of this section.

Overall Structure
The proposed approach takes video frames, the instance segmentation of the frames, and the optical flow between pairs of frames as inputs, which are concatenated in the channel dimension and fed through U 2 -ONet. We use the well-known FlowNet2 [51] and Mask-RCNN [14] methods, respectively, to obtain the results of the optical flow and instance segmentation as inputs. We use the public training models of FlowNet2 and Mask-RCNN. The FlowNet2 and Mask-RCNN networks only provide input data and do not participate in training. Before the optical flow is fed through the network, we undertake normalization to further highlight the moving objects. U 2 -ONet is built with the ORSU blocks based on octave convolution and the multi-scale attention mechanism based on the convolutional block attention module (CBAM) [24]. Inspired by octave convolution [22] and U 2 -Net [23], the octave U-block (ORSU block) is designed to capture intra-stage multiscale features while reducing the spatial redundancy and computational cost in the CNN. For the motion segmentation map obtained from U 2 -ONet, post-processing to combine it with the instance segmentation results is used to obtain the instance-level moving object segmentation results. The contours of the motion segmentation map are extracted, and each closed motion contour is used to determine whether each semantically labeled instance is moving and to find new moving instances.

ORSU Blocks
Inspired by U 2 -Net [23], we propose the novel octave residual U-block (ORSU block) in order to make good use of both local and global contextual information to improve the segmentation effect. As shown in Figure 2, ORSU-L(C in ,M,C out ) follows the main structure of the RSU block in U 2 -Net [23]. Therefore, the proposed ORSU block is composed of three main parts: 1. An input convolutional layer, which uses octave convolution (OctConv) for the local feature extraction instead of vanilla convolution. Compared with RSU blocks, ORSU blocks using OctConv further reduce the computation and memory consumption while boosting the accuracy of the segmentation. This layer transforms the input with the output channel of C out . 2. A U-Net-like symmetric encoder-decoder structure with a height of L, which is deeper with a larger value of L. It takes F 1 (x) from the input convolutional layer as input and learns to extract and encode the multi-scale contextual information µ(F 1 (x)), where µ denotes the U-Net-like structure, as shown in Figure 2. 3. A residual connection for fusing local features and the multi-scale features through the summation of: Like the RSU block, the ORSU block can capture intra-stage multi-scale features without degradation of the high-resolution features. The main difference between the design of the ORSU and RSU blocks is that the ORSU block replaces vanilla convolution with octave convolution (OctConv). CNNs have achieved outstanding achievements in many computer vision tasks. However, behind the high accuracy, there is a lot of spatial redundancy that cannot be ignored [22]. As with the decomposition of the spatial frequency components of natural images, OctConv decomposes the output feature maps of a convolutional layer into high-and low-frequency feature maps stored in different groups (see Figure 3). Therefore, through the information sharing between neighboring locations, the spatial resolution of the low-frequency group can be safely reduced and the spatial redundancy can also be reduced. In addition, OctConv performs the corresponding (low-frequency) convolution on the low-frequency information, effectively enlarging the receptive field in the pixel space. Therefore, the use of OctConv empowers the network to further reduce the computational and memory overheads while retaining the designed advantages of the RSU block. A computational cost comparison between the proposed ORSU block and the RSU block is proposed in Table 1.  [22]. f(X; W) denotes the convolution function with weight parameters W, pool(X, 2) indicates spatial average pooling with kernel size 2 × 2 and stride 2, and upsample(X) indicates an up-sampling operation by a factor of 2.

U 2 -ONet
Inspired by U 2 -Net [23], we propose the novel U 2 -ONet, whose exponential notation is level 2 of the nested U-structure. As shown in Figure 1, each stage of U 2 -ONet is filled by a well-configured ORSU block, and there are 11 stages that form a large U-structure. In general, U 2 -ONet consists of three main parts: 1. The six-stage encoder. Detailed configurations are presented in Table 2. The number "L" behind the "ORSU-" denotes the height of the blocks. C in , M, and C out represent the input channels, middle channels, and output channels of each block, respectively. A larger value of L is used to capture more large-scale information of the feature map, with larger height and width. In both the En_5 and En_6 stages, ORSU-4F blocks are used, which are the dilated version of the ORSU blocks using dilated convolution (see Figure 1), because the resolution of the feature maps in these two stages is relatively low. 2. The five-stage decoder has a similar structure to the symmetrical encoder stages (see Figure 1 and Table 2). The concatenation of the upsampling feature map of the previous stage and the upsampling feature map of the symmetric encoder stage is the input for each decoder stage. 3. The last part is a multi-scale attention mechanism attached to the decoder stages and the last encoder stage. At each level of the network, we add an attention module including channel and spatial attention mechanisms to eliminate the aliasing effect that should be eliminated by 3 × 3 convolution, inspired by [24] (see Figure 4) and [52]. At the same time, the channel attention mechanism is used to assign different significances to the channels of the feature map, and the spatial attention mechanism is used to discover which parts of the feature map are more important, so that the saliency of the spatial dimension of the moving objects is enhanced. Compared to U 2 -Net for salient object detection, we maintain a deep architecture with high resolution for moving object segmentation while further enhancing the effect, reducing the computational and memory costs (see Tables 1 and 3).

Training Supervision Strategy
Generally speaking, the standard top-most supervised training is not a problem for relatively shallow networks. However, for extremely deep networks, the network will slow down, not converge, or converge to a local optimum due to the vanishing gradient problem during gradient back-propagation. The deeply supervised network (DSN) [53] was proposed to alleviate the optimization difficulties caused by gradient flows through long chains. However, it is still susceptible to problems, including interference of the hierarchical representation generation process and the inconsistency of the optimization goals. In the training process, we used a hierarchical training supervision strategy instead of using the standard top-most supervised training and a deep supervision scheme. For each level, we used both the standard binary cross-entropy loss (BCEloss) and Kullback-Leibler divergence loss (KLloss), inspired by [54], to calculate the loss. By adding a pairwise probability prediction-matching loss (KLloss) between any two levels, we promote multilevel interaction between the different levels. The optimization objectives of the losses in the different levels are consistent, thus ensuring the robustness and generalization performance of the model. The ablation study in Section 4.1.3 proves the effectiveness of the hierarchical training supervision strategy. The binary cross-entropy loss is defined as: and the Kullback-Leibler divergence loss is defined as: where (i, j) are the pixel coordinates and (M, N) are the height and width of the image. G (i,j) and S (i,j) denote the pixel values of the ground truth and the predicted moving object segmentation result, respectively. The proposed training loss function is defined as: where l

Post-Processing
Through the output of the network, the result of the moving object segmentation can be obtained, which is that the foreground moving objects and background are separated. However the instance-level moving object segmentation result cannot be obtained. In order to obtain instance-level results, the semantic instance label mask from Mask-RCNN is fused with the contour extraction results of the motion segmentation map. The geometric contours of the motion segmentation map can improve the quality of the semantic instance mask boundaries, determine whether the instance object is moving, and find new moving objects that are not associated with a particular semantic class. Meanwhile, the semantic instance mask can provide the category label of some moving objects and accurate boundaries to distinguish overlapping objects for the motion segmentation map.
The contour extraction method follows the approach proposed in [55], which utilizes topological structural analysis of digitized binary images to obtain the multiple closed contours of the motion segmentation map. For each motion contour C i , we calculate the overlap of each semantic instance mask m j and C i to associate m j and C i . Only if this overlap is greater than a threshold-in our experiments, 80% · |m j |, where ·|m j | denotes the number of pixels belonging to the mask m j -is m j associated with C i . Finally, we obtain the instance-level moving object segmentation result and segment new objects according to the number of semantic instance masks associated with each C i (see Algorithm 1).

Algorithm 1 Instance-level moving object segmentation
Require: Each motion contour C i ; Each instance semantic mask m j ; Results of the semantic instance masks associated with C i ; Judgment threshold t (t was usually 200 in our experiments); Ensure: Each moving object instance and its mask; 1: for each motion contour C i ∈ the current motion segmentation map do 2: if the number of semantic instance masks associated with the motion contour C i > 1 then 3: for each semantic instance mask m j associated with the motion contour C i do 4: m j is output as the mask for a moving object instance. 5: the number of moving object instances ← the number of moving object instances + 1. if the number of semantic instance masks associated with the motion contour C i == 1 then 9: The area contained in motion contour C i is assigned as the mask for associated semantic instance j. 10: m j is output as the mask for a moving object instance. 11: the number of moving object instances ← the number of moving object instances + 1. 12: end if 13: if the number of semantic instance masks associated with the motion contour C i < 1 then 14: if the length of the motion contour C i > t then 15: The area contained in motion contour C i is output as the mask for a new moving object instance. 16: the number of moving object instances ← the number of moving object instances + 1. 17: end if 18: end if 19: end for

Experiments
Datasets: We evaluated the proposed method on several commonly used benchmark datasets: the Freiburg-Berkeley Motion Segmentation (FBMS) dataset [56], the Densely Annotated Video Segmentation (DAVIS) dataset [28,57], the YouTube Video Object Segmentation (YTVOS) dataset [58], and the extended KittiMoSeg dataset [59] proposed in FuseMODNet [60]. For the FBMS, we evaluated on the test set using the model trained from the training set. However, the FBMS shows a large number of annotation errors. We therefore used a corrected version of the dataset linked from the original dataset's website [61]. For the DAVIS dataset, DAVIS 2016 [28] is made up of 50 sequences containing instance segmentation masks for only the moving objects. Unlike DAVIS 2016, the DAVIS 2017 dataset [57] contains sequences providing instance-level masks for both moving and static objects, but not all of its sequences are suitable for our model. Therefore, we trained the proposed model on DAVIS 2016 and used a subset of the DAVIS 2017 dataset called DAVIS-Moving, as defined in [20], for the evaluation. For the YTVOS dataset containing both labeled static and moving objects, we also used the YTVOS-Moving dataset introduced by [20], which selects sequences where all the moving objects are labeled. For the extended KittiMoSeg dataset, there are many images in this dataset that do not contain moving objects, for which is there is no label, or for which some labels are ambiguous, as shown in Figure 5, where the (static) background is wrongly segmented into objects and the cars are labeled roughly with a square area. We manually selected 5315 images for training and 2116 images for evaluation from the extended KittiMoSeg dataset, where the moving objects are accurately labeled.
(a) (b) Figure 5. Some ambiguous labels in the extended KittiMoSeg dataset. Like (a) and (b), some moving objects are labeled roughly with a square area, and some regions of the background are also wrongly labeled as moving objects.

Implementation Details:
We trained the proposed network from scratch, and all of the convolutional layers were initialized by [62]. Stochastic gradient descent (SGD) with an initial learning rate of 4 × 10 −2 was used for the optimization, and its hyperparameter settings were as follows: momentum = 0.9, weight_decay = 0.0001. We trained for 20 epochs using a batch size of 4. Both the training and testing were conducted on a single NVIDIA Tesla V100 GPU with 16 GB memory, along with the PyTorch 1.1.0 and Python 3.7 deep learning frameworks. The results use the precision (P), recall (R), and F-measure (F), as defined in [56], as well as the mean intersection over union (IoU) for the evaluation metrics.

ORSU Block Structure
An ablation study on the blocks was undertaken to verify the effectiveness of the proposed ORSU block structure. The attention mechanism was removed in the proposed network, and then the ORSU blocks were replaced with the RSU blocks from U 2 -Net to obtain the network called U 2 -Net 6bk (−a). U 2 -Net 6bk (−a) is a network for moving object segmentation that uses the backbone of U 2 -Net and calculates the BCEloss and KLloss of six levels, as we designed. The results are shown in Tables 1 and 3. After replacing the RSU blocks with the proposed ORSU blocks, the memory usage drops by 25.69 MB and the computational cost falls by nearly 40%. For both the video foreground segmentation and multi-object motion segmentation, the network with ORSU blocks improves the precision by over 2.4%, the recall by about 1.0%, the F-measure by over 1.55%, and the IoU by over 1.51%. At the same time, it can be noted that the increase in value of the evaluation metrics in the multi-object motion segmentation is higher than that in the video foreground segmentation. It is worth noting that the improvement in precision is the most obvious, indicating that the ORSU blocks help the network to better learn motion information and more accurately segment moving objects. Therefore, it can be proved that the designed ORSU blocks are superior to the RSU blocks in motion segmentation tasks.

Attention Mechanism
As mentioned above, the addition of the attention mechanism introduces spatial and channel attention, making the moving objects more salient in the segmentation result. This ablation study was conducted to validate the effectiveness of adding the attention mechanism. U 2 -ONet 6bk (-a) without the attention mechanism is compared with the complete network, called U 2 -ONet 6bk . Table 3 shows that adding the attention mechanism improves the precision by about 1%, the recall by over 2.37%, the F-measure by over 1.64%, and the IoU by over 0.87%. Differing from the ablation study for the blocks, the increase in the value of the evaluation metrics after adding the attention mechanism in the video foreground segmentation is higher than that in the multi-object motion segmentation. Meanwhile, it can be noted that the improvement in recall is the most obvious, indicating that the designed multi-scale attention mechanism helps the network to better discover more moving objects. It is done by introducing global contextual information and capturing the spatial details around moving objects to enhance the saliency of the spatial dimension of the moving objects. Video foreground segmentation denotes segmenting the scene into static background or foreground moving objects without distinguishing object instances. Multi-object motion segmentation is segmenting each moving object instance in the scene.

Training Supervision
In order to prove the effectiveness of the multi-level loss calculation strategy, we evaluated the U 2 -ONet bk network when calculating the BCEloss and KLloss from one to six levels. Table 4 shows that the overall effect of the network in calculating the loss improves from two to four levels as the number of levels increases. The overall effect when using from four to six levels appears to decline first, and then increases as the number of levels increases. In summary, it can be proved that the proposed multi-level loss calculation training strategy further improves the effect of the deep network, but it is not that more levels result in a better effect. For the proposed network, it works best when using six levels or four levels.
To further demonstrate the superiority of the calculation of both the BCEloss and KLloss, multiple ablation experiments were conducted. From one to six levels, we compared the network only calculating the BCEloss, called U 2 -ONet b , and the network calculating both the BCEloss and KLloss, called U 2 -ONet bk . The results are listed in Table 4. From the results, when calculating the loss for three and four levels, the addition of KLloss improves the effect of the network, in general, although the improvement in each metric is not obvious, unlike in the previous ablation studies. Overall, we think that the addition of KLloss has the potential to improve the effect of the multi-level network to a certain extent. At the same time, the addition of KLloss improves the precision and makes the network more accurately segment moving objects, which is what is needed for some practical applications. Best results are highlighted in red with second best in blue.

Comparison with Prior Work
Official FBMS: The proposed method was evaluated against prior works on the standard FBMS test set using the model trained from the FBMS training set. The input image size was (512, 640). Since some of the methods compared only provide metrics for the multi-object motion segmentation task, only the metrics for this task are compared. In order to compare with the other methods, to indicate the accuracy of the segmented object count, the ∆Obj metric was added, as defined in [19]. As shown in Table 5, the proposed model performs the best in recall. In terms of recall and F-measure, it outperforms CCG [18], OBV [19], and STB [50] by over 16.5% and 6.4%, respectively. The qualitative results are shown in Figure 6. The best results are highlighted in red, with the second best in blue.

DAVIS and YTVOS:
The proposed method was further evaluated on the DAVIS-Moving dataset using the model trained from DAVIS 2016 and evaluated on the YTVOS-Moving test set using the model trained from the YTVOS-Moving training set, as defined in [20]. The input image size for both datasets was (512, 896). The results are listed in Table 6. For DAVIS-Moving, the proposed approach outperforms TSA [20] by about 4.8% in precision and about 0.8% in F-measure. Unlike the FBMS and DAVIS datasets, the YTVOS-Moving dataset contains many moving objects that are difficult to segment, such as snakes, octopuses, and camouflaged objects. Therefore, the metrics for YTVOS-Moving are much lower than the metrics for the previous datasets. However, in YTVOS-Moving, the proposed method still outperforms TSA by about 4.2% in recall and about 1.6% in F-measure. The qualitative results are shown in Figures 7 and 8. Table 6. Results for the Densely Annotated Video Segmentation (DAVIS)-Moving and YouTube Video Object Segmentation (YTVOS)-Moving datasets, as defined in [20].   Extended KittiMoSeg: Finally, the proposed approach was evaluated on the extended KittiMoSeg dataset using our split training set and test set. The proposed method was evaluated on the test set using the model trained with the training set of the extended KittiMoSeg. The input image size for this dataset was (384, 1280). The extended KittiMoSeg dataset is based on subsets of the Kitti dataset and is for real autonomous driving scenarios. Therefore, this dataset includes continuous and fast-forward moving cameras and multiple fast-moving vehicles of different sizes. These pose different challenges to the generic segmentation of moving objects. As the annotations in this dataset are all binary annotations, i.e., only the static background and moving objects are segmented, without distinguishing object instances, only the results for the video foreground segmentation are compared. Since the complete FuseMODNet combines RGB and LiDAR (Light Detection and Ranging) data, the proposed approach is compared with FuseMODNet using RGB and rgbFlow without using LiDAR. As shown in Table 7, the proposed method significantly outperforms FuseMODNet by 13.29% IoU. Since MODNet uses the unexpanded KittiMoSeg, including about 1950 frames only, the model from the training set of KittiMoSeg was used to evaluate on the testing set of KittiMoSeg. As shown in Table 8, the proposed method outperforms MODNet in all metrics. In term of precision and IoU, the proposed method outperforms MODNet by 11.9% and 10.26%, respectively. Therefore, it can be proved that the proposed approach is likely to have good application prospects in the field of autonomous driving. The qualitative results are shown in Figure 9. The best results are highlighted in bold. The best results are highlighted in bold. New robot dataset: In order to prove that the proposed method has the ability to discover moving objects that are not considered in the trained model, qualitative experiments were conducted using our own dataset, which is called the new robot dataset. This dataset is of a mobile robot, without semantic labels (see Figure 10). The models trained using the DAVIS dataset were used for testing, without any training on the new robot dataset. For quantitative evaluation, we acquired a 900-frame-long sequence and manually provided 2D ground-truth annotations for the masks of the moving objects. The qualitative and quantitative results are shown in Figure 11 and Table 9. Our method was compared with the only method with open source code-TSA [20]. TSA was also not trained with this dataset. We used the TSA's open-source model directly for evaluation and this open-source model cannot segment the moving object without semantic labels in the new robot dataset. It can be seen that the proposed method can segment the moving object without semantic labels in the new robot dataset better, proving the efficiency of the proposed method.  "-" indicates that this method can not segment this moving object, so this metric cannot be obtained.

Conclusions
In this paper, we proposed a two-level nested U-structure network with a multi-scale attention mechanism, called U 2 -ONet, for moving object segmentation. Each stage of U 2 -ONet is filled with the newly designed octave residual U-blocks (ORSU blocks) based on octave convolution, which enable U 2 -ONet to capture both local and global information with a high resolution while reducing the spatial redundancy and computational burden in the CNN. We also designed a hierarchical training supervision strategy that calculates both the BCEloss and KLloss at all six levels to improve the effectiveness of the deep network. The experimental results obtained on several general moving object segmentation datasets show that the proposed approach is a state-of-the-art method. Even in some challenging datasets, such as YTVOS-Moving (which includes camouflaged objects and tiny objects) and extended KittiMoSeg (which includes fast camera motion and non-salient moving cars), the proposed method still achieves a good performance. Experiments on our own new robot dataset also proved that this approach has the ability to segment new objects.
In the near future, we will further leverage the anti-noise ability of octave convolution to combine with the ability of the ORSU blocks to extract multi-scale features. We will also introduce a network for image dehazing and detraining in order to enhance the images before being fed through the network. Finally, we will attempt to enable U 2 -ONet to still achieve a good performance under extremely complex conditions, including rain, fog, snow, and motion blur, as well as in high-noise scenarios.
Author Contributions: B.L. guided the algorithm design. C.W. wrote the paper. C.W. designed the whole experiments. C.L. designed the whole framework. J.L. helped organize the paper. X.S., Y.W. and Y.G. provided advice for the preparation of the paper. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available because the data have not been sorted out.