In generalized single object tracking (all subsequent references to visual object tracking refer to single object tracking), the class of the tracked object is no longer restricted; only the target given in the first frame is used to track the target and the position of the target is estimated in each subsequent frame. Single object tracking in the domain of computer vision has always been a tricky and challenging task, as tracking targets may be subject to tracking drift in subsequent frames because of some unpredictable obstacles such as illumination changes, occlusions, and deformations. In the past, the VOT dataset has used the minimum outer rectangle of the tracked target as the ground truth of the target; however, in the latest dataset VOT2020 [
1], the mask of the tracked target is provided as the ground truth for some algorithms that combine visual object tracking (VOT) and visual object segmentation (VOS) for evaluation. Visual object segmentation is also a basic and difficult task in the domain of computer vision. The purpose of the visual object segmentation task is to do pixel-level classification of targets [
2,
3,
4,
5]. Although visual object tracking and visual object segmentation have different task objectives, the two tasks are inextricably related. The task of visual object segmentation is to give the target mask in the first frame and predict the mask of the target in each subsequent frame. In contrast to visual object tracking, it is to modify the target initialization conditions in the object tracking process. This also means that if the binary mask of the target is obtained, then, the task goal of object segmentation is naturally achieved, and the position of the tracking target can also be inferred from the mask. Therefore, the task goal of target tracking is also accomplished and the position of the tracking target derived from the mask is also more accurate; in this way, the chance of tracking drift can be reduced. Of course, object tracking is also beneficial for object segmentation, which quickly detects the approximate position of the target and can provide rough position information of the target for the segmentation problem, and then, refine on this basis to get the mask of the tracking target, which can reduce the effects of fast-moving targets, complex backgrounds, and similar objects, and improve the speed and accuracy of segmentation [
6].
Recently, there are many tracking algorithms [
7,
8,
9] that use tracking-by-detection as framework. These algorithms treat the tracking object as the foreground and the other parts as the background, first detecting the location of the object, and then tracking it. In other words, the tracking problem is considered to be a dichotomous problem, i.e., whether a region is foreground or background. Tracking-by-detection can be divided into two parts, i.e., feature extraction and detector, using the manually labeled sample in the first frame to train the detector for detection, and then iterative tracking. Li et al. [
10] proposed SiamRPN by combining the Region Proposal Network (RPN) network and Siamese network, using the RPN to detect the region of interest, and then perform prebackground classification and regression to get the tracking target bounding box. To increase the robustness, Zhu et al. [
11] proposed DaSiamRPN by better training the network with data. Li et al. [
12] proposed SiamRPN++ based on SiamRPN by replacing the backbone network and fusing the multilayer network to detect the region of interest. Most datasets use a rectangular box to label samples, as shown in
Figure 1a; the object contained in the red rectangular box is the tracking target, that is, the foreground region. This simple representation can help to quickly detect and track the object, but most objects in nature are non-rigid, these objects labeled with rectangular boxes will introduce background regions. The foreground area shown in
Figure 1a contains the target black swan and a large part of the background area (i.e., water, walls, and grass), which introduces background distractions when training the detector and affects the tracking effect; such effects will always exist and have a huge impact on the tracking performance. To overcome the above problems, some segmentation-based tracking algorithms are proposed, which integrate some form of segmentation into the tracking process. The training data needs to use mask to label samples, as shown in
Figure 1b. This annotation method is the same as the video segmentation, which aggravates the expense in data annotation and also aggravates the computational expense in the inference phase, but this annotation method does not introduce background distractions and has an accurate shape description of the target. The results of segmentation of segmentation-based trackers have a direct impact on the tracking results. Therefore, in order to achieve better segmentation and, thus, improve the tracking results, in this paper, we propose combining segmentation and object tracking into a joint framework and design a cross-stage and cross-resolution refinement pathway by combining skip connection.
A number of scholarly studies have shown that segmentation-based trackers can achieve satisfactory results. Segmentation-based object tracking algorithms are divided into bottom-up algorithms (which can also be called generative methods) [
13] and top-down algorithms [
14]. The bottom-up algorithms treat segmentation and tracking as two different tasks, which can effectively solve the object tracking problem of non-rigid deformation, but this method also exposes a serious problem, i.e., after the introduction of distracted targets in the segmented foreground region, this distraction may always exist, and the worst thing is to affect the final tracking effect. Therefore, in order to avoid the above problem, some scholars carry out segmentation and tracking at the same time, which is called the top-down method. Visual object segmentation provides accuracy for visual object tracking, while visual object tracking provides accurate semantic information for visual object segmentation. The top-down method make full use of the relationship between VOS and VOT and greatly improves their effectiveness. For example, Yao et al. [
15] presented a hybrid semantics-aware tracking algorithm, which used semantic information to provide reliable guidance to track the target. Semantic information is a high-level feature that specifies the class to which the target belongs so that background interference can be avoided.
In this paper, in order to enhance the segmentation effectiveness of segmentation-based object tracking methods, we extend MMS with skip connections and propose the CSCR module to fuse different levels of features to achieve better segmentation effectiveness. This paper is inspired by U-Net [
16] and U-Net++ [
17], both of which are representative papers in the domain of medical image segmentation. One of the important reasons for the success of these two papers is the skip connection, and both U-Net and U-Net++ are typical encoder–decoder structures in which the information of the image is greatly compressed in the middle. Finally, a series of deconvolution or upsampling operations are used to obtain the final segmentation result. Obviously, the deconvolution or upsampling process needs to fill in a lot of gaps to generate something from nothing, and this process lacks enough auxiliary information. The advantage of using skip connection is that the feature information at the corresponding scale is introduced into the upsampling or deconvolution process, which provides multiscale and multilevel information for a later image segmentation process, and thus, a finer segmentation effect can be obtained.