Next Article in Journal
A Novel 3D U-Net–Vision Transformer Hybrid with Multi-Scale Fusion for Precision Multimodal Brain Tumor Segmentation in 3D MRI
Previous Article in Journal
Research on Non-Contact Low-Voltage Transmission Line Voltage Measurement Method Based on Switched Capacitor Calibration
Previous Article in Special Issue
VL-PAW: A Vision–Language Dataset for Pear, Apple and Weed
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Prompt Self-Correction for SAM2 Zero-Shot Video Object Segmentation

1
Department of Intelligent Electronics and Computer Engineering, Chonnam National University, Gwangju 61186, Republic of Korea
2
Graduate School of Information, Production, and System, Waseda University, Kitakyushu 169-8050, Japan
3
Research Center, AISEED Inc., Gwangju 61186, Republic of Korea
4
Chonnam National University R&BD Foundation, G5-AICT Research Center, Gwangju 61186, Republic of Korea
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Electronics 2025, 14(18), 3602; https://doi.org/10.3390/electronics14183602
Submission received: 27 August 2025 / Revised: 9 September 2025 / Accepted: 9 September 2025 / Published: 10 September 2025
(This article belongs to the Collection Image and Video Analysis and Understanding)

Abstract

Foundation models, exemplified by the Segment Anything Model (SAM), have revolutionized object segmentation with their impressive zero-shot capabilities. The recent SAM2 extended these abilities to the video domain, utilizing an object pointer and memory attention to maintain temporal segment consistency. However, a critical limitation of SAM2 is its vulnerability to error accumulation, where an initial incorrect mask can propagate through subsequent frames, leading to tracking failure. To address this, we propose a novel method that actively monitors the temporal segment consistency of masks by evaluating the distance of object pointers across frames. When a potential error is detected via a sharp increase in distance, our method triggers a particle filter based re-inference module. This framework models object’s motion to predict a corrected bounding box, effectively guiding the model to recover the valid mask and preventing error propagation. Extensive zero-shot evaluations on DAVIS, LVOS v2, YouTube-VOS and qualitative results show that the proposed, parameter-free procedure consistently improves temporal coherence, raising mean IoU by 0.1 on DAVIS, by 0.13 on the LVOS v2 train split and 0.05 on the LVOS v2 validation split, and by 0.02 on YouTube-VOS, thereby offering a simple and effective route to more robust video object segmentation with SAM2.

1. Introduction

Segment Anything Model (SAM) [1], well known as a general-purpose foundation model [2] for object recognition and segmentation tasks, is a visual inference-based architecture that can generate segmentation masks of objects on the fly in response to various types of user prompts. As its name suggests, class-agnostic SAM is essentially comparable to the parametric method that pre-trains the image classes in that it can integrate and process various types of inputs (e.g., points, boxes, masks) [3,4] in a zero-shot [5] manner and can generate accurate segmentation masks even for unseen objects.
The follow-up, SAM2 [6], is proposed as a segmentation model with high generality and accuracy for various visual prompts on high-dimensional dynamic video frames. It borrows the structure of SAM1 as it is, which provides fast zero-shot segmentation performance through fixed image embedding and multi-mask decoding structure, and aims to generate consistent masks for various input prompts by pre-training on a large-scale dataset for video object segmentation. Notably, the memory attention module for propagating the previous predicted mask information to obtain the final result accurately is designed to eliminate the need for cumbersome prompting work.
Despite the impressive performance in video object segmentation (VOS) tasks, SAM2 has a major limitation in that if the incorrect mask is predicted due to the variability of objects or occlusion in the immediate frame, this error can accumulate in the subsequent frames, which leads to the problem of being continuously distorted. Recently, various methods have attempted to post-correct erroneous prompts or masks, or to refine predictions with an auxiliary network, as in [7,8,9]. However, these approaches typically require additional training or computation and lack a principled, a priori criterion for deciding when correction should be invoked, leaving consistency across frames uncertain. Moreover, they generally overlook SAM2’s object pointer, a native signal that can indicate frame-to-frame inconsistency and thus provide a lightweight basis for targeted intervention.
We address the problem in two stages: recognition of an incorrect mask and post-correction. During video segmentation we first recognize whether the current frame’s mask has been perturbed. To this end, we leverage SAM2’s object pointer, a learnable token that cross-attends to image and prompt embeddings and thus summarizes object-specific context without reconstructing the full mask. Our framework computes the exponential moving average of the inter-frame object pointer distance and uses this statistic to assess mask consistency. When the EMA indicates inconsistency, the frame is flagged for post-correction. This pointer-driven criterion provides a lightweight and adaptive signal for recognizing when intervention is warranted.
For post-correction, we integrate a particle filter into SAM2 and use it to construct the re-inference prompt. The filter tracks the object via its bounding box and is updated only when the object score from SAM2 indicates that the target is present, which avoids spurious state updates. When re-inference is triggered, the particle filter predicts a corrected bounding box that serves as the prompt to re-query SAM2, yielding a recovered mask. This design couples a principled, training-free inconsistency recognizer with an efficient motion prior, enabling targeted re-inference that improves temporal stability while keeping computation modest.
The key contributions of the idea can be summarized as follows:
  • We propose a mask consistency evaluation method that adaptively determines validity by thresholding the exponential moving average of the distance between the object pointers in consecutive frames (Section 3.1 and Section 3.2).
  • We propose a SAM2-based framework that integrates a particle filter-based object tracker into the inference loop, updating it whenever the object score indicates presence. When the mask consistency evaluation triggers re inference, the framework uses the particle filter’s motion prediction to construct the prompt and re-query SAM2, thereby recovering valid masks (Section 3.3).
  • Extensive experiments on DAVIS, the LVOS v2 train and validation splits and the YouTube-VOS validation set show that our framework yields small but consistent improvements over baseline SAM2 without using any learnable parameters (Section 4.3).

2. Related Works

This chapter provides a summary of the process for generating object pointers used to evaluate the consistency of masks in SAM2, as well as an overview of previous works on the subsequent approaches of the VOS task and conventional filter-based applications.

2.1. Deep Learning-Based Video Object Segmentation

Early VOS research mainly focused on learning pixel-level similarity based on CNNs to segment objects. S. Caelles et al.’s [10] pioneering study proposed a method to segment objects in an entire video using only the mask information of the first frame. Based on a pre-trained CNN, it learns appearance information and suggests the concept of one-shot VOS that segments objects in the entire video sequence using only the mask information of the first frame of the video, which influenced many subsequent studies. Later, with the advent of the Vision Transformer (ViT) [11], the VOS field gradually shifted to a Transformer-based architecture [12,13] that more effectively models spatio-temporal features. Z. Liu et al. [14] extend the successful Swin Transformer in the image domain to the video domain, proposing a method to efficiently model temporal context and spatial information simultaneously through shifted windows in the spatiotemporal dimension. The following study is known as a representative case that shows how effective the Transformer architecture is in video understanding tasks. These CNN and Transformer-based models achieved high segmentation accuracy for specific video datasets but showed limitations in that they overfitted to the training data [15], resulting in poor generalization performance for new domains or unseen objects.
Recently, Segment Anything Model, a foundation model pre-trained on a large dataset, has emerged, presenting a new paradigm in the VOS field. SAM has demonstrated the ability to precisely segment various objects without separate fine-tuning based on excellent zero-shot generalization performance. Accordingly, follow-up studies are actively being conducted to apply SAM to VOS by attaching additional lightweight modules or combining mask propagation and tracking mechanisms in the inference process. Cheng, Ho Kei et al. [16] presented the idea of decoupling the segmentation and tracking functions to apply the powerful segmentation capability of SAM to video, and proposed an efficient pipeline in which a separate tracker manages and propagates the masks generated by SAM.

2.2. Filter-Based Object Tracking in Deep Learning

Meanwhile, in the field of object tracking [17], traditional probability-based filtering methodologies such as the Kalman filter [18] and Particle filter [19] have been used for a long time. These filter-based algorithms assume the motion model of the object and predict and compensate for changes in the state over time, like position or velocity, thereby providing robust tracking performance even in temporary occlusion or appearance changes.
In the era of deep learning [20], these filter-based tracking techniques have been further developed by combining them with the powerful feature extraction capabilities of deep learning models. Many studies have combined CNN [21] or Transformer-based visual feature extractors and filter-based mechanisms [22] to complementarily utilize appearance information and motion information, thereby greatly improving the accuracy and stability of tracking. A. Bewley et al. [23] used the Kalman Filter to predict linear movements from the results of deep learning-based object detectors, and implemented fast and efficient multi-object tracking with simple matching based on the intersection over Union (IoU). By proposing a simple yet efficient multi-object real-time tracking framework that combines object detection results with a Kalman filter, they demonstrated how the Kalman filter can be used to predict object states and for data association in the era of deep learning. N. Wojke et al. [24] improved the previous work by adding CNN-based appearance features in addition to predicting motion with a Kalman filter, thereby greatly improving tracking performance in complex situations where objects are occluded or interacting. L. Bertinetto et al. [25] proposed a method to learn a similarity map between a target object and a candidate region by utilizing a Siamese Network. This opens up the possibility of combining deep learning-based trackers with filters by incorporating Kalman filters or other postprocessing techniques to improve the stability of tracking results.
While previous studies [23,24,25] have focused on enhancing the segmentation performance of the foundation model in VOS, this study focuses more on maintaining the continuity and stability aspects of tracking the target object over the whole video frame. In this paper, we propose a new approach to supplement unstable inference results and enhance temporal segment consistency while maintaining the strong generalization [26] performance of the foundation models.

3. Methods

In this section, we present a mask consistency evaluation module driven by object pointers, together with a particle filter-based framework for object tracking and re-inference. Figure 1 presents the overall architecture of the proposed framework. Our framework decides whether to update the particle filter based on SAM2’s object score and mask consistency. When this score indicates confident presence and the mask is consistent as shown by the green path in Figure 1 the mask decoder outputs the current mask and we update both the particle filter and the EMA statistics. During inference, if the object is present but the mask is inconsistent, as shown by the red path in Figure 1, we suspend updates to the particle filter and the EMA and run re inference. During re inference we form the prompt from the location predicted by the particle filter. After re-inference we choose between the first pass mask and the re-inferred mask by selecting the one that agrees more closely with the mask from the previous frame.

3.1. Object Pointer in SAM2

In SAM 2, the object pointer is defined as the output embedding of the single mask token after the transformer and serves as a compact representation of the target object’s identity. The decoder forms a single token stream by combining a learned table of mask tokens E mask R M × D with the IoU token the optional object score token and the prompt tokens, and processes this stream together with image features using a two way attention transformer. After the attention updates the mask token slice H mask R B × M × D is produced and the default pointer is taken as H mask R B × 1 × D with the option to use the multi mask tokens when desired. During video inference the pointer is maintained alongside memory embeddings accumulated from previous frames and performs memory attention at the current step to inject information about the object mask from past frames into the current image representation. The final mask is not chosen by the pointer itself, rather each mask token h m is mapped by a small MLP to hypernetwork [27] coefficients w m which act on the upscaled feature map F to produce logits L , and the final selection is decided by the predicted quality and a stability-based fallback rule. Under this mechanism, the object pointer can be regarded as indicating the mask to be produced.
Figure 2 illustrates the relationship between the object pointer and the generated mask on a representative DAVIS clip. The upper row visualizes SAM2 masks for frames 57–64, and the lower panel plots the L2 distance between successive pointers. When the mask undergoes a pronounced deformation between frames 57 and 58, the pointer-distance curve jumps from 2 to 16 and then quickly returns to a low plateau once the segmentation stabilizes. Similar transient spikes co-occur with moments of degraded temporal consistency, whereas smooth, low-amplitude variations dominate when the same object is segmented reliably.
These dynamics yield a simple but powerful proxy. Because the object pointer is derived from the selected mask token, its temporal similarity reflects the stability of the underlying mask without requiring pixel space comparisons. When the target persists with similar shape and appearance, consecutive pointers remain close in embedding space, exhibiting high cosine similarity and small L2 change. When the target disappears, is occluded, or the hypothesis switches, pointer similarity drops sharply. Therefore, we measure temporal segment consistency directly in pointer space, using pointer-to-pointer distance as an adaptive surrogate for mask agreement across frames.
To quantify temporal stability, we use an exponential moving average of the inter-frame pointer distance. When this distance remains low, memory writes and tracker updates proceed as usual. When it increases rapidly above a threshold while the object score indicates presence, the particle filter guided re-inference is triggered and the mask is replaced only if the re-inferred hypothesis aligns more strongly with recent history. This pointer-derived criterion links recognition and correction without pixel-level overlap calculations, reducing sensitivity to transient geometric distortions and improving long-range identity preservation.

3.2. Mask Consistency Evaluation

Empirically, the Euclidean ( L 2 ) distance between the object pointers in consecutive frames correlates strongly with the semantic stability of the corresponding masks. Large deviations of the mask yield large pointer distances, whereas minor shape changes produce only small pointer distances. This observation motivates the consistency criterion developed in this section. Leveraging this fact allows us to evaluate mask consistency without explicitly decoding the mask, thereby saving substantial computation. The instantaneous displacement is defined as follows:
Let the network-predicted pointer for frame t be p t R D .
d t = p t p t 1 2 = i = 1 D p t , i p t 1 , i 2 .
During the first N frames, we simply use the mean of the collected values as the estimate. Their mean provides an initial scale estimate
d EMA ( N ) = 1 N k = 1 N d k .
For t > N the scale estimate is updated online via
d EMA ( t ) = ( 1 α ) d EMA ( t 1 ) + α d t , α ( 0 , 1 ) ,
where the smoothing coefficient α controls the reactivity–robustness trade-off. During the initial frames until the warm-up phase completes we fix the threshold at θ 0 = 17. After the warm-up, we derive a per frame threshold θ t from the current EMA value.
θ t = θ 0 , t N , d EMA ( t ) , t > N ,
Re-inference is invoked when the object is present yet the mask undergoes an abrupt change. This condition uses the EMA threshold θ t and the object score derived from SAM2’s IoU token. The threshold θ t is computed from the EMA of the object pointer distances over previous frames, and an abrupt change is detected when the current distance exceeds this threshold, that is d t > θ t . Because the object pointer distance is valid only when the object is present in adjacent frames, we require the object score to exceed the cutoff in both frames, s t 1 > κ and s t > κ . The re-inference trigger is summarized in Equation (5). The specific values of the EMA threshold γ and the object score threshold κ were determined empirically through an ablation study in Section 4.4.
ReInfer t = s t 1 > κ s t > κ d t > γ θ t ,
After re-inference, the mask selection in Equation (15) chooses between the re-inferred mask M t re and the baseline mask M t base . If M t re is accepted the newly measured distance d t is used to update d EMA ( t ) and θ t as above

3.3. Object Tracking and Re-Inference Based on Particle Filter

3.3.1. Object Tracking Using the Particle Filter

Given the binary mask of the current frame M ^ t = { ( y , x ) sigmoid ( ϕ t ( y , x ) ) > 0.5 } , the bounding rectangle is defined as follows:
x 0 = min ( y , x ) M ^ t x , x 1 = max ( y , x ) M ^ t x ,
y 0 = min ( y , x ) M ^ t y , y 1 = max ( y , x ) M ^ t y .
We write the resulting box as o t = [ c x , c y , w t box , h t box ] R 4 with
c x = x 0 + x 1 2 , c y = y 0 + y 1 2 , w t box = x 1 x 0 , h t box = y 1 y 0 .
Using the box as state vector z t = [ c x , c y , w t box , h t box ] , we maintain P particles { z t ( p ) } p = 1 P and their weights { π t ( p ) } p = 1 P , where π t ( p ) 0 and p π t ( p ) = 1 .
  • Initialization ( t = 1 ):
    z 1 ( p ) N o 1 , diag ( 0.1 w 1 box , 0.1 h 1 box , 0.1 w 1 box , 0.1 h 1 box ) 2 , π 1 ( p ) = 1 P .
  • Prediction:
    z t | t 1 ( p ) = z t 1 ( p ) + ϵ t ( p ) , ϵ t ( p ) N 0 , σ t 2 I , σ t = 0.05 w t 1 box h t 1 box w t 1 box h t 1 box .
  • Update:
    π ˜ t ( p ) = IoU z t | t 1 ( p ) , o t + ε , π t ( p ) = π ˜ t ( p ) q π ˜ t ( q ) .
  • Particle-filter confidence:
    ρ t = 1 H ( π t ) log 2 P , H ( π t ) = p π t ( p ) log 2 π t ( p ) .
  • Resampling: draw indices I Multinomial ( P , π t ) and set
    z t ( p ) = z t | t 1 ( I p ) , π t ( p ) = 1 P .
  • State estimate:
z ^ t = p = 1 P π t ( p ) z t ( p ) .
Here, w t box and h t box denote the width and height of the bounding-box component of z t ; the symbol π t ( p ) is reserved exclusively for particle weights and does not conflict with the box width w t box .
The particle filter infers object presence from the SAM2 object score. When the object is judged to be present, framework update the particles based on the mask mIoU. Through these updates the particles capture the motion of the target, and this motion estimate is used to form the prompt when re inference is triggered.

3.3.2. Re-Inference with Particle Filter Prediction-Based Prompt

If ReInfer t is true, our framework replace the previous prompt with observation box extracted from the current mask, the particle filter box that predicts the object motion, and a set of negative points sampled outside both boxes to suppress background regions. Negative points are drawn from a uniform distribution over all spatial locations that lie outside both the current bounding box and the particle filter bounding box. These elements constitute the prompt used for re inference at time. {box_coords, box_labels, point_coords,point_labels} and forwarded to the SAM-2 head for a single re-inference pass. The two boxes give complementary spatial hypotheses, one reflecting current appearance, the other captured motion, while the negative points sharpen object boundaries, allowing drift to be corrected without any additional training. Consequently, given the re-inferred mask M t re and the base tracker output M t base , the final mask is chosen as the one that maximizes its agreement with the previous frame.
M t = arg max A { M t re , M t base } mIoU M t 1 , A .
Our framework selects the final output mask as the candidate with the highest mIoU against the previously accepted mask, reflecting the observation that object shape and position change little over short time intervals [28,29]. This rule encourages gradual evolution of the segmentation rather than abrupt switches, which improves temporal coherence and reduces drift. The selected mask M t is then used to update the object pointer and to resample the particle-filter state for the next frame.
Algorithm 1 details the proposed mask consistency evaluation and particle filter based object tracking and re-inference mechanism. Our correction scheme assesses mask consistency solely through the lightweight object pointer produced by SAM2, entirely avoiding full-mask operations. Also, an object tracking and subsequent re-inference step is implemented without learnable parameters or additional training, thereby preserving the foundation model’s inherent generalization capability while markedly improving temporal stability in object tracking.
Algorithm 1 Prompt Self-Correction and Re-inference
Require: Base mask M t base , previous mask M t 1 , current pointer p t , previous pointer
p t 1 , object scores s t 1 , s t , particle set { z t 1 ( p ) , π t 1 ( p ) } p = 1 P , EMA d EMA ( t 1 ) , threshold θ t 1 ,
 frame index t
Ensure: Final mask M t , updated particle set, updated d EMA ( t ) , θ t
Hyper-parameters: warm-up N, smoothing α , EMA threshold γ , warm-up threshold θ 0 ,
 object score cutoff κ , negatives K
 1:
d t p t p t 1 2
 2:
if   t N   then
 3:
     d EMA ( t ) 1 t k = 1 t d k , θ t θ 0
 4:
else
 5:
     d EMA ( t ) ( 1 α ) d EMA ( t 1 ) + α d t
 6:
     θ t γ d EMA ( t )
 7:
ReInfer t ( s t 1 > κ ) ( s t > κ ) ( d t > θ t )
 8:
if   s t > 0   then
 9:
    Predict particles with noise σ t (Equation (10))
 10:
    Update weights via IoU (Equation (11))
 11:
    Resample and compute state z ^ t (Equation (13))
 12:
else
 13:
    Retain previous particle set, set z ^ t o t
 14:
if   ReInfer t   then
 15:
    Build box prompt { o t , z ^ t } + K negatives
 16:
    Run SAM2 once ⟶ M t re
 17:
     M t arg max A { M t re , M t base } mIoU ( M t 1 , A )
 18:
    if    M t = M t base    then
 19:
         d EMA ( t ) d EMA ( t 1 ) , θ t θ t 1
 20:
else
 21:
     M t M t base
  return M t , particle set, d EMA ( t ) , θ t

4. Results

4.1. Datasets

DAVIS (Densely Annotated VIdeo Segmentation) [30] is one of the earliest and most influential benchmarks for VOS. The 2017 multi-object release expands the original corpus to 150 full-HD sequences, providing 10 459 densely annotated frames and a total of 376 object instances together with per-frame pixel-accurate masks. LVOS v2 (Large-scale Long-term Video Object Segmentation) [31], published in 2024, is designed to stress test temporal generalization. It contains high-resolution (720p) clips sampled at 6 fps, summing to 296,401 annotated frames and 407,945 pixel-accurate masks. YouTube-VOS [32] remains the largest publicly available VOS dataset, comprising 3859 high-resolution clips with more than 232,000 instance masks.

4.2. Settings

All experiments were conducted on the validation splits of three public VOS benchmarks (LVOS v2, DAVIS, and YouTube-VOS) and LVOS v2 train split using a zero-shot protocol in which the ground-truth mask of the first frame in each video was converted into a prompt and no further fine-tuning was applied. A single dot located at the mask centroid (dot prompt) and a bounding box tightly enclosing the mask prompt types wew exmained in the experiments. LVOS v2 validation was evaluated with the dot prompt, whereas DAVIS and YouTube-VOS were evaluated with the bounding-box prompt. Performance was quantified using mean Intersection-over-Union (mIoU). To update the particle filter motion prior and the exponential moving-average (EMA) distance threshold only when the target was reliably present, the SAM2 object score had to exceed κ > 0 before computing the pointer distance, updating the EMA, or resampling particles. During the first five frames a fixed distance threshold of θ 0 = 15 was employed, after which the particle filter tracked 40 particles. Hyperparameters for the adaptive EMA threshold and for the number and type of re-inference prompts were selected in a separate ablation study and then kept fixed for the main evaluation. For determinism seed value was set to 0. All experiments used the SAM2-Tiny backbone and were run on a single NVIDIA A100 GPU.

4.3. Experiment Result Analysis

Extensive evaluations on three public VOS benchmarks demonstrate the effectiveness of our proposed method. When the framework is applied on the SAM2 backbone, mIoU rises by + 0.10 percentage points on the DAVIS as a standard benchmark dataset for VOS. Detailed experimental results are presented in Table 1. For the LVOS v2 train and validation set (averaged over all videos),in Table 1, which assesses long-term temporal consistency the result increases by + 0.13 and + 0.05 , respectively. Also for YouTube-VOS validation by + 0.02 .
Although the performance improvement in mIoU is marginal, consistent performance enhancements were observed across the train split (420 videos) and validation set (140 videos) of the LVOS dataset, which includes long video sequences of over 2000 frames, as well as on the YouTube VOS and DAVIS datasets. The proposed framework improves performance without additional training, thereby preserving the generalization ability of the foundation model.
Complementing these accuracy gains, the runtime analysis in Table 1 shows that, while our model’s enhanced accuracy entails a modest decrease in frames per second (fps), the processing speed remains at a highly interactive level suitable for practical applications. Specifically, our framework operates at 26.84 fps on DAVIS and maintains a robust 16.90–17 fps even on the complex, long-sequence LVOS dataset. This demonstrates a favorable trade-off, where a significant gain in segmentation quality is achieved while preserving a sufficiently high frame rate for interactive tasks, affirming the usability of our approach.
In addition to mIoU, the temporal consistency of the proposed framework was evaluated on the DAVIS dataset by computing the mIoU between masks of adjacent frames. Figure 3 presents the increase in temporal consistency for DAVIS videos when the proposed framework is applied to the SAM2 baseline. The results indicate that the proposed framework generally improves temporal consistency.
Figure 4 presents qualitative comparisons on LVOS and DAVIS. Green boxes denote cases in which re-inference succeeds, whereas red boxes indicate failures. Success is most pronounced when adjacent frames provide reliable context, i.e., when their masks are close to the ground truth.
In most examples, applying re-inference yields higher quality masks than the SAM2 baseline. The accepted mask is then used to update the EMA of the inter frame pointer distance, which affects the subsequent thresholding and update steps. With this corrected EMA, later frames are processed more reliably, and we observe a gradual improvement in the predicted masks over the sequence.
The second example video shows a case where the man is occluded by a balloon. In this sequence, re-inference on the second frame improves the predicted mask and brings it closer to the ground truth. This suggests that the trigger detects transient inconsistency caused by occlusion and recovers the object without additional training. The correction stabilizes the track and reduces drift in the following frames.
In contrast, by challenging sequences such as the fourth example (zebra), where the baseline has already drifted substantially from the ground truth, the selection rule may favor an inferior mask. This behavior arises because the final mask is chosen by maximizing agreement with the previous frame (mIoU), which can reinforce temporally consistent yet suboptimal hypotheses when the preceding mask is erroneous.

4.4. Ablation Study

The ablation study on the DAVIS dataset, evaluates design factors of the proposed framework, including the EMA threshold ratio that governs the re-inference decision, the object tracking strategy, Object score cutoff, the number of particles in the particle filter and the prompt type for re-inference.
Table 2 shows that the EMA threshold ratio, γ EMA , influences performance on the DAVIS validation set. Without EMA the model attains 90.25 % mIoU. Setting γ EMA = 1.5 raises the score to 90.35 % , indicating that a more responsive trigger corrects drift promptly. Higher values 2.0 and 2.5 reduce the gain to 90.31 % because re-inference is invoked less often and some errors persist. A ratio of 1.5 therefore offers the best performance. Additional experiments on the hyperparameter θ 0 yielded performance changes below 0.001 mIoU, therefore detailed results are omitted.
Table 3 confirms that incorporating an explicit tracking module into the re-prompting stage consistently improves segmentation quality. Re-prompting without any temporal tracker yields 90.25 mIoU, whereas adding a Kalman filter raises the score to 90.27, and substituting the Kalman filter with a particle filter further lifts performance to 90.35. These results indicate, first, that introducing a motion-aware prompt is beneficial in itself, and second, that the sample-based particle filter, maintaining multiple weighted box hypotheses and updating them with the non-linear IoU likelihood, produces a more accurate and robust motion estimate than the single-Gaussian Kalman filter, thereby delivering the highest overall accuracy.
Table 4 shows a clear speed accuracy trade off as the particle count increases. Moving from 20 to 40 particles yields a small gain in mIoU from 90.33 to 90.35 with a modest drop in throughput from 27.43 to 26.84 fps. Beyond 40 particles the accuracy does not improve and in fact declines slightly to 90.30 at 256 particles while the average fps continues to fall to 23.82. This pattern suggests that the particle filter is already well specified with a modest budget and that additional samples bring little benefit while increasing computational cost. A practical choice is 40 particles when accuracy is the priority and 20 particles for speed sensitive settings since it preserves most of the accuracy at the highest throughput.
In Table 5, the object score cutoff κ specifies the confidence that the object is present at the time of re inference. A larger object score indicates that SAM2 is more certain that the object exists in the current frame. The ablation shows that initiating re-inference at the earliest positive presence detection, namely by setting κ = 0 which corresponds to the onset of the mask decoder’s object existence decision, yields the highest overall mIoU.
The ablation in Table 6 indicates that prompt design, rather than tracking alone, drives most of the gains. Relative to the no re-inference baseline at 90.25 mIoU, point-only prompts are fragile: a single random point slightly degrades performance (90.23), while adding a few points recovers modest improvements that saturate around three to five points (90.33) and do not further improve with ten points (90.32). A single bounding box yields only a minor benefit (90.28), and adding tracking to the box provides a negligible increment (90.29), suggesting that box localization without boundary sharpening is insufficient. The best results arise from the dual-box prompt augmented with a small number of negative points, peaking at three negatives (90.35), with diminishing or slightly adverse effects as negatives increase to five (90.34) or ten (90.31). These trends support the interpretation that complementary spatial hypotheses from the dual boxes capture motion and extent, while a modest number of negative points sharpens boundaries. Excessive negatives constrain the mask and can clip true object regions. Overall, the improvements are small but consistent, and the dual-box plus three negatives setting offers the most favorable accuracy–robustness trade-off.

5. Conclusions

This work targeted a central failure mode of SAM2 in video object segmentation. When a per frame mask is corrupted by appearance change or occlusion the error can bias subsequent predictions and erode temporal consistency. We addressed this with a selective strategy composed of recognition and correction. Recognition operates in pointer space by monitoring the exponential moving average of the inter frame object pointer distance and it is gated by the SAM2 object existence score so that action is taken only when the target is present. When inconsistency is recognized we invoke correction through a particle filter that predicts a corrected bounding box and turns it into a prompt for SAM2. The approach is training free, lightweight, and integrates cleanly into the inference loop while avoiding pixel level overlap computations.
Experiments across DAVIS, the LVOS v2 train and validation splits, and the YouTube VOS validation set show small but consistent gains over baseline SAM2 while preserving interactive throughput and the zero shot character of the foundation model. Qualitative results indicate that the trigger captures transient failures such as brief occlusions or abrupt appearance shifts and that the particle filter guided prompt helps the model recover a valid mask and maintain identity over time. In practice the rule that selects the candidate with the highest mIoU against the previously accepted mask encourages smooth evolution of the segmentation rather than abrupt switches which improves coherence and reduces drift.
The study suggests several directions for improvement. The current selection can prefer a locally consistent but suboptimal hypothesis when the recent reference has already drifted, so future work can consider multi frame consensus or uncertainty aware agreement. Prompt construction can adapt more flexibly to scene context beyond boxes and a small number of negative points and re detection cues can help in long sequences with heavy occlusions or identity switches. More broadly the same recognition and correction principle can be applied to other promptable segmentation systems, offering a general recipe for test time stabilization without additional learning.

Author Contributions

Conceptualization, J.L.; methodology, J.L. and J.-H.B.; software, J.L.; validation, J.-Y.K. and G.-H.Y.; formal analysis, L.H.A. and D.T.V.; investigation, D.T.V. and Z.U.R.; resources, H.L.; data curation, H.L. and Z.U.R.; writing—original draft preparation, L.H.A.; writing—review and editing, J.L. and J.-Y.K.; visualization, J.L. and J.-H.B.; supervision, G.-H.Y. and J.-Y.K.; project administration, G.-H.Y. and J.-Y.K.; funding acquisition, J.-Y.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korea Institute of Planning and Evaluation for Technology in Food, Agriculture and Forestry (IPET) through the Open-field Smart Agriculture Utilization Model Development Program, funded by Ministry of Agriculture, Food and Rural Affairs (MAFRA) (RS2025-02307408,50). And this work was partly supported by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (RS-2021-II212068, Artificial Intelligence Innovation Hub, 50) And This work was supported by Artificial intelligence industrial convergence cluster development project funded by the Ministry of Science and ICT(MSIT, Korea) & Gwangju Metropolitan City.

Data Availability Statement

All data are publicly available DAVIS: https://davischallenge.org/davis2017/code.html (accessed on 8 September 2025); LVOS v2: https://github.com/LingyiHongfd/LVOS (accessed on 8 September 2025); Youtube-VOS: https://youtube-vos.org/dataset/vos/ (accessed on 8 September 2025).

Conflicts of Interest

Authors Dang Thanh Vu and HeonZoo Lee are employed by AISeed Inc. However, the company did not influence the design, execution, data interpretation, writing or funding of the manuscript.The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.; Lo, W.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
  2. Awais, M.; Raza, M.; Chen, W.; Wang, X.; Tao, D. Foundation Models Defining a New Era in Vision: A Survey and Outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2245–2264. [Google Scholar] [CrossRef]
  3. Ye, M.; Zhang, J.; Liu, J.; Liu, C.; Yin, B.; Liu, C.; Du, B.; Tao, D. Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 1431–1447. [Google Scholar] [CrossRef] [PubMed]
  4. Zhang, Y.; Wang, H.; Liu, J.; Chen, X.; Li, K. EVF-SAM: Early Vision–Language Fusion for Text-Prompted Segment Anything Model. arXiv 2024, arXiv:2406.20076. [Google Scholar]
  5. Bucher, M.; Valada, A.; Navab, N.; Tombari, F. Zero-Shot Semantic Segmentation. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
  6. Ravi, N.; Gabeur, V.; Hu, Y.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. SAM 2: Segment Anything in Images and Videos. arXiv 2024, arXiv:2408.00714. [Google Scholar] [PubMed]
  7. Lin, Y.; Sun, H.; Wang, J.; Zhang, Q.; Li, X. SamRefiner: Taming Segment Anything Model for Universal Mask Refinement. arXiv 2025, arXiv:2502.06756. [Google Scholar]
  8. Shimaya, T.; Saiko, M. Sam-Correction: Fully Adaptive Label Noise Reduction for Medical Image Segmentation. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024. [Google Scholar]
  9. Zhou, C.; Ning, K.; Shen, Q.; Zhou, S.; Yu, Z.; Wang, H. SAM-SP: Self-Prompting Makes SAM Great Again. arXiv 2024, arXiv:2408.12364. [Google Scholar]
  10. Caelles, S.; Maninis, K.K.; Pont-Tuset, J.; Leal-Taixé, L.; Cremers, D.; Van Gool, L. One-Shot Video Object Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 221–230. [Google Scholar] [CrossRef]
  11. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
  12. Karim, R.; Wildes, R.P. Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability. arXiv 2023, arXiv:2310.12296. [Google Scholar] [CrossRef]
  13. Mei, J.; Wang, M.; Lin, Y.; Yuan, Y.; Liu, Y. TransVOS: Video Object Segmentation with Transformers. arXiv 2021, arXiv:2106.00588. [Google Scholar] [CrossRef]
  14. Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video Swin Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
  15. Jabbar, H.; Khan, R.Z. Methods to Avoid Over-Fitting and Under-Fitting in Supervised Machine Learning: A Comparative Study. Comp. Sci. Commun. Instrum. Devices 2015, 70, 978–981. [Google Scholar]
  16. Cheng, H.; Oh, S.; Price, B.; Schwing, A.; Lee, J. Tracking Anything with Decoupled Video Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 1316–1326. [Google Scholar]
  17. Yilmaz, A.; Javed, O.; Shah, M. Object Tracking: A Survey. ACM Comput. Surv. (CSUR) 2006, 38, 13-es. [Google Scholar] [CrossRef]
  18. Kalman, R.E. A New Approach to Linear Filtering and Prediction Problems. Trans. ASME—J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
  19. Gordon, N.J.; Salmond, D.J.; Smith, A.F.M. Novel Approach to Nonlinear/Non-Gaussian Bayesian State Estimation. IEE Proc. F—Radar Signal Process. 1993, 140, 107–113. [Google Scholar] [CrossRef]
  20. LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  21. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. arXiv 2012, arXiv:1212.5701. [Google Scholar] [CrossRef]
  22. Tang, H.; Li, Z.; Zhang, D.; He, S.; Tang, J. Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1958–1974. [Google Scholar] [CrossRef] [PubMed]
  23. Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple Online and Realtime Tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
  24. Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
  25. Bertinetto, L.; Valmadre, J.; Henriques, J.; Vedaldi, A.; Torr, P. Fully Convolutional Siamese Networks for Object Tracking. In Proceedings of the Computer Vision—ECCV 2016 Workshops, Amsterdam, The Netherland, 8–10 and 15–16 October 2016; pp. 850–865. [Google Scholar]
  26. Rafi, T.H.; Mahjabin, R.; Ghosh, E.; Ko, Y.-W.; Lee, J.-G. Domain Generalization for Semantic Segmentation: A Survey. Artif. Intell. Rev. 2024, 57, 247. [Google Scholar] [CrossRef]
  27. Ha, D.; Dai, A.M.; Le, Q.V. HyperNetworks. arXiv 2016, arXiv:1609.09106. [Google Scholar] [PubMed]
  28. Zhang, Y.; Borse, S.; Cai, H.; Wang, Y.; Bi, N.; Jiang, X.; Porikli, F. Perceptual Consistency in Video Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022. [Google Scholar]
  29. Varghese, S.; Bayzidi, Y.; Bär, A.; Kapoor, N.; Lahiri, S.; Schneider, J.D.; Schmidt, N.; Schlicht, P.; Hüger, F.; Fingscheidt, T. Unsupervised Temporal Consistency Metric for Video Segmentation in Highly-Automated Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  30. Pont-Tuset, J.; Perazzi, F.; Caelles, S.; Arbeláez, P.; Sorkine-Hornung, A.; Van Gool, L. The 2017 DAVIS Challenge on Video Object Segmentation. arXiv 2017, arXiv:1704.00675. [Google Scholar]
  31. Hong, L.; Liu, Z.; Chen, W.; Tan, C.; Feng, Y.; Zhou, X.; Guo, P.; Li, J.; Chen, Z.; Gao, S.; et al. LVOS: A Benchmark for Large-scale Long-term Video Object Segmentation. arXiv 2024, arXiv:2404.19326. [Google Scholar]
  32. Yang, L.; Fan, Y.; Xu, N. The 4th Large-Scale Video Object Segmentation Challenge—Video Object Segmentation Track; Technical Report; CVPR: New Orleans, LO, USA, 2022. [Google Scholar]
Figure 1. Proposed framework that combines object pointer driven mask consistency evaluation with particle filter object tracking and re-inference.
Figure 1. Proposed framework that combines object pointer driven mask consistency evaluation with particle filter object tracking and re-inference.
Electronics 14 03602 g001
Figure 2. Visualization of relationship between the object pointer and generated mask of SAM2 for specific video frames from the DAVIS dataset by cosine similarity matrix (left) and distance graph (right). The red frame marks around frames 57–64.
Figure 2. Visualization of relationship between the object pointer and generated mask of SAM2 for specific video frames from the DAVIS dataset by cosine similarity matrix (left) and distance graph (right). The red frame marks around frames 57–64.
Electronics 14 03602 g002
Figure 3. Temporal consistency changes across DAVIS videos, with green bars showing gains over SAM2 and red bars showing losses.
Figure 3. Temporal consistency changes across DAVIS videos, with green bars showing gains over SAM2 and red bars showing losses.
Electronics 14 03602 g003
Figure 4. Qualitative comparison of VOS results (LVOS2 and DAVIS datasets). Green boxes indicate re-inference success and red boxes indicate failure.
Figure 4. Qualitative comparison of VOS results (LVOS2 and DAVIS datasets). Green boxes indicate re-inference success and red boxes indicate failure.
Electronics 14 03602 g004
Table 1. Video object segmentation performance (mIoU %) and frame per second (fps). LVOS v2 is split into train and validation.
Table 1. Video object segmentation performance (mIoU %) and frame per second (fps). LVOS v2 is split into train and validation.
DAVIS LVOS v2 YouTube-VOS
ModelmIoUfps Train mIoUTrain fpsVal mIoUVal fps mIoUfps
SAM290.2527.78 92.0818.3368.4718.12 82.6615.36
Ours90.3526.84 92.2116.9068.5217.00 82.6815.13
Table 2. Ablation on DAVIS mIoU (%). Effect of EMA threshold ratio γ EMA .
Table 2. Ablation on DAVIS mIoU (%). Effect of EMA threshold ratio γ EMA .
γ EMA mIoU (%)
Without EMA90.25
1.590.35
2.090.31
2.590.31
Table 3. Ablation on DAVIS mIoU (%). Effect of tracking method.
Table 3. Ablation on DAVIS mIoU (%). Effect of tracking method.
Tracking MethodmIoU (%)
Without tracking90.25
Kalman filter90.27
Particle filter90.35
Table 4. Ablation on DAVIS. Effect of number of particles on mIoU and average frames per second (fps).
Table 4. Ablation on DAVIS. Effect of number of particles on mIoU and average frames per second (fps).
ParticlesmIoUAvg fps
2090.3327.43
4090.3526.84
6090.3226.39
12890.2926.35
25690.3023.82
Table 5. Ablation on DAVIS. Effect of object-score threshold κ .
Table 5. Ablation on DAVIS. Effect of object-score threshold κ .
Object Score κ mIoU
0.00.9035
0.10.9034
0.20.9030
0.30.9030
0.40.9030
0.50.9030
0.60.9030
Table 6. Ablation on DAVIS. Effect of re-inference prompt type.
Table 6. Ablation on DAVIS. Effect of re-inference prompt type.
Re-Inference Prompt TypemIoU (%)
Without re-inference90.25
1 point (random)90.23
3 points (random)90.33
5 points (random)90.33
10 points (random)90.32
Bounding box (without tracking)90.28
Bounding box (with tracking)90.29
Dual box + 1 negative point90.32
Dual box + 3 negative points90.35
Dual box + 5 negative points90.34
Dual box + 10 negative points90.31
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, J.; Bae, J.-H.; Vu, D.T.; Anh, L.H.; Rahman, Z.U.; Lee, H.; Yu, G.-H.; Kim, J.-Y. Prompt Self-Correction for SAM2 Zero-Shot Video Object Segmentation. Electronics 2025, 14, 3602. https://doi.org/10.3390/electronics14183602

AMA Style

Lee J, Bae J-H, Vu DT, Anh LH, Rahman ZU, Lee H, Yu G-H, Kim J-Y. Prompt Self-Correction for SAM2 Zero-Shot Video Object Segmentation. Electronics. 2025; 14(18):3602. https://doi.org/10.3390/electronics14183602

Chicago/Turabian Style

Lee, Jin, Ji-Hun Bae, Dang Thanh Vu, Le Hoang Anh, Zahid Ur Rahman, Heonzoo Lee, Gwang-Hyun Yu, and Jin-Young Kim. 2025. "Prompt Self-Correction for SAM2 Zero-Shot Video Object Segmentation" Electronics 14, no. 18: 3602. https://doi.org/10.3390/electronics14183602

APA Style

Lee, J., Bae, J.-H., Vu, D. T., Anh, L. H., Rahman, Z. U., Lee, H., Yu, G.-H., & Kim, J.-Y. (2025). Prompt Self-Correction for SAM2 Zero-Shot Video Object Segmentation. Electronics, 14(18), 3602. https://doi.org/10.3390/electronics14183602

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop