Next Article in Journal
Monocular Near-Infrared Optical Tracking with Retroreflective Fiducial Markers for High-Accuracy Image-Guided Surgery
Previous Article in Journal
Performance Analysis of Explainable Deep Learning-Based Intrusion Detection Systems for IoT Networks: A Systematic Review
error_outline You can access the new MDPI.com website here. Explore and share your feedback with us.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PMG-SAM: Boosting Auto-Segmentation of SAM with Pre-Mask Guidance

1
School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China
2
School of Electronic and Computer Engineering, Peking University, Beijing 100871, China
3
School of Computer Science and Technology, Donghua University, Shanghai 201620, China
4
Leiden Institute of Advanced Computer Science, Leiden University, EZ 2311 Leiden, The Netherlands
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(2), 365; https://doi.org/10.3390/s26020365
Submission received: 17 November 2025 / Revised: 24 December 2025 / Accepted: 1 January 2026 / Published: 6 January 2026
(This article belongs to the Section Intelligent Sensors)

Abstract

The Segment Anything Model (SAM), a foundational vision model, struggles with fully automatic segmentation of specific objects. Its “segment everything” mode, reliant on a grid-based prompt strategy, suffers from localization blindness and computational redundancy, leading to poor performance on tasks like Dichotomous Image Segmentation (DIS). To address this, we propose PMG-SAM, a framework that introduces a Pre-Mask Guided paradigm for automatic targeted segmentation. Our method employs a dual-branch encoder to generate a coarse global Pre-Mask, which then acts as a dense internal prompt to guide the segmentation decoder. A key component, our proposed Dense Residual Fusion Module (DRFM), iteratively co-refines multi-scale features to significantly enhance the Pre-Mask’s quality. Extensive experiments on challenging DIS and Camouflaged Object Segmentation (COS) tasks validate our approach. On the DIS-TE2 benchmark, PMG-SAM boosts the maximal F-measure from SAM’s 0.283 to 0.815 . Notably, our fully automatic model’s performance surpasses even the ground-truth bounding box prompted modes of SAM and SAM2, while using only 22.9 M trainable parameters ( 58.8 % of SAM2-Tiny). PMG-SAM thus presents an efficient and accurate paradigm for resolving the localization bottleneck of large vision models in prompt-free scenarios.

1. Introduction

The paradigm of building large-scale foundation models, first established in Natural Language Processing (NLP), has now firmly taken root in Computer Vision, aiming to create single, powerful models capable of addressing a wide array of visual tasks. Among the myriad of visual tasks, image segmentation, which involves fine-grained, pixel-level understanding, stands as a cornerstone for in-depth scene perception. A class-agnostic segmentation model capable of segmenting any object in any image is not only a long-sought goal in academia but also a core technology driving critical applications such as autonomous driving and medical image analysis. The advent of the Segment Anything Model (SAM) [1] in 2023, with its remarkable zero-shot generalization capabilities, has propelled the concept of a class-agnostic segmentation model to new heights.
The remarkable power of SAM stems from its interactive, prompt-based design, allowing it to segment virtually any object specified by user-provided cues such as points, boxes, or masks. However, a significant gap exists between its “segment everything” automatic mode and the needs of many real-world applications that require fully automatic segmentation of specific objects of interest. The automatic mode of SAM employs an exhaustive strategy, placing a dense grid of points across the image without any localization guidance. While enabling prompt-free operation, this design philosophy leads to two critical deficiencies: first, localization blindness, where the model fails to identify and prioritize salient objects, resulting in fragmented and semantically incoherent masks (Figure 1), and second, computational redundancy, expending resources on irrelevant background regions. Consequently, its performance on targeted segmentation tasks like Dichotomous Image Segmentation (DIS) [2] is severely compromised. As research [3,4] shows, the automatic mode’s performance is drastically inferior to its prompt-guided counterpart, even when the latter is provided with a simple ground-truth bounding box (GT-Bbox). This highlights a core bottleneck: SAM’s original automatic design lacks the intrinsic capability for targeted, autonomous localization.
To overcome these limitations, existing research has predominantly explored two directions: enhancing precision through more sophisticated prompt engineering or user interaction [5,6], or adapting the model to specific domains via adapters or fine-tuning [7,8]. The former, however, compromises the autonomy of model, while the latter risks impairing its valuable generalization ability.
Diverging from these approaches, our work addresses the problem at a more fundamental level. Instead of generating sparse external prompts or performing post hoc refinement, we propose to redesign the information flow within the automatic segmentation process itself. We introduce an internal, end-to-end guidance mechanism that operates without manual prompts. This mechanism first performs a coarse global analysis to efficiently generate a preliminary mask identifying salient regions, and then leverages this mask to guide a fine, detailed segmentation within those specific areas.
To this end, we propose a Pre-Mask Guided framework for SAM (PMG-SAM). The central concept is inspired by human cognitive processes: first, a lightweight, dual-branch encoder rapidly generates a global Pre-Mask that identifies potential objects of interest within the image. Subsequently, this Pre-Mask is utilized as a high-quality internal prompt to guide a powerful decoder in performing fine-grained contour segmentation. This coarse-to-fine mechanism replaces the exhaustive search with globally informed guidance, addressing the position sensitivity issue at its source and opening a new avenue for designing efficient, class-agnostic segmentation models. In summary, the contributions of this paper are fourfold as follows:
  • We propose a “locate-and-refine” paradigm implemented through Pre-Mask Guidance. This approach shifts SAM’s automatic mode from an exhaustive grid-based search to a targeted process, directly addressing the core issue of localization blindness by providing a dense, global prior before segmentation.
  • We present an effective implementation of this paradigm, featuring a dual-branch encoder that leverages the complementary strengths of Transformer and CNN architectures. A novel Dense Residual Fusion Module (DRFM) is designed to synergistically fuse these features, generating a high-quality Pre-Mask that is crucial for the guidance mechanism.
  • To enhance boundary details in the “refine” stage, we integrate a high-resolution feature injection pathway. This pathway preserves crucial spatial information from the encoder, leading to more precise segmentation contours.
  • We conduct extensive experiments demonstrating the effectiveness and efficiency of our approach. Our fully automatic model not only significantly outperforms the standard auto-modes of SAM/SAM2 but also surpasses their GT-Bbox prompted modes. This is achieved with only 22.9 M trainable parameters, showcasing a superior balance of performance and efficiency and validating the power of our proposed paradigm.

2. Related Work

2.1. Visual Backbones for Segmentation

The performance of segmentation models is heavily reliant on the quality of features extracted by their visual backbone. Modern architectures often face a trade-off between capturing fine-grained local details and robust global context.
Hierarchical Vision Transformers, such as Hiera [9], represent a significant advancement in this area. Their strength is twofold. Architecturally, by employing window-based local attention in their early stages, they excel at preserving high-frequency spatial information like edges and textures. Furthermore, their pre-training under the masked autoencoder (MAE) framework [10] is crucial. This self-supervised strategy forces the model to reconstruct randomly masked image patches, compelling it to learn rich and robust local representations without relying on extensive labeled data. This combination of an efficient architecture and a powerful pre-training scheme makes them ideal for tasks requiring precise boundary delineation.
On the other hand, CNN-based architectures, particularly those with a U-shaped structure like U-Net [11], have long been the standard for semantic segmentation. Their design of progressive downsampling to capture semantic information followed by upsampling to recover spatial resolution makes them powerful at abstracting the global context of an image. U2-Net [12] further enhances this paradigm with its nested ReSidual U-blocks (RSUs), enabling even deeper feature abstraction across multiple scales. Our work leverages the complementary nature of these two architectural paradigms by designing a novel fusion mechanism to combine their respective strengths.

2.2. Advances in Improving SAM

Research on improving SAM has predominantly advanced along four key directions: enhancing its automation, broadening its domain adaptability, refining its output precision, and fostering synergy with other foundation models.
Automated Prompt Generation. The core interactive mechanism of SAM relies on manual prompts, which limits its efficiency in fully automated applications. Consequently, a significant line of research focuses on developing techniques for automatic prompt generation. A mainstream approach involves training an auxiliary network to predict prompt cues like points or boxes [13,14], or refining an initial set of prompts into a more representative sparse point collection [15]. Another strategy leverages the semantic understanding of multimodal large models, such as CLIP [16], to generate guiding signals like pseudo-points, pseudo-masks, textual, or auditory prompts [17,18], providing SAM with initial localization information. However, these methods typically generate sparse prompts (points or boxes) or depend on implicit guidance from external models, limiting their ability to provide a global, structurally rich prior for the segmentation target.
SAM Adaptation and Enhancement. To adapt SAM for specialized domains like medical imaging, remote sensing, and video, research has proceeded along two main avenues: Parameter-Efficient Fine-Tuning (PEFT) and architectural modification. PEFT methods, such as Adapters and Low-Rank Adaptation (LoRA), aim to adjust parameters of SAM with minimal computational overhead [19,20]. Some studies have also designed novel fine-tuning mechanisms, like prompt-bridging, to balance the optimization between the encoder and decoder [21]. Architectural enhancements, on the other hand, focus on tailoring the model to specific data types. This includes introducing temporal modeling modules for video [22,23,24] and integrating state-space models like Mamba [25] to capture domain-specific spatiotemporal dependencies. These works concentrate on adjusting the internal parameters of model or structure to accommodate new tasks or data modalities.
Output Quality Refinement. Addressing the issue of coarse boundaries in output masks of SAM, research has primarily adopted post-processing refinement strategies. This involves introducing an additional network to polish the initial masks generated by SAM [26]. Alternatively, some approaches focus on decoder optimization by fusing multi-scale features or improving feature aggregation methods to enhance the detail quality of the native output [19,20,27]. Such methods are characteristically post hoc, concentrating on correcting or enhancing the segmentation result after it has been generated.
Collaboration with Other Foundation Models. Beyond single-model adaptation, a body of work explores the deep integration of SAM with other large models, notably CLIP, at a systemic level. This includes constructing explicit pipelines, such as using CLIP for localization followed by SAM for segmentation [17], deconstructing SAM and coupling it with a CLIP encoder to create an efficient single-stage open-vocabulary segmentation architecture [15], or utilizing SAM for contextual segmentation while introducing uncertainty modeling to improve robustness [28]. These efforts demonstrate potential of SAM as a versatile component within more complex, multimodal systems.

2.3. Positioning of Our Work

Our work carves a distinct and novel path within this research landscape. While existing research has largely focused on two avenues—either enhancing the promptable mode through automated prompt generation or adapting SAM’s internal architecture for specific domains—our work addresses a more fundamental issue: the inherent limitation of the automatic mode for targeted segmentation tasks.
Unlike automated prompt generation methods that aim to replace manual clicks with predicted sparse cues like points or boxes, our Pre-Mask Guided paradigm introduces a new, fully automatic pipeline. It provides a dense, explicit mask rich in spatial and structural priors, shifting the operational model from a “prompt-and-segment” process to an end-to-end “locate-and-refine” workflow. This approach fundamentally solves the localization blindness of the grid-based segment everything approach. In contrast to methods that fine-tune SAM’s parameters or add post-processing steps to refine coarse masks, our framework operates at the input level of the decoder. By furnishing a high-quality shape prior before the main segmentation process begins, we preemptively guide the model towards a precise solution. This makes our approach a foundational, architectural innovation rather than a domain-specific adaptation or a corrective afterthought. In essence, PMG-SAM proposes a new blueprint for building general-purpose automatic segmentation models that are both efficient and target-aware, making it a versatile and powerful component for complex vision systems. In essence, while other methods focus on what to prompt SAM with, our work redefines how SAM’s automatic mode should fundamentally operate. It is a shift from a ‘prompt-and-segment’ philosophy to an integrated ‘locate-and-refine’ workflow, making it an architectural innovation rather than a prompting technique.

3. Methods

The native prompt-free mode of SAM is hampered by two fundamental limitations: a localization sensitivity defect and a redundant computation bottleneck. To address these challenges head-on, we propose PMG-SAM, a framework that replaces SAM’s exhaustive grid-based search with a targeted, cognition-inspired “locate-and-refine” paradigm. This section first revisits the architecture and limitations of SAM in Section 3.1, before detailing the design and components of our proposed solution in Section 3.2.

3.1. Preliminaries: SAM

3.1.1. Architectural Framework

SAM comprises three core components: an image encoder, a prompt encoder, and a mask decoder.
The image encoder, accounting for the majority of model parameters, employs a Vision Transformer (ViT) [29] pre-trained via Masked Autoencoding (MAE) [10] to process high-resolution inputs, with each image undergoing single-pass feature extraction.
The prompt encoder is a core component that establishes SAM as a multimodal architecture, processing two distinct prompt categories: dense and sparse. Dense prompts, which consist of mask inputs, are processed by a convolutional encoder and their embeddings are added element-wise to the image embeddings. To handle cases where no mask is provided, the model incorporates a learnable “no-mask” embedding. This special token serves as a placeholder to signify the absence of a mask, ensuring the architectural consistency of the model. In contrast, sparse prompts encompass three modalities—points and boxes encoded as positional embeddings combined with modality-specific learnable embeddings, alongside textual prompts processed with CLIP token embedding.
The mask decoder integrates image features and prompt embeddings through bidirectional cross-attention and self-attention mechanisms, subsequently employing a multilayer perceptron (MLP) to project output tokens to a dynamic linear classifier that computes foreground probability masks at each spatial location.

3.1.2. Prompt-Free Segmentation

SAM implements prompt-free segmentation through a two-stage cascade beginning with grid sampling, where a default N × N grid ( N = 32 ) is overlaid on the input image and processed at each grid point to generate candidate object masks, with repetition on two cropped high-resolution regions using denser point sampling. This is followed by mask refinement through a three-phase filtering mechanism: (1) elimination of edge-affected masks from cropped regions; (2) application of Non-Maximum Suppression (NMS) for local-global mask merging; and (3) implementation of quality control via three criteria—retention of masks with IoU scores 88.0 , stability verification through soft mask thresholding at τ = { 1 , + 1 } with predictions kept only when IoU ( M 1 , M + 1 ) 95.0 , and rejection of masks covering 95 % of the image area as non-informative.
This paradigm reveals two critical limitations. First, it suffers from a localization sensitivity defect, where grid-based dense prompts are vulnerable to semantic fragmentation, leading to erroneous multi-fragment segmentation of single objects. This observation is corroborated by the work of Sun et al. [5], which demonstrates that native geometric prompts of SAM struggle in complex scenes precisely due to a lack of semantic guidance. Second, the paradigm introduces a redundant computation bottleneck. The per-point cross-attention in the ViT decoder, with its O ( N 2 ) complexity, leads to a computational explosion when high-resolution grids are employed, violating linear complexity expectations. The drive to mitigate such computational inefficiencies is a key motivation in related research, such as the model merging approach proposed by Wang et al. [30] to create a more efficient, unified model.

3.2. Architecture

To overcome the inherent limitations of automatic mode of SAM, namely its localization blindness and computational inefficiency, we introduce a Pre-Mask Guided SAM (PMG-SAM). Our approach is inspired by the “coarse-to-fine” cognitive mechanism characteristic of human vision. The overall architecture of PMG-SAM, illustrated in Figure 2, is streamlined and comprises two primary components: a novel image encoder and an enhanced mask decoder. To operationalize our proposed ‘locate-and-refine’ paradigm, the architecture of PMG-SAM is conceptually divided into two functional stages: a Pre-Mask Generator responsible for the ‘locate’ step comprising our dual-branch encoder and fusion modules, and a Guided Mask Decoder that executes the ‘refine’ step which leverages the generated Pre-Mask for precise segmentation.
The core innovation of our framework lies in the Pre-Mask guided paradigm. Specifically, we replace original image encoder of SAM, which relies on a grid of point prompts for automatic mask generation, with our specialized encoder. This new encoder is designed to automatically generate a high-quality, dense Pre-Mask from the input image. The Pre-Mask serves as a global, structural prompt, providing a strong prior about the locations and shapes of potential objects. This Pre-Mask is then fed into our enhanced mask decoder to guide the segmentation process. Crucially, to further boost precision, the decoder is also strategically augmented with high-resolution features extracted by our new encoder. This dual-input design—leveraging both the high-level semantic guidance of the Pre-Mask and the edge detail from the features—enables the production of highly accurate and detailed segmentation masks. By replacing a exhaustive strategy with a learnable, content-aware guidance mechanism, PMG-SAM fundamentally shifts the operational process from exhaustive search to targeted refinement.
To effectively capture the full spectrum of visual information required for high-quality segmentation, our feature extractor is designed with a dual-pathway architecture. This design is motivated by the observation that different architectural paradigms excel at capturing distinct types of features: hierarchical transformers are superior for fine-grained local details, while U-Net-like structures are powerful for abstracting global semantic context. For the local feature pathway, we adopt Hiera as our backbone, replacing the original ViT from SAM. The choice of Hiera is deliberate, owing to its powerful combination of architectural design and pre-training strategy. Its hierarchical structure with local attention is inherently suited for capturing fine-grained details, a capability significantly amplified by its MAE-based pre-training. This training regime compels the model to develop a profound understanding of local patterns and textures, which aligns with findings from [31] on the importance of local features for precise segmentation. Concurrently, for the global feature pathway, we employ a U2-Net-like architecture. Its deeply supervised, nested U-structure is highly adept at progressive semantic abstraction, yielding robust global feature representations that capture the overall structure and context of target objects.
The core novelty of our approach lies not in the individual backbones, but in their synergistic fusion. To combine the complementary strengths of local acuity of Hiera and the U2-Net pathway’s global understanding, we introduce our Dense Residual Fusion Module (DRFM). This module systematically integrates feature maps from both pathways. The process involves three key stages: extraction of N 1 corresponding feature maps from both architectures, where N 1 is set to six by default.; pairwise feature fusion with DRFM (detailed in Section 3.3); processing of fused features through FeatRefiner (detailed in Section 3.4).

3.3. Dense Residual Fusion Module

Dense Residual Fusion Module (DRFM) is designed to effectively merge the strengths of the Hiera and U2-Net feature streams. We adopt a layer-wise fusion strategy, where features from corresponding layers of the two models are fused pairwise. This approach is rooted in the belief that integrating local, detail-oriented features (from Hiera) with multi-scale semantic features (from U2-Net) at the same hierarchical level ensures meaningful feature interaction. In contrast to a global fusion approach, this targeted strategy prevents the mixing of features from disparate semantic levels, which could lead to information misalignment and redundant interference. Furthermore, by confining interactions to corresponding layers, we significantly reduce computational overhead.
As depicted in Figure 3, the operational flow of DRFM designates a feature map from a U2-Net layer (U2-Net Feature) as the primary backbone for fusion. This feature is progressively refined by passing through a cascade of N 2 Stabilized Residual Blocks (SRBs), where N 2 is set to three by default. Within each SRB, the corresponding feature map from the Hiera model (H-Feature) is integrated. Finally, the output from the SRB cascade is scaled and added back to the original U2-Net Feature via a long-range skip connection. As previously discussed, this process enriches the robust global context from U2-Net with high-fidelity local details from Hiera. Consequently, the fused features facilitate the generation of a pre-mask that possesses both precise localization and complete edge information, thereby providing superior guidance for the mask decoder to extract a highly refined mask of the object of interest. The overall fusion process of DRFM is summarized in Algorithm 1.
Algorithm 1: The Overall Pipeline of DRFM.
Sensors 26 00365 i001
The internal architecture of the Stabilized Residual Block (SRB) is also illustrated in Figure 3. An SRB takes an input feature F 0 and the corresponding H-Feature H. It performs M iterations of residual fusion, with M defaulting to four in our implementation. A key characteristic of this process is its dense connectivity: the new feature at each iteration is derived from a residual connection involving all preceding features in the block as well as the H-Feature H. This mechanism is termed a “Dense Residual”. The total accumulated residual from these iterations is then scaled and added to the original input F 0 , followed by a Group Normalization layer to produce the final output Y. This scaling of the residual value, applied in both the overarching DRFM structure and within each SRB, serves as a crucial stabilization technique to prevent gradient explosion during training. The internal fusion process of the SRB is detailed in Algorithm 2.
Algorithm 2: The details of SRB.
Sensors 26 00365 i002

3.4. FeatRefiner

The FeatRefiner module is designed with the explicit purpose of processing the multi-scale, hierarchical feature maps generated by the preceding DRFM fusion stages. Its primary function is to distill these diverse representations into a single, cohesive Pre-Mask. The refinement process commences by upsampling all fused feature maps from the different layers to a uniform spatial resolution. This ensures spatial alignment before integration. Subsequently, the resized feature maps are concatenated along the channel axis, creating a high-dimensional composite tensor that aggregates rich information from all semantic levels. To conclude the process, a lightweight 1 × 1 convolutional layer is applied to this concatenated tensor. This operation serves the dual purpose of compressing the channel-wise information and effectively extracting the most salient features from the aggregated representation. The final output of this module is the Pre-Mask, denoted as M pre R B × 1 × H × W , which provides a comprehensive initial localization of the object of interest to guide the subsequent mask decoder.

3.5. Mask Decoder

The outputs from the image encoder are subsequently processed by our enhanced mask decoder. This decoder is architecturally defined by a cascade of N 3 Bidirectional Attention Interaction Modules (BAIMs), which iteratively refine the representations. As illustrated in the overall framework (Figure 2), upon passing through the BAIM stack, the refined image embeddings undergo upsampling with transposed convolutions. Critically, to enhance the generation of high-resolution details, we inject fine-grained feature maps from the initial two stages of the Hiera encoder directly into these transposed convolution layers. Unlike standard ViT features which may lose high-frequency details, the early stages of Hiera, benefiting from its local attention mechanism and MAE pre-training, preserve rich spatial texture and edge information. These high-resolution features are rich in detailed edge information, proving vital for producing crisp and accurate segmentation boundaries. Finally, an MLP maps the output tokens to a dynamic linear classifier, and the foreground probability for each pixel location is computed with a dot product to yield the final mask.
The internal architecture of the BAIM, detailed in Figure 4, is meticulously designed to facilitate a comprehensive, bidirectional information flow. The input Pre-Mask is first partitioned into a grid of non-overlapping patches of size P × P (where P = 16 by default) and linearly embedded to form a sequence of P 2 mask tokens. The processing pipeline within a single BAIM block commences with a self-attention layer applied to these mask tokens, enabling them to model their internal spatial dependencies. Following this intra-mask modeling, the refined tokens serve as queries in a Mask-to-Image (M2I) cross-attention mechanism, attending to the image features to gather contextually relevant visual information. Each token is then independently updated by a pointwise MLP block. The bidirectional interaction culminates in an Image-to-Mask (I2M) cross-attention layer. In this crucial step, the roles are inverted: the image features now act as queries to attend to the updated mask tokens, allowing the image representation itself to be refined based on the focused guidance from the mask. This complete cycle of self-attention, dual cross-attention, and MLP-based updates constitutes one interaction block, and its outputs are passed to the subsequent BAIM for further iterative refinement, ensuring the guidance from the Pre-Mask is fully leveraged.

4. Experiment

In this section, we present a comprehensive set of experiments to empirically validate our central thesis: that the Pre-Mask Guided “locate-and-refine” paradigm represents a superior architectural solution for automatic segmentation compared to SAM’s native “segment everything” approach. We conduct our evaluation on two distinct tasks that represent the opposing extremes of visual perceptibility: Dichotomous Image Segmentation (DIS) which characterized by high saliency but complex structure and Camouflaged Object Segmentation (COS) which characterized by low saliency and texture ambiguity. Validating on these extremes ensures the model’s robustness across the full spectrum of segmentation challenges. We begin by introducing the datasets and evaluation metrics employed for each task. Subsequently, we describe the implementation details, including the experimental environment, model configuration, and the specific training and inference procedures. We then report and analyze the quantitative results on the DIS5K [2], COD10K [32], and NC4K [33] benchmark datasets, comparing our method against other approaches. Finally, we conduct extensive ablation studies to investigate the impact of individual components on the overall performance.
SAM-B, SAM-L, and SAM-H represent ViT-B, ViT-L, and ViT-H model types of SAM, respectively. SAM2-T, SAM2-B+, and SAM2-L represent Hiera-Tiny, Hiera-Base+, and Hiera-Large model types of SAM2, respectively.

4.1. Datasets and Evaluation Metrics

4.1.1. DIS Task: Dataset and Metrics

Our experiments use the DIS-5K dataset, the first large-scale benchmark specifically designed for high-resolution (2K, 4K, and beyond) binary image segmentation. The dataset consists of 5470 meticulously annotated images, organized into 22 groups across 225 categories. These images include diverse objects camouflaged in complex backgrounds, salient entities, and structurally dense targets. Each image was manually annotated at the pixel level, with an average labeling time of 30 min per image, extending up to 10 h for particularly complex instances. According to the official partition, the dataset is divided into 3000 training images(DIS-TR), 470 validation images(DIS-VD), and 2000 test images. The test set is further divided into four subsets (DIS-TE1 to DIS-TE4), each containing 500 images, representing ascending difficulty levels based on the product of structural complexity ( I P Q ) and boundary complexity ( P n u m ).
Evaluation uses six established metrics: maximal F-measure ( F β max ) [34], weighted F-measure ( F β ω ) [35], Mean Absolute Error ( M ) [36], Structural measure ( S α ) [37], mean Enhanced alignment measure ( E ϕ m ) [38], and Human Correction Efforts ( H C E γ ) [2]. Arrows indicate the preferred direction (↑: higher is better, ↓: lower is better).
The maximal F-measure evaluates the optimal trade-off between precision and recall. It calculates the F-measure scores across varying thresholds ( τ ) and selects the maximum value. Following the standard setting in saliency detection, we set β 2 = 0.3 to emphasize precision. It can be expressed as
F β max = max τ ( 1 + β 2 ) · Precision ( τ ) · Recall ( τ ) β 2 · Precision ( τ ) + Recall ( τ ) .
The weighted F-measure ( F β w ) utilizes a weighted precision ( P r e c i s i o n w ) and weighted recall ( R e c a l l w ) to address the flaw that standard measures treat all pixels equally. It considers the spatial dependence and pixel importance. In our evaluation, we set β 2 = 1 . It can be expressed as
F β w = ( 1 + β 2 ) · Precision w · Recall w β 2 · Precision w + Recall w .
The Mean Absolute Error quantifies the global accuracy of predictions by computing the average absolute difference between the predicted map ( Pred ) and the ground-truth map ( GT ) across all pixels. A lower M A E indicates higher consistency between predictions and ground truths. It can be expressed as
MAE = 1 N i , j Pred ( i , j ) GT ( i , j ) .
The Structural measure (S-measure) simultaneously evaluates the region-aware structural similarity ( S r ) and object-aware structural similarity ( S o ) between the prediction and ground truth. With the balance parameter α set to 0.5 . It can be expressed as
S α = α · S o + ( 1 α ) · S r .
The mean Enhanced alignment measure calculates the average E-value (which integrates pixel-level and region-level errors) across multiple thresholds ( τ t ), reflecting error distribution characteristics at both local and global scales. It can be expressed as
E ϕ m = 1 T t = 1 T E ( τ t ) .
The Human Correction Efforts is a metric designed to quantify the barriers between model predictions and real-world applications. Unlike standard metrics that measure the geometric gap (e.g., IoU), HCE approximates the human efforts required to correct the faulty regions (False Positives and False Negatives) in a segmentation mask. Specifically, it estimates the number of mouse clicking operations needed for correction, including dominant point selection for boundary refinement and region selection for area fixing. A lower H C E γ value indicates a reduced manual revision workload, signifying that the model’s output satisfies high-accuracy requirements with fewer human interventions. It can be expressed as
H C E γ = 1 K k = 1 K FP points ( k ) + FP indep ( k ) + FN points ( k ) + FN indep ( k ) .

4.1.2. COS Task: Datasets and Metrics

For the COS task, our experiments are based on the COD10K and NC4K datasets. COD10K is the first large-scale benchmark for camouflaged object detection, across 78 sub-classes and 10 super-classes, capturing diverse camouflage scenarios in natural environments. We follow the official split of 3040 images for training and 2026 for testing. The NC4K dataset, containing 4121 images of camouflaged objects in nature, serves as a supplementary benchmark to validate the generalization capability of the models.
To evaluate segmentation performance, we adopt the standard COCO-style metrics: Average Precision ( A P ), A P 50 , and A P 75 .
Average Precision ( A P ) is the primary metric, calculated as the mean of A P s over multiple IoU (Intersection over Union) thresholds (from 0.5 to 0.95 with a step of 0.05 ). It provides a comprehensive measure of instance segmentation quality. A P 50 and A P 75 are variants of A P calculated at single, fixed I o U thresholds of 0.5 and 0.75 , respectively. A P 50 evaluates basic detection and localization accuracy, while A P 75 imposes a stricter criterion for more precise mask predictions.

4.2. Experiment Settings

Our proposed PMG-SAM is an enhanced architecture based on the SAM. To enhance the multi-scale feature representation, we employ a Feature Pyramid Network (FPN) [39] to process the features extracted by the Hiera encoder before they are fed into the mask decoder. Our performance is benchmarked against numerous leading models, including SAM2 [40], a recent advancement for both image and video segmentation. All experiments were conducted on a system running Ubuntu 20.04.6, equipped with a single NVIDIA Tesla A100-PCIE-40GB GPU (NVIDIA, Santa Clara, CA, USA), and built upon a stack of PyTorch 2.5.1, CUDA 11.8, and Segment-Anything 1.0.

4.3. Training and Inference Procedure

To leverage powerful prior knowledge and accelerate training, the image encoder components—Hiera and U2-Net—are initialized with pre-trained weights and remain frozen during training. To ensure computational efficiency and prevent overfitting on limited downstream data, the image encoder components—Hiera and U2-Net—remain fully frozen throughout the training process. The Hiera-base+ model was pre-trained on the ImageNet-1K dataset [41] using MAE self-supervised learning framework. The U2-Net component uses weights pre-trained on the DUTS dataset [42]. The total loss function is a weighted sum of three standard segmentation losses: Binary Cross-Entropy ( B C E ), D i c e , and I o U loss. The total loss L all is computed as:
L all = w 1 L BCE + w 2 L Dice + w 3 L IoU .
where the weights w 1 , w 2 , w 3 are set to 1, 1, and 10. Following the original SAM, the number of BAIM, N 3 , is set to 2. The batch size is set to 4. For data preprocessing, input images undergo a series of augmentations, including random cropping, random horizontal flipping, random rotation between 30 and 30 , and random color jittering. Subsequently, all images are resized and padded to a fixed resolution of 1024 × 1024 to comply with the input requirements of the Hiera encoder.
For the DIS task, the model is trained for 300 epochs. We use the AdamW optimizer [43] with an initial learning rate of 1 × 10 3 , a weight decay of 0.1 , and momentum of 0.9 . A learning rate warm-up period is applied for the first 10 epochs, followed by a decay schedule where the learning rate is reduced by 40 % every 40 epochs. An early stopping mechanism is in place, terminating the training if the validation loss does not improve for 40 consecutive epochs.
For the COS task, we fine-tune the model using the best-performing weights obtained from the DIS task. We conduct two separate fine-tuning processes: one on the COD10K training set and another on the NC4K training set. The NC4K dataset is first partitioned into training ( 60 % ) and testing ( 40 % ) sets. In both fine-tuning pipelines, the respective training set is further split into an 85 % training subset and a 15 % validation subset. The initial learning rate is set to a lower value of 1 × 10 5 . We employ a learning rate scheduler that reduces the learning rate by half if the validation loss plateaus for 5 consecutive epochs, with a minimum learning rate of 1 × 10 7 . The optimizer remains AdamW with a weight decay of 0.1 . The model is trained for a total of 90 epochs, with a 10-epoch warm-up and an early stopping patience of 20 epochs.
During inference, the 1024 × 1024 input image is passed through PMG-SAM to generate a binary segmentation map. For the DIS task, the inference is fully end-to-end and requires no post-processing. For the COS task, a specific post-processing pipeline is employed to separate potentially overlapping or adjacent objects. First, during training, the multiple instance masks from the COD10K ground truth are merged into a single binary mask, guiding the model to learn the general concept of a “camouflaged object”. At inference time, we apply Connected Component Analysis (CCA) [44] to the model’s binary output to separate the binary mask into individual object proposals. The Hungarian algorithm [45] is then used to perform one-to-one matching between the predicted instances and the ground-truth masks. Finally, the matched predictions are saved in the standard COCO JSON format to enable evaluation with the A P metrics.

4.4. Results on DIS Task

We begin our analysis with the DIS task, which serves as the primary benchmark to evaluate the core segmentation quality of our model against baselines and specialized methods. The fine-grained and complex nature of objects in the DIS5K dataset provides an ideal testbed to assess the efficacy of our approach.

4.4.1. Efficiency and Complexity Analysis

Table 1 presents a comprehensive comparison of model complexity and inference efficiency. A key advantage of our approach is its superior balance between parameter efficiency and practical speed.
First, regarding parameter efficiency, PMG-SAM requires only 22.9 M trainable parameters, making it significantly more lightweight to train than even the smallest SAM2-T model ( 38.9 M).
Second, we address the concern regarding computational cost. Although our total FLOPs ( 1116.2 G) are relatively high due to the utilization of powerful frozen backbones (Hiera-B+ and U2-Net), our method achieves the highest inference speed of 4.85 FPS. This result reveals a critical insight: the standard “automatic mode” of SAM and SAM2 relies on a dense grid-prompting strategy, which suffers from severe computational redundancy and slows down inference. In contrast, our “locate-and-refine” paradigm generates a global prior in a single pass, avoiding exhaustive grid search. Thus, despite higher theoretical FLOPs per pass, our actual wall-clock inference time is significantly lower.
Finally, regarding memory consumption, PMG-SAM operates with a peak memory of approximately 3420 MB. This is comparable to the base-sized models and significantly lower than the large variants (e.g., SAM-H requires 5731 MB), ensuring deployability on standard GPUs.

4.4.2. Quantitative Results

We extensively evaluated PMG-SAM against baseline models (SAM, SAM2) and specialized methods including HRNet [46], STDC [47], IS-Net [2], and SINetV2 [48] across DIS-VD and DIS-TE1–4 datasets. Quantitative results on DIS5K validation and test sets (Table 2) reveal consistent improvements across all metrics.
Table 2 presents the quantitative comparison of our method against both baseline and task-specific models on the DIS5K dataset. PMG-SAM demonstrates a comprehensive improvement across multiple evaluation metrics on both the validation and test sets. Notably, the HCE value of PMG-SAM is substantially lower than that of the baseline models, which not only signifies a numerical superiority but also reflects the targeted design of our architecture to address key segmentation challenges. Our model not only achieves significant gains over the baselines but also rivals and even surpasses the performance of methods specifically designed for the DIS task.

4.4.3. Analysis Against Baseline Models

Our analysis against baseline models is structured to answer two key questions: (1) How effective is our framework at overcoming the limitations of SAM’s original “segment everything” automatic mode? (2) Is our approach merely an automated prompter, or does it represent a fundamentally more powerful segmentation system?
As shown in Table 2, when compared to the standard Auto mode of all SAM and SAM2 variants, PMG-SAM demonstrates a transformative leap in performance. For instance, on the challenging DIS-TE2 set, our model achieves a maximal F-measure of 0.815 , a stark contrast to the 0.283 of SAM-H and 0.442 of SAM2-L. This massive improvement across all metrics confirms that our Pre-Mask Guided paradigm effectively solves the localization blindness and fragmentation issues inherent in the grid-based approach, enabling precise and coherent segmentation of target objects without human intervention.
Crucially, to explicitly validate whether the localization failure of SAM stems from a lack of domain knowledge, we compared PMG-SAM against a fine-tuned version of SAM-H (the largest variant). As shown in Table 2, although fine-tuning improves the performance of SAM-H in automatic mode (raising F β m a x from 0.283 to 0.378 on DIS-VD), it still lags significantly behind our PMG-SAM ( F β m a x   0.791 ). This substantial gap confirms that parameter optimization alone cannot resolve the inherent “localization blindness” of the grid-search strategy. In contrast, our Pre-Mask paradigm provides the necessary structural prior, achieving superior performance with significantly fewer trainable parameters.
To further investigate the effectiveness of our paradigm, we compare our fully automatic method against baseline models guided by ground-truth bounding boxes (GT-Bbox). This represents an ideal scenario for prompt-based methods. As shown in the tables, our PMG-SAM consistently outperforms these perfectly prompted baselines on challenging test sets (e.g., DIS-TE2, TE3, TE4). This finding is particularly insightful. It suggests that the dense, structural information provided by our internally generated Pre-Mask offers a richer and more effective guidance signal to the decoder than a sparse bounding box. This validates that our ‘locate-and-refine’ approach creates a more capable segmentation pipeline, going beyond simple prompt automation.
In summary, these comparisons, combined with the model size analysis in Table 1, robustly demonstrate that PMG-SAM’s performance gains stem from a superior and more efficient architectural paradigm, not merely model scaling or automated prompting. It charts a new path for creating truly automatic and highly accurate general-purpose segmentation models.

4.4.4. Analysis Against Specialized Methods

When compared with other classic DIS methods, our model also exhibits strong competitiveness. Commendably, PMG-SAM’s performance matches and in some metrics exceeds that of IS-Net, the established baseline for the DIS task, showcasing its powerful capabilities. Nevertheless, the Maximum F-measure on certain test sets still has room for improvement compared to IS-Net, which provides a clear direction for our future work. Therefore, while IS-Net stands as a strong and classic baseline specifically designed for the DIS task, our PMG-SAM demonstrates highly competitive performance. This is noteworthy because PMG-SAM is not a task-specific model but rather a general-purpose paradigm for enhancing foundation models. Its strong performance on this challenging benchmark showcases the effectiveness and potential for broader applicability of our approach.

4.5. Results on COS Task

To rigorously evaluate the zero-shot transfer capability, transfer learning efficacy, and domain generalization capacity of PMG-SAM, we conducted comprehensive experiments against the baseline SAM alongside state-of-the-art models including SAM2, Mask R-CNN [49], PointSup [50], Tokencut [51], Cutler [52], and TPNet [53]. Benchmarking was performed on the COD10K test set and NC4K dataset.
As shown in Table 3, we first assess the zero-shot transfer capability, a key feature of the SAM series. By directly testing our best DIS-trained model on COD10K and NC4K, we find that PMG-SAM surpasses all SAM2 variants and the unsupervised task-specific method on the COD10K test set. On NC4K, it even matches or exceeds some weakly supervised methods. Although a performance gap with the original SAM-H remains, these results demonstrate that PMG-SAM learns effective and transferable representations from the DIS task, and its generalization ability is stronger than methods that do not rely on any labels.
Next, we performed transfer learning by fine-tuning the best DIS model on the COD10K training set and the NC4K training set. As seen in the “Transfer Learning” section of Table 3, the A P score on COD10K shows a remarkable surge from 12.9 to 30.0 , closely approaching the performance of the fine-tuned SAM-H ( 33.7 ).
Finally, we tested the domain generalization of the fine-tuned model on the COD10K testing set and the NC4K testing set. The results for domain generalization—evaluated by testing the model on a dataset it was not trained on (e.g., testing the COD10K-trained model on the NC4K test set)—are particularly striking. The A P score reaches 40.3 , and the stricter A P 75 metric hits 41.6 , surpassing all compared SAM and SAM2 variants, and even the fully supervised Mask R-CNN baseline.
Furthermore, we included a comparison with the fine-tuned SAM-H. On the COD10K dataset, the fine-tuned SAM-H achieves the highest AP of 39.2 , surpassing our transfer learning result. This is expected given SAM-H’s massive parameter count compared to our model. However, on the NC4K dataset, our PMG-SAM in the ‘Transfer Learning’ setting achieves an AP of 39.7 , outperforming the fine-tuned SAM-H. This result highlights a key advantage of our architecture: “locate-and-refine” paradigm learns a robust, class-agnostic notion of objectness that generalizes better to unseen distributions.
Collectively, these three sets of experiments demonstrate that our model, pre-trained on the DIS dataset, acquires a powerful foundational segmentation capability. The fact that its performance can be elevated to a state-of-the-art level on new domains with only minimal fine-tuning underscores its excellent domain generalization ability.
This robust performance across diverse settings provides strong evidence that our “locate-and-refine” paradigm is not a narrow, task-specific trick. Instead, it endows the model with a powerful and generalizable foundational segmentation capability. The fact that this capability can be efficiently transferred to new domains to achieve good results positions PMG-SAM as a new and effective blueprint for building the next generation of versatile, fully automatic segmentation models.

Qualitative Analysis

To provide an intuitive understanding of our model’s capabilities, we present a series of visual comparisons in Figure 5 for the DIS task and Figure 6 for the COS task. As shown in Figure 5, PMG-SAM exhibits a marked superiority over baseline models across various challenging scenarios in the DIS5K dataset. First, in scenes with complex backgrounds or multiple objects (e.g., columns 1 and 2), our method accurately segments the contours of the bag and wind turbine, whereas the baseline fails to identify the objects completely. Second, our model demonstrates a clear advantage in edge handling. For regions with intricate details, such as the lightning in column 3, PMG-SAM captures the fine edges with high fidelity, while the baseline suffers from significant omissions. Third, our method shows stronger background suppression. In column 5, where the background contains multiple distracting elements, PMG-SAM remains focused on the primary target, unlike the baseline, which is disturbed by background noise. Finally, for small or low-contrast objects, such as the faint support structure in column 8, our model successfully identifies and segments it, a task where the baseline fails. In summary, these qualitative results corroborate our quantitative findings, proving that PMG-SAM is a more robust and precise solution for fully automatic segmentation, particularly in complex scenes.
The qualitative results for the COS task in Figure 6 further underscore the advantages of our method. When dealing with challenging camouflaged targets, PMG-SAM consistently outperforms SAM and SAM2. For instance, it successfully preserves the slender legs of the crab (column 6), which are entirely missed by the baselines. Similarly, it produces a single, coherent mask for the sea snake (column 3), while the baseline outputs are fragmented. These examples highlight superior ability of our model to understand global context and perceive boundaries, enabling it to generate far more accurate and structurally complete masks. This provides compelling visual evidence for the effectiveness and generalization capability of our Pre-Mask-Guided segmentation mechanism.
To further demonstrate the robust localization capability of PMG-SAM across diverse visual domains, we provide additional qualitative results on medical (Kvasir-SEG) and industrial defect (MVTec AD) datasets in Appendix B. These zero-shot inference results confirm that our Pre-Mask mechanism can effectively locate salient targets even in domains completely unseen during training.

4.6. Limitation and Future Work

To systematically analyze the boundaries of PMG-SAM, we evaluated the model under extreme scenarios. As illustrated in Figure 7, we categorize the primary failure cases into four distinct types, revealing different underlying limitations:
First, Instance Distinctions (Column a): In scenarios with overlapping instances (e.g., the shrimp), the model correctly identifies the salient region but merges adjacent objects into a single connected component. This explicitly exposes a fundamental limitation of our current architecture: since the Pre-Mask paradigm focuses on generating a high-quality binary prior, it relies on post-processing (CCA) rather than a genuine end-to-end mechanism to distinguish instances. Consequently, without instance-specific queries or a separate mask head, the model struggles to separate topologically connected instances.
To strictly quantify this limitation, we filtered the COD10K and NC4K test sets to create specific ‘Overlapping Subsets.’ Statistical analysis reveals that overlapping instances are relatively rare, accounting for only 6.61 % ( 134 / 2026 ) of COD10K and 5.68 % ( 234 / 4121 ) of NC4K. However, on these specific subsets, the performance of PMG-SAM drops significantly. As shown in the supplementary analysis, the model achieves an AP of only 0.060 on the COD10K overlapping subset and 0.098 on the NC4K overlapping subset. Compared to the overall performance (AP 30–40%), this drastic degradation confirms that the non-end-to-end reliance on CCA is insufficient for complex instance separation, marking a clear boundary of our current architecture.
Second, Environmental Constraints (Column b): Under low illumination conditions (e.g., the black cat), the reduced contrast gradient between the object and the background impedes the Pre-Mask Generator’s ability to capture precise boundaries, leading to noisy and overflowing edges.
Third, Structural and Scale Challenges (Columns c and d): For objects with complex topologies (e.g., the transmission tower) or tiny scales (e.g., the frog), the downsampling operations in the visual encoder inevitably result in the loss of high-frequency spatial details. This causes fine grid structures to disappear and tiny objects to be missed or blurred in the final mask.
Fourth, Texture Ambiguity (Column e): When dealing with strong camouflage where the foreground texture is statistically nearly identical to the background (e.g., the snake), the model suffers from “feature confusion,” resulting in severe fragmentation of the segmentation map.
These findings point to two clear directions for future work: enhancing the high-resolution feature preservation in the encoder to handle complex structures and tiny objects, and replacing the post hoc CCA with an end-to-end instance-aware mechanism to fundamentally resolve the limitation in separating overlapping targets.

4.7. Ablation Study

In this section, we conduct a series of ablation studies to dissect the core mechanisms of PMG-SAM and rigorously evaluate the individual contributions of its key components.

4.7.1. Experimental Design

All ablation studies are conducted on the combined DIS-TE1-4 test sets. We use the full suite of six metrics ( S α , F β max , F β ω , E ϕ m , M , HCE γ ) for a comprehensive evaluation. All experiments share the same training environment and hyperparameters to ensure a fair comparison.

4.7.2. Analysis of Key Components of PMG-SAM

Table 4 presents comprehensive ablation study results to verify the effectiveness of each key component in our PMG-SAM. The study is organized into two main groups: (a) evaluating the effectiveness of our proposed Dual-branch Refined Fusion Module (DRFM), and (b) examining the impact of introducing high-resolution features from different sources.
(a) Feature Fusion Module: In this group, we validate the necessity of our carefully designed DRFM by comparing it with no feature fusion and a simple residual fusion alternative. The results reveal a critical insight: naive feature fusion is detrimental. The Residual Fusion variant, which performs simple element-wise addition followed by a Conv-BN-ReLU block (see Figure 8), achieves significantly worse performance ( F β max : 0.7329 ) than both our DRFM-equipped model ( 0.7974 ) and even the baseline with no fusion at all ( 0.7771 ).
This strongly suggests that due to the vast architectural and feature distribution differences between Hiera and U2-Net, direct addition introduces conflicting information and noise, corrupting the original feature representations. In contrast, our DRFM, with its sophisticated structure, successfully aligns and enhances these heterogeneous features, leading to consistent improvements across all metrics.
This leads to another key insight: the quality and relevance of high-resolution features are more important than their mere presence. Our results show that selectively injecting high-fidelity features (from Hiera) is the optimal strategy for boundary refinement.
(b) High-Resolution Feature Strategy: This group identifies the optimal strategy for supplementing the mask decoder with high-resolution features. We examine three approaches: introducing no high-resolution features, only from first two stages of U2-Net, or from both Hiera and U2-Net backbones. The results offer another profound insight: not all high-resolution features are beneficial.
Introducing only U2-Net’s features degrades performance ( F β max : 0.7409 ) compared to the baseline without any high-res features ( 0.7584 ), indicating that shallow U2-Net features may contain excessive background noise or task-irrelevant details. The variant using features from both backbones ( 0.7769 ) shows improved performance, suggesting that high-quality of Hiera features can partially offset the negative impact of U2-Net’s features.
This result strongly corroborates our architectural analysis in Section 3.2: U2-Net features are rich in semantic context but noisy for fine details, whereas Hiera features are structurally precise. Therefore, selectively injecting high-fidelity features (from Hiera) is the optimal strategy for boundary refinement, validating the necessity of this specific dual-backbone design.
Full Model Performance: Our final PMG-SAM model combines DRFM with high-resolution features from Hiera only, achieving the best overall performance across all metrics. This configuration yields the highest scores in F β max ( 0.7974 ), F β ω ( 0.7363 ), S α ( 0.8272 ), and E ϕ m ( 0.8883 ), while achieving the lowest error rates in M ( 0.0488 ) and H γ ( 659.3610 ). These results comprehensively validate both the independent effectiveness of our two core innovations and their powerful synergy when combined.
These results comprehensively validate our design choices and demonstrate that the powerful synergy between a well-designed fusion module and a selective feature injection strategy is the key to PMG-SAM’s good performance.

4.7.3. Hyperparameter Sensitivity Analysis

To verify the rationality and robustness of the proposed framework, we conducted comprehensive sensitivity analyses on critical hyperparameters. All experiments were performed on the DIS-TE(1-4) dataset. The results are visualized in Figure 9. For detailed numerical results across all DIS5K datasets, please refer to Table A1 in Appendix A.
(1) Structure of DRFM ( N 2 and M): As shown in Figure 9a,b, we analyzed the number of Stabilized Residual Blocks ( N 2 ) and residual fusion iterations (M). The model achieves peak performance at N 2 = 3 and M = 5 . Reducing these parameters limits the network’s capacity to capture fine-grained details, while increasing them introduces redundancy without performance gains. Note that N 1 is fixed at 6 due to the inherent structure of the frozen backbones.
(2) Configuration of BAIM ( N 3 ): We further investigated the optimal number of BAIM modules ( N 3 ) and the necessity of the bidirectional mechanism. As illustrated in Figure 9c, the bidirectional configuration with N 3 = 2 significantly outperforms the unidirectional counterpart(Uni-dir) and other quantity settings ( N 3 = 1 or 3). This confirms that two bidirectional interaction stages are sufficient to align multi-modal features effectively.
(3) Loss Function Weights: Finally, we evaluated the weight ratio of the loss function L t o t a l = w 1 L b c e + w 2 L d i c e + w 3 L i o u . Since the IoU loss value is numerically smaller than BCE and Dice losses, a larger weight is typically required to balance the gradients. We tested different ratios for w 3 (5, 10, 15) while keeping w 1 = w 2 = 1 . Figure 9d demonstrates that the setting of 1:1:10 yields the best convergence and segmentation accuracy, validating our default configuration.

4.7.4. Analysis of Prior Guidance Quality and Error Propagation

To validate the rationale behind our dual-branch encoder, we conducted a quantitative analysis of the intermediate Pre-Mask generated by the U2-Net branch.
Superiority of Structural Prior. We compared the Pre-Mask against semantic priors, specifically CLIPSeg [54], using the official weights and a generic prompt. As shown in Table 5, CLIPSeg yields an extremely low IoU of 0.024 , indicating that semantic-based models struggle to capture the intricate boundary details required for DIS tasks. In contrast, our Pre-Mask achieves an IoU of 0.382 and an S-measure of 0.629 , verifying that the U2-Net branch provides superior, shape-aware structural guidance.
Refinement and Error Correction. Furthermore, our Final-Mask achieves a remarkable performance leap, boosting the IoU to 0.651 and reducing MAE to 0.047 . To investigate the error propagation mechanism, we visualized the correlation between Pre-Mask and Final-Mask quality in Figure 10. The plot reveals a Pearson correlation of 0.455 . Notably, a dense cluster of data points appears in the upper-left region (where Pre-Mask IoU < 0.2 but Final-Mask IoU > 0.6 ). This demonstrates a powerful error correction mechanism: even when the prior guidance is erroneous or fails to locate the target, the subsequent decoder effectively leverages image features to autonomously recover the correct segmentation, preventing error propagation.

5. Conclusions

In this paper, we presented PMG-SAM, a novel framework that represents a paradigm shift in automatic segmentation. Our work is motivated by the fundamental limitations of SAM’s “segment everything” mode: its localization blindness in complex scenes and the computational inefficiency of its grid-based prompting. To overcome these challenges, we proposed a new “locate-and-refine” architecture.
This new paradigm is operationalized by a Pre-Mask Generator, which performs the critical “locate” step. It leverages a synergistic dual-branch encoder and a novel Dense Residual Fusion Module (DRFM) to produce a high-quality, dense Pre-Mask that provides a strong global prior for the target. This internal guidance signal is then passed to an enhanced mask decoder, which executes the “refine” step, augmented by high-resolution features to ensure precise boundary delineation. This two-stage process replaces SAM’s exhaustive search with intelligent, targeted refinement, achieving both high accuracy and efficiency.
The effectiveness of this paradigm is empirically validated through extensive experiments on both salient and camouflaged targets. Our parameter-efficient PMG-SAM not only drastically outperforms the automatic modes of SAM and SAM2 but also surpasses their performance when provided with perfect ground-truth bounding box prompts. This key result highlights that our dense, internal guidance is a more powerful mechanism than sparse, external prompting. Furthermore, on the COS task, PMG-SAM demonstrates exceptional transfer learning and domain generalization capabilities, achieving good performance after minimal fine-tuning. Our ablation studies further confirmed that each component of our design is crucial to the framework’s success.
In summary, PMG-SAM and its underlying ‘locate-and-refine’ paradigm offer an effective and efficient solution to the inherent localization bottleneck of SAM’s automatic mode. Our work provides a valuable blueprint for developing future foundation models that are not only powerful and versatile but also truly and intelligently automatic in prompt-free scenarios.

Author Contributions

Conceptualization, A.W., Y.G., Z.F. and M.S.L.; methodology, J.G. and X.J.; software, J.G.; validation, J.G.; formal analysis, J.G.; investigation, J.G.; resources, X.J.; data curation, J.G.; writing—original draft preparation, J.G.; writing—review and editing, X.J.; visualization, J.G.; supervision, X.J.; project administration, X.J.; funding acquisition, X.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Shanghai Natural Science Foundation grant number 25ZR1401148 and the National Natural Science Foundation of China grant number 62201338.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The experiments conducted in this research are based on the DIS5K, COD10K, and NC4K datasets, which are publicly available benchmark datasets. The DIS5K dataset can be accessed at https://xuebinqin.github.io/dis/index.html. The COD10K dataset can be accessed at https://drive.google.com/file/d/1YGa3v-MiXy-3MMJDkidLXPt0KQwygt-Z/view (for data) and https://drive.google.com/drive/folders/1Yvz63C8c7LOHFRgm06viUM9XupARRPif (for annotations). The NC4K dataset can be accessed at https://drive.google.com/file/d/1eK_oi-N4Rmo6IIxUNbYHBiNWuDDLGr_k/view (for data) and https://drive.google.com/drive/folders/1LyK7tl2QVZBFiNaWI_n0ZVa0QiwF2B8e (for annotations). All datasets were last accessed on 16 November 2025.

Acknowledgments

We express our sincere gratitude to all those involved in this project and to the providers of open source datasets.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Detailed Hyperparameter Sensitivity Results

Table A1. Detailed quantitative results of hyperparameter sensitivity analysis on all datasets and metrics. Lines containing ⋆ represent the default settings in this document.
Table A1. Detailed quantitative results of hyperparameter sensitivity analysis on all datasets and metrics. Lines containing ⋆ represent the default settings in this document.
CategorySettingDIS-VDDIS-TE1DIS-TE2DIS-TE3DIS-TE4DIS-TE(1-4)
F β max F β ω M S α E ϕ m H γ F β max F β ω M S α E ϕ m H γ F β max F β ω M S α E ϕ m H γ F β max F β ω M S α E ϕ m H γ F β max F β ω M S α E ϕ m H γ F β max F β ω M S α E ϕ m H γ
DRFM Blocks ( N 2 ) N 2 = 2 0.75980.69340.05960.80690.86447170.72930.63680.06110.78350.82481290.790.71960.05530.82450.87372620.80330.7410.05080.83070.89844830.76690.71540.06470.80310.882218060.77110.70320.0580.81040.8698670
N 2 = 3 0.79050.72620.05190.82180.88417070.76790.68750.04560.80960.85761230.81470.75090.04590.84020.89282490.82610.76890.04480.84510.91164720.78310.73820.05910.81410.891317930.79740.73630.04880.82720.8883659
N 2 = 4 0.76110.69460.05990.80650.86567310.73450.64380.05830.78770.82881270.79130.72470.05320.82760.87632610.80170.73840.05050.83150.89684900.7660.71650.06550.80380.877918350.77270.70580.05690.81270.87678
DRFM Iterations (M) M = 3 0.76630.69210.06110.8110.86137350.73150.6240.0620.78420.81591350.78460.71330.05730.82670.86782670.80380.73790.05230.8360.89444970.76850.71550.06520.8090.879218450.77120.69770.05920.8140.8643686
M = 4 0.74780.64540.07720.77820.82377400.70260.57740.0810.75010.7761320.77360.67750.07060.80060.84022640.79140.66980.06370.80980.88634980.75560.68060.08050.78340.847518650.75570.65880.07390.7860.8325690
M = 5 0.79050.72620.05190.82180.88417070.76790.68750.04560.80960.85761230.81470.75090.04590.84020.89282490.82610.76890.04480.84510.91164720.78310.73820.05910.81410.891317930.79740.73630.04880.82720.8883659
BAIM ConfigurationUni-dir0.74440.6820.06720.79580.85397260.71950.63040.06390.77740.81781300.77430.70750.06090.81360.86362660.78890.72710.05580.82110.88894920.75420.70530.06990.79330.871918460.75820.69260.06260.80130.8605683
N 3 = 1 0.7630.70170.05680.81150.87387130.73360.64580.05670.78750.83121280.79530.73110.05080.83160.88332570.80610.74620.04870.83420.90564810.76710.71860.06390.80480.881917940.77490.71040.0550.81450.8755665
N 3 = 2 0.79050.72620.05190.82180.88417070.76790.68750.04560.80960.85761230.81470.75090.04590.84020.89282490.82610.76890.04480.84510.91164720.78310.73820.05910.81410.891317930.79740.73630.04880.82720.8883659
N 3 = 3 0.74240.63780.07970.77350.81427390.68570.56110.08620.74160.7591310.76780.66650.07470.79350.83072650.78180.68560.06880.80080.8514970.75090.67350.08320.7790.840718930.74650.64670.07820.77870.8203696
Loss Weights ( w 3 ) w 3 = 5 0.73620.61880.08990.76360.79797470.6870.53670.0960.72910.73761340.76540.64730.0850.78350.81362650.78610.67290.07490.79770.84175050.75620.66210.08750.77880.833718930.74670.62980.08590.77230.8067699
w 3 = 10 0.79050.72620.05190.82180.88417070.76790.68750.04560.80960.85761230.81470.75090.04590.84020.89282490.82610.76890.04480.84510.91164720.78310.73820.05910.81410.891317930.79740.73630.04880.82720.8883659
w 3 = 15 0.76720.69540.06150.80210.85947210.73490.64570.060.78270.82391260.79150.71850.05720.81940.8682580.8090.74150.05160.82790.89394810.77980.72450.06310.80440.882617970.77880.70750.0580.80860.8671666

Appendix B. Qualitative Analysis on Out-of-Distribution Domains

To respond to the need for validating the generalization capability of the proposed model across a broader set of tasks, we conducted zero-shot inference on two distinct out-of-distribution datasets: medical imaging and industrial defect detection. It is important to note that our model was trained exclusively on the DIS dataset and was not fine-tuned on any medical or industrial data.
Figure A1. Qualitative visualization of zero-shot generalization on unseen domains.
Figure A1. Qualitative visualization of zero-shot generalization on unseen domains.
Sensors 26 00365 g0a1
The PMG-SAM model, trained solely on the DIS dataset, is directly applied to (a) the Kvasir-SEG dataset [55] for polyp segmentation and (b) the MVTec AD dataset [56] for industrial defect detection. In (a), the model accurately delineates the polyp boundaries despite low contrast and blurred edges. In (b), it successfully identifies diverse defects and separates the target from complex textures. These results demonstrate that the proposed Pre-Mask mechanism learns a robust, class-agnostic representation of “objectness” that generalizes effectively to diverse downstream tasks without specific adaptation.

References

  1. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar]
  2. Qin, X.; Dai, H.; Hu, X.; Fan, D.P.; Shao, L.; Van Gool, L. Highly accurate dichotomous image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
  3. Pei, J.; Zhou, Z.; Zhang, T. Evaluation study on SAM 2 for class-agnostic instance-level segmentation. arXiv 2024, arXiv:2409.02567. [Google Scholar]
  4. Lian, S.; Li, H. Evaluation of Segment Anything Model 2: The Role of SAM2 in the Underwater Environment. arXiv 2024, arXiv:2408.02924. [Google Scholar] [CrossRef]
  5. Sun, Y.; Chen, J.; Zhang, S.; Zhang, X.; Chen, Q.; Zhang, G.; Ding, E.; Wang, J.; Li, Z. VRP-SAM: SAM with visual reference prompt. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
  6. Xie, Z.; Guan, B.; Jiang, W.; Yi, M.; Ding, Y.; Lu, H.; Zhang, L. PA-SAM: Prompt adapter SAM for high-quality image segmentation. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024. [Google Scholar]
  7. Gowda, S.N.; Clifton, D.A. CC-SAM: SAM with cross-feature attention and context for ultrasound image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar]
  8. Wu, J.; Wang, Z.; Hong, M.; Ji, W.; Fu, H.; Xu, Y.; Xu, M.; Jin, Y. Medical SAM adapter: Adapting segment anything model for medical image segmentation. Med Image Anal. 2025, 102, 103547. [Google Scholar] [CrossRef]
  9. Ryali, C.; Hu, Y.T.; Bolya, D.; Wei, C.; Fan, H.; Huang, P.Y.; Aggarwal, V.; Chowdhury, A.; Poursaeed, O.; Hoffman, J.; et al. Hiera: A hierarchical vision transformer without the bells-and-whistles. In Proceedings of the International Conference on Machine Learning (ICML). PMLR, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
  10. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  11. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015. [Google Scholar]
  12. Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
  13. Yang, X.; Duan, S.; Wang, N.; Gao, X. Pro2SAM: Mask prompt to SAM with grid points for weakly supervised object localization. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar]
  14. Liao, J.; Jiang, S.; Chen, M.; Sun, C. SAM-YOLO: An improved small object detection model for vehicle detection. Eur. J. Artif. Intell. 2025, 38, 279–295. [Google Scholar] [CrossRef]
  15. Lee, M.; Cho, S.; Lee, J.; Yang, S.; Choi, H.; Kim, I.J.; Lee, S. Effective SAM combination for open-vocabulary semantic segmentation. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
  16. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), PMLR, Virtual, 18–24 July 2021. [Google Scholar]
  17. Li, S.; Cao, J.; Ye, P.; Ding, Y.; Tu, C.; Chen, T. ClipSAM: CLIP and SAM collaboration for zero-shot anomaly segmentation. Neurocomputing 2025, 618, 129122. [Google Scholar] [CrossRef]
  18. Ito, K. Feature design for bridging SAM and CLIP toward referring image segmentation. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 28 February–4 March 2025. [Google Scholar]
  19. Li, C.; Khanduri, P.; Qiang, Y.; Sultan, R.I.; Chetty, I.; Zhu, D. AutoProSAM: Automated prompting SAM for 3D multi-organ segmentation. arXiv 2023, arXiv:2308.14936. [Google Scholar]
  20. Shan, Z.; Liu, Y.; Zhou, L.; Yan, C.; Wang, H.; Xie, X. Ros-sam: High-quality interactive segmentation for remote sensing moving object. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
  21. Xiao, A.; Xuan, W.; Qi, H.; Xing, Y.; Ren, R.; Zhang, X.; Shao, L.; Lu, S. Cat-sam: Conditional tuning for few-shot adaptation of segment anything model. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar]
  22. Li, Z.; Tucker, R.; Cole, F.; Wang, Q.; Jin, L.; Ye, V.; Kanazawa, A.; Holynski, A.; Snavely, N. MegaSaM: Accurate, fast and robust structure and motion from casual dynamic videos. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
  23. Mei, H.; Zhang, P.; Shou, M.Z. SAM-I2V: Upgrading SAM to support promptable video segmentation with less than 0.2% training cost. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
  24. Radman, A.; Laaksonen, J. TSAM: Temporal SAM augmented with multimodal prompts for referring audio-visual segmentation. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
  25. Dutta, T.K.; Majhi, S.; Nayak, D.R.; Jha, D. SAM-Mamba: Mamba guided SAM architecture for generalized zero-shot polyp segmentation. arXiv 2024, arXiv:2412.08482. [Google Scholar]
  26. Liu, X.; Fu, K.; Zhao, Q. Promoting segment anything model towards highly accurate dichotomous image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar]
  27. Yu, C.; Liu, T.; Li, A.; Qu, X.; Wu, C.; Liu, L.; Hu, X. SAM-REF: Introducing image-prompt synergy during interaction for detail enhancement in the Segment Anything Model. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
  28. Sheng, D.; Chen, D.; Tan, Z.; Liu, Q.; Chu, Q.; Gong, T.; Liu, B.; Han, J.; Tu, W.; Xu, S.; et al. UNICL-SAM: Uncertainty-driven in-context segmentation with part prototype discovery. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
  29. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
  30. Wang, H.; Vasu, P.K.A.; Faghri, F.; Vemulapalli, R.; Farajtabar, M.; Mehta, S.; Rastegari, M.; Tuzel, O.; Pouransari, H. Sam-clip: Merging vision foundation models towards semantic and spatial understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
  31. Pan, X.; Ye, T.; Xia, Z.; Song, S.; Huang, G. Slide-Transformer: Hierarchical vision transformer with local self-attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
  32. Fan, D.P.; Ji, G.P.; Sun, G.; Cheng, M.M.; Shen, J.; Shao, L. Camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  33. Lv, Y.; Zhang, J.; Dai, Y.; Li, A.; Liu, B.; Barnes, N.; Fan, D.P. Simultaneously localize, segment and rank the camouflaged objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021. [Google Scholar]
  34. Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009. [Google Scholar]
  35. Margolin, R.; Zelnik-Manor, L.; Tal, A. How to evaluate foreground maps? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
  36. Perazzi, F.; Krähenbühl, P.; Pritch, Y.; Hornung, A. Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012. [Google Scholar]
  37. Fan, D.P.; Cheng, M.M.; Liu, Y.; Li, T.; Borji, A. Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  38. Fan, D.P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.M.; Borji, A. Enhanced-alignment measure for binary foreground map evaluation. Int. J. Comput. Vis. (IJCV) 2021, 129, 3101–3119. [Google Scholar]
  39. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  40. Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. Sam 2: Segment anything in images and videos. In Proceedings of the International Conference on Learning Representations (ICLR), Singapore, 24 April 2025. [Google Scholar]
  41. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009. [Google Scholar]
  42. Wang, L.; Lu, H.; Wang, Y.; Feng, M.; Wang, D.; Yin, B.; Ruan, X. Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  43. Loshchilov, I.; Hutter, F. Fixing weight decay regularization in Adam. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  44. Ma, N.; Bailey, D.G.; Johnston, C.T. Optimised single pass connected components analysis. In Proceedings of the 2008 International Conference on Field-Programmable Technology (FPT), Taipei, Taiwan, 8–10 December 2008. [Google Scholar]
  45. Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
  46. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
  47. Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021. [Google Scholar]
  48. Fan, D.P.; Ji, G.P.; Cheng, M.M.; Shao, L. Concealed object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6024–6042. [Google Scholar] [CrossRef] [PubMed]
  49. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  50. Cheng, B.; Parkhi, O.; Kirillov, A. Pointly-supervised instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  51. Wang, Y.; Shen, X.; Yuan, Y.; Du, Y.; Li, M.; Hu, S.X.; Crowley, J.L.; Vaufreydaz, D. Tokencut: Segmenting objects in images and videos with self-supervised transformer and normalized cut. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15790–15801. [Google Scholar] [CrossRef] [PubMed]
  52. Wang, X.; Girdhar, R.; Yu, S.X.; Misra, I. Cut and learn for unsupervised object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
  53. He, Z.; Xia, C.; Qiao, S.; Li, J. Text-prompt camouflaged instance segmentation with graduated camouflage learning. In Proceedings of the 32nd ACM International Conference on Multimedia (ACM MM), Melbourne, VIC, Australia, 28 October–1 November 2024. [Google Scholar]
  54. Lüddecke, T.; Ecker, A. Image Segmentation Using Text and Image Prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 7086–7096. [Google Scholar]
  55. Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Halvorsen, P.; de Lange, T.; Johansen, D.; Johansen, H.D. Kvasir-seg: A segmented polyp dataset. In Proceedings of the MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, Republic of Korea, 5–8 January 2020; Proceedings, Part II 26, pp. 451–462. [Google Scholar]
  56. Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD–A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9592–9600. [Google Scholar]
Figure 1. Examples of fragmented segmentation and semantic incoherence in automatic mode of SAM.
Figure 1. Examples of fragmented segmentation and semantic incoherence in automatic mode of SAM.
Sensors 26 00365 g001
Figure 2. Overview of the PMG-SAM pipeline.
Figure 2. Overview of the PMG-SAM pipeline.
Sensors 26 00365 g002
Figure 3. Schematic of the DRFM structure. Details of the fusion extraction block are provided in Algorithm 2.
Figure 3. Schematic of the DRFM structure. Details of the fusion extraction block are provided in Algorithm 2.
Sensors 26 00365 g003
Figure 4. Bidirectional Attention Interaction Module (BAIM).
Figure 4. Bidirectional Attention Interaction Module (BAIM).
Sensors 26 00365 g004
Figure 5. Qualitative comparisons on DIS5K dataset for dichotomous image segmentation.
Figure 5. Qualitative comparisons on DIS5K dataset for dichotomous image segmentation.
Sensors 26 00365 g005
Figure 6. Qualitative comparisons on COD10K and NC4K datasets for Camouflaged Object Segmentation.
Figure 6. Qualitative comparisons on COD10K and NC4K datasets for Camouflaged Object Segmentation.
Sensors 26 00365 g006
Figure 7. Visualization of typical failure cases categorized by challenge type. From left to right: (a) Overlapping Instances, where adjacent objects are merged; (b) Low Illumination, where low contrast leads to boundary leakage; (c) Complex Topology, where fine structural grids are lost; (d) Tiny Objects, where small targets are missed; and (e) Complex Texture, where foreground-background similarity causes fragmentation.
Figure 7. Visualization of typical failure cases categorized by challenge type. From left to right: (a) Overlapping Instances, where adjacent objects are merged; (b) Low Illumination, where low contrast leads to boundary leakage; (c) Complex Topology, where fine structural grids are lost; (d) Tiny Objects, where small targets are missed; and (e) Complex Texture, where foreground-background similarity causes fragmentation.
Sensors 26 00365 g007
Figure 8. Architecture of the “Residual Fusion” variant used in ablation study (see Table 4).
Figure 8. Architecture of the “Residual Fusion” variant used in ablation study (see Table 4).
Sensors 26 00365 g008
Figure 9. Sensitivity analysis of hyperparameters and structural components on the DIS-TE(1-4) dataset using Weighted F-measure ( F β ω ). (a) Ablation on the number of Stabilized Residual Blocks ( N 2 ). (b) Ablation on the number of residual fusion iterations (M). (c) Performance comparison between the unidirectional attention mechanism(Uni-dir) and the proposed Bidirectional Attention Interaction Module (BAIM) with varying module counts ( N 3 ). (d) Analysis of the weight ratio for the IoU loss ( w 3 ) within the total loss function. The red stars (⋆) and the red bar denote the selected optimal settings adopted in our final model.
Figure 9. Sensitivity analysis of hyperparameters and structural components on the DIS-TE(1-4) dataset using Weighted F-measure ( F β ω ). (a) Ablation on the number of Stabilized Residual Blocks ( N 2 ). (b) Ablation on the number of residual fusion iterations (M). (c) Performance comparison between the unidirectional attention mechanism(Uni-dir) and the proposed Bidirectional Attention Interaction Module (BAIM) with varying module counts ( N 3 ). (d) Analysis of the weight ratio for the IoU loss ( w 3 ) within the total loss function. The red stars (⋆) and the red bar denote the selected optimal settings adopted in our final model.
Sensors 26 00365 g009
Figure 10. Error propagation analysis showing the correlation between Pre-Mask IoU and Final-Mask IoU on the DIS-VD dataset. The red dashed line represents the identity function ( y = x ). Points significantly above the line (especially in the low Pre-Mask IoU region) indicate the model’s robust capability to correct coarse priors and recover fine-grained details.
Figure 10. Error propagation analysis showing the correlation between Pre-Mask IoU and Final-Mask IoU on the DIS-VD dataset. The red dashed line represents the identity function ( y = x ). Points significantly above the line (especially in the low Pre-Mask IoU region) indicate the model’s robust capability to correct coarse priors and recover fine-grained details.
Sensors 26 00365 g010
Table 1. Comparison of model complexity and inference efficiency. Metrics include Trainable Parameters (M), Total Parameters (M), FLOPs (G), Inference Speed (FPS), and Peak Memory (MB). All models were tested on a single NVIDIA Tesla A100 GPU with an input resolution of 1024 × 1024 . Note that SAM and SAM2 variants were evaluated in their official automatic segmentation mode, which significantly impacts their speed due to the grid-based prompting strategy. The best results are highlighted in bold.
Table 1. Comparison of model complexity and inference efficiency. Metrics include Trainable Parameters (M), Total Parameters (M), FLOPs (G), Inference Speed (FPS), and Peak Memory (MB). All models were tested on a single NVIDIA Tesla A100 GPU with an input resolution of 1024 × 1024 . Note that SAM and SAM2 variants were evaluated in their official automatic segmentation mode, which significantly impacts their speed due to the grid-based prompting strategy. The best results are highlighted in bold.
MethodTrainable Params (M)Total Params (M)FLOPs (G)FPSPeak Memory (MB)
SAM-B [1]93.793.7486.40.00722763.1
SAM-L [1]312.0312.01493.90.00734393.5
SAM-H [1]641.0641.02982.20.00705731.1
SAM2-T [40]38.938.9103.01.01792366.0
SAM2-B+ [40]80.880.8264.50.46472771.2
SAM2-L [40]224.0224.0810.50.53473321.7
PMG-SAM (Ours)22.9136.01116.24.85003420.0
Table 2. Quantitative comparisons of different methods. In the ‘Mode’ column, ‘Auto’, ‘GT-Box’, and ‘Full-Sup’ are abbreviations for Automatic, GT-Bbox, and Fully supervised modes. The best results are highlighted in bold.
Table 2. Quantitative comparisons of different methods. In the ‘Mode’ column, ‘Auto’, ‘GT-Box’, and ‘Full-Sup’ are abbreviations for Automatic, GT-Bbox, and Fully supervised modes. The best results are highlighted in bold.
MethodsModeDIS-VDDIS-TE1DIS-TE2DIS-TE3DIS-TE4DIS-TE(1-4)
F β max F β ω M S α E ϕ m H γ F β max F β ω M S α E ϕ m H γ F β max F β ω M S α E ϕ m H γ F β max F β ω M S α E ϕ m H γ F β max F β ω M S α E ϕ m H γ F β max F β ω M S α E ϕ m H γ
SINetV2 [48]Full-Sup0.6650.5840.1100.7270.798-0.6440.5580.0940.7270.791-0.7000.6180.0990.7530.823-0.7300.6410.0960.7660.849-0.6990.6160.1130.7440.824-0.6930.6080.1010.7470.822-
HRNet [46]Full-Sup0.7260.6410.0950.7670.824-0.6680.5790.0880.7420.797-0.7470.6640.0870.7840.840-0.7840.7000.0800.8050.869-0.7720.6870.0920.7920.854-0.7430.6580.0870.7810.840-
STDC [47]Full-Sup0.6960.6130.1030.7400.817-0.6480.5620.0900.7230.798-0.7200.6360.0920.7590.834-0.7450.6620.0900.7710.855-0.7310.6520.1020.7620.841-0.7100.6280.0940.7540.832-
IS-Net [2]Full-Sup0.7910.7170.0740.8130.85611160.7400.6620.0740.7870.8201490.7990.7280.0700.8230.8583400.8300.7580.0640.8360.8836870.8270.7530.0720.8300.87028880.7990.7260.0700.8190.8581016
SAM-B [1]Auto0.2150.1320.2580.3980.39214450.2350.1760.2230.4390.4422090.2100.1260.2680.3880.3694500.2200.1200.2700.3860.3738900.2330.1180.2980.3660.39536240.2240.1350.2650.3950.3951293
SAM-L [1]Auto0.2780.2310.3250.4010.46214020.3650.3110.2680.4810.5312240.2860.2270.3310.3970.4414640.2200.1710.3450.3620.4439050.2540.2130.3450.3790.46735280.2810.2300.3220.4040.4711280
SAM-H [1]Auto0.2830.2410.3440.3950.47514170.4020.3520.2610.5050.5552230.2830.2280.3490.3860.4494710.2350.1900.3510.3680.4539050.2720.2330.3370.3940.49135020.2980.2510.3250.4130.4871275
SAM2-T [40]Auto0.3060.2090.1690.4710.40714170.3520.2530.1420.5060.4501890.3110.2040.1680.4680.3944430.3080.2030.1690.4700.3918770.2680.1790.1920.4450.38236130.3100.2100.1680.4720.4041280
SAM2-B+ [40]Auto0.4280.3110.1560.5150.47713820.4980.3810.1170.5660.5391950.4270.2950.1550.5090.4484440.3910.2650.1590.4940.4378800.3810.2770.1790.4880.46535090.4240.3050.1530.5140.4721257
SAM2-L [40]Auto0.4200.3070.1570.5140.47813850.4940.3820.1170.5700.5501960.4420.3100.1470.5180.4644440.3900.2660.1570.4970.4378770.3850.2790.1770.4910.46435210.4280.3090.1500.5190.4791259
SAM-B [1]GT-Box0.6710.6230.1500.6810.77415540.7470.7030.1050.7540.8292860.6870.6350.1430.6920.7845900.6240.5730.1710.6470.74510800.5580.5200.2240.5880.69936670.6540.6080.1610.6700.7641405
SAM-L [1]GT-Box0.7390.6980.1170.7390.81714600.7830.7460.0910.7870.8522550.7660.7180.1070.7560.8315510.6870.6340.1430.6960.77810210.6130.5760.1910.6390.73435330.7120.6680.1330.7200.7991340
SAM-H [1]GT-Box0.6870.6520.1510.7000.78314680.7550.7210.1060.7660.8332440.7080.6660.1410.7130.7915430.6290.5830.1760.6540.7489970.5760.5450.2180.6110.70735530.4860.4390.2350.5520.6301325
SAM2-T [40]GT-Box0.7390.7020.1070.7480.83016460.7910.7560.0800.7980.8633460.7520.7080.0960.7600.8386980.6980.6530.1260.7150.80712030.6220.5870.1790.6520.74837660.7160.6760.1200.7310.8141503
SAM2-B+ [40]GT-Box0.7650.7310.1040.7660.84015600.8340.8050.0690.8290.8883130.7750.7340.1020.7700.8426420.7140.6710.1350.7190.80611690.6330.6010.1880.6570.74136770.7390.7030.1240.7440.8191450
SAM2-L [40]GT-Box0.7430.7070.1070.7520.81915330.8280.7960.0680.8240.8793050.7480.7020.1030.7500.8146250.6780.6300.1390.6980.76511270.6030.5690.1870.6390.71936790.7140.6740.1240.7280.7941434
SAM-HFine-tuned0.3780.3160.1670.5030.7989070.3250.2670.1300.5080.8251760.3640.3000.1660.4980.7983650.3690.3040.1810.4890.7856630.3830.3240.1920.4880.77822270.3600.2990.1670.4960.797858
PMG-SAM (Ours)Auto0.7910.7260.0520.8220.8847070.7680.6880.0460.8100.8581230.8150.7510.0460.8400.8932490.8260.7690.0450.8450.9124720.7830.7380.0590.8140.89117930.7970.7360.0490.8270.888659
Table 3. Quantitative comparisons of different methods on COD10K and NC4K. Our method is evaluated in three settings: Zero-shot (trained on DIS5K only), Transfer Learning (fine-tuned on target dataset), and Domain Generalization (trained on one COS dataset, tested on another). Bold/Underline indicate the best/second-best results.
Table 3. Quantitative comparisons of different methods on COD10K and NC4K. Our method is evaluated in three settings: Zero-shot (trained on DIS5K only), Transfer Learning (fine-tuned on target dataset), and Domain Generalization (trained on one COS dataset, tested on another). Bold/Underline indicate the best/second-best results.
MethodsModeCOD10KNC4K
APAP50AP75APAP50AP75
Mask R-CNN [49]Fully supervised28.760.125.736.168.933.5
PointSup [50]Point-supervised17.944.111.919.147.611.6
Tokencut [51]Unsupervised2.66.52.03.58.32.5
Cutler [52]Unsupervised11.729.17.315.537.910.5
TPNet [53]Text-prompt18.341.814.321.448.316.6
SAM-B [1]Automatic7.612.38.25.78.86.3
SAM-L [1]Automatic29.545.332.326.238.829.5
SAM-H [1]Automatic33.751.237.733.147.937.6
SAM2-T [40]Automatic3.14.03.43.44.23.8
SAM2-B+ [40]Automatic11.715.713.18.911.09.8
SAM2-L [40]Automatic10.613.212.18.810.39.6
SAM-HFine-tuned39.258.343.537.653.541.5
PMG-SAM (Ours)Zero-shot12.931.28.821.445.518.0
PMG-SAM (Ours)Transfer Learning30.055.828.639.764.641.5
PMG-SAM (Ours)Domain Generalization29.953.629.340.366.541.6
Table 4. Ablation Study on Key Components of PMG-SAM. The best results are highlighted in bold.
Table 4. Ablation Study on Key Components of PMG-SAM. The best results are highlighted in bold.
Model VariantDIS-TE(1-4)
F β max F β ω M S α E ϕ m H γ
(a) Feature Fusion Module
No Feature Fusion0.77710.70290.05770.81350.8671664.3980
Residual Fusion0.73290.62710.08590.76480.8002704.6665
(b) High-Resolution Features
No High-Res Features0.75840.69170.05950.80250.8661772.5660
Only U2-Net0.74090.66790.06670.79090.8476764.1910
Both Backbones0.77690.70140.06010.80710.8621697.3505
PMG-SAM (Ours)0.79740.73630.04880.82720.8883659.3610
Table 5. Quantitative evaluation of the Pre-Mask quality compared to CLIP-based priors and the final output on the DIS-VD dataset. The generic prompt “a photo of a salient object” was used for CLIPSeg. The results demonstrate that our structural prior significantly outperforms semantic priors, and our full model achieves substantial refinement over the prior. The best results are highlighted in bold.
Table 5. Quantitative evaluation of the Pre-Mask quality compared to CLIP-based priors and the final output on the DIS-VD dataset. The generic prompt “a photo of a salient object” was used for CLIPSeg. The results demonstrate that our structural prior significantly outperforms semantic priors, and our full model achieves substantial refinement over the prior. The best results are highlighted in bold.
MethodIoU ↑ F β max MAE ↓ S α
CLIPSeg  [54]0.0240.2150.1790.420
Pre-Mask (Ours)0.3820.5150.1170.629
Final-Mask (Ours)0.6510.7950.0470.813
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, J.; Jiang, X.; Wang, A.; Gao, Y.; Fang, Z.; Lew, M.S. PMG-SAM: Boosting Auto-Segmentation of SAM with Pre-Mask Guidance. Sensors 2026, 26, 365. https://doi.org/10.3390/s26020365

AMA Style

Gao J, Jiang X, Wang A, Gao Y, Fang Z, Lew MS. PMG-SAM: Boosting Auto-Segmentation of SAM with Pre-Mask Guidance. Sensors. 2026; 26(2):365. https://doi.org/10.3390/s26020365

Chicago/Turabian Style

Gao, Jixue, Xiaoyan Jiang, Anjie Wang, Yongbin Gao, Zhijun Fang, and Michael S. Lew. 2026. "PMG-SAM: Boosting Auto-Segmentation of SAM with Pre-Mask Guidance" Sensors 26, no. 2: 365. https://doi.org/10.3390/s26020365

APA Style

Gao, J., Jiang, X., Wang, A., Gao, Y., Fang, Z., & Lew, M. S. (2026). PMG-SAM: Boosting Auto-Segmentation of SAM with Pre-Mask Guidance. Sensors, 26(2), 365. https://doi.org/10.3390/s26020365

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop