You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Article
  • Open Access

7 September 2025

Prototype-Based Two-Stage Few-Shot Instance Segmentation with Flexible Novel Class Adaptation

,
,
,
,
and
1
College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China
2
College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
3
School of Computer, Hunan University of Technology, Zhuzhou 412007, China
*
Authors to whom correspondence should be addressed.
This article belongs to the Special Issue Structural Networks for Image Application

Abstract

Few-shot instance segmentation (FSIS) is devised to address the intricate challenge of instance segmentation when labeled data for novel classes is scant. Nevertheless, existing methodologies encounter notable constraints in the agile expansion of novel classes and the management of memory overhead. The integration workflow for novel classes is inflexible, and given the necessity of retaining class exemplars during both training and inference stages, considerable memory consumption ensues. To surmount these challenges, this study introduces an innovative framework encompassing a two-stage “base training-novel class fine-tuning” paradigm. It acquires discriminative instance-level embedding representations. Concretely, instance embeddings are aggregated into class prototypes, and the storage of embedding vectors as opposed to images inherently mitigates the issue of memory overload. Via a Region of Interest (RoI)-level cosine similarity matching mechanism, the flexible augmentation of novel classes is realized, devoid of the requirement for supplementary training and independent of historical data. Experimental validations attest that this approach significantly outperforms state-of-the-art techniques in mainstream benchmark evaluations. More crucially, its memory-optimized attributes facilitate, for the first time, the conjoint assessment of FSIS performance across all classes within the COCO dataset. Visualized instances (incorporating colored masks and class annotations of objects across diverse scenarios) further substantiate the efficacy of the method in real-world complex contexts.

1. Introduction

In the field of computer vision, instance segmentation emerges as a pivotal task [1,2], demanding the simultaneous identification of individual objects and precise delineation of their pixel-level boundaries within images [3,4]. Its practical significance spans diverse domains: in autonomous driving, it enables accurate segmentation of pedestrians, vehicles, and traffic signs to ensure safe navigation; in medical imaging, it facilitates precise detection and segmentation of tumors, organs, or cellular structures, supporting diagnostic decisions and treatment planning; and in industrial automation, it powers object identification and sorting in manufacturing processes [5]. However, traditional instance segmentation models heavily rely on large-scale, fully annotated datasets for training. Acquiring such datasets is often expensive, time-consuming, or even infeasible due to privacy restrictions, which has catalyzed research into few-shot learning for instance segmentation. In this paradigm, models are required to segment novel object classes using only a minimal number of labeled samples [6]. Furthermore, in dynamic environments where new object classes continuously emerge, the capability to incrementally learn these classes while retaining previously acquired knowledge—known as incremental few-shot instance segmentation—becomes critical for real-time system adaptation and long-term usability.
Over the past few years, significant progress has been made in few-shot instance segmentation, with many methods building on foundational frameworks such as Mask R-CNN [7,8,9]. Some approaches leverage meta-learning to adapt to new classes by learning generalizable features during meta-training on base classes. Nevertheless, existing methods face notable challenges: performance tends to degrade when novel classes exhibit significant intra-class variance [10,11,12], and in incremental learning scenarios, catastrophic forgetting occurs, where models lose proficiency in recognizing previously learned classes as they adapt to new ones. While knowledge distillation has been explored as a potential solution for few-shot learning, effectively handling the incremental addition of new classes while preserving prior knowledge remains an unresolved issue.
Driven by these limitations, this study aims to develop a framework that enables efficient adaptation to novel classes with few samples while preserving prior knowledge. Addressing the unresolved challenges of intra-class [2] variance in novel classes and catastrophic forgetting during incremental learning is central to this endeavor. Enabling accurate incremental few-shot [13] instance segmentation can significantly enhance the adaptability and robustness of computer vision systems in real-world applications, for example, allowing industrial inspection systems to learn new defect classes without the need to retrain on entire historical datasets. Our approach adopts a two-stage training process [8]. In the first stage, base training is conducted on a large-scale base-class dataset using the Mask2Former architecture. The model is optimized to learn generalizable features through a loss function that combines classification and mask losses [14]. In the second stage, novel fine-tuning, the CNN backbone and pixel decoder are frozen to preserve the knowledge acquired from base classes, while the projection layer, cosine similarity classifier, and novel object queries are fine-tuned. To prevent forgetting, local POD knowledge distillation [7] is employed to align the multi-scale features of the novel-adapted model with those of the base model [15]. For the incremental addition of new classes, class representatives for novel classes are computed from the embeddings of few samples and appended to the weight matrix of the cosine similarity classifier, enabling immediate recognition of new classes without retraining the entire model.
The contributions of this work are as follows: 1. We propose a novel framework that integrates a cosine similarity classifier, novel object queries, and local POD distillation, effectively addressing the challenges of intra-class variance in novel classes and catastrophic forgetting in incremental learning. The cosine similarity classifier enhances discriminability by focusing on angular relationships in the feature space, while novel object queries improve adaptation to new classes, and local POD distillation preserves base class knowledge. 2. We design an efficient two-stage training process (base training and novel fine-tuning) that enables the model to learn generalizable base-class features and adapt to novel classes with few samples, all while preserving prior knowledge. This balance between plasticity and stability is critical for incremental learning. 3. We introduce a computationally efficient method for incremental class addition, where new classes can be integrated by simply appending their class representatives to the cosine similarity classifier. This eliminates the need for retraining, making the framework suitable for real-world scenarios where new classes emerge continuously.
The remainder of this paper is organized as follows: Section 2 reviews related work in few-shot instance segmentation and incremental learning. Section 3 details the proposed framework, including the overall architecture, two-stage training process, and key technical components. Section 4 describes the experimental setup, including datasets, evaluation metrics, and implementation details. Finally, Section 5 concludes the study and outlines future research directions.

3. Methodology

This section details our proposed framework for incremental few-shot instance segmentation, which builds on the Mask2Former architecture with critical modifications to enable flexible learning of novel classes while preserving base class knowledge. We first outline the overall architecture, followed by a detailed description of the two-stage training process and key technical components.

3.1. Framework Overview

Our framework retains the core structure of Mask2Former, which consists of a CNN backbone, pixel decoder, and transformer decoder. To adapt it to incremental few-shot scenarios, three key modifications are introduced. Specifically, we replace the original fully connected classification head with a cosine similarity classifier [9]. This replacement aims to enhance the discriminability between the embeddings of base and novel classes, enabling more precise differentiation in the context of few-shot learning. Additionally, a dedicated set of novel object queries is added within the transformer decoder. These queries are distinct from the base object queries that are trained on base classes, and their role is to capture features that are specific to novel classes, facilitating the model’s ability to learn about these new categories. Moreover, local POD knowledge distillation is applied [12]. During the fine-tuning process, this technique performs feature alignment between the base and novel models, which is crucial for mitigating catastrophic forgetting—a common issue where learning new knowledge causes the model to lose previously acquired information about base classes. Overall, the framework operates through two sequential stages. First, base training is carried out on large-scale base-class data. This stage is designed to enable the model to learn generalizable features that can serve as a foundation [58]. Second, novel fine-tuning is performed on limited novel class samples. The goal of this stage is to adapt the model to new classes while preserving the prior knowledge gained from the base training stage (Figure 1).
Figure 1. This flowchart illustrates a two-stage process: “base training” which involves a series of modules like a backbone network, predictor, and pixel decoder for initial training, and “novel fine-tuning” that utilizes a transformer decoder along with existing components to adapt to new data.

3.2. Base Training Stage

In the base training stage, the overarching objective is to establish a robust feature extraction foundation via end-to-end training of the model on base-class data [13]. This process entails the joint optimization of core architectural components, namely the CNN backbone, pixel decoder, transformer decoder [59], projection layer, and cosine similarity classifier. A pivotal element in directing the learning dynamics is the total loss function [60], which integrates classification and mask losses to enforce comprehensive supervision during model parameter updates.
The total loss for the base training phase, denoted L base , is constructed through the additive combination of two fundamental loss components: the classification loss ( L cls ) and the mask loss ( L mask ). Formally, this relationship is expressed by the following equation:
L base = L cls + L mask
Within this formulation, L cls leverages a sigmoid focal loss. This choice is strategically motivated by the loss’s inherent capability to mitigate class imbalance within the base-class distribution. By down-weighting the loss contributions from overrepresented categories and up-weighting those from underrepresented classes, the sigmoid focal loss ensures the model allocates discriminative attention across all class entities, thereby facilitating balanced feature learning.
For the mask loss ( L mask ), it is formulated as a weighted linear combination of the binary cross-entropy (BCE) loss ( L ce ) and the dice loss ( L dice ). Mathematically, this is defined as
L mask = λ ce L ce + λ dice L dice
Here, λ ce and λ dice represent hyperparameters, empirically set to 1.0 and 0.5, respectively. The BCE loss component captures pixel-level classification discrepancies, while the dice loss emphasizes the overlap integrity between predicted and ground-truth masks. Their synergistic combination, modulated by the specified weight coefficients, strikes an optimal balance to enforce precise pixel-level segmentation accuracy.
Through this integrated training regimen [61], the base training stage yields a critical output: a base model instantiation. This model comprises pre-trained base object queries, a projection layer engineered to output multi-scale feature representations (critical for capturing contextual information across varying spatial resolutions), and a cosine similarity classifier [62]. The classifier is initialized with base-class representatives, which are derived from the learned feature distributions of the base classes. This foundational configuration establishes the necessary prerequisites for subsequent stages, wherein the model adapts to novel classes while preserving the knowledge accretion pertaining to base categories.

3.3. Novel Fine-Tuning Stage

The second stage adapts the base model to novel classes using limited samples (K shots per class) while freezing the CNN backbone and pixel decoder to preserve base class knowledge. Three components are fine-tuned: the projection layer, cosine similarity classifier, and newly added novel object queries.
We introduce a set of novel object queries in the transformer decoder, which are distinct from base object queries, to focus on learning novel class characteristics. These queries are initialized randomly and optimized during fine-tuning. To prevent forgetting base-class features, we align multi-scale features from the novel model (student) with the base model (teacher) using local POD (pooled output distance) distillation. First, the projection layer of both models outputs multi-scale feature sets F b = { f 1 b , f 2 b , f 3 b , f 4 b } (base) and F n = { f 1 n , f 2 n , f 3 n , f 4 n } (novel), where each f i b , f i n R H / 2 i × W / 2 i × 256 . For each feature map f i , we compute local POD embeddings across scales { 1 2 m } m = 0 S . At scale 1 2 m , f i is divided into 2 m × 2 m subregions, and each subregion’s embedding is computed as the concatenation of width and height pooled features. Then, we align student and teacher local embeddings using L2 loss:
L kd = 1 4 i = 1 4 Ψ ( f i n ) Ψ ( f i b ) 2
where Ψ ( f i ) denotes the concatenation of local POD embeddings of f i across all scales.
The loss function combines classification, mask, and distillation losses to balance novel class adaptation and base class retention:
L novel = L cls + L mask + L kd
where L cls and L mask are defined as in the base stage, and L kd is weighted by λ kd = 0.1 .

3.4. Incremental Class Addition

To achieve incremental addition of new novel classes, we leverage the fine-tuned model to compute class representatives. For each new class with K shots, we first extract instance embeddings from the projection layer. These embeddings capture the feature representations of the novel class instances. Then, we compute the class representative as the normalized mean of these embeddings, defined by the following formula:
w new = 1 K i = 1 K z i z i
where z i denotes the embedding of the i-th shot. This normalization step ensures that the class representative is scaled appropriately, emphasizing the angular relationships in the feature space. Finally, we append this computed w new to the weight matrix of the cosine similarity classifier. By doing so, the model gains the ability to recognize the new class immediately, without the need for retraining. This process efficiently extends the model’s capacity to handle novel classes incrementally, maintaining computational efficiency while enhancing the model’s adaptability to evolving classification tasks.
This two-stage approach capitalizes on the universal segmentation prowess inherent in Mask2Former, bolsters discriminative power through cosine similarity-based classification, and safeguards prior knowledge via local POD distillation. In so doing, it facilitates efficient and accurate incremental few-shot instance segmentation, surmounting the challenges of knowledge retention and novel class adaptation in dynamic learning scenarios.

4. Experiment

This chapter systematically validates the proposed incremental few-shot instance segmentation framework through a series of controlled experiments. We design evaluations to assess performance across diverse datasets, few-shot scenarios, and against state-of-the-art baselines, while also dissecting the impact of core components and parameters.

4.1. Experiment Setup

To rigorously validate the performance of our proposed incremental few-shot instance segmentation framework, we design a series of experiments following standardized protocols. The setup is structured into four key components—datasets, baselines, implemental details, and experimental environment—as explained below.

4.2. Datasets

We utilize two benchmark datasets to evaluate the framework’s generalization capability in both intra-dataset and cross-dataset scenarios.
COCO Dataset [46]: As the primary evaluation benchmark, COCO contains 80 object classes with dense annotations. Following the widely adopted split in FSIS research [10,45], we divide the classes into 60 base classes (e.g., “airplane”, “bus”, “dining table”) and 20 novel classes (e.g., “bear”, “zebra”, “umbrella”), where the novel classes overlap with the VOC dataset to ensure consistency with existing evaluations. The dataset includes 80,000 training images (1.5 million instances), 35,000 validation images [63], and 5000 test images, covering diverse scenes such as urban streets, rural landscapes, and indoor spaces. This diversity allows us to assess the model’s performance across varying object scales (from small objects like “cup” to large ones like “truck”), occlusion levels, and lighting conditions.
COCO2VOC Cross-Dataset Scenario: To test cross-dataset generalization—an essential property for real-world applications—we train the model on COCO base classes and evaluate it on the union of VOC2007 and VOC2012 validation sets [12]. VOC contains 20 classes, with 10 overlapping with COCO’s novel classes (e.g., “cat”, “cow”) and 10 being VOC-exclusive (e.g., “potted plant”, “sofa”). This setup introduces distribution shifts in image resolution (VOC images are generally smaller), annotation density (VOC has sparser instance labels), and object appearance, challenging the model’s adaptability to unseen data characteristics.

4.3. Baselines

We compare our framework against four representative methods to highlight its advantages in incremental few-shot learning.
MRCN+ft-full [14]: A non-incremental baseline based on Mask R-CNN. For each new task, the entire model is retrained on the combined data of base and novel classes. This approach represents the performance upper bound when computational resources are unrestricted. However, it is highly susceptible to catastrophic forgetting, where the model forgets previously learned base-class knowledge while adapting to new classes. Additionally, the high retraining costs, both in terms of time and computational resources, render it impractical for dynamic real-world scenarios where new classes emerge continuously.
Meta R-CNN [10]: A meta-learning approach that employs episodic training. During meta-training on base classes, it learns transferable knowledge. Each episode mimics a few-shot task, allowing the model to quickly adapt to novel classes through a class-specific classifier. While it demonstrates effectiveness in few-shot generalization [64], it has a significant drawback in incremental learning. For every new class, the meta-learner needs to be re-initialized, and it lacks the ability to retain knowledge of previously added novel classes, making it unsuitable for scenarios requiring continuous learning.
iMTFA [45]: The first dedicated incremental FSIS method. It uses an instance feature extractor (IFE) to generate class-agnostic embeddings. Novel class representatives, which are the mean of support embeddings, are stored in a cosine similarity classifier, enabling incremental class addition without the need for full retraining. Nevertheless, its class-agnostic mask predictor limits the segmentation precision for novel classes. Moreover, aside from embedding alignment [65], it lacks explicit and robust mechanisms to effectively mitigate catastrophic forgetting, especially when dealing with a large number of incremental class additions.
ONCE [66]: ONCE is the first detector designed to address the incremental few-shot detection (IFSD) problem. It adapts the efficient CentreNet detector [67] to the few-shot learning scenario.

4.4. Implementation Details

Evaluation Scenarios: We evaluate under 1-shot, 5-shot, and 10-shot settings, where each novel class is provided with K = 1 , K = 5 , or K = 10 labeled samples, respectively. These settings simulate varying data scarcity levels; 1-shot tests extreme few-shot learning, 5-shot represents typical low-resource scenarios, and 10-shot serves as a near-saturated case. For each scenario, support samples are randomly selected from the validation split, with no overlap with base-class training data to avoid data leakage.
Training Protocols: The model is trained on COCO base classes for 368,750 iterations with a batch size of 8 (4 images per GPU), using the Adam optimizer with an initial learning rate of 7 × 10 5 . Novel Fine-tuning: For each shot setting, the CNN backbone and pixel decoder are frozen, while the projection layer, cosine similarity classifier, and novel object queries are fine-tuned for 3000 (1-shot), 5000 (5-shot), or 7000 (10-shot) iterations. The learning rate is reduced to 5 × 10 5 to prevent overfitting to scarce novel data.
Evaluation Metrics: Following COCO standards [46], we report AP (average precision) across IoU thresholds [0.5:0.05:0.95], measuring overall segmentation accuracy, and AP50 (precision at IoU) of 0.5, emphasizing coarse localization, which is critical for real-time applications.
Metrics are reported for three categories: All Classes (base + novel), Base Classes (to quantify forgetting), and Novel Classes (to assess few-shot generalization). Statistical Validation: Each experiment is repeated 10 times with different random seeds to account for variability in support sample selection. Results are presented as mean ± standard deviation, with statistical significance (p < 0.05) verified via paired t-tests against baselines.
Qualitative Analysis: We visualize segmentation results for key cases, including successful novel class segmentation, failure modes (e.g., small objects, heavy occlusion), and cross-dataset transfers. These visualizations complement quantitative metrics by revealing the model’s behavior in complex scenarios.

4.5. Experimental Environment

The framework is implemented in PyTorch 1.10.1 with torchvision 0.11.2. Training and inference are conducted on a workstation equipped with the following.
-
GPUs: Two NVIDIA RTX 3090 (24 GB GDDR6X each), enabling parallel processing and large batch training.
-
CPU: Intel Xeon W-1290 (10 cores, 3.2 GHz) with 32 GB DDR4 RAM, supporting data preprocessing and model deployment.
-
Software: CUDA 11.3, cuDNN 8.2, and Python 3.8. Dependencies include OpenCV 4.5.3 (image augmentation), matplotlib 3.5.1 (visualization), and scikit-learn 1.0.2 (statistical analysis).
To optimize resource usage, we employ mixed-precision training (FP16) to reduce memory consumption by 40% without performance loss, and gradient accumulation (4 steps) to simulate larger batch sizes when GPU memory is constrained.

4.6. Comparative Experiments

This section presents a comparative analysis of our framework against state-of-the-art methods on COCO and cross-dataset COCO2VOC. On COCO, our method outperforms baselines across shot settings. At 5-shot, 6.75 AP exceeds iMTFA and Meta R-CNN (2.80); at 10-shot, 14.47 AP50 is 45.3% higher than iMTFA, highlighting stronger coarse localization. In COCO2VOC cross-dataset evaluation, our method shows robust generalization: 3.61 AP/5.75 AP50 (1-shot), 9.17 AP/14.39 AP50 (5-shot), and 12.65 AP/19.75 AP50 (10-shot) (outperforming iMTFA). Our superiority stems from combining cosine similarity classification (enhanced discriminability) and local POD distillation (mitigated forgetting), unlike iMTFA’s class-agnostic heads that limit precision. This balances adaptability and knowledge retention for incremental few-shot instance segmentation. As shown in Table 1, Table 2, Table 3 and Table 4.
Table 1. Performance metrics of detection and segmentation tasks with varied settings.
Table 2. COCO dataset: comprehensive performance comparison across methods.
Table 3. COCO2VOC: This table compares the AP and AP50 of methods using 1-shot, 5-shot, and 10-shot settings in the COCO2VOC cross-dataset scenario.
Table 4. FSIS results on the COCO dataset.

4.7. Ablation Studies

To validate core components, we conduct ablations on COCO 5-shot settings, where removing novel object queries reduces AP by 1.82 (6.75→4.93), confirming their role in capturing novel class-specific features. Disabling local POD distillation causes a 2.15 AP drop (6.75→4.60) and 3.21 AP50 drop (10.23→7.02), highlighting efficacy in preserving base class knowledge, and replacing the cosine similarity classifier with a linear classifier decreases AP by 1.54 (6.75→5.21), as angular similarity better captures inter-class differences in few-shot scenarios, thus confirming the necessity of each component and underscoring their synergistic contribution to incremental few-shot segmentation performance.As shown in Table 5.
Table 5. Results of the ablation experiment. ✓ means to select.

4.8. Parameter Sensitivity

To analyze the influence of critical hyperparameters on our framework’s performance, we conduct systematic experiments under the COCO 5-shot setting, focusing on three key parameters: first, for the learning rate during novel fine-tuning, we test values of 3 × 10 5 , 5 × 10 5 , and 7 × 10 5 and find that 5 × 10 5 yields an optimal average precision (AP) of 6.75, as it balances the need to adapt to novel classes and retain knowledge from base classes, with lower rates risking insufficient adaptation and higher rates potentially disrupting learned representations. Second, regarding the number of novel object queries, evaluating configurations of 20, 40, and 60 queries shows performance peaks at 40 queries (6.75 AP), as fewer queries (20) cause underfitting (5.92 AP) due to inadequate capture of novel class nuances, while more queries (60) lead to overfitting (6.21 AP) as the model learns sample idiosyncrasies. Third, for the distillation weight ( λ d ), assessing values of 0.05, 0.1, and 0.3 reveals that 0.1 achieves the best trade-off (6.75 AP), with higher weights (0.3) suppressing novel class learning (5.89 AP) by over-emphasizing base knowledge preservation and lower weights (0.05) increasing catastrophic forgetting (5.47 AP) due to insufficient distillation. Collectively, these experiments validate the robustness of our selected hyperparameters and offer guidance for future tuning to adapt the framework to diverse experimental and application scenarios (Figure 2).
Figure 2. These are visualized instances showing various objects (like a bus, bear, deer, etc.) in different scenes, each with a colored mask and a label indicating the object category.

5. Conclusions

This study introduces an incremental few-shot instance segmentation framework addressing key challenges in adaptability and knowledge retention. Leveraging a Mask2Former backbone with cosine similarity classification and local POD distillation, our method achieves state-of-the-art performance across COCO and cross-dataset COCO2VOC evaluations. On COCO, we outperform baselines like iMTFA and Meta R-CNN across 1-shot, 5-shot, and 10-shot settings—we achieve a 28.8% AP gain with the 5-shot setting and 45.3% AP50 improvement with the 10-shot setting. Cross-dataset results confirm robust generalization to unseen domains, with metrics surpassing incremental paradigms. Ablation studies validate core components; novel object queries capture class-specific features, local POD distillation mitigates forgetting, and cosine similarity classification enhances discriminability. Parameter sensitivity analyses confirm the robustness of hyperparameter choices, guiding future tuning. This framework advances incremental few-shot segmentation by balancing plasticity (adapting to novel classes) and stability (retaining base knowledge), providing a scalable solution for real-world applications requiring lifelong learning and cross-domain generalization.

Author Contributions

Conceptualization, Q.Z., Y.Z. and P.X.; Methodology, Q.Z.; Software, P.X. and M.Y.; Validation, Q.Z. and M.Y.; Formal analysis, Y.Z.; Investigation, Y.Z. and P.X.; Resources, C.Z.; Data curation, Y.Z. and M.Y.; Writing—original draft, Q.Z.; Writing—review & editing, Q.Z., P.X., L.Z. and C.Z.; Supervision, L.Z.; Project administration, C.Z.; Funding acquisition, L.Z. and C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundation of China under Grant 62472161, Grant 62202163, Grant 62072166, and Grant 62372150; the Natural Science Foundation of Hunan Province under Grant 2022JJ30231 and Grant 2023JJ30169; the Hunan Provincial Teaching Reform Research Project for Ordinary Institutions of Higher Learning under Grant HNJG-20230396; and the Hunan Provincial Degree and Postgraduate Teaching Reform Research Project under Grant 2023JGYB140.

Data Availability Statement

The data presented in this study are openly available in [iMTFA] at [https://github.com/danganea/iMTFA].

Conflicts of Interest

The authors declare no conflicts of interest.

Correction Statement

This article has been republished with a minor correction to the Data Availability Statement. This change does not affect the scientific content of the article.

References

  1. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
  2. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  3. Tian, C.; Zheng, M.; Li, B.; Zhang, Y.; Zhang, S.; Zhang, D. Perceptive self-supervised learning network for noisy image watermark removal. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7069–7079. [Google Scholar] [CrossRef]
  4. Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. arXiv 2016, arXiv:1606.04797. [Google Scholar] [CrossRef]
  5. Tian, C.; Song, M.; Fan, X.; Zheng, X.; Zhang, B.; Zhang, D. A Tree-guided CNN for image super-resolution. IEEE Trans. Consum. Electron. 2025, 71, 3631–3640. [Google Scholar] [CrossRef]
  6. Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
  7. Zhang, C.; Wang, Y.; Zhu, L.; Song, J.; Yin, H. Multi-graph heterogeneous interaction fusion for social recommendation. ACM Trans. Inf. Syst. (TOIS) 2021, 40, 1–26. [Google Scholar] [CrossRef]
  8. Zhu, L.; Wu, R.; Zhu, X.; Zhang, C.; Wu, L.; Zhang, S.; Li, X. Bi-direction label-guided semantic enhancement for cross-modal hashing. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 3983–3999. [Google Scholar] [CrossRef]
  9. Zhu, L.; Zhang, C.; Song, J.; Zhang, S.; Tian, C.; Zhu, X. Deep multigraph hierarchical enhanced semantic representation for cross-modal retrieval. IEEE MultiMedia 2022, 29, 17–26. [Google Scholar] [CrossRef]
  10. Wang, X.; Huang, T.E.; Darrell, T.; Gonzalez, J.E.; Yu, F. Frustratingly simple few-shot object detection. arXiv 2020, arXiv:2003.06957. [Google Scholar] [CrossRef]
  11. Yan, X.; Chen, Z.; Xu, A.; Wang, X.; Liang, X.; Lin, L. Meta R-CNN: Towards general solver for instance-level low-shot learning. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9577–9586. [Google Scholar]
  12. Michaelis, C.; Ustyuzhaninov, I.; Bethge, M.; Ecker, A.S. One-shot instance segmentation. arXiv 2018, arXiv:1811.11507. [Google Scholar]
  13. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
  14. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
  15. Zhu, L.; Wu, R.; Liu, D.; Zhang, C.; Wu, L.; Zhang, Y.; Zhang, S. Textual semantics enhancement adversarial hashing for cross-modal retrieval. Knowl.-Based Syst. 2025, 317, 113303. [Google Scholar] [CrossRef]
  16. Tian, C.; Zheng, M.; Lin, C.-W.; Li, Z.; Zhang, D. Heterogeneous window transformer for image denoising. IEEE Trans. Syst. Man. Cybern. Syst. 2024, 54, 6621–6632. [Google Scholar] [CrossRef]
  17. Chen, L.-C.; Wang, H.; Qiao, S. Scaling wide residual networks for panoptic segmentation. arXiv 2020, arXiv:2011.11675. [Google Scholar]
  18. Du, X.; Zoph, B.; Hung, W.-C.; Lin, T.-Y. Simple training strategies and model scaling for object detection. arXiv 2021, arXiv:2107.00057. [Google Scholar] [CrossRef]
  19. Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  20. Tian, C.; Liu, K.; Zhang, B.; Huang, Z.; Lin, C.-W.; Zhang, D. A Dynamic Transformer Network for Vehicle Detection. IEEE Trans. Consum. Electron. 2025, 71, 2387–2394. [Google Scholar] [CrossRef]
  21. Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  22. Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; Darrell, T. Few-shot object detection via feature reweighting. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8420–8429. [Google Scholar]
  23. Tian, C.; Zheng, M.; Jiao, T.; Zuo, W.; Zhang, Y.; Lin, C.-W. A self-supervised CNN for image watermark removal. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7566–7576. [Google Scholar] [CrossRef]
  24. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  25. Fan, Z.; Yu, J.-G.; Liang, Z.; Ou, J.; Gao, C.; Xia, G.-S.; Li, Y. FGN: Fully guided network for few-shot instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9172–9181. [Google Scholar]
  26. Kirillov, A.; Wu, Y.; He, K.; Girshick, R. PointRend: Image segmentation as rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  27. Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT++: Better rEal-Time Instance Segmentation. Ph.D. Thesis, University of California, Berkeley, CA, USA, 2019. [Google Scholar]
  28. Arbeláez, P.; Pont-Tuset, J.; Barron, J.T.; Marques, F.; Malik, J. Multiscale combinatorial grouping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
  29. Bao, H.; Dong, L.; Wei, F. BEiT: BERT pretraining of image transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
  30. Zhu, L.; Cai, L.; Song, J.; Zhu, X.; Zhang, C.; Zhang, S. MSSPQ: Multiple semantic structure-preserving quantization for cross-modal retrieval. In Proceedings of the 2022 International Conference on Multimedia Retrieval, Newark, NJ, USA, 27–30 June 2022; pp. 631–638. [Google Scholar]
  31. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
  32. Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the ECCV 2018 15th European Conference, Munich, Germany, 8–14 September 2018. [Google Scholar]
  33. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
  34. Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
  35. Cheng, B.; Collins, M.D.; Zhu, Y.; Liu, T.; Huang, T.S.; Adam, H.; Chen, L.-C. Panoptic-DeepLab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  36. Cheng, B.; Girshick, R.; Dollár, P.; Berg, A.C.; Kirillov, A. Boundary iou: Improving object-centric image segmentation evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
  37. Cheng, B.; Parkhi, O.; Kirillov, A. Pointly-supervised instance segmentation. arXiv 2021, arXiv:2104.06404. [Google Scholar]
  38. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  39. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  40. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
  41. Zhu, L.; Yu, W.; Zhu, X.; Zhang, C.; Li, Y.; Zhang, S. MvHAAN: Multi-view hierarchical attention adversarial network for person re-identification. World Wide Web 2024, 27, 59. [Google Scholar] [CrossRef]
  42. Everingham, M.; Eslami, S.M.A.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
  43. Fang, Y.; Yang, S.; Wang, X.; Li, Y.; Fang, C.; Shan, Y.; Feng, B.; Liu, W. Instances as queries. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
  44. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  45. Ganea, D.A.; Boom, B.; Poppe, R. Incremental few-shot instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1185–1194. [Google Scholar]
  46. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
  47. Gao, P.; Zheng, M.; Wang, X.; Dai, J.; Li, H. Fast convergence of detr with spatially modulated co-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
  48. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  49. Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. CCNet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  50. Kirillov, A.; Girshick, R.; He, K.; Dollár, P. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  51. Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.-Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
  52. Kirillov, A.; He, K.; Girshick, R.; Rother, C.; Dollár, P. Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  53. Kirillov, A.; Levinkov, E.; Andres, B.; Savchynskyy, B.; Rother, C. InstanceCut: From edges to instances with multicut. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  54. Gidaris, S.; Komodakis, N. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4367–4375. [Google Scholar]
  55. Li, Y.; Zhao, H.; Qi, X.; Chen, Y.; Qi, L.; Wang, L.; Li, Z.; Sun, J.; Jia, J. Fully convolutional networks for panoptic segmentation with point-based supervision. arXiv 2021, arXiv:2108.07682. [Google Scholar] [CrossRef] [PubMed]
  56. Tian, C.; Zhang, X.; Liang, X.; Li, B.; Sun, Y.; Zhang, S. Knowledge distillation with fast CNN for license plate detection. IEEE Trans. Intell. Vehicles 2023. [Google Scholar] [CrossRef]
  57. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
  58. Zhu, L.; Zhang, C.; Song, J.; Liu, L.; Zhang, S.; Li, Y. Multi-graph based hierarchical semantic fusion for cross-modal representation. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; IEEE Computer Society: Washington, DC, USA, 2021; pp. 1–6. [Google Scholar]
  59. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2019, arXiv:1711.05101. [Google Scholar] [CrossRef]
  60. Huang, S.; Lu, Z.; Cheng, R.; He, C. Fapn: Feature-aligned pyramid network for dense image prediction. arXiv 2021, arXiv:2108.07058. [Google Scholar]
  61. Li, Z.; Wang, W.; Xie, E.; Yu, Z.; An kumar, A.; Alvarez, J.M.; Lu, T.; Luo, P. Panoptic segformer. arXiv 2021, arXiv:2109.03814. [Google Scholar]
  62. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
  63. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  64. Neuhold, G.; Ollmann, T.; Rota Bulo, S.; Kontschieder, P. The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  65. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
  66. Perez-Rua, J.M.; Zhu, X.; Hospedales, T.M.; Xiang, T. Incremental few-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13846–13855. [Google Scholar]
  67. Cheng, B.; Schwing, A.G.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Online, 6–14 December 2021. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.