Abstract
Instance segmentation is pivotal in remote sensing image (RSI) analysis, aiding in many downstream tasks. However, annotating images with pixel-wise annotations is time-consuming and laborious. Despite some progress in automatic annotation, the performance of existing methods still needs improvement due to the high precision requirements for pixel-level annotation and the complexity of RSIs. With the support of large-scale data, some foundational models have made significant progress in semantic understanding and generalization capabilities. In this paper, we delve deep into the potential of the foundational models in automatic annotation and propose a training-free automatic annotation method called DiffuPrompter, achieving pixel-level automatic annotation of RSIs. Extensive experimental results indicate that the proposed method can provide reliable pseudo-labels, significantly reducing the annotation costs of the segmentation task. Additionally, the cross-domain validation experiments confirm the powerful effectiveness of large-scale pseudo-data in improving model generalization performance.
1. Introduction
With the development of deep learning, the interpretation of RSIs has also made significant progress [1]. Instance segmentation is a crucial part of remote sensing interpretation. However, it is also a data-intensive task that requires a significant amount of pixel-level annotations, which are labor-intensive and expensive, limiting the development of the task. To reduce the annotation costs, some scholars have introduced the automatic image annotation task, AIA for short, which has become an integral component of computer vision [2]. Although the scarcity of pixel-level annotated datasets in the remote sensing domain may be more severe, previous AIA methods have mainly focused on natural images. The perspective effect in natural images often results in distinct foreground and background elements. Therefore, some methods use the attention maps of classification networks to locate the foreground region and consider the image category as the foreground class to generate a pseudo-mask for objects [3,4,5]. In contrast, the top-down perspective and complex image content in RSIs diminish the effectiveness of these AIA methods.
Recently, the potential of big data has been further explored, leading to the emergence of many fundamental models trained on large-scale datasets, such as the Stable Diffusion Model (SDM) [6] and the Segment Anything Model (SAM) [7], which show significant contributions to downstream tasks [8]. Although fundamental models have become plentiful, none are tailored for RSI, which limits their application in RSI tasks. The main goal of this paper is not to create a fundamental model tailored for RSIs but to investigate the applicability of existing fundamental models to pixel-level AIA in RSIs.
With the development of text-guided generation models such as SDM, a pixel-level AIA technical route based on synthetic images has emerged in natural images [9,10]. These methods utilize generative models to synthesize data and generate masks for objects in synthetic images via vision–text alignment knowledge during generation [11]. Generative-based methods account for two assumptions: (1) the synthetic images are realistic enough to avoid domain shift issues between training and testing sets, and (2) the vision–text alignment knowledge can guide the generation of sufficiently accurate object masks. While these assumptions hold in natural images, they do not necessarily apply in RSIs. Generative models trained on natural images cannot synthesize RSIs realistically enough, and the complex and diverse scenes in remote sensing images seriously interfere with vision–text alignment, making it difficult to segment objects accurately. Some scholars incorporate the SAM model into instance segmentation methods for more accurate results [12]. After training on over one billion masks, the SAM model has demonstrated an outstanding ability to segment anything. However, SAM is a class-agnostic segmentation method and requires prior positional cues, such as points and bounding boxes, to segment target objects. These limitations prevent SAM from being directly applied to the instance segmentation task.
As shown in Figure 1, we present new insight into automatically obtaining mask annotations for authentic images using pre-trained foundational models to reduce the annotation cost of the instance segmentation task. Based on the insight, we propose a training-free prompt generation method called DiffuPrompter, which transforms SAM from a category-agnostic into a category-aware segmentation method to label RSIs automatically. Specifically, the proposed DiffuPrompter model leverages the text-concept grounding capabilities of a pre-trained diffusion model to provide coarse localization results for target objects. These localization results are then used as visual segmentation prompts for the SAM model, enabling precise segmentation of the target objects. We tested various automatic annotation methods on remote sensing datasets, and the experimental results validated the superiority of DiffuPrompter, i.e., it achieved 27.3% and 15.4% AP on the NWPU and iSAID datasets, respectively. Furthermore, the cross-domain study demonstrates the positive impact of pseudo-labels on improving model generalization performance, providing a valuable reference for future research.
Figure 1.
DiffuPrompter classification images, with pixel-level annotations labeled by DiffuPrompter.
The main contributions of this paper can be summarized as follows:
- We present the novel insight that it is possible to automatically obtain the mask annotation of authentic images using off-the-shelf foundational models.
- We propose a training-free prompt generation method, DiffuPrompter, that transforms SAM from a class-agnostic segmenter to a class-aware segmenter to label RSIs automatically.
- We tested several automatic annotation methods on remote sensing datasets, and the extensive results validated the superiority of the proposed DiffuPrompter while proving the positive impact of pseudo-labels on enhancing model generalization performance. The results may provide a reference for future work.
2. Theory and Methods
DiffuPrompter utilizes the pre-trained SDM to explore generating semantically explicit prompts for SAM, enabling it to generate masks for specified remote sensing objects automatically. Section 2.1 introduces the working principles of SDM and SAM. In Section 2.2.1, we introduce how to realize the grounding of textual concepts into input images, and we discuss noise suppression in Section 2.2.2. Section 2.2.3 introduces how to prompt SAM to segment specific objects.
2.1. Preliminary Knowledge
2.1.1. Overview of SDM
SDM [6] is derived from a perceptual compression model consisting of an autoencoder, a U-Net, and a decoder. Specifically, when we input an image , the encoder encodes x into a latent representation , and the decoder reconstructs the image from z, i.e., . The encoder downsamples the image according to the sampling factor , where f has different values in different layers of the encoder and U-Net, namely , where .
The training process of SDM consists of a forward diffusion and a backward denoising stage. In the forward diffusion stage, SDM adds noise to z for T steps until z is completely replaced by noise . In the backward denoising stage, SDM learns to gradually remove the noise by U-Net based on the textual condition to recover z. Finally, z is decoded into an image by . SDM achieves semantic mapping between visual and textual inputs through the cross-attention mechanism in U-Net [13]. To pre-process the condition text prompt , SDM introduces a domain-specific encoder that projects to an intermediate representation, which is then mapped to the intermediate layers of the U-Net via cross-attention layers as follows:
Here, refers the intermediate embedding in the U-Net implementation, and and are learnable projection matrices [13,14].
2.1.2. Overview of SAM
SAM is an interactive segmentation approach predicated on provided prompts such as instance points and bounding boxes. The mask-generation process can be expressed as follows:
where represents the latent representation of the input image; denotes the interactive prompts, including points and bounding boxes; signifies the mask prompt tokens, which are from the previous prediction iteration; encodes prompts into features and ; and are the pre-inserted learnable tokens representing four different mask filters and their corresponding IoU predictions; and denotes the predicted masks. The primary objective of DiffuPrompter is to provide (points and bounding boxes) for SAM to segment the target object.
2.2. Proposed Method
2.2.1. Textual Concept Grounding
Upon further exploration of the SDM training process, we discovered that, during the training of SDM, the U-Net restores the latent presentations of the input image step by step based on the text description. Equation (1) illustrates that textual concepts are directed into latent presentations through cross-attention of the cross-modal spatial transformer module in U-Net.
Based on this observation, we constructed a text semantic grounding pipeline centered on the cross-attention map. This pipeline utilizes classification datasets as its data source and grounds image categories into the images. As illustrated in Figure 2, given an image from a classification dataset, the textual description is achieved by using the ‘Photo of a category’ template to process the corresponding class name. Then, the cross-attention layer grounds each text semantic in the template into the visual space by cross-attention maps as follows:
where denotes the re-shaped attention map. For the j-th text token, e.g., airplane in Figure 3a, the corresponding cross-attention shows the visual location in of the j-th token.
Figure 2.
Pipeline for our method with the prompt ‘Photo of a stadium’. DiffuPrompter mainly includes three steps: (1) Organize the object name into the template and use it as a text prompt. (2) Object mask proposal generation. (3) The denoising strategy is applied to refine the proposals.
Figure 3.
Cross-attention maps of SDM. Text prompt: ‘Photo of an airplane’.
We propose integrating grounding results at different resolutions to enhance the accuracy and robustness. The cross-attention pyramid is obtained by applying Equation (3) to different layers in U-Net, where s denotes the attention map from the s-th layer of U-Net. We extract four resolutions in this paper, i.e., , , , and , as shown in Figure 3b. Then, we aggregate multi-scale grounding results in the cross-attention pyramid by calculating the average map as follows:
where S represents the total number of layers (i.e., four for U-Net). Finally, the attention maps are transformed into probability maps through a normalization layer to facilitate subsequent binarization processing.
2.2.2. Denoise by Noise
Figure 3a indicates that the highlighted regions in the cross-attention map correlate with the regions where that input token is presented. However, the maps are significantly noisy. Figure 4 compares the cross-attention maps between natural and remote-sensing images. Figure 4a shows a precise attention map for the ‘horse’ region. On the contrary, the attention map in Figure 4b for an RSI only roughly indicates the ‘playground’ region, showing a very weak correlation with regions a human would pick out as meaningful. Therefore, it is necessary to denoise the cross-attention maps in SDM before using them to localize remote sensing objects. However, the noise points tend to be localized and demonstrate high randomness in their distribution. Hence, achieving precise removal of these noise points poses a significant challenge.
Figure 4.
The cross-attention maps of natural and remote-sensing images.
Inspired by [15], we propose a Loop-Sampling Averaging Denoising (LSAD) strategy to suppress noise interference. In LSAD, we model the observed cross-attention map as a combination of a noise-free attention map and additive noise, as follows:
where represents the value of the captured cross-attention map at coordinates , signifies the value of the noise-free map, and denotes the value of the noise. The denoising process is the procedure of approximating from the known . For multiple cross-attention maps of the same input image and token, the will remain constant, and is random. Thus, the mean of maps of the same image can be represented as follows:
where denotes the mean value of the maps at coordinates , and is the total number of maps considered. As the noise is random and unrelated, the expectation of the mean value approximates zero, i.e., . Therefore, the expected mean and variance of the cross-attention maps can be expressed as follows:
where represents the standard deviation. Equation (7) shows that the expected mean value of multiple cross-attention maps is a map without noise. However, there will be some disturbances, and the standard deviation determines the noise’s intensity. The essence of denoising is reducing the standard deviation. Equation (8) indicates that, by increasing the value of , i.e., increasing the number of averaged maps, the noise can be suppressed effectively.
The main challenge of applying Equation (8) to denoise cross-attention maps lies in introducing random noise onto . As mentioned in Section 3, the SDM is trained to construct a clear image from Gaussian noise by removing Gaussian noise step by step. Noise contributes to generation diversity [16], implying attention maps are variable. Therefore, injecting Gaussian noise into the latent embeddings of the input image will result in multiple noisy cross-attention maps. Fortunately, the forward diffusion process in SDM is the noise addition process. Therefore, given an image, we perform an iterative forward diffusion process, preserving the cross-attention pyramid maps during each loop. Subsequently, we apply LSAD to them as follows:
where l represents the l-th sampling iteration. Figure 5 illustrates the workflow of the LSAD algorithm.
Figure 5.
Flow chart of LSAD.
Figure 6 visualizes the performance of LSAD on natural and remote sensing images. It can be observed that there is no noticeable noise in the cross-attention map of the natural image. LSAD does not show a significant enhancement effect on the cross-attention map. In contrast, there is much noise in the cross-attention map of the RSI, making it challenging to locate the target object accurately. After LSAD processing, the noise is effectively suppressed, making the highlighted areas more meaningful.
Figure 6.
Visualization of denoising effects with different sampling times; t = 40 was applied to each sampling process. (a) visualizes the denoising effect on a natural image; (b) visualizes the denoising effect on a remote sensing image.
2.2.3. Prompt for SAM
Given a normalized average attention map for the j-th text token to get the target object region (e.g., ‘airplane’). As shown in Figure 3c, the solution to the binarization process is using a threshold value and refining with DenseCRF [17] as follows:
As shown in Figure 7, we take the minimum bounding box and centroid of the areas in with a value of 1 as the box and point prompts for SAM. Then, SAM will output a mask list based on the prompts. We select the mask with the highest IoU, with the attention mask in as the final segmentation result. If the selected mask contains multiple closed intervals, it is considered to have multiple objects, such as the boat, airplane, tennis court, and storage tank in Figure 7. If the selected mask contains a single closed interval, we consider that there is a single object, such as the playground. At this point, we have constructed a training-free, pixel-level AIA pipeline using SDM and SAM.
Figure 7.
Visualization of the DiffuPrompter mask generation process: (a) original image, (b) cross-attention map, (c) binarized map, (d) box and point prompts, (e) segmentation result.
3. Results
3.1. Datasets
iSAID: iSAID [18] is a large-scale dataset for remote sensing instance segmentation inherited from DOTA [19]. The spatial resolutions of images range between 800 and 13,000. We split them into patches during training and testing. It contains 15 classes of instances in 2806 images: ship, storage tank, baseball diamond, tennis court, basketball court, playground, bridge, large vehicle, small vehicle, helicopter, swimming pool, roundabout, soccer ball field, plane, and harbor.
NWPU VHR-10: NWPU VHR-10 [20] is another widely used dataset for object detection of RSIs. It has 800 high-resolution images, among which 650 are positive and 150 are negative, without any objects of interest. This dataset contains annotations of 10 object categories: airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, and vehicle.
Classification Dataset: We selected target images corresponding to the segmentation dataset from 11 classification datasets: UC Merced Land Use Dataset [21], WHU-RS19 [22,23], RSSCN7 [24], RS_C11 [25], NWPU-RESISC45 [26], AID [26], RSD46-WHU [27,28], PatternNet [29], OPTIMAL-31 [30], CLRS [31], and DLR Munich Vehicle [32]. Ultimately, we collected 9300 RSIs across 12 categories of 0.3∼3 m resolution: airplane, ship, storage tank, baseball diamond, swimming pool, tennis court, basketball court, roundabout, ground track field, harbor, bridge, and vehicle. The corresponding classes are selected when testing on the iSAID and NWPU datasets. The spatial resolution is uniformly set to .
3.2. Evaluation Metrics
We adopted the commonly used mean average precision (mAP) metric to evaluate the performance of the proposed method. When the mask of an instance exists an intersection-over-union (IoU) with a mask of ground truth above a threshold and its predicted category matches the label, the prediction is considered to be true positive. In this study, we employ AP, , , , , and for evaluation. AP refers to metrics averaged across all 10 IoU thresholds (0.50:0.05:0.95) and all categories. A larger AP value denotes more accurate predicted instance masks and superior instance segmentation performance. represents the calculation under the IoU threshold of 0.50, while embodies a stricter metric corresponding to the calculation under the IoU threshold 0.75. Therefore, if we havethe same and values, qhere the indicates more accurate instance masks. is set for large targets (area > ); is set for medium targets ( < area < ); and is set for small targets (area < ).
3.3. Implementation Details
In this paper, we do not train any parameters in the SDM and SAM. Due to the lack of information about vehicle sizes in the remote sensing classification datasets, we merged ‘small-vehicle’ and ‘large-vehicle’ as one category, i.e., ‘vehicle’, during testing on iSAID. Additionally, since the “helicopter” and “soccer ball field” categories do not exist in the classification dataset, we did not generate pseudo-labels for them. Finally, we collected 9300 RSIs. The total pseudo-label categories and their abbreviations annotated in this paper are: AI-airplane, SH-ship, ST-storage tank, BD-baseball diamond, TC-tennis court, BC-basketball court, RA-roundabout, PL-playground (ground track field), SP-swimming pool, HA-harbor, BR-bridge, and VE-vehicle. Mask R-CNN [33], Cascade R-CNN [34], and Mask2Former [34] were used as the baseline to evaluate our method. Eight Tesla V100 GPUs were used to generate pseudo-labels, which took approximately 96 h.
3.4. Qualitative Experiments
Visualizations in Figure 7 depict the intermediate results of DiffuPrompter in generating pseudo-labels. It can be observed that the proposed method can accurately segment any number of target objects, significantly increasing the number of positive samples in the pseudo-labels. In the Stable Diffusion model, time steps control the noise intensity, affecting the results of DiffuPrompter. Figure 8 visualizes the cross-attention maps at different time steps and shows that the cross-attention map at time step 40 is the clearest, maintaining the structure of the target object. The reason may be that, at t = 40, the artificially introduced noise intensity is close to the inherent noise intensity in the cross-attention map. Therefore, LSAD can effectively suppress noise without losing the contours of the target objects due to excessive noise. Thus, we use cross-attention in step 40 to conduct experiments throughout the paper.
Figure 8.
Visualization of cross-attention maps with different time steps. The in LSAD is set to 50.
3.5. Ablation Study
We also performed an extensive ablation analysis to better understand the effectiveness of each proposed module in our DiffuPrompter.
3.5.1. Comparison with Attention Map under Different Thresholds
Figure 3c illustrates the impact of different thresholds on the binary image. It is evident that the value significantly impacts the prompt quality for SAM. Table 1 qualitatively compares the segmentation performance under different binary thresholds. The term “cross-attention” means using the binarized attention map as pseudo-labels to train the segmentation model. The experimental results indicate that setting the threshold to 0.4 provides the best guidance for the model to segment target objects in both methods. Therefore, the value of in subsequent experiments is set to 0.4. Additionally, the performance of DiffuPrompter far exceeds that of the attention map across all thresholds. This can be attributed to the superior segmentation ability of SAM, which provides accurate pseudo-masks for the segmentation model.
Table 1.
The performance of Mask R-CNN trained by pseudo-labels generated by DiffuPrompter vs. cross-attention with different thresholds.
3.5.2. Sampling Times
Table 2 provides the related ablation study for sampling times in LSAD. The results reflect that, with the increase in sampling iterations, there is a significant improvement in segmentation performance, and the performance stabilized after reaching 50 sampling iterations. We ultimately adopted 50 iterations as the universal sampling count to balance performance and economy.
Table 2.
The performance of Mask R-CNN trained on pure synthesis data under different loop sampling times.
3.6. Segmentation Performance Comparison
NWPU: Table 3 presents instance segmentation results on the NWPU. The baseline segmentation methods trained on the data labeled by DiffuPrompter can reach approximately half of the performance achieved with pure real data, e.g., vs. for mask of Mask R-CNN. Additionally, further fine-tuning on 600 (25% off) real data can achieve performance comparable to training on pure real data, e.g., 55.6% mask after fine-tuning vs. 58.3% mask training on pure real data with Mask R-CNN; 59.1% mask after fine-tuning vs. 59.8% mask training on pure real data with Cascade R-CNN; and 60.9% mask after fine-tuning vs. 61.3% mask training on pure real data with Mask2Former.
Table 3.
The performance of Mask R-CNN and Cascade R-CNN on the NWPU. ‘P’ and ‘R’ refer to ‘Pseudo’ and ‘Real’.
iSAID: Table 4 presents the results on iSAID. iSAID is more challenging than NWPU, as it includes more objects and complex backgrounds. Even in the absence of pseudo-labels for helicopter and football field categories, DiffuPrompter and iSAID still can present a competitive result, i.e., vs. mask of Mask R-CNN; vs. mask of Cascade Mask R-CNN; vs. mask of Mask2Former, when trained on 9300 pseudo-data and 2200 real data (saved 21.4% in manner effort).
Table 4.
The performance of Mask R-CNN and Cascade R-CNN on the iSAID. Mask is for 15 classes. ‘P’ and ‘R’ refer to ‘Pseudo’ and ‘Real’.
3.7. Domain Generalization
Table 5 delivers the results of cross-domain validation, which can evaluate the generalization performance. We tested the performance on overlapping categories under the two datasets. The results indicate that DiffuPrompter plays a prominent role in domain generalization, e.g., 22.5% AP with DiffuPrompter and NWPU vs. 17.9% AP with NWPU on the iSAID test set, and 50.3% AP with DiffuPrompter and iSAID vs. 47.2% AP with iSAID on the NWPU test set.
Table 5.
Performance of domain generalization between different datasets. Mask R-CNN with ResNet50 is used as the baseline.
3.8. Comparison with the State of the Art
Figure 9 visualizes the output results of two advanced AIA methods and our DiffuPrompter. CAM originates from classification models and focuses more on discriminative regions while neglecting details. Consequently, the masks it generates often fail to cover the object entirely. DiffuMask, a recently proposed advanced algorithm, also leverages diffusion models to generate pseudo-masks for objects. It optimizes attention maps in U-Net through noise learning and uses them as pseudo-labels. However, DiffuMask is designed to generate pseudo-labels for synthetic images, which limits its annotation ability for authentic images. From Figure 9, it is evident that the optimized attention map is still easily disturbed by the complex background in RSIs. Moreover, the additional training process significantly increases its annotation cost compared to DiffuPrompter. In contrast, the proposed DiffuPrompter can accurately locate and label masks for target objects without training.
Figure 9.
Some examples of pseudo-labels generated by different methods.
Table 6 quantitatively compares the performance of the proposed method with other state-of-the-art algorithms. The results in Table 6 show that the proposed method significantly outperforms the baselines in terms of accuracy. However, there is still room for improvement in annotation speed.
Table 6.
Comparison of different AIA methods. The results are from segmentation methods trained on pseudo-labels constructed by these AIA methods. Seconds/im represents the time consumed by labeling each image with one V100 GPU.
Table 7 compares DiffuPrompter with some advanced weakly supervised methods. Here, we denote the supervision type as: (image-level label), (box-level label). and represent Mask R-CNN and Cascade Mask R-CNN trained with pseudo-labels generated by DiffuPrompter, respectively. The results show that and significantly outperform methods based on image-level supervision, e.g., 29.9% AP with vs. 13.3% AP with BESTIE on NWPU, and achieve performance comparable to methods based on object-level supervision, e.g., 29.9% AP with vs. 29.8% AP with MGWI-Net on NWPU. In conclusion, DiffuPrompter can perform better with less manual labor than existing advanced methods.
Table 7.
Performance comparison with some weakly supervised methods on the NWPU and iSAID datasets. ‘Sup’ refers to the supervision type.
Table 8 compares the class-wise performance of against several advanced supervised methods trained by ground truth. The results indicate that most categories achieve nearly half the performance of supervised methods. However, the performance of “bridge” and “vehicle” is significantly lower than that of supervised methods. We hypothesize that this is due to the significant scale difference between the pseudo-labels and the segmentation dataset for these two categories. The pseudo-labels do not adequately guide the segmentation of these objects in the target dataset. Therefore, addressing the cross-dataset issues between pseudo-labels and the target dataset is worth further investigation.
Table 8.
Comparison of class-wise results with advanced supervision methods on NWPU dataset.
4. Discussion
To obtain more accurate prompts, we fine-tuned SDM on a combination of some RSI caption datasets (RSICD [44], UCM-captions [45], and Sydney-captions [45]). However, the cross-attention maps of the fine-tuned SDM, as shown in Figure 10, cannot provide precise guidance for SAM. The reason may be that the image caption datasets for RSIs are too small in scale and insufficient to convey accurate text–visual correspondence information to the SDM model. Therefore, the SDM used in this paper has yet to be fine-tuned. It is worth exploring how to fine-tune the SDM model on remote sensing datasets to obtain a more familiar understanding of remote sensing objects.
Figure 10.
Cross-attention map with fine-tuned SDM.
Another factor affecting performance is domain discrepancy. The significant differences between the classification and object detection datasets result in the knowledge provided by the pseudo-labels not being well applied to the dataset. Therefore, addressing the cross-domain issue between pseudo-labels and target data is also a valuable research direction.
In this paper, we determined the hyperparameter values based on the overall performance of the segmentation algorithm. However, the optimal hyperparameters vary slightly across different categories. Table 9 presents the category-level results under different parameters, indicating that designing adaptive hyperparameters could further improve the quality of the pseudo-labels. Therefore, developing an adaptive pseudo-labeling algorithm tailored to each category is a promising direction for future research.
Table 9.
Class-wise results of Mask R-CNN on the NWPU VHR-10 test set with different values of and t.
5. Conclusions
This paper introduces a novel insight that demonstrates the possibility of automatically annotating object masks for RSIs by leveraging off-the-shelf foundational models. Building on the SDM and SAM models, we propose an AIA method, namely DiffuPrompter, capable of leveraging the text semantic grounding knowledge in SDM to generate semantically precise SAM prompts, enabling it to acquire instance masks autonomously. We comprehensively test the effectiveness of the proposed method on two general datasets. We first evaluate the efficacy of each component in DiffuPrompter through ablation studies. Then, the cross-domain validation experiments confirm the significant effectiveness of large-scale pseudo-data in improving model generalization performance. Finally, we compare our method with other state-of-the-art algorithms, and the results demonstrate the superiority of our proposed method over existing ones.
Author Contributions
Conceptualization, H.L.; theory and methodology, H.L.; software (Python 3.9), H.L. and H.P.; visualization, H.L.; formal analysis, H.L.; writing—original draft preparation, H.L.; writing—review and editing, H.L.; supervision, Y.W.; project administration, Y.W. and W.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported by the Open Project Program Foundation of the Key Laboratory of Opto-Electronics Information Processing, Chinese Academy of Sciences (OEIP-O-202002), the National Nature Science Foundation of China (grant No. 61871106 and No. 61370152), and the Key R & D projects of Liaoning Province, China (grant No. 2020JH2/10100029).
Data Availability Statement
The original contributions presented in the study are included in the article, and further inquiries can be directed to the corresponding author.
Conflicts of Interest
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
References
- Chen, K.; Zou, Z.; Shi, Z. Building extraction from remote sensing images with sparse token transformers. Remote Sens. 2021, 13, 4441. [Google Scholar] [CrossRef]
- Cheng, Q.; Zhang, Q.; Fu, P.; Tu, C.; Li, S. A survey and analysis on automatic image annotation. Pattern Recognit. 2018, 79, 242–259. [Google Scholar] [CrossRef]
- Wu, T.; Huang, J.; Gao, G.; Wei, X.; Wei, X.; Luo, X.; Liu, C.H. Embedded discriminative attention mechanism for weakly supervised semantic segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 16765–16774. [Google Scholar] [CrossRef]
- Xu, L.; Ouyang, W.; Bennamoun, M.; Boussaid, F.; Sohel, F.; Xu, D. Leveraging auxiliary tasks with affinity learning for weakly supervised semantic segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 6984–6993. [Google Scholar] [CrossRef]
- Ru, L.; Zhan, Y.; Yu, B.; Du, B. Learning affinity from attention: End-to-end weakly-supervised semantic segmentation with transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16846–16855. [Google Scholar] [CrossRef]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar] [CrossRef]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar] [CrossRef]
- Chen, J.; Chen, H.; Chen, K.; Zhang, Y.; Zou, Z.; Shi, Z. Diffusion models for imperceptible and transferable adversarial attack. arXiv 2023, arXiv:2305.08192. [Google Scholar]
- Zhang, Y.; Ling, H.; Gao, J.; Yin, K.; Lafleche, J.F.; Barriuso, A.; Torralba, A.; Fidler, S. Datasetgan: Efficient labeled data factory with minimal human effort. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10145–10155. [Google Scholar]
- Li, D.; Ling, H.; Kim, S.W.; Kreis, K.; Fidler, S.; Torralba, A. BigDatasetGAN: Synthesizing ImageNet with Pixel-wise Annotations. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 21330–21340. [Google Scholar] [CrossRef]
- Wu, W.; Zhao, Y.; Shou, M.Z.; Zhou, H.; Shen, C. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 1206–1217. [Google Scholar] [CrossRef]
- Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4701117. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Processing Syst. 2017, 30. [Google Scholar] [CrossRef] [PubMed]
- Jaegle, A.; Gimeno, F.; Brock, A.; Vinyals, O.; Zisserman, A.; Carreira, J. Perceiver: General perception with iterative attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 4651–4664. [Google Scholar] [CrossRef]
- Smilkov, D.; Thorat, N.; Kim, B.; Viégas, F.; Wattenberg, M. Smoothgrad: Removing noise by adding noise. arXiv 2017, arXiv:1706.03825. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar] [CrossRef]
- Krähenbühl, P.; Koltun, V. Efficient inference in fully connected crfs with gaussian edge potentials. Adv. Neural Inf. Process. Syst. 2011, 24, 109–117. Available online: https://dl.acm.org/doi/10.5555/2986459.2986472 (accessed on 22 April 2024).
- Waqas Zamir, S.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Shahbaz Khan, F.; Zhu, F.; Shao, L.; Xia, G.S.; Bai, X. isaid: A large-scale dataset for instance segmentation in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 28–37. Available online: https://api.semanticscholar.org/CorpusID:170079084 (accessed on 22 April 2024).
- Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar] [CrossRef]
- Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
- Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar] [CrossRef]
- Xia, G.S.; Yang, W.; Delon, J.; Gousseau, Y.; Sun, H.; Maître, H. Structural high-resolution satellite image indexing. In Proceedings of the ISPRS TC VII Symposium-100 Years ISPRS, Vienna, Austria, 5–7 July 2010; Volume 38, pp. 298–303. Available online: https://api.semanticscholar.org/CorpusID:18018842 (accessed on 22 April 2024).
- Dai, D.; Yang, W. Satellite Image Classification via Two-Layer Sparse Coding with Biased Image Representation. IEEE Geosci. Remote Sens. Lett. 2011, 8, 173–176. [Google Scholar] [CrossRef]
- Zou, Q.; Ni, L.; Zhang, T.; Wang, Q. Deep learning based feature selection for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2321–2325. [Google Scholar] [CrossRef]
- Zhao, L.; Tang, P.; Huo, L. Feature significance-based multibag-of-visual-words model for remote sensing image scene classification. J. Appl. Remote Sens. 2016, 10, 035004. [Google Scholar] [CrossRef]
- Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
- Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
- Xiao, Z.; Long, Y.; Li, D.; Wei, C.; Tang, G.; Liu, J. High-resolution remote sensing image retrieval based on CNNs from a dimensional perspective. Remote Sens. 2017, 9, 725. [Google Scholar] [CrossRef]
- Zhou, W.; Newsam, S.; Li, C.; Shao, Z. PatternNet: A benchmark dataset for performance evaluation of remote sensing image retrieval. ISPRS J. Photogramm. Remote 2018, 145, 197–209. [Google Scholar] [CrossRef]
- Wang, Q.; Liu, S.; Chanussot, J.; Li, X. Scene classification with recurrent attention of VHR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2018, 57, 1155–1167. [Google Scholar] [CrossRef]
- Li, H.; Jiang, H.; Gu, X.; Peng, J.; Li, W.; Hong, L.; Tao, C. CLRS: Continual learning benchmark for remote sensing image scene classification. Sensors 2020, 20, 1226. [Google Scholar] [CrossRef]
- Liu, K.; Mattyus, G. Fast Multiclass Vehicle Detection on Aerial Images. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1938–1942. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar] [CrossRef]
- Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar] [CrossRef]
- Kolesnikov, A.; Lampert, C.H. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part IV 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 695–711. [Google Scholar]
- Ahn, J.; Kwak, S. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4981–4990. [Google Scholar] [CrossRef]
- Kim, B.; Yoo, Y.; Rhee, C.E.; Kim, J. Beyond semantic to instance segmentation: Weakly-supervised instance segmentation via semantic knowledge transfer and self-refinement. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4278–4287. [Google Scholar] [CrossRef]
- Dai, J.; He, K.; Sun, J. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of the 015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1635–1643. [Google Scholar] [CrossRef]
- Chen, M.; Zhang, Y.; Chen, E.; Hu, Y.; Xie, Y.; Pan, Z. Meta-Knowledge Guided Weakly Supervised Instance Segmentation for Optical and SAR Image Interpretation. Remote Sens. 2023, 15, 2357. [Google Scholar] [CrossRef]
- Kirillov, A.; Wu, Y.; He, K.; Girshick, R. Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9799–9808. [Google Scholar]
- Chen, H.; Sun, K.; Tian, Z.; Shen, C.; Huang, Y.; Yan, Y. Blendmask: Top-down meets bottom-up for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8573–8581. [Google Scholar]
- Liu, Y.; Li, H.; Hu, C.; Luo, S.; Luo, Y.; Chen, C.W. Learning to aggregate multi-scale context for instance segmentation in remote sensing images. IEEE Trans. Neural Netw. Learn. Syst. 2024, 1–15, (Early Access). [Google Scholar] [CrossRef] [PubMed]
- Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring Models and Data for Remote Sensing Image Caption Generation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2183–2195. [Google Scholar] [CrossRef]
- Qu, B.; Li, X.; Tao, D.; Lu, X. Deep semantic understanding of high resolution remote sensing image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (Cits), Kunming, China, 6–8 July 2016; pp. 1–5. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).