Deep Learning Methods for Semantic Segmentation in Remote Sensing with Small Data: A Survey

Yu, Anzhu; Quan, Yujun; Yu, Ru; Guo, Wenyue; Wang, Xin; Hong, Danyang; Zhang, Haodi; Chen, Junming; Hu, Qingfeng; He, Peipei

doi:10.3390/rs15204987

Open AccessReview

Deep Learning Methods for Semantic Segmentation in Remote Sensing with Small Data: A Survey

¹

School of Geospatial Information, PLA Strategic Support Force Information Engineering University, Zhengzhou 450001, China

²

School of Surveying, Mapping and Geoinformation, North China University of Water Resources and Electric Power, Zhengzhou 450046, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(20), 4987; https://doi.org/10.3390/rs15204987

Submission received: 4 September 2023 / Revised: 11 October 2023 / Accepted: 13 October 2023 / Published: 16 October 2023

(This article belongs to the Special Issue Geospatial Foundation Model in Urban Environments: Challenges and New Technologies)

Download

Browse Figures

Versions Notes

Abstract

:

The annotations used during the training process are crucial for the inference results of remote sensing images (RSIs) based on a deep learning framework. Unlabeled RSIs can be obtained relatively easily. However, pixel-level annotation is a process that necessitates a high level of expertise and experience. Consequently, the use of small sample training methods has attracted widespread attention as they help alleviate reliance on large amounts of high-quality labeled data and current deep learning methods. Moreover, research on small sample learning is still in its infancy owing to the unique challenges faced when completing semantic segmentation tasks with RSI. To better understand and stimulate future research that utilizes semantic segmentation tasks with small data, we summarized the supervised learning methods and challenges they face. We also reviewed the supervised approaches with data that are currently popular to help elucidate how to efficiently utilize a limited number of samples to address issues with semantic segmentation in RSI. The main methods discussed are self-supervised learning, semi-supervised learning, weakly supervised learning and few-shot methods. The solution of cross-domain challenges has also been discussed. Furthermore, multi-modal methods, prior knowledge constrained methods, and future research required to help optimize deep learning models for various downstream tasks in relation to RSI have been identified.

Keywords:

self-supervised learning; semi-supervised learning; weakly supervised learning; few-shot approaches

1. Introduction

Driven by the advances in sensor technology, fine-resolution remote sensing images (RSI) are increasingly being captured worldwide [1]. Techniques for perceiving and understanding the ground surfaces in these RSI have greatly contributed to the development of urban planning, 3D reconstruction of cities, land surveying, disaster detection, traffic planning, and other fields [2]. Excellent methods to understand RSI are thus essential to the advancement of society.

Automated approaches can be adapted well to the current situation, and consequently, an automatic understanding of RSI has been a long-standing research goal of the computer vision community [3]. The usage of deep learning-based supervised feature methods to complete automated remote sensing image understanding (RSIU) has facilitated further developments in this field. For example, Yann et al. [4] successfully used convolutional neural networks (CNN) for classification and provided solutions that led to the development of deep learning strategies in a variety of fields, such as target classification, semantic segmentation, and object detection. Numerous excellent CNNs have since been proposed that can effectively accomplish downstream work, such as NIN [5], ResNet [6], UNet [7], VGG [8], FCN [9], FPN [10], PSPNet [11], SegNet [12], DenseNet [13], UperNet [14], Googlenet [15], Deeplab [16], Deeplabv1 [17], Deeplabv2 [18], DeepLabv3+ [19], Geoseg [20], HRNet [21], MAPNet [22], EfficientNet [23], FarSeg [24]. To date, models based on self-attention have achieved good results and have been fully developed for semantic segmentation tasks, such as Swin Transformer [25], MobileViT [26], ConvNeXt [27], and VAN [28], the detailed information is in Table 1. With the further development of sensors, an increasingly large number of RSIs are being obtained. This has subsequently made it increasingly challenging to accomplish the RSIU task, as the labeling of numerous images on a per-pixel basis is arduous. While our understanding of deep learning methods has been developing, the current supervised methods required to accomplish deep learning tasks are not suitable for all requirements, and improvements in this area are needed.

Accomplishing RSIU tasks with the use of human vision is a natural process. However, the effectiveness of using numerous handcraft labels for deep learning in RSIU tasks based on human understanding requires further validation. Handcrafted labels are a form of knowledge, and the more knowledge learned, the better the model will perform [29,30,31]. The heavy reliance on this knowledge has severely hampered further model developments. Furthermore, the vast amount of remote sensing data is ultimately richer in information than can be depicted by the handcrafted labels [29]. Additionally, it should be noted that human vision comprehends RSIs in a holistic and federated manner, unconstrained by specific tasks or datasets. The training of base models in an unsupervised and task-independent manner is an important solution to the current bottleneck problem [29,32]. Consequently, exploring the general representation through the sample itself based on self-supervised learning, semi-supervised learning, and few-shot methods has attracted extensive attention. Moreover, using a limited number of handcrafted labels or inexact labels to accomplish semantic segmentation tasks could effectively alleviate the heavy reliance on labels. We have thus referred to self-supervised learning, semi-supervised learning, inexact annotation supervised learning, and few-shot methods as weakly supervised methods.

Furthermore, the development of the small sample set methods described above has provided solutions to help alleviate the cross-domain challenges faced by deep learning methods. The exploration of additional prior knowledge and the utilization of multimodal data can help fully utilize the information contained in RSI to better accomplish downstream tasks such as the semantic segmentation of RSI. Thus, a brief description of these methods is given in this survey and commonly used RSI datasets are summarized in the review, as shown in Table 2.

This paper has addressed the above issues in summary and is divided into the following sub-sections.

(1): Self-supervised learning obtains features from the unlabeled samples themselves. The commonly used contrastive learning methods and masking image modeling (MIM) methods have achieved great results in relation to semantic segmentation tasks based on RSI.
(2): Semi-supervised learning is performed by training a model with partially labeled samples. The core idea is to assign pseudo-labels to unlabeled samples [33], and the commonly used methods are generally based on low-density separation and smoothing assumptions [34].
(3): Weakly supervised learning is considered a suitable method to ease issues related to insufficient labeled samples, as it simplifies the labeled data. The pixel-level semantic segmentation task is accomplished by using point annotation, random-walk annotation or graffiti-based annotation, bounding boxes annotation, image-level annotation, and noisy labels.
(4): Domain adaptation methods have been proposed to improve the generalization ability of the models. However, the style, category, and resolution of the RSIs of the feature targets differ remarkably between the source and target domains. The methods commonly used to solve these problems can be roughly categorized into discrepancy-based, adversarial-based, and pseudo label-based strategies.
(5): The few-shot method uses only a small amount of data and learns their features. For example, representative features are learned to accomplish the semantic segmentation task using meta-learning methods. Alternatively, data-augmentation strategies are also frequently used in few-shot learning.

Surveys of self-supervised learning have been previously conducted [35,36,37,38]. While surveys of semi-supervised learning have been summarized by Ahfock et al. [34], surveys of few-shot learning were conducted by Sun et al. [39]. In this investigation, we aim to further summarize the usage of a limited number of samples to accomplish deep learning semantic segmentation tasks based on RSIs:

(1): We specifically focus on RSIs to summarize the deep learning methods that are used to accomplish semantic segmentation.
(2): This survey comprehensively studies the application of deep learning methods in the semantic segmentation of RSI, including self-supervised learning, semi-supervised learning, weakly supervised learning, cross-domain methodologies, and few-shot learning methods.
(3): The foundation approaches in the supervised learning methods with small data and the currently popular methods that have recently been proposed are also described.
(4): The crucial issues in the field of deep learning, including multi-modal fusion methods and prior-knowledge-constrained approaches, are summarized.

Table 2. This table organizes datasets commonly used for RSI semantic segmentation.

Types	Dataset
Multi Classification	Proba-V [40] https://kelvins.esa.int/proba-v-super-resolution/data/ (1 October 2023)	UC-Merced [41] http://weegee.vision.ucmerced.edu/datasets/landuse.html (1 October 2023)
	WHU-OPT-SAR [42] https://github.com/AmberHen/WHU-OPT-SAR-dataset.git (29 September 2023)	NWPU-RESISC45 [43] https://gcheng-nwpu.github.io/#Datasets (29 September 2023)
	WHU-RS19 [44]	RSSCN7 [45] https://link.zhihu.com/?target=https%3A//hyper.ai/datasets/5440 (29 September 2023)
	SIRI-WHU [46] http://www.lmars.whu.edu.cn/prof_web/zhongyanfei/e-code.html (29 September 2023)	IEEE GRSS DFC http://www.grss-ieee.org/community/technical-committees/data-fusion/data-fusion-contest/ (14 October 2023)
	RSC11 Scene Classification with Recurrent Attention of VHR Remote Sensing Images (1 October 2023)	SAT-4 and SAT-6 airborne [47] https://drive.google.com/uc?id=0B0Fef71_vt3PUkZ4YVZ5WWNvZWs&export=download (1 October 2023)
	ISAID [48] https://captain-whu.github.io/iSAID/index.html (29 September 2023)	Cityscapes [49] https://www.cityscapes-dataset.com/ (29 September 2023)
	RSD46-WHU [50] http://www.lmars.whu.edu.cn/prof_web/xiaozhifeng/dataset.html (29 September 2023)	Gaofen image dataset (GID) [51] http://captain.whu.edu.cn/GID/ (1 October 2023)
	Aerial Image Segmentationhttps://zenodo.org/record/1154821#.XH6HtygzbIU (1 October 2023)	38-Cloud dataset [52] https://github.com/SorourMo/38-Cloud-A-Cloud-Segmentation-Dataset (1 October 2023)
	Aeroscapes [53] https://github.com/ishann/aeroscapes (1 October 2023)	OPTIMAL-31 [54] http://crabwq.github.io/ (13 October 2023)
	CLRS [55] https://www.kaggle.com/c/widsdatathon2019/data (1 October 2023)	WiDSDatathon2019https://www.kaggle.com/c/widsdatathon2019 (1 October 2023)
	SenseEarth Classifyhttps://rs.sensetime.com/competition/index.html#/info (29 September 2023)	TG1HRSSChttp://www.msadc.cn/main/setsubDetail?id=1369487569196158978 (29 September 2023)
	SIRI WHU google [56] http://www.lmars.whu.edu.cn/prof_web/zhongyanfei/e-code.html (29 September 2023)	RSI CB [57] https://github.com/lehaifeng/RSI-CB (29 September 2023)
	Brazilian Coffee Scenes Dataset [58] http://patreo.dcc.ufmg.br/2017/11/12/brazilian-coffee-scenes-dataset/ (29 September 2023)	Luxcarta dataset [59]
	Synthetic Synscapes street-view [60] https://hyper.ai/datasets/16890 (1 October 2023)	Yellow River datasethttps://ieeexplore.ieee.org/document/9121326 (1 October 2023)
	Sardinia datasethttps://ieeexplore.ieee.org/document/9121326 (1 October 2023)	De Gaulle airport dataset https://ieeexplore.ieee.org/document/9121326 (1 October 2023)
	Ottawa dataset https://ieeexplore.ieee.org/document/9121326 (1 October 2023)	Mexico dataset https://ieeexplore.ieee.org/document/9121326 (1 October 2023)
	Argentina, Singapore and Haiti Datasethttps://ieeexplore.ieee.org/document/8960415	PatternNet dataset [61] https://link.zhihu.com/?target=https%3A//sites.google.com/view/zhouwx/dataset (1 October 2023)
Multi Classification	Dynamic World (Sentinel-1) [62]	LoveDA dataset [63] https://link.zhihu.com/?target=https%3A//github.com/Junjue-Wang/LoveDA (1 October 2023)
	ISPRS Vaihingenhttp://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-vaihingen.html (1 October 2023)	ISPRS Potsdamhttps://www2.isprs.org/commissions/comm2/wg4/benchmark/2d-sem-label-potsdam/ (14 October 2023)
	CVPR DGLC [64]	TOV-RS dataset [29] https://github.com/GeoX-Lab/G-RSIM/tree/main/TOV_v1 (14 October 2023)
	Drone UAVid datasethttps://uavid.nl/ (14 October 2023)	Zurich Summer [65] https://sites.google.com/site/michelevolpiresearch/data/zurich-dataset (14 October 2023)
	Evlab-SS [66] http://earthvisionlab.whu.edu.cn/zm/SemanticSegmentation/index.html (14 October 2023)	RIT-18 [67] https://github.com/rmkemker/RIT-18 (14 October 2023)
	Sen12MS dataset [68] https://link.zhihu.com/?target=https%3A//mediatum.ub.tum.de/1474000 (1 October 2023)	Levir-CS dataset [69] https://github.com/permanentCH5/GeoInfoNet (1 October 2023)
	BigEarthNet [70] http://bigearth.net/ (1 October 2023)	EuroSAT [71] https://hyper.ai/datasets/16778 (1 October 2023)
	Urban Drone Dataset (UDD) [72] https://github.com/MarcWong/UDD (1 October 2023)	Semantic Drone Dataset [71] https://www.tugraz.at/index.php?id=22387 (1 October 2023)
Point label	FROM-GLC10 [73] http://data.ess.tsinghua.edu.cn (1 October 2023)	SemCity Toulouse [74] https://isprs-annals.copernicus.org/articles/V-5-2020/109/2020/ (1 October 2023)
Point label	ICOADS data [75]	PASCAL VOC 2012 dataset [76,77] http://vision.stanford.edu/whats_the_point/ (14 October 2023)
Scribble label	PASCAL VOC 2012 dataset [78] https://www.microsoft.com/en-us/research/?from=https	PASCAL CONTEXT dataset [78] https://cs.stanford.edu/~roozbeh/pascal-context/ (14 October 2023)
Scribble label	PASCAL Semantic Boundary dataset [79]
Box label	PASCAL VOC 2007 [80,81]	ImageNet [81,82]

In order to better study the current state of research in recent years based on deep learning methods for accomplishing the task of semantic segmentation of remote sensing images, “semantic segmentation”, “deep learning methods”, “remote sensing images” are used as keywords to search within journal databases such as Elsevier and the Institute of Electrical and Electronics Engineers, to count the number of semantic segmentation papers based on RSIs with research significance published during 2018–2023 that are related to the content of the review, as shown in Figure 1.

The investigation is structured as follows: Section 2 briefly reviews single-modal supervised learning methods for semantic segmentation, introduces the semantic segmentation tasks involved with multimodal data, and provides prior information. Section 3 reviews commonly used contrastive learning and masked image modelings for semantic segmentation and identifies the popular methods based on RSIs. Section 4 introduces the semi-supervised approaches, commonly used assumptions, and methods for semantic segmentation tasks based on these assumptions. Weakly supervised learning is introduced in Section 5. In addition, the models that utilize sparse annotation labels for weakly supervised learning-based segmentation tasks are introduced. Domain-adaptation-based semantic segmentation methods are presented in Section 6 and are categorized into discrepancy-based, adversarial-based, and pseudo label-based according to the commonly used methods. Section 7 introduces the use of the few-shot method for the semantic segmentation task and summarizes the two types of methods used in few-shot learning with RSIs. Section 8 provides an outlook on the future development of deep learning methods for RSI semantic segmentation with small data, and finally, in Section 9, we summarize the main elements of this survey. And overview of this study is shown in Figure 2.

2. Supervised Learning Methods for RSI Semantic Segmentation

Large-scale variations and imbalances between background and target objects are major challenges in the semantic segmentation of RSIs. The large intra-class and small inter-class variations amongst objects in high-resolution RSIs complicate the extraction process required for spectral and geometric features. High-resolution (HR) or very high-resolution (VHR), light detection and ranging (LiDAR), or synthetic aperture radar (SAR) images can provide ample information about the observed landscape from various physical and material perspectives. The extracted features such as the spectral, textural, and structural information from the RSI can also assist in the semantic segmentation tasks used in deep learning frameworks. Consequently, single-modal methods, multi-modal methods, and the addition of prior knowledge are commonly used to solve the above problems.

2.1. Single-Modal Methods

Workman et al. [83] proposed using an exemplar of high-resolution labels to guide the training process when using low-resolution labels in supervised learning. Region aggregation approaches were also used to improve image resolution. Shao et al. [84] constructed the BRRNet network to address the problem of incomplete extraction and inaccurate localization of building boundaries. The detection of salient objects based on optical RSIs, however, is difficult due to the extreme complexity of their scale and shape [85]. To alleviate this problem, semantic and contextual attention modules have been constructed. Over-segmentation and inaccurate boundary segmentation have also been adopted to help complete the semantic segmentation tasks relating to a lake. Zhong et al. [86] proposed a transformer-based noise-canceling network model to help mitigate the over-segmentation phenomenon.

2.2. Multi-Modal Fusion Methods

With the increased number of RSIs being acquired by different types of sensors, the use of a multi-modal approach for semantic segmentation has attracted widespread attention. As multispectral image, Lidar, and SAR data are commonly used for remote sensing, a large amount of research has focused on developing methods by which to fuse the resulting data types. The method of supplementing elevation information with Lidar data was used to alleviate the object shape changes caused by perspective problems, and SAR data were used to obtain weather-independent images. However, due to the large differences in the resolution and content of the different modal data, embedding them within a reasonable representation remains a significant challenge [87].

Pan et al. [88] incorporated Lidar images to better address the issue of building height variations when compared to using 2D data alone. Previous studies [89,90] have also proposed gating residual refinement networks and fusing them with Lidar data to learn multi-level features. The red, green, and near-infrared (NIR) bands of the multi-spectral images and Lidar-derived normalized digital surface model (nDSM) were used as input data to help improve the semantic segmentation results.

Li et al. [91] accomplished semantic segmentation of land cover classifications for merged optical and SAR images. In addition, a sequential model-based optimization (SMBO) approach was also proposed, which could accomplish the optimal combination of different modalities by using a multi-modal fusion architecture search network. Kang et al. [92] proposed using the construction of parallel network structures and aggregation modules to fuse the data from different modalities.

2.3. Prior-Knowledge-Constrained Methods

Many of the proposed deep learning frameworks have achieved great inference results, but it remains difficult to pinpoint object boundaries, especially in low-contrast regions. To achieve this, a high-level abstraction of an object’s contours is required. However, constructing models of shape patterns directly from images is challenging [93]. Lower-level features contain more detailed geometric, color, and textural object information than higher-level features. Low-level features can thus be used to guide the model training process and enhance the semantic segmentation results. The addition of prior knowledge can help guide the model to focus on confusing features and thus further improve semantic segmentation inference results, as seen with the geometric information in the nDSM, which is highly correlated with land cover classes [94].

Xu et al. [95] proposed a contour vibration network (CVNet) for the automatic delineation of building boundaries, and a contour vibration equation was calculated to improve the accuracy of building contour detection.

The discriminative feature network (DFN) [96] includes an edge network that can be used as a supervised signal by semantic information. A previous study [97] constructed a constraint module to train the edge features extracted by the Sobel approach, with the aim of addressing the issue of inter-class ambiguity. Quan et al. [98] used multi-scale edge features obtained using the Difference of Gaussian (DoG) approach to improve the edge-extracted results. These methods improved the semantic segmentation accuracy and the robustness of the results.

These supervised approaches have achieved great results in relation to the semantic segmentation of RSIs. However, obtaining high-quality, manually annotated labels is tedious and time-consuming. At the same time, using the handcraft-assigned label information as prior knowledge to guide the training process tends to result in the model having an induction bias. Consequently, the small sample methods using a limited amount of labeled data or data without annotations to learn the features of images and improve the generalization capabilities have attracted increasing attention [35].

3. Self-Supervised Learning Methods for RSI Semantic Segmentation

Self-supervised learning (SSL) utilizes pretext tasks by generating labels or using supervision from unlabeled samples to learn general visuals, and the pretext tasks used in self-supervised methods usually include affine transformations, jigsaw transformations, and rotations [36,99]. Numerous studies have found that the features learned from unlabeled data through deep convolution neural networks (DCNN) can be used to help achieve excellent inference results for semantic segmentation tasks [36,100,101]. The commonly used SSL methods are contrastive learning (CL) and MIM methods. Contrastive learning separates different images and produces similar images closer together to aid in image representation learning [102], whereas the MIM method, which has been widely reported, reconstructs the missing parts of images using masks [103,104,105,106]. Thus, the application and development of self-supervised methods for the semantic segmentation of RSIs has attracted widespread attention [37,107], and the technical flow diagram is shown in Figure 3.

3.1. Commonly Used Self-Supervised Learning Models for Semantic Segmentation

3.1.1. Contrastive Learning Method

Contrastive learning is a popular self-supervised learning method for general invariant features. It closes the gap between supervised and unsupervised learning and is suitable for deep learning-based semantic segmentation [109].

Popular Contrastive Learning Models for Semantic Segmentation

SimCLR [110] is a self-supervised method that achieves comparable experimental results to those achieved using supervised learning. The Siamese network has also been used to accomplish the CL task. However, learning feature representation from numerous negative samples with multiple input images at one time is required. MoCo [111] provides a queue to store a large number of negative samples, which helps alleviate the shortcomings encountered with the large amounts of input data required to learn the representations. In addition, momentum update weights have also been adopted to solve the difficulties that exist when learning positive sample pairs. The features will be inconsistent after encoding the data by storing all features in the bank in the past. BYOL and Sim-Siam [112,113] use asymmetric structures to reduce the reliance on negative samples, and Sim-Siam replaces momentum updates by stopping the gradient updates.

Contrastive Learning Method based on RSIs for Semantic Segmentation

The following points are commonly considered when conducting semantic segmentation tasks for RSIs using contrastive learning models:

The background is rich and complex: High-spatial-resolution RSIs have different imaging conditions, seasonal variations, topographic variations, ground reflectance, imaging angles, lighting conditions, geographical locations, time stamps, and sensors. These factors mean that RSIs are data-rich [114].
Global and local features are important: Overall differences occur in RSIs due to factors such as season, weather, and the sensor used. Consequently, it is important to focus on their global characteristics. To achieve pixel-level semantic segmentation of RSIs, it is necessary to obtain the local features [115].
Spatial location information is important: RSIs have multiple objects, and the spatial location information between different objects is the most important for semantic segmentation tasks [116].
Multi-modal data are required: RSIs acquired by different satellite- and aerial-imaging platforms have different resolutions, spectral bands, and revisitation rates. However, current remote sensing applications only use a fraction of the available multi-sensor, multi-channel RSIs. The methods using fused RSIs obtained by multi-sensor arrays have attracted large amounts of attention [117,118].

Due to the inherent complexity of RSIs, significant domain differences exist between them and natural images, as shown in Figure 4. Many methods have been used to construct appropriate models based on RSIs. Self-supervised learning methods can also be suitably adapted to RSIs in complex imaging environments. For example, Ref. [119] demonstrated that the self-supervised approach is more conducive to improving the robustness of the models, as well as having better capabilities to deal with label and input corruption. The self-supervised approach can achieve better results when the sample distribution is imbalanced. There are numerous ways by which to further combine the features of RSIs using the methods that are currently popular and were previously described.

In-domain visual general representations for RSIs using self-supervised and supervised methods were proposed by [120]. Based on SimCLR and by combining the characteristics from the RSIs, the images obtained from the same geographical location showed variations between the different remote sensors. Consequently, these kinds of images were used as a positive sample for the training process [121].

Fusing different bands and multi-modal information by combining their reflection characteristics using the BYOL was proposed by [122]. Chen et al. [123] adopted the knowledge distillation method in their training process, generating pseudo-labels from unlabeled images, and labels with high confidence were used for subsequent training based on the BYOL. Location information is important for RSIs, but the location information of the target object changes when RSIs are fed into the CNN. Consequently, ref. [116] introduced the method of adding an index at pixel positions to alleviate these problems. A method based on BYOL was also proposed to accomplish the multi-classification task. Furthermore, two branches were constructed to learn both pixel-level and image-level data.

The self-supervised methods utilized in remote sensing result in separable global semantic and perceptible local semantic information, which is not the optimal result for remote sensing. A vision transformer was thus used as the backbone of the self-supervised learning method and combined with MoCo v2 and BYOL. Concurrently, lighter tricks for tuning the hyper-parameters were used to achieve better experimental results [124].

The richness of objects in the remote sensing images and the complexity of the background data have resulted in confusion and uneven distributions of the positive and negative samples. False negative samples (FNS) in the CL model were proposed [125] to solve the problem mentioned above. However, rough approximations and accurate calibrations of the FNS were used to mitigate the theoretical indecisiveness caused by the absence of definable criteria for the FNS in self-supervised learning. Significant differences occur between global and local features in RSIs, and most methods cannot take these features into account. Li et al. [115] proposed a global style and local features matching the CL module. In addition, a transformation method for images was proposed, which is used in the CL task to consider both global and local information during the training process. More meaningful representative features are obtained by maintaining the properties of images, as shown by a study [126] that proposed a multi-task network architecture and used image reconstruction regularization constraint to improve the inference results of RSIs. Meanwhile, the network is trained using image semantic segmentation tasks and two decoders responsible for image reconstruction. The CL model was constructed using RSIs from different sensors to complete the feature learning tasks [127]. Wang et al. [128] found that dense pixel predictions achieved better results in semantic segmentation tasks and optimized the pairwise contrastive similarity loss at the pixel-level for the two input images. Wang et al. [129] proposed the DINO [130] method to extract features from two enhanced views of an input image. The vision transformers (ViTs) model integration of SAR images and optical image information was adopted and used random masking of a modality as a data-enhancement method. By combining the characteristics of multispectral (MS) and multi-modal data, a multi-modal fusion method with the random fusion method was used to optimize the differences in the multi-modal features in the remote sensing field. Furthermore, the mining of features in images was accomplished through complementary information. A self-supervised learning approach based on the comparative learning of natural RSIs and abstracted images of vertical scenes was proposed in [131], and this explores the role of the self-supervision method in remote sensing based on the learning of representations of three million locations. A Bi-LSTM architecture [132] was proposed that uses both contrastive and generative self-supervised methods. Concurrently, the masked self-attention mechanisms were used to obtain the multi-modal signal, in which combined reconstruction and loss functions were used to facilitate multi-modal fusion.

Numerous studies have used inpainting, augmented transformation predictions, and CutPaste pretexts to complete the comparative learning required for RSIs. Li et al. [133] proposed a triple Siamese network to obtain the visual representation of RSI. The inpainting task, augmentation transform prediction (ATP) task, and contrastive learning tasks were adopted in the Siamese network, respectively, to improve low-level and high-level feature learning. The representations of the pretext tasks should be consistent with the image transformation. Thus, Misra et al. [101] proposed a pretext-invariant representation learning method, which improved the semantic quality of image representation learning. Li et al. [134] proposed a two-phase framework for training normal data adapted to a self-supervised learning method to obtain deeper representations. The processed normal data were then used to learn anomalous features using a CutPaste method, which included cropping the images and pasting them randomly. Tang et al. [135] proposed adding a local mutual information module to the CL approach to enhance local consistency, which largely enhanced the feature representations of each pixel and the generalizations of the model. Tian et al. [136] proposed to set up a linear predictor that directly used the input statistics but not gradient training to improve the experimental results.

3.1.2. Masked Image Modeling Method

The masked image modeling (MIM) method works by masking some content and then using visible and contextual information to predict the masked incompatible content. The methods have the potential to provide a robust objective for self-supervised learning that is rich in visual information.

Popular Masked Image Modeling Method for Semantic Segmentation

BEiT [137] learned the representations by reconstructing the visual tokens and used the pre-trained discrete variational autoencoder to encode masked patches. A previous study [138] proposed an objective function to help avoid data collapse during training. Meanwhile, the identity matrix was approached, and the cross-correlation matrix of the output samples from two identical networks was calculated to reduce redundancy. MAE [139] proposed using an asymmetric encode–decode structure to reconstruct the masked images. SimMIM [140] switched to a simple decoder learning method based on the MAE method. The Vector-Quantized Knowledge Distillation (VQ-KD) algorithm was proposed in BEiT v2 [141] to complete self-supervised representation learning, and the masked image modeling was changed from the pixel level to the semantic level. Moreover, a multimodal foundation model, BEiT v3 [142], achieves excellent transfer performance from both the vision and vision–language tasks.

Masked Image Modeling Method Based on RSIs for Semantic Segmentation

As RSIs are highly structured, He et al. [143] proposed adding structured similarity (SSIM) to the loss function to improve the performance of the semantic segmentation task. Simultaneously, the method used a salient region masking method rather than random masking, and the predicted masked regions were used as an intrinsic representation of the pretext task when learning the images. RSIs have a better orientation range than natural images. The objects are small and dense, and the backgrounds of the RSIs are intricate. Based on this, Sun et al. [103] proposed that RingMo uses the incomplete strategy method to improve the capture of small objects. Similar studies have been conducted [144], which have simultaneously reconstructed complementary masked-visible-region views and incorporated a global semantic distillation strategy (GSD) to ensure that salient areas of small objects are not lost. Muhtar et al. [99] constructed a knowledge distillation network that inputs data-enhanced images to the teacher network and masked images to the student network for updated learning, and this method aims to accomplish the SSL task using the MIM. A framework that combines the CL and MIM methods has also been proposed.

3.2. Self-Supervised Methods with Prior Knowledge Constraint

In the field of self-supervision, attaching some object features as constraints can provide a supervised signal while training the network model, effectively improving the inference results for the downstream tasks, as shown in Figure 5.

A significant distinction between RSIs and other images is the heterogeneity of the landscape, which may impact inference results. It was thus proposed [145] that urban and rural GDP areas should be used in satellite images as prior knowledge to help guide the pre-training process in order to complete the downstream tasks more fairly. Reasonable use of surface reflectance, backscatter, and derivatives, such as the normalized different vegetation index (NDVI), could improve the extraction results when accomplishing semantic segmentation tasks based on land cover [146]. A knowledge-distillation network [147] was proposed to complete the semantic segmentation task, in which the teacher network used normalized difference water index (NDWI) metrics to complete watershed detection, and the student network was guided to complete the learning of the watershed features using a deep learning model. Different objects in RSIs have different spectral reflectance characteristics. Thus, the selection of different bands for the implementation of deep learning models on feature extraction will substantially affect the process. Fan et al. [148] proposed a self-attention mechanism to improve the focus on important channels that could be combined with remote sensing spectral indices to mitigate edge information loss in mangrove semantic segmentation tasks. Xie et al. [149] used an unsupervised approach to mitigate the problems faced by RSIs through autonomous image composites and proposed a framework by which to adapt to the cloud masking problem. The Difference in the Spatio-Temporal Dynamics of Events (DISTANCE) method was proposed as a prior for the training process to realize the difficulty that there are no labeled data in the training process. Materials and textures are important features in RSIs, and effectively depicting these features can expand semantic information. Akiva et al. [3] proposed a material and texture representation learning that can be used to align RSIs in different temporal spaces. A spatial approach for the illumination and image angle invariance was also applied to obtain consistent learning for the materials and textures. Hays et al. [150] proposed that the geographic location could be used as auxiliary information to determine the properties of objects in relation to RSI, such as the style of buildings. Li et al. [151] adopted geographical knowledge as a supervisory signal during representative learning and introduced the pre-training process in RSIs. A spatio-temporal model [109] was constructed by utilizing spatially aligned images as the temporal positive pair in contrastive learning.

4. Semi-Supervised Learning Methods for RSI Semantic Segmentation

Semi-supervised learning can acquire accurate category labels from vast numbers of unlabeled samples with the use of a limited number of labeled samples, as shoen in Figure 6. Popular semi-supervised methods are proposed based on low-density separation assumption, cluster assumption, smoothness assumption, and manifold assumption, which have great practical value [34,152,153]:

Low-density separation assumption: The boundary of the two different classes is the low-density region [154,155].
Smoothness assumption: The distribution of the sample is uneven, and in high-density regions, if the input feature vectors $x_{1}$ , $x_{2}$ are close, the corresponding outputs $y_{1}$ , $y_{2}$ are also close [156,157].
Cluster assumption: If two data points are in the same cluster, they probably belong to the same class [158,159].
Manifold assumption: High-dimensional data lie on a low-dimensional manifold [160,161].

Figure 6. The figure describes the technical flow diagram of the semi-supervised methods [162].

Popular Semi-Supervised Learning Models for Semantic Segmentation

Entropy regularization and consistency regularization methods are two commonly used methods in semi-supervised learning and are closely related to the low-density separation and smoothness assumptions [34,163,164]. The use of consistent regularization for unlabeled samples is important for semi-supervised research [165]. The semi-supervised learning method can be used to effectively reduce the labeled data when extracting forest and vegetation features from RSIs [166]. Two networks were simultaneously selected to predict the unlabeled samples and added pseudo-label samples with high prediction consistencies. Li et al. [167] proposed a semi-supervised approach in the building extraction task and integrated consistency training for the features and output into the model. Consequently, the proposed features were more apparent at intermediate feature representations with low-density encoded images than the features of other layers, and there were additional perturbations in the intermediate features. A consistency regularization method was used to enhance the consistency of the output from the samples that adopted random transform changes and perturbations [168]. A network model based on knowledge distillation used gradient descent to adjust the parameters of the student network, and the teacher network was adjusted using the exponential moving average (EMA) during the training process. The proposed method using Siamese networks to complete the localization and the damage assessment of buildings using a multi-task model, as well as consistency regularization using iteratively perturbed dual-mean teacher, was utilized to improve the training process and enhance performance [169]. A semi-supervised method based on knowledge distillation networks with transform consistency regularization (TCR) that enhanced the consistency of inference results after grid shuffle, cutmix, and affine transformation was applied to the images [170].

Data-augmentation methods are popular semi-supervised learning methods and are closely related to the manifold assumption [34]. The commonly used data-augmentation methods add random perturbations or transformations, such as flip or rotation. Color jitter and random noise are also added to combine the characteristics of RSIs [165].

Additionally, the crux of the semi-supervised semantic segmentation is the assignment of pseudo-labels to unlabeled images [33]. Consequently, many popular approaches will enhance the quality of their pseudo-labels to improve semantic segmentation inference results based on semi-supervised learning. Wang et al. [171] proposed the utilization of the GAN network, whereby the discriminator acquired characteristics with high confidence from the unlabeled samples and provided a supervised signal to the generator. Desai et al. [172] proposed using the GAN network for semi-supervised learning and introduced active learning into the network model to train highly representative samples and the output of high-confidence images. Furthermore, the method achieved marked improvement in accuracy. A semi-supervised learning method was proposed [173] that adopted pixel-level and region-level contrastive loss functions to enhance correlation learning across different images and was designed to enhance label quality and category separability. The GAN network [174] was constructed based on semi-supervised learning during the semantic segmentation of the aquaculture region. Meanwhile, confidence maps generated by the discriminator and the prediction maps obtained from the unlabeled data generated by the generator as pseudo-labels were used in the training process. A previous study [175] proposed the addition of confidence discrimination to the GAN networks to improve the quality of pseudo-labels. The GAN network that adopted residual networks and dilated convolutions in the generator was used in semi-supervised learning [176], and the flow alignment module (FAM) was used to learn semantic information of adjacent feature maps. This allowed for discriminators and generators to complete the training of regularization for pixel-level inference. Wang et al. [177] proposed using the categories from different datasets as supervised signals and optimizing the semantic segmentation process with unlabeled samples. Multi-task learning models with shared weights for different tasks were used to enhance the generalization ability of the adversarial learning network.

Semi-Supervised Learning Models Based on RSIs for Semantic Segmentation

High-precision, high-quality annotated samples are difficult to acquire in the field of semantic segmentation, such as land classification and building extraction in RSIs, but unannotated samples are readily available. Thus, a semi-supervised approach using a limited number of annotated samples with a large number of unlabeled samples to accomplish the semantic segmentation task in remote sensing has attracted widespread attention [170]. Unmanned aerial vehicles could be used as a flying IoT capturing the RSIs and their elevation maps to enhance semantic segmentation inference results through bi-directional LSTM analysis [178]. Complex scenes in RSIs with different lighting and imaging angles can lead to boundary blurring in semantic segmentation results. Thus, the channel-weighted multiscale features (CMFs) and boundary attention module (BAM) were used in semi-supervised learning [179] to address the issue of inaccurate edge information in RSI based on a limited number of annotated samples. Furthermore, the study showed that incorporating the inpainting and jigsaw puzzle pretext tasks in semi-supervised learning could enhance the semantic segmentation inference results [180]. SAR has unique imaging characteristics and can be used as complementary data to complete semantic segmentation tasks. One study [181] has proposed using the texture and statistical features as nodes of a graph based on the data distribution characteristics of RSIs. A semi-supervised approach was also used to complete the instance segmentation of ships. When only a limited amount of labeled data is available, these were optimal solutions that could be used to expand the training sample by self-training and re-training the classifier during the training process or introducing a teacher–student network structure in a self-training approach to accomplish the semantic segmentation tasks [182,183].

5. Weakly Supervised Learning Methods for RSI Semantic Segmentation

Semi-supervised or self-supervised methods can accomplish semantic segmentation by using a limited number of artificial annotated samples or none. Weakly supervised learning methods are trained to perform semantic segmentation tasks of dense pixel predictions using inexpensive, easily available, and sparsely annotated samples and have been shown to have great potential to alleviate the problem of the lack of labeled data, as shown in Figure 7.

5.1. Weakly Supervised Learning Models

Commonly used methods for weakly supervised label annotations can be categorized into the following groups and as shown in Figure 8:

Point annotation;
Graffiti-based or random-walk annotation;
Bounding boxes annotation;
Image-level annotation.

Figure 8. The methods used are weakly supervised learning, including point supervision, scribble supervision, bounding box supervision, and the obtained class activation map (CAM) supervision.

A previous study [184] also proposed using dense, noisy labels to accomplish semantic segmentation.

5.1.1. Point Annotation

Lenczner et al. [185] proposed using the DNN model to accomplish a task based on a weakly supervised learning method by exploring the sparse point annotation, and three different regularization methods were used for the pseudo-labels. Zhu et al. [186] proposed a weakly supervised learning method for sea-fog semantic segmentation tasks that used an optimized pseudo-labeling approach based on point annotation. Lu et al. [187] proposed using point annotation to complete the water feature extraction task by identifying the likeness between adjacent pixels of water. The RSIs were resampled using a neighborhood sampler, feature aggregation was adopted in the network, and recursive operations were used to enhance the semantic segmentation by refining the features. A weakly supervised learning approach [188] was also proposed to accomplish the land classification task by using point annotation. A support vector machine (SVM) was used to generate initial seeds, and it assigned labels to pixels with high confidence. Due to the use of an approach with high-accuracy seed points and low-accuracy surrounding points, conditional random fields (CRFs) were used to update the seeds.

5.1.2. Graffiti-Based or Random-Walk Annotation

The weakly supervised semantic segmentation task [189] was accomplished by using random-walk annotation. Due to the position of the label in the pixels being uncertain, a differentiable approach with uncertainty caused by random-walk labels during the training was introduced into the loss function. Lin et al. [78] proposed an interactive approach to construct scribble-supervised labels. Additionally, a graph model was used to propagate information from the scribble labels to unknown pixels. Furthermore, graphs were built by using the super-pixel images in the training process, and a uniform loss function that was continuously optimized via interactions was formulated. Wei et al. [190] proposed using a graffiti-based approach to complete weak supervision. To propagate from labeled graffiti data to unlabeled pixels, pixel propagation was accomplished by using the buffer features of the road as well as the color and spatial properties of the super-pixels, and global dependency was built on the super-pixel graph nodes.

5.1.3. Bounding Box Annotation

A semantic segmentation task based on weakly supervised learning was accomplished by using bounding boxes [81]. The masks were assigned to specified candidate regions using a combination of multiple scales, and the masked labels were updated. A weakly supervised detection method using bounding boxes to accomplish edge extraction was proposed [191]. Rafique et al. [192] proposed using weakly supervised learning methods for the segmentation task based on bounding boxes annotation by modeling the probabilistic mask by means of the bivariate Gaussian distribution. A loss function was also proposed to represent the building information within the bounding box. Using bounding boxes for the semantic segmentation of the multi-class objects in weakly supervised learning poses challenges when distinguishing between foreground and background boundary features due to the absence of boundary constraints. Thus, the study in [193] proposed a background-aware pooling (BAP) method that aggregated the attention on foreground information and used noise-aware loss (NAL) to improve the pseudo-label quality.

The two-step weakly supervised semantic segmentation framework based on image-level labeling is widely used to first generate pseudo-labels based on annotation labels, and it usually generates class activation maps (CAMs). The generated pseudo-labels and original images were trained for semantic segmentation networks [194].

5.1.4. Image-Level Annotation

Zhou et al. [195] proposed using CAMs to obtain the approximate locations of images, using the cycleGAN network to generate pseudo-samples and finally accomplishing pixel-level segmentation. Xie et al. [196] presented a weakly supervised learning method for defect detection based on image-level labeling. The main tasks could be divided into target classification and target localization. Meanwhile, the classification task was performed by using a global mean-max pooling class activation map (GAM-CAM), and the localization task was performed using full convolutional channel attention (FCCA) based on CAMs. A previous study [197] also reported conducting semantic segmentation tasks by using a weakly supervised learning, but first processing the target objects from different classes in a video using heatmaps. Furthermore, Yan et al. [198] proposed the image-level label-based generation of high-quality CAMs for weakly supervised building extraction tasks. The quality of the CAM was further improved by building two modules, multiscale generation (MSG) and superpixel refinement (SR). Li et al. [194] proposed a weakly supervised approach based on the obtained building extraction results. At the same time, pseudo-masks were generated from image-level labels based on CAMs. Furthermore, the model was trained with spatial contextual features using the CRF-loss.

Meanwhile, He et al. [199] proposed an interesting method to address the uncertainty issue in the description of truth values due to the differences between vector labels and raster images and the model training process. Simply stated, a unified learning framework was proposed that can continuously adjust the learning parameters, infer the distribution of real vectors, and design a vector label refinement method. The combination of the above weakly supervised annotations was also introduced, such as using a unified approach for image-level labels, labels of the object’s edge, and partial labels for pixel-level semantic segmentation tasks [200]. Weakly supervised semantic segmentation tasks were accomplished by using visual measurements to manually annotate road routes without the need for precise pixel-level annotation [201]. OpenStreetMap (OSM) was used as the baseline label, and it used the center line provided by OSM to accomplish semantic segmentation based on a weakly supervised method [202]. Introducing machine learning methods into weakly supervised learning can effectively improve segmentation results. For example, one could train a network model using the results produced by SVM and then apply the CRF to the obtained segmentation results [203]. The nDSM is a LiDAR-based surface model covering an area of interest, and its elevation information can also be used for the semantic segmentation of trees. Thus, the nDSM was introduced for tree classification [204], and the areas overlapping with the spectral images were used to generate annotated samples for the encoder–decoder model training process. Different tree species were identified using the NDVI on the near infrared, red, and green data and the nDSM values.

5.2. Weakly Supervised Learning Methods with Prior Knowledge Constraint

To improve the use of labels containing only a part of the annotations for the semantic segmentation task, weakly supervised learning models can achieve better inference results by introducing factors such as texture, color, and background information as prior knowledge. The method for assigning point supervision to each class completing weakly supervised learning was proposed [77], and prior knowledge was attached about the probability of the object’s class to stop it from being considered as a local minimum during training. A weakly supervised learning approach for iterative clustering and epitomizing based on the affected textures, colors, and the use of deep image priors was utilized as a post-processing method [205], and an effective solution was provided. The deep mask after image acquisition activation was used as prior knowledge to fuse with the shallow location information from the network training process and determined the class of the mask [206]. Background information could be used as prior knowledge [207], and it was proposed by constructing a graph based on the training data and Random Forest classifier to extract the semantic texton forest (STF) from the super-pixels’ semantic segmentation results. Finally, a CRF optimization algorithm was used to improve the experimental results.

6. Domain Adaptation Methods for RSI Semantic Segmentation

Due to the diversity of the sensors, there are significant disparities between the source and target domains, including variations in style and category. This creates a huge challenge for domain adaptation methods which utilize a deep learning method. The current approaches [208,209] to domain adaptations based on RSIs can be broadly categorized as follows, and the technical flow diagram is shown in Figure 9.

Discrepancy-based;
Adversarial-based;
Pseudo label-based.

Figure 9. The figure describes the technical flow diagram of the domain adaptation methods. The training model can be designed according to adversarial training or self-training methods, and the

{l o s s}^{T}

takes the corresponding loss function to complete the training phase [210].

Figure 9. The figure describes the technical flow diagram of the domain adaptation methods. The training model can be designed according to adversarial training or self-training methods, and the

{l o s s}^{T}

takes the corresponding loss function to complete the training phase [210].

6.1. Discrepancy-Based

Many domain-adaptive methods have been proposed to reduce the differences between different domains. For example, Lu et al. proposed a category complementary approach based on the differences in categories between the source domains and target domains [211]. A classifier complement module that aligned categories in different source domains was also constructed. The study in [212] attempted to demonstrate from a theoretical perspective that the manual transformation expansion of an image does not alter its semantic characteristics. The study formulated the expansion process as a model of potential changes and divided the potential representation of the changed image into content groups and style groups. Cheng et al. [213] proposed that the domain adaptation method accomplished in both the source and target domains is complementary. They constructed a dual-path learning framework to address the inconsistent image visions across the domains, and proposed two interactive modules to promote learning between the dual paths to help align the source and target domains. Due to the limitations of fine maps in many places, the study [214] proposed a dynamic point-to-area cooperative learning model combining building quality information and trees information to learn such heterogeneity features for better access to the information of unlabeled areas. Another study [215] addressed the problem of conversion between different styles of datasets by proposing a cycle of consistent adversarial domain adaptation (CyCADA) which inputs different pairs of RSIs into the network to reduce the differences between datasets and alleviate cloud detection issues. Banerjee et al. [216] proposed a cross-domain clustering mapping algorithm based on graph theory to address the issue created by the different numbers of categories contained in the different RSI datasets. Moreover, Cermelli et al. [217] proposed an incremental weakly supervised learning method that required incremental updates to the network model when new classes were added. Specifying new loss functions and initializing the classifier parameters were used to accomplish the semantic segmentation task.

6.2. Adversarial-Based

GAN is utilized to promote domain confusion between the source and target domains and to alleviate the issue of category mismatch between the two domains [208]. Aerial and satellite images capture different geographical locations, terrain, and weather conditions, resulting in these cross-domain challenges. Iqbal et al. [218] proposed a domain-adaptive approach to achieve semantic segmentation in built-up areas based on GAN networks with weak data at the image level. Xie et al. [219] proposed a Cycle-GAN model based on cycle consistency and adversarial losses as supervised signals in accomplishing the transfer learning task from the source domain to the target domain. Wen et al. [220] proposed the use of labels to complete the semantic segmentation task using weakly supervised learning and transfer learning and preserved the relative position index of the maximum weights to during GAN network training. The use of the GAN networks [221] involved adopting a Pareto-based ranking method to reduce the difference between the source and target domains. The differences in the classes could be shared between the two domains while detecting unknown classes in the target domain. Zhao et al. [222] proposed the incorporation of LiDAR data using GAN networks to decrease the discrepancy between the source and target domains by reducing the per-pixel spectral shift. They also utilized a modified fractional differential mask (FrDM) method to extract spatial spectral information. A semi-supervised approach [223] was proposed using two different discriminators to obtain features about the boundaries of the land and increase the distance. The transferable features were regularly created away from the original land class boundaries. To address the large differences in feature scales in RSIs, Deng et al. [224] proposed a scale-aware framework to accomplish land cover classification across locations and scales. A conventional feature discriminator and a scale discriminator were also used.

6.3. Pseudo Label-Based

The task of generating pseudo-labels by training the model through the source domain can improve the generalization ability of the model with fewer computational resources [225], and the basic model for the segmentation of images has also been used as the strategy for aggregation to construct spatially independent and semantically information-consistent pseudo-labels. Furthermore, the pseudo-labels that capture the similarity between the source and target domains in the global context were used to accomplish domain adaptation. Gao et al. [209] proposed a multipath network model and a strategy for generating pseudo-labels that integrate the expert knowledge of these paths to better solve the cross-domain training problem with asymmetric source and target domains and categories. The large differences between the source domains, target domains, and intra-domains create difficulties in accomplishing cross-domain adaptation tasks. Thus, this study [226], abstracts the relationship between cross-domains and inner classes into a prototype and constructs a masked consistency learning task to improve the quality of pseudo-labels so as to update the prototype to complete unsupervised cross-domain learning. The complexity of geographical locations and the variety of remote sensing sensors can hinder the acquisition of training data. To narrow the discrepancy between source and target domains, Li et al. [227] introduced the weakly supervised transfer invariant constraint to connect the source domains and target domains and the pseudo-label constraint for adaptive mining of the pseudo-labels’ data, used rotational consistency features to construct weakly supervised rotation consistency constraints, and dynamically optimized the weights of the constraints.

7. Few-Shot Learning Methods for RSI Semantic Segmentation Methods

The segmentation of new classes using few-shot learning from RSI involves small data [228], as shown in Figure 10. The commonly used few-shot methods can be classified as data augmentation, the addition of prior knowledge with small data, and meta-learning. One intuitive solution is to generate new samples by augmenting data, such as by transforming images, example modeling, using simulation techniques to build up information about the object on a remote sensing image background, scene modeling, and the use of deep networks to generate new samples. Using limited samples to learn prior knowledge to guide model learning is another approach utilized for few-shot learning. These popularly used methods can be categorized as transfer learning and metric-learning [39]. Meta-learning learns new tasks with small data, and the deep learning model trained by a meta-learner could learn on different tasks [229].

7.1. Data Augmentation Method

Data augmentation methods can be utilized to alleviate overfitting during the training process and enhance model generalization [39]. Commonly used data augmentation methods can be summarized as data-warping-based methods, simulation-based methods, and deep-generative models. The common data-warping-based methods are rotation, translation, and flip transformations [231], as well as the proposed pixel–block pair (PBP) [232], stacked samples generation [233], mixup of data augmentation [234], and randomly adding Gaussian noise to the spectral dimension [235], among others. The rapid and controlled generation of samples using simulation techniques can be used to expand datasets, while some approaches using deep models to generate samples [39] are also popular data-augmentation methods.

7.2. Prior-Knowledge-Based Models

A brief summary of the methods used to learn prior knowledge is as follows:

Transfer learning: Transfer learning methods aim to recognize categories in the target domain by transferring knowledge from the source domain [39,236].
Metric learning: Metric learning is the method of learning the distance function of the data samples. Assuming two input samples ( $x_{1}$ , $y_{1}$ ) and ( $x_{2}$ , $y_{2}$ ), the distance between the two samples is solved using the distance function. Labels are assigned to the query image $x_{3}$ , and the distance between ( $x_{1}$ , $x_{3}$ ) and ( $x_{2}$ , $x_{3}$ ) can be computed [237].

The prototype network defined in few-shot learning can be acquired through metric learning and meta-learning. A study [238] based on the idea that embedding exists indicated that there are classes clustered around the prototype representation. An extension to the prototype network model that incorporates meta-learning under zero-shot learning was also suggested.

7.2.1. Transfer Learning

Transfer learning improves the semantic segmentation inference results of target domains by using a limited number of labels and transferring knowledge from relevant source domains [39]. Zhang et al. [239] proposed using source and target data pairing based on a network model that was designed as a two-stage structure, and spectral normalization was used in the discriminator. The module for the domain adaptation was added to enhance the semantic segmentation task by using the few-shot learning method. The feature-wise transformation module (FTM) [114] was proposed by utilizing a simple affine transformation to transform the data from the source domain to the target domain. Furthermore, the method proposed in this study had a higher portability for completing the land classification task according to few-shot learning. To address the adaptive issue of RSIs, Tuia et al. [240] proposed transforming the data from the source domain to the target domain using a non-linear transformation to ease the image-matching challenge of vectorization. Another study [241] proposed a unified model for small data learning, which could facilitate the arbitrarily dense pixel task. Concurrently, the work of fine-tuning the method based on transfer learning to complete specific tasks is also being conducted.

7.2.2. Metric Learning

Learning features with an appropriate distance measurement are important [242]. Mao et al. [243] proposed a bidirectional feature globalization (BFG) method to compute the similarity between point features and prototype vectors and obtain both global and local features of the image. Wang et al. [244] proposed a metric learning method for a model that could learn the prototype representation of a specific class and completed the image segmentation task by matching each pixel to the learned prototype. A pyramid structure [245] was designed to integrate feature maps of different scales. Furthermore, a metric method was generated to use with a wider range of data sets, as different classes have different distribution characteristics. Tang et al. [246] proposed a prototype network combining spatial–spectral information and used a one-dimensional convolutional network to learn the spatial–spectral metric space. Zhang et al. [247] proposed a metric-based approach to complete the semantic segmentation task based on few-shot learning. A depth map with complementary information was also added to the RSI to construct a two-branch network model to learn class-specific prototype representations of the different embedding spaces. A previous study [242] proposed a dual-prototype learning method to reduce the intra-class distances of the prototype space and improve the prototype discrimination capacity. Jiang et al. [248] proposed the generation prototype representation of each class based on its features. Concurrently, the proposed model matched with the prototype in the training process, and a non-parametric metric learning-based loss function was used. A previous study [249] proposed adding an attention module for texture features based on the prototype model to highlight the unique characteristics of forest objects. Wang et al. [250] found that many of the current few-shot-based methods for semantic segmentation tasks focus on the background and foreground regions of the image, which is unfavorable for the practical application of semantic segmentation. Meanwhile, another study proposed a prototype queuing network and accomplished a multi-category semantic segmentation task. The paper proposes the use of Holistic Prototype Activation (HPA) modeling. By deriving a prior on existing categories, irrelevant category objects are filtered and used in the decoding process of interactive feature reweighting and multilevel feature aggregation to mitigate the tendency of models trained on fewer sample data to fall into the overfitting phenomenon of existing model categories [251].

7.3. Meta Learning

Meta-learning methods allow the improved transfer of existing knowledge to new classes [229,252]. Tian et al. [253] proposed that most few-shot-based methods can only learn one feature from a single recent image, which makes most semantic segmentation tasks challenging. Consequently, their study used a branching combination of global and local features based on a meta-learning approach to accomplish the task of few-shot semantic segmentation. When accomplishing a fast learning task using a small number of samples, it is important to determine the initial parameters of the model to maximize performance during subsequent training and to consistently and efficiently obtain information from larger samples. The study [229] proposed a task-agnostic model that uses meta-learning to accomplish various types of downstream tasks. Embedding spatial data in the training phase provides a meaningful solution, but its agnostic spatial extent poses a huge challenge. Li et al. [254] proposed a spatial moderator to generalize learned spatial patterns and spatialized network structures from the original region to the new region. Meanwhile, the method introduced model-agnostic meta-learning methods to address the issue that the acquisition of deep learning models is limited to individual partitions or locations. Furthermore, Wu et al. [252] introduced the concept of meta-classes. A meta-class is meta-information that can be shared between all classes and can accomplish the semantic segmentation task by using a limited number of samples. The performance of the embedding network in meta-learning was limited by the restricted number of samples available, and Chen et al. [255] proposed a more generalized embedding network for a robust self-supervised learning-based approach to achieve classification with the few-shot learning method. Chen et al. [256] proposed a multi-perspective semantic segmentation task for the few-shot samples from 3D data and the concept of meta-classes based on a limited number of samples.

7.4. Other Few-Shot Learning Methods

Other methods have been proposed to accomplish few-shot learning tasks. Access to remote sensing data such as 3D point cloud scenes, multispectral images (MSIs), hyperspectral images (HSIs), and SAR pixel-level annotations requires specialist knowledge and extensive experience. There are some few-shot methods that are used to assess the characteristics of RSIs. Zhao et al. [257] proposed an attention-aware multi-prototype transductive inference method based on 3D point cloud data for the semantic segmentation task. While Rao et al. [258] proposed using a three-dimensional neighborhood of the pixel to explore both spatial and spectral information. A spatial–spectral relation network (SS-RN) was constructed to ensure feature extraction stability and mitigate the issues of large intra-class and small inter-class variations for objects in RSIs. Kemker et al. [259] proposed a semi-supervised approach based on the few-shot method to accomplish semantic segmentation tasks with a self-taught feature-learning framework.

8. Outlook for The Future

Small sample set methods have attracted widespread attention as they can be used to accomplish tasks with limited labels and have excellent generalization ability. There are limitations and challenges with the previous deep learning tasks that used supervised methods, as humans can learn new tasks quickly. Small sample set methods that do not completely rely on handcrafted labels were thus reviewed, and advantages and disadvantages of deep learning methods with small data were summarized in Table 3. Meanwhile, the methods were found to be more efficient in terms of pre-training the foundation model when large-scale data was used, alleviating the methods with small data such as few-shot learning methods by expanding the dataset, and better accomplishing the cross-domain tasks and the usage of the world models.

8.1. Foundation Models

Foundation models have emerged as a crucial direction for future development in the field of computer vision. By pre-training a foundation model and fine-tuning it with a limited number of samples, inference results with strong generalization capabilities can be achieved. The study in [260] designed models to perform target image segmentation tasks using zero samples, along with the construction of the largest segmentation dataset to date. Complete transfer learning becomes possible based on pre-training methods, while scale further enhances the ability of the foundation models. The concept of scale relies on three main elements, advancements in computer hardware, development of transformer architectures, and increased availability of training data [261]. Models that use large-scale training self-supervision can be adapted to a variety of downstream tasks, similar to renowned methods such as BERT [262], GPT-3 [263], CLIP [264], BEIT-3 [142], and RingMo [103], which have experienced significant developments. The challenge at hand is to scale-up the model while minimizing the computational intensity [261] and to explore the usage of existing foundation models to accomplish domain-specific downstream tasks [2].

8.2. Cross-Domain Learning

Cross-domain learning, known as inductive learning, involves learning inductive biases across different datasets with different embedding spaces. Effectively learning inductive biases remains a challenging problem [107]. Methods for domain adaptations can be broadly classified into three categories: sample-based methods, feature-based methods, and inference-based methods. The sample-based approach aims to use samples to acquire and assign weights to the target domain, while the feature-based approach revolves around mapping, projecting, and representing features to facilitate cross-domain learning at the feature level [265]. Adaptive methods are incorporated into the parameter estimation process to facilitate inference. Future research directions include the exploration of methods that are closely related to cross-domain problems, such as continuous learning [266]. Additionally, proposing solutions like incorporating low-level feature constraints into cross-domain problems or imposing constraints on high-level semantic features could also be beneficial [267].

8.3. Data Augmentation

Data augmentation is essential to reduce model training’s dependence on data and improve model performance, as it can be used to generate additional data. The quality, quantity, and diversity of the data, as well as the performance of the model, will greatly influence the results of the downstream tasks based on deep learning [268]. Currently, generative AI relies on a variety of techniques ranging from model architectures and self-supervised pre-training to generative modeling approaches [269], which provide ideal solutions for data augmentation, such as artificial intelligence-generated content (AIGC), which is a method used to generate, modify, and manipulate valuable and diverse data [270]. Due to the increased level of data, models are becoming complex, and the patterns of data distributions are becoming broader and more comprehensive. Consequently, one of the main advancements in AIGC is the ability to train more complex generative models on larger datasets. This method thus allows for the generation of higher-quality and more realistic data. Furthermore, it can be used for larger foundation model architectures and can access a wider range of computational resources [271]. For the semantic segmentation of RSIs, the AIGC method can be used to generate a large amount of synthetic data that will effectively ease the overfitting issues of the samples during the training process. Meanwhile, the diffusion model systematically destroys the data distribution during forward iterations and generates easy-to-handle data-generation patterns during reverse diffusion to produce high-quality sample data [272]. It has been proven that the quality of the sample data generated by the diffusion model was better than that generated by the generative model [273].

8.4. World Model

Knowledge acquired by humans through unsupervised or small amounts of interactive learning is not limited to specific tasks, and this acquired knowledge is often referred to as common sense. Common sense is a collection of the world model based on our understanding of rational and irrational consequences and can thus be used to predict and plan solutions and avoid errors in unknown situations [274]. In addition to the proposed lifelong learning model, it can assist the model in overcoming the issue of catastrophic forgetting so that the model can also obtain better inference accuracy in highly spatio-temporally different and new remote sensing scenarios. However, the knowledge recall model enables the model to continue learning in new scenarios, which is better adapted to the needs of deep learning models at this stage [275].

9. Conclusions and Discussion

In this survey, we have briefly summarized the commonly used network that models based on supervised learning methods. Supervised learning methods with small data used for semantic segmentation in RSIs are also reviewed. The effectiveness of multi-modal fusion approaches and the incorporation of prior knowledge and guidance to improve the performance of semantic segmentation tasks by fully exploiting the features of RSIs have been highlighted. Additionally, the current popular self-supervised learning approaches are discussed. Contrastive learning and masked image modeling approaches are also explored, along with the integration of prior knowledge. Furthermore, we have summarized the current popular semi-supervised methods. Methods for weakly supervised learning based on sparse annotations are presented, and we discuss some commonly used methods of domain adaptation in the remote sensing field. The utilization of few-shot methods to address the semantic segmentation challenge in the field of remote sensing is discussed. Moreover, the opportunities and challenges of using small sample set methods in the era of foundation models are analyzed, and the AIGC method for data augmentation was introduced to enhance the inference results of semantic segmentation. Deep domain adaptive methods and world models that have attracted widespread attention and are closely related to deep learning methods with small data are also identified and discussed.

Author Contributions

Methodology, A.Y. and Y.Q.; Investigation, R.Y., W.G. and X.W.; Resources, D.H., H.Z. and J.C.; Writing—original draft preparation, A.Y. and Y.Q.; Writing—review and editing, A.Y. and Y.Q.; Visualization, D.H., H.Z. and J.C.; Supervision, Q.H. and P.H. All authors have read and agreed the final published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 42101458, 42171456, 42130112, 41901285, 42277478.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote. Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation based on Visual Foundation Model. arXiv 2023, arXiv:2306.16269. [Google Scholar]
Akiva, P.; Purri, M.; Leotta, M. Self-Supervised Material and Texture Representation Learning for Remote Sensing Tasks. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8193–8205. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Lin, M.; Chen, Q.; Yan, S. Network In Network. arXiv 2013, arXiv:1312.4400. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. arXiv 2017, arXiv:1612.03144. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. arXiv 2017, arXiv:1612.01105. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. arXiv 2016, arXiv:1511.00561. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. arXiv 2018, arXiv:1608.06993. [Google Scholar]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified Perceptual Parsing for Scene Understanding. arXiv 2018, arXiv:1807.10221. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.P.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv 2017, arXiv:1606.00915. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 833–851. [Google Scholar]
Wu, G.; Guo, Z.; Shao, X.; Shibasaki, R. GEOSEG: A computer vision package for automatic building segmentation and outline extraction. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 158–161. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. arXiv 2019, arXiv:cs.CV/1902.09212. [Google Scholar]
Zhu, Q.; Liao, C.; Hu, H.; Mei, X.; Li, H. MAP-Net: Multiple Attending Path Neural Network for Building Footprint Extraction From Remote Sensed Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6169–6181. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2020, arXiv:1905.11946. [Google Scholar]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A. Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery. arXiv 2020, arXiv:2011.09766. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
Mehta, S.; Rastegari, M. MobileViT: Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer. arXiv 2022, arXiv:2110.02178. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. arXiv 2022, arXiv:2201.03545. [Google Scholar]
Guo, M.H.; Lu, C.Z.; Liu, Z.N.; Cheng, M.M.; Hu, S.M. Visual Attention Network. arXiv 2022, arXiv:2202.09741. [Google Scholar] [CrossRef]
Tao, C.; Qia, J.; Zhang, G.; Zhu, Q.; Lu, W.; Li, H. TOV: The Original Vision Model for Optical Remote Sensing Image Understanding via Self-Supervised Learning. arXiv 2022, arXiv:2204.04716. [Google Scholar] [CrossRef]
Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. arXiv 2017, arXiv:1707.02968. [Google Scholar]
He, K.; Girshick, R.; Dollár, P. Rethinking ImageNet Pre-Training. arXiv 2018, arXiv:1811.08883. [Google Scholar]
Shao, J.; Chen, S.; Li, Y.; Wang, K.; Yin, Z.; He, Y.; Teng, J.; Sun, Q.; Gao, M.; Liu, J.; et al. INTERN: A New Learning Paradigm Towards General Vision. arXiv 2022, arXiv:2111.08687. [Google Scholar]
Wang, Y.; Wang, H.; Shen, Y.; Fei, J.; Li, W.; Jin, G.; Wu, L.; Zhao, R.; Le, X. Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4238–4247. [Google Scholar] [CrossRef]
Ahfock, D.; McLachlan, G.J. Semi-supervised learning of classifiers from a statistical perspective: A brief review. Econom. Stat. 2023, 26, 124–138. [Google Scholar] [CrossRef]
Tao, C.; Qi, J.; Guo, M.; Zhu, Q.; Li, H. Self-Supervised Remote Sensing Feature Learning: Learning Paradigms, Challenges, and Future Works. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–26. [Google Scholar] [CrossRef]
Jing, L.; Tian, Y. Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4037–4058. [Google Scholar] [CrossRef]
Wang, Y.; Albrecht, C.M.; Braham, N.A.A.; Mou, L.; Zhu, X.X. Self-Supervised Learning in Remote Sensing: A review. IEEE Geosci. Remote Sens. Mag. 2022, 10, 213–247. [Google Scholar] [CrossRef]
Ericsson, L.; Gouk, H.; Loy, C.C.; Hospedales, T.M. Self-Supervised Representation Learning: Introduction, advances, and challenges. IEEE Signal Process. Mag. 2022, 39, 42–62. [Google Scholar] [CrossRef]
Sun, X.; Wang, B.; Wang, Z.; Li, H.; Li, H.; Fu, K. Research Progress on Few-Shot Learning for Remote Sensing Image Interpretation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2387–2402. [Google Scholar] [CrossRef]
Märtens, M.; Izzo, D.; Krzic, A.; Cox, D. Super-resolution of PROBA-V images using convolutional neural networks. Astrodynamics 2019, 3, 387–402. [Google Scholar] [CrossRef]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
Li, X.; Zhang, G.; Cui, H.; Hou, S.; Wang, S.; Li, X.; Chen, Y.; Li, Z.; Zhang, L. MCANet: A joint semantic segmentation framework of optical and SAR images for land use classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102638. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
Dai, D.; Yang, W. Satellite Image Classification via Two-Layer Sparse Coding With Biased Image Representation. IEEE Geosci. Remote Sens. Lett. 2011, 8, 173–176. [Google Scholar] [CrossRef]
Zou, Q.; Ni, L.; Zhang, T.; Wang, Q. Deep Learning Based Feature Selection for Remote Sensing Scene Classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2321–2325. [Google Scholar] [CrossRef]
Zhao, B.; Zhong, Y.; Xia, G.S.; Zhang, L. Dirichlet-derived multiple topic scene classification model for high spatial resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2015, 54, 2108–2123. [Google Scholar] [CrossRef]
Basu, S.; Ganguly, S.; Mukhopadhyay, S.; DiBiano, R.; Karki, M.; Nemani, R. Deepsat: A learning framework for satellite imagery. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, Seattle, WA, USA, 3–6 November 2015; pp. 1–10. [Google Scholar]
Waqas Zamir, S.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Shahbaz Khan, F.; Zhu, F.; Shao, L.; Xia, G.S.; Bai, X. isaid: A large-scale dataset for instance segmentation in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 15–20 June 2019; pp. 28–37. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 3213–3223. [Google Scholar]
Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
Tong, X.; Xia, G.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Learning transferable deep models for land-use classification with high-resolution remote sensing images. arXiv 2018, arXiv:1807.05713. [Google Scholar]
Mohajerani, S.; Krammer, T.A.; Saeedi, P. A Cloud Detection Algorithm for Remote Sensing Images Using Fully Convolutional Neural Networks. In Proceedings of the 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP), Vancouver, BC, Canada, 29–31 August 2018; pp. 1–5. [Google Scholar] [CrossRef]
Nigam, I.; Huang, C.; Ramanan, D. Ensemble knowledge transfer for semantic segmentation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1499–1508. [Google Scholar]
Wang, Q.; Liu, S.; Chanussot, J.; Li, X. Scene classification with recurrent attention of VHR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2018, 57, 1155–1167. [Google Scholar] [CrossRef]
Li, H.; Jiang, H.; Gu, X.; Peng, J.; Li, W.; Hong, L.; Tao, C. CLRS: Continual Learning Benchmark for Remote Sensing Image Scene Classification. Sensors 2020, 20, 1226. [Google Scholar] [CrossRef] [PubMed]
Zhong, Y.; Zhu, Q.; Zhang, L. Scene classification based on the multifeature fusion probabilistic topic model for high spatial resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6207–6222. [Google Scholar] [CrossRef]
Li, H.; Dou, X.; Tao, C.; Wu, Z.; Chen, J.; Peng, J.; Deng, M.; Zhao, L. RSI-CB: A Large-Scale Remote Sensing Image Classification Benchmark Using Crowdsourced Data. Sensors 2020, 20, 1594. [Google Scholar] [CrossRef] [PubMed]
Penatti, O.A.; Nogueira, K.; Dos Santos, J.A. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015; pp. 44–51. [Google Scholar]
Tasar, O.; Happy, S.L.; Tarabalka, Y.; Alliez, P. ColorMapGAN: Unsupervised Domain Adaptation for Semantic Segmentation Using Color Mapping Generative Adversarial Networks. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7178–7193. [Google Scholar] [CrossRef]
Wrenninge, M.; Unger, J. Synscapes: A Photorealistic Synthetic Dataset for Street Scene Parsing. arXiv 2018, arXiv:1810.08705. [Google Scholar]
Zhou, W.; Newsam, S.; Li, C.; Shao, Z. PatternNet: A benchmark dataset for performance evaluation of remote sensing image retrieval. ISPRS J. Photogramm. Remote Sens. 2018, 145, 197–209. [Google Scholar] [CrossRef]
Brown, C.F.; Brumby, S.P.; Guzder-Williams, B.; Birch, T.; Hyde, S.B.; Mazzariello, J.; Czerwinski, W.; Pasquarella, V.J.; Haertel, R.; Ilyushchenko, S.; et al. Dynamic World, Near real-time global 10 m land use land cover mapping. Sci. Data 2022, 9, 251. [Google Scholar] [CrossRef]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. arXiv 2022, arXiv:2110.08733. [Google Scholar]
Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. DeepGlobe 2018: A Challenge to Parse the Earth through Satellite Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 172–17209. [Google Scholar] [CrossRef]
Volpi, M.; Ferrari, V. Semantic segmentation of urban scenes by learning local class interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Zhang, M.; Hu, X.; Zhao, L.; Lv, Y.; Luo, M.; Pang, S. Learning dual multi-scale manifold ranking for semantic segmentation of high-resolution images. Remote Sens. 2017, 9, 500. [Google Scholar] [CrossRef]
Kemker, R.; Salvaggio, C.; Kanan, C. Algorithms for semantic segmentation of multispectral remote sensing imagery using deep learning. ISPRS J. Photogramm. Remote. Sens. 2018, 145, 60–77. [Google Scholar] [CrossRef]
Schmitt, M.; Hughes, L.H.; Qiu, C.; Zhu, X.X. SEN12MS—A Curated Dataset of Georeferenced Multi-Spectral Sentinel-1/2 Imagery for Deep Learning and Data Fusion. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2019, IV-2/W7, 153–160. [Google Scholar] [CrossRef]
Wu, X.; Shi, Z.; Zou, Z. A geographic information-driven method and a new large scale dataset for remote sensing cloud/snow detection. ISPRS J. Photogramm. Remote Sens. 2021, 174, 87–104. [Google Scholar] [CrossRef]
Sumbul, G.; Charfuelan, M.; Demir, B.; Markl, V. Bigearthnet: A large-scale benchmark archive for remote sensing image understanding. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 5901–5904. [Google Scholar]
Helber, P.; Bischke, B.; Dengel, A.; Borth, D. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 2217–2226. [Google Scholar] [CrossRef]
Chen, Y.; Wang, Y.; Lu, P.; Chen, Y.; Wang, G. Large-scale structure from motion with semantic constraints of aerial images. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Guangzhou, China, 23–26 November 2018; Springer: Cham, Switzerland, 2018; pp. 347–359. [Google Scholar]
Gong, P.; Liu, H.; Zhang, M.; Li, C.; Wang, J.; Huang, H.; Clinton, N.; Ji, L.; Li, W.; Bai, Y.; et al. Stable classification with limited sample: Transferring a 30-m resolution sample set collected in 2015 to mapping 10-m resolution global land cover in 2017. Sci. Bull. 2019, 64, 370–373. [Google Scholar] [CrossRef] [PubMed]
Roscher, R.; Volpi, M.; Mallet, C.; Drees, L.; Wegner, J.D. Semcity Toulouse: A Benchmark for Building Instance Segmentation in Satellite Images. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, V-5-2020, 109–116. [Google Scholar] [CrossRef]
Freeman, E.; Woodruff, S.D.; Worley, S.J.; Lubker, S.J.; Kent, E.C.; Angel, W.E.; Berry, D.I.; Brohan, P.; Eastman, R.; Gates, L.; et al. ICOADS Release 3.0: A major update to the historical marine climate record. Int. J. Climatol. 2017, 37, 2211–2232. [Google Scholar] [CrossRef]
International Journal of Computer Vision. The PASCAL Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Bearman, A.; Russakovsky, O.; Ferrari, V.; Fei-Fei, L. What’s the point: Semantic segmentation with point supervision. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VII 14; Springer: Cham, Switzerland, 2016; pp. 549–565. [Google Scholar]
Lin, D.; Dai, J.; Jia, J.; He, K.; Sun, J. ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3159–3167. [Google Scholar] [CrossRef]
Hariharan, B.; Arbeláez, P.; Bourdev, L.; Maji, S.; Malik, J. Semantic contours from inverse detectors. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 991–998. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Dai, J.; He, K.; Sun, J. BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Washington, DC, USA, 7–13 December 2015; pp. 1635–1643. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Workman, S.; Hadzic, A.; Rafique, M.U. Handling Image and Label Resolution Mismatch in Remote Sensing. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–7 January 2023; pp. 3698–3707. [Google Scholar] [CrossRef]
Shao, Z.; Tang, P.; Wang, Z.; Saleem, N.; Yam, S.; Sommai, C. BRRNet: A Fully Convolutional Neural Network for Automatic Building Extraction from High-Resolution Remote Sensing Images. Remote Sens. 2020, 12, 1050. [Google Scholar] [CrossRef]
Lin, Y.; Sun, H.; Liu, N.; Bian, Y.; Cen, J.; Zhou, H. Attention Guided Network for Salient Object Detection in Optical Remote Sensing Images. arXiv 2022, arXiv:2207.01755. [Google Scholar]
Zhong, H.F.; Sun, Q.; Sun, H.M.; Jia, R.S. NT-Net: A Semantic Segmentation Network for Extracting Lake Water Bodies From Optical Remote Sensing Images Based on Transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Yao, J.; Hong, D.; Gao, L.; Chanussot, J. Multimodal Remote Sensing Benchmark Datasets for Land Cover Classification. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 4807–4810. [Google Scholar] [CrossRef]
Pan, X.; Gao, L.; Marinoni, A.; Zhang, B.; Yang, F.; Gamba, P. Semantic Labeling of High Resolution Aerial Imagery and LiDAR Data with Fine Segmentation Network. Remote Sens. 2018, 10, 743. [Google Scholar] [CrossRef]
Deng, W.; Shi, Q.; Li, J. Attention-Gate-Based Encoder–Decoder Network for Automatical Building Extraction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2611–2620. [Google Scholar] [CrossRef]
Huang, J.; Zhang, X.; Xin, Q.; Sun, Y.; Zhang, P. Automatic building extraction from high-resolution aerial images and LiDAR data using gated residual refinement network. ISPRS J. Photogramm. Remote Sens. 2019, 151, 91–105. [Google Scholar] [CrossRef]
Li, X.; Lei, L.; Kuang, G. Multi-Modal Fusion Architecture Search for Land Cover Classification Using Heterogeneous Remote Sensing Images. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 5997–6000. [Google Scholar] [CrossRef]
Kang, J.; Wang, Z.; Zhu, R.; Xia, J.; Sun, X.; Fernandez-Beltran, R.; Plaza, A. DisOptNet: Distilling Semantic Knowledge From Optical Images for Weather-Independent Building Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Ding, L.; Tang, H.; Liu, Y.; Shi, Y.; Zhu, X.X.; Bruzzone, L. Adversarial Shape Learning for Building Extraction in VHR Remote Sensing Images. IEEE Trans. Image Process. 2022, 31, 678–690. [Google Scholar] [CrossRef]
Xiong, Z.; Chen, S.; Wang, Y.; Mou, L.; Zhu, X.X. GAMUS: A Geometry-aware Multi-modal Semantic Segmentation Benchmark for Remote Sensing Data. arXiv 2023, arXiv:2305.14914. [Google Scholar]
Xu, Z.; Xu, C.; Cui, Z.; Zheng, X.; Yang, J. CVNet: Contour Vibration Network for Building Extraction. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1373–1381. [Google Scholar] [CrossRef]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Learning a Discriminative Feature Network for Semantic Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1857–1866. [Google Scholar] [CrossRef]
Liao, C.; Hu, H.; Li, H.; Ge, X.; Chen, M.; Li, C.; Zhu, Q. Joint Learning of Contour and Structure for Boundary-Preserved Building Extraction. Remote Sens. 2021, 13, 1049. [Google Scholar] [CrossRef]
Quan, Y.; Yu, A.; Cao, X.; Qiu, C.; Zhang, X.; Liu, B.; He, P. Building Extraction from Remote Sensing Images with DoG as Prior Constraint. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6559–6570. [Google Scholar] [CrossRef]
Muhtar, D.; Zhang, X.; Xiao, P.; Li, Z.; Gu, F. CMID: A Unified Self-Supervised Learning Framework for Remote Sensing Image Understanding. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–17. [Google Scholar] [CrossRef]
Chen, X.; Fan, H.; Girshick, R.; He, K. Improved Baselines with Momentum Contrastive Learning. arXiv 2020, arXiv:2003.04297. [Google Scholar]
Misra, I.; van der Maaten, L. Self-Supervised Learning of Pretext-Invariant Representations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6706–6716. [Google Scholar] [CrossRef]
Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A Survey on Contrastive Self-Supervised Learning. Technologies 2021, 9, 2. [Google Scholar] [CrossRef]
Sun, X.; Wang, P.; Lu, W.; Zhu, Z.; Lu, X.; He, Q.; Li, J.; Rong, X.; Yang, Z.; Chang, H.; et al. RingMo: A Remote Sensing Foundation Model with Masked Image Modeling. IEEE Trans. Geosci. Remote. Sens. 2022, 61, 1–22. [Google Scholar] [CrossRef]
Yuan, Y.; Lin, L.; Liu, Q.; Hang, R.; Zhou, Z.G. SITS-Former: A pre-trained spatio-spectral-temporal representation model for Sentinel-2 time series classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102651. [Google Scholar] [CrossRef]
Cong, Y.; Khanna, S.; Meng, C.; Liu, P.; Rozi, E.; He, Y.; Burke, M.; Lobell, D.B.; Ermon, S. SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery. arXiv 2023, arXiv:2207.08051. [Google Scholar]
Scheibenreif, L.; Hanna, J.; Mommert, M.; Borth, D. Self-supervised Vision Transformers for Land-cover Segmentation and Classification. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 18–24 June 2022; pp. 1421–1430. [Google Scholar] [CrossRef]
Liu, X.; Zhang, F.; Hou, Z.; Mian, L.; Wang, Z.; Zhang, J.; Tang, J. Self-Supervised Learning: Generative or Contrastive. IEEE Trans. Knowl. Data Eng. 2023, 35, 857–876. [Google Scholar] [CrossRef]
Wang, J. Self-Supervised Learning. 2021. Available online: https://zhuanlan.zhihu.com/ (accessed on 11 October 2023).
Ayush, K.; Uzkent, B.; Meng, C.; Tanmay, K.; Burke, M.; Lobell, D.; Ermon, S. Geography-Aware Self-Supervised Learning. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 10161–10170. [Google Scholar] [CrossRef]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G.E. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9726–9735. [Google Scholar] [CrossRef]
Grill, J.B.; Strub, F.; Altch’e, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Pires, B.Á.; Guo, Z.D.; Azar, M.G.; et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. arXiv 2020, arXiv:2006.07733. [Google Scholar]
Chen, X.; He, K. Exploring Simple Siamese Representation Learning. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15745–15753. [Google Scholar] [CrossRef]
Chen, Q.; Chen, Z.; Luo, W. Feature Transformation for Cross-domain Few-shot Remote Sensing Scene Classification. arXiv 2022, arXiv:2203.02270. [Google Scholar]
Li, H.; Li, Y.; Zhang, G.; Liu, R.; Huang, H.; Zhu, Q.; Tao, C. Global and Local Contrastive Self-Supervised Learning for Semantic Segmentation of HR Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Muhtar, D.; Zhang, X.; Xiao, P. Index Your Position: A Novel Self-Supervised Learning Method for Remote Sensing Images Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
M Rustowicz, R.; Cheong, R.; Wang, L.; Ermon, S.; Burke, M.; Lobell, D. Semantic segmentation of crop type in Africa: A novel dataset and analysis of deep learning methods. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 75–82. [Google Scholar]
Reiche, J.; Hamunyela, E.; Verbesselt, J.; Hoekman, D.; Herold, M. Improving near-real time deforestation monitoring in tropical dry forests by combining dense Sentinel-1 time series with Landsat and ALOS-2 PALSAR-2. Remote Sens. Environ. 2018, 204, 147–161. [Google Scholar] [CrossRef]
Hendrycks, D.; Mazeika, M.; Kadavath, S.; Song, D. Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty. arXiv 2019, arXiv:1906.12340. [Google Scholar]
Ghanbarzade, A.; Soleimani, D.H. Supervised and Contrastive Self-Supervised In-Domain Representation Learning for Dense Prediction Problems in Remote Sensing. arXiv 2023, arXiv:2301.12541. [Google Scholar]
Jain, U.; Wilson, A.; Gulshan, V. Multimodal Contrastive Learning for Remote Sensing Tasks. arXiv 2022, arXiv:2209.02329. [Google Scholar]
Jain, P.; Schoen-Phelan, B.; Ross, R. Self-Supervised Learning for Invariant Representations From Multi-Spectral and SAR Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7797–7808. [Google Scholar] [CrossRef]
Chen, D.Y.; Peng, L.; Zhang, W.Y.; Wang, Y.D.; Yang, L.N. Research on Self-Supervised Building Information Extraction with High-Resolution Remote Sensing Images for Photovoltaic Potential Evaluation. Remote Sens. 2022, 14, 5350. [Google Scholar] [CrossRef]
Xie, Z.; Lin, Y.; Yao, Z.; Zhang, Z.; Dai, Q.; Cao, Y.; Hu, H. Self-Supervised Learning with Swin Transformers. arXiv 2021, arXiv:2105.04553. [Google Scholar]
Zhang, Z.; Wang, X.; Mei, X.; Tao, C.; Li, H. FALSE: False Negative Samples Aware Contrastive Learning for Semantic Segmentation of High-Resolution Remote Sensing Image. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Papadomanolaki, M.; Karantzalos, K.; Vakalopoulou, M. A Multi-Task Deep Learning Framework Coupling Semantic Segmentation and Image Reconstruction for Very High Resolution Imagery. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 1069–1072. [Google Scholar] [CrossRef]
Swope, A.M.; Rudelis, X.H.; Story, K.T. Representation Learning for Remote Sensing: An Unsupervised Sensor Fusion Approach. arXiv 2021, arXiv:2108.05094. [Google Scholar]
Wang, X.; Zhang, R.; Shen, C.; Kong, T.; Li, L. Dense Contrastive Learning for Self-Supervised Visual Pre-Training. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 3023–3032. [Google Scholar] [CrossRef]
Wang, Y.; Albrecht, C.M.; Zhu, X.X. Self-Supervised Vision Transformers for Joint SAR-Optical Representation Learning. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 139–142. [Google Scholar] [CrossRef]
Caron, M.; Touvron, H.; Misra, I.; Jegou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 9630–9640. [Google Scholar] [CrossRef]
Seneviratne, S.; Nice, K.A.; Wijnands, J.S.; Stevenson, M.; Thompson, J. Self-Supervision. Remote Sensing and Abstraction: Representation Learning Across 3 Million Locations. In Proceedings of the 2021 Digital Image Computing: Techniques and Applications (DICTA), Gold Coast, Australia, 29 November–1 December 2021; pp. 01–08. [Google Scholar] [CrossRef]
Chen, Y.; Zhao, M.; Bruzzone, L. Incomplete Multimodal Learning for Remote Sensing Data Fusion. arXiv 2023, arXiv:2304.11381. [Google Scholar]
Li, W.; Chen, H.; Shi, Z. Semantic Segmentation of Remote Sensing Images With Self-Supervised Multitask Representation Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6438–6450. [Google Scholar] [CrossRef]
Li, C.L.; Sohn, K.; Yoon, J.; Pfister, T. CutPaste: Self-Supervised Learning for Anomaly Detection and Localization. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9659–9669. [Google Scholar] [CrossRef]
Tang, M.; Georgiou, K.; Qi, H.; Champion, C.; Bosch, M. Semantic Segmentation in Aerial Imagery Using Multi-level Contrastive Learning with Local Consistency. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikola, HI, USA, 2–7 January 2023; pp. 3787–3796. [Google Scholar] [CrossRef]
Tian, Y.; Chen, X.; Ganguli, S. Understanding self-supervised Learning Dynamics without Contrastive Pairs. arXiv 2021, arXiv:2102.06810. [Google Scholar]
Bao, H.; Dong, L.; Wei, F. BEiT: BERT Pre-Training of Image Transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
Zbontar, J.; Jing, L.; Misra, I.; LeCun, Y.; Deny, S. Barlow Twins: Self-Supervised Learning via Redundancy Reduction. arXiv 2021, arXiv:2103.03230. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 15979–15988. [Google Scholar] [CrossRef]
Xie, Z.; Zhang, Z.; Cao, Y.; Lin, Y.; Bao, J.; Yao, Z.; Dai, Q.; Hu, H. SimMIM: A Simple Framework for Masked Image Modeling. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9643–9653. [Google Scholar] [CrossRef]
Peng, Z.; Dong, L.; Bao, H.; Ye, Q.; Wei, F. BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers. arXiv 2022, arXiv:2208.06366. [Google Scholar]
Wang, W.; Bao, H.; Dong, L.; Bjorck, J.; Peng, Z.; Liu, Q.; Aggarwal, K.; Mohammed, O.K.; Singhal, S.; Som, S.; et al. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks. arXiv 2022, arXiv:2208.10442. [Google Scholar]
He, S.; Li, Q.; Liu, Y.; Wang, W. Semantic Segmentation of Remote Sensing Images with Self-Supervised Semantic-Aware Inpainting. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Wang, X.; Zhang, Y.; Zhang, Z.; Luo, Q.; Yang, J. GSC-MIM: Global semantic integrated self-distilled complementary masked image model for remote sensing images scene classification. Front. Ecol. Evol. 2022, 10, 1083801. [Google Scholar] [CrossRef]
Zhang, M.; Chunara, R. Fair contrastive pre-training for geographic image segmentation. arXiv 2023, arXiv:2211.08672. [Google Scholar]
Deus, D. Integration of ALOS PALSAR and Landsat Data for Land Cover and Forest Mapping in Northern Tanzania. Land 2016, 5, 43. [Google Scholar] [CrossRef]
Peña, F.J.; Hübinger, C.; Payberah, A.H.; Jaramillo, F. DeepAqua: Self-Supervised Semantic Segmentation of Wetlands from SAR Images using Knowledge Distillation. arXiv 2023, arXiv:2305.01698. [Google Scholar]
Fan, Y.; Zeng, Q.; Mei, Z.; Hu, W. Semantic Segmentation for Mangrove Using Spectral Indices and Self-Attention Mechanism. In Proceedings of the 2022 7th International Conference on Signal and Image Processing (ICSIP), Suzhou, China, 20–22 July 2022; pp. 436–441. [Google Scholar] [CrossRef]
Xie, Y.; Li, Z.; Bao, H.; Jia, X.; Xu, D.; Zhou, X.; Skakun, S. Auto-CM: Unsupervised deep learning for satellite imagery composition and cloud masking using spatio-temporal dynamics. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 14575–14583. [Google Scholar]
Hays, J.; Efros, A.A. Large-Scale Image Geolocalization. In Multimodal Location Estimation of Videos and Images; Choi, J., Friedland, G., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 41–62. [Google Scholar] [CrossRef]
Li, W.; Chen, K.; Chen, H.; Shi, Z. Geographical Knowledge-Driven Representation Learning for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Song, Z.; Yang, X.; Xu, Z.; King, I. Graph-Based Semi-Supervised Learning: A Comprehensive Review. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–21. [Google Scholar] [CrossRef]
Aromal, M.A.; Rasool, A. Semi Supervised Learning Using Graph Data Structure—A Review. In Proceedings of the 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), Tirunelveli, India, 4–6 February 2021; pp. 894–899. [Google Scholar] [CrossRef]
Ouali, Y.; Hudelot, C.; Tami, M. Semi-Supervised Semantic Segmentation with Cross-Consistency Training. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12671–12681. [Google Scholar] [CrossRef]
Wang, C.; Tang, X.; Li, L.; Tian, B.; Zhou, Y.; Shi, J. IDN: Inner-class dense neighbours for semi-supervised learning-based remote sensing scene classification. Remote Sens. Lett. 2023, 14, 80–90. [Google Scholar] [CrossRef]
Tong, R.; Li, P.; Gao, L.; Lang, X.; Miao, A.; Shen, X. A Novel Ellipsoidal Semisupervised Extreme Learning Machine Algorithm and Its Application in Wind Turbine Blade Icing Fault Detection. IEEE Trans. Instrum. Meas. 2022, 71, 1–16. [Google Scholar] [CrossRef]
Im, D.J.; Taylor, G.W. Semisupervised Hyperspectral Image Classification via Neighborhood Graph Learning. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1913–1917. [Google Scholar] [CrossRef]
Von Kügelgen, J.; Mey, A.; Loog, M.; Schölkopf, B. Semi-Supervised Learning, Causality and the Conditional Cluster Assumption. arXiv 2020, arXiv:1905.12081. [Google Scholar]
Wang, Y.; Meng, Y.; Fu, Z.; Xue, H. Towards safe semi-supervised classification: Adjusted cluster assumption via clustering. Neural Process. Lett. 2017, 46, 1031–1042. [Google Scholar] [CrossRef]
Zhang, W.; Feng, X.; Chen, Y. A Manifold Laplacian Regularized Semi-Supervised Sparse Image Classification Method with a Variant Trace Lasso Norm. IEEE Access 2020, 8, 97361–97369. [Google Scholar] [CrossRef]
Iscen, A.; Tolias, G.; Avrithis, Y.; Chum, O. Label Propagation for Deep Semi-Supervised Learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5065–5074. [Google Scholar] [CrossRef]
All in One Article! A Comprehensive Survey of Weakly Supervised Semantics/Instances/Panorama Segmentation. 2023. Available online: https://developer.aliyun.com/article/1142964 (accessed on 14 October 2023).
Grandvalet, Y.; Bengio, Y. Semi-supervised learning by entropy minimization. Adv. Neural Inf. Process. Syst. 2004, 17, 529–536. [Google Scholar]
Belkin, M.; Niyogi, P.; Sindhwani, V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 2006, 7. [Google Scholar]
Wang, J.; Ding, H.Q.C.; Chen, S.; He, C.; Luo, B. Semi-Supervised Remote Sensing Image Semantic Segmentation via Consistency Regularization and Average Update of Pseudo-Label. Remote Sens. 2020, 12, 3603. [Google Scholar] [CrossRef]
Li, L.; Zhang, W.; Zhang, X.; Emam, M.; Jing, W. Semi-Supervised Remote Sensing Image Semantic Segmentation Method Based on Deep Learning. Electronics 2023, 12, 348. [Google Scholar] [CrossRef]
Li, Q.; Shi, Y.; Zhu, X.X. Semi-Supervised Building Footprint Generation with Feature and Output Consistency Training. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
Zhang, B.; Zhang, Y.; Li, Y.; Wan, Y.; Wen, F. Semi-Supervised Semantic Segmentation Network via Learning Consistency for Remote Sensing Land-Cover Classification. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, V-2-2020, 609–615. [Google Scholar] [CrossRef]
He, Y.; Wang, J.; Liao, C.; Zhou, X.; Shan, B. MS4D-Net: Multitask-Based Semi-Supervised Semantic Segmentation Framework with Perturbed Dual Mean Teachers for Building Damage Assessment from High-Resolution Remote Sensing Imagery. Remote Sens. 2023, 15, 478. [Google Scholar] [CrossRef]
Zhang, B.; Zhang, Y.; Li, Y.; Wan, Y.; Guo, H.; Zheng, Z.; Yang, K. Semi-Supervised Deep learning via Transformation Consistency Regularization for Remote Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 16, 5782–5796. [Google Scholar] [CrossRef]
Wang, J.; Zhao, J.; Sun, H.; Lu, X.; Huang, J.; Wang, S.; Fang, G. Satellite Remote Sensing Identification of Discolored Standing Trees for Pine Wilt Disease Based on Semi-Supervised Deep Learning. Remote Sens. 2022, 14, 936. [Google Scholar] [CrossRef]
Desai, S.; Ghose, D. Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 1485–1495. [Google Scholar] [CrossRef]
Zhang, L.; Lu, W.; Zhang, J.; Wang, H. A Semisupervised Convolution Neural Network for Partial Unlabeled Remote-Sensing Image Segmentation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Liang, C.; Cheng, B.; Xiao, B.; He, C.; Liu, X.; Jia, N.; Chen, J. Semi-/Weakly-Supervised Semantic Segmentation Method and Its Application for Coastal Aquaculture Areas Based on Multi-Source Remote Sensing Images—Taking the Fujian Coastal Area (Mainly Sanduo) as an Example. Remote Sens. 2021, 13, 1083. [Google Scholar] [CrossRef]
Kerdegari, H.; Razaak, M.; Argyriou, V.; Remagnino, P. Urban scene segmentation using semi-supervised GAN. In Proceedings of the Image and Signal Processing for Remote Sensing XXV, Strasbourg, France, 9–11 September 2019; Bruzzone, L., Bovolo, F., Eds.; International Society for Optics and Photonics, SPIE: San Francisco, CA, USA, 2019; Volume 11155, p. 111551H. [Google Scholar] [CrossRef]
Nie, W.; Gou, P.; Liu, Y.; Zhou, T.; Xu, N.; Wang, P.; Du, Q. A semi-supervised image segmentation method based on generative adversarial network. In Proceedings of the 2022 IEEE 10th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing China, 17–19 June 2022; Volume 10, pp. 1217–1223. [Google Scholar] [CrossRef]
Wang, Y.; Tsai, Y.H.; Hung, W.C.; Ding, W.; Liu, S.; Yang, M.H. Semi-supervised Multi-task Learning for Semantics and Depth. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 2663–2672. [Google Scholar]
Chakravarthy, A.S.; Sinha, S.; Narang, P.; Mandal, M.; Chamola, V.; Yu, F.R. DroneSegNet: Robust Aerial Semantic Segmentation for UAV-Based IoT Applications. IEEE Trans. Veh. Technol. 2022, 71, 4277–4286. [Google Scholar] [CrossRef]
Sun, X.; Shi, A.; Huang, H.; Mayer, H. BAS⁴Net: Boundary-aware semi-supervised semantic segmentation network for very high resolution remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5398–5413. [Google Scholar] [CrossRef]
Castillo-Navarro, J.; Saux, B.L.; Boulch, A.; Lefèvre, S. On Auxiliary Losses for Semi-Supervised Semantic Segmentation. In Proceedings of the ECML PKDD 2020: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Ghent, Belgium, 14–18 September 2020. [Google Scholar]
Chendeb El Rai, M.; Giraldo, J.H.; Al-Saad, M.; Darweech, M.; Bouwmans, T. SemiSegSAR: A Semi-Supervised Segmentation Algorithm for Ship SAR Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Li, F.; Clausi, D.A.; Xu, L.; Wong, A. ST-IRGS: A Region-Based Self-Training Algorithm Applied to Hyperspectral Image Classification and Segmentation. IEEE Trans. Geosci. Remote Sens. 2018, 56, 3–16. [Google Scholar] [CrossRef]
Xie, D.; Yang, R.; Qiao, Y.; Zhang, J. Intelligent Identification of Landslide Based on Deep Semi-supervised Learning. In Proceedings of the 2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Chengdu, China, 19–21 August 2022; pp. 264–269. [Google Scholar] [CrossRef]
Schmitt, M.; Prexl, J.; Ebel, P.; Liebel, L.; Zhu, X.X. Weakly Supervised Semantic Segmentation of Satellite Images for Land Cover Mapping—Challenges and Opportunities. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, V-3-2020, 795–802. [Google Scholar] [CrossRef]
Lenczner, G.; Chan-Hon-Tong, A.; Luminari, N.; Le Saux, B. Weakly-Supervised Continual Learning for Class-Incremental Segmentation. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 4843–4846. [Google Scholar] [CrossRef]
Zhu, X.; Xu, M.; Wu, M.; Zhang, C.; Zhang, B. Annotating Only at Definite Pixels: A Novel Weakly Supervised Semantic Segmentation Method for Sea Fog Recognition. In Proceedings of the 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP), Suzhou, China, 13–16 December 2022; pp. 1–5. [Google Scholar] [CrossRef]
Lu, M.; Fang, L.; Li, M.; Zhang, B.; Zhang, Y.; Ghamisi, P. NFANet: A Novel Method for Weakly Supervised Water Extraction from High-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Zhang, W.; Tang, P.; Corpetti, T.; Zhao, L. WTS: A Weakly towards Strongly Supervised Learning Framework for Remote Sensing Land Cover Classification Using Segmentation Models. Remote Sens. 2021, 13, 394. [Google Scholar] [CrossRef]
Vernaza, P.; Chandraker, M. Learning Random-Walk Label Propagation for Weakly-Supervised Semantic Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2953–2961. [Google Scholar] [CrossRef]
Wei, Y.; Ji, S. Scribble-Based Weakly Supervised Deep Learning for Road Surface Extraction From Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Khoreva, A.; Benenson, R.; Omran, M.; Hein, M.; Schiele, B. Weakly Supervised Object Boundaries. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 183–192. [Google Scholar] [CrossRef]
Rafique, M.U.; Jacobs, N. Weakly Supervised Building Segmentation from Aerial Images. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019. [Google Scholar]
Oh, Y.; Kim, B.; Ham, B. Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Li, Z.; Zhang, X.; Xiao, P.; Zheng, Z. On the Effectiveness of Weakly Supervised Semantic Segmentation for Building Extraction From High-Resolution Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 3266–3281. [Google Scholar] [CrossRef]
Zhou, Y.; Wang, H.; Yang, R.; Yao, G.; Xu, Q.; Zhang, X. A Novel Weakly Supervised Remote Sensing Landslide Semantic Segmentation Method: Combining CAM and cycleGAN Algorithms. Remote Sens. 2022, 14, 3650. [Google Scholar] [CrossRef]
Xie, H.; Lin, S.F. A Weakly Supervised Defect Detection Based on Dual Path Networks and GMA-CAM. In Proceedings of the International Conference on Image and Graphics, Haikou, China, 6–8 August 2021. [Google Scholar]
Saleh, F.S.; Aliakbarian, M.S.; Salzmann, M.; Petersson, L.; Alvarez, J.M. Bringing Background into the Foreground: Making All Classes Equal in Weakly-supervised Video Semantic Segmentation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Yan, X.; Shen, L.; Wang, J.; Deng, X.; Li, Z. MSG-SR-Net: A Weakly Supervised Network Integrating Multiscale Generation and Superpixel Refinement for Building Extraction From High-Resolution Remotely Sensed Imageries. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1012–1023. [Google Scholar] [CrossRef]
He, W.; Jiang, Z.; Kriby, M.; Xie, Y.; Jia, X.; Yan, D.; Zhou, Y. Quantifying and Reducing Registration Uncertainty of Spatial Vector Labels on Earth Imagery. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 14–18 August 2022; KDD ’22, pp. 554–564. [Google Scholar] [CrossRef]
Xu, J.; Schwing, A.G.; Urtasun, R. Learning to segment under various forms of weak supervision. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3781–3790. [Google Scholar] [CrossRef]
Xia, W.; Zhong, N.; Geng, D.; Luo, L. A weakly supervised road extraction approach via deep convolutional nets based image segmentation. In Proceedings of the 2017 International Workshop on Remote Sensing with Intelligent Processing (RSIP), Shanghai, China, 19–21 May 2017. [Google Scholar]
Mazhar, S.; Sun, G.; Bilal, A.; Hassan, B.; Li, Y.; Zhang, J.; Lin, Y.; Khan, A.; Ahmed, R.; Hassan, T. AUnet: A Deep Learning Framework for Surface Water Channel Mapping Using Large-Coverage Remote Sensing Images and Sparse Scribble Annotations from OSM Data. Remote Sens. 2022, 14, 3283. [Google Scholar] [CrossRef]
Moliner, E.; Romero, L.S.; Vilaplana, V. Weakly Supervised Semantic Segmentation For Remote Sensing Hyperspectral Imaging. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar]
Dvořák, J.; Potůčková, M.; Treml, V. Weakly supervised learning for treeline ecotone classification based on aerial orthoimages and an ancillary dsm. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, V-3-2022, 33–38. [Google Scholar] [CrossRef]
Robinson, C.; Malkin, K.; Hu, L.; Dilkina, B.; Jojic, N. Weakly Supervised Semantic Segmentation in the 2020 IEEE GRSS Data Fusion Contest. In Proceedings of the IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020. [Google Scholar]
Saleh, F.; Aliakbarian, M.S.; Salzmann, M.; Petersson, L.; Alvarez, J.M.; Gould, S. Incorporating Network Built-in Priors in Weakly-supervised Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1382–1396. [Google Scholar] [CrossRef]
Han, Z.; Xiao, Z.; Yu, M. Weakly supervised semantic segmentation using fore-background priors. SPIE 2017, 10420, 1049–1056. [Google Scholar]
Li, W.; Li, F.; Luo, Y.; Wang, P.; sun, J. Deep Domain Adaptive Object Detection: A Survey. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, Australia, 1–4 December 2020; pp. 1808–1813. [Google Scholar] [CrossRef]
Gao, K.; Yu, A.; You, X.; Guo, W.; Li, K.; Huang, N. Integrating Multiple Sources Knowledge for Class Asymmetry Domain Adaptation Segmentation of Remote Sensing Images. arXiv 2023, arXiv:2305.09893. [Google Scholar]
Hoyer, L.; Dai, D.; Wang, H.; Van Gool, L. MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 11721–11732. [Google Scholar] [CrossRef]
Lu, X.; Gong, T.; Zheng, X. Multisource Compensation Network for Remote Sensing Cross-Domain Scene Classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 2504–2515. [Google Scholar] [CrossRef]
Von Kügelgen, J.; Sharma, Y.; Gresele, L.; Brendel, W.; Schölkopf, B.; Besserve, M.; Locatello, F. Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style. arXiv 2022, arXiv:2106.04619. [Google Scholar]
Cheng, Y.; Wei, F.; Bao, J.; Chen, D.; Wen, F.; Zhang, W. Dual Path Learning for Domain Adaptation of Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 9062–9071. [Google Scholar] [CrossRef]
Li, Z.; Xie, Y.; Jia, X.; Stuart, K.; Delaire, C.; Skakun, S. Point-to-Region Co-learning for Poverty Mapping at High Resolution Using Satellite Imagery. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–13 February 2023. [Google Scholar]
Mateo-García, G.; Laparra, V.; López-Puigdollers, D.; Gómez-Chova, L. Cross-Sensor Adversarial Domain Adaptation of Landsat-8 and Proba-V Images for Cloud Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 747–761. [Google Scholar] [CrossRef]
Banerjee, B.; Bovolo, F.; Bhattacharya, A.; Bruzzone, L.; Chaudhuri, S.; Buddhiraju, K.M. A Novel Graph-Matching-Based Approach for Domain Adaptation in Classification of Remote Sensing Image Pair. IEEE Trans. Geosci. Remote Sens. 2015, 53, 4045–4062. [Google Scholar] [CrossRef]
Cermelli, F.; Mancini, M.; Buló, S.R.; Ricci, E.; Caputo, B. Modeling the Background for Incremental and Weakly-Supervised Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 10099–10113. [Google Scholar] [CrossRef] [PubMed]
Iqbal, J.; Ali, M. Weakly-supervised domain adaptation for built-up region segmentation in aerial and satellite imagery. ISPRS J. Photogramm. Remote Sens. 2020, 167, 263–275. [Google Scholar] [CrossRef]
Xie, X.; Chen, J.; Li, Y.; Shen, L.; Ma, K.; Zheng, Y. Self-Supervised CycleGAN for Object-Preserving Image-to-Image Domain Adaptation. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 498–513. [Google Scholar]
Wen, S.; Tian, W.; Zhang, H.; Fan, S.; Li, X. Semantic Segmentation Using a GAN and a Weakly Supervised Method Based on Deep Transfer Learning. IEEE Access 2020, 8, 176480–176494. [Google Scholar] [CrossRef]
Adayel, R.; Bazi, Y.; Alhichri, H.; Alajlan, N. Deep Open-Set Domain Adaptation for Cross-Scene Classification based on Adversarial Learning and Pareto Ranking. Remote Sens. 2020, 12, 1716. [Google Scholar] [CrossRef]
Zhao, X.; Zhang, M.; Tao, R.; Li, W.; Liao, W.; Philips, W. Cross-Domain Classification of Multisource Remote Sensing Data Using Fractional Fusion and Spatial-Spectral Domain Adaptation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5721–5733. [Google Scholar] [CrossRef]
Teng, W.; Wang, N.; Shi, H.; Liu, Y.; Wang, J. Classifier-Constrained Deep Adversarial Domain Adaptation for Cross-Domain Semisupervised Classification in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2020, 17, 789–793. [Google Scholar] [CrossRef]
Deng, X.; Zhu, Y.; Tian, Y.; Newsam, S. Scale Aware Adaptation for Land-Cover Classification in Remote Sensing Imagery. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Virtual Conference, 5–9 January 2021; pp. 2159–2168. [Google Scholar] [CrossRef]
Iqbal, J.; Ali, M. MLSL: Multi-Level Self-Supervised Learning for Domain Adaptation with Spatially Independent and Semantically Consistent Labeling. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 1853–1862. [Google Scholar] [CrossRef]
Gao, K.; Yu, A.; You, X.; Qiu, C.; Liu, B. Prototype and Context-Enhanced Learning for Unsupervised Domain Adaptation Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Li, Y.; Shi, T.; Zhang, Y.; Chen, W.; Wang, Z.; Li, H. Learning deep semantic segmentation network under multiple weakly-supervised constraints for cross-domain remote sensing image semantic segmentation. ISPRS J. Photogramm. Remote Sens. 2021, 175, 20–33. [Google Scholar] [CrossRef]
Chen, Y.; Wei, C.; Wang, D.; Ji, C.; Li, B. Semi-supervised contrastive learning for few-shot segmentation of remote sensing images. Remote Sens. 2022, 14, 4254. [Google Scholar] [CrossRef]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Few-Shot Learning. 2021. Available online: https://blog.csdn.net/weixin_44211968/article/details/121314757 (accessed on 11 October 2023).
Yu, X.; Wu, X.; Luo, C.; Ren, P. Deep learning in remote sensing scene classification: A data augmentation enhanced convolutional neural network framework. Gisci. Remote Sens. 2017, 54, 741–758. [Google Scholar] [CrossRef]
Li, W.; Chen, C.; Zhang, M.; Li, H.; Du, Q. Data Augmentation for Hyperspectral Image Classification with Deep CNN. IEEE Geosci. Remote Sens. Lett. 2019, 16, 593–597. [Google Scholar] [CrossRef]
Chen, X.; Kamata, S.I.; Zhou, W. Hyperspectral Image Classification Based on Multi-stage Vision Transformer with Stacked Samples. In Proceedings of the TENCON 2021—2021 IEEE Region 10 Conference (TENCON), Auckland, New Zealand, 7–10 December 2021; pp. 441–446. [Google Scholar] [CrossRef]
Huang, L.; Chen, Y. Dual-Path Siamese CNN for Hyperspectral Image Classification With Limited Training Samples. IEEE Geosci. Remote Sens. Lett. 2021, 18, 518–522. [Google Scholar] [CrossRef]
Ramirez Rochac, J.F.; Zhang, N.; Thompson, L.; Oladunni, T. A Data Augmentation-Assisted Deep Learning Model for High Dimensional and Highly Imbalanced Hyperspectral Imaging Data. In Proceedings of the 2019 9th International Conference on Information Science and Technology (ICIST), Kopaonik, Serbia, 10–13 March 2019; pp. 362–367. [Google Scholar] [CrossRef]
Lv, F.; Liu, H.; Wang, Y.; Zhao, J.; Yang, G. Learning Unbiased Zero-Shot Semantic Segmentation Networks via Transductive Transfer. IEEE Signal Process. Lett. 2020, 27, 1640–1644. [Google Scholar] [CrossRef]
Parnami, A.; Lee, M. Learning from Few Examples: A Summary of Approaches to Few-Shot Learning. arXiv 2022, arXiv:2203.04291. [Google Scholar]
Koch, G.R. Siamese Neural Networks for One-Shot Image Recognition. Master’s Thesis, University of Toronto, Toronto, ON, Canada, 2015. [Google Scholar]
Zhang, J.; Chen, Z.; Huang, J.; Zhuang, J.; Zhang, D. Few-Shot Domain Adaptation for Semantic Segmentation. In ACM TURC ’19; Proceedings of the ACM Turing Celebration Conference—Chengdu, China, 17–19 May 2019; Association for Computing Machinery: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
Tuia, D.; Munoz-Mari, J.; Gomez-Chova, L.; Malo, J. Graph Matching for Adaptation in Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2013, 51, 329–341. [Google Scholar] [CrossRef]
Kim, D.; Kim, J.; Cho, S.; Luo, C.; Hong, S. Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching. arXiv 2023, arXiv:2303.14969. [Google Scholar]
Kwon, H.; Jeong, S.; Kim, S.; Sohn, K. Dual Prototypical Contrastive Learning for Few-Shot Semantic Segmentation. arXiv 2021, arXiv:2111.04982. [Google Scholar]
Mao, Y.; Guo, Z.; LU, X.; Yuan, Z.; Guo, H. Bidirectional Feature Globalization for Few-shot Semantic Segmentation of 3D Point Cloud Scenes. In Proceedings of the 2022 International Conference on 3D Vision (3DV), Prague, Czech Republic, 12–15 September 2022; pp. 505–514. [Google Scholar] [CrossRef]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
Wang, B.; Wang, Z.; Sun, X.; Wang, H.; Fu, K. DMML-Net: Deep Metametric Learning for Few-Shot Geographic Object Segmentation in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Tang, H.; Li, Y.; Han, X.; Huang, Q.; Xie, W. A Spatial–Spectral Prototypical Network for Hyperspectral Remote Sensing Image. IEEE Geosci. Remote Sens. Lett. 2020, 17, 167–171. [Google Scholar] [CrossRef]
Zhang, Y.; Sidibé, D.; Morel, O.; Meriaudeau, F. Incorporating Depth Information into Few-Shot Semantic Segmentation. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 3582–3588. [Google Scholar] [CrossRef]
Jiang, X.; Zhou, N.; Li, X. Few-Shot Segmentation of Remote Sensing Images Using Deep Metric Learning. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
P, G.; Verma, U. Texture based Prototypical Network for Few-Shot Semantic Segmentation of Forest Cover: Generalizing for Different Geographical Regions. arXiv 2022, arXiv:2203.15687. [Google Scholar]
Wang, Z.; Jiang, Z.; Yuan, Y. Prototype Queue Learning for Multi-Class Few-Shot Semantic Segmentation. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 1721–1725. [Google Scholar] [CrossRef]
Cheng, G.; Lang, C.; Han, J. Holistic Prototype Activation for Few-Shot Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 4650–4666. [Google Scholar] [CrossRef]
Wu, Z.; Shi, X.; Lin, G.; Cai, J. Learning Meta-class Memory for Few-Shot Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 497–506. [Google Scholar] [CrossRef]
Tian, P.; Wu, Z.; Qi, L.; Wang, L.; Shi, Y.; Gao, Y. Differentiable meta-learning model for few-shot semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12087–12094. [Google Scholar]
Xie, Y.; Chen, W.; He, E.; Jia, X.; Bao, H.; Zhou, X.; Ghosh, R.; Ravirathinam, P. Harnessing heterogeneity in space with statistically guided meta-learning. Knowl. Inf. Syst. 2023, 65, 2699–2729. [Google Scholar] [CrossRef]
Chen, D.; Chen, Y.; Li, Y.; Mao, F.; He, Y.; Xue, H. Self-Supervised Learning for Few-Shot Image Classification. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 1745–1749. [Google Scholar] [CrossRef]
Chen, X.; Zhang, C.; Lin, G.; Han, J. Compositional prototype network with multi-view comparision for few-shot point cloud semantic segmentation. arXiv 2020, arXiv:2012.14255. [Google Scholar]
Zhao, N.; Chua, T.S.; Lee, G.H. Few-shot 3d point cloud semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8873–8882. [Google Scholar]
Rao, M.; Tang, P.; Zhang, Z. Spatial–Spectral Relation Network for Hyperspectral Image Classification with Limited Training Samples. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 5086–5100. [Google Scholar] [CrossRef]
Kemker, R.; Luu, R.; Kanan, C. Low-Shot Learning for the Semantic Segmentation of Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2018, 56, 6214–6223. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the Opportunities and Risks of Foundation Models. arXiv 2022, arXiv:2108.07258. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar]
Kouw, W.M.; Loog, M. A Review of Domain Adaptation without Target Labels. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 766–785. [Google Scholar] [CrossRef]
Toldo, M.; Maracani, A.; Michieli, U.; Zanuttigh, P. Unsupervised Domain Adaptation in Semantic Segmentation: A Review. Technologies 2020, 8, 35. [Google Scholar] [CrossRef]
Zhao, S.; Yue, X.; Zhang, S.; Li, B.; Zhao, H.; Wu, B.; Krishna, R.; Gonzalez, J.E.; Sangiovanni-Vincentelli, A.L.; Seshia, S.A.; et al. A Review of Single-Source Deep Unsupervised Visual Domain Adaptation. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 473–493. [Google Scholar] [CrossRef]
Maharana, K.; Mondal, S.; Nemade, B. A review: Data pre-processing and data augmentation techniques. Glob. Transitions Proc. 2022, 3, 91–99. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, C.; Zheng, S.; Qiao, Y.; Li, C.; Zhang, M.; Dam, S.K.; Thwal, C.M.; Tun, Y.L.; Huy, L.L.; et al. A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? arXiv 2023, arXiv:2303.11717. [Google Scholar]
Xu, M.; Du, H.; Niyato, D.; Kang, J.; Xiong, Z.; Mao, S.; Han, Z.; Jamalipour, A.; Kim, D.I.; Xuemin.; et al. Unleashing the Power of Edge-Cloud Generative AI in Mobile Networks: A Survey of AIGC Services. arXiv 2023, arXiv:2303.16129. [Google Scholar]
Cao, Y.; Li, S.; Liu, Y.; Yan, Z.; Dai, Y.; Yu, P.S.; Sun, L. A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT. arXiv 2023, arXiv:2303.04226. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.A.; Maheswaranathan, N.; Ganguli, S. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv 2015, arXiv:1503.03585. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion Models Beat GANs on Image Synthesis. arXiv 2021, arXiv:2105.05233. [Google Scholar]
LeCun, Y. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Rev. 2022, 62. [Google Scholar]
Ye, D.; Peng, J.; Li, H.; Bruzzone, L. Better Memorization, Better Recall: A Lifelong Learning Framework for Remote Sensing Image Scene Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]

Figure 1. The review outlines the number of papers published in 2018–2023 that have accomplished the semantic segmentation task using deep learning methods in RSIs.

Figure 2. Overview of the content of this study. This survey has summarized the small sample set approaches, including the self-supervised, semi-supervised, and weakly supervised methods; few-shot methods; and models for domain adaptations based on semantic segmentation. It contains commonly used methods along with their classifications, popular modules, and methods with prior knowledge constraints.

Figure 3. The figure describes the technical flow diagram of the self supervised methods [108].

Figure 4. Differences in the resolutions, spectral bands, textures, and colors in the remote sensing images acquired using different sensors, imaging angles, and ground reflectance. The data clearly differ significantly from the natural images.

Figure 5. Self-supervised learning methods used for semantic segmentation, which are detailed below.

Figure 7. The figure describes the technical flow diagram of the weakly supervised methods [162].

Figure 10. The figure describes the technical flow diagram of the few-shot learning methods [230].

Table 1. Depiction of the commonly used CNN models that were proposed from 2015 to 2023. These methods correspond to solutions to the challenges faced by deep learning at different stages in its development, from deepening the network model without overfitting, to reducing the memory footprint of the network model, to perfecting the network architecture, to proposing models for specific downstream tasks. The studies mentioned above are cited in Section 1 and Section 8.

Time	Model	Method
2015	ResNet	• The first proposed residual network and it is an effective strategy to deepen the network hierarchy.
	VGG	• Deepening networks with small convolutional kernels.
	GoogleLeNet	• Balancing complexity and performance with deeper network depth.
	DeepLabv1	• Add CRF to complete the post-processing task and refine the pixel.
	FCN	• It is proposed to use convolutional layers instead of fully connected layers so that they can accomplish the task of semantic segmentation.
	UNet	• The purpose of fusing shallow features with deeper features is achieved by skipping connections.
2016	SegNet	• The feature index of the encoder is put into the decoder to reduce the amount of computation.
2016	DeepLabv2	• Adding Atrous Convolution during upsampling to expand the field.
2017	FPN	• Construct a pyramid structure that makes full use of multi-scale features and incorporates multiple resolution feature maps.
	PSPNet	• Making the most of multi-scale information for rich semantic information.
	DeepLabv3	• Construct ASPP module to fully capture contextual background information.
2018	DenseNet	• Reduce overfitting by linking each feature map to all other feature maps through dense linking.
	DeepLabv3+	• Design simple and effective decoder structures, with a particular focus on edge information.
	GeoSeg	• The popular networks are encapsulated and provided the advantages and disadvantages of different network models for modification.
2019	UPerNet	• A commonly used decoder structure, excellent implementation of how to extract global features, texture feature structure.
	HRNet	• Progressively adding high-resolution to low-resolution sub-networks in order to form a network model where multiple stages are repeated for scale fusion.
	MAPNet	• Constructing parallel branches and integrating multiple resolution features.
2020	EfficientNet	• By constructing the model with the highest combined depth, width, and resolution metrics, better feature extraction performance is achieved.
2020	FarSeg	• The focus is on addressing the current situation of imbalance between background and foreground information by extracting the target object through foreground modeling while suppressing background information.
2021	SwinTransformer	• An in-window self-attentive mechanism for passing information through a sliding window.
2021	MobileViT	• A lightweight transformer structure is proposed to control the number of tokens for the purpose of acceleration.
2022	VAN	• Combining the advantages of CNN and Transformer, focusing on both detailed information and global information.
2022	ConvNext	• A CNN model and improvement on ResNet to achieve very good semantic segmentation results.
2023	SAM	• Pre-training on large-scale datasets further improves the predictive power of the method.
2023	CLIP	• Adding textual information to assist the model in learning semantic information to better accomplish the semantic segmentation.

Table 3. Advantages and disadvantages of deep learning methods for semantic segmentation in remote sensing with small data.

Methods	Annotation Effort	Cross-Domain Generalization Capability Performance	Training Phase	Essentials of Model Learning
Self supervised	Low (Pre-training phase without labeled samples, small data is required in fine-tuning)	Better generalization capabilities through fine-tuning	Fine-tuning is a necessary step	Input a certain number of images per mini-batch
Semi supervised	Low (Partially labeled samples are required)	No cross-domain generalization capability	Fine-tuning is commonly used to obtain better quality	Input a certain number of images per mini-batch
Weakly supervised	Low (Incomplete, inexact and inaccurate labeled are commonly used)	No cross-domain generalization capability	Fine-tuning is commonly used to achieve better quality	Input a certain number of images per mini-batch
Few-shot	Low (A large number of labeled samples are required in up front)	Better generalization capabilities through fine-tuning	Fine-tuning is a necessary step	Task or episode

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, A.; Quan, Y.; Yu, R.; Guo, W.; Wang, X.; Hong, D.; Zhang, H.; Chen, J.; Hu, Q.; He, P. Deep Learning Methods for Semantic Segmentation in Remote Sensing with Small Data: A Survey. Remote Sens. 2023, 15, 4987. https://doi.org/10.3390/rs15204987

AMA Style

Yu A, Quan Y, Yu R, Guo W, Wang X, Hong D, Zhang H, Chen J, Hu Q, He P. Deep Learning Methods for Semantic Segmentation in Remote Sensing with Small Data: A Survey. Remote Sensing. 2023; 15(20):4987. https://doi.org/10.3390/rs15204987

Chicago/Turabian Style

Yu, Anzhu, Yujun Quan, Ru Yu, Wenyue Guo, Xin Wang, Danyang Hong, Haodi Zhang, Junming Chen, Qingfeng Hu, and Peipei He. 2023. "Deep Learning Methods for Semantic Segmentation in Remote Sensing with Small Data: A Survey" Remote Sensing 15, no. 20: 4987. https://doi.org/10.3390/rs15204987

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning Methods for Semantic Segmentation in Remote Sensing with Small Data: A Survey

Abstract

1. Introduction

2. Supervised Learning Methods for RSI Semantic Segmentation

2.1. Single-Modal Methods

2.2. Multi-Modal Fusion Methods

2.3. Prior-Knowledge-Constrained Methods

3. Self-Supervised Learning Methods for RSI Semantic Segmentation

3.1. Commonly Used Self-Supervised Learning Models for Semantic Segmentation

3.1.1. Contrastive Learning Method

Popular Contrastive Learning Models for Semantic Segmentation

Contrastive Learning Method based on RSIs for Semantic Segmentation

3.1.2. Masked Image Modeling Method

Popular Masked Image Modeling Method for Semantic Segmentation

Masked Image Modeling Method Based on RSIs for Semantic Segmentation

3.2. Self-Supervised Methods with Prior Knowledge Constraint

4. Semi-Supervised Learning Methods for RSI Semantic Segmentation

Popular Semi-Supervised Learning Models for Semantic Segmentation

Semi-Supervised Learning Models Based on RSIs for Semantic Segmentation

5. Weakly Supervised Learning Methods for RSI Semantic Segmentation

5.1. Weakly Supervised Learning Models

5.1.1. Point Annotation

5.1.2. Graffiti-Based or Random-Walk Annotation

5.1.3. Bounding Box Annotation

5.1.4. Image-Level Annotation

5.2. Weakly Supervised Learning Methods with Prior Knowledge Constraint

6. Domain Adaptation Methods for RSI Semantic Segmentation

6.1. Discrepancy-Based

6.2. Adversarial-Based

6.3. Pseudo Label-Based

7. Few-Shot Learning Methods for RSI Semantic Segmentation Methods

7.1. Data Augmentation Method

7.2. Prior-Knowledge-Based Models

7.2.1. Transfer Learning

7.2.2. Metric Learning

7.3. Meta Learning

7.4. Other Few-Shot Learning Methods

8. Outlook for The Future

8.1. Foundation Models

8.2. Cross-Domain Learning

8.3. Data Augmentation

8.4. World Model

9. Conclusions and Discussion

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI