Next Article in Journal
Post-Disaster High-Frequency Ground-Based InSAR Monitoring and 3D Deformation Reconstruction of Large Landslides Using MIMO Radar
Previous Article in Journal
Machine Learning-Constrained Semi-Analysis Model for Efficient Bathymetric Mapping in Data-Scarce Coastal Waters
Previous Article in Special Issue
Sea–Land Segmentation of Remote-Sensing Images with Prompt Mask-Attention
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Combining Satellite Image Standardization and Self-Supervised Learning to Improve Building Segmentation Accuracy

1
Graduate School of Life and Environmental Sciences, University of Tsukuba, 1-1-1, Tennoudai, Tsukuba 305-8572, Ibaraki, Japan
2
Faculty of Life and Environmental Sciences, University of Tsukuba, 1-1-1, Tennoudai, Tsukuba 305-8572, Ibaraki, Japan
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(18), 3182; https://doi.org/10.3390/rs17183182
Submission received: 22 July 2025 / Revised: 12 September 2025 / Accepted: 12 September 2025 / Published: 14 September 2025

Abstract

Highlights

What are the main findings?
  • Performing atmospheric correction before pan-sharpening improves the accuracy of building segmentation.
  • The two pretext tasks are specifically designed to consider building features in satellite imagery.
What is the implication of the main finding?
  • The newly developed multi-task SSL network performs better than existing SSL methods in building segmentation.
  • The proposed method works effectively in situations where labeled satellite data is limited and on personal computers.

Abstract

Many research fields, such as urban planning, urban climate and environmental assessment, require information on the distribution of buildings. In this study, we used U-Net to segment buildings from WorldView-3 imagery. To improve the accuracy of building segmentation, we undertook two endeavors. First, we investigated the optimal order of atmospheric correction (AC) and panchromatic sharpening (pan-sharpening) and found that performing AC before pan-sharpening results in higher building segmentation accuracy than after pan-sharpening, increasing the average IoU by 9.4%. Second, we developed a new multi-task self-supervised learning (SSL) network to pre-train VGG19 backbone using 21 unlabeled WorldView images. The new multi-task SSL network includes two pretext tasks specifically designed to take into account the characteristics of buildings in satellite imagery (size, distribution pattern, multispectral, etc.). Performance evaluation shows that U-Net combined with an SSL pre-trained VGG19 backbone improves building segmentation accuracy by 15.3% compared to U-Net combined with a VGG19 backbone trained from scratch. Comparative analysis also shows that the new multi-task SSL network outperforms other existing SSL methods, improving building segmentation accuracy by 3.5–13.7%. Moreover, the proposed method significantly saves computational costs and can effectively work on a personal computer.

1. Introduction

Buildings are important infrastructure for human activities, and understanding their precise distribution can contribute to research in areas such as urban planning, urban climate, and environmental assessment [1,2,3]. It has been demonstrated that combining high-spatial resolution satellite imagery with deep learning approaches enables fast and accurate extraction of buildings [4]. However, two challenges remain in this task.
First, since each satellite image has different atmospheric effects, the reflectance spectra of buildings obtained from different satellite images may differ significantly, which brings great challenges to the process of building segmentation [5,6]. Furthermore, satellite imagery with a panchromatic (PAN) band, such as WorldView-3 imagery, typically requires a panchromatic sharpening (pan-sharpening) process to generate images that provide both clear textural information and sufficient spectral information before building semantic segmentation can be performed [7]. Therefore, the atmospheric correction (AC) and pan-sharpening processes are two important steps of imagery standardization before building segmentation. These processes remove/mitigate environmental influences and allow building segmentation according to differences in land cover type only, thereby increasing the generalizability of the developed model and the reliability of the generated building maps. However, some studies prefer to perform AC before pan-sharpening [8,9,10], while other studies prefer to perform AC after pan-sharpening [11,12]. Considering that atmospheric effects may be different for the PAN band and multispectral (MS) bands, it is necessary to clarify how the order of AC and pan-sharpening processes affects the accuracy of building segmentation.
The second challenge is the amount of labeled satellite data available to train deep learning-based models for building segmentation. Since satellite images are often taken under different conditions (e.g., different atmospheric effects, different locations with different land uses and covers, different seasons, etc.), a large amount of labeled satellite images are required to train deep learning-based models for more accurate building segmentation. This is a very time-consuming and labor-intensive task. To address the issue of limited manually labeled satellite images, self-supervised learning (SSL) approaches have been recently proposed to extract building features from unlabeled satellite images through single or multiple pretext tasks (pre-training a backbone), and their usefulness has also been confirmed [13,14,15]. Compared with other pre-training approaches, there are two main advantages of using SSL: (1) SSL allows us to continuously add new unlabeled satellite images, avoiding the problem of a lack of satellite images in existing datasets (e.g., ImageNet is primarily composed of photographs) and the impact of different object distribution patterns between satellite images and photographs (no dominant objects vs. the presence of dominant objects such as dogs, cats, etc.); (2) SSL can fully capture the characteristics of multiple satellite image bands, not just red, green, and blue bands (RGB). Therefore, the SSL approach is also considered in this study.
Generally, there are two main categories of pretext tasks in SSL: generative learning methods and contrastive learning methods [16,17]. The former can learn low-level features (e.g., color, texture, shape, etc.) by minimizing the reconstruction loss between the reconstructed image and the original input image [16,18,19]. The latter can learn high-level features (e.g., semantic categories, structure of objects, part-whole relationship of objects, etc.) by measuring the contrastive loss between positive pairs or by considering the contrastive loss of positive and negative pairs simultaneously [20,21].
Generative learning methods widely use image inpainting as their main pretext task, allowing the model to infer artificially damaged regions from the surrounding undamaged pixels (the texture of the remaining buildings, etc.) [19]. To implement image inpainting, most studies use a block mask to generate a damaged input image. For example, Li et al. proposed to mask approximately 30% of an unmanned aerial vehicle (UAV) image (256 × 256-pixel, spatial resolution 0.08 m) using a single large block [22]. Chen et al. proposed to use multiple small blocks to mask 5% of a Google image (512 × 512-pixel, spatial resolution 1 m) [23]. However, there are two problems with applying the above block mask strategies to WorldView-3 imagery (spatial resolution 0.3 m after pan-sharpening). First, when 30% of a 512 × 512-pixel WorldView-3 image is masked in one large block, the block size reaches more than 7000 m2, which is usually much larger than the average size of a building in Japan (about 100 m2, Ministry of Internal Affairs and Communications 2023 [24]). Previous studies have pointed out that to completely mask a building may result in the building being lost and its features cannot be learned during the inpainting process [25]. Second, if the block mask ratio is too small (e.g., 5% as used in [23]), more training epochs will be required to effectively learn features across the entire 512 × 512-pixel WorldView-3 image, which will increase the computational cost. Therefore, to improve building segmentation accuracy, a new image inpainting pretext task specifically designed to consider the characteristics of buildings in pan-sharpened WorldView-3 imagery (e.g., size, distribution pattern) is needed.
Another representative generative learning method is colorful image colorization (CIC), which converts an input color image into a grayscale image and trains a model to reproduce its color version [18]. However, this method only works for three-band imagery (i.e., RGB imagery) and is not applicable to multi-band satellite imagery. On the other hand, previous studies have demonstrated that multi-band images provide more information and more reliable cues than RGB images and can more effectively distinguish building materials from the background [26,27,28,29,30]. Therefore, it is desirable to develop a new pretext task that can fully learn spectral characteristics from multi-band satellite images, which will enable the segmentation model to better distinguish different building materials from the surrounding background and improve the segmentation accuracy.
Consequently, the objectives of this study are as follows:
(1)
To clarify how the order of the AC and pan-sharpening processes affects the accuracy of building segmentation and recommend the most effective standardization method.
(2)
To propose two new pretext tasks that take into account the features of objects distributed in WorldView-3 images.
(3)
To design a novel multi-task SSL network that combines generative learning methods (using two new pretext tasks) and contrastive learning methods to improve the accuracy of building segmentation.
Through this research, it is hoped to address the limitations of existing methods (inconsistent image standardization order and pretext task design unsuitable for WorldView-3 imagery) and develop a robust method that can be used for urban growth monitoring, land use analysis, and disaster assessment under conditions of limited labeled satellite imagery.

2. Materials

In this study, multiband WorldView series satellite images were collected, as shown in Table 1. Dataset I contains three WorldView-3 images (located at Sakaiminato, Sanuki, and Tsukuba, Japan, respectively, Figure 1) with four MS bands (450–510 nm, 510–580 nm, 630–690 nm, and 770–895 nm) and one PAN band (450–800 nm). Buildings in each image were manually labeled using ArcGIS (v10.8.1), resulting in 21,951, 14,828, and 18,389 objects, respectively. Images were then cropped into patches of 512 × 512 pixels, leaving a 20% overlap between adjacent patches to avoid the “stitching problem” [31]. As a result, 1632, 1683, and 1681 patches were obtained in Sakaiminato, Sanuki, and Tsukuba regions, respectively.
Dataset I was used for two purposes: (1) to investigate the optimal combination of AC and pan-sharpening processes, and (2) to evaluate the performance of the newly designed multi-task SSL network. For these purposes, we split the patches of two optional regions (Sakaiminato and Sanuki, Sakaiminato and Tsukuba, or Sanuki and Tsukuba) into training and validation datasets in a ratio of 8:2 and used the patches of the remaining one region as the test dataset. Furthermore, to increase the diversity of the training dataset, all patches in the training dataset were augmented by geometric transformations (Figure 2, [32]), including offline image augmentation (i.e., horizontal flipping, vertical flipping, and both) and online image augmentation (i.e., random scale cropping, and the 4-image mosaicking from [33]). The average test accuracy of the above three combinations was considered as the final result.
Datasets II, III, and IV were used to pre-train VGG19 backbone through SSL. As all collected images were raw data (digital numbers), AC and pan-sharpening processes were performed to generate surface reflectance MS images with a spatial resolution of 30 cm. The same four bands as those in Dataset I were extracted and applied for further analysis. All 21 images in Datasets II, III, and IV were then cropped into patches of 512 × 512 pixels, resulting in a total of 85,636 patches for pre-training VGG19 backbone via SSL. To reduce the risk of overfitting the VGG19 backbone, we kept 5% of the patches as a validation dataset, considering the maximum utilization of patches for training. Additionally, offline image augmentation (i.e., horizontal flipping, vertical flipping, and both) was also performed to increase the diversity of the training dataset.
Table 1. Details of the WorldView (WV) datasets used in this study. A patch is defined as a small region of 512 × 512 pixels cropped from an image. MS: Multispectral, PAN: Panchromatic. This study: image purchased from Maxar Technologies Inc., Westminster, CO, USA.
Table 1. Details of the WorldView (WV) datasets used in this study. A patch is defined as a small region of 512 × 512 pixels cropped from an image. MS: Multispectral, PAN: Panchromatic. This study: image purchased from Maxar Technologies Inc., Westminster, CO, USA.
Dataset No.Satellite TypeNo. of ImagesNo. of BandsResolution (m)No. of PatchesNo. of ObjectsSource
MSPANMSPAN
Dataset IWV-33411.20.3163221,951This study
(Sakaiminato, Japan)
168314,828
(Sanuki, Japan)
168118,389
(Tsukuba, Japan)
Dataset IIWV-39411.20.39593-This study
(9 regions in Japan)
Dataset IIIWV-31811.20.37392-[34]
(Washington, D.C., USA)
Dataset IVWV-228120.568,651-[35]
WV-39811.20.5(11 regions outside Japan)

3. Methods

In this study, we use U-Net with a VGG19 backbone to segment buildings. This choice is based on previous work showing that U-Net with a pre-trained VGG19 backbone achieves higher building segmentation accuracy compared to the combinations of U-Net and other backbones (e.g., DarkNet19, ResNet50, DarkNet53) [36]. Furthermore, the main objective of this study is not to investigate the performance of existing backbones, but to design a multi-task SSL network that can properly consider the building distribution characteristics of WorldView-3 images for building segmentation. This is another reason why we only used VGG19 as the backbone in this study.

3.1. Satellite Image Standardization

We designed three experiments to investigate the most effective method for standardizing satellite imagery. They are as follows: (a) only pan-sharpening, (b) first performing AC on the MS and PAN bands, respectively, and then performing pan-sharpening, and (c) vice versa. For AC, two widely used algorithms, the Second Simulation of the Satellite Signal in the Solar Spectrum (6S) [37,38] and the Fast Line-of-Sight Atmospheric Analysis of Spectral Hypercubes (FLAASH) [39], were selected [40,41,42,43].
In pan-sharpening, the component substitution-based Gram–Schmidt method [44] and the color-based Nearest Neighbor Diffusion method (NNDiffuse) [45] are representative and widely used methods [46]. NNDiffuse performs pan-sharpening by assuming that the spectrum of each pixel in the pan-sharpened image is a weighted linear mixture of the spectra of its neighboring superpixels. By treating each spectrum as the smallest operational unit rather than processing each band individually, NNDiffuse can preserve both spatial detail and spectral fidelity [47]. In contrast, the Gram–Schmidt method often results in changes in spectral characteristics after fusion, resulting in degradation of color information [48,49]. Therefore, in this study, we only consider the NNDiffuse method for pan-sharpening.
Based on the above considerations, a total of five image standardization methods were tested in this study (Figure 3). Details are as follows:
(1)
only pan-sharpening is performed, no AC is performed;
(2)
AC is first performed on the MS and PAN bands using 6S, and then pan-sharpening is performed;
(3)
AC is first performed only on the MS bands using FLAASH (FLAASH cannot be applied to PAN band), and then pan-sharpening is performed;
(4)
pan-sharpening is first performed, and then AC is performed on the pan-sharpened MS bands using 6S;
(5)
pan-sharpening is first performed, and then AC is performed on the pan-sharpened MS bands using FLAASH.

3.2. Designing a Multi-Task SSL Network to Pre-Train the VGG19 Backbone

Figure 4 shows the newly designed multi-task SSL network in this study. The new multi-task SSL network includes three pretext tasks: (1) image inpainting (new), (2) image spectrum recovery (new), and (3) contrastive learning (as in previous studies, e.g., [22,23]). Note that image inpainting (Pretext Task 1) and image spectrum recovery (Pretext Task 2) share the same decoder structure (Appendix A, Table A1), since they are two similar tasks in which both are treated as recovering a 3D image matrix (channel, height, and width). Details are as follows.

3.2.1. New Pretext Task 1: Image Inpainting Using Multiple Small Blocks

The image inpainting task allows theVGG19 backbone to learn the textural features of objects by restoring masked areas in the input image [19]. To consider the building distribution characteristics of WorldView-3 images, we propose to use multiple small blocks in Pretext Task 1. Specifically, to avoid the problem of buildings being completely masked, we set the size of each small block to 32 × 32 pixels (approximately 92.16 m2) taking into account the average building area (approximately 100 m2). We also conducted an experiment to verify that this choice was appropriate (Table A2 in Appendix A). On the other hand, to consider the balance between the masked and remaining areas, the proportion of small blocks in the image was set to 0.3, as this value was found in tests to yield the best performance in downstream segmentation tasks (Table A3 in Appendix A). Furthermore, these small blocks were set to randomly distributed within each image without overlapping with each other (Figure 4, Pretext Task 1).
Generally, there are three widely used image transformation methods to damage the masked image for each small block (32 × 32 pixels): image rotation (0 degree, 90 degrees, 180 degrees, or 270 degrees), image flipping (horizontal, vertical, horizontal and vertical, or unchanged), and image color jittering (brightness, contrast, saturation, and hue). We tested three image damage strategies to determine which strategy is most effective at learning building texture features. They are as follows: (1) each small block masked area uses different combinations of all three transformation methods (SB1); (2) all small block masked areas use the same combination of all three transformation methods (SB2); (3) all small block masked areas are simply set to zero (SB3).
The loss function for image inpainting ( L i n p a i n t ) is defined as follows:
L i n p a i n t = 1 n i = 1 n I i I i · I i I i
where n is the number of images (512 × 512 pixels) in one batch (n = 8), I i is the original image, I i   is the image with pixels in multiple small blocks transformed, and I i is the image with pixels in multiple small blocks restored (i = 1, 2, … n, where n is the number of images in the batch).

3.2.2. New Pretext Task 2: Image Spectrum Recovery

Unlike RGB imagery, WorldView-3 imagery has multiple bands, which can provide more information to distinguish buildings from other land covers. To take full advantage of this property, we propose a novel pretext task that enables the VGG19 backbone to learn the spectral features of each land cover (especially buildings) by recovering damaged bands in the input image. We performed four experiments to impair the band information, trying to find the most effective way to enable the VGG19 backbone to properly learn the spectral features of each land cover. Details of the four experiments are as follows: (1) randomly select one band and set the value of the selected band to zero (R1Z); (2) randomly select two bands, average their values, and replace the values of the two selected bands with the average value (R2A); (3) randomly select three bands, average their values, and replace the values of the three selected bands with the average value (R3A); (4) average all four bands and replace the values of the four bands with the average value (4A).
The loss function for image spectrum recovery ( L s p e c t r u m ) is defined as follows:
L s p e c t r u m = 1 n s = 1 n I s I s · I s   I s
where n is the number of images (512 × 512 pixels) in one batch (n = 8), I s is the original image, I s   is the image containing the faulty band information, and I s is the image containing the restored band information (s = 1, 2, … n, where n is the number of images in the batch).

3.2.3. Pretext Task 3: Contrastive Learning

For Pretext Task 3, we used SimCLR, the most representative contrastive learning method that learns high-level features through data augmentation, and a large number of positive and negative sample pairs [22,50,51]. The cosine similarity (sim) between positive and negative pairs were calculated using the method described in [50]:
s i m ( I ,   I ) = φ T ( I ) · φ ( I ) φ ( I ) 2 · φ ( I ) 2
where I is the original image, I is the transformed image, φ ( I ) and φ ( I ) represent the images features extracted by the VGG19 backbone, respectively. The · 2 represents the l 2 norm.
The loss function for contrastive learning ( L c o n t r a s t i v e ) is defined as follows:
L c o n t r a s t i v e = 1 n j = 1 n l o g e x p ( s i m ( I j ,   I j ) ) k j e x p ( s i m ( I j , I k ) ) l o g e x p ( s i m ( I j ,   I j ) ) k j e x p ( s i m ( I j , I k ) )
where n is the number of images (512 × 512 pixels) in one batch (n = 8), ( I j ,   I j ) is the positive pair, and ( I j , I k ) and ( I j , I k ) are the negative pairs.

3.3. U-Net with SSL Pre-Trained VGG19 Backbone for Building Segmentation

Figure 5 shows the flowchart of the proposed method for building segmentation. All input WorldView images are first processed using the most effective standardization method obtained from Section 3.1. The U-Net with SSL pre-trained VGG19 backbone in Section 3.2 is then fine-tuned using Dataset I. The output is a predicted building distribution map. Dice loss [52] was used as the loss function for building segmentation.
Implementation details for pre-training the VGG19 backbone are as follows: batch size = 8, total number of training epochs = 20, initial learning rate = 0.0001 and decreased to 95% every 2 epochs, and total loss function of SSL = 6 × L i n p a i n t + 13 × L s p e c t r u m + L c o n t r a s t i v e (the weights for each loss function were set according to the range of loss values observed when each pretext task was used individually, ensuring that no single pretext task dominates the overall SSL loss). Training is stopped just before the loss on the validation dataset starts to increase. Additionally, a random seed was set to ensure experimental reproducibility.
Note that a U-Net with a non- pretrained VGG19 backbone was used to explore the optimal combination of AC and pan-sharpening processes. Implementation details for training the U-Net are as follows: batch size = 16, total number of training epochs = 200, initial learning rate = 0.001 and decreased to 90% every 10 epochs. Training is stopped just before the loss on the validation dataset starts to increase. A random seed was also set to ensure experimental reproducibility. These implementation details are also used to fine-tune the U-Net and SSL pre-trained VGG19 backbone.

3.4. Accuracy Assessment

In this study, the Bhattacharyya Distance (DB) [53] was used to calculate the similarity between standardized satellite images as shown in the following equation:
D B ( H 1 , H 2 ) = ln B C H 1 , H 2 = ln i H 1 i H 2 i
where H 1 , H 2 are the histograms of the two images being compared, H 1 i and H 2 i represent the probability of occurrence of the i-th value, and B C H 1 , H 2 represents the Bhattacharyya coefficient.
Furthermore, the results of the building segmentation are evaluated using widely used pixel-level metrics such as Intersection over Union (IoU) and Overall Accuracy (OA), as shown in the following equations:
I o U = T P T P + F P + F N
O A = T P + T N T P + F P + T N + F N
where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively.

4. Results

4.1. The Most Effective Method for Standardizing Satellite Imagery

Table 2 shows the histogram similarities (i.e., DB values) between two of the three images of different regions using different image standardization methods. Overall, the histogram similarity between the two atmospherically corrected WorldView-3 images increased when compared to the images without AC, with the average DB value decreasing from 0.400 (method 1) to 0.153–0.378 (methods 2–5). Among them, when comparing the histogram similarity of the atmospherically corrected images using 6S and the atmospherically corrected images using FLAASH, the former (methods 2 and 4, average DB = 0.153–0.252) show higher similarity than the latter (methods 3 and 5, average DB = 0.369–0.378). Furthermore, performing AC with 6S before pan-sharpening (method 2) shows the highest histogram similarity (average DB = 0.153).
Table 3 shows the accuracy evaluation results of building segmentation using each image standardization method. Similar to the histogram similarity results, method 2 (performing AC with 6S before pan-sharpening) performed best with the highest average IoU of 0.575 and average OA of 0.943, followed by method 4 (average IoU = 0.531, average OA = 0.935) and methods 3 and 5 (average IoU = 0.420–0.424, average OA = 0.889–0.890). Method 1 (only pan-sharpening, no AC) has the worst performance (average IoU = 0.395, average OA = 0.862). Furthermore, method 2 shows the second smallest accuracy variance between the use of different training and testing datasets, demonstrating its robustness.
The above results not only highlighted the need for AC but also demonstrated the importance of the order of AC and pan-sharpening.

4.2. Building Segmentation Results Using U-Net with SSL Pre-Trained VGG19 Backbone

4.2.1. Performance of Two New Pretext Tasks in SSL

To clarify the effectiveness of each new pretext task in the multi-task SSL network, we first tested the building segmentation performance using U-Net, which applies a VGG19 backbone pre-trained on only one pretext task.
Table 4 shows the results of building segmentation using SSL network with the Pretext Task 1 only (Image Inpainting). We found that the simplest image damage strategy (SB3 in Section 3.2.1), used for regions masked with multiple small blocks (i.e., method 4), exhibited the best performance (average IoU = 0.628, average OA = 0.953). However, applying a complex image damage strategy to regions masked with multiple small blocks resulted in poorer performance (method 2, average IoU = 0.591, average OA = 0.946) compared to the use of a single large block (method 1, average IoU = 0.602, average OA = 0.949). Therefore, method 4 (SB3) in Table 4 is recommended for Pretext Task 1.
Table 5 shows the results of building segmentation using the SSL network with the Pretext Task 2 only (CIC used in previous studies or Image Spectrum Recovery proposed in this study). We find that image spectrum recovery method 5 (4A in Section 3.2.2) performs best with an average IoU of 0.652 and an average OA of 0.958. Compared to the CIC method (method 1), the building segmentation accuracy was improved by 0.04 in average IoU. However, other image spectrum recovery methods (methods 2–4) perform worse compared to the CIC method (average IoU = 0.593–0.606 vs. 0.648, average OA = 0.947–0.949 vs. 0.957). Therefore, method 5 (4A) in Table 5 is recommended for Pretext Task 2.
Figure 6 shows some examples that demonstrate the effectiveness of the two recommended new pretext tasks compared to the pretext tasks used in previous studies [18,22]. A visual comparison between Figure 6e (IoU = 0.246) and Figure 6i (IoU = 0.905), and Figure 6f (IoU = 0.827) and Figure 6j (IoU = 0.872), shows that the new pretext task 1 successfully removes the road that misclassified by image inpainting pretext task of [22] (see the red pixels in the red circle in Figure 6e) and properly detects some small buildings that were missed by image inpainting pretext task of [22] (see the green pixels in the red circle in Figure 6f). Similarly, visual comparison of Figure 6g (IoU = 0.330) and Figure 6k (IoU = 0.946), and Figure 6h (IoU = 0.525) and Figure 6l (IoU = 0.881), reveals that the new pretext task 2 successfully removes some farmland (see the red circle in Figure 6k) and properly detects almost all buildings with green rooftop (see the red circle in Figure 6l). Both were all misclassified by CIC method of [18] (see the red pixels in the red circle in Figure 6g and the green pixels in the red circle in Figure 6h). These examples highlight the advantages of the two new pretext tasks proposed in this study.

4.2.2. Performance of the New Multi-Task SSL Network

Table 6 shows the ablation experiments to clarify the importance of each pretext task in the proposed method. For reference, we also show the building segmentation accuracy using U-Net with a VGG19 backbone trained from scratch (i.e., without SSL; Method 1 in the table). It is not surprising that the U-Net combined with SSL performed better than without SSL (average IoU = 0.628–0.663 vs. average IoU = 0.575). Looking at the performance of U-Net combined with SSL, we can see a steady improvement in building segmentation accuracy with each additional pretext task. Overall, the combination of U-Net and the new multi-task SSL network (including all three pretext tasks) improves building segmentation accuracy by 15.3% on average IoU compared to U-Net without SSL. These results demonstrate not only the importance of SSL but also the necessity of each pretext task.

4.3. Comparison with Other SSL Approaches

The new multi-task SSL network proposed in this study was compared with four state-of-the-art SSL methods, including SimCLR [50], MoCo v2 [51], BYOL [55], and PGSSL [23]. The results are shown in Table 7. U-Net combined with the new multi-task SSL network performed the best with the highest average IoU (0.663) and average OA (0.958) values, followed by U-Net combined with PGSSL (average IoU = 0.640, average OA = 0.955), SimCLR (average IoU = 0.615, average OA = 0.951), BYOL (average IoU = 0.589, average OA = 0.947), and MoCo v2 (average IoU = 0.583, average OA = 0.946).
Figure 7 shows some examples that demonstrate the advantages of the proposed building segmentation method (i.e., the combination of U-Net and the new SSL pre-trained VGG19 backbone, the segmentation accuracy values for each SSL approach are listed in Table A4 in Appendix A). Misclassification of bare grounds to buildings (see the red pixels in the red circle at the top of Figure 7) is observed in all four compared SSL methods, but is successfully removed by our method (Figure 7f, top). Another problem with the four compared SSL methods is the misclassification of roads and buildings (see the red pixels in the red circles in the second row of Figure 7), and these misclassified roads are not seen in the second row of Figure 7f (using the method proposed in this study). Moreover, all the four compared SSL methods struggled to detect an entire building composed of several small dense objects (e.g., greenhouse, see the green pixels in the red circle in the third row of Figure 7), while this problem was successfully solved by our method (Figure 7f, bottom).

5. Discussion

Previous studies have identified AC and pan-sharpening as two important steps when using satellite imagery for object segmentation [12,56,57]. However, the optimal order of AC and pan-sharpening remains unclear. In this study, we clarified this issue by conducting three experiments. The results clearly show that performing AC before pan-sharpening results in higher building segmentation accuracy than after pan-sharpening (Table 2 and Table 3). This is likely due to the different atmospheric effects on each MS band and the PAN band [58]. It is considered that performing pan-sharpening before AC may change the spectral characteristics of each land cover type and reduce the building segmentation accuracy. Furthermore, the results also show that using 6S for AC before pan-sharpening performed better than using FLAASH for AC. This is mainly because FLAASH cannot perform AC on the PAN band, which results in inconsistent reflectance values between the MS bands (surface reflectance) and the PAN band (top-of-atmosphere reflectance) when pan-sharpening was performed.
SSL is a powerful approach to solve the problem of lack of labeled training data, especially when using regularly archived satellite data for building segmentation [50,51]. A good pretext task is key to enable the deep learning backbone to automatically and properly learn the features of an object [14,15,16]. In this study, we designed two new pretext tasks that enable the VGG19 backbone to learn the texture and spectral features of both foreground (buildings) and background (other land covers) objects.
First, taking into account the spatial resolution of WorldView-3 images (0.3 m after pan-sharpening) and the average size of buildings, we proposed replacing a single large block with multiple small blocks for image inpainting (Pretext Task 1 in Figure 4). This change avoided losing information of small objects during pre-training of the VGG19 backbone [25] and improved the building segmentation accuracy (Table 4 and Figure 6). For example, the misclassification of the road in Figure 6e and the missed detection of the small house in Figure 6f are due to the use of a single large block, which are improved in this study by the small block design (see the red circles in Figure 6i,j). Furthermore, in this study, the total area of multiple small blocks was set to 30% of each 512 × 512-pixel patch. This value lies between two previous studies (75% in [59]; 5% in [23]) that also use multiple blocks for image inpainting in SSL. However, both the 75% and 5% values are considered inappropriate for segmenting buildings from pan-sharpened WorldView-3 imagery. This is because a masking ratio that is too large may result in missing information of many small objects, while a masking ratio that is too small will require more epochs to learn the features of the entire 512 × 512-pixel patch, thereby increasing computational cost. In addition, it should be noted that even when using multiple small blocks, if the image transformation strategy is too complex (e.g., image inpainting method 2 in Table 4), the features of the same object in the same image will have large differences, making it difficult for the VGG19 backbone to learn effective features for each object [59].
Second, considering that WorldView-3 images have multiple bands, we proposed to have the VGG19 backbone learn the multi-spectral features of each land cover (Pretext Task 2 in Figure 4), rather than only learning spectral information from RGB images through CIC. In general, CIC is a suitable approach for RGB images (or photographs) [18]. However, satellite imagery such as WorldView-3 contains extra spectral information that can help to distinguish buildings from other land covers [26,27,60]. This is probably the reason why method 1 (CIC) in Table 5 has worse building segmentation accuracy than method 5 (4A). For example, the incorrectly classified farmlands in Figure 6g and the undetected buildings (with green rooftop) in Figure 6h are due to spectral/color similarity with buildings/vegetation and were improved by the image spectrum recovery designed in this study (see the red circles in Figure 6k,l). Furthermore, the best performance of method 5 (4A) in Table 5 may be due to the low number of epochs performed in this study (i.e., 20), which was limited by the use of a personal computer. To achieve the same building segmentation accuracy as method 5 (4A), methods 2–4 (RIZ, R2A, and R3A in Table 5) need to run more epochs, which increases the computational cost.
The comparison results show that the new multi-task SSL network proposed in this study outperforms other widely used SSL methods (Table 7). This is mainly because we designed two new pretext tasks depending on the features of objects in the WorldView images (e.g., size, number, and reflectance spectra of buildings) as well as the characteristics of the WorldView images themselves (e.g., spatial resolution and multiple bands). Moreover, the new multi-task SSL network integrates generative and contrastive learning methods, allowing it to learn both high- and low-level features of WorldView-3 imagery. In contrast, SimCLR, MoCo v2, and BYOL were originally developed based on ImageNet (i.e., photos), and therefore did not fully take into account the features of objects in satellite images [50,51,55]. Furthermore, they all only used contrastive learning methods that mainly focus on learning high-level features of satellite images. PGSSL was also developed for building segmentation, but because it uses a very small masking ratio (5% of each 512 × 512-pixel patch), it would require a large computational cost to achieve the same building segmentation accuracy as the new multi-task SSL network, making it difficult to effectively work on a personal computer.
There are some limitations when applying the multi-task SSL network proposed in this study. First, in Pretext Task 1, the size of each small block was set based on the average area of building in Japan and the spatial resolution of pan-sharpened WorldView-3 imagery (0.3 m). Therefore, this size needs to be adjusted when applying the proposed method to satellite images with different spatial resolutions or to regions with different average building areas. Second, in this study, we only tested the combination of U-Net with the VGG19 backbone and the proposed multi-task SSL network. Replacing the VGG19 with other backbones such as ResNet [61] or DarkNet [62], or replacing U-Net with other convolutional neural networks (e.g., HRNet [63] and ABCNet [64]) or Transformer-based networks with attention mechanisms (e.g., SegFormer [65] and UNetFormer [66]), could further improve the building segmentation accuracy. Third, all experimental images were acquired during the summer, and half of them (12 images) were acquired from the Japanese region. Therefore, the robustness and generalizability of the proposed SSL network needs to be further improved using WV images from different seasons and regions. Fourth, the proposed SSL network is currently only applicable when the pre-training data are generated from sensors with the same spatial resolution. To increase the number of available pre-training and fine-tuning data, future research could consider multi-scale feature fusion techniques (e.g., atrous spatial pyramid pooling module [67], adaptive spatial feature fusion [68]) or introduce a resolution adaptation mechanism that dynamically adjusts the block size according to the spatial resolution of the input image. Fifth, the spectrum recovery method developed in this study does not fully consider the influence of shadows on the spectral characteristics of buildings. To solve this problem, shadow removal methods [69,70] can be applied first to minimize the influence of shadows before learning spectral features. All these remaining challenges need to be thoroughly investigated in future research.

6. Conclusions

In this study, we first conduct several comparative experiments using three labeled WorldView-3 images and reveal the optimal order of AC and pan-sharpening (i.e., performing AC with 6S before pan-sharpening with NNDiffuse), which improves building segmentation accuracy by 8.3–45.6% compared to other image standardization methods. We then designed two new pretext tasks that specifically consider the object features distributed in WorldView-3 images (e.g., size, distribution pattern, multispectral, etc.), integrated them into the multi-task SSL network, and pre-trained the VGG19 backbone of U-Net with unlabeled WorldView images. This effort further improved building segmentation accuracy by 15.3% compared to a VGG19 backbone trained from scratch. Finally, we compared the new multi-task SSL network with other existing SSL methods and found it to perform the best. Another advantage of the proposed method is that it has low computational cost and can effectively work on a personal computer. By continuing to address the remaining issues, the proposed method can be applied to other satellite images and other regions, providing useful information in fields such as urban planning.

Author Contributions

Conceptualization, H.Z.; Data curation, H.Z.; Formal analysis, H.Z.; Funding acquisition, B.M.; Methodology, H.Z.; Resources, B.M.; Validation, H.Z.; Writing—original draft, H.Z.; Writing—review and editing, B.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the Grants-in-Aid for Scientific Research of MEXT from Japan (No. 24K01009).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We acknowledge all the reviewers and editors for their constructive suggestions on this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Decoder structure of the new multi-task SSL network. Interpolate: Up-sampling using bilinear interpolation with a scale factor of 2. Conv: Convolutional layer. BN: Batch normalization layer. ReLU: Rectified Linear Unit (ReLU) activation function. Addition: Element-wise addition between the feature map from the decoder and the corresponding encoder layer (skip connection). Linear: Fully connected layer (linear transformation). Average Pooling: Adaptive average pooling to reduce feature maps to a single vector. “-“: no input required.
Table A1. Decoder structure of the new multi-task SSL network. Interpolate: Up-sampling using bilinear interpolation with a scale factor of 2. Conv: Convolutional layer. BN: Batch normalization layer. ReLU: Rectified Linear Unit (ReLU) activation function. Addition: Element-wise addition between the feature map from the decoder and the corresponding encoder layer (skip connection). Linear: Fully connected layer (linear transformation). Average Pooling: Adaptive average pooling to reduce feature maps to a single vector. “-“: no input required.
Pretext TaskNo. of LayersLayerInput ChannelsOutput ChannelsOutput SizeKernel SizeStridePadding
Image Inpainting1Interpolate------
2Conv51251232 × 323 × 311
3BN512-----
4ReLU------
5Addition------
6Interpolate------
7Conv51225664 × 643 × 311
8BN256-----
9ReLU------
10Addition------
11Interpolate------
12Conv256128128 × 1283 × 311
13BN128-----
14ReLU------
15Addition------
16Interpolate------
17Conv12864256 × 2563 × 311
18BN64-----
19ReLU------
20Addition------
21Interpolate------
22Conv6432512 × 5123 × 311
23BN32-----
24ReLU------
25Conv324512 × 5123 × 311
Spectrum Recovery1Interpolate------
2Conv51251232 × 323 × 311
3BN512-----
4ReLU------
5Addition------
6Interpolate------
7Conv51225664 × 643 × 311
8BN256-----
9ReLU------
10Addition------
11Interpolate------
12Conv256128128 × 1283 × 311
13BN128-----
14ReLU------
15Addition------
16Interpolate------
17Conv12864256 × 2563 × 311
18BN64-----
19ReLU------
20Addition------
21Interpolate------
22Conv6432512 × 5123 × 311
23BN32-----
24ReLU------
25Conv324512 × 5123 × 311
Contrastive Learning1Average Pooling--1 × 1---
2Linear5121024----
3ReLU------
4Linear1024512----
Table A2. Accuracy of building segmentation using the SSL network with only Pretext Task 1 under different small block sizes. (A): use the images from Sakaiminato and Sanuki as the training dataset, and the image from Tsukuba as the test dataset; (B): use the images from Sakaiminato and Tsukuba as the training dataset, and the image from Sanuki as the test dataset; (C): use the images from Tsukuba and Sanuki as the training dataset, and the image from Sakaiminato as the test dataset. The highest IoU and OA values are highlighted in bold.
Table A2. Accuracy of building segmentation using the SSL network with only Pretext Task 1 under different small block sizes. (A): use the images from Sakaiminato and Sanuki as the training dataset, and the image from Tsukuba as the test dataset; (B): use the images from Sakaiminato and Tsukuba as the training dataset, and the image from Sanuki as the test dataset; (C): use the images from Tsukuba and Sanuki as the training dataset, and the image from Sakaiminato as the test dataset. The highest IoU and OA values are highlighted in bold.
No.MethodSize of Small BlocksMean(A)(B)(C)
IoUOAIoUOAIoUOAIoUOA
1Inpainting with small blocks16 × 160.6250.9530.6480.9610.6360.9590.5910.938
232 × 320.6280.9530.6420.9590.6460.9590.5960.942
364 × 640.6170.9510.6550.9620.5950.9490.6000.943
Table A3. Accuracy of building segmentation using the new SSL network with different proportions of small blocks used in Pretext Task 1. (A): use the images from Sakaiminato and Sanuki as the training dataset, and the image from Tsukuba as the test dataset; (B): use the images from Sakaiminato and Tsukuba as the training dataset, and the image from Sanuki as the test dataset; (C): use the images from Tsukuba and Sanuki as the training dataset, and the image from Sakaiminato as the test dataset. The highest IoU and OA values are highlighted in bold.
Table A3. Accuracy of building segmentation using the new SSL network with different proportions of small blocks used in Pretext Task 1. (A): use the images from Sakaiminato and Sanuki as the training dataset, and the image from Tsukuba as the test dataset; (B): use the images from Sakaiminato and Tsukuba as the training dataset, and the image from Sanuki as the test dataset; (C): use the images from Tsukuba and Sanuki as the training dataset, and the image from Sakaiminato as the test dataset. The highest IoU and OA values are highlighted in bold.
No.MethodProportion of Small Blocks in Pretext Task 1Mean(A)(B)(C)
IoUOAIoUOAIoUOAIoUOA
1Multiple task-based SSL network
(This study)
0.050.5950.9460.5710.9450.6200.9550.5940.940
20.150.6420.9550.6610.9630.6630.9620.6030.942
30.300.6630.9580.6920.9680.6780.9640.6200.943
40.400.6300.9530.6550.9620.6090.9520.6240.946
50.500.6260.9520.6380.9580.6230.9540.6150.943
60.600.6210.9500.6550.9620.5820.9460.6250.944
70.750.6410.9550.6510.9600.6520.9610.6210.945
Table A4. Accuracy of building segmentation (IoU) in regional areas using different SSL approaches. The highest IoU values are highlighted in bold.
Table A4. Accuracy of building segmentation (IoU) in regional areas using different SSL approaches. The highest IoU values are highlighted in bold.
Position in Figure 7SimCLRMoCo v2BYOLPGSSLThis Study
Top0.744 0.874 0.780 0.886 0.910
Medium0.754 0.774 0.752 0.790 0.834
Bottom0.906 0.941 0.905 0.878 0.992

References

  1. He, C.; Liu, Y.; Wang, D.; Liu, S.; Yu, L.; Ren, Y. Automatic Extraction of Bare Soil Land from High-Resolution Remote Sensing Images Based on Semantic Segmentation with Deep Learning. Remote Sens. 2023, 15, 1646. [Google Scholar] [CrossRef]
  2. Huang, X.; Zhang, L. An Adaptive Mean-Shift Analysis Approach for Object Extraction and Classification from Urban Hyperspectral Imagery. IEEE Trans. Geosci. Remote Sens. 2008, 46, 4173–4185. [Google Scholar] [CrossRef]
  3. Bai, H.; Li, Z.; Guo, H.; Chen, H.; Luo, P. Urban Green Space Planning Based on Remote Sensing and Geographic Information Systems. Remote Sens. 2022, 14, 4213. [Google Scholar] [CrossRef]
  4. Ahmadi, S.; Zoej, M.J.V.; Ebadi, H.; Moghaddam, H.A.; Mohammadzadeh, A. Automatic Urban Building Boundary Extraction from High Resolution Aerial Images Using an Innovative Model of Active Contours. Int. J. Appl. Earth Obs. Geoinf. 2010, 12, 150–157. [Google Scholar] [CrossRef]
  5. Zhang, X.; Gao, K.; Wang, J.; Hu, Z.; Wang, H.; Wang, P.; Zhao, X.; Li, W. Self-Supervised Learning with Deep Clustering for Target Detection in Hyperspectral Images with Insufficient Spectral Variation Prior. Int. J. Appl. Earth Obs. Geoinf. 2023, 122, 103405. [Google Scholar] [CrossRef]
  6. Manolakis, D.; Marden, D.; Shaw, G.A. Hyperspectral Image Processing for Automatic Target Detection Applications. Linc. Lab. J. 2003, 14, 79–116. [Google Scholar]
  7. Cui, Y.; Liu, P.; Ma, Y.; Chen, L.; Xu, M.; Guo, X. Pansharpening via Predictive Filtering with Element-Wise Feature Mixing. ISPRS J. Photogramm. Remote Sens. 2025, 219, 22–37. [Google Scholar] [CrossRef]
  8. Li, D.; Ke, Y.; Gong, H.; Li, X. Object-Based Urban Tree Species Classification Using Bi-Temporal WorldView-2 and WorldView-3 Images. Remote Sens. 2015, 7, 16917–16937. [Google Scholar] [CrossRef]
  9. Wang, D.; Qiu, P.; Wan, B.; Cao, Z.; Zhang, Q. Mapping α- and β-Diversity of Mangrove Forests with Multispectral and Hyperspectral Images. Remote Sens. Environ. 2022, 275, 113021. [Google Scholar] [CrossRef]
  10. Luo, Q.; Li, Z.; Huang, Z.; Abulaiti, Y.; Yang, Q.; Yu, S. Retrieval of Mangrove Leaf Area Index and Its Response to Typhoon Based on WorldView-3 Image. Remote Sens. Appl. Soc. Environ. 2023, 30, 100931. [Google Scholar] [CrossRef]
  11. Liu, X.; Frey, J.; Denter, M.; Zielewska-Büttner, K.; Still, N.; Koch, B. Mapping Standing Dead Trees in Temperate Montane Forests Using a Pixel- and Object-Based Image Fusion Method and Stereo WorldView-3 Imagery. Ecol. Indic. 2021, 133, 108438. [Google Scholar] [CrossRef]
  12. Yao, Y.; Wang, S. Effects of Atmospheric Correction and Image Enhancement on Effective Plastic Greenhouse Segments Based on a Semi-Automatic Extraction Method. ISPRS Int. J. Geo-Inf. 2022, 11, 585. [Google Scholar] [CrossRef]
  13. He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. arXiv 2019, arXiv:1911.05722. [Google Scholar]
  14. Wang, Y.; Albrecht, C.M.; Braham, N.A.A.; Mou, L.; Zhu, X.X. Self-Supervised Learning in Remote Sensing: A Review. IEEE Geosci. Remote Sens. Mag. 2022, 10, 213–247. [Google Scholar] [CrossRef]
  15. Balestriero, R.; Ibrahim, M.; Sobal, V.; Morcos, A.; Shekhar, S.; Goldstein, T.; Bordes, F.; Bardes, A.; Mialon, G.; Tian, Y.; et al. A Cookbook of Self-Supervised Learning. arXiv 2023, arXiv:2304.12210. [Google Scholar] [CrossRef]
  16. Liu, X.; Zhang, F.; Hou, Z.; Mian, L.; Wang, Z.; Zhang, J.; Tang, J. Self-Supervised Learning: Generative or Contrastive. IEEE Trans. Knowl. Data Eng. 2023, 35, 857–876. [Google Scholar] [CrossRef]
  17. Tang, Y.; Yang, Y.; Sun, G. Generative and Contrastive Graph Representation Learning with Message Passing. Neural Netw. 2025, 185, 107224. [Google Scholar] [CrossRef] [PubMed]
  18. Zhang, R.; Isola, P.; Efros, A.A. Colorful Image Colorization. arXiv 2016, arXiv:1603.08511. [Google Scholar] [CrossRef]
  19. Pathak, D.; Krähenbühl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context Encoders: Feature Learning by Inpainting. arXiv 2016, arXiv:1604.07379. [Google Scholar] [CrossRef]
  20. Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. arXiv 2020, arXiv:2006.09882. [Google Scholar]
  21. Zbontar, J.; Jing, L.; Misra, I.; LeCun, Y.; Deny, S. Barlow Twins: Self-Supervised Learning via Redundancy Reduction. arXiv 2021, arXiv:2103.03230. [Google Scholar] [CrossRef]
  22. Li, W.; Chen, H.; Shi, Z. Semantic Segmentation of Remote Sensing Images with Self-Supervised Multitask Representation Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6438–6450. [Google Scholar] [CrossRef]
  23. Chen, D.Y.; Peng, L.; Zhang, W.Y.; Wang, Y.D.; Yang, L.N. Research on Self-Supervised Building Information Extraction with High-Resolution Remote Sensing Images for Photovoltaic Potential Evaluation. Remote Sens. 2022, 14, 5350. [Google Scholar] [CrossRef]
  24. Ministry of Land, Infrastructure, Transport and Tourism of Japan. Housing Economy Related Data. 2024. Available online: https://www.mlit.go.jp/statistics/details/t-jutaku-2_tk_000002.html (accessed on 4 September 2025).
  25. Suvorov, R.; Logacheva, E.; Mashikhin, A.; Remizova, A.; Ashukha, A.; Silvestrov, A.; Kong, N.; Goka, H.; Park, K.; Lempitsky, V.; et al. Resolution-Robust Large Mask Inpainting with Fourier Convolutions. arXiv 2021, arXiv:2109.07161. [Google Scholar] [CrossRef]
  26. Bigdeli, S.; Süsstrunk, S. Deep Semantic Segmentation Using Nir as Extra Physical Information. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 2439–2443. [Google Scholar] [CrossRef]
  27. Cai, Y.; Fan, L.; Zhang, C. Semantic Segmentation of Multispectral Images via Linear Compression of Bands: An Experiment Using RIT-18. Remote Sens. 2022, 14, 2673. [Google Scholar] [CrossRef]
  28. Singhal, U.; Yu, S.X.; Steck, Z.; Kangas, S.; Reite, A.A. Multi-Spectral Image Classification with Ultra-Lean Complex-Valued Models. arXiv 2022, arXiv:2211.11797. [Google Scholar]
  29. Yang, J.; Li, P.; He, Y. A Multi-Band Approach to Unsupervised Scale Parameter Selection for Multi-Scale Image Segmentation. ISPRS J. Photogramm. Remote Sens. 2014, 94, 13–24. [Google Scholar] [CrossRef]
  30. Johnson, B.; Xie, Z. Unsupervised Image Segmentation Evaluation and Refinement Using a Multi-Scale Approach. ISPRS J. Photogramm. Remote Sens. 2011, 66, 473–483. [Google Scholar] [CrossRef]
  31. Huang, B.; Collins, M.L.; Bradbury, K.; Malof, M.J. Deep Convolutional Segmentation of Remote Sensing Imagery: A Simple and Efficient Alternative to Stitching Output Labels. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 6899–6902. [Google Scholar] [CrossRef]
  32. Karila, K.; Matikainen, L.; Karjalainen, M.; Puttonen, E.; Chen, Y.; Hyyppä, J. Automatic Labelling for Semantic Segmentation of VHR Satellite Images: Application of Airborne Laser Scanner Data and Object-Based Image Analysis. ISPRS Open J. Photogramm. Remote Sens. 2023, 9, 100046. [Google Scholar] [CrossRef]
  33. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
  34. Bosch, M.; Kurtz, Z.; Hagstrom, S.; Brown, M. A multiple view stereo benchmark for satellite imagery. In Proceedings of the 2016 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), Washington, DC, USA, 18–20 October 2016; pp. 1–9. [Google Scholar] [CrossRef]
  35. Myron, B.; Hirsh, G.; Kevin, F.; Andrea, L.; Sean, W.; Shea, H.; Marc, B.; Scott, A. Large-Scale Public Lidar and Satellite Image Data Set for Urban Semantic Labeling. In Proceedings of the Laser Radar Technology and Applications XXIII, Orlando, FL, USA, 17–18 April 2018; pp. 154–167. [Google Scholar] [CrossRef]
  36. Yang, J.; Matsushita, B.; Zhang, H. Improving Building Rooftop Segmentation Accuracy through the Optimization of UNet Basic Elements and Image Foreground-Background Balance. ISPRS J. Photogramm. Remote Sens. 2023, 201, 123–137. [Google Scholar] [CrossRef]
  37. Vermote, E.F.; Tanré, D.; Luc Deuzé, J.; Herman, M.; Morcrette, J.J. Second Simulation of the Satellite Signal in the Solar Spectrum, 6S: An Overview. IEEE Trans. Geosci. Remote Sens. 1997, 35, 675–686. [Google Scholar] [CrossRef]
  38. Wilson, R.T. Py6S: A Python Interface to the 6S Radiative Transfer Model. Comput. Geosci. 2013, 51, 166–171. [Google Scholar] [CrossRef]
  39. Adler-Golden, S.M.; Matthew, M.W.; Bernstein, L.S.; Levine, R.Y.; Berk, A.; Richtsmeier, S.C.; Acharya, P.K.; Anderson, G.P.; Felde, G.; Gardner, J.; et al. Atmospheric Correction for Shortwave Spectral Imagery Based on MODTRAN4. Proc. SPIE 1999, 3753, 61–69. [Google Scholar] [CrossRef]
  40. Guo, Y.; Zeng, F. Atmospheric Correction Comparison of SPOT-5 Image Based on Model FLAASH and Model QUAC. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2012, 39, 7–11. [Google Scholar] [CrossRef]
  41. Nguyen, H.C.; Jung, J.; Lee, J.; Choi, S.U.; Hong, S.Y.; Heo, J. Optimal Atmospheric Correction for Above-Ground Forest Biomass Estimation with the ETM+ Remote Sensor. Sensors 2015, 15, 18865–18886. [Google Scholar] [CrossRef]
  42. Marcello, J.; Eugenio, F.; Perdomo, U.; Medina, A. Assessment of Atmospheric Algorithms to Retrieve Vegetation in Natural Protected Areas Using Multispectral High Resolution Imagery. Sensors 2016, 16, 1624. [Google Scholar] [CrossRef] [PubMed]
  43. Yang, M.; Hu, Y.; Tian, H.; Khan, F.A.; Liu, Q.; Goes, J.I.; Gomes, H.D.R.; Kim, W. Atmospheric Correction of Airborne Hyperspectral CASI Data Using Polymer, 6S and FLAASH. Remote Sens. 2021, 13, 5062. [Google Scholar] [CrossRef]
  44. Laben, C.A.; Brower, B.V. Process for Enhancing the Spatial Resolution of Multispectral Imagery Using Pan-Sharpening. U.S. Patent 6,011,875, 4 January 2000. [Google Scholar]
  45. Sun, W.; Chen, B.; Messinger, D.W. Nearest-Neighbor Diffusion-Based Pan-Sharpening Algorithm for Spectral Images. Opt. Eng. 2014, 53, 013107. [Google Scholar] [CrossRef]
  46. Yilmaz, C.S.; Yilmaz, V.; Gungor, O. A Theoretical and Practical Survey of Image Fusion Methods for Multispectral Pansharpening. Inform. Fusion. 2022, 79, 1–43. [Google Scholar] [CrossRef]
  47. NV5 Geospatial. NNDiffuse Pan-Sharpening. NV5 Geospatial Software Documentation. 2025. Available online: https://www.nv5geospatialsoftware.com/docs/nndiffusepansharpening.html (accessed on 4 September 2025).
  48. Yilmaz, C.S.; Yilmaz, V.; Gungor, O.; Shan, J. Metaheuristic Pansharpening Based on Symbiotic Organisms Search Optimization. ISPRS J. Photogramm. Remote Sens. 2019, 158, 167–187. [Google Scholar] [CrossRef]
  49. Yilmaz, V.; Yilmaz, C.S.; Güngör, O.; Shan, J. A Genetic Algorithm Solution to the Gram-Schmidt Image Fusion. Int. J. Remote Sens. 2020, 41, 1458–1485. [Google Scholar] [CrossRef]
  50. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar] [CrossRef]
  51. Chen, X.; Fan, H.; Girshick, R.; He, K. Improved Baselines with Momentum Contrastive Learning. arXiv 2020, arXiv:2003.04297. [Google Scholar] [CrossRef]
  52. Li, X.; Sun, X.; Meng, Y.; Liang, J.; Wu, F.; Li, J. Dice Loss for Data-Imbalanced NLP Tasks. arXiv 2020, arXiv:1911.02855. [Google Scholar] [CrossRef]
  53. Bhattacharyya, A. On a Measure of Divergence between Two Multinomial Populations. Sankhya Indian J. Stat. 1946, 7, 401–406. [Google Scholar]
  54. Schanda, J. Chapter 4: CIE Color Difference Metrics. In Colorimetry: Understanding the CIE System, 1st ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2007. [Google Scholar] [CrossRef]
  55. Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Pires, B.A.; Guo, Z.D.; Azar, M.G.; et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. arXiv 2020, arXiv:2006.07733. [Google Scholar] [CrossRef]
  56. Cheng, T.; Ji, X.; Yang, G.; Zheng, H.; Ma, J.; Yao, X.; Zhu, Y.; Cao, W. DESTIN: A New Method for Delineating the Boundaries of Crop Fields by Fusing Spatial and Temporal Information from WorldView and Planet Satellite Imagery. Comput. Electron. Agric. 2020, 178, 105787. [Google Scholar] [CrossRef]
  57. Nininahazwe, F.; Varin, M.; Théau, J. Mapping Common and Glossy Buckthorns (Frangula Alnus and Rhamnus Cathartica) Using Multi-Date Satellite Imagery WorldView-3, GeoEye-1 and SPOT-7. Int. J. Digit. Earth. 2023, 16, 31–42. [Google Scholar] [CrossRef]
  58. Gao, B.C.; Montes, M.J.; Davis, C.O.; Goetz, A.F.H. Atmospheric Correction Algorithms for Hyperspectral Remote Sensing Data of Land and Ocean. Remote Sens. Environ. 2009, 113, S17–S24. [Google Scholar] [CrossRef]
  59. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. arXiv 2021, arXiv:2111.06377. [Google Scholar] [CrossRef]
  60. Baek, W.K.; Lee, M.J.; Jung, H.S. Land Cover Classification From RGB and NIR Satellite Images Using Modified U-Net Model. IEEE Access 2024, 12, 69445–69455. [Google Scholar] [CrossRef]
  61. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
  62. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
  63. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
  64. Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P.M. ABCNet: Attentive Bilateral Contextual Network for Efficient Semantic Segmentation of Fine-Resolution Remotely Sensed Imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
  65. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar] [CrossRef]
  66. Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
  67. Chen, L.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
  68. Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. arXiv 2019, arXiv:1911.09516. [Google Scholar] [CrossRef]
  69. Yang, J.; Zhao, Z.; Yang, J. A shadow removal method for high resolution remote sensing image. Geomat. Inf. Sci. Wuhan Univ. 2008, 33, 17–20. [Google Scholar]
  70. Guo, J.; Tian, Q.; Wu, Y. Study on multispectral detecting shadow areas and a theoretical model of removing shadows from remote sensing images. J. Remote Sens. 2006, 2, 151–159. [Google Scholar] [CrossRef]
Figure 1. Manually labeled building segmentation dataset (Dataset I). (a) Sakaiminato, (b) Sanuki, (c) Tsukuba, (d) locations of the three regions, (e) example of the labeled patches from the image of Sakaiminato, (f) example of the labeled patches from the image of Sanuki, (g,h) examples of the labeled patches from the image of Tsukuba, (e-1h-1) image patches of (eh), (e-2h-2) corresponding image patches with the building labels overlaid. The yellow polygons represent the building labels.
Figure 1. Manually labeled building segmentation dataset (Dataset I). (a) Sakaiminato, (b) Sanuki, (c) Tsukuba, (d) locations of the three regions, (e) example of the labeled patches from the image of Sakaiminato, (f) example of the labeled patches from the image of Sanuki, (g,h) examples of the labeled patches from the image of Tsukuba, (e-1h-1) image patches of (eh), (e-2h-2) corresponding image patches with the building labels overlaid. The yellow polygons represent the building labels.
Remotesensing 17 03182 g001
Figure 2. An illustration of the original image and its corresponding augmented images. (a) original image, (b) horizontally flipped image, (c) vertically flipped image, (d) horizontally and vertically flipped image, (e) random scale cropped image, (f) 4-image mosaicked image.
Figure 2. An illustration of the original image and its corresponding augmented images. (a) original image, (b) horizontally flipped image, (c) vertically flipped image, (d) horizontally and vertically flipped image, (e) random scale cropped image, (f) 4-image mosaicked image.
Remotesensing 17 03182 g002
Figure 3. Flowchart of the five image standardization methods. All digital numbers for the MS and PAN bands were radiometrically corrected before performing the image standardization. Note that the input dimensions = 4245 (16,980) pixels × 4244 (16,976) pixels × 4 (1) bands and the output dimensions =16,980 pixels × 16,976 pixels × 4 bands.
Figure 3. Flowchart of the five image standardization methods. All digital numbers for the MS and PAN bands were radiometrically corrected before performing the image standardization. Note that the input dimensions = 4245 (16,980) pixels × 4244 (16,976) pixels × 4 (1) bands and the output dimensions =16,980 pixels × 16,976 pixels × 4 bands.
Remotesensing 17 03182 g003
Figure 4. The multi-task SSL network designed in this study. I i represents an image masked with multiple small blocks from the original image. I s represents a spectrum damaged image from the original image. I j and I k represent transformed images from the corresponding 2 different original images. Note the input dimensions: ➀ 512 pixels × 512 pixels × 4 bands, ➁ 512 pixels × 512 pixels × 4 bands, ➂ 512 pixels × 512 pixels × 4 bands × 2 patches; and the output dimensions: ➀ 512 pixels × 512 pixels × 4 bands, ➁ 512 pixels × 512 pixels × 4 bands.
Figure 4. The multi-task SSL network designed in this study. I i represents an image masked with multiple small blocks from the original image. I s represents a spectrum damaged image from the original image. I j and I k represent transformed images from the corresponding 2 different original images. Note the input dimensions: ➀ 512 pixels × 512 pixels × 4 bands, ➁ 512 pixels × 512 pixels × 4 bands, ➂ 512 pixels × 512 pixels × 4 bands × 2 patches; and the output dimensions: ➀ 512 pixels × 512 pixels × 4 bands, ➁ 512 pixels × 512 pixels × 4 bands.
Remotesensing 17 03182 g004
Figure 5. Flowchart of the proposed method for building segmentation.
Figure 5. Flowchart of the proposed method for building segmentation.
Remotesensing 17 03182 g005
Figure 6. Examples of building segmentation using different single pretext tasks in SSL. (ad) Original WorldView-3 images, (e,f) using the image inpainting pretext task of [22] (the use of a single large block), (g,h) using the pretext task of [18] (Colorful Image Colorization), (i,j) using the new Pretext Task 1 proposed in this study (SB3 in Table 4), and (k,l) using the new Pretext Task 2 proposed in this study (4A in Table 5). White pixels represent the “True Positive”, black pixels represent the “True Negative”, red pixels represent the “False Positive”, and green pixels represent the “False Negative”.
Figure 6. Examples of building segmentation using different single pretext tasks in SSL. (ad) Original WorldView-3 images, (e,f) using the image inpainting pretext task of [22] (the use of a single large block), (g,h) using the pretext task of [18] (Colorful Image Colorization), (i,j) using the new Pretext Task 1 proposed in this study (SB3 in Table 4), and (k,l) using the new Pretext Task 2 proposed in this study (4A in Table 5). White pixels represent the “True Positive”, black pixels represent the “True Negative”, red pixels represent the “False Positive”, and green pixels represent the “False Negative”.
Remotesensing 17 03182 g006
Figure 7. Examples of building segmentation using U-Net in combination with different SSL methods. (a) Original WorldView-3 images, (b) using SimCLR as SSL, (c) using MoCo v2 as SSL, (d) using BYOL as SSL, (e) using PGSSL as SSL, (f) using the new multi-task SSL network as SSL (this study). White pixels represent the “True Positive”, black pixels represent the “True Negative”, red pixels represent the “False Positive”, and green pixels represent the “False Negative”.
Figure 7. Examples of building segmentation using U-Net in combination with different SSL methods. (a) Original WorldView-3 images, (b) using SimCLR as SSL, (c) using MoCo v2 as SSL, (d) using BYOL as SSL, (e) using PGSSL as SSL, (f) using the new multi-task SSL network as SSL (this study). White pixels represent the “True Positive”, black pixels represent the “True Negative”, red pixels represent the “False Positive”, and green pixels represent the “False Negative”.
Remotesensing 17 03182 g007
Table 2. Histogram similarities between two of the three images of different regions using different image standardization methods. DB: Bhattacharyya Distance; image (a): Sakaiminato; image (b): Sanuki; image (c): Tsukuba. The lowest DB values are highlighted in bold and represent the highest histogram similarity.
Table 2. Histogram similarities between two of the three images of different regions using different image standardization methods. DB: Bhattacharyya Distance; image (a): Sakaiminato; image (b): Sanuki; image (c): Tsukuba. The lowest DB values are highlighted in bold and represent the highest histogram similarity.
No.MethodMean DBDB Between Image (a) and Image (b)DB Between Image (b) and Image (c)DB Between Image (c) and Image (a)
1Only NNDiffuse0.4000.4590.4860.257
26S + NNDiffuse0.1530.1400.1130.207
3FLAASH + NNDiffuse0.3690.2950.4890.322
4NNDiffuse + 6S0.2520.2250.2740.258
5NNDiffuse + FLAASH0.3780.3890.4640.279
Table 3. Accuracy evaluation results of building segmentation using five image standardization methods. (A): uses the images from Sakaiminato and Sanuki as the training dataset, and the image from Tsukuba as the test dataset; (B): uses the images from Sakaiminato and Tsukuba as the training dataset, and the image from Sanuki as the test dataset; (C): uses the images from Tsukuba and Sanuki as the training dataset, and the image from Sakaiminato as the test dataset. The highest IoU, OA and variance values are highlighted in bold, and the second highest IoU, OA, and variance values are underlined.
Table 3. Accuracy evaluation results of building segmentation using five image standardization methods. (A): uses the images from Sakaiminato and Sanuki as the training dataset, and the image from Tsukuba as the test dataset; (B): uses the images from Sakaiminato and Tsukuba as the training dataset, and the image from Sanuki as the test dataset; (C): uses the images from Tsukuba and Sanuki as the training dataset, and the image from Sakaiminato as the test dataset. The highest IoU, OA and variance values are highlighted in bold, and the second highest IoU, OA, and variance values are underlined.
No.MethodMean(A)(B)(C)Variance
IoUOAIoUOAIoUOAIoUOAIoUOA
1Only NNDiffuse0.3950.8620.3710.8830.5080.9400.3040.7625.38 × 10−31.66 × 10−3
26S + NNDiffuse0.5750.9430.5050.9300.6220.9560.5990.9423.46 × 10−31.76 × 10−4
3FLAASH + NNDiffuse0.4200.8900.3240.8650.4650.9180.4700.8865.22 × 10−37.03 × 10−4
4NNDiffuse + 6S0.5310.9350.5020.9290.5140.9380.5760.9372.07 × 10−42.22 × 10−5
5NNDiffuse + FLAASH0.4240.8890.3850.8850.5110.9400.3780.8434.14 × 10−39.41 × 10−4
Table 4. Accuracy of building segmentation using SSL network with the Pretext Task 1 only. Three image transformation methods are image rotation, image flipping, and image color jittering. LB: The large block masked area uses all three transformation methods; SB1: Each small block masked area uses different combinations of all three transformation methods; SB2: All small block masked areas use the same combination of all three transformation methods; SB3: All small block masked areas are simply set to zero. (A): use the images from Sakaiminato and Sanuki as the training dataset, and the image from Tsukuba as the test dataset; (B): use the images from Sakaiminato and Tsukuba as the training dataset, and the image from Sanuki as the test dataset; (C): use the images from Tsukuba and Sanuki as the training dataset, and the image from Sakaiminato as the test dataset. The highest IoU and OA values are highlighted in bold.
Table 4. Accuracy of building segmentation using SSL network with the Pretext Task 1 only. Three image transformation methods are image rotation, image flipping, and image color jittering. LB: The large block masked area uses all three transformation methods; SB1: Each small block masked area uses different combinations of all three transformation methods; SB2: All small block masked areas use the same combination of all three transformation methods; SB3: All small block masked areas are simply set to zero. (A): use the images from Sakaiminato and Sanuki as the training dataset, and the image from Tsukuba as the test dataset; (B): use the images from Sakaiminato and Tsukuba as the training dataset, and the image from Sanuki as the test dataset; (C): use the images from Tsukuba and Sanuki as the training dataset, and the image from Sakaiminato as the test dataset. The highest IoU and OA values are highlighted in bold.
No.Image Inpainting
Method
Image Damage
Strategy
Mean(A)(B)(C)
IoUOAIoUOAIoUOAIoUOA
1One Large BlockLB0.6020.9490.5960.9510.6100.9550.5980.943
2Small Blocks
(this study)
SB10.5910.9460.5430.9400.6130.9540.6170.944
3Small Blocks
(this study)
SB20.6110.9500.6370.9590.6110.9530.5840.939
4Small Blocks
(this study)
SB30.6280.9530.6420.9590.6460.9590.5960.942
Table 5. Accuracy of building segmentation using SSL network with the Pretext Task 2 only. CIC: Colorful Image Colorization; ISR: Image Spectrum Recovery; RGB: Red-Green-Blue color space; CIELAB: International Commission on Illumination (CIE) 1976 (L, a, b) color space [54]; R1Z: Randomly select one band and set the value of the selected band to zero; R2A: Randomly select two bands, average their values, and replace the values of the two selected bands with the average value; R3A: Randomly select three bands, average their values, and replace the values of the three selected bands with the average value; 4A: Average all four bands and replace the values of the four bands with the average value. (A): use the images from Sakaiminato and Sanuki as the training dataset, and the image from Tsukuba as the test dataset; (B): use the images from Sakaiminato and Tsukuba as the training dataset, and the image from Sanuki as the test dataset; (C): use the images from Tsukuba and Sanuki as the training dataset, and the image from Sakaiminato as the test dataset. The highest IoU and OA values are highlighted in bold.
Table 5. Accuracy of building segmentation using SSL network with the Pretext Task 2 only. CIC: Colorful Image Colorization; ISR: Image Spectrum Recovery; RGB: Red-Green-Blue color space; CIELAB: International Commission on Illumination (CIE) 1976 (L, a, b) color space [54]; R1Z: Randomly select one band and set the value of the selected band to zero; R2A: Randomly select two bands, average their values, and replace the values of the two selected bands with the average value; R3A: Randomly select three bands, average their values, and replace the values of the three selected bands with the average value; 4A: Average all four bands and replace the values of the four bands with the average value. (A): use the images from Sakaiminato and Sanuki as the training dataset, and the image from Tsukuba as the test dataset; (B): use the images from Sakaiminato and Tsukuba as the training dataset, and the image from Sanuki as the test dataset; (C): use the images from Tsukuba and Sanuki as the training dataset, and the image from Sakaiminato as the test dataset. The highest IoU and OA values are highlighted in bold.
No.MethodDamage StrategyMean(A)(B)(C)
IoUOAIoUOAIoUOAIoUOA
1CICRGB to CIELAB0.6480.9570.6640.9650.6700.9650.6110.942
2ISR
(this study)
R1Z0.6050.9490.6020.9530.6240.9550.5890.938
3ISR
(this study)
R2A0.5930.9470.6140.9540.5960.9500.5700.937
4ISR
(this study)
R3A0.6060.9490.5880.9500.6350.9570.5940.941
5ISR
(this study)
4A0.6520.9580.6680.9640.6840.9660.6040.943
Table 6. Ablation experiments on the pretext tasks of the multi-task SSL network developed in this study. Pretext task 1: image inpainting using multiple small blocks with image damage strategy of SB3; Pretext task 2: image spectrum recovery with damage strategy of 4A; Pretext task 3: SimCLR from [50]. (A): use the images from Sakaiminato and Sanuki as the training dataset, and the image from Tsukuba as the test dataset; (B): use the images from Sakaiminato and Tsukuba as the training dataset, and the image from Sanuki as the test dataset; (C): use the images from Tsukuba and Sanuki as the training dataset, and the image from Sakaiminato as the test dataset. The highest IoU and OA values are highlighted in bold.
Table 6. Ablation experiments on the pretext tasks of the multi-task SSL network developed in this study. Pretext task 1: image inpainting using multiple small blocks with image damage strategy of SB3; Pretext task 2: image spectrum recovery with damage strategy of 4A; Pretext task 3: SimCLR from [50]. (A): use the images from Sakaiminato and Sanuki as the training dataset, and the image from Tsukuba as the test dataset; (B): use the images from Sakaiminato and Tsukuba as the training dataset, and the image from Sanuki as the test dataset; (C): use the images from Tsukuba and Sanuki as the training dataset, and the image from Sakaiminato as the test dataset. The highest IoU and OA values are highlighted in bold.
No.Pretext Task 1Pretext Task 2Pretext Task 3Mean(A)(B)(C)
IoUOAIoUOAIoUOAIoUOA
1×××0.5750.9430.5050.9300.6220.9560.5970.942
2××0.6280.9530.6420.9590.6460.9590.5960.942
3×0.6360.9540.6610.9630.6450.9580.6040.940
40.6630.9580.6920.9680.6780.9640.6200.943
Table 7. Comparison of building segmentation accuracy using different SSL methods. (A): use the images from Sakaiminato and Sanuki as the training dataset, and the image from Tsukuba as the test dataset; (B): use the images from Sakaiminato and Tsukuba as the training dataset, and the image from Sanuki as the test dataset; (C): use the images from Tsukuba and Sanuki as the training dataset, and the image from Sakaiminato as the test dataset. The highest IoU and OA values are highlighted in bold.
Table 7. Comparison of building segmentation accuracy using different SSL methods. (A): use the images from Sakaiminato and Sanuki as the training dataset, and the image from Tsukuba as the test dataset; (B): use the images from Sakaiminato and Tsukuba as the training dataset, and the image from Sanuki as the test dataset; (C): use the images from Tsukuba and Sanuki as the training dataset, and the image from Sakaiminato as the test dataset. The highest IoU and OA values are highlighted in bold.
No.SSL TypeMethodMean(A)(B)(C)
IoUOAIoUOAIoUOAIoUOA
1Contrastive LearningSimCLR0.6150.9510.6690.9650.5820.9480.5940.940
2MoCo v20.5830.9460.6020.9520.5720.9460.5760.939
3BYOL0.5890.9470.5690.9450.5930.9520.6060.943
4Multiple Task-basedPGSSL0.6400.9550.6080.9530.6590.9620.6530.951
5This study0.6630.9580.6920.9680.6780.9640.6200.943
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, H.; Matsushita, B. Combining Satellite Image Standardization and Self-Supervised Learning to Improve Building Segmentation Accuracy. Remote Sens. 2025, 17, 3182. https://doi.org/10.3390/rs17183182

AMA Style

Zhang H, Matsushita B. Combining Satellite Image Standardization and Self-Supervised Learning to Improve Building Segmentation Accuracy. Remote Sensing. 2025; 17(18):3182. https://doi.org/10.3390/rs17183182

Chicago/Turabian Style

Zhang, Haoran, and Bunkei Matsushita. 2025. "Combining Satellite Image Standardization and Self-Supervised Learning to Improve Building Segmentation Accuracy" Remote Sensing 17, no. 18: 3182. https://doi.org/10.3390/rs17183182

APA Style

Zhang, H., & Matsushita, B. (2025). Combining Satellite Image Standardization and Self-Supervised Learning to Improve Building Segmentation Accuracy. Remote Sensing, 17(18), 3182. https://doi.org/10.3390/rs17183182

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop