Next Article in Journal
Assessing Satellite-Based Data Products Estimating Daily Means of Solar Irradiance at Surface over South Cameroon Plateau and Potential Improvements
Previous Article in Journal
DSFNet: A Directional Statistical Fusion Network for Cloud and Cloud Shadow Segmentation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Impact of Data Modality and Batch Normalization Layers on Very High-Resolution Impervious Surface Mapping Using DeepLabv3+ and U-Net Under Regional Cross-City and Cross-Season Domain Shifts

by
Jan-Philipp Langenkamp
* and
Andreas Rienow
Interdisciplinary Geographic Information Sciences, Institute of Geography, Ruhr-Universität Bochum, 44801 Bochum, Germany
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(9), 1433; https://doi.org/10.3390/rs18091433
Submission received: 27 March 2026 / Revised: 23 April 2026 / Accepted: 28 April 2026 / Published: 4 May 2026

Highlights

What are the main findings?
  • The model performances of U-Net and DeepLabv3+ are significantly impacted by cross-city and cross-season shifts when used for impervious surface mapping with very high-resolution remote sensing datasets.
  • The use of multimodal datasets (e.g., spectral and height information) for model training and the integration of AdaBN increase the robustness of both U-Net and DeepLabv3+ to cross-city and cross-season shifts. Slightly higher performance increases are seen with multimodal model versions than with AdaBN integration in spectral model versions.
  • The combination of using multimodal datasets for model training and AdaBN integration produced the most robust model performances in cross-city and cross-season experiments.
What are the implications of the main findings?
  • When domain adaptation techniques are unavailable, combining multimodal datasets and AdaBN can present a practical solution for increasing model robustness in impervious surface mapping under cross-city and cross-season shifts.
  • Reducing the covariate shift in cross-city and cross-season scenarios using AdaBN can considerably improve U-Net and DeepLabv3+ predictions for the impervious surfaces class.

Abstract

Urban planning, climatology, or hydrology require continuous and spatially explicit information about impervious surfaces. Semantic segmentation using very high-resolution remote sensing data increased the performance of their detection. However, semantic segmentation models (SSMs) suffer from domains shifts when applied across cities or seasons. While domain adaptation (DA) techniques exist, the current literature provides little information on the level of sensitivity expected for baseline SSMs in mapping impervious surfaces in such scenarios. This study evaluates how data modality (e. g. spectral or height information) and adaptive batch normalization (AdaBN) affect the robustness of SSMs in cross-city and cross-season scenarios. Potsdam and Vaihingen benchmark datasets were used and merged into classes of impervious surfaces, buildings, and background. The impervious surface class was found to be the most sensitive to cross-domain shifts. Multimodal datasets and AdaBN increased model robustness, while in comparison, the impact of AdaBN was 3.46 percentage points lower regarding the mean intersection over union (mIoU). The combination of multimodal datasets and AdaBN exhibited the best results throughout the experiments, increasing mIoU by an additional 10.06 percentage points compared to the multimodal model versions. When DA techniques are unavailable, using multimodal datasets in combination with AdaBN holds a practical approach for cross-domain scenarios in impervious surface mapping.

1. Introduction

The expansion of impervious surfaces is linked to altered hydrological processes [1,2,3], the surface urban heat island effect [2], or loss of biodiversity [4]. Accordingly, impervious surfaces, ground covered by artificial building materials such as asphalt or concrete, are recognized as a crucial environmental indicator [5]. Therefore, continuous, spatially explicit mapping of impervious surfaces is required to aid scientific analyses in disciplines such as urban planning, climatology, and hydrology, as well as to support policymakers in achieving the Sustainable Development Goals [6,7].
The use of deep learning (DL) methods in remote sensing (RS), particularly using semantic segmentation models (SSMs) based on convolutional neural networks (CNNs), has made the classification of detailed urban scenes using very high-resolution (VHR) imagery in submeter ranges more feasible and accurate [8,9,10]. In this context, SSMs automatically annotate pixels of an image with labels [11]. Over the last decade, a wide variety of SSMs have been developed. Some of these models, such as the U-Net [12] or the DeepLabv3+ [13], are widely used and recognized as baseline SSMs within the RS literature due to their competitiveness [14,15]. Typically, training and testing SSMs involve using independent datasets of images and corresponding ground truth (GT) masks, yet from similar data distributions. However, the generalization of SSMs is challenged when predictions are made in cities or seasons other than those on which they were initially trained [16,17,18,19,20], ultimately affecting the goal of consistency in mapping approaches. This degradation in model performance is subject to domain shifts, which occur when the initial (source) data distribution used for training the model differs from the unknown (target) data distribution used for evaluation [16,21]. RS data is affected by dynamic radiometric changes, which can be caused by various factors, for example, different acquisition conditions and systems, changes to the land surface, and differences in phenology or environmental conditions [16,22,23]. This is particularly relevant when working with VHR RS data, since it is more likely to capture radiometric changes that are not related to actual changes on the land surface due to the increased spatial resolution compared to satellite RS data [22]. Ultimately, these circumstances limit the practicality of continuous monitoring of impervious surfaces using VHR RS data, particularly those acquired from airborne sources.
In the RS literature, domain shifts are usually addressed using domain adaptation (DA) techniques, aiming to reduce the impact of the unknown target data on the model prediction [16,21]. Such techniques typically involve complex training strategies or architectural changes in baseline SSMs [18,19,20,24,25,26,27]. However, in many real-world scenarios, practitioners may lack the resources or the time to implement such techniques. Therefore, prior to the implementation of DA solutions, it is important to quantify the severity of performance degradation expected and the initial sensitivity of model architectures in cross-domain scenarios. Additionally, it is of practical relevance to understand how different data modalities (i.e., spectral or height information) and the observed differences in training and testing data distribution (covariate shift) [28] contribute to model robustness. For example, Qin et al. [17] evaluate the generalization of U-Net, DeepLabv3+, Feature Pyramid Network, HRNet, and the Random Forest besides several DA approaches on VHR RS datasets from Omaha, Jacksonville, Haiti, London and Shanghai. In their cross-city and cross-sensor scenarios, they report a significant performance drop of up to 35% in mean intersection over union (mIoU) when the models were transferred from the source data to the target data. Qin et al. [17] highlight that combining spectral data and above-ground-level (AGL) height information reduced performance drops, with the Random Forest achieving the highest increase of 12% in mIoU. In contrast, DA methods are reported to reduce performance drops by up to 10% in mIoU. Another study by Petrich et al. [29] analyzes the generalization of Support Vector Machines (SVMs), SegNet, and SegNet lite trained on multimodal VHR RS datasets. The experiments are based on the Vaihingen and Potsdam benchmark datasets from the International Society for Photogrammetry and Remote Sensing (ISPRS) and on the Urban 3D Challenge dataset from the United States Special Operations Command (USSOCOM). In their cross-city and cross-season scenarios, they emphasize that combining spectral data with height information from a normalized digital surface model (nDSM) improves the accuracy of predictions on target data by an average of 14.40% when Vaihingen-trained models are transferred to the Potsdam dataset. Petrich et al. [29] highlight that adding height information significantly impacts the prediction of the target data, whereas the effect on the source data is less pronounced. Although studies exist that evaluate the impact of data modality on initial robustness of SSMs in cross-domain scenarios, there is a lack of studies that focus specifically on mapping impervious surfaces while quantifying class-specific sensitivities using VHR RS data.
In deep neural networks (DNNs), recalibrating the normalization layers to the target data is a more lightweight approach to DA [30,31]. Within a DNN, normalization layers focus on statistically adjusting the activations of neurons [32]. One prominent normalization technique is batch normalization (BN), which was proposed by Ioffe and Szegedy [33], normalizing activations by considering mini-batches of input data fed into a DNN instead of updating on individual input data examples. This technique is reported to effectively improve training speed or model generalization [32] and, thus, represents an integral part of DNN architectures [30,32]. Based on this concept, Li et al. [31] propose adaptive batch normalization (AdaBN), building upon the hypothesis that the statistics (mean and variance) calculated within the BN layers are linked to the source domain on which the network was initially trained. By utilizing AdaBN, these statistics are recalibrated to the target domain data, reducing the covariate shift [31]. Since AdaBN does not require retraining or finetuning and operates only on unlabeled images from the target domain, this concept is described as a practical approach to DA [30,31]. Benson and Ecker [34] analyze the effect of multi-domain AdaBN in building damage detection using a two-stream ResNet50 and a Dual-HRNet trained on VHR satellite data. The authors emphasize that multi-domain AdaBN improves model performance robustness when facing domain shifts, while highlighting inconsistent performance patterns between domains. In a related conference paper, Wang et al. [35] report that AdaBN improves model performance by an average of 13.58 percentage points in mIoU compared to the DeepLabv3 baseline in cross-sensor scenarios, using the ISPRS Potsdam aerial dataset and the DFCTrack1 satellite dataset in land use and land cover classification. The authors specifically point out that BN statistics played an important role in their experiments involving cross-sensor shifts. Although the importance of BN statistics in cross-domain scenarios has been recognized, there is a lack of studies that systematically compare the impact of AdaBN against the impact of data modality on model generalization in cross-city and cross-season scenarios in impervious surfaces mapping using VHR RS datasets.
To fill this gap, we therefore conduct a systematic comparative analysis, quantifying the initial robustness of baseline SSMs trained on spectral (R, G, NIR) and multimodal datasets (R, G, NIR, nDSM) while analyzing modality-specific impacts of AdaBN in impervious surface mapping under cross-city and cross-season shifts. To achieve this, we use the Potsdam and Vaihingen VHR RS benchmark datasets provided by the ISPRS, reclassifying the initial six GT classes to the three classes: impervious surfaces, buildings, and background. For comparisons, we train spectral and multimodal versions of the U-Net and the DeepLabv3+ with and without adaptations using AdaBN. In the experimental setup, the Potsdam dataset is the source domain on which the initial models were trained. The Vaihingen dataset is the target domain, representing the cross-city and cross-season domain shift scenario. We, thus, address the following research questions:
  • What is the relative impact of input data modality (spectral vs. multimodal) and AdaBN on SSM performance in impervious surface mapping under cross-city and cross-season scenarios?
  • How consistent are the effects of input data modality and AdaBN across different SSM architectures in impervious surface mapping under cross-city and cross-season scenarios?

2. Materials and Methods

2.1. Study Area and Datasets

We conduct experiments on the Potsdam and Vaihingen datasets provided by the ISPRS [36]. These VHR RS datasets are a widely used, publicly available data source in the RS literature, for example, providing benchmarks to evaluate model architectures [14,15]. Both datasets represent urban scenes in the cities of Potsdam and Vaihingen an der Enz, which are located in northeast and southwest Germany, respectively (Figure 1). As the capital of the federal state of Brandenburg, Potsdam has a population of 188,288 and an area of 187.68 km2 [37]. In contrast, Vaihingen an der Enz is a medium-sized city in the Stuttgart metropolitan region of the federal state of Baden-Württemberg. Vaihingen comprises 29,907 inhabitants and an area of 73.40 km2 [38]. While Potsdam’s cityscape is characterized by a denser urban fabric with large building complexes, Vaihingen exhibits a more fragmented and small-scale settlement structure [36]. Furthermore, the Potsdam dataset represents non-leafy conditions in spring, whereas the Vaihingen dataset was taken in leafy conditions in summer.
The ISPRS Vaihingen dataset contains true orthophoto (TOP) tiles with near-infrared (NIR), red (R), and green (G) bands; digital surface model (DSM) data; and manually annotated GT masks [36,39]. Additional nDSM data matching this dataset is provided by Gerke [40]. The ground truth masks hold the class labels: (1) impervious surfaces, (2) building, (3) low vegetation, (4) tree, (5) car, and (6) clutter/background. The ISPRS Potsdam dataset is comparable to the ISPRS Vaihingen dataset. It contains TOP tiles with R, G, blue (B), and NIR bands, as well as DSM and nDSM data and manually annotated GT masks [36,41]. Table 1 documents the characteristics of the dataset and the data that were used in this study from the ISPRS Potsdam and Vaihingen datasets.

2.2. Dataset Preparation

In impervious surface mapping, leaf-off conditions are considered to represent more optimal mapping conditions due to the reduced obscuring effects caused by vegetation, such as tree canopies [42,43,44]. Since the Potsdam dataset represents leaf-off conditions, this dataset is used as the source data, while the Vaihingen dataset, representing leaf-on conditions, serves as the target data. The Potsdam tiles are divided into three sets. The first set is for training, the second set is for validation during training, and the third set is for testing the final model’s performance. The Vaihingen tiles are divided into two sets. The first set is for adaptation to the target data, providing a more comparative, unbiased setting, and the second set is for testing the final model performances. Figure 2 illustrates the sets used within the remainder of this study.
This study focused on the initial tile splits of the Vaihingen and Potsdam training and validation sets, which were provided by the ISPRS [39,41]. For the Potsdam dataset, the initial validation set was equivalent to the test set. Four tiles were separated from the initial training set to serve as the validation set in this study. For the Vaihingen dataset, the initial training set represents the adaptation set, while the initial validation set served as the testing set. Table 2 documents the tiles in the corresponding sets alongside their identifiers.
The datasets differ in spatial resolution. We resampled the Potsdam dataset (TOP, nDSM) from 5 cm to 9 cm using nearest-neighbour interpolation to match the spatial resolution of the Vaihingen dataset. The TOP and nDSM data of each dataset represent 8-bit integer values. Both datasets were rescaled to a range of [0, 1] by dividing each dataset by 255.
To focus on impervious surface mapping, we reclassified the original six classes within the GT masks into the three classes: impervious surfaces, buildings, and background. The car class was combined with the initial impervious surface class because most of the car objects in the datasets were found in this class. The initial low vegetation, tree, and clutter/background classes were combined for the background class, representing pervious land covers. The initial building class was left unchanged. Table 3 documents the reclassification process for the initial ISPRS class labels.
For the experiments, we created spectral (NIR, R, and G) and multimodal (NIR, R, G, and nDSM) datasets for each city. Training, validation, and adaptation tiles were processed into non-overlapping patches of size 512 × 512 pixels. For the Potsdam dataset, this resulted in 1152 patches for training and 384 patches for validation. For the Vaihingen dataset, 387 patches were created, serving as the training set. The test tiles for both datasets retained their original shape and were processed during model inference. Model performances reported throughout this study refer to the test sets for the corresponding city.

2.3. Model Architectures

We utilized two encoder–decoder CNN-based model architectures, namely the U-Net [12] and the DeepLabv3+ [13]. This choice was motivated by their popularity as baseline models in the RS literature [15,17], as well as for the DeepLabv3+ within the DA literature [18,19,24,26]. Thus, analyzing the initial transferability of both models under cross-city and cross-season domain shifts may inform a variety of studies. We report their main architectural design in the following sections.

2.3.1. U-Net

The U-Net was developed by Ronneberger et al. [12]. Its design is composed of an encoder and a decoder part, which are linked through skip connections. The encoder is responsible for extracting features from the input data, which are processed by the decoder to generate segmentation masks of the input image. In this process, the skip connections aim to keep the spatial detail from the feature maps of earlier stages. Encoders and decoders consist of a similar structure, comprising four convolutional blocks, in which two 3 × 3 convolution layers are followed by a rectifier linear unit (ReLU) activation. However, while the encoder focuses on downsampling using 2 × 2 max pooling layers, the decoder focuses on upsampling using a 2 × 2 up-conv layer. Thus, the number of feature channels for the encoder is doubled for each block (from 64 to 512 features). These are correspondingly reduced in the decoder. The bottleneck captures the most abstract features and consists of two 3 × 3 convolution layers with ReLU activations, connecting the encoder and decoder. For the final segmentation mask, a 1 × 1 convolution is used, adjusting the number of channels to the required output.

2.3.2. DeepLabv3+

The DeepLabv3+ was created by Chen et al. [13]. The model architecture comprises an encoder–decoder structure and an atrous spatial pyramid pooling (ASPP) module, aiming to strengthen contextual information and detailed object boundaries. A CNN backbone generates feature maps inside the encoder, from which the last feature maps are transferred to the ASPP module. Inside the ASPP module, atrous convolutions are used with different dilation rates to obtain multi-scale contextual information. After an upsampling step is performed to align the spatial resolution of the feature maps with those from earlier layers, the ASPP output is concatenated with the feature maps to optimize the object boundaries. Following convolutional layers, we then process these concatenated features to produce the final segmentation output by upsampling to the original input image size.

2.4. Training Configurations

To ensure a comparable experimental setup, the same hyperparameters and training settings were used throughout the experiments. Additionally, to increase comparability between the U-Net and DeepLabv3+ model architecture implementations, we used the Segmentation Model PyTorch (SMP) library [45] (version: 0.5.0) with PyTorch (version: 2.9.1+cu128) in Python (version: 3.12.12). For both U-Net and DeepLabv3+, we utilized ResNet34 as the backbone model without the use of pretrained weights to focus on initial model transferability. Data augmentation was implemented using horizontal and vertical flips and random rotation by 90 degrees, each with a probability of 50%. Photometric data augmentations were not considered to focus on the raw domain shifts observed in the data. Each model was trained for 40 epochs using the AdamW optimizer [46] with a learning rate of 1 × 10−3. Additionally, the cosine annealing learning rate scheduler was utilized to reduce the learning rate throughout model training to a final learning rate of 1 × 10−5 [47]. The batch size was set to 16. When validation loss stopped decreasing over 10 consecutive epochs, early stopping was triggered. As the loss function, we utilized the cross entropy (CE) loss function as one of the most applied loss functions in DL [48].

2.5. Model Inference

We divided model inference into two categories. The first category is conventional inference, which uses the initially trained models without adaptations. The second category is an adjusted inference for which models were adjusted using AdaBN prior to the inference step. For the remainder of this study, these categories will be referred to as “conventional” and “AdaBN”.
BN normalizes the activation outputs of a neural network layer using the activations produced for each mini-batch (B) of training data [30,32,33]. For each feature, the mean ( μ B ) and variance ( σ B 2 ) are calculated across the mini-batch. Each individual activation ( x i ) is then normalized by subtracting the feature’s mini-batch mean and dividing by the feature’s mini-batch standard deviation, while adding a constant for numerical stability (1) [33]. After normalization, the scale (γ) and shift (β) parameters are applied to the normalized activations, which are learned by the model during training (2) [33].
x ^ i = x i μ B σ B 2 +
y i = γ × x ^ i + β
At inference time, the BN layers normalize the test data using the fixed running estimates of the mean and variance, which were accumulated during training. Therefore, these learned statistics are representative of the training data [31]. With AdaBN, these statistics are updated based on the target data, recalculating the mean and variance [31].
We adopt the concept of AdaBN by updating the BN layer running statistics using the target data represented by the Vaihingen adaptation set. For each adaptation, we created a separate copy of each individual source-trained model so that the BN layer statistics always reflected the initial source-trained statistics prior to applying AdaBN. Then, the input images from the adaptation set, without any GT masks, are passed through the SSM in training mode without gradient updates, allowing the running mean and variance to adapt per mini-batch. To evaluate the impact of AdaBN on the encoder and decoder, this workflow was reproduced. This time, however, either the encoder or the decoder was set to training mode, while the other was set to evaluation mode. The same batch size of 16 was applied during this process as was used during initial model training. Additionally, random shuffling was disabled in the data loading pipeline to allow for more accurate model comparisons.
Inference was performed on the testing tiles of either the Potsdam or the Vaihingen dataset. First, the tiles were patched using a sliding window approach with overlapping patches and a step size of 128 pixels. For each patch, class probability maps were created. The class probability maps were aggregated to match their initial positions and the final tile sizes. The predicted class probabilities from all overlapping patches were summed pixel-wise. Throughout this process, the number of times each pixel was overlapped was tracked. Then, the final class probability mask was normalized by dividing it by the count of pixel overlap. The final segmentation mask was obtained by assigning each pixel to the class with the highest aggregated probability.

2.6. Evaluation

To evaluate the experiments, class-specific and overall metrics were calculated. For this, the intersection over union (IoU), F1 score, precision, and recall were utilized using class-specific true positives (TPs), false positives (FPs), and false negatives (FNs). Precision (3) measures the TP within all positive model predictions, indicating the correctness of the model for the target class [49]. Recall (4) calculates the TP for all positive predictions that were correctly identified, as well as for the cases where the prediction was false. Therefore, this metric indicates how many of the actual positive cases were identified by the model [49]. In addition, F1 (5) represents a balanced overview of both precision and recall, with higher scores indicating more balanced model performance [49]. The IoU (6) is a standard metric in semantic segmentation, quantifying the degree of overlap between the ground truth and the predicted outcome, for which higher scores indicate better performance [49]. For an overall overview of model performances, we averaged class-wise metrics (denoted with C within the formulas) as described in (7). Unless stated otherwise, the metric results are equivalent to the mean metric results calculated either on the Potsdam or Vaihingen test set for the remainder of this study. Further differentiation is required for the IoU, which reflects the class-specific, and the mean IoU (mIoU), which reflects the averaged metric result. These are used correspondingly:
P r e c i s i o n c = T P c T P c + F P c
R e c a l l c = T P c T P c + F N c
F 1 c = 2 × P r e c i s i o n c × R e c a l l c P r e c i s i o n c + R e c a l l c
I o U c = T P c T P c + F P c + F N c
m M e t r i c = 1 C c = 1 C M e t r i c c

3. Results

In this study, four separate models were trained, and 12 inferences were performed, as indicated in Table 4. The source data test is defined as Potsdam to Potsdam (P2P), and the target data test is Potsdam to Vaihingen (P2V).

3.1. Potsdam to Potsdam (P2P)

3.1.1. Overall Results

As shown in Table 5, the multimodal model versions outperformed the spectral model versions by 3.63 percentage points in mIoU when comparing the mean model performances. Among the spectral model versions, the U-Net achieved an mIoU of 80.50%, which was 0.7 percentage points higher than the DeepLabv3+’s mIoU of 79.80%. The same was true for the multimodal model versions, where the U-Net (mIoU = 83.85%) outperformed DeepLabv3+ (mIoU = 83.70%), which was an even smaller margin of 0.15 percentage points.

3.1.2. Class-Specific Results

As illustrated in Figure 3, among all models, the impervious surface class exhibited the lowest IoU and F1 scores. The spectral DeepLabv3+ exhibited the lowest IoU with 76.14%, whereas the multimodal U-Net achieved the highest IoU with 79.37% for the impervious surface class. The best class performance, among all models, was observed for the buildings class. Again, the spectral DeepLabv3+ achieved the lowest score of 82.05%, whereas the multimodal U-Net exhibited the highest IoU with 90.37% for the building class.
As shown in Table 6, the IoU improved the most for the building class when the difference between the multimodal and spectral performance results is compared. The DeepLabv3+ improved the most by 7.89 percentage points, compared to 7.65 for the U-Net. The second-highest improvement was observed for the impervious surface class, showing increases in IoU by 3.01 percentage points for DeepLabv3+ and 2.33 for U-Net. In contrast, the background class increased only marginally when the multimodal dataset was used.

3.2. Potsdam to Vaihingen (P2V)

3.2.1. Overall Results

Table 7 shows the model performances on the P2V test set. Overall, all model performances were affected by the cross-domain shift, as indicated by considerably lower metric scores compared to P2P.
The mean model performance results show that the multimodal model versions (mIoU = 64.76%) outperformed the spectral model versions (mIoU = 47.22%) by 17.54 percentage points in mIoU. This increase was 3.46 percentage points lower when AdaBN was combined with spectral model versions, showing a mean model performance of 61.30% mIoU. The highest mIoU and F1 scores were observed when multimodal model versions were combined with AdaBN. When comparing the mean model performance, these model versions achieved an mIoU of 74.82%. This mIoU score was 10.06 percentage points higher than for the multimodal model versions. Relating this to the spectral model versions in P2P (mIoU = 80.15%), this was 5.33 percentage points lower.
When spectral models were compared, it was observed that DeepLabv3+ exhibited a higher mIoU score with 49.60% than the U-Net with 44.84%. Among the multimodal model versions, U-Net outperformed DeepLabv3+ with an mIoU of 69.27% to 60.25%, respectively. U-Net also outperformed DeepLabv3+ when the spectral datasets and AdaBN were combined (mIoU: 62.66% to 59.94%). This pattern was also observed for the combination of multimodal datasets and AdaBN (mIoU: 75.72% to 73.92%). Therefore, the U-Net outperformed the DeepLabv3+ in all cases, except for the spectral model versions.
Furthermore, greater differences were observed when comparing models of the same data type (∆mIoU) when AdaBN was not applied. Without AdaBN, multimodal model versions showed an ∆mIoU of ±9.02 and spectral model versions of ±4.76. When AdaBN was applied, multimodal model versions showed an ∆mIoU of ±1.8 and spectral model versions of ±2.72.

3.2.2. Conventional—Class-Specific Results

Figure 4 shows the class-specific metric scores for conventional inference on the P2V test set. The impervious surface class performed the worst of all the trained models. This class was severely impacted for the spectral models, showing IoU scores of 27.23% for DeepLabv3+ and 18.78% for U-Net. Low IoU scores were also observed for the impervious surface class for the multimodal DeepLabv3+, exhibiting an IoU score of 43.33%. However, this IoU score was considerably lower than that for the multimodal U-Net of 65.12%. For the spectral model versions, the best-performing class is the background class, with the building class representing the second-best class. This pattern was also observed for the multimodal U-Net version. However, the multimodal DeepLabv3+ is an exception to this pattern, where the building class is the best-performing class, followed by the background class.
Different patterns can be observed between the model architectures regarding class-specific improvements when comparing the multimodal and spectral model versions in Table 8. The greatest improvement for DeepLabv3+ was observed for the buildings class, with an increase in IoU of 22.91 percentage points. The second-highest increase in IoU was exhibited for the impervious surfaces class with 16.10, followed by a decrease in IoU of −7.05 percentage points for the background class.
U-Net, on the other hand, exhibited its greatest improvement for the impervious surface class with 46.34 percentage points, stabilizing the considerably low performance observed for its spectral version. The second greatest improvement was achieved for the building class of 21.16 percentage points in IoU, followed by an increase of 5.79 percentage points in IoU for the background class.

3.2.3. AdaBN—Class-Specific Results

As shown in Figure 5, when AdaBN is applied, the spectral model version exhibited higher class-specific metric scores compared to those without (cf. Figure 4). When AdaBN was applied, the impervious surface class showed the lowest IoU scores, followed by the building and background classes. One exception is the spectral DeepLabv3+, which showed the lowest IoU score for the building class with 49.94%.
Table 9 shows the class-specific improvements in IoU for spectral and multimodal model versions with and without the application of AdaBN. These improvements represent the difference between the AdaBN (cf. Figure 5) and the conventional (cf. Figure 4) IoU results for the same class, data type, and model architecture.
U-Net achieved the greatest improvements in IoU when AdaBN was applied to the spectral model versions. The impervious surfaces class increased in IoU by 37.24, the background class by 6.86, and the building class by 9.36 percentage points. The DeepLabv3+, on the other hand, increased in IoU for the impervious surface class by 28.73, for the background class by 0.44, and for the building class by 1.86 percentage points. Therefore, AdaBN had the greatest impact on the impervious surfaces class and the least impact on the building class.
When comparing the multimodal model versions, AdaBN further improved class-specific IoU scores. These improvements were greater for the multimodal DeepLabv3+ than for the multimodal U-Net. For DeepLabv3+, the impervious surface class increased by 25.99, the background by 12.44, and the buildings by 2.56 percentage points. In contrast, U-Net improved the most for the buildings class with 8.32, followed by the impervious surface class by 6.20, and the background class by 4.84 percentage points. Therefore, when AdaBN was applied to the multimodal model versions, patterns in improvements differ considerably.

3.2.4. Ablation Study: AdaBN in Ecoder vs. Decoder

Figure 6 shows the model performances when AdaBN was applied to the entire model architecture, the encoder, or the decoder. For both multimodal and spectral DeepLabv3+, the highest IoU scores were observed when AdaBN was applied solely to the encoder of the models (74.34% for multimodal; 60.38% for spectral), followed by applying AdaBN to the entire model (73.92% for multimodal; 59.94% for spectral). U-Net exhibited the greatest IoU scores when AdaBN was applied to the entire model architecture (75.72% for multimodal; 62.66% for spectral). However, this was marginally greater than when applying AdaBN solely to the encoder (75.68% for multimodal; 62.63% for spectral). Therefore, the results show that the decoder plays a subordinate role in the impact of AdaBN within both model architectures.

3.3. Visual Assessment

3.3.1. Potsdam to Potsdam (P2P)

Figure 7 illustrates that all segmentation models produced robust results within the P2P task. Overall, however, the multimodal model versions produced cleaner results with sharper object boundaries, particularly regarding the building class. Regarding Example A, both spectral and multimodal DeepLabv3+ and U-Net also segmented smaller buildings not referenced within the GT map. In cases of ambiguity caused by tree canopies or green roofs, the multimodal model versions produced more complete building objects than the spectral model versions, as highlighted within the middle and bottom bounding boxes. In Example B, the spectral DeepLabv3+ produced more accurate results than the spectral U-Net, creating fewer misclassifications, particularly for buildings with high albedo, as highlighted within the bottom bounding box. Furthermore, both the spectral DeepLabv3+ and U-Net models struggle to correctly classify a low-albedo impervious surface, as highlighted within the central bounding box. Additionally, this surface is incorrectly labeled as background in the GT map. This suggests that such labeling errors may have affected the training process. Similarly, in the top bounding box, both models incorrectly classify the car park as an impervious surface, despite it being labeled as a building in the GT map. For the multimodal model versions, misclassified areas were considerably reduced, particularly for U-Net. However, both models struggled with ambiguous impervious surfaces highlighted within the central bounding box.

3.3.2. Potsdam to Vaihingen (P2V)

Figure 8 illustrates the severity of the performance drops within the P2V task. This is particularly evident in the case of the spectral DeepLabv3+ and spectral U-Net in Examples A and B, struggling with overpredicting the building and background class. Nevertheless, the spectral DeepLabv3+ performed slightly better in this comparison, especially for the impervious surface class. The multimodal versions of both the DeepLabv3+ and the U-Net are considerably more robust than the spectral model versions, as can be seen in Examples A and B. Additionally, all spectral model versions profited from applying AdaBN. The spectral U-Net has improved considerably in its ability to correctly segment impervious surfaces, as indicated in Example A. However, it can be observed that, when AdaBN was applied to the spectral version of DeepLabv3+ and U-Net, buildings with lower albedo were falsely labeled as impervious surfaces, as highlighted within the central box. Using AdaBN, the object boundaries were improved for the multimodal model versions, reducing confusion with the building class, as depicted in Example A.
As shown in Example B, the performances of the spectral DeepLabv3+ and U-Net were comparable, which was also true when AdaBN was applied. Both models misclassified buildings with low-albedo roofs in their AdaBN versions. The multimodal DeepLabv3+ and U-Net versions produced considerably better prediction maps than their spectral equivalents. However, they misclassified buildings with high-albedo roofs, as shown in the bottom box. The quantitative results contrasted with the visual appearance of the prediction maps generated by the multimodal DeepLabv3+. The multimodal DeepLabv3+ visually appeared to generate better prediction maps than its version with AdaBN applied. This is because the AdaBN version incorrectly classified buildings with low-albedo roofs, as indicated by the central bounding box. This was not the case for the multimodal U-Net with AdaBN, which produced the best overall prediction map among the models. Additionally, when comparing the spectral and multimodal versions of DeepLabv3+, it can be observed that the spectral version tends to overpredict the building class. In contrast, the multimodal version tends to overpredict the background class considerably.

4. Discussion

4.1. Impact of Data Modality on Cross-City and Cross-Season Shift

In the P2P scenario, the results show that data modality affected model performances marginally. On average, the difference in performance between the multimodal and spectral model versions was 3.43 percentage points in mIoU, favoring the multimodal models. This was also true for the performance difference between model architectures. The differences in mIoU between the spectral model versions were 0.7, whereas the differences between the multimodal model versions were 0.15 percentage points. In both cases, U-Net marginally outperformed the DeepLabv3+. Therefore, we agree with Petrich et al. [29] that, for in-domain tasks, relying solely on spectral datasets may be sufficient.
In the P2V scenario, results show that the combination of spectral and height information considerably reduced the performance drops observed in spectral model versions. On average, multimodal model versions increased in mIoU by 17.54 percentage points when compared to spectral model versions. This suggests that the nDSM data increases the robustness of impervious surface SSMs when confronted with cross-city and cross-season domain shifts. One possible explanation is that the nDSM represents a more generalizable feature representation, which is less influenced by domain shifts compared to the spectral dataset [17]. Our results align with the findings of Qin et al. [17] and Petrich et al. [29], who reported that the integration of nDSM data helped increase the robustness of model performances under domain shifts. For example, we can compare our results with those of Petrich et al. [29], who reported an average increase in accuracy of 14.40% for the V2P task when additional nDSM data was added. Conceptually, our findings match the description of Chen et al. [25] that spectral and textural cues are considered more domain-specific than height information, while adding to Bruzzone and Bovolo [22] and Chen et al. [23] that spectral data is subject to radiometric noise.
Greatest class-specific increases in IoU were found for the building and the impervious surface class when nDSM data were added. For the multimodal DeepLabv3+, the building class improved the most. For the multimodal U-Net, however, this was the impervious surface class, directly followed by the building class. This partly met our expectations, as we had anticipated that the building class would benefit the most from adding height information from the nDSM. However, when the performance of the two models was compared for the building class, it was found that they both benefited in a similar margin from the additional nDSM. While both models also profited considerably from the impervious surface class, the results suggest that additional height information may increase class separability. This is in line with Chen et al. [25], emphasizing that ambiguity between classes may be reduced when utilizing height information.

4.2. Impact of AdaBN on Cross-City and Cross-Season Shift

In the P2V scenario, performance drops observed in spectral model versions were reduced by using AdaBN. The results showed that AdaBN increased the mean model performance of the spectral model versions by 14.08 percentage points in mIoU, which was 3.46 percentage points lower than using multimodal datasets. The combination of AdaBN with multimodal model versions exhibited the best results, further increasing mIoU on average by 10.06 percentage points compared to multimodal model versions alone. The results suggest that AdaBN increases robustness against cross-city and cross-season domain shifts without requiring additional data or model finetuning. Moreover, its effect is additive to multimodal datasets. One possible explanation is that the covariate shifts between the Potsdam and Vaihingen datasets considerably affected the transferability of the spectral models within the experiments, leading to consistent improvements between data modalities. Additionally, the multimodal dataset outperforms AdaBN, since the data is directly integrated into model training. This increases the amount of data available for the task, as well as the amount of information (e.g., semantics) used during training. Our results are in line with Li et al. [31], Benson and Ecker [34], and Wang et al. [35], reporting robustness improvements to domain shifts using AdaBN. Specifically, our observed increases in mIoU are comparable to the 13.58 percentage-point average increase outlined by Wang et al. [35] for AdaBN against the DeepLabv3 baseline in cross-sensor scenarios.
Moreover, when AdaBN was applied, the greatest class-specific improvement was observed for the impervious surface class regardless of the input data type used within model training. For the spectral model versions, the IoU scores observed for the impervious surface class increased by 28.73 and 37.24 percentage points, respectively, for DeepLabv3+ and U-Net. These drastic increases, however, are a result of the severe performance drops observed for this class within the P2V task. The results suggest that the impervious surface class was the most sensitive class to the cross-city and cross-season domain shift. One possible explanation could be that the covariate shifts between the source and the target data were most severe for impervious surfaces than for the other classes. This may have caused the models to overfit their city- and season-specific source data, which was not generalizable enough for the target data. For example, the roads in the Vaihingen images were partly obscured by leafy tree canopies, unlike the roads in the Potsdam images, which represent leaf-off conditions. Or, additionally, the shadows cast onto the roads showed different patterns, resulting in longer, lighter shadows in the Potsdam images compared to the Vaihingen images (cf. Figure 7 and Figure 8). Thus, in our experiments, vegetation occlusions and shadow patterns affected model generalization. However, season-specific overfitting could potentially be more severe when facing more contrasting seasonal changes, such as leaf-on conditions in summer and snow cover conditions in winter, than when facing defoliation in autumn.
Furthermore, our observations revealed that AdaBN had a greater impact on the encoder than on the decoder for both the DeepLabv3+ and U-Net models. For the DeepLabv3+, applying AdaBN solely to the encoder generated an even greater mIoU score than its application to the entire model. This suggests that the encoder plays a crucial part in model performance in cross-domain scenarios. One potential explanation for this finding is that the encoder statistics are more domain-sensitive than the decoder. This may be due to the encoder extracting more domain-specific low-level features, while the decoder is more class-discriminative. Specifically, when the input images differ during inference, the extracted features by the encoder may not align with the feature representation expected by the decoder, which was learned during training. Therefore, using AdaBN to align the encoder features in the early stages of the model may be more beneficial, increasing the processability for the decoder in later stages.

4.3. Differences Between Model Architectures

In our experiments, U-Net consistently outperformed DeepLabv3+ in all P2P and P2V comparisons except for the spectral models in P2V. This result suggests that U-Net was more robust to in-domain and cross-domain than DeepLabv3+. One possible explanation could be that DeepLabv3+ overfitted more severely to the characteristics of the Potsdam image dataset compared to the U-Net, particularly in P2V scenarios. More specifically, this may be due to the ASPP module and the additional dilated convolutions in the DeepLabv3+ architecture. This additional contextual information potentially may be more representative of the source domain, influencing its cross-domain performance.
In addition, unexpectedly, the background class performance of DeepLabv3+ decreased, when nDSM data was added, which was not observed for U-Net. Although the exact reason remains unclear, one possible explanation could be that DeepLabv3+ may not only overfit to the source data but also to specific characteristics of the input data in our experiments. Supporting this, we observed that the spectral DeepLabv3+ tended to overpredict the building class. In contrast, the multimodal DeepLabv3+ was more accurate at predicting the building class due to the nDSM. However, it tended to overpredict the background class rather than the impervious surfaces class, possibly due to the class imbalance present within the training data. This contrasting behavior of the model versions potentially led to the observed decrease in IoU for the background class. In contrast to U-Net, DeepLabv3+ uses dilated convolutions and the ASPP module to increase spatial context. Therefore, if this spatial context information differs due to domain shifts in the input data, it may lead to more varying patterns of prediction uncertainty when comparing different data modalities.
Our findings contrast with those of Qin et al. [17] and Petrich et al. [29], who highlighted inconsistency between best-performing models on the source and on the target data. However, instead of claiming that U-Net outperformed DeepLabv3+, we more conservatively report that the choice of model architecture influenced robustness to cross-domain shifts in our experiments. This is because the training setup (training data, model architecture, and loss and optimization functions) may influence the models to learn specific characteristics instead of others (inductive bias), leading to the observed results [50].
Furthermore, the results show that model performance differences between DeepLabv3+ and U-Net decreased in the P2V scenario when AdaBN was applied, regardless of data modality. This finding suggests that AdaBN may increase the comparability between model architectures in cross-domain scenarios. This may be explained by AdaBN reducing the error component related to the mismatch of the BN statistics due to the cross-domain shifts, isolating more underlying influences, such as the inductive bias between the model architectures.
In addition, data modality and AdaBN had rather different impacts on both DeepLabv3+ and U-Net. The results show that applying AdaBN to spectral datasets had a consistent effect on class-specific improvements of both model architectures, which was not the case for the usage of multimodal datasets. These results suggest that using AdaBN leads to more predictable, class-specific improvements than using different data modalities during training. This can be explained by the inherent stochasticity and inductive biases present during training, increasing the probability of observing differences between model outcomes. However, this may be less pronounced if only the BN layer statistics are adapted to the target data in AdaBN, predominantly addressing covariate shift [31].

4.4. Limitations and Future Research

This study is limited to a comparison of DeepLabv3+ and U-Net, two baseline CNN-based encoder–decoder networks, and their initial capability to handle cross-season and cross-region domain shifts. Furthermore, to provide a more comparable experimental setting, the SMP library’s [45] model implementation was used. This leads to the assumption that the experimental results may differ when the original model architecture implementations are used. However, given their widespread use in the RS [15,17] and DA literature [18,19,24,26], we argue that our results can provide insights into a multitude of semantic segmentation approaches informing cross-city and cross-season domain shifts. Additionally, unlike standard approaches in RS, we did not use pretrained ImageNet weights to study initial model transferability in cross-domain scenarios. However, we anticipate that pretrained weights would considerably increase the robustness of the models in the P2V scenario, particularly for the spectral model versions.
We used fixed hyperparameter settings throughout the experiments. This approach likely biased the trained models by using predefined values that were not individually tuned, for example, using automated hyperparameter tuning techniques [51]. With this approach, we argue for more confidently attributing observed performance differences to the model architecture rather than to differences in model training or hyperparameter selection. However, creating similar training environments for model comparison is complicated [51], not least because of inductive bias.
Furthermore, AdaBN was used as a straightforward technique to address domain shifts, despite being a relatively old method in the chronology of the DL literature. Regarding our results in aiding spectral and multimodal model versions and the practicality of this method, without the need for finetuning, we argue that this method is still relevant, supporting consistent impervious surface mapping. AdaBN is particularly useful in real-world settings, as nDSM data are often unavailable or subject to unaligned acquisition dates, which limits workflow automation. Moreover, since AdaBN has been shown to be effective with a small number of examples, it is also suitable for cases with limited data in the target domain [31]. As Li et al. [31] report, the expected increase with higher numbers of images is expected to plateau. Therefore, in our case, a smaller adaptation set may have produced similar results. However, it should be noted that AdaBN predominantly addresses covariate shifts [31], limiting its effectiveness when inconsistencies between features and labels are present (concept shift) or when the distribution of label information differs (prior probability shift) [28]. Therefore, AdaBN cannot replace comprehensive DA approaches [31], but it can provide support in cross-domain scenarios.
Moreover, data quality and data pre-processing may have influenced experimental results. As visualized in Figure 7 and reported by Audebert et al. [9] and Marmanis et al. [10], annotation errors are present within the GT datasets. Besides these annotation errors, Marmanis et al. [10] and Gerke [40] also report errors within the nDSM data for the Vaihingen dataset, which were already pointed out by Gerke [40]. Marmanis et al. [10] investigated these errors within the nDSM of the Vaihingen dataset and found that they refer to missing industrial buildings. In their experiments, correcting for these errors improved the overall accuracy by 0.9 percentage points. These buildings occupy 3.1% of the area in image “area12”, 9.3% in image “area31”, and 10.0% in image “area33” [10], which are found in our testing set. Therefore, these errors may have systematically affected the performance of the building class, as well as our reported overall performances for the P2V tests. However, these issues were not resolved in this study due to the substantial manual effort and time required to address them properly and to ensure the comparability of future studies. Moreover, we used nearest-neighbor resampling instead of bilinear or bicubic interpolation to align the spatial resolution of the Potsdam dataset with that of the Vaihingen dataset. However, this approach may have caused aliasing or influenced the spectral characteristics of the spectral input data for model training, which potentially affected model performances. Furthermore, we reclassified the initial Potsdam and Vaihingen GT labels. Specifically, we chose to add the car class to the impervious surface class. This could have contributed to label noise because cars were also located on permeable surfaces and, more ideally, should have been added to the background class. This aspect was not quantified in this study and should be considered when interpreting our experimental results or in future work. Therefore, we acknowledge that these issues may have affected the experimental results.
We focused on the geographical locations of Potsdam and Vaihingen, covering leaf-off and leaf-on conditions. One of the key reasons for choosing this study area was the availability of comparable VHR datasets with high-quality GT data. Due to the lack of comparable and high-quality VHR datasets for impervious surface mapping between regions, cross-domain comparisons using VHR datasets are currently limited to more regional comparisons with limited seasonal coverage.
Therefore, future research could involve generating additional VHR datasets in different geographical locations around the globe with greater variety within the investigated classes. Moreover, our experimental design could be extended to focus more broadly on different model architectures, particularly in more varied geographical locations and seasonal settings. For example, this might involve the comparison of the initial transferability of CNN- and Transformer-based models (e.g., SegFormer [52], DBRSNet [53]) to newer domain adaptation networks (e.g., DAFormer [27], HighDAN [20]) on cross-city and cross-season domain shifts for very high-resolution impervious surface mapping. Moreover, the impact of additional data modalities on cross-domain shifts could be studied, for example, by explicitly integrating spectral indices such as the normalized difference vegetation index (NDVI) [54] or the perpendicular impervious surface index (PISI) [55]. Ultimately, these research directions could contribute to the further understanding of the relationship between domain shifts and the underlying factors on SSMs in impervious surface mapping.

5. Conclusions

This study provides a systematic evaluation of how data modality and BN layer recalibration using AdaBN affect the robustness of SSMs for impervious surface mapping using VHR RS datasets under cross-city and cross-season shifts. Since these shifts cause a degradation in the performance of SSMs [16,17,18,19,20], evaluating such lightweight adaptation strategies could inform more practical impervious surface mapping approaches. For this evaluation, both spectral (R, G, and NIR) and multimodal (R, G, NIR, and nDSM) versions of the U-Net and DeepLabv3+ models were trained using the ISPRS Potsdam dataset as the source domain. They were then tested using the source domain and the ISPRS Vaihingen dataset as the target domain, with and without prior adaptation using AdaBN.
The results show that cross-city and cross-season domain shifts significantly affect impervious surface mapping using VHR datasets. All the trained models exhibited performance degradation, which is consistent with previous studies. Moreover, the experiments demonstrate that the choice of model architecture can affect the initial robustness observed in cross-domain scenarios.
Regarding our first research question, we conclude that both multimodal datasets and AdaBN increase the robustness of SSMs in impervious surface mapping under cross-city and cross-season scenarios. Specifically, the impervious surface class was found to be the most sensitive to cross-domain shifts, regardless of data modality. The experiments demonstrate that the performance increase using AdaBN was, on average, 3.46 percentage points lower in terms of mIoU than using multimodal datasets. Moreover, the most robust results were obtained when combining multimodal datasets and AdaBN, further improving the performance of multimodal model versions by 10.06 percentage points in mIoU on average. This confirms an additive effect of improvements by adding height information and AdaBN to spectral datasets.
For the second research question, we conclude that the impact of data modality and AdaBN on the robustness of the U-Net and DeepLabv3+ models does not exhibit consistent patterns across domain shifts. The results show that integrating additional height information or the AdaBN approach leads to different class-specific improvement patterns. While AdaBN integration yields consistent patterns for both model architectures, favoring impervious surfaces, the usage of multimodal datasets exhibits more inconsistent patterns, potentially due to stochasticity and inductive bias during training.
Our findings indicate that the covariate shifts observed between source and target data and BN layers play an important role in impervious surface mapping using VHR RS datasets in cross-city and cross-season generalization. While AdaBN addresses this shift directly within the trained model, adding nDSM data during training may provide more generalizable features, which, in such scenarios, may help in creating more holistic models. Therefore, we conclude that, when more comprehensive DA approaches are unavailable, we recommend using multimodal datasets in combination with AdaBN to improve SSM robustness across cities and seasons.
To achieve further research into cross-domain shifts in VHR urban remote sensing, it is required to create comparable VHR datasets in different geographical locations. This is a prerequisite to achieve sufficient testing grounds, for example, in comparing CNN-based and Transformer-based SSMs.

Author Contributions

Conceptualization, J.-P.L.; methodology, J.-P.L.; software, J.-P.L.; validation, J.-P.L.; formal analysis, J.-P.L.; investigation, J.-P.L.; resources, A.R.; data curation, J.-P.L.; writing—original draft preparation, J.-P.L.; writing—review and editing, J.-P.L., A.R.; visualization, J.-P.L.; supervision, A.R.; project administration, A.R.; funding acquisition, A.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The ISPRS Potsdam and Vaihingen datasets are accessible under the following link: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/default.aspx (accessed on 16 March 2026). The nDSM data from the ISPRS Vaihingen dataset is accessible under the following link: https://www.researchgate.net/publication/270450634_normalized_DSM_heights_encoded_in_dm_see_report (accessed on 16 March 2026). Other data related to the results of this study are available upon reasonable request.

Acknowledgments

We thank the Institute of Geography at the Ruhr-Universität Bochum for providing the facilities and resources necessary for conducting this study. We would like to thank the International Society for Photogrammetry and Remote Sensing (ISPRS) WG III/4 and the German Society for Photogrammetry, Remote Sensing and Geoinformation (DGPF) for providing the Potsdam and Vaihingen benchmark datasets. We thank Stefanie Steinbach and Torben Dedring for their valuable feedback and professional discussions. We utilized DeepL (Translator and Write) for the purposes of optimizing translations and sentence structure to enhance the readability of our manuscript (https://www.deepl.com/de/translator; https://www.deepl.com/de/write (accessed on 16 March 2026)).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AdaBNAdaptive Batch Normalization;
AGLAbove Ground Level;
ASPPAtrous Spatial Pyramid Pooling;
BBlue;
BNBatch Normalization;
CECross-Entropy Loss;
CNNConvolutional Neural Network;
DADomain Adaptation;
DLDeep Learning;
DNNDeep Neural Network;
DSMDigital Surface Model;
DTMDigital Terrain Model;
GGreen;
GTGround Truth;
IoUIntersection Over Union;
ISPRSInternational Society for Photogrammetry and Remote Sensing;
mIoUMean Intersection Over Union;
nDSMNormalized Digital Surface Model;
NDVINormalized Difference Vegetation Index;
NIRNear-Infrared;
P2PTransfer Task Potsdam to Potsdam;
P2VTransfer Task Potsdam to Vaihingen;
PISIPerpendicular Impervious Surface Index;
RRed;
ReLURectified Linear Unit;
RSRemote Sensing;
SSMSemantic Segmentation Model;
SVMSupport Vector Machine;
TOPTrue Orthophotos;
USSOCOMUnited States Special Operations Command;
VHRVery High Resolution.

References

  1. Shuster, W.D.; Bonta, J.; Thurston, H.; Warnemuende, E.; Smith, D.R. Impacts of impervious surface on watershed hydrology: A review. Urban Water J. 2005, 2, 263–275. [Google Scholar] [CrossRef]
  2. Chithra, S.V.; Nair, M.H.; Amarnath, A.; Anjana, N.S. Impacts of impervious surfaces on the environment. Int. J. Eng. Sci. Invent. 2015, 4, 27–31. [Google Scholar]
  3. Strohbach, M.W.; Döring, A.O.; Möck, M.; Sedrez, M.; Mumm, O.; Schneider, A.-K.; Weber, S.; Schröder, B. The “hidden urbanization”: Trends of impervious surface in low-density housing developments and resulting impacts on the water balance. Front. Environ. Sci. 2019, 7, 29. [Google Scholar] [CrossRef]
  4. Seto, K.C.; Güneralp, B.; Hutyra, L.R. Global forecasts of urban expansion to 2030 and direct impacts on biodiversity and carbon pools. Proc. Natl. Acad. Sci. USA 2012, 109, 16083–16088. [Google Scholar] [CrossRef] [PubMed]
  5. Arnold, C.L., Jr.; Gibbons, C.J. Impervious surface coverage: The emergence of a key environmental indicator. J. Am. Plan. Assoc. 1996, 62, 243–258. [Google Scholar] [CrossRef]
  6. Peroni, F.; Pappalardo, S.E.; Facchinelli, F.; Crescini, E.; Munafò, M.; Hodgson, M.E.; de Marchi, M. How to map soil sealing, land take and impervious surfaces? A systematic review. Environ. Res. Lett. 2022, 17, 53005. [Google Scholar] [CrossRef]
  7. Weng, Q. Remote sensing of impervious surfaces in the urban areas: Requirements, methods, and trends. Remote Sens. Environ. 2012, 117, 34–49. [Google Scholar] [CrossRef]
  8. Sherrah, J. Fully convolutional networks for dense semantic labelling of high-resolution aerial imagery. arXiv 2016, arXiv:1606.02585. [Google Scholar] [CrossRef]
  9. Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef]
  10. Marmanis, D.; Schindler, K.; Wegner, J.D.; Galliani, S.; Datcu, M.; Stilla, U. Classification with an edge: Improving semantic image segmentation with boundary detection. ISPRS J. Photogramm. Remote Sens. 2018, 135, 158–172. [Google Scholar] [CrossRef]
  11. Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Garcia-Rodriguez, J. A Review on Deep Learning Techniques Applied to Semantic Segmentation. arXiv 2017, arXiv:1704.06857. [Google Scholar] [CrossRef]
  12. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  13. Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Computer Vision—ECCV 2018: 15th European Conference Proceedings, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 833–851. [Google Scholar]
  14. Huang, L.; Jiang, B.; Lv, S.; Liu, Y.; Fu, Y. Deep-Learning-Based Semantic Segmentation of Remote Sensing Images: A Survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8370–8396. [Google Scholar] [CrossRef]
  15. Lv, J.; Shen, Q.; Lv, M.; Li, Y.; Shi, L.; Zhang, P. Deep learning-based semantic segmentation of remote sensing images: A review. Front. Ecol. Evol. 2023, 11, 1201125. [Google Scholar] [CrossRef]
  16. Tuia, D.; Persello, C.; Bruzzone, L. Domain Adaptation for the Classification of Remote Sensing Data: An Overview of Recent Advances. IEEE Geosci. Remote Sens. Mag. 2016, 4, 41–57. [Google Scholar] [CrossRef]
  17. Qin, R.; Zhang, G.; Tang, Y. On the Transferability of Semantic Segmentation for Very-High-Resolution Remote Sensing Data of Multi-City Environments. Photogramm. Eng. Remote Sens. 2025, 91, 517–528. [Google Scholar] [CrossRef]
  18. Ni, H.; Liu, Q.; Guan, H.; Tang, H.; Chanussot, J. Category-Level Assignment for Cross-Domain Semantic Segmentation in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5608416. [Google Scholar] [CrossRef]
  19. Zhang, B.; Chen, T.; Wang, B. Curriculum-Style Local-to-Global Adaptation for Cross-Domain Remote Sensing Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5611412. [Google Scholar] [CrossRef]
  20. Hong, D.; Zhang, B.; Li, H.; Li, Y.; Yao, J.; Li, C.; Werner, M.; Chanussot, J.; Zipf, A.; Zhu, X.X. Cross-city matters: A multimodal remote sensing benchmark dataset for cross-city semantic segmentation using high-resolution domain adaptation networks. Remote Sens. Environ. 2023, 299, 113856. [Google Scholar] [CrossRef]
  21. Zhou, K.; Liu, Z.; Qiao, Y.; Xiang, T.; Loy, C.C. Domain Generalization: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 4396–4415. [Google Scholar] [CrossRef]
  22. Bruzzone, L.; Bovolo, F. A Novel Framework for the Design of Change-Detection Systems for Very-High-Resolution Remote Sensing Images. Proc. IEEE 2013, 101, 609–630. [Google Scholar] [CrossRef]
  23. Chen, J.; Hou, D.; He, C.; Liu, Y.; Guo, Y.; Yang, B. Change Detection with Cross-Domain Remote Sensing Images: A Systematic Review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11563–11582. [Google Scholar] [CrossRef]
  24. Bai, L.; Du, S.; Zhang, X.; Wang, H.; Liu, B.; Ouyang, S. Domain Adaptation for Remote Sensing Image Semantic Segmentation: An Integrated Approach of Contrastive Learning and Adversarial Learning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5628313. [Google Scholar] [CrossRef]
  25. Chen, H.; Zhang, H.; Yang, G.; Li, S.; Zhang, L. A Mutual Information Domain Adaptation Network for Remotely Sensed Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5537316. [Google Scholar] [CrossRef]
  26. Sun, Z.; Guo, P.; Li, Z.; Chen, X.; Liu, X. Elevation-Aware Domain Adaptation for Sematic Segmentation of Aerial Images. Remote Sens. 2025, 17, 2529. [Google Scholar] [CrossRef]
  27. Hoyer, L.; Dai, D.; van Gool, L. DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 9914–9925. [Google Scholar]
  28. Y, G.D.; Nair, N.G.; Satpathy, P.; Christopher, J. Covariate Shift: A Review and Analysis on Classifiers. In 2019 Global Conference for Advancement in Technology (GCAT), Bangalore, India, 18–20 October 2019; IEEE: New York, NY, USA, 2019; pp. 1–6. [Google Scholar]
  29. Petrich, J.; Sander, R.; Bradley, E.; Dawood, A.; Hough, S. On the Importance of 3D Surface Information for Remote Sensing Classification Tasks. Data Sci. J. 2021, 20, 20. [Google Scholar] [CrossRef]
  30. Liang, J.; He, R.; Tan, T. A Comprehensive Survey on Test-Time Adaptation Under Distribution Shifts. Int. J. Comput. Vis. 2025, 133, 31–64. [Google Scholar] [CrossRef]
  31. Li, Y.; Wang, N.; Shi, J.; Liu, J.; Hou, X. Revisiting Batch Normalization for Practical Domain Adaptation. arXiv 2016, arXiv:1603.04779v4. [Google Scholar] [CrossRef]
  32. Huang, L.; Qin, J.; Zhou, Y.; Zhu, F.; Liu, L.; Shao, L. Normalization Techniques in Training DNNs: Methodology, Analysis and Application. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10173–10196. [Google Scholar] [CrossRef]
  33. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167v3. [Google Scholar] [CrossRef]
  34. Benson, V.; Ecker, A. Assessing out-of-domain generalization for robust building damage detection. arXiv 2020, arXiv:2011.10328. [Google Scholar] [CrossRef]
  35. Wang, J.; Zhong, Y.; Zheng, Z.; Ma, A. Sensor-Specific Adversarial Network for Transferable Land-Cover Classification. In IGARSS 2021—2021 IEEE International Geoscience and Remote Sensing Symposium, Brussels, Belgium, 12–16 July 2021; Miralles, D., Persello, C., Beenen, K., Eds.; IEEE: Piscataway, NJ, USA, 2021; pp. 7943–7946. [Google Scholar]
  36. ISPRS. 2D Semantic Labeling Contest. Available online: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/semantic-labeling.aspx (accessed on 10 February 2026).
  37. Landeshauptstadt Potsdam. Statistische Grunddaten zur Landeshauptstadt Potsdam. Available online: https://www.potsdam.de/de/statistische-grunddaten-zur-landeshauptstadt-potsdam (accessed on 27 February 2026).
  38. Stadt Vaihingen an der Enz. Vaihingen an der Enz in Zahlen. Available online: https://www.vaihingen.de/unsere-stadt/stadt-vaihingen/zahlen-fakten (accessed on 27 February 2026).
  39. Cramer, M. The DGPF-Test on Digital Airborne Camera Evaluation Overview and Test Design. Photogramm.-Fernerkund.-Geoinf. 2010, 2, 73–82. [Google Scholar] [CrossRef] [PubMed]
  40. Gerke, M. Technical Report: Use of the Stair Vision Library Within the ISPRS 2D Semantic Labeling Benchmark (Vaihingen); ITC, University of Twente: Enschede, The Netherlands, 2014. [Google Scholar]
  41. ISPRS. 2D Semantic Labeling Contest—Potsdam. Available online: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx (accessed on 10 February 2026).
  42. van der Linden, S.; Hostert, P. The influence of urban structures on impervious surface maps from airborne hyperspectral data. Remote Sens. Environ. 2009, 113, 2298–2305. [Google Scholar] [CrossRef]
  43. Cai, C.; Li, P.; Jin, H. Extraction of Urban Impervious Surface Using Two-Season WorldView-2 Images: A Comparison. Photogramm. Eng. Remote Sens. 2016, 82, 335–349. [Google Scholar] [CrossRef]
  44. Langenkamp, J.-P.; Rienow, A. Exploring the Use of Orthophotos in Google Earth Engine for Very High-Resolution Mapping of Impervious Surfaces: A Data Fusion Approach in Wuppertal, Germany. Remote Sens. 2023, 15, 1818. [Google Scholar] [CrossRef]
  45. Iakubovskii, P. Segmentation Models Pytorch; GitHub Repository; GitHub: San Francisco, CA, USA, 2019; Available online: https://github.com/qubvel-org/segmentation_models.pytorch (accessed on 27 April 2026).
  46. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
  47. Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2016, arXiv:1608.03983v5. [Google Scholar]
  48. Mao, A.; Mohri, M.; Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In ICML’23: Proceedings of the 40th International Conference on Machine Learning, Honolulu, Hawaii, USA, 23–29 July 2023; JMLR.org: Norfolk, MA, USA, 2023; pp. 23803–23828. [Google Scholar]
  49. Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar] [CrossRef]
  50. Geirhos, R.; Jacobsen, J.-H.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F.A. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2020, 2, 665–673. [Google Scholar] [CrossRef]
  51. Goodfellow, I.J.; Mirza, M.; Xiao, D.; Courville, A.; Bengio, Y. An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks. arXiv 2013, arXiv:1312.6211v3. [Google Scholar]
  52. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In NIPS’21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Wortman Vaughan, J., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; pp. 12077–12090. [Google Scholar]
  53. Ji, Y.; Shi, W.; Lei, J.; Ding, J. DBRSNet: A dual-branch remote sensing image segmentation model based on feature interaction and multi-scale feature fusion. Sci. Rep. 2025, 15, 27786. [Google Scholar] [CrossRef]
  54. Tucker, C.J. Red and photographic infrared linear combinations for monitoring vegetation. Remote Sens. Environ. 1979, 8, 127–150. [Google Scholar] [CrossRef]
  55. Tian, Y.; Chen, H.; Song, Q.; Zheng, K. A Novel Index for Impervious Surface Area Mapping: Development and Validation. Remote Sens. 2018, 10, 1521. [Google Scholar] [CrossRef]
Figure 1. Study area of Potsdam and Vaihingen an der Enz located in Germany. Excerpts of the corresponding ISPRS datasets [36,39] are visualized and represented by false-color images and GT labels.
Figure 1. Study area of Potsdam and Vaihingen an der Enz located in Germany. Excerpts of the corresponding ISPRS datasets [36,39] are visualized and represented by false-color images and GT labels.
Remotesensing 18 01433 g001
Figure 2. Data splits of the Potsdam and Vaihingen datasets to create training, validation, testing, and adaptation sets.
Figure 2. Data splits of the Potsdam and Vaihingen datasets to create training, validation, testing, and adaptation sets.
Remotesensing 18 01433 g002
Figure 3. Overview of the class-specific model performance for the DeepLabv3+ and U-Net for both the spectral and multimodal versions on the P2P test set using conventional inference.
Figure 3. Overview of the class-specific model performance for the DeepLabv3+ and U-Net for both the spectral and multimodal versions on the P2P test set using conventional inference.
Remotesensing 18 01433 g003
Figure 4. Overview of the class-specific model performances for the DeepLabv3+ and the U-Net, for both spectral and multimodal versions on the P2V test set using conventional inference.
Figure 4. Overview of the class-specific model performances for the DeepLabv3+ and the U-Net, for both spectral and multimodal versions on the P2V test set using conventional inference.
Remotesensing 18 01433 g004
Figure 5. Overview of the class-specific model performances for DeepLabv3+ and U-Net, for both spectral and multimodal versions on the P2V test set using AdaBN prior to inference.
Figure 5. Overview of the class-specific model performances for DeepLabv3+ and U-Net, for both spectral and multimodal versions on the P2V test set using AdaBN prior to inference.
Remotesensing 18 01433 g005
Figure 6. Metric scores observed when applying AdaBN to the entire model, the encoder, and the decoder for DeepLabv3+ and U-Net for both spectral and multimodal datasets on the P2V test dataset.
Figure 6. Metric scores observed when applying AdaBN to the entire model, the encoder, and the decoder for DeepLabv3+ and U-Net for both spectral and multimodal datasets on the P2V test dataset.
Remotesensing 18 01433 g006
Figure 7. Qualitative results of DeepLabv3+ and U-Net for both spectral and multimodal datasets on the P2P test dataset. Visualized are excerpts from the tiles with the identifier “3_13” (Example A) and “6_14” (Example B). Red bounding boxes highlight the areas described within the manuscript text.
Figure 7. Qualitative results of DeepLabv3+ and U-Net for both spectral and multimodal datasets on the P2P test dataset. Visualized are excerpts from the tiles with the identifier “3_13” (Example A) and “6_14” (Example B). Red bounding boxes highlight the areas described within the manuscript text.
Remotesensing 18 01433 g007
Figure 8. Qualitative results of the DeepLabv3+ and the U-Net for both spectral and multimodal datasets with and without the use of AdaBN on the P2V test dataset. Visualized are excerpts from the tiles with the identifier “area12” (Example A) and “area29” (Example B). Red bounding boxes highlight the areas described within the manuscript text.
Figure 8. Qualitative results of the DeepLabv3+ and the U-Net for both spectral and multimodal datasets with and without the use of AdaBN on the P2V test dataset. Visualized are excerpts from the tiles with the identifier “area12” (Example A) and “area29” (Example B). Red bounding boxes highlight the areas described within the manuscript text.
Remotesensing 18 01433 g008
Table 1. Overview of the Potsdam and Vaihingen dataset characteristics and usage within this study.
Table 1. Overview of the Potsdam and Vaihingen dataset characteristics and usage within this study.
DatasetNumber of TilesData UsedVegetation ConditionSpatial ResolutionReferences
Potsdam38R, G, NIR
nDSM *
Leaf-Off5 cm[36,41]
Vaihingen33R, G, NIR
nDSM
Leaf-On9 cm[36,39]
* nDSM files with file endings “normalized_lastools” were used.
Table 2. Training, validation, adaptation, and test sets used within this study categorized by their tile identifiers for the Potsdam and Vaihingen datasets.
Table 2. Training, validation, adaptation, and test sets used within this study categorized by their tile identifiers for the Potsdam and Vaihingen datasets.
DatasetPurposeTile Identifier
Potsdam
(Source)
Training2_10, 2_11, 3_10, 3_11, 4_10, 4_11, 5_10, 5_11, 6_7, 6_8, 6_9, 6_10, 6_11, 7_7, 7_8, 7_9, 7_10, 7_11
Validation2_12, 3_12, 4_12, 5_12, 6_12, 7_12
Testing2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_13, 6_14, 6_15, 7_13
Vaihingen
(Target)
Adaptationarea1, area3, area5, area7, area11, area13, area15, area17, area21, area23, area26, area28, area30, area32, area34, area37
Testingarea2, area4, area6, area8, area10, area12, area14, area16, area20, area22, area24, area27, area29, area31, area33, area35, area38
Table 3. Overview of the reclassification of the initial ISPRS GT class labels.
Table 3. Overview of the reclassification of the initial ISPRS GT class labels.
Original ClassesNew Class
Impervious surfaces (1), car (5)Impervious surfaces
Building (2)Building
Low vegetation (3), tree (4), clutter/background (6)Background
Table 4. Overview of the performed inference scenarios.
Table 4. Overview of the performed inference scenarios.
TaskData TypeInferenceModel
P2P
(Source)
Spectral
(R, G, NIR)
ConventionalU-Net
DeepLabv3+
Multimodal
(R, G, NIR, nDSM)
ConventionalU-Net
DeepLabv3+
P2V
(Transfer)
Spectral
(R, G, NIR)
ConventionalU-Net
DeepLabv3+
AdaBNU-Net
DeepLabv3+
Multimodal
(R, G, NIR, nDSM)
ConventionalU-Net
DeepLabv3+
AdaBNU-Net
DeepLabv3+
Table 5. Overview of the DeepLabv3+ and U-Net model performances for both spectral and multimodal versions on the P2P test set. The ∆mIoU is calculated as the difference between models of the same data type. Mean model performance represents the average of the two models within a data type. The highest scores between models are highlighted in bold, and the lowest scores are highlighted with an underline.
Table 5. Overview of the DeepLabv3+ and U-Net model performances for both spectral and multimodal versions on the P2P test set. The ∆mIoU is calculated as the difference between models of the same data type. Mean model performance represents the average of the two models within a data type. The highest scores between models are highlighted in bold, and the lowest scores are highlighted with an underline.
Data TypeModelmIoU [%]F1 [%]Precision [%]Recall [%]∆mIoU [%]
SpectralDeepLabv3+79.8088.6688.7588.72±0.7
U-Net80.5089.0889.4888.85
Mean Model
Performance
80.1588.8789.1288.79-
MultimodalDeepLabv3+83.7090.9991.0491.07±0.15
U-Net83.8591.0891.0991.17
Mean Model
Performance
83.7891.0491.0791.12-
Table 6. Class-specific improvements in IoU for both DeepLabv3+ and U-Net on the P2P test set. Improvements in IoU are calculated by subtracting the class-specific IoU results of the multimodal and spectral models documented in Figure 3. Results are sorted in descending order, starting with the greatest improvement in IoU.
Table 6. Class-specific improvements in IoU for both DeepLabv3+ and U-Net on the P2P test set. Improvements in IoU are calculated by subtracting the class-specific IoU results of the multimodal and spectral models documented in Figure 3. Results are sorted in descending order, starting with the greatest improvement in IoU.
ModelClassIoU ↑ *
DeepLabv3+Buildings7.89
Impervious Surfaces3.01
Background0.80
U-NetBuildings7.65
Impervious Surfaces2.33
Background0.09
* IoU ↑ = IoUmultimodal − IoUspectral.
Table 7. Overview of the DeepLabv3+ and U-Net model performances for both spectral and multimodal versions on the P2V test set. The ∆mIoU is calculated as the difference between models of the same data type. Mean model performance represents the average of the two models within a data type. The highest scores between models are highlighted in bold, and the lowest scores are highlighted with an underline.
Table 7. Overview of the DeepLabv3+ and U-Net model performances for both spectral and multimodal versions on the P2V test set. The ∆mIoU is calculated as the difference between models of the same data type. Mean model performance represents the average of the two models within a data type. The highest scores between models are highlighted in bold, and the lowest scores are highlighted with an underline.
Data TypeModelmIoU [%]F1 [%]Precision [%]Recall [%]∆mIoU [%]
SpectralDeepLabv3+49.6063.7473.8567.23±4.76
U-Net44.8458.6870.3363.88
Mean Model
Performance
47.2261.2172.0965.56-
Spectral
(AdaBN)
DeepLabv3+59.9474.1374.9274.72±2.72
U-Net62.6676.4576.8077.15
Mean Model
Performance
61.3075.2975.8675.94-
MultimodalDeepLabv3+60.2573.5882.0173.73±9.02
U-Net69.2781.3987.3479.40
Mean Model
Performance
64.7677.4984.6876.57-
Multimodal (AdaBN)DeepLabv3+73.9284.6884.8085.33±1.8
U-Net75.7285.9185.9486.20
Mean Model
Performance
74.8285.3085.3785.77-
Table 8. Class-specific improvements in IoU for both DeepLabv3+ and U-Net on the P2V test set. Improvements in IoU are calculated by subtracting the class-specific IoU results of the multimodal and spectral models documented in Figure 4. Results are sorted in descending order, starting with the greatest improvement in IoU.
Table 8. Class-specific improvements in IoU for both DeepLabv3+ and U-Net on the P2V test set. Improvements in IoU are calculated by subtracting the class-specific IoU results of the multimodal and spectral models documented in Figure 4. Results are sorted in descending order, starting with the greatest improvement in IoU.
ModelClassIoU ↑ *
DeepLabv3+Buildings22.91
Impervious Surfaces16.10
Background−7.05
U-NetImpervious Surfaces46.34
Buildings21.16
Background5.79
* IoU ↑ = IoUmultimodal − IoUspectral.
Table 9. Class-specific improvements in IoU for both DeepLabv3+ and U-Net on the P2V test set. Improvements in IoU are calculated by subtracting the class-specific IoU results of the spectral/multimodal dataset with AdaBN applied and the initial spectral/multimodal models documented in Figure 5 and Figure 4. Results are sorted in descending order, starting with the greatest improvement in IoU.
Table 9. Class-specific improvements in IoU for both DeepLabv3+ and U-Net on the P2V test set. Improvements in IoU are calculated by subtracting the class-specific IoU results of the spectral/multimodal dataset with AdaBN applied and the initial spectral/multimodal models documented in Figure 5 and Figure 4. Results are sorted in descending order, starting with the greatest improvement in IoU.
ModelDatatypeClassIoU ↑ *
DeepLabv3+SpectralImpervious Surfaces28.73
Buildings1.86
Background0.44
MultimodalImpervious Surfaces25.99
Background12.44
Buildings2.56
U-NetSpectralImpervious Surfaces37.24
Buildings9.36
Background6.86
MultimodalBuildings8.32
Impervious Surfaces6.20
Background4.84
* IoU ↑ = IoUAdaBN,spectral − IoUspectral; IoU ↑ = IoUAdaBN,multimodal − IoUmultimodal.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Langenkamp, J.-P.; Rienow, A. Impact of Data Modality and Batch Normalization Layers on Very High-Resolution Impervious Surface Mapping Using DeepLabv3+ and U-Net Under Regional Cross-City and Cross-Season Domain Shifts. Remote Sens. 2026, 18, 1433. https://doi.org/10.3390/rs18091433

AMA Style

Langenkamp J-P, Rienow A. Impact of Data Modality and Batch Normalization Layers on Very High-Resolution Impervious Surface Mapping Using DeepLabv3+ and U-Net Under Regional Cross-City and Cross-Season Domain Shifts. Remote Sensing. 2026; 18(9):1433. https://doi.org/10.3390/rs18091433

Chicago/Turabian Style

Langenkamp, Jan-Philipp, and Andreas Rienow. 2026. "Impact of Data Modality and Batch Normalization Layers on Very High-Resolution Impervious Surface Mapping Using DeepLabv3+ and U-Net Under Regional Cross-City and Cross-Season Domain Shifts" Remote Sensing 18, no. 9: 1433. https://doi.org/10.3390/rs18091433

APA Style

Langenkamp, J.-P., & Rienow, A. (2026). Impact of Data Modality and Batch Normalization Layers on Very High-Resolution Impervious Surface Mapping Using DeepLabv3+ and U-Net Under Regional Cross-City and Cross-Season Domain Shifts. Remote Sensing, 18(9), 1433. https://doi.org/10.3390/rs18091433

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop