1. Introduction
The expansion of impervious surfaces is linked to altered hydrological processes [
1,
2,
3], the surface urban heat island effect [
2], or loss of biodiversity [
4]. Accordingly, impervious surfaces, ground covered by artificial building materials such as asphalt or concrete, are recognized as a crucial environmental indicator [
5]. Therefore, continuous, spatially explicit mapping of impervious surfaces is required to aid scientific analyses in disciplines such as urban planning, climatology, and hydrology, as well as to support policymakers in achieving the Sustainable Development Goals [
6,
7].
The use of deep learning (DL) methods in remote sensing (RS), particularly using semantic segmentation models (SSMs) based on convolutional neural networks (CNNs), has made the classification of detailed urban scenes using very high-resolution (VHR) imagery in submeter ranges more feasible and accurate [
8,
9,
10]. In this context, SSMs automatically annotate pixels of an image with labels [
11]. Over the last decade, a wide variety of SSMs have been developed. Some of these models, such as the U-Net [
12] or the DeepLabv3+ [
13], are widely used and recognized as baseline SSMs within the RS literature due to their competitiveness [
14,
15]. Typically, training and testing SSMs involve using independent datasets of images and corresponding ground truth (GT) masks, yet from similar data distributions. However, the generalization of SSMs is challenged when predictions are made in cities or seasons other than those on which they were initially trained [
16,
17,
18,
19,
20], ultimately affecting the goal of consistency in mapping approaches. This degradation in model performance is subject to domain shifts, which occur when the initial (source) data distribution used for training the model differs from the unknown (target) data distribution used for evaluation [
16,
21]. RS data is affected by dynamic radiometric changes, which can be caused by various factors, for example, different acquisition conditions and systems, changes to the land surface, and differences in phenology or environmental conditions [
16,
22,
23]. This is particularly relevant when working with VHR RS data, since it is more likely to capture radiometric changes that are not related to actual changes on the land surface due to the increased spatial resolution compared to satellite RS data [
22]. Ultimately, these circumstances limit the practicality of continuous monitoring of impervious surfaces using VHR RS data, particularly those acquired from airborne sources.
In the RS literature, domain shifts are usually addressed using domain adaptation (DA) techniques, aiming to reduce the impact of the unknown target data on the model prediction [
16,
21]. Such techniques typically involve complex training strategies or architectural changes in baseline SSMs [
18,
19,
20,
24,
25,
26,
27]. However, in many real-world scenarios, practitioners may lack the resources or the time to implement such techniques. Therefore, prior to the implementation of DA solutions, it is important to quantify the severity of performance degradation expected and the initial sensitivity of model architectures in cross-domain scenarios. Additionally, it is of practical relevance to understand how different data modalities (i.e., spectral or height information) and the observed differences in training and testing data distribution (covariate shift) [
28] contribute to model robustness. For example, Qin et al. [
17] evaluate the generalization of U-Net, DeepLabv3+, Feature Pyramid Network, HRNet, and the Random Forest besides several DA approaches on VHR RS datasets from Omaha, Jacksonville, Haiti, London and Shanghai. In their cross-city and cross-sensor scenarios, they report a significant performance drop of up to 35% in mean intersection over union (mIoU) when the models were transferred from the source data to the target data. Qin et al. [
17] highlight that combining spectral data and above-ground-level (AGL) height information reduced performance drops, with the Random Forest achieving the highest increase of 12% in mIoU. In contrast, DA methods are reported to reduce performance drops by up to 10% in mIoU. Another study by Petrich et al. [
29] analyzes the generalization of Support Vector Machines (SVMs), SegNet, and SegNet lite trained on multimodal VHR RS datasets. The experiments are based on the Vaihingen and Potsdam benchmark datasets from the International Society for Photogrammetry and Remote Sensing (ISPRS) and on the Urban 3D Challenge dataset from the United States Special Operations Command (USSOCOM). In their cross-city and cross-season scenarios, they emphasize that combining spectral data with height information from a normalized digital surface model (nDSM) improves the accuracy of predictions on target data by an average of 14.40% when Vaihingen-trained models are transferred to the Potsdam dataset. Petrich et al. [
29] highlight that adding height information significantly impacts the prediction of the target data, whereas the effect on the source data is less pronounced. Although studies exist that evaluate the impact of data modality on initial robustness of SSMs in cross-domain scenarios, there is a lack of studies that focus specifically on mapping impervious surfaces while quantifying class-specific sensitivities using VHR RS data.
In deep neural networks (DNNs), recalibrating the normalization layers to the target data is a more lightweight approach to DA [
30,
31]. Within a DNN, normalization layers focus on statistically adjusting the activations of neurons [
32]. One prominent normalization technique is batch normalization (BN), which was proposed by Ioffe and Szegedy [
33], normalizing activations by considering mini-batches of input data fed into a DNN instead of updating on individual input data examples. This technique is reported to effectively improve training speed or model generalization [
32] and, thus, represents an integral part of DNN architectures [
30,
32]. Based on this concept, Li et al. [
31] propose adaptive batch normalization (AdaBN), building upon the hypothesis that the statistics (mean and variance) calculated within the BN layers are linked to the source domain on which the network was initially trained. By utilizing AdaBN, these statistics are recalibrated to the target domain data, reducing the covariate shift [
31]. Since AdaBN does not require retraining or finetuning and operates only on unlabeled images from the target domain, this concept is described as a practical approach to DA [
30,
31]. Benson and Ecker [
34] analyze the effect of multi-domain AdaBN in building damage detection using a two-stream ResNet50 and a Dual-HRNet trained on VHR satellite data. The authors emphasize that multi-domain AdaBN improves model performance robustness when facing domain shifts, while highlighting inconsistent performance patterns between domains. In a related conference paper, Wang et al. [
35] report that AdaBN improves model performance by an average of 13.58 percentage points in mIoU compared to the DeepLabv3 baseline in cross-sensor scenarios, using the ISPRS Potsdam aerial dataset and the DFCTrack1 satellite dataset in land use and land cover classification. The authors specifically point out that BN statistics played an important role in their experiments involving cross-sensor shifts. Although the importance of BN statistics in cross-domain scenarios has been recognized, there is a lack of studies that systematically compare the impact of AdaBN against the impact of data modality on model generalization in cross-city and cross-season scenarios in impervious surfaces mapping using VHR RS datasets.
To fill this gap, we therefore conduct a systematic comparative analysis, quantifying the initial robustness of baseline SSMs trained on spectral (R, G, NIR) and multimodal datasets (R, G, NIR, nDSM) while analyzing modality-specific impacts of AdaBN in impervious surface mapping under cross-city and cross-season shifts. To achieve this, we use the Potsdam and Vaihingen VHR RS benchmark datasets provided by the ISPRS, reclassifying the initial six GT classes to the three classes: impervious surfaces, buildings, and background. For comparisons, we train spectral and multimodal versions of the U-Net and the DeepLabv3+ with and without adaptations using AdaBN. In the experimental setup, the Potsdam dataset is the source domain on which the initial models were trained. The Vaihingen dataset is the target domain, representing the cross-city and cross-season domain shift scenario. We, thus, address the following research questions:
What is the relative impact of input data modality (spectral vs. multimodal) and AdaBN on SSM performance in impervious surface mapping under cross-city and cross-season scenarios?
How consistent are the effects of input data modality and AdaBN across different SSM architectures in impervious surface mapping under cross-city and cross-season scenarios?
2. Materials and Methods
2.1. Study Area and Datasets
We conduct experiments on the Potsdam and Vaihingen datasets provided by the ISPRS [
36]. These VHR RS datasets are a widely used, publicly available data source in the RS literature, for example, providing benchmarks to evaluate model architectures [
14,
15]. Both datasets represent urban scenes in the cities of Potsdam and Vaihingen an der Enz, which are located in northeast and southwest Germany, respectively (
Figure 1). As the capital of the federal state of Brandenburg, Potsdam has a population of 188,288 and an area of 187.68 km
2 [
37]. In contrast, Vaihingen an der Enz is a medium-sized city in the Stuttgart metropolitan region of the federal state of Baden-Württemberg. Vaihingen comprises 29,907 inhabitants and an area of 73.40 km
2 [
38]. While Potsdam’s cityscape is characterized by a denser urban fabric with large building complexes, Vaihingen exhibits a more fragmented and small-scale settlement structure [
36]. Furthermore, the Potsdam dataset represents non-leafy conditions in spring, whereas the Vaihingen dataset was taken in leafy conditions in summer.
The ISPRS Vaihingen dataset contains true orthophoto (TOP) tiles with near-infrared (NIR), red (R), and green (G) bands; digital surface model (DSM) data; and manually annotated GT masks [
36,
39]. Additional nDSM data matching this dataset is provided by Gerke [
40]. The ground truth masks hold the class labels: (1) impervious surfaces, (2) building, (3) low vegetation, (4) tree, (5) car, and (6) clutter/background. The ISPRS Potsdam dataset is comparable to the ISPRS Vaihingen dataset. It contains TOP tiles with R, G, blue (B), and NIR bands, as well as DSM and nDSM data and manually annotated GT masks [
36,
41].
Table 1 documents the characteristics of the dataset and the data that were used in this study from the ISPRS Potsdam and Vaihingen datasets.
2.2. Dataset Preparation
In impervious surface mapping, leaf-off conditions are considered to represent more optimal mapping conditions due to the reduced obscuring effects caused by vegetation, such as tree canopies [
42,
43,
44]. Since the Potsdam dataset represents leaf-off conditions, this dataset is used as the source data, while the Vaihingen dataset, representing leaf-on conditions, serves as the target data. The Potsdam tiles are divided into three sets. The first set is for training, the second set is for validation during training, and the third set is for testing the final model’s performance. The Vaihingen tiles are divided into two sets. The first set is for adaptation to the target data, providing a more comparative, unbiased setting, and the second set is for testing the final model performances.
Figure 2 illustrates the sets used within the remainder of this study.
This study focused on the initial tile splits of the Vaihingen and Potsdam training and validation sets, which were provided by the ISPRS [
39,
41]. For the Potsdam dataset, the initial validation set was equivalent to the test set. Four tiles were separated from the initial training set to serve as the validation set in this study. For the Vaihingen dataset, the initial training set represents the adaptation set, while the initial validation set served as the testing set.
Table 2 documents the tiles in the corresponding sets alongside their identifiers.
The datasets differ in spatial resolution. We resampled the Potsdam dataset (TOP, nDSM) from 5 cm to 9 cm using nearest-neighbour interpolation to match the spatial resolution of the Vaihingen dataset. The TOP and nDSM data of each dataset represent 8-bit integer values. Both datasets were rescaled to a range of [0, 1] by dividing each dataset by 255.
To focus on impervious surface mapping, we reclassified the original six classes within the GT masks into the three classes: impervious surfaces, buildings, and background. The car class was combined with the initial impervious surface class because most of the car objects in the datasets were found in this class. The initial low vegetation, tree, and clutter/background classes were combined for the background class, representing pervious land covers. The initial building class was left unchanged.
Table 3 documents the reclassification process for the initial ISPRS class labels.
For the experiments, we created spectral (NIR, R, and G) and multimodal (NIR, R, G, and nDSM) datasets for each city. Training, validation, and adaptation tiles were processed into non-overlapping patches of size 512 × 512 pixels. For the Potsdam dataset, this resulted in 1152 patches for training and 384 patches for validation. For the Vaihingen dataset, 387 patches were created, serving as the training set. The test tiles for both datasets retained their original shape and were processed during model inference. Model performances reported throughout this study refer to the test sets for the corresponding city.
2.3. Model Architectures
We utilized two encoder–decoder CNN-based model architectures, namely the U-Net [
12] and the DeepLabv3+ [
13]. This choice was motivated by their popularity as baseline models in the RS literature [
15,
17], as well as for the DeepLabv3+ within the DA literature [
18,
19,
24,
26]. Thus, analyzing the initial transferability of both models under cross-city and cross-season domain shifts may inform a variety of studies. We report their main architectural design in the following sections.
2.3.1. U-Net
The U-Net was developed by Ronneberger et al. [
12]. Its design is composed of an encoder and a decoder part, which are linked through skip connections. The encoder is responsible for extracting features from the input data, which are processed by the decoder to generate segmentation masks of the input image. In this process, the skip connections aim to keep the spatial detail from the feature maps of earlier stages. Encoders and decoders consist of a similar structure, comprising four convolutional blocks, in which two 3 × 3 convolution layers are followed by a rectifier linear unit (ReLU) activation. However, while the encoder focuses on downsampling using 2 × 2 max pooling layers, the decoder focuses on upsampling using a 2 × 2 up-conv layer. Thus, the number of feature channels for the encoder is doubled for each block (from 64 to 512 features). These are correspondingly reduced in the decoder. The bottleneck captures the most abstract features and consists of two 3 × 3 convolution layers with ReLU activations, connecting the encoder and decoder. For the final segmentation mask, a 1 × 1 convolution is used, adjusting the number of channels to the required output.
2.3.2. DeepLabv3+
The DeepLabv3+ was created by Chen et al. [
13]. The model architecture comprises an encoder–decoder structure and an atrous spatial pyramid pooling (ASPP) module, aiming to strengthen contextual information and detailed object boundaries. A CNN backbone generates feature maps inside the encoder, from which the last feature maps are transferred to the ASPP module. Inside the ASPP module, atrous convolutions are used with different dilation rates to obtain multi-scale contextual information. After an upsampling step is performed to align the spatial resolution of the feature maps with those from earlier layers, the ASPP output is concatenated with the feature maps to optimize the object boundaries. Following convolutional layers, we then process these concatenated features to produce the final segmentation output by upsampling to the original input image size.
2.4. Training Configurations
To ensure a comparable experimental setup, the same hyperparameters and training settings were used throughout the experiments. Additionally, to increase comparability between the U-Net and DeepLabv3+ model architecture implementations, we used the Segmentation Model PyTorch (SMP) library [
45] (version: 0.5.0) with PyTorch (version: 2.9.1+cu128) in Python (version: 3.12.12). For both U-Net and DeepLabv3+, we utilized ResNet34 as the backbone model without the use of pretrained weights to focus on initial model transferability. Data augmentation was implemented using horizontal and vertical flips and random rotation by 90 degrees, each with a probability of 50%. Photometric data augmentations were not considered to focus on the raw domain shifts observed in the data. Each model was trained for 40 epochs using the AdamW optimizer [
46] with a learning rate of 1 × 10
−3. Additionally, the cosine annealing learning rate scheduler was utilized to reduce the learning rate throughout model training to a final learning rate of 1 × 10
−5 [
47]. The batch size was set to 16. When validation loss stopped decreasing over 10 consecutive epochs, early stopping was triggered. As the loss function, we utilized the cross entropy (CE) loss function as one of the most applied loss functions in DL [
48].
2.5. Model Inference
We divided model inference into two categories. The first category is conventional inference, which uses the initially trained models without adaptations. The second category is an adjusted inference for which models were adjusted using AdaBN prior to the inference step. For the remainder of this study, these categories will be referred to as “conventional” and “AdaBN”.
BN normalizes the activation outputs of a neural network layer using the activations produced for each mini-batch (B) of training data [
30,
32,
33]. For each feature, the mean (
) and variance (
) are calculated across the mini-batch. Each individual activation (
) is then normalized by subtracting the feature’s mini-batch mean and dividing by the feature’s mini-batch standard deviation, while adding a constant
for numerical stability (1) [
33]. After normalization, the scale (
γ) and shift (
β) parameters are applied to the normalized activations, which are learned by the model during training (2) [
33].
At inference time, the BN layers normalize the test data using the fixed running estimates of the mean and variance, which were accumulated during training. Therefore, these learned statistics are representative of the training data [
31]. With AdaBN, these statistics are updated based on the target data, recalculating the mean and variance [
31].
We adopt the concept of AdaBN by updating the BN layer running statistics using the target data represented by the Vaihingen adaptation set. For each adaptation, we created a separate copy of each individual source-trained model so that the BN layer statistics always reflected the initial source-trained statistics prior to applying AdaBN. Then, the input images from the adaptation set, without any GT masks, are passed through the SSM in training mode without gradient updates, allowing the running mean and variance to adapt per mini-batch. To evaluate the impact of AdaBN on the encoder and decoder, this workflow was reproduced. This time, however, either the encoder or the decoder was set to training mode, while the other was set to evaluation mode. The same batch size of 16 was applied during this process as was used during initial model training. Additionally, random shuffling was disabled in the data loading pipeline to allow for more accurate model comparisons.
Inference was performed on the testing tiles of either the Potsdam or the Vaihingen dataset. First, the tiles were patched using a sliding window approach with overlapping patches and a step size of 128 pixels. For each patch, class probability maps were created. The class probability maps were aggregated to match their initial positions and the final tile sizes. The predicted class probabilities from all overlapping patches were summed pixel-wise. Throughout this process, the number of times each pixel was overlapped was tracked. Then, the final class probability mask was normalized by dividing it by the count of pixel overlap. The final segmentation mask was obtained by assigning each pixel to the class with the highest aggregated probability.
2.6. Evaluation
To evaluate the experiments, class-specific and overall metrics were calculated. For this, the intersection over union (IoU), F1 score, precision, and recall were utilized using class-specific true positives (TPs), false positives (FPs), and false negatives (FNs). Precision (3) measures the TP within all positive model predictions, indicating the correctness of the model for the target class [
49]. Recall (4) calculates the TP for all positive predictions that were correctly identified, as well as for the cases where the prediction was false. Therefore, this metric indicates how many of the actual positive cases were identified by the model [
49]. In addition, F1 (5) represents a balanced overview of both precision and recall, with higher scores indicating more balanced model performance [
49]. The IoU (6) is a standard metric in semantic segmentation, quantifying the degree of overlap between the ground truth and the predicted outcome, for which higher scores indicate better performance [
49]. For an overall overview of model performances, we averaged class-wise metrics (denoted with C within the formulas) as described in (7). Unless stated otherwise, the metric results are equivalent to the mean metric results calculated either on the Potsdam or Vaihingen test set for the remainder of this study. Further differentiation is required for the IoU, which reflects the class-specific, and the mean IoU (mIoU), which reflects the averaged metric result. These are used correspondingly:
4. Discussion
4.1. Impact of Data Modality on Cross-City and Cross-Season Shift
In the P2P scenario, the results show that data modality affected model performances marginally. On average, the difference in performance between the multimodal and spectral model versions was 3.43 percentage points in mIoU, favoring the multimodal models. This was also true for the performance difference between model architectures. The differences in mIoU between the spectral model versions were 0.7, whereas the differences between the multimodal model versions were 0.15 percentage points. In both cases, U-Net marginally outperformed the DeepLabv3+. Therefore, we agree with Petrich et al. [
29] that, for in-domain tasks, relying solely on spectral datasets may be sufficient.
In the P2V scenario, results show that the combination of spectral and height information considerably reduced the performance drops observed in spectral model versions. On average, multimodal model versions increased in mIoU by 17.54 percentage points when compared to spectral model versions. This suggests that the nDSM data increases the robustness of impervious surface SSMs when confronted with cross-city and cross-season domain shifts. One possible explanation is that the nDSM represents a more generalizable feature representation, which is less influenced by domain shifts compared to the spectral dataset [
17]. Our results align with the findings of Qin et al. [
17] and Petrich et al. [
29], who reported that the integration of nDSM data helped increase the robustness of model performances under domain shifts. For example, we can compare our results with those of Petrich et al. [
29], who reported an average increase in accuracy of 14.40% for the V2P task when additional nDSM data was added. Conceptually, our findings match the description of Chen et al. [
25] that spectral and textural cues are considered more domain-specific than height information, while adding to Bruzzone and Bovolo [
22] and Chen et al. [
23] that spectral data is subject to radiometric noise.
Greatest class-specific increases in IoU were found for the building and the impervious surface class when nDSM data were added. For the multimodal DeepLabv3+, the building class improved the most. For the multimodal U-Net, however, this was the impervious surface class, directly followed by the building class. This partly met our expectations, as we had anticipated that the building class would benefit the most from adding height information from the nDSM. However, when the performance of the two models was compared for the building class, it was found that they both benefited in a similar margin from the additional nDSM. While both models also profited considerably from the impervious surface class, the results suggest that additional height information may increase class separability. This is in line with Chen et al. [
25], emphasizing that ambiguity between classes may be reduced when utilizing height information.
4.2. Impact of AdaBN on Cross-City and Cross-Season Shift
In the P2V scenario, performance drops observed in spectral model versions were reduced by using AdaBN. The results showed that AdaBN increased the mean model performance of the spectral model versions by 14.08 percentage points in mIoU, which was 3.46 percentage points lower than using multimodal datasets. The combination of AdaBN with multimodal model versions exhibited the best results, further increasing mIoU on average by 10.06 percentage points compared to multimodal model versions alone. The results suggest that AdaBN increases robustness against cross-city and cross-season domain shifts without requiring additional data or model finetuning. Moreover, its effect is additive to multimodal datasets. One possible explanation is that the covariate shifts between the Potsdam and Vaihingen datasets considerably affected the transferability of the spectral models within the experiments, leading to consistent improvements between data modalities. Additionally, the multimodal dataset outperforms AdaBN, since the data is directly integrated into model training. This increases the amount of data available for the task, as well as the amount of information (e.g., semantics) used during training. Our results are in line with Li et al. [
31], Benson and Ecker [
34], and Wang et al. [
35], reporting robustness improvements to domain shifts using AdaBN. Specifically, our observed increases in mIoU are comparable to the 13.58 percentage-point average increase outlined by Wang et al. [
35] for AdaBN against the DeepLabv3 baseline in cross-sensor scenarios.
Moreover, when AdaBN was applied, the greatest class-specific improvement was observed for the impervious surface class regardless of the input data type used within model training. For the spectral model versions, the IoU scores observed for the impervious surface class increased by 28.73 and 37.24 percentage points, respectively, for DeepLabv3+ and U-Net. These drastic increases, however, are a result of the severe performance drops observed for this class within the P2V task. The results suggest that the impervious surface class was the most sensitive class to the cross-city and cross-season domain shift. One possible explanation could be that the covariate shifts between the source and the target data were most severe for impervious surfaces than for the other classes. This may have caused the models to overfit their city- and season-specific source data, which was not generalizable enough for the target data. For example, the roads in the Vaihingen images were partly obscured by leafy tree canopies, unlike the roads in the Potsdam images, which represent leaf-off conditions. Or, additionally, the shadows cast onto the roads showed different patterns, resulting in longer, lighter shadows in the Potsdam images compared to the Vaihingen images (cf.
Figure 7 and
Figure 8). Thus, in our experiments, vegetation occlusions and shadow patterns affected model generalization. However, season-specific overfitting could potentially be more severe when facing more contrasting seasonal changes, such as leaf-on conditions in summer and snow cover conditions in winter, than when facing defoliation in autumn.
Furthermore, our observations revealed that AdaBN had a greater impact on the encoder than on the decoder for both the DeepLabv3+ and U-Net models. For the DeepLabv3+, applying AdaBN solely to the encoder generated an even greater mIoU score than its application to the entire model. This suggests that the encoder plays a crucial part in model performance in cross-domain scenarios. One potential explanation for this finding is that the encoder statistics are more domain-sensitive than the decoder. This may be due to the encoder extracting more domain-specific low-level features, while the decoder is more class-discriminative. Specifically, when the input images differ during inference, the extracted features by the encoder may not align with the feature representation expected by the decoder, which was learned during training. Therefore, using AdaBN to align the encoder features in the early stages of the model may be more beneficial, increasing the processability for the decoder in later stages.
4.3. Differences Between Model Architectures
In our experiments, U-Net consistently outperformed DeepLabv3+ in all P2P and P2V comparisons except for the spectral models in P2V. This result suggests that U-Net was more robust to in-domain and cross-domain than DeepLabv3+. One possible explanation could be that DeepLabv3+ overfitted more severely to the characteristics of the Potsdam image dataset compared to the U-Net, particularly in P2V scenarios. More specifically, this may be due to the ASPP module and the additional dilated convolutions in the DeepLabv3+ architecture. This additional contextual information potentially may be more representative of the source domain, influencing its cross-domain performance.
In addition, unexpectedly, the background class performance of DeepLabv3+ decreased, when nDSM data was added, which was not observed for U-Net. Although the exact reason remains unclear, one possible explanation could be that DeepLabv3+ may not only overfit to the source data but also to specific characteristics of the input data in our experiments. Supporting this, we observed that the spectral DeepLabv3+ tended to overpredict the building class. In contrast, the multimodal DeepLabv3+ was more accurate at predicting the building class due to the nDSM. However, it tended to overpredict the background class rather than the impervious surfaces class, possibly due to the class imbalance present within the training data. This contrasting behavior of the model versions potentially led to the observed decrease in IoU for the background class. In contrast to U-Net, DeepLabv3+ uses dilated convolutions and the ASPP module to increase spatial context. Therefore, if this spatial context information differs due to domain shifts in the input data, it may lead to more varying patterns of prediction uncertainty when comparing different data modalities.
Our findings contrast with those of Qin et al. [
17] and Petrich et al. [
29], who highlighted inconsistency between best-performing models on the source and on the target data. However, instead of claiming that U-Net outperformed DeepLabv3+, we more conservatively report that the choice of model architecture influenced robustness to cross-domain shifts in our experiments. This is because the training setup (training data, model architecture, and loss and optimization functions) may influence the models to learn specific characteristics instead of others (inductive bias), leading to the observed results [
50].
Furthermore, the results show that model performance differences between DeepLabv3+ and U-Net decreased in the P2V scenario when AdaBN was applied, regardless of data modality. This finding suggests that AdaBN may increase the comparability between model architectures in cross-domain scenarios. This may be explained by AdaBN reducing the error component related to the mismatch of the BN statistics due to the cross-domain shifts, isolating more underlying influences, such as the inductive bias between the model architectures.
In addition, data modality and AdaBN had rather different impacts on both DeepLabv3+ and U-Net. The results show that applying AdaBN to spectral datasets had a consistent effect on class-specific improvements of both model architectures, which was not the case for the usage of multimodal datasets. These results suggest that using AdaBN leads to more predictable, class-specific improvements than using different data modalities during training. This can be explained by the inherent stochasticity and inductive biases present during training, increasing the probability of observing differences between model outcomes. However, this may be less pronounced if only the BN layer statistics are adapted to the target data in AdaBN, predominantly addressing covariate shift [
31].
4.4. Limitations and Future Research
This study is limited to a comparison of DeepLabv3+ and U-Net, two baseline CNN-based encoder–decoder networks, and their initial capability to handle cross-season and cross-region domain shifts. Furthermore, to provide a more comparable experimental setting, the SMP library’s [
45] model implementation was used. This leads to the assumption that the experimental results may differ when the original model architecture implementations are used. However, given their widespread use in the RS [
15,
17] and DA literature [
18,
19,
24,
26], we argue that our results can provide insights into a multitude of semantic segmentation approaches informing cross-city and cross-season domain shifts. Additionally, unlike standard approaches in RS, we did not use pretrained ImageNet weights to study initial model transferability in cross-domain scenarios. However, we anticipate that pretrained weights would considerably increase the robustness of the models in the P2V scenario, particularly for the spectral model versions.
We used fixed hyperparameter settings throughout the experiments. This approach likely biased the trained models by using predefined values that were not individually tuned, for example, using automated hyperparameter tuning techniques [
51]. With this approach, we argue for more confidently attributing observed performance differences to the model architecture rather than to differences in model training or hyperparameter selection. However, creating similar training environments for model comparison is complicated [
51], not least because of inductive bias.
Furthermore, AdaBN was used as a straightforward technique to address domain shifts, despite being a relatively old method in the chronology of the DL literature. Regarding our results in aiding spectral and multimodal model versions and the practicality of this method, without the need for finetuning, we argue that this method is still relevant, supporting consistent impervious surface mapping. AdaBN is particularly useful in real-world settings, as nDSM data are often unavailable or subject to unaligned acquisition dates, which limits workflow automation. Moreover, since AdaBN has been shown to be effective with a small number of examples, it is also suitable for cases with limited data in the target domain [
31]. As Li et al. [
31] report, the expected increase with higher numbers of images is expected to plateau. Therefore, in our case, a smaller adaptation set may have produced similar results. However, it should be noted that AdaBN predominantly addresses covariate shifts [
31], limiting its effectiveness when inconsistencies between features and labels are present (concept shift) or when the distribution of label information differs (prior probability shift) [
28]. Therefore, AdaBN cannot replace comprehensive DA approaches [
31], but it can provide support in cross-domain scenarios.
Moreover, data quality and data pre-processing may have influenced experimental results. As visualized in
Figure 7 and reported by Audebert et al. [
9] and Marmanis et al. [
10], annotation errors are present within the GT datasets. Besides these annotation errors, Marmanis et al. [
10] and Gerke [
40] also report errors within the nDSM data for the Vaihingen dataset, which were already pointed out by Gerke [
40]. Marmanis et al. [
10] investigated these errors within the nDSM of the Vaihingen dataset and found that they refer to missing industrial buildings. In their experiments, correcting for these errors improved the overall accuracy by 0.9 percentage points. These buildings occupy 3.1% of the area in image “area12”, 9.3% in image “area31”, and 10.0% in image “area33” [
10], which are found in our testing set. Therefore, these errors may have systematically affected the performance of the building class, as well as our reported overall performances for the P2V tests. However, these issues were not resolved in this study due to the substantial manual effort and time required to address them properly and to ensure the comparability of future studies. Moreover, we used nearest-neighbor resampling instead of bilinear or bicubic interpolation to align the spatial resolution of the Potsdam dataset with that of the Vaihingen dataset. However, this approach may have caused aliasing or influenced the spectral characteristics of the spectral input data for model training, which potentially affected model performances. Furthermore, we reclassified the initial Potsdam and Vaihingen GT labels. Specifically, we chose to add the car class to the impervious surface class. This could have contributed to label noise because cars were also located on permeable surfaces and, more ideally, should have been added to the background class. This aspect was not quantified in this study and should be considered when interpreting our experimental results or in future work. Therefore, we acknowledge that these issues may have affected the experimental results.
We focused on the geographical locations of Potsdam and Vaihingen, covering leaf-off and leaf-on conditions. One of the key reasons for choosing this study area was the availability of comparable VHR datasets with high-quality GT data. Due to the lack of comparable and high-quality VHR datasets for impervious surface mapping between regions, cross-domain comparisons using VHR datasets are currently limited to more regional comparisons with limited seasonal coverage.
Therefore, future research could involve generating additional VHR datasets in different geographical locations around the globe with greater variety within the investigated classes. Moreover, our experimental design could be extended to focus more broadly on different model architectures, particularly in more varied geographical locations and seasonal settings. For example, this might involve the comparison of the initial transferability of CNN- and Transformer-based models (e.g., SegFormer [
52], DBRSNet [
53]) to newer domain adaptation networks (e.g., DAFormer [
27], HighDAN [
20]) on cross-city and cross-season domain shifts for very high-resolution impervious surface mapping. Moreover, the impact of additional data modalities on cross-domain shifts could be studied, for example, by explicitly integrating spectral indices such as the normalized difference vegetation index (NDVI) [
54] or the perpendicular impervious surface index (PISI) [
55]. Ultimately, these research directions could contribute to the further understanding of the relationship between domain shifts and the underlying factors on SSMs in impervious surface mapping.
5. Conclusions
This study provides a systematic evaluation of how data modality and BN layer recalibration using AdaBN affect the robustness of SSMs for impervious surface mapping using VHR RS datasets under cross-city and cross-season shifts. Since these shifts cause a degradation in the performance of SSMs [
16,
17,
18,
19,
20], evaluating such lightweight adaptation strategies could inform more practical impervious surface mapping approaches. For this evaluation, both spectral (R, G, and NIR) and multimodal (R, G, NIR, and nDSM) versions of the U-Net and DeepLabv3+ models were trained using the ISPRS Potsdam dataset as the source domain. They were then tested using the source domain and the ISPRS Vaihingen dataset as the target domain, with and without prior adaptation using AdaBN.
The results show that cross-city and cross-season domain shifts significantly affect impervious surface mapping using VHR datasets. All the trained models exhibited performance degradation, which is consistent with previous studies. Moreover, the experiments demonstrate that the choice of model architecture can affect the initial robustness observed in cross-domain scenarios.
Regarding our first research question, we conclude that both multimodal datasets and AdaBN increase the robustness of SSMs in impervious surface mapping under cross-city and cross-season scenarios. Specifically, the impervious surface class was found to be the most sensitive to cross-domain shifts, regardless of data modality. The experiments demonstrate that the performance increase using AdaBN was, on average, 3.46 percentage points lower in terms of mIoU than using multimodal datasets. Moreover, the most robust results were obtained when combining multimodal datasets and AdaBN, further improving the performance of multimodal model versions by 10.06 percentage points in mIoU on average. This confirms an additive effect of improvements by adding height information and AdaBN to spectral datasets.
For the second research question, we conclude that the impact of data modality and AdaBN on the robustness of the U-Net and DeepLabv3+ models does not exhibit consistent patterns across domain shifts. The results show that integrating additional height information or the AdaBN approach leads to different class-specific improvement patterns. While AdaBN integration yields consistent patterns for both model architectures, favoring impervious surfaces, the usage of multimodal datasets exhibits more inconsistent patterns, potentially due to stochasticity and inductive bias during training.
Our findings indicate that the covariate shifts observed between source and target data and BN layers play an important role in impervious surface mapping using VHR RS datasets in cross-city and cross-season generalization. While AdaBN addresses this shift directly within the trained model, adding nDSM data during training may provide more generalizable features, which, in such scenarios, may help in creating more holistic models. Therefore, we conclude that, when more comprehensive DA approaches are unavailable, we recommend using multimodal datasets in combination with AdaBN to improve SSM robustness across cities and seasons.
To achieve further research into cross-domain shifts in VHR urban remote sensing, it is required to create comparable VHR datasets in different geographical locations. This is a prerequisite to achieve sufficient testing grounds, for example, in comparing CNN-based and Transformer-based SSMs.