A Robust Deep Learning Ensemble Framework for Waterbody Detection Using High-Resolution X-Band SAR Under Data-Constrained Conditions

Choi, Soyeon; Kim, Seung Hee; Nghiem, Son V.; Kafatos, Menas; Choi, Minha; Kim, Jinsoo; Lee, Yangwon

doi:10.3390/rs18020301

Open AccessArticle

A Robust Deep Learning Ensemble Framework for Waterbody Detection Using High-Resolution X-Band SAR Under Data-Constrained Conditions

by

Soyeon Choi

¹

,

Seung Hee Kim

²

,

Son V. Nghiem

³

,

Menas Kafatos

²,

Minha Choi

⁴

,

Jinsoo Kim

¹

and

Yangwon Lee

^1,*

¹

Major of Geomatics Engineering, Division of Earth Environmental System Sciences, Pukyong National University, Busan 48513, Republic of Korea

²

Institute for Earth, Computing, Human and Observing (ECHO), Chapman University, Orange, CA 92866, USA

³

NASA Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA 91109, USA

⁴

Department of Water Resources, Graduate School of Water Resources, Sungkyunkwan University, Suwon 16419, Republic of Korea

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(2), 301; https://doi.org/10.3390/rs18020301

Submission received: 19 October 2025 / Revised: 31 December 2025 / Accepted: 14 January 2026 / Published: 16 January 2026

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The performance of waterbody segmentation using deep learning models can be improved by incorporating land cover maps and topography information, such as slope and the Height Above Nearest Drainage (HAND), alongside high-resolution X-band SAR images.
An ensemble of deep learning models provided moderate Intersection over Union (IoU) gains over the best single model but offered critical operational advantages in Precision–Recall balance and prediction consistency.

What are the implications of the main findings?

Multi-modal data integration for waterbody detection: Our results suggest that incorporating auxiliary geospatial layers (e.g., topography and land cover), when available and sufficiently reliable, can effectively reduce sensitivity to ambiguity arising from any single sensor modality in waterbody detection, particularly in complex terrain like Republic of Korea.
Ensemble modeling for operational stability: For applications where consistent performance is critical (e.g., flood risk assessment), combining complementary architectures can help mitigate model-specific inductive biases. Although the quantitative gains in this study were moderate, ensemble aggregation improved the Precision–Recall balance and prediction consistency when adopting Optimized Weights via systematic grid search.

Abstract

Accurate delineation of inland waterbodies is critical for applications such as hydrological monitoring, disaster response preparedness and response, and environmental management. While optical satellite imagery is hindered by cloud cover or low-light conditions, Synthetic Aperture Radar (SAR) provides consistent surface observations regardless of weather or illumination. This study introduces a deep learning-based ensemble framework for precise inland waterbody detection using high-resolution X-band Capella SAR imagery. To improve the discrimination of water from spectrally similar non-water surfaces (e.g., roads and urban structures), an 8-channel input configuration was developed by incorporating auxiliary geospatial features such as height above nearest drainage (HAND), slope, and land cover classification. Four advanced deep learning segmentation models—Proportional–Integral–Derivative Network (PIDNet), Mask2Former, Swin Transformer, and Kernel Network (K-Net)—were systematically evaluated via cross-validation. Their outputs were combined using a weighted average ensemble strategy. The proposed ensemble model achieved an Intersection over Union (IoU) of 0.9422 and an F1-score of 0.9703 in blind testing, indicating high accuracy. While the ensemble gains over the best single model (IoU: 0.9371) were moderate, the enhanced operational reliability through balanced Precision–Recall performance provides significant practical value for flood and water resource monitoring with high-resolution SAR imagery, particularly under data-constrained commercial satellite platforms.

Keywords:

waterbody detection; Synthetic Aperture Radar (SAR); capella; deep learning; semantic segmentation; ensemble

1. Introduction

Inland waterbodies, including rivers, lakes, and reservoirs, are vital components of the Earth’s hydrological cycle and critical resources for ecosystems, human livelihoods, and economic development. Monitoring the spatiotemporal dynamics of surface water is essential for environmental protection, water resource planning, and disaster mitigation [1,2].

Satellite remote sensing has proven to be a powerful tool for large-scale surface water mapping. Numerous studies have utilized optical sensors, such as Landsat and Sentinel-2, to detect surface water [3,4,5]. While these sensors provide valuable information under clear-sky conditions, they are constrained by cloud cover, atmospheric interference, and the inability to acquire nighttime data [6].

In contrast, Synthetic Aperture Radar (SAR), an active remote sensing technique, emits microwave signals and records their backscatter from the Earth’s surface. This capability allows consistent observation regardless of atmospheric or illumination conditions [7]. SAR imagery is especially effective in disaster scenarios such as floods, where optical imagery is often obscured by clouds or precipitation, enabling reliable detection of inundated areas [8,9,10]. SAR-based surface water detection methods have evolved through three major approaches. Initial efforts focused on thresholding techniques that leverage the low backscatter values of smooth water surfaces due to specular reflection [11,12,13]. However, these methods struggle under complex environmental conditions, such as wind induced surface roughness, vegetated wetlands, and impervious surfaces like roads, which can also yield low backscatter values.

To address these limitations, machine learning algorithms such as Support Vector Machine (SVM) and Random Forest (RF) were introduced. These methods utilize diverse features—backscatter intensity, incidence angle, texture metrics, and polarimetric data—to enhance classification performance [14,15]. Although more adaptable than thresholding method, they are still prone to overfitting and may perform poorly in heterogeneous landscapes. Since the 2020s, deep learning techniques, particularly Convolutional Neural Network (CNN), has brought significant advances to satellite image analysis. CNN is capable of learning hierarchical spatial features from imagery, outperforming traditional machine learning in segmentation tasks [16,17,18]. Encoder–decoder CNN architectures [19] excel in waterbody segmentation by extracting multiscale features while preserving object boundaries. Recently, Transformer-based architectures have been increasingly applied to remote sensing tasks [20], employing self-attention mechanisms to model long range pixel dependencies. These models complement the localized receptive fields of CNNs, improving performance in complex scenes with ambiguous boundaries [21,22]. Transformer models offer context aware learning at both regional and global scales and have shown effectiveness compared to conventional CNN-based architectures in delineating waterbodies under complex backgrounds in recent studies [23,24,25].

Even with the use of advanced machine learning techniques, SAR-based waterbody detection remains challenging due to the variability in backscatter caused by physical (e.g., water depth, roughness), environmental (e.g., vegetation, topography), and meteorological (e.g., wind, rainfall) conditions [26]. While a single deep learning model may achieve adequate performance, its generalization capability requires further enhancement to cope with diverse conditions. Ensemble approaches can mitigate biases and variance associated with individual models, thereby enhancing performance in complex scenes [27]. For example, Sharma and Saharia (2025) [28] proposed DeepSARFlood, an ensemble of U-Net++, MaxViT-based U-Net, and Swin Transformer models using probabilistic soft voting on Sentinel-1 data. Hosseiny et al. (2021) [29] introduced WetNet, which combines a 2D-CNN for spatial features from Sentinel-1, a 3D-CNN for multitemporal Sentinel-2 data, and an RNN for temporal modeling via Bayesian averaging. Paul and Ganju (2021) [30] developed a semi supervised ensemble of U-Net variants for the ETCI 2021 global flood dataset.

Despite these developments, existing methods still struggle to distinguish artificial structures with water-like backscatter (e.g., smooth asphalt roads, large parking lots, airport runways, and radar shadow areas behind buildings or terrain features) and delineate complex boundaries [31,32,33]. Such ambiguities are particularly problematic in regions with complex topography and significant hydro-geographic variability, such as the Korean Peninsula [34]. Another critical constraint is spatial resolution of the data. While Sentinel-1 provides wide swath coverage and frequent revisits, its 10 m resolution is often insufficient for accurately delineating small or narrow waterbodies [35]. In contrast, higher-resolution X-band SAR sensors, such as TerraSAR-X and COSMO-SkyMed, offer finer spatial data, enabling more precise detection of features [36,37,38,39]. Recent advances in microsatellite SAR technology have produced constellations by the incorporation such as Capella Space [40] and ICEYE [41]. Especially, Capella satellite provides submeter resolution (up to 0.5 m), short revisit time of 2–5 h and flexible acquisition tasking, enhancing spatiotemporal resolution and enabling effective monitoring of inland waterbodies [42,43].

This study presents an ensemble learning framework that leverages high resolution Capella SAR imagery, along with auxiliary geospatial features such as height above nearest drainage (HAND), slope, and land cover, to enhance inland waterbody detection. Compared to previous SAR-based approaches, the proposed method focuses on enhancing segmentation stability by compensating for individual model limitations through ensemble aggregation, offering potential applications in flood mapping, drought monitoring, and water resource management.

2. Materials

2.1. The Capella SAR System

The Capella SAR system is a high-resolution X-band synthetic aperture radar constellation (Table 1). It delivers imagery with ground resolutions down to 0.5 m and supports on demand tasking across multiple satellites, offering both temporal and spatial flexibility. As of 2024, the Capella SAR system consists of eight satellites and is planned to expand into a 30-satellite constellation [44]. The system’s bandwidth is configurable between 500 and 700 MHz, and its ground resolution, swath width, and noise-equivalent sigma zero (NESZ) vary with look angle. Users may adjust transmission bandwidth and pulse repetition frequency (PRF) to meet specific application requirements [45].

Capella SAR operates in three imaging modes—Spotlight (Spot), Sliding Spotlight (Site), and Stripmap. We used Stripmap data with HH polarization, in which the antenna beam remains fixed relative to the satellite’s motion, enabling continuous, wide area coverage. Compared to Spotlight and Sliding Spotlight modes, Stripmap offers the greatest spatial extent while maintaining consistent image quality and resolution.

Capella’s Stripmap mode supports three standard SAR product types: Single Look Complex (SLC), Geocoded Ellipsoid Corrected (GEC), and Geocoded Terrain Corrected (GEO). SLC products retain both amplitude and phase information, delivering range-compressed, focused imagery in slant-range geometry. These datasets are georeferenced using precise orbit parameters and range-Doppler projection algorithms. GEC products include only amplitude data. After range compression, detection, focusing, and multilooking, which enhances radiometric resolution, the imagery is resampled and projected onto the mean scene-center height of the World Geodetic System 1984 (WGS84) ellipsoid. GEO products incorporate all GEC processing steps plus a terrain correction using a high-resolution DEM, yielding the highest geometric accuracy and geolocation precision [45,46]. Table 2 summarizes the coverage and resolution specifications for each product type [46].

We used five Capella SAR scenes acquired over three regions in Republic of Korea—Ulsan, Hongseong, and Pohang (Figure 1; Table 3). All data were collected in Stripmap mode and encompass a variety of inland waterbody types, including rivers, reservoirs, canals, and small ponds. The imagery was captured by three Capella satellites (Capella-6, Capella-7, and Capella-8) between June and September 2022.

All acquisitions were performed in the X-band (9.65 GHz) with ground range resolutions of 1.43–1.79 m and azimuth resolutions of 1.37–1.43 m. Pixel spacing was uniformly 0.8 m × 0.8 m. Incidence angles varied from 32.0° to 41.8°, and look angles ranged from 29.1° to 37.9°. Except for scene PH0912, which was acquired in ascending orbit, all scenes were collected in descending orbit.

NESZ quantifies image quality as the backscatter coefficient at which the signal-to-noise ratio equals 0 dB [47]. In general, a lower NESZ indicates higher image quality, although it depends on the factors such as satellite altitude, incidence angle, antenna pattern, system losses (transmission, reception, atmospheric attenuation, rainfall), signal bandwidth, and duty cycle. The NESZ values in our experiment ranged from −18.9 dB to −13.61 dB, meeting the standard quality criteria for Capella Stripmap products (−21 dB to −14 dB) [46].

Figure 2 shows a closeup comparison of overlapping regions in the PH0912, PH0916, and PH0925 SAR scenes, which were acquired at different times. Although all three scenes cover the same area, they exhibit noticeable variations in backscatter and noise levels due to differences in acquisition date, incidence angle, orbit direction, and NESZ. Notably, PH0912 has a higher peak NESZ (−16.17 dB) than PH0916 (−18.90 dB) and PH0925 (−17.36 dB), indicating lower sensitivity to weak backscattering signals and greater susceptibility to system noise in the PH0912 acquisition.

2.2. SAR Data Preprocessing

2.2.1. Standard SAR Intensity Pre-Processing

To enable the use of Capella SAR data for deep learning model training, a structured preprocessing pipeline was applied. SAR imagery poses inherent challenges when directly used as input for deep learning due to its unique characteristics. First, speckle noise from the coherent nature of radar systems degrades image quality and obscures object boundaries and fine details, negatively impacting feature extraction and classification performance [32,48]. Second, SAR images typically exhibit a broader dynamic range than optical imagery, characterized by non-linear and widely dispersed pixel value distributions. This necessitates effective normalization to ensure training stability [49,50]. In the Capella SAR scenes used in this study, pixel values ranged from 0 to approximately 20,000, with substantial variability across images. Such non-uniform scales hinder consistent inter-image analysis and complicate the standardization of model inputs.

To address these challenges and enhance input consistency, a three-stage preprocessing strategy was employed:

(1): Dynamic range compression via logarithmic transformation;
(2): Outlier mitigation through percentile-based clipping;
(3): Min–max normalization to scale values to a standardized [0, 1] range.

The first step involved applying a logarithmic transformation to compress the dynamic range. DN (Digital Number) values were transformed using the formula 10 × log₁₀(DN) to a logarithmic scale. This process enhances the visibility of low-amplitude signals and improves overall contrast while preserving structural details. As shown in Figure 3b, the histogram of the log-transformed data approximates a more typical distribution compared to the raw input.

The second step addressed extreme values by clipping the data to the 2nd–98th percentile range. Pixel values below the 2nd percentile and above the 98th percentile were capped at their respective thresholds, reducing the influence of noise and outliers [49,51]. This step improved inter-image consistency and contributed to more stable normalization.

In the final step, min–max normalization was applied, linearly mapping the clipped pixel values to the [0, 1] range. The minimum and maximum values served as the scaling parameters, making the data suitable for deep learning input.

Ground truth masks for surface waterbodies were manually generated from Capella images with the aid of multisource auxiliary data, including land cover maps, 25 cm resolution aerial imagery, and Sentinel-2 satellite images (Figure 3). The specific Sentinel-2 acquisition dates—3 July, 22 August, 9 September, 21 September, and 1 October 2022—correspond to the Capella scenes listed in Table 3 and were acquired within a few days of the corresponding Capella acquisitions (mean |Δt| = 5.8 days).

Although the land cover map provided contextual information, a pixel-wise comparison revealed only moderate agreement between its water class and manual annotations (IoU: 0.69). This justified manual refinement to account for temporal water extent variations and SAR-specific geometric effects.

Figure 4 presents the impact of each preprocessing stage using histograms and statistical analysis. The original SAR data showed a broad, exponentially skewed distribution reaching values as high as 20,000. After logarithmic transformation, the distribution became approximately normal; percentile clipping eliminated extreme outliers, and min–max normalization standardized all values to the [0, 1] range. This consistent preprocessing pipeline ensured uniform input scaling across SAR scenes acquired under varying conditions, a crucial step for stable model training and improved generalization performance.

2.2.2. Incidence Angle Corrected Pre-Processing

SAR backscatter is strongly dependent on radar incidence angle, producing range-dependent radiometric gradients [33]. In this study, the five Capella scenes have scene-center incidence angles ranging from 32.0° to 41.8° (Table 3). To mitigate inter-scene radiometric inconsistencies arising from these differences, an incidence-angle-corrected gamma-nought (

γ^{0}

) band was generated and combined with the intensity band in the model inputs.

Gamma nought (

γ^{0}

) is a radiometric quantity that compensates the incidence-angle dependence of the backscatter coefficient (

σ^{0}

) and is defined as follows [52]:

γ^{0} = \frac{σ^{0}}{\cos (θ)}

(1)

where θ denotes the scene-center incidence angle for each Capella GEO scene. Digital numbers (DN) were first radiometrically calibrated to linear

σ^{0}

using the vendor-provided scaling factor contained in the auxiliary metadata;

γ^{0}

was then computed via Equation (1). The resulting

γ^{0}

band underwent the same downstream preprocessing described in Section 2.2.1 (logarithmic transform using 10 log₁₀, 2nd–98th percentile clipping, and min–max normalization to [0, 1]) and was stacked with the intensity band to form a two-channel input. Comparative results between the single-band configuration (intensity only) and the dual-band configuration (intensity +

γ^{0}

) for waterbody detection are reported in Section 4.1.

2.3. Auxiliary Data Preprocessing

To improve the performance of SAR-based surface water detection, auxiliary input channels incorporating terrain and land cover information were employed. Unlike rule-based masking techniques, these auxiliary datasets were integrated to enable the deep learning model to learn complex relationships between SAR backscatter and topographic or environmental characteristics. This approach is intended to reduce misclassification in diverse terrain types and urban settings, thereby enhancing the model’s flexibility and generalizability. Geospatial features derived from DEM and land cover maps were combined with SAR backscatter data to provide contextual information to the model. This integrated strategy helps mitigate the limitations of relying solely on SAR data and establishes a robust foundation for consistent waterbody detection, even in complex urban–rural transition zones [53,54].

2.3.1. Topography Data

A 5 m resolution DEM, provided by the National Geographic Information Institute (NGII), was used as an auxiliary input to represent terrain characteristics. The DEM, which encodes surface elevation in raster format, was processed to derive two topographic indices relevant to surface water detection: HAND and slope.

When using SAR imagery alone, non-water surfaces such as roads, parking lots, and flat concrete structures often exhibit low backscatter values (typically below −20 dB) which can lead to misclassification as waterbodies [55,56]. To address this issue, geospatial information was incorporated to enable the model to learn more contextual relationships between SAR backscatter and topographic context. For example, areas with low backscatter but steep slopes are unlikely to represent water, whereas regions with relatively higher backscatter but low slope and low HAND values may still correspond to turbid or sediment-laden water surfaces. Thus, integrating DEM-derived features enhances delineation accuracy and reduces false positives in complex environments.

In the DEM preprocessing step, both HAND and slope were computed from the raw elevation data. The HAND index quantifies the relative elevation difference between each grid cell and the nearest hydrologically connected drainage channel and is calculated as follows [57]:

H A N D (x, y) = H (x, y) - H_{d} (x, y)

(2)

where

H (x, y)

is the elevation at a given point, and

H_{d} (x, y)

is the elevation at the nearest drainage cell along the hydrological flow path. Drainage connectivity was determined using flow direction data derived from the DEM. We used the Terrain Analysis Using Digital Elevation Models (TauDEM) algorithm to extract flow direction and delineate the drainage network, which were then used to compute HAND values. Cells with low HAND values—indicating minimal elevation difference from nearby drainage channels—are more likely to correspond to actual waterbodies, while those with higher HAND values are less likely to be misclassified as such [58].

Slope, representing the steepness of terrain, was calculated using the following formula:

S l o p e = \arctan (\sqrt{{(\frac{d z}{d x})}^{2} + {(\frac{d z}{d y})}^{2}}) \times (\frac{180}{π})

(3)

where

\frac{d z}{d x}

and

\frac{d z}{d y}

denote the elevation gradients in the x and y directions, respectively. Since surface water typically accumulates in flat or gently sloped areas, regions with low slope values have a higher probability of being waterbodies. Therefore, both HAND and slope were jointly used to support the waterbody segmentation.

2.3.2. Land Cover Maps

We used a detailed land cover map provided by the Korea Ministry of Climate, Energy and Environment (MCEE) (https://aid.mcee.go.kr/). The MCEE land cover map, produced at a 1:5000 scale using high-resolution imagery (1 m or finer), originally comprised 41 detailed classes. We reclassified them into seven broader categories: (1) urban/built-up area, (2) agricultural area, (3) forest, (4) grassland, (5) wetland, (6) water, and (7) barren land. Of these seven classes, only four—urban/built-up area, agricultural area, forest, and grassland—were selected as model inputs. Three classes were excluded for the following reasons: water directly overlaps with the target waterbody class; wetland exhibits similar scattering characteristics to water due to inundated vegetation; and barren land, along with paved surfaces, dry soil, and radar shadow, displays low backscatter signals resembling water, potentially causing overestimation in water detection.

Each selected class was encoded as a separate binary channel (0 or 1), forming a one-hot encoded multi-channel input. This format prevents the model from misinterpreting land cover classes as ordinal or metrically related variables, a common issue when using single-channel integer labels [57,59], and enables independent learning of spatial and contextual patterns for each class. Figure 5 illustrates the SAR image of the study area (top left), the corresponding high-resolution optical reference image (bottom left), and the land cover classification layers. The right panel (channels 1–7) displays each land cover class as one-hot encoded channels: (1) urban/built-up area (dark red), (2) agricultural area (yellow), (3) forest (dark green), (4) grassland (light green), (5) wetland (purple), (6) water (blue), and (7) barren land (cyan).

Incorporating land cover information alongside SAR backscatter data improves waterbody detection performance across diverse environmental conditions, especially in cases where backscatter signals alone are ambiguous [53]. Rather than applying uniform masking of artificial surfaces (e.g., urban areas), the selected land cover types were provided as separate input layers. This approach enables the model to learn the interactions between SAR backscattering behavior and land cover context in a more flexible and data-driven manner.

When utilizing land cover data, a trade-off between spatial resolution and temporal synchronization is a critical consideration (Table 4). The Korea Ministry MCEE land cover map employed in this study provides high geometric precision at approximately 1 m resolution, which is advantageous for identifying fine-scale urban structures and reducing false positives in water detection. In contrast, near-real-time products such as Dynamic World [60] offer 10 m resolution but provide updates within 2–5 days of acquisition, enabling the capture of water distribution synchronized with SAR imagery (Figure 6).

Beyond spatial and temporal characteristics, classification accuracy and production methodology also significantly influence mapping quality. The MCEE land cover map is produced through manual interpretation and field verification, ensuring high label reliability for the study region. Conversely, Dynamic World land cover map is a global automated product that provides probability-based outputs optimized for global consistency rather than localized precision. Given that this study utilizes 1 m resolution SAR imagery, the national land cover map was selected to maximize spatial compatibility. Nonetheless, Dynamic World land cover map remains a viable alternative for applications requiring near-real-time monitoring.

2.4. Waterbody Labeling

The Ground Truth masks were not derived from pre-existing datasets such as the Global Surface Water (GSW); instead, they were manually generated based on Capella SAR imagery with reference to multiple auxiliary datasets, including land cover maps, 25 cm high-resolution aerial photographs, and Sentinel-2 imagery. To resolve the inherent ambiguities of water body boundaries in X-band SAR imagery, a three-step labeling protocol with rigorous exclusion criteria was established:

Step 1: Identification of Water Candidates: Regions exhibiting low backscatter values due to specular reflection were identified as potential water body candidates.
Step 2: Exclusion of Ambiguous Regions: To eliminate potential noise, radar shadows were removed by cross-referencing DEM-based slope information. Furthermore, considering the limited vegetation penetration depth of the X-band, the detection targets were strictly confined to open water.
Step 3: Cross-validation and Final Labeling: Final validation was performed using the Normalized Difference Water Index (NDWI) and high-resolution aerial imagery. A conservative labeling approach was adopted, where only pixels confirmed as water in both SAR and optical imagery were selected to minimize false positives.

3. Methods

3.1. Overview

The overall research methodology is shown in Figure 7. We preprocessed five Capella SAR scenes and split each scene into 512 × 512-pixel patches: 159 from HS0904, 84 from PH0912, 266 from PH0916, 63 from US0629, and 113 from PH0925, covering rivers, tributaries, reservoirs, and small ponds. HS0904 was held out for blind testing, while the other four scenes were used for 5-fold cross-validation for training and validation.

To improve generalization, we applied image data augmentation—geometric transformations, random noise injection, mosaic generation, and Gaussian noise—to both training sets, resulting in 2933, 2940, 2772, 3017, and 3066 samples per fold, respectively. We trained the final models on eight channel inputs. During cross-validation, we combined each model’s probability maps using a class-wise weighted average to produce the final waterbody segmentation.

3.2. Input Channel Configuration

We evaluated the impact of varying input channel types and quantities on the performance of a SAR-based waterbody detection model. Input configurations ranged from a single SAR band to an 8-channel setup incorporating both topographic and land cover information. Table 5 summarizes the tested configurations and their respective components.

Experiment 1 served as the baseline configuration, using only the HH-polarized SAR image to test the feasibility of detecting waterbodies from backscatter data alone.

Experiment 2 added the Gamma (γ⁰) channel to the HH-polarized data, providing normalized backscatter coefficients to assess the effect of enhanced SAR information (2 channels: HH + Gamma).

Experiment 3 added HAND and slope data to the SAR image to assess the contribution of topographic features.

Experiment 7 included SAR, HAND, slope, and a subset of land cover classes. Specifically, water, wetland, and barren land were excluded from the original seven-category land cover dataset, resulting in a 7-channel input. This configuration reduces confusion from spectrally similar classes.

Experiment 8 represented the most comprehensive input setup, combining both HH-polarized and Gamma (γ⁰) channels from SAR data, HAND, slope, and the same four one-hot encoded land cover classes, totaling 8 channels. This configuration was designed to maximize detection accuracy by allowing the model to learn complex spatial relationships between land cover characteristics and the presence of surface water.

3.3. Segmentation Models and Ensemble Framework

We employed four segmentation models for SAR-based waterbody detection: PIDNet, Swin Transformer, Mask2Former, and K-Net. Each model possesses distinct architectural characteristics and offers complementary strengths that enhance overall detection performance. A brief overview of the core architecture and functionalities of each model is provided below.

3.3.1. PIDNet

PIDNet is a CNN-based segmentation model inspired by proportional–integral–derivative (PID) control theory, designed to achieve both real-time performance and high accuracy [61]. The architecture comprises three parallel branches: (1) the P-branch preserves high-resolution spatial features; (2) the I-branch captures global contextual information through progressive downsampling; and (3) the D-branch focuses on boundary refinement. These features are integrated through a Dual Aggregation Module (DAM), which fuses spatial detail and semantic context to enable precise segmentation.

3.3.2. Mask2Former

Mask2Former is a Transformer-based segmentation model that unifies semantic, instance, and panoptic segmentation tasks within a single framework [62]. It employs a masked attention mechanism, in which attention is restricted to predicted mask regions rather than applied across the entire image, thereby improving computational efficiency. The model processes multi-scale features through Transformer decoder layers at multiple resolutions and incorporates several architectural optimizations, including reordered self- and cross-attention, learnable query embeddings, and the removal of dropout layers, to enhance performance, particularly for small or complex objects.

3.3.3. Swin Transformer

Swin Transformer is a hierarchical vision Transformer that serves as a backbone for multi-resolution feature extraction [63]. It partitions the input image into non-overlapping windows and performs self-attention within each window to reduce computational complexity. To maintain contextual continuity across windows, a shifted window mechanism is applied between layers. This design allows the model to effectively combine the local inductive biases of CNNs with the long-range dependency modeling capabilities of Transformers.

3.3.4. K-Net

K-Net is a hybrid segmentation model that integrates principles from both CNNs and Transformers [64]. It performs semantic, instance, and panoptic segmentation using a set of learnable kernels, each responsible for generating a mask corresponding to a specific class or object instance. These kernels are initialized randomly and iteratively refined through a kernel update mechanism that adapts them to object-specific regions. The model employs a bipartite matching strategy to accommodate varying numbers of objects and operates entirely through mask-based learning, without relying on bounding box supervision or non-maximum suppression (NMS).

3.3.5. Model Optimization

All models were trained using 512 × 512 pixel inputs with an 8-channel configuration for the sake of convenience (Table 6). PIDNet was implemented with its original backbone, while Swin Transformer, Mask2Former, and K-Net were built on the Swin-Large backbone [58] (embedding dimension: 192; depth: [2, 2, 18, 2]). The Swin-Large configuration was selected based on its hierarchical architecture, which generates multi-scale feature maps through progressive patch merging. This architecture has been demonstrated as effective for dense prediction tasks, including semantic segmentation [63]. To accommodate graphics processing unit (GPU) memory constraints, the batch size was set to 4 for all models.

All models were optimized using the AdamW (Adaptive Moment Estimation with Weight Decay) optimizer, with initial learning rates set as follows: PIDNet, 1.0 × 10⁻⁴; Mask2Former, 2.0 × 10⁻⁴; Swin Transformer, 2.0 × 10⁻⁴; and K-Net, 2.0 × 10⁻⁴. Loss functions were selected based on model architecture: PIDNet employed a combination of Cross-Entropy, Online Hard Example Mining (OHEM), and boundary loss; Mask2Former used Cross-Entropy and Dice loss; and both Swin Transformer and K-Net used Cross-Entropy loss. Training was conducted for 30,000 iterations for PIDNet and 20,000 iterations for the other models.

3.3.6. Weighted Average Ensemble

To integrate predictions from multiple architectures and mitigate individual model biases, a weighted probability averaging ensemble strategy was employed, as illustrated in Figure 8. Four semantic segmentation models (PIDNet, Mask2Former, Swin Transformer, and K-Net) were trained independently, each generating a pixel-wise probability map

P_{i} (x, y)

∈ [0, 1] for water detection. The ensemble probability is computed as

P_e n s e m b l e (x, y) = \sum w_{i} \cdot P_{i} (x, y)

(4)

where

w = (w_{1}, w_{2}, w_{3}, w_{4})

satisfies

\sum w_{i} = 1 a n d w_{i} \geq 0

. Optimal weights were determined through a systematic grid search on the validation set. Weight combinations were evaluated at 0.05 intervals across the valid range (0.0 to 1.0), with each combination assessed by computing water class IoU. The combination yielding the highest IoU was selected as w*. Final binary predictions were obtained by thresholding the ensemble probability at 0.5. Such an elaborated process ensures that the ensemble contribution of each model can be determined based on the suitability derived from the validation performance.

3.4. Model Performance Evaluation

To quantitatively evaluate the performance of the SAR-based waterbody detection models, we employed several standard evaluation metrics commonly used in binary classification tasks. The classification objective was to distinguish between waterbodies and non-waterbodies. All evaluation metrics were derived from the following four fundamental quantities:

TP: Number of waterbody pixels correctly classified as waterbodies;
TN: Number of non-waterbody pixels correctly classified as non-waterbodies;
FP: Number of non-waterbody pixels incorrectly classified as waterbodies;
FN: Number of waterbody pixels incorrectly classified as non-waterbodies.

Intersection over Union (IoU) is the representative measure for performance evaluation in computer vision. It quantifies the degree of overlap between the predicted and actual waterbody regions, that is, the ratio of the intersection to the union of the predicted and ground truth areas, as defined in Equation (5):

I o U = \frac{T P}{T P + F P + F N}

(5)

Accuracy represents the overall proportion of correctly classified pixels and is calculated using Equation (6):

A c c u r a c y = \frac{T P + T N}{T P + F P + F N + T N}

(6)

Precision indicates the proportion of true waterbody pixels among all pixels predicted as waterbodies. It reflects the model’s ability to avoid false alarms and is given by Equation (7):

P r e c i s i o n = \frac{T P}{T P + F P}

(7)

Recall measures the proportion of correctly identified waterbody pixels among all actual waterbody pixels. It reflects the model’s capability to avoid missed detection and is computed as shown in Equation (8):

R e c a l l = \frac{T N}{F P + T N}

(8)

The F1-score is the harmonic mean of Precision and Recall, providing a balanced measure to evaluate the tendency of over- or under-estimation. It is defined as

F 1 - s c o r e = 2 \times \frac{R e c a l l \times P r e c i s i o n}{R e c a l l + P r e c i s i o n}

(9)

Among these, water IoU and F1-score were used as the primary indicators of model performance in this study, given their robustness in evaluating both the spatial accuracy and class balance of segmentation results [17,18].

4. Results

4.1. Input Channel Configurations

To assess the contribution of different input channels and determine the best combination, five input configurations were examined (Table 7). This involved training and validation using a random split of the dataset (80% training, 20% validation) with the Mask2Former model. Experiment 1 (HH-polarized SAR only) served as the baseline, while subsequent experiments progressively incorporated incidence angle correction (Experiment 2), topographic features (Experiment 3), land cover types (Experiment 7), and their combination (Experiment 8). Quantitative metrics and qualitative comparisons (Figure 9) demonstrate that multi-source data integration improves detection accuracy, with Water IoU increasing from 0.915 (Experiment 1) to 0.955 (Experiment 8).

The baseline configuration (Experiment 1) achieved a Water IoU of 0.915 and an F1-score of 0.956. False positives were primarily observed in paved roads, shadow areas, and low vegetation, while false negatives occurred at narrow stream boundaries (Figure 9). Experiment 2, which added incidence-angle-corrected Gamma (

γ^{0}

), showed marginal improvement (IoU: 0.917, F1: 0.957). While the standalone quantitative gain was moderate, radiometric normalization establishes a consistent physical baseline, ensuring that backscatter variations are driven by surface properties rather than acquisition geometry.

Integration of topographic features (Slope and HAND) in Experiment 3 yielded substantial performance gains, achieving a Water IoU of 0.934 and an F1-score of 0.966. HAND effectively distinguished waterbodies from non-water surfaces at similar elevations, while Slope reduced false positives on flat artificial surfaces. Experiment 7, incorporating selective land cover classes, further improved Water IoU to 0.952, primarily by reducing false positives in spectrally ambiguous urban-rural transition zones.

Experiment 8, combining all input channels, achieved the highest overall performance (Water IoU: 0.955, F1-score: 0.977). Although the improvement over Experiment 7 was marginal (ΔIoU: +0.003), Experiment 8 was selected as the final configuration to ensure operational robustness under varying acquisition geometries—a critical factor for consistent waterbody monitoring across diverse SAR scenes—while maintaining acceptable computational cost. This decision prioritizes detection reliability over computational efficiency given the importance of accurate boundary delineation in flood mapping applications.

Compared to the baseline, the optimal configuration (Experiment 8) improved Water IoU by approximately 4.4% (from 0.915 to 0.955) and the F1-score by 2.2% (from 0.956 to 0.977). Ablation analysis reveals complementary roles: topographic features primarily mitigate false negatives at waterbody boundaries, land cover information suppresses false positives in urban areas, and radiometric normalization provides a stable physical basis for feature integration. Qualitative analysis (Figure 9) visually confirms that this multi-source integration significantly enhances detection robustness beyond the capabilities of SAR intensity alone. Moreover, the 8-channel probability map yields significantly more distinguishable and well-defined patterns, particularly at water–land boundaries. In contrast, the 1-channel model exhibits indistinct and ambiguous transitions in these regions (Figure 10).

4.2. Performance Comparison Among Deep Learning Models

After establishing the optimal input configuration, this section compares and analyzes the waterbody segmentation performance of four deep learning models: PIDNet, Mask2Former, Swin Transformer, and K-Net. These models were selected to evaluate diverse architectural strategies, combining efficient CNN-based designs (PIDNet) with Transformer-based approaches (Mask2Former, Swin Transformer, K-Net) capable of capturing long-range dependencies.

To establish a rigorous comparative context with existing Capella SAR research, U-Net++ was additionally evaluated as a baseline reference. This architecture was selected following its successful application by Popien et al. [43] for flood extent mapping using Capella X-band Stripmap imagery, thereby enabling a direct benchmark against prior work on the identical sensor platform. All models were trained using the optimal input channels identified in Experiment 8. The dataset distribution for 5-fold cross-validation is detailed in Table 8.

Table 9 summarizes the quantitative cross-validation performance (mean ± standard deviation), while Figure 11 visually illustrates the distribution of Water IoU and F1-scores across the five folds. The baseline model, U-Net++, achieved a Water IoU of 0.915 (±0.016) and an F1-score of 0.955 (±0.009). The proposed Transformer-based models generally outperformed this baseline, with K-Net achieving the highest performance at a Water IoU of 0.932 (±0.014) and an F1-score of 0.964 (±0.010). Mask2Former followed with a Water IoU of 0.928 (±0.018) and demonstrated particular strength in minimizing false negatives, recording a Recall of 0.958 (±0.019).

Performance differences were observed according to architecture type. Transformer-based models (K-Net, Mask2Former, and Swin Transformer) recorded IoU values in the range of 0.920–0.932, whereas CNN-based models (U-Net++ and PIDNet) achieved 0.915–0.916. While Transformer models showed slightly higher performance (+1.6% IoU relative to the CNN baseline), the overlapping inter-quartile ranges shown in Figure 11 and the standard deviations in Table 9 suggest that this difference reflects the combined influence of architectural design, model capacity, and training configuration rather than the superiority of self-attention mechanisms alone.

Despite all models achieving Accuracy above 0.991, clear distinctions emerged in Water IoU and F1-score. This demonstrates that IoU and F1-score serve as more discriminative performance measures than Accuracy for waterbody segmentation tasks with class imbalance, consistent with observations in the literature. Given K-Net’s consistently high performance and the complementary strengths of each architecture, the following section presents the final waterbody segmentation performance obtained through a weighted average ensemble of these models.

4.3. Blind Test Using Ensemble Model

To assess practical applicability and generalization capability, a blind test was conducted on the HS0904 scene (strictly excluded from training and validation). For comprehensive benchmarking, two reference baselines were evaluated: (1) traditional Otsu thresholding with topographic refinement [43] and (2) U-Net++ [65], a deep learning baseline applied to Capella SAR flood mapping.

4.3.1. Baseline Comparison and Ensemble Performance

Table 10 summarizes the quantitative performance on the blind test set. Despite an optimization using identical HAND and Slope masks, the Otsu-based method achieved a Water IoU of only 0.6313, highlighting the limitations of intensity-based thresholding in complex terrain environments. U-Net++ demonstrated substantially improved performance (Water IoU: 0.8551), validating the effectiveness of deep learning approaches.

Among the four candidate models trained on Fold 3, Swin Transformer achieved the best single-model performance (Water IoU: 0.9371), followed by K-Net (0.9247), Mask2Former (0.9155), and PIDNet (0.8921). Notably, while K-Net ranked first during cross-validation (Water IoU: 0.932), Swin Transformer demonstrated superior generalization on the unseen test images.

The ensemble model achieved the highest overall performance (IoU: 0.9422), representing a moderate improvement over Swin Transformer (ΔIoU: +0.0051, 0.54%) but a substantial gain over U-Net++ (ΔIoU: +0.0871, 10.19%). The ensemble also achieved optimal Precision–Recall balance (Precision: 0.9833, Recall: 0.9575, F1-score: 0.9703), effectively compensating for individual model biases.

Figure 12 presents visual validation on the blind test data. The Otsu baseline produced extensive false positives (green) in smooth surfaces and barren land, demonstrating the limitations of threshold-based approaches. All deep learning models showed substantially improved performance with visually similar results. The ensemble achieved marginally more consistent performance across diverse waterbody geometries, with quantitative differences detailed in Table 10.

4.3.2. Evaluation of Ensemble Strategies

To validate the weighted averaging approach, Table 11 compares four ensemble configurations: (1) Best Single Model (Swin Transformer), (2) Equal Weights [0.25, 0.25, 0.25, 0.25], (3) Automated Weights derived via Softmax normalization of validation IoU scores, and (4) Optimized Weights obtained through systematic grid search [PIDNet = 0.05, Swin = 0.40, Mask2Former = 0.20, K-Net = 0.35]. The Optimized Weights configuration achieved the highest performance (IoU: 0.9422), marginally outperforming both Equal Weights (0.9408) and Automated Weights (0.9408). This is because the grid search explores various combinations of weights within a given range to maximize the ensemble model’s final performance while the Softmax normalization sets the weights proportional to the individual model’s absolute performance.

5. Discussions

5.1. Ensemble Approach

5.1.1. Ensemble for Constrained Data

A notable aspect of this work is achieving robust performance (blind test Water IoU: 0.9422, F1-score: 0.9703) with a relatively small-scale training dataset—a practical constraint that reflects the operational reality of commercial SAR data acquisition. Unlike operational satellite missions such as Sentinel-1 (freely available, global coverage, 6-day revisit), commercial high-resolution SAR platforms like Capella operate on a tasking basis with associated costs, limiting the volume of available training data. Our training dataset comprises 526 patches (512 × 512 pixels) from two geographic regions in Republic of Korea over a four-month period (June–September 2022). However, the 1 m spatial resolution of Capella imagery provides significantly richer within-scene spatial detail, and our framework compensates for limited sample diversity through three strategic mechanisms: (1) multi-source feature integration (topographic and land cover data), (2) extensive data augmentation, and (3) ensemble aggregation of architecturally diverse models. This approach demonstrates that high-quality waterbody detection is achievable even under data-constrained conditions typical of commercial SAR platforms.

Previous research on SAR-based waterbody detection has demonstrated diverse approaches across different sensor platforms and application domains. At the Sentinel-1 scale, Peña et al. (2024) [66] introduced DeepAqua, achieving robust performance for wetland water segmentation through self-supervised learning without manual annotations, demonstrating the potential of unsupervised approaches for large-scale monitoring. For high-resolution Capella X-band SAR, Popien et al. (2023) [43] proposed a UNet++-based CNN method for urban flood mapping, Das et al. (2023) [67] presented a flood depth estimation technique combining SAR imagery and DEMs, and Jensen et al. (2022) [68] introduced a multi-sensor approach integrating X-band and C-band SAR data. While these studies have significantly contributed to their respective applications, they primarily relied on limited auxiliary information within a single model framework.

Furthermore, there has been a lack of systematic research evaluating the robustness of deep learning models under varying topographic and land cover conditions, as well as the use of ensemble strategies to enhance model performance. This study addresses these gaps through systematic comparison of four advanced segmentation models (PIDNet, Mask2Former, Swin Transformer, K-Net) and introduces a generalized waterbody detection framework for Capella SAR imagery. The multi-source integration and ensemble strategies described above enable robust performance across diverse environmental conditions.

In particular, our ensemble strategy, utilizing Optimized Weights via grid search, was found to be particularly appropriate for the data-constrained conditions typical of Capella imagery. The Softmax-based Automated Weights configuration sets the weights strictly proportional to the individual model’s absolute performance (e.g., IoU scores); hence, the weights are mathematically fixed once those scores are determined. In contrast, Grid Search explores various combinations of weights within a defined range. Because its goal is to maximize the final performance of the ensemble model, it does not necessarily assign the highest weight to the best-performing individual model. It is noteworthy that in the Optimized Weights configuration, the weights for the Swin Transformer (0.40) and K-Net (0.35) are relatively higher than that of Mask2Former (0.20), despite the models having similar individual IoU values (0.9371, 0.9247, and 0.9155, respectively). This suggests that although the individual models’ performances are comparable, their prediction errors were not sufficiently independent or complementary to the error patterns of the other models. Put differently, weighting the predictions of the Swin Transformer and K-Net more heavily led to the most effective reduction in the overall ensemble error. Therefore, the Optimized Weights (via grid search) scheme was successful because it was not constrained by the IoU performance ranking of individual models. Instead, it successfully found the optimal combination of weights that could most effectively offset and compensate for the misclassification patterns (False Positives/Negatives) produced by each model on the test data. Both Optimized Weights (Grid Search) and Automated Weights (Softmax normalization) can be incorporated into our framework for a more extensible system.

5.1.2. Inter-Model Disagreement

A key advantage of the ensemble approach in semantic segmentation is the ability to examine inter-model disagreement and compensate for individual model limitations by blending their probability maps. As shown in Figure 12, the ensemble effectively minimizes false positive (FP) and false negative (FN) pixels, even though individual models may possess unique errors and inherent differences. We visualized this disagreement by calculating the pixel-wise standard deviation of the water probability values across the four models. As illustrated in Figure 13, while the standard deviation is notably higher around land–water boundaries, the ensemble process successfully reconciles these discrepancies, resulting in highly accurate segmentation with negligible FP and FN occurrences.

5.2. Analyses by Case

To evaluate the practical applicability and generalization capability of the proposed ensemble framework, waterbody detection performance was visualized over the entire blind test dataset, which had not been used during training or validation. Figure 14 overlays the ensemble model’s detection results (in green) on the original SAR image. The full scene was divided into four sections (marked by red boxes), and three regions of interest (ROIs) were selected as Case 1 (Estuarine Area), Case 2 (Mountainous Terrain), and Case 3 (Urban Area), respectively, for qualitative evaluation under diverse environmental conditions. Figure 14, Figure 15 and Figure 16 present detailed visualizations for the Case 1, 2, and 3, including the original SAR image, land cover map, optical reference image, the ensemble model’s probability map, and the final binary detection map.

5.2.1. Case 1: Estuarine Area

Case 1 in Figure 15 presents the detection results for an estuarine area characterized by broad waterbodies and a heterogeneous mix of land cover types. The land cover map for this region reveals a complex distribution of barren land and wetland classes in addition to water. From a SAR backscatter perspective, barren areas with low surface roughness and variable soil moisture can exhibit backscatter values similar to water, while wetlands may produce ambiguous signals depending on vegetation structure and inundation conditions. X-band SAR’s short wavelength results in surface scattering dominance, limiting backscatter contrast among smooth water, wet barren land, and inundated wetlands. Traditional SAR thresholding approaches may struggle to distinguish these areas, potentially misclassifying barren or wetland zones as water, or failing to detect waterbodies with spectral characteristics similar to non-water classes.

The ensemble model addresses this limitation by integrating topographic information (Slope and HAND) alongside land cover. HAND distinguishes waterbodies through low elevation values, separating them from topographically elevated barren areas. Slope differentiates flat water surfaces from terrain with relief. As illustrated in the probability map (Figure 15d) and binary detection map (Figure 15e), the ensemble model captured broad water areas while minimizing misclassifications in regions where barren land and wetlands are interspersed along waterbody boundaries. Visual comparison with the optical reference (Figure 15c) and the detection overlay on SAR images (Figure 15f, green) confirms accurate water delineation in this complex estuarine environment.

5.2.2. Case 2: Mountainous Terrain

Case 2 in Figure 16 presents the detection results for waterbodies located in mountainous terrain. In such environments, numerous regions exhibit very low SAR backscatter due to steep slopes, aspect angle effects, and radar shadowing. These low-backscatter shadow areas resemble actual waterbodies, potentially causing false positives. Several dark shadow regions can be observed in non-water mountainous areas in the SAR image (Figure 16a).

The ensemble model addresses this ambiguity by integrating topographic information. HAND distinguishes actual waterbodies, located near drainage networks with low elevation values, from shadowed mountain slopes at higher elevations. Slope separates flat water surfaces from steep shadowed terrain. The probability map (Figure 16d) correctly assigns low probabilities (purple) to most shadowed areas, while actual water areas display high probabilities (red-white). The binary detection map (Figure 16e) accurately identifies waterbodies while excluding most shadowed terrain. Comparison with the optical base map (Figure 16c) and the detection overlay on SAR images (Figure 16f, green) confirms accurate water delineation in this mountainous environment.

5.2.3. Case 3: Urban Area

Case 3 in Figure 17 presents the waterbody detection results within a complex urban environment. Urban areas are among the most challenging settings for SAR analysis due to severe layover and shadowing from building facades, combined with diverse artificial structures exhibiting irregular surface characteristics. Areas classified as barren land, including construction zones and sports fields, often exhibit low SAR backscatter similar to water, increasing the potential for false detections.

The ensemble model integrates topographic information to reduce such ambiguities. HAND distinguishes waterbodies near drainage networks from low-lying barren areas, while Slope differentiates flat artificial surfaces from water. The probability map (Figure 17d) shows high probabilities for actual waterbodies (orange-red) and low probabilities for most barren areas (purple). The binary detection map (Figure 17e) correctly identifies primary waterbodies (white) while excluding most non-water areas (black). However, some false positives remain in barren land regions where SAR backscatter and topographic characteristics resemble water conditions. Comparison with the optical base map (Figure 17c) and the detection overlay (Figure 17f, green) confirms reasonable performance in this complex urban environment.

5.3. Implications, Limitations and Future Work

In this study, waterbody segmentation performance improved when topography and land cover layers were integrated with X-band SAR. By providing complementary physical context beyond backscatter intensity, multi-modal inputs make the segmentation more robust and less susceptible to the noise or ambiguity inherent in relying on a single sensor type. This result supports the use of auxiliary geospatial information to mitigate SAR-specific ambiguities and strengthen model robustness, provided that the additional inputs maintain sufficient quality and coverage.

Furthermore, ensemble aggregation across multiple architectures improved prediction consistency and the precision–recall balance, mitigating architecture-specific errors observed in individual models. Although the quantitative gain was moderate (approximately +0.5% IoU over the best single model in our evaluation), the ensemble provided stronger operational stability (Precision: 0.9833, Recall: 0.9575). Our ensemble strategy was found to be appropriate for Capella images under data-constrained conditions, primarily because the Optimized Weights derived via grid search can set the weights to maximize the ensemble model’s final performance. For applications requiring reliable water mapping (e.g., flood risk assessment), ensembles represent a practical approach to improving robustness across heterogeneous conditions.

Also, this study has several limitations. The evaluation was conducted on data from Republic of Korea during the period of June–September 2022. While the study area encompasses diverse environments including estuarine areas, mountainous terrain, and urban regions, validation across different geographic regions and seasonal conditions would strengthen confidence in broader applicability. Additionally, false positives persist in areas where SAR backscatter and topographic characteristics resemble water conditions, such as construction sites and sports fields on low-lying terrain. The topographic and land cover data may not be temporally aligned with SAR acquisitions, and the impact of such discrepancies on model performance requires further investigation.

Future research should focus on validating the framework across diverse geographic and climatic conditions, exploring multi-temporal analysis to reduce false positives in ambiguous areas, and evaluating computational efficiency for operational deployment. Despite these limitations, the proposed framework demonstrates practical utility for high-resolution SAR-based waterbody detection and provides a foundation for further development.

6. Conclusions

This study developed and evaluated a framework for waterbody detection using high-resolution X-band Capella SAR images integrated with topographic information (Slope and HAND) and land cover data. Four deep learning segmentation models (PIDNet, Mask2Former, Swin Transformer, and K-Net) were systematically compared under consistent experimental conditions, and a weighted average ensemble strategy was employed to effectively compensate for individual model limitations. The ensemble achieved a Water IoU of 0.9422 and an F1-score of 0.9703 on the blind test set, moderately outperforming the best single model (Swin Transformer: IoU 0.9371) while substantially surpassing the U-Net++ baseline (IoU 0.8551). Qualitative analysis confirmed accurate delineation in challenging environments, including mountainous shadows, estuarine areas, and urban settings. The integration of multi-source data with deep learning ensemble frameworks demonstrates practical utility for high-resolution SAR-based waterbody detection, particularly under data-constrained commercial satellite platforms.

The proposed methodology enhances the applicability of Capella SAR data for disaster response, water resource monitoring, and environmental change detection. Future work should focus on validating the framework across diverse geographic conditions, exploring multi-temporal analysis to reduce false positives, and evaluating computational efficiency for operational deployment.

Author Contributions

Conceptualization, S.C., S.H.K., S.V.N., M.K., M.C., J.K. and Y.L.; methodology, S.C., S.H.K., S.V.N., M.K., M.C., J.K. and Y.L.; data curation, S.C.; formal analysis, S.C.; writing—original draft preparation, S.C.; writing—review and editing, S.H.K., S.V.N., M.K., M.C., J.K. and Y.L.; project administration, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a grant (2021-MOIS 37-002) of the Intelligent Technology Development Program on Disaster Response and Emergency Management funded by the Ministry of Interior and Safety (MOIS, Korea). This work was carried out with the support of the “Cooperative Research Program for Agriculture Science and Technology Development (Project No. PJ0162342025)” by the Rural Development Administration, Republic of Korea. This research was supported by the Regional Innovation System & Education (RISE) program through the Institute for Regional Innovation System & Education in Busan Metropolitan City, funded by the Ministry of Education (MOE) and the Busan Metropolitan City, Republic of Korea (2025-RISE-02-001-009). This work was supported by the National Research Foundation (NRF), Korea, under project BK21 FOUR. The research carried out at the Jet Propulsion Laboratory, California Institute of Technology, is supported under a contract with NASA (80NM0018D0004).

Data Availability Statement

Data and model will be made available upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sheffield, J.; Wood, E.F.; Pan, M.; Beck, H.; Coccia, G.; Serrat-Capdevila, A.; Verbist, K.J.W.R.R. Satellite remote sensing for water resources management: Potential for supporting sustainable development in data-poor regions. Water Resour. Res. 2018, 54, 9724–9758. [Google Scholar] [CrossRef]
Rango, A. Application of remote sensing methods to hydrology and water resources. Hydrol. Sci. J. 1994, 39, 309–320. [Google Scholar] [CrossRef]
Li, B.; Liu, K.; Wang, M.; Wang, Y.; He, Q.; Zhuang, L.; Zhu, W. High-spatiotemporal-resolution dynamic water monitoring using LightGBM model and Sentinel-2 MSI data. Int. J. Appl. Earth Obs. Geoinf. 2023, 118, 103278. [Google Scholar] [CrossRef]
Liu, S.; Wu, Y.; Zhang, G.; Lin, N.; Liu, Z. Comparing water indices for Landsat data for automated surface water body extraction under complex ground background: A case study in Jilin Province. Remote Sens. 2023, 15, 1678. [Google Scholar] [CrossRef]
Yang, X.; Chen, Y.; Wang, J. Combined use of Sentinel-2 and Landsat 8 to monitor water surface area dynamics using Google Earth Engine. Remote Sens. Lett. 2020, 11, 687–696. [Google Scholar] [CrossRef]
Markert, K.N.; Chishtie, F.; Anderson, E.R.; Saah, D.; Griffin, R.E. On the merging of optical and SAR satellite imagery for surface water mapping applications. Results Phys. 2018, 9, 275–277. [Google Scholar] [CrossRef]
Musa, Z.N.; Popescu, I.; Mynett, A. A review of applications of satellite SAR, optical, altimetry and DEM data for surface water modelling, mapping and parameter estimation. Hydrol. Earth Syst. Sci. 2015, 19, 3755–3769. [Google Scholar] [CrossRef]
Uddin, K.; Matin, M.A.; Meyer, F.J. Operational flood mapping using multi-temporal Sentinel-1 SAR images: A case study from Bangladesh. Remote Sens. 2019, 11, 1581. [Google Scholar] [CrossRef]
Grimaldi, S.; Xu, J.; Li, Y.; Pauwels, V.R.; Walker, J.P. Flood mapping under vegetation using single SAR acquisitions. Remote Sens. Environ. 2020, 237, 111582. [Google Scholar] [CrossRef]
Pulvirenti, L.; Pierdicca, N.; Chini, M.; Guerriero, L. An algorithm for operational flood mapping from Synthetic Aperture Radar (SAR) data using fuzzy logic. Nat. Hazards Earth Syst. Sci. 2011, 11, 529–540. [Google Scholar] [CrossRef]
Cutler, P.J.; Schwartzkopf, W.C.; Koehler, F.W. Robust automated thresholding of SAR imagery for open-water detection. In Proceedings of the 2015 IEEE Radar Conference (RadarCon), Arlington, VA, USA, 11–15 May 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 310–315. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, R.; Zhang, Q.; Zhu, Y.; Huang, B.; Lu, Z. An automatic thresholding method for waterbody detection from SAR image. In Proceedings of the 2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), Chongqing, China, 11–13 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4. [Google Scholar] [CrossRef]
Chen, S.; Huang, W.; Chen, Y.; Feng, M. An adaptive thresholding approach toward rapid flood coverage extraction from Sentinel-1 SAR imagery. Remote Sens. 2021, 13, 4899. [Google Scholar] [CrossRef]
Bangira, T.; Alfieri, S.M.; Menenti, M.; Van Niekerk, A. Comparing thresholding with machine learning classifiers for mapping complex water. Remote Sens. 2019, 11, 1351. [Google Scholar] [CrossRef]
Navale, A.; Haldar, D. Evaluation of machine learning algorithms to Sentinel SAR data. Spat. Inf. Res. 2020, 28, 345–355. [Google Scholar] [CrossRef]
Pech-May, F.; Aquino-Santos, R.; Delgadillo-Partida, J. Sentinel-1 SAR images and deep learning for waterbody mapping. Remote Sens. 2023, 15, 3009. [Google Scholar] [CrossRef]
Bereczky, M.; Wieland, M.; Krullikowski, C.; Martinis, S.; Plank, S. Sentinel-1-based water and flood mapping: Benchmarking convolutional neural networks against an operational rule-based processing chain. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2023–2036. [Google Scholar] [CrossRef]
Tavus, B.; Can, R.; Kocaman, S. A CNN-based flood mapping approach using sentinel-1 data. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, 3, 549–556. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer International Publishing: Cham, Switzerland, 2015. Proceedings, Part III 18. pp. 234–241. [Google Scholar] [CrossRef]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
Saleh, T.; Weng, X.; Holail, S.; Hao, C.; Xia, G.S. DAM-Net: Flood detection from SAR imagery using differential attention metric-based vision transformers. ISPRS J. Photogramm. Remote Sens. 2024, 212, 440–453. [Google Scholar] [CrossRef]
Zhang, Y.; Lu, H.; Ma, G.; Zhao, H.; Xie, D.; Geng, S.; Tian, W.; Sian, K.T.C.L.K. MU-net: Embedding MixFormer into unet to extract water bodies from remote sensing images. Remote Sens. 2023, 15, 3559. [Google Scholar] [CrossRef]
Ma, D.; Jiang, L.; Li, J.; Shi, Y. Water index and Swin Transformer Ensemble (WISTE) for water body extraction from multispectral remote sensing images. GIScience Remote Sens. 2023, 60, 2251704. [Google Scholar] [CrossRef]
Chen, B.; Zou, X.; Zhang, Y.; Li, J.; Li, K.; Xing, J.; Tao, P. LEFormer: A hybrid CNN-transformer architecture for accurate lake extraction from remote sensing imagery. In Proceedings of the ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 5710–5714. [Google Scholar] [CrossRef]
Zhou, Y.; Yang, K.; Ma, F.; Hu, W.; Zhang, F. Water–land segmentation via structure-aware CNN–transformer network on large-scale SAR data. IEEE Sens. J. 2022, 23, 1408–1422. [Google Scholar] [CrossRef]
Tsyganskaya, V.; Martinis, S.; Marzahn, P.; Ludwig, R. SAR-based detection of flooded vegetation–a review of characteristics and approaches. Int. J. Remote Sens. 2018, 39, 2255–2293. [Google Scholar] [CrossRef]
Ganaie, M.A.; Hu, M.; Malik, A.K.; Tanveer, M.; Suganthan, P.N. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022, 115, 105151. [Google Scholar] [CrossRef]
Sharma, N.K.; Saharia, M. DeepSARFlood: Rapid and Automated SAR-based flood inundation mapping using Vision Transformer-based Deep Ensembles with uncertainty estimates. Sci. Remote Sens. 2025, 11, 100203. [Google Scholar] [CrossRef]
Hosseiny, B.; Mahdianpari, M.; Brisco, B.; Mohammadimanesh, F.; Salehi, B. WetNet: A spatial–temporal ensemble deep learning model for wetland classification using Sentinel-1 and Sentinel-2. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
Paul, S.; Ganju, S. Flood segmentation on Sentinel-1 SAR imagery with semi-supervised learning. arXiv 2021, arXiv:2107.08369. [Google Scholar] [CrossRef]
Hong, S.; Jang, H.; Kim, N.; Sohn, H.G. Water area extraction using RADARSAT SAR imagery combined with Landsat imagery and terrain information. Sensors 2015, 15, 6652–6667. [Google Scholar] [CrossRef]
Hughes, L.H.; Marcos, D.; Lobry, S.; Tuia, D.; Schmitt, M. A deep learning framework for matching of SAR and optical imagery. ISPRS J. Photogramm. Remote Sens. 2020, 169, 166–179. [Google Scholar] [CrossRef]
Najem, S.; Baghdadi, N.; Bazzi, H.; Zribi, M. Incidence angle normalization of C-band radar backscattering coefficient over agricultural surfaces using dynamic cosine method. Remote Sens. 2024, 16, 3838. [Google Scholar] [CrossRef]
Azam, M.; Park, H.; Kim, J. Spatial and Temporal Trend Analysis of Precipitation and Drought in South Korea. Water 2018, 10, 765. [Google Scholar] [CrossRef]
Jiang, C.; Zhang, H.; Wang, C.; Ge, J.; Wu, F. Water surface mapping from Sentinel-1 imagery based on attention-UNet3+: A case study of Poyang Lake region. Remote Sens. 2022, 14, 4708. [Google Scholar] [CrossRef]
Hahmann, T.; Roth, A.; Martinis, S.; Twele, A.; Gruber, A. Automatic extraction of waterbodies from TerraSAR-X data. In Proceedings of the IGARSS 2008-2008 IEEE International Geoscience and Remote Sensing Symposium, Boston, MA, USA, 7–11 July 2008; IEEE: Piscataway, NJ, USA, 2008; Volume 3, pp. III-103–III-106. [Google Scholar] [CrossRef]
Martinis, S.; Kersten, J.; Twele, A. A fully automated TerraSAR-X based flood service. ISPRS J. Photogramm. Remote Sens. 2015, 104, 203–212. [Google Scholar] [CrossRef]
Yayong, S.; Shifeng, H.; Jiren, L.; Xiaotao, L.; Jianwei, M.; Hui, W. Monitoring seasonal changes in the water surface areas of Poyang Lake using COSMO-SkyMed time series data in PR China. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 7180–7183. [Google Scholar] [CrossRef]
Tong, X.; Luo, X.; Liu, S.; Xie, H.; Chao, W.; Liu, S.; Liu, S.; Makhinov, A.N.; Makhinova, A.F.; Jiang, Y. An approach for flood monitoring by the combined use of Landsat 8 optical imagery and COSMO-SkyMed radar imagery. ISPRS J. Photogramm. Remote Sens. 2018, 136, 144–153. [Google Scholar] [CrossRef]
Castelletti, D.; Farquharson, G.; Stringham, C.; Duersch, M.; Eddy, D. Capella space first operational SAR satellite. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1483–1486. [Google Scholar] [CrossRef]
Ignatenko, V.; Laurila, P.; Radius, A.; Lamentowski, L.; Antropov, O.; Muff, D. ICEYE Microsatellite SAR Constellation Status Update: Evaluation of first commercial imaging modes. In Proceedings of the IGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium, Virtual, 26 September–2 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 3581–3584. [Google Scholar] [CrossRef]
Yague-Martinez, N.; Leach, N.R.; Dasgupta, A.; Tellman, E.; Brown, J.S. Towards frequent flood mapping with the Capella SAR system. The 2021 Eastern Australia floods case. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 6174–6177. [Google Scholar] [CrossRef]
Popien, P.; D’Hondt, O.; Sunkara, V.; Chakrabarti, S. Deep Learning Based Urban Flood Mapping From High Resolution Capella Space Sar Imagery. In Proceedings of the IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1384–1387. [Google Scholar] [CrossRef]
Stringham, C.; Farquharson, G.; Castelletti, D.; Quist, E.; Riggi, L.; Eddy, D.; Soenen, S. The capella X-band SAR constellation for rapid imaging. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 9248–9251. [Google Scholar] [CrossRef]
Albinet, C. Technical Note for Capella Data Assessment; ESA Earthnet Data Assessment Pilot; European Space Agency: Paris, France, 2022; p. 28. Available online: https://earth.esa.int/eogateway/documents/20142/37627/Technical%20Note%20for%20Capella%20Data%20Assessment.pdf (accessed on 5 January 2026).
Capella Space. Capella SAR System Performance v2.0; White Paper; Capella Space: San Francisco, CA, USA, 2020; p. 18. Available online: https://geokom.ba/wp-content/uploads/2020/12/Capella_Space_SAR_System_Performance.pdf (accessed on 5 January 2026).
Younis, M.; Huber, S.; Patyuchenko, A.; Bordoni, F.; Krieger, G. Performance comparison of reflector-and planar-antenna based digital beam-forming SAR. Int. J. Antennas Propag. 2009, 2009, 614931. [Google Scholar] [CrossRef]
Wang, X.; Liu, J.; Zhang, S.; Deng, Q.; Wang, Z.; Li, Y.; Fan, J. Detection of oil spill using SAR imagery based on AlexNet model. Comput. Intell. Neurosci. 2021, 2021, 4812979. [Google Scholar] [CrossRef]
Rousso, R.; Katz, N.; Sharon, G.; Glizerin, Y.; Kosman, E.; Shuster, A. Automatic recognition of oil spills using neural networks and classic image processing. Water 2022, 14, 1127. [Google Scholar] [CrossRef]
Singha, S.; Bellerby, T.J.; Trieschmann, O. Satellite oil spill detection using artificial neural networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 6, 2355–2363. [Google Scholar] [CrossRef]
Yu, Q.; Liu, W.; Gonçalves, W.N.; Junior, J.M.; Li, J. Spatial Resolution Enhancement for Large-Scale Land Cover Mapping via Weakly Supervised Deep Learning. Photogramm. Eng. Remote Sens. 2021, 87, 405–412. [Google Scholar] [CrossRef]
Small, D. Flattening gamma: Radiometric terrain correction for SAR imagery. IEEE Trans. Geosci. Remote Sens. 2011, 49, 3081–3093. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, C.; Atkinson, P.M. Combining SAR images with land cover products for rapid urban flood mapping. Front. Environ. Sci. 2022, 10, 973192. [Google Scholar] [CrossRef]
Li, Z.; Demir, I. U-net-based semantic classification for flood extent extraction using SAR imagery and GEE platform: A case study for 2019 central US flooding. Sci. Total Environ. 2023, 869, 161757. [Google Scholar] [CrossRef] [PubMed]
Islam, M.T.; Meng, Q. An exploratory study of Sentinel-1 SAR for rapid urban flood mapping on Google Earth Engine. Int. J. Appl. Earth Obs. Geoinf. 2022, 113, 103002. [Google Scholar] [CrossRef]
Mason, D.C.; Bevington, J.; Dance, S.L.; Revilla-Romero, B.; Smith, R.; Vetra-Carvalho, S.; Cloke, H.L. Improving urban flood mapping by merging synthetic aperture radar-derived flood footprints with flood hazard maps. Water 2021, 13, 1577. [Google Scholar] [CrossRef]
Engen, M.; Sandø, E.; Sjølander, B.L.O.; Arenberg, S.; Gupta, R.; Goodwin, M. Farm-scale crop yield prediction from multi-temporal data using deep hybrid neural networks. Agronomy 2021, 11, 2576. [Google Scholar] [CrossRef]
Johary, R.; Révillion, C.; Catry, T.; Alexandre, C.; Mouquet, P.; Rakotoniaina, S.; Pennober, G.; Rakotondraompiana, S. Detection of large-scale floods using Google Earth Engine and Google Colab. Remote Sens. 2023, 15, 5368. [Google Scholar] [CrossRef]
Deng, K.; Hu, X.; Zhang, Z.; Su, B.; Feng, C.; Zhan, Y.; Wang, X.; Duan, Y. Cross-modal change detection using historical land use maps and current remote sensing images. ISPRS J. Photogramm. Remote Sens. 2024, 218, 114–132. [Google Scholar] [CrossRef]
Brown, C.F.; Brumby, S.P.; Guzder-Williams, B.; Birch, T.; Hyde, S.B.; Mazzariello, J.; Czerwinski, W.; Pasquarella, V.J.; Haertel, R.; Ilyushchenko, S.; et al. Dynamic World, Near real-time global 10 m land use land cover mapping. Sci. Data 2022, 9, 251. [Google Scholar] [CrossRef]
Xu, J.; Xiong, Z.; Bhattacharyya, S.P. PIDNet: A real-time semantic segmentation network inspired by PID controllers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19529–19539. [Google Scholar] [CrossRef]
Cheng, B.; Choudhuri, A.; Misra, I.; Kirillov, A.; Girdhar, R.; Schwing, A.G. Mask2former for video instance segmentation. arXiv 2021, arXiv:2112.10764. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Zhang, W.; Pang, J.; Chen, K.; Loy, C.C. K-net: Towards unified image segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 10326–10338. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the International Workshop on Deep Learning in Medical Image Analysis, Granada, Spain, 20 September 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar] [CrossRef]
Peña, F.J.; Hübinger, C.; Payberah, A.H.; Jaramillo, F. DeepAqua: Semantic segmentation of wetland water surfaces with SAR imagery using deep neural networks without manually annotated data. Int. J. Appl. Earth Obs. Geoinf. 2024, 126, 103624. [Google Scholar] [CrossRef]
Das, P.; Jensen, K.; De, S.; Ganguly, A.R. Flood Depth Estimation Using Synthetic Aperture Radar (SAR) Imagery and Topography: A Case Study of the 2021 and 2022 Floods in Hawkesbury Valley, Australia. In Proceedings of the IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2402–2405. [Google Scholar] [CrossRef]
Jensen, K.; De, S.; Hughes, L.; Yalla, G. Flood Monitoring with X-Band and C-Band SAR: A Case Study of the 2021 British Columbia Floods. In Proceedings of the IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 5535–5538. [Google Scholar] [CrossRef]

Figure 1. Geographic locations and spatial extents of the Capella SAR scenes used in this study.

Figure 2. Comparison of noise characteristics across SAR images of the same location acquired on different dates.

Figure 3. Multi-source data integration for manual ground truth generation: (a) High-resolution optical image (base map), (b) Sentinel-2 false color composite (NIR-Red-Green), (c) Land Cover Map, and (d) Sentinel-2 normalized difference water index (NDWI).

Figure 4. Histogram analysis of the SAR image preprocessing workflow: (a) Raw SAR backscatter, (b) Logarithmic compression, (c) Percentile-based dynamic range clipping, and (d) Normalization (0–1 scaling).

Figure 5. SAR image, reference optical image, and land cover classification represented as seven independent binary (one-hot encoded) channels. The left column displays the original SAR image (top) and the corresponding reference optical image (bottom) of the study area. Panels (1–7) on the right illustrate the land cover classification layers for the major categories.

Figure 6. Spatial resolution comparison between land cover datasets. The Dynamic World land cover map (left; 10 m resolution) features near-real-time probability-based estimates, whereas the Korea Ministry of Climate, Energy and Environment (MCEE) land cover map (right; 1 m resolution) provides superior geometric precision validated by manual interpretation.

Figure 7. Overall workflow of the deep learning based ensemble framework for waterbody segmentation using Capella SAR images in Republic of Korea. In the DEM preprocessing module, the color gradients for slope and HAND represent the relative magnitude of the values, ranging from low (blue) to high (red).

Figure 8. Workflow of weighted average ensemble used in this study.

Figure 9. Comparative waterbody detection results based on input channel configurations on the Blind Test Data (HS0904). The five rows (a–e) present representative sub-scenes. Columns show (from left) Original SAR (HH), Optical Base map, Land Cover Map (LULC), Ground Truth (G.T.), and the segmentation results from Experiment 1 (SAR only) and Experiment 8 (All Features). Error visualization is overlaid on the results: White = True Positive (Correct Detection), Green = False Positive (Non-Water classified as Water), Red = False Negative (Water missed), Black = True Negative (Correctly Non-Water).

Figure 10. Comparison of probability maps between the 1-channel and 8-channel segmentation models. While the 1-channel probability map exhibits ambiguous patterns, particularly at water–land boundaries, the 8-channel model provides relatively distinguishable and sharp patterns in those regions. The color gradient illustrates the probability of water presence, ranging from cyan (low probability) to orange (high probability).

Figure 11. Cross-validation performance comparison of segmentation models. (a) Water IoU and (b) F1 score distributions across 5-fold cross-validation for U-Net++, PIDNet, Swin Transformer, Mask2Former, and K-Net. Box plots show median, quartiles, and range. Individual fold results are displayed as colored points (Fold 1–5: red, blue, green, purple, orange).

Figure 12. Waterbody segmentation performance comparison on blind test data. Rows (a–e) show representative scenes from the HS0904 test set. Columns display (from left): original SAR, land cover map (LULC), optical base map, ground truth (G.T.), and segmentation results from Otsu, U-Net++, PIDNet, Mask2Former, Swin Transformer, K-Net, and the proposed Ensemble model. Error visualization: white = true positive, green = false positive, red = false negative, black = true negative.

Figure 13. Visualization of inter-model disagreement among the four segmentation models (PIDNet, Swin Transformer, Mask2Former, and K-Net). The disagreement is quantified using the standard deviation of the water probability values across the four models. Notably, higher standard deviation values are observed primarily along land–water boundaries. In the disagreement maps, the color gradient represents the magnitude of the standard deviation; purple denotes low disagreement (high consistency), while orange signifies high disagreement (low consistency).

Figure 14. Waterbody mapping using Capella SAR (HS0904 image). Full Stripmap ensemble output with zoomed-in analyses of three representative regions of interest (ROIs).

Figure 15. Waterbody detection results for Case 1 (estuarine area) using the ensemble model. (a) Original Capella X-band SAR image showing the complex estuarine environment; (b) Land cover classification map; (c) Optical base map; (d) Ensemble model probability map displaying confidence levels for waterbody detection; (e) Binary detection results with error overlay (white: true positive, green: false positive, red: false negative, black: true negative); (f) Overlay of detection results on the SAR image.

Figure 16. Waterbody detection results for Case 2 (mountainous area) using the ensemble model. (a) Original Capella X-band SAR image displaying complex mountainous terrain; (b) Land cover classification map; (c) Optical base map; (d) Ensemble model probability map showing confidence levels for waterbody detection; (e) Binary detection results with error overlay (white: true positive, green: false positive, red: false negative, black: true negative); (f) Overlay of detection results on the SAR image.

Figure 17. Waterbody detection results for Case 3 (urban area) using the ensemble model. (a) Original Capella X-band SAR image revealing the complex urban environment; (b) Land cover classification map; (c) Optical base map; (d) Ensemble model probability map showing confidence levels for waterbody detection; (e) Binary detection results with error overlay (white: true positive, green: false positive, red: false negative, black: true negative); (f) Overlay of detection results on the SAR image.

Table 1. Characteristics of Capella SAR system (Data sourced from ref. [45]).

Parameter	Wavelength
Frequency Band	X-band (9.4–9.9 GHz)
Imaging Bandwidth	500–700 MHz
Imaging Modes	Spotlight
	Sliding Spotlight
	Stripmap
Imaging Polarizations	Single-Pol HH and VV
Imaging Orbit Directions	Ascending and Descending
Imaging Look Directions	Left and Right
Accessible Imaging Latitudes	SSO ¹ 97° Orbital Plane: +87.4°N to −87.4°S
	MIO ² 53° Orbital Plane: +58.3°N to −58.3°S
	MIO 45° Orbital Plane: +48.9°N to −48.9°S
Look Angle Ranges	25–50° (Standard Products)
	Up to 15–50° (Extended Products)
	Up to 5–50° (Custom Products)
Transmit Power	600 Watt
Acquisition Direction	Left and Right sides

¹ SSO: Sun-Synchronous Orbit; ² MIO: Medium Inclination Orbit.

Table 2. Specifications of Capella Stripmap SAR image products [46].

Image Product	Imaging Mode	Nominal Scene Size	Azimuth Resolution	Slant Range Resolution	Ground Range Resolution	Look Angle Range
SLC	Stripmap	5–10 km	1.2 m	0.75 m	NA	25–45°
GEC/GEO	Stripmap	5–10 km	1.2 m	NA	1.1–1.6 m	25–45°

Table 3. Summary of SAR acquisition parameters and sensor characteristics for the Capella images used in this study.

Image ID		US0629	HS0904	PH0912	PH0916	PH0925
Image size		13,154 × 28,246	25,865 × 26,585	25,680 × 26,748	12,950 × 28,188	12,992 × 28,168
Acquisition date		29 June 2022	4 September 2022	12 September 2022	16 September 2022	25 September 2022
Satellite		Capella-7	Capella-6	Capella-6	Capella-8	Capella-8
Frequency		X-band (9.65 GHz)
Resolution	Ground range	1.59 m	1.59 m	1.79 m	1.60 m	1.43 m
Resolution	Azimuth	1.41 m	1.41 m	1.37 m	1.43 m	1.43 m
Pixel spacing		0.8 m × 0.8 m
Angle	Incidence	36.80°	36.75°	32.00°	36.60°	41.80°
Angle	Look	33.50°	33.54°	29.10°	33.40°	37.90°
Orbit		Desc.	Desc.	Asc.	Desc.	Desc.
NESZ peak		−17.71	−13.61	−16.17	−18.90	−17.36

Table 4. Comparison of key characteristics between the Dynamic World land cover map and the Korea Ministry of Climate, Energy and Environment (MCEE) land cover map.

Feature	Dynamic World Map [60]	MCEE Land Cover Map
Data Source	Sentinel-2 MSI (Optical Satellite)	Aerial Orthophotos
Data Type and Resolution	Raster (10 m)	Vector (Effective res. ≈ 1 m)
Update Frequency	Near Real-time (Available within 2–5 days of acquisition)	Periodic (1-year to multi-year update cycles)
Geographic Coverage	Global	Republic of Korea (National)
Classification Scheme	9 Land Use/Land Cover (LULC) classes	41 Detailed classes (Reclassified to 7)
Key Advantages	Temporal Synchronization: Aligns with SAR acquisition time, capturing dynamic changes (e.g., floods, seasonal water)	Geometric Precision: High boundary accuracy for static features (e.g., buildings, roads) due to vector format
Limitations	Coarser Resolution: Limited capacity to resolve small-scale objects; susceptible to cloud cover	Temporal Discrepancy: Potential misalignment with SAR imagery due to update latency (time lag)

Table 5. Input channel configurations for each experiment based on input type.

Experiment ID	Input Channels	Descriptions	No. of Channels
Exp. 1	HH		1
Exp. 2	HH, Gamma		2
Exp. 3	HH, Slope, and HAND		3
Exp. 7	HH, Slope, HAND, and Land cover	Land cover layers (urban, agriculture, forest, grassland)	7
Exp. 8	HH, Slope, HAND, Land cover, and Gamma	All channels	8

Table 6. Key settings for the segmentation models in Experiment 8.

Setting	PIDNet	Mask2Former	Swin Transformer	K-Net
Input size	512 × 512
Input channels	8 (HH, Gamma, HAND, Slope, and land cover with four classes)
Backbone	PIDNet	Swin-Large	Swin-Large	Swin-Large
Batch size	4
Iterations	30,000	20,000	20,000	20,000
Learning rate	1.0 × 10⁻⁴	2.0 × 10⁻⁴	2.0 × 10⁻⁴	2.0 × 10⁻⁴
Optimizer	AdamW
Loss Function	CrossEntropy + OHEM + Boundary	CrossEntropy + Dice	CrossEntropy	CrossEntropy

Table 7. Performance of waterbody segmentation by input channel configuration. Mask2Former was used for training and validation. Best results are shown in bold.

Experiment ID	Exp. 1	Exp. 2	Exp. 3	Exp. 7	Exp. 8
Water IoU	0.9149	0.9166	0.9341	0.9521	0.9550
Accuracy	0.9859	0.9863	0.9891	0.9921	0.9926
F1-score	0.9556	0.9565	0.9659	0.9755	0.9770
Recall	0.9728	0.9800	0.9812	0.9785	0.9810
Precision	0.9389	0.9342	0.9511	0.9725	0.9730

Table 8. Dataset distribution and patch counts for 5-fold cross-validation.

No. of Patches	Scenes Used	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5
Training set (w/Augmentation)	PH0912 PH0916 US0629 PH0925	2933	2940	2772	3017	3066
Validation set (w/o Augmentation)	PH0912 PH0916 US0629 PH0925	107	106	130	95	88
Test set (w/o Augmentation)	HS0904	159	159	159	159	159

Table 9. Summary of cross-validation performance metrics (mean ± standard deviation) for the five segmentation models.

Model	Water IoU	Accuracy	F1-Score	Precision	Recall
U-Net++	0.915 ± 0.016	0.991 ± 0.003	0.955 ± 0.009	0.962 ± 0.013	0.949 ± 0.016
PIDNet	0.916 ± 0.024	0.991 ± 0.003	0.956 ± 0.013	0.965 ± 0.013	0.947 ± 0.020
Swin Transformer	0.920 ± 0.026	0.991 ± 0.002	0.958 ± 0.015	0.966 ± 0.010	0.950 ± 0.021
Mask2Former	0.928 ± 0.018	0.992 ± 0.002	0.962 ± 0.010	0.967 ± 0.015	0.958 ± 0.019
K-Net	0.932 ± 0.014	0.993 ± 0.002	0.964 ± 0.010	0.968 ± 0.017	0.959 ± 0.020

Table 10. Segmentation performance metrics on the unseen HS0904 test set. Results are presented for reference methods (Otsu, U-Net++), candidate models, and the weighted ensemble model. Best results in bold.

Model	Water IoU	Accuracy	F1-Score	Precision	Recall
Otsu	0.6313	0.9016	0.7740	0.7240	0.8313
U-Net++	0.8551	0.9687	0.9219	0.9333	0.9108
PIDNet	0.8921	0.9779	0.9430	0.9902	0.9001
Swin Transformer	0.9371	0.9871	0.9675	0.9916	0.9446
Mask2Former	0.9155	0.9828	0.9559	0.9930	0.9215
K-Net	0.9247	0.9845	0.9609	0.9851	0.9378
Ensemble	0.9422	0.9881	0.9703	0.9833	0.9575

Table 11. Performance comparison of ensemble weighting strategies on the HS0904 blind test set.

Model	Weights *	Water IoU	Accuracy	F1-Score	Precision	Recall
Best single	NA	0.9371	0.9871	0.9675	0.9916	0.9446
Equal Weights	[0.25, 0.25, 0.25, 0.25]	0.9408	0.9878	0.9695	0.9861	0.9534
Automated Weights (Softmax Normalization)	[0.2478, 0.2512, 0.2506, 0.2504]	0.9408	0.9879	0.9695	0.9861	0.9535
Optimized Weights (Grid Search)	[0.05, 0.40, 0.20, 0.35]	0.9422	0.9881	0.9703	0.9833	0.9575

* The order is K-Net, Swin Transformer, PIDNet, Mask2Former.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Choi, S.; Kim, S.H.; Nghiem, S.V.; Kafatos, M.; Choi, M.; Kim, J.; Lee, Y. A Robust Deep Learning Ensemble Framework for Waterbody Detection Using High-Resolution X-Band SAR Under Data-Constrained Conditions. Remote Sens. 2026, 18, 301. https://doi.org/10.3390/rs18020301

AMA Style

Choi S, Kim SH, Nghiem SV, Kafatos M, Choi M, Kim J, Lee Y. A Robust Deep Learning Ensemble Framework for Waterbody Detection Using High-Resolution X-Band SAR Under Data-Constrained Conditions. Remote Sensing. 2026; 18(2):301. https://doi.org/10.3390/rs18020301

Chicago/Turabian Style

Choi, Soyeon, Seung Hee Kim, Son V. Nghiem, Menas Kafatos, Minha Choi, Jinsoo Kim, and Yangwon Lee. 2026. "A Robust Deep Learning Ensemble Framework for Waterbody Detection Using High-Resolution X-Band SAR Under Data-Constrained Conditions" Remote Sensing 18, no. 2: 301. https://doi.org/10.3390/rs18020301

APA Style

Choi, S., Kim, S. H., Nghiem, S. V., Kafatos, M., Choi, M., Kim, J., & Lee, Y. (2026). A Robust Deep Learning Ensemble Framework for Waterbody Detection Using High-Resolution X-Band SAR Under Data-Constrained Conditions. Remote Sensing, 18(2), 301. https://doi.org/10.3390/rs18020301

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Robust Deep Learning Ensemble Framework for Waterbody Detection Using High-Resolution X-Band SAR Under Data-Constrained Conditions

Highlights

Abstract

1. Introduction

2. Materials

2.1. The Capella SAR System

2.2. SAR Data Preprocessing

2.2.1. Standard SAR Intensity Pre-Processing

2.2.2. Incidence Angle Corrected Pre-Processing

2.3. Auxiliary Data Preprocessing

2.3.1. Topography Data

2.3.2. Land Cover Maps

2.4. Waterbody Labeling

3. Methods

3.1. Overview

3.2. Input Channel Configuration

3.3. Segmentation Models and Ensemble Framework

3.3.1. PIDNet

3.3.2. Mask2Former

3.3.3. Swin Transformer

3.3.4. K-Net

3.3.5. Model Optimization

3.3.6. Weighted Average Ensemble

3.4. Model Performance Evaluation

4. Results

4.1. Input Channel Configurations

4.2. Performance Comparison Among Deep Learning Models

4.3. Blind Test Using Ensemble Model

4.3.1. Baseline Comparison and Ensemble Performance

4.3.2. Evaluation of Ensemble Strategies

5. Discussions

5.1. Ensemble Approach

5.1.1. Ensemble for Constrained Data

5.1.2. Inter-Model Disagreement

5.2. Analyses by Case

5.2.1. Case 1: Estuarine Area

5.2.2. Case 2: Mountainous Terrain

5.2.3. Case 3: Urban Area

5.3. Implications, Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI