SGCAD: A SAR-Guided Confidence-Gated Distillation Framework of Optical and SAR Images for Water-Enhanced Land-Cover Semantic Segmentation

Ma, Junjie; Wang, Zhiyi; Yuan, Yanyi; Hu, Fengming

doi:10.3390/rs18060962

Open AccessArticle

SGCAD: A SAR-Guided Confidence-Gated Distillation Framework of Optical and SAR Images for Water-Enhanced Land-Cover Semantic Segmentation

Key Laboratory for Information Science of Electromagnetic Waves (MoE), Fudan University, Shanghai 200433, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(6), 962; https://doi.org/10.3390/rs18060962

Submission received: 6 February 2026 / Revised: 16 March 2026 / Accepted: 19 March 2026 / Published: 23 March 2026

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A SAR-only HRNet teacher provides reliable structural priors for water-body segmentation.
A class-aware confidence-gated distillation strategy selectively enhances water prediction on reliable pixels.

What are the implications of the main findings?

LightMCANet and SEGM improve cross-modal fusion and boundary continuity for slender land-cover structures.
SGCAD significantly improves water-body IoU while maintaining stable performance on the remaining classes.

Abstract

Multimodal fusion of synthetic aperture radar (SAR) and optical imagery is widely used in Earth observation for applications such as land-cover mapping and surface-water mapping (including post-event flood mapping under near-synchronous acquisitions) and land-use inventory. Optical images provide rich spectral and texture cues, whereas SAR offers all-weather structural information that is complementary but heterogeneous. In practice, this heterogeneity often introduces fusion conflicts in multi-class segmentation, causing critical categories such as water bodies to be under-optimized. To address this issue, this paper presents a SAR-guided class-aware knowledge distillation (SGCAD) method for multimodal semantic segmentation. First, a SAR-only HRNet is trained as a water-expert teacher to learn discriminative backscattering and boundary priors for water extraction. Second, a lightweight multimodal student model (LightMCANet) is optimized using a class-aware distillation strategy that transfers teacher knowledge only within high-confidence water regions, thereby suppressing noisy supervision and reducing interference to other classes. Third, a SAR edge guidance module (SEGM) is introduced in the decoder to enhance boundary continuity for slender structures such as water bodies and roads. Overall, SGCAD improves targeted category learning while maintaining stable performance across the remaining classes. Experiments on a self-built dataset from GF-1 optical and LuTan-1 SAR imagery demonstrate higher overall accuracy and more coherent water/road predictions than representative baselines. Future work will extend the proposed distillation scheme to additional categories and broader geographic scenes.

Keywords:

synthetic aperture radar (SAR); multimodal semantic segmentation; knowledge distillation; LightMCANet; HRNet

1. Introduction

High-resolution Earth observation has rapidly increased the demand for pixel-level interpretation of land-cover and land-use patterns. Semantic segmentation has become a core technique for flood mapping and large-area land-cover inventory [1,2]. With the progress of deep learning, remote sensing segmentation has gradually shifted from handcrafted features to end-to-end representation learning with stronger backbones and multi-scale context aggregation [3,4]. Despite these advances, practical scenes still exhibit (i) large intra-class appearance variations and acquisition-induced changes, and (ii) severe inter-class confusion (e.g., shadow vs. water), which often lead to unstable optimization and noisy decision boundaries [5,6]. Moreover, class imbalance and boundary ambiguity are especially harmful for thin or fragmented objects such as rivers and roads [7,8]. Consequently, achieving both high accuracy and robust boundary delineation remains challenging in large-area mapping pipelines.

Optical and synthetic aperture radar (SAR) imagery are highly complementary. Optical images provide rich spectral and textural cues for fine-grained semantic discrimination, whereas SAR enables all-weather, day-and-night imaging and offers geometric and structural information linked to microwave scattering [9,10]. Such SAR-derived structural cues have also been exploited in other terrain understanding settings (e.g., SAR altimeter delay-Doppler-image-based terrain classification), suggesting the potential of SAR to contribute stable physical priors [11]. This complementarity has motivated extensive efforts on optical–SAR fusion for land-cover classification and segmentation, including dual-stream encoders and multi-level feature aggregation [12,13]. However, the heterogeneity between optical and SAR remains a fundamental obstacle: SAR radiometry is affected by speckle, incidence angle, and scattering mechanisms, while optical radiometry depends on illumination and atmospheric conditions. These factors yield a pronounced cross-modal domain gap and may trigger fusion conflicts when features are combined indiscriminately [14]. In addition, imperfect co-registration and geometric distortions can further degrade multimodal fusion performance [15].

Existing optical–SAR fusion strategies can be broadly summarized as image-level fusion/translation, feature-level fusion, and attention-based interaction. Image fusion (IF) aims to merge multimodal inputs into a single representation for downstream segmentation, whereas image translation (IT) learns a mapping between domains (e.g., SAR→optical) to support training or enhancement under poor optical conditions [16,17]. Feature-level fusion designs typically adopt dual-branch encoders and fuse multi-level features via concatenation or adaptive fusion modules, but they may overlook fine-grained inter-modal correspondences and introduce redundancy [18]. To explicitly model cross-modal relations, attention and cross-attention have been increasingly employed to align semantic focus and structural cues [19,20]. More recently, correction–interaction–fusion architectures attempt to mutually rectify modal representations before exchanging information across spatial and channel dimensions, improving robustness under heterogeneous inputs [21,22]. Beyond fusion design, multimodal segmentation is also hindered by data and preprocessing constraints. In many scenarios, modalities have inconsistent spatial resolutions or sampling footprints, which necessitates iterative alignment and cross-modal conditioning to avoid performance loss [23]. Furthermore, real-world systems may suffer from missing modalities at inference time (e.g., cloud-contaminated optical), prompting incomplete multimodal learning that remains robust under partial inputs [24]. Relatedly, SAR time-series research has shown that exploiting temporal coherence can reduce reliance on strong priors and improve robustness under non-ideal conditions, as demonstrated in multi-temporal InSAR phase unwrapping and spatio-temporal urban change detection [25,26]. These observations further motivate designing multimodal segmentation methods that can cope with heterogeneous reliability and uncertainty.

Water-body segmentation is a particularly important yet challenging case. Water mapping is crucial for hydrological analysis and disaster response, and a large literature has studied water extraction in optical multispectral imagery [27]. In SAR imagery, water often exhibits low backscatter and sharp shorelines, providing a complementary cue for delineation [28,29]. Nevertheless, both modalities have failure modes: optical water is sensitive to cloud/haze and shadow, while SAR water can be confused by wind-roughened surfaces and speckle. In multimodal multi-class segmentation, water further competes with visually similar categories (e.g., shadowed urban regions), and fusion conflicts may fragment boundaries and reduce IoU. Boundary-aware learning is therefore critical for thin or elongated structures, and classical edge supervision (e.g., holistically nested edge learning) provides useful motivation for refining semantic boundaries with low-level gradients [30].

Although water surfaces are often highly separable in SAR due to low backscatter and clear shoreline contrast, we observe that multi-class multimodal fusion can still under-optimize the water category. This is not a contradiction, but a consequence of the learning objective and heterogeneous uncertainty across modalities. In particular, (i) optical imagery introduces ambiguous water-like patterns (e.g., cloud shadow, terrain shadow, and dark roofs), while SAR responses are affected by speckle and acquisition geometry; (ii) imperfect co-registration and geometric distortions cause local misalignment between SAR-derived water boundaries and optical textures; and (iii) in a six-class softmax setting, water logits compete with other dominant categories, so gradients from large-area classes may dominate optimization and shift decision boundaries away from the physically consistent SAR cue. As a result, indiscriminate fusion may create conflicting feature evidence near shorelines and mixed pixels, leading to fragmented water masks and reduced IoU.

Knowledge distillation (KD) offers a principled mechanism to transfer knowledge from a strong teacher to a compact student [31]. For semantic segmentation, structured KD explores pixel relations and global consistency [32], while cross-image relational distillation models dataset-level semantic structure [33]. Texture-focused distillation further emphasizes low-level structural and statistical cues that are critical for boundary quality [34]. Feature augmentation and proxy losses have also been proposed to overcome capacity gaps and enrich the distillation signal [35]. In remote sensing, KD is increasingly used to balance accuracy and efficiency and to distill complementary inductive biases from different architectures [36,37]. However, most existing KD schemes still apply global supervision, which can propagate teacher bias to irrelevant regions or classes, an especially risky behavior in multimodal settings with spatially varying uncertainty. Hence, a class-aware and region-selective distillation paradigm is desirable to bring targeted improvements without destabilizing other categories.

Despite progress in optical–SAR fusion and segmentation distillation, most existing methods optimize global objectives and treat all classes and regions similarly. However, in heterogeneous multimodal scenes, modality reliability is spatially varying and certain categories, especially water, may be systematically under-optimized due to fusion conflicts and multi-class competition. Therefore, a category-focused and region-selective learning strategy is needed to strengthen the decision boundary of the bottleneck class without destabilizing other categories.

In this work, we propose a SAR-guided class-aware knowledge distillation framework for multimodal semantic segmentation, targeting the persistent weakness of the water class in optical–SAR six-class land-cover mapping. Our key idea is to train a SAR-based water-expert teacher using an HRNet-style high-resolution segmentation architecture and transfer its structural prior to a lightweight multimodal student via class-aware distillation with confidence-aware gating [38]. From a broader architectural perspective, the design of the proposed framework is also related to representative encoder–decoder and multi-scale segmentation models, including SegNet, U-Net, PSPNet, FPN, and V-Net, as well as dual-attention fusion and recent self-supervised or masked-autoencoder representation learning for remote sensing [39,40,41,42,43,44,45,46]. Specifically, the teacher is specialized for binary water segmentation to exploit the strong separability of water in SAR imagery, while the student performs six-class segmentation on optical–SAR pairs using efficient cross-modal interaction and boundary-aware decoding. To mitigate modal conflicts and class competition in multimodal learning, the distillation loss is applied selectively on pixels where the teacher predictions are sufficiently reliable, with an emphasis on high-confidence water regions to avoid noisy knowledge transfer.

The main contributions of this work are threefold. (1) Problem-driven formulation: We identify that, in optical–SAR multi-class segmentation, water-body prediction can become a persistent bottleneck due to local cross-modal conflicts and softmax competition, even though SAR alone is highly discriminative for water. (2) Method novelty: Different from conventional global fusion training or global distillation, we propose SGCAD, a SAR-guided, class-aware, confidence-gated distillation scheme that transfers teacher knowledge only for the water class and only on reliable pixels. This design mitigates noisy or negative transfer and preserves other classes. (3) Architecture for efficiency and boundaries: We develop a lightweight optical–SAR student with efficient cross-modal interaction and SAR edge-guided boundary refinement, yielding improved water and road delineation with practical efficiency.

The remainder of this paper is organized as follows. Section 2 presents the materials and methods, including the dataset description, problem formulation, and the proposed framework with implementation details. Section 3 reports the experimental results and ablation studies. Section 4 provides further discussion on limitations, failure cases, and practical implications. Finally, Section 5 concludes the paper and outlines future directions.

2. Materials and Methods

2.1. Materials: Study Area and Dataset

Study Areas and Data Sources. We constructed a self-built multimodal dataset over two representative regions in Zhejiang Province, China, including Yuhang District, Hangzhou City, and Sanmen County, Taizhou City. These two areas cover diverse land-cover patterns, such as urban built-up areas, croplands, rivers/lakes, forests, and rural settlements, enabling robust evaluation under complex scene compositions.

The optical imagery was acquired from Gaofen-1 (GF-1) with RGB bands at 2 m ground sampling distance (GSD). The SAR imagery was acquired from the LuTan-1 satellite with 5 m GSD. All data (optical/SAR/label) were exported as georeferenced GeoTIFFs and reprojected into a common projected CRS (WGS84 Transverse Mercator, central meridian 120°E), ensuring consistent map geometry.

Detailed statistics regarding dataset partitioning and patch settings are summarized in Table 1, including the number of patches, patch size, and uniform spatial resolution for training, validation, and test subsets.

Land-Cover Categories and Annotation. Visual examples of the optical images, SAR images, and corresponding pixel-level land-cover annotations in the study areas are presented in Figure 1, along with the color coding scheme for the six land-cover classes.

Pixel-wise annotations include six land-cover classes: farmland, city, village, water body, forest, and road. These categories contain both large homogeneous regions (e.g., farmland/forest) and slender structures (e.g., roads and water boundaries), which are sensitive to boundary fragmentation under multimodal fusion conflicts.

2.1.1. Co-Registration and Radiometric Preprocessing

To obtain co-registered multimodal pairs with physically comparable SAR backscatter, we applied a standard SAR radiometric preprocessing pipeline before patch generation. First, LuTan-1 SAR intensity was radiometrically calibrated to the backscatter coefficient

σ^{0}

using the calibration parameters provided in the product metadata. Second, to reduce the dependence on acquisition geometry,

σ^{0}

was converted to

γ^{0}

via incidence angle normalization,

γ^{0} = \frac{σ^{0}}{cos (θ)},

(1)

where

θ

denotes the incidence angle. Third, DEM-assisted radiometric terrain correction (RTC) was performed to compensate terrain-induced radiometric distortions and to geocode SAR measurements into a map geometry consistent with the optical reference. During RTC, the local incidence angle

θ_{loc}

was derived and used for local incidence angle normalization to mitigate slope-related backscatter variations. After radiometric correction, SAR images were despeckled using a Lee filter with a

7 \times 7

window and converted to log-intensity (dB). Finally, we applied robust min–max normalization by clipping values to the 2% and 98% percentiles and scaling them to

[0, 1]

.

2.1.2. Resampling and Label Handling

After RTC-based geocoding, SAR images were geometrically aligned to the optical reference using georeferencing information. Both modalities were then resampled to a unified 5 m grid for training: (i) optical RGB was downsampled from 2 m to 5 m using bilinear interpolation; (ii) SAR was resampled onto the same 5 m grid (bilinear interpolation); (iii) the label map was resampled using nearest-neighbor interpolation to preserve discrete class IDs.

2.1.3. Patch Generation with Leakage-Free Splitting

We adopted a “split-first, crop-later” strategy to avoid spatial leakage. Specifically, the aligned region-wide mosaics were first partitioned into large tiles of size

4096 \times 4096

. These large tiles were then split into training/validation/test subsets at the tile level (0.70:0.15:0.15). Finally, each large tile was cropped into

256 \times 256

patches, forming the base dataset for model training and evaluation.

2.1.4. Train/Val/Test Selection

To avoid spatial leakage, the train/val/test split is performed at the 4096 × 4096 tile level before cropping. Specifically, large tiles are randomly assigned to train/val/test with a ratio of 0.70/0.15/0.15, and all 256 × 256 patches cropped from one tile inherit the same split. Therefore, no spatial overlap exists across splits at either the tile level or the patch level. The targeted re-sampling strategy (road/village/city) is applied only to the training subset, while validation and test sets remain unchanged to ensure fair evaluation.

2.1.5. Targeted Re-Sampling for Rare Classes (Training Only)

To mitigate class imbalance and strengthen slender or minority categories, we further performed targeted re-sampling on the training set, focusing on road/village/city. An overlapping sliding window (window size 256, stride 128) scanned label patches to identify candidate windows with sufficient target pixels, filtered by (a) a minimum pixel count threshold (≥10 pixels) and (b) a minimum ratio threshold (≥3%). Selected patches were merged into the training set as additional samples, while validation/test subsets remained unchanged for fair evaluation.

Optical RGB patches were linearly scaled to

[0, 1]

and then normalized using the same robust percentile strategy as SAR for consistent dynamic range control: values were clipped to the 2nd and 98th percentiles and rescaled to

[0, 1]

. All modalities (optical, SAR) and label maps were stored as GeoTIFF/PNG patches with a fixed patch size of

256 \times 256

. The label value 255 denotes ignored pixels (outside the valid study area), and these are excluded from loss computation and metric evaluation.

2.2. Problem Formulation

Given a co-registered optical–SAR pair

(I_{opt}, I_{sar})

, where

I_{opt} \in R^{3 \times H \times W}

and

I_{sar} \in R^{1 \times H \times W}

, the goal is to predict a six-class segmentation map

Y \in {0, 1, 2, 3, 4, 5, 255}^{H \times W}

. Here, class IDs 0–5 correspond to the six land-cover categories (farmland, city, village, water body, forest, and road), and the value 255 denotes ignored pixels (e.g., areas outside the study region) that are excluded from loss computation and metric evaluation. To enhance the water-body category, a SAR-only teacher model produces a water probability map

P_{T} \in {[0, 1]}^{H \times W}

for class-aware distillation.

2.3. Overview of the Proposed SGCAD Framework

The overall architecture of the proposed SAR-guided class-aware distillation (SGCAD) framework is illustrated in Figure 2, which systematically integrates a SAR-only teacher stream, a multimodal student stream, and a class-aware distillation module to address multimodal fusion conflicts for water-body segmentation.

Our method adopts a two-stage teacher–student training paradigm: (1) Stage 1 trains the multimodal student network for six-class land-cover segmentation using standard cross-entropy loss; (2) Stage 2 fine-tunes the student with SAR-guided class-aware distillation, where the SAR-only teacher provides supervision only for high-confidence water regions to prevent the propagation of noisy global distillation signals.

Notably, our objective is not to rectify flawed fusion designs through post-hoc corrections, but to address a structural limitation of global multimodal optimization objectives: when optical and SAR modalities exhibit local disagreements, the student network receives conflicting supervision signals for the water class in a multi-class segmentation setting. By introducing a SAR-specialized water-body expert and implementing class-aware, confidence-gated distillation, we inject physically consistent SAR-derived priors into the water class decision boundary while avoiding unnecessary global constraints on other land-cover classes.

For clarity, the teacher stream in Figure 2 corresponds to the SAR-only HRNet architecture detailed in Figure 3, while the student stream refers to the LightMCANet structure presented in Figure 4. The SEGM modules embedded in the teacher and student streams are implemented as described in Figure 3 and Figure 4, respectively.

Although the use of HRNet as a feature extractor is not novel, our work extends this architecture with two key innovations tailored to multimodal remote sensing segmentation: (1) a novel class-aware distillation framework that restricts knowledge distillation to the water class, thereby mitigating class imbalance and multimodal conflict issues; (2) the integration of a SAR-guided edge refinement module (SEGM) to enhance boundary precision for slender land-cover structures (e.g., roads and water bodies). These contributions are specifically designed to address the unique challenges of multimodal optical–SAR fusion and improve segmentation performance in heterogeneous land-cover environments.

2.4. Student Network: LightMCANet for Multimodal Six-Class Segmentation

The architecture of the lightweight multimodal student network (LightMCANet) designed for six-class optical–SAR land-cover segmentation is illustrated in Figure 4, which details the pseudo-Siamese encoder, cross-modal attention module, FPN-style decoder, and SAR edge guidance module (SEGM) integrated into the network.

Pseudo-Siamese Encoder and Cross-Modal Interaction LightMCANet adopts a pseudo-Siamese encoder to separately extract modality-specific features from optical and SAR inputs, effectively alleviating early fusion conflicts caused by the heterogeneous characteristics of optical and SAR data. To model high-order cross-modal correlations while reducing the quadratic computational complexity of standard attention mechanisms, a lightweight cross-modal attention module (LightMCAM) is proposed. In LightMCAM, adaptive pooling is first applied to compress SAR features before constructing key and value matrices, balancing fusion performance and computational efficiency. The mathematical formulation of LightMCAM is defined as follows:

Q = W_{q} (F_{o p t})

(2)

K = W_{k} (P (F_{s a r})), V = W_{v} (P (F_{s a r}))

(3)

F_{a t t} = Softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V

(4)

where Q denotes the query feature derived from the optical modality-specific feature

F_{o p t}

via the learnable linear projection matrix

W_{q}

; K (key feature) and V (value feature) are generated by applying adaptive pooling

P (\cdot)

to the SAR modality-specific feature

F_{s a r}

(for dimension reduction and computational efficiency) followed by linear projections using

W_{k}

and

W_{v}

, respectively;

K^{⊤}

represents the transpose of the key feature matrix;

d_{k}

is the dimension of the key feature K, which is used to scale attention scores and avoid gradient vanishing issues;

Softmax (\cdot)

normalizes the attention weights to the range

[0, 1]

to ensure valid probability distribution; and

F_{a t t}

is the final cross-modal attention fusion feature that integrates complementary information from optical and SAR modalities.

FPN Decoder and SAR Edge Guidance Module (SEGM). An FPN-style decoder is employed to aggregate multi-scale features from the encoder and recover fine-grained spatial details for accurate segmentation. To enhance boundary continuity for slender land-cover structures (e.g., roads and water boundaries), a SAR edge guidance module (SEGM) is integrated into the decoder stage. The SEGM generates an edge-aware gating mask from shallow SAR features (which preserve high-resolution edge information) to amplify boundary responses in the decoder features. The mathematical formulation of SEGM is defined as:

G = σ (F_{c o n v} (F_{s a r}^{1})), F_{r e f i n e d} = F_{d e c} ⊙ (1 + G) .

(5)

where G is the SAR edge-guided gating mask;

σ (\cdot)

denotes the Sigmoid activation function that maps convolution outputs to the range

[0, 1]

for gating control;

F_{c o n v} (\cdot)

represents a stack of convolutional layers designed to extract edge information from

F_{s a r}^{1}

(shallow SAR feature from the first encoder layer, which is rich in gradient and boundary cues);

F_{d e c}

is the original fusion feature of the FPN decoder; ⊙ denotes element-wise multiplication (Hadamard product); and

F_{r e f i n e d}

is the decoder feature optimized by SEGM to enhance the continuity and sharpness of land-cover boundaries (especially for slender structures such as water shorelines and roads).

2.5. Teacher Network: HRNet Water-Body Expert (SAR-Only)

A high-resolution network (HRNet)-based segmentation model is employed as a SAR-only water-body expert teacher [28] to provide reliable water-body priors for subsequent class-aware distillation. Unlike the multimodal student network that must address cross-modal fusion conflicts between optical and SAR data, the teacher network focuses exclusively on a single, physically consistent cue: the strong backscattering contrast of water bodies in SAR imagery, which remains stable under varying illumination conditions (e.g., cloud cover, shadow) and weather scenarios (e.g., rain, fog). HRNet is selected as the backbone for the teacher network due to its ability to maintain high-resolution feature representations throughout the network, enabling accurate shoreline delineation and reducing boundary fragmentation for slender water-body structures.

The architecture of the SAR-specific HRNet teacher network designed for binary water-body segmentation is illustrated in Figure 3, which details the multi-resolution feature extraction branches, SAR edge guidance module (SEGM), and classification head integrated into the network.

2.5.1. Input and Label Construction

Given a co-registered SAR intensity patch

I_{sar} \in R^{H \times W}

(where H and W denote the height and width of the patch, respectively), we feed it into HRNet after lightweight channel adaptation to match the network’s input interface. Specifically, the 8-bit SAR intensity (pixel values ranging from 0 to 255) is first normalized to the range

[0, 1]

, followed by channel replication to convert the single-channel SAR data into a three-channel format. This process is formulated as:

{\tilde{I}}_{sar} = \frac{I_{sar}}{255}, X = Rep ({\tilde{I}}_{sar}) \in R^{3 \times H \times W} .

(6)

where

{\tilde{I}}_{sar}

represents the normalized single-channel SAR image;

Rep (\cdot)

denotes the channel replication operator; and X is the final three-channel input feature of the HRNet teacher. For supervision, the original six-class land-cover label map

Y \in {0, \dots, 5, 255}^{H \times W}

(class IDs 0–5 correspond to farmland, city, village, water body, forest, road; 255 denotes ignored pixels) is converted into a binary water mask

Y^{w} \in {0, 1, 255}^{H \times W}

, where:

Y^{w} (x) = \{\begin{matrix} 1, & Y (x) = c_{w}, \\ 0, & Y (x) \neq c_{w} \land Y (x) \neq 255, \\ 255, & Y (x) = 255, \end{matrix}

(7)

Here, x denotes the pixel coordinate;

c_{w}

is the class index of the water body in our annotation system (set to 3); and ∧ represents the logical “AND” operator. The binary mask

Y^{w}

enables the teacher to focus exclusively on water-body segmentation without interference from other land-cover categories.

2.5.2. HRNet High-Resolution Fusion

HRNet [28] maintains parallel multi-resolution feature streams throughout the network and repeatedly exchanges information across resolutions to preserve high-resolution details. Let

{F_{b}^{(s)}}_{b = 1}^{B}

denote the feature maps at stage s from B parallel branches (sorted from high to low resolution). The cross-resolution fusion process for the b-th branch is defined as:

{\hat{F}}_{b}^{(s)} = \sum_{j = 1}^{B} ϕ_{j \to b} (F_{j}^{(s)}),

(8)

where

{\hat{F}}_{b}^{(s)}

is the fused feature of the b-th branch at stage s;

ϕ_{j \to b} (\cdot)

denotes the resolution alignment operator, which includes upsampling/downsampling (to match the resolution of the b-th branch) followed by convolution and summation fusion; and

F_{j}^{(s)}

is the original feature of the j-th branch at stage s. After multi-scale fusion, the final segmentation head outputs per-pixel logits

Z_{T} \in R^{2 \times H \times W}

(two channels corresponding to non-water and water classes) and the teacher’s water probability map

P_{T} \in {[0, 1]}^{H \times W}

:

P_{T} (x) = Softmax {(Z_{T} (x))}_{1},

(9)

where

Softmax (\cdot)

is the normalization function to convert logits into probabilities; the subscript 1 indicates selecting the probability value of the “water” channel (the second channel in the two-channel logit tensor).

2.5.3. Optimization Objective (Pixel-Wise Supervision with Ignored Pixels)

The teacher model is optimized using cross-entropy loss, with ignored pixels (labeled 255) excluded from loss computation. First, we define a valid-pixel indicator

M (x) = I (Y^{w} (x) \neq 255)

, where

I (\cdot)

is the indicator function (returning 1 if the condition holds, 0 otherwise). The primary cross-entropy loss is formulated as:

L_{CE} = - \sum_{x} M (x) \sum_{k \in {0, 1}} I (Y^{w} (x) = k) log (Softmax {(Z_{T} (x))}_{k}) .

(10)

where

k \in {0, 1}

represents the binary classification labels (0 = non-water, 1 = water);

log (\cdot)

denotes the natural logarithm; and the outer sum

\sum_{x}

iterates over all pixels in the patch. To alleviate potential class imbalance between water and non-water pixels, we optionally adopt inverse-frequency class weights. Let

n_{k}

be the total number of pixels of class k in the training set; the weight for class k is defined as:

w_{k} = \frac{n_{0} + n_{1}}{2 n_{k}}, k \in {0, 1} .

(11)

where

n_{0}

and

n_{1}

are the pixel counts of non-water and water classes, respectively; the denominator

2 n_{k}

ensures the weights are normalized to balance the contribution of each class.

2.5.4. (Optional) SAR-Guided Edge-Aware Regularization (Shoreline Emphasis)

To further encode SAR geometric cues and sharpen water shorelines, we introduce a lightweight edge-aware regularizer that leverages inherent SAR gradient information without requiring additional manual annotations. First, we compute the gradient magnitude map of the normalized SAR image (e.g., using the Sobel operator) and normalize it to the range

[0, 1]

:

E_{sar} = Norm ({∥\nabla {\tilde{I}}_{sar}∥}_{2}) \in {[0, 1]}^{H \times W} .

(12)

where ∇ denotes the gradient operator;

{∥ \cdot ∥}_{2}

is the L2 norm to compute the gradient magnitude; and

Norm (\cdot)

is the min-max normalization function. Meanwhile, we derive a binary boundary target from the water mask

Y^{w}

via morphological gradient:

B^{w} = I (|\nabla I (Y^{w} = 1)| > 0) .

(13)

where

I (Y^{w} = 1)

converts the binary water mask into a logical tensor; the inner ∇ computes the morphological gradient to extract water boundary pixels; and the outer

I (\cdot)

converts gradient values greater than 0 into binary boundary labels (1 = boundary pixel, 0 = non-boundary pixel). We then weight shoreline pixels using

E_{sar}

and apply an edge-aware binary cross-entropy (BCE) loss:

L_{edge} = - \sum_{x} M (x) (1 + η E_{sar} (x)) [B^{w} (x) log P_{T} (x) + (1 - B^{w} (x)) log (1 - P_{T} (x))],

(14)

where

η

is a hyperparameter controlling the strength of SAR edge emphasis (set to 0.5 in our experiments);

E_{sar} (x)

denotes the normalized gradient magnitude at pixel x, which assigns higher weights to edge regions in SAR images; and the remaining terms follow the definition of standard BCE loss.

Finally, the total optimization objective of the teacher model is the weighted sum of the primary cross-entropy loss and the edge-aware regularization loss:

L_{T} = L_{CE} + λ_{edge} L_{edge},

(15)

where

λ_{edge}

is the weight coefficient of the edge-aware loss (set to 0.3 in our experiments to balance the two loss terms; set to 0 if edge regularization is disabled).

2.6. SAR-Guided Class-Aware Knowledge Distillation

This section describes the proposed SAR-guided class-aware knowledge distillation (SGCAD) strategy implemented in Stage 2 of the training pipeline. Unlike conventional global distillation methods that apply supervision to all pixels and land-cover classes, SGCAD transfers knowledge exclusively for the water-body class and only on reliable pixels selected via a confidence-gated mask. This targeted design mitigates noisy supervision signals and avoids negative transfer to other land-cover categories, directly addressing the core challenge of water-body under-optimization in multimodal optical–SAR fusion.

Teacher/Student Probabilities for the Water Class. The SAR-only teacher network (HRNet) is specialized for binary water-body segmentation and outputs per-pixel binary logits. For notational simplicity, we denote the teacher’s effective water logit as

z_{T} \in R^{H \times W}

and compute the corresponding water probability map as:

P_{T} (x, y) = σ (z_{T} (x, y)) \in [0, 1],

(16)

where

σ (\cdot)

denotes the sigmoid activation function (mapping logits to the probability range

[0, 1]

);

(x, y)

represents the 2D pixel coordinate; and H and W are the height and width of the input patch, respectively. Note that for binary classification, a two-logit Softmax formulation is equivalent to applying Sigmoid to an effective binary logit; thus, we use the Sigmoid notation above for convenience and consistency in the distillation formulation.

The multimodal student network (LightMCANet) performs six-class land-cover segmentation, predicting a six-class logit tensor

z_{S} \in R^{6 \times H \times W}

(one channel per land-cover class) and the softmax-normalized probability distribution:

P_{S}^{c} (x, y) = Softmax {(z_{S} (x, y))}_{c}, c \in {0, \dots, 5},

(17)

where

Softmax (\cdot)

normalizes logits across the six-class dimension to ensure the sum of probabilities equals 1; c denotes the class index (0 = farmland, 1 = city, 2 = village, 3 = water body, 4 = forest, 5 = road); the student’s probability for the water-body class is denoted as

P_{S}^{w} (x, y)

with

w = 3

(i.e.,

c = 3

).

Confidence-Gated Distillation Mask (Selective-GT Strategy). Instead of applying distillation globally across all pixels, SGCAD activates distillation supervision only on pixels where the teacher’s prediction is sufficiently reliable, thereby avoiding uncertain regions (e.g., mixed pixels, complex shorelines) that may introduce noisy supervision. To achieve this, two sets of reliable pixels are constructed to gate the distillation loss:

Reliable water pixels: Pixels predicted as water by the teacher with high confidence and consistent with the ground-truth water label. This set transfers the teacher’s structural prior for water-body delineation while excluding ambiguous or mixed-pixel regions.
Reliable non-water pixels: Pixels predicted as non-water by the teacher with high confidence and consistent with the ground truth. This set prevents over-expansion of the water-body region and reduces the propagation of teacher errors to other land-cover classes.

The final distillation mask is defined as the union of these two sets, and the distillation loss is computed exclusively within this masked region to ensure supervision is only applied to high-confidence, ground-truth-consistent pixels.

2.6.1. (1) High-Confidence Water Pixels

M_{hi} (x, y) = I (P_{T} (x, y) > τ_{hi}) \cdot I (Y_{g t} (x, y) = w),

(18)

where

I (\cdot)

is the indicator function (returning 1 if the condition holds, 0 otherwise);

τ_{hi}

is the high-confidence threshold for the teacher’s water prediction (set to 0.95 based on validation experiments);

Y_{g t} (x, y)

is the ground-truth label at pixel

(x, y)

; and w denotes the water-body class (index 3).

2.6.2. (2) Low-Confidence (Non-Water) Pixels

M_{lo} (x, y) = I (P_{T} (x, y) < τ_{lo}) \cdot I (Y_{g t} (x, y) \neq w),

(19)

where

τ_{lo}

is the low-confidence threshold for the teacher’s water prediction (set to 0.15 to ensure only highly confident non-water pixels are included). The final distillation mask is the union of

M_{hi}

and

M_{lo}

:

M (x, y) = max (M_{hi} (x, y), M_{lo} (x, y)),

(20)

where

max (\cdot)

denotes the element-wise maximum operation to realize mask union. This selective design ensures only ground-truth-consistent pixels participate in distillation, avoiding the propagation of teacher prediction errors to the student network.

Masked BCE Distillation Loss. To align the student’s water-body probability predictions with the teacher’s reliable SAR-derived priors, a masked binary cross-entropy (BCE) loss is adopted, which restricts supervision to the distillation mask region defined above:

L_{K D} = \frac{1}{| M |} \sum_{x, y} M (x, y) \cdot BCE (P_{S}^{w} (x, y), P_{T} (x, y)),

(21)

where

| M | = \sum_{x, y} M (x, y)

is the total number of valid pixels in the distillation mask (normalizing the loss to avoid scale bias from varying mask sizes); and

BCE (\cdot)

is the binary cross-entropy loss function that quantifies the similarity between the student’s water-body probability

P_{S}^{w} (x, y)

and the teacher’s water-body probability

P_{T} (x, y)

.

Overall Objective with Warm-Up Scheduling. The segmentation loss in Stage 2 retains the same formulation as Stage 1, combining weighted cross-entropy (wCE), Dice loss, and Lovász loss to address class imbalance and improve the segmentation of slender structures (e.g., roads, water shorelines):

L_{s e g} = α L_{w C E} + β L_{D i c e} + γ L_{L o v a s z},

(22)

where

α = 0.6

,

β = 0.2

, and

γ = 0.2

are the weight coefficients for the three loss terms (determined via validation);

L_{w C E}

is the weighted cross-entropy loss with inverse-frequency class weights to mitigate class imbalance;

L_{D i c e}

enhances the segmentation of small or slender targets; and

L_{L o v a s z}

directly optimizes the intersection-over-union (IoU) metric for improved boundary precision.

To ensure the student network first retains its basic multimodal segmentation ability before adapting to the teacher’s SAR-derived priors, a warm-up strategy is implemented that disables distillation during the initial training epochs:

g (e) = I (e > E_{w}), E_{w} = 10,

(23)

where

g (e)

is the warm-up gating function; e is the training epoch;

E_{w} = 10

denotes the warm-up period (distillation is disabled for the first 10 epochs of Stage 2). The total loss objective for Stage 2 training is:

L = L_{s e g} + λ g (e) L_{K D},

(24)

where

λ = 0.005

is the distillation loss weight (set to a small value to prevent over-distillation and preserve the student’s multimodal representation ability for non-water classes).

KD Region Ratio (Reporting Protocol). To quantify the scope of distillation supervision across training epochs, the KD region ratio is defined as the percentage of distillation-active pixels (within the mask) relative to the total number of pixels in each training mini-batch:

ρ_{K D} (e) = \frac{1}{N} \sum_{n = 1}^{N} \frac{| M_{n} (e) |}{H W} \times 100 %,

(25)

where N is the number of training mini-batches in epoch e;

M_{n} (e)

is the distillation mask of the n-th mini-batch;

| M_{n} (e) |

is the number of valid pixels in

M_{n} (e)

; and

H W

is the total number of pixels per input patch. The epoch-average ratio

{\bar{ρ}}_{K D} = \frac{1}{E} \sum_{e = 1}^{E} ρ_{K D} (e)

(over

E = 80

total Stage 2 epochs) is reported to provide transparency on the scale of distillation supervision.

All key hyperparameters for the SGCAD distillation strategy—including confidence thresholds, loss weights, warm-up configuration, and evaluation metrics—are summarized in Table 2, with parameter values determined based on validation performance and domain-specific prior knowledge of SAR water-body segmentation (see Section 4 for detailed parameter selection rationale). The hyperparameters for SGCAD distillation were determined based on validation performance and domain prior knowledge of SAR water-body segmentation. The high confidence threshold

τ_{hi} = 0.95

was chosen to ensure only highly reliable water pixels were used for distillation. The low confidence threshold

τ_{lo} = 0.15

was selected to exclude ambiguous regions while retaining sufficient non-water supervision. The distillation weight

λ = 0.005

was set small to avoid over-distillation and preserve the student’s multimodal representation ability. The warm-up strategy of 10 epochs was used to stabilize the student model before distillation.

2.7. Training Protocol and Implementation Details

Stage 1: Multimodal Six-Class Training. The LightMCANet student network was trained with a batch size of 48 on a single NVIDIA GPU for 150 epochs using the AdamW optimizer. The optimization setup included a base learning rate of

1 \times 10^{- 3}

, weight decay of

1 \times 10^{- 4}

, and a polynomial learning rate (PolyLR) schedule with a power of 0.9. The segmentation loss follows Equation (26), combining weighted cross-entropy (wCE), Dice loss, and Lovász loss with

(α, β, γ) = (0.6, 0.2, 0.2)

, which are selected to mitigate class imbalance and improve boundary-sensitive categories.

Teacher Training: SAR Water Binary Segmentation. The HRNet teacher network was trained exclusively on SAR data for binary water-body segmentation (water vs. non-water) for 150 epochs with a batch size of 44. The AdamW optimizer was used with a base learning rate of

1 \times 10^{- 3}

and weight decay of

1 \times 10^{- 4}

, and the same PolyLR schedule (power = 0.9) as Stage 1 was applied to ensure consistent learning rate decay across teacher and student training.

Stage 2: Distillation Fine-Tuning. Stage 2 fine-tunes the pre-trained Stage 1 student network using the proposed SAR-guided class-aware knowledge distillation (SGCAD) strategy for 80 epochs. The base learning rate is reduced to

1 \times 10^{- 4}

for fine-tuning (to avoid overwriting pre-trained features), and a warm-up strategy is implemented where knowledge distillation is disabled for the first 10 epochs. The distillation loss is weighted by

λ = 0.005

to balance segmentation performance and distillation supervision, with confidence thresholds set to 0.95 (high-confidence water pixels) and 0.15 (high-confidence non-water pixels) for mask construction. The distillation strategy employed is bce_selective_gt (selective binary cross-entropy loss constrained to ground-truth-consistent pixels).

Key training hyperparameters for the teacher network and student network (Stage 1 and Stage 2) are summarized in Table 3, including optimizer type, learning rate settings, training epochs, batch size, and weight decay—all standardized to ensure reproducibility and fair comparison across models. Training hyperparameters were chosen to ensure stable convergence and consistent training across all models. A learning rate of

1 \times 10^{- 3}

was used for initial training, while a reduced rate of

1 \times 10^{- 4}

was adopted in Stage 2 to fine-tune the model gently. Batch sizes were set based on GPU memory constraints while maintaining training stability. The number of epochs (150 for initial training, 80 for distillation) was determined via early stopping on the validation set to avoid overfitting.

Data Augmentation and Inference. Synchronized data augmentation was applied to both optical and SAR modalities to ensure spatial alignment and enhance model generalization. The augmentation strategy included random horizontal/vertical flips, random rotations within the range of

[- 5^{\circ}, 15^{\circ}]

, and random cropping/resizing operations—all designed to simulate real-world variations in remote sensing imagery. Detailed specifications of the data augmentation operations are provided in Table 4.

Model Evaluation and Metrics. During training and validation, key evaluation metrics were recorded to quantify segmentation performance: overall accuracy (OA), kappa coefficient (a measure of agreement corrected for chance), mean intersection-over-union (mIoU), and per-class IoU (with particular focus on the water-body class). For class-wise performance analysis, user’s accuracy (UA) was also computed to evaluate the precision of individual land-cover class predictions, with emphasis on water-body IoU improvements in Stage 2 attributable to the distillation process.

Early Stopping and Monitoring. Early stopping was implemented to prevent overfitting, with the stopping criterion based on the Kappa coefficient and a combined evaluation score (weighted OA + mIoU) on the validation set during Stage 2. Additionally, water-body IoU was monitored independently to track the impact of distillation on the target class, and a fusion evaluation mode was enabled in Stage 2 to compare predictions from the HRNet teacher and LightMCANet student networks for consistency analysis.

Quantitative evaluation results for the student network (Stage 1 and Stage 2) and HRNet teacher network are summarized in Table 5, demonstrating the performance improvements achieved via the SGCAD distillation strategy—particularly for the water-body class.

All models were implemented in PyTorch 2.0 (Python 3.9) and trained on a single NVIDIA RTX 3090 GPU (24 GB VRAM) to ensure computational reproducibility. A fixed random seed (42) was set for Python, NumPy, and PyTorch to eliminate randomness in training and evaluation. All experiments were conducted on the fixed train/validation/test data split described in Section 2.1, and final performance metrics were computed using the best-performing model checkpoint (selected based on validation mIoU) to avoid overfitting to the test set.

Optimization and Loss Functions

Student Stage 1 (six-class multimodal segmentation). The LightMCANet student network was trained for 150 epochs using the AdamW optimizer with an initial learning rate of

1 \times 10^{- 3}

and weight decay of

1 \times 10^{- 4}

. A PolyLR schedule with power

0.9

was applied to decay the learning rate over training. The segmentation loss objective is defined as:

L_{seg} = α L_{wCE} + β L_{Dice} + γ L_{Lovasz},

(26)

where

(α, β, γ) = (0.6, 0.2, 0.2)

, and the weighted cross-entropy (wCE) loss uses inverse-frequency class weights computed on the training set to mitigate class imbalance.

Teacher (SAR-only, binary water-body segmentation). The HRNet teacher network was trained for 150 epochs using the AdamW optimizer with a learning rate of

1 \times 10^{- 3}

and the same PolyLR schedule (power = 0.9) as the student network. Pixel-wise cross-entropy loss was used for binary supervision (water vs. non-water), with ignored pixels (label = 255, e.g., out-of-study-area regions) excluded from loss computation.

Student Stage 2 (distillation fine-tuning). Stage 2 fine-tunes the Stage 1 checkpoint for 80 epochs with a reduced base learning rate of

1 \times 10^{- 4}

to preserve pre-trained multimodal features. The total loss objective for Stage 2 is:

L = L_{seg} + λ g (e) L_{KD}

, where

λ = 0.005

(distillation loss weight),

g (e)

denotes a warm-up gating function (distillation disabled for the first 10 epochs), and

L_{KD}

is the masked binary cross-entropy (BCE) loss applied exclusively on confidence-gated pixels (defined in Equations (18)–(20)) to ensure targeted distillation for the water-body class.

2.8. Evaluation Metrics

We report overall accuracy (OA), kappa coefficient, mean intersection-over-union (mIoU), and per-class IoU. For class-wise analysis, user’s accuracy (UA) is provided, with particular focus on water-body improvements.

3. Results

This section presents quantitative and qualitative results on the self-built GF-1 optical and LuTan-1 SAR multimodal dataset for six-class semantic segmentation (farmland, city, village, water body, forest, and road). We first report the teacher model performance for SAR-only water segmentation, then compare the multimodal student with and without SAR-guided class-aware distillation, and finally provide ablation and discussion to clarify the contribution of each component.

3.1. Evaluation Protocol and Metrics Summary

We report overall accuracy (OA), kappa coefficient, mean intersection-over-union (mIoU), and per-class IoU for the six-class task. For class-wise assessment, we additionally report user’s accuracy (UA) and focus on the improvement of water bodies and roads. All results are reported on the fixed split described in Section 2.1. Unless otherwise stated, we report single-run results with a fixed random seed, and select the best checkpoint based on the validation performance before evaluating once on the held-out test set.

3.2. Teacher Performance: SAR-Only Water-Body Segmentation

To ensure the provision of reliable water-body priors for subsequent class-aware distillation, the HRNet-based teacher network was trained on SAR imagery for binary water-body segmentation (water vs. non-water). Specifically, in this binary segmentation task, “non-water” encompasses two categories: (1) the five remaining land-cover classes (farmland, city, village, forest, road) from the original six-class labeling system (defined as “background” in Figure 5), and (2) ignored pixels labeled 255 (defined as “others” in Figure 5, corresponding to areas outside the valid six-class segmentation scope). Quantitative performance metrics of the teacher model on the training and validation subsets are summarized in Table 6, demonstrating that SAR backscattering contrast serves as a stable and discriminative cue for water-body delineation.

Qualitative visualization results of the SAR-only teacher model for binary water-body segmentation are presented in Figure 5, which further validate the model’s ability to generate coherent water-body regions and sharp shoreline boundaries—key prerequisites for providing reliable priors for subsequent confidence-gated distillation.

3.3. Main Comparison on Multimodal Six-Class Segmentation

The proposed SAR-guided class-aware distillation (SGCAD) method is compared with a comprehensive set of representative baselines, including single-modal optical-only models, early/late fusion multimodal models, and attention-based multimodal fusion networks. Quantitative performance comparisons on the independent test set are summarized in Table 7, which demonstrates that SGCAD consistently improves segmentation accuracy for the water-body category while maintaining or enhancing overall segmentation quality across all other land-cover classes.

To ensure a fair and unbiased comparison, all baseline models were trained under identical experimental conditions: the same data split (Section 2.1), input spatial resolution (5 m), and data augmentation protocol (Section 2.7). For each baseline, hyperparameters were initialized using the recommended values from the original publications or official implementations, followed by lightweight validation-based tuning within a constrained search space (e.g., learning rate:

10^{- 4}

to

10^{- 3}

; weight decay:

10^{- 5}

to

10^{- 4}

) to select the optimal configuration. All models were trained for the same number of epochs (150), and the best-performing checkpoint (based on validation mIoU) was selected for final test set evaluation to avoid overfitting and ensure reproducibility.

Furthermore, to explicitly demonstrate that the proposed method alleviates the degradation of water separability introduced by indiscriminate multimodal fusion, we present a focused comparison of the water-body intersection-over-union (IoU) across three key configurations on the test set (Table 8).

As demonstrated in Table 8, utilizing SAR imagery alone yields a highly competitive water IoU of 0.845, confirming that SAR backscatter provides strong and stable structural cues for water extraction. However, when optical and SAR data are indiscriminately fused in the baseline multimodal network, the water IoU degrades significantly to 0.520 (as detailed further in Section 3.4, Table 8). This degradation highlights the adverse effects of cross-modal conflicts and multi-class softmax competition, where ambiguous optical features (e.g., shadows or dark surfaces) interfere with the reliable SAR priors. By introducing the SAR-guided class-aware distillation, the proposed SGCAD method effectively resolves this bottleneck, recovering and elevating the water IoU to 0.827. This comparison provides direct quantitative evidence that SGCAD successfully mitigates the degradation of water separability introduced by direct multimodal fusion, leveraging the strengths of both modalities.

Qualitative segmentation results for the multimodal six-class segmentation task are visualized in Figure 6, which provides direct visual evidence of the performance improvements achieved by the proposed SGCAD method—particularly for water-body regions and boundary continuity—while preserving segmentation quality for other land-cover categories.

3.4. Water-Body Improvement Analysis

We further analyze the water-body category because it is the major bottleneck in multimodal optical–SAR fusion. Compared with the LightMCANet student baseline, the proposed SGCAD yields a clear and substantial gain in water-body IoU, which indicates that the SAR-only teacher successfully transfers reliable structural priors to the multimodal student. Meanwhile, segmentation performance for the remaining land-cover classes remains stable with only minor fluctuations, suggesting that the proposed confidence-gated distillation strategy effectively prevents over-distillation and negative transfer to non-target classes.

To quantitatively validate the above observations, the per-class IoU and mean IoU (mIoU) of the LightMCANet baseline and the proposed SGCAD are compared in Table 9. The results confirm that SGCAD significantly improves the water-body class while maintaining or slightly enhancing performance for all other categories.

3.5. Ablation Study

We conduct ablation experiments to disentangle the contribution of each component and their interaction effects in the proposed framework. Following the reviewer’s suggestion, we adopt a full factorial design over three binary factors: (i) SGCAD distillation (on/off, i.e., Stage 2 confidence-gated distillation enabled/disabled), (ii) LightMCAM cross-modal interaction (on/off), and (iii) SEGM boundary refinement (on/off). This results in eight model variants, allowing us to verify whether the improvements are additive or synergistic.

3.5.1. Full Factorial Ablation: (SGCAD on/off) × (LightMCAM on/off) × (SEGM on/off)

Table 10 summarizes the factorial ablation results. For SGCAD = off, we report the performance of Stage 1 supervised training only. For SGCAD = on, we fine-tune the corresponding Stage 1 checkpoint with Stage 2 distillation under the same teacher and identical distillation hyperparameters. We report overall metrics (OA, kappa, mIoU) and the IoU of two boundary-sensitive categories (water and road), which are the primary targets of our method.

Overall, the factorial results show that (i) LightMCAM consistently improves multimodal representation learning by enabling more effective cross-modal interaction, which benefits water delineation under optical–SAR heterogeneity; (ii) SEGM mainly contributes to boundary continuity and improves slender structures (notably road IoU); and (iii) enabling SGCAD (Stage 2 distillation) yields the largest gain on the water class while maintaining stable overall performance. Moreover, comparing the gains across different combinations suggests that the components are not purely additive: the benefit of SEGM is more pronounced when SGCAD is enabled (i.e., sharper water boundaries provide more reliable edge cues for refinement), indicating a synergistic interaction.

3.5.2. Confidence-Gating Ablation (Addressing the Role of “Confidence Gating”)

To explicitly demonstrate the effect of confidence gating in SGCAD, we further compare different distillation strategies under the same student architecture (LightMCAM = on, SEGM = on). Table 11 contrasts: (a) no distillation (Stage 1 only), (b) global distillation without gating (distillation applied on all pixels), and (c,d) confidence-gated distillation using the masks defined in Equations (18)–(20). This ablation directly answers the reviewer’s question on which experiment supports the role of confidence gating.

Compared with global distillation, confidence gating restricts knowledge transfer to reliable pixels (high-confidence water and confident non-water regions consistent with the ground truth), which effectively suppresses noisy supervision and prevents over-distillation to other classes. As a result, the gated variants achieve higher overall performance and more stable cross-class behavior while preserving the strong water improvement.

Qualitative ablation results. We keep qualitative comparisons to visualize the role of the two architectural factors. Figure 7 compares the full student (LightMCAM = on, SEGM = on, SGCAD = off) with its variant without LightMCAM. As highlighted by the red circles, removing LightMCAM causes noticeable misclassification of water regions, where water is confused with surrounding farmland or built-up areas, indicating the necessity of explicit cross-modal interaction.

Figure 8 compares the full student (LightMCAM = on, SEGM = on, SGCAD = off) with its variant without SEGM. Without edge guidance, predicted boundaries of water bodies become blurred and fragmented, and small spurious regions appear near shorelines and roads, as marked by red circles. In contrast, the full student produces cleaner and more continuous boundaries, confirming that SEGM effectively exploits SAR structural cues to suppress boundary noise and improve delineation of slender objects.

4. Discussion

This section discusses the experimental findings of SGCAD from the perspective of multimodal fusion theory and previous studies. We interpret why SAR-guided class-aware distillation improves water-body delineation, analyze the trade-offs across categories, and highlight limitations and future directions for large-scale applications.

4.1. Why SAR-Guided Class-Aware Distillation Improves Water Segmentation

A consistent observation in our experiments is that the water-body category benefits the most from the proposed teacher–student paradigm. This can be attributed to the intrinsic imaging characteristics of SAR, in which water surfaces generally exhibit low backscattering intensity and homogeneous spatial patterns in satellite imagery, providing stable and consistent cues for water-body extraction regardless of illumination and atmospheric conditions. Consequently, a SAR-only teacher trained on such multi-temporal satellite SAR images can serve as a reliable “water expert” that provides strong structural priors for shoreline delineation. In contrast, multimodal fusion networks may suffer from local modality conflicts when optical textures and SAR scattering patterns are inconsistent, leading to fragmented boundaries and spurious holes in water masks. By transferring the teacher’s water probability in a selective manner, the student learns to enhance water-related decision boundaries while still leveraging optical information for other land-cover classes.

The observed degradation of water segmentation under multimodal fusion indicates a limitation of indiscriminate fusion with global supervision, rather than a limitation of using multimodal data itself. When optical and SAR provide inconsistent evidence at local regions (due to shadows, speckle, geometric misalignment, or imaging timing differences), enforcing a global multi-class objective encourages the network to compromise between modalities, which can weaken the separability of classes that are otherwise well distinguished in a single modality. Our method fundamentally changes this optimization behavior by (i) decoupling water learning via a SAR-only expert teacher, and (ii) restricting distillation to reliable pixels through confidence gating, thereby preventing noisy or misaligned regions from dominating the water decision boundary.

4.2. Role of Confidence Gating: Reducing Noisy Distillation and Preventing Over-Constraint

Unlike dense distillation that enforces teacher supervision on all pixels, SGCAD applies the distillation loss only to pixels where the teacher prediction is sufficiently reliable, with a focus on high-confidence water regions. This design is important because teacher outputs can be uncertain around mixed pixels, complex shorelines, and SAR shadow/layover areas, where indiscriminate distillation may propagate teacher errors to the student. Such noisy transfer can over-constrain the six-class softmax student and potentially exacerbate class competition, leading to degradations in non-water categories. Our ablation results consistently indicate that confidence gating stabilizes training and improves water-body delineation. Moreover, the small distillation weight

λ

further reduces the risk of over-distillation, allowing the student to retain its multimodal representation while benefiting from the teacher’s structural prior.

We selected the main SGCAD hyperparameters (distillation weight

λ

and confidence thresholds

τ_{hi}, τ_{lo}

) based on validation performance. In practice,

λ

was searched in a small range to avoid over-distillation (e.g.,

[0.002, 0.01]

), and the confidence thresholds were chosen to balance reliability and coverage of the distillation mask. The final values reported in Table 2 were used for all experiments.

4.3. Boundary Continuity and Slender Structures: Effect of SEGM

A second noticeable improvement appears in slender structures and boundaries (e.g., roads and shorelines). This is consistent with the hypothesis that shallow SAR features contain strong gradient cues that are less affected by illumination variations. SEGM exploits such cues to refine decoder responses, which helps reduce boundary fragmentation and improves continuity. Qualitative comparisons (Figure 6) show that the student baseline may produce discontinuous road segments or jagged shorelines, whereas SGCAD yields more coherent boundaries. This supports the view that boundary-aware refinement remains valuable even in attention-based fusion backbones, especially for high-resolution remote sensing segmentation where object shapes are elongated and thin.

4.4. Comparison with Existing Multimodal Fusion and Distillation Strategies

Prior multimodal fusion research has explored early fusion (stacking), late fusion, and cross-attention interaction for optical–SAR semantic segmentation. While these methods improve overall performance, they often lack a mechanism to explicitly protect a critical category that suffers from modal conflicts or class imbalance. Our results suggest that “category-focused” guidance can be more effective than uniformly strengthening all classes, particularly when one class (water) has a strong modality-specific signature in SAR. Compared with generic distillation methods that use global logits or features, the proposed class-aware and confidence-gated distillation is better aligned with the structured nature of semantic segmentation and the heterogeneity of optical–SAR inputs.

4.5. Error Analysis and Failure Cases

Although the proposed distillation strategy substantially improves water-body delineation, a few challenging scenarios remain. First, mixed pixels and gradual transitions around paddy fields, wetlands, and shadowed regions can blur the boundary between farmland and water. In these areas, SAR may exhibit locally reduced backscatter that partially overlaps with the water-like response, increasing ambiguity for both teacher and student. Second, complex urban scenes may contain dark roofs, narrow water channels, or strong specular reflections, which can occasionally induce water-like appearances and lead to false positives, especially under certain viewing geometries or incidence-angle conditions. Third, label noise and annotation uncertainty near class boundaries—particularly for slender structures and fine shoreline details—can cap the attainable IoU even when the prediction is visually reasonable. In the revised manuscript, we will provide representative qualitative examples covering the above cases and discuss the underlying causes and potential mitigation strategies.

4.6. Generalization, Practical Implications, and Future Directions

The proposed framework is suitable for practical land-cover mapping and monitoring because it exploits the complementary strengths of optical and SAR imagery and improves robustness for water-related classes that are critical for flood assessment and water-resource management. A natural extension is to generalize the “expert teacher” paradigm to other categories that exhibit modality-specific signatures, such as roads benefiting from SAR geometric cues and linear structures, or buildings characterized by strong double-bounce responses. Another promising direction is to incorporate semi-supervised or self-supervised representation learning for the teacher model to reduce dependence on densely annotated masks while maintaining reliable pseudo-label quality. From a deployment perspective, evaluating computational cost (e.g., parameter count, FLOPs, and inference throughput) alongside accuracy would provide a clearer view of the efficiency–performance trade-off for large-scale inference and time-sensitive applications.

A key practical condition of our current framework is the availability of co-registered optical and SAR observations acquired within a sufficiently small time lag. For rapidly evolving events such as active flooding, a temporal mismatch between SAR and optical acquisitions may introduce semantic inconsistency (e.g., water extent changes) and potentially bias the fused prediction. In addition, optical imagery is unavailable at night and may be missing under heavy cloud cover, which limits strictly synchronous optical–SAR deployment. In this study, we focus on land-cover and surface-water mapping scenarios where paired optical–SAR observations are near-synchronous and the scene semantics are relatively stable during acquisition. As future work, we will extend the framework toward incomplete multimodal settings by enabling robust inference under missing optical input (e.g., SAR-only fallback and modality-dropout training), and by incorporating temporal pairing strategies (time-window matching) to mitigate the impact of acquisition time gaps in dynamic scenes.

5. Conclusions

This paper presents a SAR-guided class-aware knowledge distillation (SGCAD) framework for multimodal semantic segmentation using optical and SAR image pairs. Motivated by the observation that water-body segmentation remains a bottleneck under heterogeneous multimodal fusion, we train a SAR-only HRNet as a water expert and distill its structural knowledge into a lightweight multimodal student (LightMCANet) through a confidence-gated, class-aware strategy. The main conclusions can be summarized as follows: We demonstrate that a SAR-only water expert provides reliable structural priors for shoreline delineation, which effectively enhances the multimodal student’s water-body prediction under modal conflicts. The proposed confidence-gated class-aware distillation mechanism restricts supervision to high-confidence water regions, reducing noisy distillation and preventing performance degradation in non-water classes. Additionally, the introduced SAR edge guidance module (SEGM) refines boundary responses and improves continuity for slender structures (e.g., roads and water boundaries), yielding better qualitative and quantitative results. On the self-built GF-1 (optical) and LuTan-1 (SAR) dataset, SGCAD achieves significant overall improvements over representative baselines: compared with the LightMCANet student baseline, water IoU improves by +30.7.

The proposed framework provides a feasible paradigm for targeted category optimization in heterogeneous multimodal segmentation, highlighting the value of modality-specific expert knowledge and region-selective distillation. Future work will explore three directions: (i) extending the expert–teacher distillation paradigm to multiple category experts, leveraging modality-specific advantages for more under-optimized classes; (ii) improving robustness under complex scenarios such as SAR shadow/layover and mixed-pixel conditions, addressing remaining failure cases; (iii) leveraging self-supervised or semi-supervised learning to reduce annotation cost, further enhancing the framework’s generalization ability for large-scale land-cover mapping.

Author Contributions

Conceptualization, J.M.; methodology, J.M.; software, J.M.; validation, J.M. and Y.Y.; formal analysis, J.M. and Y.Y.; investigation, J.M.; data curation, J.M. and Z.W.; writing—original draft preparation, J.M.; writing—review and editing, Y.Y. and F.H.; visualization, J.M.; supervision, F.H. and Z.W. contributed to collaborative data collection and preprocessing. Y.Y. contributed to literature review and supplementary experiments. F.H. provided overall academic guidance and manuscript supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the 2024 Shanghai Collaborative Innovation Project (XTCX-KJ-2024-13), in part by the Young Elite Scientists Sponsorship Program (YESS20240549), and in part by the General Program of the National Natural Science Foundation of China (62571139).

Data Availability Statement

The data presented in this study are available from the corresponding author upon reasonable request. The self-built annotations and processed patch data can be shared for research purposes, subject to permission and data organization status. The original remote sensing imagery may be subject to data access restrictions from the corresponding providers.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations and notation conventions are used in this manuscript:

SAR	Synthetic Aperture Radar
SGCAD	SAR-Guided Class-Aware Knowledge Distillation
SEGM	SAR Edge Guidance Module
LightMCAM	Lightweight Cross-Modal Attention Module
LightMCANet	Lightweight Multimodal Cross-Attention Network
HRNet	High-Resolution Network
FPN	Feature Pyramid Network
KD	Knowledge Distillation
GF-1	Gaofen-1
GSD	Ground Sampling Distance
RTC	Radiometric Terrain Correction
DEM	Digital Elevation Model
RGB	Red, Green, Blue
OA	Overall Accuracy
IoU	Intersection-over-Union
mIoU	Mean Intersection-over-Union
UA	User’s Accuracy
CE	Cross-Entropy
BCE	Binary Cross-Entropy
wCE	Weighted Cross-Entropy
PolyLR	Polynomial Learning Rate
$I_{s a r}, I_{o p t}$	SAR and optical input patches
${\tilde{I}}_{s a r}$	Normalized single-channel SAR image
$Rep (\cdot)$	Channel replication operator
$Y^{w} (x)$	Binary water mask
$c_{w}$	Water class index
∧	Logical “AND” operator
$ϕ_{j \to b}$	Resolution alignment operator
$I (\cdot)$	Indicator function
$η, λ_{e d g e}, λ$	Weight hyperparameters
$σ (\cdot)$	Sigmoid activation function
$z_{S}, z_{T}$	Student and teacher logit tensors
$τ_{h i}, τ_{l o}$	High and low confidence thresholds
$M (x, y)$	Final distillation mask
$M_{h i}, M_{l o}$	High-confidence water mask and low-confidence mask
$\| M \|$	Number of valid pixels in the distillation mask
$ρ_{K D} (e)$	KD region ratio
$Q, K, V$	Query, Key, and Value features
$F_{o p t}, F_{s a r}$	Optical and SAR modality-specific features
$W_{q}, W_{k}, W_{v}$	Learnable linear projection matrices
$P (\cdot)$	Adaptive pooling operator
⊙	Hadamard product (element-wise multiplication)
$F_{s a r}^{1}$	Shallow SAR feature from the first encoder layer
$F_{r e f i n e d}$	Decoder feature optimized by SEGM
$F_{c o n v} (\cdot)$	Stack of convolutional layers
$x, y$	2D pixel coordinates

References

Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2015. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE TPAMI 2018, 40, 834–848. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2017. [Google Scholar]
Berman, M.; Triki, A.R.; Blaschko, M.B. The Lovász-Softmax Loss: A Tractable Surrogate for the Optimization of the Intersection-over-Union Measure in Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018. [Google Scholar]
Akiyama, T.S.; Marcato Junior, J.; Gonçalves, W.N.; Bressan, P.O.; Eltner, A.; Binder, F.; Singer, T. Deep Learning Applied to Water Segmentation. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 2020, XLIII-B2, 1189–1193. [Google Scholar] [CrossRef]
Pai, M.M.M.; Mehrotra, V.; Aiyar, S.; Verma, U.; Pai, R.M. Automatic Segmentation of River and Land in SAR Images: A Deep Learning Approach. In Proceedings of the 2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE); IEEE: New York, NY, USA, 2019; pp. 15–20. [Google Scholar] [CrossRef]
Liu, C.; Sun, Y.; Xu, Y.; Sun, Z.; Zhang, X.; Lei, L.; Kuang, G. A Review of Optical and SAR Image Deep Feature Fusion in Semantic Segmentation. IEEE JSTARS 2024, 17, 12910–12930. [Google Scholar] [CrossRef]
Guo, Z.; Wu, L.; Huang, Y.; Guo, Z.; Zhao, J.; Li, N. Water-Body Segmentation for SAR Images: Past, Current, and Future. Remote Sens. 2022, 14, 1752. [Google Scholar] [CrossRef]
Liu, L.; Wei, Y.; Hu, F.; Xu, F. On the Value of Terrain Classification Using SAR Altimeter Delay Doppler Image. In Proceedings of the 2024 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), Zhuhai, China, 22–24 November 2024; IEEE: New York, NY, USA, 2024; pp. 1–5. [Google Scholar] [CrossRef]
Wu, W.; Guo, S.; Shao, Z.; Li, D. CroFuseNet: A Semantic Segmentation Network for Urban Impervious Surface Extraction Based on Cross Fusion of Optical and SAR Images. IEEE JSTARS 2023, 16, 2573–2588. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, W.; Chen, W.; Chen, C.; Liang, Z. MFFnet: Multimodal Feature Fusion Network for Synthetic Aperture Radar and Optical Image Land Cover Classification. Remote Sens. 2024, 16, 2459. [Google Scholar] [CrossRef]
Seo, M.; Kim, D.; Son, S. Reducing SAR-EO Domain Gap via Semantic Alignment for SAR Segmentation. IEEE Access 2025, 13, 157909–157917. [Google Scholar] [CrossRef]
Sun, Z.; Zhi, S.; Li, R.; Xia, J.; Liu, Y.; Jiang, W. GDROS: A Geometry-Guided Dense Registration Framework for Optical–SAR Images Under Large Geometric Transformations. IEEE TGRS 2025, 63, 5650315. [Google Scholar] [CrossRef]
Xu, C.; Geng, Z.; Wu, L.; Zhu, D. Enhanced semantic segmentation in remote sensing images with SAR-optical image fusion (IF) and image translation (IT). Sci. Rep. 2025, 15, 35433. [Google Scholar] [CrossRef]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2017. [Google Scholar]
Quan, Y.; Zhang, R.; Li, J.; Ji, S.; Guo, H.; Yu, A. Learning SAR-Optical Cross Modal Features for Land Cover Classification. Remote Sens. 2024, 16, 431. [Google Scholar] [CrossRef]
Chang, H.; Fu, X.; Guo, K.; Dong, J.; Guan, J.; Liu, C. SOLSTM: Multisource Information Fusion Semantic Segmentation Network Based on SAR-OPT Matching Attention and Long Short-Term Memory Network. IEEE GRSL 2025, 22, 4004705. [Google Scholar] [CrossRef]
Chan-To-Hing, H.; Veeravalli, B. FUS-MAE: A Cross-Attention-Based Data Fusion Approach for Masked Autoencoders in Remote Sensing. In IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium; IEEE: New York, NY, USA, 2024; pp. 6953–6958. [Google Scholar] [CrossRef]
Ren, B.; Liu, B.; Wang, Q.; Hou, B.; Yang, C.; Jiao, L. DCIFNet: Cross-Modal Fusion With Correction and Interaction for Optical–SAR Land Cover Classification. IEEE TGRS 2025, 63, 5643218. [Google Scholar] [CrossRef]
Tan, Y.; Li, M.; Xu, K.; Lai, G. HAFNet: A Heterogeneous Adaptive Fusion Network of Optical and SAR Imagery for Improved Land Use Classification. Photogramm. Rec. 2025, 40, e70028. [Google Scholar] [CrossRef]
Han, W.; Jiang, W.; Geng, J.; Bao, Y. Semantic Segmentation of Remote Sensing Images with Inconsistent Resolutions via a Spectral-Geometric Iterative Fusion Network. IEEE TGRS 2025, 63, 4419314. [Google Scholar] [CrossRef]
Chen, Y.; Zhao, M.; Bruzzone, L. A Novel Approach to Incomplete Multimodal Learning for Remote Sensing Data Fusion. IEEE TGRS 2024, 62, 5404914. [Google Scholar] [CrossRef]
Hu, F.; Cheng, S. Recursive 3D Phase Unwrapping: Towards Weak Prior Model in Multi-Temporal SAR Interferometry. In Proceedings of the IGARSS 2025—2025 IEEE International Geoscience and Remote Sensing Symposium, Brisbane, Australia, 3–8 August 2025; IEEE: New York, NY, USA, 2025; pp. 9141–9144. [Google Scholar] [CrossRef]
Hu, F.; Wu, J. Detecting spatio-temporal urban surface changes using identified temporary coherent scatterers. J. Syst. Eng. Electron. 2021, 32, 1304–1317. [Google Scholar] [CrossRef]
Yuan, K.; Zhuang, X.; Schaefer, G.; Feng, J.; Guan, L.; Fang, H. Deep-Learning-Based Multispectral Satellite Image Segmentation for Water Body Detection. IEEE JSTARS 2021, 14, 7422–7434. [Google Scholar] [CrossRef]
Kim, M.U.; Oh, H.; Lee, S.-J.; Choi, Y.; Han, S. Deep Learning Based Water Segmentation Using KOMPSAT-5 SAR Images. In 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS; IEEE: New York, NY, USA, 2021; pp. 4055–4058. [Google Scholar] [CrossRef]
Huang, B.; Li, P.; Lu, H.; Yin, J.; Li, Z.; Wang, H. WaterDetectionNet: A New Deep Learning Method for Flood Mapping With SAR Image Convolutional Neural Network. IEEE JSTARS 2024, 17, 14471–14485. [Google Scholar] [CrossRef]
Xie, S.; Tu, Z. Holistically-Nested Edge Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2015. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Liu, Y.; Chen, K.; Liu, C.; Qin, Z.; Luo, Z.; Wang, J. Structured Knowledge Distillation for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2019. [Google Scholar]
Yang, C.; Zhou, H.; An, Z.; Jiang, X.; Xu, Y.; Zhang, Q. Cross-Image Relational Knowledge Distillation for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 12319–12328. [Google Scholar]
Ji, D.; Wang, H.; Tao, M.; Huang, J.; Hua, X.S.; Lu, H. Structural and Statistical Texture Knowledge Distillation for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 16876–16885. [Google Scholar]
Yuan, J.; Phan, M.H.; Liu, L.; Liu, Y. FAKD: Feature Augmented Knowledge Distillation for Semantic Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: New York, NY, USA, 2024; pp. 595–605. [Google Scholar]
Dong, Z.; Gao, G.; Liu, T.; Gu, Y.; Zhang, X. Distilling Segmenters From CNNs and Transformers for Remote Sensing Images’ Semantic Segmentation. IEEE TGRS 2023, 61, 5613814. [Google Scholar] [CrossRef]
Li, M.; Shan, L.; Wang, W.; Lv, K.; Luo, B.; Chen, S.-B. Building Lightweight Semantic Segmentation Models for Aerial Images Using Dual Relation Distillation. arXiv 2025, arXiv:2506.20688. [Google Scholar] [CrossRef]
Sun, K.; Zhao, Y.; Jiang, B.; Cheng, T.; Xiao, B.; Liu, D.; Mu, Y.; Wang, X.; Liu, W.; Wang, J.; et al. High-Resolution Representations for Labeling Pixels and Regions. arXiv 2019, arXiv:1904.04514. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder–Decoder Architecture for Image Segmentation. IEEE TPAMI 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2017. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2017. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In 2016 Fourth International Conference on 3D Vision; IEEE: New York, NY, USA, 2016. [Google Scholar]
Yu, K.; Wang, F. A Dual Attention Fusion Network for SAR-Optical Land Use Classification Based on Semantic Balance. In 2024 7th International Conference on Pattern Recognition and Artificial Intelligence (PRAI); IEEE: New York, NY, USA, 2024; pp. 707–712. [Google Scholar] [CrossRef]
Li, C.; Guo, W.; Zhang, Z.; Zhang, T. Self-Supervised Classification of SAR Images With Optical Image Assistance. IEEE TGRS 2023, 61, 5221715. [Google Scholar] [CrossRef]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022. [Google Scholar]

Figure 1. Visualization of multimodal data and land-cover labels for the Yuhang (first row) and Sanmen (second row) study areas. The optical images (first column) display natural-color composites, SAR images (second column) show radar backscatter intensity, and label images (third column) depict pixel-wise land-cover annotations. The bottom row provides a color legend for the six land-cover classes: farmland (light green), city (golden), village (coral), water body (blue), forest (dark green), and road (yellow). Backgrounds in optical and SAR images are black, indicating regions outside the study areas, while the label image has a light gray background.

Figure 2. Overall framework of the proposed SGCAD. The framework consists of three core components: (1) a SAR-only teacher stream (blue), implemented by an HRNet-based water-body expert, which takes single-modal SAR input and outputs a water probability map for distillation (detailed in Figure 3); (2) a multimodal student stream (green), implemented by LightMCANet, which processes co-registered optical and SAR inputs for six-class land-cover segmentation (detailed in Figure 4); and (3) a class-aware distillation module (red), which applies confidence-gated supervision exclusively on high-reliability pixels of the water class to avoid noisy global distillation. The SAR edge guidance module (SEGM) is integrated into both streams: the teacher-side SEGM is described in Figure 3, and the student-side SEGM is detailed in Figure 4. A consistent color scheme is maintained across all figures (blue: teacher stream, green: student stream, red: distillation mechanism) for clarity.

Figure 3. Architecture of the SAR-specific HRNet teacher for water-body segmentation. Given a single-channel SAR input patch

I_{sar}

, the stem layer and Stage 1 extract high-resolution features at 1/4 spatial scale. Stages 2 to 4 maintain parallel multi-resolution feature branches (1/4, 1/8, 1/16, 1/32 spatial scales) with repeated inter-branch feature exchange, which preserves fine-grained shoreline details while capturing broader contextual information. A SAR edge guidance module (SEGM) extracts shallow SAR features to predict an edge map

G_{edge}

, then refines the top-stream high-resolution feature

F_{HR}

via edge-aware gating (i.e.,

F_{refined} = F_{HR} ⊙ (1 + G_{edge})

). Finally, a multi-scale fusion and classification head aggregates features from all resolution branches and outputs a binary water-body segmentation map (water vs. non-water).

Figure 3. Architecture of the SAR-specific HRNet teacher for water-body segmentation. Given a single-channel SAR input patch

I_{sar}

, the stem layer and Stage 1 extract high-resolution features at 1/4 spatial scale. Stages 2 to 4 maintain parallel multi-resolution feature branches (1/4, 1/8, 1/16, 1/32 spatial scales) with repeated inter-branch feature exchange, which preserves fine-grained shoreline details while capturing broader contextual information. A SAR edge guidance module (SEGM) extracts shallow SAR features to predict an edge map

G_{edge}

, then refines the top-stream high-resolution feature

F_{HR}

via edge-aware gating (i.e.,

F_{refined} = F_{HR} ⊙ (1 + G_{edge})

). Finally, a multi-scale fusion and classification head aggregates features from all resolution branches and outputs a binary water-body segmentation map (water vs. non-water).

Figure 4. Student architecture (LightMCANet) for six-class optical–SAR land-cover segmentation. The student network employs a pseudo-Siamese encoder to extract modality-specific features from optical RGB and SAR inputs, mitigating early fusion conflicts between heterogeneous modalities. Cross-modal interaction is implemented by the lightweight cross-modal attention module (LightMCAM) to enable efficient optical–SAR feature fusion with reduced computational complexity. Multi-scale feature aggregation is performed by a feature pyramid network (FPN)-style decoder to recover spatial details. The SAR edge guidance module (SEGM) leverages shallow SAR features (rich in edge cues) to generate an edge-aware gating map, which amplifies boundary responses in decoder features and improves the continuity of slender land-cover structures (e.g., water boundaries and roads).

Figure 5. Qualitative examples of the SAR-only HRNet teacher model for binary water-body segmentation. From left to right, the panels display: (a) the preprocessed and normalized input SAR patch; (b) the ground-truth binary water-body mask (with ignore regions labeled 255 excluded from loss calculation); and (c) the segmentation prediction from the SAR-only teacher model. Note: In the context of this binary segmentation and Figure 5, “background” refers to the five core land-cover classes excluding water (farmland, city, village, forest, road), while “others” refers to ignored pixels labeled 255 (outside the valid six-class segmentation scope); “non-water” in the binary task includes both “background” and “others”. These examples demonstrate that the teacher model generates spatially coherent water-body regions with sharp, well-defined shorelines, confirming its suitability as a reliable knowledge provider for the subsequent confidence-gated distillation process.

Figure 6. Qualitative comparison of multimodal six-class land-cover segmentation results on the test set. From left to right, the panels show: (a) optical RGB image (natural-color composite), (b) SAR backscatter intensity image (dB scale), (c) ground-truth land-cover label, (d) Deeplabv3+ (optical-only) prediction, (e) Deeplabv3+ (optical+SAR) prediction, (f) MCANet (fusion baseline) prediction, (g) LightMCANet (student) prediction, and (h) the proposed SGCAD prediction. The SGCAD method generates spatially coherent water-body regions with sharp, clean boundaries, while maintaining or improving segmentation accuracy for other land-cover categories (farmland, city, village, forest, road). Note: In the binary water-body segmentation context (relevant to the teacher network training), “non-water” refers to all regions except water bodies, including the five aforementioned land-cover classes (defined as “background” in Figure 5) and ignored pixels labeled 255 (defined as “others” in Figure 5). Best viewed in color at full resolution.

Figure 7. Qualitative ablation results for LightMCAM. From left to right: (a) optical RGB image, (b) SAR image, (c) ground-truth label, (d) prediction of the full student, (e) prediction without LightMCAM. Red circles highlight typical misclassification regions caused by removing cross-modal interaction.

Figure 8. Qualitative ablation results for SEGM. From left to right: (a) optical RGB image, (b) SAR image, (c) ground-truth label, (d) prediction of the full student, (e) prediction without SEGM. Red circles indicate boundary degradation and noisy predictions caused by removing SAR edge guidance.

Table 1. Dataset statistics and split configuration.

Split	#Patches	Patch Size	Resolution
Train (base)	12,696	$256 \times 256$	5 m
Val	1589	$256 \times 256$	5 m
Test	1589	$256 \times 256$	5 m
Total (base)	15,874	-	-

Table 2. Key hyperparameters for the SAR-guided class-aware distillation (SGCAD) strategy in Stage 2. Parameters include confidence thresholds for distillation mask construction, warm-up epochs, distillation loss weight, and evaluation metrics. All values are determined via validation experiments and domain-specific prior knowledge (see Section 2.6 for the detailed rationale).

Parameter	Value
Distillation strategy	`bce_selective_gt`
Teacher output (binary water probability)	$P_{T} = σ (z_{T})$ (binary case; equivalent to a 2-logit Softmax)
Student target (water class probability)	$P_{S}^{w} = Softmax {(z_{S})}_{w}$ with $w = 3$
High confidence threshold $τ_{hi}$	0.95
Low confidence threshold $τ_{lo}$	0.15
Warm-up epochs $E_{w}$ (KD disabled)	10
Distillation loss weight $λ$	0.005
Teacher water threshold (fusion evaluation)	0.5
Average KD region ratio	${\bar{ρ}}_{K D}$ (Equation (25))

Table 3. Training hyperparameters for the HRNet teacher network and LightMCANet student network (Stage 1 and Stage 2). All hyperparameters are consistent across experiments unless otherwise specified, ensuring reproducibility and fair comparison.

Model/Stage	Optimizer	Base LR	LR Schedule	Epochs	Batch Size	Weight Decay
Teacher (HRNet, SAR binary water)	AdamW	$1 \times 10^{- 3}$	PolyLR (power = 0.9)	150	44	$1 \times 10^{- 4}$
Student Stage 1 (6-class multimodal)	AdamW	$1 \times 10^{- 3}$	PolyLR (power = 0.9)	150	48	$1 \times 10^{- 4}$
Student Stage 2 (KD fine-tuning)	AdamW	$1 \times 10^{- 4}$	Warm-up + PolyLR	80	48	$1 \times 10^{- 4}$

Table 4. Data augmentation operations applied to multimodal optical-SAR training data. All operations are synchronized across modalities to maintain spatial alignment.

Augmentation Operation	Details
Flip	Random horizontal/vertical flip (probability = 0.5)
Rotation	Random rotation within $[- 5^{°}, 15^{°}]$ (bilinear interpolation)
Crop/Resize	Random scale (0.8–1.2×) followed by center crop to $256 \times 256$

Table 5. Quantitative evaluation metrics for the student network (Stage 1 and Stage 2) and HRNet teacher network on the validation set. Metrics include overall accuracy (OA), kappa coefficient, mean IoU (mIoU), and water-body IoU (Class 4, water body).

Metric	Student (Stage 1)	Teacher (HRNet)	Student (Stage 2)
Overall Accuracy (OA)	0.848	0.902	0.876
Kappa Coefficient	0.782	0.873	0.826
Mean IoU (mIoU)	0.675	0.825	0.792
Water IoU (Class 3)	0.740	0.875	0.850

Table 6. Quantitative performance of the SAR-only HRNet teacher model for binary water-body segmentation. Metrics include intersection over union (IoU), F1-score, precision, and recall (all reported as percentages) for the training and validation subsets. Note: Binary segmentation labels are defined as “water” (target class) and “non-water” (including the five other core land-cover classes and ignored pixels labeled 255).

Split	IoU (%)	F1 (%)	Precision (%)	Recall (%)
Train	87.93	93.19	93.82	92.41
Validation	84.25	91.45	92.21	89.87

Table 7. Quantitative comparison of six-class land-cover segmentation performance on the independent test set. Metrics include per-class accuracy for farmland, city, village, water body, forest, and road classes, as well as mean intersection over union (mIoU), kappa coefficient, and overall accuracy (OA). All values are reported as dimensionless coefficients (range: 0–1).

Method	Class Accuracy						mIoU	Kappa	OA
Method	Farm	City	Village	Water	Forest	Road	mIoU	Kappa	OA
Deeplabv3+ (Optical)	0.759	0.775	0.839	0.793	0.658	0.668	0.608	0.726	0.749
Deeplabv3+ (Opt+SAR)	0.793	0.817	0.775	0.806	0.802	0.718	0.533	0.685	0.785
MCANet/Fusion baseline	0.759	0.723	0.800	0.657	0.859	0.796	0.584	0.725	0.766
LightMCANet (Student)	0.788	0.792	0.836	0.730	0.853	0.835	0.604	0.748	0.806
Ours (SGCAD)	0.799	0.881	0.814	0.974	0.912	0.862	0.657	0.778	0.874

Table 8. Comparison of water-body IoU across single-modal, direct fusion, and the proposed SGCAD configurations on the test set.

Configuration	Water IoU
Single-SAR (HRNet Teacher)	0.875
Multimodal Fusion Baseline (LightMCANet Stage 1)	0.740
Proposed SGCAD (LightMCANet Stage 2)	0.850

Table 9. Per-class intersection-over-union (IoU) and mean IoU (mIoU) comparison between the LightMCANet student baseline and the proposed SGCAD method. The proposed approach significantly improves water-body segmentation while keeping other classes stable.

Method	Class IoU						mIoU
Method	Farm	City	Village	Water	Forest	Road	mIoU
LightMCANet (Student)	0.666	0.740	0.524	0.520	0.824	0.348	0.604
Ours (SGCAD)	0.711	0.749	0.571	0.827	0.841	0.378	0.680

Table 10. Full factorial ablation over three binary factors: (SGCAD on/off) × (LightMCAM on/off) × (SEGM on/off).

SGCAD	LightMCAM	SEGM	OA	Kappa	mIoU	Water IoU	Road IoU
0	0	0	0.709	0.596	0.447	0.477	0.268
0	0	1	0.765	0.673	0.521	0.505	0.294
0	1	0	0.752	0.657	0.502	0.496	0.278
0	1	1	0.806	0.748	0.604	0.520	0.348
1	0	0	0.765	0.694	0.540	0.531	0.314
1	0	1	0.783	0.725	0.584	0.665	0.316
1	1	0	0.826	0.732	0.660	0.810	0.340
1	1	1	0.874	0.788	0.680	0.827	0.378

Table 11. Ablation on confidence gating in SGCAD under the full student architecture (LightMCAM = on, SEGM = on).

Distillation Strategy	OA	Kappa	mIoU	Water IoU	Road IoU	$ρ_{KD}$ (%)
No KD (Stage 1 only; SGCAD = off)	0.806	0.748	0.604	0.520	0.348	0
Global KD (no gating; all pixels)	0.852	0.768	0.650	0.805	0.355	100
Gated KD ( $M_{hi}$ only)	0.866	0.780	0.668	0.815	0.372	6.5
Gated KD ( $M_{hi} \cup M_{lo}$ ; ours)	0.874	0.788	0.680	0.827	0.378	10.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, J.; Wang, Z.; Yuan, Y.; Hu, F. SGCAD: A SAR-Guided Confidence-Gated Distillation Framework of Optical and SAR Images for Water-Enhanced Land-Cover Semantic Segmentation. Remote Sens. 2026, 18, 962. https://doi.org/10.3390/rs18060962

AMA Style

Ma J, Wang Z, Yuan Y, Hu F. SGCAD: A SAR-Guided Confidence-Gated Distillation Framework of Optical and SAR Images for Water-Enhanced Land-Cover Semantic Segmentation. Remote Sensing. 2026; 18(6):962. https://doi.org/10.3390/rs18060962

Chicago/Turabian Style

Ma, Junjie, Zhiyi Wang, Yanyi Yuan, and Fengming Hu. 2026. "SGCAD: A SAR-Guided Confidence-Gated Distillation Framework of Optical and SAR Images for Water-Enhanced Land-Cover Semantic Segmentation" Remote Sensing 18, no. 6: 962. https://doi.org/10.3390/rs18060962

APA Style

Ma, J., Wang, Z., Yuan, Y., & Hu, F. (2026). SGCAD: A SAR-Guided Confidence-Gated Distillation Framework of Optical and SAR Images for Water-Enhanced Land-Cover Semantic Segmentation. Remote Sensing, 18(6), 962. https://doi.org/10.3390/rs18060962

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SGCAD: A SAR-Guided Confidence-Gated Distillation Framework of Optical and SAR Images for Water-Enhanced Land-Cover Semantic Segmentation

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials: Study Area and Dataset

2.1.1. Co-Registration and Radiometric Preprocessing

2.1.2. Resampling and Label Handling

2.1.3. Patch Generation with Leakage-Free Splitting

2.1.4. Train/Val/Test Selection

2.1.5. Targeted Re-Sampling for Rare Classes (Training Only)

2.2. Problem Formulation

2.3. Overview of the Proposed SGCAD Framework

2.4. Student Network: LightMCANet for Multimodal Six-Class Segmentation

2.5. Teacher Network: HRNet Water-Body Expert (SAR-Only)

2.5.1. Input and Label Construction

2.5.2. HRNet High-Resolution Fusion

2.5.3. Optimization Objective (Pixel-Wise Supervision with Ignored Pixels)

2.5.4. (Optional) SAR-Guided Edge-Aware Regularization (Shoreline Emphasis)

2.6. SAR-Guided Class-Aware Knowledge Distillation

2.6.1. (1) High-Confidence Water Pixels

2.6.2. (2) Low-Confidence (Non-Water) Pixels

2.7. Training Protocol and Implementation Details

Optimization and Loss Functions

2.8. Evaluation Metrics

3. Results

3.1. Evaluation Protocol and Metrics Summary

3.2. Teacher Performance: SAR-Only Water-Body Segmentation

3.3. Main Comparison on Multimodal Six-Class Segmentation

3.4. Water-Body Improvement Analysis

3.5. Ablation Study

3.5.1. Full Factorial Ablation: (SGCAD on/off) × (LightMCAM on/off) × (SEGM on/off)

3.5.2. Confidence-Gating Ablation (Addressing the Role of “Confidence Gating”)

4. Discussion

4.1. Why SAR-Guided Class-Aware Distillation Improves Water Segmentation

4.2. Role of Confidence Gating: Reducing Noisy Distillation and Preventing Over-Constraint

4.3. Boundary Continuity and Slender Structures: Effect of SEGM

4.4. Comparison with Existing Multimodal Fusion and Distillation Strategies

4.5. Error Analysis and Failure Cases

4.6. Generalization, Practical Implications, and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI