1. Introduction
High-resolution Earth observation has rapidly increased the demand for pixel-level interpretation of land-cover and land-use patterns. Semantic segmentation has become a core technique for flood mapping and large-area land-cover inventory [
1,
2]. With the progress of deep learning, remote sensing segmentation has gradually shifted from handcrafted features to end-to-end representation learning with stronger backbones and multi-scale context aggregation [
3,
4]. Despite these advances, practical scenes still exhibit (i) large intra-class appearance variations and acquisition-induced changes, and (ii) severe inter-class confusion (e.g., shadow vs. water), which often lead to unstable optimization and noisy decision boundaries [
5,
6]. Moreover, class imbalance and boundary ambiguity are especially harmful for thin or fragmented objects such as rivers and roads [
7,
8]. Consequently, achieving both high accuracy and robust boundary delineation remains challenging in large-area mapping pipelines.
Optical and synthetic aperture radar (SAR) imagery are highly complementary. Optical images provide rich spectral and textural cues for fine-grained semantic discrimination, whereas SAR enables all-weather, day-and-night imaging and offers geometric and structural information linked to microwave scattering [
9,
10]. Such SAR-derived structural cues have also been exploited in other terrain understanding settings (e.g., SAR altimeter delay-Doppler-image-based terrain classification), suggesting the potential of SAR to contribute stable physical priors [
11]. This complementarity has motivated extensive efforts on optical–SAR fusion for land-cover classification and segmentation, including dual-stream encoders and multi-level feature aggregation [
12,
13]. However, the heterogeneity between optical and SAR remains a fundamental obstacle: SAR radiometry is affected by speckle, incidence angle, and scattering mechanisms, while optical radiometry depends on illumination and atmospheric conditions. These factors yield a pronounced cross-modal domain gap and may trigger
fusion conflicts when features are combined indiscriminately [
14]. In addition, imperfect co-registration and geometric distortions can further degrade multimodal fusion performance [
15].
Existing optical–SAR fusion strategies can be broadly summarized as image-level fusion/translation, feature-level fusion, and attention-based interaction. Image fusion (IF) aims to merge multimodal inputs into a single representation for downstream segmentation, whereas image translation (IT) learns a mapping between domains (e.g., SAR→optical) to support training or enhancement under poor optical conditions [
16,
17]. Feature-level fusion designs typically adopt dual-branch encoders and fuse multi-level features via concatenation or adaptive fusion modules, but they may overlook fine-grained inter-modal correspondences and introduce redundancy [
18]. To explicitly model cross-modal relations, attention and cross-attention have been increasingly employed to align semantic focus and structural cues [
19,
20]. More recently, correction–interaction–fusion architectures attempt to mutually rectify modal representations before exchanging information across spatial and channel dimensions, improving robustness under heterogeneous inputs [
21,
22]. Beyond fusion design, multimodal segmentation is also hindered by data and preprocessing constraints. In many scenarios, modalities have inconsistent spatial resolutions or sampling footprints, which necessitates iterative alignment and cross-modal conditioning to avoid performance loss [
23]. Furthermore, real-world systems may suffer from missing modalities at inference time (e.g., cloud-contaminated optical), prompting incomplete multimodal learning that remains robust under partial inputs [
24]. Relatedly, SAR time-series research has shown that exploiting temporal coherence can reduce reliance on strong priors and improve robustness under non-ideal conditions, as demonstrated in multi-temporal InSAR phase unwrapping and spatio-temporal urban change detection [
25,
26]. These observations further motivate designing multimodal segmentation methods that can cope with heterogeneous reliability and uncertainty.
Water-body segmentation is a particularly important yet challenging case. Water mapping is crucial for hydrological analysis and disaster response, and a large literature has studied water extraction in optical multispectral imagery [
27]. In SAR imagery, water often exhibits low backscatter and sharp shorelines, providing a complementary cue for delineation [
28,
29]. Nevertheless, both modalities have failure modes: optical water is sensitive to cloud/haze and shadow, while SAR water can be confused by wind-roughened surfaces and speckle. In multimodal multi-class segmentation, water further competes with visually similar categories (e.g., shadowed urban regions), and fusion conflicts may fragment boundaries and reduce IoU. Boundary-aware learning is therefore critical for thin or elongated structures, and classical edge supervision (e.g., holistically nested edge learning) provides useful motivation for refining semantic boundaries with low-level gradients [
30].
Although water surfaces are often highly separable in SAR due to low backscatter and clear shoreline contrast, we observe that multi-class multimodal fusion can still under-optimize the water category. This is not a contradiction, but a consequence of the learning objective and heterogeneous uncertainty across modalities. In particular, (i) optical imagery introduces ambiguous water-like patterns (e.g., cloud shadow, terrain shadow, and dark roofs), while SAR responses are affected by speckle and acquisition geometry; (ii) imperfect co-registration and geometric distortions cause local misalignment between SAR-derived water boundaries and optical textures; and (iii) in a six-class softmax setting, water logits compete with other dominant categories, so gradients from large-area classes may dominate optimization and shift decision boundaries away from the physically consistent SAR cue. As a result, indiscriminate fusion may create conflicting feature evidence near shorelines and mixed pixels, leading to fragmented water masks and reduced IoU.
Knowledge distillation (KD) offers a principled mechanism to transfer knowledge from a strong teacher to a compact student [
31]. For semantic segmentation, structured KD explores pixel relations and global consistency [
32], while cross-image relational distillation models dataset-level semantic structure [
33]. Texture-focused distillation further emphasizes low-level structural and statistical cues that are critical for boundary quality [
34]. Feature augmentation and proxy losses have also been proposed to overcome capacity gaps and enrich the distillation signal [
35]. In remote sensing, KD is increasingly used to balance accuracy and efficiency and to distill complementary inductive biases from different architectures [
36,
37]. However, most existing KD schemes still apply
global supervision, which can propagate teacher bias to irrelevant regions or classes, an especially risky behavior in multimodal settings with spatially varying uncertainty. Hence, a
class-aware and
region-selective distillation paradigm is desirable to bring targeted improvements without destabilizing other categories.
Despite progress in optical–SAR fusion and segmentation distillation, most existing methods optimize global objectives and treat all classes and regions similarly. However, in heterogeneous multimodal scenes, modality reliability is spatially varying and certain categories, especially water, may be systematically under-optimized due to fusion conflicts and multi-class competition. Therefore, a category-focused and region-selective learning strategy is needed to strengthen the decision boundary of the bottleneck class without destabilizing other categories.
In this work, we propose a SAR-guided class-aware knowledge distillation framework for multimodal semantic segmentation, targeting the persistent weakness of the water class in optical–SAR six-class land-cover mapping. Our key idea is to train a SAR-based water-expert teacher using an HRNet-style high-resolution segmentation architecture and transfer its structural prior to a lightweight multimodal student via class-aware distillation with confidence-aware gating [
38]. From a broader architectural perspective, the design of the proposed framework is also related to representative encoder–decoder and multi-scale segmentation models, including SegNet, U-Net, PSPNet, FPN, and V-Net, as well as dual-attention fusion and recent self-supervised or masked-autoencoder representation learning for remote sensing [
39,
40,
41,
42,
43,
44,
45,
46]. Specifically, the teacher is specialized for binary water segmentation to exploit the strong separability of water in SAR imagery, while the student performs six-class segmentation on optical–SAR pairs using efficient cross-modal interaction and boundary-aware decoding. To mitigate modal conflicts and class competition in multimodal learning, the distillation loss is applied selectively on pixels where the teacher predictions are sufficiently reliable, with an emphasis on high-confidence water regions to avoid noisy knowledge transfer.
The main contributions of this work are threefold. (1) Problem-driven formulation: We identify that, in optical–SAR multi-class segmentation, water-body prediction can become a persistent bottleneck due to local cross-modal conflicts and softmax competition, even though SAR alone is highly discriminative for water. (2) Method novelty: Different from conventional global fusion training or global distillation, we propose SGCAD, a SAR-guided, class-aware, confidence-gated distillation scheme that transfers teacher knowledge only for the water class and only on reliable pixels. This design mitigates noisy or negative transfer and preserves other classes. (3) Architecture for efficiency and boundaries: We develop a lightweight optical–SAR student with efficient cross-modal interaction and SAR edge-guided boundary refinement, yielding improved water and road delineation with practical efficiency.
The remainder of this paper is organized as follows.
Section 2 presents the materials and methods, including the dataset description, problem formulation, and the proposed framework with implementation details.
Section 3 reports the experimental results and ablation studies.
Section 4 provides further discussion on limitations, failure cases, and practical implications. Finally,
Section 5 concludes the paper and outlines future directions.
2. Materials and Methods
2.1. Materials: Study Area and Dataset
Study Areas and Data Sources. We constructed a self-built multimodal dataset over two representative regions in Zhejiang Province, China, including Yuhang District, Hangzhou City, and Sanmen County, Taizhou City. These two areas cover diverse land-cover patterns, such as urban built-up areas, croplands, rivers/lakes, forests, and rural settlements, enabling robust evaluation under complex scene compositions.
The optical imagery was acquired from Gaofen-1 (GF-1) with RGB bands at 2 m ground sampling distance (GSD). The SAR imagery was acquired from the LuTan-1 satellite with 5 m GSD. All data (optical/SAR/label) were exported as georeferenced GeoTIFFs and reprojected into a common projected CRS (WGS84 Transverse Mercator, central meridian 120°E), ensuring consistent map geometry.
Detailed statistics regarding dataset partitioning and patch settings are summarized in
Table 1, including the number of patches, patch size, and uniform spatial resolution for training, validation, and test subsets.
Land-Cover Categories and Annotation. Visual examples of the optical images, SAR images, and corresponding pixel-level land-cover annotations in the study areas are presented in
Figure 1, along with the color coding scheme for the six land-cover classes.
Pixel-wise annotations include six land-cover classes: farmland, city, village, water body, forest, and road. These categories contain both large homogeneous regions (e.g., farmland/forest) and slender structures (e.g., roads and water boundaries), which are sensitive to boundary fragmentation under multimodal fusion conflicts.
2.1.1. Co-Registration and Radiometric Preprocessing
To obtain co-registered multimodal pairs with physically comparable SAR backscatter, we applied a standard SAR radiometric preprocessing pipeline before patch generation. First, LuTan-1 SAR intensity was radiometrically calibrated to the backscatter coefficient
using the calibration parameters provided in the product metadata. Second, to reduce the dependence on acquisition geometry,
was converted to
via incidence angle normalization,
where
denotes the incidence angle. Third, DEM-assisted radiometric terrain correction (RTC) was performed to compensate terrain-induced radiometric distortions and to geocode SAR measurements into a map geometry consistent with the optical reference. During RTC, the local incidence angle
was derived and used for local incidence angle normalization to mitigate slope-related backscatter variations. After radiometric correction, SAR images were despeckled using a Lee filter with a
window and converted to log-intensity (dB). Finally, we applied robust min–max normalization by clipping values to the 2% and 98% percentiles and scaling them to
.
2.1.2. Resampling and Label Handling
After RTC-based geocoding, SAR images were geometrically aligned to the optical reference using georeferencing information. Both modalities were then resampled to a unified 5 m grid for training: (i) optical RGB was downsampled from 2 m to 5 m using bilinear interpolation; (ii) SAR was resampled onto the same 5 m grid (bilinear interpolation); (iii) the label map was resampled using nearest-neighbor interpolation to preserve discrete class IDs.
2.1.3. Patch Generation with Leakage-Free Splitting
We adopted a “split-first, crop-later” strategy to avoid spatial leakage. Specifically, the aligned region-wide mosaics were first partitioned into large tiles of size . These large tiles were then split into training/validation/test subsets at the tile level (0.70:0.15:0.15). Finally, each large tile was cropped into patches, forming the base dataset for model training and evaluation.
2.1.4. Train/Val/Test Selection
To avoid spatial leakage, the train/val/test split is performed at the 4096 × 4096 tile level before cropping. Specifically, large tiles are randomly assigned to train/val/test with a ratio of 0.70/0.15/0.15, and all 256 × 256 patches cropped from one tile inherit the same split. Therefore, no spatial overlap exists across splits at either the tile level or the patch level. The targeted re-sampling strategy (road/village/city) is applied only to the training subset, while validation and test sets remain unchanged to ensure fair evaluation.
2.1.5. Targeted Re-Sampling for Rare Classes (Training Only)
To mitigate class imbalance and strengthen slender or minority categories, we further performed targeted re-sampling on the training set, focusing on road/village/city. An overlapping sliding window (window size 256, stride 128) scanned label patches to identify candidate windows with sufficient target pixels, filtered by (a) a minimum pixel count threshold (≥10 pixels) and (b) a minimum ratio threshold (≥3%). Selected patches were merged into the training set as additional samples, while validation/test subsets remained unchanged for fair evaluation.
Optical RGB patches were linearly scaled to and then normalized using the same robust percentile strategy as SAR for consistent dynamic range control: values were clipped to the 2nd and 98th percentiles and rescaled to . All modalities (optical, SAR) and label maps were stored as GeoTIFF/PNG patches with a fixed patch size of . The label value 255 denotes ignored pixels (outside the valid study area), and these are excluded from loss computation and metric evaluation.
2.2. Problem Formulation
Given a co-registered optical–SAR pair , where and , the goal is to predict a six-class segmentation map . Here, class IDs 0–5 correspond to the six land-cover categories (farmland, city, village, water body, forest, and road), and the value 255 denotes ignored pixels (e.g., areas outside the study region) that are excluded from loss computation and metric evaluation. To enhance the water-body category, a SAR-only teacher model produces a water probability map for class-aware distillation.
2.3. Overview of the Proposed SGCAD Framework
The overall architecture of the proposed SAR-guided class-aware distillation (SGCAD) framework is illustrated in
Figure 2, which systematically integrates a SAR-only teacher stream, a multimodal student stream, and a class-aware distillation module to address multimodal fusion conflicts for water-body segmentation.
Our method adopts a two-stage teacher–student training paradigm: (1) Stage 1 trains the multimodal student network for six-class land-cover segmentation using standard cross-entropy loss; (2) Stage 2 fine-tunes the student with SAR-guided class-aware distillation, where the SAR-only teacher provides supervision only for high-confidence water regions to prevent the propagation of noisy global distillation signals.
Notably, our objective is not to rectify flawed fusion designs through post-hoc corrections, but to address a structural limitation of global multimodal optimization objectives: when optical and SAR modalities exhibit local disagreements, the student network receives conflicting supervision signals for the water class in a multi-class segmentation setting. By introducing a SAR-specialized water-body expert and implementing class-aware, confidence-gated distillation, we inject physically consistent SAR-derived priors into the water class decision boundary while avoiding unnecessary global constraints on other land-cover classes.
For clarity, the teacher stream in
Figure 2 corresponds to the SAR-only HRNet architecture detailed in
Figure 3, while the student stream refers to the LightMCANet structure presented in
Figure 4. The SEGM modules embedded in the teacher and student streams are implemented as described in
Figure 3 and
Figure 4, respectively.
Although the use of HRNet as a feature extractor is not novel, our work extends this architecture with two key innovations tailored to multimodal remote sensing segmentation: (1) a novel class-aware distillation framework that restricts knowledge distillation to the water class, thereby mitigating class imbalance and multimodal conflict issues; (2) the integration of a SAR-guided edge refinement module (SEGM) to enhance boundary precision for slender land-cover structures (e.g., roads and water bodies). These contributions are specifically designed to address the unique challenges of multimodal optical–SAR fusion and improve segmentation performance in heterogeneous land-cover environments.
2.4. Student Network: LightMCANet for Multimodal Six-Class Segmentation
The architecture of the lightweight multimodal student network (LightMCANet) designed for six-class optical–SAR land-cover segmentation is illustrated in
Figure 4, which details the pseudo-Siamese encoder, cross-modal attention module, FPN-style decoder, and SAR edge guidance module (SEGM) integrated into the network.
Pseudo-Siamese Encoder and Cross-Modal Interaction LightMCANet adopts a pseudo-Siamese encoder to separately extract modality-specific features from optical and SAR inputs, effectively alleviating early fusion conflicts caused by the heterogeneous characteristics of optical and SAR data. To model high-order cross-modal correlations while reducing the quadratic computational complexity of standard attention mechanisms, a lightweight cross-modal attention module (LightMCAM) is proposed. In LightMCAM, adaptive pooling is first applied to compress SAR features before constructing key and value matrices, balancing fusion performance and computational efficiency. The mathematical formulation of LightMCAM is defined as follows:
where
Q denotes the query feature derived from the optical modality-specific feature
via the learnable linear projection matrix
;
K (key feature) and
V (value feature) are generated by applying adaptive pooling
to the SAR modality-specific feature
(for dimension reduction and computational efficiency) followed by linear projections using
and
, respectively;
represents the transpose of the key feature matrix;
is the dimension of the key feature
K, which is used to scale attention scores and avoid gradient vanishing issues;
normalizes the attention weights to the range
to ensure valid probability distribution; and
is the final cross-modal attention fusion feature that integrates complementary information from optical and SAR modalities.
FPN Decoder and SAR Edge Guidance Module (SEGM). An FPN-style decoder is employed to aggregate multi-scale features from the encoder and recover fine-grained spatial details for accurate segmentation. To enhance boundary continuity for slender land-cover structures (e.g., roads and water boundaries), a SAR edge guidance module (SEGM) is integrated into the decoder stage. The SEGM generates an edge-aware gating mask from shallow SAR features (which preserve high-resolution edge information) to amplify boundary responses in the decoder features. The mathematical formulation of SEGM is defined as:
where
G is the SAR edge-guided gating mask;
denotes the Sigmoid activation function that maps convolution outputs to the range
for gating control;
represents a stack of convolutional layers designed to extract edge information from
(shallow SAR feature from the first encoder layer, which is rich in gradient and boundary cues);
is the original fusion feature of the FPN decoder; ⊙ denotes element-wise multiplication (Hadamard product); and
is the decoder feature optimized by SEGM to enhance the continuity and sharpness of land-cover boundaries (especially for slender structures such as water shorelines and roads).
2.5. Teacher Network: HRNet Water-Body Expert (SAR-Only)
A high-resolution network (HRNet)-based segmentation model is employed as a SAR-only
water-body expert teacher [
28] to provide reliable water-body priors for subsequent class-aware distillation. Unlike the multimodal student network that must address cross-modal fusion conflicts between optical and SAR data, the teacher network focuses exclusively on a single, physically consistent cue: the strong backscattering contrast of water bodies in SAR imagery, which remains stable under varying illumination conditions (e.g., cloud cover, shadow) and weather scenarios (e.g., rain, fog). HRNet is selected as the backbone for the teacher network due to its ability to maintain high-resolution feature representations throughout the network, enabling accurate shoreline delineation and reducing boundary fragmentation for slender water-body structures.
The architecture of the SAR-specific HRNet teacher network designed for binary water-body segmentation is illustrated in
Figure 3, which details the multi-resolution feature extraction branches, SAR edge guidance module (SEGM), and classification head integrated into the network.
2.5.1. Input and Label Construction
Given a co-registered SAR intensity patch
(where
H and
W denote the height and width of the patch, respectively), we feed it into HRNet after lightweight channel adaptation to match the network’s input interface. Specifically, the 8-bit SAR intensity (pixel values ranging from 0 to 255) is first normalized to the range
, followed by channel replication to convert the single-channel SAR data into a three-channel format. This process is formulated as:
where
represents the normalized single-channel SAR image;
denotes the channel replication operator; and
X is the final three-channel input feature of the HRNet teacher. For supervision, the original six-class land-cover label map
(class IDs 0–5 correspond to farmland, city, village, water body, forest, road; 255 denotes ignored pixels) is converted into a binary water mask
, where:
Here,
x denotes the pixel coordinate;
is the class index of the water body in our annotation system (set to 3); and ∧ represents the logical “AND” operator. The binary mask
enables the teacher to focus exclusively on water-body segmentation without interference from other land-cover categories.
2.5.2. HRNet High-Resolution Fusion
HRNet [
28] maintains parallel multi-resolution feature streams throughout the network and repeatedly exchanges information across resolutions to preserve high-resolution details. Let
denote the feature maps at stage
s from
B parallel branches (sorted from high to low resolution). The cross-resolution fusion process for the
b-th branch is defined as:
where
is the fused feature of the
b-th branch at stage
s;
denotes the resolution alignment operator, which includes upsampling/downsampling (to match the resolution of the
b-th branch) followed by convolution and summation fusion; and
is the original feature of the
j-th branch at stage
s. After multi-scale fusion, the final segmentation head outputs per-pixel logits
(two channels corresponding to non-water and water classes) and the teacher’s water probability map
:
where
is the normalization function to convert logits into probabilities; the subscript 1 indicates selecting the probability value of the “water” channel (the second channel in the two-channel logit tensor).
2.5.3. Optimization Objective (Pixel-Wise Supervision with Ignored Pixels)
The teacher model is optimized using cross-entropy loss, with ignored pixels (labeled 255) excluded from loss computation. First, we define a valid-pixel indicator
, where
is the indicator function (returning 1 if the condition holds, 0 otherwise). The primary cross-entropy loss is formulated as:
where
represents the binary classification labels (0 = non-water, 1 = water);
denotes the natural logarithm; and the outer sum
iterates over all pixels in the patch. To alleviate potential class imbalance between water and non-water pixels, we optionally adopt inverse-frequency class weights. Let
be the total number of pixels of class
k in the training set; the weight for class
k is defined as:
where
and
are the pixel counts of non-water and water classes, respectively; the denominator
ensures the weights are normalized to balance the contribution of each class.
2.5.4. (Optional) SAR-Guided Edge-Aware Regularization (Shoreline Emphasis)
To further encode SAR geometric cues and sharpen water shorelines, we introduce a lightweight edge-aware regularizer that leverages inherent SAR gradient information without requiring additional manual annotations. First, we compute the gradient magnitude map of the normalized SAR image (e.g., using the Sobel operator) and normalize it to the range
:
where ∇ denotes the gradient operator;
is the L2 norm to compute the gradient magnitude; and
is the min-max normalization function. Meanwhile, we derive a binary boundary target from the water mask
via morphological gradient:
where
converts the binary water mask into a logical tensor; the inner ∇ computes the morphological gradient to extract water boundary pixels; and the outer
converts gradient values greater than 0 into binary boundary labels (1 = boundary pixel, 0 = non-boundary pixel). We then weight shoreline pixels using
and apply an edge-aware binary cross-entropy (BCE) loss:
where
is a hyperparameter controlling the strength of SAR edge emphasis (set to 0.5 in our experiments);
denotes the normalized gradient magnitude at pixel
x, which assigns higher weights to edge regions in SAR images; and the remaining terms follow the definition of standard BCE loss.
Finally, the total optimization objective of the teacher model is the weighted sum of the primary cross-entropy loss and the edge-aware regularization loss:
where
is the weight coefficient of the edge-aware loss (set to 0.3 in our experiments to balance the two loss terms; set to 0 if edge regularization is disabled).
2.6. SAR-Guided Class-Aware Knowledge Distillation
This section describes the proposed SAR-guided class-aware knowledge distillation (SGCAD) strategy implemented in Stage 2 of the training pipeline. Unlike conventional global distillation methods that apply supervision to all pixels and land-cover classes, SGCAD transfers knowledge exclusively for the water-body class and only on reliable pixels selected via a confidence-gated mask. This targeted design mitigates noisy supervision signals and avoids negative transfer to other land-cover categories, directly addressing the core challenge of water-body under-optimization in multimodal optical–SAR fusion.
Teacher/Student Probabilities for the Water Class. The SAR-only teacher network (HRNet) is specialized for binary water-body segmentation and outputs per-pixel binary logits. For notational simplicity, we denote the teacher’s effective water logit as
and compute the corresponding water probability map as:
where
denotes the sigmoid activation function (mapping logits to the probability range
);
represents the 2D pixel coordinate; and
H and
W are the height and width of the input patch, respectively. Note that for binary classification, a two-logit Softmax formulation is equivalent to applying Sigmoid to an effective binary logit; thus, we use the Sigmoid notation above for convenience and consistency in the distillation formulation.
The multimodal student network (LightMCANet) performs six-class land-cover segmentation, predicting a six-class logit tensor
(one channel per land-cover class) and the softmax-normalized probability distribution:
where
normalizes logits across the six-class dimension to ensure the sum of probabilities equals 1;
c denotes the class index (0 = farmland, 1 = city, 2 = village, 3 = water body, 4 = forest, 5 = road); the student’s probability for the water-body class is denoted as
with
(i.e.,
).
Confidence-Gated Distillation Mask (Selective-GT Strategy). Instead of applying distillation globally across all pixels, SGCAD activates distillation supervision only on pixels where the teacher’s prediction is sufficiently reliable, thereby avoiding uncertain regions (e.g., mixed pixels, complex shorelines) that may introduce noisy supervision. To achieve this, two sets of reliable pixels are constructed to gate the distillation loss:
Reliable water pixels: Pixels predicted as water by the teacher with high confidence and consistent with the ground-truth water label. This set transfers the teacher’s structural prior for water-body delineation while excluding ambiguous or mixed-pixel regions.
Reliable non-water pixels: Pixels predicted as non-water by the teacher with high confidence and consistent with the ground truth. This set prevents over-expansion of the water-body region and reduces the propagation of teacher errors to other land-cover classes.
The final distillation mask is defined as the union of these two sets, and the distillation loss is computed exclusively within this masked region to ensure supervision is only applied to high-confidence, ground-truth-consistent pixels.
2.6.1. (1) High-Confidence Water Pixels
where
is the indicator function (returning 1 if the condition holds, 0 otherwise);
is the high-confidence threshold for the teacher’s water prediction (set to 0.95 based on validation experiments);
is the ground-truth label at pixel
; and
w denotes the water-body class (index 3).
2.6.2. (2) Low-Confidence (Non-Water) Pixels
where
is the low-confidence threshold for the teacher’s water prediction (set to 0.15 to ensure only highly confident non-water pixels are included). The final distillation mask is the union of
and
:
where
denotes the element-wise maximum operation to realize mask union. This selective design ensures only ground-truth-consistent pixels participate in distillation, avoiding the propagation of teacher prediction errors to the student network.
Masked BCE Distillation Loss. To align the student’s water-body probability predictions with the teacher’s reliable SAR-derived priors, a masked binary cross-entropy (BCE) loss is adopted, which restricts supervision to the distillation mask region defined above:
where
is the total number of valid pixels in the distillation mask (normalizing the loss to avoid scale bias from varying mask sizes); and
is the binary cross-entropy loss function that quantifies the similarity between the student’s water-body probability
and the teacher’s water-body probability
.
Overall Objective with Warm-Up Scheduling. The segmentation loss in Stage 2 retains the same formulation as Stage 1, combining weighted cross-entropy (wCE), Dice loss, and Lovász loss to address class imbalance and improve the segmentation of slender structures (e.g., roads, water shorelines):
where
,
, and
are the weight coefficients for the three loss terms (determined via validation);
is the weighted cross-entropy loss with inverse-frequency class weights to mitigate class imbalance;
enhances the segmentation of small or slender targets; and
directly optimizes the intersection-over-union (IoU) metric for improved boundary precision.
To ensure the student network first retains its basic multimodal segmentation ability before adapting to the teacher’s SAR-derived priors, a warm-up strategy is implemented that disables distillation during the initial training epochs:
where
is the warm-up gating function;
e is the training epoch;
denotes the warm-up period (distillation is disabled for the first 10 epochs of Stage 2). The total loss objective for Stage 2 training is:
where
is the distillation loss weight (set to a small value to prevent over-distillation and preserve the student’s multimodal representation ability for non-water classes).
KD Region Ratio (Reporting Protocol). To quantify the scope of distillation supervision across training epochs, the KD region ratio is defined as the percentage of distillation-active pixels (within the mask) relative to the total number of pixels in each training mini-batch:
where
N is the number of training mini-batches in epoch
e;
is the distillation mask of the
n-th mini-batch;
is the number of valid pixels in
; and
is the total number of pixels per input patch. The epoch-average ratio
(over
total Stage 2 epochs) is reported to provide transparency on the scale of distillation supervision.
All key hyperparameters for the SGCAD distillation strategy—including confidence thresholds, loss weights, warm-up configuration, and evaluation metrics—are summarized in
Table 2, with parameter values determined based on validation performance and domain-specific prior knowledge of SAR water-body segmentation (see
Section 4 for detailed parameter selection rationale). The hyperparameters for SGCAD distillation were determined based on validation performance and domain prior knowledge of SAR water-body segmentation. The high confidence threshold
was chosen to ensure only highly reliable water pixels were used for distillation. The low confidence threshold
was selected to exclude ambiguous regions while retaining sufficient non-water supervision. The distillation weight
was set small to avoid over-distillation and preserve the student’s multimodal representation ability. The warm-up strategy of 10 epochs was used to stabilize the student model before distillation.
2.7. Training Protocol and Implementation Details
Stage 1: Multimodal Six-Class Training. The LightMCANet student network was trained with a batch size of 48 on a single NVIDIA GPU for 150 epochs using the AdamW optimizer. The optimization setup included a base learning rate of
, weight decay of
, and a polynomial learning rate (PolyLR) schedule with a power of 0.9. The segmentation loss follows Equation (
26), combining weighted cross-entropy (wCE), Dice loss, and Lovász loss with
, which are selected to mitigate class imbalance and improve boundary-sensitive categories.
Teacher Training: SAR Water Binary Segmentation. The HRNet teacher network was trained exclusively on SAR data for binary water-body segmentation (water vs. non-water) for 150 epochs with a batch size of 44. The AdamW optimizer was used with a base learning rate of and weight decay of , and the same PolyLR schedule (power = 0.9) as Stage 1 was applied to ensure consistent learning rate decay across teacher and student training.
Stage 2: Distillation Fine-Tuning. Stage 2 fine-tunes the pre-trained Stage 1 student network using the proposed SAR-guided class-aware knowledge distillation (SGCAD) strategy for 80 epochs. The base learning rate is reduced to for fine-tuning (to avoid overwriting pre-trained features), and a warm-up strategy is implemented where knowledge distillation is disabled for the first 10 epochs. The distillation loss is weighted by to balance segmentation performance and distillation supervision, with confidence thresholds set to 0.95 (high-confidence water pixels) and 0.15 (high-confidence non-water pixels) for mask construction. The distillation strategy employed is bce_selective_gt (selective binary cross-entropy loss constrained to ground-truth-consistent pixels).
Key training hyperparameters for the teacher network and student network (Stage 1 and Stage 2) are summarized in
Table 3, including optimizer type, learning rate settings, training epochs, batch size, and weight decay—all standardized to ensure reproducibility and fair comparison across models. Training hyperparameters were chosen to ensure stable convergence and consistent training across all models. A learning rate of
was used for initial training, while a reduced rate of
was adopted in Stage 2 to fine-tune the model gently. Batch sizes were set based on GPU memory constraints while maintaining training stability. The number of epochs (150 for initial training, 80 for distillation) was determined via early stopping on the validation set to avoid overfitting.
Data Augmentation and Inference. Synchronized data augmentation was applied to both optical and SAR modalities to ensure spatial alignment and enhance model generalization. The augmentation strategy included random horizontal/vertical flips, random rotations within the range of
, and random cropping/resizing operations—all designed to simulate real-world variations in remote sensing imagery. Detailed specifications of the data augmentation operations are provided in
Table 4.
Model Evaluation and Metrics. During training and validation, key evaluation metrics were recorded to quantify segmentation performance: overall accuracy (OA), kappa coefficient (a measure of agreement corrected for chance), mean intersection-over-union (mIoU), and per-class IoU (with particular focus on the water-body class). For class-wise performance analysis, user’s accuracy (UA) was also computed to evaluate the precision of individual land-cover class predictions, with emphasis on water-body IoU improvements in Stage 2 attributable to the distillation process.
Early Stopping and Monitoring. Early stopping was implemented to prevent overfitting, with the stopping criterion based on the Kappa coefficient and a combined evaluation score (weighted OA + mIoU) on the validation set during Stage 2. Additionally, water-body IoU was monitored independently to track the impact of distillation on the target class, and a fusion evaluation mode was enabled in Stage 2 to compare predictions from the HRNet teacher and LightMCANet student networks for consistency analysis.
Quantitative evaluation results for the student network (Stage 1 and Stage 2) and HRNet teacher network are summarized in
Table 5, demonstrating the performance improvements achieved via the SGCAD distillation strategy—particularly for the water-body class.
All models were implemented in PyTorch 2.0 (Python 3.9) and trained on a single NVIDIA RTX 3090 GPU (24 GB VRAM) to ensure computational reproducibility. A fixed random seed (42) was set for Python, NumPy, and PyTorch to eliminate randomness in training and evaluation. All experiments were conducted on the fixed train/validation/test data split described in
Section 2.1, and final performance metrics were computed using the best-performing model checkpoint (selected based on validation mIoU) to avoid overfitting to the test set.
Optimization and Loss Functions
Student Stage 1 (six-class multimodal segmentation). The LightMCANet student network was trained for 150 epochs using the AdamW optimizer with an initial learning rate of
and weight decay of
. A PolyLR schedule with power
was applied to decay the learning rate over training. The segmentation loss objective is defined as:
where
, and the weighted cross-entropy (wCE) loss uses inverse-frequency class weights computed on the training set to mitigate class imbalance.
Teacher (SAR-only, binary water-body segmentation). The HRNet teacher network was trained for 150 epochs using the AdamW optimizer with a learning rate of and the same PolyLR schedule (power = 0.9) as the student network. Pixel-wise cross-entropy loss was used for binary supervision (water vs. non-water), with ignored pixels (label = 255, e.g., out-of-study-area regions) excluded from loss computation.
Student Stage 2 (distillation fine-tuning). Stage 2 fine-tunes the Stage 1 checkpoint for 80 epochs with a reduced base learning rate of to preserve pre-trained multimodal features. The total loss objective for Stage 2 is: , where (distillation loss weight), denotes a warm-up gating function (distillation disabled for the first 10 epochs), and is the masked binary cross-entropy (BCE) loss applied exclusively on confidence-gated pixels (defined in Equations (18)–(20)) to ensure targeted distillation for the water-body class.
2.8. Evaluation Metrics
We report overall accuracy (OA), kappa coefficient, mean intersection-over-union (mIoU), and per-class IoU. For class-wise analysis, user’s accuracy (UA) is provided, with particular focus on water-body improvements.
4. Discussion
This section discusses the experimental findings of SGCAD from the perspective of multimodal fusion theory and previous studies. We interpret why SAR-guided class-aware distillation improves water-body delineation, analyze the trade-offs across categories, and highlight limitations and future directions for large-scale applications.
4.1. Why SAR-Guided Class-Aware Distillation Improves Water Segmentation
A consistent observation in our experiments is that the water-body category benefits the most from the proposed teacher–student paradigm. This can be attributed to the intrinsic imaging characteristics of SAR, in which water surfaces generally exhibit low backscattering intensity and homogeneous spatial patterns in satellite imagery, providing stable and consistent cues for water-body extraction regardless of illumination and atmospheric conditions. Consequently, a SAR-only teacher trained on such multi-temporal satellite SAR images can serve as a reliable “water expert” that provides strong structural priors for shoreline delineation. In contrast, multimodal fusion networks may suffer from local modality conflicts when optical textures and SAR scattering patterns are inconsistent, leading to fragmented boundaries and spurious holes in water masks. By transferring the teacher’s water probability in a selective manner, the student learns to enhance water-related decision boundaries while still leveraging optical information for other land-cover classes.
The observed degradation of water segmentation under multimodal fusion indicates a limitation of indiscriminate fusion with global supervision, rather than a limitation of using multimodal data itself. When optical and SAR provide inconsistent evidence at local regions (due to shadows, speckle, geometric misalignment, or imaging timing differences), enforcing a global multi-class objective encourages the network to compromise between modalities, which can weaken the separability of classes that are otherwise well distinguished in a single modality. Our method fundamentally changes this optimization behavior by (i) decoupling water learning via a SAR-only expert teacher, and (ii) restricting distillation to reliable pixels through confidence gating, thereby preventing noisy or misaligned regions from dominating the water decision boundary.
4.2. Role of Confidence Gating: Reducing Noisy Distillation and Preventing Over-Constraint
Unlike dense distillation that enforces teacher supervision on all pixels, SGCAD applies the distillation loss only to pixels where the teacher prediction is sufficiently reliable, with a focus on high-confidence water regions. This design is important because teacher outputs can be uncertain around mixed pixels, complex shorelines, and SAR shadow/layover areas, where indiscriminate distillation may propagate teacher errors to the student. Such noisy transfer can over-constrain the six-class softmax student and potentially exacerbate class competition, leading to degradations in non-water categories. Our ablation results consistently indicate that confidence gating stabilizes training and improves water-body delineation. Moreover, the small distillation weight further reduces the risk of over-distillation, allowing the student to retain its multimodal representation while benefiting from the teacher’s structural prior.
We selected the main SGCAD hyperparameters (distillation weight
and confidence thresholds
) based on validation performance. In practice,
was searched in a small range to avoid over-distillation (e.g.,
), and the confidence thresholds were chosen to balance reliability and coverage of the distillation mask. The final values reported in
Table 2 were used for all experiments.
4.3. Boundary Continuity and Slender Structures: Effect of SEGM
A second noticeable improvement appears in slender structures and boundaries (e.g., roads and shorelines). This is consistent with the hypothesis that shallow SAR features contain strong gradient cues that are less affected by illumination variations. SEGM exploits such cues to refine decoder responses, which helps reduce boundary fragmentation and improves continuity. Qualitative comparisons (
Figure 6) show that the student baseline may produce discontinuous road segments or jagged shorelines, whereas SGCAD yields more coherent boundaries. This supports the view that boundary-aware refinement remains valuable even in attention-based fusion backbones, especially for high-resolution remote sensing segmentation where object shapes are elongated and thin.
4.4. Comparison with Existing Multimodal Fusion and Distillation Strategies
Prior multimodal fusion research has explored early fusion (stacking), late fusion, and cross-attention interaction for optical–SAR semantic segmentation. While these methods improve overall performance, they often lack a mechanism to explicitly protect a critical category that suffers from modal conflicts or class imbalance. Our results suggest that “category-focused” guidance can be more effective than uniformly strengthening all classes, particularly when one class (water) has a strong modality-specific signature in SAR. Compared with generic distillation methods that use global logits or features, the proposed class-aware and confidence-gated distillation is better aligned with the structured nature of semantic segmentation and the heterogeneity of optical–SAR inputs.
4.5. Error Analysis and Failure Cases
Although the proposed distillation strategy substantially improves water-body delineation, a few challenging scenarios remain. First, mixed pixels and gradual transitions around paddy fields, wetlands, and shadowed regions can blur the boundary between farmland and water. In these areas, SAR may exhibit locally reduced backscatter that partially overlaps with the water-like response, increasing ambiguity for both teacher and student. Second, complex urban scenes may contain dark roofs, narrow water channels, or strong specular reflections, which can occasionally induce water-like appearances and lead to false positives, especially under certain viewing geometries or incidence-angle conditions. Third, label noise and annotation uncertainty near class boundaries—particularly for slender structures and fine shoreline details—can cap the attainable IoU even when the prediction is visually reasonable. In the revised manuscript, we will provide representative qualitative examples covering the above cases and discuss the underlying causes and potential mitigation strategies.
4.6. Generalization, Practical Implications, and Future Directions
The proposed framework is suitable for practical land-cover mapping and monitoring because it exploits the complementary strengths of optical and SAR imagery and improves robustness for water-related classes that are critical for flood assessment and water-resource management. A natural extension is to generalize the “expert teacher” paradigm to other categories that exhibit modality-specific signatures, such as roads benefiting from SAR geometric cues and linear structures, or buildings characterized by strong double-bounce responses. Another promising direction is to incorporate semi-supervised or self-supervised representation learning for the teacher model to reduce dependence on densely annotated masks while maintaining reliable pseudo-label quality. From a deployment perspective, evaluating computational cost (e.g., parameter count, FLOPs, and inference throughput) alongside accuracy would provide a clearer view of the efficiency–performance trade-off for large-scale inference and time-sensitive applications.
A key practical condition of our current framework is the availability of co-registered optical and SAR observations acquired within a sufficiently small time lag. For rapidly evolving events such as active flooding, a temporal mismatch between SAR and optical acquisitions may introduce semantic inconsistency (e.g., water extent changes) and potentially bias the fused prediction. In addition, optical imagery is unavailable at night and may be missing under heavy cloud cover, which limits strictly synchronous optical–SAR deployment. In this study, we focus on land-cover and surface-water mapping scenarios where paired optical–SAR observations are near-synchronous and the scene semantics are relatively stable during acquisition. As future work, we will extend the framework toward incomplete multimodal settings by enabling robust inference under missing optical input (e.g., SAR-only fallback and modality-dropout training), and by incorporating temporal pairing strategies (time-window matching) to mitigate the impact of acquisition time gaps in dynamic scenes.