Gated Lightweight CNN-Transformer Fusion for Real-Time Flood Segmentation on Satellite Internet Terminals Under Triple-Disruption Emergency Conditions

Nie, Yungui; Shi, Zhiguo; Li, Jianing; Ge, HuiLing

doi:10.3390/rs18091418

Open AccessArticle

Gated Lightweight CNN-Transformer Fusion for Real-Time Flood Segmentation on Satellite Internet Terminals Under Triple-Disruption Emergency Conditions

¹

School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China

²

Beijing Key Laboratory of AI Metrology Technology for Future Cities, Beijing 100083, China

³

Xiong An New Area Key Laboratory of Large Model Scenario Application Technology, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(9), 1418; https://doi.org/10.3390/rs18091418

Submission received: 28 February 2026 / Revised: 30 March 2026 / Accepted: 28 April 2026 / Published: 3 May 2026

(This article belongs to the Special Issue Advances in Earth Observation to Improve Flood Disaster Monitoring and Management (Second Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A Dynamic Gated Fusion Mechanism (DGFM) has been proposed to adaptively balance CNN local features and Transformer global context for SAR flood segmentation. The gated, lightweight CNN–Transformer hybrid enables real-time inference (~0.22 s/frame) on CPU-only satellite terminals and achieves an mIoU of 0.5814 and an F1 score of 0.7028.
Quality-aware training effectively leverages weak-label data from the Sen1Floods11 dataset, reducing reliance on costly manual annotations while maintaining competitive segmentation performance and cross-region generalization (precision = 0.7481).

What are the implications of the main findings?

The lightweight, CPU-only design enables deployable flood mapping on resource-constrained satellite internet terminals during triple-disruption emergencies involving network outages, power cuts and blocked roads.
The DGFM-based fusion approach provides a systematic solution for fusing local and global features in SAR flood segmentation, facilitating reliable real-time disaster response.

Abstract

During flood disasters, on-site operations often face the “triple disruption” of network outages, power cuts and blocked roads. This renders terrestrial cellular infrastructure inoperable and disrupts communication links. Satellite internet can partially restore emergency communications thanks to its wide-area coverage and resistance to ground damage. However, limited computing power, memory and unstable bandwidth at the terminal prevent cloud-based flood segmentation from providing near-real-time situational awareness. This paper therefore proposes a lightweight semantic flood segmentation framework for emergency terminals that uses satellite internet. This comprises a parallel dual-branch design with a lightweight U-Net-style convolutional neural network (CNN) branch for local boundary details and a compact Transformer branch for global context. A dynamic gated fusion mechanism (DGFM) balances local texture and global information adaptively. Experiments on the public synthetic aperture radar (SAR) dataset Sen1Floods11 demonstrate that the hybrid architecture strikes a balance between accuracy and inference efficiency. The proposed method combines gated fusion with quality-aware training. Compared to a lightweight CNN baseline and state-of-the-art segmentation models using the same protocol, the proposed configuration (Hybrid-Gated with Quality-Aware Training) achieves the highest mean intersection over union and F1 score among the compared fusion variants, while maintaining competitive false alarm and risk-sensitive performance under deployment constraints. This aligns with the preferences of emergency decision makers. The framework provides a deployable perception module for emergency systems supported by low-orbit satellites and terrestrial networks under triple-disruption conditions.

Keywords:

CNN–Transformer fusion; dynamic gated fusion; flood segmentation; satellite internet emergency terminals

1. Introduction

Floods are among the most devastating natural hazards globally, causing substantial economic losses and casualties every year [1]. Accurate and timely flood extent mapping is essential for effective emergency response, yet traditional ground-based monitoring is frequently impractical in the event of extreme disasters. Synthetic Aperture Radar (SAR) has emerged as the primary data source for flood monitoring due to its day-and-night, all-weather imaging capabilities [2,3], enabling inundation information to be captured regardless of meteorological conditions. However, conventional SAR flood segmentation relies heavily on centralised cloud computing, which requires the transmission of large amounts of raw SAR data. This approach fails to meet the demand for near-real-time situational awareness in emergency response scenarios [4].

An even greater challenge arises in “triple-disruption” emergency scenarios, which are characterized by network outages, power cuts, and road blockages [5,6]. In these scenarios, terrestrial communication infrastructure and cloud-based processing pipelines are completely paralyzed. Satellite internet has become an indispensable backup for critical communications thanks to its wide-area coverage and resistance to ground damage [7]. Nevertheless, satellite internet terminals face resource constraints, including limited computing power, restricted memory, and unstable bandwidth [8,9]. These constraints preclude cloud-centric processing, necessitating a paradigm shift toward on-device, terminal-side segmentation. This approach generates decision-ready flood masks directly on satellite terminals without relying on cloud resources. The core application problem that emerges is how to develop a lightweight, real-time, and reliable SAR flood segmentation framework tailored to resource-constrained satellite terminals under triple-disruption conditions.

Despite significant progress in deep learning-based flood segmentation, existing methods remain insufficient to address this core problem due to three technical gaps. First, CNN-based architectures typified by U-Net [10] and its derivatives [11] preserve local spatial details effectively, but their inherently local receptive fields can lead to misclassification of SAR confounders, such as terrain shadows and dark urban roofs, that exhibit radiometric characteristics similar to water bodies [12]. Second, Transformer models [13] and efficient variants such as Swin Transformer [14] improve long-range contextual modeling [15], yet often impose computational overhead that is difficult to accommodate on resource-constrained terminals [16,17], while potentially weakening fine-grained boundary refinement. Third, existing CNN-Transformer hybrids [18,19,20] are mostly designed for high-performance cloud environments [21] and commonly rely on static fusion strategies (e.g., concatenation or averaging), which are less adaptive to the diverse flood morphologies in SAR imagery. In addition, general-purpose edge inference methods [22,23], typically developed for RGB benchmarks or mobile-GPU settings, do not fully account for dual-channel SAR characteristics or strict CPU-only deployment constraints on satellite internet terminals [24]. Moreover, the integration of adaptive fusion with risk-sensitive training for satellite-terminal flood response under triple-disruption conditions remains limited [25].

To address these gaps, this paper presents a lightweight parallel CNN-Transformer fusion framework optimized for satellite internet terminals in emergency scenarios. The framework adopts a dual-branch design: a lightweight U-Net-style CNN branch uses depthwise separable convolutions (DSC) to extract local boundary details with low computational overhead, while a compact Transformer branch captures global contextual dependencies to suppress SAR-specific misclassifications. A sample-level DGFM is introduced to adaptively balance the two branches. To align model optimization with emergency decision preferences, we further incorporate quality-aware training on Sen1Floods11 [26], prioritizing high-confidence labels and optimizing risk-sensitive metrics such as F0.5 and false alarm rate (FAR).

The main contributions of this paper are summarized as follows.

(1): We develop a terminal-oriented parallel segmentation framework that explicitly balances computational efficiency with segmentation accuracy. By optimizing the architecture for terminal-side CPU constraints, the model enables real-time situational awareness within the narrow decision windows required for emergency response.
(2): We introduce a sample-level DGFM combined with a quality-aware training protocol. This approach adaptively integrates local and global features through a lightweight MLP, significantly improving performance on risk-sensitive metrics such as F0.5 and FAR.
(3): We provide an in-depth analysis of terminal deployment trade-offs, focusing on the relationship between model complexity and actual CPU inference latency [24]. These empirical findings offer practical guidance for deploying deep learning solutions over constrained satellite-ground emergency links.

2. Related Work

2.1. SAR-Specific Flood Segmentation Methods

Due to its characteristic imaging capability, SAR imagery enables flood monitoring under all-weather and day-and-night conditions [2]. Early approaches to mapping flood extent relied largely on backscatter intensity analysis, change detection, and thresholding schemes [27]. With the advent of deep learning, fully convolutional networks and U-Net-like encoder-decoder architectures have become prevalent for end-to-end, pixel-level segmentation. These architectures leverage skip connections to recover spatial details [10,28]. U-Net variants have demonstrated robust performance in SAR flood mapping benchmarks such as Sen1Floods11, where models benefit from the ability to preserve fine-grained boundaries [11,29]. The use of Sentinel-1 for flood detection in arid regions has also been studied [30].

Despite these advances, SAR flood segmentation remains challenging due to radar-specific ambiguities in SAR scenes. Speckle noise and radar-dependent backscatter can cause confusion between water and non-water, particularly when non-water regions (e.g., terrain shadows and dark urban roofs) exhibit radiometric similarity to inundated areas [12]. While CNN-based methods are effective at extracting local spatial details, they are governed by limited receptive fields. This can reduce their robustness to confounding patterns whose discriminative evidence may be contextual rather than purely local.

Transformer-based approaches address this limitation by improving long-range contextual modeling. Efficient Transformer designs such as Swin Transformer improve global dependency capture while maintaining computational feasibility [14]. Hybrid designs that integrate Transformers with CNN backbones have also been explored for flash-flood detection [15]. However, Transformer components often introduce substantial computational overhead and may offer weaker boundary refinement when deployed under strict terminal constraints [16,17]. Moreover, many existing SAR flood segmentation studies primarily optimize generic segmentation accuracy rather than decision-critical objectives under emergency conditions, where false alarms and missed detections can have asymmetric operational costs [25].

2.2. CNN-Transformer Hybrid Fusion Strategies

CNN-Transformer hybrids have been widely investigated as a way to balance the preservation of local details with the understanding of global context [18,19,20]. Many existing works differ primarily in how features are fused and integrated. Static fusion strategies remain prevalent. For example, TransUNet incorporates Transformer modules into a CNN bottleneck to capture global semantics [31], and SegFormer [32] uses a hierarchical Transformer encoder with a lightweight decoder. CMTFNet combines local CNN features with multi-scale Transformer representations [33]. However, many hybrid designs rely on fixed fusion patterns, such as concatenation or summation, that do not change the relative contribution of local versus global evidence as scene ambiguity varies [20].

This limitation is particularly relevant for SAR flood segmentation because the importance of local boundary versus global contextual cues depends on flood morphology and confounding conditions (e.g., fragmented inundation versus open water regions and strong shadow-related ambiguity). Furthermore, hybrid architectures are often developed and evaluated in high-performance computing settings. In addition, the adaptation of hybrid fusion to CPU-level terminal deployment, where latency and memory budgets are limited, remains insufficiently addressed [21].

Although attention-based and gated mechanisms have been introduced to improve adaptivity in other vision tasks, including gated transformer decoders for instance segmentation [34], their use in combination with emergency-oriented reliability objectives for satellite-terminal scenarios remains limited. In particular, integrating adaptive fusion with risk-sensitive training that considers false-alarm costs has not been demonstrated in a structured way for targeted deployment settings [25].

2.3. Deployment-Oriented Edge Inference for Satellite-Terminal Scenarios

Edge inference is critical for timely disaster response because it reduces latency and bandwidth pressure by performing inference closer to where the data is acquired [35,36,37]. Considerable progress has been made in designing lightweight models, including depthwise separable convolutions and operator-efficient architectures [22], as well as compression techniques such as quantization and pruning [23]. Efficient vision transformers and hybrid lightweight designs (e.g., MobileViT [38] and TinyViT [39]) have also been proposed to reduce memory and computational overhead on resource-constrained devices.

However, directly transferring such edge designs to satellite-terminal SAR flood segmentation is non-trivial. Many lightweight edge methods have been developed and validated on RGB benchmarks or GPU-oriented platforms. Flood mapping has also been explored using color cues in UAV imagery [40], which differs from dual-channel SAR inputs and speckle-driven ambiguity under strict CPU-only constraints [24]. Additionally, satellite terminal deployment introduces strict CPU-only inference constraints and link constraints, including limited memory capacity and unstable bandwidth. These constraints are not always considered jointly with the segmentation objective [41,42].

Satellite internet and edge–satellite workflows have been investigated for crisis communications and emergency pipelines. These studies often emphasize system architecture and data preprocessing to reduce the uplink burden while leaving the design and reliability of the segmentation model under terminal constraints at a conceptual level [24,43]. Importantly, as noted in emergency-related literature, there is a lack of sufficient studies integrating adaptive fusion mechanisms with risk-sensitive training for triple-disruption constraints, where false alarms incur significant operational costs [25]. Consequently, there is limited evidence for methods that simultaneously achieve scene-adaptive fusion, terminal efficiency, and reliability aligned with false-alarm penalties.

Overall, existing studies have made significant progress in three areas: local detail preservation in CNN-based SAR segmentation, global context modeling in Transformer-based or hybrid architectures, and lightweight inference design for edge deployment. However, these approaches are rarely integrated into a single objective for satellite terminal emergency operations. Current methods rarely address scene-adaptive local–global fusion, CPU-constrained deployment on dual-channel SAR inputs, and reliability evaluation under asymmetric false-alarm costs simultaneously. Therefore, our study adopts a terminal-oriented parallel CNN-Transformer design with dynamic gated fusion and quality-aware training, and evaluates it with both efficiency-oriented and decision-critical metrics.

3. Methodology

This section provides a detailed description of the architecture and key modules of the proposed lightweight, parallel CNN-Transformer fusion framework. After outlining the overall structure, the Dynamic Gated Fusion Module (DGFM) and the risk-sensitive optimisation strategy are introduced.

3.1. Overall Structure of the Framework

As shown in Figure 1, the proposed framework adopts a parallel asymmetric encoder-decoder paradigm to achieve real-time situational awareness on satellite internet terminals. Given an input SAR image patch

X \in R^{H \times W \times C}

(where H = W = 256, and C = 2 corresponds to the VV and VH dual channels), the model extracts local spatial features and global contextual dependencies simultaneously through two distinct branches.

Specifically, the CNN branch uses a lightweight U-Net-like encoder comprising four stages of DSC. To reduce computational complexity while preserving high-resolution boundary information, the channel configuration is set to 24–48–96–192. Fusion is performed at the decoder stage: the local feature map used for fusion is

g \in R^{\frac{H}{4} \times \frac{W}{4} \times 96}

, i.e., at one-quarter spatial resolution with 96 channels.

In parallel, a compact Transformer branch is employed to capture long-range contextual dependencies. The Transformer branch applies convolutional patch embedding with patch size P = 16 and stride P (non-overlapping partitions), producing a sequence of patch tokens that are processed by an L = 2 Transformer encoder with multi-headed self-attention (MHSA) and hidden dimension D = 128. The output of the Transformer encoder is further reshaped to a global feature vector

t \in R^{1 \times 128}

, representing the semantic scene topology.

To combine these multi-stream features, the local texture feature

g

and the global context feature

t

are adaptively integrated through the proposed DGFM module. The fused feature

f

is then transferred to a minimalist segmentation head to generate the final binary prediction map

P

. The operation of the framework is expressed as follows:

g = CNN (X)

(1)

t = Transformer (X)

(2)

f = DGFM (g, t)

(3)

P = Pred . Head (f)

(4)

3.2. Multi-Stream Feature Extraction Branches

(1) Lightweight CNN Branch (

F_{cnn}

): This branch of the CNN is designed to handle the “local consistency” of water bodies in SAR imagery. To minimise the number of parameters, DSC are used as the primary building block. Each stage consists of a 3 × 3 DSC followed by Batch Normalization (BN) and ReLU activation. The channel capacity expands from 24 to 192 across four stages, ensuring that the spatial skip connections can effectively pass low-level boundary information to the decoder. This branch specifically addresses the “boundary leakage” problem common in SAR segmentation where water-land edges are blurred by speckle noise.

(2) Compact Transformer Branch (

F_{trans}

): Recognizing that CNNs often misclassify terrain shadows as water due to their limited receptive fields, the Transformer branch is introduced to provide global scene topology. We utilize a patch embedding layer to mitigate the loss of fine-grained spatial information during tokenization. The branch consists of L = 2 Transformer blocks with MHSA. For a latent dimension d = 128, the self-attention mechanism is formulated as [13], where

d_{k} = d / h e a d s (d_{k} = 32)

denotes the dimension of query and key vectors:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(5)

After patch embedding, multi-head self-attention operates on patch tokens rather than on raw pixels. With

H = W = 256

and

P = 16

, the embedding yields a

16 \times 16

token grid with

N = H / P \times W / P = 256

tokens. Each token summarizes a non-overlapping

P \times P

region of the input patch. MHSA performs global mixing over this token set, providing long-range context at patch-level resolution, while fine spatial detail is primarily carried by the CNN branch.

In this parallel design, the CNN pathway retains high-resolution spatial structure through depthwise separable convolutions and skip connections, while the Transformer pathway summarizes long-range dependencies into a compact global descriptor. The Transformer branch primarily encodes scene-level context that disambiguates flood-related backscatter patterns rather than recovering fine geometry, which is delegated to the CNN branch. Thus, a dense token map is not adopted here. Under CPU and memory constraints on satellite terminals, this compact representation limits attention-related overhead, keeping the hybrid model suitable for near-real-time inference.

3.3. Dynamic Gated Fusion Module

To couple the local and global features more effectively, the DGFM module is developed to replace traditional static fusion strategies such as concatenation or summation. Inspired by the need for sample-adaptive representation, a learnable gated weight

w

is dynamically generated according to the content of the current input sample. The structure of the DGFM is shown in Figure 2.

The procedures for the feature fusion with the DGFM are summarized as follows:

Global Summary Extraction: To capture the overall distribution of local spatial features, a global average pooling (GAP) operation is conducted on

g

to obtain a summary vector. Before gating, both the CNN summary and the Transformer output t are projected to the same dimension

d = 128

; we denote the projected summary as

v \in R^{128}

and the global feature as

t \in R^{128}

.

Joint Descriptor Formation: The projected summary

v

and the global context feature

t

are concatenated to form a comprehensive descriptor ξ = Concat(v,t)∈

R^{256}

(see Equation (7)). This descriptor represents the global-local joint characteristics of the scene.

Gated Weight Generation: The joint descriptor

ξ

is passed through a lightweight two-layer MLP (hidden dimension 128). A Sigmoid function is applied to generate the final gated scalar weight

w

. This weight controls the fusion ratio, effectively acting as a “content-aware” switch.

Adaptive Feature Refinement: The final fused feature

f

is calculated via a convex combination of the local and global features, where the global feature

t

is broadcasted to match the spatial dimensions of

g

.

The operation of the DGFM module is expressed as follows:

v = GAP (g)

(6)

ξ = Concat (v, t)

(7)

w = σ (MLP (ξ))

(8)

f = (1 - w) \cdot g + w \cdot Broadcast (t)

(9)

where

σ

is the Sigmoid function. The global descriptor is broadcast to the spatial resolution of

g

so that scene-level context modulates the CNN feature map uniformly, while local structure remains encoded in

g

. By analyzing the statistical distribution of

w

, it is found that for scenes with complex textures (e.g., urban shadows), the model assigns a smaller

w

to prioritize the CNN branch. Conversely, for large-scale inundation, a larger

w

is generated to emphasize the Transformer’s global context.

3.4. Interpretability of the Gated Weight

We empirically show that the gated weight w behaves as an “uncertainty proxy”. Correlation analysis on the Sen1Floods11 test set shows that w maintains a Spearman’s rank correlation of −0.75 with the cost-sensitive risk metric (Risk = 2FP + FN).

As demonstrated in Figure 3 and Figure 4, the distribution of

w

across regions (e.g., Sri Lanka: 0.57 ± 0.10; USA: 0.60 ± 0.11) illustrates that the gate is stable across different geographic environments, which supports its use as an interpretable signal for terminal-side deployment.

3.5. Segmentation Head and Training Strategy

To map the fused features onto the final mask efficiently, a minimalist segmentation head is designed. It consists of a

3 \times 3

refinement layer and a

1 \times 1

classifier, followed by a single bilinear upsampling operation. This design minimizes parameters and computational latency, stemming from our insight that the front-end parallel encoders already generate highly discriminative features. To optimize the framework under the constraints of class imbalance and label noise, a composite loss function

L_{total}

is adopted:

L_{total} = L_{BCE} + λ L_{Dice}

(10)

where

λ = 0.5

.

L_{BCE}

provides stable pixel-level gradients, while

L_{Dice}

optimizes the regional overlap to enhance the detection of small-scale flood patches.

To leverage both weakly and manually labeled samples, we further use a quality-aware sample-wise weighting strategy. For each training sample

i

, a label-quality tag

q_{i} \in {w e a k, h a n d}

is assigned, with fixed weights.

w_{i} = \{\begin{array}{l} w_{w e a k}, & q_{i} = w e a k \\ 1.0, & q_{i} = h a n d \end{array}

(11)

and

w_{w e a k} = 0.7

in all experiments. Let

l_{i}

denote the per-sample base loss. The mini-batch objective is:

L_{Q A} = \frac{\sum_{i} w_{i} l_{i}}{\sum_{i} w_{i}}

(12)

This design down-weights potentially noisy weak labels while still exploiting their diversity for representation learning. The training process uses an Adam optimizer with a learning rate of

1 0^{- 4}

and a cosine decay schedule. To accommodate the limited memory of satellite terminals, a micro-batch size of 2 with gradient accumulation is implemented. The default model hyperparameters are listed in Table 1. Training settings are summarized in Table 2.

4. Experiments and Results

In this section, the performance of the proposed parallel CNN-Transformer fusion framework is evaluated. We first describe the dataset and experimental setup, followed by a series of comparative and ablation studies. Finally, the model is stress-tested under the “triple-disruption” scenario to verify its practical utility.

4.1. Dataset and Experimental Setup

The experiments in this paper use the publicly available Sen1Floods11 SAR flood benchmark dataset [26], which consists of Sentinel-1 imagery with VV/VH dual-polarization channels. The dataset is a popular benchmark for deep learning-based flood segmentation using Sentinel-1 data and includes two subsets: “Weakly Labeled” and “Hand Labeled”. The “Hand Labeled” subset is manually annotated by experts to a high standard, ensuring consistent inter-annotator agreement. To leverage all available data, the training set incorporates samples from both the “Weakly Labeled” and “Hand Labeled” subsets, enabling the model to benefit from diverse scenarios. According to its source subset in Sen1Floods11, each training sample is tagged as weak or hand. During training, weakly and hand-labeled samples are shuffled together into the same mini-batches without additional oversampling or undersampling. Quality awareness is introduced through sample-wise loss weighting, as defined in Section 3.5. To ensure rigorous and reliable evaluation, the validation and test sets comprise only high-quality “Hand Labeled” patches. This strategy, coupled with our quality-aware training protocol, enables the model to learn effectively from noisy weak labels while providing an objective assessment of the expert-verified ground truth. To ensure the model can generalise and avoid data leakage, patches from the same original image appear in only one set. We employ data augmentation techniques during the training phase [29], including random horizontal and vertical flipping (with a probability of 0.5), 90° rotation, and ±15% radiometric jitter of the backscatter values (to mimic intensity variation for SAR data).

This paper uses the standard evaluation metrics for semantic segmentation tasks: mean intersection over union (mIoU), F1 score, precision, and recall [44]. To accurately reflect edge deployment scenarios, all inference speed tests are performed on a CPU platform and report the average processing latency for a single image (batch size = 1). Training uses the Adam optimizer [45] with an initial learning rate of 10⁻⁴, combined with cosine annealing scheduling [46] and an early stopping strategy [47]. The specific hyperparameters are shown in Table 2.

4.2. Evaluation Metrics

To quantitatively assess the segmentation performance, we adopt standard pixel-level metrics: Mean Intersection over Union (mIoU), F1-score, Precision, and Recall. Furthermore, to reflect asymmetric error costs in emergency-oriented assessment, we use a cost-sensitive risk term computed as a linear combination of false positives and false negatives with user-specified weights

C_{FP}

and

C_{F N}

[48] (see Equation (16)). We also report FAR = FP/(FP + TN).

The mathematical formulations of these indexes are defined as:

IoU = \frac{TP}{TP + FN + FP}

(13)

Precision = \frac{TP}{TP + FP}, Recall = \frac{TP}{TP + FN}

(14)

F1-score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(15)

Risk = C_{FP} \times FP + C_{FN} \times FN

(16)

where

C_{FP}

and

C_{FN}

represent the cost weights for false positives and false negatives, respectively.

4.3. Comparative Experiments with Baseline Methods

To verify the contribution of the parallel architecture and the dynamic gated fusion module (DGFM), some baseline configurations are meticulously compared. The results on the Sen1Floods11 test set are summarized in Table 3.

As demonstrated in Table 3, the Hybrid-gated (ours) configuration, combined with quality-aware training through weak-label reweighting, achieves the highest mIoU of 0.5814 and an F1-score of 0.7028. Notably, this configuration yields the lowest FAR (0.0173) and the minimum Rask (2658.42). This suggests that the DGFM effectively leverages the local consistency from the CNN branch and the global context from the Transformer branch, particularly in mitigating “boundary leakage” and urban shadow misclassifications.

As shown in Figure 5, the training and validation losses steadily decrease and converge under the proposed training protocol.

Figure 6 shows a statistical comparison of the mIoU and F1 score for different baseline architectures.

The results align with the operational requirements for high-precision situational awareness in emergency scenarios. The Hybrid-gated (ours) model dynamically balances multi-scale features to minimise cost-sensitive risks and prevent the wastage of limited rescue resources caused by false reports, all while maintaining a robust detection rate. Compared to standard fusion strategies (e.g., averaging and concatenation), our gating mechanism offers greater adaptability to the variability in SAR imagery backscatter. Section 4.5 provides a detailed ablation analysis of the internal gating structures and the impact of the quality-aware training protocol.

4.4. Comparison with State-of-the-Art Methods

To comprehensively evaluate the framework, we compare it with several representative SOTA models, including UNet++, DeepLabV3+, and SegFormer-B0. All models are trained from scratch with in_channels = 2 to avoid domain bias from RGB pre-training. The ResNet-18 backbone follows the standard residual network [49].

As shown in Table 4, our Hybrid-gated (ours) model outperforms all SOTA baselines in overall segmentation quality (mIoU 0.5814). It is important to note that the absolute mIoU values in SAR tasks are lower than those in optical RGB benchmarks due to speckle noise and lack of spectral information. However, our model offers an operating point that prioritizes accuracy with competitive risk-sensitive behavior under the same protocol. In terms of efficiency, our latency (0.2179 s/frame) is competitive with UNet++ (0.2227 s/frame) while providing significantly higher precision.

The proposed hybrid-gated model outperforms the UNet++ (ResNet18) model in terms of mIoU. Some SOTA baselines (e.g., FPN and MAnet) achieve lower FAR and risk in risk-sensitive metrics, indicating different operational trade-offs. Compared to SegFormer-B0, our model achieves a higher mIoU (0.5814 vs. 0.5678) and a comparable, though slightly higher, FAR (0.0173 vs. 0.01), indicating a different trade-off prioritizing overall accuracy. While absolute mIoU values in SAR flood mapping are typically lower than those in optical benchmarks due to speckle noise and ambiguous backscatter, our method offers a practical balance of accuracy and reliability under terminal deployment constraints. Figure 7 presents the qualitative visualization of the results.

4.5. Structural Ablation and Generalization Assessment

To verify the specific contributions of the gating mechanism and the training protocol, we first compare different structural variants. To make a fair comparison, the “Hybrid-gated (sample-wise)” and “Hybrid-gated (ours)” models have the same architecture, data split, optimizer, augmentation, and training schedule. The only difference is whether quality-aware sample weighting is enabled. As shown in Table 3, the sample-wise gated variant outperforms the channel-wise variant, with mIoUs of 0.5123 and 0.4523 respectively. This confirms that sample-adaptive weight assignment is better suited to handling the diverse backscatter characteristics of SAR imagery. Furthermore, introducing the quality-aware training (Hybrid-gated, ours) substantially improves performance, elevating the mIoU from 0.5123 to 0.5814. These results suggest that the current protocol benefits from the complementary contributions of quality-aware training and fusion architecture.

To further distinguish the effects of architecture from the effects of the training protocol, we conduct an additional factorized 2 × 2 comparison under the same split, optimizer, augmentation, and training schedule. Specifically, we compare the Hybrid-concat and Hybrid-gated models, each with standard and quality-aware training. The corresponding results are summarized in Table 5. While Table 3 reports on a broader range of architectural variants, Table 5 provides a controlled 2 × 2 decomposition of architectural versus training attributions.

Under standard training, the Hybrid-concat model produces a stronger baseline than the Hybrid-gated model. However, with quality-aware reweighting, the Hybrid-gated model improves substantially and becomes the best overall model, suggesting that DGFM is more sensitive to label quality under the current protocol.

To isolate the effects of Transformer depth and patch size from quality-aware reweighting, Table 6 reports a controlled ablation under the standard training objective (no QA). This uses the same split, optimizer, augmentation, and schedule, except for L and P.

Under this protocol, L = 1 and P = 8 achieve the highest mIoU and F1 scores among the listed depth and patch settings. However, L = 3 reduces mIoU, which is consistent with the increased optimization difficulty under speckle-dominated SAR data and limited training diversity. Patch size balances the spatial support of each token with the number of tokens and the cost of global self-attention.

The main experiments in Table 3, Table 4 and Table 5 employ quality-aware training. Therefore, we retrained the hybrid-gated model under the same protocol, with an identical data split, optimizer, augmentation, and schedule, while varying one structural factor relative to the default backbone at a time. With quality-aware reweighting, L = 2 and P = 16 achieve an mIoU of 0.5814 and an F1 score of 0.7028. This is compared to an mIoU of 0.4800 and an F1 score of 0.5826 with L = 1 and P = 16, and an mIoU of 0.5445 and an F1 score of 0.6633 with L = 2 and P = 8. Thus, we adopt L = 2 and P = 16 as the default Transformer depth and patch size for all quality-aware configurations reported in this paper.

Finally, we conducted a cross-region generalization assessment using an unseen-region split. The model was trained on eight regions (Bolivia, Colombia, Ghana, India, the Mekong region, Nigeria, Pakistan, and Paraguay) and evaluated on two regions that were not used for training: Sri Lanka and the USA. To avoid bias from a single metric, we report the full set of metrics on the unseen test split: mIoU = 0.5228, F1 = 0.6382, precision = 0.7481, recall = 0.6721, F0.5 = 0.6668, FAR = 0.0252, risk = 3440.62, and latency = 0.1480 s/frame. These results suggest promising cross-region transferability under the current protocol. However, a more systematic per-region and cross-method comparison is necessary.

4.6. Efficiency Analysis and Communication Impact

To validate the feasibility of deploying the proposed parallel framework on resource-constrained satellite internet terminals, Table 7 provides a quantitative analysis of the computational complexity and resource consumption of all models. This includes the number of parameters, the number of floating-point operations (FLOPs), the peak memory usage, and the CPU inference latency.

Integrating the DGFM introduces only marginal computational overhead. The Hybrid-gated model achieves a CPU latency of 0.2179 s/frame (Table 7), comparable to Hybrid-avg (0.2501 s/frame) and Hybrid-concat (0.2576 s/frame), verifying the DGFM’s lightweight design.

Furthermore, communication efficiency is significantly enhanced via on-site terminal-side segmentation for satellite emergency links. Specifically, the raw SAR image patch, which has a payload of 256 KB, is compressed into either an 8 KB binary flood segmentation mask or vectorised flood boundary polygons, which are even more streamlined. This 32-fold reduction in payload (over one order of magnitude and nearly two orders of magnitude) drastically alleviates uplink bandwidth pressure for satellite internet links, which is particularly critical for flood emergency response in the event of triple disruption, when terrestrial communication infrastructure is paralysed.

4.7. Stress Testing Under Emergency Constraints

In order to evaluate the model’s performance in the event of a “triple-disruption”, we conducted a latency-based stress test and a cost-sensitive risk analysis. The results of the latency stress test are summarized in Table 8.

Although lighter models such as PAN meet a 100 ms CPU inference latency budget, they have a high FAR. In contrast, our hybrid-gated model satisfies a 250 ms per-patch budget while providing highly reliable segmentation masks.

As shown in Table 9 and Figure 8, among the baseline and hybrid fusion variants in Table 9 our proposed hybrid-gated model achieves the lowest risk scores as the penalty ratio for false alarms (C_FP:C_FN) increases. Specifically, under the 2:1 and 3:1 ratios, which simulate emergency scenarios in which situational awareness is prioritised to prevent the misallocation of limited rescue resources, our model significantly outperforms the baselines. This shows that the DGFM effectively suppresses false positives induced by urban shadows and SAR speckle noise, thus aligning with the “precision-first” requirement of emergency response under “triple-disruption” constraints, where every rescue sortie must be based on credible evidence.

5. Discussion

To further elucidate the significance of the proposed parallel CNN-Transformer architecture and the DGFM, we conduct a qualitative analysis of the feature representations and fixed-gate comparisons, and an assessment of operational efficiency.

5.1. Dual-Branch Feature Synergy Analysis

The core motivation for adopting a parallel dual-branch architecture lies in the complementary nature of local spatial textures and global semantic dependencies. To investigate the individual influence of each branch, we examine the feature visualization results under different configurations.

As observed in our qualitative studies, when the Transformer branch is deactivated, the model primarily relies on the lightweight CNN’s local receptive field. While this preserves high-resolution boundary details, it frequently leads to false positives in complex backscatter regions, such as urban shadows or mountainous terrain, which exhibit radiometric characteristics similar to water bodies in SAR imagery. Conversely, when the CNN branch is removed, the Transformer successfully captures long-range context to suppress these false alarms; however, the resulting flood masks often suffer from “boundary erosion” and lack fine-grained spatial precision.

By employing the DGFM, the proposed framework yields an empirically favorable combination between the two branches. The CNN branch appears to support precise boundary delineation, while the Transformer branch appears to provide scene-level context that is useful for suppressing some false alarms. The gated weight,

w

, acts as an adaptive balancing signal that reconciles local and global feature maps based on content. The controlled fixed-gate comparisons in Table 5 (Rows 5–7) further support the benefit of sample-adaptive gating. Replacing the learned gating weights with fixed constants (e.g.,

w

= 0.5) is associated with a noticeable decline in model performance, with the mIoU dropping from 0.5814 to 0.5598. This pattern suggests that a single fixed fusion weight may fail to fully accommodate the diversity of backscatter appearances in our data. By contrast, learned gating allows branch contributions to vary with each sample.

5.2. Model Efficiency and Deployment Feasibility

In the context of satellite internet terminals, computational efficiency is as critical as segmentation performance. To evaluate the practical deployment feasibility of the proposed framework, we conduct a multi-dimensional trade-off analysis. Table 10 provides a summary of the accuracy, latency, and risk profiles for representative baseline and SOTA models.

The comparative data in Table 10 indicate that while the Transformer-only model offers an exceptionally low latency, its segmentation accuracy and high false alarm rate render it insufficient for reliable emergency mapping. In contrast, the proposed Hybrid-gated model achieves the highest mIoU and F1-score. Although its CPU latency is higher than that of the ultra-lightweight SegFormer-B0, it remains within a 250 ms per-patch budget required for near-real-time situational awareness.

Most importantly, our model maintains a competitive risk profile of 2658.42 while striking a balance between detection precision and computational overhead within terminal constraints. Combined with the complexity analysis in Table 7, which shows a modest parameter count of approximately 4.2 M, these results indicate that the proposed parallel fusion strategy is a strong accuracy-oriented operating point under edge-side hardware constraints rather than a universal optimum for every latency- or risk-sensitive preference. Although power consumption is not measured here, our lightweight, CPU-only design is favourable under power-limited conditions (e.g., battery or generator during outages).

5.3. Significance in “Triple-Disruption” Scenarios

The practical value of this framework is most evident in scenarios characterized by the disruption of power, communication, and road networks. Traditional flood mapping pipelines rely on backhauling large volumes of raw SAR data to centralized cloud servers, a process that is often paralyzed during major disasters.

By performing intelligent segmentation directly at the satellite terminal, we transform the “data-intensive” task into a “knowledge-light” task. The transmission of a binary flood mask or vectorized boundary data requires less than 5% of the bandwidth needed for raw image transmission. This edge-side processing capability is critical during power outages and communication disruptions, also known as the “triple-disruption” scenario. Backhauling large-scale raw SAR data to centralized cloud servers is infeasible during these events due to severely compromised infrastructure. This reduction in communication payload is potentially valuable under low-bandwidth satellite links when terrestrial infrastructure is compromised. In the present study, however, this deployment argument is supported by CPU latency and payload analysis rather than by field validation during an actual disaster. This is consistent with the stress-test results in Section 4.7, where the proposed model satisfies a 250 ms per-patch CPU latency budget while maintaining the lowest cost-sensitive risk among the baseline and hybrid variants.

5.4. Limitations and Practical Validation

This work has been primarily validated using the Sen1Floods11 benchmark. Therefore, the reported performance should be interpreted in light of the data distribution defined by the benchmark, which includes scene statistics, label characteristics and typical SAR urban-shadow patterns. Although our cross-region evaluation and the stability of the learned gate responses suggest limited-to-moderate generalisation, the current results may still be impacted by domain shifts in unseen acquisition configurations and flood morphologies.

From a modelling perspective, the Transformer branch is intentionally designed to meet strict CPU-side latency and memory constraints. This leads to the aggressive compression of contextual information into a scene-level representation prior to fusion. Consequently, subtle local inundation structures (e.g., narrow ribbons or fragmented patches) and fine-grained ambiguities may not be fully captured. A qualitative inspection of the predictions on the Sen1Floods11 test set reveals two recurring failure modes in our protocol: (1) occasional missed detections for very narrow channels and highly fragmented inundation when local contrast is weak and (2) residual false positives in shadowed regions and other land covers whose VV/VH backscatter resembles open water. These behaviors are consistent with the SAR ambiguity and compact contextual encoding discussed above. Accordingly, failure modes may be more pronounced in dense urban settings with strong shadow/double-bounce confusion, and in conditions that differ substantially from the training distribution. Extending the Transformer pathway with lightweight, spatialized context is left for future work.

Finally, while end-to-end field deployment was not conducted, operational reliability relevant to emergency response was rigorously examined through deployment-oriented evaluations. These evaluations included CPU latency under stress-test budgets, cost-sensitive risk analysis, and communication payload considerations. These assessments provide evidence of the robustness of the algorithms under strict operational constraints. Future work will extend these evaluations to additional benchmarks, such as UrbanSARFloods [51], and pursue more grounded validations, including shadow-mode testing and operator feedback, to characterize real-world reliability in triple-disruption environments.

6. Conclusions

This study presents a lightweight, parallel CNN-Transformer fusion framework for semantic segmentation of flood zones in dual-channel SAR imagery. The proposed architecture adopts a parallel dual-branch design to simultaneously capture both low-level spatial details and global contextual dependencies. These are then adaptively integrated via the DGFM, which is combined with a quality-aware training protocol.

Experimental results on the Sen1Floods11 dataset demonstrate the effectiveness and versatility of the model. Using the same evaluation protocol, the framework achieves the highest mIoU with strong F1 performance. Furthermore, the factorized ablations and fixed-gate comparisons in Table 5 support the contributions of dual-branch fusion and sample-adaptive gating to our results. Qualitative checks also indicate fewer SAR-specific errors, such as shadow misclassification and boundary leakage, in our experiments.

Although the model’s robustness is evident in its successful cross-region testing in unseen areas, the inherent complexity of SAR acquisition settings and diverse flood morphologies present ongoing challenges. To improve the framework’s practical applicability and ease the transition to end-to-end field deployment, future research will expand these evaluations to broader benchmarks, such as the UrbanSARFloods dataset, to better account for complex urban backscatter characteristics. Additionally, we plan to conduct more grounded validations through shadow-mode testing in active disaster zones to ensure the system’s reliability for a global “triple-disruption” emergency response.

Author Contributions

Methodology, Y.N.; validation, J.L. and H.G.; formal analysis, J.L. and H.G.; investigation, Y.N.; writing—original draft, Y.N.; supervision, Z.S.; project administration, Z.S.; funding acquisition, Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Major Science and Technology Support Program of Hebei Province [Grant No. 252X0303D] and the Science and Technology Project of Xiong’an New Area [Grant No. XA2026110904K].

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tellman, B.; Sullivan, J.A.; Kuhn, C.; Kettner, A.J.; Doyle, C.S.; Brakenridge, G.R.; Erickson, T.A.; Slayback, D.A. Satellite imaging reveals increased proportion of population exposed to floods. Nature 2021, 596, 80–86. [Google Scholar] [CrossRef]
Amitrano, D.; Di Martino, G.; Di Simone, A.; Imperatore, P. Flood detection with SAR: A review of techniques and datasets. Remote Sens. 2024, 16, 656. [Google Scholar] [CrossRef]
Zhao, J.; Li, M.; Li, Y.; Matgen, P.; Chini, M. Urban flood mapping using satellite synthetic aperture radar data: A review of characteristics, approaches, and datasets. IEEE Geosci. Remote Sens. Mag. 2024, 13, 237–268. [Google Scholar] [CrossRef]
Mateo-Garcia, G.; Veitch-Michaelis, J.; Smith, L.; Oprea, S.V.; Schumann, G.; Gal, Y.; Baydin, A.G.; Backes, D. Towards global flood mapping onboard low cost satellites with machine learning. Sci. Rep. 2021, 11, 7249. [Google Scholar] [CrossRef] [PubMed]
Wang, B.; Shi, H.; Qiu, Y.; Deng, N.; Nock, D.; Shen, X.; Wang, Z.; Wang, Y. Unequal power outages induced by natural disasters. Nat. Commun. 2025, 16, 8947. [Google Scholar] [CrossRef] [PubMed]
Qiu, J.; Yang, X.; Zheng, Z.; Tarolli, P. High-resolution mapping of China’s flooded croplands. Sci. Bull. 2025, 70, 1165–1173. [Google Scholar] [CrossRef]
Haghshenas, S.S.; Astarita, V.; Haghshenas, S.S.; Martino, G.; Guido, G. The potential of satellite internet technologies for crisis management during urban evacuation: A case study of starlink in Italy. Information 2025, 16, 840. [Google Scholar] [CrossRef]
Kopidaki, C.; Tsagkatakis, G.; Tsakalides, P. Federated learning for remote sensing image classification using sparse image representations. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
Xiong, X.; Zhang, X.; Jiang, W.; Liu, T.; Liu, Y.; Liu, L. Lightweight dual-stream SAR–ATR framework based on an attention mechanism-guided heterogeneous graph network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 537–556. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Wu, X.; Zhang, Z.; Xiong, S.; Zhang, W.; Tang, J.; Li, Z.; An, B.; Li, R. A near-real-time flood detection method based on deep learning and SAR images. Remote Sens. 2023, 15, 2046. [Google Scholar] [CrossRef]
Wang, J.; Wang, S.; Wang, F.; Zhou, Y.; Wang, Z.; Ji, J.; Xiong, Y.; Zhao, Q. FWENet: A deep convolutional neural network for flood water body extraction based on SAR images. Int. J. Digit. Earth 2022, 15, 345–361. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Nice, France, 2017; Available online: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 13 February 2026).
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. Available online: https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper (accessed on 13 February 2026).
Noori, A.M.; Ziboon, A.R.T.; AL-Hameedawi, A.N. Deep-learning integration of CNN–transformer and U-net for Bi-temporal SAR flash-flood detection. Appl. Sci. 2025, 15, 7770. [Google Scholar] [CrossRef]
Wu, C.; Wang, Z.; Xia, J.; Zhang, F. CFP-SwinT: Building damage classification during floods using cross fusion pyramid swin-T. Int. J. Digit. Earth 2025, 18, 2523490. [Google Scholar] [CrossRef]
Basit, A.; Siddique, M.A.; Bhatti, M.K.; Sarfraz, M.S. Comparison of CNNs and vision transformers-based hybrid models using gradient profile loss for classification of oil spills in SAR images. Remote Sens. 2022, 14, 2085. [Google Scholar] [CrossRef]
Xiang, X.; Gong, W.; Li, S.; Chen, J.; Ren, T. TCNet: Multiscale fusion of transformer and CNN for semantic segmentation of remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 3123–3136. [Google Scholar] [CrossRef]
Wu, H.; Zeng, Z.; Huang, P.; Yu, X.; Zhang, M. CCTNet: CNN and cross-shaped transformer hybrid network for remote sensing image semantic segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 19986–19997. [Google Scholar] [CrossRef]
Huang, Y.; Jiao, D.; Huang, X.; Tang, T.; Gui, G. A hybrid CNN-transformer network for object detection in optical remote sensing images: Integrating local and global feature fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 241–254. [Google Scholar] [CrossRef]
Eliopoulos, N.J.; Jajal, P.; Davis, J.C.; Liu, G.; Thiravathukal, G.K.; Lu, Y.-H. Pruning one more token is enough: Leveraging latency-workload non-linearities for vision transformers on the edge. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 28 February–4 March 2025; pp. 7153–7162. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Nice, France, 2015. [Google Scholar]
Wang, Y.; Yang, C.; Lan, S.; Zhu, L.; Zhang, Y. End-edge-cloud collaborative computing for deep learning: A comprehensive survey. IEEE Commun. Surv. Tutor. 2024, 26, 2647–2683. [Google Scholar] [CrossRef]
Misra, A.; White, K.; Nsutezo, S.F.; Straka, W.; Lavista, J. Mapping global floods with 10 years of satellite radar data. Nat. Commun. 2025, 16, 5762. [Google Scholar] [CrossRef]
Bonafilia, D.; Tellman, B.; Anderson, T.; Issenberg, E. Sen1Floods11: A georeferenced dataset to train and test deep learning flood algorithms for sentinel-1. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 210–211. [Google Scholar]
Long, S.; Fatoyinbo, T.E.; Policelli, F. Flood extent mapping for Namibia using change detection and thresholding with SAR. Environ. Res. Lett. 2014, 9, 35002. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
Pech-May, F.; Aquino-Santos, R.; Álvarez-Cárdenas, O.; Arandia, J.L.; Rios-Toledo, G. Segmentation and visualization of flooded areas through sentinel-1 images and U-net. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8996–9008. [Google Scholar] [CrossRef]
Garg, S.; Dasgupta, A.; Motagh, M.; Martinis, S.; Selvakumaran, S. Unlocking the full potential of Sentinel-1 for flood detection in arid regions. Remote Sens. Environ. 2024, 315, 114417. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Nice, France, 2021; pp. 12077–12090. [Google Scholar]
Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and multiscale transformer fusion network for remote-sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]
Lin, C.-W.; Lin, Y.; Zhou, S.; Zhu, L. Gateinst: Instance segmentation with multi-scale gated-enhanced queries in transformer decoder. Multimed. Syst. 2024, 30, 252. [Google Scholar] [CrossRef]
Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv 2016, arXiv:1510.00149. [Google Scholar] [CrossRef]
Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713. [Google Scholar]
Huang, M.; Luo, J.; Ding, C.; Wei, Z.; Huang, S.; Yu, H. An integer-only and group-vector systolic accelerator for efficiently mapping vision transformer on edge. IEEE Trans. Circuits Syst. I 2023, 70, 5289–5301. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar] [CrossRef]
Wu, K.; Zhang, J.; Peng, H.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. TinyViT: Fast pretraining distillation for small vision transformers. In Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2022; Volume 13681, pp. 68–85. [Google Scholar] [CrossRef]
Simantiris, G.; Panagiotakis, C. Unsupervised color-based flood segmentation in UAV imagery. Remote Sens. 2024, 16, 2126. [Google Scholar] [CrossRef]
Wang, Z.; Wang, X.; Li, G. An extremely lightweight U-net with soft fusion for flood detection using multi-source satellite images. In Proceedings of the IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 6454–6457. [Google Scholar]
Xu, T.; Huang, C.; Chen, C. Spatial-channel knowledge distillation for flood mapping in synthetic aperture radar images. Knowl.-Based Syst. 2025, 328, 114177. [Google Scholar] [CrossRef]
Wang, Z.; Wang, X.; Li, G.; Wu, W.; Liu, Y.; Song, Z.; Song, H. Historical information fusion of dense multi-source satellite image time series for flood extent mapping. Inf. Fusion 2024, 109, 102445. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv 2017, arXiv:1608.03983. [Google Scholar] [CrossRef]
Prechelt, L. Early stopping—But when? In Neural Networks: Tricks of the Trade; Orr, G.B., Müller, K.-R., Eds.; Springer: Berlin, Germany, 1998; Volume 1524, pp. 55–69. [Google Scholar]
Elkan, C. The foundations of cost-sensitive learning. In Proceedings of the International Joint Conference on Artificial Intelligence; Lawrence Erlbaum Associates: Mahwah, NJ, USA, 2001; Volume 17, pp. 973–978. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Katiyar, V.; Tamkuan, N.; Nagai, M. Near-real-time flood mapping using off-the-shelf models with SAR imagery and deep learning. Remote Sens. 2021, 13, 2334. [Google Scholar] [CrossRef]
Zhao, J.; Xiong, Z.; Zhu, X. UrbanSARFloods: Sentinel-1 SLC-based benchmark dataset for urban and open-area flood mapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 17–18 June 2024. [Google Scholar]

Figure 1. Overall architecture of the proposed parallel CNN-Transformer fusion framework.

Figure 2. Structure of the proposed Dynamic Gated Fusion Module (DGFM).

Figure 3. Statistical distribution of the learned gate weight

w

across different flood scenarios.

Figure 3. Statistical distribution of the learned gate weight

w

across different flood scenarios.

Figure 4. Comparative analysis of gate weights in Sri Lanka and USA regions. The orange line denotes the median gate weight in each region.

Figure 5. Evolution of training and validation loss for the proposed Hybrid-gated model.

Figure 6. Statistical comparison of mIoU and F1-score across different baseline architectures.

Figure 7. Qualitative visualization of flood segmentation on the Sen1Floods11 dataset (white indicates flood regions, while black indicates non-flood regions). In the Overlay column, green indicates ground-truth flood regions, while red indicates predicted flood regions.

Figure 8. Comparative analysis of cost-sensitive risk under different

C_{FP} : C_{FN}

ratios.

Figure 8. Comparative analysis of cost-sensitive risk under different

C_{FP} : C_{FN}

ratios.

Table 1. Default model hyperparameters.

Component	Parameters
Input	256 × 256 × 2
CNN base channels	24
Transformer	L = 2, heads = 4, d = 128, patch_size = 16
Gated MLP hidden dimension	128
Dropout	0.1

Table 2. Training Hyperparameter Settings.

Parameter	Value
Optimizer	Adam
Learning rate	$10^{- 4}$
Weight decay	$10^{- 5}$
Batch size	2
Gradient accumulation steps	2
Maximum epochs	40
Early stopping patience	10
Learning rate scheduling	Cosine annealing
Loss function	$L = L_{BCE} + 0.5 L_{Dice}$
Gradient clipping	1.0

Table 3. Performance Comparison of Different Models on the Sen1Floods11 Test Set (bold text represents the best result).

Model	mIoU	F1	Precision	Recall	F0.5	FAR	Risk	Latency (s/frame)
CNN-only	0.4951	0.6177	0.5922	0.8284	0.5950	0.0262	4247.94	0.2286
Transformer-only	0.3770	0.4953	0.5256	0.5286	0.5012	0.0319	4194.49	0.0048
Hybrid-avg	0.5499	0.6696	0.7635	0.7168	0.7022	0.0386	4687.58	0.2501
Hybrid-concat	0.5622	0.6857	0.7129	0.7643	0.6955	0.0247	3530.82	0.2576
Hybrid-gated (sample-wise)	0.5123	0.6416	0.7941	0.6543	0.6967	0.0177	3283.36	0.2490
Hybrid-gated (channel)	0.4523	0.5633	0.7881	0.5585	0.6292	0.0310	4518.11	0.1193
Hybrid-gated (Lovász)	0.5363	0.6642	0.7187	0.7535	0.6816	0.0267	3852.44	0.1233
Hybrid-gated (ours)	0.5814	0.7028	0.7562	0.7493	0.7174	0.0173	2658.42	0.2179
Hybrid-gated (spatial)	0.5356	0.6663	0.7007	0.7334	0.6771	0.0257	3477.02	0.1249

Note: Hybrid-gated (sample-wise) = same gate without quality-aware training; both share the same architectural backbone; the latency difference may be due to run-to-run variance. Precision, Recall, and F1 are macro-averaged (mean over test images). Risk is the mean of per-image cost-sensitive risk (2 × FP + FN) over the test set.

Table 4. Quantitative Comparison with State-of-the-Art Segmentation Models (bold text represents the best result).

Model	mIoU	F1	Precision	Recall	F0.5	FAR	Risk	Latency (s/frame)
UNet (ResNet18)	0.4480	0.5602	0.5268	0.8735	0.5324	0.0244	4069.32	0.1031
UNet++ (ResNet18)	0.5032	0.6206	0.5801	0.8701	0.5899	0.0364	5187.26	0.2227
UNet (S1-weak) [50]	0.3490	-	-	-	-	-	-	-
UNet (TL) [50]	0.4940	-	-	-	-	-	-	-
DeepLabV3+ (ResNet18)	0.3417	0.4504	0.4348	0.8472	0.4220	0.0454	7115.65	0.0731
DeepLabV3+ (MobileNetV2)	0.3961	0.5257	0.6050	0.6747	0.5465	0.0111	3719.42	0.0554
FPN (ResNet18)	0.5265	0.6567	0.6838	0.6985	0.6675	0.0079	1625.2	0.0772
PAN (ResNet18)	0.4335	0.5594	0.5446	0.7749	0.535	0.0193	3322.17	0.0588
MAnet (ResNet18)	0.5773	0.6969	0.7243	0.8044	0.6991	0.0081	1629.27	0.1025
Linknet (ResNet18)	0.4833	0.5986	0.5804	0.8516	0.5797	0.017	2851.56	0.0597
SegFormer-B0 (MiT-B0)	0.5678	0.6868	0.7249	0.7797	0.6915	0.01	2505.63	0.0674
FCNN (S1-weak) [26]	0.3092	-	-	-	-	-	-	-
Hybrid-concat	0.5622	0.6857	0.7129	0.7643	0.6955	0.0247	3530.82	0.2576
Hybrid-gated (ours)	0.5814	0.7028	0.7562	0.7493	0.7174	0.0173	2658.42	0.2179

Note: Hybrid-gated (ours) in this table is the same configuration as in Table 3 (Hybrid-gated with quality-aware training). Hyphens (-) indicate that the metrics were not published in those works.

Table 5. Factorized ablation of the fusion architecture and training protocol in a unified setting, including fixed-gate interventions under quality-aware training.

Fusion Architecture	Training Protocol	mIoU	F1	Precision	Recall	F0.5	FAR	Risk	Latency (s/Frame)
Hybrid-concat	Standard	0.5622	0.6857	0.7129	0.7643	0.6955	0.0247	3530.82	0.2576
Hybrid-concat	Quality-aware	0.5662	0.6852	0.6843	0.8059	0.6782	0.0244	3406.37	0.1910
Hybrid-gated	Standard	0.5123	0.6416	0.7941	0.6543	0.6967	0.0177	3283.36	0.2490
Hybrid-gated	Quality-aware (ours)	0.5814	0.7028	0.7562	0.7493	0.7174	0.0173	2658.42	0.2179
Hybrid-gated (fixed w = 0.2)	Quality-aware	0.5725	0.6976	0.6893	0.7979	0.6870	0.0231	3284.82	0.1848
Hybrid-gated (fixed w = 0.5)	Quality-aware	0.5598	0.6840	0.6536	0.8240	0.6604	0.0271	3525.74	0.1744
Hybrid-gated (fixed w = 0.8)	Quality-aware	0.5423	0.6655	0.7808	0.6747	0.7085	0.0227	3002.96	0.1606

Note: Unless stated otherwise in the main text, all seven configurations use the same data split, optimizer, augmentation, and training schedule. The first four rows constitute a controlled 2 × 2 factorization of fusion architecture (concat vs. gated) and training protocol (standard vs. quality-aware). Hybrid-gated (Standard) corresponds to Hybrid-gated (sample-wise) in Table 3. Rows 5–7 are controlled interventions that replace the learned MLP gate in DGFM with a constant fusion weight, w, while maintaining the hybrid-gated architecture and quality-aware weighting identical to that of row 4. Latency is from independent CPU runs and may vary due to fluctuations from run to run.

Table 6. Ablation on Transformer depth and patch size under the Standard protocol (provided for sensitivity analysis).

Variant	mIoU	F1	Precision	Recall	Latency (s)	Description
Transformer L = 1	0.5431	0.6640	0.7347	0.7404	0.1149	Strongest mIoU/F1 among depths in this table; shallower global mixing.
Transformer L = 2 (paper default depth)	0.5123	0.6416	0.7941	0.6543	0.1169	Depth used in the hybrid architecture; under standard loss, mIoU/F1 are below L = 1.
Transformer L = 3	0.4946	0.6351	0.6874	0.7261	0.1184	Deepest setting; mIoU decreases, likely due to optimization under speckle and data limits.
Patch size = 8	0.5669	0.6871	0.7993	0.7063	0.1338	Strongest mIoU/F1 among patch sizes in this table; finer tokens and higher attention cost.
Patch size = 16 (paper default patch)	0.5123	0.6416	0.7941	0.6543	0.1254	Patch size used in the hybrid architecture; under standard loss, mIoU/F1 are below P = 8.
Patch size = 32	0.5183	0.6364	0.8224	0.6349	0.1170	Coarser patches; fewer tokens, different precision–recall trade-off.

Note: All variants in Table 6 are evaluated under the Standard training protocol (without quality-aware reweighting) to isolate the effects of Transformer depth and patch size.

Table 7. Analysis of Computational Complexity and Resource Consumption.

Model	Params (M)	FLOPs (G)	Peak Memory (MB)	CPU Latency (s/Frame)
CNN-only	3.1	2.8	45	0.2286
Transformer-only	3.8	1.2	38	0.0048
Hybrid-avg	4.2	4.0	52	0.2501
Hybrid-concat	4.2	4.2	55	0.2576
Hybrid-gated	4.2	4.1	54	0.2179

Table 8. Results of the Latency Stress Test under Different CPU Budgets.

Latency Budget	Models Within Budget
≤100 ms	Transformer-only, DeepLabV3+ (MobileNetV2), PAN, DeepLabV3+ (ResNet18), FPN, Linknet, SegFormer-B0 (MiT-B0)
≤200 ms	All of the above; UNet (ResNet18), MAnet (ResNet18)
≤250 ms	All of the above; UNet++ (ResNet18), CNN-only, Hybrid-gated (ours)
≤260 ms	All of the above; Hybrid-avg, Hybrid-concat

Table 9. Cost-Sensitive Risk Evaluation under Varying Cost Ratios.

Model	FP	FN	Risk (1:1)	Risk (2:1)	Risk (3:1)
CNN-only	138,482	79,863	218,345	356,827	495,309
Transformer-only	66,943	310,562	377,505	444,448	511,391
Hybrid-avg	171,020	51,717	222,737	393,757	564,777
Hybrid-concat	120,545	55,499	176,044	296,589	417,134
Hybrid-gated (ours)	87,459	48,388	135,847	223,306	310,765

Note: Risk = C_FP × FP + C_FN × FN. Ratios 1:1 (neutral), 2:1 (emergency, higher penalty on false alarms), 3:1 (strong false-alarm penalty). FP and FN summed over entire test set.

Table 10. Summary of Accuracy–Latency–Risk Trade-offs for Representative Models.

Model	mIoU	F1	FAR	Risk	CPU Latency
Transformer-only	0.3770	0.4953	0.0319	4194.49	0.0048
CNN-only	0.4951	0.6177	0.0262	4247.94	0.2286
FPN (ResNet18)	0.5265	0.6567	0.0079	1625.20	0.0772
SegFormer-B0 (MiT-B0)	0.5678	0.6868	0.0100	2505.63	0.0674
Hybrid-gated (ours)	0.5814	0.7028	0.0173	2658.42	0.2179

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nie, Y.; Shi, Z.; Li, J.; Ge, H. Gated Lightweight CNN-Transformer Fusion for Real-Time Flood Segmentation on Satellite Internet Terminals Under Triple-Disruption Emergency Conditions. Remote Sens. 2026, 18, 1418. https://doi.org/10.3390/rs18091418

AMA Style

Nie Y, Shi Z, Li J, Ge H. Gated Lightweight CNN-Transformer Fusion for Real-Time Flood Segmentation on Satellite Internet Terminals Under Triple-Disruption Emergency Conditions. Remote Sensing. 2026; 18(9):1418. https://doi.org/10.3390/rs18091418

Chicago/Turabian Style

Nie, Yungui, Zhiguo Shi, Jianing Li, and HuiLing Ge. 2026. "Gated Lightweight CNN-Transformer Fusion for Real-Time Flood Segmentation on Satellite Internet Terminals Under Triple-Disruption Emergency Conditions" Remote Sensing 18, no. 9: 1418. https://doi.org/10.3390/rs18091418

APA Style

Nie, Y., Shi, Z., Li, J., & Ge, H. (2026). Gated Lightweight CNN-Transformer Fusion for Real-Time Flood Segmentation on Satellite Internet Terminals Under Triple-Disruption Emergency Conditions. Remote Sensing, 18(9), 1418. https://doi.org/10.3390/rs18091418

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Gated Lightweight CNN-Transformer Fusion for Real-Time Flood Segmentation on Satellite Internet Terminals Under Triple-Disruption Emergency Conditions

Highlights

Abstract

1. Introduction

2. Related Work

2.1. SAR-Specific Flood Segmentation Methods

2.2. CNN-Transformer Hybrid Fusion Strategies

2.3. Deployment-Oriented Edge Inference for Satellite-Terminal Scenarios

3. Methodology

3.1. Overall Structure of the Framework

3.2. Multi-Stream Feature Extraction Branches

3.3. Dynamic Gated Fusion Module

3.4. Interpretability of the Gated Weight

3.5. Segmentation Head and Training Strategy

4. Experiments and Results

4.1. Dataset and Experimental Setup

4.2. Evaluation Metrics

4.3. Comparative Experiments with Baseline Methods

4.4. Comparison with State-of-the-Art Methods

4.5. Structural Ablation and Generalization Assessment

4.6. Efficiency Analysis and Communication Impact

4.7. Stress Testing Under Emergency Constraints

5. Discussion

5.1. Dual-Branch Feature Synergy Analysis

5.2. Model Efficiency and Deployment Feasibility

5.3. Significance in “Triple-Disruption” Scenarios

5.4. Limitations and Practical Validation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI