In this section, the performance of the proposed parallel CNN-Transformer fusion framework is evaluated. We first describe the dataset and experimental setup, followed by a series of comparative and ablation studies. Finally, the model is stress-tested under the “triple-disruption” scenario to verify its practical utility.
4.1. Dataset and Experimental Setup
The experiments in this paper use the publicly available Sen1Floods11 SAR flood benchmark dataset [
26], which consists of Sentinel-1 imagery with VV/VH dual-polarization channels. The dataset is a popular benchmark for deep learning-based flood segmentation using Sentinel-1 data and includes two subsets: “Weakly Labeled” and “Hand Labeled”. The “Hand Labeled” subset is manually annotated by experts to a high standard, ensuring consistent inter-annotator agreement. To leverage all available data, the training set incorporates samples from both the “Weakly Labeled” and “Hand Labeled” subsets, enabling the model to benefit from diverse scenarios. According to its source subset in Sen1Floods11, each training sample is tagged as weak or hand. During training, weakly and hand-labeled samples are shuffled together into the same mini-batches without additional oversampling or undersampling. Quality awareness is introduced through sample-wise loss weighting, as defined in
Section 3.5. To ensure rigorous and reliable evaluation, the validation and test sets comprise only high-quality “Hand Labeled” patches. This strategy, coupled with our quality-aware training protocol, enables the model to learn effectively from noisy weak labels while providing an objective assessment of the expert-verified ground truth. To ensure the model can generalise and avoid data leakage, patches from the same original image appear in only one set. We employ data augmentation techniques during the training phase [
29], including random horizontal and vertical flipping (with a probability of 0.5), 90° rotation, and ±15% radiometric jitter of the backscatter values (to mimic intensity variation for SAR data).
This paper uses the standard evaluation metrics for semantic segmentation tasks: mean intersection over union (mIoU), F1 score, precision, and recall [
44]. To accurately reflect edge deployment scenarios, all inference speed tests are performed on a CPU platform and report the average processing latency for a single image (batch size = 1). Training uses the Adam optimizer [
45] with an initial learning rate of 10
−4, combined with cosine annealing scheduling [
46] and an early stopping strategy [
47]. The specific hyperparameters are shown in
Table 2.
4.3. Comparative Experiments with Baseline Methods
To verify the contribution of the parallel architecture and the dynamic gated fusion module (DGFM), some baseline configurations are meticulously compared. The results on the Sen1Floods11 test set are summarized in
Table 3.
As demonstrated in
Table 3, the Hybrid-gated (ours) configuration, combined with quality-aware training through weak-label reweighting, achieves the highest mIoU of 0.5814 and an F1-score of 0.7028. Notably, this configuration yields the lowest FAR (0.0173) and the minimum Rask (2658.42). This suggests that the DGFM effectively leverages the local consistency from the CNN branch and the global context from the Transformer branch, particularly in mitigating “boundary leakage” and urban shadow misclassifications.
As shown in
Figure 5, the training and validation losses steadily decrease and converge under the proposed training protocol.
Figure 6 shows a statistical comparison of the mIoU and F1 score for different baseline architectures.
The results align with the operational requirements for high-precision situational awareness in emergency scenarios. The Hybrid-gated (ours) model dynamically balances multi-scale features to minimise cost-sensitive risks and prevent the wastage of limited rescue resources caused by false reports, all while maintaining a robust detection rate. Compared to standard fusion strategies (e.g., averaging and concatenation), our gating mechanism offers greater adaptability to the variability in SAR imagery backscatter.
Section 4.5 provides a detailed ablation analysis of the internal gating structures and the impact of the quality-aware training protocol.
4.4. Comparison with State-of-the-Art Methods
To comprehensively evaluate the framework, we compare it with several representative SOTA models, including UNet++, DeepLabV3+, and SegFormer-B0. All models are trained from scratch with in_channels = 2 to avoid domain bias from RGB pre-training. The ResNet-18 backbone follows the standard residual network [
49].
As shown in
Table 4, our Hybrid-gated (ours) model outperforms all SOTA baselines in overall segmentation quality (mIoU 0.5814). It is important to note that the absolute mIoU values in SAR tasks are lower than those in optical RGB benchmarks due to speckle noise and lack of spectral information. However, our model offers an operating point that prioritizes accuracy with competitive risk-sensitive behavior under the same protocol. In terms of efficiency, our latency (0.2179 s/frame) is competitive with UNet++ (0.2227 s/frame) while providing significantly higher precision.
The proposed hybrid-gated model outperforms the UNet++ (ResNet18) model in terms of mIoU. Some SOTA baselines (e.g., FPN and MAnet) achieve lower FAR and risk in risk-sensitive metrics, indicating different operational trade-offs. Compared to SegFormer-B0, our model achieves a higher mIoU (0.5814 vs. 0.5678) and a comparable, though slightly higher, FAR (0.0173 vs. 0.01), indicating a different trade-off prioritizing overall accuracy. While absolute mIoU values in SAR flood mapping are typically lower than those in optical benchmarks due to speckle noise and ambiguous backscatter, our method offers a practical balance of accuracy and reliability under terminal deployment constraints.
Figure 7 presents the qualitative visualization of the results.
4.5. Structural Ablation and Generalization Assessment
To verify the specific contributions of the gating mechanism and the training protocol, we first compare different structural variants. To make a fair comparison, the “Hybrid-gated (sample-wise)” and “Hybrid-gated (ours)” models have the same architecture, data split, optimizer, augmentation, and training schedule. The only difference is whether quality-aware sample weighting is enabled. As shown in
Table 3, the sample-wise gated variant outperforms the channel-wise variant, with mIoUs of 0.5123 and 0.4523 respectively. This confirms that sample-adaptive weight assignment is better suited to handling the diverse backscatter characteristics of SAR imagery. Furthermore, introducing the quality-aware training (Hybrid-gated, ours) substantially improves performance, elevating the mIoU from 0.5123 to 0.5814. These results suggest that the current protocol benefits from the complementary contributions of quality-aware training and fusion architecture.
To further distinguish the effects of architecture from the effects of the training protocol, we conduct an additional factorized 2 × 2 comparison under the same split, optimizer, augmentation, and training schedule. Specifically, we compare the Hybrid-concat and Hybrid-gated models, each with standard and quality-aware training. The corresponding results are summarized in
Table 5. While
Table 3 reports on a broader range of architectural variants,
Table 5 provides a controlled 2 × 2 decomposition of architectural versus training attributions.
Under standard training, the Hybrid-concat model produces a stronger baseline than the Hybrid-gated model. However, with quality-aware reweighting, the Hybrid-gated model improves substantially and becomes the best overall model, suggesting that DGFM is more sensitive to label quality under the current protocol.
To isolate the effects of Transformer depth and patch size from quality-aware reweighting,
Table 6 reports a controlled ablation under the standard training objective (no QA). This uses the same split, optimizer, augmentation, and schedule, except for L and P.
Under this protocol, L = 1 and P = 8 achieve the highest mIoU and F1 scores among the listed depth and patch settings. However, L = 3 reduces mIoU, which is consistent with the increased optimization difficulty under speckle-dominated SAR data and limited training diversity. Patch size balances the spatial support of each token with the number of tokens and the cost of global self-attention.
The main experiments in
Table 3,
Table 4 and
Table 5 employ quality-aware training. Therefore, we retrained the hybrid-gated model under the same protocol, with an identical data split, optimizer, augmentation, and schedule, while varying one structural factor relative to the default backbone at a time. With quality-aware reweighting, L = 2 and P = 16 achieve an mIoU of 0.5814 and an F1 score of 0.7028. This is compared to an mIoU of 0.4800 and an F1 score of 0.5826 with L = 1 and P = 16, and an mIoU of 0.5445 and an F1 score of 0.6633 with L = 2 and P = 8. Thus, we adopt L = 2 and P = 16 as the default Transformer depth and patch size for all quality-aware configurations reported in this paper.
Finally, we conducted a cross-region generalization assessment using an unseen-region split. The model was trained on eight regions (Bolivia, Colombia, Ghana, India, the Mekong region, Nigeria, Pakistan, and Paraguay) and evaluated on two regions that were not used for training: Sri Lanka and the USA. To avoid bias from a single metric, we report the full set of metrics on the unseen test split: mIoU = 0.5228, F1 = 0.6382, precision = 0.7481, recall = 0.6721, F0.5 = 0.6668, FAR = 0.0252, risk = 3440.62, and latency = 0.1480 s/frame. These results suggest promising cross-region transferability under the current protocol. However, a more systematic per-region and cross-method comparison is necessary.
4.6. Efficiency Analysis and Communication Impact
To validate the feasibility of deploying the proposed parallel framework on resource-constrained satellite internet terminals,
Table 7 provides a quantitative analysis of the computational complexity and resource consumption of all models. This includes the number of parameters, the number of floating-point operations (FLOPs), the peak memory usage, and the CPU inference latency.
Integrating the DGFM introduces only marginal computational overhead. The Hybrid-gated model achieves a CPU latency of 0.2179 s/frame (
Table 7), comparable to Hybrid-avg (0.2501 s/frame) and Hybrid-concat (0.2576 s/frame), verifying the DGFM’s lightweight design.
Furthermore, communication efficiency is significantly enhanced via on-site terminal-side segmentation for satellite emergency links. Specifically, the raw SAR image patch, which has a payload of 256 KB, is compressed into either an 8 KB binary flood segmentation mask or vectorised flood boundary polygons, which are even more streamlined. This 32-fold reduction in payload (over one order of magnitude and nearly two orders of magnitude) drastically alleviates uplink bandwidth pressure for satellite internet links, which is particularly critical for flood emergency response in the event of triple disruption, when terrestrial communication infrastructure is paralysed.
4.7. Stress Testing Under Emergency Constraints
In order to evaluate the model’s performance in the event of a “triple-disruption”, we conducted a latency-based stress test and a cost-sensitive risk analysis. The results of the latency stress test are summarized in
Table 8.
Although lighter models such as PAN meet a 100 ms CPU inference latency budget, they have a high FAR. In contrast, our hybrid-gated model satisfies a 250 ms per-patch budget while providing highly reliable segmentation masks.
As shown in
Table 9 and
Figure 8, among the baseline and hybrid fusion variants in
Table 9 our proposed hybrid-gated model achieves the lowest risk scores as the penalty ratio for false alarms (C
FP:C
FN) increases. Specifically, under the 2:1 and 3:1 ratios, which simulate emergency scenarios in which situational awareness is prioritised to prevent the misallocation of limited rescue resources, our model significantly outperforms the baselines. This shows that the DGFM effectively suppresses false positives induced by urban shadows and SAR speckle noise, thus aligning with the “precision-first” requirement of emergency response under “triple-disruption” constraints, where every rescue sortie must be based on credible evidence.