1. Introduction
Mangroves possess remarkable carbon sequestration capacity and provide unique ecological services, but their sustainability has been severely challenged by extensive human activities and environmental changes. Mangrove ecosystems have experienced extensive degradation, as large areas in Southeast Asia have been cleared for aquaculture and coastal development. In regions such as West Africa and South America, pollution and rising sea levels driven by climate change have further accelerated their decline, with research showing that global mangrove cover has declined by nearly one-third over the past fifty years. These challenges highlight the urgency of effective monitoring and conservation of mangroves.
Compared with traditional machine learning approaches, DNNs exhibit superior feature extraction capabilities and higher object recognition accuracy. Consequently, DNN-based remote sensing technologies have emerged as a promising means for large-scale and long-term monitoring of mangrove ecosystems, facilitating timely and accurate assessments of their spatial distribution and ecological status. Nevertheless, DNNs still suffer from several inherent limitations. The lack of interpretability obscures the understanding of internal decision-making mechanisms [
1], while the performance of DNNs remains highly sensitive to external perturbations [
2,
3]. In mangrove monitoring tasks, the inevitable noise arising from atmospheric interference, sensor imperfections, and environmental variability can blur canopy boundaries and fine structural details, thereby amplifying model vulnerability. Furthermore, the spectral similarity between mangroves and other vegetation, as well as their spatial overlap with water bodies, increases the likelihood of misclassification. Without adequate robustness, DNNs are prone to overfitting noisy data and losing generalization capability, which undermines the reliability of ecological monitoring and assessment. Therefore, enhancing the noise resistance of DNN-based models is essential for achieving stable and practical mangrove detection in real-world remote sensing applications.
Fortunately, the NODE [
4,
5] offers a new research approach for improving the robustness of DNNs. NODE characterizes DNNs from the perspective of dynamic systems, extending the discrete stacking paradigm of traditional DNNs to a continuous dynamical evolution process. Unlike DNNs with explicit layer structures, NODE implicitly establishes a mapping between input and output, resulting in observed higher nonlinearity, clearer dynamical characteristics, and stronger fitting capabilities. NODE effectively mitigates the impact of input noise or perturbations by evolving feature embeddings along the NODE trajectory, thereby demonstrating enhanced robustness. In addition, unlike the black box effect of the traditional DNN intermediate hidden layer, the evolution trajectory of the NODE internal state reflects the trend of feature changes.
However, employing a full NODE architecture often incurs substantial computational overhead and slows inference. This is because continuous-time modeling requires repeated NODE solver iterations that iteratively call the neural network, while training gradients are computed through the computationally expensive adjoint method. To balance robustness and efficiency, we selectively incorporate NODE into key components of the network to perform dynamic modeling over time, as shown in
Figure 1. Specifically, we used NODE to implement the original convolution branch of the void space pyramid [
6]. Furthermore, we combined the modeling idea of NODE to transform the classic SE-Block [
7], proposing the NODE-SE-Block and embedding it into the segmentation network.
In the complex field monitoring scenarios of mangroves, remote sensing models not only need to defend against adversarial human intervention but also must cope with random natural environmental degradation. This study reveals a key collaborative mechanism: adversarial training [
8] using known human-induced perturbation samples can significantly improve the robustness of the NODE architecture to non-specific environmental noise, which can be called Synergetic Adversarial Training (SAT), as shown in
Figure 2. The underlying mechanism of this phenomenon lies in the effective depth regularization of the model’s continuous vector field through adversarial training, thereby constructing a more stable feature evolution trajectory.
By optimizing the network against extremal directional perturbations, the training process enforces a smoother decision manifold that satisfies a higher degree of Lipschitz continuity. This rigorous hardening steers the model away from volatile, non-robust features and toward the invariant structural essence of the mangrove canopy. Consequently, while natural interferences like atmospheric haze or sensor stripes differ in origin from adversarial attacks, they fall within the robust subspace established during the defense against FGSM. This cross-robustness effect ensures that the proposed methodology remains dependable in complex ecological environments, bridging the gap between theoretical security and real-world reliability.
In summary, the contributions of this paper mainly include the following four aspects:
We developed and integrated diverse NODE modules into critical network components, including a novel plug-and-play NODE-SE-Block that leverages continuous dynamics within channel attention mechanisms to adaptively optimize feature weights and bolster model resilience.
We propose an adversarial training framework that uses a deliberate malicious attack as the benchmark for maximum perturbation. By forcing the model to remain invariant under these extreme gradients, the framework stabilizes feature trajectories and smooths the decision manifold, effectively extending robustness from deliberate attacks to natural noise.
Extensive experiments on mangrove remote sensing datasets demonstrate that the proposed method significantly enhances robustness against human-induced adverse disturbances and natural environmental degradation.
2. Related Works
Remote sensing monitoring of mangroves is crucial for ecological conservation and environmental assessment, and semantic segmentation using DNNs is particularly crucial in this field. However, in practical applications, noise interference and environmental variations pose significant challenges to model stability and robustness. The following section will explore current research methods and techniques to improve model robustness in mangrove remote sensing tasks.
2.1. Remote Sensing of Mangrove
Mangrove mapping and semantic segmentation have been studied across a range of sensor types and methodological paradigms. Early reviews summarized the advantages and limitations of optical, hyperspectral and SAR sensors for mangrove monitoring and emphasized challenges including mixed pixels, tidal inundation and species-level confusion [
9]. Recent research on DNNs has focused on pixel-level segmentation using U-Net [
10] and its hybrids [
11] on multisource high-resolution imagery, showing marked gains over traditional pixel or object-based classifiers when sufficient labeled data are available. Concurrently, large-scale benchmark efforts and dataset [
12,
13] releases have enabled systematic comparisons of kinds of architectures for global-scale mangrove mapping [
14]. In addition to improving the accuracy of semantic segmentation, related robustness research is also valuable.
In recent years, the robustness of remote sensing semantic segmentation models against noise and perturbations has attracted increasing attention, although systematic studies specifically targeting mangrove ecosystems remain scarce [
15]. Existing studies mainly focus on reducing the impact of common disturbances such as clouds, shadows, and tidal variations, often through preprocessing [
16], spectral unmixing [
17], or multi-temporal data fusion [
18]. With the development of deep learning, several works introduced adversarial training [
19,
20] and data augmentation strategies [
21] to improve segmentation robustness against synthetic noise or perturbations. A recent advanced method specifically targets robustness and boundary precision in complex tidal and mixed-background scenes by combining local CNN feature extractors with transformer-style global context modules or dual-backbone fusion designs, which report state-of-the-art gains on regional benchmarks [
22].
2.2. Sensitivity of Deep Learning Networks
Although deep neural networks have achieved remarkable success in computer vision and remote sensing tasks, their sensitivity to perturbations remains a significant challenge. Existing studies have shown that even small perturbations may noticeably affect model predictions and reduce robustness in real-world environments.
Recent research has demonstrated this vulnerability across different application scenarios. For example, Ref. [
23] showed that stochastic computing architectures may weaken traditional defense mechanisms and allow adversarial perturbations to bypass protection strategies. In medical imaging, Ref. [
24] proposed a semi-supervised generative adversarial framework for myocarditis diagnosis and highlighted the sensitivity of deep learning models to perturbation distributions and optimization instability. In addition, Ref. [
25] demonstrated that even low-epsilon perturbations can successfully mislead online image stream classifiers.
These studies indicate that deep learning models are generally sensitive to adversarial perturbations and environmental variations. These studies indicate that deep learning models are generally sensitive to adversarial perturbations and environmental variations, highlighting the importance of robustness research for remote sensing semantic segmentation models in practical mangrove monitoring applications.
2.3. Neural Ordinary Differential Equations
In recent studies, NODEs have been proposed as a continuous form of deep networks. The basic idea is to regard the residual block as a discrete solver of ODE and define the state evolution equation and use the numerical integrator to solve the network output [
5]. NODE has been gradually introduced into computer vision tasks [
26,
27]. The core idea is to replace the traditional discrete hierarchy with continuous-time modeling, thereby achieving more flexible feature evolution and higher robustness. In image classification, NODE has been used to replace the convolutional layer in the residual network, reducing the number of parameters while maintaining or even improving performance [
28]. Existing studies have combined NODE with Gaussian processes to enhance model robustness and uncertainty modeling, while also incorporating numerical methods to improve their performance against adversarial attacks and out-of-distribution samples [
29]. These works show that the ODE architecture has broad application potential in visual tasks.
2.4. Defense Mechanisms Against Perturbations
Adversarial training has been established as a primary paradigm for empirical defense. Adversarial training is the most widely used empirical defense, which improves robustness by jointly training on clean and adversarial examples. Early studies employed single-step FGSM [
30] for inner maximization, achieving robustness against simple attacks. However, later works showed that models trained with single-step methods remain vulnerable to stronger multi-step attacks. Madry et al. [
31] formalized multi-step PGD-based adversarial training as a strong baseline against first-order adversaries. To reduce computational cost, efficient variants such as Free Adversarial Training [
32] and Fast Adversarial Training [
33] were proposed. Ensemble Adversarial Training [
34] further enhances robustness by leveraging adversarial examples transferred from multiple models.
Regularization-based methods enhance model robustness by imposing objective-level constraints. Beyond adversarial training, regularization-based defenses improve robustness by constraining the model’s local sensitivity at the objective level. Input gradient regularization penalizes
to smooth responses to small perturbations [
35]. Adversarial Logit Pairing (ALP) [
36] enforces consistency between clean and adversarial logits, while TRADES [
37] explicitly formulates the accuracy–robustness tradeoff through a decoupled training objective. Related approaches, including Jacobian regularization, label smoothing [
38], and margin-based training [
39], further promote local linearity and confidence calibration.
Input-transformation defenses utilize randomization and preprocessing to mitigate adversarial effects. Another line of defense applies randomization or input transformations to disrupt gradient-based attacks [
40,
41]. Typical methods include random resizing or padding, noise injection, and preprocessing operations such as JPEG compression and denoising. While effective against simple adversaries, these techniques may suffer from gradient masking when used alone, leading to overestimated robustness without careful evaluation.
Architecture-level robustness focuses on suppressing perturbation amplification through structural and dynamical design. Architecture-level defenses aim to suppress perturbation amplification within network structures. Lipschitz constraints [
42] bound layer-wise sensitivity, while structural designs such as feature denoising [
43], non-local blocks [
44], and normalization or residual connections improve stability during adversarial training. More recently, Neural ODEs model forward propagation as continuous dynamical systems, implicitly encouraging smoothness and stability [
45]. Existing analyses on stability and perturbation propagation support their robustness potential, which aligns with the design motivation of our proposed ODE-based architecture.
In summary, existing research on mangrove remote sensing has achieved remarkable progress in segmentation accuracy through advances in deep learning and multisource data integration. However, the robustness of these models under real-world noise, environmental disturbances, and adversarial perturbations remains underexplored.
NODE offers inherent stability advantages by modeling feature evolution as a continuous dynamic process [
4,
29]. This property suggests that NODE-based architectures can naturally resist small input perturbations and suppress gradient amplification during propagation.
Motivated by these insights, we propose a NODE-based mangrove remote sensing semantic segmentation framework. To further enhance robustness, we integrate a collaborative adversarial training strategy that jointly optimizes clean and perturbed samples. The combination of continuous-time feature modeling and adversarial supervision enables the proposed framework to achieve improved robustness against both human-induced perturbations and natural image degradation, while maintaining competitive segmentation accuracy and computational efficiency. As a result, our method offers a robust and practical solution for real-world mangrove segmentation applications.
3. Methodology
Traditional deep networks typically transform features layer by layer through discretely stacked network layers, meaning that feature representations evolve in a discrete form along the network depth direction. While this design has been successful in practice, the abrupt mapping between discrete layers can easily lead to feature discontinuities, gradient amplification, and sensitivity to input perturbations, especially in the presence of noise or adversarial perturbations.
To alleviate the aforementioned problems, NODEs reinterpret deep networks as continuous-time dynamical systems. Specifically, given input features
, their evolution within the network is no longer defined by a series of discrete layers, but is modeled through the following ordinary differential equations:
where
represents the dynamic function parameterized by the neural network,
is a learnable parameter, and
t represents a continuous depth or time variable.
In this modeling approach, the feature representation evolves continuously along the time dimension, and the network output
is obtained by numerically integrating the above differential equation over the interval
,
M represents the length of continuous depth or time.
From the perspective of representation learning, NODEs replace the traditional discrete layer stacking with a continuous evolution process, making feature updates smoother and effectively suppressing abrupt amplification and the accumulation of high-frequency noise. At the same time, since feature evolution is constrained by differential equations, its dynamic behavior is easier to analyze in theory and naturally possesses a certain degree of stability and Lipschitz continuity, which provides favorable conditions for improving the robustness of the model.
To comprehensively enhance the robustness and continuity of feature learning, we introduce NODE modeling into three representative components of the encoder: Vanilla Convolution, Dilated Convolution, and SE Convolution, corresponding respectively to local feature extraction, multi-scale contextual representation, and channel-wise feature recalibration.
3.1. Vanilla Convolution with NODE
Conventional convolution gradually increases the receptive field by stacking discrete layers, initially capturing only local information. NODE Convolution (NODE Conv), on the other hand, models the convolution operation as a dynamical system in continuous time, representing the evolution of features over time.
We apply NODE Conv to the bottleneck layer of the network, which is named BottleneckNODE, for compression, learning, and recovery of image features as show in
Figure 3. In BottleneckNODE, channel compression is first performed using a
convolution. Subsequently, a NODE Conv
layer is used to model continuous dynamic feature changes. Finally, a
convolution is performed to restore the number of channels. In the NODE Conv part, the evolution of the feature map
can be described by an ordinary differential equation above and
represents a nonlinear transformation consisting of
convolution, batch normalization, and ReLU activation.
The output features are obtained by solving the NODE and integrating it over the time interval . NODE Conv aggregates richer multi-scale contextual information within a single convolution branch, thereby better capturing long-range dependencies.
Since intermediate layers contain a large number of feature representations, directly applying NODE modeling throughout the entire encoder would incur prohibitive computational overhead. To strike a balance between efficiency and robustness, we adopt a hierarchical design in which standard convolutional layers are first employed to efficiently extract low-level and mid-level image features. NODE Conv is then introduced only at the final stage of the encoder to integrate multi-scale contextual information and capture long-range dependencies.
By modeling feature transformation as a smooth dynamical system, NODE-based convolution enhances robustness against noise and perturbations while preserving expressive power. Following this design principle, all NODE-enhanced modules in our framework are applied after sufficient feature extraction by conventional convolutions, and the same strategy is consistently adopted throughout the network. Therefore, the subsequent NODE-based modules share an identical adaptation philosophy and are not repeatedly elaborated in later sections.
3.2. Dilated Convolution with NODE
Dilated Convolution can effectively enlarge the receptive field without increasing the number of parameters or computational cost, thereby incorporating richer contextual information during feature extraction. The process of dilated convolution can also be dynamically modeled by an NODE, as shown in
Figure 4.
The application of dilated convolution in the ASPP structure reflects its advantages. In the conventional ASPP structure, atrous convolution branches with different dilation rates can effectively capture multi-scale contextual information. However, the convolution operation is essentially a discrete feature mapping, which makes it difficult to fully model the continuous evolution of features across scales. To address this limitation, we introduce NODE into ASPP for dynamic modeling, namely NODE-ASPP, to achieve smoother and more robust multi-scale feature representation.
Compared to the discrete convolution operation of standard ASPP, the state change process of each dilation branch can be represented as:
In each branch, we define a dynamic system based on the dilation rate
:
Finally, the output features of all integrated branches are concatenated, channel-compressed, and fused using a 1 × 1 convolution to obtain the feature representation of NODE-ASPP-Block:
NODE-ASPP-Block enables continuous convolution evolution of each dilation branch, achieving stronger multi-scale feature modeling and robustness through a continuously changing multi-scale receptive field.
3.3. SE Convolution with NODE
In traditional channel-wise attention mechanisms, channel weights are typically generated by a two-layer fully connected network, lacking dynamics and continuity. To address this, we designed the NODE-SE-Block, which introduces NODE modeling to enable attention weights to vary continuously as input features evolve, as shown in
Figure 5.
Similar to the traditional approach, the input features
are transformed into a channel descriptor vector through global average pooling:
We introduce a dynamic modeling strategy by formulating the channel descriptor as the initial state of an ordinary differential equation:
Subsequently, the hidden representation of the channel descriptor evolves over time. The attention weight representation for each channel can be obtained by integrating the corresponding differential equation:
The function denotes the parameterized dynamics governing channel-wise feature evolution. In practice, is implemented as a lightweight two-layer fully connected network with an intermediate nonlinear activation, which enables flexible yet stable modeling of dynamic channel interactions while introducing negligible computational overhead.
Finally, the feature weights of each channel are reassembled in channel order and a nonlinear activation function is applied to obtain the dynamic channel attention weights:
The obtained attention weights are used to rescale the original features:
This design enables the generation of dynamic channel attention, where the attention weights are continuously evolved via NODE modeling rather than being fixed by static parameters.
3.4. Synergistic Adversarial Training
In complex mangrove field monitoring scenarios, remote sensing segmentation models are exposed not only to human-induced perturbations but also to diverse and unpredictable natural degradations, such as atmospheric haze, sensor noise, and stripe artifacts. These natural disturbances are typically stochastic and difficult to model explicitly, which poses a significant challenge for robust deployment in real-world ecological environments.
To address this issue, we reveal a key synergistic mechanism between adversarial training and NODE architectures. Specifically, adversarial training using artificial perturbations can significantly enhance the robustness of NODE-based models against non-adversarial environmental noise. We refer to this mechanism as Synergistic Adversarial Training (SAT).
3.4.1. Robust Optimization Perspective
To formalize the proposed synergistic mechanism, let
denote the remote sensing segmentation model parameterized by
, where
and
represent the input image space and the semantic label space, respectively. Let
denote the task-specific loss function, such as cross-entropy. Standard empirical risk minimization (ERM) optimizes the model by minimizing the expected loss over the training distribution
:
where
are image-label pairs sampled from
.
SAT extends this framework into a robust min-max optimization problem, which can be viewed as a zero-sum game between the model and an adversary:
where
represents a bounded additive perturbation constrained within an
-norm ball
with a radius
.
Crucially, in the context of mangrove monitoring, any natural environmental degradation that satisfies is contained within the same perturbation set. Consequently, by explicitly optimizing against the most damaging adversarial directions , the model implicitly develops a robustness margin that covers stochastic and non-adversarial noise encountered in real-world ecological scenarios.
3.4.2. Synergy with Continuous Dynamics
In Neural Ordinary Differential Equation (NODE) architectures, the feature evolution is conceptualized as a continuous-time dynamical system defined by:
where
denotes the hidden state at time
t. When adversarial training is integrated with NODE-based encoders, the optimization process acts as a form of implicit regularization on the continuous vector field
f. By forcing the model to remain invariant under extreme input perturbations, the learned dynamics are encouraged to satisfy a lower **Lipschitz constant**
K:
where
and
represent hidden state representations. A smaller
K indicates that the vector field is smoother, which effectively suppresses the amplification of perturbations during the integration process. This stability ensures that the NODE-based encoder captures the invariant structural characteristics of mangrove canopies, filtering out fragile, noise-sensitive patterns and resulting in more reliable feature trajectories
across varying environmental conditions.
As a result, the NODE-based encoder learns smoother feature trajectories that favor invariant structural characteristics of mangrove canopies, rather than fragile, noise-sensitive patterns.
Although adversarial perturbations and natural environmental degradations arise from different mechanisms, both can be viewed as bounded perturbations in the input space. By explicitly optimizing the model with respect to worst-case adversarial directions, SAT enlarges the robustness margin around clean samples. Natural perturbations, which typically correspond to non-optimal directions within this margin, are therefore effectively handled by the trained model. For generating adversarial examples, we selected three representative methods as noise sources: Fast Gradient Sign Method (FGSM), Projected Gradient Descent (PGD), and Momentum Iterative FGSM (MI-FGSM). By combining these three methods, we can cover attacks ranging from single-step fast attacks to multi-step, more powerful attacks with momentum, thereby improving the model’s generalization ability to different perturbation patterns.
4. Experimental Section
4.1. Dateset
This experiment utilized a publicly available mangrove segmentation dataset (
https://aistudio.baidu.com/datasetdetail/74400/0, accessed on 16 March 2026), which includes high-resolution remote sensing images paired with their corresponding pixel-level annotations. The dataset contains a total of 10,000 samples, evenly distributed with 5000 positive samples and 5000 negative samples. The dataset provides four-channel RGBA images, with additional channels representing the spectral characteristics and vegetation index of mangroves. The dataset covers complex scenes such as coastal zones, water bodies, farmland, and woodlands, and features blurred class boundaries and strong background interference, providing a good benchmark for research on intelligent mangrove identification and robust segmentation methods.
The annotations in the dataset use different pixel values to distinguish mangrove from non-mangrove areas. In this experiment, the dataset was divided into training, validation, and test sets in an 8:1:1 ratio, used for model training, hyperparameter tuning, and performance evaluation, respectively.
4.2. Evaluation Indicators
In this study, we use mean Intersection over Union (mIoU) as the main segmentation performance evaluation metric. Specifically, mIoU is calculated by calculating the intersection over union (IoU) of the predicted result and the ground-truth annotated region, and averaged over all categories. It is defined as follows:
where
,
,
represent the number of true positive, false positive, and false negative pixels of the
kth class, respectively, and
K is the total number of classes.
4.3. Implementation Details
In our experiments, adversarial training is employed to evaluate the robustness of the proposed model in the mangrove segmentation task. The overall pipeline is as follows.
First, the input image is resized to and fed into the residual encoder to extract both deep and shallow feature representations. These features are subsequently refined through multiple enhancement modules, fused with shallow features, and finally passed to the decoder to generate the segmentation output.
To further improve robustness, adversarial examples are constructed online during training. Specifically, perturbations are generated based on the current model using three representative attack methods, namely FGSM, MI-FGSM, and PGD, and then added to the original images. For adversarial artificial perturbation experiments, adversarial examples are generated using a specific, single attack method throughout the training process. Conversely, to build generalized robustness against adversarial natural perturbations, we adopt a random per-batch approach. Clean samples and their corresponding adversarial counterparts are combined in a strict 1:1 ratio to form each training batch. The perturbation budget is set to for all attacks. For MI-FGSM and PGD, the number of iterations is set to 10 with a step size of . During training, the model is jointly supervised using both clean and adversarial samples, thereby improving its robustness against noise and adversarial perturbations.
In addition, the proposed architecture integrates an ODE-based feature evolution module within the encoder. The continuous dynamics are numerically solved using a fixed-step fourth-order Runge–Kutta (RK4) solver over the interval , with a single-step discretization. This results in a constant number of function evaluations (NFE = 4) for each forward pass.
For optimization, the Adam optimizer is adopted with an initial learning rate of , which is gradually adjusted during training. The model is trained for up to 1000 epochs, and an early stopping strategy based on validation loss is applied to prevent overfitting once convergence is detected. Mini-batch training is used throughout the optimization process. All experiments are conducted on an NVIDIA GeForce TITAN RTX GPU. The implementation was developed using Python 3.10 and PyTorch 2.1.0, accelerated by CUDA 12.1 and cuDNN 8.9.
4.4. Comparisons
To provide a consistent experimental setting, we adopt a unified segmentation network composed of a ResNet 50 backbone and a shared decoder. All NODE-based improvements are incorporated into this framework by modifying the corresponding components, while keeping the overall architecture consistent.
NODE Conv: Convolutional layers are reformulated as NODE Conv operations for temporal feature modeling.
NODE ASPP: ODE solvers is introduced into the ASPP module.
NODE SE: Dynamically modeling channel attention weights using NODE.
To ensure a fair comparison of robustness across different architectures, the same training strategy and hyperparameters are applied to several representative segmentation networks. Model performance is evaluated on a unified test set using multiple metrics, with the results reported in
Table 1, comparing different networks under clean and perturbed inputs with varying attack strengths.
To quantitatively evaluate the robustness gains of our methodology, this section employs traditional discrete architectural modules as comparative baselines. Specifically, the standard Squeeze-and-Excitation block (SE) and the standard Atrous Spatial Pyramid Pooling module (ASPP) serve as the benchmark configurations against their respective NODE-enhanced counterparts. This experimental setup is designed to directly measure the performance improvement achieved by substituting discrete feature recalibration and spatial pooling with our proposed continuous-time NODE dynamics when the network is subjected to human-induced adversarial perturbations.
We further include DeepLabv3 as a state-of-the-art segmentation baseline for comparison. By leveraging atrous convolution and the ASPP module, DeepLabv3 is able to capture rich multi-scale contextual information without significantly increasing computational cost. This provides a strong and representative benchmark for evaluating the effectiveness of the proposed approach.
4.4.1. Analysis of Structural Robustness Metrics
To quantitatively evaluate the structural stability of the proposed modules, we measured the empirical local Lipschitz constant and Jacobian spectral norm using random input testing and power iteration methods. As shown in
Table 1, the evaluation reveals significant differences in how continuous-time modeling affects various network components.
The results demonstrate that the NODE architecture effectively suppresses perturbation amplification in complex feature extraction modules. Notably, the discrete ASPP module is highly sensitive to perturbations and exhibits a Lipschitz constant of 141.49. By replacing it with NODE ASPP, this metric drops drastically to 6.86, alongside a similar reduction in the spectral norm from 28.78 to 4.30. Furthermore, the NODE SE module successfully reduces the Lipschitz constant from 123.66 for the standard SE module to 91.52. These substantial reductions indicate that modeling multi-scale spatial pooling and channel recalibration as continuous dynamic processes fundamentally stabilizes the feature mappings.
Conversely, compared with the relatively shallow and smooth discrete ResNet50 baseline with a metric of 7.52, the NODE Conv module produces a higher theoretical metric of 85.04. This increase mainly stems from the additional implicit computational depth introduced by the ordinary differential equation solver, which simultaneously enhances the non-linear representational capacity of the feature extraction process. Although the higher Lipschitz-related metric suggests increased dynamic complexity, subsequent experimental results demonstrate that the NODE Conv module can still maintain favorable robustness under adversarial perturbations when combined with the proposed synergistic adversarial training strategy.
4.4.2. Performance Under Human-Induced Noise
Table 2 presents a performance comparison of various networks under a clean environment without adversarial training to establish a baseline for their segmentation abilities. Notably, the results demonstrate that the integration of the NODE architecture does not compromise the network’s inherent segmentation capability, maintaining a highly competitive performance compared to standard backbones in noise-free environments.
Table 3 demonstrates that the proposed NODE-based models consistently outperform traditional discrete architectures in adversarial robustness across all evaluated white-box attacks. Under FGSM, NODE SE achieves better robustness than the standard SE module, improving the mIoU from 0.8968 to 0.9232 while reducing the performance drop from 5.51 to 3.43. Iterative attacks reveal similar advantages: NODE Conv shows remarkable resilience against MI-FGSM with a minimal drop of 2.27, contrasting sharply with the severe 4.43 degradation observed in Deeplabv3. Finally, against the PGD attack, NODE ASPP proves to be the most stable configuration, yielding the lowest performance degradation (2.35) among all tested methods.
In summary, the experimental results verify that re-formulating discrete feature mappings as continuous-time dynamic systems via NODEs effectively suppresses gradient-based perturbations. By smoothing feature trajectories and enhancing Lipschitz continuity, our methodology ensures more stable and reliable mangrove segmentation in the presence of adversarial noise.
Beyond the fixed-strength evaluation in
Table 3, a sensitivity analysis was conducted by progressively increasing the perturbation intensity, as shown in
Figure 6. Since different attack methods exhibit different degradation behaviors, the perturbation ranges were selected accordingly. Specifically, FGSM was evaluated with
, whereas MI-FGSM and PGD were evaluated with
.
Figure 6 presents the robustness comparison under progressively stronger adversarial perturbations. As the perturbation magnitude
increases, the performance of all models gradually degrades due to the increasing distortion in the input feature distribution. However, the degradation patterns differ significantly across architectures.
Overall, the NODE-enhanced variants exhibit improved robustness and slower performance degradation compared with their discrete counterparts. For example, under FGSM with , NODE Conv achieves an accuracy of 0.5630, significantly outperforming ResNet50 (0.2898) and ASPP (0.3684). In particular, the NODE Conv model consistently maintains higher stability under MI—FGSM and PGD attacks, suggesting that continuous-time feature evolution can effectively mitigate the propagation of adversarial perturbations.
Interestingly, DeepLabv3 also shows relatively strong robustness under FGSM attacks across different perturbation intensities. This behavior is likely related to the relatively limited optimization strength of single-step attacks, under which the overall segmentation structure of DeepLabv3 can still be partially preserved. However, under stronger iterative attacks such as MI-FGSM and PGD, the robustness advantage of the proposed NODE-based architectures becomes significantly more evident, indicating their superior stability against progressively accumulated perturbations.
The SE module also demonstrates strong robustness characteristics. Across the three attack settings, SE maintains relatively stable performance compared with most baseline architectures, indicating that channel-wise feature recalibration helps suppress noise-sensitive responses.
Under the stronger PGD attack, the advantage of the NODE-based design becomes more pronounced at higher perturbation levels. When , NODE ASPP achieves the best performance (0.6536), outperforming both the original ASPP module and other baseline models.
These results indicate that modeling feature transformations as continuous dynamical systems improves the stability of learned representations under adversarial perturbations, while the SE mechanism further contributes to robust feature recalibration.
4.4.3. Performance Under Natural Noise
The proposed Synergistic Adversarial Training strategy aims to defend against adversarial perturbations while ensuring that the learned robustness generalizes to natural noise.
We further evaluate whether the learned robustness generalizes to naturally occurring noise. To this end, two common noise models are introduced: Gaussian noise, which simulates environmental disturbances such as sensor noise and illumination variation, and salt-and-pepper noise, which represents sensor degradation and random pixel corruption.
The results are shown in
Figure 7. Without SAT training, the segmentation performance degrades rapidly as the noise intensity increases. In contrast, models trained with SAT maintain significantly higher accuracy under both Gaussian and sensor degradation noise. Furthermore, the experimental results show that NODE-based architectures achieve more substantial robustness improvements under SAT compared with traditional discrete networks. Although adversarial training enhances the performance of all evaluated models, the NODE-enhanced variants consistently maintain higher segmentation accuracy and exhibit slower performance degradation as the corruption intensity increases.
Among the compared methods, NODE ASPP and NODE SE demonstrate particularly strong robustness under both Gaussian noise and sensor degradation noise. Even under severe corruption conditions, the NODE-based models are able to preserve more stable segmentation performance and clearer structural predictions than their conventional counterparts. These results indicate that the proposed SAT framework is especially effective when combined with NODE-based feature modeling, leading to better robustness generalization from adversarial perturbations to naturally occurring environmental disturbances.
Overall, the experimental findings demonstrate that integrating SAT with NODE architectures not only improves adversarial robustness, but also significantly enhances the stability and reliability of mangrove segmentation under complex real-world noise conditions.
4.4.4. Ablation Study on SE Depth and NODE Architecture
To verify that the performance superiority of NODE SE is driven by the continuous-time node mechanism rather than simple depth accumulation, we further introduce DeepSE as an ablation baseline by scaling the conventional SE block to comparable or even greater depths. Specifically, the original SE module is expanded to two and four stacked layers, denoted as DeepSE (Layer2) and DeepSE (Layer4), respectively. Among the evaluated adversarial attacks, we select PGD as the representative benchmark due to its strong iterative perturbation capability and wide adoption in robustness evaluation. The complexity comparison of the four models is reported in
Table 4.
After adversarial training, we compare the performance of SE (Layer1), DeepSE (Layer2), deepSE (Layer4), and NODE SE under both clean and adversarial conditions. The experimental results can be seen in
Figure 8.
As shown in
Figure 8, increasing the depth of the conventional SE module does improve the mIoU performance under attack-free conditions. However, under PGD attack, NODE SE still achieves a 1.79% higher mIoU than the deeper DeepSE(Layer4) configuration. This result clearly indicates that the robustness enhancement of NODE SE is not merely attributed to increased network depth or parameter accumulation, but rather originates from the continuous dynamic feature evolution introduced by the NODE mechanism.
4.4.5. Inference Speed Comparison
Figure 9 reports the inference speed comparison of different network architectures measured in frames per second (FPS). As shown in the results, the introduction of NODE-based modules leads to a moderate reduction in inference speed due to the additional computation introduced by the ODE solver. For example, NODE Conv achieves 120.61 FPS compared with 138.12 FPS for the standard ResNet50 backbone.
Despite this additional computational cost, the NODE-enhanced models still maintain competitive inference efficiency. In particular, NODE SE achieves 124.8 FPS, which remains close to the original SE module (133.73 FPS). Similarly, NODE ASPP runs at 95.69 FPS compared with 101.16 FPS for ASPP. These results indicate that the proposed continuous-time modeling introduces only a limited computational overhead while preserving high inference efficiency.
Compared with the full Deeplabv3 architecture, which runs at 59.31 FPS, all lightweight module-level variants achieve substantially higher inference speed. This demonstrates that the proposed NODE-based modifications provide improved robustness while maintaining practical computational efficiency for real-world segmentation tasks.
4.5. Visualization Results
In this section, we compared and visualized the segmentation results obtained from clean and noise-corrupted inputs. As shown in
Figure 10, our proposed network produces highly consistent segmentation maps under both conditions. Even when the input images are disturbed by noise, the predicted boundaries and region structures remain nearly identical to those obtained from clean images, showing only minimal deviations in fine details. This consistency demonstrates that our model is capable of resisting noise interference and maintaining stable segmentation performance. Such robustness is particularly valuable for mangrove remote sensing applications, where images are often affected by atmospheric noise, illumination changes, and sensor instability.
5. Limitations and Future Work
Although the proposed NODE-based framework demonstrates strong robustness against both adversarial perturbations and natural noise, several limitations still remain. First, due to the introduction of ODE solvers, the proposed framework inevitably incurs additional computational overhead compared with conventional discrete architectures, which may limit its applicability in resource-constrained or real-time remote sensing scenarios. Second, the current experimental setting adopts a commonly used random train/validation/test split strategy and mainly evaluates robustness under representative perturbations, including adversarial attacks and typical noise corruption. More challenging cross-region, cross-sensor, and remote sensing-specific corruption settings may further reveal the generalization capability of the proposed framework under domain shifts. In addition, the current study mainly focuses on spatial feature robustness and does not explicitly model temporal ecological dynamics.
Future work will therefore investigate lightweight NODE architectures for more efficient deployment, robustness evaluation under broader real-world remote sensing corruptions and cross-domain settings, as well as the integration of temporal modeling and multi-modal remote sensing information to better capture long-term ecological variations and further improve robustness in practical monitoring applications.