DEM-Assisted Topography-Conditioned and Orientation-Adaptive Siamese Network for Cross-Region Landslide Change Detection

Wang, Jing; Li, Haiyang; Wu, Shuguang; Nie, Guigen; Yu, Yukui; Fan, Zhaoquan

doi:10.3390/rs18050702

Open AccessArticle

DEM-Assisted Topography-Conditioned and Orientation-Adaptive Siamese Network for Cross-Region Landslide Change Detection

by

Jing Wang

¹

,

Haiyang Li

²

,

Shuguang Wu

³,

Guigen Nie

⁴

,

Yukui Yu

¹ and

Zhaoquan Fan

^1,*

¹

Guangzhou Institute of Technology, Xidian University, Guangzhou 510555, China

²

Jiangsu Hydraulic Research Institute, Nanjing 210029, China

³

School of Electrical Engineering, Naval University of Engineering, Wuhan 430033, China

⁴

GNSS Research Center, Wuhan University, Wuhan 430070, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(5), 702; https://doi.org/10.3390/rs18050702

Submission received: 18 January 2026 / Revised: 10 February 2026 / Accepted: 24 February 2026 / Published: 26 February 2026

Download

Browse Figures

Versions Notes

Highlights

Proposes a DEM-assisted Siamese network that combines topography-conditioned feature modulation with orientation-adaptive convolutions for cross-region landslide change detection.
Demonstrates improved robustness and boundary delineation under a site-wise cross-region evaluation protocol, supported by ablation evidence for the two modules.

What are the main findings?

Topography-conditioned modulation reduces pseudo-change and stabilizes landslide change detection across regions with strong domain shifts.
Orientation-adaptive convolutions better capture elongated landslide structures, improving change-map coherence and boundary detail.

What are the implications of the main findings?

Terrain priors from DEM can be used as physically meaningful conditioning signals to enhance generalization of optical change detectors in mountainous and heterogeneous environments.
Geometry-aware operators are a practical design choice for hazard mapping tasks dominated by direction.

Abstract

Automated landslide change detection using remote sensing imagery is critical for rapid disaster response. However, landslide change detection using bi-temporal optical imagery is frequently degraded by cross-region domain shifts and by the elongated, anisotropic morphology of landslide boundaries, leading to substantial pseudo-change alarms. To suppress pseudo-changes and improve cross-region robustness, we propose a DEM-assisted topography-conditioned and orientation-adaptive Siamese network (DEMO-Net) that injects topographic inductive bias through terrain-conditioned feature modulation and orientation-adaptive convolutions. Specifically, DEM-derived multi-channel priors are encoded to predict spatially varying FiLM parameters that recalibrate shallow optical features, suppressing spurious changes while preserving discriminative cues. In addition, we introduce an adaptive-oriented attention convolution that leverages a DEM-derived aspect to guide sparse multi-orientation aggregation via shared-kernel transformation, enabling direction-aware receptive-field alignment for elongated and direction-varying landslide structures without costly global attention. Experiments on the GVLM benchmark under a 5-fold site-wise cross-region protocol show that DEMO-Net achieves 85.17% F1 and 74.26% mIoU, outperforming the strongest CNN baseline FC-EF by 5.05% and 7.20%, respectively. These results demonstrate the effectiveness of jointly leveraging terrain-conditioned calibration and physically consistent orientation-aligned feature extraction for robust cross-region landslide change detection.

Keywords:

change detection; landslide; remote sensing; feature modulation; Siamese networks

1. Introduction

Landslides are among the most destructive geomorphic hazards, threatening lives, infrastructure, and ecosystems across mountainous and hilly regions worldwide [1,2]. Unlike instantaneous disasters, landslides are dynamic processes involving initiation, runout, and episodic reactivation, which continuously reshape the terrain over time [3,4]. Thus, rapid and reliable monitoring of these evolving footprints is critical for emergency response and long-term risk mitigation [5,6].

Traditional landslide detection requires substantial labor and time. This significantly constrains the speed and scale of disaster response. Landslide monitoring through satellite imagery has become increasingly critical for disaster prevention and environmental management [7,8,9]. Multi-temporal change detection (CD) provides a robust mechanism to capture these dynamic processes by systematically contrasting observations across time. By computationally isolating active surface disturbances from the stable background, CD enables the precise identification of evolutionary stages—such as initiation, enlargement, and reactivation. This capability effectively transforms landslide mapping from static, discrete inventories into a continuous, decision-ready monitoring system. Landslide change detection transforms static inventories into continuous, decision-ready monitoring. By contrasting observations across time, CD isolates surface disturbances from largely static backgrounds, theoretically enabling continuous hazard monitoring [10].

However, automated landslide CD remains a formidable challenge due to the intricate interplay between environmental noise and complex object morphology [11]. On the one hand, optical satellite imagery is inherently susceptible to atmospheric disturbances, such as clouds and shadows, as well as seasonal phenological changes [12,13,14]. These factors, combined with domain shifts across different regions, severely undermine the generalization capability of data-driven models [15,16]. On the other hand, landslides exhibit unique, anisotropic geometric patterns. Unlike varying man-made objects, landslides are geomorphically constrained, manifesting as elongated debris trails or curved scarps that align with the topographic gradient [17,18,19]. These irregular shapes and varying orientations create significant ambiguity for algorithms that lack explicit geometric awareness [20].

Existing methodologies generally struggle to align with these physical and geometric realities. Traditional pixel-based and feature-based architectures, while computationally simple, rely on low-level spectral features [7,21,22]. They lack the spatial context required to distinguish true landslides from the environmental noise and registration errors described above [23,24]. In recent years, deep learning approaches, particularly Convolutional Neural Networks (CNNs), have dominated the field due to their hierarchical feature extraction capabilities [25,26,27]. However, standard CNNs employ fixed, isotropic-receptive fields that are inherently grid-aligned [28,29]. This structural rigidity contradicts the natural morphology of landslides, which are often anisotropic and follow topographic gradients [30,31]. Consequently, standard kernels struggle to capture rotated or elongated features, leading to fragmented detections and boundary aliasing [32]. Orientation-aware convolutions have been widely explored in object detection and remote sensing recognition, especially for man-made targets such as buildings, ships, and vehicles with relatively regular shapes [33,34,35]. However, their combination with topographic priors for modeling highly anisotropic landslide runout patterns in a change detection setting has not been fully investigated.

To address these challenges, we propose a novel DEM-assisted topography-conditioned and orientation-adaptive Siamese network (DEMO-Net), designed for the speed of convolutions with the geometric flexibility required for landslide change detection. DEMO-Net retains the efficiency of convolutional backbones while introducing geometric flexibility tailored to elongated landslide structures. Different from optical-only models that rely on volatile spectral cues, DEMO-Net explicitly injects a topographic inductive bias to improve robustness under cross-region domain shifts.

(1) We develop a geomorphic-aware FiLM module for terrain-conditioned feature modulation. Instead of treating topographic data as additional input channels, the proposed module projects topographic priors into spatially varying affine parameters that recalibrate shallow optical features. This conditioning enables the network to suppress pseudo-changes over stable low relief areas and enhance sensitivity over steep, hazard-prone slopes, thereby improving the discrimination between true landslide changes and environmental noise.

(2) We introduce an Adaptive-Oriented Attention Convolution (AOAC) module to achieve direction-aware receptive field alignment. The module uses a DEM-derived aspect to guide a lightweight spatial attention mechanism that selects and aggregates a small set of oriented kernels generated from a shared base kernel. Compared with deformable convolutions that learn un-constrained offsets implicitly, AOAC imposes an explicit geometric inductive bias consistent with gravity-driven downslope motion, aligning receptive fields with slope-parallel runout directions. This design strengthens the representation of thin, anisotropic boundaries without the computational overhead of global self-attention.

(3) We evaluate DEMO-Net under a rigorous 5-fold site-wise cross-validation protocol on cross-region landslide datasets, which better reflects deployment in unseen areas. The results show consistent improvements over nine baselines, and ablation studies verify the effectiveness of both terrain-conditioned modulation and orientation-adaptive feature extraction.

2. Related Work

2.1. Attention Mechanisms in Change Detection

The distinction between genuine landslides and spectral pseudo-changes induced by factors such as seasonal phenology and illumination shifts poses a critical challenge in change detection. Traditional methods typically treat spectral differences uniformly, rendering them vulnerable to environmental variances. To address this limitation, attention mechanisms have been widely adopted for their ability to perform dynamic feature recalibration [36]. Functionally, attention mechanisms learn to suppress environmental noise by assigning lower weights to irrelevant spectral channels or spatial regions, while selectively amplifying the response of discriminative landslide features. This learnable filtering capability effectively enables the model to differentiate genuine hazards from background interference [37].

Several studies highlight the effectiveness of attention mechanisms in various aspects of remote sensing change detection. For instance, the Hybrid Self-Cross Attention Network (HSANet) represents a significant leap in modeling bi-temporal relationships [38]. It introduces a novel hybrid attention module that operates in dual modes simultaneously: it utilizes self-attention to capture long-range dependencies within single-temporal images (intra-image context) and cross-attention to align and compare features between the pre- and post-event images (inter-image difference). This design effectively fuses global semantic context with multi-scale features, thereby refining edge details and significantly improving detection performance in large-scale monitoring tasks. Similarly, addressing the scarcity of positive samples, the Hierarchical Attention Network (HANet) tackles the inherent class imbalance problem prevalent in CD datasets [39]. It integrates a spectral–spatial attention mechanism with a progressive foreground-balanced sampling strategy. Unlike standard training procedures that treat all pixels equally, this strategy forces the network to focus its computational resources more intensively on changed pixels during the training process. This “hard-sample mining” approach ensures that the model learns robust representations even for small or subtle changes in challenging scenarios. Furthermore, the advent of transformer-based models has pushed the boundaries of global context modeling [40,41]. Unlike CNNs, which are limited by local receptive fields, these models leverage robust multi-head self-attention mechanisms to model long-range dependencies across the entire image in both spatial and temporal dimensions. By tokenizing the image patches and interacting them globally, they facilitate a comprehensive understanding of the high-level semantic context, which is crucial for distinguishing real changes from pseudo-changes caused by seasonal variations or illumination shifts. However, a critical limitation persists in these methods: they primarily focus on feature reweighting while often neglecting geometric adaptation. This inductive bias limits their ability to capture the complex morphologies of landslides, which often exhibit strong directionality and irregular boundaries due to terrain constraints. This necessitates the development of efficient mechanisms that can explicitly adapt to the anisotropic geometry of terrain features without incurring the heavy computational burden of full self-attention.

2.2. Feature Modulation for Multimodal Analysis

Integrating complementary information from heterogeneous modalities is a critical research avenue for enhancing scene understanding in complex remote sensing environments. Since different sensors capture distinct physical properties of the Earth’s surface, effective fusion strategies are essential to leverage their unique strengths. Deep learning techniques have driven significant progress in moving beyond simple concatenation toward more sophisticated multimodal feature extraction and fusion paradigms [42].

For example, Multimodal Feature Decomposition and Fusion Network (MDFNet) employs a dedicated multi-stream architecture to address the discrepancy between modalities [43]. It explicitly decomposes features into modality-shared components and modality-specific components. By processing these distinct representations through separate encoders and then fusing them, the network can robustly handle challenging conditions and achieve superior semantic segmentation results. In the context of topography-aware analysis, the Hierarchical Feature Integration and Fusion (HFIF) framework demonstrates the utility of multi-scale fusion in remote sensing visual question answering tasks [44]. It addresses the semantic gap between visual signals and high-level concepts by hierarchically integrating visual features with semantic context at multiple resolutions. This ensures that information from different receptive fields is effectively combined, allowing the model to answer complex queries about the scene by reasoning over both local details and global structures. Moreover, recent works have shifted towards feature modulation techniques as a parameter-efficient alternative to computationally heavy fusion modules like cross-attention modules [45]. By injecting prior knowledge directly into the feature extraction process, feature modulation significantly improves the robustness and accuracy of change detection in diverse and unseen geographical environments.

3. Method

This section details the overall architecture and workflow of the proposed DEMO-Net. The complete framework is illustrated in Figure 1. The framework employs a Siamese encoder–decoder architecture that processes bitemporal images alongside a multi-channel topographic prior tensor DEM, slope, and aspect. Specifically, a shared ResNet-D backbone extracts hierarchical features, which are first dynamically modulated at shallow stages by a parallel prior branch via geomorphic-aware FiLM to inject terrain context, and then refined at deeper stages by AOAC modules to capture anisotropic geological patterns. Subsequently, the bi-temporal features are fused using a difference-aware concatenation strategy and passed to a U-Net-style decoder equipped with Atrous Spatial Pyramid Pooling (ASPP) to generate the final primary landslide mask and auxiliary boundary map. The architecture is distinguished by two core innovations: a topographic prior injection mechanism to mitigate spectral ambiguity, and an AOAC module to handle anisotropic morphological variations.

3.1. Siamese Encoder and Change-Aware Feature Fusion

We employed a weight-sharing Siamese encoder

\emptyset_{e n c}

with a ResNet-D backbone. Compared with standard ResNet, ResNet-D replaces the 7 × 7 stem with three stacked 3 × 3 convolutions, which preserves fine-grained texture details critical for remote sensing. In addition, it integrates average pooling in the downsampling path to provide anti-aliasing properties, ensuring shift-invariance during feature extraction.

Given the two temporal images, the encoder produces a multi-scale feature pyramid:

[F_{p r e}^{2}, F_{p r e}^{3}, F_{p r e}^{4}, F_{p r e}^{5}] = \emptyset_{e n c} (I_{p r e})

(1)

[F_{p o s t}^{2}, F_{p o s t}^{3}, F_{p o s t}^{4}, F_{p o s t}^{5}] = \emptyset_{e n c} (I_{p o s t})

(2)

where

F^{k}

represents the feature map output by the backbone network in stage k.

To explicitly capture change signals while preserving temporal semantic context, we employ a robust fusion strategy. Unlike simple concatenation, we compute the absolute difference between temporal features to highlight change intensities. The fused feature map

F_{f u s e d}^{k}

is computed as:

F_{c a t}^{k} = C o n c a t (F_{p r e}^{k}, F_{p o s t}^{k}, F_{p o s t}^{k} - F_{p r e}^{k})

(3)

F_{f u s e d}^{k} = G E L U ({C o n v}_{1 * 1} (F_{c a t}^{k}))

(4)

where (,) denotes channel-wise concatenation. GELU activation is used to reduce channel dimensionality and aggregate temporal information. This fusion strategy both preserves the two sets of temporally independent contextual information and provides the network with clear change signals. These fused features

[F_{f u s e d}^{2}, F_{f u s e d}^{3}, F_{f u s e d}^{4}, F_{f u s e d}^{5}]

serve as the input to the AOAC blocks and the decoder.

3.2. Geomorphic-Aware Feature Modulation

Existing topography-aware change detection and segmentation pipelines typically inject DEM-derived priors through either early fusion or late fusion strategies. Early fusion generally involves directly concatenating topographic data with optical features at the input level. This approach enlarges the channel space, often leading to modality interference and unstable generalization when facing cross-region domain shifts. Late fusion mixes modalities after high-level semantics are formed, but it frequently incurs non-trivial computational overhead due to attention-style interactions.

In this study, we injected terrain cues stage-wise into shallow encoder blocks using a spatially varying FiLM transform. Unlike self-conditioned reweighting, our modulation was driven by external terrain embeddings, allowing topographic context to explicitly guide optical feature extraction. Moreover, we predicted modulation parameters per spatial location, rather than using global or channel-wise scalars, enabling geo-morphically meaningful, location-dependent calibration while preserving the optical backbone’s representation learning.

3.2.1. Prior Branch: Terrain Embedding Encoder

The topographic tensor

P_{t o p o}

is first processed by a lightweight Prior Branch

\emptyset_{p r i o r}

, consisting of stacked convolutional layers. This maps the raw physical values such as elevation and degree into a high-dimensional semantic embedding space

E_{t o p o}

:

E_{t o p o} = \emptyset_{p r i o r} (P_{t o p o})

(5)

where

ϕ

denotes a small CNN with two strided 3 × 3 convolutions followed by a 3 × 3 refinement layer, GroupNorm, and GELU; C_p = 64, and (H₂,W₂) = (H/4,W/4)) matches the resolution of the first encoder stage.

3.2.2. Dynamic Feature Modulation

Instead of using simple channel concatenation, we use a FiLM mechanism to inject geomorphic context. For each stage k, the terrain embedding E is resized to the corresponding resolution:

E_{k} = R e s i z e ({E, H}_{l}, W_{l})

(6)

A small prediction network then transforms

E_{k}

into per-location scaling and shifting parameters:

[γ, β] = S p l i t (\emptyset_{F i l m_p a r a n s} (E_{k}))

(7)

The optical features are modulated via a spatially varying affine transformation:

F_{i m g}^{'} = F_{i m g} ⊙ (1 + t a n h (γ) + β)

(8)

where

⊙

denotes element-wise multiplication and tanh(·) stabilizes the modulation amplitude. FiLM is applied to the shallow blocks (k = 2, 3), where high-resolution geomorphic cues are most informative. The tanh(·) function ensures stable gating. This mechanism allows the network to learn spatially adaptive rules: in steep slope areas, the network learns to amplify (

γ

> 0) change related activations; conversely, in flat terrain, it suppresses (

γ

< 0) features to filter out pseudo-changes.

Thus, the network can learn to amplify feature responses in steep, geomorphically susceptible areas, while suppressing responses in flat or alluvial regions that rarely host landslides.

3.3. Orientation-Adaptive Attention Convolutions

Landslides exhibit extreme morphological diversity and anisotropy, often appearing as elongated debris flows strictly controlled by gravity and terrain aspect. Standard convolutions possess fixed, square-receptive fields, which are suboptimal for capturing such directional geometric patterns. Existing orientation-sensitive designs often rely on multi-branch architectures or banks of independently learned kernels for different angles, which increases parameters linearly with the number of orientations and may weaken optimization stability. By contrast, our AOAC module is built on a shared-kernel principle: it learns a single base kernel and generates K-oriented kernels through geometric rotation via sampling on an affine grid as shown in Figure 2. The responses of these oriented kernels are then combined using sparse, top-m direction weights so that only a few dominant orientations contribute at each spatial location. This yields explicit orientation selectivity with strong parameter sharing, enabling the model to align receptive fields with elongated landslide structures while keeping the computational and parameter overhead controlled.

Consider a topography-modulated feature map

F^{'} \in R^{H \times W \times C}

, AOAC proceeds in three stages as shown in Figure 3.

3.3.1. Orientation Scoring

To accurately capture the anisotropic morphology of landslides while maintaining computational stability, we propose a two-stage orientation estimation strategy: coarse orientation proposal and fine-grained discrete refinement.

Stage A: Orientation Proposal

Given the input feature map

F^{'} \in R^{H \times W \times C}

, AOAC first estimates a continuous principal direction of feature response

θ

that summarizes anisotropic responses along the vertical and horizontal axes, which is intrinsically aligned with the terrain aspect due to the preceding geomorphic-aware modulation.

To capture this efficiently, we employ a lightweight self-attention mechanism to aggregate global descriptors along orthogonal directions. We generate two 1D attention maps

A_{H}

and

A_{W}

via global average pooling (GAP) and learnable projections along the height (H) and width (W) dimensions respectively:

A_{H} = σ (W_{H} \cdot {GAP}_{H} (F^{'}))

(9)

A_{W} = σ (W_{W} \cdot {GAP}_{W} (F^{' T}))

(10)

where

{GAP}_{H}

and

{GAP}_{W}

denote global average pooling along respective dimensions,

W_{H}

and

W_{W}

are learnable projection weights, and

σ

represents the sigmoid activation.

A_{H}

emphasizes the dominant extent along the horizontal axis, while

A_{W}

does so along the vertical axis. Their combination provides a coarse but efficient cue of the local anisotropic trend. Then AOAC convert these two axis-wise attentions into a continuous dominant orientation estimate by aggregating their weighted coordinates, serving as a prior for the approximate flow direction:

θ = a r c t a n (\frac{\sum_{i = 1}^{H} A_{H} (i) \cdot (i - \frac{H}{2})}{\sum_{j = 1}^{W} A_{W} (j) \cdot (j - \frac{W}{2})})

(11)

Stage B: Discrete Direction Attention

While

θ

provides a global context, landslide boundaries often exhibit complex local variations that a single scalar cannot represent. Therefore, we utilize

θ

to guide a fine-grained discrete selection process.

We discretize the orientation space into K predefined anchor angles

Θ

. A scorer network then predicts a confidence vector S, representing the matching degree between the local features and these discrete anchors. Crucially, this step is prior-guided: the optimization is implicitly conditioned on the coarse trend to resolve ambiguities in texture-less regions.

Θ = \{θ_{1}, θ_{2}, \dots θ_{K}\}

(12)

S = S c o r e r (θ_{1}, θ_{2}, \dots θ_{K})

(13)

To eliminate noise from irrelevant directions and enforce orientation sparsity, we apply a top-m sparsification strategy. Only the top-m orientation scores are retained:

S_{m a s k e d}^{'} = T o p K M a s k (S, m)

(14)

A_{K} = \frac{e x p (S_{k, m a s k e d}^{'} / τ)}{\sum_{j = 1}^{K} e x p (S_{j, m a s k e d}^{'} / τ)}

(15)

where τ is the temperature hyper-parameter.

This hierarchical design ensures that the receptive field alignment is both globally consistent with the terrain trend and locally adaptive to boundary details.

3.3.2. Dynamic Kernel Adaptation and Aggregation

Instead of learning independent kernels for each direction, we generate rotated kernels from a single base kernel via geometric transformation.

The kernel adaptation stage rotates standard convolution kernels K according to the estimated orientation θ. The rotation employs an affine transformation matrix

T_{θ}

:

T_{θ} = [\begin{matrix} c o s θ_{K} & - s i n θ_{K} \\ s i n θ_{K} & c o s θ_{K} \end{matrix}]

(16)

Let

G_{b a s e}

denote the canonical sampling grid of a k × k convolution kernel. We generate the sampling grid

G_{K}

for the orientation

θ

by rotating

G_{b a s e}

:

G_{K} = G_{b a s e} \cdot T_{θ}^{T}

(17)

Then let

W_{b a s e}

denote the learnable weights of the base kernel. For each discrete direction

θ_{K}

, we obtain the rotated kernel weights

W_{K}

by bilinearly resampling the base weights

W_{b a s e}

at the locations specified by

G_{K}

:

W_{K} = G r i d S a m p l e (W_{b a s e}, G_{K})

(18)

Combine the input feature map F and

W_{K}

we can filter the feature maps:

Y_{K} = D e p t h w i s e C o n v (W_{k}, F)

(19)

Finally, we perform weighted aggregation. The responses from K directions are fused into a single feature map:

Y_{a g g} = \sum_{k = 1}^{K} A_{K} ⊙ Y_{K}

(20)

3.3.3. Residual Feature Transformation

To enhance representation power while preserving stable optimization, AOAC applies a pointwise transformation followed by a residual connection:

F_{o u t} = G E L U ({C o n v}_{1 * 1} (Y_{a g g})) + F

(21)

where GELU represents batch normalization and the ReLU activation function. AOAC blocks are inserted at intermediate and deep encoder stages, enabling the network to capture fine-scale orientation cues near head scarps and coarse-scale flow directions in the run-out zone.

3.4. Decoder and Output Heads

To reconstruct dense, pixel-wise predictions from the compressed encoder features, we adopt a UNet decoder equipped with multi-scale context aggregation.

3.4.1. ASPP for High-Level Context

Multi-Scale Context Aggregation: Before initiating the reconstruction pathway, the high-level semantic features from the encoder’s bottleneck are processed by an Atrous Spatial Pyramid Pooling (ASPP) module. To accommodate the extreme scale variations typical of landslide targets, the ASPP module employs dilation rates {1, 6, 12, 18}. This design effectively enlarges the receptive field and aggregates multi-scale contextual information without losing spatial resolution.

3.4.2. Progressive Reconstruction Pathway

The decoder follows a U-shaped architecture, progressively recovering spatial details through a sequence of upsampling stages. Each stage consists of a transposed convolution operation to double the spatial resolution of the feature maps. To compensate for the loss of fine-grained spatial information during encoding, at each stage, the upsampled feature is concatenated with the corresponding encoder feature

F_{4}^{f u s e d}, F_{3}^{f u s e d}, F_{2}^{f u s e d}

, followed by a 3 × 3 convolutional block to merge semantic and spatial details. This U-shaped reconstruction path recovers fine structures while preserving strong semantic discrimination.

3.4.3. Dual-Task Prediction

Instead of a standard single-output layer, we employ a dual-head design using a shared-encoder, decoupled-decoder strategy. The primary segmentation head predicts the landslide masks, while the auxiliary boundary head predicts the binary boundaries. The final decoder feature Z is fed into two parallel 1 × 1 convolutional heads:

Primary Segmentation Head: Generates the final change probability map, representing the main body of the detected landslides:

{P r e}_{s e g} = f_{s e g} (Z) \in R^{1 \times H^{'} \times W^{'}}

(22)

Auxiliary Boundary Head: Specifically tasked with predicting the pixel-wise boundaries of change regions.

{P r e}_{b n d} = f_{b n d} (Z) \in R^{1 \times H^{'} \times W^{'}}

(23)

Both outputs are bilinearly upsampled to the original image resolution (H,W). The auxiliary boundary head explicitly focuses on high-frequency edge information, which helps sharpen landslide boundaries and reduce label bleeding across class borders.

During training, the boundary head is used only for auxiliary supervision to enforce edge-aware learning, and its loss is added to the main segmentation loss. During inference, the final landslide change map is produced by the primary segmentation head, while the boundary prediction can be optionally retained for visualization but is not fed back into the segmentation stream.

3.5. Loss Function

Landslide change detection is highly class-imbalanced: positive landslide pixels are usually far fewer than negative background pixels. To address this, we employ the focal Tversky loss (FTL), which simultaneously addresses class imbalance and hard-sample mining. The Tversky index (TI) is defined as:

T I = \frac{T P}{T P + α F N + β F P}

(24)

where TP, FN, FP denote true positives, false negatives, and false positives, respectively.

α

and

β

are hyperparameters controlling the penalty for false negatives versus false positives.

The final FTL is formulated as:

L_{F T L} = {(1 - T I)}^{1 / γ}

(25)

where

γ

controls the focusing on hard examples.

We employ the FTL for the primary segmentation task. For the auxiliary boundary head, we use a binary classification loss computed between the predicted boundary probability and a boundary target derived from the ground-truth mask. In all experiments, total loss function

L_{t o t a l}

is a weighted sum of the primary segmentation loss

L_{F T L}

and the auxiliary boundary loss

L_{b d y}

:

L_{t o t a l} = L_{F T L} + λ_{b d y} L_{b d y}

(26)

where

λ_{b d y}

is a scalar weight that balances the contribution of the auxiliary boundary supervision.

3.6. Evaluation Metrics and Baseline Methods

We compared DEMO-Net with nine representative change-detection baselines: Bi-temporal CNN-based methods: FC-EF [46], SNUNet [47], FC-EF-W.; Bi-temporal CNN with interaction mechanisms: DMINet [48], A2Net [49]; Bi-temporal transformer-based methods: BIT [50], ChangeFormer [51]; Multi-temporal time-series method: SitsSCD [52]; Diffusion-based method: DDPM-CD [53]. Among them, FC-EF-W is a widened variant of FC-EF, where we manually increase the network width to obtain a parameter-matched capacity baseline. This model is introduced solely to control for model size when comparing against DEMO-Net, rather than as a structurally novel method. All methods were retrained on their official implementations. All models are trained and evaluated under the same site-wise five-fold protocol and data preprocessing pipeline. For all baseline methods, we follow their original implementations and use only the two bitemporal optical images as inputs. Our DEMO-Net additionally exploits DEM-derived topographic priors that are available in the dataset.

We adopted four standard metrics: precision (P), recall (R), F1-score (F1), and mIoU. These were computed at pixel level across the test set:

P = \frac{T P}{T P + F P}

(27)

R = \frac{T P}{T P + F N}

(28)

F 1 = \frac{2 P R}{P + R}

(29)

I o U = \frac{T P}{T P + F P + F N}

(30)

A c c = \frac{T P + F N}{T P + T N + F P + F N}

(31)

where TP, FP, and FN denote true positives, false positives, and false negatives respectively.

4. Experiments and Results

4.1. Datasets

The dataset used in this study consists of 17 independent landslide sites. The data are fused from GVLM multi-temporal remote sensing imagery and NASA DEM topographic prior data.

GVLM is a bi-temporal optical large-scale and VHR landslide change detection dataset. Each sample provides a pair of bitemporal images and the corresponding ground-truth map. The ground-truth labels in GVLM were manually annotated by image interpretation experts and we conducted manual spot checks across all sites. Data for sites at 0.59 m spatial resolution were collected from Google Earth. The total mapped area is 163.77 km², covering 17 diverse regions and land-cover types with wide variation in landslide size, shape, timing, and phenology as shown in Figure 4 and Table 1. GVLM is specifically designed for evaluating cross-domain generalization capabilities due to its high diversity in geomorphology [21].

We use the NASADEM global digital elevation model released by NASA JPL. To align the significant resolution gap between the optical imagery 0.59 m and the DEM 30 m, we first compute slope and aspect on the DEM grid and then resample the derived priors to the optical patch resolution using bilinear interpolation. Although the upsampled DEM lacks fine-grained micro-topographic details, it effectively preserves the macro-scale geometric constraints. Since landslide mechanics are fundamentally governed by these regional gravitational forces, the 30 m topographic prior provides sufficient low-frequency guidance to condition the high-frequency optical features.

After cropping the original images into small nonoverlapping patches of size 512 × 512 from the original images, we obtained 7265 pairs of patches for training, validation, and testing, respectively.

4.2. Experimental Setup

4.2.1. Data Split Strategy

To rigorously and fairly assess cross-domain generalization, we adopt a site-wise K-fold cross-validation protocol.

Strict Site-Level Isolation: Unlike random splitting at the patch level, which leads to significant information leakage, we partition the data strictly at the site level. The 17 independent sites were randomly shuffled once and divided into five folds with site counts of [3, 3, 3, 4, 4].

Training/Testing Assignment: All 512 × 512 patches are grouped by site ID. To evaluate cross-region generalization, we strictly prevent leakage by ensuring that patches from the same landslide site never appear in both training and testing. Within each fold, patches from the held-out site are used exclusively for testing, while the remaining sites constitute the training set. A small validation subset is sampled only from the training sites for model selection and early stopping.

The protocol prevents site leakage, avoids misleading fold imbalance, and provides a fair basis for comparing cross domain generalization.

4.2.2. Topographic Prior Pre-Processing

The topographic priors used in our model are derived from three independent GeoTIFF sources: the NASADEM elevation model, a precomputed slope map, and a precomputed aspect map. Because these layers differ from the optical images in coordinate reference system, spatial resolution, and spatial extent, we define a metric, equally spaced target grid in the local UTM system for each site using the optical metadata, and reproject the DEM, slope, and aspect rasters to this grid.

To make the priors suitable for neural processing, the three reprojected layers are converted into a four-channel feature tensor composed of elevation, slope, and the sine and cosine of aspect.

4.2.3. Implementation Details

We trained the network for 100 epochs using Adam optimizer with initial learning rate 1 × 10⁻⁴ and batch size 8. All experiments were conducted on an NVIDIA GeForce RTX 4060 Ti GPU, which provided stable computational performance throughout the training process. No additional balancing strategies were employed to address skewed pixel distributions.

4.3. Quantitative Results and Analysis

To disentangle the effect of additional modalities from the effect of the proposed architecture, we further introduce an optical-only variant in ablation experiments. Table 1 reports mean ± standard deviation over folds for F1, mIoU, precision, recall, accuracy, as well as model size, FLOPs, and per-patch latency.

As can be clearly seen from Table 2, DEMO-Net achieves the best scores on all accuracy metrics. It attains an average F1 of 85.17%, surpassing the strongest baseline FC-EF by 5.05%, an improvement of +6.09% compared to the best performing SOTA method DMINet. Notably, DEMO-Net also surpasses the transformer-based baseline BIT 77.39% and ChangeFormer 75.45%, as well as the multi-temporal time-series method SitsSCD 75.02% and the diffusion-based baseline DDPM-CD 79.05%, indicating robust advantages over diverse modeling paradigms. mIOU is a more stringent metric for measuring the overlap of segmentation masks. Our DEMO-Net model achieved an mIOU of 74.26%, a significant advantage, exceeding the second-best performing FC-EF by +7.2%. Higher mIoU reflects sharper boundaries and better localization of elongated runouts and fragmented failures, which aligns with the orientation-adaptive design. In addition, we include FC-EF-W, a widened variant of FC-EF constructed by manually increasing network width to provide a parameter-matched reference. Despite having a parameter count comparable to DEMO-Net, FC-EF-W still lags behind in accuracy, suggesting that the observed gains arise primarily from architectural design choices rather than simply increasing model capacity. The joint improvement in precision and recall suggests that DEMO-Net reduces both false alarms and misses. This strongly demonstrates that our model predicts the landslide areas with the highest degree of overlap with the ground truth.

Standard deviation reflects the performance fluctuation of the model in 5-fold cross-validation and is a key measure of the model’s generalization ability and robustness. A lower standard deviation indicates a more stable model. The DEMO-Net model has an F1 standard deviation of ±2.96% and an mIOU standard deviation of ±4.59%. This indicates that the model has strong generalization ability across different geographic sites and high performance consistency. In contrast, the FC-EF model, which has the second-highest average score, exhibits extremely poor stability, with an F1 standard deviation as high as ±5.05 and an mIOU standard deviation reaching ±7.19, indicating that its average performance is unreliable and highly sensitive to specific data distributions. SNUNet-CD ±1.84% and ChangeFormer ±1.89% show the highest stability, but at the cost of very low average performance. These gains are consistent across folds, indicating robust generalization under cross-domain splits.

Beyond detection accuracy, practical landslide monitoring demands a balance between model performance and computational complexity. DEMO-Net uses a larger parameter budget 74.27 M than CNN baselines but maintains moderate computation with 31.14 FLOPs and 8.49 ms latency per patch. Transformer-based models like ChangeFormer incur a massive computational burden due to their quadratic self-attention mechanism, they fail to yield proportional performance gains, likely due to the lack of inductive bias for local topographic features. Conversely, lightweight architectures such as A2Net and BIT operate with low FLOPs but struggle to capture complex landslide boundaries, resulting in suboptimal F1-scores below 78%. In contrast, our method achieves the highest F1-score while maintaining a moderate computational cost. This demonstrates that the proposed AOAC module and geomorphic-aware modulation effectively enhance feature representation without the excessive parameter redundancy seen in vision transformers, making the framework highly suitable for large-scale remote sensing applications.

To more intuitively compare the performance of different models, we present the detection results of four representative samples from different test sites in Figure 3. From left to right, Figure 5 shows: the T1 time-phase image, the T2 time-phase image, the ground truth with black represents true change landslides, and the prediction results of FC-EF, SNUNet-CD, BIT, DMINet, A2Net, ChangeFormer, FC-EF-W, SitsSCD, DDPM-CD and our proposed model. To clearly show the prediction, we chose yellow to present correctly detected landslide areas, white to present prediction boundary, red to present the false positive and blue to present the false negative areas.

We can clearly see from the image that whether it is the large, continuous landslide areas or the narrow riverbed landslide, our model maximizes ground truth coverage with an extremely low false negative rate. This perfectly justifies its highest recall score in Table 1. Our model also shows the smallest red FP area. In the image, other models generally produce significant noise or false positives at the edges, while our model’s predictions are remarkably clean. This justifies its highest precision score of 87.39% in Table 1. Since FPs and FNs are both at their lowest, the yellow TP area of the DEMO-Net model is the largest and its outline is closest to the ground truth.

In contrast, A2Net and ChangeFormer exhibit severe undersegmentation on this dataset. They almost completely miss large landslide areas, resulting in the predicted map being covered by large areas of blue FN. DMINet performs slightly better, but also loses nearly half of the change regions. FC-EF, SNUNet-CD, and BIT tend to oversegment. FC-EF-W, SitsSCD, DDPM-CD tend to wrongly segment. The three models generate a large amount of red FP noise at the edges of landslide areas, especially on unchanged riverbanks. This high false alarm rate lowers their precision scores. This high false negative rate is the main reason for their low recall, F1, and mIOU scores.

Qualitative analysis and quantitative results are in high agreement. DEMO-Net capable of simultaneously maintaining high integrity and high purity in highly challenging landslide detection scenarios. It demonstrates its superior effectiveness and robustness by handling large-scale continuous variations, slender structures, and scattered small targets. It can keep computation within practical limits for large-scale landslide change detection.

4.4. Ablation Results and Analysis

To validate the effectiveness of the key components in our proposed model, we designed a series of ablation experiments. Using the ResNet-D as a baseline, we constructed four variants by removing or replacing these components one by one. All experiments use the average of 5-fold cross-validation and the same training configuration as the main experiment to ensure fair comparison. Table 3 shows the average of the models.

The results of the ablation experiments clearly demonstrate the rationality of our model design; each component plays a crucial role in the final high performance.

Removing prior information resulted in a 4.67% decrease in F1 score and a 6.89% decrease in mIOU. This significant performance decrease indicates that introducing terrain and slope is crucial for this task. It effectively helps the model eliminate false positives, allowing it to focus on truly high-risk areas.

Replacing the Tversky loss with BCE + Dice loss, the model represents a 7.35% drop in F1 score and a 12.96% drop in mIOU. This result strongly indicates a severe class imbalance in the landslide detection task. The BCE + Dice loss performs poorly in this situation, while the Tversky loss more effectively balances the weights of false negatives and false positives, significantly improving the model’s ability to detect the minority landslide class.

After replacing FiLM fusion with simple concat fusion, the F1 score dropped to 73.25%, and mIOU decreased to 57.79%. This resulted in an 11.92% drop in F1 score and a 16.47% drop in mIOU. This is the second largest performance degradation across all experiments. This strongly demonstrates that simple feature concat is insufficient for handling multimodal data. FiLM, as an adaptive affine transformation, can more intelligently incorporate prior information into optical features, and its effectiveness far surpasses simple fusion.

Removing the AOAC caused a performance collapse, with an F1 score of only 72.30% and an mIOU of only 56.61%. This represents a 12.87% drop in F1 score and a 17.65% drop in mIOU—the most significant performance degradation across all ablation experiments. This result strongly indicates topography-aware, orientation-adaptive feature extraction yields significant accuracy gains and is crucial for the model to distinguish landslide areas from complex backgrounds, significantly improving segmentation accuracy.

Beyond verifying the individual effectiveness of each component, the interaction mechanism between the geometry-aware AOAC and the terrain-conditioned modulation pathway can be seen. As summarized in Table 3, the baseline model without prior, FiLM, and AOAC achieves 67.39% F1 and 50.81% mIoU. Introducing AOAC alone substantially boosts performance to 80.50% F1 and 67.37% mIoU, indicating that explicitly modeling local orientation and enforcing anisotropic filtering is critical for delineating slope-parallel landslide runouts. However, removing FiLM while keeping AOAC reduces the accuracy to 73.25% F1 and 57.79% mIoU, implying that orientation-adaptive filtering by itself is vulnerable to domain shift and spurious appearance changes. Conversely, removing AOAC while retaining terrain-conditioned modulation yields 72.30% F1 and 56.62% mIoU, suggesting that terrain-aware feature reweighting alone is insufficient to capture elongated and direction-consistent structures without an explicit anisotropic inductive bias. The full model achieves the best performance, improving over AOAC-only by +4.67% F1 and +6.89% mIoU, which demonstrates a clear complementarity. In essence, FiLM uses topographic priors to modulate feature responses and stabilize the representation under cross-region appearance variations, while AOAC provides orientation-aligned receptive fields that strengthen the spatial coherence and boundary localization of elongated failures. Their coupling forms a coherent fusion framework in which terrain-conditioned semantics guide orientation-adaptive geometry, leading to consistent gains in overlap-based metrics and a better precision–recall trade-off.

All four ablation experiments show that removing or replacing any of our key components leads to a significant decrease in model performance. Based on the magnitude of the F1 score decrease, AOAC Module contributes +12.87% F1, FiLM Fusion contributes +11.92% F1, Tversky Loss contributes +7.35% F1 and prior information contributes +4.67% F1. These results collectively validate the integrity and efficiency of our DEMO-Net model design.

To intuitively verify the specific contributions of each component to the landslide extraction capability, we visualize the results of the ablation study in Figure 4. Figure 6 displays samples from three different scenarios. Each column represents the pre-event image Img1, post-event image Img2, ground-truth-mask, and the prediction results of the DEMO-Net model A0 versus the four variants A1 to A4. The color coding is as follows: yellow denotes true positives, red denotes false positives, white denotes prediction boundary and blue denotes false negatives.

As shown in the samples of the A0 and A1 rows, the predictions of A1 contain obvious blue holes inside the main landslide bodies. Without the constraint provided by topographic priors, the model easily confuses spectrally similar but geomorphically stable areas with landslides. In contrast, A0, which fuses prior information, effectively filters out this background noise and produces cleaner boundaries that better follow the underlying terrain. In the A2 column, especially in the edge regions of the third-row sample, a large number of red false positives emerge. This indicates that, although the model roughly locates the landslide, it fails to segment it as a coherent and continuous object, leading to severe leakage. These results suggest that the standard BCE + Dice loss is less effective in handling the extreme foreground–background imbalance inherent in landslide detection The A3 column exhibits extensive blue false negative regions, implying that simple feature concatenation is insufficient for effective multimodal fusion. Finally, A4 not only suffers from pronounced leakage, but also shows chaotic interleaving of red and blue pixels along the landslide boundaries. In comparison, A0, equipped with the AOAC module, adaptively aggregates directional features, fills these gaps, and yields dense, compact, and visually complete yellow prediction regions.

The qualitative visualization strongly corroborates the findings from the quantitative ablation study: the AOAC module ensures the internal spatial integrity of landslides; Prior information eliminates background false positives; and the FiLM fusion alongside Tversky Loss establishes the foundation for robust feature extraction and classification. These four components are indispensable and collectively contribute to the superior detection performance of the proposed method.

5. Conclusions

This work tackles landslide mapping with bitemporal remote sensing by addressing the geometric and domain-shift limitations of existing detectors. We introduce DEMO-Net for landslide change detection in multisource remote sensing imagery. The network adopts a Siamese encoder–decoder and integrates key components to couple geometry with semantics. First, an AOAC estimates local orientation with DEM guidance and dynamically rotates anisotropic kernels to follow slope-parallel structures. This design enforces strong parameter sharing across orientations and avoids the parameter explosion that can occur when learning independent kernels for each angle. Second, a topography-aware modulation pathway injects slope, aspect, and relief embeddings to aggregate cues within geomorphically coherent neighborhoods. Extensive experiments on the GVLM dataset with 17 heterogeneous sites and site-wise five-fold evaluation show consistent performance gains. DEMO-Net achieves the highest F1 and mIoU among strong CNN and transformer baselines, with sharper boundaries and better sensitivity to small and elongated failures, while maintaining moderate computation. Ablation studies confirm that each component contributes: AOAC drives improvements on orientation-sensitive patterns, topography-aware modulation enhances boundary fidelity and suppresses pseudo changes, and the stabilization strategy strengthens generalization to unseen regions. Extensive experiments on the GVLM benchmark with 17 heterogeneous landslide sites under a rigorous site-wise five-fold cross-region protocol demonstrate consistent gains over strong CNN and transformer baselines. DEMO-Net achieves 85.17% F1 and 74.26% mIoU, outperforming the strongest CNN baseline FC-EF by 5.05% in F1 and 7.20% in mIoU, while producing sharper boundaries and improved detection of small and elongated failures. Ablation results further verify the contributions of each component: AOAC improves orientation-sensitive delineation, terrain-conditioned modulation enhances boundary fidelity and suppresses spurious changes, and the stabilization strategy strengthens generalization to unseen regions.

DEMO-Net is primarily designed for landslide detection in topographically complex regions where relief, slope, and aspect provide informative priors for constraining change patterns and improving cross-site robustness. Future work will extend the framework to broader geohazards and richer modalities, including SAR, InSAR, LiDAR, and rainfall products. We will extend our evaluation by collecting additional datasets from low-relief plains and conducting relief-stratified benchmarking to further assess the generalization of DEM-guided change detection beyond mountainous terrains. These efforts aim to advance reliable, transferable, and efficient landslide monitoring at regional and global scales.

Author Contributions

Methodology, Z.F.; resources, S.W. and G.N.; writing—original draft preparation, J.W.; writing—review and editing, H.L.; funding acquisition, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Shaanxi Provincial Natural Science Foundation (2025JC-YBQN-403), Shaanxi Provincial Natural Science Foundation (2025JC-YBMS-680).

Data Availability Statement

Publicly available datasets were analyzed in this study. The GVLM dataset can be found here: https://github.com/zxk688/GVLM?tab=readme-ov-file (24 January 2026). The NASADEM topographic data is available at https://www.earthdata.nasa.gov/topics/land-surface/digital-elevation-terrain-model-dem (24 January 2026). The processed data and source code supporting the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Casagli, N.; Intrieri, E.; Tofani, V.; Gigli, G.; Raspini, F. Landslide Detection, Monitoring and Prediction with Remote-Sensing Techniques. Nat. Rev. Earth Environ. 2023, 4, 51–64. [Google Scholar] [CrossRef]
Keefer, D.K.; Larsen, M.C. Assessing Landslide Hazards. Science 2007, 316, 1136–1138. [Google Scholar] [CrossRef]
Hungr, O.; Leroueil, S.; Picarelli, L. The Varnes Classification of Landslide Types, an Update. Landslides 2013, 11, 167–194. [Google Scholar] [CrossRef]
Guzzetti, F.; Mondini, A.C.; Cardinali, M.; Fiorucci, F.; Santangelo, M.; Chang, K.-T. Landslide Inventory Maps: New Tools for an Old Problem. Earth-Sci. Rev. 2012, 112, 42–66. [Google Scholar] [CrossRef]
Murillo-García, F.G.; Alcántara-Ayala, I.; Ardizzone, F.; Cardinali, M.; Fiourucci, F.; Guzzetti, F. Satellite Stereoscopic Pair Images of Very High Resolution: A Step Forward for the Development of Landslide Inventories. Landslides 2014, 12, 277–291. [Google Scholar] [CrossRef]
Martha, T.R.; van Westen, C.J.; Kerle, N.; Jetten, V.; Vinod Kumar, K. Landslide Hazard and Risk Assessment Using Semi-Automatically Created Landslide Inventories. Geomorphology 2013, 184, 139–150. [Google Scholar] [CrossRef]
Li, Z.; Shi, W.; Myint, S.W.; Lu, P.; Wang, Q. Semi-Automated Landslide Inventory Mapping from Bitemporal Aerial Photographs Using Change Detection and Level Set Method. Remote Sens. Environ. 2016, 175, 215–230. [Google Scholar] [CrossRef]
Ji, S.; Yu, D.; Shen, C.; Li, W.; Xu, Q. Landslide Detection from an Open Satellite Imagery and Digital Elevation Model Dataset Using Attention Boosted Convolutional Neural Networks. Landslides 2020, 17, 1337–1352. [Google Scholar] [CrossRef]
Nava, L.; Bhuyan, K.; Meena, S.R.; Monserrat, O.; Catani, F. Rapid Mapping of Landslides on SAR Data by Attention U-Net. Remote Sens. 2022, 14, 1449. [Google Scholar] [CrossRef]
Lu, P.; Qin, Y.; Li, Z.; Mondini, A.C.; Casagli, N. Landslide Mapping from Multi-Sensor Data through Improved Change Detection-Based Markov Random Field. Remote Sens. Environ. 2019, 231, 111235. [Google Scholar] [CrossRef]
Mora, O.E.; Lenzano, M.G.; Toth, C.K.; Grejner-Brzezinska, D.A.; Fayne, J.V. Landslide Change Detection Based on Multi-Temporal Airborne LiDAR-Derived DEMs. Geosciences 2018, 8, 23. [Google Scholar] [CrossRef]
Zhu, X.; Helmer, E.H. An Automatic Method for Screening Clouds and Cloud Shadows in Optical Satellite Image Time Series in Cloudy Regions. Remote Sens. Environ. 2018, 214, 135–153. [Google Scholar] [CrossRef]
Li, Z.; Shen, H.; Weng, Q.; Zhang, Y.; Dou, P.; Zhang, L. Cloud and Cloud Shadow Detection for Optical Satellite Imagery: Features, Algorithms, Validation, and Prospects. ISPRS J. Photogramm. Remote Sens. 2022, 188, 89–108. [Google Scholar] [CrossRef]
Gong, Z.; Ge, W.; Guo, J.; Liu, J. Satellite Remote Sensing of Vegetation Phenology: Progress, Challenges, and Opportunities. ISPRS J. Photogramm. Remote Sens. 2024, 217, 149–164. [Google Scholar] [CrossRef]
Ji, C.; Tang, H. Towards Reliable Land Cover Mapping under Domain Shift: An Overview and Comprehensive Comparative Study on Uncertainty Estimation. Earth-Sci. Rev. 2025, 263, 105070. [Google Scholar] [CrossRef]
Wei, R.; Li, Y.; Li, Y.; Zhang, B.; Wang, J.; Wu, C.; Yao, S.; Ye, C. A Universal Adapter in Segmentation Models for Transferable Landslide Mapping. ISPRS J. Photogramm. Remote Sens. 2024, 218, 446–465. [Google Scholar] [CrossRef]
Gao, L.; Zhang, L.M.; Chen, H.X.; Fei, K.; Hong, Y. Topography and Geology Effects on Travel Distances of Natural Terrain Landslides: Evidence from a Large Multi-Temporal Landslide Inventory in Hong Kong. Eng. Geol. 2021, 292, 106266. [Google Scholar] [CrossRef]
Guo, J.; Wang, Y.; Li, Y. Topographic Controls on the Initiation and Transport of Landslide-Triggered Debris Flows. Geomorphology 2025, 486, 109901. [Google Scholar] [CrossRef]
Chen, T.; Trinder, J.C.; Niu, R. Object-Oriented Landslide Mapping Using ZY-3 Satellite Imagery, Random Forest and Mathematical Morphology, for the Three-Gorges Reservoir, China. Remote Sens. 2017, 9, 333. [Google Scholar] [CrossRef]
Chen, J.; Liu, J.; Zeng, X.; Zhou, S.; Sun, G.; Rao, S.; Guo, Y.; Zhu, J. A Cross-Domain Landslide Extraction Method Utilizing Image Masking and Morphological Information Enhancement. Remote Sens. 2025, 17, 1464. [Google Scholar] [CrossRef]
Zhang, X.; Yu, W.; Pun, M.-O.; Shi, W. Cross-Domain Landslide Mapping from Large-Scale Remote Sensing Images Using Prototype-Guided Domain-Aware Progressive Representation Learning. ISPRS J. Photogramm. Remote Sens. 2023, 197, 1–17. [Google Scholar] [CrossRef]
Tavakkoli Piralilou, S.; Shahabi, H.; Jarihani, B.; Ghorbanzadeh, O.; Blaschke, T.; Gholamnia, K.; Meena, S.R.; Aryal, J. Landslide Detection Using Multi-Scale Image Segmentation and Different Machine Learning Models in the Higher Himalayas. Remote Sens. 2019, 11, 2575. [Google Scholar] [CrossRef]
Zhiyong, L.; Liu, T.; Wang, R.Y.; Benediktsson, J.A.; Saha, S. Automatic Landslide Inventory Mapping Approach Based on Change Detection Technique with Very-High-Resolution Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6000805. [Google Scholar] [CrossRef]
Li, Z.; Shi, W.; Lu, P.; Yan, L.; Wang, Q.; Miao, Z. Landslide Mapping from Aerial Photographs Using Change Detection-Based Markov Random Field. Remote Sens. Environ. 2016, 187, 76–90. [Google Scholar] [CrossRef]
Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep Learning in Remote Sensing Applications: A Meta-Analysis and Review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
Novellino, A.; Pennington, C.; Leeming, K.; Taylor, S.; Alvarez, I.; McAllister, E.; Arnhardt, C.; Winson, A. Mapping Landslides from Space: A Review. Landslides 2024, 21, 1041–1052. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Tek, F.B.; Çam, İ.; Karlı, D. Adaptive Convolution Kernel for Artificial Neural Networks. J. Vis. Commun. Image Represent. 2021, 75, 103015. [Google Scholar] [CrossRef]
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 29, Proceedings of the NIPS 2016; NeurIPS Foundation: San Diego, CA, USA, 2017. [Google Scholar]
Li, L.; Lan, H.; Strom, A.; Macciotta, R. Landslide Longitudinal Shape: A New Concept for Complementing Landslide Aspect Ratio. Landslides 2022, 19, 1143–1163. [Google Scholar] [CrossRef]
Du, J.; Song, C.; Li, Z.; Tomás, R.; Li, Z. Kinematic Behavior and Sliding Geometry of Large Anthropogenic-Induced Landslides Using Three-Dimensional Time Series InSAR: Insights from the Li-Kan Road Landslide. Landslides 2025, 22, 3319–3333. [Google Scholar] [CrossRef]
Han, J.; Ding, J.; Xue, N.; Xia, G.-S. ReDet: A Rotation-Equivariant Detector for Aerial Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Fu, K.; Chang, Z.; Zhang, Y.; Xu, G.; Zhang, K.; Sun, X. Rotation-Aware and Multi-Scale Convolutional Neural Network for Object Detection in Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2020, 161, 294–308. [Google Scholar] [CrossRef]
Wang, K.; Wang, Z.; Li, Z.; Su, A.; Teng, X.; Pan, E.; Liu, M.; Yu, Q. Oriented Object Detection in Optical Remote Sensing Images Using Deep Learning: A Survey. Artif. Intell. Rev. 2025, 58, 350. [Google Scholar] [CrossRef]
Wen, L.; Cheng, Y.; Fang, Y.; Li, X. A Comprehensive Survey of Oriented Object Detection in Remote Sensing Images. Expert Syst. Appl. 2023, 224, 119960. [Google Scholar] [CrossRef]
Li, H.; Liu, X.; Li, H.; Dong, Z.; Xiao, X. MDFENet: A Multiscale Difference Feature Enhancement Network for Remote Sensing Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3104–3115. [Google Scholar] [CrossRef]
Wu, Y.; Bai, Z.; Miao, Q.; Ma, W.; Yang, Y.; Gong, M. A Classified Adversarial Network for Multi-Spectral Remote Sensing Image Change Detection. Remote Sens. 2020, 12, 2098. [Google Scholar] [CrossRef]
Han, C.; Su, X.; Wei, Z.; Hu, M.; Xu, Y. HSANET: A Hybrid Self-Cross Attention Network for Remote Sensing Change Detection. In Proceedings of the 2025 IEEE International Geoscience and Remote Sensing Symposium, Brisbane, Australia, 3–8 August 2025. [Google Scholar]
Han, C.; Wu, C.; Guo, H.; Hu, M.; Chen, H. HANet: A Hierarchical Attention Network for Change Detection with Bitemporal Very-High-Resolution Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3867–3878. [Google Scholar] [CrossRef]
Lei, T.; Xu, Y.; Ning, H.; Lv, Z.; Min, C.; Jin, Y.; Nandi, A.K. Lightweight Structure-Aware Transformer Network for Remote Sensing Image Change Detection. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6000305. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, S.; Qin, Y.; Wang, H. MATNet: Multilevel Attention-Based Transformers for Change Detection in Remote Sensing Images. Image Vis. Comput. 2024, 151, 105294. [Google Scholar] [CrossRef]
Sun, Y.; Fu, Z.; Sun, C.; Hu, Y.; Zhang, S. Deep Multimodal Fusion Network for Semantic Segmentation Using Remote Sensing Image and LiDAR Data. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5404418. [Google Scholar] [CrossRef]
Wei, T.; Chen, H.; Wang, J.; Liu, W. MDFNet: Multimodal Feature Decomposition and Fusion Network for Multimodal Remote Sensing Image Semantic Segmentation. In Proceedings of the 2024 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), Zhuhai, China, 22–24 November 2024; pp. 1–5. [Google Scholar]
Feng, J.; Li, S.; Dong, S. Hierarchical Feature Integration and Fusion for Remote Sensing Visual Question Answering. Displays 2025, 90, 103099. [Google Scholar] [CrossRef]
Lu, K.; Huang, X.; Xia, R.; Zhang, P.; Shen, J. Cross Attention Is All You Need: Relational Remote Sensing Change Detection with Transformer. GIScience Remote Sens. 2024, 61, 2380126. [Google Scholar] [CrossRef]
Caye Daudt, R.; Le Saux, B.; Boulch, A. Fully Convolutional Siamese Networks for Change Detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
Li, K.; Li, Z.; Fang, S. Siamese NestedUNet Networks for Change Detection of High Resolution Satellite Image. In Proceedings of the 2020 1st International Conference on Control, Robotics and Intelligent System, Xiamen, China, 27–29 October 2020; Association for Computing Machinery: New York, NY, USA, 2021; pp. 42–48. [Google Scholar]
Li, Z.; Tang, C.; Liu, X.; Zhang, W.; Dou, J.; Wang, L.; Zomaya, A.Y. Lightweight remote sensing change detection with progressive feature aggregation and supervised attention. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5602812. [Google Scholar] [CrossRef]
Chen, H.; Qi, Z.; Shi, Z. Remote sensing image change detection with transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607514. [Google Scholar] [CrossRef]
Feng, Y.; Jiang, J.; Xu, H.; Zheng, J. Change detection on remote sensing images using dual-branch multilevel intertemporal network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4401015. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. A transformer-based Siamese network for change detection. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 207–210. [Google Scholar]
Vincent, E.; Ponce, J.; Aubry, M. Satellite image time series semantic change detection: Novel architecture and analysis of domain shift. arXiv 2024, arXiv:2407.07616. [Google Scholar] [CrossRef]
D’Alessandro, M.; Bontempelli, R.; Uricchio, T.; Bolelli, F.; Grana, C. DDPM-CD: Denoising diffusion probabilistic models for change detection. arXiv 2022, arXiv:2206.11892. [Google Scholar]

Figure 1. Architectural illustration of the proposed DEMO-Net.

Figure 2. Illustration of the sampling mechanism. (a) Standard Convolution. (b) Our proposed AOAC.

Figure 3. Architecture illustration of the proposed AOAC module.

Figure 4. Spatial distribution of the landslide sites in the GVLM dataset [21].

Figure 5. Landslide detection results compared with 6 state-of-the-art approaches.

Figure 6. Landslide detection results of ablation study experiment.

Table 1. Detailed information of the landslide dataset used in this study.

Country	City	Central Coordinates	Image Size	Time 1	Time 2	Triggers
Vietnam	A Luoi	107.321°E, 16.406°N	7346 × 4096	02/2018	02/2021	Rainfall
Japan	Asakura	130.78°E, 33.402°N	5632 × 3584	04/2017	09/2017	Earthquake
Iceland	Askja	16.732°E, 65.106°N	4151 × 2763	09/2017	08/2020	Snow and glacier melting
United States	Big Sur	121.43°W, 35.865°N	1748 × 1748	04/2015	06/2017	Loose soil and rock splitting
Zimbabwe	Chimanimani	32.870°E, 19.818°S	10,808 × 7424	11/2018	03/2019	Tropical cyclone
China	Jiuzhaigou	103.787°E, 33.288°N	5888 × 6313	12/2015	08/2017	Earthquake
New Zealand	Kaikoura	173.824°E, 42.245°N	4977 × 3897	03/2016	11/2016	Earthquake
India	Kodagu	75.636°E, 12.470°N	8704 × 6912	03/2017	10/2018	Rainfall
Indonesia	Kupang	123.645°E, 10.206°N	1946 × 1319	02/2021	04/2021	Rainfall
Turkey	Kurucasile	32.607°E, 41.802°N	8192 × 4608	10/2015	06/2017	Flood
Chile	Los Lagos	72.384°W, 43.384°N	8533 × 4077	09/2013	01/2018	Glacier melting and rainfall
Kyrgyzstan	Osh	73.308°E, 40.605°N	8860 × 7193	06/2016	06/2018	Melting snow and rainfall
Brazil	Santa Catarina	49.604°W, 27.075°S	4864 × 3072	11/2018	02/2021	Torrential rain
China	Shimen	110.652°W, 29.890°N	1861 × 1749	02/2018	11/2020	Rainfall
China	Taitung	120.909°E, 22.851°N	3840 × 3840	03/2010	10/2011	Typhoon and rainfall
Georgia	Tbilisi	44.674°E, 41.689°N	5588 × 5632	08/2013	06/2015	Flood
Mexico	Tenejapa	92.551°E, 16.809°S	4200 × 1301	07/2020	02/2021	Hurricane

Table 2. Performance comparison of different models in landslide detection.

Model	F1	mIOU	Precision	Recall	Acc	Params (m)	FLOP	Time (ms)
FC-EF	80.12 ± 5.05	67.06 ± 7.19	77.72 ± 5.19	83.11 ± 8.31	97.73 ± 0.45	7.745	12.586	1.73
SNUNet-CD	77.17 ± 1.84	62.85 ± 2.45	76.29 ± 10.39	80.49 ± 12.68	97.24 ± 1.01	10.276	23.09	3.03
BIT	77.39 ± 1.99	63.15 ± 2.67	75.76 ± 8.57	81.09 ± 12.03	97.26 ± 0.96	11.913	8.484	1.95
DMINET	79.08 ± 2.58	65.45 ± 3.57	77.94 ± 7.01	81.56 ± 10.45	97.54 ± 0.68	6.754	14.476	11.62
A2Net	72.12 ± 2.45	56.44 ± 2.96	67.79 ± 12.00	80.80 ± 14.19	96.57 ± 1.15	3.78	3.048	4.03
ChangeFormer	75.45 ± 1.89	60.61 ± 2.44	77.34 ± 11.71	76.66 ± 14.51	97.15 ± 1.14	41.015	202.624	13.07
FC-EF-W	78.32 ± 4.50	60.01 ± 5.24	67.70 ± 4.52	84.30 ± 8.49	97.98 ± 0.78	70.66	112.25	6.52
SitsSCD	75.02 ± 3.46	59.99 ± 5.49	67.76 ± 8.59	84.00 ± 7.45	97.54 ± 0.69	0.26	36.48	9.71
DDPM-CD	79.05 ± 5.37	65.36 ± 6.73	67.76 ± 8.59	87.42 ± 11.23	97.32 ± 0.86	35.82	390	340.08
DEMO-Net (Ours)	85.17 ± 2.96	74.26 ± 4.59	87.39 ± 4.45	83.57 ± 7.55	98.05 ± 0.17	74.274	31.14	8.49

Table 3. Ablation comparison of different module in landslide detection.

Model	Description	Prior	FiLM	AOAC	Loss	F1	mIOU	Pre	Acc	Recall
A0	DEMO-Net (Ours)	√	√	√	√	85.17	74.26	87.39	98.05	83.57
A1	AOAC only	×	×	√	√	80.50	67.37	85.09	90.99	79.79
A2	w/o Focal Tversky Loss	√	√	√	×	77.82	61.30	82.48	93.04	73.54
A3	w/o FiLM Modulation	√	×	√	√	73.25	57.79	77.78	96.50	69.22
A4	w/o AOAC Module	√	√	×	√	72.30	56.62	76.80	97.51	68.30
A5	Baseline	×	×	×	√	67.39	50.81	69.81	90.15	65.13

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, J.; Li, H.; Wu, S.; Nie, G.; Yu, Y.; Fan, Z. DEM-Assisted Topography-Conditioned and Orientation-Adaptive Siamese Network for Cross-Region Landslide Change Detection. Remote Sens. 2026, 18, 702. https://doi.org/10.3390/rs18050702

AMA Style

Wang J, Li H, Wu S, Nie G, Yu Y, Fan Z. DEM-Assisted Topography-Conditioned and Orientation-Adaptive Siamese Network for Cross-Region Landslide Change Detection. Remote Sensing. 2026; 18(5):702. https://doi.org/10.3390/rs18050702

Chicago/Turabian Style

Wang, Jing, Haiyang Li, Shuguang Wu, Guigen Nie, Yukui Yu, and Zhaoquan Fan. 2026. "DEM-Assisted Topography-Conditioned and Orientation-Adaptive Siamese Network for Cross-Region Landslide Change Detection" Remote Sensing 18, no. 5: 702. https://doi.org/10.3390/rs18050702

APA Style

Wang, J., Li, H., Wu, S., Nie, G., Yu, Y., & Fan, Z. (2026). DEM-Assisted Topography-Conditioned and Orientation-Adaptive Siamese Network for Cross-Region Landslide Change Detection. Remote Sensing, 18(5), 702. https://doi.org/10.3390/rs18050702

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DEM-Assisted Topography-Conditioned and Orientation-Adaptive Siamese Network for Cross-Region Landslide Change Detection

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Attention Mechanisms in Change Detection

2.2. Feature Modulation for Multimodal Analysis

3. Method

3.1. Siamese Encoder and Change-Aware Feature Fusion

3.2. Geomorphic-Aware Feature Modulation

3.2.1. Prior Branch: Terrain Embedding Encoder

3.2.2. Dynamic Feature Modulation

3.3. Orientation-Adaptive Attention Convolutions

3.3.1. Orientation Scoring

3.3.2. Dynamic Kernel Adaptation and Aggregation

3.3.3. Residual Feature Transformation

3.4. Decoder and Output Heads

3.4.1. ASPP for High-Level Context

3.4.2. Progressive Reconstruction Pathway

3.4.3. Dual-Task Prediction

3.5. Loss Function

3.6. Evaluation Metrics and Baseline Methods

4. Experiments and Results

4.1. Datasets

4.2. Experimental Setup

4.2.1. Data Split Strategy

4.2.2. Topographic Prior Pre-Processing

4.2.3. Implementation Details

4.3. Quantitative Results and Analysis

4.4. Ablation Results and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI