Language-Guided Contrastive Learning and Difference Enhancement for Semantic Change Detection in Remote Sensing Images

Hu, Yongli; Ren, Lintian; Jiang, Huajie; Guo, Kan; Liu, Tengfei; Gao, Junbin; Sun, Yanfeng; Yin, Baocai

doi:10.3390/rs18060964

Open AccessArticle

Language-Guided Contrastive Learning and Difference Enhancement for Semantic Change Detection in Remote Sensing Images

by

Yongli Hu

^1,*

,

Lintian Ren

¹,

Huajie Jiang

¹,

Kan Guo

¹,

Tengfei Liu

¹,

Junbin Gao

²,

Yanfeng Sun

¹ and

Baocai Yin

¹

Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, School of Information Science and Technology, Beijing University of Technology, Beijing 100124, China

²

Discipline of Business Analytics, The University of Sydney Business School, The University of Sydney, Camperdown, NSW 2006, Australia

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(6), 964; https://doi.org/10.3390/rs18060964

Submission received: 28 January 2026 / Revised: 4 March 2026 / Accepted: 10 March 2026 / Published: 23 March 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose LGDENet, a lightweight framework that unifies Language-Guided Contrastive Learning with a difference enhancement mechanism.
The method achieves state-of-the-art accuracy on the SECOND and Landsat-SCD datasets while maintaining high computational efficiency compared to foundation model-based approaches.

What are the implications of the main findings?

The language-guided strategy aligns visual features with text prompts, effectively resolving directional semantic ambiguities in “from–to” transitions.
The Difference Enhancement Module (DEM) and the hybrid encoder decouples spatial and channel information to adaptively suppress pseudo-change noise, such as registration errors.

Abstract

Semantic change detection (SCD) in remote sensing images aims not only to localize changed regions but also to identify their specific “from–to” semantic transitions. This task remains challenging due to the inherent semantic ambiguity of spectral changes and the presence of pseudo-change noise. While recent vision–language models have shown promise in remote sensing, existing approaches like RemoteCLIP predominantly focus on static scene classification, lacking the ability to explicitly model dynamic temporal transitions. Other adaptations of foundation models (e.g., AdaptVFMs-RSCD) often rely on heavy backbones, incurring prohibitive computational costs. To address these limitations, this paper proposes LGDENet, a lightweight, end-to-end framework that unifies Language-Guided Temporal Contrastive Learning with a noise-robust difference enhancement mechanism. Specifically, we construct a temporal transition prompt learning strategy that aligns visual difference features with textual descriptions of dynamic processes, thereby resolving directional semantic ambiguities. Furthermore, we introduce a Difference Enhancement Module (DEM) that leverages the channel–spatial decoupling property of depthwise separable convolutions to adaptively isolate and suppress irrelevant variations (e.g., registration errors) before feature fusion. Experiments on the SECOND and Landsat-SCD datasets demonstrate that LGDENet achieves state-of-the-art performance, yielding a semantic F1 score (

F_{s c d}

) of 87.90% and 88.71%, respectively. Moreover, with a modest parameter count of 33.45 M, it offers a superior trade-off between accuracy and efficiency compared to heavy foundation model-based approaches.

Keywords:

semantic change detection; remote sensing; multimodal learning; contrastive learning; vision transformer

1. Introduction

Remote sensing (RS) has become a central modality for documenting the surface dynamics of the Earth, largely due to rapid advances in the spatial and spectral resolution of satellite sensors [1,2]. The availability of high-resolution imagery further enables the fine-grained observation of land cover evolution [3,4]. Within this context, change detection (CD)—a core task in RS images analysis—infers land cover transitions by contrasting two acquisitions of the same area captured at different times [5]. Such capability underpins a broad range of applications, including flood assessment [6], time-critical disaster response [7], military situational awareness, and evidence-based urban planning [8].

With the maturation of deep learning techniques, numerous convolutional-based architectures have demonstrated strong representational capacity for CD tasks [9,10,11,12,13,14,15]. Beyond convolutional backbones, Transformer-centric change detectors have progressed rapidly, frequently utilizing CNN–Transformer co-designs to integrate fine-scale spatial cues with holistic context. Zhang et al. [16] proposed SwinSUNet, in which Swin Transformer blocks model long-range dependencies and a decoder recovers spatial resolution to yield accurate maps. Similarly, Li et al. [17] proposed TransUNetCD, combining a CNN encoder with Transformer modules for global context aggregation. Bandara et al. [18] further developed ChangeFormer, a pure Transformer-based architecture built on SegFormer, demonstrating state-of-the-art performance in large-scale datasets. In addition, unsupervised [19] and self-supervised [20] CD frameworks have emerged, highlighting the importance of reducing reliance on large-scale labeled datasets. Recent surveys highlight the progress of deep learning-based CD, yet also underscore unresolved challenges [21,22,23,24].

While traditional binary CD methods excel at localizing where change occurs, they fail to identify what has changed [25,26,27,28]. This limitation has driven the shift toward semantic change detection (SCD), which simultaneously localizes changes and identifies the semantic categories of the “from–to” transition (e.g., from “forest” to “farmland”). To address this complex task, existing SCD techniques have generally evolved into post-classification, direct classification, and multi-task structural learning approaches. While early post-classification strategies often suffer from cumulative errors, recent multi-task frameworks effectively mitigate this by jointly optimizing binary change boundaries and semantic transitions. Under this paradigm, representative approaches such as SCDNet [29], Bi-SRNet [30], and recent hybrid attention networks [31] have improved performance by employing dual-branch encoders and sophisticated attention mechanisms to align bi-temporal features. Furthermore, the multi-task structural methods further enhance accuracy by jointly learning land cover classification and change masks [32,33,34,35]. Methods like SSESN [33] and GAPL-SCD [34] have introduced multi-scale spatial interactions and prototype learning to enhance semantic coherence and resolve task conflicts. More recently, Transformer-based methods, including Pyramid-SCDFormer [36] and ChangeMask [32], have demonstrated notable improvements in modeling long-range dependencies and capturing fine-grained semantics.

Furthermore, contrastive learning paradigms have been successfully explored to extract robust change features and mitigate noise interference in other remote sensing modalities. For instance, recent studies have developed self-supervised contrastive learning frameworks to effectively suppress complex real noise and extract discriminative variation information in hyperspectral image change detection [37].

Despite these advancements, SCD remains plagued by two fundamental issues: semantic ambiguity and pseudo-change noise. First, visual features alone are often insufficient to distinguish spectrally similar but semantically distinct transitions (e.g., distinguishing “bare land to built-up” from “bare land to road” in low-contrast scenarios). Second, bitemporal images are rife with irrelevant variations caused by seasonal shifts, illumination differences, and registration errors, which conventional difference modules often conflate with real changes.

Recently, vision–language models (VLMs) like CLIP [38] have revolutionized computer vision by aligning visual and textual representations. In the remote sensing domain, Zhang et al. [39] proposed LDGnet, which incorporates both coarse and fine-grained textual prompts to enhance cross-scene generalization in hyperspectral classification. Wang et al. [40] extended this idea by utilizing explicit high-level semantics (EHS) from text to improve robustness to domain shifts. Models such as RemoteCLIP [41] and RS-CLIP [42] have successfully adapted this paradigm for zero-shot scene classification. However, these methods are designed for static scene understanding and lack the capability to model the dynamic temporal transitions required for SCD. More recent works like AdaptVFMs-RSCD [43] have attempted to bridge this gap by adapting heavy foundation models (e.g., SAM [44] and CLIP) for change detection. While effective, these approaches rely on computationally expensive backbones with hundreds of millions of parameters, limiting their practical deployability in resource-constrained RS applications.

To address these challenges without incurring the computational penalty of heavy foundation models, we propose LGDENet, a lightweight, end-to-end framework that introduces two key innovations: Language-Guided Temporal Contrastive Learning and a Difference Enhancement Module (DEM). Unlike previous static VLMs, our language branch specifically models temporal transitions by aligning visual difference features with text prompts describing dynamic processes (e.g., “a remote sensing image where [class A] changes to [class B]”). This explicit modeling of transition semantics helps resolve directional ambiguities that purely visual models struggle with

Furthermore, we tackle the issue of pseudo-changes through a theoretically grounded Difference Enhancement Module (DEM). Unlike standard convolutions that mix channel information immediately—potentially propagating noise—our DEM leverages depthwise separable convolutions (DSConv) to decouple spatial filtering from channel fusion. This design allows the network to isolate and suppress high-frequency spatial noise (such as registration errors) in individual channels before integrating them into semantic features, effectively acting as a learnable noise filter. Combined with a hybrid encoder that balances the local texture extraction of asymmetric convolutions with the global context modeling of Swin Transformers, LGDENet achieves state-of-the-art performance with significantly fewer parameters than foundation model-based approaches.

The major contributions of this work are summarized as follows:

We propose a Language-Guided Contrastive Learning framework specifically for SCD. Unlike static scene classification models, our method aligns visual difference vectors with text prompts representing temporal transitions, effectively resolving semantic ambiguity in complex change scenarios.
We design a Difference Enhancement Module (DEM) that utilizes the channel–spatial decoupling property of DSConv to suppress irrelevant variations (e.g., registration noise) while enhancing genuine semantic changes.
A hybrid encoder architecture is proposed to effectively balance local feature refinement and global semantic reasoning by combining the asymmetric convolution, the Squeeze-and-Excitation (SE) module, and the hierarchical Swin Transformer.

The remainder of this paper is structured as follows. Section 2 surveys prior work on CD, SCD, and multimodal contrastive learning. Section 3 introduces the proposed LGDENet in detail. Section 4 reports the experimental settings and results. Section 5 concludes the study and outlines avenues for future research.

2. Materials and Methods

2.1. Proposed LGDENet Framework

In this section, the framework of the proposed LGDENet is firstly introduced and then the main components of LGDENet are described in detail.

The framework of the proposed LGDENet is shown in Figure 1, in which the main components include the image encoder, the difference enhancement module, and the Language-Guided Contrastive Learning module.

Given a pair of bitemporal remote sensing images, LGDENet uses a hybrid image encoder to extract multi-level representations. This encoder combines asymmetric convolution for preserving directional details, employs channel attention for feature recalibration, and utilizes a hierarchical Swin Transformer backbone to effectively capture long-range dependencies and cross-temporal relationships. This design effectively balances local detail sensitivity and global semantic reasoning.

To emphasize reliable structural variations, a Difference Enhancement Module (DEM) is introduced by combining subtraction- and concatenation-based feature interactions followed by depthwise separable convolution, which adaptively enhances discriminative difference cues while suppressing background noise, ensuring that subsequent reasoning is focused on genuine semantic transitions. Additionally, LGDENet employs a multi-task prediction design: (i) a segmentation branch produces pixel-level semantic change maps, and (ii) an MLP-based classification branch predicts image-level change categories. This dual supervision enforces both global and local semantic consistency.

Differing from the current methods, LGDENet incorporates high-level semantic priors by a Language-Guided Contrastive Learning strategy, in which textual prompts describing land cover transitions (e.g., “a photo of [class1] changes to [class2]”) are encoded via a text encoder and projected into a shared semantic space with visual features. Contrastive alignment pulls together paired image–text embeddings while pushing apart mismatched ones, injecting the missing textual modality into change detection and improving inter-class discriminability while reducing semantic confusion.

By unifying difference-aware feature modeling, multi-task supervision, and cross-modal alignment, LGDENet is considered to be robust and semantically consistent across diverse and complex remote sensing scenarios. In the following, the main components of LGDENet are described in detail.

2.2. Image Encoder

Given a bitemporal image pair

I_{1}, I_{2}

, the image encoder extracts multi-level visual representations with shared parameters across time. The encoder adopts a hybrid design that integrates both convolutional blocks and Transformer modules, as illustrated in Figure 2. Specifically, (i) an asymmetric convolution stem with Squeeze-and-Excitation (SE) is employed for local texture refinement and channel recalibration, and (ii) a four-stage Vision Transformer (ViT) stack is utilized for progressive long-range semantic reasoning with selective cross-layer fusion.

2.2.1. Asymmetric Convolution Stem

The input image pair is first processed by an asymmetric convolution module. Specifically, a

3 \times 3

convolution, which is followed by batch normalization (BN) and ReLU activation, is factorized into two parallel branches utilizing

1 \times 3

and

3 \times 1

kernels. This decomposition preserves the receptive field while reducing parameters and the computational overhead. Particularly, it is effective in capturing horizontal and vertical edge structures.

To further adaptively recalibrate feature channels, as shown in the middle in Figure 2, we construct a channel attention mechanism by both max pooling and average pooling across the spatial dimension, which are then passed through fully connected layers and a sigmoid gating function to produce channel weights. These weights modulate the feature maps, enabling the model to emphasize semantically relevant features while suppressing redundant responses. The resulting output fuses spatial texture cues with channel-aware weighting, serving as a robust input for subsequent Transformer stages.

2.2.2. Hierarchical Swin Transformer Backbone

The refined features are tokenized and passed into a n-stage Swin Transformer backbone to capture long-range dependencies. Each stage consists of multiple Swin Transformer blocks interleaved with patch merging layers, as illustrated in Figure 2. The hierarchical design progressively reduces spatial resolution while expanding the feature dimension, enabling multi-scale representation. In particular, a local window self-attention within each block is used to model fine-grained interactions, while a shifted window mechanism allows cross-window communication, thereby improving the modeling of the global context.

2.3. Difference Enhancement Module

To address the challenge of noisy and ambiguous difference signals in bitemporal feature extraction, we design a Difference Enhancement Module (DEM), which explicitly emphasizes discriminative changes while suppressing irrelevant variations.

Let

X_{1}

,

X_{2} \in R^{M \times H \times W}

denote the deep features extracted from the two temporal images. We first compute a coarse difference representation by element-wise subtraction:

X_{c} = X_{1} - X_{2} .

(1)

The coarse feature

X_{c}

offers an initial cue to spatial discrepancies but unavoidably includes pseudo-changes arising from illumination, seasonal shifts, and sensor noise. To refine it, we concatenate

X_{c}

with the original features

X_{1}

and

X_{2}

and apply depthwise separable convolutions (DSConv) to efficiently harvest complementary context:

\begin{matrix} X_{c 1} & = DSConv (Concat (X_{c}, X_{1})), \\ X_{c 2} & = DSConv (Concat (X_{c}, X_{2})) . \end{matrix}

(2)

Subsequently, we employ

1 \times 1

convolutions with sigmoid activation to generate channel-wise attention weights, which adaptively highlight informative change cues:

\begin{matrix} W_{1} = σ ({Conv}_{1 \times 1} (X_{c 1})), \\ W_{2} = σ ({Conv}_{1 \times 1} (X_{c 2})) . \end{matrix}

(3)

The weighted feature maps

W_{1} ⊙ X_{c 1}

and

W_{2} ⊙ X_{c 2}

are then concatenated and processed by a final DSConv layer to yield the enhanced change feature:

\begin{matrix} U_{1} = W_{1} ⊙ X_{c 1}, \\ U_{2} = W_{2} ⊙ X_{c 2}, \end{matrix}

(4)

Y = DSConv (Concat (U_{1}, U_{2})) .

(5)

For supervision training, we incorporate an additional classification branch that predicts the semantic transition type of each detected change region. Its objective is formulated with a standard cross-entropy loss:

L_{cls} = - \frac{1}{N H W} \sum_{i = 1}^{N} \sum_{h = 1}^{H} \sum_{w = 1}^{W} \sum_{c = 0}^{1} y_{i h w c} log {\hat{y}}_{i h w c},

(6)

where N denotes the total number of samples, C denotes the total number of semantic categories,

y_{i h w c}

represents the one-hot ground truth label, and

{\hat{y}}_{i h w c}

is the predicted probability for class c.

In parallel, we introduce a semantic segmentation branch to enforce dense, pixel-level supervision. This branch employs a pixel-wise cross-entropy formulation:

L_{seg} = - \frac{1}{N H W} \sum_{i = 1}^{N} \sum_{h = 1}^{H} \sum_{w = 1}^{W} \sum_{c = 0}^{C} u_{i h w c} log {\hat{u}}_{i h w c} .

(7)

where

u_{i h w c}

and

{\hat{u}}_{i h w c}

correspond to the ground truth and predicted probability.

By jointly leveraging coarse differences, context-aware refinement, and channel-wise weighting, the DEM effectively suppresses irrelevant variations while amplifying structural and semantic changes. The resulting feature Y provides a fine-grained representation of change regions, which is subsequently fed into higher-level semantic modeling.

2.4. Language-Guided Contrastive Learning

2.4.1. Text Prompts and Encoder

To provide meaningful textual supervision, we design a set of prompt templates that describe land cover transitions between bitemporal images. Unlike patch-level templates used in hyperspectral scenarios, which often contain overly limited semantic cues, our templates explicitly capture change types in a natural language form. A representative template is defined as follows:

“A remote sensing image shows a change from [class_before] to [class_after].”

This formulation ensures that each text query encodes not only the categories involved but also their transition relationship, providing richer semantic priors for alignment with visual difference features. For example, “a change from farmland to residential area” explicitly embeds the direction and type of change, which assists the model in discriminating visually similar but semantically distinct transitions.

In addition to transition prompts, we also define a set of static prompts to describe unchanged regions, such as

“A remote sensing image remains [class_unchanged].”

This dual prompt strategy allows the model to cover both changed and unchanged cases within a unified contrastive framework. The combination of change-oriented and unchanged textual descriptions provides complementary semantic information, which will guild the later contrastive learning to simultaneously reduce intra-class distance and enlarge inter-class separation across modalities.

For the text prompts, LGDENet utilizes the Transformer [45] architecture initialized with the pre-trained CLIP-ViT-B/32 model as the text encoder, comprising approximately 63 M parameters. Crucially, to maintain computational efficiency and prevent the catastrophic forgetting of its generalized linguistic representations, the parameters of the text encoder are kept strictly frozen during the entire training process.

To ensure that this frozen text space—predominantly pre-trained on natural images—can effectively capture the highly specific temporal “change” semantics in remote sensing, we introduce a domain adaptation strategy. Rather than forcing the text encoder to comprehend remote sensing pixels directly, we leverage its structured compositional language space (which naturally understands concepts like “changes to”) as a semantic anchor. The encoded textual features are then passed through a learnable projection head (MLP) to map them into a shared multi-modal semantic space. Simultaneously, the fully trainable image encoder learns to extract and align the bitemporal visual difference features to this semantic anchor.

Furthermore, it is imperative to note that the text branch functions exclusively as an auxiliary module during the training phase. During inference, the text encoder is completely discarded, and the network operates solely on the bitemporal image inputs. This asymmetric training inference design ensures that LGDENet benefits from rich linguistic priors during optimization while strictly preserving its lightweight nature (33.45 M parameters) for practical deployment. which will help to bridge the gap between raw visual differences and human-interpretable land cover transitions. This strategy enhances the scalability of LGDENet to diverse datasets, as the prompt templates are generalizable and can be flexibly adapted to new domains.

2.4.2. Language-Guided Multi-Modal Contrastive Learning

Based on the visual features from bitemporal images and the corresponding textual embeddings of the semantic space, a visual–language multi-modal contrastive learning strategy is proposed to reduce the distance between paired image–text features belonging to the same category while enlarging the distance from mismatched pairs. To resolve the granularity mismatch between global text embeddings and localized pixel-level semantic changes, LGDENet employs a Region-Aware Alignment Strategy during training. Specifically, a single

512 \times 512

input patch may simultaneously encapsulate multiple, distinct semantic transitions. To accurately align these heterogeneous transitions with their corresponding text prompts, we utilize the ground truth segmentation masks to perform region-specific feature aggregation.

For a specific transition class c present in the patch, we extract its binary ground truth mask

M_{c}

. We then apply masked average pooling over the dense visual difference feature map F to obtain a regional visual embedding

v_{c} = Pool (F ⊙ M_{c})

. This operation ensures that the visual representation is strictly aggregated from the specific localized regions where the transition occurs. Finally, the aggregated regional visual embedding

v_{c}

is aligned with the textual embedding

l_{c}

of the corresponding prompt. This region-based formulation effectively prevents the feature suppression of minor or highly fragmented change regions, as each transition class is independently mapped into the multi-modal semantic space. In the multi-modal semantic space, the visual–language contrastive loss, denoted as

L_{clip}

, is computed between matched image–text pairs:

\begin{matrix} L_{clip} = - \sum_{i = 0}^{N} \frac{1}{| P (i) |} ( & \sum_{p \in P_{v} (i)} log \frac{exp (l_{i}^{T} v_{p}^{+} / τ)}{\sum_{n \in N_{v} (i)} exp (l_{i}^{T} v_{n}^{-} / τ)} \\ + & \sum_{p \in P_{l} (i)} log \frac{exp (v_{i}^{T} l_{p}^{+} / τ)}{\sum_{n \in N_{l} (i)} exp (v_{i}^{T} l_{n}^{-} / τ)}) . \end{matrix}

(8)

where

P_{v} (i)

and

N_{v} (i)

represent the positive and negative image feature sets, while

P_{l} (i)

and

N_{l} (i)

denote the positive and negative text feature sets. This loss not only constrains intra-class consistency across modalities but also enlarges the inter-class margins, resulting in a more discriminative embedding distribution.

The final training objective of LGDENet can be obtained by combining the contrastive loss

L_{clip}

, the classification loss

L_{cls}

, and the segmentation loss

L_{seg}

in a multi-task optimization framework as follows:

L_{total} = λ_{1} L_{clip} + λ_{2} L_{cls} + λ_{3} L_{seg}

(9)

where

λ_{1}

,

λ_{2}

, and

λ_{3}

are weighting coefficients balancing the contributions of each component. This joint design encourages the LGDENet model to simultaneously achieve robust cross-modal alignment, reliable category prediction, and fine-grained change localization.

2.5. Datasets

To evaluate the proposed LGDENet, we conduct SCD experiments on two commonly used RS benchmark datasets, SECOND [46] and Landsat-SCD [36], compared with a set of related methods.

Two commonly used RS benchmark datasets of SECOND and Landsat-SCD are used in our experiments.

SECOND: This dataset contains 4662 pairs of multi-temporal, high-resolution RS images collected from major urban areas—principally Shanghai, Chengdu, and Hangzhou. Image spatial resolution spans 0.5–3 m, and all tiles are uniformly cropped to $512 \times 512$ pixels for consistent analysis. Land cover is annotated using one “no change” label together with six semantic classes: water bodies, non-vegetated surfaces, low vegetation, trees, buildings, and sports fields. This labeling scheme supplies diverse context, enabling comprehensive semantic change detection. In the publicly available release, the split comprises 2968 pairs for training and 1694 pairs for testing, which supports robust model development and evaluation across the aforementioned land cover types.
Landsat-SCD: This dataset is constructed from Landsat satellite imagery acquired between 1990 and 2020, with a geographic focus on Tumushuke in Xinjiang, China, located along the margin of the Taklamakan Desert. Each bitemporal pair has a spatial resolution of 30 m and is annotated with one “no change” label together with four land cover classes—farmland, desert, buildings, and water bodies—where only regions exhibiting change are labeled. The original release contains 8468 pairs at $416 \times 416$ pixels; after removing redundant augmentation duplicates in the public version, 2425 unique pairs remain. In our experiments, we adopt 1455 pairs for training, 485 for validation, and 485 for testing. This partitioning supports comprehensive training, model selection, and evaluation, enabling the reliable assessment of change detection performance.

2.6. Experimental Setup

2.6.1. Text Prompt

To incorporate language guidance into semantic change detection, we explicitly associate each semantic transition with a natural language prompt.

Given a bitemporal patch whose land cover class at time

t_{1}

is

c l a s s_{1}

and at time

t_{2}

is

c l a s s_{2}

, we define a semantic change type as the ordered pair

c l a s s_{1}

→

c l a s s_{2}

(

c l a s s_{1}

≠

c l a s s_{2}

). For every valid transition in the training data, a text prompt is instantiated using the following template:

“A remote sensing image where $c l a s s_{1}$ changes to $c l a s s_{2}$ .”

Here,

c l a s s_{1}

and

c l a s s_{2}

are dataset-specific category phrases defined below. In this way, all the semantic change types present in the datasets are systematically mapped to language descriptions. For unchanged regions, we use a special prompt:

“There is no change.”

The SECOND dataset contains one “no change” label and six land cover categories, which we describe with the following phrases:

Low vegetation;
Non-vegetated surface;
Tree;
Water;
Building;
Playground.

For each ordered pair

c l a s s_{1} \to c l a s s_{2}

(

c l a s s_{1}

≠

c l a s s_{2}

) drawn from the six land cover classes above, we generate a language prompt by plugging the corresponding phrases into the template. A few concrete examples used in our experiments are

“There is no change.”
“A remote sensing image where low vegetation changes to buildings.”
“A remote sensing image where non-vegetated surface changes to playground.”

All other semantic change types on SECOND are instantiated in the same way.

The Landsat-SCD dataset is annotated with one “no change” label and four land cover classes, described as

Farmland;
Desert;
Building;
Water.

Similar to SECOND, every semantic transition

c l a s s_{1} \to c l a s s_{2}

(

c l a s s_{1} \neq c l a s s_{2}

) among the above classes is associated with a language prompt via the same template. Typical examples include

“There is no change.”
“A remote sensing image where farmland changes to buildings.”
“A remote sensing image where desert changes to water.”

By enumerating all valid

c l a s s_{1} \to c l a s s_{2}

pairs in SECOND and Landsat-SCD and mapping them to the corresponding prompts, we obtain a complete set of text descriptions for all semantic change types used in our Language-Guided Contrastive Learning framework.

2.6.2. Comparison Baselines:

To demonstrate the effectiveness of the proposed LGDENet, we compare it against several representative semantic change detection models.

SSESN [33]: A method integrates multi-scale features through a pyramid structure and assigns spatial priorities to bi-temporal branches for change interpretation.
SCDNet [29]: A dual-branch encoder–decoder model, in which a multi-scale convolutional unit expands the receptive field, and the resulting features are fused with encoder representations before being decoded.
Bi-SRNet [30]: A method incorporates self-attention for richer semantic interaction and a cross-temporal module for improving correspondence, which surpasses its baseline counterparts.
SCanNet [8]: A method explicitly models the temporal-to-temporal semantic transformations via a semantic change Transformer, and applies spatio-temporal constraints to align with the task objective.
GAPL-SCD [34]: A method leverages graph aggregation-based prototype learning under a multi-task optimization regime. By introducing adaptive weight allocation and gradient modulation strategies, it effectively mitigates conflicts among different training objectives, thereby enhancing the stability and efficiency of multi-task learning.

2.6.3. Assessment Criteria

For a quantitative assessment of semantic change detection, we report four standard metrics: the Overall Accuracy (OA), mean Intersection over Union (mIoU), Separation Kappa (SeK), and the semantic F1 score F_scd.

Define the confusion matrix as

M = {m_{i j}}

, where

m_{i j}

indicates the number of pixels assigned to class i given ground truth j; the index 0 is reserved for the “no change” class. Since the “no change” label often constitutes the majority, relying solely on OA can obscure a model’s capacity to discriminate among the semantic change categories, motivating the inclusion of mIoU, SeK, and

F_{scd}

in our evaluation protocol. To address this limitation, we incorporate mIoU and SeK into our evaluation. The mIoU metric measures the overlap between the predicted and ground truth regions, thereby providing insights into both change localization and the model’s semantic accuracy. Similarly, the SeK coefficient helps to assess the model’s effectiveness in distinguishing between classes, accounting for the inherent class imbalances present in the dataset. By utilizing this comprehensive set of metrics, we aim to achieve a more balanced and nuanced evaluation of the model’s performance in semantic change detection tasks, enabling a better understanding of its strengths and weaknesses in classifying changes within the imagery.

2.6.4. Implementation Details

In this study, LGDENet is implemented in PyTorch (version 1.12.1) and optimized with Adam, using an initial learning rate of

1 \times 10^{- 4}

and a weight decay of

1 \times 10^{- 5}

. Following prior works [47,48,49,50], the patch size for tokenization is set to

13 \times 13

, which provides a good trade-off between efficiency and representational capacity. The model is trained for 200 epochs on SECOND and Landsat-SCD, with the learning rate decayed following a cosine annealing schedule and a linear warmup in the first five epochs. The text encoder is initialized with CLIP-ViT-B/32 to provide stable linguistic representations, while the image encoder is trained from scratch.

All the experiments are conducted on a single NVIDIA GeForce RTX 3090 Ti GPU. To ensure fairness, all the baseline models are trained and evaluated under the same hardware and software environment.

3. Results

3.1. Quantitative Results and Analysis

Table 1 details the comparative accuracies on SECOND, and Table 2 provides the analogous results on Landsat–SCD for both our method and the baselines. For clarity, the highest accuracy in each setting is emphasized in bold.

From Table 1, the proposed LGDENet attains the best overall performance on the SECOND dataset across all evaluation metrics. In terms of OA, LGDENet reaches 94.14%, exceeding the performance of all the baselines. In terms of mIoU, LGDENet reaches 76.46%, surpassing SCanNet and Bi-SRNet by 0.79% and 1.63%, respectively. Regarding semantic consistency (SeK), LGDENet attains 84.71%, which is 0.48% higher than GAPL-SCD and 1.51% higher than Bi-SRNet, confirming its stronger semantic recognition capability. For the semantic F1 score (F_scd), LGDENet also achieves the highest value of 87.90%. At the category level, significant improvements are observed in “low vegetation,” “non-vegetated surface,” and “buildings,” where LGDENet obtains 92.31%, 93.44%, and 75.82%, respectively. In particular, for the “building” category, LGDENet improves upon GAPL-SCD by 0.62% and SCanNet by 0.92%, demonstrating superior robustness in complex urban regions.

Table 2 shows the consistent superiority of the proposed LGDENet on Landsat–SCD for all assessed measures. LGDENet obtains the highest OA of 95.46%, outperforming all compared methods. In terms of mIoU, SeK, and F_scd, LGDENet reaches 87.48%, 54.23%, and 88.71%, respectively, each surpassing the previous results. At the category level, notable gains are observed in “farmland,” “desert,” and “building.” Specifically, for “farmland,” LGDENet achieves 84.02%, exceeding Bi-SRNet by 4.76% and SCanNet by 2.91%. For “desert,” the proposed approach attains 86.16%, with improvements of 4.85% over SCanNet and 4.85% over Bi-SRNet. In the “building” category, LGDENet reaches 81.34%, outperforming Bi-SRNet by 4.12% and SCanNet by 3.72%. Additionally, in the “water” category, the performance improves to 90.54%, slightly higher than GAPL-SCD by 0.55%. These results confirm the robustness of our LGDENet method in handling diverse land cover transitions, complex textures, and highly reflective regions, demonstrating its strong generalization for semantic change detection.

3.2. Visual Evaluation

Some SCD outcomes on the SECOND dataset of different methods are visualized in Figure 3. As evidenced by the comparison, the proposed LGDENet delivers stronger results for multiple challenging classes, especially buildings, non-vegetated areas, and low vegetation (marked by black dashed boxes). Compared with the baslines, LGDENet attains higher accuracy on these categories while mitigating semantic ambiguity: building edges are more clearly delineated from non-vegetated areas, and low vegetation is more precisely distinguished from trees. These improvements demonstrate that LGDENet not only enhances category-specific recognition but also maintains a higher robustness and consistency when addressing complex land cover changes.

Some semantic change detection results on the Landsat–SCD dataset are shown in Figure 4. As shown in the visual comparisons, the proposed LGDENet exhibits superior capability in detecting semantic changes across water, farmland, desert, and buildings (marked with black dashed boxes). In particular, our method provides the more accurate identification of farmland and building changes, effectively distinguishing subtle variations between agricultural regions and urban structures. Moreover, for desert regions, our approach captures boundary details more faithfully, avoiding the misclassifications observed in other methods. These results indicate that the proposed LGDENet maintains strong adaptability and leading performance across multiple semantic change categories, even in complex environmental scenarios.

3.3. Ablation Study

To accurately quantify the contributions of each proposed component—specifically the hybrid encoder, the Difference Enhancement Module (DEM), and the Language-Guided Contrastive Learning (CL)—we designed a comprehensive ablation study consisting of three parts.

3.3.1. Impact of Core Architectural Components

We establish a baseline model using a standard ResNet-50 backbone with a simple subtraction-based difference module and a UNet-like segmentation head. We then progressively incorporate our proposed modules. The results on the SECOND dataset are presented in Table 3.

Effect of the Hybrid Encoder (Exp 1 vs. 2): Replacing the standard ResNet with our hybrid encoder (Asymmetric Conv + Swin Transformer) yields a 1.2% improvement in

F_{s c d}

. This validates our hypothesis that combining local texture features with global context solves the scale–texture paradox in remote sensing.

Effect of the Difference Enhancement Module (Exp 2 vs. 3): Integrating the DEM leads to a substantial gain of 1.6% in

F_{s c d}

. This confirms that the channel–spatial decoupling mechanism of the DSConv layers in the DEM effectively suppresses registration noise and enhances true difference signals, supporting the theoretical claims made in the Introduction.

Effect of Language Guidance (Exp 3 vs. 4): Finally, adding the Language-Guided Contrastive Learning branch boosts performance to 87.90%

F_{s c d}

. The gain of 1.1% over the strongest visual-only baseline (Exp 3) demonstrates that high-level semantic priors from language help disambiguate visually similar classes (e.g., classifying a change as “building” vs. “non-vegetated surface”), which visual features alone struggle to resolve.

3.3.2. Impact of Textual Alignment Validity

To ensure that the performance gain from the language branch stems from semantic alignment rather than just increased model parameters or regularization effects, we compared our CLIP-based prompts against a “no text” baseline.

As shown in Table 4, utilizing random text embeddings provides negligible improvement (+0.05%

F_{s c d}

), indicating that the structural capacity of the text branch contributes little on its own. In contrast, the CLIP-based meaningful prompts provide a significant boost (+1.1%

F_{s c d}

), proving that the performance improvement is driven by the explicit semantic knowledge transferred from the pre-trained language space to the visual domain.

3.4. Model Efficiency Analysis

To address concerns regarding computational cost, we compared the parameter count (Params) and Floating Point Operations (FLOPs) of LGDENet against other SOTA methods, including heavy foundation model-based approaches.

As illustrated in Table 5, LGDENet achieves state-of-the-art accuracy with a modest parameter count of 33.45 M, which is lower than SCDNet and comparable to other specialized lightweight networks. Crucially, its FLOPs (65.30 G) are significantly lower than Bi-SRNet (189.91 G) and orders of magnitude lower than AdaptVFMs-RSCD, which relies on the heavy Segment Anything Model (SAM). This efficiency is attributed to the use of depthwise separable convolutions (DSConv) in our DEM and the efficient window-based attention of Swin Transformer, making LGDENet highly suitable for practical deployment.

It is worth noting that while the inference stage is highly lightweight, the inclusion of the frozen 63 M parameter CLIP text encoder during training introduces additional video memory overhead. In our experiments, optimizing LGDENet with a batch size of eight requires approximately 15.6 GB of GPU memory, which remains well within the capacity of standard consumer-grade hardware (e.g., a single NVIDIA RTX 3090 Ti).

3.5. Interpretability of the Enhancements via Grad-CAM

To intuitively understand the internal mechanisms of LGDENet and validate the theoretical claims regarding our core modules, we conduct a qualitative interpretability analysis using Gradient-weighted Class Activation Mapping (Grad-CAM). Figure 5 visualizes the feature response heatmaps of the network’s final output layer across different ablation stages. We compare three configurations: (a) the baseline model (pure vision; simple subtraction), (b) the baseline integrating the Difference Enhancement Module (+DEM), and (c) the full LGDENet incorporating Language-Guided Contrastive Learning (+CL).

The visual evolution of the class activation maps clearly demonstrates the distinct contributions of each module:

Suppression of Irrelevant Variations by DEM: As observed in column (a) of Figure 5, the pure visual baseline frequently exhibits diffuse and scattered attention. It is easily distracted by pseudo-changes such as seasonal shifts, illumination differences, and registration errors (especially evident in the vast backgrounds of the Landsat-SCD examples in the bottom two rows). Upon incorporating the DEM (column b), the scattered background noise is significantly attenuated. The attention maps become much cleaner and begin to group around the actual changed regions. This empirically substantiates that the channel–spatial decoupling mechanism within the DEM effectively acts as a noise filter, isolating and suppressing non-semantic, high-frequency spatial disturbances before feature fusion.
Semantic Disambiguation via Language Guidance: While the DEM successfully reduces background noise, the feature responses in (b) still exhibit somewhat blurry boundaries and occasionally weak activations on complex transitions. The integration of the text branch (column c) yields a profound qualitative shift. Guided by explicit transition prompts (e.g., “bare land changes to built-up”), the model’s feature responses strictly alter: the heatmaps become highly concentrated, exhibiting intense peak activations (deep red regions) that perfectly align with the specific “from–to” semantic boundaries defined by the ground truth. This visual evidence compellingly illustrates that language guidance actively steers the visual encoder to focus on precise semantic transitions, thereby successfully resolving directional semantic ambiguities that visual features alone struggle to discern.

4. Discussion

In this study, we developed LGDENet to tackle two pervasive challenges in semantic change detection (SCD): directional semantic ambiguity and pseudo-change noise. While the quantitative results and ablation studies (Section 3) demonstrate the superiority of our method across the SECOND and Landsat-SCD datasets, several underlying mechanisms and practical implications warrant further discussion.

4.1. The Role of Language Guidance in Semantic Disambiguation

Traditional visual-only SCD models, such as SCDNet and Bi-SRNet, often struggle to differentiate between spectrally similar but semantically distinct land cover transitions (e.g., distinguishing a transition to “buildings” versus “roads” in low-contrast environments). Our Language-Guided Contrastive Learning framework explicitly addresses this limitation by injecting high-level textual priors into the visual feature space.

As evidenced by our visual interpretability analyses, the integration of transition-specific prompts (e.g., “a change from bare land to built-up”) fundamentally alters the model’s feature responses. Instead of relying solely on pixel-level spectral differences, the model learns to associate specific visual discrepancies with their corresponding semantic concepts. This cross-modal alignment successfully suppresses diffuse background attention and sharply focuses the network on the exact boundaries of the semantic transition, thereby resolving the directional ambiguities that frequently confound purely visual approaches.

4.2. Robustness Against Irrelevant Variations

A common vulnerability in bitemporal image analysis is the conflation of genuine land cover changes with irrelevant variations caused by seasonal shifts, illumination differences, and sensor artifacts. Our empirical findings indicate that simple subtraction-based difference modules amplify these high-frequency disturbances.

The proposed Difference Enhancement Module (DEM) overcomes this by leveraging the channel–spatial decoupling property of depthwise separable convolutions (DSConv). By isolating spatial filtering from channel fusion, the DEM acts as a learnable, adaptive noise filter. Our visualizations demonstrate that the DEM effectively attenuates widespread background pseudo-changes before they can propagate into the semantic reasoning stages. Consequently, LGDENet maintains high localization precision even in highly heterogeneous environments, such as the vast desert and farmland borders in the Landsat-SCD dataset.

4.3. Efficiency vs. Performance Trade-Off

Recent trends in remote sensing have increasingly leaned toward adapting large-scale foundation models (e.g., SAM and CLIP) for dense prediction tasks. While methods like AdaptVFMs-RSCD have shown promise, their reliance on heavy backbones with hundreds of millions of parameters severely restricts their deployability in resource-constrained or real-time Earth observation scenarios.

LGDENet demonstrates that an explicit, targeted integration of language priors during the training phase can achieve comparable or superior semantic reasoning without the inference time burden of a heavy text encoder. With only 33.45 M parameters and 65.30 G FLOPs, LGDENet provides a highly competitive alternative, proving that lightweight, hybrid CNN-Transformer architectures can still achieve state-of-the-art SCD performance when supported by intelligent multi-modal training strategies.

4.4. Limitations and Future Perspectives

Despite the promising results, this study has certain limitations that outline avenues for future research. First, the current language guidance relies on manually predefined, dataset-specific prompt templates. While this ensures precise transition descriptions, it lacks the flexibility required for open-world semantic change detection where novel, unseen categories might emerge. Future work will explore learnable prompt engineering (Prompt Tuning) to automate textual generation and enhance generalizability. Second, due to the patch-level alignment strategy in our contrastive learning framework, the feature representations of extremely small or highly fragmented change regions may occasionally be overshadowed by dominant semantic transitions within the same patch. Addressing this granularity imbalance—perhaps by introducing dense, pixel-level visual–textual alignment or integrating focal loss mechanisms for minor classes—will be a critical next step to further refine the model’s performance in highly complex scenes. Furthermore, while the DEM exhibits strong robustness against general pseudo-changes, its performance under extreme geometric registration errors—a persistent challenge in high-resolution UAV imagery—remains a constraint that warrants further investigation.

5. Conclusions

In this paper, we proposed LGDENet, a novel Language-Guided Contrastive Learning framework with difference enhancement for semantic change detection in remote sensing images. The framework integrates an improved image encoder, which combines asymmetric convolutions, Squeeze-and-Excitation modules, and multi-stage Transformers with selective cross-layer fusion, together with a Difference Enhancement Module to highlight truly changed regions while suppressing irrelevant variations. Furthermore, we introduced a language-guided contrastive learning strategy, where textual descriptions of change are aligned with visual representations in a shared semantic space, enabling the model to learn more discriminative and semantically consistent features. A multi-task objective combining classification, segmentation, and contrastive losses ensures comprehensive supervision.

LGDENet establishes new state-of-the-art results on both benchmarks, achieving an OA of 94.14%, mIoU of 76.46%, SeK of 84.71%, and

F_{s c d}

of 87.90% on the SECOND dataset, alongside similarly dominant metrics (OA: 95.46%, mIoU: 87.48%, SeK: 54.23%, and

F_{s c d}

: 88.71%) on Landsat-SCD. Requiring only 33.45 M parameters and 65.30 G FLOPs, our model demonstrates a superior accuracy–efficiency trade-off, circumventing the prohibitive costs of heavy foundation models. Furthermore, ablation studies confirm that the enhanced visual encoder and language guidance are essential for capturing fine-grained changes and resolving semantic ambiguities.

In the future, we plan to further optimize LGDENet from several perspectives. First, adaptive prompt engineering and large-scale multi-modal pre-training could be explored to enhance the flexibility and generalization of language guidance. Second, knowledge distillation and model compression techniques may be introduced to design lightweight variants suitable for real-time applications in disaster monitoring, land resource management, and urban planning. Finally, extending the framework to handle multi-source and multi-resolution remote sensing data is a promising direction to improve robustness in more complex real-world scenarios.

Author Contributions

Conceptualization, Y.H. and L.R.; methodology, Y.H., L.R., H.J., K.G., T.L., J.G., Y.S. and B.Y.; software, L.R.; data curation, L.R.; writing—original draft, L.R.; writing—review and editing, Y.H.; visualization, L.R.; supervision, Y.H.; project administration, Y.H.; and funding acquisition, Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in by the National Natural Science Foundation of China under grants No. 62572017.

Data Availability Statement

Publicly available datasets were analyzed in this study. The SECOND dataset can be found at https://captain-whu.github.io/SCD/ (accessed on 9 March 2026), and the Landsat-SCD dataset is available at https://figshare.com/articles/figure/Landsat-SCD_dataset_zip/19946135/1?file=35495660 (accessed on 9 March 2026).

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, D.; Zhang, J.; Du, B.; Xia, G.S.; Tao, D. An empirical study of remote sensing pretraining. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5608020. [Google Scholar] [CrossRef]
Miao, W.; Geng, J.; Jiang, W. Multigranularity decoupling network with pseudolabel selection for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5603813. [Google Scholar] [CrossRef]
Geng, J.; Deng, X.; Ma, X.; Jiang, W. Transfer learning for SAR image classification via deep joint distribution adaptation networks. IEEE Trans. Geosci. Remote Sens. 2020, 58, 5377–5392. [Google Scholar] [CrossRef]
Hu, J.; Zhang, Y. Seasonal change of land-use/land cover (LULC) detection using MODIS data in rapid urbanization regions: A case study of the pearl river delta region (China). IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 6, 1913–1920. [Google Scholar] [CrossRef]
Bai, T.; Wang, L.; Yin, D.; Sun, K.; Chen, Y.; Li, W.; Li, D. Deep learning for change detection in remote sensing: A review. Geo-Spat. Inf. Sci. 2023, 26, 262–288. [Google Scholar] [CrossRef]
Saleh, T.; Weng, X.; Holail, S.; Hao, C.; Xia, G.S. DAM-Net: Flood detection from SAR imagery using differential attention metric-based vision transformers. ISPRS J. Photogramm. Remote Sens. 2024, 212, 440–453. [Google Scholar] [CrossRef]
Wang, D.; Ma, G.; Zhang, H.; Wang, X.; Zhang, Y. Refined change detection in heterogeneous low-resolution remote sensing images for disaster emergency response. ISPRS J. Photogramm. Remote Sens. 2025, 220, 139–155. [Google Scholar] [CrossRef]
Ding, L.; Zhang, J.; Guo, H.; Zhang, K.; Liu, B.; Bruzzone, L. Joint spatio-temporal modeling for semantic change detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5610814. [Google Scholar] [CrossRef]
Varghese, A.; Gubbi, J.; Ramaswamy, A.; Balamuralidhar, P. ChangeNet: A deep learning architecture for visual change detection. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops; Springer: Berlin/Heidelberg, Germany, 2018; pp. 129–145. [Google Scholar]
Xu, Z.; Jiang, W.; Geng, J. Dual-branch dynamic modulation network for hyperspectral and LiDAR data classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5514813. [Google Scholar] [CrossRef]
Xu, Z.; Jiang, W.; Geng, J. Texture-aware causal feature extraction network for multimodal remote sensing data classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5103512. [Google Scholar] [CrossRef]
Wu, H.; Geng, J.; Jiang, W. Multidomain constrained translation network for change detection in heterogeneous remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5616916. [Google Scholar] [CrossRef]
Wang, G.; Cheng, G.; Zhou, P.; Han, J. Cross-level attentive feature aggregation for change detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 6051–6062. [Google Scholar] [CrossRef]
Zhou, Y.; Wang, F.; Zhao, J.; Yao, R.; Chen, S.; Ma, H. Spatial-temporal based multihead self-attention for remote sensing image change detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6615–6626. [Google Scholar] [CrossRef]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure transformer network for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5224713. [Google Scholar] [CrossRef]
Li, Q.; Zhong, R.; Du, X.; Du, Y. TransUNetCD: A hybrid transformer network for change detection in optical remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5622519. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. A transformer-based siamese network for change detection. In Proceedings of the IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium; IEEE: New York, NY, USA, 2022; pp. 207–210. [Google Scholar]
Saha, S.; Bovolo, F.; Bruzzone, L. Building change detection in VHR SAR images via unsupervised deep transcoding. IEEE Trans. Geosci. Remote Sens. 2020, 59, 1917–1929. [Google Scholar] [CrossRef]
Hou, X.; Bai, Y.; Xie, Y.; Zhang, Y.; Fu, L.; Li, Y.; Shang, C.; Shen, Q. Self-supervised multimodal change detection based on difference contrast learning for remote sensing imagery. Pattern Recognit. 2025, 159, 111148. [Google Scholar] [CrossRef]
Wang, Q.; Jing, W.; Chi, K.; Yuan, Y. Cross-difference semantic consistency network for semantic change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4406312. [Google Scholar] [CrossRef]
Wang, L.; Zhang, M.; Gao, X.; Shi, W. Advances and challenges in deep learning-based change detection for remote sensing images: A review through various learning paradigms. Remote Sens. 2024, 16, 804. [Google Scholar] [CrossRef]
Xie, W.; Shao, W.; Li, D.; Li, Y.; Fang, L. MIFNet: Multi-scale interaction fusion network for remote sensing image change detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 2725–2739. [Google Scholar] [CrossRef]
Cui, B.; Peng, Y.; Zhang, Y.; Yin, H.; Fang, H.; Guo, S.; Du, P. Enhanced edge information and prototype constrained clustering for SAR change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5206116. [Google Scholar] [CrossRef]
Long, J.; Li, M.; Wang, X.; Stein, A. Semantic change detection using a hierarchical semantic graph interaction network from high-resolution remote sensing images. ISPRS J. Photogramm. Remote Sens. 2024, 211, 318–335. [Google Scholar] [CrossRef]
Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Shi, Q.; Liu, M.; Li, S.; Liu, X.; Wang, F.; Zhang, L. A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5604816. [Google Scholar] [CrossRef]
Li, Z.; Tang, C.; Liu, X.; Zhang, W.; Dou, J.; Wang, L.; Zomaya, A.Y. Lightweight remote sensing change detection with progressive feature aggregation and supervised attention. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5602812. [Google Scholar] [CrossRef]
Peng, D.; Bruzzone, L.; Zhang, Y.; Guan, H.; He, P. SCDNET: A novel convolutional network for semantic change detection in high resolution optical remote sensing imagery. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102465. [Google Scholar] [CrossRef]
Ding, L.; Guo, H.; Liu, S.; Mou, L.; Zhang, J.; Bruzzone, L. Bi-temporal semantic reasoning for the semantic change detection in HR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5620014. [Google Scholar] [CrossRef]
Chang, H.; Wang, P.; Diao, W.; Xu, G.; Sun, X. A triple-branch hybrid attention network with bitemporal feature joint refinement for remote-sensing image semantic change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5613816. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Tian, S.; Ma, A.; Zhang, L. ChangeMask: Deep multi-task encoder-transformer-decoder architecture for semantic change detection. ISPRS J. Photogramm. Remote Sens. 2022, 183, 228–239. [Google Scholar] [CrossRef]
Zhao, M.; Zhao, Z.; Gong, S.; Liu, Y.; Yang, J.; Xiong, X.; Li, S. Spatially and semantically enhanced Siamese network for semantic change detection in high-resolution remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2563–2573. [Google Scholar] [CrossRef]
Xu, Z.; Wu, H.; Jiang, W.; Geng, J. Graph Aggregation Prototype Learning for Semantic Change Detection in Remote Sensing. arXiv 2025, arXiv:2507.10938. [Google Scholar] [CrossRef]
Li, Z.; Wang, X.; Fang, S.; Zhao, J.; Yang, S.; Li, W. A decoder-focused multitask network for semantic change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5609115. [Google Scholar] [CrossRef]
Yuan, P.; Zhao, Q.; Zhao, X.; Wang, X.; Long, X.; Zheng, Y. A transformer-based Siamese network and an open optical dataset for semantic change detection of remote sensing images. Int. J. Digit. Earth 2022, 15, 1506–1525. [Google Scholar] [CrossRef]
Ou, X.; Liu, L.; Tan, S.; Zhang, G.; Li, W.; Tu, B. A hyperspectral image change detection framework with self-supervised contrastive learning pretrained model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7724–7739. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
Xu, C. LDGNet: A Lightweight Difference Guiding Network for Remote Sensing Change Detection. arXiv 2025, arXiv:2504.05062. [Google Scholar] [CrossRef]
Wang, X.; Dong, S.; Zheng, X.; Lu, R.; Jia, J. Explicit High-Level Semantic Network for Domain Generalization in Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5538314. [Google Scholar] [CrossRef]
Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. arXiv 2024, arXiv:2306.11029. [Google Scholar] [CrossRef]
Li, X.; Wen, C.; Hu, Y.; Zhou, N. RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103497. [Google Scholar] [CrossRef]
Jiang, W.; Sun, Y.; Lei, L.; Kuang, G.; Ji, K. AdaptVFMs-RSCD: Advancing Remote Sensing Change Detection from binary to semantic with SAM and CLIP. ISPRS J. Photogramm. Remote Sens. 2025, 230, 304–317. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 261–272. [Google Scholar]
Yang, K.; Xia, G.S.; Liu, Z.; Du, B.; Yang, W.; Pelillo, M.; Zhang, L. Asymmetric siamese networks for semantic change detection in aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5609818. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, M.; Li, W.; Wang, S.; Tao, R. Language-aware domain generalization network for cross-scene hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5501312. [Google Scholar] [CrossRef]
Dong, L.; Geng, J.; Jiang, W. Spectral–spatial enhancement and causal constraint for hyperspectral image cross-scene classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5507013. [Google Scholar] [CrossRef]
Zhao, H.; Zhang, J.; Lin, L.; Wang, J.; Gao, S.; Zhang, Z. Locally linear unbiased randomization network for cross-scene hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5526512. [Google Scholar] [CrossRef]
Zhang, Y.; Li, W.; Sun, W.; Tao, R.; Du, Q. Single-source domain expansion network for cross-scene hyperspectral image classification. IEEE Trans. Image Process. 2023, 32, 1498–1512. [Google Scholar] [CrossRef]

Figure 1. The framework of the proposed Language-Guided Contrastive Learning and Difference Enhancement Network (LGDENet) for semantic change detection in remote sensing images.

Figure 2. The image encoder for the input bitemporal images.

Figure 3. Visualization experimental results on the SECOND dataset. The black dashed boxes highlight specific regions with significant improvements by our proposed method.

Figure 4. Visualization experimental results on the Landsat-SCD dataset. The black dashed boxes highlight specific regions with significant improvements by our proposed method.

Figure 5. Grad-CAM visualization of feature responses on the SECOND (top two rows) and Landsat-SCD (bottom two rows) datasets. (a) Baseline: The model shows diffuse attention and recognition errors, often distracted by background pseudo-changes. (b) +DEM: Suppresses irrelevant background variations and noise. (c) Ours (+DEM +CL): The language guidance strictly steers the model’s focus, yielding sharp activations that precisely align with semantic boundaries. Red Dashed boxes highlight specific regions where the baseline model’s recognition errors are successfully corrected by the DEM and text branch.

Table 1. Comparsion of different methods on the SECOND dataset. The best results are highlighted in bold.

Category	SSESN [33]	SCDNet [29]	Bi-SRNet [30]	SCanNet [8]	GAPL-SCD [34]	LGDENet
OA (%)	93.45	93.78	93.62	93.93	94.02	94.14
mIoU (%)	73.56	74.45	74.83	75.67	76.02	76.46
SeK (%)	82.17	82.88	83.20	83.88	84.23	84.71
F_scd (%)	85.44	86.12	86.60	87.10	87.49	87.90
Non-change	93.13	92.22	92.10	93.25	93.56	93.68
Low vegetation	89.23	91.11	90.67	91.67	92.06	92.31
N.v.g. surface	92.52	92.25	92.35	93.02	93.22	93.44
Tree	85.67	86.22	86.45	86.90	87.15	87.24
Water	91.23	91.73	91.58	92.32	92.78	93.20
Building	74.56	73.45	74.33	75.10	75.63	75.82
Playground	83.33	82.89	83.22	84.05	84.47	84.70

Table 2. Comparision of different methods on the Landsat-SCD dataset. The best results are highlighted in bold.

Category	SSESN [33]	SCDNet [29]	Bi-SRNet [30]	SCanNet [8]	GAPL-SCD [34]	LGDENet
OA (%)	89.15	91.44	93.80	95.04	95.30	95.46
mIoU (%)	74.17	77.95	82.94	86.37	87.02	87.48
SeK (%)	24.28	32.46	44.27	52.63	53.88	54.23
F_scd (%)	68.27	74.82	82.01	85.62	85.99	88.71
Non-change	95.10	95.61	96.75	97.84	97.63	97.78
Farmland	63.07	71.24	79.26	81.91	83.65	84.02
Desert	66.74	74.76	81.31	83.91	85.75	86.16
Building	40.18	59.45	77.22	77.62	81.21	81.34
Water	85.49	84.96	88.23	89.13	89.99	90.54

Table 3. Component analysis on the second dataset. The checkmark (✓) indicates the inclusion of a specific module, while the cross (×) indicates its absence. The best results are highlighted in bold.

Exp.	Hybrid Encoder	DEM	Language-Guided CL	OA (%)	mIoU (%)	$F_{s c d}$ (%)
1	× (ResNet)	× (Simple Subtraction)	×	92.50	73.15	84.00
2	✓	× (Simple Subtraction)	×	93.10	74.82	85.20
3	✓	✓	×	93.80	75.90	86.80
4	✓	✓	✓ (LGDENet)	94.14	76.46	87.90

Table 4. Text embedding validation on second dataset. The best results are highlighted in bold.

Text Strategy	OA (%)	mIoU (%)	$F_{scd}$ (%)
No Text (Visual Only)	93.80	75.90	86.80
CLIP Text Embedding (Ours)	94.14	76.46	87.90

Table 5. Model efficiency comparison. The best results are highlighted in bold.

Method	Backbone	Params (M)	FLOPs (G)	$F_{scd}$ (%)
SCDNet	ResNet-50	39.62	116.98	86.12
Bi-SRNet	ResNet-18	22.24	189.91	86.60
SCanNet	ResNet-18	27.90	-	87.10
AdaptVFMs-RSCD	SAM-B + CLIP	>150.00	>500.00	-
LGDENet (Ours)	Swin-T + CNN	33.45	65.30	87.90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, Y.; Ren, L.; Jiang, H.; Guo, K.; Liu, T.; Gao, J.; Sun, Y.; Yin, B. Language-Guided Contrastive Learning and Difference Enhancement for Semantic Change Detection in Remote Sensing Images. Remote Sens. 2026, 18, 964. https://doi.org/10.3390/rs18060964

AMA Style

Hu Y, Ren L, Jiang H, Guo K, Liu T, Gao J, Sun Y, Yin B. Language-Guided Contrastive Learning and Difference Enhancement for Semantic Change Detection in Remote Sensing Images. Remote Sensing. 2026; 18(6):964. https://doi.org/10.3390/rs18060964

Chicago/Turabian Style

Hu, Yongli, Lintian Ren, Huajie Jiang, Kan Guo, Tengfei Liu, Junbin Gao, Yanfeng Sun, and Baocai Yin. 2026. "Language-Guided Contrastive Learning and Difference Enhancement for Semantic Change Detection in Remote Sensing Images" Remote Sensing 18, no. 6: 964. https://doi.org/10.3390/rs18060964

APA Style

Hu, Y., Ren, L., Jiang, H., Guo, K., Liu, T., Gao, J., Sun, Y., & Yin, B. (2026). Language-Guided Contrastive Learning and Difference Enhancement for Semantic Change Detection in Remote Sensing Images. Remote Sensing, 18(6), 964. https://doi.org/10.3390/rs18060964

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Language-Guided Contrastive Learning and Difference Enhancement for Semantic Change Detection in Remote Sensing Images

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Proposed LGDENet Framework

2.2. Image Encoder

2.2.1. Asymmetric Convolution Stem

2.2.2. Hierarchical Swin Transformer Backbone

2.3. Difference Enhancement Module

2.4. Language-Guided Contrastive Learning

2.4.1. Text Prompts and Encoder

2.4.2. Language-Guided Multi-Modal Contrastive Learning

2.5. Datasets

2.6. Experimental Setup

2.6.1. Text Prompt

2.6.2. Comparison Baselines:

2.6.3. Assessment Criteria

2.6.4. Implementation Details

3. Results

3.1. Quantitative Results and Analysis

3.2. Visual Evaluation

3.3. Ablation Study

3.3.1. Impact of Core Architectural Components

3.3.2. Impact of Textual Alignment Validity

3.4. Model Efficiency Analysis

3.5. Interpretability of the Enhancements via Grad-CAM

4. Discussion

4.1. The Role of Language Guidance in Semantic Disambiguation

4.2. Robustness Against Irrelevant Variations

4.3. Efficiency vs. Performance Trade-Off

4.4. Limitations and Future Perspectives

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI